Paperid: 1, https://arxiv.org/pdf/2506.24127.pdf   GitHub GitHub
Authors:Matthew Gwilliam, Roy Zhang, Namitha Padmanabhan, Hongyang Du, Abhinav Shrivastava
Title: How to Design and Train Your Implicit Neural Representation for Video Compression
Abstract:
Implicit neural representation (INR) methods for video compression have recently achieved visual quality and compression ratios that are competitive with traditional pipelines. However, due to the need for per-sample network training, the encoding speeds of these methods are too slow for practical adoption. We develop a library to allow us to disentangle and review the components of methods from the NeRV family, reframing their performance in terms of not only size-quality trade-offs, but also impacts on training time. We uncover principles for effective video INR design and propose a state-of-the-art configuration of these components, Rabbit NeRV (RNeRV). When all methods are given equal training time (equivalent to 300 NeRV epochs) for 7 different UVG videos at 1080p, RNeRV achieves +1.27% PSNR on average compared to the best-performing alternative for each video in our NeRV library. We then tackle the encoding speed issue head-on by investigating the viability of hyper-networks, which predict INR weights from video inputs, to disentangle training from encoding to allow for real-time encoding. We propose masking the weights of the predicted INR during training to allow for variable, higher quality compression, resulting in 1.7% improvements to both PSNR and MS-SSIM at 0.037 bpp on the UCF-101 dataset, and we increase hyper-network parameters by 0.4% for 2.5%/2.7% improvements to PSNR/MS-SSIM with equal bpp and similar speeds. Our project website is available at https://mgwillia.github.io/vinrb/ and our code is available at https://github.com/mgwillia/vinrb.
Chinese: 该研究提出了Rabbit NeRV(RNeRV),一种先进的隐式神经表示方法,通过采用超网络实现实时编码,显著提升了视频压缩质量并解决了编码速度慢的问题。
English: The study introduces Rabbit NeRV (RNeRV), a state-of-the-art implicit neural representation method that enhances video compression quality and addresses slow encoding speeds by employing hyper-networks for real-time performance.

Authors:Jiacheng Cui, Xinyue Bi, Yaxin Luo, Xiaohan Zhao, Jiacheng Liu, Zhiqiang Shen
Title: FADRM: Fast and Accurate Data Residual Matching for Dataset Distillation
Abstract:
Residual connection has been extensively studied and widely applied at the model architecture level. However, its potential in the more challenging data-centric approaches remains unexplored. In this work, we introduce the concept of Data Residual Matching for the first time, leveraging data-level skip connections to facilitate data generation and mitigate data information vanishing. This approach maintains a balance between newly acquired knowledge through pixel space optimization and existing core local information identification within raw data modalities, specifically for the dataset distillation task. Furthermore, by incorporating optimization-level refinements, our method significantly improves computational efficiency, achieving superior performance while reducing training time and peak GPU memory usage by 50%. Consequently, the proposed method Fast and Accurate Data Residual Matching for Dataset Distillation (FADRM) establishes a new state-of-the-art, demonstrating substantial improvements over existing methods across multiple dataset benchmarks in both efficiency and effectiveness. For instance, with ResNet-18 as the student model and a 0.8% compression ratio on ImageNet-1K, the method achieves 47.7% test accuracy in single-model dataset distillation and 50.0% in multi-model dataset distillation, surpassing RDED by +5.7% and outperforming state-of-the-art multi-model approaches, EDC and CV-DD, by +1.4% and +4.0%. Code is available at: https://github.com/Jiacheng8/FADRM.
Chinese: 本文首次提出数据残差匹配方法,通过数据级跳跃连接在数据集蒸馏任务中平衡新知识获取与原始数据核心信息,大幅提升计算效率并降低资源消耗,在多个基准测试中实现了最优性能。
English: This paper introduces Data Residual Matching, a novel data-centric approach that uses skip connections to enhance dataset distillation by balancing new knowledge acquisition with core data preservation, achieving state-of-the-art efficiency and accuracy with significant reductions in computational resources.

Authors:Sixun Dong, Wei Fan, Teresa Wu, Yanjie Fu
Title: Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives
Abstract:
Time series forecasting traditionally relies on unimodal numerical inputs, which often struggle to capture high-level semantic patterns due to their dense and unstructured nature. While recent approaches have explored representing time series as text using large language models (LLMs), these methods remain limited by the discrete nature of token sequences and lack the perceptual intuition humans typically apply, such as interpreting visual patterns. In this paper, we propose a multimodal contrastive learning framework that transforms raw time series into structured visual and textual perspectives. Rather than using natural language or real-world images, we construct both modalities directly from numerical sequences. We then align these views in a shared semantic space via contrastive learning, enabling the model to capture richer and more complementary representations. Furthermore, we introduce a variate selection module that leverages the aligned representations to identify the most informative variables for multivariate forecasting. Extensive experiments on fifteen short-term and six long-term forecasting benchmarks demonstrate that our approach consistently outperforms strong unimodal and cross-modal baselines, highlighting the effectiveness of multimodal alignment in enhancing time series forecasting. Code is available at: https://github.com/Ironieser/TimesCLIP.
中文摘要:本文提出一种多模态对比学习框架,将时间序列转化为对齐的视觉与文本表征,通过增强语义对齐和变量选择机制,在预测任务中实现了卓越性能。
English Summary: This paper introduces a multimodal contrastive learning framework that transforms time series data into aligned visual and textual representations, achieving superior forecasting performance through enhanced semantic alignment and variate selection.

Authors:Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qiuyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, Qifeng Chen
Title: Calligrapher: Freestyle Text Image Customization
Abstract:
We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.
中文: Calligrapher是一种基于扩散的框架,将文本定制与艺术字体相结合,通过自蒸馏和局部风格注入技术,精确复现复杂风格和字形,在数字设计应用中超越传统模型。
English: Calligrapher is a diffusion-based framework that combines text customization with artistic typography, using self-distillation and localized style injection to accurately reproduce intricate styles and precise glyphs, surpassing traditional models in digital design applications.

Authors:Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, Xun Cao, Wei Yin
Title: Epona: Autoregressive Diffusion World Model for Autonomous Driving
Abstract:
Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end-to-end framework. Our architecture enables high-resolution, long-duration generation while introducing a novel chain-of-forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state-of-the-art performance with 7.4\% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a real-time motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks. Code will be publicly available at \href{https://github.com/Kevin-thu/Epona/}{https://github.com/Kevin-thu/Epona/}.
Chinese: Epona模型通过解耦时空建模与模块化轨迹集成,实现了长周期高分辨率的世界模拟,在视频预测质量和实时运动规划方面均超越现有方法。
English: The proposed Epona model introduces an autoregressive diffusion framework with decoupled spatiotemporal factorization and modular trajectory integration, enabling long-horizon, high-resolution world modeling that outperforms existing methods in both video prediction quality and real-time motion planning capabilities.

Authors:Shai Yehezkel, Omer Dahary, Andrey Voynov, Daniel Cohen-Or
Title: Navigating with Annealing Guidance Scale in Diffusion Space
Abstract:
Denoising diffusion models excel at generating high-quality images conditioned on text prompts, yet their effectiveness heavily relies on careful guidance during the sampling process. Classifier-Free Guidance (CFG) provides a widely used mechanism for steering generation by setting the guidance scale, which balances image quality and prompt alignment. However, the choice of the guidance scale has a critical impact on the convergence toward a visually appealing and prompt-adherent image. In this work, we propose an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal. By learning a scheduling policy, our method addresses the temperamental behavior of CFG. Empirical results demonstrate that our guidance scheduler significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation. Notably, our novel scheduler requires no additional activations or memory consumption, and can seamlessly replace the common classifier-free guidance, offering an improved trade-off between prompt alignment and quality.

Authors:Xiangtai Li, Tao Zhang, Yanwei Li, Haobo Yuan, Shihao Chen, Yikang Zhou, Jiahao Meng, Yueyi Sun, Shilin Xu, Lu Qi, Tianheng Cheng, Yi Lin, Zilong Huang, Wenhao Huang, Jiashi Feng, Guang Shi
Title: DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World
Abstract:
Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations and relations for visual entities. Several grounded caption datasets face the problems of missing detailed descriptions, relations, and massive object descriptions on high-resolution images. To fill this gap for the community, we present DenseWorld-1M, the first massive, detailed, dense grounded caption dataset in the real world. We design a three-stage labeling pipeline, containing open-world perception, detailed object caption generation, and dense caption merging. The first stage obtains entity-level masks and labels. The second stage generates the object-level, detailed captions with the guidance of masks and labels from the first stage. The final stage merges object captions and masks into spatial and relational dense captions. To accelerate the labeling process and improve caption quality, we present two VLM models: the Detailed Region Caption model and the Spatial Caption Merging model. Extensive experiments on various settings, including vision-language understanding, visual grounding, and region caption generation, demonstrate the effectiveness of our DenseWorld-1M dataset and labeling models.
中文摘要:DenseWorld-1M数据集通过三阶段标注流程和专用视觉语言模型,解决了现有标注数据集中细节描述与空间关系缺失的问题,在多项视觉语言任务中验证了其有效性。
English Summary: The DenseWorld-1M dataset addresses limitations in existing caption datasets by providing detailed, dense grounded captions through a three-stage labeling pipeline and specialized VLM models, demonstrating effectiveness across multiple vision-language tasks.

Authors:Moein Heidari, Yasamin Medghalchi, Mahdi Khoursha, Reza Rezaeian, Ilker Hacihaliloglu
Title: WaRA: Wavelet Low Rank Adaptation
Abstract:
Parameter-efficient fine-tuning (PEFT) has gained widespread adoption across various applications. Among PEFT techniques, Low-Rank Adaptation (LoRA) and its extensions have emerged as particularly effective, allowing efficient model adaptation while significantly reducing computational overhead. However, existing approaches typically rely on global low-rank factorizations, which overlook local or multi-scale structure, failing to capture complex patterns in the weight updates. To address this, we propose WaRA, a novel PEFT method that leverages wavelet transforms to decompose the weight update matrix into a multi-resolution representation. By performing low-rank factorization in the wavelet domain and reconstructing updates through an inverse transform, WaRA obtains compressed adaptation parameters that harness multi-resolution analysis, enabling it to capture both coarse and fine-grained features while providing greater flexibility and sparser representations than standard LoRA. Through comprehensive experiments and analysis, we demonstrate that WaRA performs superior on diverse vision tasks, including image generation, classification, and semantic segmentation, significantly enhancing generated image quality while reducing computational complexity. Although WaRA was primarily designed for vision tasks, we further showcase its effectiveness in language tasks, highlighting its broader applicability and generalizability. The code is publicly available at \href{GitHub}{https://github.com/moeinheidari7829/WaRA}.
中文: 提出的WaRA方法通过小波变换实现多分辨率分析,在视觉和语言任务中显著提升性能并降低计算成本,改进了参数高效微调技术。
English: The proposed WaRA method enhances parameter-efficient fine-tuning by applying wavelet transforms for multi-resolution analysis, achieving superior performance in vision and language tasks with reduced computational costs.

Authors:Mayank V. Golhar, Lucas Sebastian Galeano Fretes, Loren Ayers, Venkata S. Akshintala, Taylor L. Bobrow, Nicholas J. Durr
Title: C3VDv2 -- Colonoscopy 3D video dataset with enhanced realism
Abstract:
Spatial computer vision techniques have the potential to improve the diagnostic performance of colonoscopy. However, the lack of 3D colonoscopy datasets for training and validation hinders their development. This paper introduces C3VDv2, the second version (v2) of the high-definition Colonoscopy 3D Video Dataset, featuring enhanced realism designed to facilitate the quantitative evaluation of 3D colon reconstruction algorithms. 192 video sequences totaling 169,371 frames were captured by imaging 60 unique, high-fidelity silicone colon phantom segments. Ground truth depth, surface normals, optical flow, occlusion, diffuse maps, six-degree-of-freedom pose, coverage map, and 3D models are provided for 169 colonoscopy videos. Eight simulated screening colonoscopy videos acquired by a gastroenterologist are provided with ground truth poses. Lastly, the dataset includes 15 videos with colon deformations for qualitative assessment. C3VDv2 emulates diverse and challenging scenarios for 3D reconstruction algorithms, including fecal debris, mucous pools, blood, debris obscuring the colonoscope lens, en-face views, and fast camera motion. The enhanced realism of C3VDv2 will allow for more robust and representative development and evaluation of 3D reconstruction algorithms. Project Page - https://durrlab.github.io/C3VDv2/

Authors:Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, Hao Ye, Zihao Sheng, Xin Zhao, Tuopu Wen, Zheng Fu, Sikai Chen, Kun Jiang, Diange Yang, Seongjin Choi, Lijun Sun
Title: A Survey on Vision-Language-Action Models for Autonomous Driving
Abstract:
The rapid progress of multimodal large language models (MLLM) has paved the way for Vision-Language-Action (VLA) paradigms, which integrate visual perception, natural language understanding, and control within a single policy. Researchers in autonomous driving are actively adapting these methods to the vehicle domain. Such models promise autonomous vehicles that can interpret high-level instructions, reason about complex traffic scenes, and make their own decisions. However, the literature remains fragmented and is rapidly expanding. This survey offers the first comprehensive overview of VLA for Autonomous Driving (VLA4AD). We (i) formalize the architectural building blocks shared across recent work, (ii) trace the evolution from early explainer to reasoning-centric VLA models, and (iii) compare over 20 representative models according to VLA's progress in the autonomous driving domain. We also consolidate existing datasets and benchmarks, highlighting protocols that jointly measure driving safety, accuracy, and explanation quality. Finally, we detail open challenges - robustness, real-time efficiency, and formal verification - and outline future directions of VLA4AD. This survey provides a concise yet complete reference for advancing interpretable socially aligned autonomous vehicles. Github repo is available at \href{https://github.com/JohnsonJiang1996/Awesome-VLA4AD}{SicongJiang/Awesome-VLA4AD}.
中文: 本综述首次全面概述了自动驾驶中的视觉-语言-动作模型,通过分析其发展历程、架构组件及20多个代表性模型,同时指出了鲁棒性和实时效率等关键挑战。
English: This survey provides the first comprehensive overview of Vision-Language-Action models for autonomous driving, analyzing their evolution, architectural components, and over 20 representative models while identifying key challenges like robustness and real-time efficiency.

Authors:Hongxin Zhang, Zheyuan Zhang, Zeyuan Wang, Zunzhe Zhang, Lixing Fang, Qinhong Zhou, Chuang Gan
Title: Ella: Embodied Social Agents with Lifelong Memory
Abstract:
We introduce Ella, an embodied social agent capable of lifelong learning within a community in a 3D open world, where agents accumulate experiences and acquire knowledge through everyday visual observations and social interactions. At the core of Ella's capabilities is a structured, long-term multimodal memory system that stores, updates, and retrieves information effectively. It consists of a name-centric semantic memory for organizing acquired knowledge and a spatiotemporal episodic memory for capturing multimodal experiences. By integrating this lifelong memory system with foundation models, Ella retrieves relevant information for decision-making, plans daily activities, builds social relationships, and evolves autonomously while coexisting with other intelligent beings in the open world. We conduct capability-oriented evaluations in a dynamic 3D open world where 15 agents engage in social activities for days and are assessed with a suite of unseen controlled evaluations. Experimental results show that Ella can influence, lead, and cooperate with other agents well to achieve goals, showcasing its ability to learn effectively through observation and social interaction. Our findings highlight the transformative potential of combining structured memory systems with foundation models for advancing embodied intelligence. More videos can be found at https://umass-embodied-agi.github.io/Ella/.

Authors:Hyunjong Kim, Sangyeop Kim, Jongheon Jeong, Yeongjae Cho, Sungzoon Cho
Title: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations
Abstract:
Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.
Chinese: 近期语言模型的进展推动了对可解释图像描述评估指标的需求,因此开发了EXPERT这一无参考系统,它基于流畅性、相关性和描述性提供结构化评估,实现了顶尖性能和高品质解释。
English: Recent progress in language models has spurred the need for explainable image captioning metrics, leading to the development of EXPERT, a reference-free system that provides structured evaluations based on fluency, relevance, and descriptiveness, achieving top performance and high-quality explanations.

Authors:Lijun Sheng, Jian Liang, Ran He, Zilei Wang, Tieniu Tan
Title: The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models
Abstract:
Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis. These problems hinder fair comparisons between TTA methods and obscure their practical strengths and weaknesses. To address these challenges, we introduce TTA-VLM, a comprehensive benchmark for evaluating TTA methods on VLMs. Our benchmark implements 8 episodic TTA and 7 online TTA methods within a unified and reproducible framework, and evaluates them across 15 widely used datasets. Unlike prior studies focused solely on CLIP, we extend the evaluation to SigLIP--a model trained with a Sigmoid loss--and include training-time tuning methods such as CoOp, MaPLe, and TeCoA to assess generality. Beyond classification accuracy, TTA-VLM incorporates various evaluation metrics, including robustness, calibration, out-of-distribution detection, and stability, enabling a more holistic assessment of TTA methods. Through extensive experiments, we find that 1) existing TTA methods produce limited gains compared to the previous pioneering work; 2) current TTA methods exhibit poor collaboration with training-time fine-tuning methods; 3) accuracy gains frequently come at the cost of reduced model trustworthiness. We release TTA-VLM to provide fair comparison and comprehensive evaluation of TTA methods for VLMs, and we hope it encourages the community to develop more reliable and generalizable TTA strategies.
中文: TTA-VLM作为一个综合性基准,旨在解决当前视觉语言模型测试时适应研究中的局限性,通过统一框架和多样化评估指标,促进对多种方法、数据集及模型可信度的公平全面比较。
English: TTA-VLM is introduced as a comprehensive benchmark to address limitations in current test-time adaptation research for vision-language models, enabling fair and holistic evaluations across multiple methods, datasets, and metrics beyond just accuracy.

Authors:Boyue Xu, Ruichao Hou, Tongwei Ren, Gangshan Wu
Title: Visual and Memory Dual Adapter for Multi-Modal Object Tracking
Abstract:
Prompt-learning-based multi-modal trackers have achieved promising progress by employing lightweight visual adapters to incorporate auxiliary modality features into frozen foundation models. However, existing approaches often struggle to learn reliable prompts due to limited exploitation of critical cues across frequency and temporal domains. In this paper, we propose a novel visual and memory dual adapter (VMDA) to construct more robust and discriminative representations for multi-modal tracking. Specifically, we develop a simple but effective visual adapter that adaptively transfers discriminative cues from auxiliary modality to dominant modality by jointly modeling the frequency, spatial, and channel-wise features. Additionally, we design the memory adapter inspired by the human memory mechanism, which stores global temporal cues and performs dynamic update and retrieval operations to ensure the consistent propagation of reliable temporal information across video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the various multi-modal tracking tasks, including RGB-Thermal, RGB-Depth, and RGB-Event tracking. Code and models are available at https://github.com/xuboyue1999/mmtrack.git.
中文摘要:该双适配器框架通过频率引导的视觉提示和多级记忆存储,无需全微调即可增强跨模态交互和时间一致性,从而提升多模态跟踪性能。
English Summary: The proposed dual-adapter framework enhances multi-modal tracking by incorporating frequency-guided visual prompts and multi-level memory storage to improve cross-modal interaction and temporal consistency without full fine-tuning.

Authors:Zinuo You, Stamatios Georgoulis, Anpei Chen, Siyu Tang, Dengxin Dai
Title: GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering
Abstract:
Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent. Existing approaches, depending on the domain they operate, suffer from several issues (e.g. geometric distortions, excessive cropping, poor generalization) that degrade the user experience. To address these issues, we introduce \textbf{GaVS}, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent `local reconstruction and rendering' paradigm. Given 3D camera poses, we augment a reconstruction model to predict Gaussian Splatting primitives, and finetune it at test-time, with multi-view dynamics-aware photometric supervision and cross-frame regularization, to produce temporally-consistent local reconstructions. The model are then used to render each stabilized frame. We utilize a scene extrapolation module to avoid frame cropping. Our method is evaluated on a repurposed dataset, instilled with 3D-grounded information, covering samples with diverse camera motions and scene dynamics. Quantitatively, our method is competitive with or superior to state-of-the-art 2D and 2.5D approaches in terms of conventional task metrics and new geometry consistency. Qualitatively, our method produces noticeably better results compared to alternatives, validated by the user study.

Authors:Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, Yi R. Fung
Title: Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Abstract:
Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.
中文: 摘要阐述了多模态推理中的范式转变,即人工智能从仅对图像进行思考转向利用视觉信息作为动态认知工作空间进行思考,并概述了其三个关键发展阶段。
English: The abstract introduces a paradigm shift in multimodal reasoning where AI transitions from merely thinking about images to thinking with them, utilizing visual information as a dynamic cognitive workspace across three developmental stages.

Authors:Longliang Liu, Miaojie Feng, Junda Cheng, Jijun Xiang, Xuan Zhu, Xin Yang
Title: PriOr-Flow: Enhancing Primitive Panoramic Optical Flow with Orthogonal View
Abstract:
Panoramic optical flow enables a comprehensive understanding of temporal dynamics across wide fields of view. However, severe distortions caused by sphere-to-plane projections, such as the equirectangular projection (ERP), significantly degrade the performance of conventional perspective-based optical flow methods, especially in polar regions. To address this challenge, we propose PriOr-Flow, a novel dual-branch framework that leverages the low-distortion nature of the orthogonal view to enhance optical flow estimation in these regions. Specifically, we introduce the Dual-Cost Collaborative Lookup (DCCL) operator, which jointly retrieves correlation information from both the primitive and orthogonal cost volumes, effectively mitigating distortion noise during cost volume construction. Furthermore, our Ortho-Driven Distortion Compensation (ODDC) module iteratively refines motion features from both branches, further suppressing polar distortions. Extensive experiments demonstrate that PriOr-Flow is compatible with various perspective-based iterative optical flow methods and consistently achieves state-of-the-art performance on publicly available panoramic optical flow datasets, setting a new benchmark for wide-field motion estimation. The code is publicly available at: https://github.com/longliangLiu/PriOr-Flow.
中文: PriOr-Flow提出了一种双分支框架,通过DCCL算子和ODDC模块有效减少全景光流估计中的畸变,在公开数据集上取得了领先的性能。
English: PriOr-Flow introduces a dual-branch framework with a DCCL operator and ODDC module to mitigate panoramic distortion in optical flow estimation, achieving state-of-the-art performance on benchmark datasets.

Authors:Jiahao Ma, Lei Wang, Miaomiao liu, David Ahmedt-Aristizabal, Chuong Nguyen
Title: Puzzles: Unbounded Video-Depth Augmentation for Scalable End-to-End 3D Reconstruction
Abstract:
Multi-view 3D reconstruction remains a core challenge in computer vision. Recent methods, such as DUST3R and its successors, directly regress pointmaps from image pairs without relying on known scene geometry or camera parameters. However, the performance of these models is constrained by the diversity and scale of available training data. In this work, we introduce Puzzles, a data augmentation strategy that synthesizes an unbounded volume of high-quality posed video-depth data from a single image or video clip. By simulating diverse camera trajectories and realistic scene geometry through targeted image transformations, Puzzles significantly enhances data variety. Extensive experiments show that integrating Puzzles into existing video-based 3D reconstruction pipelines consistently boosts performance without modifying the underlying network architecture. Notably, models trained on only ten percent of the original data augmented with Puzzles still achieve accuracy comparable to those trained on the full dataset. Code is available at https://jiahao-ma.github.io/puzzles/.

Authors:Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, Yunhai Tong
Title: VMoBA: Mixture-of-Block Attention for Video Diffusion Models
Abstract:
The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.
中文: 本文提出的VMoBA是一种新型稀疏注意力机制,通过自适应时空模式显著加速视频扩散模型训练,在保持甚至提升视频生成质量的同时实现了可观的加速效果。
English: This paper introduces VMoBA, a novel sparse attention mechanism that accelerates Video Diffusion Models by adapting to spatio-temporal patterns, achieving significant speed improvements while maintaining or enhancing video generation quality.

Authors:Ji Zhang, Shihan Wu, Lianli Gao, Jingkuan Song, Nicu Sebe, Heng Tao Shen
Title: A Closer Look at Conditional Prompt Tuning for Vision-Language Models
Abstract:
Despite the great promise of Prompt Tuning (PT) in adapting large Vision-Language Pretrained Models (VLPMs) to downstream tasks, they often struggle to overcome the Base-New Tradeoff (BNT) dilemma: as VLPMs are better tuned to a base task, their ability to generalize to new tasks diminishes. Recent work on conditional PT addresses this problem by replacing static prompts with dynamic Visual Image Information (VII)-conditioned prompts, improving the model's generalization to new tasks to some extent. In this work, we first identify a critical issue with existing conditional PT methods: using VII as the "condition" of prompts yields suboptimal performance, and even random noise-conditioned prompts can outperform the VII-conditioned counterparts. On further analysis, we find that learning dynamic prompts conditioned on Textual Class Information (TCI) is the key to solving the BNT problem. Motivated by this, we then propose Class-adaptive Prompt Tuning (CaPT), which enables fast adaptation of tuned models to new classes by learning TCI-conditioned prompts from base classes. Remarkably, CaPT can be used as a plugin to mitigate the BNT problem for existing unconditional PT schemes. Extensive experiments on 11 datasets show that CaPT consistently improves the performance of five strong unconditional PT baselines with negligible additional computational cost. Additionally, by integrating CaPT with our recently proposed DePT framework, we devise a new conditional PT approach, termed DeCaPT, which outperforms the H ACC of the state-of-the-art conditional PT scheme by 3.49%, averaged over the 11 datasets. Code: https://github.com/Koorye/CaPT.
中文: 提示调优在视觉语言模型中常面临基础-新任务权衡困境,即基础任务性能提升会削弱对新任务的泛化能力,而提出的类自适应提示调优方法通过基于文本类别信息学习动态提示,有效解决了这一问题,以极小的额外计算成本在多个数据集上显著提升了性能。
English: Prompt Tuning for vision-language models often faces the Base-New Tradeoff dilemma, where improved base task performance reduces generalization to new tasks, but the proposed Class-adaptive Prompt Tuning (CaPT) method effectively addresses this by learning dynamic prompts conditioned on textual class information, enhancing performance across multiple datasets with minimal added cost.

Authors:Jianing Jin, Jiangyong Ying, Huiyu Duan, Liu Yang, Sijing Wu, Yunhao Li, Yushuo Zheng, Xiongkuo Min, Guangtao Zhai
Title: RGC-VQA: An Exploration Database for Robotic-Generated Video Quality Assessment
Abstract:
As camera-equipped robotic platforms become increasingly integrated into daily life, robotic-generated videos have begun to appear on streaming media platforms, enabling us to envision a future where humans and robots coexist. We innovatively propose the concept of Robotic-Generated Content (RGC) to term these videos generated from egocentric perspective of robots. The perceptual quality of RGC videos is critical in human-robot interaction scenarios, and RGC videos exhibit unique distortions and visual requirements that differ markedly from those of professionally-generated content (PGC) videos and user-generated content (UGC) videos. However, dedicated research on quality assessment of RGC videos is still lacking. To address this gap and to support broader robotic applications, we establish the first Robotic-Generated Content Database (RGCD), which contains a total of 2,100 videos drawn from three robot categories and sourced from diverse platforms. A subjective VQA experiment is conducted subsequently to assess human visual perception of robotic-generated videos. Finally, we conduct a benchmark experiment to evaluate the performance of 11 state-of-the-art VQA models on our database. Experimental results reveal significant limitations in existing VQA models when applied to complex, robotic-generated content, highlighting a critical need for RGC-specific VQA models. Our RGCD is publicly available at: https://github.com/IntMeGroup/RGC-VQA.
中文: 本研究创新性地提出了机器人生成内容(RGC)概念,建立了首个包含2100个视频的RGC数据库,并通过实验证明现有视频质量评估模型对RGC内容表现不佳,亟需开发专用评估模型。
English: The study introduces Robotic-Generated Content (RGC) as a new video category from robots' perspectives, establishes the first RGC database with 2,100 videos, and reveals that current video quality assessment models perform poorly on RGC, highlighting the need for specialized models.

Authors:Ziwei Chen, Ziling Liu, Zitong Huang, Mingqi Gao, Feng Zheng
Title: SCORP: Scene-Consistent Object Refinement via Proxy Generation and Tuning
Abstract:
Viewpoint missing of objects is common in scene reconstruction, as camera paths typically prioritize capturing the overall scene structure rather than individual objects. This makes it highly challenging to achieve high-fidelity object-level modeling while maintaining accurate scene-level representation. Addressing this issue is critical for advancing downstream tasks requiring high-fidelity object reconstruction. In this paper, we introduce Scene-Consistent Object Refinement via Proxy Generation and Tuning (SCORP), a novel 3D enhancement framework that leverages 3D generative priors to recover fine-grained object geometry and appearance under missing views. Starting with proxy generation by substituting degraded objects using a 3D generation model, SCORP then progressively refines geometry and texture by aligning each proxy to its degraded counterpart in 7-DoF pose, followed by correcting spatial and appearance inconsistencies through registration-constrained enhancement. This two-stage proxy tuning ensures the high-fidelity geometry and appearance of the original object in unseen views while maintaining consistency in spatial positioning, observed geometry, and appearance. Across challenging benchmarks, SCORP achieves consistent gains over recent state-of-the-art baselines on both novel view synthesis and geometry completion tasks. SCORP is available at https://github.com/PolySummit/SCORP.
中文: SCORP是一种新颖的3D增强框架,通过生成和优化物体代理来解决场景重建中的视角缺失问题,在保持空间一致性的同时实现高保真几何与外观重建。
English: SCORP is a novel 3D enhancement framework that addresses viewpoint gaps in scene reconstruction by generating and refining object proxies to achieve high-fidelity geometry and appearance while maintaining spatial consistency.

Authors:Mingcheng Qu, Yuncong Wu, Donglin Di, Yue Gao, Tonghua Su, Yang Song, Lei Fan
Title: Spatially Gene Expression Prediction using Dual-Scale Contrastive Learning
Abstract:
Spatial transcriptomics (ST) provides crucial insights into tissue micro-environments, but is limited to its high cost and complexity. As an alternative, predicting gene expression from pathology whole slide images (WSI) is gaining increasing attention. However, existing methods typically rely on single patches or a single pathology modality, neglecting the complex spatial and molecular interactions between target and neighboring information (e.g., gene co-expression). This leads to a failure in establishing connections among adjacent regions and capturing intricate cross-modal relationships. To address these issues, we propose NH2ST, a framework that integrates spatial context and both pathology and gene modalities for gene expression prediction. Our model comprises a query branch and a neighbor branch to process paired target patch and gene data and their neighboring regions, where cross-attention and contrastive learning are employed to capture intrinsic associations and ensure alignments between pathology and gene expression. Extensive experiments on six datasets demonstrate that our model consistently outperforms existing methods, achieving over 20% in PCC metrics. Codes are available at https://github.com/MCPathology/NH2ST
中文: 提出的NH2ST框架通过整合空间上下文和多模态数据,利用交叉注意力和对比学习改进病理图像中的基因表达预测,在六个数据集上始终优于现有方法,PCC指标提升超过20%。
English: The proposed NH2ST framework enhances gene expression prediction from pathology images by integrating spatial context and multi-modal data through cross-attention and contrastive learning, consistently outperforming existing methods by over 20% in PCC metrics across six datasets.

Authors:Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Xiaojie Jin
Title: Flash-VStream: Efficient Real-Time Understanding for Long Video Streams
Abstract:
Benefiting from the advances in large language models and cross-modal alignment, existing multimodal large language models have achieved prominent performance in image and short video understanding. However, the understanding of long videos is still challenging, as their long-context nature results in significant computational and memory overhead. Most existing work treats long videos in the same way as short videos, which is inefficient for real-world applications and hard to generalize to even longer videos. To address these issues, we propose Flash-VStream, an efficient video language model capable of processing extremely long videos and responding to user queries in real time. Particularly, we design a Flash Memory module, containing a low-capacity context memory to aggregate long-context temporal information and model the distribution of information density, and a high-capacity augmentation memory to retrieve detailed spatial information based on this distribution. Compared to existing models, Flash-VStream achieves significant reductions in inference latency. Extensive experiments on long video benchmarks and comprehensive video benchmarks, i.e., EgoSchema, MLVU, LVBench, MVBench and Video-MME, demonstrate the state-of-the-art performance and outstanding efficiency of our method. Code is available at https://github.com/IVGSZ/Flash-VStream.
中文摘要:Flash-VStream是一种高效视频语言模型,通过双内存模块处理超长视频并实时响应用户查询,在多项基准测试中实现了最先进性能与显著效率提升。
English Summary: Flash-VStream is an efficient video language model that processes long videos in real-time using a dual-memory module to reduce latency while maintaining state-of-the-art performance on multiple benchmarks.

Authors:Shiming Chen, Bowen Duan, Salman Khan, Fahad Shahbaz Khan
Title: Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model
Abstract:
Large-scale vision-language models (VLMs), such as CLIP, have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets. However, these methods often lack interpretability, as they compute the similarity between an entire query image and the embedded category words, making it difficult to explain their predictions. One approach to address this issue is to develop interpretable models by integrating language, where classifiers are built using discrete attributes, similar to human perception. This introduces a new challenge: how to effectively align local visual features with corresponding attributes based on pre-trained VLMs. To tackle this, we propose LaZSL, a locally-aligned vision-language model for interpretable ZSL. LaZSL employs local visual-semantic alignment via optimal transport to perform interaction between visual regions and their associated attributes, facilitating effective alignment and providing interpretable similarity without the need for additional training. Extensive experiments demonstrate that our method offers several advantages, including enhanced interpretability, improved accuracy, and strong domain generalization. Codes available at: https://github.com/shiming-chen/LaZSL.
中文摘要:LaZSL是一种新型视觉语言模型,通过最优传输实现局部视觉特征与语义属性的对齐,无需额外训练即可提升零样本学习的可解释性和准确性。
English Summary: LaZSL is a novel vision-language model that enhances interpretability in zero-shot learning by aligning local visual features with semantic attributes through optimal transport, improving both accuracy and explainability without extra training.

Authors:Mahshid Shiri, Cigdem Beyan, Vittorio Murino
Title: MadCLIP: Few-shot Medical Anomaly Detection with CLIP
Abstract:
An innovative few-shot anomaly detection approach is presented, leveraging the pre-trained CLIP model for medical data, and adapting it for both image-level anomaly classification (AC) and pixel-level anomaly segmentation (AS). A dual-branch design is proposed to separately capture normal and abnormal features through learnable adapters in the CLIP vision encoder. To improve semantic alignment, learnable text prompts are employed to link visual features. Furthermore, SigLIP loss is applied to effectively handle the many-to-one relationship between images and unpaired text prompts, showcasing its adaptation in the medical field for the first time. Our approach is validated on multiple modalities, demonstrating superior performance over existing methods for AC and AS, in both same-dataset and cross-dataset evaluations. Unlike prior work, it does not rely on synthetic data or memory banks, and an ablation study confirms the contribution of each component. The code is available at https://github.com/mahshid1998/MadCLIP.
中文: 本研究提出了一种基于CLIP模型的创新少样本异常检测方法,通过双分支结构和可学习提示实现医学图像的精准分类与分割,在多项评估中均优于现有方法且无需合成数据。
English: This study introduces a novel few-shot anomaly detection method using the CLIP model for medical data, achieving superior performance in both image classification and pixel segmentation without synthetic data or memory banks.

Authors:Yongjian Wu, Yang Zhou, Jiya Saiyin, Bingzheng Wei, Yan Xu
Title: Visual Textualization for Image Prompted Object Detection
Abstract:
We propose VisTex-OVLM, a novel image prompted object detection method that introduces visual textualization -- a process that projects a few visual exemplars into the text feature space to enhance Object-level Vision-Language Models' (OVLMs) capability in detecting rare categories that are difficult to describe textually and nearly absent from their pre-training data, while preserving their pre-trained object-text alignment. Specifically, VisTex-OVLM leverages multi-scale textualizing blocks and a multi-stage fusion strategy to integrate visual information from visual exemplars, generating textualized visual tokens that effectively guide OVLMs alongside text prompts. Unlike previous methods, our method maintains the original architecture of OVLM, maintaining its generalization capabilities while enhancing performance in few-shot settings. VisTex-OVLM demonstrates superior performance across open-set datasets which have minimal overlap with OVLM's pre-training data and achieves state-of-the-art results on few-shot benchmarks PASCAL VOC and MSCOCO. The code will be released at https://github.com/WitGotFlg/VisTex-OVLM.
中文:VisTex-OVLM通过视觉文本化将视觉样本投影到文本特征空间,在保持预训练对齐的同时增强对罕见类别的检测能力,无需改变原有架构即在少样本基准测试中取得了最优性能。
English: VisTex-OVLM introduces visual textualization to project visual exemplars into text space, enhancing object detection for rare categories while preserving pre-trained alignment, achieving state-of-the-art results on few-shot benchmarks without altering the original model architecture.

Authors:Shiao Wang, Ju Huang, Qingchuan Ma, Jinfeng Gao, Chunyi Xu, Xiao Wang, Lan Chen, Bo Jiang
Title: Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking
Abstract:
Combining traditional RGB cameras with bio-inspired event cameras for robust object tracking has garnered increasing attention in recent years. However, most existing multimodal tracking algorithms depend heavily on high-complexity Vision Transformer architectures for feature extraction and fusion across modalities. This not only leads to substantial computational overhead but also limits the effectiveness of cross-modal interactions. In this paper, we propose an efficient RGB-Event object tracking framework based on the linear-complexity Vision Mamba network, termed Mamba-FETrack V2. Specifically, we first design a lightweight Prompt Generator that utilizes embedded features from each modality, together with a shared prompt pool, to dynamically generate modality-specific learnable prompt vectors. These prompts, along with the modality-specific embedded features, are then fed into a Vision Mamba-based FEMamba backbone, which facilitates prompt-guided feature extraction, cross-modal interaction, and fusion in a unified manner. Finally, the fused representations are passed to the tracking head for accurate target localization. Extensive experimental evaluations on multiple RGB-Event tracking benchmarks, including short-term COESOT dataset and long-term datasets, i.e., FE108 and FELT V2, demonstrate the superior performance and efficiency of the proposed tracking framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/Mamba_FETrack
Chinese: 本文提出Mamba-FETrack V2,一种基于视觉Mamba网络和提示生成器的高效RGB-事件目标跟踪框架,在降低计算复杂度的同时实现了卓越的跟踪性能。
English: This paper introduces Mamba-FETrack V2, an efficient RGB-Event object tracking framework that uses a Vision Mamba network and a prompt generator to achieve superior performance with reduced computational complexity.

Authors:Guiyu Zhang, Chen Shi, Zijian Jiang, Xunzhi Xiang, Jingjing Qian, Shaoshuai Shi, Li Jiang
Title: Proteus-ID: ID-Consistent and Motion-Coherent Video Customization
Abstract:
Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization. Codes and data are publicly available at https://grenoble-zhang.github.io/Proteus-ID/.

Authors:Chang'an Yi, Xiaohui Deng, Guohao Chen, Yan Zhou, Qinghua Lu, Shuaicheng Niu
Title: When Small Guides Large: Cross-Model Co-Learning for Test-Time Adaptation
Abstract:
Test-time Adaptation (TTA) adapts a given model to testing domain data with potential domain shifts through online unsupervised learning, yielding impressive performance. However, to date, existing TTA methods primarily focus on single-model adaptation. In this work, we investigate an intriguing question: how does cross-model knowledge influence the TTA process? Our findings reveal that, in TTA's unsupervised online setting, each model can provide complementary, confident knowledge to the others, even when there are substantial differences in model size. For instance, a smaller model like MobileViT (10.6M parameters) can effectively guide a larger model like ViT-Base (86.6M parameters). In light of this, we propose COCA, a Cross-Model Co-Learning framework for TTA, which mainly consists of two main strategies. 1) Co-adaptation adaptively integrates complementary knowledge from other models throughout the TTA process, reducing individual model biases. 2) Self-adaptation enhances each model's unique strengths via unsupervised learning, enabling diverse adaptation to the target domain. Extensive experiments show that COCA, which can also serve as a plug-and-play module, significantly boosts existing SOTAs, on models with various sizes--including ResNets, ViTs, and Mobile-ViTs--via cross-model co-learned TTA. For example, with Mobile-ViT's guidance, COCA raises ViT-Base's average adaptation accuracy on ImageNet-C from 51.7% to 64.5%. The code is publicly available at https://github.com/ycarobot/COCA.
中文: COCA提出了一种跨模型协同学习的测试时自适应框架,通过整合不同模型间的互补知识并强化各自优势,显著提升了多种模型在目标域上的适应性能,例如将ViT-Base在ImageNet-C上的准确率从51.7%提高到64.5%。
English: Test-time Adaptation (TTA) is enhanced by COCA, a cross-model co-learning framework that leverages complementary knowledge between models of different sizes to boost adaptation accuracy, as demonstrated by significant improvements on benchmarks like ImageNet-C.

Authors:Smriti Joshi, Richard Osuala, Lidia Garrucho, Kaisar Kushibar, Dimitri Kessler, Oliver Diaz, Karim Lekadir
Title: Single Image Test-Time Adaptation via Multi-View Co-Training
Abstract:
Test-time adaptation enables a trained model to adjust to a new domain during inference, making it particularly valuable in clinical settings where such on-the-fly adaptation is required. However, existing techniques depend on large target domain datasets, which are often impractical and unavailable in medical scenarios that demand per-patient, real-time inference. Moreover, current methods commonly focus on two-dimensional images, failing to leverage the volumetric richness of medical imaging data. Bridging this gap, we propose a Patch-Based Multi-View Co-Training method for Single Image Test-Time adaptation. Our method enforces feature and prediction consistency through uncertainty-guided self-training, enabling effective volumetric segmentation in the target domain with only a single test-time image. Validated on three publicly available breast magnetic resonance imaging datasets for tumor segmentation, our method achieves performance close to the upper bound supervised benchmark while also outperforming all existing state-of-the-art methods, on average by a Dice Similarity Coefficient of 3.75%. We publicly share our accessible codebase, readily integrable with the popular nnUNet framework, at https://github.com/smriti-joshi/muvi.git.
中文摘要:该研究提出的基于多视角协同训练的补丁方法,通过不确定性引导的自训练仅需单张测试图像即可实现有效的体积分割,在乳腺MRI数据集上以3.75%的Dice分数超越现有最优方法。
English Summary: The proposed patch-based multi-view co-training method enables effective volumetric segmentation using only a single test image through uncertainty-guided self-training, outperforming existing methods by 3.75% Dice score on breast MRI datasets.

Authors:Lingtong Zhang, Mengdie Song, Xiaohan Hao, Huayu Mai, Bensheng Qiu
Title: MDPG: Multi-domain Diffusion Prior Guidance for MRI Reconstruction
Abstract:
Magnetic Resonance Imaging (MRI) reconstruction is essential in medical diagnostics. As the latest generative models, diffusion models (DMs) have struggled to produce high-fidelity images due to their stochastic nature in image domains. Latent diffusion models (LDMs) yield both compact and detailed prior knowledge in latent domains, which could effectively guide the model towards more effective learning of the original data distribution. Inspired by this, we propose Multi-domain Diffusion Prior Guidance (MDPG) provided by pre-trained LDMs to enhance data consistency in MRI reconstruction tasks. Specifically, we first construct a Visual-Mamba-based backbone, which enables efficient encoding and reconstruction of under-sampled images. Then pre-trained LDMs are integrated to provide conditional priors in both latent and image domains. A novel Latent Guided Attention (LGA) is proposed for efficient fusion in multi-level latent domains. Simultaneously, to effectively utilize a prior in both the k-space and image domain, under-sampled images are fused with generated full-sampled images by the Dual-domain Fusion Branch (DFB) for self-adaption guidance. Lastly, to further enhance the data consistency, we propose a k-space regularization strategy based on the non-auto-calibration signal (NACS) set. Extensive experiments on two public MRI datasets fully demonstrate the effectiveness of the proposed methodology. The code is available at https://github.com/Zolento/MDPG.
中文摘要:本研究提出的多域扩散先验引导方法通过整合潜在扩散模型与视觉曼巴骨干网络及双域融合技术,有效提升了磁共振成像重建的数据一致性和图像保真度。
English Summary: The proposed Multi-domain Diffusion Prior Guidance (MDPG) method enhances MRI reconstruction by integrating latent diffusion models with a Visual-Mamba backbone and dual-domain fusion, achieving improved data consistency and image fidelity.

Authors:Shuai Tan, Biao Gong, Yujie Wei, Shiwei Zhang, Zhuoxin Liu, Dandan Zheng, Jingdong Chen, Yan Wang, Hao Ouyang, Kecheng Zheng, Yujun Shen
Title: SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
Abstract:
Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., ''cats'' or ''dogs'') to produce visually appealing results. However, video data involve complex spatio-temporal patterns, and focusing solely on semantics cause the model to overlook the visual complexity of motion. Conversely, tuning only the visual representation leads to semantic confusion in representing the intended action. To address these limitations, we propose SynMotion, a new motion-customized video generation model that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce the dual-embedding semantic comprehension mechanism which disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we introduce a new embedding-specific training strategy which \textbf{alternately optimizes} subject and motion embeddings, supported by the manually constructed Subject Prior Video (SPV) training dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that \method outperforms existing baselines. Project page: https://lucaria-academy.github.io/SynMotion/
中文摘要:SynMotion是一种新型视频运动定制模型,通过结合语义引导和视觉适配来提升运动保真度和时序连贯性,在文本到视频和图像到视频生成任务中优于现有基准方法。
English Summary: SynMotion is a novel video motion customization model that integrates semantic guidance and visual adaptation to enhance motion fidelity and temporal coherence, outperforming existing baselines in text-to-video and image-to-video generation.

Authors:Mario Koddenbrock, Rudolf Hoffmann, David Brodmann, Erik Rodner
Title: On the Domain Robustness of Contrastive Vision-Language Models
Abstract:
In real-world vision-language applications, practitioners increasingly rely on large, pretrained foundation models rather than custom-built solutions, despite limited transparency regarding their training data and processes. While these models achieve impressive performance on general benchmarks, their effectiveness can decline notably under specialized domain shifts, such as unique imaging conditions or environmental variations. In this work, we introduce Deepbench, a framework designed to assess domain-specific robustness of vision-language models (VLMs). Deepbench leverages a large language model (LLM) to generate realistic, context-aware image corruptions tailored to specific deployment domains without requiring labeled data. We evaluate a range of contrastive vision-language architectures and architectural variants across six real-world domains and observe substantial variability in robustness, highlighting the need for targeted, domain-aware evaluation. Deepbench is released as open-source software to support further research into domain-aware robustness assessment.
中文: Deepbench框架利用大语言模型生成针对特定领域的图像干扰,以评估视觉语言模型的鲁棒性,结果发现在不同现实领域中性能存在显著差异,凸显了针对性评估的必要性。
English: Deepbench is a framework that uses a large language model to generate domain-specific image corruptions for evaluating the robustness of vision-language models, revealing significant performance variations across different real-world domains and emphasizing the need for targeted assessments.

Authors:Zhe Liu, Yuhao Huang, Lian Liu, Chengrui Zhang, Haotian Lin, Tong Han, Zhiyuan Zhu, Yanlin Chen, Yuerui Chen, Dong Ni, Zhongshan Gou, Xin Yang
Title: MReg: A Novel Regression Model with MoE-based Video Feature Mining for Mitral Regurgitation Diagnosis
Abstract:
Color Doppler echocardiography is a crucial tool for diagnosing mitral regurgitation (MR). Recent studies have explored intelligent methods for MR diagnosis to minimize user dependence and improve accuracy. However, these approaches often fail to align with clinical workflow and may lead to suboptimal accuracy and interpretability. In this study, we introduce an automated MR diagnosis model (MReg) developed on the 4-chamber cardiac color Doppler echocardiography video (A4C-CDV). It follows comprehensive feature mining strategies to detect MR and assess its severity, considering clinical realities. Our contribution is threefold. First, we formulate the MR diagnosis as a regression task to capture the continuity and ordinal relationships between categories. Second, we design a feature selection and amplification mechanism to imitate the sonographer's diagnostic logic for accurate MR grading. Third, inspired by the Mixture-of-Experts concept, we introduce a feature summary module to extract the category-level features, enhancing the representational capacity for more accurate grading. We trained and evaluated our proposed MReg on a large in-house A4C-CDV dataset comprising 1868 cases with three graded regurgitation labels. Compared to other weakly supervised video anomaly detection and supervised classification methods, MReg demonstrated superior performance in MR diagnosis. Our code is available at: https://github.com/cskdstz/MReg.
中文: 本研究提出MReg模型,通过四腔心彩色多普勒超声视频自动诊断二尖瓣反流,采用特征挖掘策略和回归方法,在提高诊断准确性的同时更好地契合临床工作流程。
English: This study introduces MReg, an automated model for diagnosing mitral regurgitation using 4-chamber cardiac color Doppler echocardiography videos, which employs feature mining strategies and a regression-based approach to improve accuracy and clinical alignment over existing methods.

Authors:Shaofei Huang, Rui Ling, Tianrui Hui, Hongyu Li, Xu Zhou, Shifeng Zhang, Si Liu, Richang Hong, Meng Wang
Title: Revisiting Audio-Visual Segmentation with Vision-Centric Transformer
Abstract:
Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation. Extensive experiments demonstrate that our VCT framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset. The code is available at https://github.com/spyflying/VCT_AVS.
中文: 提出的视觉中心Transformer(VCT)框架通过视觉驱动查询解决视听分割中的感知模糊和细节丢失问题,能更好区分混合音频中的发声物体并精确勾勒轮廓,在AVSBench数据集上实现了最优性能。
English: The proposed Vision-Centric Transformer (VCT) framework addresses audio-visual segmentation limitations by using vision-derived queries to better distinguish sounding objects from mixed audio and accurately delineate their contours, achieving state-of-the-art performance on AVSBench datasets.

Authors:Suyash Maniyar, Vishvesh Trivedi, Ajoy Mondal, Anand Mishra, C. V. Jawahar
Title: AI-Generated Lecture Slides for Improving Slide Element Detection and Retrieval
Abstract:
Lecture slide element detection and retrieval are key problems in slide understanding. Training effective models for these tasks often depends on extensive manual annotation. However, annotating large volumes of lecture slides for supervised training is labor intensive and requires domain expertise. To address this, we propose a large language model (LLM)-guided synthetic lecture slide generation pipeline, SynLecSlideGen, which produces high-quality, coherent and realistic slides. We also create an evaluation benchmark, namely RealSlide by manually annotating 1,050 real lecture slides. To assess the utility of our synthetic slides, we perform few-shot transfer learning on real data using models pre-trained on them. Experimental results show that few-shot transfer learning with pretraining on synthetic slides significantly improves performance compared to training only on real data. This demonstrates that synthetic data can effectively compensate for limited labeled lecture slides. The code and resources of our work are publicly available on our project website: https://synslidegen.github.io/.

Authors:Yawen Zou, Guang Li, Duo Su, Zi Wang, Jun Yu, Chao Zhang
Title: Dataset Distillation via Vision-Language Category Prototype
Abstract:
Dataset distillation (DD) condenses large datasets into compact yet informative substitutes, preserving performance comparable to the original dataset while reducing storage, transmission costs, and computational consumption. However, previous DD methods mainly focus on distilling information from images, often overlooking the semantic information inherent in the data. The disregard for context hinders the model's generalization ability, particularly in tasks involving complex datasets, which may result in illogical outputs or the omission of critical objects. In this study, we integrate vision-language methods into DD by introducing text prototypes to distill language information and collaboratively synthesize data with image prototypes, thereby enhancing dataset distillation performance. Notably, the text prototypes utilized in this study are derived from descriptive text information generated by an open-source large language model. This framework demonstrates broad applicability across datasets without pre-existing text descriptions, expanding the potential of dataset distillation beyond traditional image-based approaches. Compared to other methods, the proposed approach generates logically coherent images containing target objects, achieving state-of-the-art validation performance and demonstrating robust generalization. Source code and generated data are available in https://github.com/zou-yawen/Dataset-Distillation-via-Vision-Language-Category-Prototype/
中文: 本研究通过融合视觉语言方法,利用大型语言模型生成的文本原型与图像原型协同合成数据,提升了数据集蒸馏的性能、泛化能力和生成图像的逻辑一致性,实现了最先进的验证效果。
English: This study enhances dataset distillation by integrating vision-language methods, using text prototypes from a large language model to collaboratively synthesize data with image prototypes, which improves performance, generalization, and logical coherence in generated images, achieving state-of-the-art results.

Authors:Nuo Chen, Chao Xiao, Yimian Dai, Shiman He, Miao Li, Wei An
Title: Event-based Tiny Object Detection: A Benchmark Dataset and Baseline
Abstract:
Small object detection (SOD) in anti-UAV task is a challenging problem due to the small size of UAVs and complex backgrounds. Traditional frame-based cameras struggle to detect small objects in complex environments due to their low frame rates, limited dynamic range, and data redundancy. Event cameras, with microsecond temporal resolution and high dynamic range, provide a more effective solution for SOD. However, existing event-based object detection datasets are limited in scale, feature large targets size, and lack diverse backgrounds, making them unsuitable for SOD benchmarks. In this paper, we introduce a Event-based Small object detection (EVSOD) dataset (namely EV-UAV), the first large-scale, highly diverse benchmark for anti-UAV tasks. It includes 147 sequences with over 2.3 million event-level annotations, featuring extremely small targets (averaging 6.8 $\times$ 5.4 pixels) and diverse scenarios such as urban clutter and extreme lighting conditions. Furthermore, based on the observation that small moving targets form continuous curves in spatiotemporal event point clouds, we propose Event based Sparse Segmentation Network (EV-SpSegNet), a novel baseline for event segmentation in point cloud space, along with a Spatiotemporal Correlation (STC) loss that leverages motion continuity to guide the network in retaining target events. Extensive experiments on the EV-UAV dataset demonstrate the superiority of our method and provide a benchmark for future research in EVSOD. The dataset and code are at https://github.com/ChenYichen9527/Ev-UAV.
中文摘要:本文提出了首个用于反无人机任务的大规模事件型小目标检测数据集EV-UAV,包含极小目标和多样化场景,同时开发了一种基于时空事件点云分割的新网络及相关损失函数,实验证明了该方法的优越性。
English Summary: This paper introduces EV-UAV, the first large-scale event-based dataset for small object detection in anti-UAV tasks, featuring extremely small targets and diverse scenarios, along with a novel segmentation network and spatiotemporal loss function that demonstrate superior performance.

Authors:Mingqian Ji, Jian Yang, Shanshan Zhang
Title: OcRFDet: Object-Centric Radiance Fields for Multi-View 3D Object Detection in Autonomous Driving
Abstract:
Current multi-view 3D object detection methods typically transfer 2D features into 3D space using depth estimation or 3D position encoder, but in a fully data-driven and implicit manner, which limits the detection performance. Inspired by the success of radiance fields on 3D reconstruction, we assume they can be used to enhance the detector's ability of 3D geometry estimation. However, we observe a decline in detection performance, when we directly use them for 3D rendering as an auxiliary task. From our analysis, we find the performance drop is caused by the strong responses on the background when rendering the whole scene. To address this problem, we propose object-centric radiance fields, focusing on modeling foreground objects while discarding background noises. Specifically, we employ Object-centric Radiance Fields (OcRF) to enhance 3D voxel features via an auxiliary task of rendering foreground objects. We further use opacity - the side-product of rendering- to enhance the 2D foreground BEV features via Height-aware Opacity-based Attention (HOA), where attention maps at different height levels are generated separately via multiple networks in parallel. Extensive experiments on the nuScenes validation and test datasets demonstrate that our OcRFDet achieves superior performance, outperforming previous state-of-the-art methods with 57.2$\%$ mAP and 64.8$\%$ NDS on the nuScenes test benchmark. Code will be available at https://github.com/Mingqj/OcRFDet.
中文摘要:现有多视角3D物体检测方法因隐式数据驱动方式存在性能局限,本文提出的以物体为中心的辐射场(OcRF)通过聚焦前景物体建模,结合渲染任务和基于不透明度的注意力机制增强特征,在nuScenes基准测试中实现了最优性能。
English Summary: Current multi-view 3D object detection methods face performance limitations due to implicit data-driven approaches, which the proposed Object-centric Radiance Fields (OcRF) addresses by focusing on foreground objects and enhancing features through rendering and opacity-based attention, achieving state-of-the-art results on nuScenes benchmarks.

Authors:Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K. Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, Wenhao Wu, Dacheng Tao
Title: MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
Abstract:
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions. First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations. Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps. With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities. We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research. Code will be available at https://github.com/HJYao00/MMReason.
Chinese: MMReason基准通过提供需要多步骤推理的多样化挑战性问题,结合详细解答和三值评分机制,旨在弥补现有MLLM评估在长链推理能力检测上的不足,实现精准全面的能力评估。
English: The MMReason benchmark is introduced to address the limitations of existing MLLM evaluations by providing diverse, challenging questions that require multi-step reasoning, incorporating detailed solutions and a ternary scoring mechanism to accurately assess long-chain reasoning capabilities.

Authors:Hui Li, Baoyou Chen, Liwei Zhang, Jiaye Li, Jingdong Wang, Siyu Zhu
Title: Pyramidal Patchification Flow for Visual Generation
Abstract:
Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps; Linear projections are learned for each patch size; and Unpatchify is accordingly modified. Unlike Pyramidal Flow, our approach operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick. We demonstrate the effectiveness of our approach through two training manners. Training from scratch achieves a $1.6\times$ ($2.0\times$) inference speed over SiT-B/2 for 2-level (3-level) pyramid patchification with slightly lower training FLOPs and similar image generation performance. Training from pretrained normal DiTs achieves even better performance with small training time. The code and checkpoint are at https://github.com/fudan-generative-vision/PPFlow.
中文: PPFlow方法根据扩散模型中不同噪声水平自适应调整补丁大小,在保持相近生成性能的同时,将推理速度提升高达2倍并降低计算成本。
English: The PPFlow method adapts patch sizes based on noise levels in diffusion transformers, improving inference speed by up to 2x while maintaining similar performance and reducing computational costs.

Authors:Weida Wang, Changyong He, Jin Zeng, Di Qiu
Title: Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention
Abstract:
Depth images captured by Time-of-Flight (ToF) sensors are prone to noise, requiring denoising for reliable downstream applications. Previous works either focus on single-frame processing, or perform multi-frame processing without considering depth variations at corresponding pixels across frames, leading to undesirable temporal inconsistency and spatial ambiguity. In this paper, we propose a novel ToF depth denoising network leveraging motion-invariant graph fusion to simultaneously enhance temporal stability and spatial sharpness. Specifically, despite depth shifts across frames, graph structures exhibit temporal self-similarity, enabling cross-frame geometric attention for graph fusion. Then, by incorporating an image smoothness prior on the fused graph and data fidelity term derived from ToF noise distribution, we formulate a maximum a posterior problem for ToF denoising. Finally, the solution is unrolled into iterative filters whose weights are adaptively learned from the graph-informed geometric attention, producing a high-performance yet interpretable network. Experimental results demonstrate that the proposed scheme achieves state-of-the-art performance in terms of accuracy and consistency on synthetic DVToF dataset and exhibits robust generalization on the real Kinectv2 dataset. Source code will be released at \href{https://github.com/davidweidawang/GIGA-ToF}{https://github.com/davidweidawang/GIGA-ToF}.
中文: 本文提出了一种基于运动不变图融合的ToF深度去噪网络,通过跨帧几何注意力和噪声分布优化,有效提升时间一致性与空间清晰度,在合成与真实数据集上均达到最优性能。
English: This paper introduces a motion-invariant graph fusion network for ToF depth denoising that enhances temporal stability and spatial sharpness by leveraging cross-frame geometric attention and noise-aware optimization, achieving state-of-the-art performance on synthetic and real datasets.

Authors:Yuhao Huang, Yueyue Xu, Haoran Dou, Jiaxiao Deng, Xin Yang, Hongyu Zheng, Dong Ni
Title: Uncertainty-aware Diffusion and Reinforcement Learning for Joint Plane Localization and Anomaly Diagnosis in 3D Ultrasound
Abstract:
Congenital uterine anomalies (CUAs) can lead to infertility, miscarriage, preterm birth, and an increased risk of pregnancy complications. Compared to traditional 2D ultrasound (US), 3D US can reconstruct the coronal plane, providing a clear visualization of the uterine morphology for assessing CUAs accurately. In this paper, we propose an intelligent system for simultaneous automated plane localization and CUA diagnosis. Our highlights are: 1) we develop a denoising diffusion model with local (plane) and global (volume/text) guidance, using an adaptive weighting strategy to optimize attention allocation to different conditions; 2) we introduce a reinforcement learning-based framework with unsupervised rewards to extract the key slice summary from redundant sequences, fully integrating information across multiple planes to reduce learning difficulty; 3) we provide text-driven uncertainty modeling for coarse prediction, and leverage it to adjust the classification probability for overall performance improvement. Extensive experiments on a large 3D uterine US dataset show the efficacy of our method, in terms of plane localization and CUA diagnosis. Code is available at https://github.com/yuhoo0302/CUA-US.
中文: 本文提出一种智能系统,通过去噪扩散模型和强化学习,自动完成三维超声中的平面定位和先天性子宫异常诊断,实验证明其方法高效可靠。
English: This paper introduces an intelligent system that uses a denoising diffusion model and reinforcement learning to automate both plane localization and diagnosis of congenital uterine anomalies from 3D ultrasound data, demonstrating high effectiveness in experiments.

Authors:Xinyue Li, Zhangkai Ni, Wenhan Yang
Title: AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm
Abstract:
Existing learning-based methods effectively reconstruct HDR images from multi-exposure LDR inputs with extended dynamic range and improved detail, but they rely more on empirical design rather than theoretical foundation, which can impact their reliability. To address these limitations, we propose the cross-iterative Alignment and Fusion deep Unfolding Network (AFUNet), where HDR reconstruction is systematically decoupled into two interleaved subtasks -- alignment and fusion -- optimized through alternating refinement, achieving synergy between the two subtasks to enhance the overall performance. Our method formulates multi-exposure HDR reconstruction from a Maximum A Posteriori (MAP) estimation perspective, explicitly incorporating spatial correspondence priors across LDR images and naturally bridging the alignment and fusion subproblems through joint constraints. Building on the mathematical foundation, we reimagine traditional iterative optimization through unfolding -- transforming the conventional solution process into an end-to-end trainable AFUNet with carefully designed modules that work progressively. Specifically, each iteration of AFUNet incorporates an Alignment-Fusion Module (AFM) that alternates between a Spatial Alignment Module (SAM) for alignment and a Channel Fusion Module (CFM) for adaptive feature fusion, progressively bridging misaligned content and exposure discrepancies. Extensive qualitative and quantitative evaluations demonstrate AFUNet's superior performance, consistently surpassing state-of-the-art methods. Our code is available at: https://github.com/eezkni/AFUNet
中文: 所提出的AFUNet方法通过深度展开技术将HDR重建系统解耦为对齐与融合子任务,通过交替优化和明确数学基础实现了超越现有方法的性能表现。
English: The proposed AFUNet method systematically decouples HDR reconstruction into alignment and fusion subtasks through deep unfolding, achieving superior performance by alternating refinement with explicit mathematical foundations.

Authors:Zixun Fang, Kai Zhu, Zhiheng Liu, Yu Liu, Wei Zhai, Yang Cao, Zheng-Jun Zha
Title: ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models
Abstract:
Panoramic video generation aims to synthesize 360-degree immersive videos, holding significant importance in the fields of VR, world models, and spatial intelligence. Existing works fail to synthesize high-quality panoramic videos due to the inherent modality gap between panoramic data and perspective data, which constitutes the majority of the training data for modern diffusion models. In this paper, we propose a novel framework utilizing pretrained perspective video models for generating panoramic videos. Specifically, we design a novel panorama representation named ViewPoint map, which possesses global spatial continuity and fine-grained visual details simultaneously. With our proposed Pano-Perspective attention mechanism, the model benefits from pretrained perspective priors and captures the panoramic spatial correlations of the ViewPoint map effectively. Extensive experiments demonstrate that our method can synthesize highly dynamic and spatially consistent panoramic videos, achieving state-of-the-art performance and surpassing previous methods.
Chinese: 本文提出了一种新颖框架,利用预训练的视角视频模型,通过引入ViewPoint地图表示和全景-视角注意力机制,有效弥合模态差异,在生成动态且空间一致的360度全景视频方面实现了最先进的性能。
English: This paper introduces a novel framework that leverages pretrained perspective video models to generate high-quality panoramic videos by employing a ViewPoint map representation and a Pano-Perspective attention mechanism, effectively bridging the modality gap and achieving state-of-the-art results in dynamic and spatially consistent 360-degree video synthesis.

Authors:ZongHan Hsieh, Tzer-Jen Wei, ShengJing Yang
Title: ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding
Abstract:
In this paper, we present ZonUI-3B, a lightweight Vision-Language Model (VLM) that can be fully trained on a single consumer-grade GPU (RTX 4090) while delivering performance comparable to significantly larger models on GUI grounding tasks. The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable to larger datasets, emphasizing data diversity over sheer volume. Empirical evaluation on standard GUI grounding benchmarks, including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro, highlights ZonUI-3B's exceptional accuracy, achieving 84.9% on ScreenSpot and 86.4% on ScreenSpot-v2, surpassing prior models under 4B parameters. Ablation studies validate the critical role of balanced sampling and two-stage fine-tuning in enhancing robustness, particularly in high-resolution desktop scenarios. The ZonUI-3B is available at: https://github.com/Han1018/ZonUI-3B
ZonUI-3B 是一种轻量级视觉语言模型,通过创新的数据策略和两阶段微调方法,在图形用户界面定位任务上达到与更大模型相当的性能,且仅需单个消费级GPU即可完成训练。
ZonUI-3B is a lightweight vision-language model that achieves performance comparable to larger models on GUI grounding tasks through innovative data strategies and two-stage fine-tuning, while being trainable on a single consumer GPU.

Authors:Yuzhuo Chen, Zehua Ma, Han Fang, Weiming Zhang, Nenghai Yu
Title: TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity
Abstract:
AI-generated content (AIGC) enables efficient visual creation but raises copyright and authenticity risks. As a common technique for integrity verification and source tracing, digital image watermarking is regarded as a potential solution to above issues. However, the widespread adoption and advancing capabilities of generative image editing tools have amplified malicious tampering risks, while simultaneously posing new challenges to passive tampering detection and watermark robustness. To address these challenges, this paper proposes a Tamper-Aware Generative image WaterMarking method named TAG-WM. The proposed method comprises four key modules: a dual-mark joint sampling (DMJS) algorithm for embedding copyright and localization watermarks into the latent space while preserving generative quality, the watermark latent reconstruction (WLR) utilizing reversed DMJS, a dense variation region detector (DVRD) leveraging diffusion inversion sensitivity to identify tampered areas via statistical deviation analysis, and the tamper-aware decoding (TAD) guided by localization results. The experimental results demonstrate that TAG-WM achieves state-of-the-art performance in both tampering robustness and localization capability even under distortion, while preserving lossless generation quality and maintaining a watermark capacity of 256 bits. The code is available at: https://github.com/Suchenl/TAG-WM.
中文摘要:本文提出TAG-WM篡改感知生成式水印方法,通过嵌入双重水印保护AI生成图像免受恶意篡改,在保持生成质量的同时实现了最优的鲁棒性。
English Summary: This paper introduces TAG-WM, a tamper-aware generative watermarking method that embeds dual watermarks to protect AI-generated images against malicious tampering while maintaining generation quality and achieving state-of-the-art robustness.

Authors:Xian Zhang, Xiang Cheng
Title: Evaluation of Geolocation Capabilities of Multimodal Large Language Models and Analysis of Associated Privacy Risks
Abstract:
Objectives: The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly enhanced their reasoning capabilities, enabling a wide range of intelligent applications. However, these advancements also raise critical concerns regarding privacy and ethics. MLLMs are now capable of inferring the geographic location of images -- such as those shared on social media or captured from street views -- based solely on visual content, thereby posing serious risks of privacy invasion, including doxxing, surveillance, and other security threats. Methods: This study provides a comprehensive analysis of existing geolocation techniques based on MLLMs. It systematically reviews relevant litera-ture and evaluates the performance of state-of-the-art visual reasoning models on geolocation tasks, particularly in identifying the origins of street view imagery. Results: Empirical evaluation reveals that the most advanced visual large models can successfully localize the origin of street-level imagery with up to $49\%$ accuracy within a 1-kilometer radius. This performance underscores the models' powerful capacity to extract and utilize fine-grained geographic cues from visual data. Conclusions: Building on these findings, the study identifies key visual elements that contribute to suc-cessful geolocation, such as text, architectural styles, and environmental features. Furthermore, it discusses the potential privacy implications associated with MLLM-enabled geolocation and discuss several technical and policy-based coun-termeasures to mitigate associated risks. Our code and dataset are available at https://github.com/zxyl1003/MLLM-Geolocation-Evaluation.
中文: 多模态大语言模型在地理定位方面展现出强大能力,能通过视觉线索以高达49%的准确率识别街景图像位置,这引发了严重的隐私风险,亟需技术和政策层面的应对措施。
English: Multimodal Large Language Models (MLLMs) demonstrate significant geolocation capabilities, achieving up to 49% accuracy in pinpointing street-level imagery within a 1-kilometer radius, which raises serious privacy concerns and necessitates countermeasures.

Authors:Xuan Yao, Junyu Gao, Changsheng Xu
Title: NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments
Abstract:
Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to execute sequential navigation actions in complex environments guided by natural language instructions. Current approaches often struggle with generalizing to novel environments and adapting to ongoing changes during navigation. Inspired by human cognition, we present NavMorph, a self-evolving world model framework that enhances environmental understanding and decision-making in VLN-CE tasks. NavMorph employs compact latent representations to model environmental dynamics, equipping agents with foresight for adaptive planning and policy refinement. By integrating a novel Contextual Evolution Memory, NavMorph leverages scene-contextual information to support effective navigation while maintaining online adaptability. Extensive experiments demonstrate that our method achieves notable performance improvements on popular VLN-CE benchmarks. Code is available at https://github.com/Feliciaxyao/NavMorph.
中文摘要:NavMorph是一个自演进的世界模型框架,通过采用紧凑潜在表征和上下文记忆来增强环境理解与自适应决策,在连续环境中的视觉语言导航任务上实现了显著的性能提升。
English Summary: NavMorph is a self-evolving world model framework that improves navigation in continuous environments by using compact latent representations and contextual memory to enhance environmental understanding and adaptive decision-making, achieving superior performance on VLN-CE benchmarks.

Authors:Vikram Rangarajan, Shishira Maiya, Max Ehrlich, Abhinav Shrivastava
Title: SIEDD: Shared-Implicit Encoder with Discrete Decoders
Abstract:
Implicit Neural Representations (INRs) offer exceptional fidelity for video compression by learning per-video optimized functions, but their adoption is crippled by impractically slow encoding times. Existing attempts to accelerate INR encoding often sacrifice reconstruction quality or crucial coordinate-level control essential for adaptive streaming and transcoding. We introduce SIEDD (Shared-Implicit Encoder with Discrete Decoders), a novel architecture that fundamentally accelerates INR encoding without these compromises. SIEDD first rapidly trains a shared, coordinate-based encoder on sparse anchor frames to efficiently capture global, low-frequency video features. This encoder is then frozen, enabling massively parallel training of lightweight, discrete decoders for individual frame groups, further expedited by aggressive coordinate-space sampling. This synergistic design delivers a remarkable 20-30X encoding speed-up over state-of-the-art INR codecs on HD and 4K benchmarks, while maintaining competitive reconstruction quality and compression ratios. Critically, SIEDD retains full coordinate-based control, enabling continuous resolution decoding and eliminating costly transcoding. Our approach significantly advances the practicality of high-fidelity neural video compression, demonstrating a scalable and efficient path towards real-world deployment. Our codebase is available at https://github.com/VikramRangarajan/SIEDD .
中文摘要:SIEDD提出了一种新颖架构,通过共享编码器和并行训练轻量解码器的方法,将隐式神经表示的编码速度提升20-30倍,同时保持重建质量和完整的坐标控制能力,显著推进了神经视频压缩技术的实际应用。
English Summary: SIEDD introduces a novel architecture that accelerates implicit neural representation encoding by 20-30 times while maintaining reconstruction quality and full coordinate-based control, making high-fidelity neural video compression practical for real-world deployment.

Authors:Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille
Title: OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions
Abstract:
Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code will be released at https://github.com/caiyuanhao1998/Open-OmniVCus
中文: 本文提出了VideoCus-Factory数据构建流程和OmniVCus扩散Transformer框架,实现了多主体视频定制与可控编辑,在性能上显著超越了现有先进方法。
English: This paper introduces VideoCus-Factory, a data construction pipeline for multi-subject video customization, and OmniVCus, a diffusion Transformer framework that enables instructive editing and surpasses existing methods in performance.

Authors:Shunsuke Yasuki, Taiki Miyanishi, Nakamasa Inoue, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Masato Taki, Yutaka Matsuo
Title: GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields
Abstract:
The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to small-scale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale high-fidelity 3D scenes. GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering vast urban spaces using directional cues, distance measurements, elevation data, and landmark references; and (ii) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as area segmentation and object detection. Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. To assess performance in city-scale reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, spatial reasoning, comparison, counting, and measurement. Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning in high-fidelity city-scale 3D environments via natural language. The code is available at https://snskysk.github.io/GeoProg3D/.

Authors:Yi Liu, Shengqian Li, Zuzeng Lin, Feng Wang, Si Liu
Title: CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation
Abstract:
The current conditional autoregressive image generation methods have shown promising results, yet their potential remains largely unexplored in the practical unsupervised image translation domain, which operates without explicit cross-domain correspondences. A critical limitation stems from the discrete quantization inherent in traditional Vector Quantization-based frameworks, which disrupts gradient flow between the Variational Autoencoder decoder and causal Transformer, impeding end-to-end optimization during adversarial training in image space. To tackle this issue, we propose using Softmax Relaxed Quantization, a novel approach that reformulates codebook selection as a continuous probability mixing process via Softmax, thereby preserving gradient propagation. Building upon this differentiable foundation, we introduce CycleVAR, which reformulates image-to-image translation as image-conditional visual autoregressive generation by injecting multi-scale source image tokens as contextual prompts, analogous to prefix-based conditioning in language models. CycleVAR exploits two modes to generate the target image tokens, including (1) serial multi-step generation, enabling iterative refinement across scales, and (2) parallel one-step generation synthesizing all resolution outputs in a single forward pass. Experimental findings indicate that the parallel one-step generation mode attains superior translation quality with quicker inference speed than the serial multi-step mode in unsupervised scenarios. Furthermore, both quantitative and qualitative results indicate that CycleVAR surpasses previous state-of-the-art unsupervised image translation models, \textit{e}.\textit{g}., CycleGAN-Turbo.
中文摘要:本文提出CycleVAR方法,通过Softmax松弛量化替代传统离散量化保持梯度传播,将图像翻译重构为条件自回归生成,其并行单步生成模式在无监督场景下实现了优于CycleGAN-Turbo等模型的图像转换质量与推理速度。
English Summary: This paper introduces CycleVAR, a novel unsupervised image translation method that replaces discrete vector quantization with Softmax Relaxed Quantization to maintain gradient flow and enables both serial and parallel generation modes, achieving superior performance over existing models like CycleGAN-Turbo.

Authors:Zhenghao Peng, Yuxin Liu, Bolei Zhou
Title: InfGen: Scenario Generation as Next Token Group Prediction
Abstract:
Realistic and interactive traffic simulation is essential for training and evaluating autonomous driving systems. However, most existing data-driven simulation methods rely on static initialization or log-replay data, limiting their ability to model dynamic, long-horizon scenarios with evolving agent populations. We propose InfGen, a scenario generation framework that outputs agent states and trajectories in an autoregressive manner. InfGen represents the entire scene as a sequence of tokens, including traffic light signals, agent states, and motion vectors, and uses a transformer model to simulate traffic over time. This design enables InfGen to continuously insert new agents into traffic, supporting infinite scene generation. Experiments demonstrate that InfGen produces realistic, diverse, and adaptive traffic behaviors. Furthermore, reinforcement learning policies trained in InfGen-generated scenarios achieve superior robustness and generalization, validating its utility as a high-fidelity simulation environment for autonomous driving. More information is available at https://metadriverse.github.io/infgen/.

Authors:Yiming Huang, Long Bai, Beilei Cui, Kun Yuan, Guankun Wang, Mobarak I. Hoque, Nicolas Padoy, Nassir Navab, Hongliang Ren
Title: SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting
Abstract:
In contemporary surgical research and practice, accurately comprehending 3D surgical scenes with text-promptable capabilities is particularly crucial for surgical planning and real-time intra-operative guidance, where precisely identifying and interacting with surgical tools and anatomical structures is paramount. However, existing works focus on surgical vision-language model (VLM), 3D reconstruction, and segmentation separately, lacking support for real-time text-promptable 3D queries. In this paper, we present SurgTPGS, a novel text-promptable Gaussian Splatting method to fill this gap. We introduce a 3D semantics feature learning strategy incorporating the Segment Anything model and state-of-the-art vision-language models. We extract the segmented language features for 3D surgical scene reconstruction, enabling a more in-depth understanding of the complex surgical environment. We also propose semantic-aware deformation tracking to capture the seamless deformation of semantic features, providing a more precise reconstruction for both texture and semantic features. Furthermore, we present semantic region-aware optimization, which utilizes regional-based semantic information to supervise the training, particularly promoting the reconstruction quality and semantic smoothness. We conduct comprehensive experiments on two real-world surgical datasets to demonstrate the superiority of SurgTPGS over state-of-the-art methods, highlighting its potential to revolutionize surgical practices. SurgTPGS paves the way for developing next-generation intelligent surgical systems by enhancing surgical precision and safety. Our code is available at: https://github.com/lastbasket/SurgTPGS.
中文: 本文提出SurgTPGS方法,通过融合三维语义特征学习和形变跟踪技术,实现了可文本提示的实时三维手术场景理解,显著提升了手术精准度和安全性。
English: This paper introduces SurgTPGS, a text-promptable Gaussian Splatting method that integrates 3D semantic feature learning and deformation tracking to enable real-time 3D surgical scene understanding, demonstrating superior performance in surgical precision and safety.

Authors:Yiming Huang, Long Bai, Beilei Cui, Yanheng Li, Tong Chen, Jie Wang, Jinlin Wu, Zhen Lei, Hongbin Liu, Hongliang Ren
Title: Endo-4DGX: Robust Endoscopic Scene Reconstruction and Illumination Correction with Gaussian Splatting
Abstract:
Accurate reconstruction of soft tissue is crucial for advancing automation in image-guided robotic surgery. The recent 3D Gaussian Splatting (3DGS) techniques and their variants, 4DGS, achieve high-quality renderings of dynamic surgical scenes in real-time. However, 3D-GS-based methods still struggle in scenarios with varying illumination, such as low light and over-exposure. Training 3D-GS in such extreme light conditions leads to severe optimization problems and devastating rendering quality. To address these challenges, we present Endo-4DGX, a novel reconstruction method with illumination-adaptive Gaussian Splatting designed specifically for endoscopic scenes with uneven lighting. By incorporating illumination embeddings, our method effectively models view-dependent brightness variations. We introduce a region-aware enhancement module to model the sub-area lightness at the Gaussian level and a spatial-aware adjustment module to learn the view-consistent brightness adjustment. With the illumination adaptive design, Endo-4DGX achieves superior rendering performance under both low-light and over-exposure conditions while maintaining geometric accuracy. Additionally, we employ an exposure control loss to restore the appearance from adverse exposure to the normal level for illumination-adaptive optimization. Experimental results demonstrate that Endo-4DGX significantly outperforms combinations of state-of-the-art reconstruction and restoration methods in challenging lighting environments, underscoring its potential to advance robot-assisted surgical applications. Our code is available at https://github.com/lastbasket/Endo-4DGX.
中文: Endo-4DGX提出了一种光照自适应的高斯溅射方法,显著提升了内窥镜场景在多变光照下的三维重建效果,实现了卓越的渲染质量和几何精度。
English: Endo-4DGX introduces an illumination-adaptive Gaussian splatting method that enhances 3D reconstruction in endoscopic scenes under varying light conditions, achieving superior rendering quality and geometric accuracy.

Authors:Xing Shen, Justin Szeto, Mingyang Li, Hengguan Huang, Tal Arbel
Title: Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification
Abstract:
Multimodal large language models (MLLMs) have enormous potential to perform few-shot in-context learning in the context of medical image analysis. However, safe deployment of these models into real-world clinical practice requires an in-depth analysis of the accuracies of their predictions, and their associated calibration errors, particularly across different demographic subgroups. In this work, we present the first investigation into the calibration biases and demographic unfairness of MLLMs' predictions and confidence scores in few-shot in-context learning for medical image classification. We introduce CALIN, an inference-time calibration method designed to mitigate the associated biases. Specifically, CALIN estimates the amount of calibration needed, represented by calibration matrices, using a bi-level procedure: progressing from the population level to the subgroup level prior to inference. It then applies this estimation to calibrate the predicted confidence scores during inference. Experimental results on three medical imaging datasets: PAPILA for fundus image classification, HAM10000 for skin cancer classification, and MIMIC-CXR for chest X-ray classification demonstrate CALIN's effectiveness at ensuring fair confidence calibration in its prediction, while improving its overall prediction accuracies and exhibiting minimum fairness-utility trade-off. Our codebase can be found at https://github.com/xingbpshen/medical-calibration-fairness-mllm.
中文: 本研究提出的CALIN方法在推理时通过双层校准机制有效减轻了多模态大语言模型在医学图像分类中的校准偏差和人口统计不公平性,在提升准确率的同时确保了预测的公平性。
English: This study introduces CALIN, an inference-time calibration method that effectively mitigates calibration biases and demographic unfairness in multimodal large language models for medical image classification, improving accuracy and fairness across diverse datasets.

Authors:Changtao Miao, Yi Zhang, Weize Gao, Man Luo, Weiwei Feng, Zhiya Tan, Jianshu Li, Ajian Liu, Yunfeng Diao, Qi Chu, Tao Gong, Zhe Li, Weibin Yao, Joey Tianyi Zhou
Title: DDL: A Dataset for Interpretable Deepfake Detection and Localization in Real-World Scenarios
Abstract:
Recent advances in AIGC have exacerbated the misuse of malicious deepfake content, making the development of reliable deepfake detection methods an essential means to address this challenge. Although existing deepfake detection models demonstrate outstanding performance in detection metrics, most methods only provide simple binary classification results, lacking interpretability. In critical domains such as law, interpretability is crucial for enhancing the credibility and authority of decisions. Recent studies attempt to improve the interpretability of classification results by providing spatial manipulation masks or temporal forgery segments. However, the practical effectiveness of these methods remains suboptimal due to limitations of the forgery data. Most current deepfake datasets predominantly offer binary labels, only a few datasets with localization annotations. However, they suffer from restricted forgery scenarios, limited diversity in deepfake types, and insufficient data scale, making them inadequate for complex real-world scenarios. To address this predicament, we construct a novel large-scale deepfake detection and localization ($\textbf{DDL}$) dataset containing over $\textbf{1.8M}$ forged samples and encompassing up to $\textbf{75}$ distinct deepfake methods. The DDL design incorporates four key innovations: (1) $\textbf{Diverse Forgery Scenarios}$, (2) $\textbf{Comprehensive Deepfake Methods}$, (3) $\textbf{Varied Manipulation Modes}$, and (4) $\textbf{Fine-grained Forgery Annotations}$. Through these improvements, our DDL not only provides a more challenging benchmark for complex real-world forgeries, but also offers crucial support for building next-generation deepfake detection, localization, and interpretability methods. The DDL dataset project page is on https://deepfake-workshop-ijcai2025.github.io/main/index.html.

Authors:Chengyou Jia, Xin Shen, Zhuohang Dang, Zhuohang Dang, Changliang Xia, Weijia Wu, Xinyu Zhang, Hangwei Qian, Ivor W. Tsang, Minnan Luo
Title: Why Settle for One? Text-to-ImageSet Generation and Evaluation
Abstract:
Despite remarkable progress in Text-to-Image models, many real-world applications require generating coherent image sets with diverse consistency requirements. Existing consistent methods often focus on a specific domain with specific aspects of consistency, which significantly constrains their generalizability to broader applications. In this paper, we propose a more challenging problem, Text-to-ImageSet (T2IS) generation, which aims to generate sets of images that meet various consistency requirements based on user instructions. To systematically study this problem, we first introduce $\textbf{T2IS-Bench}$ with 596 diverse instructions across 26 subcategories, providing comprehensive coverage for T2IS generation. Building on this, we propose $\textbf{T2IS-Eval}$, an evaluation framework that transforms user instructions into multifaceted assessment criteria and employs effective evaluators to adaptively assess consistency fulfillment between criteria and generated sets. Subsequently, we propose $\textbf{AutoT2IS}$, a training-free framework that maximally leverages pretrained Diffusion Transformers' in-context capabilities to harmonize visual elements to satisfy both image-level prompt alignment and set-level visual consistency. Extensive experiments on T2IS-Bench reveal that diverse consistency challenges all existing methods, while our AutoT2IS significantly outperforms current generalized and even specialized approaches. Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value. Visit our project in https://chengyou-jia.github.io/T2IS-Home.

Authors:Marko Mihajlovic, Siwei Zhang, Gen Li, Kaifeng Zhao, Lea Müller, Siyu Tang
Title: VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions
Abstract:
Parametric human body models play a crucial role in computer graphics and vision, enabling applications ranging from human motion analysis to understanding human-environment interactions. Traditionally, these models use surface meshes, which pose challenges in efficiently handling interactions with other geometric entities, such as objects and scenes, typically represented as meshes or point clouds. To address this limitation, recent research has explored volumetric neural implicit body models. However, existing works are either insufficiently robust for complex human articulations or impose high computational and memory costs, limiting their widespread use. To this end, we introduce VolumetricSMPL, a neural volumetric body model that leverages Neural Blend Weights (NBW) to generate compact, yet efficient MLP decoders. Unlike prior approaches that rely on large MLPs, NBW dynamically blends a small set of learned weight matrices using predicted shape- and pose-dependent coefficients, significantly improving computational efficiency while preserving expressiveness. VolumetricSMPL outperforms prior volumetric occupancy model COAP with 10x faster inference, 6x lower GPU memory usage, enhanced accuracy, and a Signed Distance Function (SDF) for efficient and differentiable contact modeling. We demonstrate VolumetricSMPL's strengths across four challenging tasks: (1) reconstructing human-object interactions from in-the-wild images, (2) recovering human meshes in 3D scenes from egocentric views, (3) scene-constrained motion synthesis, and (4) resolving self-intersections. Our results highlight its broad applicability and significant performance and efficiency gains.

Authors:Lunhao Duan, Shanshan Zhao, Xingxing Weng, Jing Zhang, Gui-Song Xia
Title: High-quality Pseudo-labeling for Point Cloud Segmentation with Scene-level Annotation
Abstract:
This paper investigates indoor point cloud semantic segmentation under scene-level annotation, which is less explored compared to methods relying on sparse point-level labels. In the absence of precise point-level labels, current methods first generate point-level pseudo-labels, which are then used to train segmentation models. However, generating accurate pseudo-labels for each point solely based on scene-level annotations poses a considerable challenge, substantially affecting segmentation performance. Consequently, to enhance accuracy, this paper proposes a high-quality pseudo-label generation framework by exploring contemporary multi-modal information and region-point semantic consistency. Specifically, with a cross-modal feature guidance module, our method utilizes 2D-3D correspondences to align point cloud features with corresponding 2D image pixels, thereby assisting point cloud feature learning. To further alleviate the challenge presented by the scene-level annotation, we introduce a region-point semantic consistency module. It produces regional semantics through a region-voting strategy derived from point-level semantics, which are subsequently employed to guide the point-level semantic predictions. Leveraging the aforementioned modules, our method can rectify inaccurate point-level semantic predictions during training and obtain high-quality pseudo-labels. Significant improvements over previous works on ScanNet v2 and S3DIS datasets under scene-level annotation can demonstrate the effectiveness. Additionally, comprehensive ablation studies validate the contributions of our approach's individual components. The code is available at https://github.com/LHDuan/WSegPC .
中文摘要:本文针对场景级标注下的室内点云语义分割提出新框架,通过跨模态特征引导和区域-点语义一致性模块生成高质量伪标签,在多个数据集上实现了显著性能提升。
English Summary: This paper proposes a novel framework for indoor point cloud semantic segmentation using scene-level annotations, employing cross-modal feature guidance and region-point consistency modules to generate high-quality pseudo-labels and achieve state-of-the-art performance.

Authors:Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, Yong Li
Title: UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Abstract:
Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $\textit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $\textit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $\textit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $\textit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.
中文摘要:UrbanLLaVA是一个多模态大语言模型,通过构建城市指令数据集和多阶段训练框架,能够同时处理多种城市数据类型,在各类城市任务中表现优于现有模型。
English Summary: UrbanLLaVA is a multi-modal large language model designed to process diverse urban data types simultaneously, outperforming existing models across various urban tasks through a curated dataset and multi-stage training framework.

Authors:Vladislav Bargatin, Egor Chistov, Alexander Yakovenko, Dmitriy Vatolin
Title: MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation
Abstract:
Recent advances in optical flow estimation have prioritized accuracy at the cost of growing GPU memory consumption, particularly for high-resolution (FullHD) inputs. We introduce MEMFOF, a memory-efficient multi-frame optical flow method that identifies a favorable trade-off between multi-frame estimation and GPU memory usage. Notably, MEMFOF requires only 2.09 GB of GPU memory at runtime for 1080p inputs, and 28.5 GB during training, which uniquely positions our method to be trained at native 1080p without the need for cropping or downsampling. We systematically revisit design choices from RAFT-like architectures, integrating reduced correlation volumes and high-resolution training protocols alongside multi-frame estimation, to achieve state-of-the-art performance across multiple benchmarks while substantially reducing memory overhead. Our method outperforms more resource-intensive alternatives in both accuracy and runtime efficiency, validating its robustness for flow estimation at high resolutions. At the time of submission, our method ranks first on the Spring benchmark with a 1-pixel (1px) outlier rate of 3.289, leads Sintel (clean) with an endpoint error (EPE) of 0.963, and achieves the best Fl-all error on KITTI-2015 at 2.94%. The code is available at https://github.com/msu-video-group/memfof.
中文:MEMFOF是一种内存高效的多帧光流方法,在显著降低GPU内存消耗的同时,于多个基准测试中达到最先进精度,并实现了无需裁剪或下采样的全1080p训练。
English: MEMFOF is a memory-efficient multi-frame optical flow method that achieves state-of-the-art accuracy on multiple benchmarks while significantly reducing GPU memory consumption, enabling full 1080p training without cropping or downsampling.

Authors:Shahad Hardan, Darya Taratynova, Abdelmajid Essofi, Karthik Nandakumar, Mohammad Yaqub
Title: Forget-MI: Machine Unlearning for Forgetting Multimodal Information in Healthcare Settings
Abstract:
Privacy preservation in AI is crucial, especially in healthcare, where models rely on sensitive patient data. In the emerging field of machine unlearning, existing methodologies struggle to remove patient data from trained multimodal architectures, which are widely used in healthcare. We propose Forget-MI, a novel machine unlearning method for multimodal medical data, by establishing loss functions and perturbation techniques. Our approach unlearns unimodal and joint representations of the data requested to be forgotten while preserving knowledge from the remaining data and maintaining comparable performance to the original model. We evaluate our results using performance on the forget dataset, performance on the test dataset, and Membership Inference Attack (MIA), which measures the attacker's ability to distinguish the forget dataset from the training dataset. Our model outperforms the existing approaches that aim to reduce MIA and the performance on the forget dataset while keeping an equivalent performance on the test set. Specifically, our approach reduces MIA by 0.202 and decreases AUC and F1 scores on the forget set by 0.221 and 0.305, respectively. Additionally, our performance on the test set matches that of the retrained model, while allowing forgetting. Code is available at https://github.com/BioMedIA-MBZUAI/Forget-MI.git
Chinese: 提出的Forget-MI方法通过消除特定表征,在保持剩余数据性能的同时,有效从医疗多模态AI模型中移除敏感患者数据,并在降低成员推理攻击和遗忘集指标方面优于现有方法。
English: The proposed Forget-MI method effectively removes sensitive patient data from multimodal AI models in healthcare by unlearning specific representations while maintaining performance on retained data and outperforming existing approaches in reducing membership inference attacks and forget set metrics.

Authors:Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, Yong Li
Title: RoboScape: Physics-informed Embodied World Model
Abstract:
World models have become indispensable tools for embodied intelligence, serving as powerful simulators capable of generating realistic robotic videos while addressing critical data scarcity challenges. However, current embodied world models exhibit limited physical awareness, particularly in modeling 3D geometry and motion dynamics, resulting in unrealistic video generation for contact-rich robotic scenarios. In this paper, we present RoboScape, a unified physics-informed world model that jointly learns RGB video generation and physics knowledge within an integrated framework. We introduce two key physics-informed joint training tasks: temporal depth prediction that enhances 3D geometric consistency in video rendering, and keypoint dynamics learning that implicitly encodes physical properties (e.g., object shape and material characteristics) while improving complex motion modeling. Extensive experiments demonstrate that RoboScape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios. We further validate its practical utility through downstream applications including robotic policy training with generated data and policy evaluation. Our work provides new insights for building efficient physics-informed world models to advance embodied intelligence research. The code is available at: https://github.com/tsinghua-fib-lab/RoboScape.
中文:RoboScape提出了一种统一的物理感知世界模型,通过整合三维几何与运动动力学来提升视频生成的真实性,并有效支持机器人策略的训练与评估。
English: RoboScape introduces a unified physics-informed world model that enhances video generation by integrating 3D geometry and motion dynamics, improving realism and enabling effective robotic policy training and evaluation.

Authors:Xinlei Yu, Changmiao Wang, Hui Jin, Ahmed Elazab, Gangyong Jia, Xiang Wan, Changqing Zou, Ruiquan Ge
Title: CRISP-SAM2: SAM2 with Cross-Modal Interaction and Semantic Prompting for Multi-Organ Segmentation
Abstract:
Multi-organ medical segmentation is a crucial component of medical image processing, essential for doctors to make accurate diagnoses and develop effective treatment plans. Despite significant progress in this field, current multi-organ segmentation models often suffer from inaccurate details, dependence on geometric prompts and loss of spatial information. Addressing these challenges, we introduce a novel model named CRISP-SAM2 with CRoss-modal Interaction and Semantic Prompting based on SAM2. This model represents a promising approach to multi-organ medical segmentation guided by textual descriptions of organs. Our method begins by converting visual and textual inputs into cross-modal contextualized semantics using a progressive cross-attention interaction mechanism. These semantics are then injected into the image encoder to enhance the detailed understanding of visual information. To eliminate reliance on geometric prompts, we use a semantic prompting strategy, replacing the original prompt encoder to sharpen the perception of challenging targets. In addition, a similarity-sorting self-updating strategy for memory and a mask-refining process is applied to further adapt to medical imaging and enhance localized details. Comparative experiments conducted on seven public datasets indicate that CRISP-SAM2 outperforms existing models. Extensive analysis also demonstrates the effectiveness of our method, thereby confirming its superior performance, especially in addressing the limitations mentioned earlier. Our code is available at: https://github.com/YU-deep/CRISP_SAM2.git.
中文摘要:CRISP-SAM2是一种新颖的多器官分割模型,通过跨模态交互和语义提示技术提升医学图像分析的细节精度并摆脱对几何提示的依赖,在多个公开数据集上展现出优越性能。
English Summary: CRISP-SAM2 is a novel multi-organ segmentation model that enhances medical image analysis through cross-modal interaction and semantic prompting, achieving superior performance by improving detail accuracy and eliminating reliance on geometric prompts.

Authors:Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou
Title: MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings
Abstract:
Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into effective bidirectional multimodal embedding models. The first stage, Modality-aware Continual Pre-training, introduces a joint reconstruction objective that simultaneously denoises interleaved text and image inputs, enhancing bidirectional context-aware reasoning. The second stage, Heterogeneous Contrastive Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple image-caption pairs to enhance generalization and alignment. Our method addresses the stated limitations by introducing bidirectional attention through continual pre-training, scaling effectively with massive unlabeled datasets via joint reconstruction objectives, and utilizing diverse multimodal data for enhanced representation robustness. Experiments demonstrate that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results, and exhibits strong scalability with both model size and training data on MMEB.
中文摘要:提出的MoCa框架通过模态感知持续预训练和异构对比微调,将预训练视觉语言模型转化为双向多模态嵌入模型,在解决注意力机制、数据可扩展性和训练多样性限制的同时,在多个基准测试中实现了最先进的性能。
English Summary: The proposed MoCa framework transforms pre-trained Vision Language Models into bidirectional multimodal embedding models through modality-aware continual pre-training and heterogeneous contrastive fine-tuning, achieving state-of-the-art performance on benchmarks while addressing limitations in attention mechanisms, data scalability, and training diversity.

Authors:Jian Shi, Tianqi You, Pingping Zhang, Hongli Zhang, Rui Xu, Haojie Li
Title: Frequency-enhanced Multi-granularity Context Network for Efficient Vertebrae Segmentation
Abstract:
Automated and accurate segmentation of individual vertebra in 3D CT and MRI images is essential for various clinical applications. Due to the limitations of current imaging techniques and the complexity of spinal structures, existing methods still struggle with reducing the impact of image blurring and distinguishing similar vertebrae. To alleviate these issues, we introduce a Frequency-enhanced Multi-granularity Context Network (FMC-Net) to improve the accuracy of vertebrae segmentation. Specifically, we first apply wavelet transform for lossless downsampling to reduce the feature distortion in blurred images. The decomposed high and low-frequency components are then processed separately. For the high-frequency components, we apply a High-frequency Feature Refinement (HFR) to amplify the prominence of key features and filter out noises, restoring fine-grained details in blurred images. For the low-frequency components, we use a Multi-granularity State Space Model (MG-SSM) to aggregate feature representations with different receptive fields, extracting spatially-varying contexts while capturing long-range dependencies with linear complexity. The utilization of multi-granularity contexts is essential for distinguishing similar vertebrae and improving segmentation accuracy. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches on both CT and MRI vertebrae segmentation datasets. The source code is publicly available at https://github.com/anaanaa/FMCNet.
中文:FMC-Net模型通过小波变换分别处理图像的高频和低频成分,优化细节并捕捉长距离依赖,从而提升三维CT和MRI中椎骨分割的准确性。
English: The FMC-Net model enhances vertebrae segmentation in 3D CT and MRI by using wavelet transforms to process high and low frequencies separately, refining details and capturing long-range dependencies to improve accuracy.

Authors:Suofei Zhang, Xinxin Wang, Xiaofu Wu, Quan Zhou, Haifeng Hu
Title: Dynamic Contrastive Learning for Hierarchical Retrieval: A Case Study of Distance-Aware Cross-View Geo-Localization
Abstract:
Existing deep learning-based cross-view geo-localization methods primarily focus on improving the accuracy of cross-domain image matching, rather than enabling models to comprehensively capture contextual information around the target and minimize the cost of localization errors. To support systematic research into this Distance-Aware Cross-View Geo-Localization (DACVGL) problem, we construct Distance-Aware Campus (DA-Campus), the first benchmark that pairs multi-view imagery with precise distance annotations across three spatial resolutions. Based on DA-Campus, we formulate DACVGL as a hierarchical retrieval problem across different domains. Our study further reveals that, due to the inherent complexity of spatial relationships among buildings, this problem can only be addressed via a contrastive learning paradigm, rather than conventional metric learning. To tackle this challenge, we propose Dynamic Contrastive Learning (DyCL), a novel framework that progressively aligns feature representations according to hierarchical spatial margins. Extensive experiments demonstrate that DyCL is highly complementary to existing multi-scale metric learning methods and yields substantial improvements in both hierarchical retrieval performance and overall cross-view geo-localization accuracy. Our code and benchmark are publicly available at https://github.com/anocodetest1/DyCL.
中文摘要:现有跨视角地理定位方法侧重匹配精度而忽略环境上下文,为此我们构建了首个距离感知基准DA-Campus并提出动态对比学习框架DyCL,通过分层特征对齐显著提升了跨域层次检索与定位性能。
English Summary: Current cross-view geo-localization methods prioritize matching accuracy over contextual understanding, prompting the development of the DA-Campus benchmark and Dynamic Contrastive Learning (DyCL) framework to address hierarchical distance-aware localization through progressive feature alignment.

Authors:Yu Zheng, Boyang Gong, Fanye Kong, Yueqi Duan, Bingyao Yu, Wenzhao Zheng, Lei Chen, Jiwen Lu, Jie Zhou
Title: Learning Counterfactually Decoupled Attention for Open-World Model Attribution
Abstract:
In this paper, we propose a Counterfactually Decoupled Attention Learning (CDAL) method for open-world model attribution. Existing methods rely on handcrafted design of region partitioning or feature space, which could be confounded by the spurious statistical correlations and struggle with novel attacks in open-world scenarios. To address this, CDAL explicitly models the causal relationships between the attentional visual traces and source model attribution, and counterfactually decouples the discriminative model-specific artifacts from confounding source biases for comparison. In this way, the resulting causal effect provides a quantification on the quality of learned attention maps, thus encouraging the network to capture essential generation patterns that generalize to unseen source models by maximizing the effect. Extensive experiments on existing open-world model attribution benchmarks show that with minimal computational overhead, our method consistently improves state-of-the-art models by large margins, particularly for unseen novel attacks. Source code: https://github.com/yzheng97/CDAL.
中文摘要:本文提出反事实解耦注意力学习(CDAL)方法,通过因果建模分离判别性模型特征与混杂偏差,在计算开销极小的前提下显著提升开放世界模型溯源性能,尤其针对未知攻击具有卓越泛化能力。
English Summary: This paper introduces Counterfactually Decoupled Attention Learning (CDAL), a novel method that uses causal modeling to separate discriminative model artifacts from confounding biases, significantly improving open-world model attribution performance for unseen attacks with minimal computational cost.

Authors:Jiazhen Liu, Yuchuan Deng, Long Chen
Title: Empowering Small VLMs to Think with Dynamic Memorization and Exploration
Abstract:
Empowering Small-scale Vision-Language Models (SVLMs) with reliable thinking capabilities remains fundamentally challenging due to their limited parameter capacity and weak instruction-following abilities. Existing training paradigms, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR), impose substantial demands on the base VLM, exceeding the capabilities of SVLMs. Consequently, directly applying these paradigms to SVLMs often suffers from severe pseudo thinking traces and advantage collapse, ultimately undermining both thinking reliability and task performance. A natural solution is to combine SFT and RLVR, leveraging their complementarity to reduce the dependence on model capacity. However, the widely adopted two-stage training paradigm still performs poorly on SVLMs, as their tendency toward sub-optimal convergence hinders the trade-off and limits the benefits of the combination. To address this, we propose DyME, a novel training paradigm that Dynamically selects between Memorization (via SFT) and Exploration (via RLVR) modes at each optimization step, ensuring that every update contributes to the trade-off. Extensive experiments across diverse domains demonstrate that DyME consistently achieves this balance, and thus delivers substantial performance improvements. These results establish DyME as a practical and effective solution for empowering SVLMs with reliable thinking capabilities. GitHub: https://github.com/HKUST-LongGroup/DyME
中文摘要:DyME训练范式通过在优化过程中动态切换记忆与探索模式,有效解决了小规模视觉语言模型的能力限制,在不同领域均实现了显著的性能提升。
English Summary: The proposed DyME training paradigm dynamically alternates between memorization and exploration modes during optimization to overcome the limitations of small-scale vision-language models, achieving significant performance improvements across various domains.

Authors:Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen
Title: Ovis-U1 Technical Report
Abstract:
In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.
Ovis-U1 是一个 30 亿参数的统一模型,通过创新的统一训练方法,在多模态理解、文生图和图像编辑任务中均实现了领先性能,并在多个基准测试中取得最优成绩。
Ovis-U1 is a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing, achieving state-of-the-art performance across multiple benchmarks through a novel unified training approach.

Authors:Jie Liu, Jiayi Shen, Pan Zhou, Jan-Jakob Sonke, Efstratios Gavves
Title: Probabilistic Prototype Calibration of Vision-Language Models for Generalized Few-shot Semantic Segmentation
Abstract:
Generalized Few-Shot Semantic Segmentation (GFSS) aims to extend a segmentation model to novel classes with only a few annotated examples while maintaining performance on base classes. Recently, pretrained vision-language models (VLMs) such as CLIP have been leveraged in GFSS to improve generalization on novel classes through multi-modal prototypes learning. However, existing prototype-based methods are inherently deterministic, limiting the adaptability of learned prototypes to diverse samples, particularly for novel classes with scarce annotations. To address this, we propose FewCLIP, a probabilistic prototype calibration framework over multi-modal prototypes from the pretrained CLIP, thus providing more adaptive prototype learning for GFSS. Specifically, FewCLIP first introduces a prototype calibration mechanism, which refines frozen textual prototypes with learnable visual calibration prototypes, leading to a more discriminative and adaptive representation. Furthermore, unlike deterministic prototype learning techniques, FewCLIP introduces distribution regularization over these calibration prototypes. This probabilistic formulation ensures structured and uncertainty-aware prototype learning, effectively mitigating overfitting to limited novel class data while enhancing generalization. Extensive experimental results on PASCAL-5$^i$ and COCO-20$^i$ datasets demonstrate that our proposed FewCLIP significantly outperforms state-of-the-art approaches across both GFSS and class-incremental setting. The code is available at https://github.com/jliu4ai/FewCLIP.
中文:FewCLIP提出了一种概率原型校准框架,通过视觉校准和分布正则化优化多模态原型,在广义少样本语义分割中显著提升了适应性和泛化能力,在基准数据集上取得了领先性能。
English: FewCLIP introduces a probabilistic prototype calibration framework that enhances adaptability and generalization in Generalized Few-Shot Semantic Segmentation by refining multi-modal prototypes with visual calibration and distribution regularization, achieving state-of-the-art results on benchmark datasets.

Authors:AmirHossein Naghi Razlighi, Elaheh Badali Golezani, Shohreh Kasaei
Title: Confident Splatting: Confidence-Based Compression of 3D Gaussian Splatting via Learnable Beta Distributions
Abstract:
3D Gaussian Splatting enables high-quality real-time rendering but often produces millions of splats, resulting in excessive storage and computational overhead. We propose a novel lossy compression method based on learnable confidence scores modeled as Beta distributions. Each splat's confidence is optimized through reconstruction-aware losses, enabling pruning of low-confidence splats while preserving visual fidelity. The proposed approach is architecture-agnostic and can be applied to any Gaussian Splatting variant. In addition, the average confidence values serve as a new metric to assess the quality of the scene. Extensive experiments demonstrate favorable trade-offs between compression and fidelity compared to prior work. Our code and data are publicly available at https://github.com/amirhossein-razlighi/Confident-Splatting
中文摘要:本文提出了一种基于可学习置信度分数的创新性有损压缩方法,通过优化高斯分布的置信度实现高效场景压缩,在保持视觉质量的同时显著提升了压缩性能。
English Summary: This paper introduces a novel lossy compression method for 3D Gaussian Splatting that uses learnable confidence scores to prune unnecessary splats while maintaining visual quality, achieving superior compression-performance trade-offs compared to existing methods.

Authors:Amir Aghdam, Vincent Tao Hu, Björn Ommer
Title: ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
Abstract:
We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image-language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image-language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas--the most diverse benchmark of fine-grained actions across multiple sports--where human performance is only 61.6%. ActAlign outperforms billion-parameter video-language models while using 8x fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image-language models for fine-grained video understanding.

Authors:Mai A. Shaaban, Tausifa Jan Saleem, Vijay Ram Papineni, Mohammad Yaqub
Title: MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering
Abstract:
Medical visual question answering (MedVQA) plays a vital role in clinical decision-making by providing contextually rich answers to image-based queries. Although vision-language models (VLMs) are widely used for this task, they often generate factually incorrect answers. Retrieval-augmented generation addresses this challenge by providing information from external sources, but risks retrieving irrelevant context, which can degrade the reasoning capabilities of VLMs. Re-ranking retrievals, as introduced in existing approaches, enhances retrieval relevance by focusing on query-text alignment. However, these approaches neglect the visual or multimodal context, which is particularly crucial for medical diagnosis. We propose MOTOR, a novel multimodal retrieval and re-ranking approach that leverages grounded captions and optimal transport. It captures the underlying relationships between the query and the retrieved context based on textual and visual information. Consequently, our approach identifies more clinically relevant contexts to augment the VLM input. Empirical analysis and human expert evaluation demonstrate that MOTOR achieves higher accuracy on MedVQA datasets, outperforming state-of-the-art methods by an average of 6.45%. Code is available at https://github.com/BioMedIA-MBZUAI/MOTOR.
Chinese: MOTOR提出了一种多模态检索与重排序方法,利用基础描述和最优传输融合视觉与文本信息,显著提升了医疗视觉问答中检索内容的临床相关性,平均准确率比现有最优方法高出6.45%。
English: MOTOR introduces a multimodal retrieval and re-ranking method that integrates visual and textual information through grounded captions and optimal transport, significantly enhancing the relevance of retrieved contexts for medical visual question answering and achieving a 6.45% average accuracy improvement over existing methods.

Authors:Senkang Hu, Yihang Tao, Guowen Xu, Xinyuan Qian, Yiqin Deng, Xianhao Chen, Sam Tak Wu Kwong, Yuguang Fang
Title: CP-uniGuard: A Unified, Probability-Agnostic, and Adaptive Framework for Malicious Agent Detection and Defense in Multi-Agent Embodied Perception Systems
Abstract:
Collaborative Perception (CP) has been shown to be a promising technique for multi-agent autonomous driving and multi-agent robotic systems, where multiple agents share their perception information to enhance the overall perception performance and expand the perception range. However, in CP, an ego agent needs to receive messages from its collaborators, which makes it vulnerable to attacks from malicious agents. To address this critical issue, we propose a unified, probability-agnostic, and adaptive framework, namely, CP-uniGuard, which is a tailored defense mechanism for CP deployed by each agent to accurately detect and eliminate malicious agents in its collaboration network. Our key idea is to enable CP to reach a consensus rather than a conflict against an ego agent's perception results. Based on this idea, we first develop a probability-agnostic sample consensus (PASAC) method to effectively sample a subset of the collaborators and verify the consensus without prior probabilities of malicious agents. Furthermore, we define collaborative consistency loss (CCLoss) for object detection task and bird's eye view (BEV) segmentation task to capture the discrepancy between an ego agent and its collaborators, which is used as a verification criterion for consensus. In addition, we propose online adaptive threshold via dual sliding windows to dynamically adjust the threshold for consensus verification and ensure the reliability of the systems in dynamic environments. Finally, we conduct extensive experiments and demonstrate the effectiveness of our framework. Code will be released at https://github.com/CP-Security/CP-uniGuard.
Chinese: 协作感知(CP)通过共享感知数据提升多智能体系统性能,但易受恶意攻击;为此提出的CP-uniGuard框架采用共识机制和自适应阈值,能有效检测并排除恶意智能体,确保系统在动态环境中的可靠性。
English: Collaborative Perception (CP) enhances multi-agent systems by sharing perception data but is vulnerable to malicious attacks, leading to the development of CP-uniGuard, a framework that detects and eliminates such threats through consensus-based methods and adaptive thresholds to ensure system reliability.

Authors:Dang Jisheng, Wu Xudong, Wang Bimei, Lv Ning, Chen Jiayu, Jingwen Zhao, Yichu liu, Jizhao Liu, Juncheng Li, Teng Wang
Title: Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder
Abstract:
Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2. Specifically, first, we devise a pre-training paradigm that converts textual ground-truth labels into point-level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model's semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual and visual feature subspaces. Finally, a dynamic mask fusion strategy synergistically combines these decoupled features through triple supervision from predicted text/visual masks and ground-truth annotations. Extensive experiments demonstrate state-of-the-art performance across diverse tasks, including image segmentation, image question answering, video segmentation, and video question answering. Our codes are available at https://github.com/longmalongma/DeSa2VA.
中文总结:提出的DeSa2VA方法通过文本预训练和线性解耦模块的增强提示方案,有效解决了现有模型中视觉与语义信息纠缠的问题,在多项视觉语言任务中实现了最优性能。
English Summary: The proposed DeSa2VA method introduces a decoupling-enhanced prompting scheme with text pre-training and linear feature disentanglement to overcome the visual-semantic entanglement in existing models, achieving state-of-the-art performance across multiple vision-language tasks.

Authors:Ramya Hebbalaguppe, Tamoghno Kandar, Abhinav Nagpal, Chetan Arora
Title: Prompting without Panic: Attribute-aware, Zero-shot, Test-Time Calibration
Abstract:
Vision-language models (VLM) have demonstrated impressive performance in image recognition by leveraging self-supervised training on large datasets. Their performance can be further improved by adapting to the test sample using test-time prompt tuning (TPT). Unfortunately, the singular focus of TPT approaches on improving the accuracy suffers from tunnel vision, and leads to degradation in confidence calibration. This limits the applicability of TPT in critical applications. We make three contributions in this work. (1) We posit that random or naive initialization of prompts leads to overfitting on a particular test sample, and is the main reason for miscalibration of the VLM after TPT. To mitigate the problem, we propose careful initialization of test time prompt using prior knowledge about the target label attributes from a large language model (LLM); (2) To further maintain the quality of prompts during \tpt, we propose a novel regularization loss to reduce intraclass distance, and increase inter-class distance between the learnt Through extensive experiments on different CLIP architectures and 15 datasets, we show that our approach can effectively improve the calibration after TPT. We report an average expected calibration error (ECE) of 4.11 with our method, TCA, compared to 11.7 for vanilla TPT, 6.12 for C-TPT (ICLR'24), 6.78 for DiffTPT (CVPR'23), and 8.43 for PromptAlign (NeurIPS'23). The code is publicly accessible at: https://github.com/rhebbalaguppe/TCA_PromptWithoutPanic.
中文: 视觉语言模型通过测试时提示调优可提升性能,但易导致过拟合和置信度校准不佳;我们提出的方法利用大语言模型初始化提示并结合新型正则化损失,有效改善了多个数据集的校准效果。
English: Vision-language models can be enhanced through test-time prompt tuning, but this often leads to overfitting and poor confidence calibration, which our proposed method mitigates by using large language model initialization and a novel regularization loss to significantly improve calibration across multiple datasets.

Authors:Yueyang Li, Shengyu Gong, Weiming Zeng, Nizhuan Wang, Wai Ting Siok
Title: FreqDGT: Frequency-Adaptive Dynamic Graph Networks with Transformer for Cross-subject EEG Emotion Recognition
Abstract:
Electroencephalography (EEG) serves as a reliable and objective signal for emotion recognition in affective brain-computer interfaces, offering unique advantages through its high temporal resolution and ability to capture authentic emotional states that cannot be consciously controlled. However, cross-subject generalization remains a fundamental challenge due to individual variability, cognitive traits, and emotional responses. We propose FreqDGT, a frequency-adaptive dynamic graph transformer that systematically addresses these limitations through an integrated framework. FreqDGT introduces frequency-adaptive processing (FAP) to dynamically weight emotion-relevant frequency bands based on neuroscientific evidence, employs adaptive dynamic graph learning (ADGL) to learn input-specific brain connectivity patterns, and implements multi-scale temporal disentanglement network (MTDN) that combines hierarchical temporal transformers with adversarial feature disentanglement to capture both temporal dynamics and ensure cross-subject robustness. Comprehensive experiments demonstrate that FreqDGT significantly improves cross-subject emotion recognition accuracy, confirming the effectiveness of integrating frequency-adaptive, spatial-dynamic, and temporal-hierarchical modeling while ensuring robustness to individual differences. The code is available at https://github.com/NZWANG/FreqDGT.
中文: 针对脑电情绪识别中的跨被试泛化难题,FreqDGT模型通过频率自适应处理、动态脑连接学习和多尺度时序解耦,有效提升了识别精度与个体差异鲁棒性。
English: EEG-based emotion recognition faces cross-subject generalization challenges, which FreqDGT addresses through frequency-adaptive processing, dynamic brain connectivity learning, and multi-scale temporal modeling to significantly improve accuracy and robustness.

Authors:Byung Hyun Lee, Sungjin Lim, Seunggyu Lee, Dong Un Kang, Se Young Chun
Title: Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate
Abstract:
Remarkable progress in text-to-image diffusion models has brought a major concern about potentially generating images on inappropriate or trademarked concepts. Concept erasing has been investigated with the goals of deleting target concepts in diffusion models while preserving other concepts with minimal distortion. To achieve these goals, recent concept erasing methods usually fine-tune the cross-attention layers of diffusion models. In this work, we first show that merely updating the cross-attention layers in diffusion models, which is mathematically equivalent to adding \emph{linear} modules to weights, may not be able to preserve diverse remaining concepts. Then, we propose a novel framework, dubbed Concept Pinpoint Eraser (CPE), by adding \emph{nonlinear} Residual Attention Gates (ResAGs) that selectively erase (or cut) target concepts while safeguarding remaining concepts from broad distributions by employing an attention anchoring loss to prevent the forgetting. Moreover, we adversarially train CPE with ResAG and learnable text embeddings in an iterative manner to maximize erasing performance and enhance robustness against adversarial attacks. Extensive experiments on the erasure of celebrities, artistic styles, and explicit contents demonstrated that the proposed CPE outperforms prior arts by keeping diverse remaining concepts while deleting the target concepts with robustness against attack prompts. Code is available at https://github.com/Hyun1A/CPE
中文摘要:本文提出概念精准擦除器(CPE),通过非线性残差注意力门控和对抗性训练,在扩散模型中有效擦除目标概念的同时保持其余概念的多样性,并具备抗攻击鲁棒性。
English summary: This paper introduces Concept Pinpoint Eraser (CPE), a novel framework that uses nonlinear Residual Attention Gates and adversarial training to effectively erase target concepts from diffusion models while preserving diverse remaining concepts and maintaining robustness against attacks.

Authors:Nuoye Xiong, Anqi Dong, Ning Wang, Cong Hua, Guangming Zhu, Lin Mei, Peiyi Shen, Liang Zhang
Title: Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding
Abstract:
Recent advances in deep learning have led to increasingly complex models with deeper layers and more parameters, reducing interpretability and making their decisions harder to understand. While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable framework to approximate black-box reasoning and communicate conceptual understanding. Detrimental concepts are automatically identified and refined (removed/replaced) based on global gradient contributions. The modified CBM then distills corrected knowledge back into the black-box model, enhancing both interpretability and accuracy. We evaluate CBM-HNMU on various CNN and transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, and CUB-200, achieving a maximum accuracy improvement of 2.64% and a maximum increase in average accuracy across 1.03%. Source code is available at: https://github.com/XiGuaBo/CBM-HNMU.
Chinese: 本文提出了增强人机互理解的概念瓶颈模型(CBM-HNMU),通过自动识别并优化有害概念来提升模型可解释性与准确率,在多个数据集上最高实现了2.64%的精度提升。
English: This paper introduces the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU), which automatically identifies and refines detrimental concepts to improve both model interpretability and accuracy, achieving up to a 2.64% accuracy gain across multiple datasets.

Authors:Sicong Du, Jiarun Liu, Qifeng Chen, Hao-Xiang Chen, Tai-Jiang Mu, Sheng Yang
Title: RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors
Abstract:
A single-pass driving clip frequently results in incomplete scanning of the road structure, making reconstructed scene expanding a critical requirement for sensor simulators to effectively regress driving actions. Although contemporary 3D Gaussian Splatting (3DGS) techniques achieve remarkable reconstruction quality, their direct extension through the integration of diffusion priors often introduces cumulative physical inconsistencies and compromises training efficiency. To address these limitations, we present RGE-GS, a novel expansive reconstruction framework that synergizes diffusion-based generation with reward-guided Gaussian integration. The RGE-GS framework incorporates two key innovations: First, we propose a reward network that learns to identify and prioritize consistently generated patterns prior to reconstruction phases, thereby enabling selective retention of diffusion outputs for spatial stability. Second, during the reconstruction process, we devise a differentiated training strategy that automatically adjust Gaussian optimization progress according to scene converge metrics, which achieving better convergence than baseline methods. Extensive evaluations of publicly available datasets demonstrate that RGE-GS achieves state-of-the-art performance in reconstruction quality. Our source-code will be made publicly available at https://github.com/CN-ADLab/RGE-GS.
Chinese: 提出的RGE-GS框架通过将扩散先验与奖励引导的高斯优化相结合,克服了场景扩展中的局限性,借助选择性模式保留和自适应训练策略实现了最先进的重建质量。
English: The proposed RGE-GS framework overcomes limitations in scene expansion by integrating diffusion priors with reward-guided Gaussian optimization, achieving state-of-the-art reconstruction quality through selective pattern retention and adaptive training strategies.

Authors:Minchao Jiang, Shunyu Jia, Jiaming Gu, Xiaoyuan Lu, Guangming Zhu, Anqi Dong, Liang Zhang
Title: VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding
Abstract:
3D Gaussian Splatting (3DGS) has become horsepower in high-quality, real-time rendering for novel view synthesis of 3D scenes. However, existing methods focus primarily on geometric and appearance modeling, lacking deeper scene understanding while also incurring high training costs that complicate the originally streamlined differentiable rendering pipeline. To this end, we propose VoteSplat, a novel 3D scene understanding framework that integrates Hough voting with 3DGS. Specifically, Segment Anything Model (SAM) is utilized for instance segmentation, extracting objects, and generating 2D vote maps. We then embed spatial offset vectors into Gaussian primitives. These offsets construct 3D spatial votes by associating them with 2D image votes, while depth distortion constraints refine localization along the depth axis. For open-vocabulary object localization, VoteSplat maps 2D image semantics to 3D point clouds via voting points, reducing training costs associated with high-dimensional CLIP features while preserving semantic unambiguity. Extensive experiments demonstrate effectiveness of VoteSplat in open-vocabulary 3D instance localization, 3D point cloud understanding, click-based 3D object localization, hierarchical segmentation, and ablation studies. Our code is available at https://sy-ja.github.io/votesplat/

Authors:Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep P. Chinchali
Title: PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection
Abstract:
Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.
中文: 深度伪造攻击日益严重,但现有数据集缺乏真实性,因此我们提出了PhonemeFake,通过语言推理操纵关键语音段,显著降低人类感知和基准准确率,并推出高效检测模型,提升性能与速度。
English: Deepfake attacks are advancing, but current datasets lack realism, prompting the introduction of PhonemeFake, which manipulates speech segments to significantly reduce human perception and benchmark accuracy, alongside an efficient detection model that improves performance and speed.

Authors:Jianing Zhang, Jiayi Zhu, Feiyu Ji, Xiaokang Yang, Xiaoyun Yuan
Title: Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography
Abstract:
Metalenses offer significant potential for ultra-compact computational imaging but face challenges from complex optical degradation and computational restoration difficulties. Existing methods typically rely on precise optical calibration or massive paired datasets, which are non-trivial for real-world imaging systems. Furthermore, a lack of control over the inference process often results in undesirable hallucinated artifacts. We introduce Degradation-Modeled Multipath Diffusion for tunable metalens photography, leveraging powerful natural image priors from pretrained models instead of large datasets. Our framework uses positive, neutral, and negative-prompt paths to balance high-frequency detail generation, structural fidelity, and suppression of metalens-specific degradation, alongside \textit{pseudo} data augmentation. A tunable decoder enables controlled trade-offs between fidelity and perceptual quality. Additionally, a spatially varying degradation-aware attention (SVDA) module adaptively models complex optical and sensor-induced degradation. Finally, we design and build a millimeter-scale MetaCamera for real-world validation. Extensive results show that our approach outperforms state-of-the-art methods, achieving high-fidelity and sharp image reconstruction. More materials: https://dmdiff.github.io/.

Authors:Jiang Yuan, JI Ma, Bo Wang, Guanzhou Ke, Weiming Hu
Title: LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning
Abstract:
Implicit degradation estimation-based blind super-resolution (IDE-BSR) hinges on extracting the implicit degradation representation (IDR) of the LR image and adapting it to LR image features to guide HR detail restoration. Although IDE-BSR has shown potential in dealing with noise interference and complex degradations, existing methods ignore the importance of IDR discriminability for BSR and instead over-complicate the adaptation process to improve effect, resulting in a significant increase in the model's parameters and computations. In this paper, we focus on the discriminability optimization of IDR and propose a new powerful and lightweight BSR model termed LightBSR. Specifically, we employ a knowledge distillation-based learning framework. We first introduce a well-designed degradation-prior-constrained contrastive learning technique during teacher stage to make the model more focused on distinguishing different degradation types. Then we utilize a feature alignment technique to transfer the degradation-related knowledge acquired by the teacher to the student for practical inferencing. Extensive experiments demonstrate the effectiveness of IDR discriminability-driven BSR model design. The proposed LightBSR can achieve outstanding performance with minimal complexity across a range of blind SR tasks. Our code is accessible at: https://github.com/MJ-NCEPU/LightBSR.
中文摘要:本文提出了一种轻量级盲超分辨率模型LightBSR,通过知识蒸馏和特征对齐技术优化隐式退化表示的可区分性,在保持最低复杂度的同时实现了卓越的性能。
English Summary: The paper introduces LightBSR, a lightweight blind super-resolution model that enhances implicit degradation representation discriminability through knowledge distillation and feature alignment, achieving high performance with minimal complexity.

Authors:Haoxuan Wang, Zhenghao Zhao, Junyi Wu, Yuzhang Shang, Gaowen Liu, Yan Yan
Title: CaO$_2$: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation
Abstract:
The recent introduction of diffusion models in dataset distillation has shown promising potential in creating compact surrogate datasets for large, high-resolution target datasets, offering improved efficiency and performance over traditional bi-level/uni-level optimization methods. However, current diffusion-based dataset distillation approaches overlook the evaluation process and exhibit two critical inconsistencies in the distillation process: (1) Objective Inconsistency, where the distillation process diverges from the evaluation objective, and (2) Condition Inconsistency, leading to mismatches between generated images and their corresponding conditions. To resolve these issues, we introduce Condition-aware Optimization with Objective-guided Sampling (CaO$_2$), a two-stage diffusion-based framework that aligns the distillation process with the evaluation objective. The first stage employs a probability-informed sample selection pipeline, while the second stage refines the corresponding latent representations to improve conditional likelihood. CaO$_2$ achieves state-of-the-art performance on ImageNet and its subsets, surpassing the best-performing baselines by an average of 2.3% accuracy.
中文摘要:CaO$_2$框架通过两阶段优化方法解决了基于扩散的数据集蒸馏中的目标与条件不一致问题,在ImageNet数据集上实现了最优性能,准确率比现有最佳方法平均提升2.3%。
English Summary: The CaO$_2$ framework addresses inconsistencies in diffusion-based dataset distillation by aligning the process with evaluation objectives through a two-stage approach, achieving state-of-the-art performance on ImageNet with a 2.3% accuracy improvement over baselines.

Authors:Arunkumar Kannan, Martin A. Lindquist, Brian Caffo
Title: BrainMT: A Hybrid Mamba-Transformer Architecture for Modeling Long-Range Dependencies in Functional MRI Data
Abstract:
Recent advances in deep learning have made it possible to predict phenotypic measures directly from functional magnetic resonance imaging (fMRI) brain volumes, sparking significant interest in the neuroimaging community. However, existing approaches, primarily based on convolutional neural networks or transformer architectures, often struggle to model the complex relationships inherent in fMRI data, limited by their inability to capture long-range spatial and temporal dependencies. To overcome these shortcomings, we introduce BrainMT, a novel hybrid framework designed to efficiently learn and integrate long-range spatiotemporal attributes in fMRI data. Our framework operates in two stages: (1) a bidirectional Mamba block with a temporal-first scanning mechanism to capture global temporal interactions in a computationally efficient manner; and (2) a transformer block leveraging self-attention to model global spatial relationships across the deep features processed by the Mamba block. Extensive experiments on two large-scale public datasets, UKBioBank and the Human Connectome Project, demonstrate that BrainMT achieves state-of-the-art performance on both classification (sex prediction) and regression (cognitive intelligence prediction) tasks, outperforming existing methods by a significant margin. Our code and implementation details will be made publicly available at this https://github.com/arunkumar-kannan/BrainMT-fMRI
中文:BrainMT框架通过结合双向Mamba模块和Transformer,有效捕捉fMRI数据中的长程时空特征,在大型神经影像数据集上的分类与回归任务中均实现了最先进的性能表现。
English: The BrainMT framework introduces a hybrid approach using bidirectional Mamba blocks and transformers to effectively capture long-range spatiotemporal dependencies in fMRI data, achieving state-of-the-art results in both classification and regression tasks on major neuroimaging datasets.

Authors:Vasilis Siomos, Jonathan Passerat-Palmbach, Giacomo Tarroni
Title: FedCLAM: Client Adaptive Momentum with Foreground Intensity Matching for Federated Medical Image Segmentation
Abstract:
Federated learning is a decentralized training approach that keeps data under stakeholder control while achieving superior performance over isolated training. While inter-institutional feature discrepancies pose a challenge in all federated settings, medical imaging is particularly affected due to diverse imaging devices and population variances, which can diminish the global model's effectiveness. Existing aggregation methods generally fail to adapt across varied circumstances. To address this, we propose FedCLAM, which integrates \textit{client-adaptive momentum} terms derived from each client's loss reduction during local training, as well as a \textit{personalized dampening factor} to curb overfitting. We further introduce a novel \textit{intensity alignment} loss that matches predicted and ground-truth foreground distributions to handle heterogeneous image intensity profiles across institutions and devices. Extensive evaluations on two datasets show that FedCLAM surpasses eight cutting-edge methods in medical segmentation tasks, underscoring its efficacy. The code is available at https://github.com/siomvas/FedCLAM.
中文: FedCLAM通过整合客户端自适应动量项和个性化抑制因子来缓解联邦学习中的特征差异和过拟合问题,同时引入强度对齐损失处理异构图像数据,在医学分割任务中显著优于现有方法。
English: FedCLAM enhances federated learning in medical imaging by incorporating client-adaptive momentum and a personalized dampening factor to mitigate feature discrepancies and overfitting, while introducing an intensity alignment loss to handle heterogeneous image data, achieving superior performance in segmentation tasks compared to existing methods.

Authors:Hang Xu, Jie Huang, Linjiang Huang, Dong Li, Yidi Liu, Feng Zhao
Title: FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment
Abstract:
Domain Adaptation(DA) for dense prediction tasks is an important topic, which enhances the dense prediction model's performance when tested on its unseen domain. Recently, with the development of Diffusion-based Dense Prediction (DDP) models, the exploration of DA designs tailored to this framework is worth exploring, since the diffusion model is effective in modeling the distribution transformation that comprises domain information. In this work, we propose a training-free mechanism for DDP frameworks, endowing them with DA capabilities. Our motivation arises from the observation that the exposure bias (e.g., noise statistics bias) in diffusion brings domain shift, and different domains in conditions of DDP models can also be effectively captured by the noise prediction statistics. Based on this, we propose a training-free Domain Noise Alignment (DNA) approach, which alleviates the variations of noise statistics to domain changes during the diffusion sampling process, thereby achieving domain adaptation. Specifically, when the source domain is available, we directly adopt the DNA method to achieve domain adaptation by aligning the noise statistics of the target domain with those of the source domain. For the more challenging source-free DA, inspired by the observation that regions closer to the source domain exhibit higher confidence meeting variations of sampling noise, we utilize the statistics from the high-confidence regions progressively to guide the noise statistic adjustment during the sampling process. Notably, our method demonstrates the effectiveness of enhancing the DA capability of DDP models across four common dense prediction tasks. Code is available at \href{https://github.com/xuhang07/FreeDNA}{https://github.com/xuhang07/FreeDNA}.
Chinese: 本研究提出了一种无需训练的域噪声对齐(DNA)方法,通过在扩散采样过程中对齐域间的噪声统计特性,有效增强了基于扩散的密集预测模型的域适应能力,无需额外训练即可提升多种任务的性能。
English: This study introduces a training-free Domain Noise Alignment (DNA) method that enhances domain adaptation in diffusion-based dense prediction models by aligning noise statistics between domains during the sampling process, effectively improving performance across various tasks without requiring additional training.

Authors:Hassan Baker, Matthew S. Emigh, Austin J. Brockmeier
Title: Weakly Supervised Object Segmentation by Background Conditional Divergence
Abstract:
As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic, images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images, and then during learning create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The basecode for this work can be found at \href{GitHub}{https://github.com/bakerhassan/WSOS}.
Chinese: 本文提出了一种弱监督的二元目标分割方法,利用图像级目标存在标签并通过将分割目标与聚类背景融合生成反事实图像,在声纳等专业领域和自然图像上实现了优于无监督基线的性能,且无需预训练网络或对抗训练。
English: This paper introduces a weakly supervised method for binary object segmentation that uses image-level object presence labels and creates counterfactual images by blending segmented objects with clustered backgrounds, achieving improved performance on specialized domains like sonar and natural images without relying on pretrained networks or adversarial training.

Authors:Hassan Baker, Austin J. Brockmeier
Title: Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection
Abstract:
Detecting brain lesions as abnormalities observed in magnetic resonance imaging (MRI) is essential for diagnosis and treatment. In the search of abnormalities, such as tumors and malformations, radiologists may benefit from computer-aided diagnostics that use computer vision systems trained with machine learning to segment normal tissue from abnormal brain tissue. While supervised learning methods require annotated lesions, we propose a new unsupervised approach (Patch2Loc) that learns from normal patches taken from structural MRI. We train a neural network model to map a patch back to its spatial location within a slice of the brain volume. During inference, abnormal patches are detected by the relatively higher error and/or variance of the location prediction. This generates a heatmap that can be integrated into pixel-wise methods to achieve finer-grained segmentation. We demonstrate the ability of our model to segment abnormal brain tissues by applying our approach to the detection of tumor tissues in MRI on T2-weighted images from BraTS2021 and MSLUB datasets and T1-weighted images from ATLAS and WMH datasets. We show that it outperforms the state-of-the art in unsupervised segmentation. The codebase for this work can be found on our \href{https://github.com/bakerhassan/Patch2Loc}{GitHub page}.
中文: 本研究提出Patch2Loc这一无监督方法,通过训练神经网络预测脑部MRI图像中斑块的位置,并利用预测误差识别病变区域,在性能上超越了现有最先进的无监督分割技术。
English: This study introduces Patch2Loc, an unsupervised method that detects brain lesions in MRI by training a neural network to predict patch locations and identifying abnormalities through prediction errors, outperforming current state-of-the-art techniques.

Authors:Weiyi Zhao, Xiaoyu Tan, Liang Liu, Sijia Li, Youwei Song, Xihe Qiu
Title: Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models
Abstract:
Surgical risk identification is critical for patient safety and reducing preventable medical errors. While multimodal large language models (MLLMs) show promise for automated operating room (OR) risk detection, they often exhibit visual-semantic knowledge conflicts (VS-KC), failing to identify visual safety violations despite understanding textual rules. To address this, we introduce a dataset comprising over 34,000 synthetic images generated by diffusion models, depicting operating room scenes containing entities that violate established safety rules. These images were created to alleviate data scarcity and examine MLLMs vulnerabilities. In addition, the dataset includes 214 human-annotated images that serve as a gold-standard reference for validation. This comprehensive dataset, spanning diverse perspectives, stages, and configurations, is designed to expose and study VS-KC. Fine-tuning on OR-VSKC significantly improves MLLMs' detection of trained conflict entities and generalizes well to new viewpoints for these entities, but performance on untrained entity types remains poor, highlighting learning specificity and the need for comprehensive training. The main contributions of this work include: (1) a data generation methodology tailored for rule-violation scenarios; (2) the release of the OR-VSKC dataset and its associated benchmark as open-source resources; and (3) an empirical analysis of violation-sensitive knowledge consistency in representative MLLMs. The dataset and appendix are available at https://github.com/zgg2577/VS-KC.
中文: 本研究提出OR-VSKC数据集,包含手术室安全违规场景的合成图像和人工标注图像,旨在解决多模态大语言模型的视觉-语义知识冲突问题,实验表明微调能提升模型对已知违规实体的检测能力,但对未训练实体类型的泛化能力仍显不足。
English: This study introduces the OR-VSKC dataset, comprising synthetic and human-annotated images of operating room safety violations, to address visual-semantic knowledge conflicts in multimodal large language models, improving their detection of trained risks while revealing limitations in generalizing to unseen entities.

Authors:Weizhi Gao, Zhichao Hou, Junqi Yin, Feiyi Wang, Linyu Peng, Xiaorui Liu
Title: Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization
Abstract:
Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an in-depth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our code implementation is available at https://github.com/WeizhiGao/MoDiff.
中文: 本文提出MoDiff这一创新框架,通过调制量化和误差补偿加速扩散模型,在训练后量化中将激活位从8位降至3位且不损失性能。
English: This paper introduces MoDiff, a novel framework that accelerates diffusion models through modulated quantization and error compensation, effectively reducing activation bits from 8 to 3 without performance loss in post-training quantization.

Authors:Sadra Safadoust, Fabio Tosi, Fatma Güney, Matteo Poggi
Title: WarpRF: Multi-View Consistency for Training-Free Uncertainty Quantification and Applications in Radiance Fields
Abstract:
We introduce WarpRF, a training-free general-purpose framework for quantifying the uncertainty of radiance fields. Built upon the assumption that photometric and geometric consistency should hold among images rendered by an accurate model, WarpRF quantifies its underlying uncertainty from an unseen point of view by leveraging backward warping across viewpoints, projecting reliable renderings to the unseen viewpoint and measuring the consistency with images rendered there. WarpRF is simple and inexpensive, does not require any training, and can be applied to any radiance field implementation for free. WarpRF excels at both uncertainty quantification and downstream tasks, e.g., active view selection and active mapping, outperforming any existing method tailored to specific frameworks.
中文:WarpRF是一种无需训练的多功能框架,通过反向扭曲测量视角间一致性来量化辐射场的不确定性,在主动视图选择等任务中表现卓越且无需额外训练。
English: WarpRF is a training-free framework that quantifies radiance field uncertainty by leveraging backward warping to measure consistency across viewpoints, excelling in tasks like active view selection without requiring any training.

Authors:Yuliang Huang, Imraj Singh, Thomas Joyce, Kris Thielemans, Jamie R. McClelland
Title: DIGS: Dynamic CBCT Reconstruction using Deformation-Informed 4D Gaussian Splatting and a Low-Rank Free-Form Deformation Model
Abstract:
3D Cone-Beam CT (CBCT) is widely used in radiotherapy but suffers from motion artifacts due to breathing. A common clinical approach mitigates this by sorting projections into respiratory phases and reconstructing images per phase, but this does not account for breathing variability. Dynamic CBCT instead reconstructs images at each projection, capturing continuous motion without phase sorting. Recent advancements in 4D Gaussian Splatting (4DGS) offer powerful tools for modeling dynamic scenes, yet their application to dynamic CBCT remains underexplored. Existing 4DGS methods, such as HexPlane, use implicit motion representations, which are computationally expensive. While explicit low-rank motion models have been proposed, they lack spatial regularization, leading to inconsistencies in Gaussian motion. To address these limitations, we introduce a free-form deformation (FFD)-based spatial basis function and a deformation-informed framework that enforces consistency by coupling the temporal evolution of Gaussian's mean position, scale, and rotation under a unified deformation field. We evaluate our approach on six CBCT datasets, demonstrating superior image quality with a 6x speedup over HexPlane. These results highlight the potential of deformation-informed 4DGS for efficient, motion-compensated CBCT reconstruction. The code is available at https://github.com/Yuliang-Huang/DIGS.
中文: 针对三维锥束CT中呼吸运动导致的伪影问题,本研究提出基于自由形变的4D高斯溅射框架,通过统一高斯运动场实现运动补偿重建,在提升图像质量的同时较现有方法提速六倍。
English: To address motion artifacts in 3D CBCT caused by breathing variability, this study introduces a deformation-informed 4D Gaussian Splatting framework that uses free-form deformation to unify Gaussian motion, achieving higher image quality and a sixfold speed improvement over existing methods.

Authors:Filippo Merlo, Ece Takmaz, Wenkai Chen, Albert Gatt
Title: COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication
Abstract:
Natural scenes provide us with rich contexts for object recognition and reference. In particular, knowing what type of scene one is looking at generates expectations about which objects will occur, and what their spatial configuration should be. Do Vision-Language Models (VLMs) learn to rely on scene contexts in a similar way, when generating references to objects? To address this question, we introduce the \textit{Common Objects Out-of-Context (COOCO)} dataset and test to what extent VLMs rely on scene context to refer to objects under different degrees of scene-object congruency, and different perturbations. Our findings show that models leverage scene context adaptively, depending on both the semantic relatedness between object and scene and the level of noise. In particular, models rely more on context under high target-scene congruence or when objects are degraded. Attention analysis reveals that successful object categorisation involves increased focus on the target in mid-level layers, especially under moderate noise, suggesting that VLMs dynamically balance local and contextual information for reference generation. We make our dataset, code and models available at \href{https://github.com/cs-nlp-uu/scenereg}{https://github.com/cs-nlp-uu/scenereg}.
中文摘要:本研究探讨视觉语言模型是否像人类一样利用场景上下文进行物体指代,发现模型会根据物体与场景的语义关联度及干扰程度自适应地依赖上下文,其注意力机制能动态平衡局部与整体信息。
English Summary: This study investigates whether Vision-Language Models utilize scene context for object reference like humans do, finding they adaptively rely on context based on object-scene congruence and noise levels, with attention mechanisms dynamically balancing local and contextual information.

Authors:Ajay Mittal, Raghav Mehta, Omar Todd, Philipp Seeböck, Georg Langs, Ben Glocker
Title: Cardiovascular disease classification using radiomics and geometric features from cardiac CT
Abstract:
Automatic detection and classification of Cardiovascular disease (CVD) from Computed Tomography (CT) images play an important part in facilitating better-informed clinical decisions. However, most of the recent deep learning based methods either directly work on raw CT data or utilize it in pair with anatomical cardiac structure segmentation by training an end-to-end classifier. As such, these approaches become much more difficult to interpret from a clinical perspective. To address this challenge, in this work, we break down the CVD classification pipeline into three components: (i) image segmentation, (ii) image registration, and (iii) downstream CVD classification. Specifically, we utilize the Atlas-ISTN framework and recent segmentation foundational models to generate anatomical structure segmentation and a normative healthy atlas. These are further utilized to extract clinically interpretable radiomic features as well as deformation field based geometric features (through atlas registration) for CVD classification. Our experiments on the publicly available ASOCA dataset show that utilizing these features leads to better CVD classification accuracy (87.50\%) when compared against classification model trained directly on raw CT images (67.50\%). Our code is publicly available: https://github.com/biomedia-mira/grc-net
中文: 本研究提出一个三阶段心血管疾病分类流程,通过分割和图谱配准提取可解释的影像组学与几何特征,在ASOCA数据集上达到87.50%的准确率,显著优于直接使用原始CT图像的67.50%。
English: This study introduces a three-component pipeline for cardiovascular disease classification that uses interpretable radiomic and geometric features derived from segmentation and atlas registration, achieving 87.50% accuracy on the ASOCA dataset compared to 67.50% with raw CT images.

Authors:Holly Dinkel, Raghavendra Navaratna, Jingyi Xiang, Brian Coltin, Trey Smith, Timothy Bretl
Title: KnotDLO: Toward Interpretable Knot Tying
Abstract:
This work presents KnotDLO, a method for one-handed Deformable Linear Object (DLO) knot tying that is robust to occlusion, repeatable for varying rope initial configurations, interpretable for generating motion policies, and requires no human demonstrations or training. Grasp and target waypoints for future DLO states are planned from the current DLO shape. Grasp poses are computed from indexing the tracked piecewise linear curve representing the DLO state based on the current curve shape and are piecewise continuous. KnotDLO computes intermediate waypoints from the geometry of the current DLO state and the desired next state. The system decouples visual reasoning from control. In 16 trials of knot tying, KnotDLO achieves a 50% success rate in tying an overhand knot from previously unseen configurations.

Authors:Ronald Fecso, José Morano, Ursula Schmidt-Erfurth, Hrvoje Bogunović
Title: RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models
Abstract:
The rise of imaging techniques such as optical coherence tomography (OCT) and advances in deep learning (DL) have enabled clinicians and researchers to streamline retinal disease staging. A popular DL approach is self-supervised learning (SSL), where models learn from vast amounts of unlabeled data, avoiding costly annotation. SSL has allowed the development of foundation models (FMs), large models that can be used for a variety of downstream tasks. However, existing FMs for OCT, trained solely on image data, lack a comprehensive and robust semantic understanding of images, as evidenced by their downstream performance (especially for complex tasks), and thus require supervised fine-tuning (which may be unfeasible) to better adapt to specific applications and populations. To address this, we propose RetFiner, an SSL vision-language refinement scheme that improves the representations of existing FMs and enables their efficient and direct adaptation to specific populations for improved downstream performance. Our method uses a diverse set of training objectives which take advantage of the rich supervisory signal found in textual data. We tested RetFiner on the retinal FMs RETFound, UrFound, and VisionFM, showing significant improvements in linear probing performance on seven highly diverse OCT classification tasks, with an average increase of 5.8, 3.9, and 2.1 percentage points over their baselines, respectively. Our code and model weights are publicly available at https://github.com/ronnief1/RetFiner.
Chinese: RetFiner是一种自监督视觉语言优化方案,通过利用文本数据增强现有OCT基础模型,在无需监督微调的情况下显著提升了多种视网膜疾病分类任务的表现。
English: RetFiner is a self-supervised vision-language refinement scheme that enhances existing foundation models for OCT imaging by leveraging textual data, significantly improving their performance across diverse retinal disease classification tasks without requiring supervised fine-tuning.

Authors:Noora Sassali, Roel Pieters
Title: Evaluating Pointing Gestures for Target Selection in Human-Robot Collaboration
Abstract:
Pointing gestures are a common interaction method used in Human-Robot Collaboration for various tasks, ranging from selecting targets to guiding industrial processes. This study introduces a method for localizing pointed targets within a planar workspace. The approach employs pose estimation, and a simple geometric model based on shoulder-wrist extension to extract gesturing data from an RGB-D stream. The study proposes a rigorous methodology and comprehensive analysis for evaluating pointing gestures and target selection in typical robotic tasks. In addition to evaluating tool accuracy, the tool is integrated into a proof-of-concept robotic system, which includes object detection, speech transcription, and speech synthesis to demonstrate the integration of multiple modalities in a collaborative application. Finally, a discussion over tool limitations and performance is provided to understand its role in multimodal robotic systems. All developments are available at: https://github.com/NMKsas/gesture_pointer.git.
中文摘要:本研究提出了一种基于RGB-D数据流、通过姿态估计和肩腕几何模型定位人机协作中指向目标的方法,并通过集成多模态的机器人系统进行了全面评估与验证。
English Summary: This study presents a method for localizing pointed targets in human-robot collaboration using pose estimation and geometric modeling from RGB-D data, with comprehensive evaluation and multimodal integration demonstrated through a robotic system.

Authors:Hyeongji Kim, Stine Hansen, Michael Kampffmeyer
Title: Tied Prototype Model for Few-Shot Medical Image Segmentation
Abstract:
Common prototype-based medical image few-shot segmentation (FSS) methods model foreground and background classes using class-specific prototypes. However, given the high variability of the background, a more promising direction is to focus solely on foreground modeling, treating the background as an anomaly -- an approach introduced by ADNet. Yet, ADNet faces three key limitations: dependence on a single prototype per class, a focus on binary classification, and fixed thresholds that fail to adapt to patient and organ variability. To address these shortcomings, we propose the Tied Prototype Model (TPM), a principled reformulation of ADNet with tied prototype locations for foreground and background distributions. Building on its probabilistic foundation, TPM naturally extends to multiple prototypes and multi-class segmentation while effectively separating non-typical background features. Notably, both extensions lead to improved segmentation accuracy. Finally, we leverage naturally occurring class priors to define an ideal target for adaptive thresholds, boosting segmentation performance. Taken together, TPM provides a fresh perspective on prototype-based FSS for medical image segmentation. The code can be found at https://github.com/hjk92g/TPM-FSS.
中文: TPM模型通过绑定前景和背景的原型位置,支持多类别分割和自适应阈值,有效解决了医学图像中因患者和器官差异导致的背景多变问题。
English: The Tied Prototype Model (TPM) improves upon ADNet by introducing tied prototypes for foreground and background, enabling multi-class segmentation and adaptive thresholds to address variability in medical images.

Authors:Zipei Ma, Junzhe Jiang, Yurui Chen, Li Zhang
Title: BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting
Abstract:
The realistic reconstruction of street scenes is critical for developing real-world simulators in autonomous driving. Most existing methods rely on object pose annotations, using these poses to reconstruct dynamic objects and move them during the rendering process. This dependence on high-precision object annotations limits large-scale and extensive scene reconstruction. To address this challenge, we propose Bézier curve Gaussian splatting (BézierGS), which represents the motion trajectories of dynamic objects using learnable Bézier curves. This approach fully leverages the temporal information of dynamic objects and, through learnable curve modeling, automatically corrects pose errors. By introducing additional supervision on dynamic object rendering and inter-curve consistency constraints, we achieve reasonable and accurate separation and reconstruction of scene elements. Extensive experiments on the Waymo Open Dataset and the nuPlan benchmark demonstrate that BézierGS outperforms state-of-the-art alternatives in both dynamic and static scene components reconstruction and novel view synthesis.
中文: 本文提出BézierGS方法,通过可学习的贝塞尔曲线建模动态物体运动轨迹,无需依赖精确位姿标注即可实现精确街景重建,在自动驾驶基准测试中展现出优越性能。
English: This paper introduces BézierGS, a method that uses learnable Bézier curves to model dynamic object motion, enabling accurate street scene reconstruction without relying on precise pose annotations and demonstrating superior performance on autonomous driving benchmarks.

Authors:Hong Nie, Fuyuan Cao, Lu Chen, Fengxin Chen, Yuefeng Zou, Jun Yu
Title: Few-Shot Identity Adaptation for 3D Talking Heads via Global Gaussian Field
Abstract:
Reconstruction and rendering-based talking head synthesis methods achieve high-quality results with strong identity preservation but are limited by their dependence on identity-specific models. Each new identity requires training from scratch, incurring high computational costs and reduced scalability compared to generative model-based approaches. To overcome this limitation, we propose FIAG, a novel 3D speaking head synthesis framework that enables efficient identity-specific adaptation using only a few training footage. FIAG incorporates Global Gaussian Field, which supports the representation of multiple identities within a shared field, and Universal Motion Field, which captures the common motion dynamics across diverse identities. Benefiting from the shared facial structure information encoded in the Global Gaussian Field and the general motion priors learned in the motion field, our framework enables rapid adaptation from canonical identity representations to specific ones with minimal data. Extensive comparative and ablation experiments demonstrate that our method outperforms existing state-of-the-art approaches, validating both the effectiveness and generalizability of the proposed framework. Code is available at: \textit{https://github.com/gme-hong/FIAG}.
中文摘要:提出的FIAG框架通过共享的全局场和运动场,仅需少量训练数据即可实现高效身份适配,突破了传统说话头合成的局限性。
English Summary: The proposed FIAG framework overcomes the limitations of traditional talking head synthesis by enabling efficient identity adaptation with minimal training data through shared global and motion fields.

Authors:Han Wang, Shengyang Li, Jian Yang, Yuxuan Liu, Yixuan Lv, Zhuang Zhou
Title: Cross-modal Ship Re-Identification via Optical and SAR Imagery: A Novel Dataset and Method
Abstract:
Detecting and tracking ground objects using earth observation imagery remains a significant challenge in the field of remote sensing. Continuous maritime ship tracking is crucial for applications such as maritime search and rescue, law enforcement, and shipping analysis. However, most current ship tracking methods rely on geostationary satellites or video satellites. The former offer low resolution and are susceptible to weather conditions, while the latter have short filming durations and limited coverage areas, making them less suitable for the real-world requirements of ship tracking. To address these limitations, we present the Hybrid Optical and Synthetic Aperture Radar (SAR) Ship Re-Identification Dataset (HOSS ReID dataset), designed to evaluate the effectiveness of ship tracking using low-Earth orbit constellations of optical and SAR sensors. This approach ensures shorter re-imaging cycles and enables all-weather tracking. HOSS ReID dataset includes images of the same ship captured over extended periods under diverse conditions, using different satellites of different modalities at varying times and angles. Furthermore, we propose a baseline method for cross-modal ship re-identification, TransOSS, which is built on the Vision Transformer architecture. It refines the patch embedding structure to better accommodate cross-modal tasks, incorporates additional embeddings to introduce more reference information, and employs contrastive learning to pre-train on large-scale optical-SAR image pairs, ensuring the model's ability to extract modality-invariant features. Our dataset and baseline method are publicly available on https://github.com/Alioth2000/Hoss-ReID.
中文摘要:HOSS ReID数据集通过融合光学与合成孔径雷达卫星影像,解决了现有船舶追踪技术受天气和覆盖范围限制的问题,同时提出的TransOSS基准方法改进了视觉Transformer架构,利用对比学习提升跨模态船舶重识别能力。
English Summary: The HOSS ReID dataset addresses limitations in maritime ship tracking by combining optical and SAR satellite imagery for all-weather monitoring with shorter re-imaging cycles, while the TransOSS baseline method enhances cross-modal ship re-identification through modified Vision Transformer architecture and contrastive learning.

Authors:Qi Gao, Zhihao Chen, Dong Zeng, Junping Zhang, Jianhua Ma, Hongming Shan
Title: Noise-Inspired Diffusion Model for Generalizable Low-Dose CT Reconstruction
Abstract:
The generalization of deep learning-based low-dose computed tomography (CT) reconstruction models to doses unseen in the training data is important and remains challenging. Previous efforts heavily rely on paired data to improve the generalization performance and robustness through collecting either diverse CT data for re-training or a few test data for fine-tuning. Recently, diffusion models have shown promising and generalizable performance in low-dose CT (LDCT) reconstruction, however, they may produce unrealistic structures due to the CT image noise deviating from Gaussian distribution and imprecise prior information from the guidance of noisy LDCT images. In this paper, we propose a noise-inspired diffusion model for generalizable LDCT reconstruction, termed NEED, which tailors diffusion models for noise characteristics of each domain. First, we propose a novel shifted Poisson diffusion model to denoise projection data, which aligns the diffusion process with the noise model in pre-log LDCT projections. Second, we devise a doubly guided diffusion model to refine reconstructed images, which leverages LDCT images and initial reconstructions to more accurately locate prior information and enhance reconstruction fidelity. By cascading these two diffusion models for dual-domain reconstruction, our NEED requires only normal-dose data for training and can be effectively extended to various unseen dose levels during testing via a time step matching strategy. Extensive qualitative, quantitative, and segmentation-based evaluations on two datasets demonstrate that our NEED consistently outperforms state-of-the-art methods in reconstruction and generalization performance. Source code is made available at https://github.com/qgao21/NEED.
中文摘要:本文提出NEED模型,一种针对低剂量CT重建的噪声启发扩散方法,通过移位泊松扩散处理投影数据噪声和双重引导扩散优化重建图像,仅需常规剂量训练数据即可泛化至不同未知剂量水平,并在重建性能上超越现有技术。
English Summary: The paper introduces NEED, a noise-inspired diffusion model for generalizable low-dose CT reconstruction that uses a shifted Poisson diffusion for projection denoising and a doubly guided diffusion for image refinement, requiring only normal-dose training data and outperforming existing methods across unseen dose levels.

Authors:Tianyu Zhang, Xin Luo, Li Li, Dong Liu
Title: StableCodec: Taming One-Step Diffusion for Extreme Image Compression
Abstract:
Diffusion-based image compression has shown remarkable potential for achieving ultra-low bitrate coding (less than 0.05 bits per pixel) with high realism, by leveraging the generative priors of large pre-trained text-to-image diffusion models. However, current approaches require a large number of denoising steps at the decoder to generate realistic results under extreme bitrate constraints, limiting their application in real-time compression scenarios. Additionally, these methods often sacrifice reconstruction fidelity, as diffusion models typically fail to guarantee pixel-level consistency. To address these challenges, we introduce StableCodec, which enables one-step diffusion for high-fidelity and high-realism extreme image compression with improved coding efficiency. To achieve ultra-low bitrates, we first develop an efficient Deep Compression Latent Codec to transmit a noisy latent representation for a single-step denoising process. We then propose a Dual-Branch Coding Structure, consisting of a pair of auxiliary encoder and decoder, to enhance reconstruction fidelity. Furthermore, we adopt end-to-end optimization with joint bitrate and pixel-level constraints. Extensive experiments on the CLIC 2020, DIV2K, and Kodak dataset demonstrate that StableCodec outperforms existing methods in terms of FID, KID and DISTS by a significant margin, even at bitrates as low as 0.005 bits per pixel, while maintaining strong fidelity. Additionally, StableCodec achieves inference speeds comparable to mainstream transform coding schemes. All source code are available at https://github.com/LuizScarlet/StableCodec.
中文:StableCodec提出了一种单步扩散方法,用于极低码率下的图像压缩,在显著提升真实感和保真度的同时,实现了与主流编码方案相当的推理速度。
English: StableCodec introduces a one-step diffusion method for extreme image compression that enhances both realism and fidelity at ultra-low bitrates, outperforming existing techniques in speed and quality metrics.

Authors:Liu Yang, Huiyu Duan, Jiarui Wang, Jing Liu, Menghan Hu, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet
Title: Quality Assessment and Distortion-aware Saliency Prediction for AI-Generated Omnidirectional Images
Abstract:
With the rapid advancement of Artificial Intelligence Generated Content (AIGC) techniques, AI generated images (AIGIs) have attracted widespread attention, among which AI generated omnidirectional images (AIGODIs) hold significant potential for Virtual Reality (VR) and Augmented Reality (AR) applications. AI generated omnidirectional images exhibit unique quality issues, however, research on the quality assessment and optimization of AI-generated omnidirectional images is still lacking. To this end, this work first studies the quality assessment and distortion-aware saliency prediction problems for AIGODIs, and further presents a corresponding optimization process. Specifically, we first establish a comprehensive database to reflect human feedback for AI-generated omnidirectionals, termed OHF2024, which includes both subjective quality ratings evaluated from three perspectives and distortion-aware salient regions. Based on the constructed OHF2024 database, we propose two models with shared encoders based on the BLIP-2 model to evaluate the human visual experience and predict distortion-aware saliency for AI-generated omnidirectional images, which are named as BLIP2OIQA and BLIP2OISal, respectively. Finally, based on the proposed models, we present an automatic optimization process that utilizes the predicted visual experience scores and distortion regions to further enhance the visual quality of an AI-generated omnidirectional image. Extensive experiments show that our BLIP2OIQA model and BLIP2OISal model achieve state-of-the-art (SOTA) results in the human visual experience evaluation task and the distortion-aware saliency prediction task for AI generated omnidirectional images, and can be effectively used in the optimization process. The database and codes will be released on https://github.com/IntMeGroup/AIGCOIQA to facilitate future research.
中文: 本研究针对AI生成的全景图像,建立了OHF2024数据库并开发了BLIP2OIQA和BLIP2OISal模型,在人类视觉体验评估和失真感知显著性预测方面取得领先性能,有效提升了图像视觉质量优化效果。
English: This study addresses the quality assessment and optimization of AI-generated omnidirectional images (AIGODIs) by establishing the OHF2024 database and developing BLIP2OIQA and BLIP2OISal models, which achieve state-of-the-art performance in evaluating human visual experience and predicting distortion-aware saliency for enhanced visual quality.

Authors:Juming Xiong, Ruining Deng, Jialin Yue, Siqi Lu, Junlin Guo, Marilyn Lionts, Tianyuan Yao, Can Cui, Junchao Zhu, Chongyu Qu, Mengmeng Yin, Haichun Yang, Yuankai Huo
Title: ZeroReg3D: A Zero-shot Registration Pipeline for 3D Consecutive Histopathology Image Reconstruction
Abstract:
Histological analysis plays a crucial role in understanding tissue structure and pathology. While recent advancements in registration methods have improved 2D histological analysis, they often struggle to preserve critical 3D spatial relationships, limiting their utility in both clinical and research applications. Specifically, constructing accurate 3D models from 2D slices remains challenging due to tissue deformation, sectioning artifacts, variability in imaging techniques, and inconsistent illumination. Deep learning-based registration methods have demonstrated improved performance but suffer from limited generalizability and require large-scale training data. In contrast, non-deep-learning approaches offer better generalizability but often compromise on accuracy. In this study, we introduced ZeroReg3D, a novel zero-shot registration pipeline tailored for accurate 3D reconstruction from serial histological sections. By combining zero-shot deep learning-based keypoint matching with optimization-based affine and non-rigid registration techniques, ZeroReg3D effectively addresses critical challenges such as tissue deformation, sectioning artifacts, staining variability, and inconsistent illumination without requiring retraining or fine-tuning. The code has been made publicly available at https://github.com/hrlblab/ZeroReg3D
Chinese: 本研究提出ZeroReg3D这一零样本配准流程,结合深度学习关键点匹配与优化技术,无需重新训练即可解决组织变形和成像差异问题,实现从二维组织切片到三维模型的精准重建。
English: This study introduces ZeroReg3D, a zero-shot registration pipeline that combines deep learning keypoint matching with optimization techniques to accurately reconstruct 3D models from 2D histological sections while addressing tissue deformation and imaging inconsistencies without requiring retraining.

Authors:Justin Reinman, Sunwoong Choi
Title: CERBERUS: Crack Evaluation & Recognition Benchmark for Engineering Reliability & Urban Stability
Abstract:
CERBERUS is a synthetic benchmark designed to help train and evaluate AI models for detecting cracks and other defects in infrastructure. It includes a crack image generator and realistic 3D inspection scenarios built in Unity. The benchmark features two types of setups: a simple Fly-By wall inspection and a more complex Underpass scene with lighting and geometry challenges. We tested a popular object detection model (YOLO) using different combinations of synthetic and real crack data. Results show that combining synthetic and real data improves performance on real-world images. CERBERUS provides a flexible, repeatable way to test defect detection systems and supports future research in automated infrastructure inspection. CERBERUS is publicly available at https://github.com/justinreinman/Cerberus-Defect-Generator.
中文:CERBERUS是一个用于训练和评估基础设施缺陷检测AI模型的合成基准,包含裂缝图像生成器和3D场景,结合真实数据可提升性能,并已公开可用。
English: CERBERUS is a synthetic benchmark for training and evaluating AI models in infrastructure defect detection, featuring a crack image generator and 3D scenarios that improve performance when combined with real data, and it is publicly accessible.

Authors:Mingquan Liu
Title: RAUM-Net: Regional Attention and Uncertainty-aware Mamba Network
Abstract:
Fine Grained Visual Categorization (FGVC) remains a challenging task in computer vision due to subtle inter class differences and fragile feature representations. Existing methods struggle in fine grained scenarios, especially when labeled data is scarce. We propose a semi supervised method combining Mamba based feature modeling, region attention, and Bayesian uncertainty. Our approach enhances local to global feature modeling while focusing on key areas during learning. Bayesian inference selects high quality pseudo labels for stability. Experiments show strong performance on FGVC benchmarks with occlusions, demonstrating robustness when labeled data is limited. Code is available at https://github.com/wxqnl/RAUM Net.
中文: 本文提出一种半监督方法,结合Mamba特征建模、区域注意力和贝叶斯不确定性,通过增强局部到全局特征表示并筛选高质量伪标签,在标注数据有限的情况下显著提升了细粒度视觉分类的鲁棒性。
English: This paper introduces a semi-supervised method that integrates Mamba-based feature modeling, region attention, and Bayesian uncertainty to improve fine-grained visual categorization by enhancing local-to-global feature representation and selecting high-quality pseudo-labels, demonstrating robust performance on benchmarks with limited labeled data.

Authors:Umihiro Kamoto, Tatsuya Ishibashi, Noriyuki Kugo
Title: DIVE: Deep-search Iterative Video Exploration A Technical Report for the CVRR Challenge at CVPR 2025
Abstract:
In this report, we present the winning solution that achieved the 1st place in the Complex Video Reasoning & Robustness Evaluation Challenge 2025. This challenge evaluates the ability to generate accurate natural language answers to questions about diverse, real-world video clips. It uses the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) benchmark, which consists of 214 unique videos and 2,400 question-answer pairs spanning 11 categories. Our method, DIVE (Deep-search Iterative Video Exploration), adopts an iterative reasoning approach, in which each input question is semantically decomposed and solved through stepwise reasoning and progressive inference. This enables our system to provide highly accurate and contextually appropriate answers to even the most complex queries. Applied to the CVRR-ES benchmark, our approach achieves 81.44% accuracy on the test set, securing the top position among all participants. This report details our methodology and provides a comprehensive analysis of the experimental results, demonstrating the effectiveness of our iterative reasoning framework in achieving robust video question answering. The code is available at https://github.com/PanasonicConnect/DIVE
中文摘要:我们的DIVE方法在2025年复杂视频推理挑战赛中荣获第一,通过迭代推理对复杂视频问题进行逐步分解,在CVRR-ES基准测试中实现了81.44%的准确率。
English Summary: Our DIVE method won first place in the 2025 Complex Video Reasoning Challenge by using iterative reasoning to achieve 81.44% accuracy on the CVRR-ES benchmark through stepwise decomposition of complex video questions.

Authors:Yuansheng Li, Yunhao Zou, Linwei Chen, Ying Fu
Title: Physical Degradation Model-Guided Interferometric Hyperspectral Reconstruction with Unfolding Transformer
Abstract:
Interferometric Hyperspectral Imaging (IHI) is a critical technique for large-scale remote sensing tasks due to its advantages in flux and spectral resolution. However, IHI is susceptible to complex errors arising from imaging steps, and its quality is limited by existing signal processing-based reconstruction algorithms. Two key challenges hinder performance enhancement: 1) the lack of training datasets. 2) the difficulty in eliminating IHI-specific degradation components through learning-based methods. To address these challenges, we propose a novel IHI reconstruction pipeline. First, based on imaging physics and radiometric calibration data, we establish a simplified yet accurate IHI degradation model and a parameter estimation method. This model enables the synthesis of realistic IHI training datasets from hyperspectral images (HSIs), bridging the gap between IHI reconstruction and deep learning. Second, we design the Interferometric Hyperspectral Reconstruction Unfolding Transformer (IHRUT), which achieves effective spectral correction and detail restoration through a stripe-pattern enhancement mechanism and a spatial-spectral transformer architecture. Experimental results demonstrate the superior performance and generalization capability of our method.The code and are available at https://github.com/bit1120203554/IHRUT.
中文摘要:本研究提出了一种新型干涉高光谱成像重建流程,通过建立基于物理的精确退化模型生成训练数据,并设计专用的IHRUT变换器网络,有效解决了训练数据匮乏问题,实现了卓越的光谱空间重建性能。
English Summary: This study introduces a novel Interferometric Hyperspectral Imaging (IHI) reconstruction pipeline that combines a physics-based degradation model for realistic dataset synthesis with a specialized transformer network (IHRUT) to overcome training data limitations and achieve enhanced spectral-spatial reconstruction.

Authors:Yanguang Sun, Jiexi Yan, Jianjun Qian, Chunyan Xu, Jian Yang, Lei Luo
Title: Dual-Perspective United Transformer for Object Segmentation in Optical Remote Sensing Images
Abstract:
Automatically segmenting objects from optical remote sensing images (ORSIs) is an important task. Most existing models are primarily based on either convolutional or Transformer features, each offering distinct advantages. Exploiting both advantages is valuable research, but it presents several challenges, including the heterogeneity between the two types of features, high complexity, and large parameters of the model. However, these issues are often overlooked in existing the ORSIs methods, causing sub-optimal segmentation. For that, we propose a novel Dual-Perspective United Transformer (DPU-Former) with a unique structure designed to simultaneously integrate long-range dependencies and spatial details. In particular, we design the global-local mixed attention, which captures diverse information through two perspectives and introduces a Fourier-space merging strategy to obviate deviations for efficient fusion. Furthermore, we present a gated linear feed-forward network to increase the expressive ability. Additionally, we construct a DPU-Former decoder to aggregate and strength features at different layers. Consequently, the DPU-Former model outperforms the state-of-the-art methods on multiple datasets. Code: https://github.com/CSYSI/DPU-Former.
中文: 提出的DPU-Former模型通过全局-局部混合注意力等创新组件,有效融合了卷积和Transformer特征的远程依赖与空间细节,在光学遥感图像分割任务中实现了最优性能。
English: The proposed DPU-Former model effectively integrates long-range dependencies and spatial details from both convolutional and Transformer features through innovative components like global-local mixed attention and a gated feed-forward network, achieving superior segmentation performance on optical remote sensing images.

Authors:Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou
Title: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
Abstract:
In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.
Chinese: 本文提出LLaVA-Scissor,一种无需训练的视频多模态令牌压缩方法,通过语义连通组件实现全面语义覆盖,在多种视频理解基准测试中表现出优越性能。
English: This paper introduces LLaVA-Scissor, a training-free token compression method for video multimodal models that uses Semantic Connected Components to achieve comprehensive semantic coverage and superior performance across video understanding benchmarks.

Authors:Jiho Choi, Sang Jun Lee
Title: Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation
Abstract:
In this paper, we propose a method that learns a general representation of periodic signals from unlabeled facial videos by capturing subtle changes in skin tone over time. The proposed framework employs the video masked autoencoder to learn a high-dimensional spatio-temporal representation of the facial region through self-supervised learning. Capturing quasi-periodic signals in the video is crucial for remote photoplethysmography (rPPG) estimation. To account for signal periodicity, we apply frame masking in terms of video sampling, which allows the model to capture resampled quasi-periodic signals during the pre-training stage. Moreover, the framework incorporates physiological bandlimit constraints, leveraging the property that physiological signals are sparse within their frequency bandwidth to provide pulse cues to the model. The pre-trained encoder is then transferred to the rPPG task, where it is used to extract physiological signals from facial videos. We evaluate the proposed method through extensive experiments on the PURE, UBFC-rPPG, MMPD, and V4V datasets. Our results demonstrate significant performance improvements, particularly in challenging cross-dataset evaluations. Our code is available at https://github.com/ziiho08/Periodic-MAE.
中文: 本文提出一种通过视频掩码自编码器的自监督框架,从无标签面部视频中学习周期性生理信号,结合生理频带约束与帧掩码技术,在跨数据集rPPG评估中取得显著性能提升。
English: This paper introduces a self-supervised framework using video masked autoencoders to learn periodic facial signals from unlabeled videos, achieving superior cross-dataset rPPG estimation performance through physiological constraints and frame masking techniques.

Authors:Rafael Sterzinger, Marco Peer, Robert Sablatnig
Title: Few-Shot Segmentation of Historical Maps via Linear Probing of Vision Foundation Models
Abstract:
As rich sources of history, maps provide crucial insights into historical changes, yet their diverse visual representations and limited annotated data pose significant challenges for automated processing. We propose a simple yet effective approach for few-shot segmentation of historical maps, leveraging the rich semantic embeddings of large vision foundation models combined with parameter-efficient fine-tuning. Our method outperforms the state-of-the-art on the Siegfried benchmark dataset in vineyard and railway segmentation, achieving +5% and +13% relative improvements in mIoU in 10-shot scenarios and around +20% in the more challenging 5-shot setting. Additionally, it demonstrates strong performance on the ICDAR 2021 competition dataset, attaining a mean PQ of 67.3% for building block segmentation, despite not being optimized for this shape-sensitive metric, underscoring its generalizability. Notably, our approach maintains high performance even in extremely low-data regimes (10- & 5-shot), while requiring only 689k trainable parameters - just 0.21% of the total model size. Our approach enables precise segmentation of diverse historical maps while drastically reducing the need for manual annotations, advancing automated processing and analysis in the field. Our implementation is publicly available at: https://github.com/RafaelSterzinger/few-shot-map-segmentation.
Chinese: 本研究提出了一种简单有效的历史地图少样本分割方法,通过结合大型视觉基础模型与参数高效微调,在基准数据集上实现了最优性能,同时大幅减少了对标注数据和可训练参数的需求。
English: This study introduces a simple yet effective few-shot segmentation method for historical maps by combining large vision foundation models with parameter-efficient fine-tuning, achieving state-of-the-art performance on benchmark datasets while requiring minimal annotated data and trainable parameters.

Authors:Remco F. Leijenaar, Hamidreza Kasaei
Title: Asymmetric Dual Self-Distillation for 3D Self-Supervised Representation Learning
Abstract:
Learning semantically meaningful representations from unstructured 3D point clouds remains a central challenge in computer vision, especially in the absence of large-scale labeled datasets. While masked point modeling (MPM) is widely used in self-supervised 3D learning, its reconstruction-based objective can limit its ability to capture high-level semantics. We propose AsymDSD, an Asymmetric Dual Self-Distillation framework that unifies masked modeling and invariance learning through prediction in the latent space rather than the input space. AsymDSD builds on a joint embedding architecture and introduces several key design choices: an efficient asymmetric setup, disabling attention between masked queries to prevent shape leakage, multi-mask sampling, and a point cloud adaptation of multi-crop. AsymDSD achieves state-of-the-art results on ScanObjectNN (90.53%) and further improves to 93.72% when pretrained on 930k shapes, surpassing prior methods.
中文:AsymDSD框架通过潜在空间预测整合掩码建模与不变性学习,在ScanObjectNN上实现90.53%的领先性能,经93万形状预训练后进一步提升至93.72%。
English: The AsymDSD framework integrates masked modeling and invariance learning through latent space prediction, achieving state-of-the-art performance on ScanObjectNN with 90.53% accuracy and 93.72% when pretrained on extensive data.

Authors:Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, Zhou Zhao
Title: APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization
Abstract:
Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data, but they often struggle with complex reasoning. While Reinforcement learning (RL) can boost reasoning in LLMs, applying it to MLLMs is tricky. Common issues include a drop in performance on general tasks and the generation of overly detailed or "overthinking" reasoning. Our work investigates how the KL penalty and overthinking affect RL training in MLLMs. We propose Asymmetric Policy Optimization (APO) to address these issues, which divides the sampled responses into positive and negative groups. For positive samples, Difficulty-Adaptive Divergence Shaping (DADS) is introduced to dynamically adjust the KL divergence weight based on their difficulty. This method prevents policy entropy from dropping sharply, improves training stability, utilizes samples better, and preserves the model's existing knowledge. For negative samples, Suboptimal Trajectory Complexity Regularization (STCR) is proposed to penalize overly long responses. This helps mitigate overthinking and encourages more concise reasoning while preserving the model's explorative capacity. We apply our method to Qwen2.5-VL-3B, creating View-R1-3B. View-R1-3B significantly enhances reasoning capabilities, showing an average 7\% gain over the base model and outperforming larger MLLMs (7-11B) on various reasoning benchmarks. Importantly, unlike other reasoning-tuned MLLMs that often degrade on general tasks, View-R1-3B maintains consistent improvement, demonstrating superior generalization. These results highlight the effectiveness and broad applicability of our DADS and STCR techniques for advancing complex multimodal reasoning in MLLMs. The code will be made available at https://github.com/Indolent-Kawhi/View-R1.
中文: 本研究提出非对称策略优化(APO)方法,结合难度自适应散度塑形(DADS)和次优轨迹复杂度正则化(STCR),有效提升了多模态大语言模型的推理能力,在保持泛化性能的同时显著优于基准模型及更大规模模型。
English: This research introduces Asymmetric Policy Optimization (APO) with Difficulty-Adaptive Divergence Shaping (DADS) and Suboptimal Trajectory Complexity Regularization (STCR) to enhance multimodal reasoning in MLLMs, achieving significant performance gains while maintaining generalization across tasks.

Authors:Haiping Yang, Huaxing Liu, Wei Wu, Zuohui Chen, Ning Wu
Title: AeroLite-MDNet: Lightweight Multi-task Deviation Detection Network for UAV Landing
Abstract:
Unmanned aerial vehicles (UAVs) are increasingly employed in diverse applications such as land surveying, material transport, and environmental monitoring. Following missions like data collection or inspection, UAVs must land safely at docking stations for storage or recharging, which is an essential requirement for ensuring operational continuity. However, accurate landing remains challenging due to factors like GPS signal interference. To address this issue, we propose a deviation warning system for UAV landings, powered by a novel vision-based model called AeroLite-MDNet. This model integrates a multiscale fusion module for robust cross-scale object detection and incorporates a segmentation branch for efficient orientation estimation. We introduce a new evaluation metric, Average Warning Delay (AWD), to quantify the system's sensitivity to landing deviations. Furthermore, we contribute a new dataset, UAVLandData, which captures real-world landing deviation scenarios to support training and evaluation. Experimental results show that our system achieves an AWD of 0.7 seconds with a deviation detection accuracy of 98.6\%, demonstrating its effectiveness in enhancing UAV landing reliability. Code will be available at https://github.com/ITTTTTI/Maskyolo.git
中文:本文提出的AeroLite-MDNet系统通过基于视觉的偏差预警机制提升无人机着陆安全性,该系统以98.6%的检测精度和仅0.7秒的延迟实现精准预警,并配套开发了新型评估指标与专用数据集。
English: The proposed AeroLite-MDNet system enhances UAV landing safety through a vision-based deviation warning system that achieves 98.6% detection accuracy with only 0.7-second delay, supported by a new evaluation metric and dataset.

Authors:Yixin Sun, Li Li, Wenke E, Amir Atapour-Abarghouei, Toby P. Breckon
Title: TOMD: A Trail-based Off-road Multimodal Dataset for Traversable Pathway Segmentation under Challenging Illumination Conditions
Abstract:
Detecting traversable pathways in unstructured outdoor environments remains a significant challenge for autonomous robots, especially in critical applications such as wide-area search and rescue, as well as incident management scenarios like forest fires. Existing datasets and models primarily target urban settings or wide, vehicle-traversable off-road tracks, leaving a substantial gap in addressing the complexity of narrow, trail-like off-road scenarios. To address this, we introduce the Trail-based Off-road Multimodal Dataset (TOMD), a comprehensive dataset specifically designed for such environments. TOMD features high-fidelity multimodal sensor data -- including 128-channel LiDAR, stereo imagery, GNSS, IMU, and illumination measurements -- collected through repeated traversals under diverse conditions. We also propose a dynamic multiscale data fusion model for accurate traversable pathway prediction. The study analyzes the performance of early, cross, and mixed fusion strategies under varying illumination levels. Results demonstrate the effectiveness of our approach and the relevance of illumination in segmentation performance. We publicly release TOMD at https://github.com/yyyxs1125/TMOD to support future research in trail-based off-road navigation.
中文摘要:本研究针对非结构化户外环境中可通行路径检测的难题,推出了专门设计的越野小径多模态数据集TOMD和动态多尺度融合模型,通过不同光照条件下的实验验证了方法的有效性。
English Summary: The study introduces the Trail-based Off-road Multimodal Dataset (TOMD) and a dynamic multiscale fusion model to address autonomous robots' challenges in detecting traversable pathways in unstructured outdoor environments, demonstrating improved performance under varying illumination conditions.

Authors:Chenhao Zhang, Yezhi Shen, Fengqing Zhu
Title: ICP-3DGS: SfM-free 3D Gaussian Splatting for Large-scale Unbounded Scenes
Abstract:
In recent years, neural rendering methods such as NeRFs and 3D Gaussian Splatting (3DGS) have made significant progress in scene reconstruction and novel view synthesis. However, they heavily rely on preprocessed camera poses and 3D structural priors from structure-from-motion (SfM), which are challenging to obtain in outdoor scenarios. To address this challenge, we propose to incorporate Iterative Closest Point (ICP) with optimization-based refinement to achieve accurate camera pose estimation under large camera movements. Additionally, we introduce a voxel-based scene densification approach to guide the reconstruction in large-scale scenes. Experiments demonstrate that our approach ICP-3DGS outperforms existing methods in both camera pose estimation and novel view synthesis across indoor and outdoor scenes of various scales. Source code is available at https://github.com/Chenhao-Z/ICP-3DGS.
Chinese: 提出的ICP-3DGS方法结合迭代最近点算法与基于优化的位姿细化及体素密度化技术,无需预处理相机位姿即可实现更优的相机定位和新视角合成,在各类室内外场景中均超越现有方法。
English: The proposed ICP-3DGS method integrates Iterative Closest Point with optimization-based pose refinement and voxel-based densification to achieve superior camera pose estimation and novel view synthesis without relying on preprocessed camera poses, outperforming existing approaches across diverse indoor and outdoor scenes.

Authors:Varun Mannam, Fang Wang, Xin Chen
Title: Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding
Abstract:
Current evaluation frameworks for multimodal generative AI struggle to establish trustworthiness, hindering enterprise adoption where reliability is paramount. We introduce a systematic, quantitative benchmarking framework to measure the trustworthiness of progressively integrating cross-modal inputs such as text, images, captions, and OCR within VisualRAG systems for enterprise document intelligence. Our approach establishes quantitative relationships between technical metrics and user-centric trust measures. Evaluation reveals that optimal modality weighting with weights of 30% text, 15% image, 25% caption, and 30% OCR improves performance by 57.3% over text-only baselines while maintaining computational efficiency. We provide comparative assessments of foundation models, demonstrating their differential impact on trustworthiness in caption generation and OCR extraction-a vital consideration for reliable enterprise AI. This work advances responsible AI deployment by providing a rigorous framework for quantifying and enhancing trustworthiness in multimodal RAG for critical enterprise applications.

Authors:Duong Bach
Title: Hierarchical Patch Compression for ColPali: Efficient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization
Abstract:
Multi-vector document retrieval systems, such as ColPali, excel in fine-grained matching for complex queries but incur significant storage and computational costs due to their reliance on high-dimensional patch embeddings and late-interaction scoring. To address these challenges, we propose HPC-ColPali, a Hierarchical Patch Compression framework that enhances the efficiency of ColPali while preserving its retrieval accuracy. Our approach integrates three innovative techniques: (1) K-Means quantization, which compresses patch embeddings into 1-byte centroid indices, achieving up to 32$\times$ storage reduction; (2) attention-guided dynamic pruning, utilizing Vision-Language Model attention weights to retain only the top-$p\%$ most salient patches, reducing late-interaction computation by up to 60\% with less than 2\% nDCG@10 loss; and (3) optional binary encoding of centroid indices into $b$-bit strings ($b=\lceil\log_2 K\rceil$), enabling rapid Hamming distance-based similarity search for resource-constrained environments. Evaluated on the ViDoRe and SEC-Filings datasets, HPC-ColPali achieves 30--50\% lower query latency under HNSW indexing while maintaining high retrieval precision. When integrated into a Retrieval-Augmented Generation pipeline for legal summarization, it reduces hallucination rates by 30\% and halves end-to-end latency. These advancements establish HPC-ColPali as a scalable and efficient solution for multi-vector document retrieval across diverse applications. Code is available at https://github.com/DngBack/HPC-ColPali.
Chinese: HPC-ColPali提出了一种分层补丁压缩框架,通过量化、剪枝和二进制编码技术,在保持高精度的同时显著降低了多向量文档检索的存储和计算成本。
English: HPC-ColPali introduces a hierarchical patch compression framework that significantly reduces storage and computational costs for multi-vector document retrieval while maintaining high accuracy through quantization, pruning, and binary encoding techniques.

Authors:Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik
Title: Whole-Body Conditioned Egocentric Video Prediction
Abstract:
We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.

Authors:Alex Costanzino, Pierluigi Zama Ramirez, Luigi Lella, Matteo Ragaglia, Alessandro Oliva, Giuseppe Lisanti, Luigi Di Stefano
Title: SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark
Abstract:
We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalising from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 Mpx) and point clouds (7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.
中文:SiM3D是首个融合多视角和多模态信息的3D异常检测与分割基准,专注于单实例训练场景下的合成数据到真实数据泛化挑战,并提供工业级高质量数据集及基线评估方法。
English: SiM3D is the first benchmark integrating multiview and multimodal data for 3D anomaly detection and segmentation, addressing single-instance training with synthetic-to-real generalization and providing high-quality industrial data with baseline evaluations.

Authors:Xinzhuo Li, Adheesh Juvekar, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Ismini Lourentzou
Title: HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation
Abstract:
Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.

Authors:Yansong Qu, Shaohui Dai, Xinyang Li, Yuze Wang, You Shen, Liujuan Cao, Rongrong Ji
Title: DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion
Abstract:
Reconstructing 3D objects from a single image remains challenging, especially under real-world occlusions. While recent diffusion-based view synthesis models can generate consistent novel views from a single RGB image, they typically assume fully visible inputs and fail when parts of the object are occluded, resulting in degraded 3D reconstruction quality. We propose DeOcc-1-to-3, an end-to-end framework for occlusion-aware multi-view generation that synthesizes six structurally consistent novel views directly from a single occluded image, enabling reliable 3D reconstruction without prior inpainting or manual annotations. Our self-supervised training pipeline leverages occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency. Without modifying the original architecture, we fully fine-tune the view synthesis model to jointly learn completion and multi-view generation. Additionally, we introduce the first benchmark for occlusion-aware reconstruction, covering diverse occlusion levels, object categories, and masking patterns, providing a standardized protocol for future evaluation.
中文: DeOcc-1-to-3是一个端到端框架,能从单张遮挡图像生成六个结构一致的新视角,通过自监督训练实现可靠的3D重建,无需预先修复或人工标注。
English: DeOcc-1-to-3 is an end-to-end framework that generates six structurally consistent novel views from a single occluded image, enabling reliable 3D reconstruction through self-supervised training without requiring prior inpainting or manual annotations.

Authors:Chuxin Wang, Yixin Zha, Wenfei Yang, Tianzhu Zhang
Title: StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning
Abstract:
Recently, Mamba-based methods have demonstrated impressive performance in point cloud representation learning by leveraging State Space Model (SSM) with the efficient context modeling ability and linear complexity. However, these methods still face two key issues that limit the potential of SSM: Destroying the adjacency of 3D points during SSM processing and failing to retain long-sequence memory as the input length increases in downstream tasks. To address these issues, we propose StruMamba3D, a novel paradigm for self-supervised point cloud representation learning. It enjoys several merits. First, we design spatial states and use them as proxies to preserve spatial dependencies among points. Second, we enhance the SSM with a state-wise update strategy and incorporate a lightweight convolution to facilitate interactions between spatial states for efficient structure modeling. Third, our method reduces the sensitivity of pre-trained Mamba-based models to varying input lengths by introducing a sequence length-adaptive strategy. Experimental results across four downstream tasks showcase the superior performance of our method. In addition, our method attains the SOTA 95.1% accuracy on ModelNet40 and 92.75% accuracy on the most challenging split of ScanObjectNN without voting strategy.
中文摘要:StruMamba3D通过引入空间状态和自适应策略,解决了Mamba方法在点云学习中破坏空间邻接性和长序列记忆的问题,在ModelNet40和ScanObjectNN基准测试中取得了最优性能。
English Summary: StruMamba3D addresses limitations in Mamba-based point cloud learning by preserving spatial dependencies through spatial states and adaptive strategies, achieving state-of-the-art accuracy on benchmarks like ModelNet40 and ScanObjectNN.

Authors:Mohammed Baharoon, Jun Ma, Congyu Fang, Augustin Toma, Bo Wang
Title: Exploring the Design Space of 3D MLLMs for CT Report Generation
Abstract:
Multimodal Large Language Models (MLLMs) have emerged as a promising way to automate Radiology Report Generation (RRG). In this work, we systematically investigate the design space of 3D MLLMs, including visual input representation, projectors, Large Language Models (LLMs), and fine-tuning techniques for 3D CT report generation. We also introduce two knowledge-based report augmentation methods that improve performance on the GREEN score by up to 10%, achieving the 2nd place on the MICCAI 2024 AMOS-MM challenge. Our results on the 1,687 cases from the AMOS-MM dataset show that RRG is largely independent of the size of LLM under the same training protocol. We also show that larger volume size does not always improve performance if the original ViT was pre-trained on a smaller volume size. Lastly, we show that using a segmentation mask along with the CT volume improves performance. The code is publicly available at https://github.com/bowang-lab/AMOS-MM-Solution
中文摘要:本研究系统探索了用于自动化CT报告生成的3D多模态大语言模型设计,提出的基于知识的报告增强方法在MICCAI 2024挑战赛中荣获第二名,并证明模型性能更取决于训练方案而非大语言模型规模。
English Summary: This study systematically explores the design of 3D multimodal large language models for automated CT report generation, introducing knowledge-based augmentation methods that achieved second place in a MICCAI 2024 challenge while demonstrating that model performance depends more on training protocols than LLM size.

Authors:Yihan Wang, Jia Deng
Title: WAFT: Warping-Alone Field Transforms for Optical Flow
Abstract:
We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being up to 4.1x faster than methods with similar performance. Code and model weights are available at https://github.com/princeton-vl/WAFT.
中文: WAFT是一种创新的光流方法,通过用高分辨率扭曲替代代价体积,在降低内存占用的同时实现了最优的基准测试性能,且速度比同类方法更快。
English: WAFT is a novel optical flow method that replaces cost volumes with high-resolution warping, achieving top benchmark performance with lower memory usage and faster speed than comparable approaches.

Authors:Polina Karpikova, Daniil Selikhanovych, Kirill Struminsky, Ruslan Musaev, Maria Golitsyna, Dmitry Baranchuk
Title: MADrive: Memory-Augmented Driving Scene Modeling
Abstract:
Recent advances in scene reconstruction have pushed toward highly realistic modeling of autonomous driving (AD) environments using 3D Gaussian splatting. However, the resulting reconstructions remain closely tied to the original observations and struggle to support photorealistic synthesis of significantly altered or novel driving scenarios. This work introduces MADrive, a memory-augmented reconstruction framework designed to extend the capabilities of existing scene reconstruction methods by replacing observed vehicles with visually similar 3D assets retrieved from a large-scale external memory bank. Specifically, we release MAD-Cars, a curated dataset of ${\sim}70$K 360° car videos captured in the wild and present a retrieval module that finds the most similar car instances in the memory bank, reconstructs the corresponding 3D assets from video, and integrates them into the target scene through orientation alignment and relighting. The resulting replacements provide complete multi-view representations of vehicles in the scene, enabling photorealistic synthesis of substantially altered configurations, as demonstrated in our experiments. Project page: https://yandex-research.github.io/madrive/

Authors:Mohammed Rakib, Arunkumar Bagavathi
Title: G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation
Abstract:
Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation (G$^{2}$D), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. G$^{2}$D further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate G$^{2}$D on multiple real-world datasets and show that G$^{2}$D amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks. Our code is available at https://github.com/rAIson-Lab/G2D.
Chinese: 本文提出梯度引导蒸馏(G²D)框架,通过定制化损失函数和动态顺序模态优先级技术解决多模态学习中的模态不平衡问题,有效增强弱模态并提升分类和回归任务的性能。
English: The paper introduces Gradient-Guided Distillation (G²D), a knowledge distillation framework that addresses modality imbalance in multimodal learning by using a custom loss function and dynamic sequential modality prioritization to enhance weak modalities and improve performance across tasks.

Authors:Wentao Hu, Shunkai Li, Ziqiao Peng, Haoxian Zhang, Fan Shi, Xiaoqiang Liu, Pengfei Wan, Di Zhang, Hui Tian
Title: GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation
Abstract:
Creating high-quality, generalizable speech-driven 3D talking heads remains a persistent challenge. Previous methods achieve satisfactory results for fixed viewpoints and small-scale audio variations, but they struggle with large head rotations and out-of-distribution (OOD) audio. Moreover, they are constrained by the need for time-consuming, identity-specific training. We believe the core issue lies in the lack of sufficient 3D priors, which limits the extrapolation capabilities of synthesized talking heads. To address this, we propose GGTalker, which synthesizes talking heads through a combination of generalizable priors and identity-specific adaptation. We introduce a two-stage Prior-Adaptation training strategy to learn Gaussian head priors and adapt to individual characteristics. We train Audio-Expression and Expression-Visual priors to capture the universal patterns of lip movements and the general distribution of head textures. During the Customized Adaptation, individual speaking styles and texture details are precisely modeled. Additionally, we introduce a color MLP to generate fine-grained, motion-aligned textures and a Body Inpainter to blend rendered results with the background, producing indistinguishable, photorealistic video frames. Comprehensive experiments show that GGTalker achieves state-of-the-art performance in rendering quality, 3D consistency, lip-sync accuracy, and training efficiency.
中文: GGTalker通过两阶段的先验适应训练策略,结合通用先验和身份特定适应,有效解决了大角度头部旋转和分布外音频的挑战,合成了具有高质量和真实感的3D说话头部。
English: GGTalker overcomes the limitations of previous methods by using a two-stage training strategy with generalizable priors and identity-specific adaptation to synthesize high-quality, photorealistic 3D talking heads that handle large head rotations and out-of-distribution audio effectively.

Authors:Srikumar Sastry, Aayush Dhakal, Eric Xing, Subash Khanal, Nathan Jacobs
Title: Global and Local Entailment Learning for Natural World Imagery
Abstract:
Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models are open-sourced at https://vishu26.github.io/RCME/index.html.

Authors:Sweta Banerjee, Viktoria Weiss, Taryn A. Donovan, Rutger H. J. Fick, Thomas Conrad, Jonas Ammeling, Nils Porsche, Robert Klopfleisch, Christopher Kaltenecker, Katharina Breininger, Marc Aubreville, Christof A. Bertram
Title: Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation
Abstract:
Atypical mitosis marks a deviation in the cell division process that has been shown be an independent prognostic marker for tumor malignancy. However, atypical mitosis classification remains challenging due to low prevalence, at times subtle morphological differences from normal mitotic figures, low inter-rater agreement among pathologists, and class imbalance in datasets. Building on the Atypical Mitosis dataset for Breast Cancer (AMi-Br), this study presents a comprehensive benchmark comparing deep learning approaches for automated atypical mitotic figure (AMF) classification, including end-to-end trained deep learning models, foundation models with linear probing, and foundation models fine-tuned with low-rank adaptation (LoRA). For rigorous evaluation, we further introduce two new held-out AMF datasets - AtNorM-Br, a dataset of mitotic figures from the TCGA breast cancer cohort, and AtNorM-MD, a multi-domain dataset of mitotic figures from a subset of the MIDOG++ training set. We found average balanced accuracy values of up to 0.8135, 0.7788, and 0.7723 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and AtNorM-MD datasets, respectively. Our work shows that atypical mitotic figure classification, while being a challenging problem, can be effectively addressed through the use of recent advances in transfer learning and model fine-tuning techniques. We make all code and data used in this paper available in this github repository: https://github.com/DeepMicroscopy/AMi-Br_Benchmark.
中文摘要:本研究针对乳腺癌非典型有丝分裂图像分类难题,通过迁移学习和微调技术建立了深度学习基准方法,在引入新数据集验证下取得了较高分类准确率。
English Summary: This study benchmarks deep learning methods for classifying atypical mitotic figures in breast cancer, achieving high accuracy through transfer learning and fine-tuning techniques while introducing new datasets for rigorous evaluation.

Authors:Samuel Joutard, Marijn Stollenga, Marc Balle Sanchez, Mohammad Farid Azampour, Raphael Prevost
Title: HyperSORT: Self-Organising Robust Training with hyper-networks
Abstract:
Medical imaging datasets often contain heterogeneous biases ranging from erroneous labels to inconsistent labeling styles. Such biases can negatively impact deep segmentation networks performance. Yet, the identification and characterization of such biases is a particularly tedious and challenging task. In this paper, we introduce HyperSORT, a framework using a hyper-network predicting UNets' parameters from latent vectors representing both the image and annotation variability. The hyper-network parameters and the latent vector collection corresponding to each data sample from the training set are jointly learned. Hence, instead of optimizing a single neural network to fit a dataset, HyperSORT learns a complex distribution of UNet parameters where low density areas can capture noise-specific patterns while larger modes robustly segment organs in differentiated but meaningful manners. We validate our method on two 3D abdominal CT public datasets: first a synthetically perturbed version of the AMOS dataset, and TotalSegmentator, a large scale dataset containing real unknown biases and errors. Our experiments show that HyperSORT creates a structured mapping of the dataset allowing the identification of relevant systematic biases and erroneous samples. Latent space clusters yield UNet parameters performing the segmentation task in accordance with the underlying learned systematic bias. The code and our analysis of the TotalSegmentator dataset are made available: https://github.com/ImFusionGmbH/HyperSORT
中文: HyperSORT提出了一种超网络框架,通过学习UNet参数的分布来识别和表征数据集中的偏差,从而在医学影像中实现稳健的分割和系统性错误检测。
English: HyperSORT introduces a hyper-network framework that learns a distribution of UNet parameters to identify and characterize dataset biases, enabling robust segmentation and systematic error detection in medical imaging.

Authors:Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, Xinglong Wu
Title: XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation
Abstract:
Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.
中文:提出的XVerse模型通过将参考图像转换为文本流偏移量,实现了对文本到图像生成中多主体的精确独立控制,在不破坏图像特征的情况下完成高保真且可编辑的多主体合成。
English: The proposed XVerse model enables precise and independent control over multiple subjects in text-to-image generation by converting reference images into text-stream offsets, achieving high-fidelity and editable multi-subject synthesis without disrupting image features.

Authors:Zhirui Gao, Renjiao Yi, Yaqiao Dai, Xuening Zhu, Wei Chen, Chenyang Zhu, Kai Xu
Title: Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction
Abstract:
This paper presents an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps. Contrasting with existing two-stage methods that follow a sequential ``edge point cloud reconstruction and parametric curve fitting'' pipeline, our one-stage approach optimizes 3D parametric curves directly from 2D edge maps, eliminating error accumulation caused by the inherent optimization gap between disconnected stages. However, parametric curves inherently lack suitability for rendering-based multi-view optimization, necessitating a complementary representation that preserves their geometric properties while enabling differentiable rendering. We propose a novel bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This tight correspondence formulates a curve-aware Gaussian representation, \textbf{CurveGaussian}, that enables differentiable rendering of 3D curves, allowing direct optimization guided by multi-view evidence. Furthermore, we introduce a dynamically adaptive topology optimization framework during training to refine curve structures through linearization, merging, splitting, and pruning operations. Comprehensive evaluations on the ABC dataset and real-world benchmarks demonstrate our one-stage method's superiority over two-stage alternatives, particularly in producing cleaner and more robust reconstructions. Additionally, by directly optimizing parametric curves, our method significantly reduces the parameter count during training, achieving both higher efficiency and superior performance compared to existing approaches.
中文摘要:本文提出CurveGaussian单阶段框架,通过参数曲线与高斯组件的双向耦合机制及自适应拓扑优化,直接从多视角边缘图重建三维参数曲线,相比两阶段方法实现了更高效率和更优重建质量。
English Summary: This paper introduces CurveGaussian, a one-stage framework that directly reconstructs 3D parametric curves from multi-view edge maps through a bi-directional coupling mechanism and adaptive topology optimization, achieving superior efficiency and reconstruction quality over two-stage methods.

Authors:Can Liu, Chunlin Da, Xiaoxiao Long, Yuxiao Yang, Yu Zhang, Yong Wang
Title: SimVecVis: A Dataset for Enhancing MLLMs in Visualization Understanding
Abstract:
Current multimodal large language models (MLLMs), while effective in natural image understanding, struggle with visualization understanding due to their inability to decode the data-to-visual mapping and extract structured information. To address these challenges, we propose SimVec, a novel simplified vector format that encodes chart elements such as mark type, position, and size. The effectiveness of SimVec is demonstrated by using MLLMs to reconstruct chart information from SimVec formats. Then, we build a new visualization dataset, SimVecVis, to enhance the performance of MLLMs in visualization understanding, which consists of three key dimensions: bitmap images of charts, their SimVec representations, and corresponding data-centric question-answering (QA) pairs with explanatory chain-of-thought (CoT) descriptions. We finetune state-of-the-art MLLMs (e.g., MiniCPM and Qwen-VL), using SimVecVis with different dataset dimensions. The experimental results show that it leads to substantial performance improvements of MLLMs with good spatial perception capabilities (e.g., MiniCPM) in data-centric QA tasks. Our dataset and source code are available at: https://github.com/VIDA-Lab/SimVecVis.
Chinese: 当前多模态大语言模型在可视化理解方面存在不足,因此我们提出了SimVec这一简化向量格式来编码图表元素,并构建了SimVecVis数据集,显著提升了模型在数据问答任务中的表现。
English: Current multimodal large language models (MLLMs) struggle with visualization understanding, so we propose SimVec, a simplified vector format that encodes chart elements, and build the SimVecVis dataset to significantly enhance MLLMs' performance in data-centric question-answering tasks.

Authors:Dewen Zhang, Tahir Hussain, Wangpeng An, Hayaru Shouno
Title: LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning
Abstract:
Current vision-language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding. Code is available at https://github.com/Ody-trek/LLaVA-Pose.
中文: 当前视觉语言模型因缺乏专业数据而在复杂人体姿态和动作任务上表现不佳,但我们通过整合人体关键点与视觉特征的方法构建了专用数据集,微调后使模型性能显著提升了33.2%。
English: Current vision-language models struggle with complex human pose and action tasks due to a lack of specialized data, but our method integrating human keypoints with visual features creates a dataset that significantly improves model performance by 33.2% when fine-tuned.

Authors:Jiayi Zheng, Xiaodong Cun
Title: FairyGen: Storied Cartoon Video from a Single Child-Drawn Character
Abstract:
We propose FairyGen, an automatic system for generating story-driven cartoon videos from a single child's drawing, while faithfully preserving its unique artistic style. Unlike previous storytelling methods that primarily focus on character consistency and basic motion, FairyGen explicitly disentangles character modeling from stylized background generation and incorporates cinematic shot design to support expressive and coherent storytelling. Given a single character sketch, we first employ an MLLM to generate a structured storyboard with shot-level descriptions that specify environment settings, character actions, and camera perspectives. To ensure visual consistency, we introduce a style propagation adapter that captures the character's visual style and applies it to the background, faithfully retaining the character's full visual identity while synthesizing style-consistent scenes. A shot design module further enhances visual diversity and cinematic quality through frame cropping and multi-view synthesis based on the storyboard. To animate the story, we reconstruct a 3D proxy of the character to derive physically plausible motion sequences, which are then used to fine-tune an MMDiT-based image-to-video diffusion model. We further propose a two-stage motion customization adapter: the first stage learns appearance features from temporally unordered frames, disentangling identity from motion; the second stage models temporal dynamics using a timestep-shift strategy with frozen identity weights. Once trained, FairyGen directly renders diverse and coherent video scenes aligned with the storyboard. Extensive experiments demonstrate that our system produces animations that are stylistically faithful, narratively structured natural motion, highlighting its potential for personalized and engaging story animation. The code will be available at https://github.com/GVCLab/FairyGen
中文摘要:FairyGen是一个从单张儿童绘画自动生成故事驱动卡通视频的系统,通过风格传播适配器、镜头设计模块和运动定制技术,在保持原画独特艺术风格的同时实现连贯的叙事动画。
English Summary: FairyGen is an automated system that generates story-driven cartoon videos from a single child's drawing while preserving its unique artistic style through style propagation, cinematic shot design, and motion customization.

Authors:Xianghan Meng, Zhengyu Tong, Zhiyuan Huang, Chun-Guang Li
Title: Temporal Rate Reduction Clustering for Human Motion Segmentation
Abstract:
Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-of-Subspaces (UoS) distribution. However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering ($\text{TR}^2\text{C}$), which jointly learns structured representations and affinity to segment the sequences of frames in video. Specifically, the structured representations learned by $\text{TR}^2\text{C}$ enjoy temporally consistency and are aligned well with a UoS structure, which is favorable for addressing the HMS task. We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors. The code is available at: https://github.com/mengxianghan123/TR2C.
Chinese Summary: 本文提出时间速率降低聚类(TR²C)方法,通过联合学习结构化表示和关联性来分割视频帧序列,其学得的表示具有时间一致性且符合子空间联合结构,在多个基准测试中实现了最优性能。
English Summary: This paper introduces Temporal Rate Reduction Clustering (TR²C), a novel method for human motion segmentation that learns structured representations aligned with Union-of-Subspaces to effectively partition video frames into distinct motions, achieving state-of-the-art results across multiple benchmarks.

Authors:Xiwei Xuan, Ziquan Deng, Kwan-Liu Ma
Title: ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation
Abstract:
Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training. Our code is available at https://github.com/xiweix/ReME .
Chinese: 本研究提出一种以数据为中心的框架,通过构建具有对齐片段-文本嵌入的高质量参考集,无需模型微调即可在十个基准测试中实现卓越性能,显著提升了免训练开放词汇语义分割的效果。
English: This study introduces a data-centric framework that enhances training-free open-vocabulary semantic segmentation by constructing a high-quality reference set with aligned segment-text embeddings, achieving superior performance across ten benchmarks without model fine-tuning.

Authors:Yihong Cao, Jiaming Zhang, Xu Zheng, Hao Shi, Kunyu Peng, Hang Liu, Kailun Yang, Hui Zhang
Title: Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation
Abstract:
Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360° viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK.
中文: 本文提出UNLOCK方法,无需源数据或目标标签即可实现全景图像的无缝分割,有效应对扭曲和遮挡问题,其性能与依赖源数据的方法相当,达到领先水平。
English: This paper introduces UNLOCK, a source-free method for panoramic image segmentation that overcomes distortions and occlusions without requiring source data or target labels, achieving state-of-the-art performance comparable to source-dependent approaches.

Authors:Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Ruiping Liu, Fei Teng, Kai Luo, Zhiyong Li, Kailun Yang
Title: Out-of-Distribution Semantic Occupancy Prediction
Abstract:
3D Semantic Occupancy Prediction is crucial for autonomous driving, providing a dense, semantically rich environmental representation. However, existing methods focus on in-distribution scenes, making them susceptible to Out-of-Distribution (OoD) objects and long-tail distributions, which increases the risk of undetected anomalies and misinterpretations, posing safety hazards. To address these challenges, we introduce Out-of-Distribution Semantic Occupancy Prediction, targeting OoD detection in 3D voxel space. To fill the gaps in the dataset, we propose a Synthetic Anomaly Integration Pipeline that injects synthetic anomalies while preserving realistic spatial and occlusion patterns, enabling the creation of two datasets: VAA-KITTI and VAA-KITTI-360. We introduce OccOoD, a novel framework integrating OoD detection into 3D semantic occupancy prediction, with Voxel-BEV Progressive Fusion (VBPF) leveraging an RWKV-based branch to enhance OoD detection via geometry-semantic fusion. Experimental results demonstrate that OccOoD achieves state-of-the-art OoD detection with an AuROC of 67.34% and an AuPRCr of 29.21% within a 1.2m region, while maintaining competitive occupancy prediction performance. The established datasets and source code will be made publicly available at https://github.com/7uHeng/OccOoD.
Chinese: 本研究提出了分布外语义占据预测方法和OccOoD框架,通过合成数据集增强自动驾驶系统在三维体素空间中的异常检测能力,在保持竞争力的占据预测性能同时实现了最先进的分布外检测效果。
English: This study introduces Out-of-Distribution Semantic Occupancy Prediction and the OccOoD framework to enhance autonomous driving safety by detecting anomalies in 3D voxel space, supported by synthetic datasets and achieving state-of-the-art OoD detection performance.

Authors:Longkun Zou, Kangjun Liu, Ke Chen, Kailing Guo, Kui Jia, Yaowei Wang
Title: Topology-Aware Modeling for Unsupervised Simulation-to-Reality Point Cloud Recognition
Abstract:
Learning semantic representations from point sets of 3D object shapes is often challenged by significant geometric variations, primarily due to differences in data acquisition methods. Typically, training data is generated using point simulators, while testing data is collected with distinct 3D sensors, leading to a simulation-to-reality (Sim2Real) domain gap that limits the generalization ability of point classifiers. Current unsupervised domain adaptation (UDA) techniques struggle with this gap, as they often lack robust, domain-insensitive descriptors capable of capturing global topological information, resulting in overfitting to the limited semantic patterns of the source domain. To address this issue, we introduce a novel Topology-Aware Modeling (TAM) framework for Sim2Real UDA on object point clouds. Our approach mitigates the domain gap by leveraging global spatial topology, characterized by low-level, high-frequency 3D structures, and by modeling the topological relations of local geometric features through a novel self-supervised learning task. Additionally, we propose an advanced self-training strategy that combines cross-domain contrastive learning with self-training, effectively reducing the impact of noisy pseudo-labels and enhancing the robustness of the adaptation process. Experimental results on three public Sim2Real benchmarks validate the effectiveness of our TAM framework, showing consistent improvements over state-of-the-art methods across all evaluated tasks. The source code of this work will be available at https://github.com/zou-longkun/TAG.git.
中文摘要:本研究提出的拓扑感知建模框架通过利用全局空间拓扑结构和新型自监督学习任务,有效解决了三维点云分类中的仿真到现实领域差异问题,在多个基准测试中展现出优越性能。
English Summary: The proposed Topology-Aware Modeling framework addresses the Sim2Real domain gap in 3D point cloud classification by leveraging global spatial topology and novel self-supervised learning, demonstrating superior performance on benchmarks.

Authors:Hai Jiang, Binhao Guan, Zhen Liu, Xiaohong Liu, Jian Yu, Zheng Liu, Songchen Han, Shuaicheng Liu
Title: Learning to See in the Extremely Dark
Abstract:
Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method. The code and dataset are available at https://github.com/JianghaiSCU/SIED.
中文摘要:本研究提出了一个名为SIED的新数据集,用于增强极暗环境下的RAW图像,并开发了一种基于扩散的框架,能够从极低信噪比的RAW输入中有效恢复高质量的视觉效果。
English Summary: This study introduces a new dataset called SIED for enhancing extremely low-light RAW images and proposes a diffusion-based framework that effectively restores high-quality visuals from very dark inputs.

Authors:Luosheng Xu, Dalin Zhang, Zhaohui Song
Title: Pushing Trade-Off Boundaries: Compact yet Effective Remote Sensing Change Detection
Abstract:
Remote sensing change detection is essential for monitoring urban expansion, disaster assessment, and resource management, offering timely, accurate, and large-scale insights into dynamic landscape transformations. While deep learning has revolutionized change detection, the increasing complexity and computational demands of modern models have not necessarily translated into significant accuracy gains. Instead of following this trend, this study explores a more efficient approach, focusing on lightweight models that maintain high accuracy while minimizing resource consumption, which is an essential requirement for on-satellite processing. To this end, we propose FlickCD, which means quick flick then get great results, pushing the boundaries of the performance-resource trade-off. FlickCD introduces an Enhanced Difference Module (EDM) to amplify critical feature differences between temporal phases while suppressing irrelevant variations such as lighting and weather changes, thereby reducing computational costs in the subsequent change decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion Blocks, leveraging Shifted Window Self-Attention (SWSA) and Efficient Global Self-Attention (EGSA) to effectively capture semantic information at multiple scales, preserving both coarse- and fine-grained changes. Extensive experiments on four benchmark datasets demonstrate that FlickCD reduces computational and storage overheads by more than an order of magnitude while achieving state-of-the-art (SOTA) performance or incurring only a minor (<1% F1) accuracy trade-off. The implementation code is publicly available at https://github.com/xulsh8/FlickCD.
中文: 本研究提出轻量级遥感变化检测模型FlickCD,通过增强差异模块和局部-全局融合块在显著降低计算与存储开销的同时保持高精度,以最小资源消耗实现了最优性能。
English: This study introduces FlickCD, a lightweight remote sensing change detection model that significantly reduces computational and storage demands while maintaining high accuracy through its Enhanced Difference Module and Local-Global Fusion Blocks, achieving state-of-the-art performance with minimal resource consumption.

Authors:Demin Zhang, Jiahao Lyu, Zhijie Shen, Yu Zhou
Title: Class-Agnostic Region-of-Interest Matching in Document Images
Abstract:
Document understanding and analysis have received a lot of attention due to their widespread application. However, existing document analysis solutions, such as document layout analysis and key information extraction, are only suitable for fixed category definitions and granularities, and cannot achieve flexible applications customized by users. Therefore, this paper defines a new task named ``Class-Agnostic Region-of-Interest Matching'' (``RoI-Matching'' for short), which aims to match the customized regions in a flexible, efficient, multi-granularity, and open-set manner. The visual prompt of the reference document and target document images are fed into our model, while the output is the corresponding bounding boxes in the target document images. To meet the above requirements, we construct a benchmark RoI-Matching-Bench, which sets three levels of difficulties following real-world conditions, and propose the macro and micro metrics to evaluate. Furthermore, we also propose a new framework RoI-Matcher, which employs a siamese network to extract multi-level features both in the reference and target domains, and cross-attention layers to integrate and align similar semantics in different domains. Experiments show that our method with a simple procedure is effective on RoI-Matching-Bench, and serves as the baseline for further research. The code is available at https://github.com/pd162/RoI-Matching.
中文摘要:本文提出了一种名为“类别无关兴趣区域匹配”的新任务,通过构建基准测试和采用孪生网络框架,实现了灵活、多粒度的文档区域自定义匹配,为文档分析提供了有效的解决方案。
English Summary: This paper introduces a new task called "Class-Agnostic Region-of-Interest Matching" (RoI-Matching) to enable flexible, user-customized document analysis, proposing both a benchmark and a siamese network-based framework that effectively handles multi-granularity matching across documents.

Authors:Shangbo Wu, Yu-an Tan, Ruinan Ma, Wencong Ma, Dehua Zhu, Yuanzhang Li
Title: Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features
Abstract:
The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA -- a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at https://github.com/spencerwooo/dSVA.
中文: 本文提出dSVA攻击方法,利用自监督视觉变换器的全局与局部特征生成对抗样本,在多种架构的黑盒模型中展现出卓越的迁移性能。
English: This paper introduces dSVA, a generative attack that leverages both global and local features from self-supervised Vision Transformers to craft adversarial examples with superior black-box transferability across diverse model architectures.

Authors:Boyong He, Yuxiang Ji, Zhuoyue Tan, Liaoni Wu
Title: Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability
Abstract:
Detectors often suffer from performance drop due to domain gap between training and testing data. Recent methods explore diffusion models applied to domain generalization (DG) and adaptation (DA) tasks, but still struggle with large inference costs and have not yet fully leveraged the capabilities of diffusion models. We propose to tackle these problems by extracting intermediate features from a single-step diffusion process, improving feature collection and fusion to reduce inference time by 75% while enhancing performance on source domains (i.e., Fitness). Then, we construct an object-centered auxiliary branch by applying box-masked images with class prompts to extract robust and domain-invariant features that focus on object. We also apply consistency loss to align the auxiliary and ordinary branch, balancing fitness and generalization while preventing overfitting and improving performance on target domains (i.e., Generalization). Furthermore, within a unified framework, standard detectors are guided by diffusion detectors through feature-level and object-level alignment on source domains (for DG) and unlabeled target domains (for DA), thereby improving cross-domain detection performance (i.e., Transferability). Our method achieves competitive results on 3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO generalization benchmark demonstrate that our method maintains significant advantages and show remarkable efficiency in large domain shifts and low-data scenarios. Our work shows the superiority of applying diffusion models to domain generalized and adaptive detection tasks and offers valuable insights for visual perception tasks across diverse domains. The code is available at \href{https://github.com/heboyong/Fitness-Generalization-Transferability}.
中文: 本文提出一种方法,利用单步扩散过程的中间特征,将推理时间减少75%的同时提升源域性能,并通过构建以物体为中心的辅助分支和一致性损失,有效提高跨域检测任务的泛化性和迁移能力。
English: This paper introduces a method that leverages intermediate features from a single-step diffusion process to significantly reduce inference time by 75% while enhancing performance on source domains, and employs an object-centered auxiliary branch with consistency loss to improve generalization and transferability in cross-domain detection tasks.

Authors:Wenzhou Lyu, Jialing Lin, Wenqi Ren, Ruihao Xia, Feng Qian, Yang Tang
Title: DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation
Abstract:
Commercial RGB-D cameras often produce noisy, incomplete depth maps for non-Lambertian objects. Traditional depth completion methods struggle to generalize due to the limited diversity and scale of training data. Recent advances exploit visual priors from pre-trained text-to-image diffusion models to enhance generalization in dense prediction tasks. However, we find that biases arising from training-inference mismatches in the vanilla diffusion framework significantly impair depth completion performance. Additionally, the lack of distinct visual features in non-Lambertian regions further hinders precise prediction. To address these issues, we propose \textbf{DidSee}, a diffusion-based framework for depth completion on non-Lambertian objects. First, we integrate a rescaled noise scheduler enforcing a zero terminal signal-to-noise ratio to eliminate signal leakage bias. Second, we devise a noise-agnostic single-step training formulation to alleviate error accumulation caused by exposure bias and optimize the model with a task-specific loss. Finally, we incorporate a semantic enhancer that enables joint depth completion and semantic segmentation, distinguishing objects from backgrounds and yielding precise, fine-grained depth maps. DidSee achieves state-of-the-art performance on multiple benchmarks, demonstrates robust real-world generalization, and effectively improves downstream tasks such as category-level pose estimation and robotic grasping.

Authors:Lei Hao, Lina Xu, Chang Liu, Yanni Dong
Title: LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection
Abstract:
Effective deep feature extraction via feature-level fusion is crucial for multimodal object detection. However, previous studies often involve complex training processes that integrate modality-specific features by stacking multiple feature-level fusion units, leading to significant computational overhead. To address this issue, we propose a new fusion detection baseline that uses a single feature-level fusion unit to enable high-performance detection, thereby simplifying the training process. Based on this approach, we propose a lightweight attention-guided self-modulation feature fusion network (LASFNet), which introduces a novel attention-guided self-modulation feature fusion (ASFF) module that adaptively adjusts the responses of fusion features at both global and local levels based on attention information from different modalities, thereby promoting comprehensive and enriched feature generation. Additionally, a lightweight feature attention transformation module (FATM) is designed at the neck of LASFNet to enhance the focus on fused features and minimize information loss. Extensive experiments on three representative datasets demonstrate that, compared to state-of-the-art methods, our approach achieves a favorable efficiency-accuracy trade-off, reducing the number of parameters and computational cost by as much as 90% and 85%, respectively, while improving detection accuracy (mAP) by 1%-3%. The code will be open-sourced at https://github.com/leileilei2000/LASFNet.
Chinese: 本研究提出了一种轻量级多模态目标检测网络LASFNet,通过单一融合单元和注意力机制,在多个数据集上实现检测精度提升1%-3%的同时,将计算成本最高降低90%。
English: This study introduces LASFNet, a lightweight multimodal object detection network that uses a single fusion unit and attention mechanisms to significantly reduce computational costs by up to 90% while improving detection accuracy by 1%-3% across multiple datasets.

Authors:Ayaka Yasunaga, Hideo Saito, Shohei Mori
Title: User-in-the-Loop View Sampling with Error Peaking Visualization
Abstract:
Augmented reality (AR) provides ways to visualize missing view samples for novel view synthesis. Existing approaches present 3D annotations for new view samples and task users with taking images by aligning the AR display. This data collection task is known to be mentally demanding and limits capture areas to pre-defined small areas due to the ideal but restrictive underlying sampling theory. To free users from 3D annotations and limited scene exploration, we propose using locally reconstructed light fields and visualizing errors to be removed by inserting new views. Our results show that the error-peaking visualization is less invasive, reduces disappointment in final results, and is satisfactory with fewer view samples in our mobile view synthesis system. We also show that our approach can contribute to recent radiance field reconstruction for larger scenes, such as 3D Gaussian splatting.

Authors:Tyler Ward, Xiaoqin Wang, Braxton McFarland, Md Atik Ahamed, Sahar Nozad, Talal Arshad, Hafsa Nebbache, Jin Chen, Abdullah Imran
Title: Detection of Breast Cancer Lumpectomy Margin with SAM-incorporated Forward-Forward Contrastive Learning
Abstract:
Complete removal of cancer tumors with a negative specimen margin during lumpectomy is essential in reducing breast cancer recurrence. However, 2D specimen radiography (SR), the current method used to assess intraoperative specimen margin status, has limited accuracy, resulting in nearly a quarter of patients requiring additional surgery. To address this, we propose a novel deep learning framework combining the Segment Anything Model (SAM) with Forward-Forward Contrastive Learning (FFCL), a pre-training strategy leveraging both local and global contrastive learning for patch-level classification of SR images. After annotating SR images with regions of known maligancy, non-malignant tissue, and pathology-confirmed margins, we pre-train a ResNet-18 backbone with FFCL to classify margin status, then reconstruct coarse binary masks to prompt SAM for refined tumor margin segmentation. Our approach achieved an AUC of 0.8455 for margin classification and segmented margins with a 27.4% improvement in Dice similarity over baseline models, while reducing inference time to 47 milliseconds per image. These results demonstrate that FFCL-SAM significantly enhances both the speed and accuracy of intraoperative margin assessment, with strong potential to reduce re-excision rates and improve surgical outcomes in breast cancer treatment. Our code is available at https://github.com/tbwa233/FFCL-SAM/.
中文: 一种结合Segment Anything模型与前向对比学习的新型深度学习框架,显著提高了乳腺癌手术中切缘评估的速度和准确性,其改进的分类和分割效果有望降低再切除率。
English: A novel deep learning framework combining the Segment Anything Model with Forward-Forward Contrastive Learning significantly improves the speed and accuracy of intraoperative margin assessment in breast cancer surgery, achieving enhanced classification and segmentation results that could reduce re-excision rates.

Authors:Qiuyi Qi, Xin Li, Ming Kong, Zikang Xu, Bingdi Chen, Qiang Zhu, S Kevin Zhou
Title: Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology
Abstract:
Challenges such as the lack of high-quality annotations, long-tailed data distributions, and inconsistent staining styles pose significant obstacles to training neural networks to detect abnormal cells in cytopathology robustly. This paper proposes a style-aligned image composition (SAIC) method that composes high-fidelity and style-preserved pathological images to enhance the effectiveness and robustness of detection models. Without additional training, SAIC first selects an appropriate candidate from the abnormal cell bank based on attribute guidance. Then, it employs a high-frequency feature reconstruction to achieve a style-aligned and high-fidelity composition of abnormal cells and pathological backgrounds. Finally, it introduces a large vision-language model to filter high-quality synthesis images. Experimental results demonstrate that incorporating SAIC-synthesized images effectively enhances the performance and robustness of abnormal cell detection for tail categories and styles, thereby improving overall detection performance. The comprehensive quality evaluation further confirms the generalizability and practicality of SAIC in clinical application scenarios. Our code will be released at https://github.com/Joey-Qi/SAIC.
中文摘要:本文提出的风格对齐图像合成(SAIC)方法通过生成高保真病理图像,有效解决数据质量不足和染色差异问题,显著提升了异常细胞检测模型的鲁棒性和综合性能,在临床应用中展现出良好泛化能力。
English Summary: This paper introduces the Style-Aligned Image Composition (SAIC) method, which generates high-fidelity pathological images to enhance abnormal cell detection by addressing data limitations and staining inconsistencies, thereby improving model robustness and performance across diverse clinical scenarios.

Authors:Wenjie Xuan, Jing Zhang, Juhua Liu, Bo Du, Dacheng Tao
Title: Rethink Sparse Signals for Pose-guided Text-to-image Generation
Abstract:
Recent works favored dense signals (e.g., depth, DensePose), as an alternative to sparse signals (e.g., OpenPose), to provide detailed spatial guidance for pose-guided text-to-image generation. However, dense representations raised new challenges, including editing difficulties and potential inconsistencies with textual prompts. This fact motivates us to revisit sparse signals for pose guidance, owing to their simplicity and shape-agnostic nature, which remains underexplored. This paper proposes a novel Spatial-Pose ControlNet(SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. Specifically, we extend OpenPose to a learnable spatial representation, making keypoint embeddings discriminative and expressive. Additionally, we introduce keypoint concept learning, which encourages keypoint tokens to attend to the spatial positions of each keypoint, thus improving pose alignment. Experiments on animal- and human-centric image generation tasks demonstrate that our method outperforms recent spatially controllable T2I generation approaches under sparse-pose guidance and even matches the performance of dense signal-based methods. Moreover, SP-Ctrl shows promising capabilities in diverse and cross-species generation through sparse signals. Codes will be available at https://github.com/DREAMXFAR/SP-Ctrl.
中文摘要:本文提出SP-Ctrl方法,通过增强稀疏姿态信号实现更有效的姿态引导图像生成,在保持简单性的同时达到与密集信号方法相当的性能,并展现出优异的跨物种生成能力。
English Summary: This paper introduces SP-Ctrl, a novel method that enhances sparse pose signals for more effective pose-guided image generation, achieving performance comparable to dense signal approaches while offering improved simplicity and cross-species generation capabilities.

Authors:Yiman Zhang, Ziheng Luo, Qiangyu Yan, Wei He, Borui Jiang, Xinghao Chen, Kai Han
Title: OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs
Abstract:
In this paper, we introduce OmniEval, a benchmark for evaluating omni-modality models like MiniCPM-O 2.6, which encompasses visual, auditory, and textual inputs. Compared with existing benchmarks, our OmniEval has several distinctive features: (i) Full-modal collaboration: We design evaluation tasks that highlight the strong coupling between audio and video, requiring models to effectively leverage the collaborative perception of all modalities; (ii) Diversity of videos: OmniEval includes 810 audio-visual synchronized videos, 285 Chinese videos and 525 English videos; (iii) Diversity and granularity of tasks: OmniEval contains 2617 question-answer pairs, comprising 1412 open-ended questions and 1205 multiple-choice questions. These questions are divided into 3 major task types and 12 sub-task types to achieve comprehensive evaluation. Among them, we introduce a more granular video localization task named Grounding. Then we conduct experiments on OmniEval with several omni-modality models. We hope that our OmniEval can provide a platform for evaluating the ability to construct and understand coherence from the context of all modalities. Codes and data could be found at https://omnieval-benchmark.github.io/.

Authors:Shubhankar Borse, Seokeon Choi, Sunghyun Park, Jeongho Kim, Shreya Kadambi, Risheek Garrepalli, Sungrack Yun, Munawar Hayat, Fatih Porikli
Title: MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans
Abstract:
Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation. The dataset and evaluation codes will be available at https://github.com/Qualcomm-AI-research/MultiHuman-Testbench.
中文: 本文提出了MultiHuman-Testbench这一包含1800个样本和5550张多样化人脸的基准测试,用于评估多人图像生成模型,通过创新的评估指标和技术显著提升了身份相似度,为该领域研究提供了标准化工具。
English: This paper introduces MultiHuman-Testbench, a comprehensive benchmark with 1,800 samples and 5,550 diverse face images to evaluate multi-human image generation models, proposing novel evaluation metrics and techniques that significantly improve identity similarity and provide standardized tools for advancing the field.

Authors:Chengan He, Junxuan Li, Tobias Kirschstein, Artem Sevastopolsky, Shunsuke Saito, Qingyang Tan, Javier Romero, Chen Cao, Holly Rushmeier, Giljoo Nam
Title: 3DGH: 3D Head Generation with Composable Hair and Face
Abstract:
We present 3DGH, an unconditional generative model for 3D human heads with composable hair and face components. Unlike previous work that entangles the modeling of hair and face, we propose to separate them using a novel data representation with template-based 3D Gaussian Splatting, in which deformable hair geometry is introduced to capture the geometric variations across different hairstyles. Based on this data representation, we design a 3D GAN-based architecture with dual generators and employ a cross-attention mechanism to model the inherent correlation between hair and face. The model is trained on synthetic renderings using carefully designed objectives to stabilize training and facilitate hair-face separation. We conduct extensive experiments to validate the design choice of 3DGH, and evaluate it both qualitatively and quantitatively by comparing with several state-of-the-art 3D GAN methods, demonstrating its effectiveness in unconditional full-head image synthesis and composable 3D hairstyle editing. More details will be available on our project page: https://c-he.github.io/projects/3dgh/.

Authors:Qin Ren, Yifan Wang, Ruogu Fang, Haibin Ling, Chenyu You
Title: OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport
Abstract:
Survival prediction using whole slide images (WSIs) can be formulated as a multiple instance learning (MIL) problem. However, existing MIL methods often fail to explicitly capture pathological heterogeneity within WSIs, both globally -- through long-tailed morphological distributions, and locally through -- tile-level prediction uncertainty. Optimal transport (OT) provides a principled way of modeling such heterogeneity by incorporating marginal distribution constraints. Building on this insight, we propose OTSurv, a novel MIL framework from an optimal transport perspective. Specifically, OTSurv formulates survival predictions as a heterogeneity-aware OT problem with two constraints: (1) global long-tail constraint that models prior morphological distributions to avert both mode collapse and excessive uniformity by regulating transport mass allocation, and (2) local uncertainty-aware constraint that prioritizes high-confidence patches while suppressing noise by progressively raising the total transport mass. We then recast the initial OT problem, augmented by these constraints, into an unbalanced OT formulation that can be solved with an efficient, hardware-friendly matrix scaling algorithm. Empirically, OTSurv sets new state-of-the-art results across six popular benchmarks, achieving an absolute 3.6% improvement in average C-index. In addition, OTSurv achieves statistical significance in log-rank tests and offers high interpretability, making it a powerful tool for survival prediction in digital pathology. Our codes are available at https://github.com/Y-Research-SBU/OTSurv.
中文:OTSurv提出了一种基于最优传输的新型多示例学习框架,通过全局长尾约束和局部不确定性约束解决全切片图像中的病理异质性问题,在生存预测中实现了最先进的性能与高可解释性。
English: OTSurv introduces a novel multiple instance learning framework using optimal transport to address pathological heterogeneity in whole slide images, achieving state-of-the-art survival prediction with improved accuracy and interpretability.

Authors:Alexander Selivanov, Philip Müller, Özgün Turgut, Nil Stolt-Ansó, Daniel Rückert
Title: Global and Local Contrastive Learning for Joint Representations from Cardiac MRI and ECG
Abstract:
An electrocardiogram (ECG) is a widely used, cost-effective tool for detecting electrical abnormalities in the heart. However, it cannot directly measure functional parameters, such as ventricular volumes and ejection fraction, which are crucial for assessing cardiac function. Cardiac magnetic resonance (CMR) is the gold standard for these measurements, providing detailed structural and functional insights, but is expensive and less accessible. To bridge this gap, we propose PTACL (Patient and Temporal Alignment Contrastive Learning), a multimodal contrastive learning framework that enhances ECG representations by integrating spatio-temporal information from CMR. PTACL uses global patient-level contrastive loss and local temporal-level contrastive loss. The global loss aligns patient-level representations by pulling ECG and CMR embeddings from the same patient closer together, while pushing apart embeddings from different patients. Local loss enforces fine-grained temporal alignment within each patient by contrasting encoded ECG segments with corresponding encoded CMR frames. This approach enriches ECG representations with diagnostic information beyond electrical activity and transfers more insights between modalities than global alignment alone, all without introducing new learnable weights. We evaluate PTACL on paired ECG-CMR data from 27,951 subjects in the UK Biobank. Compared to baseline approaches, PTACL achieves better performance in two clinically relevant tasks: (1) retrieving patients with similar cardiac phenotypes and (2) predicting CMR-derived cardiac function parameters, such as ventricular volumes and ejection fraction. Our results highlight the potential of PTACL to enhance non-invasive cardiac diagnostics using ECG. The code is available at: https://github.com/alsalivan/ecgcmr
中文: PTACL是一种多模态对比学习框架,通过整合心脏磁共振的时空数据来增强心电图表征,无需额外参数即可提升心脏表型检索和功能参数预测的性能。
English: PTACL is a multimodal contrastive learning framework that enhances ECG representations by integrating spatio-temporal CMR data, enabling improved cardiac phenotype retrieval and function parameter prediction without additional parameters.

Authors:Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, Ziwei Liu
Title: MMSearch-R1: Incentivizing LMMs to Search
Abstract:
Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.
中文:MMSearch-R1是一种创新的强化学习框架,它使大型多模态模型能够通过整合图像和文本工具进行高效、按需的搜索,在减少30%以上搜索调用的同时,性能超越了现有方法。
English: MMSearch-R1 is a novel reinforcement learning framework that enables large multimodal models to perform efficient, on-demand searches using integrated image and text tools, significantly reducing search calls while outperforming existing methods.

Authors:Jacopo Dapueto, Vito Paolo Pastore, Nicoletta Noceti, Francesca Odone
Title: Disentangled representations of microscopy images
Abstract:
Microscopy image analysis is fundamental for different applications, from diagnosis to synthetic engineering and environmental monitoring. Modern acquisition systems have granted the possibility to acquire an escalating amount of images, requiring a consequent development of a large collection of deep learning-based automatic image analysis methods. Although deep neural networks have demonstrated great performance in this field, interpretability, an essential requirement for microscopy image analysis, remains an open challenge. This work proposes a Disentangled Representation Learning (DRL) methodology to enhance model interpretability for microscopy image classification. Exploiting benchmark datasets from three different microscopic image domains (plankton, yeast vacuoles, and human cells), we show how a DRL framework, based on transferring a representation learnt from synthetic data, can provide a good trade-off between accuracy and interpretability in this domain.
中文摘要:本研究提出了一种解耦表示学习方法,通过三个显微图像数据集验证了该方法能在保持分类精度的同时有效提升模型可解释性。
English Summary: This study introduces a Disentangled Representation Learning approach to improve interpretability in microscopy image classification, demonstrating through three datasets that it effectively balances accuracy with explainability.

Authors:Ji Qi, Xinchang Zhang, Dingqi Ye, Yongjia Ruan, Xin Guo, Shaowen Wang, Haifeng Li
Title: SFNet: Fusion of Spatial and Frequency-Domain Features for Remote Sensing Image Forgery Detection
Abstract:
The rapid advancement of generative artificial intelligence is producing fake remote sensing imagery (RSI) that is increasingly difficult to detect, potentially leading to erroneous intelligence, fake news, and even conspiracy theories. Existing forgery detection methods typically rely on single visual features to capture predefined artifacts, such as spatial-domain cues to detect forged objects like roads or buildings in RSI, or frequency-domain features to identify artifacts from up-sampling operations in adversarial generative networks (GANs). However, the nature of artifacts can significantly differ depending on geographic terrain, land cover types, or specific features within the RSI. Moreover, these complex artifacts evolve as generative models become more sophisticated. In short, over-reliance on a single visual cue makes existing forgery detectors struggle to generalize across diverse remote sensing data. This paper proposed a novel forgery detection framework called SFNet, designed to identify fake images in diverse remote sensing data by leveraging spatial and frequency domain features. Specifically, to obtain rich and comprehensive visual information, SFNet employs two independent feature extractors to capture spatial and frequency domain features from input RSIs. To fully utilize the complementary domain features, the domain feature mapping module and the hybrid domain feature refinement module(CBAM attention) of SFNet are designed to successively align and fuse the multi-domain features while suppressing redundant information. Experiments on three datasets show that SFNet achieves an accuracy improvement of 4%-15.18% over the state-of-the-art RS forgery detection methods and exhibits robust generalization capabilities. The code is available at https://github.com/GeoX-Lab/RSTI/tree/main/SFNet.
Chinese: 提出的SFNet框架通过融合空间和频率域特征,能有效检测多样化遥感图像中的伪造内容,在多个数据集上实现了显著精度提升并展现出强大泛化能力。
English: The proposed SFNet framework effectively detects fake remote sensing images by integrating spatial and frequency domain features, achieving significant accuracy improvements and robust generalization across diverse datasets.

Authors:Zhonghao Shi, Enyu Zhao, Nathaniel Dennler, Jingzhen Wang, Xinyang Xu, Kaleen Shrestha, Mengxue Fu, Daniel Seita, Maja Matarić
Title: HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction
Abstract:
Real-time human perception is crucial for effective human-robot interaction (HRI). Large vision-language models (VLMs) offer promising generalizable perceptual capabilities but often suffer from high latency, which negatively impacts user experience and limits VLM applicability in real-world scenarios. To systematically study VLM capabilities in human perception for HRI and performance-latency trade-offs, we introduce HRIBench, a visual question-answering (VQA) benchmark designed to evaluate VLMs across a diverse set of human perceptual tasks critical for HRI. HRIBench covers five key domains: (1) non-verbal cue understanding, (2) verbal instruction understanding, (3) human-robot object relationship understanding, (4) social navigation, and (5) person identification. To construct HRIBench, we collected data from real-world HRI environments to curate questions for non-verbal cue understanding, and leveraged publicly available datasets for the remaining four domains. We curated 200 VQA questions for each domain, resulting in a total of 1000 questions for HRIBench. We then conducted a comprehensive evaluation of both state-of-the-art closed-source and open-source VLMs (N=11) on HRIBench. Our results show that, despite their generalizability, current VLMs still struggle with core perceptual capabilities essential for HRI. Moreover, none of the models within our experiments demonstrated a satisfactory performance-latency trade-off suitable for real-time deployment, underscoring the need for future research on developing smaller, low-latency VLMs with improved human perception capabilities. HRIBench and our results can be found in this Github repository: https://github.com/interaction-lab/HRIBench.
中文: HRIBench是一个为评估人机交互中人类感知任务而设计的视觉问答基准,结果表明现有模型在感知能力和性能-延迟权衡方面均未达到实时部署的要求。
English: HRIBench is introduced as a VQA benchmark to evaluate VLMs on human perception tasks for HRI, revealing that current models lack both sufficient perceptual capabilities and satisfactory performance-latency trade-offs for real-time deployment.

Authors:Lei Zhu, Jun Zhou, Rick Siow Mong Goh, Yong Liu
Title: AdvMIM: Adversarial Masked Image Modeling for Semi-Supervised Medical Image Segmentation
Abstract:
Vision Transformer has recently gained tremendous popularity in medical image segmentation task due to its superior capability in capturing long-range dependencies. However, transformer requires a large amount of labeled data to be effective, which hinders its applicability in annotation scarce semi-supervised learning scenario where only limited labeled data is available. State-of-the-art semi-supervised learning methods propose combinatorial CNN-Transformer learning to cross teach a transformer with a convolutional neural network, which achieves promising results. However, it remains a challenging task to effectively train the transformer with limited labeled data. In this paper, we propose an adversarial masked image modeling method to fully unleash the potential of transformer for semi-supervised medical image segmentation. The key challenge in semi-supervised learning with transformer lies in the lack of sufficient supervision signal. To this end, we propose to construct an auxiliary masked domain from original domain with masked image modeling and train the transformer to predict the entire segmentation mask with masked inputs to increase supervision signal. We leverage the original labels from labeled data and pseudo-labels from unlabeled data to learn the masked domain. To further benefit the original domain from masked domain, we provide a theoretical analysis of our method from a multi-domain learning perspective and devise a novel adversarial training loss to reduce the domain gap between the original and masked domain, which boosts semi-supervised learning performance. We also extend adversarial masked image modeling to CNN network. Extensive experiments on three public medical image segmentation datasets demonstrate the effectiveness of our method, where our method outperforms existing methods significantly. Our code is publicly available at https://github.com/zlheui/AdvMIM.
Chinese: 视觉Transformer在医学图像分割中表现出色,但在标注数据有限时效果不佳,因此本文提出一种对抗性掩码图像建模方法,通过增强监督信号和缩小域间差异,显著提升了半监督学习的性能。
English: Vision Transformer excels in medical image segmentation but struggles with limited labeled data, so this paper introduces an adversarial masked image modeling method to enhance supervision signals and bridge domain gaps, significantly improving semi-supervised learning performance.

Authors:Manyi Li, Renshuai Tao, Yufan Liu, Chuangchuang Tan, Haotong Qin, Bing Li, Yunchao Wei, Yao Zhao
Title: Pay Less Attention to Deceptive Artifacts: Robust Detection of Compressed Deepfakes on Online Social Networks
Abstract:
With the rapid advancement of deep learning, particularly through generative adversarial networks (GANs) and diffusion models (DMs), AI-generated images, or ``deepfakes", have become nearly indistinguishable from real ones. These images are widely shared across Online Social Networks (OSNs), raising concerns about their misuse. Existing deepfake detection methods overlook the ``block effects" introduced by compression in OSNs, which obscure deepfake artifacts, and primarily focus on raw images, rarely encountered in real-world scenarios. To address these challenges, we propose PLADA (Pay Less Attention to Deceptive Artifacts), a novel framework designed to tackle the lack of paired data and the ineffective use of compressed images. PLADA consists of two core modules: Block Effect Eraser (B2E), which uses a dual-stage attention mechanism to handle block effects, and Open Data Aggregation (ODA), which processes both paired and unpaired data to improve detection. Extensive experiments across 26 datasets demonstrate that PLADA achieves a remarkable balance in deepfake detection, outperforming SoTA methods in detecting deepfakes on OSNs, even with limited paired data and compression. More importantly, this work introduces the ``block effect" as a critical factor in deepfake detection, providing a robust solution for open-world scenarios. Our code is available at https://github.com/ManyiLee/PLADA.
中文:提出的PLADA框架通过消除块效应并利用配对与非配对数据,有效解决了在线社交网络中压缩图像深度伪造检测的难题,在多种数据集上实现了卓越性能。
English: The proposed PLADA framework effectively addresses deepfake detection challenges in compressed images from online social networks by eliminating block effects and leveraging both paired and unpaired data, achieving superior performance across diverse datasets.

Authors:Fangyijie Wang, Yuan Liang, Sourav Bhattacharjee, Abey Campbell, Kathleen M. Curran, Guénolé Silvestre
Title: Fusing Radiomic Features with Deep Representations for Gestational Age Estimation in Fetal Ultrasound Images
Abstract:
Accurate gestational age (GA) estimation, ideally through fetal ultrasound measurement, is a crucial aspect of providing excellent antenatal care. However, deriving GA from manual fetal biometric measurements depends on the operator and is time-consuming. Hence, automatic computer-assisted methods are demanded in clinical practice. In this paper, we present a novel feature fusion framework to estimate GA using fetal ultrasound images without any measurement information. We adopt a deep learning model to extract deep representations from ultrasound images. We extract radiomic features to reveal patterns and characteristics of fetal brain growth. To harness the interpretability of radiomics in medical imaging analysis, we estimate GA by fusing radiomic features and deep representations. Our framework estimates GA with a mean absolute error of 8.0 days across three trimesters, outperforming current machine learning-based methods at these gestational ages. Experimental results demonstrate the robustness of our framework across different populations in diverse geographical regions. Our code is publicly available on \href{https://github.com/13204942/RadiomicsImageFusion_FetalUS}.
中文摘要:本文提出了一种新颖的特征融合框架,通过结合胎儿超声图像的深度学习表征和影像组学特征来准确估算孕周,在不同地理区域的人群中实现了仅8.0天的平均绝对误差,性能优于现有方法。
English summary: This paper introduces a novel feature fusion framework that combines deep learning representations with radiomic features from fetal ultrasound images to accurately estimate gestational age, achieving superior performance with a mean absolute error of 8.0 days across diverse populations.

Authors:Changliang Xia, Chengyou Jia, Zhuohang Dang, Minnan Luo
Title: From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios
Abstract:
Dense prediction tasks hold significant importance of computer vision, aiming to learn pixel-wise annotated label for an input image. Despite advances in this field, existing methods primarily focus on idealized conditions, with limited generalization to real-world scenarios and facing the challenging scarcity of real-world data. To systematically study this problem, we first introduce DenseWorld, a benchmark spanning a broad set of 25 dense prediction tasks that correspond to urgent real-world applications, featuring unified evaluation across tasks. Then, we propose DenseDiT, which maximally exploits generative models' visual priors to perform diverse real-world dense prediction tasks through a unified strategy. DenseDiT combines a parameter-reuse mechanism and two lightweight branches that adaptively integrate multi-scale context, working with less than 0.1% additional parameters. Evaluations on DenseWorld reveal significant performance drops in existing general and specialized baselines, highlighting their limited real-world generalization. In contrast, DenseDiT achieves superior results using less than 0.01% training data of baselines, underscoring its practical value for real-world deployment. Our data, and checkpoints and codes are available at https://xcltql666.github.io/DenseDiTProj

Authors:Kun Yuan, Tingxuan Chen, Shi Li, Joel L. Lavanchy, Christian Heiliger, Ege Özsoy, Yiming Huang, Long Bai, Nassir Navab, Vinkle Srivastav, Hongliang Ren, Nicolas Padoy
Title: Recognizing Surgical Phases Anywhere: Few-Shot Test-time Adaptation and Task-graph Guided Refinement
Abstract:
The complexity and diversity of surgical workflows, driven by heterogeneous operating room settings, institutional protocols, and anatomical variability, present a significant challenge in developing generalizable models for cross-institutional and cross-procedural surgical understanding. While recent surgical foundation models pretrained on large-scale vision-language data offer promising transferability, their zero-shot performance remains constrained by domain shifts, limiting their utility in unseen surgical environments. To address this, we introduce Surgical Phase Anywhere (SPA), a lightweight framework for versatile surgical workflow understanding that adapts foundation models to institutional settings with minimal annotation. SPA leverages few-shot spatial adaptation to align multi-modal embeddings with institution-specific surgical scenes and phases. It also ensures temporal consistency through diffusion modeling, which encodes task-graph priors derived from institutional procedure protocols. Finally, SPA employs dynamic test-time adaptation, exploiting the mutual agreement between multi-modal phase prediction streams to adapt the model to a given test video in a self-supervised manner, enhancing the reliability under test-time distribution shifts. SPA is a lightweight adaptation framework, allowing hospitals to rapidly customize phase recognition models by defining phases in natural language text, annotating a few images with the phase labels, and providing a task graph defining phase transitions. The experimental results show that the SPA framework achieves state-of-the-art performance in few-shot surgical phase recognition across multiple institutions and procedures, even outperforming full-shot models with 32-shot labeled data. Code is available at https://github.com/CAMMA-public/SPA
中文: SPA框架通过少量样本的空间对齐、时序一致性建模和动态测试时自适应,以轻量级方式提升手术基础模型在跨机构环境中的性能,实现了仅需少量标注即可达到最优的手术阶段识别效果。
English: The SPA framework introduces a lightweight adaptation method that enhances surgical foundation models' performance in cross-institutional settings through few-shot spatial alignment, temporal consistency modeling, and dynamic test-time adaptation, achieving state-of-the-art phase recognition with minimal annotations.

Authors:Siqiao Li, Chen Hui, Wei Zhang, Rui Liang, Chenyue Song, Feng Jiang, Haiqi Zhu, Zhixuan Li, Hong Huang, Xiang Li
Title: MS-IQA: A Multi-Scale Feature Fusion Network for PET/CT Image Quality Assessment
Abstract:
Positron Emission Tomography / Computed Tomography (PET/CT) plays a critical role in medical imaging, combining functional and anatomical information to aid in accurate diagnosis. However, image quality degradation due to noise, compression and other factors could potentially lead to diagnostic uncertainty and increase the risk of misdiagnosis. When evaluating the quality of a PET/CT image, both low-level features like distortions and high-level features like organ anatomical structures affect the diagnostic value of the image. However, existing medical image quality assessment (IQA) methods are unable to account for both feature types simultaneously. In this work, we propose MS-IQA, a novel multi-scale feature fusion network for PET/CT IQA, which utilizes multi-scale features from various intermediate layers of ResNet and Swin Transformer, enhancing its ability of perceiving both local and global information. In addition, a multi-scale feature fusion module is also introduced to effectively combine high-level and low-level information through a dynamically weighted channel attention mechanism. Finally, to fill the blank of PET/CT IQA dataset, we construct PET-CT-IQA-DS, a dataset containing 2,700 varying-quality PET/CT images with quality scores assigned by radiologists. Experiments on our dataset and the publicly available LDCTIQAC2023 dataset demonstrate that our proposed model has achieved superior performance against existing state-of-the-art methods in various IQA metrics. This work provides an accurate and efficient IQA method for PET/CT. Our code and dataset are available at https://github.com/MS-IQA/MS-IQA/.
中文:本研究提出的MS-IQA模型通过动态融合多尺度特征,有效提升了PET/CT图像质量评估能力,在新构建的数据集上展现出优越性能。
English: The proposed MS-IQA model effectively enhances PET/CT image quality assessment by integrating multi-scale features through a dynamic fusion mechanism, demonstrating superior performance on a newly constructed dataset.

Authors:Deepak Ghimire, Kilho Lee, Seong-heum Kim
Title: Loss-Aware Automatic Selection of Structured Pruning Criteria for Deep Neural Network Acceleration
Abstract:
Structured pruning is a well-established technique for compressing neural networks, making it suitable for deployment in resource-limited edge devices. This paper presents an efficient Loss-Aware Automatic Selection of Structured Pruning Criteria (LAASP) for slimming and accelerating deep neural networks. The majority of pruning methodologies employ a sequential process consisting of three stages: 1) training, 2) pruning, and 3) fine-tuning, whereas the proposed pruning technique adopts a pruning-while-training approach that eliminates the first stage and integrates the second and third stages into a single cycle. The automatic selection of magnitude or similarity-based filter pruning criteria from a specified pool of criteria and the specific pruning layer at each pruning iteration is guided by the network's overall loss on a small subset of the training data. To mitigate the abrupt accuracy drop due to pruning, the network is retrained briefly after each reduction of a predefined number of floating-point operations (FLOPs). The optimal pruning rates for each layer in the network are automatically determined, eliminating the need for manual allocation of fixed or variable pruning rates for each layer. Experiments on the VGGNet and ResNet models on the CIFAR-10 and ImageNet benchmark datasets demonstrate the effectiveness of the proposed method. In particular, the ResNet56 and ResNet110 models on the CIFAR-10 dataset significantly improve the top-1 accuracy compared to state-of-the-art methods while reducing the network FLOPs by 52\%. Furthermore, the ResNet50 model on the ImageNet dataset reduces FLOPs by more than 42\% with a negligible 0.33\% drop in top-5 accuracy. The source code of this paper is publicly available online - https://github.com/ghimiredhikura/laasp.
Chinese: 本文提出LAASP,一种高效的损失感知自动结构化剪枝方法,通过将剪枝与训练相结合来精简和加速深度神经网络,在基准数据集上实现了显著的计算量减少且精度损失极小。
English: The paper introduces LAASP, an efficient loss-aware automatic structured pruning method that integrates pruning with training to slim and accelerate deep neural networks, achieving significant FLOPs reduction with minimal accuracy loss on benchmark datasets.

Authors:Haipeng Fan, Shiyuan Zhang, Baohunesitu, Zihang Guo, Huaiwen Zhang
Title: EAR: Erasing Concepts from Unified Autoregressive Models
Abstract:
Autoregressive (AR) models have achieved unified and strong performance across both visual understanding and image generation tasks. However, removing undesired concepts from AR models while maintaining overall generation quality remains an open challenge. In this paper, we propose Erasure Autoregressive Model (EAR), a fine-tuning method for effective and utility-preserving concept erasure in AR models. Specifically, we introduce Windowed Gradient Accumulation (WGA) strategy to align patch-level decoding with erasure objectives, and Thresholded Loss Masking (TLM) strategy to protect content unrelated to the target concept during fine-tuning. Furthermore, we propose a novel benchmark, Erase Concept Generator and Visual Filter (ECGVF), aim at provide a more rigorous and comprehensive foundation for evaluating concept erasure in AR models. Specifically, we first employ structured templates across diverse large language models (LLMs) to pre-generate a large-scale corpus of target-replacement concept prompt pairs. Subsequently, we generate images from these prompts and subject them to rigorous filtering via a visual classifier to ensure concept fidelity and alignment. Extensive experimental results conducted on the ECGVF benchmark with the AR model Janus-Pro demonstrate that EAR achieves marked improvements in both erasure effectiveness and model utility preservation. Code is available at: https://github.com/immc-lab/ear/
Chinese: 本文提出擦除自回归模型(EAR),一种通过窗口梯度累积和阈值损失掩码等策略,在自回归模型中有效消除指定概念同时保持生成质量的微调方法,并在新基准测试中验证了其显著提升。
English: This paper introduces the Erasure Autoregressive Model (EAR), a fine-tuning method that effectively removes unwanted concepts from autoregressive models while preserving generation quality through novel strategies like Windowed Gradient Accumulation and Thresholded Loss Masking, validated on a new benchmark showing significant improvements.

Authors:Jiahao Lin, Weixuan Peng, Bojia Zi, Yifeng Gao, Xianbiao Qi, Xingjun Ma, Yu-Gang Jiang
Title: BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos
Abstract:
Recent advances in deep generative models have led to significant progress in video generation, yet the fidelity of AI-generated videos remains limited. Synthesized content often exhibits visual artifacts such as temporally inconsistent motion, physically implausible trajectories, unnatural object deformations, and local blurring that undermine realism and user trust. Accurate detection and spatial localization of these artifacts are crucial for both automated quality control and for guiding the development of improved generative models. However, the research community currently lacks a comprehensive benchmark specifically designed for artifact localization in AI generated videos. Existing datasets either restrict themselves to video or frame level detection or lack the fine-grained spatial annotations necessary for evaluating localization methods. To address this gap, we introduce BrokenVideos, a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption. Each annotation is validated through detailed human inspection to ensure high quality ground truth. Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions. Through extensive evaluation, we demonstrate that BrokenVideos establishes a critical foundation for benchmarking and advancing research on artifact localization in generative video models. The dataset is available at: https://broken-video-detection-datetsets.github.io/Broken-Video-Detection-Datasets.github.io/.
Chinese: BrokenVideos数据集通过提供3,254个带有像素级视觉伪影标注的AI生成视频,填补了该领域缺乏专门基准的空白,显著提升了模型在检测和定位伪影方面的性能。
English: The BrokenVideos dataset addresses the lack of a comprehensive benchmark for artifact localization in AI-generated videos by providing 3,254 videos with pixel-level annotations of visual corruptions, significantly improving model performance in detecting and localizing these artifacts.

Authors:Hsiang-Wei Huang, Wenhao Chai, Kuang-Ming Chen, Cheng-Yen Yang, Jenq-Neng Hwang
Title: ToSA: Token Merging with Spatial Awareness
Abstract:
Token merging has emerged as an effective strategy to accelerate Vision Transformers (ViT) by reducing computational costs. However, existing methods primarily rely on the visual token's feature similarity for token merging, overlooking the potential of integrating spatial information, which can serve as a reliable criterion for token merging in the early layers of ViT, where the visual tokens only possess weak visual information. In this paper, we propose ToSA, a novel token merging method that combines both semantic and spatial awareness to guide the token merging process. ToSA leverages the depth image as input to generate pseudo spatial tokens, which serve as auxiliary spatial information for the visual token merging process. With the introduced spatial awareness, ToSA achieves a more informed merging strategy that better preserves critical scene structure. Experimental results demonstrate that ToSA outperforms previous token merging methods across multiple benchmarks on visual and embodied question answering while largely reducing the runtime of the ViT, making it an efficient solution for ViT acceleration. The code will be available at: https://github.com/hsiangwei0903/ToSA
Chinese Summary: ToSA是一种新颖的令牌合并方法,通过结合语义和空间感知来指导令牌合并过程,在保持关键场景结构的同时显著提升视觉Transformer的运行效率,并在多个基准测试中优于现有方法。
English Summary: ToSA is a novel token merging method that enhances Vision Transformer acceleration by integrating both semantic and spatial awareness, achieving superior performance on visual tasks while significantly reducing computational runtime.

Authors:Hang Zhang, Yuxi Zhang, Jiazheng Wang, Xiang Chen, Renjiu Hu, Xin Tian, Gaolei Li, Min Liu
Title: VoxelOpt: Voxel-Adaptive Message Passing for Discrete Optimization in Deformable Abdominal CT Registration
Abstract:
Recent developments in neural networks have improved deformable image registration (DIR) by amortizing iterative optimization, enabling fast and accurate DIR results. However, learning-based methods often face challenges with limited training data, large deformations, and tend to underperform compared to iterative approaches when label supervision is unavailable. While iterative methods can achieve higher accuracy in such scenarios, they are considerably slower than learning-based methods. To address these limitations, we propose VoxelOpt, a discrete optimization-based DIR framework that combines the strengths of learning-based and iterative methods to achieve a better balance between registration accuracy and runtime. VoxelOpt uses displacement entropy from local cost volumes to measure displacement signal strength at each voxel, which differs from earlier approaches in three key aspects. First, it introduces voxel-wise adaptive message passing, where voxels with lower entropy receives less influence from their neighbors. Second, it employs a multi-level image pyramid with 27-neighbor cost volumes at each level, avoiding exponential complexity growth. Third, it replaces hand-crafted features or contrastive learning with a pretrained foundational segmentation model for feature extraction. In abdominal CT registration, these changes allow VoxelOpt to outperform leading iterative in both efficiency and accuracy, while matching state-of-the-art learning-based methods trained with label supervision. The source code will be available at https://github.com/tinymilky/VoxelOpt
中文:VoxelOpt是一种新型可变形图像配准框架,融合了基于学习和迭代方法的优势,通过位移熵和预训练分割模型,在腹部CT配准中实现了更高的精度和效率。
English: VoxelOpt is a novel deformable image registration framework that integrates learning-based and iterative methods, utilizing displacement entropy and a pretrained segmentation model to achieve superior accuracy and efficiency in abdominal CT registration.

Authors:Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, Zhi-Ming Ma
Title: Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture
Abstract:
Large language models (LLMs) predominantly use autoregressive (AR) approaches, but masked diffusion models (MDMs) are emerging as viable alternatives. A key challenge in comparing AR and MDM paradigms is their typical architectural difference: AR models are often decoder-only, while MDMs have largely been encoder-only. This practice of changing both the modeling paradigm and architecture simultaneously makes direct comparisons unfair, as it's hard to distinguish whether observed differences stem from the paradigm itself or the architectural shift. This research evaluates MDMs within a decoder-only framework to: (1) equitably compare MDM (as Any-Order AR, or AO-AR) and standard AR paradigms. Our investigation suggests that the standard AO-AR objective, which averages over all token permutations, may benefit from refinement, as many permutations appear less informative compared to the language's inherent left-to-right structure. (2) Investigate architectural influences (decoder-only vs. encoder-only) within MDMs. We demonstrate that while encoder-only MDMs model a simpler conditional probability space, decoder-only MDMs can achieve dramatic generation speedups ($\sim25\times$) and comparable perplexity with temperature annealing despite modeling a vastly larger space, highlighting key trade-offs. This work thus decouples core paradigm differences from architectural influences, offering insights for future model design. Code is available at https://github.com/scxue/AO-GPT-MDM.
中文摘要:本研究在仅解码器架构下比较掩码扩散模型与自回归模型,发现掩码扩散模型能实现显著生成加速且保持性能,同时强调了区分核心范式差异与架构影响的重要性。
English Summary: This study compares masked diffusion models (MDMs) with autoregressive models using a decoder-only architecture, revealing that MDMs can achieve significant generation speedups while maintaining performance, and highlights the importance of separating paradigm differences from architectural influences.

Authors:Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han
Title: Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
Abstract:
Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $O(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $O(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$\times$ speedup over the original dense attention. With minimal tuning, it enables video generation up to 4$\times$ longer while reducing training costs by up to 4.4$\times$ compared to direct fine-tuning and accelerating inference by up to 3.7$\times$ compared to dense attention inference.
Chinese: 本文提出径向注意力机制,通过利用时空能量衰减现象,在保持视频质量的同时显著降低了视频扩散模型的计算复杂度,实现了高效的训练和推理。
English: This paper introduces Radial Attention, a scalable sparse attention mechanism that reduces computational complexity in video diffusion models by leveraging spatiotemporal energy decay, achieving significant efficiency gains while maintaining video quality.

Authors:Zehuan Huang, Haoran Feng, Yangtian Sun, Yuanchen Guo, Yanpei Cao, Lu Sheng
Title: AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models
Abstract:
We present AnimaX, a feed-forward 3D animation framework that bridges the motion priors of video diffusion models with the controllable structure of skeleton-based animation. Traditional motion synthesis methods are either restricted to fixed skeletal topologies or require costly optimization in high-dimensional deformation spaces. In contrast, AnimaX effectively transfers video-based motion knowledge to the 3D domain, supporting diverse articulated meshes with arbitrary skeletons. Our method represents 3D motion as multi-view, multi-frame 2D pose maps, and enables joint video-pose diffusion conditioned on template renderings and a textual motion prompt. We introduce shared positional encodings and modality-aware embeddings to ensure spatial-temporal alignment between video and pose sequences, effectively transferring video priors to motion generation task. The resulting multi-view pose sequences are triangulated into 3D joint positions and converted into mesh animation via inverse kinematics. Trained on a newly curated dataset of 160,000 rigged sequences, AnimaX achieves state-of-the-art results on VBench in generalization, motion fidelity, and efficiency, offering a scalable solution for category-agnostic 3D animation. Project page: \href{https://anima-x.github.io/}{https://anima-x.github.io/}.

Authors:Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jinsong Li, Shuangrui Ding, Weiming Zhang, Nenghai Yu, Jiaqi Wang, Feng Wu, Dahua Lin
Title: ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
Abstract:
This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs sentence-level offline contrastive decoding to effectively identify and eliminate hallucinations caused by linguistic biases. With increased inference cost, more heuristic questions are raised by ScaleCap to progressively capture additional visual details, generating captions that are more accurate, balanced, and informative. Extensive modality alignment experiments demonstrate the effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity of generated captions with two additional tasks: replacing images with captions in VQA task, and reconstructing images from captions to assess semantic coverage. Code is available at https://github.com/Cooperx521/ScaleCap.
中文: ScaleCap是一种推理时可扩展的图像描述策略,通过启发式问答和对比语句评分解决多模态和语言偏见,随着推理成本增加逐步生成更准确、平衡且信息丰富的图像描述。
English: ScaleCap is an inference-time scalable image captioning strategy that addresses multimodal and linguistic biases in LVLMs through heuristic question answering and contrastive sentence rating, progressively generating more accurate and detailed captions with increased inference cost.

Authors:Zekun Li, Rui Zhou, Rahul Sajnani, Xiaoyan Cong, Daniel Ritchie, Srinath Sridhar
Title: GenHSI: Controllable Generation of Human-Scene Interaction Videos
Abstract:
Large-scale pre-trained video diffusion models have exhibited remarkable capabilities in diverse video generation. However, existing solutions face several challenges in using these models to generate long movie-like videos with rich human-object interactions that include unrealistic human-scene interaction, lack of subject identity preservation, and require expensive training. We propose GenHSI, a training-free method for controllable generation of long human-scene interaction videos (HSI). Taking inspiration from movie animation, our key insight is to overcome the limitations of previous work by subdividing the long video generation task into three stages: (1) script writing, (2) pre-visualization, and (3) animation. Given an image of a scene, a user description, and multiple images of a person, we use these three stages to generate long-videos that preserve human-identity and provide rich human-scene interactions. Script writing converts complex human tasks into simple atomic tasks that are used in the pre-visualization stage to generate 3D keyframes (storyboards). These 3D keyframes are rendered and animated by off-the-shelf video diffusion models for consistent long video generation with rich contacts in a 3D-aware manner. A key advantage of our work is that we alleviate the need for scanned, accurate scenes and create 3D keyframes from single-view images. We are the first to generate a long video sequence with a consistent camera pose that contains arbitrary numbers of character actions without training. Experiments demonstrate that our method can generate long videos that effectively preserve scene content and character identity with plausible human-scene interaction from a single image scene. Visit our project homepage https://kunkun0w0.github.io/project/GenHSI/ for more information.

Authors:Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Arpit Sahni, Sergey Tulyakov, Vicente Ordonez, Aliaksandr Siarohin
Title: Improving Progressive Generation with Decomposable Flow Matching
Abstract:
Generating high-dimensional visual modalities is a computationally intensive task. A common solution is progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage architectures are rarely adopted. These architectures have increased the complexity of the overall approach, introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, add-hoc samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective framework for the progressive generation of visual media. DFM applies Flow Matching independently at each level of a user-defined multi-scale representation (such as Laplacian pyramid). As shown by our experiments, our approach improves visual quality for both images and videos, featuring superior results compared to prior multistage frameworks. On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline, under the same training compute. When applied to finetuning of large models, such as FLUX, DFM shows faster convergence speed to the training distribution. Crucially, all these advantages are achieved with a single model, architectural simplicity, and minimal modifications to existing training pipelines.
中文: 可分解流匹配(DFM)是一种简单有效的框架,在用户定义的多尺度表示的每个层级上独立应用流匹配,通过架构简洁性和最小修改显著提升了图像和视频的视觉质量。
English: Decomposable Flow Matching (DFM) is a simple yet effective framework that independently applies Flow Matching at each level of a multi-scale representation, improving visual quality for images and videos with architectural simplicity and minimal modifications.

Authors:Yitao Peng, Lianghua He, Hongzhou Chen
Title: ProtoSolo: Interpretable Image Classification via Single-Prototype Activation
Abstract:
Although interpretable prototype networks have improved the transparency of deep learning image classification, the need for multiple prototypes in collaborative decision-making increases cognitive complexity and hinders user understanding. To solve this problem, this paper proposes a novel interpretable deep architecture for image classification, called ProtoSolo. Unlike existing prototypical networks, ProtoSolo requires activation of only a single prototype to complete the classification. This design significantly simplifies interpretation, as the explanation for each class requires displaying only the prototype with the highest similarity score and its corresponding feature map. Additionally, the traditional full-channel feature vector is replaced with a feature map for similarity comparison and prototype learning, enabling the use of richer global information within a single-prototype activation decision. A non-projection prototype learning strategy is also introduced to preserve the association between the prototype and image patch while avoiding abrupt structural changes in the network caused by projection, which can affect classification performance. Experiments on the CUB-200-2011 and Stanford Cars datasets demonstrate that ProtoSolo matches state-of-the-art interpretable methods in classification accuracy while achieving the lowest cognitive complexity. The code is available at https://github.com/pyt19/ProtoSolo.
中文摘要:本文提出ProtoSolo这一新型可解释深度架构,通过仅激活单个原型完成图像分类决策,在保持竞争力的分类准确率同时显著降低了认知复杂度。
English Summary: This paper introduces ProtoSolo, a novel interpretable deep architecture that simplifies image classification by activating only a single prototype per decision, thereby reducing cognitive complexity while maintaining competitive accuracy.

Authors:Baochang Ren, Shuofei Qiao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Title: KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
Abstract:
Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.
Chinese: 针对慢思考大语言模型的严重幻觉问题,我们提出KnowRL方法,通过知识增强的强化学习引入事实性奖励机制,引导模型进行基于事实的慢思考,在保持推理能力的同时有效减少错误输出。
English: To address severe hallucinations in slow-thinking Large Language Models, we propose KnowRL, a knowledge-enhanced reinforcement learning method that integrates factuality rewards to guide fact-based reasoning and reduce incorrect outputs while preserving reasoning capabilities.

Authors:Junwei Zhou, Xueting Li, Lu Qi, Ming-Hsuan Yang
Title: CoCo4D: Comprehensive and Complex 4D Scene Generation
Abstract:
Existing 4D synthesis methods primarily focus on object-level generation or dynamic scene synthesis with limited novel views, restricting their ability to generate multi-view consistent and immersive dynamic 4D scenes. To address these constraints, we propose a framework (dubbed as CoCo4D) for generating detailed dynamic 4D scenes from text prompts, with the option to include images. Our method leverages the crucial observation that articulated motion typically characterizes foreground objects, whereas background alterations are less pronounced. Consequently, CoCo4D divides 4D scene synthesis into two responsibilities: modeling the dynamic foreground and creating the evolving background, both directed by a reference motion sequence. Given a text prompt and an optional reference image, CoCo4D first generates an initial motion sequence utilizing video diffusion models. This motion sequence then guides the synthesis of both the dynamic foreground object and the background using a novel progressive outpainting scheme. To ensure seamless integration of the moving foreground object within the dynamic background, CoCo4D optimizes a parametric trajectory for the foreground, resulting in realistic and coherent blending. Extensive experiments show that CoCo4D achieves comparable or superior performance in 4D scene generation compared to existing methods, demonstrating its effectiveness and efficiency. More results are presented on our website https://colezwhy.github.io/coco4d/.
中文: 现有4D合成方法难以生成多视角一致的动态场景,而CoCo4D通过运动序列引导的前景与背景分离建模,从文本或图像中创造出沉浸式4D场景。
English: Current 4D synthesis methods are limited in generating multi-view consistent dynamic scenes, but CoCo4D addresses this by separating foreground and background modeling guided by motion sequences to produce immersive 4D scenes from text or images.

Authors:Yihong Luo, Shuchen Xue, Tianyang Hu, Jing Tang
Title: Noise Consistency Training: A Native Approach for One-Step Generator in Learning Additional Controls
Abstract:
The pursuit of efficient and controllable high-quality content generation remains a central challenge in artificial intelligence-generated content (AIGC). While one-step generators, enabled by diffusion distillation techniques, offer excellent generation quality and computational efficiency, adapting them to new control conditions--such as structural constraints, semantic guidelines, or external inputs--poses a significant challenge. Conventional approaches often necessitate computationally expensive modifications to the base model and subsequent diffusion distillation. This paper introduces Noise Consistency Training (NCT), a novel and lightweight approach to directly integrate new control signals into pre-trained one-step generators without requiring access to original training images or retraining the base diffusion model. NCT operates by introducing an adapter module and employs a noise consistency loss in the noise space of the generator. This loss aligns the adapted model's generation behavior across noises that are conditionally dependent to varying degrees, implicitly guiding it to adhere to the new control. Theoretically, this training objective can be understood as minimizing the distributional distance between the adapted generator and the conditional distribution induced by the new conditions. NCT is modular, data-efficient, and easily deployable, relying only on the pre-trained one-step generator and a control signal model. Extensive experiments demonstrate that NCT achieves state-of-the-art controllable generation in a single forward pass, surpassing existing multi-step and distillation-based methods in both generation quality and computational efficiency. Code is available at https://github.com/Luo-Yihong/NCT
中文: 本文提出的噪声一致性训练(NCT)是一种轻量级方法,可在无需重新训练的情况下将新控制信号集成到预训练单步生成器中,实现了在生成质量和计算效率方面均超越现有方法的最优可控生成效果。
English: This paper introduces Noise Consistency Training (NCT), a lightweight method that enables pre-trained one-step generators to incorporate new control signals without retraining, achieving state-of-the-art controllable generation with superior quality and efficiency.

Authors:Yang Xing, Jiong Wu, Yuheng Bu, Kuang Gong
Title: SAM2-SGP: Enhancing SAM2 for Medical Image Segmentation via Support-Set Guided Prompting
Abstract:
Although new vision foundation models such as Segment Anything Model 2 (SAM2) have significantly enhanced zero-shot image segmentation capabilities, reliance on human-provided prompts poses significant challenges in adapting SAM2 to medical image segmentation tasks. Moreover, SAM2's performance in medical image segmentation was limited by the domain shift issue, since it was originally trained on natural images and videos. To address these challenges, we proposed SAM2 with support-set guided prompting (SAM2-SGP), a framework that eliminated the need for manual prompts. The proposed model leveraged the memory mechanism of SAM2 to generate pseudo-masks using image-mask pairs from a support set via a Pseudo-mask Generation (PMG) module. We further introduced a novel Pseudo-mask Attention (PMA) module, which used these pseudo-masks to automatically generate bounding boxes and enhance localized feature extraction by guiding attention to relevant areas. Furthermore, a low-rank adaptation (LoRA) strategy was adopted to mitigate the domain shift issue. The proposed framework was evaluated on both 2D and 3D datasets across multiple medical imaging modalities, including fundus photography, X-ray, computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), and ultrasound. The results demonstrated a significant performance improvement over state-of-the-art models, such as nnUNet and SwinUNet, as well as foundation models, such as SAM2 and MedSAM2, underscoring the effectiveness of the proposed approach. Our code is publicly available at https://github.com/astlian9/SAM_Support.
中文: 提出的SAM2-SGP框架通过支持集生成伪掩码并利用注意力机制,无需人工提示即可完成医学图像分割,同时采用LoRA策略解决领域偏移问题,在多种影像模态中均实现了优越性能。
English: The proposed SAM2-SGP framework eliminates the need for manual prompts in medical image segmentation by generating pseudo-masks from support sets and using attention mechanisms, while employing LoRA to address domain shift, achieving superior performance across multiple imaging modalities.

Authors:Oscar J. Pellicer-Valero, Cesar Aybar, Gustau Camps Valls
Title: Video Compression for Spatiotemporal Earth System Data
Abstract:
Large-scale Earth system datasets, from high-resolution remote sensing imagery to spatiotemporal climate model outputs, exhibit characteristics analogous to those of standard videos. Their inherent spatial, temporal, and spectral redundancies can thus be readily exploited by established video compression techniques. Here, we present xarrayvideo, a Python library for compressing multichannel spatiotemporal datasets by encoding them as videos. Our approach achieves compression ratios of up to 250x while maintaining high fidelity by leveraging standard, well-optimized video codecs through ffmpeg. We demonstrate the library's effectiveness on four real-world multichannel spatiotemporal datasets: DynamicEarthNet (very high resolution Planet images), DeepExtremeCubes (high resolution Sentinel-2 images), ERA5 (weather reanalysis data), and the SimpleS2 dataset (high resolution multichannel Sentinel-2 images), achieving Peak Signal-to-Noise Ratios (PSNRs) of 55.86, 40.60, 46.58, and 43.23 dB at 0.1 bits per pixel per band (bpppb) and 65.91, 54.28, 62.90, and 55.04 dB at 1 bpppb. We are redistributing two of these datasets, DeepExtremeCubes (2.3 Tb) and DynamicEarthNet (525 Gb), in the machine-learning-ready and cloud-ready TACO format through HuggingFace at significantly reduced sizes (270 Gb and 8.5 Gb, respectively) without compromising quality (PSNR 55.77-56.65 and 60.15). No performance loss is observed when the compressed versions of these datasets are used in their respective deep learning-based downstream tasks (next step reflectance prediction and landcover segmentation). In conclusion, xarrayvideo presents an efficient solution for handling the rapidly growing size of Earth observation datasets, making advanced compression techniques accessible and practical to the Earth science community. The library is available for use at https://github.com/IPL-UV/xarrayvideo
中文: xarrayvideo Python库通过将地球系统数据集转换为视频格式,利用标准编解码器实现了高达250倍的压缩率,同时保持数据质量并确保其在机器学习任务中的可用性。
English: The xarrayvideo Python library compresses large Earth system datasets by converting them into videos using standard codecs, achieving up to 250x compression while preserving data quality and usability for machine learning tasks.

Authors:Gaurav Sharma, Ravi Kothari, Josef Schmid
Title: Self-Supervised Multimodal NeRF for Autonomous Driving
Abstract:
In this paper, we propose a Neural Radiance Fields (NeRF) based framework, referred to as Novel View Synthesis Framework (NVSF). It jointly learns the implicit neural representation of space and time-varying scene for both LiDAR and Camera. We test this on a real-world autonomous driving scenario containing both static and dynamic scenes. Compared to existing multimodal dynamic NeRFs, our framework is self-supervised, thus eliminating the need for 3D labels. For efficient training and faster convergence, we introduce heuristic-based image pixel sampling to focus on pixels with rich information. To preserve the local features of LiDAR points, a Double Gradient based mask is employed. Extensive experiments on the KITTI-360 dataset show that, compared to the baseline models, our framework has reported best performance on both LiDAR and Camera domain. Code of the model is available at https://github.com/gaurav00700/Selfsupervised-NVSF
Chinese: 本文提出了一种名为NVSF的自监督神经辐射场框架,通过启发式采样和双梯度掩码技术联合建模激光雷达与相机的动态场景,在KITTI-360数据集上实现了最佳性能。
English: This paper introduces a self-supervised Neural Radiance Fields framework called NVSF that jointly models dynamic scenes for LiDAR and camera data, achieving state-of-the-art performance on the KITTI-360 dataset through heuristic sampling and gradient-based feature preservation.

Authors:Gencer Sumbul, Chang Xu, Emanuele Dalsasso, Devis Tuia
Title: SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images
Abstract:
From optical sensors to microwave radars, leveraging the complementary strengths of remote sensing (RS) sensors is crucial for achieving dense spatio-temporal monitoring of our planet. In contrast, recent deep learning models, whether task-specific or foundational, are often specific to single sensors or to fixed combinations: adapting such models to different sensory inputs requires both architectural changes and re-training, limiting scalability and generalization across multiple RS sensors. On the contrary, a single model able to modulate its feature representations to accept diverse sensors as input would pave the way to agile and flexible multi-sensor RS data processing. To address this, we introduce SMARTIES, a generic and versatile foundation model lifting sensor-specific/dependent efforts and enabling scalability and generalization to diverse RS sensors: SMARTIES projects data from heterogeneous sensors into a shared spectrum-aware space, enabling the use of arbitrary combinations of bands both for training and inference. To obtain sensor-agnostic representations, we train a single, unified transformer model reconstructing masked multi-sensor data with cross-sensor token mixup. On both single- and multi-modal tasks across diverse sensors, SMARTIES outperforms previous models that rely on sensor-specific pretraining. Our code and pretrained models are available at https://gsumbul.github.io/SMARTIES.

Authors:Tao Huang, Zhekun Liu, Rui Wang, Yang Zhang, Liping Jing
Title: Visual hallucination detection in large vision-language models via evidential conflict
Abstract:
Despite the remarkable multimodal capabilities of Large Vision-Language Models (LVLMs), discrepancies often occur between visual inputs and textual outputs--a phenomenon we term visual hallucination. This critical reliability gap poses substantial risks in safety-critical Artificial Intelligence (AI) applications, necessitating a comprehensive evaluation benchmark and effective detection methods. Firstly, we observe that existing visual-centric hallucination benchmarks mainly assess LVLMs from a perception perspective, overlooking hallucinations arising from advanced reasoning capabilities. We develop the Perception-Reasoning Evaluation Hallucination (PRE-HAL) dataset, which enables the systematic evaluation of both perception and reasoning capabilities of LVLMs across multiple visual semantics, such as instances, scenes, and relations. Comprehensive evaluation with this new benchmark exposed more visual vulnerabilities, particularly in the more challenging task of relation reasoning. To address this issue, we propose, to the best of our knowledge, the first Dempster-Shafer theory (DST)-based visual hallucination detection method for LVLMs through uncertainty estimation. This method aims to efficiently capture the degree of conflict in high-level features at the model inference phase. Specifically, our approach employs simple mass functions to mitigate the computational complexity of evidence combination on power sets. We conduct an extensive evaluation of state-of-the-art LVLMs, LLaVA-v1.5, mPLUG-Owl2 and mPLUG-Owl3, with the new PRE-HAL benchmark. Experimental results indicate that our method outperforms five baseline uncertainty metrics, achieving average AUROC improvements of 4%, 10%, and 7% across three LVLMs. Our code is available at https://github.com/HT86159/Evidential-Conflict.
中文摘要:大型视觉语言模型存在视觉幻觉导致可靠性风险,为此开发了PRE-HAL评估基准和基于Dempster-Shafer理论的检测方法,该方法在多项测试中优于现有基准指标。
English Summary: Large Vision-Language Models suffer from visual hallucinations causing reliability risks, prompting the creation of the PRE-HAL benchmark for systematic evaluation and a novel Dempster-Shafer theory-based detection method that outperforms existing metrics.

Authors:Pengfei Hao, Shuaibo Li, Hongqiu Wang, Zhizhuo Kou, Junhang Zhang, Guang Yang, Lei Zhu
Title: Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning
Abstract:
In recent years, significant progress has been made in the field of surgical scene understanding, particularly in the task of Visual Question Localized-Answering in robotic surgery (Surgical-VQLA). However, existing Surgical-VQLA models lack deep reasoning capabilities and interpretability in surgical scenes, which limits their reliability and potential for development in clinical applications. To address this issue, inspired by the development of Reasoning Multimodal Large Language Models (MLLMs), we first build the Surgery-R1-54k dataset, including paired data for Visual-QA, Grounding-QA, and Chain-of-Thought (CoT). Then, we propose the first Reasoning MLLM for Surgical-VQLA (Surgery-R1). In our Surgery-R1, we design a two-stage fine-tuning mechanism to enable the basic MLLM with complex reasoning abilities by utilizing supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Furthermore, for an efficient and high-quality rule-based reward system in our RFT, we design a Multimodal Coherence reward mechanism to mitigate positional illusions that may arise in surgical scenarios. Experiment results demonstrate that Surgery-R1 outperforms other existing state-of-the-art (SOTA) models in the Surgical-VQLA task and widely-used MLLMs, while also validating its reasoning capabilities and the effectiveness of our approach. The code and dataset will be organized in https://github.com/FiFi-HAO467/Surgery-R1.
中文: 本研究提出了首个用于手术视觉问答定位的推理多模态大语言模型Surgery-R1,采用两阶段微调方法和多模态一致性奖励机制,显著提升了推理能力并超越了现有最优模型。
English: This study introduces Surgery-R1, the first reasoning multimodal large language model for surgical visual question localized-answering, which employs a two-stage fine-tuning approach and a multimodal coherence reward mechanism to enhance reasoning capabilities and outperform existing models.

Authors:Ankita Raj, Harsh Swaika, Deepankar Varma, Chetan Arora
Title: Assessing Risk of Stealing Proprietary Models for Medical Imaging Tasks
Abstract:
The success of deep learning in medical imaging applications has led several companies to deploy proprietary models in diagnostic workflows, offering monetized services. Even though model weights are hidden to protect the intellectual property of the service provider, these models are exposed to model stealing (MS) attacks, where adversaries can clone the model's functionality by querying it with a proxy dataset and training a thief model on the acquired predictions. While extensively studied on general vision tasks, the susceptibility of medical imaging models to MS attacks remains inadequately explored. This paper investigates the vulnerability of black-box medical imaging models to MS attacks under realistic conditions where the adversary lacks access to the victim model's training data and operates with limited query budgets. We demonstrate that adversaries can effectively execute MS attacks by using publicly available datasets. To further enhance MS capabilities with limited query budgets, we propose a two-step model stealing approach termed QueryWise. This method capitalizes on unlabeled data obtained from a proxy distribution to train the thief model without incurring additional queries. Evaluation on two medical imaging models for Gallbladder Cancer and COVID-19 classification substantiates the effectiveness of the proposed attack. The source code is available at https://github.com/rajankita/QueryWise.
中文摘要:医疗影像中的深度学习模型易受模型窃取攻击,攻击者可通过代理数据集和有限查询克隆专有模型,为此提出的QueryWise方法能在无需额外查询下提升窃取效率。
English Summary: Deep learning models in medical imaging are vulnerable to model stealing attacks, where adversaries clone proprietary models using proxy datasets and limited queries, prompting the development of the QueryWise method to enhance theft efficiency without extra queries.

Authors:Zhifeng Wang, Renjiao Yi, Xin Wen, Chenyang Zhu, Kai Xu, Kunlun He
Title: Angio-Diff: Learning a Self-Supervised Adversarial Diffusion Model for Angiographic Geometry Generation
Abstract:
Vascular diseases pose a significant threat to human health, with X-ray angiography established as the gold standard for diagnosis, allowing for detailed observation of blood vessels. However, angiographic X-rays expose personnel and patients to higher radiation levels than non-angiographic X-rays, which are unwanted. Thus, modality translation from non-angiographic to angiographic X-rays is desirable. Data-driven deep approaches are hindered by the lack of paired large-scale X-ray angiography datasets. While making high-quality vascular angiography synthesis crucial, it remains challenging. We find that current medical image synthesis primarily operates at pixel level and struggles to adapt to the complex geometric structure of blood vessels, resulting in unsatisfactory quality of blood vessel image synthesis, such as disconnections or unnatural curvatures. To overcome this issue, we propose a self-supervised method via diffusion models to transform non-angiographic X-rays into angiographic X-rays, mitigating data shortages for data-driven approaches. Our model comprises a diffusion model that learns the distribution of vascular data from diffusion latent, a generator for vessel synthesis, and a mask-based adversarial module. To enhance geometric accuracy, we propose a parametric vascular model to fit the shape and distribution of blood vessels. The proposed method contributes a pipeline and a synthetic dataset for X-ray angiography. We conducted extensive comparative and ablation experiments to evaluate the Angio-Diff. The results demonstrate that our method achieves state-of-the-art performance in synthetic angiography image quality and more accurately synthesizes the geometric structure of blood vessels. The code is available at https://github.com/zfw-cv/AngioDiff.
Chinese: 本研究提出了一种名为Angio-Diff的自监督扩散模型,能将非血管造影X光转换为血管造影图像,通过参数化血管模型和对抗训练有效解决了数据稀缺问题,显著提升了血管几何结构的合成精度。
English: This study introduces a self-supervised diffusion model called Angio-Diff that converts non-angiographic X-rays into angiographic images, effectively addressing data scarcity and improving geometric accuracy in blood vessel synthesis through a parametric vascular model and adversarial training.

Authors:Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, Yong Li
Title: Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System
Abstract:
Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce \textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short-term memory (STM) caches recent multimodal entries in relative coordinates for real-time obstacle avoidance and local planning. At each step, STM retrieval sharply prunes dynamic context, and, when deeper history is needed, LTM tokens are decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13 pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW improvement. Ablations confirm the indispensability of both the hierarchical map and dual memory modules. Our codes are open-sourced via https://github.com/tsinghua-fib-lab/Mem4Nav.
中文: Mem4Nav提出了一种层次化空间认知记忆系统,通过融合细粒度体素索引与语义地标连通性来增强视觉语言导航智能体,在多个基准测试中实现了显著的性能提升。
English: Mem4Nav introduces a hierarchical spatial-cognition memory system that enhances Vision-and-Language Navigation agents by combining fine-grained voxel indexing with semantic landmark connectivity, achieving significant performance improvements across multiple benchmarks.

Authors:Yin Zhang, Zian Ning, Xiaoyu Zhang, Shiliang Guo, Peidong Liu, Shiyu Zhao
Title: EvDetMAV: Generalized MAV Detection from Moving Event Cameras
Abstract:
Existing micro aerial vehicle (MAV) detection methods mainly rely on the target's appearance features in RGB images, whose diversity makes it difficult to achieve generalized MAV detection. We notice that different types of MAVs share the same distinctive features in event streams due to their high-speed rotating propellers, which are hard to see in RGB images. This paper studies how to detect different types of MAVs from an event camera by fully exploiting the features of propellers in the original event stream. The proposed method consists of three modules to extract the salient and spatio-temporal features of the propellers while filtering out noise from background objects and camera motion. Since there are no existing event-based MAV datasets, we introduce a novel MAV dataset for the community. This is the first event-based MAV dataset comprising multiple scenarios and different types of MAVs. Without training, our method significantly outperforms state-of-the-art methods and can deal with challenging scenarios, achieving a precision rate of 83.0\% (+30.3\%) and a recall rate of 81.5\% (+36.4\%) on the proposed testing dataset. The dataset and code are available at: https://github.com/WindyLab/EvDetMAV.
Chinese: 本研究提出了一种基于事件的新型微飞行器检测方法,通过利用事件流中独特的螺旋桨特征,在无需训练的情况下实现了卓越性能,并提供了首个基于事件的微飞行器数据集。
English: This study introduces a novel event-based method for detecting micro aerial vehicles by leveraging the distinctive propeller features in event streams, achieving superior performance without training and providing the first event-based MAV dataset.

Authors:Declan J. Curran, Sanaa Hobeichi, Hira Saleem, Hao Xue, Flora D. Salim
Title: Generate the Forest before the Trees -- A Hierarchical Diffusion model for Climate Downscaling
Abstract:
Downscaling is essential for generating the high-resolution climate data needed for local planning, but traditional methods remain computationally demanding. Recent years have seen impressive results from AI downscaling models, particularly diffusion models, which have attracted attention due to their ability to generate ensembles and overcome the smoothing problem common in other AI methods. However, these models typically remain computationally intensive. We introduce a Hierarchical Diffusion Downscaling (HDD) model, which introduces an easily-extensible hierarchical sampling process to the diffusion framework. A coarse-to-fine hierarchy is imposed via a simple downsampling scheme. HDD achieves competitive accuracy on ERA5 reanalysis datasets and CMIP6 models, significantly reducing computational load by running on up to half as many pixels with competitive results. Additionally, a single model trained at 0.25° resolution transfers seamlessly across multiple CMIP6 models with much coarser resolution. HDD thus offers a lightweight alternative for probabilistic climate downscaling, facilitating affordable large-ensemble high-resolution climate projections. See a full code implementation at: https://github.com/HDD-Hierarchical-Diffusion-Downscaling/HDD-Hierarchical-Diffusion-Downscaling.
Chinese: 分层扩散降尺度(HDD)模型通过引入可扩展的分层采样过程,在保持竞争力的气候数据降尺度精度的同时显著降低计算负荷,为高分辨率气候预测提供了一种轻量级解决方案。
English: The Hierarchical Diffusion Downscaling (HDD) model introduces a computationally efficient hierarchical sampling process that achieves competitive accuracy in climate data downscaling while significantly reducing computational load, offering a lightweight solution for high-resolution climate projections.

Authors:Solveig Thrun, Stine Hansen, Zijun Sun, Nele Blum, Suaiba A. Salahuddin, Kristoffer Wickstrøm, Elisabeth Wetzer, Robert Jenssen, Maik Stille, Michael Kampffmeyer
Title: Reconsidering Explicit Longitudinal Mammography Alignment for Enhanced Breast Cancer Risk Prediction
Abstract:
Regular mammography screening is essential for early breast cancer detection. Deep learning-based risk prediction methods have sparked interest to adjust screening intervals for high-risk groups. While early methods focused only on current mammograms, recent approaches leverage the temporal aspect of screenings to track breast tissue changes over time, requiring spatial alignment across different time points. Two main strategies for this have emerged: explicit feature alignment through deformable registration and implicit learned alignment using techniques like transformers, with the former providing more control. However, the optimal approach for explicit alignment in mammography remains underexplored. In this study, we provide insights into where explicit alignment should occur (input space vs. representation space) and if alignment and risk prediction should be jointly optimized. We demonstrate that jointly learning explicit alignment in representation space while optimizing risk estimation performance, as done in the current state-of-the-art approach, results in a trade-off between alignment quality and predictive performance and show that image-level alignment is superior to representation-level alignment, leading to better deformation field quality and enhanced risk prediction accuracy. The code is available at https://github.com/sot176/Longitudinal_Mammogram_Alignment.git.
中文: 本研究探索纵向乳腺摄影的显式对齐策略,发现图像级对齐优于表征级对齐,既能提升形变场质量又能增强风险预测精度,避免了联合优化带来的性能权衡。
English: This study investigates explicit alignment strategies in longitudinal mammogram analysis, finding that image-level alignment outperforms representation-level alignment by improving deformation field quality and risk prediction accuracy without the trade-offs of joint optimization.

Authors:Jintao Rong, Xin Xie, Xinyi Yu, Linlin Ou, Xinyu Zhang, Chunhua Shen, Dong Gong
Title: Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation
Abstract:
Distilled video generation models offer fast and efficient synthesis but struggle with motion customization when guided by reference videos, especially under training-free settings. Existing training-free methods, originally designed for standard diffusion models, fail to generalize due to the accelerated generative process and large denoising steps in distilled models. To address this, we propose MotionEcho, a novel training-free test-time distillation framework that enables motion customization by leveraging diffusion teacher forcing. Our approach uses high-quality, slow teacher models to guide the inference of fast student models through endpoint prediction and interpolation. To maintain efficiency, we dynamically allocate computation across timesteps according to guidance needs. Extensive experiments across various distilled video generation models and benchmark datasets demonstrate that our method significantly improves motion fidelity and generation quality while preserving high efficiency. Project page: https://euminds.github.io/motionecho/

Authors:Mingcheng Qu, Guang Yang, Donglin Di, Yue Gao, Tonghua Su, Yang Song, Lei Fan
Title: Memory-Augmented Incomplete Multimodal Survival Prediction via Cross-Slide and Gene-Attentive Hypergraph Learning
Abstract:
Multimodal pathology-genomic analysis is critical for cancer survival prediction. However, existing approaches predominantly integrate formalin-fixed paraffin-embedded (FFPE) slides with genomic data, while neglecting the availability of other preservation slides, such as Fresh Froze (FF) slides. Moreover, as the high-resolution spatial nature of pathology data tends to dominate the cross-modality fusion process, it hinders effective multimodal fusion and leads to modality imbalance challenges between pathology and genomics. These methods also typically require complete data modalities, limiting their clinical applicability with incomplete modalities, such as missing either pathology or genomic data. In this paper, we propose a multimodal survival prediction framework that leverages hypergraph learning to effectively integrate multi-WSI information and cross-modality interactions between pathology slides and genomics data while addressing modality imbalance. In addition, we introduce a memory mechanism that stores previously learned paired pathology-genomic features and dynamically compensates for incomplete modalities. Experiments on five TCGA datasets demonstrate that our model outperforms advanced methods by over 2.3% in C-Index. Under incomplete modality scenarios, our approach surpasses pathology-only (3.3%) and gene-only models (7.9%). Code: https://github.com/MCPathology/M2Surv
中文: 本研究提出了一种多模态生存预测框架,通过超图学习有效整合病理切片与基因组数据,利用记忆机制解决模态不平衡和缺失问题,在五种TCGA数据集上验证了其优越性能。
English: This study introduces a multimodal survival prediction framework using hypergraph learning to effectively integrate pathology slides and genomic data while addressing modality imbalance and compensating for incomplete data through a memory mechanism, achieving superior performance over existing methods.

Authors:Yuang Yao, Ruiqi Wu, Yi Zhou, Tao Zhou
Title: Continual Retinal Vision-Language Pre-training upon Incremental Imaging Modalities
Abstract:
Traditional fundus image analysis models focus on single-modal tasks, ignoring fundus modality complementarity, which limits their versatility. Recently, retinal foundation models have emerged, but most still remain modality-specific. Integrating multiple fundus imaging modalities into a single foundation model is valuable. However, in dynamic environments, data from different modalities often arrive incrementally, necessitating continual pre-training. To address this, we propose RetCoP, the first continual vision-language pre-training framework in the fundus domain, which incrementally integrates image and text features from different imaging modalities into a single unified foundation model. To mitigate catastrophic forgetting in continual pre-training, we introduce a rehearsal strategy utilizing representative image-text pairs and an off-diagonal information distillation approach. The former allows the model to revisit knowledge from previous stages, while the latter explicitly preserves the alignment between image and text representations. Experiments show that RetCoP outperforms all the compared methods, achieving the best generalization and lowest forgetting rate. The code can be found at https://github.com/Yuang-Yao/RetCoP.
中文: RetCoP是首个眼底领域的持续视觉语言预训练框架,通过代表性图文对复现和离对角线信息蒸馏策略,逐步整合多模态图像与文本数据,在实现最佳泛化性能的同时显著降低了遗忘率。
English: RetCoP is a pioneering continual vision-language pre-training framework for fundus analysis that incrementally integrates multimodal image and text data while mitigating catastrophic forgetting through rehearsal strategies and off-diagonal distillation, achieving superior generalization with minimal forgetting.

Authors:Xiangbo Gao, Yuheng Wu, Fengze Yang, Xuewen Luo, Keshu Wu, Xinghao Chen, Yuping Wang, Chenxi Liu, Yang Zhou, Zhengzhong Tu
Title: AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration
Abstract:
While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of "uncovered danger zones" in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative or complement to fixed Road-Side Units (RSUs). Drones offer unique advantages over ground-based perception: complementary bird's-eye-views that reduce occlusions, dynamic positioning capabilities that enable hovering, patrolling, and escorting navigation rules, and significantly lower deployment costs compared to fixed infrastructure. Our dataset comprises 6.73 hours of drone-assisted driving scenarios across urban, suburban, and rural environments with varied weather and lighting conditions. The AirV2X-Perception dataset facilitates the development and standardized evaluation of Vehicle-to-Drone (V2D) algorithms, addressing a critical gap in the rapidly expanding field of aerial-assisted autonomous driving systems. The dataset and development kits are open-sourced at https://github.com/taco-group/AirV2X-Perception.
中文摘要:AirV2X-Perception提出了一种创新的无人机感知数据集,通过提供灵活的空中感知能力,解决了传统V2X系统在自动驾驶应用中的局限性。
English Summary: AirV2X-Perception introduces a novel drone-based dataset that overcomes the limitations of traditional V2X systems by providing flexible aerial perception capabilities for autonomous driving applications.

Authors:Rui Huang, Jincheng Zeng, Sen Gao, Yan Xing
Title: 3D-SSM: A Novel 3D Selective Scan Module for Remote Sensing Change Detection
Abstract:
Existing Mamba-based approaches in remote sensing change detection have enhanced scanning models, yet remain limited by their inability to capture long-range dependencies between image channels effectively, which restricts their feature representation capabilities. To address this limitation, we propose a 3D selective scan module (3D-SSM) that captures global information from both the spatial plane and channel perspectives, enabling a more comprehensive understanding of the data.Based on the 3D-SSM, we present two key components: a spatiotemporal interaction module (SIM) and a multi-branch feature extraction module (MBFEM). The SIM facilitates bi-temporal feature integration by enabling interactions between global and local features across images from different time points, thereby enhancing the detection of subtle changes. Meanwhile, the MBFEM combines features from the frequency domain, spatial domain, and 3D-SSM to provide a rich representation of contextual information within the image. Our proposed method demonstrates favourable performance compared to state-of-the-art change detection methods on five benchmark datasets through extensive experiments. Code is available at https://github.com/VerdantMist/3D-SSM
中文: 现有基于Mamba的遥感变化检测方法因无法有效捕捉图像通道间的长程依赖而受限,为此提出3D选择性扫描模块(3D-SSM),通过时空交互模块和多分支特征提取模块整合空间与通道的全局信息,在多个基准数据集上实现了优越的性能。
English: Current Mamba-based remote sensing change detection methods are constrained by inadequate long-range dependency capture across image channels, prompting the development of a 3D selective scan module (3D-SSM) that integrates spatial and channel perspectives to enhance feature representation through spatiotemporal interaction and multi-branch feature extraction, achieving superior performance on benchmark datasets.

Authors:Sunggu Kyung, Hyungbin Park, Jinyoung Seo, Jimin Sung, Jihyun Kim, Dongyeong Kim, Wooyoung Jo, Yoojin Nam, Sangah Park, Taehee Kwon, Sang Min Lee, Namkug Kim
Title: MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports
Abstract:
Computed Tomography (CT) plays a crucial role in clinical diagnosis, but the growing demand for CT examinations has raised concerns about diagnostic errors. While Multimodal Large Language Models (MLLMs) demonstrate promising comprehension of medical knowledge, their tendency to produce inaccurate information highlights the need for rigorous validation. However, existing medical visual question answering (VQA) benchmarks primarily focus on simple visual recognition tasks, lacking clinical relevance and failing to assess expert-level knowledge. We introduce MedErr-CT, a novel benchmark for evaluating medical MLLMs' ability to identify and correct errors in CT reports through a VQA framework. The benchmark includes six error categories - four vision-centric errors (Omission, Insertion, Direction, Size) and two lexical error types (Unit, Typo) - and is organized into three task levels: classification, detection, and correction. Using this benchmark, we quantitatively assess the performance of state-of-the-art 3D medical MLLMs, revealing substantial variation in their capabilities across different error types. Our benchmark contributes to the development of more reliable and clinically applicable MLLMs, ultimately helping reduce diagnostic errors and improve accuracy in clinical practice. The code and datasets are available at https://github.com/babbu3682/MedErr-CT.
中文:MedErr-CT基准通过视觉问答评估医学多模态大模型在CT报告中识别和纠正错误的能力,揭示了不同错误类型间的显著性能差异,旨在提升临床应用的可靠性。
English: The MedErr-CT benchmark evaluates medical Multimodal Large Language Models' ability to identify and correct errors in CT reports through visual question answering, revealing significant performance variations across error types to enhance clinical reliability.

Authors:Christina Ourania Tze, Daniel Dauner, Yiyi Liao, Dzmitry Tsishkou, Andreas Geiger
Title: PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Scenes
Abstract:
Large-scale 3D semantic scene generation has predominantly relied on voxel-based representations, which are memory-intensive, bound by fixed resolutions, and challenging to edit. In contrast, primitives represent semantic entities using compact, coarse 3D structures that are easy to manipulate and compose, making them an ideal representation for this task. In this paper, we introduce PrITTI, a latent diffusion-based framework that leverages primitives as the main foundational elements for generating compositional, controllable, and editable 3D semantic scene layouts. Our method adopts a hybrid representation, modeling ground surfaces in a rasterized format while encoding objects as vectorized 3D primitives. This decomposition is also reflected in a structured latent representation that enables flexible scene manipulation of ground and object components. To overcome the orientation ambiguities in conventional encoding methods, we introduce a stable Cholesky-based parameterization that jointly encodes object size and orientation. Experiments on the KITTI-360 dataset show that PrITTI outperforms a voxel-based baseline in generation quality, while reducing memory requirements by up to $3\times$. In addition, PrITTI enables direct instance-level manipulation of objects in the scene and supports a range of downstream applications, including scene inpainting, outpainting, and photo-realistic street-view synthesis.

Authors:Ilia Beletskii, Andrey Kuznetsov, Aibek Alanov
Title: Inverse-and-Edit: Effective and Fast Image Editing by Cycle Consistency Models
Abstract:
Recent advances in image editing with diffusion models have achieved impressive results, offering fine-grained control over the generation process. However, these methods are computationally intensive because of their iterative nature. While distilled diffusion models enable faster inference, their editing capabilities remain limited, primarily because of poor inversion quality. High-fidelity inversion and reconstruction are essential for precise image editing, as they preserve the structural and semantic integrity of the source image. In this work, we propose a novel framework that enhances image inversion using consistency models, enabling high-quality editing in just four steps. Our method introduces a cycle-consistency optimization strategy that significantly improves reconstruction accuracy and enables a controllable trade-off between editability and content preservation. We achieve state-of-the-art performance across various image editing tasks and datasets, demonstrating that our method matches or surpasses full-step diffusion models while being substantially more efficient. The code of our method is available on GitHub at https://github.com/ControlGenAI/Inverse-and-Edit.
中文摘要:本文提出了一种利用一致性模型增强图像反演的新框架,仅需四步即可实现高质量图像编辑,并在多种任务和数据集上达到最优性能。
English Summary: This paper introduces a novel framework that enhances image inversion using consistency models, enabling high-quality image editing in just four steps while achieving state-of-the-art performance across various tasks and datasets.

Authors:Georgii Bychkov, Khaled Abud, Egor Kovalev, Alexander Gushchin, Dmitriy Vatolin, Anastasia Antsiferova
Title: NIC-RobustBench: A Comprehensive Open-Source Toolkit for Neural Image Compression and Robustness Analysis
Abstract:
Adversarial robustness of neural networks is an increasingly important area of research, combining studies on computer vision models, large language models (LLMs), and others. With the release of JPEG AI -- the first standard for end-to-end neural image compression (NIC) methods -- the question of evaluating NIC robustness has become critically significant. However, previous research has been limited to a narrow range of codecs and attacks. To address this, we present \textbf{NIC-RobustBench}, the first open-source framework to evaluate NIC robustness and adversarial defenses' efficiency, in addition to comparing Rate-Distortion (RD) performance. The framework includes the largest number of codecs among all known NIC libraries and is easily scalable. The paper demonstrates a comprehensive overview of the NIC-RobustBench framework and employs it to analyze NIC robustness. Our code is available online at https://github.com/msu-video-group/NIC-RobustBench.
中文: NIC-RobustBench是首个开源框架,用于全面评估神经图像压缩方法的鲁棒性和对抗防御效率,通过整合多种编解码器和可扩展性,弥补了以往研究的不足。
English: NIC-RobustBench is the first open-source framework designed to comprehensively evaluate the robustness and adversarial defense efficiency of neural image compression methods, addressing limitations in prior research by incorporating a wide range of codecs and scalability.

Authors:Yang Liu, Chuanchen Luo, Zimo Tang, Yingyan Li, Yuran Yang, Yuanyong Ning, Lue Fan, Zhaoxiang Zhang, Junran Peng
Title: TC-Light: Temporally Coherent Generative Rendering for Realistic World Transfer
Abstract:
Illumination and texture editing are critical dimensions for world-to-world transfer, which is valuable for applications including sim2real and real2real visual data scaling up for embodied AI. Existing techniques generatively re-render the input video to realize the transfer, such as video relighting models and conditioned world generation models. Nevertheless, these models are predominantly limited to the domain of training data (e.g., portrait) or fall into the bottleneck of temporal consistency and computation efficiency, especially when the input video involves complex dynamics and long durations. In this paper, we propose TC-Light, a novel generative renderer to overcome these problems. Starting from the video preliminarily relighted by an inflated video relighting model, it optimizes appearance embedding in the first stage to align global illumination. Then it optimizes the proposed canonical video representation, i.e., Unique Video Tensor (UVT), to align fine-grained texture and lighting in the second stage. To comprehensively evaluate performance, we also establish a long and highly dynamic video benchmark. Extensive experiments show that our method enables physically plausible re-rendering results with superior temporal coherence and low computation cost. The code and video demos are available at https://dekuliutesla.github.io/tclight/.
中文摘要:TC-Light是一种新型生成式渲染器,通过两阶段优化全局光照与细粒度纹理,实现了物理可信的视频重渲染效果,具有卓越的时间一致性和低计算成本优势。
English Summary: TC-Light is a novel generative renderer that achieves physically plausible video re-rendering with superior temporal coherence and low computation cost through two-stage optimization of global illumination and fine-grained texture.

Authors:Wenqiang Sun, Fangyun Wei, Jinjing Zhao, Xi Chen, Zilong Chen, Hongyang Zhang, Jun Zhang, Yan Lu
Title: From Virtual Games to Real-World Play
Abstract:
We introduce RealPlay, a neural network-based real-world game engine that enables interactive video generation from user control signals. Unlike prior works focused on game-style visuals, RealPlay aims to produce photorealistic, temporally consistent video sequences that resemble real-world footage. It operates in an interactive loop: users observe a generated scene, issue a control command, and receive a short video chunk in response. To enable such realistic and responsive generation, we address key challenges including iterative chunk-wise prediction for low-latency feedback, temporal consistency across iterations, and accurate control response. RealPlay is trained on a combination of labeled game data and unlabeled real-world videos, without requiring real-world action annotations. Notably, we observe two forms of generalization: (1) control transfer-RealPlay effectively maps control signals from virtual to real-world scenarios; and (2) entity transfer-although training labels originate solely from a car racing game, RealPlay generalizes to control diverse real-world entities, including bicycles and pedestrians, beyond vehicles. Project page can be found: https://wenqsun.github.io/RealPlay/

Authors:Hong Li, Houyuan Chen, Chongjie Ye, Zhaoxi Chen, Bohan Li, Shaocong Xu, Xianda Guo, Xuhui Liu, Yikai Wang, Baochang Zhang, Satoshi Ikehata, Boxin Shi, Anyi Rao, Hao Zhao
Title: Light of Normals: Unified Feature Representation for Universal Photometric Stereo
Abstract:
Universal photometric stereo (PS) aims to recover high-quality surface normals from objects under arbitrary lighting conditions without relying on specific illumination models. Despite recent advances such as SDM-UniPS and Uni MS-PS, two fundamental challenges persist: 1) the deep coupling between varying illumination and surface normal features, where ambiguity in observed intensity makes it difficult to determine whether brightness variations stem from lighting changes or surface orientation; and 2) the preservation of high-frequency geometric details in complex surfaces, where intricate geometries create self-shadowing, inter-reflections, and subtle normal variations that conventional feature processing operations struggle to capture accurately.
中文:LINO UniPS方法通过光寄存器令牌和交错注意力块实现光照与法向信息的解耦,结合小波双分支架构和法向梯度感知损失保留高频细节,在PS-Verse数据集上采用课程训练后,在多个基准测试中取得了最优性能。
English: The proposed LINO UniPS method introduces Light Register Tokens with interleaved attention to decouple illumination from surface normals, alongside a wavelet-based architecture and gradient loss to preserve geometric details, achieving state-of-the-art performance on benchmarks through curriculum training on the new PS-Verse dataset.

Authors:Xinyu Zhang, Dong Gong, Zicheng Duan, Anton van den Hengel, Lingqiao Liu
Title: Let Your Video Listen to Your Music!
Abstract:
Aligning the rhythm of visual motion in a video with a given music track is a practical need in multimedia production, yet remains an underexplored task in autonomous video editing. Effective alignment between motion and musical beats enhances viewer engagement and visual appeal, particularly in music videos, promotional content, and cinematic editing. Existing methods typically depend on labor-intensive manual cutting, speed adjustments, or heuristic-based editing techniques to achieve synchronization. While some generative models handle joint video and music generation, they often entangle the two modalities, limiting flexibility in aligning video to music beats while preserving the full visual content. In this paper, we propose a novel and efficient framework, termed MVAA (Music-Video Auto-Alignment), that automatically edits video to align with the rhythm of a given music track while preserving the original visual content. To enhance flexibility, we modularize the task into a two-step process in our MVAA: aligning motion keyframes with audio beats, followed by rhythm-aware video inpainting. Specifically, we first insert keyframes at timestamps aligned with musical beats, then use a frame-conditioned diffusion model to generate coherent intermediate frames, preserving the original video's semantic content. Since comprehensive test-time training can be time-consuming, we adopt a two-stage strategy: pretraining the inpainting module on a small video set to learn general motion priors, followed by rapid inference-time fine-tuning for video-specific adaptation. This hybrid approach enables adaptation within 10 minutes with one epoch on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V as the backbone. Extensive experiments show that our approach can achieve high-quality beat alignment and visual smoothness.

Authors:Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu
Title: OmniGen2: Exploration to Advanced Multimodal Generation
Abstract:
In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2
中文: OmniGen2 是一种开源生成模型,通过双解码路径和反射机制统一处理文本到图像、图像编辑及上下文生成任务,在保持文本生成能力的同时实现了领先性能。
English: OmniGen2 is an open-source generative model that unifies text-to-image, image editing, and in-context generation tasks through dual decoding pathways and a reflection mechanism, achieving state-of-the-art performance while preserving text generation capabilities.

Authors:Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, Steven Hoi
Title: OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation
Abstract:
Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses existing models in both facial and semi-body video generation, offering precise text-based control for creating videos in various domains, such as podcasts, human interactions, dynamic scenes, and singing. Our project page is https://omni-avatar.github.io/.

Authors:Kuanning Wang, Yuqian Fu, Tianyu Wang, Yanwei Fu, Longfei Liang, Yu-Gang Jiang, Xiangyang Xue
Title: RAG-6DPose: Retrieval-Augmented 6D Pose Estimation via Leveraging CAD as Knowledge Base
Abstract:
Accurate 6D pose estimation is key for robotic manipulation, enabling precise object localization for tasks like grasping. We present RAG-6DPose, a retrieval-augmented approach that leverages 3D CAD models as a knowledge base by integrating both visual and geometric cues. Our RAG-6DPose roughly contains three stages: 1) Building a Multi-Modal CAD Knowledge Base by extracting 2D visual features from multi-view CAD rendered images and also attaching 3D points; 2) Retrieving relevant CAD features from the knowledge base based on the current query image via our ReSPC module; and 3) Incorporating retrieved CAD information to refine pose predictions via retrieval-augmented decoding. Experimental results on standard benchmarks and real-world robotic tasks demonstrate the effectiveness and robustness of our approach, particularly in handling occlusions and novel viewpoints. Supplementary material is available on our project website: https://sressers.github.io/RAG-6DPose .
中文:RAG-6DPose通过融合多模态CAD知识库的视觉与几何线索,提升了机器人操作中的6D姿态估计精度,在遮挡和新视角场景下表现尤为鲁棒。
English: RAG-6DPose enhances 6D pose estimation for robotic manipulation by integrating a multi-modal CAD knowledge base with visual and geometric cues, improving accuracy in occluded and novel-view scenarios.

Authors:Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yi Zhang, Gen Li, Xinghui Li, Siyu Zhou, Qian He, Xinglong Wu
Title: Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset
Abstract:
Subject-to-video generation has witnessed substantial progress in recent years. However, existing models still face significant challenges in faithfully following textual instructions. This limitation, commonly known as the copy-paste problem, arises from the widely used in-pair training paradigm. This approach inherently entangles subject identity with background and contextual attributes by sampling reference images from the same scene as the target video. To address this issue, we introduce \textbf{Phantom-Data, the first general-purpose cross-pair subject-to-video consistency dataset}, containing approximately one million identity-consistent pairs across diverse categories. Our dataset is constructed via a three-stage pipeline: (1) a general and input-aligned subject detection module, (2) large-scale cross-context subject retrieval from more than 53 million videos and 3 billion images, and (3) prior-guided identity verification to ensure visual consistency under contextual variation. Comprehensive experiments show that training with Phantom-Data significantly improves prompt alignment and visual quality while preserving identity consistency on par with in-pair baselines.

Authors:Olivier Gamache, Jean-Michel Fortin, Matěj Boxan, François Pomerleau, Philippe Giguère
Title: Reproducible Evaluation of Camera Auto-Exposure Methods in the Field: Platform, Benchmark and Lessons Learned
Abstract:
Standard datasets often present limitations, particularly due to the fixed nature of input data sensors, which makes it difficult to compare methods that actively adjust sensor parameters to suit environmental conditions. This is the case with Automatic-Exposure (AE) methods, which rely on environmental factors to influence the image acquisition process. As a result, AE methods have traditionally been benchmarked in an online manner, rendering experiments non-reproducible. Building on our prior work, we propose a methodology that utilizes an emulator capable of generating images at any exposure time. This approach leverages BorealHDR, a unique multi-exposure stereo dataset, along with its new extension, in which data was acquired along a repeated trajectory at different times of the day to assess the impact of changing illumination. In total, BorealHDR covers 13.4 km over 59 trajectories in challenging lighting conditions. The dataset also includes lidar-inertial-odometry-based maps with pose estimation for each image frame, as well as Global Navigation Satellite System (GNSS) data for comparison. We demonstrate that by using images acquired at various exposure times, we can emulate realistic images with a Root-Mean-Square Error (RMSE) below 1.78% compared to ground truth images. Using this offline approach, we benchmarked eight AE methods, concluding that the classical AE method remains the field's best performer. To further support reproducibility, we provide in-depth details on the development of our backpack acquisition platform, including hardware, electrical components, and performance specifications. Additionally, we share valuable lessons learned from deploying the backpack over more than 25 km across various environments. Our code and dataset are available online at this link: https://github.com/norlab-ulaval/TFR24 BorealHDR
中文: 标准数据集难以比较自适应传感器方法,因此本研究利用BorealHDR数据集开发模拟器进行离线基准测试,发现传统自动曝光方法最优,并共享采集平台细节以支持重现性研究。
English: Standard datasets hinder reproducible comparisons of adaptive sensor methods like Automatic Exposure, so this study introduces an emulator using the BorealHDR dataset to benchmark AE methods offline, finding classical AE performs best and sharing detailed platform insights for reproducibility.

Authors:Siao Tang, Xinyin Ma, Gongfan Fang, Xinchao Wang
Title: ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation
Abstract:
Recent advancements in large reasoning models (LRMs) like DeepSeek-R1 and OpenAI o1 series have achieved notable performance enhancements on complex reasoning tasks by scaling up the generation length by Chain-of-Thought (CoT). However, an emerging issue is their inclination to produce excessively verbose reasoning processes, leading to the inefficiency problem. Existing literature on improving efficiency mainly adheres to the before-reasoning paradigms such as prompting and reasoning or fine-tuning and reasoning, but ignores the promising direction of directly encouraging the model to speak concisely by intervening during the generation of reasoning. In order to fill the blank, we propose a framework dubbed ConciseHint, which continuously encourages the reasoning model to speak concisely by injecting the textual hint (manually designed or trained on the concise data) during the token generation of the reasoning process. Besides, ConciseHint is adaptive to the complexity of the query by adaptively adjusting the hint intensity, which ensures it will not undermine model performance. Experiments on the state-of-the-art LRMs, including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can effectively produce concise reasoning processes while maintaining performance well. For instance, we achieve a reduction ratio of 65\% for the reasoning length on GSM8K benchmark with Qwen-3 4B with nearly no accuracy loss.
中文:提出的ConciseHint框架通过在推理生成过程中注入自适应提示,有效解决大型推理模型冗长问题,能在保持性能的同时生成简洁输出,并可无缝兼容现有方法。
English: The proposed ConciseHint framework addresses the verbosity of large reasoning models by injecting adaptive hints during reasoning generation, effectively producing concise outputs while maintaining performance and compatibility with existing methods.

Authors:Suyash Gaurav, Muhammad Farhan Humayun, Jukka Heikkonen, Jatin Chaudhary
Title: Focus Your Attention: Towards Data-Intuitive Lightweight Vision Transformers
Abstract:
The evolution of Vision Transformers has led to their widespread adaptation to different domains. Despite large-scale success, there remain significant challenges including their reliance on extensive computational and memory resources for pre-training on huge datasets as well as difficulties in task-specific transfer learning. These limitations coupled with energy inefficiencies mainly arise due to the computation-intensive self-attention mechanism. To address these issues, we propose a novel Super-Pixel Based Patch Pooling (SPPP) technique that generates context-aware, semantically rich, patch embeddings to effectively reduce the architectural complexity and improve efficiency. Additionally, we introduce the Light Latent Attention (LLA) module in our pipeline by integrating latent tokens into the attention mechanism allowing cross-attention operations to significantly reduce the time and space complexity of the attention module. By leveraging the data-intuitive patch embeddings coupled with dynamic positional encodings, our approach adaptively modulates the cross-attention process to focus on informative regions while maintaining the global semantic structure. This targeted attention improves training efficiency and accelerates convergence. Notably, the SPPP module is lightweight and can be easily integrated into existing transformer architectures. Extensive experiments demonstrate that our proposed architecture provides significant improvements in terms of computational efficiency while achieving comparable results with the state-of-the-art approaches, highlighting its potential for energy-efficient transformers suitable for edge deployment. (The code is available on our GitHub repository: https://github.com/zser092/Focused-Attention-ViT).
中文摘要:本文提出基于超像素的补丁池化技术和轻量潜在注意力模块,有效降低视觉Transformer的复杂度,在保持与先进方法相当性能的同时显著提升计算效率。
English Summary: This paper introduces a Super-Pixel Based Patch Pooling technique and Light Latent Attention module to reduce Vision Transformer complexity, improving computational efficiency while maintaining competitive performance with state-of-the-art methods.

Authors:Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, Yahui Zhou
Title: Matrix-Game: Interactive World Foundation Model
Abstract:
We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game.
中文:Matrix-Game 是一种可控游戏世界生成模型,采用两阶段训练流程和超过170亿参数,能生成高质量、交互式视频并实现精确动作控制,在所有评估指标上均优于先前模型。
English: Matrix-Game is a controllable game world generation model that uses a two-stage training pipeline and over 17 billion parameters to produce high-quality, interactive videos with precise action control, outperforming previous models across all evaluation metrics.

Authors:Tianchen Deng, Guole Shen, Xun Chen, Shenghai Yuan, Hongming Shen, Guohao Peng, Zhenyu Wu, Jingchuan Wang, Lihua Xie, Danwei Wang, Hesheng Wang, Weidong Chen
Title: MCN-SLAM: Multi-Agent Collaborative Neural SLAM with Hybrid Implicit Neural Scene Representation
Abstract:
Neural implicit scene representations have recently shown promising results in dense visual SLAM. However, existing implicit SLAM algorithms are constrained to single-agent scenarios, and fall difficulties in large-scale scenes and long sequences. Existing NeRF-based multi-agent SLAM frameworks cannot meet the constraints of communication bandwidth. To this end, we propose the first distributed multi-agent collaborative neural SLAM framework with hybrid scene representation, distributed camera tracking, intra-to-inter loop closure, and online distillation for multiple submap fusion. A novel triplane-grid joint scene representation method is proposed to improve scene reconstruction. A novel intra-to-inter loop closure method is designed to achieve local (single-agent) and global (multi-agent) consistency. We also design a novel online distillation method to fuse the information of different submaps to achieve global consistency. Furthermore, to the best of our knowledge, there is no real-world dataset for NeRF-based/GS-based SLAM that provides both continuous-time trajectories groundtruth and high-accuracy 3D meshes groundtruth. To this end, we propose the first real-world Dense slam (DES) dataset covering both single-agent and multi-agent scenarios, ranging from small rooms to large-scale outdoor scenes, with high-accuracy ground truth for both 3D mesh and continuous-time camera trajectory. This dataset can advance the development of the research in both SLAM, 3D reconstruction, and visual foundation model. Experiments on various datasets demonstrate the superiority of the proposed method in both mapping, tracking, and communication. The dataset and code will open-source on https://github.com/dtc111111/mcnslam.
中文摘要:本文提出了首个采用混合场景表示的分布式多智能体协同神经SLAM框架,并创建了新型真实世界数据集,在建模、跟踪和通信效率方面均展现出优越性能。
English Summary: This paper introduces the first distributed multi-agent collaborative neural SLAM framework with hybrid scene representation and a novel real-world dataset, demonstrating superior performance in mapping, tracking, and communication efficiency.

Authors:Ling Zhang, Boxiang Yun, Qingli Li, Yan Wang
Title: Historical Report Guided Bi-modal Concurrent Learning for Pathology Report Generation
Abstract:
Automated pathology report generation from Whole Slide Images (WSIs) faces two key challenges: (1) lack of semantic content in visual features and (2) inherent information redundancy in WSIs. To address these issues, we propose a novel Historical Report Guided \textbf{Bi}-modal Concurrent Learning Framework for Pathology Report \textbf{Gen}eration (BiGen) emulating pathologists' diagnostic reasoning, consisting of: (1) A knowledge retrieval mechanism to provide rich semantic content, which retrieves WSI-relevant knowledge from pre-built medical knowledge bank by matching high-attention patches and (2) A bi-modal concurrent learning strategy instantiated via a learnable visual token and a learnable textual token to dynamically extract key visual features and retrieved knowledge, where weight-shared layers enable cross-modal alignment between visual features and knowledge features. Our multi-modal decoder integrates both modals for comprehensive diagnostic reports generation. Experiments on the PathText (BRCA) dataset demonstrate our framework's superiority, achieving state-of-the-art performance with 7.4\% relative improvement in NLP metrics and 19.1\% enhancement in classification metrics for Her-2 prediction versus existing methods. Ablation studies validate the necessity of our proposed modules, highlighting our method's ability to provide WSI-relevant rich semantic content and suppress information redundancy in WSIs. Code is publicly available at https://github.com/DeepMed-Lab-ECNU/BiGen.
Chinese: BiGen框架通过知识检索机制和双模态并行学习,解决了全切片图像自动生成病理报告中语义内容不足和信息冗余的问题,在自然语言处理和分类指标上均实现了显著提升,达到最优性能。
English: The BiGen framework addresses semantic deficiency and information redundancy in automated pathology report generation by incorporating a knowledge retrieval mechanism and bi-modal concurrent learning, achieving state-of-the-art performance with significant improvements in NLP and classification metrics.

Authors:Wenxu Qian, Chaoyue Wang, Hou Peng, Zhiyu Tan, Hao Li, Anxiang Zeng
Title: RDPO: Real Data Preference Optimization for Physics Consistency Video Generation
Abstract:
Video generation techniques have achieved remarkable advancements in visual quality, yet faithfully reproducing real-world physics remains elusive. Preference-based model post-training may improve physical consistency, but requires costly human-annotated datasets or reward models that are not yet feasible. To address these challenges, we present Real Data Preference Optimisation (RDPO), an annotation-free framework that distills physical priors directly from real-world videos. Specifically, the proposed RDPO reverse-samples real video sequences with a pre-trained generator to automatically build preference pairs that are statistically distinguishable in terms of physical correctness. A multi-stage iterative training schedule then guides the generator to obey physical laws increasingly well. Benefiting from the dynamic information explored from real videos, our proposed RDPO significantly improves the action coherence and physical realism of the generated videos. Evaluations on multiple benchmarks and human evaluations have demonstrated that RDPO achieves improvements across multiple dimensions. The source code and demonstration of this paper are available at: https://wwenxu.github.io/RDPO/
Chinese: 本文提出真实数据偏好优化(RDPO),一种无需标注的框架,通过从真实视频中反向采样提取物理先验并进行多阶段迭代训练,显著提升了生成视频的动作连贯性和物理真实性,无需依赖昂贵的人工标注。
English: The paper introduces Real Data Preference Optimisation (RDPO), an annotation-free framework that enhances video generation by extracting physical priors from real videos through reverse-sampling and iterative training, significantly improving physical realism and coherence without costly human input.

Authors:Mauricio Byrd Victorica, György Dán, Henrik Sandberg
Title: SpaNN: Detecting Multiple Adversarial Patches on CNNs by Spanning Saliency Thresholds
Abstract:
State-of-the-art convolutional neural network models for object detection and image classification are vulnerable to physically realizable adversarial perturbations, such as patch attacks. Existing defenses have focused, implicitly or explicitly, on single-patch attacks, leaving their sensitivity to the number of patches as an open question or rendering them computationally infeasible or inefficient against attacks consisting of multiple patches in the worst cases. In this work, we propose SpaNN, an attack detector whose computational complexity is independent of the expected number of adversarial patches. The key novelty of the proposed detector is that it builds an ensemble of binarized feature maps by applying a set of saliency thresholds to the neural activations of the first convolutional layer of the victim model. It then performs clustering on the ensemble and uses the cluster features as the input to a classifier for attack detection. Contrary to existing detectors, SpaNN does not rely on a fixed saliency threshold for identifying adversarial regions, which makes it robust against white box adversarial attacks. We evaluate SpaNN on four widely used data sets for object detection and classification, and our results show that SpaNN outperforms state-of-the-art defenses by up to 11 and 27 percentage points in the case of object detection and the case of image classification, respectively. Our code is available at https://github.com/gerkbyrd/SpaNN.
中文: 提出的SpaNN检测器通过使用二值化特征图集合和聚类技术,有效识别对抗性补丁攻击,其计算效率与补丁数量无关,性能显著优于现有防御方法。
English: The proposed SpaNN detector effectively identifies adversarial patch attacks by using an ensemble of binarized feature maps and clustering, outperforming existing defenses with computational efficiency independent of patch count.

Authors:Kaifeng Sheng, Zheng Zhou, Yingliang Peng, Qianwei Wang
Title: 2D Triangle Splatting for Direct Differentiable Mesh Training
Abstract:
Differentiable rendering with 3D Gaussian primitives has emerged as a powerful method for reconstructing high-fidelity 3D scenes from multi-view images. While it offers improvements over NeRF-based methods, this representation still encounters challenges with rendering speed and advanced rendering effects, such as relighting and shadow rendering, compared to mesh-based models. In this paper, we propose 2D Triangle Splatting (2DTS), a novel method that replaces 3D Gaussian primitives with 2D triangle facelets. This representation naturally forms a discrete mesh-like structure while retaining the benefits of continuous volumetric modeling. By incorporating a compactness parameter into the triangle primitives, we enable direct training of photorealistic meshes. Our experimental results demonstrate that our triangle-based method, in its vanilla version (without compactness tuning), achieves higher fidelity compared to state-of-the-art Gaussian-based methods. Furthermore, our approach produces reconstructed meshes with superior visual quality compared to existing mesh reconstruction methods. Please visit our project page at https://gaoderender.github.io/triangle-splatting.

Authors:Sung Jin Um, Dongjin Kim, Sangmin Lee, Jung Uk Kim
Title: Object-aware Sound Source Localization via Audio-Visual Scene Understanding
Abstract:
Audio-visual sound source localization task aims to spatially localize sound-making objects within visual scenes by integrating visual and audio cues. However, existing methods struggle with accurately localizing sound-making objects in complex scenes, particularly when visually similar silent objects coexist. This limitation arises primarily from their reliance on simple audio-visual correspondence, which does not capture fine-grained semantic differences between sound-making and silent objects. To address these challenges, we propose a novel sound source localization framework leveraging Multimodal Large Language Models (MLLMs) to generate detailed contextual information that explicitly distinguishes between sound-making foreground objects and silent background objects. To effectively integrate this detailed information, we introduce two novel loss functions: Object-aware Contrastive Alignment (OCA) loss and Object Region Isolation (ORI) loss. Extensive experimental results on MUSIC and VGGSound datasets demonstrate the effectiveness of our approach, significantly outperforming existing methods in both single-source and multi-source localization scenarios. Code and generated detailed contextual information are available at: https://github.com/VisualAIKHU/OA-SSL.
中文: 本研究提出了一种新颖的视听声源定位框架,利用多模态大语言模型生成区分发声物体与静默物体的详细上下文信息,并通过两种新型损失函数增强定位效果,在基准数据集上显著优于现有方法。
English: This study introduces a novel audio-visual sound source localization framework that utilizes Multimodal Large Language Models to generate detailed contextual information distinguishing sound-making objects from silent ones, enhanced by two new loss functions which significantly outperform existing methods on benchmark datasets.

Authors:Muhao Xu, Xueying Zhou, Xizhan Gao, Weiye Song, Guang Feng, Sijie Niu
Title: Normality Prior Guided Multi-Semantic Fusion Network for Unsupervised Image Anomaly Detection
Abstract:
Recently, detecting logical anomalies is becoming a more challenging task compared to detecting structural ones. Existing encoder decoder based methods typically compress inputs into low-dimensional bottlenecks on the assumption that the compression process can effectively suppress the transmission of logical anomalies to the decoder. However, logical anomalies present a particular difficulty because, while their local features often resemble normal semantics, their global semantics deviate significantly from normal patterns. Thanks to the generalisation capabilities inherent in neural networks, these abnormal semantic features can propagate through low-dimensional bottlenecks. This ultimately allows the decoder to reconstruct anomalous images with misleading fidelity. To tackle the above challenge, we propose a novel normality prior guided multi-semantic fusion network for unsupervised anomaly detection. Instead of feeding the compressed bottlenecks to the decoder directly, we introduce the multi-semantic features of normal samples into the reconstruction process. To this end, we first extract abstract global semantics of normal cases by a pre-trained vision-language network, then the learnable semantic codebooks are constructed to store representative feature vectors of normal samples by vector quantisation. Finally, the above multi-semantic features are fused and employed as input to the decoder to guide the reconstruction of anomalies to approximate normality. Extensive experiments are conducted to validate the effectiveness of our proposed method, and it achieves the SOTA performance on the MVTec LOCO AD dataset with improvements of 5.7% in pixel-sPRO and 2.6% in image-AUROC. The source code is available at https://github.com/Xmh-L/NPGMF.
Chinese: 本文提出了一种新颖的无监督异常检测方法,通过将正常样本的多语义特征融入重建过程来解决逻辑异常检测难题,在MVTec LOCO AD数据集上实现了最先进的性能表现。
English: This paper introduces a novel unsupervised anomaly detection method that addresses the challenge of logical anomalies by integrating multi-semantic features from normal samples into the reconstruction process, achieving state-of-the-art performance on the MVTec LOCO AD dataset.

Authors:JiaKui Hu, Yuxiao Yang, Jialun Liu, Jinbo Wu, Chen Zhao, Yanye Lu
Title: Auto-Regressively Generating Multi-View Consistent Images
Abstract:
Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (\textbf{MV-AR}) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the ``Shuffle View" data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images across a range of conditions and performs on par with leading diffusion-based multi-view image generation models. The code and models are released at https://github.com/MILab-PKU/MVAR.
中文: 本文提出的多视图自回归(MV-AR)方法通过自回归模型、条件注入模块和数据增强技术,能够根据多种输入提示逐步生成一致的多视图图像,其性能与领先的基于扩散的模型相当。
English: This paper introduces the Multi-View Auto-Regressive (MV-AR) method, which progressively generates consistent multi-view images from various prompts using an auto-regressive model, condition injection modules, and data augmentation, achieving performance comparable to leading diffusion-based models.

Authors:Yuting Zhang, Kaishen Yuan, Hao Lu, Yutao Yue, Jintai Chen, Kaishun Wu
Title: MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis
Abstract:
Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at https://github.com/keke-nice/MedTVT-R1.
中文:MedTVT-R1是一种新型多模态大语言模型,通过整合临床数据和采用强化微调技术,实现了精准的多疾病诊断与可解释推理,在医疗应用中展现出卓越性能。
English: MedTVT-R1 is a novel multimodal large language model that integrates clinical data for accurate multi-disease diagnosis and interpretable reasoning, demonstrating superior performance through advanced modality fusion and reinforcement fine-tuning.

Authors:Sophia Sirko-Galouchenko, Spyros Gidaris, Antonin Vobecky, Andrei Bursuc, Nicolas Thome
Title: DIP: Unsupervised Dense In-Context Post-training of Visual Representations
Abstract:
We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. Code available here: https://github.com/sirkosophia/DIP
中文: DIP是一种无监督的后训练方法,通过结合预训练扩散模型生成伪任务来增强视觉编码器中的密集图像表示,在各种上下文场景理解任务中高效地实现了卓越性能。
English: DIP is an unsupervised post-training method that enhances dense image representations in vision encoders by using pseudo-tasks generated with a diffusion model, achieving superior performance in various in-context scene understanding tasks efficiently.

Authors:Yang Lyu, Zhenghao Zou, Yanfeng Li, Chunhui Zhao, Quan Pan
Title: Radar and Event Camera Fusion for Agile Robot Ego-Motion Estimation
Abstract:
Achieving reliable ego motion estimation for agile robots, e.g., aerobatic aircraft, remains challenging because most robot sensors fail to respond timely and clearly to highly dynamic robot motions, often resulting in measurement blurring, distortion, and delays. In this paper, we propose an IMU-free and feature-association-free framework to achieve aggressive ego-motion velocity estimation of a robot platform in highly dynamic scenarios by combining two types of exteroceptive sensors, an event camera and a millimeter wave radar, First, we used instantaneous raw events and Doppler measurements to derive rotational and translational velocities directly. Without a sophisticated association process between measurement frames, the proposed method is more robust in texture-less and structureless environments and is more computationally efficient for edge computing devices. Then, in the back-end, we propose a continuous-time state-space model to fuse the hybrid time-based and event-based measurements to estimate the ego-motion velocity in a fixed-lagged smoother fashion. In the end, we validate our velometer framework extensively in self-collected experiment datasets. The results indicate that our IMU-free and association-free ego motion estimation framework can achieve reliable and efficient velocity output in challenging environments. The source code, illustrative video and dataset are available at https://github.com/ZzhYgwh/TwistEstimator.
中文摘要:本文提出了一种无需IMU和特征关联的框架,通过结合事件相机与毫米波雷达,在高度动态场景中实现鲁棒的自身运动速度估计,实验验证表明该方法在挑战性环境中具有可靠性能。
English Summary: This paper introduces an IMU-free and feature-association-free framework that combines event cameras and millimeter wave radar to achieve robust ego-motion velocity estimation in highly dynamic scenarios, demonstrating reliable performance in challenging environments through experimental validation.

Authors:Yiyao Wang, Bo Pan, Ke Wang, Han Liu, Jinyuan Mao, Yuxin Liu, Minfeng Zhu, Xiuqi Huang, Weifeng Chen, Bo Zhang, Wei Chen
Title: IntuiTF: MLLM-Guided Transfer Function Optimization for Direct Volume Rendering
Abstract:
Direct volume rendering (DVR) is a fundamental technique for visualizing volumetric data, where transfer functions (TFs) play a crucial role in extracting meaningful structures. However, designing effective TFs remains unintuitive due to the semantic gap between user intent and TF parameter space. Although numerous TF optimization methods have been proposed to mitigate this issue, existing approaches still face two major challenges: the vast exploration space and limited generalizability. To address these issues, we propose IntuiTF, a novel framework that leverages Multimodal Large Language Models (MLLMs) to guide TF optimization in alignment with user intent. Specifically, our method consists of two key components: (1) an evolution-driven explorer for effective exploration of the TF space, and (2) an MLLM-guided human-aligned evaluator that provides generalizable visual feedback on rendering quality. The explorer and the evaluator together establish an efficient Trial-Insight-Replanning paradigm for TF space exploration. We further extend our framework with an interactive TF design system. We demonstrate the broad applicability of our framework through three case studies and validate the effectiveness of each component through extensive experiments. We strongly recommend readers check our cases, demo video, and source code at: https://github.com/wyysteelhead/IntuiTF
中文摘要:IntuiTF是一种创新框架,利用多模态大语言模型通过进化驱动探索器和人类对齐评估器来指导传递函数优化,建立了高效的“试验-洞察-重规划”范式,有效解决了直接体绘制中的探索空间广阔和泛化性有限两大挑战。
English Summary: IntuiTF is a novel framework that uses Multimodal Large Language Models to guide transfer function optimization through an evolution-driven explorer and human-aligned evaluator, addressing challenges in direct volume rendering by creating an efficient Trial-Insight-Replanning paradigm.

Authors:Haoneng Lin, Cheng Xu, Jing Qin
Title: Taming Vision-Language Models for Medical Image Analysis: A Comprehensive Review
Abstract:
Modern Vision-Language Models (VLMs) exhibit unprecedented capabilities in cross-modal semantic understanding between visual and textual modalities. Given the intrinsic need for multi-modal integration in clinical applications, VLMs have emerged as a promising solution for a wide range of medical image analysis tasks. However, adapting general-purpose VLMs to medical domain poses numerous challenges, such as large domain gaps, complicated pathological variations, and diversity and uniqueness of different tasks. The central purpose of this review is to systematically summarize recent advances in adapting VLMs for medical image analysis, analyzing current challenges, and recommending promising yet urgent directions for further investigations. We begin by introducing core learning strategies for medical VLMs, including pretraining, fine-tuning, and prompt learning. We then categorize five major VLM adaptation strategies for medical image analysis. These strategies are further analyzed across eleven medical imaging tasks to illustrate their current practical implementations. Furthermore, we analyze key challenges that impede the effective adaptation of VLMs to clinical applications and discuss potential directions for future research. We also provide an open-access repository of related literature to facilitate further research, available at https://github.com/haonenglin/Awesome-VLM-for-MIA. It is anticipated that this article can help researchers who are interested in harnessing VLMs in medical image analysis tasks have a better understanding on their capabilities and limitations, as well as current technical barriers, to promote their innovative, robust, and safe application in clinical practice.
中文: 现代视觉语言模型在医学图像分析中潜力巨大,但面临领域差异和任务多样性等挑战,本综述旨在总结其适应策略、分析当前障碍并建议未来研究方向,以促进其在临床中的创新和安全应用。
English: Modern Vision-Language Models show great potential for medical image analysis but face challenges like domain gaps and task diversity, prompting this review to summarize adaptation strategies, analyze obstacles, and suggest future research directions to enhance their clinical application.

Authors:Hieu Nguyen, Phuc-Tan Nguyen, Thien-Phuc Tran, Minh-Quang Nguyen, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Title: OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding
Abstract:
We introduce OpenEvents V1a large-scale benchmark dataset designed to advance event-centric vision-language understanding. Unlike conventional image captioning and retrieval datasets that focus on surface-level descriptions, OpenEvents V1 dataset emphasizes contextual and temporal grounding through three primary tasks: (1) generating rich, event-aware image captions, (2) retrieving event-relevant news articles from image queries, and (3) retrieving event-relevant images from narrative-style textual queries. The dataset comprises over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian, spanning diverse domains and time periods. We provide extensive baseline results and standardized evaluation protocols for all tasks. OpenEvents V1 establishes a robust foundation for developing multimodal AI systems capable of deep reasoning over complex real-world events. The dataset is publicly available at https://ltnghia.github.io/eventa/openevents-v1.
中文: OpenEvents V1是一个大规模事件导向的视觉语言数据集,包含20万篇新闻文章和40万张图片,旨在通过事件感知描述和跨模态检索等任务推动多模态人工智能系统的发展。
English: OpenEvents V1 is a large-scale event-centric vision-language dataset featuring over 200,000 news articles and 400,000 images, designed for tasks like event-aware captioning and cross-modal retrieval to advance multimodal AI systems.

Authors:Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, Sungroh Yoon
Title: RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models
Abstract:
Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task. Project page: https://github.com/oyt9306/RePIC
中文: 本文提出了一种基于强化学习的后训练框架,显著提升了多模态大语言模型生成个性化图像描述的能力,尤其在现有方法表现不佳的复杂多概念场景中表现突出。
English: This paper introduces a reinforcement learning-based post-training framework that significantly improves multi-modal large language models' ability to generate personalized and accurate image captions, particularly excelling in complex multi-concept scenarios where existing methods falter.

Authors:Tongshun Zhang, Pingping Liu, Mengen Cai, Zijian Zhang, Yubing Lu, Qiuzhan Zhou
Title: BSMamba: Brightness and Semantic Modeling for Long-Range Interaction in Low-Light Image Enhancement
Abstract:
Current low-light image enhancement (LLIE) methods face significant limitations in simultaneously improving brightness while preserving semantic consistency, fine details, and computational efficiency. With the emergence of state-space models, particularly Mamba, image restoration has achieved remarkable performance, yet existing visual Mamba approaches flatten 2D images into 1D token sequences using fixed scanning rules, critically limiting interactions between distant tokens with causal relationships and constraining their ability to capture meaningful long-range dependencies. To address these fundamental limitations, we propose BSMamba, a novel visual Mamba architecture comprising two specially designed components: Brightness Mamba and Semantic Mamba. The Brightness Mamba revolutionizes token interaction patterns by prioritizing connections between distant tokens with similar brightness levels, effectively addressing the challenge of brightness restoration in LLIE tasks through brightness-guided selective attention. Complementing this, the Semantic Mamba establishes priority interactions between tokens sharing similar semantic meanings, allowing the model to maintain contextual consistency by connecting semantically related regions across the image, thus preserving the hierarchical nature of image semantics during enhancement. By intelligently modeling tokens based on brightness and semantic similarity rather than arbitrary scanning patterns, BSMamba transcends the constraints of conventional token sequencing while adhering to the principles of causal modeling. Extensive experiments demonstrate that BSMamba achieves state-of-the-art performance in LLIE while preserving semantic consistency. Code is available at https://github.com/bywlzts/BSMamba.
中文: BSMamba提出了一种新颖的视觉Mamba架构,包含亮度Mamba和语义Mamba组件,通过基于亮度和语义相似性的令牌交互优先机制,在增强低光图像的同时保持语义一致性,性能优于现有方法。
English: BSMamba introduces a novel visual Mamba architecture with Brightness Mamba and Semantic Mamba components, which prioritize interactions between tokens based on brightness and semantic similarity to enhance low-light images while maintaining semantic consistency and outperforming existing methods.

Authors:Saad Wazir, Daeyoung Kim
Title: Rethinking Decoder Design: Improving Biomarker Segmentation Using Depth-to-Space Restoration and Residual Linear Attention
Abstract:
Segmenting biomarkers in medical images is crucial for various biotech applications. Despite advances, Transformer and CNN based methods often struggle with variations in staining and morphology, limiting feature extraction. In medical image segmentation, where datasets often have limited sample availability, recent state-of-the-art (SOTA) methods achieve higher accuracy by leveraging pre-trained encoders, whereas end-to-end methods tend to underperform. This is due to challenges in effectively transferring rich multiscale features from encoders to decoders, as well as limitations in decoder efficiency. To address these issues, we propose an architecture that captures multi-scale local and global contextual information and a novel decoder design, which effectively integrates features from the encoder, emphasizes important channels and regions, and reconstructs spatial dimensions to enhance segmentation accuracy. Our method, compatible with various encoders, outperforms SOTA methods, as demonstrated by experiments on four datasets and ablation studies. Specifically, our method achieves absolute performance gains of 2.76% on MoNuSeg, 3.12% on DSB, 2.87% on Electron Microscopy, and 4.03% on TNBC datasets compared to existing SOTA methods. Code: https://github.com/saadwazir/MCADS-Decoder
中文: 我们提出的架构采用新型解码器,能有效捕捉多尺度上下文信息并整合编码器特征,从而在四个数据集上显著提升了医学图像分割的准确性,超越了现有最优方法。
English: Our proposed architecture with a novel decoder effectively captures multi-scale contextual information and integrates encoder features to enhance medical image segmentation, outperforming state-of-the-art methods across four datasets.

Authors:Chao Li, Jiawei Fan, Anbang Yao
Title: Morse: Dual-Sampling for Lossless Acceleration of Diffusion Models
Abstract:
In this paper, we present Morse, a simple dual-sampling framework for accelerating diffusion models losslessly. The key insight of Morse is to reformulate the iterative generation (from noise to data) process via taking advantage of fast jump sampling and adaptive residual feedback strategies. Specifically, Morse involves two models called Dash and Dot that interact with each other. The Dash model is just the pre-trained diffusion model of any type, but operates in a jump sampling regime, creating sufficient space for sampling efficiency improvement. The Dot model is significantly faster than the Dash model, which is learnt to generate residual feedback conditioned on the observations at the current jump sampling point on the trajectory of the Dash model, lifting the noise estimate to easily match the next-step estimate of the Dash model without jump sampling. By chaining the outputs of the Dash and Dot models run in a time-interleaved fashion, Morse exhibits the merit of flexibly attaining desired image generation performance while improving overall runtime efficiency. With our proposed weight sharing strategy between the Dash and Dot models, Morse is efficient for training and inference. Our method shows a lossless speedup of 1.78X to 3.31X on average over a wide range of sampling step budgets relative to 9 baseline diffusion models on 6 image generation tasks. Furthermore, we show that our method can be also generalized to improve the Latent Consistency Model (LCM-SDXL, which is already accelerated with consistency distillation technique) tailored for few-step text-to-image synthesis. The code and models are available at https://github.com/deep-optimization/Morse.
中文: Morse是一种无损加速扩散模型的双采样框架,通过跳跃采样与自适应残差反馈相结合,在保持图像质量的同时显著提升了生成效率。
English: Morse is a dual-sampling framework that accelerates diffusion models losslessly by combining fast jump sampling with adaptive residual feedback, achieving significant speedups while maintaining image quality.

Authors:Ankit Sanjyal
Title: Limitations of NERF with pre-trained Vision Features for Few-Shot 3D Reconstruction
Abstract:
Neural Radiance Fields (NeRF) have revolutionized 3D scene reconstruction from sparse image collections. Recent work has explored integrating pre-trained vision features, particularly from DINO, to enhance few-shot reconstruction capabilities. However, the effectiveness of such approaches remains unclear, especially in extreme few-shot scenarios. In this paper, we present a systematic evaluation of DINO-enhanced NeRF models, comparing baseline NeRF, frozen DINO features, LoRA fine-tuned features, and multi-scale feature fusion. Surprisingly, our experiments reveal that all DINO variants perform worse than the baseline NeRF, achieving PSNR values around 12.9 to 13.0 compared to the baseline's 14.71. This counterintuitive result suggests that pre-trained vision features may not be beneficial for few-shot 3D reconstruction and may even introduce harmful biases. We analyze potential causes including feature-task mismatch, overfitting to limited data, and integration challenges. Our findings challenge common assumptions in the field and suggest that simpler architectures focusing on geometric consistency may be more effective for few-shot scenarios.
中文: DINO增强的NeRF模型在少样本三维重建中意外表现不如基线NeRF,表明预训练视觉特征可能带来有害偏差而非提升效果。
English: DINO-enhanced NeRF models surprisingly underperform baseline NeRF in few-shot 3D reconstruction, suggesting pre-trained vision features may introduce harmful biases rather than improvements.

Authors:Youjie Zhou, Guofeng Mei, Yiming Wang, Yi Wan, Fabio Poiesi
Title: Multimodal Fusion SLAM with Fourier Attention
Abstract:
Visual SLAM is particularly challenging in environments affected by noise, varying lighting conditions, and darkness. Learning-based optical flow algorithms can leverage multiple modalities to address these challenges, but traditional optical flow-based visual SLAM approaches often require significant computational resources.To overcome this limitation, we propose FMF-SLAM, an efficient multimodal fusion SLAM method that utilizes fast Fourier transform (FFT) to enhance the algorithm efficiency. Specifically, we introduce a novel Fourier-based self-attention and cross-attention mechanism to extract features from RGB and depth signals. We further enhance the interaction of multimodal features by incorporating multi-scale knowledge distillation across modalities. We also demonstrate the practical feasibility of FMF-SLAM in real-world scenarios with real time performance by integrating it with a security robot by fusing with a global positioning module GNSS-RTK and global Bundle Adjustment. Our approach is validated using video sequences from TUM, TartanAir, and our real-world datasets, showcasing state-of-the-art performance under noisy, varying lighting, and dark conditions.Our code and datasets are available at https://github.com/youjie-zhou/FMF-SLAM.git.
中文: FMF-SLAM是一种高效的多模态融合SLAM方法,利用快速傅里叶变换和新型注意力机制,在噪声、变化光照和黑暗条件下提升性能,实现了最先进的结果并具备实时可行性。
English: FMF-SLAM is an efficient multimodal fusion SLAM method that uses fast Fourier transform and novel attention mechanisms to enhance performance in noisy, varying lighting, and dark conditions, achieving state-of-the-art results with real-time feasibility.

Authors:Quan Zhou, Gan Luo, Qiang Hu, Qingyong Zhang, Jinhua Zhang, Yinjiao Tian, Qiang Li, Zhiwei Wang
Title: Targeted False Positive Synthesis via Detector-guided Adversarial Diffusion Attacker for Robust Polyp Detection
Abstract:
Polyp detection is crucial for colorectal cancer screening, yet existing models are limited by the scale and diversity of available data. While generative models show promise for data augmentation, current methods mainly focus on enhancing polyp diversity, often overlooking the critical issue of false positives. In this paper, we address this gap by proposing an adversarial diffusion framework to synthesize high-value false positives. The extensive variability of negative backgrounds presents a significant challenge in false positive synthesis. To overcome this, we introduce two key innovations: First, we design a regional noise matching strategy to construct a negative synthesis space using polyp detection datasets. This strategy trains a negative-centric diffusion model by masking polyp regions, ensuring the model focuses exclusively on learning diverse background patterns. Second, we introduce the Detector-guided Adversarial Diffusion Attacker (DADA) module, which perturbs the negative synthesis process to disrupt a pre-trained detector's decision, guiding the negative-centric diffusion model to generate high-value, detector-confusing false positives instead of low-value, ordinary backgrounds. Our approach is the first to apply adversarial diffusion to lesion detection, establishing a new paradigm for targeted false positive synthesis and paving the way for more reliable clinical applications in colorectal cancer screening. Extensive results on public and in-house datasets verify the superiority of our method over the current state-of-the-arts, with our synthesized data improving the detectors by at least 2.6% and 2.7% in F1-score, respectively, over the baselines. Codes are at https://github.com/Huster-Hq/DADA.
Chinese: 本文提出了一种对抗性扩散框架,通过合成高价值假阳性样本以增强息肉检测能力,该方法专注于多样背景模式并干扰检测器决策,显著超越了现有技术的检测性能。
English: This paper introduces an adversarial diffusion framework that synthesizes high-value false positives to enhance polyp detection by focusing on diverse background patterns and confusing detectors, significantly improving detection performance over existing methods.

Authors:Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, Yao Mu
Title: RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
Abstract:
Simulation-based data synthesis has emerged as a powerful paradigm for advancing real-world robotic manipulation. Yet existing datasets remain insufficient for robust bimanual manipulation due to (1) the lack of scalable task generation methods and (2) oversimplified simulation environments. We present RoboTwin 2.0, a scalable framework for automated, large-scale generation of diverse and realistic data, together with unified evaluation protocols for dual-arm manipulation. At its core is RoboTwin-OD, an object library of 731 instances across 147 categories with semantic and manipulation-relevant annotations. Building on this, we design an expert data synthesis pipeline that leverages multimodal language models (MLLMs) and simulation-in-the-loop refinement to automatically generate task-level execution code. To improve sim-to-real transfer, RoboTwin 2.0 applies structured domain randomization along five axes: clutter, lighting, background, tabletop height, and language, enhancing data diversity and policy robustness. The framework is instantiated across 50 dual-arm tasks and five robot embodiments. Empirically, it yields a 10.9% gain in code generation success rate. For downstream policy learning, a VLA model trained with synthetic data plus only 10 real demonstrations achieves a 367% relative improvement over the 10-demo baseline, while zero-shot models trained solely on synthetic data obtain a 228% gain. These results highlight the effectiveness of RoboTwin 2.0 in strengthening sim-to-real transfer and robustness to environmental variations. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation. Project Page: https://robotwin-platform.github.io/, Code: https://github.com/robotwin-Platform/robotwin/.
中文:RoboTwin 2.0通过自动化任务合成和领域随机化提出了可扩展的双臂操作数据生成框架,在仿真到实物的迁移和政策鲁棒性方面实现了显著提升。
English: RoboTwin 2.0 introduces a scalable framework for generating diverse bimanual manipulation data through automated task synthesis and domain randomization, achieving significant improvements in sim-to-real transfer and policy robustness.

Authors:Wenzhuo Liu, Yicheng Qiao, Zhen Wang, Qiannan Guo, Zilong Chen, Meihua Zhou, Xinran Li, Letian Wang, Zhiwei Li, Huaping Liu, Wenshuo Wang
Title: TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving
Abstract:
Multi-task learning (MTL) can advance assistive driving by exploring inter-task correlations through shared representations. However, existing methods face two critical limitations: single-modality constraints limiting comprehensive scene understanding and inefficient architectures impeding real-time deployment. This paper proposes TEM^3-Learning (Time-Efficient Multimodal Multi-task Learning), a novel framework that jointly optimizes driver emotion recognition, driver behavior recognition, traffic context recognition, and vehicle behavior recognition through a two-stage architecture. The first component, the mamba-based multi-view temporal-spatial feature extraction subnetwork (MTS-Mamba), introduces a forward-backward temporal scanning mechanism and global-local spatial attention to efficiently extract low-cost temporal-spatial features from multi-view sequential images. The second component, the MTL-based gated multimodal feature integrator (MGMI), employs task-specific multi-gating modules to adaptively highlight the most relevant modality features for each task, effectively alleviating the negative transfer problem in MTL. Evaluation on the AIDE dataset, our proposed model achieves state-of-the-art accuracy across all four tasks, maintaining a lightweight architecture with fewer than 6 million parameters and delivering an impressive 142.32 FPS inference speed. Rigorous ablation studies further validate the effectiveness of the proposed framework and the independent contributions of each module. The code is available on https://github.com/Wenzhuo-Liu/TEM3-Learning.
Chinese: 本文提出TEM^3-Learning多模态多任务学习框架,通过高效时空特征提取和自适应模态融合,在四个辅助驾驶任务中实现最优性能,同时保持高速度和低计算成本。
English: This paper introduces TEM^3-Learning, a multimodal multi-task framework that efficiently integrates temporal-spatial feature extraction and adaptive modality fusion to achieve state-of-the-art performance in four assistive driving tasks while maintaining high speed and low computational cost.

Authors:Jisheng Dang, Huilin Song, Junbin Xiao, Bimei Wang, Han Peng, Haoxuan Li, Xun Yang, Meng Wang, Tat-Seng Chua
Title: MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering
Abstract:
Grounded Video Question Answering (Grounded VideoQA) requires aligning textual answers with explicit visual evidence. However, modern multimodal models often rely on linguistic priors and spurious correlations, resulting in poorly grounded predictions. In this work, we propose MUPA, a cooperative MUlti-Path Agentic approach that unifies video grounding, question answering, answer reflection and aggregation to tackle Grounded VideoQA. MUPA features three distinct reasoning paths on the interplay of grounding and QA agents in different chronological orders, along with a dedicated reflection agent to judge and aggregate the multi-path results to accomplish consistent QA and grounding. This design markedly improves grounding fidelity without sacrificing answer accuracy. Despite using only 2B parameters, our method outperforms all 7B-scale competitors. When scaled to 7B parameters, MUPA establishes new state-of-the-art results, with Acc@GQA of 30.3% and 47.4% on NExT-GQA and DeVE-QA respectively, demonstrating MUPA' effectiveness towards trustworthy video-language understanding. Our code is available in https://github.com/longmalongma/MUPA.
中文摘要:MUPA框架通过多路径推理和反思代理协同工作,在保持答案准确性的同时显著提升了视频问答的视觉依据可靠性,以更少参数实现卓越性能。
English Summary: The proposed MUPA framework enhances Grounded VideoQA by integrating multi-path reasoning and reflection agents, achieving superior grounding fidelity and answer accuracy with fewer parameters.

Authors:Hangzhou He, Jiachen Tang, Lei Zhu, Kaiwen Li, Yanye Lu
Title: Training-free Test-time Improvement for Explainable Medical Image Classification
Abstract:
Deep learning-based medical image classification techniques are rapidly advancing in medical image analysis, making it crucial to develop accurate and trustworthy models that can be efficiently deployed across diverse clinical scenarios. Concept Bottleneck Models (CBMs), which first predict a set of explainable concepts from images and then perform classification based on these concepts, are increasingly being adopted for explainable medical image classification. However, the inherent explainability of CBMs introduces new challenges when deploying trained models to new environments. Variations in imaging protocols and staining methods may induce concept-level shifts, such as alterations in color distribution and scale. Furthermore, since CBM training requires explicit concept annotations, fine-tuning models solely with image-level labels could compromise concept prediction accuracy and faithfulness - a critical limitation given the high cost of acquiring expert-annotated concept labels in medical domains. To address these challenges, we propose a training-free confusion concept identification strategy. By leveraging minimal new data (e.g., 4 images per class) with only image-level labels, our approach enhances out-of-domain performance without sacrificing source domain accuracy through two key operations: masking misactivated confounding concepts and amplifying under-activated discriminative concepts. The efficacy of our method is validated on both skin and white blood cell images. Our code is available at: https://github.com/riverback/TF-TTI-XMed.
中文: 提出的免训练混淆概念识别策略通过屏蔽混淆概念和增强判别性概念,仅使用少量带图像级标签的新数据,即可提升概念瓶颈模型在医学图像分类中的跨域性能。
English: The proposed training-free confusion concept identification strategy enhances out-of-domain performance for Concept Bottleneck Models in medical image classification by masking confounding concepts and amplifying discriminative ones, using minimal new data with only image-level labels.

Authors:Fenghe Tang, Wenxin Ma, Zhiyang He, Xiaodong Tao, Zihang Jiang, S. Kevin Zhou
Title: Pre-Trained LLM is a Semantic-Aware and Generalizable Segmentation Booster
Abstract:
With the advancement of Large Language Model (LLM) for natural language processing, this paper presents an intriguing finding: a frozen pre-trained LLM layer can process visual tokens for medical image segmentation tasks. Specifically, we propose a simple hybrid structure that integrates a pre-trained, frozen LLM layer within the CNN encoder-decoder segmentation framework (LLM4Seg). Surprisingly, this design improves segmentation performance with a minimal increase in trainable parameters across various modalities, including ultrasound, dermoscopy, polypscopy, and CT scans. Our in-depth analysis reveals the potential of transferring LLM's semantic awareness to enhance segmentation tasks, offering both improved global understanding and better local modeling capabilities. The improvement proves robust across different LLMs, validated using LLaMA and DeepSeek.
中文: 本文发现将预训练的大型语言模型冻结层整合到CNN分割框架中,能在仅少量增加参数的情况下提升多种医学影像模态的分割性能,通过迁移语言模型的语义感知能力增强全局理解和局部建模。
English: This paper demonstrates that integrating a frozen pre-trained Large Language Model layer into a CNN-based medical image segmentation framework enhances performance across multiple imaging modalities with minimal parameter increase, leveraging the LLM's semantic awareness for improved global and local modeling.

Authors:Junjian Li, Hulin Kuang, Jin Liu, Hailin Yue, Mengshen He, Jianxin Wang
Title: MiCo: Multiple Instance Learning with Context-Aware Clustering for Whole Slide Image Analysis
Abstract:
Multiple instance learning (MIL) has shown significant promise in histopathology whole slide image (WSI) analysis for cancer diagnosis and prognosis. However, the inherent spatial heterogeneity of WSIs presents critical challenges, as morphologically similar tissue types are often dispersed across distant anatomical regions. Conventional MIL methods struggle to model these scattered tissue distributions and capture cross-regional spatial interactions effectively. To address these limitations, we propose a novel Multiple instance learning framework with Context-Aware Clustering (MiCo), designed to enhance cross-regional intra-tissue correlations and strengthen inter-tissue semantic associations in WSIs. MiCo begins by clustering instances to distill discriminative morphological patterns, with cluster centroids serving as semantic anchors. To enhance cross-regional intra-tissue correlations, MiCo employs a Cluster Route module, which dynamically links instances of the same tissue type across distant regions via feature similarity. These semantic anchors act as contextual hubs, propagating semantic relationships to refine instance-level representations. To eliminate semantic fragmentation and strengthen inter-tissue semantic associations, MiCo integrates a Cluster Reducer module, which consolidates redundant anchors while enhancing information exchange between distinct semantic groups. Extensive experiments on two challenging tasks across nine large-scale public cancer datasets demonstrate the effectiveness of MiCo, showcasing its superiority over state-of-the-art methods. The code is available at https://github.com/junjianli106/MiCo.
中文摘要:提出的MiCo框架通过聚类实例捕捉跨区域组织相关性并整合语义锚点,增强了组织病理学图像的多示例学习能力,在九个癌症数据集上展现出优越性能。
English Summary: The proposed MiCo framework enhances multiple instance learning for histopathology image analysis by clustering instances to capture cross-regional tissue correlations and consolidating semantic anchors, demonstrating superior performance across nine cancer datasets.

Authors:Kui Huang, Xinrong Chen, Wenyu Lv, Jincheng Liao, Guanzhong Wang, Yi Liu
Title: PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding
Abstract:
This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These enhancements yield an $11.4\%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0\%$ to the vanilla version. A key innovation of our work is a data quality optimization strategy for multimodal document tasks. By employing a large-scale multimodal pre-trained model to evaluate data, we apply a novel statistical criterion to filter outliers, ensuring high-quality training data. Inspired by insights into underutilized intermediate features in multimodal models, we enhance the ViT representational capacity by decomposing it into layers and applying a novel feature fusion strategy to improve complex reasoning. The source code and pre-trained model are available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
Chinese: PP-DocBee2通过优化数据质量和视觉特征融合策略,在中文商务文档理解任务中性能提升11.4%,推理延迟降低73.0%,显著提升了多模态文档理解能力。
English: PP-DocBee2 significantly improves multimodal document understanding with enhanced data quality and feature fusion, boosting performance by 11.4% and reducing latency by 73.0% compared to its predecessor.

Authors:Yang Li, Victor Cheung, Xinhai Liu, Yuguang Chen, Zhongjin Luo, Biwen Lei, Haohan Weng, Zibo Zhao, Jingwei Huang, Zhuo Chen, Chunchao Guo
Title: Auto-Regressive Surface Cutting
Abstract:
Surface cutting is a fundamental task in computer graphics, with applications in UV parameterization, texture mapping, and mesh decomposition. However, existing methods often produce technically valid but overly fragmented atlases that lack semantic coherence. We introduce SeamGPT, an auto-regressive model that generates cutting seams by mimicking professional workflows. Our key technical innovation lies in formulating surface cutting as a next token prediction task: sample point clouds on mesh vertices and edges, encode them as shape conditions, and employ a GPT-style transformer to sequentially predict seam segments with quantized 3D coordinates. Our approach achieves exceptional performance on UV unwrapping benchmarks containing both manifold and non-manifold meshes, including artist-created, and 3D-scanned models. In addition, it enhances existing 3D segmentation tools by providing clean boundaries for part decomposition.

Authors:Chenyue Song, Chen Hui, Qing Lin, Wei Zhang, Siqiao Li, Haiqi Zhu, Zhixuan Li, Shengping Zhang, Shaohui Liu, Feng Jiang, Xiang Li
Title: LVPNet: A Latent-variable-based Prediction-driven End-to-end Framework for Lossless Compression of Medical Images
Abstract:
Autoregressive Initial Bits is a framework that integrates sub-image autoregression and latent variable modeling, demonstrating its advantages in lossless medical image compression. However, in existing methods, the image segmentation process leads to an even distribution of latent variable information across each sub-image, which in turn causes posterior collapse and inefficient utilization of latent variables. To deal with these issues, we propose a prediction-based end-to-end lossless medical image compression method named LVPNet, leveraging global latent variables to predict pixel values and encoding predicted probabilities for lossless compression. Specifically, we introduce the Global Multi-scale Sensing Module (GMSM), which extracts compact and informative latent representations from the entire image, effectively capturing spatial dependencies within the latent space. Furthermore, to mitigate the information loss introduced during quantization, we propose the Quantization Compensation Module (QCM), which learns the distribution of quantization errors and refines the quantized features to compensate for quantization loss. Extensive experiments on challenging benchmarks demonstrate that our method achieves superior compression efficiency compared to state-of-the-art lossless image compression approaches, while maintaining competitive inference speed. The code is at https://github.com/scy-Jackel/LVPNet.
中文: 提出的LVPNet方法通过全局潜在变量预测像素值,并采用多尺度感知模块和量化补偿模块来捕捉空间依赖性和弥补量化损失,从而在无损医学图像压缩中实现了更优的压缩效率。
English: The proposed LVPNet method enhances lossless medical image compression by utilizing global latent variables for pixel prediction and incorporating modules to capture spatial dependencies and compensate for quantization errors, achieving superior compression efficiency.

Authors:Mischa Dombrowski, Bernhard Kainz
Title: Enabling PSO-Secure Synthetic Data Sharing Using Diversity-Aware Diffusion Models
Abstract:
Synthetic data has recently reached a level of visual fidelity that makes it nearly indistinguishable from real data, offering great promise for privacy-preserving data sharing in medical imaging. However, fully synthetic datasets still suffer from significant limitations: First and foremost, the legal aspect of sharing synthetic data is often neglected and data regulations, such as the GDPR, are largley ignored. Secondly, synthetic models fall short of matching the performance of real data, even for in-domain downstream applications. Recent methods for image generation have focused on maximising image diversity instead of fidelity solely to improve the mode coverage and therefore the downstream performance of synthetic data. In this work, we shift perspective and highlight how maximizing diversity can also be interpreted as protecting natural persons from being singled out, which leads to predicate singling-out (PSO) secure synthetic datasets. Specifically, we propose a generalisable framework for training diffusion models on personal data which leads to unpersonal synthetic datasets achieving performance within one percentage point of real-data models while significantly outperforming state-of-the-art methods that do not ensure privacy. Our code is available at https://github.com/MischaD/Trichotomy.
中文: 本研究提出了一种基于隐私保护的扩散模型训练框架,能够生成既接近真实数据性能又防止个人被识别的合成数据集,其表现显著优于未保障隐私的现有方法。
English: This study introduces a privacy-focused framework for training diffusion models on personal data, producing synthetic datasets that achieve near-real data performance while ensuring protection against individual identification and significantly outperforming existing non-private methods.

Authors:Fei Zhou
Title: SegChange-R1: LLM-Augmented Remote Sensing Change Detection
Abstract:
Remote sensing change detection is used in urban planning, terrain analysis, and environmental monitoring by analyzing feature changes in the same area over time. In this paper, we propose a large language model (LLM) augmented inference approach (SegChange-R1), which enhances the detection capability by integrating textual descriptive information and guides the model to focus on relevant change regions, accelerating convergence. We designed a linear attention-based spatial transformation module (BEV) to address modal misalignment by unifying features from different times into a BEV space. Furthermore, we introduce DVCD, a novel dataset for building change detection from UAV viewpoints. Experiments on four widely-used datasets demonstrate significant improvements over existing method The code and pre-trained models are available in {https://github.com/Yu-Zhouz/SegChange-R1}.
中文摘要:本文提出SegChange-R1方法,通过集成文本描述信息和设计BEV空间对齐模块来增强遥感变化检测性能,并在包括新型无人机建筑变化数据集在内的多个数据集上验证了其优越性。
English Summary: This paper introduces SegChange-R1, a large language model-augmented method that improves remote sensing change detection by incorporating textual descriptions and aligning temporal features in a unified BEV space, validated on multiple datasets including a new UAV-based building change dataset.

Authors:Junho Park, Andrew Sangwoo Ye, Taein Kwon
Title: EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations
Abstract:
Egocentric vision is essential for both human and machine visual understanding, particularly in capturing the detailed hand-object interactions needed for manipulation tasks. Translating third-person views into first-person views significantly benefits augmented reality (AR), virtual reality (VR) and robotics applications. However, current exocentric-to-egocentric translation methods are limited by their dependence on 2D cues, synchronized multi-view settings, and unrealistic assumptions such as necessity of initial egocentric frame and relative camera poses during inference. To overcome these challenges, we introduce EgoWorld, a novel two-stage framework that reconstructs an egocentric view from rich exocentric observations, including projected point clouds, 3D hand poses, and textual descriptions. Our approach reconstructs a point cloud from estimated exocentric depth maps, reprojects it into the egocentric perspective, and then applies diffusion-based inpainting to produce dense, semantically coherent egocentric images. Evaluated on the H2O and TACO datasets, EgoWorld achieves state-of-the-art performance and demonstrates robust generalization to new objects, actions, scenes, and subjects. Moreover, EgoWorld shows promising results even on unlabeled real-world examples.

Authors:Jianghong Huang, Luping Ji, Xin Ma, Mao Ye
Title: BeltCrack: the First Sequential-image Industrial Conveyor Belt Crack Detection Dataset and Its Baseline with Triple-domain Feature Learning
Abstract:
Conveyor belts are important equipment in modern industry, widely applied in production and manufacturing. Their health is much critical to operational efficiency and safety. Cracks are a major threat to belt health. Currently, considering safety, how to intelligently detect belt cracks is catching an increasing attention. To implement the intelligent detection with machine learning, real crack samples are believed to be necessary. However, existing crack datasets primarily focus on pavement scenarios or synthetic data, no real-world industrial belt crack datasets at all. Cracks are a major threat to belt health. Furthermore, to validate usability and effectiveness, we propose a special baseline method with triple-domain ($i.e.$, time-space-frequency) feature hierarchical fusion learning for the two whole-new datasets. Experimental results demonstrate the availability and effectiveness of our dataset. Besides, they also show that our baseline is obviously superior to other similar detection methods. Our datasets and source codes are available at https://github.com/UESTC-nnLab/BeltCrack.
中文: 本研究提出了两个全新的工业传送带真实裂纹数据集及基于时-空-频三域特征分层融合的基线方法,实验证明该方法显著优于同类检测方案。
English: This study introduces two novel real-world industrial conveyor belt crack datasets and a baseline method using triple-domain feature fusion learning, which proves more effective than existing detection approaches.

Authors:Jiahao Lu, Jiacheng Deng
Title: Relation3D: Enhancing Relation Modeling for Point Cloud Instance Segmentation
Abstract:
3D instance segmentation aims to predict a set of object instances in a scene, representing them as binary foreground masks with corresponding semantic labels. Currently, transformer-based methods are gaining increasing attention due to their elegant pipelines and superior predictions. However, these methods primarily focus on modeling the external relationships between scene features and query features through mask attention. They lack effective modeling of the internal relationships among scene features as well as between query features. In light of these disadvantages, we propose \textbf{Relation3D: Enhancing Relation Modeling for Point Cloud Instance Segmentation}. Specifically, we introduce an adaptive superpoint aggregation module and a contrastive learning-guided superpoint refinement module to better represent superpoint features (scene features) and leverage contrastive learning to guide the updates of these features. Furthermore, our relation-aware self-attention mechanism enhances the capabilities of modeling relationships between queries by incorporating positional and geometric relationships into the self-attention mechanism. Extensive experiments on the ScanNetV2, ScanNet++, ScanNet200 and S3DIS datasets demonstrate the superior performance of Relation3D.
中文: 本文提出了Relation3D方法,通过自适应超点聚合、对比学习引导的超点优化以及关系感知的自注意力机制来增强三维实例分割中的关系建模,在多个基准数据集上实现了优越性能。
English: This paper introduces Relation3D, a novel method for 3D instance segmentation that enhances relation modeling through adaptive superpoint aggregation, contrastive learning-guided refinement, and a relation-aware self-attention mechanism, achieving superior performance on multiple benchmark datasets.

Authors:Xiaodong Guo, Zi'ang Lin, Luwen Hu, Zhihong Deng, Tong Liu, Wujie Zhou
Title: Cross-modal State Space Modeling for Real-time RGB-thermal Wild Scene Semantic Segmentation
Abstract:
The integration of RGB and thermal data can significantly improve semantic segmentation performance in wild environments for field robots. Nevertheless, multi-source data processing (e.g. Transformer-based approaches) imposes significant computational overhead, presenting challenges for resource-constrained systems. To resolve this critical limitation, we introduced CM-SSM, an efficient RGB-thermal semantic segmentation architecture leveraging a cross-modal state space modeling (SSM) approach. Our framework comprises two key components. First, we introduced a cross-modal 2D-selective-scan (CM-SS2D) module to establish SSM between RGB and thermal modalities, which constructs cross-modal visual sequences and derives hidden state representations of one modality from the other. Second, we developed a cross-modal state space association (CM-SSA) module that effectively integrates global associations from CM-SS2D with local spatial features extracted through convolutional operations. In contrast with Transformer-based approaches, CM-SSM achieves linear computational complexity with respect to image resolution. Experimental results show that CM-SSM achieves state-of-the-art performance on the CART dataset with fewer parameters and lower computational cost. Further experiments on the PST900 dataset demonstrate its generalizability. Codes are available at https://github.com/xiaodonguo/CMSSM.
中文摘要:提出的CM-SSM架构通过跨模态状态空间建模有效融合RGB与热成像数据,在实现线性计算复杂度的同时,以更少参数和更低计算成本获得了最先进的语义分割性能。
English Summary: The proposed CM-SSM architecture efficiently integrates RGB and thermal data for semantic segmentation using cross-modal state space modeling, achieving superior performance with linear computational complexity compared to Transformer-based methods.

Authors:Yingcheng Liu, Peiqi Wang, Sebastian Diaz, Esra Abaci Turk, Benjamin Billot, P. Ellen Grant, Polina Golland
Title: Fetuses Made Simple: Modeling and Tracking of Fetal Shape and Pose
Abstract:
Analyzing fetal body motion and shape is paramount in prenatal diagnostics and monitoring. Existing methods for fetal MRI analysis mainly rely on anatomical keypoints or volumetric body segmentations. Keypoints simplify body structure to facilitate motion analysis, but may ignore important details of full-body shape. Body segmentations capture complete shape information but complicate temporal analysis due to large non-local fetal movements. To address these limitations, we construct a 3D articulated statistical fetal body model based on the Skinned Multi-Person Linear Model (SMPL). Our algorithm iteratively estimates body pose in the image space and body shape in the canonical pose space. This approach improves robustness to MRI motion artifacts and intensity distortions, and reduces the impact of incomplete surface observations due to challenging fetal poses. We train our model on segmentations and keypoints derived from $19,816$ MRI volumes across $53$ subjects. Our model captures body shape and motion across time series and provides intuitive visualization. Furthermore, it enables automated anthropometric measurements traditionally difficult to obtain from segmentations and keypoints. When tested on unseen fetal body shapes, our method yields a surface alignment error of $3.2$ mm for $3$ mm MRI voxel size. To our knowledge, this represents the first 3D articulated statistical fetal body model, paving the way for enhanced fetal motion and shape analysis in prenatal diagnostics. The code is available at https://github.com/MedicalVisionGroup/fetal-smpl .
中文摘要:本研究首次开发了基于SMPL的三维关节式统计胎儿身体模型,通过迭代估计图像空间中的身体姿态和标准姿态空间中的身体形状,显著提升了对MRI运动伪影的鲁棒性,并实现了传统方法难以获得的自动化人体测量功能。
English Summary: This study introduces the first 3D articulated statistical fetal body model that simultaneously captures body shape and motion from MRI data, improving robustness to artifacts and enabling automated anthropometric measurements with 3.2 mm surface alignment accuracy.

Authors:Fadi Abdeladhim Zidi, Djamel Eddine Boukhari, Abdellah Zakaria Sellam, Abdelkrim Ouafi, Cosimo Distante, Salah Eddine Bekhouche, Abdelmalik Taleb-Ahmed
Title: LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA for Hyperspectral Imaging
Abstract:
Hyperspectral image classification remains a challenging task due to the high dimensionality of spectral data, significant inter-band redundancy, and the limited availability of annotated samples. While recent transformer-based models have improved the global modeling of spectral-spatial dependencies, their scalability and adaptability under label-scarce conditions remain limited. In this work, we propose \textbf{LoLA-SpecViT}(Low-rank adaptation Local Attention Spectral Vision Transformer), a lightweight spectral vision transformer that addresses these limitations through a parameter-efficient architecture tailored to the unique characteristics of hyperspectral imagery. Our model combines a 3D convolutional spectral front-end with local window-based self-attention, enhancing both spectral feature extraction and spatial consistency while reducing computational complexity. To further improve adaptability, we integrate low-rank adaptation (LoRA) into attention and projection layers, enabling fine-tuning with over 80\% fewer trainable parameters. A novel cyclical learning rate scheduler modulates LoRA adaptation strength during training, improving convergence and generalisation. Extensive experiments on three benchmark datasets WHU-Hi LongKou, WHU-Hi HongHu, and Salinas demonstrate that LoLA-SpecViT consistently outperforms state-of-the-art baselines, achieving up to 99.91\% accuracy with substantially fewer parameters and enhanced robustness under low-label regimes. The proposed framework provides a scalable and generalizable solution for real-world HSI applications in agriculture, environmental monitoring, and remote sensing analytics. Our code is available in the following \href{https://github.com/FadiZidiDz/LoLA-SpecViT}{GitHub Repository}.
中文: 本文提出的LoLA-SpecViT轻量化光谱视觉Transformer,通过结合三维卷积光谱前端与局部自注意力机制,并引入低秩自适应技术,在高光谱图像分类任务中以更少参数实现了最优精度。
English: This paper introduces LoLA-SpecViT, a lightweight spectral vision transformer that enhances hyperspectral image classification by combining 3D convolutional spectral processing with local self-attention and low-rank adaptation, achieving state-of-the-art accuracy with significantly reduced parameters.

Authors:Mengqi Lei, Siqi Li, Yihong Wu, Han Hu, You Zhou, Xinhu Zheng, Guiguang Ding, Shaoyi Du, Zongze Wu, Yue Gao
Title: YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception
Abstract:
The YOLO series models reign supreme in real-time object detection due to their superior accuracy and computational efficiency. However, both the convolutional architectures of YOLO11 and earlier versions and the area-based self-attention mechanism introduced in YOLOv12 are limited to local information aggregation and pairwise correlation modeling, lacking the capability to capture global multi-to-multi high-order correlations, which limits detection performance in complex scenarios. In this paper, we propose YOLOv13, an accurate and lightweight object detector. To address the above-mentioned challenges, we propose a Hypergraph-based Adaptive Correlation Enhancement (HyperACE) mechanism that adaptively exploits latent high-order correlations and overcomes the limitation of previous methods that are restricted to pairwise correlation modeling based on hypergraph computation, achieving efficient global cross-location and cross-scale feature fusion and enhancement. Subsequently, we propose a Full-Pipeline Aggregation-and-Distribution (FullPAD) paradigm based on HyperACE, which effectively achieves fine-grained information flow and representation synergy within the entire network by distributing correlation-enhanced features to the full pipeline. Finally, we propose to leverage depthwise separable convolutions to replace vanilla large-kernel convolutions, and design a series of blocks that significantly reduce parameters and computational complexity without sacrificing performance. We conduct extensive experiments on the widely used MS COCO benchmark, and the experimental results demonstrate that our method achieves state-of-the-art performance with fewer parameters and FLOPs. Specifically, our YOLOv13-N improves mAP by 3.0\% over YOLO11-N and by 1.5\% over YOLOv12-N. The code and models of our YOLOv13 model are available at: https://github.com/iMoonLab/yolov13.
Chinese: YOLOv13通过基于超图的自适应相关性增强机制和全流程聚合-分配范式,解决了先前模型在捕捉全局高阶相关性方面的局限,在MS COCO基准测试中以更少的计算量实现了最优性能。
English: YOLOv13 introduces a Hypergraph-based Adaptive Correlation Enhancement mechanism and a Full-Pipeline Aggregation-and-Distribution paradigm to overcome previous models' limitations in capturing global high-order correlations, achieving state-of-the-art performance with reduced computational complexity on the MS COCO benchmark.

Authors:Jihyun Kim, Junho Park, Kyeongbo Kong, Suk-Ju Kang
Title: Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models
Abstract:
We present Programmable-Room, a framework which interactively generates and edits a 3D room mesh, given natural language instructions. For precise control of a room's each attribute, we decompose the challenging task into simpler steps such as creating plausible 3D coordinates for room meshes, generating panorama images for the texture, constructing 3D meshes by integrating the coordinates and panorama texture images, and arranging furniture. To support the various decomposed tasks with a unified framework, we incorporate visual programming (VP). VP is a method that utilizes a large language model (LLM) to write a Python-like program which is an ordered list of necessary modules for the various tasks given in natural language. We develop most of the modules. Especially, for the texture generating module, we utilize a pretrained large-scale diffusion model to generate panorama images conditioned on text and visual prompts (i.e., layout, depth, and semantic map) simultaneously. Specifically, we enhance the panorama image generation quality by optimizing the training objective with a 1D representation of a panorama scene obtained from bidirectional LSTM. We demonstrate Programmable-Room's flexibility in generating and editing 3D room meshes, and prove our framework's superiority to an existing model quantitatively and qualitatively. Project page is available in https://jihyun0510.github.io/Programmable_Room_Page/.

Authors:Amirshayan Nasirimajd, Chiara Plizzari, Simone Alberto Peirone, Marco Ciccone, Giuseppe Averta, Barbara Caputo
Title: Domain Generalization using Action Sequences for Egocentric Action Recognition
Abstract:
Recognizing human activities from visual inputs, particularly through a first-person viewpoint, is essential for enabling robots to replicate human behavior. Egocentric vision, characterized by cameras worn by observers, captures diverse changes in illumination, viewpoint, and environment. This variability leads to a notable drop in the performance of Egocentric Action Recognition models when tested in environments not seen during training. In this paper, we tackle these challenges by proposing a domain generalization approach for Egocentric Action Recognition. Our insight is that action sequences often reflect consistent user intent across visual domains. By leveraging action sequences, we aim to enhance the model's generalization ability across unseen environments. Our proposed method, named SeqDG, introduces a visual-text sequence reconstruction objective (SeqRec) that uses contextual cues from both text and visual inputs to reconstruct the central action of the sequence. Additionally, we enhance the model's robustness by training it on mixed sequences of actions from different domains (SeqMix). We validate SeqDG on the EGTEA and EPIC-KITCHENS-100 datasets. Results on EPIC-KITCHENS-100, show that SeqDG leads to +2.4% relative average improvement in cross-domain action recognition in unseen environments, and on EGTEA the model achieved +0.6% Top-1 accuracy over SOTA in intra-domain action recognition.
中文: 本文提出SeqDG方法,通过序列重构和混合训练提升第一人称动作识别模型在未知环境中的泛化能力,在EPIC-KITCHENS-100和EGTEA数据集上实现了显著准确率提升。
English: This paper introduces SeqDG, a domain generalization method for Egocentric Action Recognition that enhances model performance in unseen environments through sequence reconstruction and mixed training, achieving notable accuracy improvements on benchmark datasets.

Authors:Kailing Li, Qi'ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang
Title: CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
Abstract:
Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.
中文: CLiViS是一种无需训练的框架,它结合了大型语言模型的任务规划能力和视觉语言模型的感知能力,通过动态认知地图连接感知与推理,以在复杂环境中实现高效的具身视觉推理。
English: CLiViS is a training-free framework that synergizes LLMs for task planning and VLMs for visual perception, using a dynamic Cognitive Map to bridge perception and reasoning for effective embodied visual reasoning in complex environments.

Authors:Mihir Godbole, Xiangbo Gao, Zhengzhong Tu
Title: DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving
Abstract:
Understanding the short-term motion of vulnerable road users (VRUs) like pedestrians and cyclists is critical for safe autonomous driving, especially in urban scenarios with ambiguous or high-risk behaviors. While vision-language models (VLMs) have enabled open-vocabulary perception, their utility for fine-grained intent reasoning remains underexplored. Notably, no existing benchmark evaluates multi-class intent prediction in safety-critical situations, To address this gap, we introduce DRAMA-X, a fine-grained benchmark constructed from the DRAMA dataset via an automated annotation pipeline. DRAMA-X contains 5,686 accident-prone frames labeled with object bounding boxes, a nine-class directional intent taxonomy, binary risk scores, expert-generated action suggestions for the ego vehicle, and descriptive motion summaries. These annotations enable a structured evaluation of four interrelated tasks central to autonomous decision-making: object detection, intent prediction, risk assessment, and action suggestion. As a reference baseline, we propose SGG-Intent, a lightweight, training-free framework that mirrors the ego vehicle's reasoning pipeline. It sequentially generates a scene graph from visual input using VLM-backed detectors, infers intent, assesses risk, and recommends an action using a compositional reasoning stage powered by a large language model. We evaluate a range of recent VLMs, comparing performance across all four DRAMA-X tasks. Our experiments demonstrate that scene-graph-based reasoning enhances intent prediction and risk assessment, especially when contextual cues are explicitly modeled.
中文摘要:本研究提出DRAMA-X基准,用于评估自动驾驶车辆在安全关键场景中对行人及骑行者意图的预测能力,并通过结合场景图与语言模型的新框架,显著提升了意图推理和风险评估的准确性。
English Summary: This research introduces DRAMA-X, a benchmark for evaluating autonomous vehicles' ability to predict pedestrian and cyclist intentions in safety-critical scenarios, proposing a novel framework that uses scene graphs and language models to improve intent reasoning and risk assessment.

Authors:Julio Silva-Rodríguez, Ismail Ben Ayed, Jose Dolz
Title: Trustworthy Few-Shot Transfer of Medical VLMs through Split Conformal Prediction
Abstract:
Medical vision-language models (VLMs) have demonstrated unprecedented transfer capabilities and are being increasingly adopted for data-efficient image classification. Despite its growing popularity, its reliability aspect remains largely unexplored. This work explores the split conformal prediction (SCP) framework to provide trustworthiness guarantees when transferring such models based on a small labeled calibration set. Despite its potential, the generalist nature of the VLMs' pre-training could negatively affect the properties of the predicted conformal sets for specific tasks. While common practice in transfer learning for discriminative purposes involves an adaptation stage, we observe that deploying such a solution for conformal purposes is suboptimal since adapting the model using the available calibration data breaks the rigid exchangeability assumptions for test data in SCP. To address this issue, we propose transductive split conformal adaptation (SCA-T), a novel pipeline for transfer learning on conformal scenarios, which performs an unsupervised transductive adaptation jointly on calibration and test data. We present comprehensive experiments utilizing medical VLMs across various image modalities, transfer tasks, and non-conformity scores. Our framework offers consistent gains in efficiency and conditional coverage compared to SCP, maintaining the same empirical guarantees.
中文: 本研究提出了一种转导式分割共形适应(SCA-T)框架,通过在校准和测试数据上进行无监督转导适应,提升了医学视觉语言模型在迁移学习中的可靠性和预测效率,优于传统共形方法。
English: This study introduces a transductive split conformal adaptation (SCA-T) framework to enhance the reliability of medical vision-language models by ensuring efficient and conditionally accurate predictions during transfer learning, outperforming standard conformal methods.

Authors:Julio Silva-Rodríguez, Fereshteh Shakeri, Houda Bahig, Jose Dolz, Ismail Ben Ayed
Title: Few-Shot, Now for Real: Medical VLMs Adaptation without Balanced Sets or Validation
Abstract:
Vision-language models (VLMs) are gaining attention in medical image analysis. These are pre-trained on large, heterogeneous data sources, yielding rich and transferable representations. Notably, the combination of modality-specialized VLMs with few-shot adaptation has provided fruitful results, enabling the efficient deployment of high-performing solutions. However, previous works on this topic make strong assumptions about the distribution of adaptation data, which are unrealistic in the medical domain. First, prior art assumes access to a balanced support set, a condition that breaks the natural imbalance in disease prevalence found in real-world scenarios. Second, these works typically assume the presence of an additional validation set to fix critical hyper-parameters, which is highly data-inefficient. This work challenges these favorable deployment scenarios and introduces a realistic, imbalanced, validation-free adaptation setting. Our extensive benchmark across various modalities and downstream tasks demonstrates that current methods systematically compromise their performance when operating under realistic conditions, occasionally even performing worse than zero-shot inference. Also, we introduce a training-free linear probe that adaptively blends visual and textual supervision. Detailed studies demonstrate that the proposed solver is a strong, efficient baseline, enabling robust adaptation in challenging scenarios.
中文摘要:视觉语言模型在医学影像分析中潜力显著,但现有方法依赖平衡数据和验证集等不切实际的假设;本研究提出无需训练的自适应融合方法,在更贴近现实的不平衡场景下实现了鲁棒性能。
English Summary: Vision-language models show promise in medical imaging but rely on unrealistic assumptions about balanced data and validation sets, prompting this study to propose a training-free adaptive method that performs robustly under more practical, imbalanced conditions.

Authors:Zijun Sun, Solveig Thrun, Michael Kampffmeyer
Title: VMRA-MaR: An Asymmetry-Aware Temporal Framework for Longitudinal Breast Cancer Risk Prediction
Abstract:
Breast cancer remains a leading cause of mortality worldwide and is typically detected via screening programs where healthy people are invited in regular intervals. Automated risk prediction approaches have the potential to improve this process by facilitating dynamically screening of high-risk groups. While most models focus solely on the most recent screening, there is growing interest in exploiting temporal information to capture evolving trends in breast tissue, as inspired by clinical practice. Early methods typically relied on two time steps, and although recent efforts have extended this to multiple time steps using Transformer architectures, challenges remain in fully harnessing the rich temporal dynamics inherent in longitudinal imaging data. In this work, we propose to instead leverage Vision Mamba RNN (VMRNN) with a state-space model (SSM) and LSTM-like memory mechanisms to effectively capture nuanced trends in breast tissue evolution. To further enhance our approach, we incorporate an asymmetry module that utilizes a Spatial Asymmetry Detector (SAD) and Longitudinal Asymmetry Tracker (LAT) to identify clinically relevant bilateral differences. This integrated framework demonstrates notable improvements in predicting cancer onset, especially for the more challenging high-density breast cases and achieves superior performance at extended time points (years four and five), highlighting its potential to advance early breast cancer recognition and enable more personalized screening strategies. Our code is available at https://github.com/Mortal-Suen/VMRA-MaR.git.
中文: 本研究提出了一种新颖的Vision Mamba RNN框架,通过整合不对称性检测模块来捕捉乳腺组织的细微演化趋势和双侧差异,显著提升了高风险病例和长期筛查中的乳腺癌预测性能。
English: This study introduces a novel Vision Mamba RNN framework enhanced with asymmetry detection to improve breast cancer risk prediction by capturing nuanced tissue evolution trends and bilateral differences, demonstrating superior performance in challenging cases and long-term screenings.

Authors:Haitian Wang, Yiren Wang, Xinyu Wang, Yumeng Miao, Yuliang Zhang, Yu Zhang, Atif Mansoor
Title: P2MFDS: A Privacy-Preserving Multimodal Fall Detection System for Elderly People in Bathroom Environments
Abstract:
By 2050, people aged 65 and over are projected to make up 16 percent of the global population. As aging is closely associated with increased fall risk, particularly in wet and confined environments such as bathrooms where over 80 percent of falls occur. Although recent research has increasingly focused on non-intrusive, privacy-preserving approaches that do not rely on wearable devices or video-based monitoring, these efforts have not fully overcome the limitations of existing unimodal systems (e.g., WiFi-, infrared-, or mmWave-based), which are prone to reduced accuracy in complex environments. These limitations stem from fundamental constraints in unimodal sensing, including system bias and environmental interference, such as multipath fading in WiFi-based systems and drastic temperature changes in infrared-based methods. To address these challenges, we propose a Privacy-Preserving Multimodal Fall Detection System for Elderly People in Bathroom Environments. First, we develop a sensor evaluation framework to select and fuse millimeter-wave radar with 3D vibration sensing, and use it to construct and preprocess a large-scale, privacy-preserving multimodal dataset in real bathroom settings, which will be released upon publication. Second, we introduce P2MFDS, a dual-stream network combining a CNN-BiLSTM-Attention branch for radar motion dynamics with a multi-scale CNN-SEBlock-Self-Attention branch for vibration impact detection. By uniting macro- and micro-scale features, P2MFDS delivers significant gains in accuracy and recall over state-of-the-art approaches. Code and pretrained models will be made available at: https://github.com/HaitianWang/P2MFDS-A-Privacy-Preserving-Multimodal-Fall-Detection-Network-for-Elderly-Individuals-in-Bathroom.
中文: 针对205年全球65岁以上人口将占16%且浴室跌倒高发的现状,本研究提出P2MFDS多模态监测系统,通过融合毫米波雷达与振动传感技术,在保护隐私的前提下实现了较现有方法更优的检测精度与召回率。
English: By 2050, the global elderly population will reach 16%, with bathrooms being high-risk fall areas, prompting the development of P2MFDS—a privacy-preserving multimodal system using radar and vibration sensors that significantly outperforms existing methods in accuracy and recall.

Authors:Zhixiang Chi, Li Gu, Huan Liu, Ziqiang Wang, Yanan Wu, Yang Wang, Konstantinos N Plataniotis
Title: Learning to Adapt Frozen CLIP for Few-Shot Test-Time Domain Adaptation
Abstract:
Few-shot Test-Time Domain Adaptation focuses on adapting a model at test time to a specific domain using only a few unlabeled examples, addressing domain shift. Prior methods leverage CLIP's strong out-of-distribution (OOD) abilities by generating domain-specific prompts to guide its generalized, frozen features. However, since downstream datasets are not explicitly seen by CLIP, solely depending on the feature space knowledge is constrained by CLIP's prior knowledge. Notably, when using a less robust backbone like ViT-B/16, performance significantly drops on challenging real-world benchmarks. Departing from the state-of-the-art of inheriting the intrinsic OOD capability of CLIP, this work introduces learning directly on the input space to complement the dataset-specific knowledge for frozen CLIP. Specifically, an independent side branch is attached in parallel with CLIP and enforced to learn exclusive knowledge via revert attention. To better capture the dataset-specific label semantics for downstream adaptation, we propose to enhance the inter-dispersion among text features via greedy text ensemble and refinement. The text and visual features are then progressively fused in a domain-aware manner by a generated domain prompt to adapt toward a specific domain. Extensive experiments show our method's superiority on 5 large-scale benchmarks (WILDS and DomainNet), notably improving over smaller networks like ViT-B/16 with gains of \textbf{+5.1} in F1 for iWildCam and \textbf{+3.1\%} in WC Acc for FMoW.
中文摘要:本文提出了一种新方法,通过在CLIP冻结特征基础上增加输入空间学习分支,结合逆向注意力和改进的文本特征分散技术,显著提升了小样本测试时域自适应性能,在多个基准测试中取得突破性进展。
English Summary: This paper introduces a novel method that enhances few-shot test-time domain adaptation by learning directly in the input space alongside CLIP's frozen features, using a side branch with revert attention and improved text feature dispersion, achieving significant performance gains on challenging benchmarks.

Authors:Yijun Lin, Theresa Chen, Colby Brungard, Grunwald Sabine, Sue Ives, Matt Macander, Timm Nawrocki, Yao-Yi Chiang, Nic Jelinski
Title: Fine-Scale Soil Mapping in Alaska with Multimodal Machine Learning
Abstract:
Fine-scale soil mapping in Alaska, traditionally relying on fieldwork and localized simulations, remains a critical yet underdeveloped task, despite the region's ecological importance and extensive permafrost coverage. As permafrost thaw accelerates due to climate change, it threatens infrastructure stability and key ecosystem services, such as soil carbon storage. High-resolution soil maps are essential for characterizing permafrost distribution, identifying vulnerable areas, and informing adaptation strategies. We present MISO, a vision-based machine learning (ML) model to produce statewide fine-scale soil maps for near-surface permafrost and soil taxonomy. The model integrates a geospatial foundation model for visual feature extraction, implicit neural representations for continuous spatial prediction, and contrastive learning for multimodal alignment and geo-location awareness. We compare MISO with Random Forest (RF), a traditional ML model that has been widely used in soil mapping applications. Spatial cross-validation and regional analysis across Permafrost Zones and Major Land Resource Areas (MLRAs) show that MISO generalizes better to remote, unseen locations and achieves higher recall than RF, which is critical for monitoring permafrost thaw and related environmental processes. These findings demonstrate the potential of advanced ML approaches for fine-scale soil mapping and provide practical guidance for future soil sampling and infrastructure planning in permafrost-affected landscapes. The project will be released at https://github.com/knowledge-computing/Peatland-permafrost.
中文: 本研究提出的MISO视觉机器学习模型,在阿拉斯加高分辨率土壤制图中优于传统方法,为气候变化下的冻土监测和基础设施规划提供了更有效的解决方案。
English: This study introduces MISO, a vision-based machine learning model that outperforms traditional methods in generating high-resolution soil maps for Alaska, enabling better permafrost monitoring and infrastructure planning amid climate change.

Authors:Yuqi Li, Junhao Dong, Zeyu Dong, Chuanguang Yang, Zhulin An, Yongjun Xu
Title: SRKD: Towards Efficient 3D Point Cloud Segmentation via Structure- and Relation-aware Knowledge Distillation
Abstract:
3D point cloud segmentation faces practical challenges due to the computational complexity and deployment limitations of large-scale transformer-based models. To address this, we propose a novel Structure- and Relation-aware Knowledge Distillation framework, named SRKD, that transfers rich geometric and semantic knowledge from a large frozen teacher model (>100M) to a lightweight student model (<15M). Specifically, we propose an affinity matrix-based relation alignment module, which distills structural dependencies from the teacher to the student through point-wise similarity matching, enhancing the student's capability to learn contextual interactions. Meanwhile, we introduce a cross-sample mini-batch construction strategy that enables the student to perceive stable and generalized geometric structure. This aligns across diverse point cloud instances of the teacher, rather than within a single sample. Additionally, KL divergence is applied to align semantic distributions, and ground-truth supervision further reinforces accurate segmentation. Our method achieves state of the art performance with significantly reduced model complexity, demonstrating its effectiveness and efficiency in real-world deployment scenarios. Our Code is available at https://github.com/itsnotacie/SRKD.
中文: 提出的SRKD框架通过关系对齐和跨样本训练,将大型教师模型的几何与语义知识高效迁移至轻量学生模型,以显著降低的复杂度实现了最先进的3D点云分割性能。
English: The proposed SRKD framework efficiently transfers geometric and semantic knowledge from a large teacher model to a lightweight student model through relation alignment and cross-sample training, achieving state-of-the-art 3D segmentation with significantly reduced complexity.

Authors:Jisu Nam, Soowon Son, Dahyun Chung, Jiyoung Kim, Siyoon Jin, Junhwa Hur, Seungryong Kim
Title: Emergent Temporal Correspondences from Video Diffusion Transformers
Abstract:
Recent advancements in video diffusion models based on Diffusion Transformers (DiTs) have achieved remarkable success in generating temporally coherent videos. Yet, a fundamental question persists: how do these models internally establish and represent temporal correspondences across frames? We introduce DiffTrack, the first quantitative analysis framework designed to answer this question. DiffTrack constructs a dataset of prompt-generated video with pseudo ground-truth tracking annotations and proposes novel evaluation metrics to systematically analyze how each component within the full 3D attention mechanism of DiTs (e.g., representations, layers, and timesteps) contributes to establishing temporal correspondences. Our analysis reveals that query-key similarities in specific, but not all, layers play a critical role in temporal matching, and that this matching becomes increasingly prominent during the denoising process. We demonstrate practical applications of DiffTrack in zero-shot point tracking, where it achieves state-of-the-art performance compared to existing vision foundation and self-supervised video models. Further, we extend our findings to motion-enhanced video generation with a novel guidance method that improves temporal consistency of generated videos without additional training. We believe our work offers crucial insights into the inner workings of video DiTs and establishes a foundation for further research and applications leveraging their temporal understanding.

Authors:Xiuyu Yang, Shuhan Tan, Philipp Krähenbühl
Title: Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation
Abstract:
An ideal traffic simulator replicates the realistic long-term point-to-point trip that a self-driving system experiences during deployment. Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene. This is problematic for long-term simulation. Agents enter and exit the scene as the ego vehicle enters new regions. We propose InfGen, a unified next-token prediction model that performs interleaved closed-loop motion simulation and scene generation. InfGen automatically switches between closed-loop motion simulation and scene generation mode. It enables stable long-term rollout simulation. InfGen performs at the state-of-the-art in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation. The code and model of InfGen will be released at https://orangesodahub.github.io/InfGen
中文: InfGen是一种统一的下一令牌预测模型,通过交替执行闭环运动模拟和场景生成来实现稳定的长期交通模拟,在短期和长期场景中均达到最优性能。
English: InfGen is a unified next-token prediction model that enables stable long-term traffic simulation by interleaving closed-loop motion simulation and scene generation, achieving state-of-the-art performance in both short-term and long-term scenarios.

Authors:Yukun Huang, Yanning Zhou, Jianan Wang, Kaiyi Huang, Xihui Liu
Title: DreamCube: 3D Panorama Generation via Multi-plane Synchronization
Abstract:
3D panorama synthesis is a promising yet challenging task that demands high-quality and diverse visual appearance and geometry of the generated omnidirectional content. Existing methods leverage rich image priors from pre-trained 2D foundation models to circumvent the scarcity of 3D panoramic data, but the incompatibility between 3D panoramas and 2D single views limits their effectiveness. In this work, we demonstrate that by applying multi-plane synchronization to the operators from 2D foundation models, their capabilities can be seamlessly extended to the omnidirectional domain. Based on this design, we further introduce DreamCube, a multi-plane RGB-D diffusion model for 3D panorama generation, which maximizes the reuse of 2D foundation model priors to achieve diverse appearances and accurate geometry while maintaining multi-view consistency. Extensive experiments demonstrate the effectiveness of our approach in panoramic image generation, panoramic depth estimation, and 3D scene generation.

Authors:Teng Li, Quanfeng Lu, Lirui Zhao, Hao Li, Xizhou Zhu, Yu Qiao, Jun Zhang, Wenqi Shao
Title: UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
Abstract:
Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our analysis reveals a crucial observation: understanding tasks benefit from a progressively increasing modality alignment across network depth, which helps build up semantic information for better comprehension; In contrast, generation tasks follow a different trend: modality alignment increases in the early layers but decreases in the deep layers to recover spatial details. These divergent alignment patterns create a fundamental conflict in fully shared Transformer backbones, where a uniform representational flow often leads to performance compromises across two tasks. Motivated by this finding, we introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference. This design effectively balances shared learning and task specialization. Through extensive ablation experiments, we demonstrate that Unifork consistently outperforms conventional fully shared Transformer architectures, and achieves performance on par with or better than task-specific models.
Chinese: 统一图像理解与生成模型面临模态对齐模式的根本冲突,UniFork通过Y形架构在浅层共享表示学习,同时在深层采用任务特定分支来平衡共享学习与专业化,有效解决了这一问题。
English: Unified image understanding and generation models face a fundamental conflict in modality alignment patterns, which UniFork addresses through a Y-shaped architecture that shares shallow layers while employing task-specific deep branches to balance shared learning and specialization.

Authors:Qing Xu, Yuxiang Luo, Wenting Duan, Zhen Chen
Title: Co-Seg++: Mutual Prompt-Guided Collaborative Learning for Versatile Medical Segmentation
Abstract:
Medical image analysis is critical yet challenged by the need of jointly segmenting organs or tissues, and numerous instances for anatomical structures and tumor microenvironment analysis. Existing studies typically formulated different segmentation tasks in isolation, which overlooks the fundamental interdependencies between these tasks, leading to suboptimal segmentation performance and insufficient medical image understanding. To address this issue, we propose a Co-Seg++ framework for versatile medical segmentation. Specifically, we introduce a novel co-segmentation paradigm, allowing semantic and instance segmentation tasks to mutually enhance each other. We first devise a spatio-temporal prompt encoder (STP-Encoder) to capture long-range spatial and temporal relationships between segmentation regions and image embeddings as prior spatial constraints. Moreover, we devise a multi-task collaborative decoder (MTC-Decoder) that leverages cross-guidance to strengthen the contextual consistency of both tasks, jointly computing semantic and instance segmentation masks. Extensive experiments on diverse CT and histopathology datasets demonstrate that the proposed Co-Seg++ outperforms state-of-the-arts in the semantic, instance, and panoptic segmentation of dental anatomical structures, histopathology tissues, and nuclei instances. The source code is available at https://github.com/xq141839/Co-Seg-Plus.
中文摘要:Co-Seg++框架提出了一种新颖的协同分割范式,通过时空提示编码器和多任务协作解码器实现语义分割与实例分割任务的相互增强,在多种医学影像数据集上取得了最优的分割性能。
English Summary: The Co-Seg++ framework introduces a novel co-segmentation approach that enables mutual enhancement between semantic and instance segmentation tasks through a spatio-temporal prompt encoder and multi-task collaborative decoder, achieving state-of-the-art performance across multiple medical imaging datasets.

Authors:Teng Guo, Jingjin Yu
Title: RGBTrack: Fast, Robust Depth-Free 6D Pose Estimation and Tracking
Abstract:
We introduce a robust framework, RGBTrack, for real-time 6D pose estimation and tracking that operates solely on RGB data, thereby eliminating the need for depth input for such dynamic and precise object pose tracking tasks. Building on the FoundationPose architecture, we devise a novel binary search strategy combined with a render-and-compare mechanism to efficiently infer depth and generate robust pose hypotheses from true-scale CAD models. To maintain stable tracking in dynamic scenarios, including rapid movements and occlusions, RGBTrack integrates state-of-the-art 2D object tracking (XMem) with a Kalman filter and a state machine for proactive object pose recovery. In addition, RGBTrack's scale recovery module dynamically adapts CAD models of unknown scale using an initial depth estimate, enabling seamless integration with modern generative reconstruction techniques. Extensive evaluations on benchmark datasets demonstrate that RGBTrack's novel depth-free approach achieves competitive accuracy and real-time performance, making it a promising practical solution candidate for application areas including robotics, augmented reality, and computer vision. The source code for our implementation will be made publicly available at https://github.com/GreatenAnoymous/RGBTrack.git.
中文:RGBTrack是一种仅使用RGB数据的实时6D姿态估计与跟踪框架,无需深度输入即可实现高精度和实时性能,适用于机器人和增强现实等领域。
English: RGBTrack is a robust, real-time 6D pose estimation and tracking framework that uses only RGB data, achieving competitive accuracy and efficiency without depth input, making it suitable for robotics and augmented reality applications.

Authors:Shoubin Yu, Yue Zhang, Ziyang Wang, Jaehong Yoon, Mohit Bansal
Title: MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
Abstract:
Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.
中文:MEXA是一种无需训练的框架,能根据输入模态和任务需求动态选择和聚合专业专家模型,无需额外训练即可实现跨领域的透明高效多模态推理。
English: MEXA is a training-free framework that dynamically selects and aggregates specialized expert models based on input modality and task demands, enabling transparent and effective multimodal reasoning across diverse domains without additional training.

Authors:Ke Li, Chenyu Zhang, Yuxin Ding, Xianbiao Hu, Ruwen Qin
Title: Multi-label Scene Classification for Autonomous Vehicles: Acquiring and Accumulating Knowledge from Diverse Datasets
Abstract:
Driving scenes are inherently heterogeneous and dynamic. Multi-attribute scene identification, as a high-level visual perception capability, provides autonomous vehicles (AVs) with essential contextual awareness to understand, reason through, and interact with complex driving environments. Although scene identification is best modeled as a multi-label classification problem via multitask learning, it faces two major challenges: the difficulty of acquiring balanced, comprehensively annotated datasets and the need to re-annotate all training data when new attributes emerge. To address these challenges, this paper introduces a novel deep learning method that integrates Knowledge Acquisition and Accumulation (KAA) with Consistency-based Active Learning (CAL). KAA leverages monotask learning on heterogeneous single-label datasets to build a knowledge foundation, while CAL bridges the gap between single- and multi-label data, adapting the foundation model for multi-label scene classification. An ablation study on the newly developed Driving Scene Identification (DSI) dataset demonstrates a 56.1% improvement over an ImageNet-pretrained baseline. Moreover, KAA-CAL outperforms state-of-the-art multi-label classification methods on the BDD100K and HSD datasets, achieving this with 85% less data and even recognizing attributes unseen during foundation model training. The DSI dataset and KAA-CAL implementation code are publicly available at https://github.com/KELISBU/KAA-CAL .
中文: 本文提出新颖的KAA-CAL方法,通过整合单标签数据集的知识获取与主动学习,实现了高效的多标签驾驶场景分类,在显著减少数据需求的同时获得了卓越的性能提升。
English: This paper introduces a novel KAA-CAL method that integrates knowledge acquisition from single-label datasets with active learning to enable efficient multi-label driving scene classification, achieving significant performance improvements with substantially reduced data requirements.

Authors:Binbin Xiang, Maciej Wielgosz, Stefano Puliti, Kamil Král, Martin Krůček, Azim Missarov, Rasmus Astrup
Title: ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds
Abstract:
The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA-guided query point selection, a score-based block merging strategy during inference, and a one-to-many association mechanism for effective training. By combining these new components, our model achieves state-of-the-art performance for individual tree segmentation on the newly introduced FOR-instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR-instanceV2 dataset and the ForestFormer3D code are publicly available at https://bxiang233.github.io/FF3D/.

Authors:Annika Thomas, Robaire Galliath, Aleksander Garbuz, Luke Anger, Cormac O'Neill, Trevor Johst, Dami Thomas, George Lordos, Jonathan P. How
Title: LunarLoc: Segment-Based Global Localization on the Moon
Abstract:
Global localization is necessary for autonomous operations on the lunar surface where traditional Earth-based navigation infrastructure, such as GPS, is unavailable. As NASA advances toward sustained lunar presence under the Artemis program, autonomous operations will be an essential component of tasks such as robotic exploration and infrastructure deployment. Tasks such as excavation and transport of regolith require precise pose estimation, but proposed approaches such as visual-inertial odometry (VIO) accumulate odometry drift over long traverses. Precise pose estimation is particularly important for upcoming missions such as the ISRU Pilot Excavator (IPEx) that rely on autonomous agents to operate over extended timescales and varied terrain. To help overcome odometry drift over long traverses, we propose LunarLoc, an approach to global localization that leverages instance segmentation for zero-shot extraction of boulder landmarks from onboard stereo imagery. Segment detections are used to construct a graph-based representation of the terrain, which is then aligned with a reference map of the environment captured during a previous session using graph-theoretic data association. This method enables accurate and drift-free global localization in visually ambiguous settings. LunarLoc achieves sub-cm level accuracy in multi-session global localization experiments, significantly outperforming the state of the art in lunar global localization. To encourage the development of further methods for global localization on the Moon, we release our datasets publicly with a playback module: https://github.com/mit-acl/lunarloc-data.
中文摘要:LunarLoc是一种月球表面全局定位方法,通过立体影像的零样本实例分割提取巨石地标,并采用基于图论的场景匹配技术,可在视觉模糊环境中实现厘米级精度的无漂移定位。
English Summary: LunarLoc is a global localization method for lunar surface operations that uses instance segmentation of boulder landmarks from stereo imagery and graph-based terrain matching to achieve sub-centimeter accuracy without odometry drift.

Authors:Bin Huang, Feihong Xu, Xinchong Shi, Shan Huang, Binxuan Li, Fei Li, Qiegen Liu
Title: PET Tracer Separation Using Conditional Diffusion Transformer with Multi-latent Space Learning
Abstract:
In clinical practice, single-radiotracer positron emission tomography (PET) is commonly used for imaging. Although multi-tracer PET imaging can provide supplementary information of radiotracers that are sensitive to physiological function changes, enabling a more comprehensive characterization of physiological and pathological states, the gamma-photon pairs generated by positron annihilation reactions of different tracers in PET imaging have the same energy, making it difficult to distinguish the tracer signals. In this study, a multi-latent space guided texture conditional diffusion transformer model (MS-CDT) is proposed for PET tracer separation. To the best of our knowledge, this is the first attempt to use texture condition and multi-latent space for tracer separation in PET imaging. The proposed model integrates diffusion and transformer architectures into a unified optimization framework, with the novel addition of texture masks as conditional inputs to enhance image details. By leveraging multi-latent space prior derived from different tracers, the model captures multi-level feature representations, aiming to balance computational efficiency and detail preservation. The texture masks, serving as conditional guidance, help the model focus on salient structural patterns, thereby improving the extraction and utilization of fine-grained image textures. When combined with the diffusion transformer backbone, this conditioning mechanism contributes to more accurate and robust tracer separation. To evaluate its effectiveness, the proposed MS-CDT is compared with several advanced methods on two types of 3D PET datasets: brain and chest scans. Experimental results indicate that MS-CDT achieved competitive performance in terms of image quality and preservation of clinically relevant information. Code is available at: https://github.com/yqx7150/MS-CDT.
中文: 本研究提出了一种多潜在空间引导的纹理条件扩散变换模型(MS-CDT),首次在PET多示踪剂成像中利用纹理条件和多潜在空间来区分难以分离的示踪剂信号,在保持图像质量和临床信息方面表现出优异性能。
English: This study introduces a multi-latent space guided texture conditional diffusion transformer model (MS-CDT), the first method to utilize texture conditions and multi-latent spaces for separating indistinguishable tracer signals in multi-tracer PET imaging, achieving competitive performance in preserving image quality and clinical information.

Authors:Jun Fu, Bin Tian, Haonan Chen, Shi Meng, Tingting Yao
Title: ParkFormer: A Transformer-Based Parking Policy with Goal Embedding and Pedestrian-Aware Control
Abstract:
Autonomous parking plays a vital role in intelligent vehicle systems, particularly in constrained urban environments where high-precision control is required. While traditional rule-based parking systems struggle with environmental uncertainties and lack adaptability in crowded or dynamic scenes, human drivers demonstrate the ability to park intuitively without explicit modeling. Inspired by this observation, we propose a Transformer-based end-to-end framework for autonomous parking that learns from expert demonstrations. The network takes as input surround-view camera images, goal-point representations, ego vehicle motion, and pedestrian trajectories. It outputs discrete control sequences including throttle, braking, steering, and gear selection. A novel cross-attention module integrates BEV features with target points, and a GRU-based pedestrian predictor enhances safety by modeling dynamic obstacles. We validate our method on the CARLA 0.9.14 simulator in both vertical and parallel parking scenarios. Experiments show our model achieves a high success rate of 96.57\%, with average positional and orientation errors of 0.21 meters and 0.41 degrees, respectively. The ablation studies further demonstrate the effectiveness of key modules such as pedestrian prediction and goal-point attention fusion. The code and dataset will be released at: https://github.com/little-snail-f/ParkFormer.
中文摘要:本文提出了一种基于Transformer的端到端自动泊车框架,通过模仿专家驾驶行为,在集成鸟瞰图特征和行人预测模块后实现了96.57%的成功率,且位置与方向误差极小。
English Summary: This paper introduces a Transformer-based end-to-end autonomous parking framework that learns from expert demonstrations, achieving a 96.57% success rate with minimal positional and orientation errors through integrated BEV features and pedestrian prediction modules.

Authors:Chaonan Ji, Jinwei Qi, Peng Zhang, Bang Zhang, Liefeng Bo
Title: Controllable and Expressive One-Shot Video Head Swapping
Abstract:
In this paper, we propose a novel diffusion-based multi-condition controllable framework for video head swapping, which seamlessly transplant a human head from a static image into a dynamic video, while preserving the original body and background of target video, and further allowing to tweak head expressions and movements during swapping as needed. Existing face-swapping methods mainly focus on localized facial replacement neglecting holistic head morphology, while head-swapping approaches struggling with hairstyle diversity and complex backgrounds, and none of these methods allow users to modify the transplanted head expressions after swapping. To tackle these challenges, our method incorporates several innovative strategies through a unified latent diffusion paradigm. 1) Identity-preserving context fusion: We propose a shape-agnostic mask strategy to explicitly disentangle foreground head identity features from background/body contexts, combining hair enhancement strategy to achieve robust holistic head identity preservation across diverse hair types and complex backgrounds. 2) Expression-aware landmark retargeting and editing: We propose a disentangled 3DMM-driven retargeting module that decouples identity, expression, and head poses, minimizing the impact of original expressions in input images and supporting expression editing. While a scale-aware retargeting strategy is further employed to minimize cross-identity expression distortion for higher transfer precision. Experimental results demonstrate that our method excels in seamless background integration while preserving the identity of the source portrait, as well as showcasing superior expression transfer capabilities applicable to both real and virtual characters.
中文: 本文提出一种基于扩散模型的视频头部替换框架,通过身份保持融合和解耦标志点重定向技术,在保留目标视频背景的同时实现头部表情编辑功能。
English: This paper introduces a diffusion-based framework for video head swapping that preserves target video backgrounds while enabling expression editing through identity-preserving fusion and disentangled landmark retargeting.

Authors:Chaehyeon Song, Dongjae Lee, Jongwoo Lim, Ayoung Kim
Title: Camera Calibration via Circular Patterns: A Comprehensive Framework with Measurement Uncertainty and Unbiased Projection Model
Abstract:
Camera calibration using planar targets has been widely favored, and two types of control points have been mainly considered as measurements: the corners of the checkerboard and the centroid of circles. Since a centroid is derived from numerous pixels, the circular pattern provides more precise measurements than the checkerboard. However, the existing projection model of circle centroids is biased under lens distortion, resulting in low performance. To surmount this limitation, we propose an unbiased projection model of the circular pattern and demonstrate its superior accuracy compared to the checkerboard. Complementing this, we introduce uncertainty into circular patterns to enhance calibration robustness and completeness. Defining centroid uncertainty improves the performance of calibration components, including pattern detection, optimization, and evaluation metrics. We also provide guidelines for performing good camera calibration based on the evaluation metric. The core concept of this approach is to model the boundary points of a two-dimensional shape as a Markov random field, considering its connectivity. The shape distribution is propagated to the centroid uncertainty through an appropriate shape representation based on the Green theorem. Consequently, the resulting framework achieves marked gains in calibration accuracy and robustness. The complete source code and demonstration video are available at https://github.com/chaehyeonsong/discocal.
中文: 本研究提出了一种无偏的圆形图案投影模型,以克服现有基于圆心的相机校准的局限性,通过引入圆心不确定性来提高精度和鲁棒性,结果显示其性能显著优于棋盘格方法。
English: The study introduces an unbiased projection model for circular patterns to overcome the limitations of existing centroid-based camera calibration, incorporating centroid uncertainty to enhance accuracy and robustness, with results showing significant improvements over checkerboard methods.

Authors:Yuchu Jiang, Jiaming Chu, Jian Zhao, Xin Zhang, Xu Yang, Lei Jin, Chi Zhang, Xuelong Li
Title: Loupe: A Generalizable and Adaptive Framework for Image Forgery Detection
Abstract:
The proliferation of generative models has raised serious concerns about visual content forgery. Existing deepfake detection methods primarily target either image-level classification or pixel-wise localization. While some achieve high accuracy, they often suffer from limited generalization across manipulation types or rely on complex architectures. In this paper, we propose Loupe, a lightweight yet effective framework for joint deepfake detection and localization. Loupe integrates a patch-aware classifier and a segmentation module with conditional queries, allowing simultaneous global authenticity classification and fine-grained mask prediction. To enhance robustness against distribution shifts of test set, Loupe introduces a pseudo-label-guided test-time adaptation mechanism by leveraging patch-level predictions to supervise the segmentation head. Extensive experiments on the DDL dataset demonstrate that Loupe achieves state-of-the-art performance, securing the first place in the IJCAI 2025 Deepfake Detection and Localization Challenge with an overall score of 0.846. Our results validate the effectiveness of the proposed patch-level fusion and conditional query design in improving both classification accuracy and spatial localization under diverse forgery patterns. The code is available at https://github.com/Kamichanw/Loupe.
Chinese: Loupe是一种轻量级框架,通过融合块感知分类器和条件查询分割模块,实现了深度伪造的联合检测与定位,借助伪标签引导的测试时自适应机制,在多种伪造模式中取得最优性能。
English: Loupe is a lightweight framework that combines patch-aware classification and conditional query-based segmentation to jointly detect and localize deepfakes, achieving state-of-the-art performance through test-time adaptation and robust generalization across manipulation types.

Authors:Chao Chen, Nobel Dang, Juexiao Zhang, Wenkai Sun, Pengfei Zheng, Xuhang He, Yimeng Ye, Jiasheng Zhang, Taarun Srinivas, Chen Feng
Title: Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes
Abstract:
Humans exhibit a remarkable ability to recognize co-visibility-the 3D regions simultaneously visible in multiple images-even when these images are sparsely distributed across a complex scene. This ability is foundational to 3D vision, robotic perception, and relies not only on low-level feature matching but also on high-level spatial reasoning and cognitive integration. Yet, it remains unclear whether current vision models can replicate this human-level proficiency. In this work, we introduce the Co-VisiON benchmark, designed to evaluate human-inspired co-visibility reasoning across more than 1,000 sparse-view indoor scenarios. Our results show that while co-visibility is often approached as a low-level feature-matching task, it remains challenging for existing vision models under sparse conditions. Notably, a proprietary vision-language model surpasses all vision-only baselines, but all models fall significantly short of human performance. This gap underscores the limitations of current architectures and motivates the need for models that integrate spatial and semantic information in a human-like manner. Inspired by human visual cognition, we propose a novel multi-view baseline, Covis, which achieves top performance among pure vision models and narrows the gap to the proprietary VLM. We hope our benchmark and findings will spur further advancements in developing vision models capable of robust, cognitively inspired reasoning in challenging, sparse environments. Our dataset and source code can be found at https://ai4ce.github.io/CoVISION.

Authors:Xiaoyu Shi, Rahul Kumar Jain, Yinhao Li, Ruibo Hou, Jingliang Cheng, Jie Bai, Guohua Zhao, Lanfen Lin, Rui Xu, Yen-wei Chen
Title: TextBraTS: Text-Guided Volumetric Brain Tumor Segmentation with Innovative Dataset Development and Fusion Module Exploration
Abstract:
Deep learning has demonstrated remarkable success in medical image segmentation and computer-aided diagnosis. In particular, numerous advanced methods have achieved state-of-the-art performance in brain tumor segmentation from MRI scans. While recent studies in other medical imaging domains have revealed that integrating textual reports with visual data can enhance segmentation accuracy, the field of brain tumor analysis lacks a comprehensive dataset that combines radiological images with corresponding textual annotations. This limitation has hindered the exploration of multimodal approaches that leverage both imaging and textual data. To bridge this critical gap, we introduce the TextBraTS dataset, the first publicly available volume-level multimodal dataset that contains paired MRI volumes and rich textual annotations, derived from the widely adopted BraTS2020 benchmark. Building upon this novel dataset, we propose a novel baseline framework and sequential cross-attention method for text-guided volumetric medical image segmentation. Through extensive experiments with various text-image fusion strategies and templated text formulations, our approach demonstrates significant improvements in brain tumor segmentation accuracy, offering valuable insights into effective multimodal integration techniques. Our dataset, implementation code, and pre-trained models are publicly available at https://github.com/Jupitern52/TextBraTS.
中文摘要:TextBraTS数据集作为首个结合MRI影像与文本标注的多模态数据集填补了脑肿瘤分析领域的空白,提出的新型框架和跨注意力方法通过图文融合显著提升了分割精度。
English Summary: The TextBraTS dataset is introduced as the first multimodal dataset pairing MRI volumes with textual annotations to advance brain tumor segmentation, and a novel framework with cross-attention mechanisms demonstrates improved accuracy through effective text-image fusion.

Authors:Weinan Guan, Wei Wang, Bo Peng, Ziwen He, Jing Dong, Haonan Cheng
Title: Noise-Informed Diffusion-Generated Image Detection with Anomaly Attention
Abstract:
With the rapid development of image generation technologies, especially the advancement of Diffusion Models, the quality of synthesized images has significantly improved, raising concerns among researchers about information security. To mitigate the malicious abuse of diffusion models, diffusion-generated image detection has proven to be an effective countermeasure.However, a key challenge for forgery detection is generalising to diffusion models not seen during training. In this paper, we address this problem by focusing on image noise. We observe that images from different diffusion models share similar noise patterns, distinct from genuine images. Building upon this insight, we introduce a novel Noise-Aware Self-Attention (NASA) module that focuses on noise regions to capture anomalous patterns. To implement a SOTA detection model, we incorporate NASA into Swin Transformer, forming an novel detection architecture NASA-Swin. Additionally, we employ a cross-modality fusion embedding to combine RGB and noise images, along with a channel mask strategy to enhance feature learning from both modalities. Extensive experiments demonstrate the effectiveness of our approach in enhancing detection capabilities for diffusion-generated images. When encountering unseen generation methods, our approach achieves the state-of-the-art performance.Our code is available at https://github.com/WeinanGuan/NASA-Swin.
中文摘要:本文提出一种新颖的噪声感知自注意力模块,通过聚焦扩散生成图像特有的噪声模式,结合跨模态融合策略,在检测未知生成方法的伪造图像时实现了最先进的性能。
English Summary: This paper introduces a novel Noise-Aware Self-Attention (NASA) module integrated into Swin Transformer to detect diffusion-generated images by focusing on their distinct noise patterns, achieving state-of-the-art performance even with unseen generation methods.

Authors:Fang Chen, Weifeng Zhang, Xingyu Ai, BingXuan Li, An Li, Qiegen Liu
Title: A Prior-Guided Joint Diffusion Model in Projection Domain for PET Tracer Conversion
Abstract:
Positron emission tomography (PET) is widely used to assess metabolic activity, but its application is limited by the availability of radiotracers. 18F-labeled fluorodeoxyglucose (18F-FDG) is the most commonly used tracer but shows limited effectiveness for certain tumors. In contrast, 6-18F-fluoro-3,4-dihydroxy-L-phenylalanine (18F-DOPA) offers higher specificity for neuroendocrine tumors and neurological disorders. However, the complexity of its synthesis process and constraints on transportation time have limited its clinical application. Among different forms of raw data acquired by the scanner, sinogram is a commonly used representation in PET imaging. Therefore, modeling in projection domain enables more direct utilization of the original information, potentially reducing the accumulation errors during the image reconstruction process. Inspired by these factors, this study proposes a prior-guided joint diffusion model (PJDM) for transforming 18F-FDG PET sinograms into 18F-DOPA PET sinograms. During inference, an initial synthetic 18F-DOPA PET sinogram is first generated using a higher-order hybrid sampler. This sinogram is then degraded and serves as an additional condition to guide the iterative refinement process. Experimental results demonstrated that PJDM effectively improved both sinogram quality and the final synthetic outcomes. The code is available at: https://github.com/yqx7150/PJDM.
中文摘要:本研究提出了一种先验引导联合扩散模型(PJDM),通过在投影域建模将18F-FDG PET正弦图转换为18F-DOPA PET正弦图,有效提升了合成图像质量并减少了重建误差。
English Summary: This study introduces a prior-guided joint diffusion model (PJDM) that converts 18F-FDG PET sinograms into 18F-DOPA PET sinograms using projection domain modeling to enhance synthetic accuracy and reduce reconstruction errors.

Authors:Yunhan Ren, Feng Luo, Siyu Huang
Title: Few-Shot Generalized Category Discovery With Retrieval-Guided Decision Boundary Enhancement
Abstract:
While existing Generalized Category Discovery (GCD) models have achieved significant success, their performance with limited labeled samples and a small number of known categories remains largely unexplored. In this work, we introduce the task of Few-shot Generalized Category Discovery (FSGCD), aiming to achieve competitive performance in GCD tasks under conditions of known information scarcity. To tackle this challenge, we propose a decision boundary enhancement framework with affinity-based retrieval. Our framework is designed to learn the decision boundaries of known categories and transfer these boundaries to unknown categories. First, we use a decision boundary pre-training module to mitigate the overfitting of pre-trained information on known category boundaries and improve the learning of these decision boundaries using labeled samples. Second, we implement a two-stage retrieval-guided decision boundary optimization strategy. Specifically, this strategy further enhances the severely limited known boundaries by using affinity-retrieved pseudo-labeled samples. Then, these refined boundaries are applied to unknown clusters via guidance from affinity-based feature retrieval. Experimental results demonstrate that our proposed method outperforms existing methods on six public GCD benchmarks under the FSGCD setting. The codes are available at: https://github.com/Ryh1218/FSGCD
中文摘要:本文提出了少样本广义类别发现(FSGCD)任务,并设计了一种基于相似性检索的决策边界增强框架,通过将已知类别的边界知识迁移至未知类别,在六个公开基准上实现了最优性能。
English Summary: This paper introduces the Few-shot Generalized Category Discovery (FSGCD) task and proposes a decision boundary enhancement framework that leverages affinity-based retrieval to transfer learned boundaries from known to unknown categories, achieving state-of-the-art performance on six benchmarks.

Authors:Guang Yin, Yitong Li, Yixuan Wang, Dale McConachie, Paarth Shah, Kunimatsu Hashimoto, Huan Zhang, Katherine Liu, Yunzhu Li
Title: CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity
Abstract:
Natural language instructions for robotic manipulation tasks often exhibit ambiguity and vagueness. For instance, the instruction "Hang a mug on the mug tree" may involve multiple valid actions if there are several mugs and branches to choose from. Existing language-conditioned policies typically rely on end-to-end models that jointly handle high-level semantic understanding and low-level action generation, which can result in suboptimal performance due to their lack of modularity and interpretability. To address these challenges, we introduce a novel robotic manipulation framework that can accomplish tasks specified by potentially ambiguous natural language. This framework employs a Vision-Language Model (VLM) to interpret abstract concepts in natural language instructions and generates task-specific code - an interpretable and executable intermediate representation. The generated code interfaces with the perception module to produce 3D attention maps that highlight task-relevant regions by integrating spatial and semantic information, effectively resolving ambiguities in instructions. Through extensive experiments, we identify key limitations of current imitation learning methods, such as poor adaptation to language and environmental variations. We show that our approach excels across challenging manipulation tasks involving language ambiguity, contact-rich manipulation, and multi-object interactions.

Authors:Haotian Yin, Aleksander Plocharski, Michal Jan Wlodarczyk, Mikolaj Kida, Przemyslaw Musialski
Title: FlatCAD: Fast Curvature Regularization of Neural SDFs for CAD Models
Abstract:
Neural signed-distance fields (SDFs) are a versatile backbone for neural geometry representation, but enforcing CAD-style developability usually requires Gaussian-curvature penalties with full Hessian evaluation and second-order differentiation, which are costly in memory and time. We introduce an off-diagonal Weingarten loss that regularizes only the mixed shape operator term that represents the gap between principal curvatures and flattens the surface. We present two variants: a finite-difference version using six SDF evaluations plus one gradient, and an auto-diff version using a single Hessian-vector product. Both converge to the exact mixed term and preserve the intended geometric properties without assembling the full Hessian. On the ABC benchmarks the losses match or exceed Hessian-based baselines while cutting GPU memory and training time by roughly a factor of two. The method is drop-in and framework-agnostic, enabling scalable curvature-aware SDF learning for engineering-grade shape reconstruction. Our code is available at https://flatcad.github.io/.

Authors:Changsheng Gao, Zijie Liu, Li Li, Dong Liu, Xiaoyan Sun, Weisi Lin
Title: DT-UFC: Universal Large Model Feature Coding via Peaky-to-Balanced Distribution Transformation
Abstract:
Like image coding in visual data transmission, feature coding is essential for the distributed deployment of large models by significantly reducing transmission and storage burden. However, prior studies have mostly targeted task- or model-specific scenarios, leaving the challenge of universal feature coding across diverse large models largely unexplored. In this paper, we present the first systematic study on universal feature coding for large models. The key challenge lies in the inherently diverse and distributionally incompatible nature of features extracted from different models. For example, features from DINOv2 exhibit highly peaky, concentrated distributions, while those from Stable Diffusion 3 (SD3) are more dispersed and uniform. This distributional heterogeneity severely hampers both compression efficiency and cross-model generalization. To address this, we propose a learned peaky-to-balanced distribution transformation, which reshapes highly skewed feature distributions into a common, balanced target space. This transformation is non-uniform, data-driven, and plug-and-play, enabling effective alignment of heterogeneous distributions without modifying downstream codecs. With this alignment, a universal codec trained on the balanced target distribution can effectively generalize to features from different models and tasks. We validate our approach on three representative large models (LLaMA3, DINOv2, and SD3) across multiple tasks and modalities. Extensive experiments show that our method achieves notable improvements in both compression efficiency and cross-model generalization over task-specific baselines. All source code has been made available at https://github.com/chansongoal/DT-UFC.
Chinese: 本文提出了一种针对大模型的通用特征编码方法,通过学习的分布变换将异构特征对齐到平衡空间,从而在多种任务和模态中显著提升了压缩效率和跨模型泛化能力。
English: This paper introduces a universal feature coding method for large models that employs a learned distribution transformation to align heterogeneous features into a balanced space, enhancing compression efficiency and cross-model generalization across diverse tasks and modalities.

Authors:Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao
Title: IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
Abstract:
Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under [this https URL](https://github.com/AI45Lab/IS-Bench).
中文:IS-Bench作为首个评估具身智能体交互安全性的多模态基准,通过对主流视觉语言模型的广泛测试,揭示了当前模型缺乏风险意识以及安全导向推理存在的任务完成度妥协问题。
English: IS-Bench is introduced as the first multimodal benchmark for evaluating interactive safety in embodied agents, revealing current models' lack of risk awareness and the trade-offs of safety-focused reasoning through extensive testing on leading VLMs.

Authors:Chunhou Ji, Qiumeng Li
Title: TrajSceneLLM: A Multimodal Perspective on Semantic GPS Trajectory Analysis
Abstract:
GPS trajectory data reveals valuable patterns of human mobility and urban dynamics, supporting a variety of spatial applications. However, traditional methods often struggle to extract deep semantic representations and incorporate contextual map information. We propose TrajSceneLLM, a multimodal perspective for enhancing semantic understanding of GPS trajectories. The framework integrates visualized map images (encoding spatial context) and textual descriptions generated through LLM reasoning (capturing temporal sequences and movement dynamics). Separate embeddings are generated for each modality and then concatenated to produce trajectory scene embeddings with rich semantic content which are further paired with a simple MLP classifier. We validate the proposed framework on Travel Mode Identification (TMI), a critical task for analyzing travel choices and understanding mobility behavior. Our experiments show that these embeddings achieve significant performance improvement, highlighting the advantage of our LLM-driven method in capturing deep spatio-temporal dependencies and reducing reliance on handcrafted features. This semantic enhancement promises significant potential for diverse downstream applications and future research in geospatial artificial intelligence. The source code and dataset are publicly available at: https://github.com/februarysea/TrajSceneLLM.
中文摘要:TrajSceneLLM提出了一种融合地图图像与大语言模型生成文本的多模态框架,通过增强轨迹语义嵌入显著提升了出行方式识别的性能,为GPS轨迹分析提供了新的解决方案。
English Summary: TrajSceneLLM is a novel multimodal framework that enhances GPS trajectory analysis by integrating map images and LLM-generated text descriptions, achieving superior performance in travel mode identification through enriched semantic embeddings.

Authors:Yunhao Hou, Bochao Zou, Min Zhang, Ran Chen, Shangdong Yang, Yanmei Zhang, Junbao Zhuo, Siheng Chen, Jiansheng Chen, Huimin Ma
Title: AGC-Drive: A Large-Scale Dataset for Real-World Aerial-Ground Collaboration in Driving Scenarios
Abstract:
By sharing information across multiple agents, collaborative perception helps autonomous vehicles mitigate occlusions and improve overall perception accuracy. While most previous work focus on vehicle-to-vehicle and vehicle-to-infrastructure collaboration, with limited attention to aerial perspectives provided by UAVs, which uniquely offer dynamic, top-down views to alleviate occlusions and monitor large-scale interactive environments. A major reason for this is the lack of high-quality datasets for aerial-ground collaborative scenarios. To bridge this gap, we present AGC-Drive, the first large-scale real-world dataset for Aerial-Ground Cooperative 3D perception. The data collection platform consists of two vehicles, each equipped with five cameras and one LiDAR sensor, and one UAV carrying a forward-facing camera and a LiDAR sensor, enabling comprehensive multi-view and multi-agent perception. Consisting of approximately 120K LiDAR frames and 440K images, the dataset covers 14 diverse real-world driving scenarios, including urban roundabouts, highway tunnels, and on/off ramps. Notably, 19.5% of the data comprises dynamic interaction events, including vehicle cut-ins, cut-outs, and frequent lane changes. AGC-Drive contains 400 scenes, each with approximately 100 frames and fully annotated 3D bounding boxes covering 13 object categories. We provide benchmarks for two 3D perception tasks: vehicle-to-vehicle collaborative perception and vehicle-to-UAV collaborative perception. Additionally, we release an open-source toolkit, including spatiotemporal alignment verification tools, multi-agent visualization systems, and collaborative annotation utilities. The dataset and code are available at https://github.com/PercepX/AGC-Drive.
中文: 协作感知能帮助自动驾驶车辆减少遮挡并提高感知精度,但以往研究多集中于车与车及车与基础设施的协作,对无人机提供的动态鸟瞰视角关注不足,为此我们推出了首个大规模真实世界空地协同3D感知数据集AGC-Drive,包含多视角数据和基准测试。
English: Collaborative perception enhances autonomous vehicles' ability to overcome occlusions and boost accuracy, yet aerial perspectives from UAVs have been underutilized due to a lack of datasets, leading to the introduction of AGC-Drive, the first large-scale real-world aerial-ground cooperative 3D perception dataset with comprehensive multi-agent data and benchmarks.

Authors:Chao He, Hongxi Wei
Title: MambaHash: Visual State Space Deep Hashing Model for Large-Scale Image Retrieval
Abstract:
Deep image hashing aims to enable effective large-scale image retrieval by mapping the input images into simple binary hash codes through deep neural networks. More recently, Vision Mamba with linear time complexity has attracted extensive attention from researchers by achieving outstanding performance on various computer tasks. Nevertheless, the suitability of Mamba for large-scale image retrieval tasks still needs to be explored. Towards this end, we propose a visual state space hashing model, called MambaHash. Concretely, we propose a backbone network with stage-wise architecture, in which grouped Mamba operation is introduced to model local and global information by utilizing Mamba to perform multi-directional scanning along different groups of the channel. Subsequently, the proposed channel interaction attention module is used to enhance information communication across channels. Finally, we meticulously design an adaptive feature enhancement module to increase feature diversity and enhance the visual representation capability of the model. We have conducted comprehensive experiments on three widely used datasets: CIFAR-10, NUS-WIDE and IMAGENET. The experimental results demonstrate that compared with the state-of-the-art deep hashing methods, our proposed MambaHash has well efficiency and superior performance to effectively accomplish large-scale image retrieval tasks. Source code is available https://github.com/shuaichaochao/MambaHash.git
中文: MambaHash是一种新颖的视觉状态空间哈希模型,通过分组Mamba操作和注意力模块,在大规模图像检索任务中实现了高效且卓越的性能。
English: MambaHash is a novel visual state space hashing model that employs grouped Mamba operations and attention modules to efficiently achieve superior performance in large-scale image retrieval tasks.

Authors:Nikola Jovanović, Ismail Labiad, Tomáš Souček, Martin Vechev, Pierre Fernandez
Title: Watermarking Autoregressive Image Generation
Abstract:
Watermarking the outputs of generative models has emerged as a promising approach for tracking their provenance. Despite significant interest in autoregressive image generation models and their potential for misuse, no prior work has attempted to watermark their outputs at the token level. In this work, we present the first such approach by adapting language model watermarking techniques to this setting. We identify a key challenge: the lack of reverse cycle-consistency (RCC), wherein re-tokenizing generated image tokens significantly alters the token sequence, effectively erasing the watermark. To address this and to make our method robust to common image transformations, neural compression, and removal attacks, we introduce (i) a custom tokenizer-detokenizer finetuning procedure that improves RCC, and (ii) a complementary watermark synchronization layer. As our experiments demonstrate, our approach enables reliable and robust watermark detection with theoretically grounded p-values.
中文: 本研究首次提出了针对自回归图像生成模型的令牌级水印方法,通过定制化分词器微调和水印同步层解决了逆向循环一致性问题,有效提升了水印对图像变换与攻击的鲁棒性。
English: This study introduces the first token-level watermarking method for autoregressive image generation models, addressing the challenge of reverse cycle-consistency and enhancing robustness against image transformations and attacks through custom tokenizer finetuning and a watermark synchronization layer.

Authors:Arpit Jadon, Haoran Wang, Phillip Thomas, Michael Stanley, S. Nathaniel Cibik, Rachel Laurat, Omar Maher, Lukas Hoyer, Ozan Unal, Dengxin Dai
Title: RealDriveSim: A Realistic Multi-Modal Multi-Task Synthetic Dataset for Autonomous Driving
Abstract:
As perception models continue to develop, the need for large-scale datasets increases. However, data annotation remains far too expensive to effectively scale and meet the demand. Synthetic datasets provide a solution to boost model performance with substantially reduced costs. However, current synthetic datasets remain limited in their scope, realism, and are designed for specific tasks and applications. In this work, we present RealDriveSim, a realistic multi-modal synthetic dataset for autonomous driving that not only supports popular 2D computer vision applications but also their LiDAR counterparts, providing fine-grained annotations for up to 64 classes. We extensively evaluate our dataset for a wide range of applications and domains, demonstrating state-of-the-art results compared to existing synthetic benchmarks. The dataset is publicly available at https://realdrivesim.github.io/.

Authors:Zhaoyi Wang, Jemil Avers Butt, Shengyu Huang, Tomislav Medic, Andreas Wieser
Title: Dense 3D Displacement Estimation for Landslide Monitoring via Fusion of TLS Point Clouds and Embedded RGB Images
Abstract:
Landslide monitoring is essential for understanding geohazards and mitigating associated risks. However, existing point cloud-based methods typically rely on either geometric or radiometric information and often yield sparse or non-3D displacement estimates. In this paper, we propose a hierarchical partition-based coarse-to-fine approach that fuses 3D point clouds and co-registered RGB images to estimate dense 3D displacement vector fields. We construct patch-level matches using both 3D geometry and 2D image features. These matches are refined via geometric consistency checks, followed by rigid transformation estimation per match. Experimental results on two real-world landslide datasets demonstrate that our method produces 3D displacement estimates with high spatial coverage (79% and 97%) and high accuracy. Deviations in displacement magnitude with respect to external measurements (total station or GNSS observations) are 0.15 m and 0.25 m on the two datasets, respectively, and only 0.07 m and 0.20 m compared to manually derived references. These values are below the average scan resolutions (0.08 m and 0.30 m). Our method outperforms the state-of-the-art method F2S3 in spatial coverage while maintaining comparable accuracy. Our approach offers a practical and adaptable solution for TLS-based landslide monitoring and is extensible to other types of point clouds and monitoring tasks. Our example data and source code are publicly available at https://github.com/zhaoyiww/fusion4landslide.
中文摘要:本文提出一种融合三维点云与RGB图像的分层由粗到精方法,实现了滑坡监测中的密集三维位移估计,在真实数据集上展现出优越的空间覆盖率和精度表现。
English Summary: This paper introduces a hierarchical coarse-to-fine method that integrates 3D point clouds and RGB images to achieve dense 3D displacement estimation for landslide monitoring, demonstrating superior spatial coverage and accuracy on real-world datasets.

Authors:Weeyoung Kwon, Jeahun Sung, Minkyu Jeon, Chanho Eom, Jihyong Oh
Title: R3eVision: A Survey on Robust Rendering, Restoration, and Enhancement for 3D Low-Level Vision
Abstract:
Neural rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved significant progress in photorealistic 3D scene reconstruction and novel view synthesis. However, most existing models assume clean and high-resolution (HR) multi-view inputs, which limits their robustness under real-world degradations such as noise, blur, low-resolution (LR), and weather-induced artifacts. To address these limitations, the emerging field of 3D Low-Level Vision (3D LLV) extends classical 2D Low-Level Vision tasks including super-resolution (SR), deblurring, weather degradation removal, restoration, and enhancement into the 3D spatial domain. This survey, referred to as R\textsuperscript{3}eVision, provides a comprehensive overview of robust rendering, restoration, and enhancement for 3D LLV by formalizing the degradation-aware rendering problem and identifying key challenges related to spatio-temporal consistency and ill-posed optimization. Recent methods that integrate LLV into neural rendering frameworks are categorized to illustrate how they enable high-fidelity 3D reconstruction under adverse conditions. Application domains such as autonomous driving, AR/VR, and robotics are also discussed, where reliable 3D perception from degraded inputs is critical. By reviewing representative methods, datasets, and evaluation protocols, this work positions 3D LLV as a fundamental direction for robust 3D content generation and scene-level reconstruction in real-world environments.
中文: 神经渲染技术虽在三维场景重建中取得进展,但难以应对真实世界中的退化问题,因此三维低层视觉领域应运而生,通过超分辨率和去模糊等任务提升三维重建的鲁棒性。
English: Neural rendering techniques like NeRF and 3DGS have advanced 3D scene reconstruction but struggle with real-world degradations, prompting the emergence of 3D Low-Level Vision to enhance robustness through restoration and super-resolution in 3D space.

Authors:Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, Xihui Liu
Title: GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
Abstract:
Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.
中文: 本研究提出了用于评估多模态大语言模型后训练方法的基准SEED-Bench-R1,并开发了GRPO-CARE一致性感知强化学习框架,该框架在无需显式监督的情况下显著提升了答案准确性和推理连贯性,实现了明显的性能改进。
English: This study introduces SEED-Bench-R1, a benchmark for evaluating multimodal large language models' post-training methods, and proposes GRPO-CARE, a consistency-aware reinforcement learning framework that enhances both answer accuracy and reasoning coherence, achieving significant performance gains without explicit supervision.

Authors:Byung Hoon Lee, Wooseok Shin, Sung Won Han
Title: TD3Net: A temporal densely connected multi-dilated convolutional network for lipreading
Abstract:
The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems. The code is available at our GitHub repository (https://github.com/Leebh-kor/TD3Net).
中文摘要:提出的TD3Net后端架构通过结合密集跳跃连接与多扩张卷积来消除感受野盲区,在词级唇读任务中以更少参数实现更高准确率,有效提升了时序特征建模能力。
English Summary: The proposed TD3Net backend architecture enhances word-level lipreading by combining dense skip connections with multi-dilated convolutions to eliminate receptive field blind spots, achieving superior accuracy with improved efficiency compared to existing temporal convolutional networks.

Authors:Jiawen Yu, Jieji Ren, Yang Chang, Qiaojun Yu, Xuan Tong, Boyang Wang, Yan Song, You Li, Xinji Mai, Wenqiang Zhang
Title: Noise Fusion-based Distillation Learning for Anomaly Detection in Complex Industrial Environments
Abstract:
Anomaly detection and localization in automated industrial manufacturing can significantly enhance production efficiency and product quality. Existing methods are capable of detecting surface defects in pre-defined or controlled imaging environments. However, accurately detecting workpiece defects in complex and unstructured industrial environments with varying views, poses and illumination remains challenging. We propose a novel anomaly detection and localization method specifically designed to handle inputs with perturbative patterns. Our approach introduces a new framework based on a collaborative distillation heterogeneous teacher network (HetNet), an adaptive local-global feature fusion module, and a local multivariate Gaussian noise generation module. HetNet can learn to model the complex feature distribution of normal patterns using limited information about local disruptive changes. We conducted extensive experiments on mainstream benchmarks. HetNet demonstrates superior performance with approximately 10% improvement across all evaluation metrics on MSC-AD under industrial conditions, while achieving state-of-the-art results on other datasets, validating its resilience to environmental fluctuations and its capability to enhance the reliability of industrial anomaly detection systems across diverse scenarios. Tests in real-world environments further confirm that HetNet can be effectively integrated into production lines to achieve robust and real-time anomaly detection. Codes, images and videos are published on the project website at: https://zihuatanejoyu.github.io/HetNet/

Authors:Liangjing Shao, Linxin Bai, Chenkang Du, Xinrong Chen
Title: EndoMUST: Monocular Depth Estimation for Robotic Endoscopy via End-to-end Multi-step Self-supervised Training
Abstract:
Monocular depth estimation and ego-motion estimation are significant tasks for scene perception and navigation in stable, accurate and efficient robot-assisted endoscopy. To tackle lighting variations and sparse textures in endoscopic scenes, multiple techniques including optical flow, appearance flow and intrinsic image decomposition have been introduced into the existing methods. However, the effective training strategy for multiple modules are still critical to deal with both illumination issues and information interference for self-supervised depth estimation in endoscopy. Therefore, a novel framework with multistep efficient finetuning is proposed in this work. In each epoch of end-to-end training, the process is divided into three steps, including optical flow registration, multiscale image decomposition and multiple transformation alignments. At each step, only the related networks are trained without interference of irrelevant information. Based on parameter-efficient finetuning on the foundation model, the proposed method achieves state-of-the-art performance on self-supervised depth estimation on SCARED dataset and zero-shot depth estimation on Hamlyn dataset, with 4\%$\sim$10\% lower error. The evaluation code of this work has been published on https://github.com/BaymaxShao/EndoMUST.
中文: 本研究提出了一种新颖的多步微调框架,用于内窥镜中的自监督深度估计,通过分步训练相关网络有效应对光照变化和纹理稀疏问题,在基准数据集上实现了最先进的性能并显著降低了误差。
English: This study introduces a novel multistep fine-tuning framework for self-supervised depth estimation in endoscopy, which effectively addresses lighting variations and sparse textures by training related networks separately in each step, achieving state-of-the-art performance with significantly reduced errors on benchmark datasets.

Authors:Cong Wang, Zexuan Deng, Zhiwei Jiang, Fei Shen, Yafeng Yin, Shiwei Gan, Zifeng Cheng, Shiping Ge, Qing Gu
Title: Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization
Abstract:
Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (\eg, skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporates multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (\ie, fine-grained poses and 3D hands). SignViP contains three core components. (1) Sign Video Diffusion Model is jointly trained with a multi-condition encoder to learn continuous embeddings that encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization (FSQ) Autoencoder is further trained to compress and quantize these embeddings into discrete tokens for compact representation of the conditions. (3) Multi-Condition Token Translator is trained to translate spoken language text to discrete multi-condition tokens. During inference, Multi-Condition Token Translator first translates the spoken language text into discrete multi-condition tokens. These tokens are then decoded to continuous embeddings by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion Model to guide video generation. Experimental results show that SignViP achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at https://github.com/umnooob/signvip/.
中文: SignViP提出了一种新颖的手语视频生成框架,通过离散化标记整合多种细粒度条件,在视频质量和语义保真度方面实现了最先进的性能。
English: SignViP introduces a novel framework for sign language video generation by integrating multiple fine-grained conditions through discrete tokenization, achieving state-of-the-art performance in video quality and semantic fidelity.

Authors:Zhongchen Zhao, Chaodong Xiao, Hui Lin, Qi Xie, Lei Zhang, Deyu Meng
Title: Polyline Path Masked Attention for Vision Transformer
Abstract:
Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.
Chinese: 本文提出的Polyline Path Masked Attention (PPMA)通过将Mamba2的结构化掩码与创新的二维折线路径扫描策略相结合,增强了视觉Transformer的空间邻接建模能力,在图像分类、目标检测和分割任务中取得了最先进的性能。
English: This paper introduces Polyline Path Masked Attention (PPMA), which enhances Vision Transformers by integrating Mamba2's structured mask with a novel 2D polyline path scanning strategy to better model spatial adjacency, achieving state-of-the-art results in image classification, object detection, and segmentation tasks.

Authors:Fatmah AlHindaassi, Mohammed Talha Alam, Fakhri Karray
Title: ADAM-Dehaze: Adaptive Density-Aware Multi-Stage Dehazing for Improved Object Detection in Foggy Conditions
Abstract:
Adverse weather conditions, particularly fog, pose a significant challenge to autonomous vehicles, surveillance systems, and other safety-critical applications by severely degrading visual information. We introduce ADAM-Dehaze, an adaptive, density-aware dehazing framework that jointly optimizes image restoration and object detection under varying fog intensities. A lightweight Haze Density Estimation Network (HDEN) classifies each input as light, medium, or heavy fog. Based on this score, the system dynamically routes the image through one of three CORUN branches: Light, Medium, or Complex, each tailored to its haze regime. A novel adaptive loss balances physical-model coherence and perceptual fidelity, ensuring both accurate defogging and preservation of fine details. On Cityscapes and the real-world RTTS benchmark, ADAM-Dehaze improves PSNR by up to 2.1 dB, reduces FADE by 30 percent, and increases object detection mAP by up to 13 points, while cutting inference time by 20 percent. These results highlight the importance of intensity-specific processing and seamless integration with downstream vision tasks. Code available at: https://github.com/talha-alam/ADAM-Dehaze.
中文: ADAM-Dehaze是一种自适应去雾框架,能根据雾浓度动态处理图像,在提升图像复原和物体检测效果的同时显著改善清晰度与检测精度,并减少处理时间。
English: ADAM-Dehaze is an adaptive dehazing framework that dynamically processes images based on fog density to enhance both image restoration and object detection, achieving significant performance improvements in clarity and detection accuracy while reducing processing time.

Authors:Zhe Wang, Yuhua Ru, Aladine Chetouani, Tina Shiang, Fang Chen, Fabian Bauer, Liping Zhang, Didier Hans, Rachid Jennane, William Ewing Palmer, Mohamed Jarraya, Yung Hsin Chen
Title: Diffusion-based Counterfactual Augmentation: Towards Robust and Interpretable Knee Osteoarthritis Grading
Abstract:
Automated grading of Knee Osteoarthritis (KOA) from radiographs is challenged by significant inter-observer variability and the limited robustness of deep learning models, particularly near critical decision boundaries. To address these limitations, this paper proposes a novel framework, Diffusion-based Counterfactual Augmentation (DCA), which enhances model robustness and interpretability by generating targeted counterfactual examples. The method navigates the latent space of a diffusion model using a Stochastic Differential Equation (SDE), governed by balancing a classifier-informed boundary drive with a manifold constraint. The resulting counterfactuals are then used within a self-corrective learning strategy to improve the classifier by focusing on its specific areas of uncertainty. Extensive experiments on the public Osteoarthritis Initiative (OAI) and Multicenter Osteoarthritis Study (MOST) datasets demonstrate that this approach significantly improves classification accuracy across multiple model architectures. Furthermore, the method provides interpretability by visualizing minimal pathological changes and revealing that the learned latent space topology aligns with clinical knowledge of KOA progression. The DCA framework effectively converts model uncertainty into a robust training signal, offering a promising pathway to developing more accurate and trustworthy automated diagnostic systems. Our code is available at https://github.com/ZWang78/DCA.
Chinese: 本文提出了一种基于扩散的反事实增强框架,通过生成针对性反事实样本来提高膝关节骨关节炎分级的模型鲁棒性和可解释性,在临床数据集上显著提升了分类准确性。
English: This paper introduces a Diffusion-based Counterfactual Augmentation (DCA) framework that enhances knee osteoarthritis grading by generating targeted counterfactual examples to improve model robustness and interpretability, significantly boosting classification accuracy on clinical datasets.

Authors:Fangzhou Lin, Zilin Dai, Rigved Sanku, Songlin Hou, Kazunori D Yamada, Haichong K. Zhang, Ziming Zhang
Title: A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion
Abstract:
The single-view image guided point cloud completion (SVIPC) task aims to reconstruct a complete point cloud from a partial input with the help of a single-view image. While previous works have demonstrated the effectiveness of this multimodal approach, the fundamental necessity of image guidance remains largely unexamined. To explore this, we propose a strong baseline approach for SVIPC based on an attention-based multi-branch encoder-decoder network that only takes partial point clouds as input, view-free. Our hierarchical self-fusion mechanism, driven by cross-attention and self-attention layers, effectively integrates information across multiple streams, enriching feature representations and strengthening the networks ability to capture geometric structures. Extensive experiments and ablation studies on the ShapeNet-ViPC dataset demonstrate that our view-free framework performs superiorly to state-of-the-art SVIPC methods. We hope our findings provide new insights into the development of multimodal learning in SVIPC. Our demo code will be available at https://github.com/Zhang-VISLab.
Chinese: 本研究提出了一种无需视角图像的单视角图像引导点云补全基线方法,仅依赖部分点云输入,通过分层自融合机制超越现有技术,并对图像指导的必要性提出质疑。
English: This study introduces a view-free baseline for single-view image guided point cloud completion that relies solely on partial point clouds, using a hierarchical self-fusion mechanism to outperform existing methods and questioning the necessity of image guidance.

Authors:Anirud Aggarwal, Abhinav Shrivastava, Matthew Gwilliam
Title: Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model
Abstract:
Diffusion-based image generation models excel at producing high-quality synthetic content, but suffer from slow and computationally expensive inference. Prior work has attempted to mitigate this by caching and reusing features within diffusion transformers across inference steps. These methods, however, often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures. We propose Evolutionary Caching to Accelerate Diffusion models (ECAD), a genetic algorithm that learns efficient, per-model, caching schedules forming a Pareto frontier, using only a small set of calibration prompts. ECAD requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models. Notably, ECAD's learned schedules can generalize effectively to resolutions and model variants not seen during calibration. We evaluate ECAD on PixArt-alpha, PixArt-Sigma, and FLUX-1$.$dev using multiple metrics (FID, CLIP, Image Reward) across diverse benchmarks (COCO, MJHQ-30k, PartiPrompts), demonstrating consistent improvements over previous approaches. On PixArt-alpha, ECAD identifies a schedule that outperforms the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x. Our results establish ECAD as a scalable and generalizable approach for accelerating diffusion inference. Our project website is available at https://aniaggarwal.github.io/ecad and our code is available at https://github.com/aniaggarwal/ecad.
中文: ECAD提出一种遗传算法,可为扩散模型学习高效的缓存策略,在不修改网络参数或参考图像的情况下,显著提升推理速度并优化质量与延迟的权衡。
English: ECAD introduces a genetic algorithm that learns efficient caching schedules for diffusion models, achieving significant speedups and improved quality-latency trade-offs without modifying network parameters or requiring reference images.

Authors:Kaifeng Zhang, Baoyu Li, Kris Hauser, Yunzhu Li
Title: Particle-Grid Neural Dynamics for Learning Deformable Object Models from RGB-D Videos
Abstract:
Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects -- such as ropes, cloths, stuffed animals, and paper bags -- from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks. The project page is available at https://kywind.github.io/pgnd .

Authors:Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, Kai-Wei Chang
Title: Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence
Abstract:
AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/.

Authors:Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, Zizhen Li, Fanrui Zhang, Jiaxin Ai, Zhixiang Wang, Yuwei Wu, Tong He, Jiangmiao Pang, Yu Qiao, Yunde Jia, Kaipeng Zhang
Title: Sekai: A Video Dataset towards World Exploration
Abstract:
Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning ``dream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications. The project page is https://lixsp11.github.io/sekai-project/.
Chinese: Sekai 数据集通过提供超过5000小时的带标注第一人称全球视频,解决了现有视频生成数据集的局限性,为训练如YUME等交互式世界探索模型奠定了基础。
English: The Sekai dataset addresses limitations in current video generation datasets by providing over 5,000 hours of annotated first-person global footage, enabling the training of interactive world exploration models like YUME.

Authors:Yuqing Lan, Chenyang Zhu, Zhirui Gao, Jiazhao Zhang, Yihan Cao, Renjiao Yi, Yijie Wang, Kai Xu
Title: BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion
Abstract:
Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.
中文: 本文提出了一种免重构的在线开放词汇3D物体检测框架,通过融合预训练视觉基础模型与多视角优化技术,在保持实时性的同时实现了大规模场景下的最优检测性能。
English: This paper introduces a reconstruction-free online framework for open-vocabulary 3D object detection that leverages pre-trained visual foundation models and multi-view fusion techniques to achieve state-of-the-art performance while enabling real-time operation in large-scale environments.

Authors:Kyobin Choo, Hyunkyung Han, Jinyeong Kim, Chanyong Yoon, Seong Jae Hwang
Title: Mono-Modalizing Extremely Heterogeneous Multi-Modal Medical Image Registration
Abstract:
In clinical practice, imaging modalities with functional characteristics, such as positron emission tomography (PET) and fractional anisotropy (FA), are often aligned with a structural reference (e.g., MRI, CT) for accurate interpretation or group analysis, necessitating multi-modal deformable image registration (DIR). However, due to the extreme heterogeneity of these modalities compared to standard structural scans, conventional unsupervised DIR methods struggle to learn reliable spatial mappings and often distort images. We find that the similarity metrics guiding these models fail to capture alignment between highly disparate modalities. To address this, we propose M2M-Reg (Multi-to-Mono Registration), a novel framework that trains multi-modal DIR models using only mono-modal similarity while preserving the established architectural paradigm for seamless integration into existing models. We also introduce GradCyCon, a regularizer that leverages M2M-Reg's cyclic training scheme to promote diffeomorphism. Furthermore, our framework naturally extends to a semi-supervised setting, integrating pre-aligned and unaligned pairs only, without requiring ground-truth transformations or segmentation masks. Experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate that M2M-Reg achieves up to 2x higher DSC than prior methods for PET-MRI and FA-MRI registration, highlighting its effectiveness in handling highly heterogeneous multi-modal DIR. Our code is available at https://github.com/MICV-yonsei/M2M-Reg.
中文: 提出的M2M-Reg框架通过仅使用单模态相似性训练模型并引入GradCyCon正则化,解决了多模态形变图像配准中的挑战,在无需真实变换数据的情况下,显著提升了PET-MRI和FA-MRI等高异质性模态的配准效果。
English: The proposed M2M-Reg framework addresses challenges in multi-modal deformable image registration by training models using mono-modal similarity and introducing GradCyCon regularization, achieving significantly improved alignment for highly heterogeneous modalities like PET-MRI and FA-MRI without requiring ground-truth data.

Authors:Yujing Sun, Lingchen Sun, Shuaizheng Liu, Rongyuan Wu, Zhengqiang Zhang, Lei Zhang
Title: One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution
Abstract:
It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.
中文: 提出的双LoRA学习范式通过交替优化一致性和细节增强模块,训练基于稳定扩散的模型,在视频超分辨率中同时实现逼真的空间细节和时间一致性。
English: The proposed Dual LoRA Learning (DLoRAL) paradigm trains a stable diffusion-based model to simultaneously achieve realistic spatial details and temporal consistency in video super-resolution by alternately optimizing consistency and detail enhancement modules.

Authors:Jinheng Xie, Zhenheng Yang, Mike Zheng Shou
Title: Show-o2: Improved Native Unified Multimodal Models
Abstract:
This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.
Chinese: Show-o2模型通过自回归建模和流匹配技术,在3D因果变分自编码器空间中构建统一视觉表征,采用双路径时空融合和两阶段训练方法,实现了跨图像与视频模态的可扩展多模态理解与生成能力。
English: The Show-o2 model introduces a unified multimodal framework using autoregressive modeling and flow matching within a 3D causal variational autoencoder space, enabling scalable image and video understanding and generation through a two-stage training process.

Authors:Xingrui Qin, Wentao Zhao, Chuan Cao, Yihe Niu, Tianchen Deng, Houcheng Jiang, Rui Guo, Jingchuan Wang
Title: RaCalNet: Radar Calibration Network for Sparse-Supervised Metric Depth Estimation
Abstract:
Dense depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, generated via multi-frame projection and interpolation, for guiding the learning of accurate depth from sparse radar measurements and RGB images. However, this paradigm is both costly and data-intensive. To address this, we propose RaCalNet, a novel framework that eliminates the need for dense supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of merely around 1\% compared to dense-supervised methods. RaCalNet is composed of two key modules. The Radar Recalibration module performs radar point screening and pixel-wise displacement refinement, producing accurate and reliable depth priors from sparse radar inputs. These priors are then used by the Metric Depth Optimization module, which learns to infer scene-level scale priors and fuses them with monocular depth predictions to achieve metrically accurate outputs. This modular design enhances structural consistency and preserves fine-grained geometric details. Despite relying solely on sparse supervision, RaCalNet produces depth maps with clear object contours and fine-grained textures, demonstrating superior visual quality compared to state-of-the-art dense-supervised methods. Quantitatively, it achieves performance comparable to existing methods on the ZJU-4DRadarCam dataset and yields a 34.89\% RMSE reduction in real-world deployment scenarios. We plan to gradually release the code and models in the future at https://github.com/818slam/RaCalNet.git.
中文: RaCalNet提出了一种新颖框架,通过稀疏激光雷达监督替代密集监督,从雷达和图像中实现精确的密集深度估计,大幅降低成本的同时保持甚至超越了现有方法的性能。
English: RaCalNet introduces a novel framework that replaces dense LiDAR supervision with sparse supervision to achieve accurate dense depth estimation from radar and images, significantly reducing costs while maintaining or even surpassing the performance of existing methods.

Authors:Farheen Ramzan, Yusuf Kiberu, Nikesh Jathanna, Shahnaz Jamil-Copley, Richard H. Clayton, Chen Chen
Title: CLAIM: Clinically-Guided LGE Augmentation for Realistic and Diverse Myocardial Scar Synthesis and Segmentation
Abstract:
Deep learning-based myocardial scar segmentation from late gadolinium enhancement (LGE) cardiac MRI has shown great potential for accurate and timely diagnosis and treatment planning for structural cardiac diseases. However, the limited availability and variability of LGE images with high-quality scar labels restrict the development of robust segmentation models. To address this, we introduce CLAIM: \textbf{C}linically-Guided \textbf{L}GE \textbf{A}ugmentation for Real\textbf{i}stic and Diverse \textbf{M}yocardial Scar Synthesis and Segmentation framework, a framework for anatomically grounded scar generation and segmentation. At its core is the SMILE module (Scar Mask generation guided by cLinical knowledgE), which conditions a diffusion-based generator on the clinically adopted AHA 17-segment model to synthesize images with anatomically consistent and spatially diverse scar patterns. In addition, CLAIM employs a joint training strategy in which the scar segmentation network is optimized alongside the generator, aiming to enhance both the realism of synthesized scars and the accuracy of the scar segmentation performance. Experimental results show that CLAIM produces anatomically coherent scar patterns and achieves higher Dice similarity with real scar distributions compared to baseline models. Our approach enables controllable and realistic myocardial scar synthesis and has demonstrated utility for downstream medical imaging task. Code is available at https://github.com/farheenjabeen/CLAIM-Scar-Synthesis.
中文:CLAIM框架通过临床引导和扩散模型生成解剖结构准确且多样化的心肌瘢痕图像,有效提升分割模型的鲁棒性和准确性,实验证明其生成的瘢痕模式更真实且与真实分布一致性更高。
English: CLAIM is a novel framework that uses clinical guidance and diffusion models to generate realistic, anatomically accurate myocardial scar images for robust segmentation, outperforming baseline methods in producing diverse and coherent scar patterns.

Authors:Alaa Anani, Tobias Lorenz, Mario Fritz, Bernt Schiele
Title: Pixel-level Certified Explanations via Randomized Smoothing
Abstract:
Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel's importance against $\ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at https://github.com/AlaaAnani/certified-attributions.
中文摘要:本文首次提出基于随机平滑的认证框架,为任何黑盒归因方法提供像素级鲁棒性保证,确保归因结果在对抗扰动下的可靠性和可解释性。
English Summary: This paper introduces the first certification framework using randomized smoothing to guarantee pixel-level robustness for any black-box attribution method against adversarial perturbations, ensuring reliable and interpretable explanations.

Authors:Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, Qingxiang Lin, Zeqiang Lai, Xianghui Yang, Huiwen Shi, Zibo Zhao, Bowen Zhang, Hongyu Yan, Lifu Wang, Sicong Liu, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Dongyuan Guo, Junlin Yu, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Shida Wei, Chao Zhang, Yonghao Tan, Yifu Sun, Lin Niu, Shirui Huang, Bojian Zheng, Shu Liu, Shilin Chen, Xiang Yuan, Xiaofeng Yang, Kai Liu, Jianchen Zhu, Peng Chen, Tian Liu, Di Wang, Yuhong Liu, Linus, Jie Jiang, Jingwei Huang, Chunchao Guo
Title: Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material
Abstract:
3D AI-generated content (AIGC) is a passionate field that has significantly accelerated the creation of 3D models in gaming, film, and design. Despite the development of several groundbreaking models that have revolutionized 3D generation, the field remains largely accessible only to researchers, developers, and designers due to the complexities involved in collecting, processing, and training 3D models. To address these challenges, we introduce Hunyuan3D 2.1 as a case study in this tutorial. This tutorial offers a comprehensive, step-by-step guide on processing 3D data, training a 3D generative model, and evaluating its performance using Hunyuan3D 2.1, an advanced system for producing high-resolution, textured 3D assets. The system comprises two core components: the Hunyuan3D-DiT for shape generation and the Hunyuan3D-Paint for texture synthesis. We will explore the entire workflow, including data preparation, model architecture, training strategies, evaluation metrics, and deployment. By the conclusion of this tutorial, you will have the knowledge to finetune or develop a robust 3D generative model suitable for applications in gaming, virtual reality, and industrial design.
中文: 本教程以Hunyuan3D 2.1为例,系统介绍三维数据处理、生成模型训练与性能评估的全流程,帮助开发者掌握创建适用于游戏和设计领域的高质量三维资产的能力。
English: This tutorial presents Hunyuan3D 2.1 as a comprehensive guide for processing 3D data, training generative models, and evaluating performance to create high-resolution 3D assets for gaming and design applications.

Authors:Niki Amini-Naieni, Andrew Zisserman
Title: Open-World Object Counting in Videos
Abstract:
We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and similar objects, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image-based counting model, and a promptable video segmentation and tracking model to enable automated, open-world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for our novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x-rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code are available at https://github.com/niki-amini-naieni/CountVid/.
中文摘要:本文提出CountVid模型,通过文本或图像查询在视频中实现开放世界的物体计数,能有效追踪并统计不同物体实例,并在新构建的VideoCount数据集上验证了其优越性能。
English Summary: This paper introduces CountVid, a model for open-world object counting in videos that uses text or image queries to identify and track unique object instances across frames, and demonstrates its superior performance on the new VideoCount dataset.

Authors:Lanfeng Zhong, Xin Liao, Shichuan Zhang, Shaoting Zhang, Guotai Wang
Title: OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models
Abstract:
Pathology image classification plays a crucial role in accurate medical diagnosis and treatment planning. Training high-performance models for this task typically requires large-scale annotated datasets, which are both expensive and time-consuming to acquire. Active Learning (AL) offers a solution by iteratively selecting the most informative samples for annotation, thereby reducing the labeling effort. However, most AL methods are designed under the assumption of a closed-set scenario, where all the unannotated images belong to target classes. In real-world clinical environments, the unlabeled pool often contains a substantial amount of Out-Of-Distribution (OOD) data, leading to low efficiency of annotation in traditional AL methods. Furthermore, most existing AL methods start with random selection in the first query round, leading to a significant waste of labeling costs in open-set scenarios. To address these challenges, we propose OpenPath, a novel open-set active learning approach for pathological image classification leveraging a pre-trained Vision-Language Model (VLM). In the first query, we propose task-specific prompts that combine target and relevant non-target class prompts to effectively select In-Distribution (ID) and informative samples from the unlabeled pool. In subsequent queries, Diverse Informative ID Sampling (DIS) that includes Prototype-based ID candidate Selection (PIS) and Entropy-Guided Stochastic Sampling (EGSS) is proposed to ensure both purity and informativeness in a query, avoiding the selection of OOD samples. Experiments on two public pathology image datasets show that OpenPath significantly enhances the model's performance due to its high purity of selected samples, and outperforms several state-of-the-art open-set AL methods. The code is available at \href{https://github.com/HiLab-git/OpenPath}{https://github.com/HiLab-git/OpenPath}..
中文: 本文提出OpenPath这一新型开放集主动学习方法,利用预训练视觉语言模型在病理图像分类中有效筛选信息丰富的分布内样本并排除分布外数据,实验证明其性能优于现有先进方法。
English: This paper introduces OpenPath, a novel open-set active learning method that utilizes a pre-trained vision-language model to efficiently select informative in-distribution samples while avoiding out-of-distribution data in pathological image classification, demonstrating superior performance over existing methods.

Authors:Leonid Ivanov, Vasily Yuryev, Dmitry Yudin
Title: MapFM: Foundation Model-Driven HD Mapping with Multi-Task Contextual Learning
Abstract:
In autonomous driving, high-definition (HD) maps and semantic maps in bird's-eye view (BEV) are essential for accurate localization, planning, and decision-making. This paper introduces an enhanced End-to-End model named MapFM for online vectorized HD map generation. We show significantly boost feature representation quality by incorporating powerful foundation model for encoding camera images. To further enrich the model's understanding of the environment and improve prediction quality, we integrate auxiliary prediction heads for semantic segmentation in the BEV representation. This multi-task learning approach provides richer contextual supervision, leading to a more comprehensive scene representation and ultimately resulting in higher accuracy and improved quality of the predicted vectorized HD maps. The source code is available at https://github.com/LIvanoff/MapFM.
Chinese: 本文提出MapFM增强型端到端模型,通过融合基础模型进行图像编码并在鸟瞰图表示中集成辅助语义分割头,显著提升了在线矢量化高精地图生成的准确性和质量。
English: This paper presents MapFM, an enhanced end-to-end model that improves online vectorized HD map generation by integrating foundation models for image encoding and auxiliary semantic segmentation heads in BEV representation, resulting in higher accuracy and quality.

Authors:Han Wu, Junyao Li, Kangbo Zhao, Sen Zhang, Yukai Shi, Liang Lin
Title: One-shot Face Sketch Synthesis in the Wild via Generative Diffusion Prior and Instruction Tuning
Abstract:
Face sketch synthesis is a technique aimed at converting face photos into sketches. Existing face sketch synthesis research mainly relies on training with numerous photo-sketch sample pairs from existing datasets. However, these large-scale discriminative learning methods will have to face problems such as data scarcity and high human labor costs. Once the training data becomes scarce, their generative performance significantly degrades. In this paper, we propose a one-shot face sketch synthesis method based on diffusion models. We optimize text instructions on a diffusion model using face photo-sketch image pairs. Then, the instructions derived through gradient-based optimization are used for inference. To simulate real-world scenarios more accurately and evaluate method effectiveness more comprehensively, we introduce a new benchmark named One-shot Face Sketch Dataset (OS-Sketch). The benchmark consists of 400 pairs of face photo-sketch images, including sketches with different styles and photos with different backgrounds, ages, sexes, expressions, illumination, etc. For a solid out-of-distribution evaluation, we select only one pair of images for training at each time, with the rest used for inference. Extensive experiments demonstrate that the proposed method can convert various photos into realistic and highly consistent sketches in a one-shot context. Compared to other methods, our approach offers greater convenience and broader applicability. The dataset will be available at: https://github.com/HanWu3125/OS-Sketch
中文: 本文提出了一种基于扩散模型的单次人脸素描合成方法,通过优化文本指令实现仅需一对样本即可将照片转化为逼真素描,并建立了新的基准数据集验证其优越性能。
English: This paper introduces a one-shot face sketch synthesis method using diffusion models, which optimizes text instructions for converting photos into realistic sketches with minimal training data, and validates its effectiveness through a new benchmark dataset.

Authors:Mattia Nardon, Stefano Messelodi, Antonio Granata, Fabio Poiesi, Alberto Danese, Davide Boscaini
Title: AI-driven visual monitoring of industrial assembly tasks
Abstract:
Visual monitoring of industrial assembly tasks is critical for preventing equipment damage due to procedural errors and ensuring worker safety. Although commercial solutions exist, they typically require rigid workspace setups or the application of visual markers to simplify the problem. We introduce ViMAT, a novel AI-driven system for real-time visual monitoring of assembly tasks that operates without these constraints. ViMAT combines a perception module that extracts visual observations from multi-view video streams with a reasoning module that infers the most likely action being performed based on the observed assembly state and prior task knowledge. We validate ViMAT on two assembly tasks, involving the replacement of LEGO components and the reconfiguration of hydraulic press molds, demonstrating its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios characterized by partial and uncertain visual observations. Project page: https://tev-fbk.github.io/ViMAT
中文摘要:ViMAT是一种基于人工智能的系统,通过多视角视频流和任务知识实时监控工业装配任务,无需固定工作环境或视觉标记,已在复杂实际场景中验证其有效性。
English Summary: ViMAT is an AI-driven system that monitors industrial assembly tasks in real-time using multi-view video streams and task knowledge, eliminating the need for rigid setups or visual markers, and has been proven effective in complex scenarios.

Authors:Liangjie Meng, Danxia Li, Jinrong He, Lili Ma, Zhixin Li
Title: Convolutional Feature Enhancement and Attention Fusion BiFPN for Ship Detection in SAR Images
Abstract:
Synthetic Aperture Radar (SAR) enables submeter-resolution imaging and all-weather monitoring via active microwave and advanced signal processing. Currently, SAR has found extensive applications in critical maritime domains such as ship detection. However, SAR ship detection faces several challenges, including significant scale variations among ships, the presence of small offshore vessels mixed with noise, and complex backgrounds for large nearshore ships. To address these issues, this paper proposes a novel feature enhancement and fusion framework named C-AFBiFPN. C-AFBiFPN constructs a Convolutional Feature Enhancement (CFE) module following the backbone network, aiming to enrich feature representation and enhance the ability to capture and represent local details and contextual information. Furthermore, C-AFBiFPN innovatively integrates BiFormer attention within the fusion strategy of BiFPN, creating the AFBiFPN network. AFBiFPN improves the global modeling capability of cross-scale feature fusion and can adaptively focus on critical feature regions. The experimental results on SAR Ship Detection Dataset (SSDD) indicate that the proposed approach substantially enhances detection accuracy for small targets, robustness against occlusions, and adaptability to multi-scale features.
Chinese: 合成孔径雷达(SAR)能够实现全天候高分辨率成像,但在船舶检测中面临目标尺度多变和背景复杂等挑战;本文提出的C-AFBiFPN框架通过增强特征表征与融合能力,有效提升了检测精度和鲁棒性。
English: Synthetic Aperture Radar (SAR) enables high-resolution, all-weather imaging but faces challenges in ship detection due to scale variations and complex backgrounds, which the proposed C-AFBiFPN framework addresses by enhancing feature representation and fusion to improve accuracy and robustness.

Authors:Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang
Title: video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models
Abstract:
Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing the captioning error rates by 28\%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.
Chinese: Video-SALMONN 2 通过多轮直接偏好优化(MrDPO)和字幕质量目标,在视频描述和问答任务中实现了最先进的性能,超越了GPT-4o和Gemini-1.5 Pro等专有系统,并在多个基准测试中表现优异。
English: Video-SALMONN 2 introduces multi-round direct preference optimization (MrDPO) with a caption-quality objective, achieving state-of-the-art results in video description and question answering across multiple benchmarks while outperforming proprietary systems like GPT-4o and Gemini-1.5 Pro.

Authors:Dan He, Weisheng Li, Guofen Wang, Yuping Huang, Shiqiang Liu
Title: DM-FNet: Unified multimodal medical image fusion via diffusion process-trained encoder-decoder
Abstract:
Multimodal medical image fusion (MMIF) extracts the most meaningful information from multiple source images, enabling a more comprehensive and accurate diagnosis. Achieving high-quality fusion results requires a careful balance of brightness, color, contrast, and detail; this ensures that the fused images effectively display relevant anatomical structures and reflect the functional status of the tissues. However, existing MMIF methods have limited capacity to capture detailed features during conventional training and suffer from insufficient cross-modal feature interaction, leading to suboptimal fused image quality. To address these issues, this study proposes a two-stage diffusion model-based fusion network (DM-FNet) to achieve unified MMIF. In Stage I, a diffusion process trains UNet for image reconstruction. UNet captures detailed information through progressive denoising and represents multilevel data, providing a rich set of feature representations for the subsequent fusion network. In Stage II, noisy images at various steps are input into the fusion network to enhance the model's feature recognition capability. Three key fusion modules are also integrated to process medical images from different modalities adaptively. Ultimately, the robust network structure and a hybrid loss function are integrated to harmonize the fused image's brightness, color, contrast, and detail, enhancing its quality and information density. The experimental results across various medical image types demonstrate that the proposed method performs exceptionally well regarding objective evaluation metrics. The fused image preserves appropriate brightness, a comprehensive distribution of radioactive tracers, rich textures, and clear edges. The code is available at https://github.com/HeDan-11/DM-FNet.
Chinese: 本研究提出了一种基于扩散模型的两阶段融合网络(DM-FNet),通过增强特征提取和跨模态交互来优化多模态医学图像融合,使融合图像在亮度、对比度和细节方面表现卓越。
English: This study introduces a two-stage diffusion model-based fusion network (DM-FNet) that enhances multimodal medical image fusion by improving feature capture and cross-modal interaction, resulting in superior brightness, contrast, and detail in fused images.

Authors:Abdur Rahman, Keerthiveena Balraj, Manojkumar Ramteke, Anurag Singh Rathore
Title: Echo-DND: A dual noise diffusion model for robust and precise left ventricle segmentation in echocardiography
Abstract:
Recent advancements in diffusion probabilistic models (DPMs) have revolutionized image processing, demonstrating significant potential in medical applications. Accurate segmentation of the left ventricle (LV) in echocardiograms is crucial for diagnostic procedures and necessary treatments. However, ultrasound images are notoriously noisy with low contrast and ambiguous LV boundaries, thereby complicating the segmentation process. To address these challenges, this paper introduces Echo-DND, a novel dual-noise diffusion model specifically designed for this task. Echo-DND leverages a unique combination of Gaussian and Bernoulli noises. It also incorporates a multi-scale fusion conditioning module to improve segmentation precision. Furthermore, it utilizes spatial coherence calibration to maintain spatial integrity in segmentation masks. The model's performance was rigorously validated on the CAMUS and EchoNet-Dynamic datasets. Extensive evaluations demonstrate that the proposed framework outperforms existing SOTA models. It achieves high Dice scores of 0.962 and 0.939 on these datasets, respectively. The proposed Echo-DND model establishes a new standard in echocardiogram segmentation, and its architecture holds promise for broader applicability in other medical imaging tasks, potentially improving diagnostic accuracy across various medical domains. Project page: https://abdur75648.github.io/Echo-DND

Authors:Jiaqi Shi, Jin Xiao, Xiaoguang Hu, Boyang Song, Hao Jiang, Tianyou Chen, Baochang Zhang
Title: Enhancing point cloud analysis via neighbor aggregation correction based on cross-stage structure correlation
Abstract:
Point cloud analysis is the cornerstone of many downstream tasks, among which aggregating local structures is the basis for understanding point cloud data. While numerous works aggregate neighbor using three-dimensional relative coordinates, there are irrelevant point interference and feature hierarchy gap problems due to the limitation of local coordinates. Although some works address this limitation by refining spatial description though explicit modeling of cross-stage structure, these enhancement methods based on direct geometric structure encoding have problems of high computational overhead and noise sensitivity. To overcome these problems, we propose the Point Distribution Set Abstraction module (PDSA) that utilizes the correlation in the high-dimensional space to correct the feature distribution during aggregation, which improves the computational efficiency and robustness. PDSA distinguishes the point correlation based on a lightweight cross-stage structural descriptor, and enhances structural homogeneity by reducing the variance of the neighbor feature matrix and increasing classes separability though long-distance modeling. Additionally, we introducing a key point mechanism to optimize the computational overhead. The experimental result on semantic segmentation and classification tasks based on different baselines verify the generalization of the method we proposed, and achieve significant performance improvement with less parameter cost. The corresponding ablation and visualization results demonstrate the effectiveness and rationality of our method. The code and training weight is available at: https://github.com/AGENT9717/PointDistribution
中文摘要:提出的点分布集合抽象(PDSA)模块通过在高维空间校正特征分布,提升了点云分析的计算效率和鲁棒性,并以更少参数实现了显著性能提升。
English Summary: The proposed Point Distribution Set Abstraction (PDSA) module improves point cloud analysis by correcting feature distribution in high-dimensional space, enhancing computational efficiency and robustness while achieving significant performance gains with fewer parameters.

Authors:Yufei Liu, Haoke Xiao, Jiaxing Chai, Yongcun Zhang, Rong Wang, Zijie Meng, Zhiming Luo
Title: SynPo: Boosting Training-Free Few-Shot Medical Segmentation via High-Quality Negative Prompts
Abstract:
The advent of Large Vision Models (LVMs) offers new opportunities for few-shot medical image segmentation. However, existing training-free methods based on LVMs fail to effectively utilize negative prompts, leading to poor performance on low-contrast medical images. To address this issue, we propose SynPo, a training-free few-shot method based on LVMs (e.g., SAM), with the core insight: improving the quality of negative prompts. To select point prompts in a more reliable confidence map, we design a novel Confidence Map Synergy Module by combining the strengths of DINOv2 and SAM. Based on the confidence map, we select the top-k pixels as the positive points set and choose the negative points set using a Gaussian distribution, followed by independent K-means clustering for both sets. Then, these selected points are leveraged as high-quality prompts for SAM to get the segmentation results. Extensive experiments demonstrate that SynPo achieves performance comparable to state-of-the-art training-based few-shot methods.

Authors:Yijun Lin, Yao-Yi Chiang
Title: Hyper-Local Deformable Transformers for Text Spotting on Historical Maps
Abstract:
Text on historical maps contains valuable information providing georeferenced historical, political, and cultural contexts. However, text extraction from historical maps is challenging due to the lack of (1) effective methods and (2) training data. Previous approaches use ad-hoc steps tailored to only specific map styles. Recent machine learning-based text spotters (e.g., for scene images) have the potential to solve these challenges because of their flexibility in supporting various types of text instances. However, these methods remain challenges in extracting precise image features for predicting every sub-component (boundary points and characters) in a text instance. This is critical because map text can be lengthy and highly rotated with complex backgrounds, posing difficulties in detecting relevant image features from a rough text region. This paper proposes PALETTE, an end-to-end text spotter for scanned historical maps of a wide variety. PALETTE introduces a novel hyper-local sampling module to explicitly learn localized image features around the target boundary points and characters of a text instance for detection and recognition. PALETTE also enables hyper-local positional embeddings to learn spatial interactions between boundary points and characters within and across text instances. In addition, this paper presents a novel approach to automatically generate synthetic map images, SynthMap+, for training text spotters for historical maps. The experiment shows that PALETTE with SynthMap+ outperforms SOTA text spotters on two new benchmark datasets of historical maps, particularly for long and angled text. We have deployed PALETTE with SynthMap+ to process over 60,000 maps in the David Rumsey Historical Map collection and generated over 100 million text labels to support map searching. The project is released at https://github.com/kartta-foundation/mapkurator-palette-doc.
中文: 历史地图文本提取因样式多样和背景复杂而困难,但PALETTE采用超局部采样模块和SynthMap+合成训练数据,能有效检测和识别文本,性能优于现有方法,并已成功处理超过6万张地图。
English: Historical map text extraction is challenging due to diverse styles and complex backgrounds, but PALETTE introduces a hyper-local sampling module and SynthMap+ synthetic training data to effectively detect and recognize text, outperforming existing methods and successfully processing over 60,000 maps.

Authors:Lukas Schiesser, Cornelius Wolff, Sophie Haas, Simon Pukrop
Title: PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers
Abstract:
Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model -- its architecture, pretraining, and training dynamics -- at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.
中文:PictSure是一个强调图像嵌入关键作用的上下文学习框架,通过系统分析表明预训练策略显著影响小样本图像分类的跨域性能,在挑战性基准测试中取得了优异成果。
English: PictSure is an in-context learning framework that emphasizes the critical role of image embeddings, demonstrating through systematic analysis that pretraining strategies significantly influence out-of-domain performance in few-shot image classification, achieving superior results on challenging benchmarks.

Authors:Ziqiao Peng, Wentao Hu, Junyuan Ma, Xiangyu Zhu, Xiaomei Zhang, Hao Zhao, Hui Tian, Jun He, Hongyan Liu, Zhaoxin Fan
Title: SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting
Abstract:
Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic results. To address the critical issue of synchronization, identified as the ''devil'' in creating realistic talking heads, we introduce SyncTalk++, which features a Dynamic Portrait Renderer with Gaussian Splatting to ensure consistent subject identity preservation and a Face-Sync Controller that aligns lip movements with speech while innovatively using a 3D facial blendshape model to reconstruct accurate facial expressions. To ensure natural head movements, we propose a Head-Sync Stabilizer, which optimizes head poses for greater stability. Additionally, SyncTalk++ enhances robustness to out-of-distribution (OOD) audio by incorporating an Expression Generator and a Torso Restorer, which generate speech-matched facial expressions and seamless torso regions. Our approach maintains consistency and continuity in visual details across frames and significantly improves rendering speed and quality, achieving up to 101 frames per second. Extensive experiments and user studies demonstrate that SyncTalk++ outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk++.
中文: SyncTalk++通过动态肖像渲染器和面部同步控制器解决语音驱动说话头合成的同步问题,结合头部同步稳定器优化头部姿态,显著提升了真实感和渲染速度。
English: SyncTalk++ tackles the synchronization challenge in speech-driven talking head synthesis by integrating a Dynamic Portrait Renderer and Face-Sync Controller for lip, expression, and identity alignment, along with a Head-Sync Stabilizer for natural head movements, achieving superior realism and speed.

Authors:Dahang Wan, Rongsheng Lu, Yang Fang, Xianli Lang, Shuangbao Shu, Jingjing Chen, Siyuan Shen, Ting Xu, Zecong Ye
Title: YOLOv11-RGBT: Towards a Comprehensive Single-Stage Multispectral Object Detection Framework
Abstract:
Multispectral object detection, which integrates information from multiple bands, can enhance detection accuracy and environmental adaptability, holding great application potential across various fields. Although existing methods have made progress in cross-modal interaction, low-light conditions, and model lightweight, there are still challenges like the lack of a unified single-stage framework, difficulty in balancing performance and fusion strategy, and unreasonable modality weight allocation. To address these, based on the YOLOv11 framework, we present YOLOv11-RGBT, a new comprehensive multimodal object detection framework. We designed six multispectral fusion modes and successfully applied them to models from YOLOv3 to YOLOv12 and RT-DETR. After reevaluating the importance of the two modalities, we proposed a P3 mid-fusion strategy and multispectral controllable fine-tuning (MCF) strategy for multispectral models. These improvements optimize feature fusion, reduce redundancy and mismatches, and boost overall model performance. Experiments show our framework excels on three major open-source multispectral object detection datasets, like LLVIP and FLIR. Particularly, the multispectral controllable fine-tuning strategy significantly enhanced model adaptability and robustness. On the FLIR dataset, it consistently improved YOLOv11 models' mAP by 3.41%-5.65%, reaching a maximum of 47.61%, verifying the framework and strategies' effectiveness. The code is available at: https://github.com/wandahangFY/YOLOv11-RGBT.
Chinese: YOLOv11-RGBT框架提出了六种多光谱融合模式和创新的中融合与可控微调策略,显著提升了多光谱目标检测的精度和鲁棒性,在FLIR数据集上mAP最高提升达5.65%。
English: The YOLOv11-RGBT framework introduces six multispectral fusion modes and innovative strategies like mid-fusion and controllable fine-tuning, significantly enhancing detection accuracy and robustness across datasets, with mAP improvements up to 5.65% on FLIR.

Authors:Yuke Xing, Jiarui Wang, Peizhi Niu, Wenjie Huang, Guangtao Zhai, Yiling Xu
Title: 3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a promising approach for novel view synthesis, offering real-time rendering with high visual fidelity. However, its substantial storage requirements present significant challenges for practical applications. While recent state-of-the-art (SOTA) 3DGS methods increasingly incorporate dedicated compression modules, there is a lack of a comprehensive framework to evaluate their perceptual impact. Therefore we present 3DGS-IEval-15K, the first large-scale image quality assessment (IQA) dataset specifically designed for compressed 3DGS representations. Our dataset encompasses 15,200 images rendered from 10 real-world scenes through 6 representative 3DGS algorithms at 20 strategically selected viewpoints, with different compression levels leading to various distortion effects. Through controlled subjective experiments, we collect human perception data from 60 viewers. We validate dataset quality through scene diversity and MOS distribution analysis, and establish a comprehensive benchmark with 30 representative IQA metrics covering diverse types. As the largest-scale 3DGS quality assessment dataset to date, our work provides a foundation for developing 3DGS specialized IQA metrics, and offers essential data for investigating view-dependent quality distribution patterns unique to 3DGS. The database is publicly available at https://github.com/YukeXing/3DGS-IEval-15K.
中文: 3D高斯泼溅技术虽能实现高质量实时新视角合成,但存储需求巨大,为此我们构建了首个大规模图像质量评估数据集3DGS-IEval-15K,通过主观实验和多元指标系统评估压缩3DGS的感知质量。
English: 3D Gaussian Splatting enables real-time novel view synthesis with high visual quality but faces storage challenges, leading to the creation of 3DGS-IEval-15K, the first large-scale dataset for evaluating perceptual quality in compressed 3DGS representations through human assessments and diverse metrics.

Authors:Md. Adnanul Islam, Md. Faiyaz Abdullah Sayeedi, Md. Asaduzzaman Shuvo, Shahanur Rahman Bappy, Md Asiful Islam, Swakkhar Shatabda
Title: VisText-Mosquito: A Unified Multimodal Benchmark Dataset for Visual Detection, Segmentation, and Textual Reasoning on Mosquito Breeding Sites
Abstract:
Mosquito-borne diseases pose a major global health risk, requiring early detection and proactive control of breeding sites to prevent outbreaks. In this paper, we present VisText-Mosquito, a multimodal dataset that integrates visual and textual data to support automated detection, segmentation, and reasoning for mosquito breeding site analysis. The dataset includes 1,828 annotated images for object detection, 142 images for water surface segmentation, and natural language reasoning texts linked to each image. The YOLOv9s model achieves the highest precision of 0.92926 and mAP@50 of 0.92891 for object detection, while YOLOv11n-Seg reaches a segmentation precision of 0.91587 and mAP@50 of 0.79795. For reasoning generation, we tested a range of large vision-language models (LVLMs) in both zero-shot and few-shot settings. Our fine-tuned Mosquito-LLaMA3-8B model achieved the best results, with a final loss of 0.0028, a BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.85. This dataset and model framework emphasize the theme "Prevention is Better than Cure", showcasing how AI-based detection can proactively address mosquito-borne disease risks. The dataset and implementation code are publicly available at GitHub: https://github.com/adnanul-islam-jisun/VisText-Mosquito
中文: 本文介绍了VisText-Mosquito多模态数据集,它整合视觉与文本数据以改进蚊虫孳生地的自动检测、分割和推理,先进模型实现了高精度并支持"预防胜于治疗"的主动疾病防控理念。
English: This paper introduces VisText-Mosquito, a multimodal dataset combining visual and textual data to enhance automated detection, segmentation, and reasoning for mosquito breeding sites, with advanced models achieving high precision and supporting proactive disease prevention.

Authors:Giacomo Meanti, Thomas Ryckeboer, Michael Arbel, Julien Mairal
Title: Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching
Abstract:
This work addresses image restoration tasks through the lens of inverse problems using unpaired datasets. In contrast to traditional approaches -- which typically assume full knowledge of the forward model or access to paired degraded and ground-truth images -- the proposed method operates under minimal assumptions and relies only on small, unpaired datasets. This makes it particularly well-suited for real-world scenarios, where the forward model is often unknown or misspecified, and collecting paired data is costly or infeasible. The method leverages conditional flow matching to model the distribution of degraded observations, while simultaneously learning the forward model via a distribution-matching loss that arises naturally from the framework. Empirically, it outperforms both single-image blind and unsupervised approaches on deblurring and non-uniform point spread function (PSF) calibration tasks. It also matches state-of-the-art performance on blind super-resolution. We also showcase the effectiveness of our method with a proof of concept for lens calibration: a real-world application traditionally requiring time-consuming experiments and specialized equipment. In contrast, our approach achieves this with minimal data acquisition effort.
中文: 本研究提出一种基于非配对数据集和条件流匹配的图像复原方法,能够同时学习退化模型和复原过程,在去模糊和点扩散函数校准任务中表现优异,且仅需少量数据即可实现。
English: This study introduces a novel image restoration method that uses unpaired datasets and conditional flow matching to learn both the degradation model and restoration process, achieving superior performance in deblurring and PSF calibration tasks with minimal data requirements.

Authors:Ming Xu, Xu Zhang
Title: PoseGRAF: Geometric-Reinforced Adaptive Fusion for Monocular 3D Human Pose Estimation
Abstract:
Existing monocular 3D pose estimation methods primarily rely on joint positional features, while overlooking intrinsic directional and angular correlations within the skeleton. As a result, they often produce implausible poses under joint occlusions or rapid motion changes. To address these challenges, we propose the PoseGRAF framework. We first construct a dual graph convolutional structure that separately processes joint and bone graphs, effectively capturing their local dependencies. A Cross-Attention module is then introduced to model interdependencies between bone directions and joint features. Building upon this, a dynamic fusion module is designed to adaptively integrate both feature types by leveraging the relational dependencies between joints and bones. An improved Transformer encoder is further incorporated in a residual manner to generate the final output. Experimental results on the Human3.6M and MPI-INF-3DHP datasets show that our method exceeds state-of-the-art approaches. Additional evaluations on in-the-wild videos further validate its generalizability. The code is publicly available at https://github.com/iCityLab/PoseGRAF.
中文:PoseGRAF框架通过双图卷积和跨注意力机制捕捉骨骼关节间的方向关联,解决了现有单目三维姿态估计方法的局限性,在多个基准数据集上实现了最先进的性能。
English: The PoseGRAF framework addresses limitations in monocular 3D pose estimation by employing dual graph convolutions and cross-attention mechanisms to capture directional correlations between joints and bones, achieving state-of-the-art performance on benchmark datasets.

Authors:Zhiwen Shao, Yifan Cheng, Feiran Li, Yong Zhou, Xuequan Lu, Yuan Xie, Lizhuang Ma
Title: MOL: Joint Estimation of Micro-Expression, Optical Flow, and Landmark via Transformer-Graph-Style Convolution
Abstract:
Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions. Most existing methods depend on hand-crafted features, key frames like onset, apex, and offset frames, or deep networks limited by small-scale and low-diversity datasets. In this paper, we propose an end-to-end micro-action-aware deep learning framework with advantages from transformer, graph convolution, and vanilla convolution. In particular, we propose a novel F5C block composed of fully-connected convolution and channel correspondence convolution to directly extract local-global features from a sequence of raw frames, without the prior knowledge of key frames. The transformer-style fully-connected convolution is proposed to extract local features while maintaining global receptive fields, and the graph-style channel correspondence convolution is introduced to model the correlations among feature patterns. Moreover, MER, optical flow estimation, and facial landmark detection are jointly trained by sharing the local-global features. The two latter tasks contribute to capturing facial subtle action information for MER, which can alleviate the impact of insufficient training data. Extensive experiments demonstrate that our framework (i) outperforms the state-of-the-art MER methods on CASME II, SAMM, and SMIC benchmarks, (ii) works well for optical flow estimation and facial landmark detection, and (iii) can capture facial subtle muscle actions in local regions associated with MEs. The code is available at https://github.com/CYF-cuber/MOL.
中文摘要:本文提出了一种端到端的微动作感知深度学习框架,结合变换器、图卷积和普通卷积的优势,直接从原始视频帧中提取局部-全局特征,无需关键帧先验知识,在面部微表情识别任务中实现了最优性能。
English Summary: The paper introduces an end-to-end micro-action-aware deep learning framework that integrates transformer, graph, and vanilla convolutions to extract local-global features directly from raw video frames, achieving state-of-the-art performance in facial micro-expression recognition without requiring key frame annotations.

Authors:Huan Kang, Hui Li, Xiao-Jun Wu, Tianyang Xu, Rui Wang, Chunyang Cheng, Josef Kittler
Title: GrFormer: A Novel Transformer on Grassmann Manifold for Infrared and Visible Image Fusion
Abstract:
In the field of image fusion, promising progress has been made by modeling data from different modalities as linear subspaces. However, in practice, the source images are often located in a non-Euclidean space, where the Euclidean methods usually cannot encapsulate the intrinsic topological structure. Typically, the inner product performed in the Euclidean space calculates the algebraic similarity rather than the semantic similarity, which results in undesired attention output and a decrease in fusion performance. While the balance of low-level details and high-level semantics should be considered in infrared and visible image fusion task. To address this issue, in this paper, we propose a novel attention mechanism based on Grassmann manifold for infrared and visible image fusion (GrFormer). Specifically, our method constructs a low-rank subspace mapping through projection constraints on the Grassmann manifold, compressing attention features into subspaces of varying rank levels. This forces the features to decouple into high-frequency details (local low-rank) and low-frequency semantics (global low-rank), thereby achieving multi-scale semantic fusion. Additionally, to effectively integrate the significant information, we develop a cross-modal fusion strategy (CMS) based on a covariance mask to maximise the complementary properties between different modalities and to suppress the features with high correlation, which are deemed redundant. The experimental results demonstrate that our network outperforms SOTA methods both qualitatively and quantitatively on multiple image fusion benchmarks. The codes are available at https://github.com/Shaoyun2023.
中文摘要:本文提出基于格拉斯曼流形的新型注意力机制GrFormer,通过将特征解耦为高频细节和低频语义实现多尺度语义融合,在红外与可见光图像融合任务中定性定量均优于现有最优方法。
English Summary: This paper introduces GrFormer, a novel attention mechanism using Grassmann manifold to achieve multi-scale semantic fusion by separating image features into high-frequency details and low-frequency semantics, outperforming state-of-the-art methods in infrared and visible image fusion.

Authors:Xiaoqi Wang, Yi Wang, Lap-Pui Chau
Title: EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization
Abstract:
Egocentric video-language understanding demands both high efficiency and accurate spatial-temporal modeling. Existing approaches face three key challenges: 1) Excessive pre-training cost arising from multi-stage pre-training pipelines, 2) Ineffective spatial-temporal encoding due to manually split 3D rotary positional embeddings that hinder feature interactions, and 3) Imprecise learning objectives in soft-label multi-instance retrieval, which neglect negative pair correlations. In this paper, we introduce EVA02-AT, a suite of EVA02-based video-language foundation models tailored to egocentric video understanding tasks. EVA02-AT first efficiently transfers an image-based CLIP model into a unified video encoder via a single-stage pretraining. Second, instead of applying rotary positional embeddings to isolated dimensions, we introduce spatial-temporal rotary positional embeddings along with joint attention, which can effectively encode both spatial and temporal information on the entire hidden dimension. This joint encoding of spatial-temporal features enables the model to learn cross-axis relationships, which are crucial for accurately modeling motion and interaction in videos. Third, focusing on multi-instance video-language retrieval tasks, we introduce the Symmetric Multi-Similarity (SMS) loss and a novel training framework that advances all soft labels for both positive and negative pairs, providing a more precise learning objective. Extensive experiments on Ego4D, EPIC-Kitchens-100, and Charades-Ego under zero-shot and fine-tuning settings demonstrate that EVA02-AT achieves state-of-the-art performance across diverse egocentric video-language tasks with fewer parameters. Models with our SMS loss also show significant performance gains on multi-instance retrieval benchmarks. Our code and models are publicly available at https://github.com/xqwang14/EVA02-AT .
中文摘要:EVA02-AT通过单阶段预训练实现了统一视频编码器,采用联合注意力增强时空编码,并引入对称多重相似度损失函数,以更少参数量在自我中心视频语言任务中达到最优性能。
English Summary: EVA02-AT introduces a unified video encoder through single-stage pretraining, enhanced spatial-temporal encoding with joint attention, and a Symmetric Multi-Similarity loss, achieving state-of-the-art performance in egocentric video-language tasks with fewer parameters.

Authors:Jeremy A. Collins, Loránd Cheng, Kunal Aneja, Albert Wilcox, Benjamin Joffe, Animesh Garg
Title: AMPLIFY: Actionless Motion Priors for Robot Learning from Videos
Abstract:
Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at https://amplify-robotics.github.io/.

Authors:Kunyuan Deng, Yi Wang, Lap-Pui Chau
Title: Egocentric Human-Object Interaction Detection: A New Benchmark and Method
Abstract:
Egocentric human-object interaction (Ego-HOI) detection is crucial for intelligent agents to understand and assist human activities from a first-person perspective. However, progress has been hindered by the lack of benchmarks and methods tailored to egocentric challenges such as severe hand-object occlusion. In this paper, we introduce the real-world Ego-HOI detection task and the accompanying Ego-HOIBench, a new dataset with over 27K egocentric images and explicit, fine-grained hand-verb-object triplet annotations across 123 categories. Ego-HOIBench covers diverse daily scenarios, object types, and both single- and two-hand interactions, offering a comprehensive testbed for Ego-HOI research. Benchmarking existing third-person HOI detectors on Ego-HOIBench reveals significant performance gaps, highlighting the need for egocentric-specific solutions. To this end, we propose Hand Geometry and Interactivity Refinement (HGIR), a lightweight, plug-and-play scheme that leverages hand pose and geometric cues to enhance interaction representations. Specifically, HGIR explicitly extracts global hand geometric features from the estimated hand pose proposals, and further refines interaction features through pose-interaction attention, enabling the model to focus on subtle hand-object relationship differences even under severe occlusion. HGIR significantly improves Ego-HOI detection performance across multiple baselines, achieving new state-of-the-art results on Ego-HOIBench. Our dataset and method establish a solid foundation for future research in egocentric vision and human-object interaction understanding. Project page: https://dengkunyuan.github.io/EgoHOIBench/
中文: 本文提出了针对自我中心视角下人与物体交互检测的新数据集Ego-HOIBench,并开发了HGIR方法,通过手部几何特征增强交互表征,在严重遮挡情况下仍能实现最优性能。
English: This paper introduces Ego-HOIBench, a comprehensive dataset for egocentric human-object interaction detection addressing challenges like occlusion, and proposes HGIR, a plug-and-play method that enhances interaction features using hand geometry to achieve state-of-the-art performance.

Authors:Juho Bai, Inwook Shim
Title: SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM-Guided Walkability
Abstract:
Accurate prediction of pedestrian trajectories is essential for applications in robotics and surveillance systems. While existing approaches primarily focus on social interactions between pedestrians, they often overlook the rich environmental context that significantly shapes human movement patterns. In this paper, we propose SceneAware, a novel framework that explicitly incorporates scene understanding to enhance trajectory prediction accuracy. Our method leverages a Vision Transformer~(ViT) scene encoder to process environmental context from static scene images, while Multi-modal Large Language Models~(MLLMs) generate binary walkability masks that distinguish between accessible and restricted areas during training. We combine a Transformer-based trajectory encoder with the ViT-based scene encoder, capturing both temporal dynamics and spatial constraints. The framework integrates collision penalty mechanisms that discourage predicted trajectories from violating physical boundaries, ensuring physically plausible predictions. SceneAware is implemented in both deterministic and stochastic variants. Comprehensive experiments on the ETH/UCY benchmark datasets show that our approach outperforms state-of-the-art methods, with more than 50\% improvement over previous models. Our analysis based on different trajectory categories shows that the model performs consistently well across various types of pedestrian movement. This highlights the importance of using explicit scene information and shows that our scene-aware approach is both effective and reliable in generating accurate and physically plausible predictions. Code is available at: https://github.com/juho127/SceneAware.
中文摘要:SceneAware框架通过结合视觉变换器的场景理解和可通行性掩码,显著提升了行人轨迹预测的准确性,在基准数据集上性能超越现有方法50%以上,同时确保预测路径符合物理约束。
English Summary: The SceneAware framework enhances pedestrian trajectory prediction by integrating scene understanding through Vision Transformers and walkability masks, achieving over 50% improvement in accuracy on benchmark datasets while ensuring physically plausible paths.

Authors:Nafiz Sadman, Farhana Zulkernine, Benjamin Kwan
Title: Interpreting Biomedical VLMs on High-Imbalance Out-of-Distributions: An Insight into BiomedCLIP on Radiology
Abstract:
In this paper, we construct two research objectives: i) explore the learned embedding space of BiomedCLIP, an open-source large vision language model, to analyse meaningful class separations, and ii) quantify the limitations of BiomedCLIP when applied to a highly imbalanced, out-of-distribution multi-label medical dataset. We experiment on IU-xray dataset, which exhibits the aforementioned criteria, and evaluate BiomedCLIP in classifying images (radiographs) in three contexts: zero-shot inference, full finetuning, and linear probing. The results show that the model under zero-shot settings over-predicts all labels, leading to poor precision and inter-class separability. Full fine-tuning improves classification of distinct diseases, while linear probing detects overlapping features. We demonstrate visual understanding of the model using Grad-CAM heatmaps and compare with 15 annotations by a radiologist. We highlight the need for careful adaptations of the models to foster reliability and applicability in a real-world setting. The code for the experiments in this work is available and maintained on GitHub.
中文: 本研究评估了BiomedCLIP在非平衡医疗数据集上的表现,揭示了其零样本预测在精度和类别区分上的局限性,同时通过全微调和线性探测展示了改进效果,强调了模型需针对性优化以确保实际应用的可靠性。
English: This study evaluates BiomedCLIP's performance on an imbalanced medical dataset, revealing its zero-shot limitations in precision and class separation while demonstrating improvements through fine-tuning and linear probing, emphasizing the need for model adaptations to ensure real-world reliability.

Authors:Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Xiang Li, Hongwen Zhang, Yebin Liu
Title: GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation
Abstract:
Accurate scene perception is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we adopt a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing 4D modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF provides three interrelated outputs: reconstruction of the current scene, prediction of future frames, and estimation of init action via Gaussian motion. Furthermore, we employ an action-vision-aligned denoising framework, conditioned on a unified representation that combines the init action and the Gaussian perception, both generated by the GAF, to further obtain more precise actions. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR, +0.3864 SSIM and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average +7.3% success rate in robotic manipulation tasks over state-of-the-art methods.
中文: 本文提出采用V-4D-A框架和Gaussian Action Field(GAF),通过运动感知的4D表示实现直接动作推理,在场景重建质量和机器人操作成功率上相比现有方法均取得显著提升。
English: This paper introduces a V-4D-A framework using Gaussian Action Field (GAF) that enables direct action reasoning from motion-aware 4D representations, significantly improving both scene reconstruction quality and robotic manipulation success rates over existing methods.

Authors:Chunyu Cao, Jintao Cheng, Zeyu Chen, Linfan Zhan, Rui Fan, Zhijian He, Xiaoyu Tang
Title: KDMOS:Knowledge Distillation for Motion Segmentation
Abstract:
Motion Object Segmentation (MOS) is crucial for autonomous driving, as it enhances localization, path planning, map construction, scene flow estimation, and future state prediction. While existing methods achieve strong performance, balancing accuracy and real-time inference remains a challenge. To address this, we propose a logits-based knowledge distillation framework for MOS, aiming to improve accuracy while maintaining real-time efficiency. Specifically, we adopt a Bird's Eye View (BEV) projection-based model as the student and a non-projection model as the teacher. To handle the severe imbalance between moving and non-moving classes, we decouple them and apply tailored distillation strategies, allowing the teacher model to better learn key motion-related features. This approach significantly reduces false positives and false negatives. Additionally, we introduce dynamic upsampling, optimize the network architecture, and achieve a 7.69% reduction in parameter count, mitigating overfitting. Our method achieves a notable IoU of 78.8% on the hidden test set of the SemanticKITTI-MOS dataset and delivers competitive results on the Apollo dataset. The KDMOS implementation is available at https://github.com/SCNU-RISLAB/KDMOS.
中文摘要:本文提出了一种基于逻辑的知识蒸馏框架用于运动目标分割,通过解耦运动与非运动类别并应用针对性蒸馏策略,在保持实时效率的同时显著提升精度,在SemanticKITTI-MOS数据集上获得了78.8%的交并比。
English Summary: This paper introduces a logits-based knowledge distillation framework for Motion Object Segmentation (MOS) that enhances accuracy while maintaining real-time efficiency by decoupling moving and non-moving classes and applying tailored distillation strategies, achieving a 78.8% IoU on SemanticKITTI-MOS.

Authors:Chelsi Jain, Yiran Wu, Yifan Zeng, Jiale Liu, S hengyu Dai, Zhenwen Shao, Qingyun Wu, Huazheng Wang
Title: SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement
Abstract:
Document Visual Question Answering (DocVQA) is a practical yet challenging task, which is to ask questions based on documents while referring to multiple pages and different modalities of information, e.g, images and tables. To handle multi-modality, recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline, but utilize Visual Language Models (VLMs) based embedding model to embed and retrieve relevant pages as images, and generate answers with VLMs that can accept an image as input. In this paper, we introduce SimpleDoc, a lightweight yet powerful retrieval - augmented framework for DocVQA. It boosts evidence page gathering by first retrieving candidates through embedding similarity and then filtering and re-ranking these candidates based on page summaries. A single VLM-based reasoner agent repeatedly invokes this dual-cue retriever, iteratively pulling fresh pages into a working memory until the question is confidently answered. SimpleDoc outperforms previous baselines by 3.2% on average on 4 DocVQA datasets with much fewer pages retrieved. Our code is available at https://github.com/ag2ai/SimpleDoc.
中文:SimpleDoc是一种轻量级检索增强框架,通过双重线索过滤和迭代推理增强文档证据检索能力,在四个DocVQA数据集上平均性能提升3.2%且检索页面更少。
English: SimpleDoc is a lightweight retrieval-augmented framework for Document Visual Question Answering that enhances evidence retrieval through dual-cue filtering and iterative reasoning, achieving a 3.2% average improvement on four datasets with fewer retrieved pages.

Authors:Runtao Liu, Jiahao Zhan, Yingqing He, Chen Wei, Alan Yuille, Qifeng Chen
Title: Fake it till You Make it: Reward Modeling as Discriminative Prediction
Abstract:
An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM's effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Code and data will be released at https://github.com/Visualignment/GAN-RM.
Chinese: 本文提出GAN-RM,一种高效奖励建模框架,通过仅使用少量目标样本与模型生成输出进行判别训练,无需人工偏好标注和显式质量维度设计,并在样本筛选和微调等关键应用中验证了其有效性。
English: This paper introduces GAN-RM, an efficient reward modeling framework that eliminates the need for manual preference annotations and explicit quality engineering by training on a small set of target samples and model-generated outputs, demonstrating effectiveness across various applications like sample filtering and fine-tuning.

Authors:Florian Kofler, Marcel Rosier, Mehdi Astaraki, Ujjwal Baid, Hendrik Möller, Josef A. Buchner, Felix Steinbauer, Eva Oswald, Ezequiel de la Rosa, Ivan Ezhov, Constantin von See, Jan Kirschke, Anton Schmick, Sarthak Pati, Akis Linardos, Carla Pitarch, Sanyukta Adap, Jeffrey Rudie, Maria Correia de Verdier, Rachit Saluja, Evan Calabrese, Dominic LaBella, Mariam Aboian, Ahmed W. Moawad, Nazanin Maleki, Udunna Anazodo, Maruf Adewole, Marius George Linguraru, Anahita Fathi Kazerooni, Zhifan Jiang, Gian Marco Conte, Hongwei Li, Juan Eugenio Iglesias, Spyridon Bakas, Benedikt Wiestler, Marie Piraud, Bjoern Menze
Title: BraTS orchestrator : Democratizing and Disseminating state-of-the-art brain tumor image analysis
Abstract:
The Brain Tumor Segmentation (BraTS) cluster of challenges has significantly advanced brain tumor image analysis by providing large, curated datasets and addressing clinically relevant tasks. However, despite its success and popularity, algorithms and models developed through BraTS have seen limited adoption in both scientific and clinical communities. To accelerate their dissemination, we introduce BraTS orchestrator, an open-source Python package that provides seamless access to state-of-the-art segmentation and synthesis algorithms for diverse brain tumors from the BraTS challenge ecosystem. Available on GitHub (https://github.com/BrainLesion/BraTS), the package features intuitive tutorials designed for users with minimal programming experience, enabling both researchers and clinicians to easily deploy winning BraTS algorithms for inference. By abstracting the complexities of modern deep learning, BraTS orchestrator democratizes access to the specialized knowledge developed within the BraTS community, making these advances readily available to broader neuro-radiology and neuro-oncology audiences.
Chinese: BraTS orchestrator 是一个开源Python包,它简化了对BraTS挑战赛中先进脑肿瘤分割与合成算法的访问,使研究人员和临床医生能够轻松部署这些技术,无需深厚的编程经验。
English: The BraTS orchestrator is an open-source Python package that simplifies access to advanced brain tumor segmentation and synthesis algorithms from the BraTS challenges, enabling easy deployment for researchers and clinicians with minimal programming skills.

Authors:Yuheng Yuan, Qiuhong Shen, Shizun Wang, Xingyi Yang, Xinchao Wang
Title: Test3R: Learning to Reconstruct 3D at Test Time
Abstract:
Dense matching methods like DUSt3R regress pairwise pointmaps for 3D reconstruction. However, the reliance on pairwise prediction and the limited generalization capability inherently restrict the global geometric consistency. In this work, we introduce Test3R, a surprisingly simple test-time learning technique that significantly boosts geometric accuracy. Using image triplets ($I_1,I_2,I_3$), Test3R generates reconstructions from pairs ($I_1,I_2$) and ($I_1,I_3$). The core idea is to optimize the network at test time via a self-supervised objective: maximizing the geometric consistency between these two reconstructions relative to the common image $I_1$. This ensures the model produces cross-pair consistent outputs, regardless of the inputs. Extensive experiments demonstrate that our technique significantly outperforms previous state-of-the-art methods on the 3D reconstruction and multi-view depth estimation tasks. Moreover, it is universally applicable and nearly cost-free, making it easily applied to other models and implemented with minimal test-time training overhead and parameter footprint. Code is available at https://github.com/nopQAQ/Test3R.
中文: Test3R是一种简单的测试时学习方法,通过使用图像三元组优化网络,最大化共享图像对重建间的几何一致性,从而显著提升三维重建精度。
English: Test3R is a simple test-time learning method that enhances 3D reconstruction accuracy by optimizing networks using image triplets to maximize geometric consistency between reconstructions from shared image pairs.

Authors:Junyoung Seo, Jisang Han, Jaewoo Jung, Siyoon Jin, Joungbin Lee, Takuya Narihira, Kazumi Fukuda, Takashi Shibuya, Donghoon Ahn, Shoukang Hu, Seungryong Kim, Yuki Mitsufuji
Title: Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry
Abstract:
We introduce Vid-CamEdit, a novel framework for video camera trajectory editing, enabling the re-synthesis of monocular videos along user-defined camera paths. This task is challenging due to its ill-posed nature and the limited multi-view video data for training. Traditional reconstruction methods struggle with extreme trajectory changes, and existing generative models for dynamic novel view synthesis cannot handle in-the-wild videos. Our approach consists of two steps: estimating temporally consistent geometry, and generative rendering guided by this geometry. By integrating geometric priors, the generative model focuses on synthesizing realistic details where the estimated geometry is uncertain. We eliminate the need for extensive 4D training data through a factorized fine-tuning framework that separately trains spatial and temporal components using multi-view image and video data. Our method outperforms baselines in producing plausible videos from novel camera trajectories, especially in extreme extrapolation scenarios on real-world footage.
中文: Vid-CamEdit是一种创新框架,通过结合几何估计与生成式渲染,能够沿自定义相机路径重新合成视频,无需大量4D训练数据即可有效处理极端轨迹变化。
English: Vid-CamEdit is a novel framework that enables video re-synthesis along custom camera paths by combining geometric estimation with generative rendering, effectively handling extreme trajectory changes without requiring extensive 4D training data.

Authors:Zhucun Xue, Jiangning Zhang, Teng Hu, Haoyang He, Yinan Chen, Yuxuan Cai, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Dacheng Tao
Title: UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions
Abstract:
The quality of the video dataset (image quality, resolution, and fine-grained caption) greatly influences the performance of the video generation model. The growing demand for video applications sets higher requirements for high-quality video generation models. For example, the generation of movie-level Ultra-High Definition (UHD) videos and the creation of 4K short video content. However, the existing public datasets cannot support related research and applications. In this paper, we first propose a high-quality open-sourced UHD-4K (22.4\% of which are 8K) text-to-video dataset named UltraVideo, which contains a wide range of topics (more than 100 kinds), and each video has 9 structured captions with one summarized caption (average of 824 words). Specifically, we carefully design a highly automated curation process with four stages to obtain the final high-quality dataset: \textit{i)} collection of diverse and high-quality video clips. \textit{ii)} statistical data filtering. \textit{iii)} model-based data purification. \textit{iv)} generation of comprehensive, structured captions. In addition, we expand Wan to UltraWan-1K/-4K, which can natively generate high-quality 1K/4K videos with more consistent text controllability, demonstrating the effectiveness of our data curation.We believe that this work can make a significant contribution to future research on UHD video generation. UltraVideo dataset and UltraWan models are available at https://xzc-zju.github.io/projects/UltraVideo.
中文: UltraVideo数据集通过提供带结构化描述的超高清4K视频填补了高质量公开视频数据的空白,同时UltraWan模型展现了更强的文本到视频生成能力。
English: The UltraVideo dataset addresses the lack of high-quality public video data by providing UHD-4K videos with structured captions, while UltraWan models demonstrate enhanced text-to-video generation capabilities.

Authors:Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, Ziwei Liu
Title: Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
Abstract:
We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.
中文: Ego-R1框架采用结构化工具思维链方法,通过强化学习智能体动态选择工具进行分步推理,有效解决了超长第一人称视频的理解难题,将处理时长从数小时扩展至数周。
English: The Ego-R1 framework introduces a structured Chain-of-Tool-Thought process, where a reinforcement learning agent dynamically selects tools for step-by-step reasoning to effectively handle ultra-long egocentric video understanding tasks, extending coverage from hours to weeks.

Authors:Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng
Title: Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
Abstract:
The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.
中文: 本文提出Stream-Omni大语言-视觉-语音模型,通过序列维度拼接实现视觉-文本对齐、基于CTC的层维度映射实现语音-文本对齐,以更少数据需求在多模态任务中实现优异性能。
English: This paper introduces Stream-Omni, a large language-vision-speech model that achieves efficient modality alignments through tailored methods—sequence-dimension concatenation for vision-text and CTC-based mapping for speech-text—enabling strong performance across multimodal tasks with reduced data requirements.

Authors:Zhiyi Shi, Binjie Wang, Chongjie Si, Yichen Wu, Junsik Kim, Hanspeter Pfister
Title: DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models
Abstract:
Model editing aims to efficiently update a pre-trained model's knowledge without the need for time-consuming full retraining. While existing pioneering editing methods achieve promising results, they primarily focus on editing single-modal language models (LLMs). However, for vision-language models (VLMs), which involve multiple modalities, the role and impact of each modality on editing performance remain largely unexplored. To address this gap, we explore the impact of textual and visual modalities on model editing and find that: (1) textual and visual representations reach peak sensitivity at different layers, reflecting their varying importance; and (2) editing both modalities can efficiently update knowledge, but this comes at the cost of compromising the model's original capabilities. Based on our findings, we propose DualEdit, an editor that modifies both textual and visual modalities at their respective key layers. Additionally, we introduce a gating module within the more sensitive textual modality, allowing DualEdit to efficiently update new knowledge while preserving the model's original information. We evaluate DualEdit across multiple VLM backbones and benchmark datasets, demonstrating its superiority over state-of-the-art VLM editing baselines as well as adapted LLM editing methods on different evaluation metrics. Codes are available at https://github.com/zhiyiscs/DualEdit
Chinese Summary: 针对视觉语言模型(VLM)的模型编辑需兼顾文本和视觉模态,DualEdit通过在各模态关键层进行修改,并采用门控机制保护原始信息,实现了高效的知识更新与性能保持。
English Summary: Model editing for vision-language models (VLMs) requires addressing both textual and visual modalities, with DualEdit effectively updating knowledge by targeting key layers in each modality while preserving original capabilities through a gating mechanism.

Authors:Zhucun Xue, Jiangning Zhang, Xurong Xie, Yuxuan Cai, Yong Liu, Xiangtai Li, Dacheng Tao
Title: AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding
Abstract:
Multimodal Large Language Models (MLLMs) struggle with long videos due to fixed context windows and weak long-term dependency modeling. Existing Retrieval-Augmented Generation (RAG) methods for videos use static retrieval strategies, leading to inefficiencies for simple queries and information loss for complex tasks. To address this, we propose AdaVideoRAG, a novel framework that dynamically adapts retrieval granularity based on query complexity using a lightweight intent classifier. Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs, enabling optimal resource allocation across tasks. We also introduce the HiVU benchmark for comprehensive evaluation. Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs. AdaVideoRAG establishes a new paradigm for adaptive retrieval in video analysis. Codes will be open-sourced at https://github.com/xzc-zju/AdaVideoRAG.
Chinese: AdaVideoRAG提出了一种动态检索框架,根据查询复杂度自适应调整检索粒度,通过分层索引和轻量级分类器提升多模态大语言模型在长视频理解中的效率与准确性。
English: AdaVideoRAG introduces a dynamic retrieval framework that adapts granularity based on query complexity, utilizing hierarchical indexing and a lightweight classifier to enhance efficiency and accuracy in long-video understanding for MLLMs.

Authors:Yihui Li, Chengxin Lv, Hongyu Yang, Di Huang
Title: Micro-macro Gaussian Splatting with Enhanced Scalability for Unconstrained Scene Reconstruction
Abstract:
Reconstructing 3D scenes from unconstrained image collections poses significant challenges due to variations in appearance. In this paper, we propose Scalable Micro-macro Wavelet-based Gaussian Splatting (SMW-GS), a novel method that enhances 3D reconstruction across diverse scales by decomposing scene representations into global, refined, and intrinsic components. SMW-GS incorporates the following innovations: Micro-macro Projection, which enables Gaussian points to sample multi-scale details with improved diversity; and Wavelet-based Sampling, which refines feature representations using frequency-domain information to better capture complex scene appearances. To achieve scalability, we further propose a large-scale scene promotion strategy, which optimally assigns camera views to scene partitions by maximizing their contributions to Gaussian points, achieving consistent and high-quality reconstructions even in expansive environments. Extensive experiments demonstrate that SMW-GS significantly outperforms existing methods in both reconstruction quality and scalability, particularly excelling in large-scale urban environments with challenging illumination variations. Project is available at https://github.com/Kidleyh/SMW-GS.
中文摘要:SMW-GS提出了一种创新的3D重建方法,通过微宏观投影和小波采样技术实现多尺度细节优化,在大规模城市场景中显著提升了复杂光照条件下的重建质量与扩展性。
English Summary: SMW-GS introduces a novel 3D reconstruction method using micro-macro projection and wavelet-based sampling to achieve superior multi-scale detail capture and scalability, particularly excelling in large-scale urban environments with challenging lighting conditions.

Authors:Zerui Gong, Zhonghua Wu, Qingyi Tao, Qinyue Li, Chen Change Loy
Title: SA-LUT: Spatial Adaptive 4D Look-Up Table for Photorealistic Style Transfer
Abstract:
Photorealistic style transfer (PST) enables real-world color grading by adapting reference image colors while preserving content structure. Existing methods mainly follow either approaches: generation-based methods that prioritize stylistic fidelity at the cost of content integrity and efficiency, or global color transformation methods such as LUT, which preserve structure but lack local adaptability. To bridge this gap, we propose Spatial Adaptive 4D Look-Up Table (SA-LUT), combining LUT efficiency with neural network adaptability. SA-LUT features: (1) a Style-guided 4D LUT Generator that extracts multi-scale features from the style image to predict a 4D LUT, and (2) a Context Generator using content-style cross-attention to produce a context map. This context map enables spatially-adaptive adjustments, allowing our 4D LUT to apply precise color transformations while preserving structural integrity. To establish a rigorous evaluation framework for photorealistic style transfer, we introduce PST50, the first benchmark specifically designed for PST assessment. Experiments demonstrate that SA-LUT substantially outperforms state-of-the-art methods, achieving a 66.7% reduction in LPIPS score compared to 3D LUT approaches, while maintaining real-time performance at 16 FPS for video stylization. Our code and benchmark are available at https://github.com/Ry3nG/SA-LUT
中文摘要:提出的空间自适应4D查找表(SA-LUT)结合了LUT的高效性与神经网络的适应性,在保持实时性能的同时,实现了优于现有方法的超写实风格迁移效果。
English Summary: The proposed Spatial Adaptive 4D Look-Up Table (SA-LUT) combines LUT efficiency with neural network adaptability to achieve superior photorealistic style transfer, outperforming existing methods while maintaining real-time performance.

Authors:Laiyan Ding, Hualie Jiang, Jiwei Chen, Rui Huang
Title: Self-Supervised Enhancement for Depth from a Lightweight ToF Sensor with Monocular Images
Abstract:
Depth map enhancement using paired high-resolution RGB images offers a cost-effective solution for improving low-resolution depth data from lightweight ToF sensors. Nevertheless, naively adopting a depth estimation pipeline to fuse the two modalities requires groundtruth depth maps for supervision. To address this, we propose a self-supervised learning framework, SelfToF, which generates detailed and scale-aware depth maps. Starting from an image-based self-supervised depth estimation pipeline, we add low-resolution depth as inputs, design a new depth consistency loss, propose a scale-recovery module, and finally obtain a large performance boost. Furthermore, since the ToF signal sparsity varies in real-world applications, we upgrade SelfToF to SelfToF* with submanifold convolution and guided feature fusion. Consequently, SelfToF* maintain robust performance across varying sparsity levels in ToF data. Overall, our proposed method is both efficient and effective, as verified by extensive experiments on the NYU and ScanNet datasets. The code is available at \href{https://github.com/denyingmxd/selftof}{https://github.com/denyingmxd/selftof}.
Chinese: SelfToF框架是一种自监督学习方法,通过融合高分辨率RGB图像来增强低分辨率深度图,无需真实深度数据即可实现鲁棒且尺度感知的深度估计。
English: The SelfToF framework is a self-supervised learning method that enhances low-resolution depth maps by integrating high-resolution RGB images, achieving robust and scale-aware depth estimation without requiring groundtruth depth data.

Authors:Kang Chen, Bin Huang, Xuebin Yang, Junyan Zhang, Yongbo Wang, Qiegen Liu
Title: PRO: Projection Domain Synthesis for CT Imaging
Abstract:
Synthetic CT projection data is crucial for advancing imaging research, yet its generation remains challenging. Current image domain methods are limited as they cannot simulate the physical acquisition process or utilize the complete statistical information present in projection data, restricting their utility and fidelity. In this work, we present PRO, a projection domain synthesis foundation model for CT imaging. To the best of our knowledge, this is the first study that performs CT synthesis in the projection domain. Unlike previous approaches that operate in the image domain, PRO learns rich structural representations from projection data and leverages anatomical text prompts for controllable synthesis. Projection data generation models can utilize complete measurement signals and simulate the physical processes of scanning, including material attenuation characteristics, beam hardening, scattering, and projection geometry, and support research on downstream imaging tasks. Moreover, PRO functions as a foundation model, capable of generalizing across diverse downstream tasks by adjusting its generative behavior via prompt inputs. Experimental results demonstrated that incorporating our synthesized data significantly improves performance across multiple downstream tasks, including low-dose and sparse-view reconstruction. These findings underscore the versatility and scalability of PRO in data generation for various CT applications. These results highlight the potential of projection domain synthesis as a powerful tool for data augmentation and robust CT imaging. Our source code is publicly available at: https://github.com/yqx7150/PRO.
中文: PRO是首个在投影域进行合成的CT成像基础模型,通过学习投影数据的结构表征并利用解剖文本提示实现可控合成,能有效模拟扫描物理过程并显著提升低剂量重建等下游任务的性能。
English: PRO is a pioneering projection domain synthesis foundation model for CT imaging that generates synthetic projection data by learning structural representations and utilizing anatomical text prompts, enabling controllable synthesis and improved performance in downstream tasks like low-dose reconstruction.

Authors:Xiang Yu, Yayan Chen, Guannan He, Qing Zeng, Yue Qin, Meiling Liang, Dandan Luo, Yimei Liao, Zeyu Ren, Cheng Kang, Delong Yang, Bocheng Liang, Bin Pu, Ying Yuan, Shengli Li
Title: Simple is what you need for efficient and accurate medical image segmentation
Abstract:
While modern segmentation models often prioritize performance over practicality, we advocate a design philosophy prioritizing simplicity and efficiency, and attempted high performance segmentation model design. This paper presents SimpleUNet, a scalable ultra-lightweight medical image segmentation model with three key innovations: (1) A partial feature selection mechanism in skip connections for redundancy reduction while enhancing segmentation performance; (2) A fixed-width architecture that prevents exponential parameter growth across network stages; (3) An adaptive feature fusion module achieving enhanced representation with minimal computational overhead. With a record-breaking 16 KB parameter configuration, SimpleUNet outperforms LBUNet and other lightweight benchmarks across multiple public datasets. The 0.67 MB variant achieves superior efficiency (8.60 GFLOPs) and accuracy, attaining a mean DSC/IoU of 85.76%/75.60% on multi-center breast lesion datasets, surpassing both U-Net and TransUNet. Evaluations on skin lesion datasets (ISIC 2017/2018: mDice 84.86%/88.77%) and endoscopic polyp segmentation (KVASIR-SEG: 86.46%/76.48% mDice/mIoU) confirm consistent dominance over state-of-the-art models. This work demonstrates that extreme model compression need not compromise performance, providing new insights for efficient and accurate medical image segmentation. Codes can be found at https://github.com/Frankyu5666666/SimpleUNet.
中文: 本文提出SimpleUNet超轻量医学图像分割模型,通过特征选择机制、固定宽度结构和自适应融合模块三大创新,在仅16KB参数量下实现卓越性能,证明了极致模型压缩无需以精度为代价。
English: This paper introduces SimpleUNet, an ultra-lightweight medical image segmentation model that achieves state-of-the-art performance with minimal parameters through innovations in feature selection, fixed-width architecture, and adaptive fusion, demonstrating that extreme compression need not sacrifice accuracy.

Authors:Beilei Cui, Yiming Huang, Long Bai, Hongliang Ren
Title: TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Scale-Oriented Contrast
Abstract:
This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M's great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance. (Code is available at: https://github.com/BeileiCui/TR2M)
中文摘要:本研究提出TR2M框架,通过融合文本和图像输入,在像素级别将相对深度转换为度量深度,在多个数据集上表现出优异性能,并展现出卓越的零样本泛化能力。
English Summary: This study introduces TR2M, a novel framework that leverages text and image inputs to convert relative depth into metric depth at the pixel level, demonstrating strong performance across diverse datasets and superior zero-shot generalization capabilities.

Authors:Yan Chen, Hanlin Shang, Ce Liu, Yuxuan Chen, Hui Li, Weihao Yuan, Hao Zhu, Zilong Dong, Siyu Zhu
Title: DicFace: Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration
Abstract:
Video face restoration faces a critical challenge in maintaining temporal consistency while recovering fine facial details from degraded inputs. This paper presents a novel approach that extends Vector-Quantized Variational Autoencoders (VQ-VAEs), pretrained on static high-quality portraits, into a video restoration framework through variational latent space modeling. Our key innovation lies in reformulating discrete codebook representations as Dirichlet-distributed continuous variables, enabling probabilistic transitions between facial features across frames. A spatio-temporal Transformer architecture jointly models inter-frame dependencies and predicts latent distributions, while a Laplacian-constrained reconstruction loss combined with perceptual (LPIPS) regularization enhances both pixel accuracy and visual quality. Comprehensive evaluations on blind face restoration, video inpainting, and facial colorization tasks demonstrate state-of-the-art performance. This work establishes an effective paradigm for adapting intensive image priors, pretrained on high-quality images, to video restoration while addressing the critical challenge of flicker artifacts. The source code has been open-sourced and is available at https://github.com/fudan-generative-vision/DicFace.
中文: 本文提出一种新颖的视频人脸修复方法,通过变分潜在建模将VQ-VAE扩展至视频领域,采用狄利克雷连续表示和时空Transformer架构实现时序一致性和高质量细节恢复,在多项任务中展现出领先性能。
English: This paper introduces a novel video face restoration method that extends VQ-VAEs to video through variational latent modeling, using a Dirichlet-based continuous representation and spatio-temporal Transformer to achieve temporal consistency and high-quality detail recovery, demonstrating state-of-the-art performance across multiple tasks.

Authors:Mae Younes, Adnane Boukhayma
Title: TextureSplat: Per-Primitive Texture Mapping for Reflective Gaussian Splatting
Abstract:
Gaussian Splatting have demonstrated remarkable novel view synthesis performance at high rendering frame rates. Optimization-based inverse rendering within complex capture scenarios remains however a challenging problem. A particular case is modelling complex surface light interactions for highly reflective scenes, which results in intricate high frequency specular radiance components. We hypothesize that such challenging settings can benefit from increased representation power. We hence propose a method that tackles this issue through a geometrically and physically grounded Gaussian Splatting borne radiance field, where normals and material properties are spatially variable in the primitive's local space. Using per-primitive texture maps for this purpose, we also propose to harness the GPU hardware to accelerate rendering at test time via unified material texture atlas.
Chinese: 高斯泼溅技术在新视角合成中实现了高速渲染,但在处理反射场景的反向渲染时存在挑战,因此提出一种方法,通过高斯辐射场中局部空间的可变法线和材质属性来增强表示能力。
English: Gaussian Splatting achieves high-speed rendering for novel view synthesis but struggles with inverse rendering in reflective scenes, leading to a proposed method that enhances representation through localized normals and material properties in a Gaussian-based radiance field.

Authors:Bo Pan, Yixiao Fu, Ke Wang, Junyu Lu, Lunke Pan, Ziyang Qian, Yuhan Chen, Guoliang Wang, Yitao Zhou, Li Zheng, Yinghao Tang, Zhen Wen, Yuchen Wu, Junhua Lu, Biao Zhu, Minfeng Zhu, Bo Zhang, Wei Chen
Title: VIS-Shepherd: Constructing Critic for LLM-based Data Visualization Generation
Abstract:
Data visualization generation using Large Language Models (LLMs) has shown promising results but often produces suboptimal visualizations that require human intervention for improvement. In this work, we introduce VIS-Shepherd, a specialized Multimodal Large Language Model (MLLM)-based critic to evaluate and provide feedback for LLM-generated data visualizations. At the core of our approach is a framework to construct a high-quality visualization critique dataset, where we collect human-created visualization instances, synthesize corresponding LLM-generated instances, and construct high-quality critiques. We conduct both model-based automatic evaluation and human preference studies to evaluate the effectiveness of our approach. Our experiments show that even small (7B parameters) open-source MLLM models achieve substantial performance gains by leveraging our high-quality visualization critique dataset, reaching levels comparable to much larger open-source or even proprietary models. Our work demonstrates significant potential for MLLM-based automated visualization critique and indicates promising directions for enhancing LLM-based data visualization generation. Our project page: https://github.com/bopan3/VIS-Shepherd.
中文: VIS-Shepherd提出了一种基于多模态大语言模型的专门评估器,通过高质量的可视化评述数据集来改进大语言模型生成的数据可视化效果,使得较小模型也能达到与大型模型相当的性能水平。
English: VIS-Shepherd introduces a specialized MLLM-based critic that evaluates and improves LLM-generated data visualizations through high-quality critique datasets, enabling smaller models to achieve performance comparable to larger ones.

Authors:Wenlong Wan, Weiying Zheng, Tianyi Xiang, Guiqing Li, Shengfeng He
Title: Action Dubber: Timing Audible Actions via Inflectional Flow
Abstract:
We introduce the task of Audible Action Temporal Localization, which aims to identify the spatio-temporal coordinates of audible movements. Unlike conventional tasks such as action recognition and temporal action localization, which broadly analyze video content, our task focuses on the distinct kinematic dynamics of audible actions. It is based on the premise that key actions are driven by inflectional movements; for example, collisions that produce sound often involve abrupt changes in motion. To capture this, we propose $TA^{2}Net$, a novel architecture that estimates inflectional flow using the second derivative of motion to determine collision timings without relying on audio input. $TA^{2}Net$ also integrates a self-supervised spatial localization strategy during training, combining contrastive learning with spatial analysis. This dual design improves temporal localization accuracy and simultaneously identifies sound sources within video frames. To support this task, we introduce a new benchmark dataset, $Audible623$, derived from Kinetics and UCF101 by removing non-essential vocalization subsets. Extensive experiments confirm the effectiveness of our approach on $Audible623$ and show strong generalizability to other domains, such as repetitive counting and sound source localization. Code and dataset are available at https://github.com/WenlongWan/Audible623.
中文: 本文提出可听动作时序定位任务,旨在识别可听动作的时空坐标,并开发了TA²Net架构,通过运动二阶导数无需音频输入即可检测碰撞时刻,在新基准数据集Audible623上验证了其有效性。
English: This paper introduces Audible Action Temporal Localization, a task identifying spatio-temporal coordinates of audible movements, and proposes TA²Net, a novel architecture using motion's second derivative to detect collision timings without audio input, validated on the new Audible623 benchmark dataset.

Authors:Huayang Li, Yahui Liu, Hongyu Sun, Deng Cai, Leyang Cui, Wei Bi, Peilin Zhao, Taro Watanabe
Title: SeqPE: Transformer with Sequential Position Encoding
Abstract:
Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position embeddings (PEs) limit extrapolation capabilities beyond pre-trained sequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate this limitation but demand extensive modifications for adapting to new modalities, underscoring fundamental challenges in adaptability and scalability. In this work, we present SeqPE, a unified and fully learnable position encoding framework that represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings in an end-to-end manner. To regularize SeqPE's embedding space, we introduce two complementary objectives: a contrastive objective that aligns embedding distances with a predefined position-distance function, and a knowledge distillation loss that anchors out-of-distribution position embeddings to in-distribution teacher representations, further enhancing extrapolation performance. Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM), and accuracy--particularly under context length extrapolation--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign. We release our code, data, and checkpoints at https://github.com/ghrua/seqpe.
中文摘要:SeqPE提出了一种统一且完全可学习的位置编码框架,通过符号序列表示位置索引并采用对比学习与知识蒸馏目标,显著提升了外推能力和多模态适应性。
English Summary: SeqPE introduces a unified, learnable position encoding framework that uses symbolic sequences and complementary objectives to enhance extrapolation and adaptability across various tasks and modalities.

Authors:Yining Shi, Kun Jiang, Qiang Meng, Ke Wang, Jiabao Wang, Wenchao Sun, Tuopu Wen, Mengmeng Yang, Diange Yang
Title: COME: Adding Scene-Centric Forecasting Control to Occupancy World Model
Abstract:
World models are critical for autonomous driving to simulate environmental dynamics and generate synthetic data. Existing methods struggle to disentangle ego-vehicle motion (perspective shifts) from scene evolvement (agent interactions), leading to suboptimal predictions. Instead, we propose to separate environmental changes from ego-motion by leveraging the scene-centric coordinate systems. In this paper, we introduce COME: a framework that integrates scene-centric forecasting Control into the Occupancy world ModEl. Specifically, COME first generates ego-irrelevant, spatially consistent future features through a scene-centric prediction branch, which are then converted into scene condition using a tailored ControlNet. These condition features are subsequently injected into the occupancy world model, enabling more accurate and controllable future occupancy predictions. Experimental results on the nuScenes-Occ3D dataset show that COME achieves consistent and significant improvements over state-of-the-art (SOTA) methods across diverse configurations, including different input sources (ground-truth, camera-based, fusion-based occupancy) and prediction horizons (3s and 8s). For example, under the same settings, COME achieves 26.3% better mIoU metric than DOME and 23.7% better mIoU metric than UniScene. These results highlight the efficacy of disentangled representation learning in enhancing spatio-temporal prediction fidelity for world models. Code and videos will be available at https://github.com/synsin0/COME.
中文: COME框架通过场景中心化预测将自车运动与环境变化解耦,在nuScenes-Occ3D数据集上相比现有最优方法实现了显著性能提升,证明了分离表征学习对提升世界模型预测精度的有效性。
English: The COME framework enhances autonomous driving world models by disentangling ego-motion from scene dynamics through scene-centric forecasting, achieving significant performance improvements over state-of-the-art methods on the nuScenes-Occ3D dataset.

Authors:Jiashu Dai, Along Wang, Binfan Ni, Tao Cao
Title: High-Quality Facial Albedo Generation for 3D Face Reconstruction from a Single Image using a Coarse-to-Fine Approach
Abstract:
Facial texture generation is crucial for high-fidelity 3D face reconstruction from a single image. However, existing methods struggle to generate UV albedo maps with high-frequency details. To address this challenge, we propose a novel end-to-end coarse-to-fine approach for UV albedo map generation. Our method first utilizes a UV Albedo Parametric Model (UVAPM), driven by low-dimensional coefficients, to generate coarse albedo maps with skin tones and low-frequency texture details. To capture high-frequency details, we train a detail generator using a decoupled albedo map dataset, producing high-resolution albedo maps. Extensive experiments demonstrate that our method can generate high-fidelity textures from a single image, outperforming existing methods in terms of texture quality and realism. The code and pre-trained model are publicly available at https://github.com/MVIC-DAI/UVAPM, facilitating reproducibility and further research.
中文: 本文提出一种新颖的由粗到精方法,先通过参数化模型生成基础面部纹理,再借助细节生成器增强高频细节,在单图像3D人脸重建中实现了卓越的纹理质量和真实感。
English: This paper introduces a novel coarse-to-fine approach that first generates basic facial textures using a parametric model and then enhances them with high-frequency details through a detail generator, achieving superior texture quality and realism in single-image 3D face reconstruction.

Authors:Ting Qiao, Yiming Li, Jianbin Li, Yingjia Wang, Leyi Qi, Junfeng Guo, Ruili Feng, Dacheng Tao
Title: CertDW: Towards Certified Dataset Ownership Verification via Conformal Prediction
Abstract:
Deep neural networks (DNNs) rely heavily on high-quality open-source datasets (e.g., ImageNet) for their success, making dataset ownership verification (DOV) crucial for protecting public dataset copyrights. In this paper, we find existing DOV methods (implicitly) assume that the verification process is faithful, where the suspicious model will directly verify ownership by using the verification samples as input and returning their results. However, this assumption may not necessarily hold in practice and their performance may degrade sharply when subjected to intentional or unintentional perturbations. To address this limitation, we propose the first certified dataset watermark (i.e., CertDW) and CertDW-based certified dataset ownership verification method that ensures reliable verification even under malicious attacks, under certain conditions (e.g., constrained pixel-level perturbation). Specifically, inspired by conformal prediction, we introduce two statistical measures, including principal probability (PP) and watermark robustness (WR), to assess model prediction stability on benign and watermarked samples under noise perturbations. We prove there exists a provable lower bound between PP and WR, enabling ownership verification when a suspicious model's WR value significantly exceeds the PP values of multiple benign models trained on watermark-free datasets. If the number of PP values smaller than WR exceeds a threshold, the suspicious model is regarded as having been trained on the protected dataset. Extensive experiments on benchmark datasets verify the effectiveness of our CertDW method and its resistance to potential adaptive attacks. Our codes are at \href{https://github.com/NcepuQiaoTing/CertDW}{GitHub}.
中文: 本文提出CertDW认证数据集水印方法,通过统计指标评估模型预测稳定性,在恶意攻击下仍能实现可靠的数据集所有权验证,并提供可证明的保障。
English: This paper introduces CertDW, a certified dataset watermarking method that ensures reliable dataset ownership verification under malicious attacks by leveraging statistical measures to assess model prediction stability and providing provable guarantees.

Authors:Jinguang Tong, Xuesong li, Fahira Afzal Maken, Sundaram Muthu, Lars Petersson, Chuong Nguyen, Hongdong Li
Title: GS-2DGS: Geometrically Supervised 2DGS for Reflective Object Reconstruction
Abstract:
3D modeling of highly reflective objects remains challenging due to strong view-dependent appearances. While previous SDF-based methods can recover high-quality meshes, they are often time-consuming and tend to produce over-smoothed surfaces. In contrast, 3D Gaussian Splatting (3DGS) offers the advantage of high speed and detailed real-time rendering, but extracting surfaces from the Gaussians can be noisy due to the lack of geometric constraints. To bridge the gap between these approaches, we propose a novel reconstruction method called GS-2DGS for reflective objects based on 2D Gaussian Splatting (2DGS). Our approach combines the rapid rendering capabilities of Gaussian Splatting with additional geometric information from foundation models. Experimental results on synthetic and real datasets demonstrate that our method significantly outperforms Gaussian-based techniques in terms of reconstruction and relighting and achieves performance comparable to SDF-based methods while being an order of magnitude faster. Code is available at https://github.com/hirotong/GS2DGS
中文摘要:本文提出GS-2DGS新方法,结合高斯泼溅的快速渲染与基础模型的几何约束,实现了对高反光物体的高质量高效三维重建,在速度和细节上均优于现有技术。
English Summary: This paper introduces GS-2DGS, a novel method that combines Gaussian Splatting's fast rendering with geometric constraints from foundation models to achieve high-quality, efficient 3D reconstruction of reflective objects, outperforming existing techniques in speed and detail.

Authors:Shahram Najam Syed, Ishir Roongta, Kavin Ravie, Gangadhar Nageswar
Title: SuperPoint-SLAM3: Augmenting ORB-SLAM3 with Deep Features, Adaptive NMS, and Learning-Based Loop Closure
Abstract:
Visual simultaneous localization and mapping (SLAM) must remain accurate under extreme viewpoint, scale and illumination variations. The widely adopted ORB-SLAM3 falters in these regimes because it relies on hand-crafted ORB keypoints. We introduce SuperPoint-SLAM3, a drop-in upgrade that (i) replaces ORB with the self-supervised SuperPoint detector--descriptor, (ii) enforces spatially uniform keypoints via adaptive non-maximal suppression (ANMS), and (iii) integrates a lightweight NetVLAD place-recognition head for learning-based loop closure. On the KITTI Odometry benchmark SuperPoint-SLAM3 reduces mean translational error from 4.15% to 0.34% and mean rotational error from 0.0027 deg/m to 0.0010 deg/m. On the EuRoC MAV dataset it roughly halves both errors across every sequence (e.g., V2\_03: 1.58% -> 0.79%). These gains confirm that fusing modern deep features with a learned loop-closure module markedly improves ORB-SLAM3 accuracy while preserving its real-time operation. Implementation, pretrained weights and reproducibility scripts are available at https://github.com/shahram95/SuperPointSLAM3.
中文:SuperPoint-SLAM3通过融合SuperPoint检测器、自适应非极大值抑制和NetVLAD闭环模块,显著降低了在KITTI和EuRoC数据集上的定位误差,同时保持了ORB-SLAM3的实时运行特性。
English: SuperPoint-SLAM3 enhances ORB-SLAM3 by integrating the SuperPoint detector, adaptive non-maximal suppression, and a NetVLAD loop-closure module, significantly reducing localization errors on KITTI and EuRoC benchmarks while maintaining real-time performance.

Authors:Xuhui Zhu, Jing Xu, Bingjie Wang, Huikang Dai, Hao Lu
Title: Video Individual Counting With Implicit One-to-Many Matching
Abstract:
Video Individual Counting (VIC) is a recently introduced task that aims to estimate pedestrian flux from a video. It extends conventional Video Crowd Counting (VCC) beyond the per-frame pedestrian count. In contrast to VCC that only learns to count repeated pedestrian patterns across frames, the key problem of VIC is how to identify co-existent pedestrians between frames, which turns out to be a correspondence problem. Existing VIC approaches, however, mainly follow a one-to-one (O2O) matching strategy where the same pedestrian must be exactly matched between frames, leading to sensitivity to appearance variations or missing detections. In this work, we show that the O2O matching could be relaxed to a one-to-many (O2M) matching problem, which better fits the problem nature of VIC and can leverage the social grouping behavior of walking pedestrians. We therefore introduce OMAN, a simple but effective VIC model with implicit One-to-Many mAtchiNg, featuring an implicit context generator and a one-to-many pairwise matcher. Experiments on the SenseCrowd and CroHD benchmarks show that OMAN achieves the state-of-the-art performance. Code is available at \href{https://github.com/tiny-smart/OMAN}{OMAN}.
中文: 视频个体计数(VIC)通过解决跨帧识别行人的对应问题扩展了视频人群计数,所提出的OMAN模型采用一对多匹配策略,在基准测试中实现了最先进的性能。
English: Video Individual Counting (VIC) extends Video Crowd Counting by addressing the correspondence problem of identifying pedestrians across frames, and the proposed OMAN model with one-to-many matching achieves state-of-the-art performance on benchmarks.

Authors:Haibo Qiu, Xiaohan Lan, Fanfan Liu, Xiaohu Sun, Delian Ruan, Peng Shi, Lin Ma
Title: Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning
Abstract:
Recent advancements in large language models (LLMs) have witnessed a surge in the development of advanced reasoning paradigms, which are now being integrated into multimodal large language models (MLLMs). However, existing approaches often fall short: methods solely employing reinforcement learning (RL) can struggle with sample inefficiency and activating entirely absent reasoning capabilities, while conventional pipelines that initiate with a cold-start supervised fine-tuning (SFT) phase before RL may restrict the model's exploratory capacity and face suboptimal convergence. In this work, we introduce \textbf{Metis-RISE} (\textbf{R}L \textbf{I}ncentivizes and \textbf{S}FT \textbf{E}nhances) for multimodal reasoning model learning. Unlike conventional approaches, Metis-RISE distinctively omits an initial SFT stage, beginning instead with an RL phase (e.g., using a Group Relative Policy Optimization variant) to incentivize and activate the model's latent reasoning capacity. Subsequently, the targeted SFT stage addresses two key challenges identified during RL: (1) \textit{inefficient trajectory sampling} for tasks where the model possesses but inconsistently applies correct reasoning, which we tackle using self-distilled reasoning trajectories from the RL model itself; and (2) \textit{fundamental capability absence}, which we address by injecting expert-augmented knowledge for prompts where the model entirely fails. This strategic application of RL for incentivization followed by SFT for enhancement forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard demonstrate that both models achieve state-of-the-art performance among similar-sized models, with the 72B version ranking fourth overall. Please refer to our project page for open-source information.
Chinese: 本文提出Metis-RISE多模态推理模型,该方法先通过强化学习激发模型推理能力,再采用监督微调解决采样效率低和知识缺失问题,在评测中取得了同类模型的最佳性能。
English: This paper introduces Metis-RISE, a novel multimodal reasoning model that starts with reinforcement learning to activate reasoning capabilities before applying supervised fine-tuning to address inefficiencies and knowledge gaps, achieving state-of-the-art performance on benchmarks.

Authors:Can Polat, Hasan Kurban, Erchin Serpedin, Mustafa Kurban
Title: Stress-Testing Multimodal Foundation Models for Crystallographic Reasoning
Abstract:
Evaluating foundation models for crystallographic reasoning requires benchmarks that isolate generalization behavior while enforcing physical constraints. This work introduces a multiscale multicrystal dataset with two physically grounded evaluation protocols to stress-test multimodal generative models. The Spatial-Exclusion benchmark withholds all supercells of a given radius from a diverse dataset, enabling controlled assessments of spatial interpolation and extrapolation. The Compositional-Exclusion benchmark omits all samples of a specific chemical composition, probing generalization across stoichiometries. Nine vision--language foundation models are prompted with crystallographic images and textual context to generate structural annotations. Responses are evaluated via (i) relative errors in lattice parameters and density, (ii) a physics-consistency index penalizing volumetric violations, and (iii) a hallucination score capturing geometric outliers and invalid space-group predictions. These benchmarks establish a reproducible, physically informed framework for assessing generalization, consistency, and reliability in large-scale multimodal models. Dataset and code are available at https://github.com/KurbanIntelligenceLab/StressTestingMMFMinCR.
中文摘要:本研究提出一个多尺度多晶体数据集及两项基于物理原理的评估基准,通过空间排除和成分排除协议,系统测试多模态基础模型在晶体学推理中的泛化能力、物理一致性及可靠性。
English Summary: This study introduces a multiscale multicrystal dataset with two physics-grounded benchmarks to evaluate multimodal foundation models' crystallographic reasoning by testing their generalization, physical consistency, and reliability through structured exclusion protocols.

Authors:Adhrith Vutukuri, Akash Awasthi, David Yang, Carol C. Wu, Hien Van Nguyen
Title: Beyond the First Read: AI-Assisted Perceptual Error Detection in Chest Radiography Accounting for Interobserver Variability
Abstract:
Chest radiography is widely used in diagnostic imaging. However, perceptual errors -- especially overlooked but visible abnormalities -- remain common and clinically significant. Current workflows and AI systems provide limited support for detecting such errors after interpretation and often lack meaningful human--AI collaboration. We introduce RADAR (Radiologist--AI Diagnostic Assistance and Review), a post-interpretation companion system. RADAR ingests finalized radiologist annotations and CXR images, then performs regional-level analysis to detect and refer potentially missed abnormal regions. The system supports a "second-look" workflow and offers suggested regions of interest (ROIs) rather than fixed labels to accommodate inter-observer variation. We evaluated RADAR on a simulated perceptual-error dataset derived from de-identified CXR cases, using F1 score and Intersection over Union (IoU) as primary metrics. RADAR achieved a recall of 0.78, precision of 0.44, and an F1 score of 0.56 in detecting missed abnormalities in the simulated perceptual-error dataset. Although precision is moderate, this reduces over-reliance on AI by encouraging radiologist oversight in human--AI collaboration. The median IoU was 0.78, with more than 90% of referrals exceeding 0.5 IoU, indicating accurate regional localization. RADAR effectively complements radiologist judgment, providing valuable post-read support for perceptual-error detection in CXR interpretation. Its flexible ROI suggestions and non-intrusive integration position it as a promising tool in real-world radiology workflows. To facilitate reproducibility and further evaluation, we release a fully open-source web implementation alongside a simulated error dataset. All code, data, demonstration videos, and the application are publicly available at https://github.com/avutukuri01/RADAR.
中文摘要:RADAR是一种放射科医生解读后的辅助系统,通过区域分析和兴趣区建议来检测胸部X光片中可能被忽略的异常区域,在模拟错误检测中实现了0.78的召回率和0.56的F1分数,有效促进了人机协作。
English Summary: RADAR is a post-interpretation AI system that assists radiologists by detecting potentially missed abnormalities in chest X-rays through regional analysis and ROI suggestions, achieving effective human-AI collaboration with 0.78 recall and 0.56 F1 score in simulated error detection.

Authors:Haiyang Guo, Fanhu Zeng, Fei Zhu, Jiayi Wang, Xukai Wang, Jingang Zhou, Hongbo Zhao, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu
Title: Continual Learning for Generative AI: From LLMs to MLLMs and Beyond
Abstract:
The rapid advancement of generative models has empowered modern AI systems to comprehend and produce highly sophisticated content, even achieving human-level performance in specific domains. However, these models are fundamentally constrained by \emph{catastrophic forgetting}, \ie~a persistent challenge where models experience performance degradation on previously learned tasks when adapting to new tasks. To address this practical limitation, numerous approaches have been proposed to enhance the adaptability and scalability of generative AI in real-world applications. In this work, we present a comprehensive survey of continual learning methods for mainstream generative AI models, encompassing large language models, multimodal large language models, vision-language-action models, and diffusion models. Drawing inspiration from the memory mechanisms of the human brain, we systematically categorize these approaches into three paradigms: architecture-based, regularization-based, and replay-based methods, while elucidating their underlying methodologies and motivations. We further analyze continual learning setups for different generative models, including training objectives, benchmarks, and core backbones, thereby providing deeper insights into the field. The project page of this paper is available at https://github.com/Ghy0501/Awesome-Continual-Learning-in-Generative-Models.
中文: 本综述系统梳理了生成式AI模型的持续学习方法,通过借鉴人类记忆机制提出三种解决灾难性遗忘的范式,并深入分析了不同模型的训练设置与评估基准。
English: This survey systematically categorizes and analyzes continual learning methods for generative AI models to address catastrophic forgetting, drawing inspiration from human memory mechanisms and examining various training setups and benchmarks.

Authors:Christian Hilaire, Sima Didari
Title: ViewPCL: a point cloud based active learning method for multi-view segmentation
Abstract:
We propose a novel active learning framework for multi-view semantic segmentation. This framework relies on a new score that measures the discrepancy between point cloud distributions generated from the extra geometrical information derived from the model's prediction across different views. Our approach results in a data efficient and explainable active learning method. The source code is available at https://github.com/chilai235/viewpclAL.
Chinese: 本文提出了一种新颖的多视角语义分割主动学习框架,通过基于模型预测生成的点云分布差异评分,实现了数据高效且可解释的学习方法。
English: This paper introduces a novel active learning framework for multi-view semantic segmentation that utilizes a discrepancy score based on point cloud distributions from model predictions to achieve data efficiency and explainability.

Authors:Xinyi Zhao, Congjing Zhang, Pei Guo, Wei Li, Lin Chen, Chaoyue Zhao, Shuai Huang
Title: SmartHome-Bench: A Comprehensive Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal Large Language Models
Abstract:
Video anomaly detection (VAD) is essential for enhancing safety and security by identifying unusual events across different environments. Existing VAD benchmarks, however, are primarily designed for general-purpose scenarios, neglecting the specific characteristics of smart home applications. To bridge this gap, we introduce SmartHome-Bench, the first comprehensive benchmark specially designed for evaluating VAD in smart home scenarios, focusing on the capabilities of multi-modal large language models (MLLMs). Our newly proposed benchmark consists of 1,203 videos recorded by smart home cameras, organized according to a novel anomaly taxonomy that includes seven categories, such as Wildlife, Senior Care, and Baby Monitoring. Each video is meticulously annotated with anomaly tags, detailed descriptions, and reasoning. We further investigate adaptation methods for MLLMs in VAD, assessing state-of-the-art closed-source and open-source models with various prompting techniques. Results reveal significant limitations in the current models' ability to detect video anomalies accurately. To address these limitations, we introduce the Taxonomy-Driven Reflective LLM Chain (TRLC), a new LLM chaining framework that achieves a notable 11.62% improvement in detection accuracy. The benchmark dataset and code are publicly available at https://github.com/Xinyi-0724/SmartHome-Bench-LLM.
中文摘要:本文提出了首个专为智能家居场景设计的视频异常检测基准SmartHome-Bench,并开发了新型TRLC框架,将检测准确率显著提升了11.62%。
English Summary: This paper introduces SmartHome-Bench, the first specialized benchmark for video anomaly detection in smart home environments, and proposes a novel framework called TRLC that significantly improves detection accuracy by 11.62%.

Authors:Xiaoya Tang, Bodong Zhang, Man Minh Ho, Beatrice S. Knudsen, Tolga Tasdizen
Title: DuoFormer: Leveraging Hierarchical Representations by Local and Global Attention Vision Transformer
Abstract:
Despite the widespread adoption of transformers in medical applications, the exploration of multi-scale learning through transformers remains limited, while hierarchical representations are considered advantageous for computer-aided medical diagnosis. We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are adapted for transformer input through an innovative patch tokenization process, preserving the inherited multi-scale inductive biases. We also introduce a scale-wise attention mechanism that directly captures intra-scale and inter-scale associations. This mechanism complements patch-wise attention by enhancing spatial understanding and preserving global perception, which we refer to as local and global attention, respectively. Our model significantly outperforms baseline models in terms of classification accuracy, demonstrating its efficiency in bridging the gap between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). The components are designed as plug-and-play for different CNN architectures and can be adapted for multiple applications. The code is available at https://github.com/xiaoyatang/DuoFormer.git.
中文: 本研究提出了一种新颖的分层Transformer模型,融合了CNN的特征提取能力和ViT的表征优势,通过多尺度学习和注意力机制显著提升了医学图像分类的准确性。
English: This study introduces a novel hierarchical transformer model that combines CNNs' feature extraction with ViTs' representational power, enhancing medical image classification accuracy through multi-scale learning and attention mechanisms.

Authors:Hyunjin Kim, Haebeom Jung, Jaesik Park
Title: Metropolis-Hastings Sampling for 3D Gaussian Reconstruction
Abstract:
We propose an adaptive sampling framework for 3D Gaussian Splatting (3DGS) that leverages comprehensive multi-view photometric error signals within a unified Metropolis-Hastings approach. Traditional 3DGS methods heavily rely on heuristic-based density-control mechanisms (e.g., cloning, splitting, and pruning), which can lead to redundant computations or the premature removal of beneficial Gaussians. Our framework overcomes these limitations by reformulating densification and pruning as a probabilistic sampling process, dynamically inserting and relocating Gaussians based on aggregated multi-view errors and opacity scores. Guided by Bayesian acceptance tests derived from these error-based importance scores, our method substantially reduces reliance on heuristics, offers greater flexibility, and adaptively infers Gaussian distributions without requiring predefined scene complexity. Experiments on benchmark datasets, including Mip-NeRF360, Tanks and Temples, and Deep Blending, show that our approach reduces the number of Gaussians needed, enhancing computational efficiency while matching or modestly surpassing the view-synthesis quality of state-of-the-art models.

Authors:Jie Liu, Zheng Hui
Title: EraserDiT: Fast Video Inpainting with Diffusion Transformer Model
Abstract:
Video object removal and inpainting are critical tasks in the fields of computer vision and multimedia processing, aimed at restoring missing or corrupted regions in video sequences. Traditional methods predominantly rely on flow-based propagation and spatio-temporal Transformers, but these approaches face limitations in effectively leveraging long-term temporal features and ensuring temporal consistency in the completion results, particularly when dealing with large masks. Consequently, performance on extensive masked areas remains suboptimal. To address these challenges, this paper introduces a novel video inpainting approach leveraging the Diffusion Transformer (DiT). DiT synergistically combines the advantages of diffusion models and transformer architectures to maintain long-term temporal consistency while ensuring high-quality inpainting results. We propose a Circular Position-Shift strategy to further enhance long-term temporal consistency during the inference stage. Additionally, the proposed method interactively removes specified objects, and generates corresponding prompts. In terms of processing speed, it takes only 65 seconds (testing on one NVIDIA H800 GPU) to complete a video with a resolution of $2160 \times 2100$ with 97 frames without any acceleration method. Experimental results indicate that the proposed method demonstrates superior performance in content fidelity, texture restoration, and temporal consistency. Project page:https://jieliu95.github.io/EraserDiT_demo/

Authors:Hao Xu, Lechao Cheng, Yaxiong Wang, Shengeng Tang, Zhun Zhong
Title: Towards Fine-Grained Emotion Understanding via Skeleton-Based Micro-Gesture Recognition
Abstract:
We present our solution to the MiGA Challenge at IJCAI 2025, which aims to recognize micro-gestures (MGs) from skeleton sequences for the purpose of hidden emotion understanding. MGs are characterized by their subtlety, short duration, and low motion amplitude, making them particularly challenging to model and classify. We adopt PoseC3D as the baseline framework and introduce three key enhancements: (1) a topology-aware skeleton representation specifically designed for the iMiGUE dataset to better capture fine-grained motion patterns; (2) an improved temporal processing strategy that facilitates smoother and more temporally consistent motion modeling; and (3) the incorporation of semantic label embeddings as auxiliary supervision to improve the model generalization. Our method achieves a Top-1 accuracy of 67.01\% on the iMiGUE test set. As a result of these contributions, our approach ranks third on the official MiGA Challenge leaderboard. The source code is available at \href{https://github.com/EGO-False-Sleep/Miga25_track1}{https://github.com/EGO-False-Sleep/Miga25\_track1}.
中文: 我们通过改进PoseC3D框架,引入拓扑感知骨骼表征、优化时序处理和语义标签嵌入三项关键技术,在微表情识别任务中取得67.01%的Top-1准确率,荣获MiGA挑战赛第三名。
English: We propose an enhanced PoseC3D framework with three key improvements—topology-aware skeleton representation, refined temporal processing, and semantic label embeddings—achieving 67.01% Top-1 accuracy and third place in the MiGA Challenge for micro-gesture-based emotion recognition.

Authors:Mustansar Fiaz, Mubashir Noman, Hiyam Debary, Kamran Ali, Hisham Cholakkal
Title: HyRet-Change: A hybrid retentive network for remote sensing change detection
Abstract:
Recently convolution and transformer-based change detection (CD) methods provide promising performance. However, it remains unclear how the local and global dependencies interact to effectively alleviate the pseudo changes. Moreover, directly utilizing standard self-attention presents intrinsic limitations including governing global feature representations limit to capture subtle changes, quadratic complexity, and restricted training parallelism. To address these limitations, we propose a Siamese-based framework, called HyRet-Change, which can seamlessly integrate the merits of convolution and retention mechanisms at multi-scale features to preserve critical information and enhance adaptability in complex scenes. Specifically, we introduce a novel feature difference module to exploit both convolutions and multi-head retention mechanisms in a parallel manner to capture complementary information. Furthermore, we propose an adaptive local-global interactive context awareness mechanism that enables mutual learning and enhances discrimination capability through information exchange. We perform experiments on three challenging CD datasets and achieve state-of-the-art performance compared to existing methods. Our source code is publicly available at https://github.com/mustansarfiaz/HyRect-Change.
Chinese: HyRet-Change框架通过整合卷积和保留机制捕捉多尺度特征,并采用自适应局部-全局交互模块增强变化检测能力,在多个挑战性数据集上取得了领先性能。
English: The HyRet-Change framework integrates convolution and retention mechanisms to capture multi-scale features and uses an adaptive local-global interaction module to enhance change detection, achieving state-of-the-art results on challenging datasets.

Authors:Chenglin Wang, Yucheng Zhou, Qianning Wang, Zhe Wang, Kai Zhang
Title: ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies
Abstract:
Text-driven image editing has achieved remarkable success in following single instructions. However, real-world scenarios often involve complex, multi-step instructions, particularly ``chain'' instructions where operations are interdependent. Current models struggle with these intricate directives, and existing benchmarks inadequately evaluate such capabilities. Specifically, they often overlook multi-instruction and chain-instruction complexities, and common consistency metrics are flawed. To address this, we introduce ComplexBench-Edit, a novel benchmark designed to systematically assess model performance on complex, multi-instruction, and chain-dependent image editing tasks. ComplexBench-Edit also features a new vision consistency evaluation method that accurately assesses non-modified regions by excluding edited areas. Furthermore, we propose a simple yet powerful Chain-of-Thought (CoT)-based approach that significantly enhances the ability of existing models to follow complex instructions. Our extensive experiments demonstrate ComplexBench-Edit's efficacy in differentiating model capabilities and highlight the superior performance of our CoT-based method in handling complex edits. The data and code are released at https://github.com/llllly26/ComplexBench-Edit.
中文摘要:本文提出了ComplexBench-Edit基准测试,用于评估图像编辑模型处理复杂多步指令的能力,并开发了一种思维链方法,显著提升了模型处理复杂编辑任务的性能。
English Summary: This paper introduces ComplexBench-Edit, a benchmark for evaluating image editing models on complex multi-step instructions, along with a Chain-of-Thought method that significantly improves model performance on such tasks.

Authors:Mufan Liu, Cixiao Zhang, Qi Yang, Yujie Cao, Yiling Xu, Yin Xu, Shu Sun, Mingzeng Dai, Yunfeng Guan
Title: Rasterizing Wireless Radiance Field via Deformable 2D Gaussian Splatting
Abstract:
Modeling the wireless radiance field (WRF) is fundamental to modern communication systems, enabling key tasks such as localization, sensing, and channel estimation. Traditional approaches, which rely on empirical formulas or physical simulations, often suffer from limited accuracy or require strong scene priors. Recent neural radiance field (NeRF-based) methods improve reconstruction fidelity through differentiable volumetric rendering, but their reliance on computationally expensive multilayer perceptron (MLP) queries hinders real-time deployment. To overcome these challenges, we introduce Gaussian splatting (GS) to the wireless domain, leveraging its efficiency in modeling optical radiance fields to enable compact and accurate WRF reconstruction. Specifically, we propose SwiftWRF, a deformable 2D Gaussian splatting framework that synthesizes WRF spectra at arbitrary positions under single-sided transceiver mobility. SwiftWRF employs CUDA-accelerated rasterization to render spectra at over 100000 fps and uses a lightweight MLP to model the deformation of 2D Gaussians, effectively capturing mobility-induced WRF variations. In addition to novel spectrum synthesis, the efficacy of SwiftWRF is further underscored in its applications in angle-of-arrival (AoA) and received signal strength indicator (RSSI) prediction. Experiments conducted on both real-world and synthetic indoor scenes demonstrate that SwiftWRF can reconstruct WRF spectra up to 500x faster than existing state-of-the-art methods, while significantly enhancing its signal quality. The project page is https://evan-sudo.github.io/swiftwrf/.
中文摘要:SwiftWRF提出了一种可变形二维高斯泼溅框架,通过CUDA加速渲染和轻量级神经网络建模,在实现比现有最优方法快500倍的无线辐射场重建速度的同时,显著提升了信号质量。
English Summary: SwiftWRF introduces a deformable 2D Gaussian splatting framework that achieves 500x faster wireless radiance field reconstruction than state-of-the-art methods while significantly improving signal quality through CUDA-accelerated rendering and lightweight neural modeling.

Authors:Junbo Niu, Yuanhong Zheng, Ziyang Miao, Hejun Dong, Chunjiang Ge, Hao Liang, Ma Lu, Bohan Zeng, Qiahao Zheng, Conghui He, Wentao Zhang
Title: Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images, as most existing models rely on fixed, low-resolution inputs. While recent studies have explored integrating native resolution visual encoding to improve model performance, such efforts remain fragmented and lack a systematic framework within the open-source community. Moreover, existing benchmarks fall short in evaluating VLMs under varied visual conditions, often neglecting resolution as a critical factor. To address the "Resolution Dilemma" stemming from both model design and benchmark limitations, we introduce RC-Bench, a novel benchmark specifically designed to systematically evaluate VLM capabilities under extreme visual conditions, with an emphasis on resolution and aspect ratio variations. In conjunction, we propose NativeRes-LLaVA, an open-source training framework that empowers VLMs to effectively process images at their native resolutions and aspect ratios. Based on RC-Bench and NativeRes-LLaVA, we conduct comprehensive experiments on existing visual encoding strategies. The results show that Native Resolution Visual Encoding significantly improves the performance of VLMs on RC-Bench as well as other resolution-centric benchmarks. Code is available at https://github.com/Niujunbo2002/NativeRes-LLaVA.
中文摘要:该摘要提出了RC-Bench这一针对极端视觉条件评估视觉语言模型性能的基准,以及NativeRes-LLaVA这一支持模型处理原生分辨率图像的开源框架,实验证明该方法能显著提升模型表现。
English Summary: The abstract introduces RC-Bench, a benchmark for evaluating Vision-Language Models under extreme visual conditions, and NativeRes-LLaVA, a framework enabling models to process images at native resolutions, demonstrating significant performance improvements.

Authors:Ruojing Li, Wei An, Xinyi Ying, Yingqian Wang, Yimian Dai, Longguang Wang, Miao Li, Yulan Guo, Li Liu
Title: Probing Deep into Temporal Profile Makes the Infrared Small Target Detector Much Better
Abstract:
Infrared small target (IRST) detection is challenging in simultaneously achieving precise, universal, robust and efficient performance due to extremely dim targets and strong interference. Current learning-based methods attempt to leverage ``more" information from both the spatial and the short-term temporal domains, but suffer from unreliable performance under complex conditions while incurring computational redundancy. In this paper, we explore the ``more essential" information from a more crucial domain for the detection. Through theoretical analysis, we reveal that the global temporal saliency and correlation information in the temporal profile demonstrate significant superiority in distinguishing target signals from other signals. To investigate whether such superiority is preferentially leveraged by well-trained networks, we built the first prediction attribution tool in this field and verified the importance of the temporal profile information. Inspired by the above conclusions, we remodel the IRST detection task as a one-dimensional signal anomaly detection task, and propose an efficient deep temporal probe network (DeepPro) that only performs calculations in the time dimension for IRST detection. We conducted extensive experiments to fully validate the effectiveness of our method. The experimental results are exciting, as our DeepPro outperforms existing state-of-the-art IRST detection methods on widely-used benchmarks with extremely high efficiency, and achieves a significant improvement on dim targets and in complex scenarios. We provide a new modeling domain, a new insight, a new method, and a new performance, which can promote the development of IRST detection. Codes are available at https://github.com/TinaLRJ/DeepPro.
Chinese: 本文提出DeepPro,一种高效的深度时间探针网络,将红外小目标检测重塑为一维信号异常检测任务,通过利用全局时间显著性和相关性信息,在基准测试中实现了卓越性能和高效率。
English: This paper introduces DeepPro, an efficient deep temporal probe network that redefines infrared small target detection as a one-dimensional signal anomaly detection task, achieving superior performance and high efficiency on benchmarks by leveraging global temporal saliency and correlation information.

Authors:Rong Wu, Ziqi Chen, Liming Zhong, Heng Li, Hai Shu
Title: Unleashing Diffusion and State Space Models for Medical Image Segmentation
Abstract:
Existing segmentation models trained on a single medical imaging dataset often lack robustness when encountering unseen organs or tumors. Developing a robust model capable of identifying rare or novel tumor categories not present during training is crucial for advancing medical imaging applications. We propose DSM, a novel framework that leverages diffusion and state space models to segment unseen tumor categories beyond the training data. DSM utilizes two sets of object queries trained within modified attention decoders to enhance classification accuracy. Initially, the model learns organ queries using an object-aware feature grouping strategy to capture organ-level visual features. It then refines tumor queries by focusing on diffusion-based visual prompts, enabling precise segmentation of previously unseen tumors. Furthermore, we incorporate diffusion-guided feature fusion to improve semantic segmentation performance. By integrating CLIP text embeddings, DSM captures category-sensitive classes to improve linguistic transfer knowledge, thereby enhancing the model's robustness across diverse scenarios and multi-label tasks. Extensive experiments demonstrate the superior performance of DSM in various tumor segmentation tasks. Code is available at https://github.com/Rows21/k-Means_Mask_Mamba.
中文: 提出的DSM框架结合扩散与状态空间模型,通过对象查询和CLIP文本嵌入技术,实现了对训练数据外未知肿瘤的精准分割,在医学影像任务中展现出卓越性能。
English: The proposed DSM framework combines diffusion and state space models with object queries and CLIP text embeddings to robustly segment unseen tumors beyond training data, demonstrating superior performance in medical imaging tasks.

Authors:Hang Xu, Wei Yu, Jiangtong Tan, Zhen Zou, Feng Zhao
Title: Adaptive Dropout: Unleashing Dropout across Layers for Generalizable Image Super-Resolution
Abstract:
Blind Super-Resolution (blind SR) aims to enhance the model's generalization ability with unknown degradation, yet it still encounters severe overfitting issues. Some previous methods inspired by dropout, which enhances generalization by regularizing features, have shown promising results in blind SR. Nevertheless, these methods focus solely on regularizing features before the final layer and overlook the need for generalization in features at intermediate layers. Without explicit regularization of features at intermediate layers, the blind SR network struggles to obtain well-generalized feature representations. However, the key challenge is that directly applying dropout to intermediate layers leads to a significant performance drop, which we attribute to the inconsistency in training-testing and across layers it introduced. Therefore, we propose Adaptive Dropout, a new regularization method for blind SR models, which mitigates the inconsistency and facilitates application across intermediate layers of networks. Specifically, for training-testing inconsistency, we re-design the form of dropout and integrate the features before and after dropout adaptively. For inconsistency in generalization requirements across different layers, we innovatively design an adaptive training strategy to strengthen feature propagation by layer-wise annealing. Experimental results show that our method outperforms all past regularization methods on both synthetic and real-world benchmark datasets, also highly effective in other image restoration tasks. Code is available at \href{https://github.com/xuhang07/Adpative-Dropout}{https://github.com/xuhang07/Adpative-Dropout}.
Chinese: 本文提出自适应Dropout方法,通过自适应特征融合和逐层退火策略解决盲超分辨率中的训练-测试及跨层不一致性问题,在合成与真实基准数据集上超越现有正则化方法,并有效应用于其他图像复原任务。
English: This paper introduces Adaptive Dropout, a novel regularization method for blind super-resolution that addresses training-testing and cross-layer inconsistencies by adaptively integrating features and employing layer-wise annealing, achieving state-of-the-art performance on benchmark datasets and other image restoration tasks.

Authors:Wenxiao Cai, Zongru Li, Iris Wang, Yu-Neng Wang, Thomas H. Lee
Title: OscNet v1.5: Energy Efficient Hopfield Network on CMOS Oscillators for Image Classification
Abstract:
Machine learning has achieved remarkable advancements but at the cost of significant computational resources. This has created an urgent need for a novel and energy-efficient computational fabric and corresponding algorithms. CMOS Oscillator Networks (OscNet) is a brain inspired and specially designed hardware for low energy consumption. In this paper, we propose a Hopfield Network based machine learning algorithm that can be implemented on OscNet. The network is trained using forward propagation alone to learn sparsely connected weights, yet achieves an 8% improvement in accuracy compared to conventional deep learning models on MNIST dataset. OscNet v1.5 achieves competitive accuracy on MNIST and is well-suited for implementation using CMOS-compatible ring oscillator arrays with SHIL. In oscillator-based inference, we utilize only 24% of the connections used in a fully connected Hopfield network, with merely a 0.1% drop in accuracy. OscNet v1.5 relies solely on forward propagation and employs sparse connections, making it an energy-efficient machine learning pipeline designed for oscillator computing fabric. The repository for OscNet family is: https://github.com/RussRobin/OscNet .
中文: OscNet v1.5 是一种基于CMOS振荡器硬件的高效能机器学习方案,仅通过前向传播和稀疏连接就在MNIST数据集上实现了与主流模型相当的精度,同时将连接数减少至24%且精度损失仅为0.1%。
English: OscNet v1.5 is an energy-efficient machine learning pipeline using forward propagation and sparse connections on CMOS oscillator hardware, achieving competitive accuracy on MNIST with 24% fewer connections and minimal performance loss.

Authors:Darryl Ho, Samuel Madden
Title: DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification
Abstract:
In recent years, large transformer-based video encoder models have greatly advanced state-of-the-art performance on video classification tasks. However, these large models typically process videos by averaging embedding outputs from multiple clips over time to produce fixed-length representations. This approach fails to account for a variety of time-related features, such as variable video durations, chronological order of events, and temporal variance in feature significance. While methods for temporal modeling do exist, they often require significant architectural changes and expensive retraining, making them impractical for off-the-shelf, fine-tuned large encoders. To overcome these limitations, we propose DejaVid, an encoder-agnostic method that enhances model performance without the need for retraining or altering the architecture. Our framework converts a video into a variable-length temporal sequence of embeddings, which we call a multivariate time series (MTS). An MTS naturally preserves temporal order and accommodates variable video durations. We then learn per-timestep, per-feature weights over the encoded MTS frames, allowing us to account for variations in feature importance over time. We introduce a new neural network architecture inspired by traditional time series alignment algorithms for this learning task. Our evaluation demonstrates that DejaVid substantially improves the performance of a state-of-the-art large encoder, achieving leading Top-1 accuracy of 77.2% on Something-Something V2, 89.1% on Kinetics-400, and 88.6% on HMDB51, while adding fewer than 1.8% additional learnable parameters and requiring less than 3 hours of training time. Our code is available at https://github.com/darrylho/DejaVid.
中文: DejaVid是一种与编码器无关的框架,通过将视频转换为带有时序特征权重的多元时间序列来提升分类性能,无需重新训练或修改结构即可实现最优准确率。
English: DejaVid is an encoder-agnostic framework that enhances video classification by converting videos into temporal sequences with learned feature weights, achieving state-of-the-art accuracy without retraining or architectural changes.

Authors:Chunjiang Wang, Kun Zhang, Yandong Liu, Zhiyang He, Xiaodong Tao, S. Kevin Zhou
Title: MVP-CBM:Multi-layer Visual Preference-enhanced Concept Bottleneck Model for Explainable Medical Image Classification
Abstract:
The concept bottleneck model (CBM), as a technique improving interpretability via linking predictions to human-understandable concepts, makes high-risk and life-critical medical image classification credible. Typically, existing CBM methods associate the final layer of visual encoders with concepts to explain the model's predictions. However, we empirically discover the phenomenon of concept preference variation, that is, the concepts are preferably associated with the features at different layers than those only at the final layer; yet a blind last-layer-based association neglects such a preference variation and thus weakens the accurate correspondences between features and concepts, impairing model interpretability. To address this issue, we propose a novel Multi-layer Visual Preference-enhanced Concept Bottleneck Model (MVP-CBM), which comprises two key novel modules: (1) intra-layer concept preference modeling, which captures the preferred association of different concepts with features at various visual layers, and (2) multi-layer concept sparse activation fusion, which sparsely aggregates concept activations from multiple layers to enhance performance. Thus, by explicitly modeling concept preferences, MVP-CBM can comprehensively leverage multi-layer visual information to provide a more nuanced and accurate explanation of model decisions. Extensive experiments on several public medical classification benchmarks demonstrate that MVP-CBM achieves state-of-the-art accuracy and interoperability, verifying its superiority. Code is available at https://github.com/wcj6/MVP-CBM.
中文: 提出的多层视觉偏好增强概念瓶颈模型(MVP-CBM)通过捕捉概念与多层视觉特征的关联,解决了概念偏好变化问题,显著提升了医学图像分类的可解释性和准确性。
English: The proposed Multi-layer Visual Preference-enhanced Concept Bottleneck Model (MVP-CBM) addresses concept preference variation by capturing associations between concepts and features across multiple visual layers, enhancing both interpretability and accuracy in medical image classification.

Authors:Catalin E. Brita, Hieu Nguyen, Lohithsai Yadala Chanchu, Domonkos Nagy, Maksim Zhdanov
Title: BSA: Ball Sparse Attention for Large-scale Geometries
Abstract:
Self-attention scales quadratically with input size, limiting its use for large-scale physical systems. Although sparse attention mechanisms provide a viable alternative, they are primarily designed for regular structures such as text or images, making them inapplicable for irregular geometries. In this work, we present Ball Sparse Attention (BSA), which adapts Native Sparse Attention (NSA) (Yuan et al., 2025) to unordered point sets by imposing regularity using the Ball Tree structure from the Erwin Transformer (Zhdanov et al., 2025). We modify NSA's components to work with ball-based neighborhoods, yielding a global receptive field at sub-quadratic cost. On an airflow pressure prediction task, we achieve accuracy comparable to Full Attention while significantly reducing the theoretical computational complexity. Our implementation is available at https://github.com/britacatalin/bsa.
中文:球稀疏注意力通过球树结构将稀疏注意力应用于不规则几何形状,在物理系统预测任务中以次二次复杂度实现了与全注意力相当的精度。
English: Ball Sparse Attention adapts sparse attention to irregular geometries using ball tree structures, achieving full-attention accuracy with sub-quadratic complexity on physical systems.

Authors:Nuwan Bandara, Thivya Kandappu, Archan Misra
Title: Inference-Time Gaze Refinement for Micro-Expression Recognition: Enhancing Event-Based Eye Tracking with Motion-Aware Post-Processing
Abstract:
Event-based eye tracking holds significant promise for fine-grained cognitive state inference, offering high temporal resolution and robustness to motion artifacts, critical features for decoding subtle mental states such as attention, confusion, or fatigue. In this work, we introduce a model-agnostic, inference-time refinement framework designed to enhance the output of existing event-based gaze estimation models without modifying their architecture or requiring retraining. Our method comprises two key post-processing modules: (i) Motion-Aware Median Filtering, which suppresses blink-induced spikes while preserving natural gaze dynamics, and (ii) Optical Flow-Based Local Refinement, which aligns gaze predictions with cumulative event motion to reduce spatial jitter and temporal discontinuities. To complement traditional spatial accuracy metrics, we propose a novel Jitter Metric that captures the temporal smoothness of predicted gaze trajectories based on velocity regularity and local signal complexity. Together, these contributions significantly improve the consistency of event-based gaze signals, making them better suited for downstream tasks such as micro-expression analysis and mind-state decoding. Our results demonstrate consistent improvements across multiple baseline models on controlled datasets, laying the groundwork for future integration with multimodal affect recognition systems in real-world environments. Our code implementations can be found at https://github.com/eye-tracking-for-physiological-sensing/EyeLoRiN.
中文: 本文提出了一种与模型无关的优化框架,通过运动感知滤波和光流局部优化来改进基于事件的视线估计,显著提升了视线信号的稳定性以支持认知状态解码,并引入了评估时间平滑度的新型抖动指标。
English: This paper presents a model-agnostic framework that enhances event-based gaze estimation through motion-aware filtering and optical flow refinement, improving signal consistency for cognitive state inference while introducing a novel jitter metric for temporal smoothness evaluation.

Authors:Runhao Zeng, Qi Deng, Ronghao Zhang, Shuaicheng Niu, Jian Chen, Xiping Hu, Victor C. M. Leung
Title: Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation
Abstract:
Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio information into video TTA. Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels, a new concept in the context of video TTA. Specifically, we propose an audio-to-video label mapping method by first employing pre-trained audio models to classify audio signals extracted from videos and then mapping the audio-based predictions to video label spaces through large language models, thereby establishing a connection between the audio categories and video labels. To effectively leverage the generated pseudo-labels, we present a flexible adaptation cycle that determines the optimal number of adaptation iterations for each sample, based on changes in loss and consistency across different views. This enables a customized adaptation process for each sample. Experimental results on two widely used datasets (UCF101-C and Kinetics-Sounds-C), as well as on two newly constructed audio-video TTA datasets (AVE-C and AVMIT-C) with various corruption types, demonstrate the superiority of our approach. Our method consistently improves adaptation performance across different video classification models and represents a significant step forward in integrating audio information into video TTA. Code: https://github.com/keikeiqi/Audio-Assisted-TTA.
Chinese: 本研究提出了一种音频辅助的视频测试时自适应方法,利用音频生成的伪标签和灵活的自适应周期,在多个受损数据集上显著提升了模型性能。
English: This study introduces an audio-assisted test-time adaptation method for video that leverages audio-generated pseudo-labels and a flexible adaptation cycle to enhance model performance across multiple corrupted datasets.

Authors:Zonghao Ying, Siyang Wu, Run Hao, Peng Ying, Shixuan Sun, Pengyu Chen, Junze Chen, Hao Du, Kaiwen Shen, Shangkun Wu, Jiwei Wei, Shiyuan He, Yang Yang, Xiaohai Xu, Ke Ma, Qianqian Xu, Qingming Huang, Shi Lin, Xun Wang, Changting Lin, Meng Han, Yilei Jiang, Siqi Lai, Yaozhi Zheng, Yifei Song, Xiangyu Yue, Zonglei Jing, Tianyuan Zhang, Zhilei Zhu, Aishan Liu, Jiakai Wang, Siyuan Liang, Xianglong Kong, Hainan Li, Junjie Mu, Haotong Qin, Yue Yu, Lei Chen, Felix Juefei-Xu, Qing Guo, Xinyun Chen, Yew Soon Ong, Xianglong Liu, Dawn Song, Alan Yuille, Philip Torr, Dacheng Tao
Title: Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025
Abstract:
Multimodal Large Language Models (MLLMs) have enabled transformative advancements across diverse applications but remain susceptible to safety threats, especially jailbreak attacks that induce harmful outputs. To systematically evaluate and improve their safety, we organized the Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025}. This technical report presents findings from the competition, which involved 86 teams testing MLLM vulnerabilities via adversarial image-text attacks in two phases: white-box and black-box evaluations. The competition results highlight ongoing challenges in securing MLLMs and provide valuable guidance for developing stronger defense mechanisms. The challenge establishes new benchmarks for MLLM safety evaluation and lays groundwork for advancing safer multimodal AI systems. The code and data for this challenge are openly available at https://github.com/NY1024/ATLAS_Challenge_2025.
中文: ATLAS 2025挑战赛通过对抗性测试系统评估了多模态大语言模型的越狱攻击漏洞,在揭示持续安全挑战的同时,为开发更强防御机制建立了新基准。
English: The ATLAS 2025 challenge systematically evaluated multimodal large language models' vulnerabilities to jailbreak attacks through adversarial testing, revealing persistent safety challenges while establishing new benchmarks for developing stronger defense mechanisms.

Authors:Hyeonseo Lee, Juhyun Park, Jihyong Oh, Chanho Eom
Title: Domain Generalization for Person Re-identification: A Survey Towards Domain-Agnostic Person Matching
Abstract:
Person Re-identification (ReID) aims to retrieve images of the same individual captured across non-overlapping camera views, making it a critical component of intelligent surveillance systems. Traditional ReID methods assume that the training and test domains share similar characteristics and primarily focus on learning discriminative features within a given domain. However, they often fail to generalize to unseen domains due to domain shifts caused by variations in viewpoint, background, and lighting conditions. To address this issue, Domain-Adaptive ReID (DA-ReID) methods have been proposed. These approaches incorporate unlabeled target domain data during training and improve performance by aligning feature distributions between source and target domains. Domain-Generalizable ReID (DG-ReID) tackles a more realistic and challenging setting by aiming to learn domain-invariant features without relying on any target domain data. Recent methods have explored various strategies to enhance generalization across diverse environments, but the field remains relatively underexplored. In this paper, we present a comprehensive survey of DG-ReID. We first review the architectural components of DG-ReID including the overall setting, commonly used backbone networks and multi-source input configurations. Then, we categorize and analyze domain generalization modules that explicitly aim to learn domain-invariant and identity-discriminative representations. To examine the broader applicability of these techniques, we further conduct a case study on a related task that also involves distribution shifts. Finally, we discuss recent trends, open challenges, and promising directions for future research in DG-ReID. To the best of our knowledge, this is the first systematic survey dedicated to DG-ReID.
中文: 本文首次系统综述了无需目标域数据的域泛化行人重识别方法,重点分析了其架构组成和泛化模块,并探讨了该领域的未来研究方向。
English: This paper provides the first systematic survey of Domain-Generalizable Person Re-identification (DG-ReID), which focuses on learning domain-invariant features without target domain data, reviewing its architecture, generalization modules, and discussing future research directions.

Authors:Hongbi Zhou, Zhangkai Ni
Title: Perceptual-GS: Scene-adaptive Perceptual Densification for Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis. However, existing methods struggle to adaptively optimize the distribution of Gaussian primitives based on scene characteristics, making it challenging to balance reconstruction quality and efficiency. Inspired by human perception, we propose scene-adaptive perceptual densification for Gaussian Splatting (Perceptual-GS), a novel framework that integrates perceptual sensitivity into the 3DGS training process to address this challenge. We first introduce a perception-aware representation that models human visual sensitivity while constraining the number of Gaussian primitives. Building on this foundation, we develop a perceptual sensitivity-adaptive distribution to allocate finer Gaussian granularity to visually critical regions, enhancing reconstruction quality and robustness. Extensive evaluations on multiple datasets, including BungeeNeRF for large-scale scenes, demonstrate that Perceptual-GS achieves state-of-the-art performance in reconstruction quality, efficiency, and robustness. The code is publicly available at: https://github.com/eezkni/Perceptual-GS
中文: Perceptual-GS提出了一种感知自适应框架,通过优化高斯基元分布来提升重建质量和效率,在多个数据集上实现了领先性能。
English: Perceptual-GS introduces a perception-aware framework that adaptively optimizes Gaussian primitives for enhanced reconstruction quality and efficiency, achieving state-of-the-art results across multiple datasets.

Authors:Mingjun Xu, Jinhan Dong, Jue Hou, Zehui Wang, Sihang Li, Zhifeng Gao, Renxin Zhong, Hengxing Cai
Title: MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval
Abstract:
Multimodal document retrieval systems enable information access across text, images, and layouts, benefiting various domains like document-based question answering, report analysis, and interactive content summarization. Rerankers improve retrieval precision by reordering retrieved candidates. However, current multimodal reranking methods remain underexplored, with significant room for improvement in both training strategies and overall effectiveness. Moreover, the lack of explicit reasoning makes it difficult to analyze and optimize these methods further. In this paper, We propose MM-R5, a MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval, aiming to provide a more effective and reliable solution for multimodal reranking tasks. MM-R5 is trained in two stages: supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we focus on improving instruction-following and guiding the model to generate complete and high-quality reasoning chains. To support this, we introduce a novel data construction strategy that produces rich, high-quality reasoning data. In the RL stage, we design a task-specific reward framework, including a reranking reward tailored for multimodal candidates and a composite template-based reward to further refine reasoning quality. We conduct extensive experiments on MMDocIR, a challenging public benchmark spanning multiple domains. MM-R5 achieves state-of-the-art performance on most metrics and delivers comparable results to much larger models on the remaining ones. Moreover, compared to the best retrieval-only method, MM-R5 improves recall@1 by over 4%. These results validate the effectiveness of our reasoning-enhanced training pipeline. Our code is available at https://github.com/i2vec/MM-R5 .
中文: 本文提出MM-R5,一种基于强化学习的多模态推理增强重排器,通过生成高质量推理链提升文档检索效果,在基准测试中取得了领先性能。
English: The paper introduces MM-R5, a multimodal reasoning-enhanced reranker using reinforcement learning to improve document retrieval by generating high-quality reasoning chains and achieving state-of-the-art performance on benchmarks.

Authors:Worasit Sangjan, Piyush Pandey, Norman B. Best, Jacob D. Washburn
Title: MatchPlant: An Open-Source Pipeline for UAV-Based Single-Plant Detection and Data Extraction
Abstract:
Accurate identification of individual plants from unmanned aerial vehicle (UAV) images is essential for advancing high-throughput phenotyping and supporting data-driven decision-making in plant breeding. This study presents MatchPlant, a modular, graphical user interface-supported, open-source Python pipeline for UAV-based single-plant detection and geospatial trait extraction. MatchPlant enables end-to-end workflows by integrating UAV image processing, user-guided annotation, Convolutional Neural Network model training for object detection, forward projection of bounding boxes onto an orthomosaic, and shapefile generation for spatial phenotypic analysis. In an early-season maize case study, MatchPlant achieved reliable detection performance (validation AP: 89.6%, test AP: 85.9%) and effectively projected bounding boxes, covering 89.8% of manually annotated boxes with 87.5% of projections achieving an Intersection over Union (IoU) greater than 0.5. Trait values extracted from predicted bounding instances showed high agreement with manual annotations (r = 0.87-0.97, IoU >= 0.4). Detection outputs were reused across time points to extract plant height and Normalized Difference Vegetation Index with minimal additional annotation, facilitating efficient temporal phenotyping. By combining modular design, reproducibility, and geospatial precision, MatchPlant offers a scalable framework for UAV-based plant-level analysis with broad applicability in agricultural and environmental monitoring.
中文:MatchPlant是一个开源Python流程,通过集成无人机图像处理与深度学习模型,实现了端到端的单株植物检测和性状提取,在玉米案例中展现出高精度检测能力,为农业表型分析提供了可扩展的解决方案。
English: MatchPlant is an open-source Python pipeline that provides end-to-end workflows for accurate single-plant detection and trait extraction from UAV imagery, demonstrating high reliability in maize studies and enabling efficient temporal phenotyping for agricultural applications.

Authors:Yijiang Li, Genpei Zhang, Jiacheng Cheng, Yi Li, Xiaojun Shan, Dashan Gao, Jiancheng Lyu, Yuan Li, Ning Bi, Nuno Vasconcelos
Title: EgoPrivacy: What Your First-Person Camera Says About You?
Abstract:
While the rapid proliferation of wearable cameras has raised significant concerns about egocentric video privacy, prior work has largely overlooked the unique privacy threats posed to the camera wearer. This work investigates the core question: How much privacy information about the camera wearer can be inferred from their first-person view videos? We introduce EgoPrivacy, the first large-scale benchmark for the comprehensive evaluation of privacy risks in egocentric vision. EgoPrivacy covers three types of privacy (demographic, individual, and situational), defining seven tasks that aim to recover private information ranging from fine-grained (e.g., wearer's identity) to coarse-grained (e.g., age group). To further emphasize the privacy threats inherent to egocentric vision, we propose Retrieval-Augmented Attack, a novel attack strategy that leverages ego-to-exo retrieval from an external pool of exocentric videos to boost the effectiveness of demographic privacy attacks. An extensive comparison of the different attacks possible under all threat models is presented, showing that private information of the wearer is highly susceptible to leakage. For instance, our findings indicate that foundation models can effectively compromise wearer privacy even in zero-shot settings by recovering attributes such as identity, scene, gender, and race with 70-80% accuracy. Our code and data are available at https://github.com/williamium3000/ego-privacy.
Chinese: 本研究推出了首个用于评估第一人称视角隐私风险的大规模基准EgoPrivacy,揭示即使零样本情况下,也能从穿戴者视频中高精度推断其身份、人口统计和情境等隐私信息。
English: This study introduces EgoPrivacy, the first large-scale benchmark for assessing privacy risks in egocentric vision, revealing that wearers' private information like identity, demographics, and situations can be inferred with high accuracy from first-person videos, even in zero-shot scenarios.

Authors:Jayjun Lee, Nima Fazeli
Title: ViTaSCOPE: Visuo-tactile Implicit Representation for In-hand Pose and Extrinsic Contact Estimation
Abstract:
Mastering dexterous, contact-rich object manipulation demands precise estimation of both in-hand object poses and external contact locations$\unicode{x2013}$tasks particularly challenging due to partial and noisy observations. We present ViTaSCOPE: Visuo-Tactile Simultaneous Contact and Object Pose Estimation, an object-centric neural implicit representation that fuses vision and high-resolution tactile feedback. By representing objects as signed distance fields and distributed tactile feedback as neural shear fields, ViTaSCOPE accurately localizes objects and registers extrinsic contacts onto their 3D geometry as contact fields. Our method enables seamless reasoning over complementary visuo-tactile cues by leveraging simulation for scalable training and zero-shot transfers to the real-world by bridging the sim-to-real gap. We evaluate our method through comprehensive simulated and real-world experiments, demonstrating its capabilities in dexterous manipulation scenarios.

Authors:Ilya Ilyankou, Natchapon Jongwiriyanurak, Tao Cheng, James Haworth
Title: CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images
Abstract:
We present a CLIP-based, multi-modal, multi-label classifier for predicting geographical context tags from landscape photos in the Geograph dataset--a crowdsourced image archive spanning the British Isles, including remote regions lacking POIs and street-level imagery. Our approach addresses a Kaggle competition\footnote{https://www.kaggle.com/competitions/predict-geographic-context-from-landscape-photos} task based on a subset of Geograph's 8M images, with strict evaluation: exact match accuracy is required across 49 possible tags. We show that combining location and title embeddings with image features improves accuracy over using image embeddings alone. We release a lightweight pipeline\footnote{https://github.com/SpaceTimeLab/ClipTheLandscape} that trains on a modest laptop, using pre-trained CLIP image and text embeddings and a simple classification head. Predicted tags can support downstream tasks such as building location embedders for GeoAI applications, enriching spatial understanding in data-sparse regions.
中文: 本研究提出了一种基于CLIP的多模态分类器,通过融合位置和标题嵌入与图像特征,改进了景观照片的地理标签预测效果,其性能优于仅使用图像的方法,并为GeoAI应用提供了轻量级训练流程。
English: This study introduces a CLIP-based multi-modal classifier that enhances geographical tag prediction for landscape photos by integrating location and title embeddings with image features, outperforming image-only approaches and providing a lightweight training pipeline for GeoAI applications.

Authors:Haoxiang Chen, Wei Zhao, Rufei Zhang, Nannan Li, Dongjin Li
Title: Multiple Object Tracking in Video SAR: A Benchmark and Tracking Baseline
Abstract:
In the context of multi-object tracking using video synthetic aperture radar (Video SAR), Doppler shifts induced by target motion result in artifacts that are easily mistaken for shadows caused by static occlusions. Moreover, appearance changes of the target caused by Doppler mismatch may lead to association failures and disrupt trajectory continuity. A major limitation in this field is the lack of public benchmark datasets for standardized algorithm evaluation. To address the above challenges, we collected and annotated 45 video SAR sequences containing moving targets, and named the Video SAR MOT Benchmark (VSMB). Specifically, to mitigate the effects of trailing and defocusing in moving targets, we introduce a line feature enhancement mechanism that emphasizes the positive role of motion shadows and reduces false alarms induced by static occlusions. In addition, to mitigate the adverse effects of target appearance variations, we propose a motion-aware clue discarding mechanism that substantially improves tracking robustness in Video SAR. The proposed model achieves state-of-the-art performance on the VSMB, and the dataset and model are released at https://github.com/softwarePupil/VSMB.
中文摘要:该研究提出了视频合成孔径雷达多目标跟踪基准(VSMB)数据集及新型跟踪模型,通过线特征增强和运动感知线索丢弃机制,有效缓解多普勒效应造成的伪影和目标外观变化,实现了最先进的性能。
English Summary: The study introduces a Video SAR MOT Benchmark (VSMB) dataset and a novel tracking model that uses line feature enhancement and motion-aware clue discarding to mitigate Doppler-induced artifacts and appearance changes, achieving state-of-the-art performance.

Authors:Yewei Liu, Xiyuan Wang, Muhan Zhang
Title: Meta Pruning via Graph Metanetworks : A Meta Learning Framework for Network Pruning
Abstract:
Network pruning, aimed at reducing network size while preserving accuracy, has attracted significant research interest. Numerous pruning techniques have been proposed over time. They are becoming increasingly effective, but more complex and harder to interpret as well. Given the inherent complexity of neural networks, we argue that manually designing pruning criteria has reached a bottleneck. To address this, we propose a novel approach in which we "use a neural network to prune neural networks". More specifically, we introduce the newly developed idea of metanetwork from meta-learning into pruning. A metanetwork is a network that takes another network as input and produces a modified network as output. In this paper, we first establish a bijective mapping between neural networks and graphs, and then employ a graph neural network as our metanetwork. We train a metanetwork that learns the pruning strategy automatically which can transform a network that is hard to prune into another network that is much easier to prune. Once the metanetwork is trained, our pruning needs nothing more than a feedforward through the metanetwork and the standard finetuning to prune at state-of-the-art. Our method achieved outstanding results on many popular and representative pruning tasks (including ResNet56 on CIFAR10, VGG19 on CIFAR100, ResNet50 on ImageNet). Our code is available at https://github.com/Yewei-Liu/MetaPruning
Chinese: 我们提出了一种全新的元学习网络剪枝框架,通过元网络自动学习复杂剪枝规则,无需针对每个任务进行专门训练即可在各种网络上实现最先进的剪枝效果。
English: We introduce a novel meta-learning framework for network pruning that automatically learns complex pruning rules through a metanetwork, achieving state-of-the-art results across various networks without requiring special training for each task.

Authors:Hao Gu, Lujun Li, Zheyu Wang, Bei Liu, Qiyuan Zhu, Sirui Han, Yike Guo
Title: BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook
Abstract:
Binary quantization represents the most extreme form of large language model (LLM) compression, reducing weights to $\pm$1 for maximal memory and computational efficiency. While recent sparsity-aware binarization methods achieve sub-1-bit compression by pruning redundant binary weights, they suffer from three critical challenges: performance deterioration, computational complexity from sparse mask management, and limited hardware compatibility. In this paper, we present BTC-LLM, a novel sub-1-bit LLM quantization framework that leverages adaptive weight transformation and binary pattern clustering to overcome these limitations, delivering both superior accuracy and efficiency. Our approach incorporates two key innovations: (1) a Learnable Transformation that optimizes invertible scaling and rotation matrices to align binarized weights with full-precision distributions, enabling incoherence processing to enhance layer-wise representation quality; (2) a Flash and Accurate Binary Codebook that identifies recurring binary vector clusters, compressing them into compact indices with tailored distance metrics and sign-based centroid updates. This eliminates the need for sparse masks, enabling efficient inference on standard hardware. Our code is available at https://github.com/Chooovy/BTC-LLM.
中文:BTC-LLM是一种新颖的亚1比特量化框架,通过自适应权重变换和二进制模式聚类克服了性能和硬件限制,无需稀疏掩码即可实现卓越效率。
English: BTC-LLM is a novel sub-1-bit quantization framework that overcomes performance and hardware limitations through adaptive weight transformation and binary pattern clustering, achieving superior efficiency without sparse masks.

Authors:Hsi-Che Lin, Yu-Chu Yu, Kai-Po Chang, Yu-Chiang Frank Wang
Title: EMLoC: Emulator-based Memory-efficient Fine-tuning with LoRA Correction
Abstract:
Open-source foundation models have seen rapid adoption and development, enabling powerful general-purpose capabilities across diverse domains. However, fine-tuning large foundation models for domain-specific or personalized tasks remains prohibitively expensive for most users due to the significant memory overhead beyond that of inference. We introduce EMLoC, an Emulator-based Memory-efficient fine-tuning framework with LoRA Correction, which enables model fine-tuning within the same memory budget required for inference. EMLoC constructs a task-specific light-weight emulator using activation-aware singular value decomposition (SVD) on a small downstream calibration set. Fine-tuning then is performed on this lightweight emulator via LoRA. To tackle the misalignment between the original model and the compressed emulator, we propose a novel compensation algorithm to correct the fine-tuned LoRA module, which thus can be merged into the original model for inference. EMLoC supports flexible compression ratios and standard training pipelines, making it adaptable to a wide range of applications. Extensive experiments demonstrate that EMLoC outperforms other baselines across multiple datasets and modalities. Moreover, without quantization, EMLoC enables fine-tuning of a 38B model on a single 24GB consumer GPU-bringing efficient and practical model adaptation to individual users.

Authors:Paul Setinek, Gianluca Galletti, Thomas Gross, Dominik Schnürer, Johannes Brandstetter, Werner Zellinger
Title: SIMSHIFT: A Benchmark for Adapting Neural Surrogates to Distribution Shifts
Abstract:
Neural surrogates for Partial Differential Equations (PDEs) often suffer significant performance degradation when evaluated on unseen problem configurations, such as novel material types or structural dimensions. Meanwhile, Domain Adaptation (DA) techniques have been widely used in vision and language processing to generalize from limited information about unseen configurations. In this work, we address this gap through two focused contributions. First, we introduce SIMSHIFT, a novel benchmark dataset and evaluation suite composed of four industrial simulation tasks: hot rolling, sheet metal forming, electric motor design and heatsink design. Second, we extend established domain adaptation methods to state of the art neural surrogates and systematically evaluate them. These approaches use parametric descriptions and ground truth simulations from multiple source configurations, together with only parametric descriptions from target configurations. The goal is to accurately predict target simulations without access to ground truth simulation data. Extensive experiments on SIMSHIFT highlight the challenges of out of distribution neural surrogate modeling, demonstrate the potential of DA in simulation, and reveal open problems in achieving robust neural surrogates under distribution shifts in industrially relevant scenarios. Our codebase is available at https://github.com/psetinek/simshift
神经PDE代理模型在未见配置下表现不佳,而领域自适应技术可利用目标域的参数化数据(无需真实模拟数据)提升其泛化能力。
Neural PDE surrogates struggle with unseen configurations, but domain adaptation techniques can improve their generalization using parametric data from target domains without ground truth simulations.

Authors:Yue Yao, Zelin Wen, Yan Tong, Xinyu Tian, Xuqing Li, Xiao Ma, Dongliang Xu, Tom Gedeon
Title: Simple Radiology VLLM Test-time Scaling with Thought Graph Traversal
Abstract:
Test-time scaling offers a promising way to improve the reasoning performance of vision-language large models (VLLMs) without additional training. In this paper, we explore a simple but effective approach for applying test-time scaling to radiology report generation. Specifically, we introduce a lightweight Thought Graph Traversal (TGT) framework that guides the model to reason through organ-specific findings in a medically coherent order. This framework integrates structured medical priors into the prompt, enabling deeper and more logical analysis with no changes to the underlying model. To further enhance reasoning depth, we apply a reasoning budget forcing strategy that adjusts the model's inference depth at test time by dynamically extending its generation process. This simple yet powerful combination allows a frozen radiology VLLM to self-correct and generate more accurate, consistent chest X-ray reports. Our method outperforms baseline prompting approaches on standard benchmarks, and also reveals dataset biases through traceable reasoning paths. Code and prompts are open-sourced for reproducibility at https://github.com/glerium/Thought-Graph-Traversal.
中文: 思维图遍历框架通过整合医学先验知识和动态推理预算,使冻结的视觉语言模型无需重新训练即可生成更准确的胸部X光报告。
English: The Thought Graph Traversal framework enhances radiology report generation by integrating medical priors and dynamic reasoning budgets, enabling frozen vision-language models to produce more accurate chest X-ray reports without retraining.

Authors:Min-Seop Kwak, Junho Kim, Sangdoo Yun, Dongyoon Han, Taekyoung Kim, Seungryong Kim, Jin-Hwa Kim
Title: Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
Abstract:
We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.

Authors:Zhangkai Ni, Yang Zhang, Wenhan Yang, Hanli Wang, Shiqi Wang, Sam Kwong
Title: Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution
Abstract:
Major efforts in data-driven image super-resolution (SR) primarily focus on expanding the receptive field of the model to better capture contextual information. However, these methods are typically implemented by stacking deeper networks or leveraging transformer-based attention mechanisms, which consequently increases model complexity. In contrast, model-driven methods based on the unfolding paradigm show promise in improving performance while effectively maintaining model compactness through sophisticated module design. Based on these insights, we propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR. This method is designed through unfolding an SR optimization function constrained by structural similarity, aiming to combine the strengths of both data-driven and model-driven approaches. Our model operates progressively following the unfolding paradigm. Each iteration consists of multiple Mixed-Scale Gating Modules (MSGM) and an Efficient Sparse Attention Module (ESAM). The former implements comprehensive constraints on features, including a structural similarity constraint, while the latter aims to achieve sparse activation. In addition, we design a Mixture-of-Experts-based Feature Selector (MoE-FS) that fully utilizes multi-level feature information by combining features from different steps. Extensive experiments validate the efficacy and efficiency of our unfolding-inspired network. Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption. Our code will be available at: https://github.com/eezkni/SSIU
中文: 提出的结构相似性启发展开(SSIU)方法融合了数据驱动和模型驱动策略,通过专门模块设计实现了高效图像超分辨率,以更少参数和更低内存消耗达到了当前最优性能。
English: The proposed Structural Similarity-Inspired Unfolding (SSIU) method combines data-driven and model-driven approaches for efficient image super-resolution, achieving state-of-the-art performance with fewer parameters and lower memory consumption through specialized modules.

Authors:Muhammad Sarmad, Arnt-Børre Salberg, Michael Kampffmeyer
Title: DiffFuSR: Super-Resolution of all Sentinel-2 Multispectral Bands using Diffusion Models
Abstract:
This paper presents DiffFuSR, a modular pipeline for super-resolving all 12 spectral bands of Sentinel-2 Level-2A imagery to a unified ground sampling distance (GSD) of 2.5 meters. The pipeline comprises two stages: (i) a diffusion-based super-resolution (SR) model trained on high-resolution RGB imagery from the NAIP and WorldStrat datasets, harmonized to simulate Sentinel-2 characteristics; and (ii) a learned fusion network that upscales the remaining multispectral bands using the super-resolved RGB image as a spatial prior. We introduce a robust degradation model and contrastive degradation encoder to support blind SR. Extensive evaluations of the proposed SR pipeline on the OpenSR benchmark demonstrate that the proposed method outperforms current SOTA baselines in terms of reflectance fidelity, spectral consistency, spatial alignment, and hallucination suppression. Furthermore, the fusion network significantly outperforms classical pansharpening approaches, enabling accurate enhancement of Sentinel-2's 20 m and 60 m bands. This study underscores the power of harmonized learning with generative priors and fusion strategies to create a modular framework for Sentinel-2 SR. Our code and models can be found at https://github.com/NorskRegnesentral/DiffFuSR.
中文:DiffFuSR是一个模块化流程,通过扩散模型和融合网络将哨兵2号影像超分辨率提升至2.5米,在保真度和光谱一致性上优于现有方法。
English: DiffFuSR is a modular pipeline that super-resolves Sentinel-2 imagery to 2.5 meters using a diffusion-based model and fusion network, outperforming existing methods in fidelity and spectral consistency.

Authors:Dinh Viet Cuong, Hoang-Bao Le, An Pham Ngoc Nguyen, Liting Zhou, Cathal Gurrin
Title: Quizzard@INOVA Challenge 2025 -- Track A: Plug-and-Play Technique in Interleaved Multi-Image Model
Abstract:
This paper addresses two main objectives. Firstly, we demonstrate the impressive performance of the LLaVA-NeXT-interleave on 22 datasets across three different tasks: Multi-Image Reasoning, Documents and Knowledge-Based Understanding and Interactive Multi-Modal Communication. Secondly, we add the Dense Channel Integration (DCI) connector to the LLaVA-NeXT-Interleave and compare its performance against the standard model. We find that the standard model achieves the highest overall accuracy, excelling in vision-heavy tasks like VISION, NLVR2, and Fashion200K. Meanwhile, the DCI-enhanced version shows particular strength on datasets requiring deeper semantic coherence or structured change understanding such as MIT-States_PropertyCoherence and SlideVQA. Our results highlight the potential of combining powerful foundation models with plug-and-play techniques for Interleave tasks. The code is available at https://github.com/dinhvietcuong1996/icme25-inova.
Chinese: 该研究展示了LLaVA-NeXT-Interleave在三个任务的22个数据集上的优异表现,其中标准模型在视觉密集型任务中表现突出,而DCI增强版本在语义连贯性和结构化变化理解方面更具优势。
English: The study showcases LLaVA-NeXT-Interleave's strong performance across 22 datasets in three tasks, with the standard model excelling in vision-heavy tasks while the DCI-enhanced version performs better on semantic coherence and structured change understanding.

Authors:Libin Lan, Hongxing Li, Zunhui Xia, Yudong Zhang
Title: DMAF-Net: An Effective Modality Rebalancing Framework for Incomplete Multi-Modal Medical Image Segmentation
Abstract:
Incomplete multi-modal medical image segmentation faces critical challenges from modality imbalance, including imbalanced modality missing rates and heterogeneous modality contributions. Due to their reliance on idealized assumptions of complete modality availability, existing methods fail to dynamically balance contributions and neglect the structural relationships between modalities, resulting in suboptimal performance in real-world clinical scenarios. To address these limitations, we propose a novel model, named Dynamic Modality-Aware Fusion Network (DMAF-Net). The DMAF-Net adopts three key ideas. First, it introduces a Dynamic Modality-Aware Fusion (DMAF) module to suppress missing-modality interference by combining transformer attention with adaptive masking and weight modality contributions dynamically through attention maps. Second, it designs a synergistic Relation Distillation and Prototype Distillation framework to enforce global-local feature alignment via covariance consistency and masked graph attention, while ensuring semantic consistency through cross-modal class-specific prototype alignment. Third, it presents a Dynamic Training Monitoring (DTM) strategy to stabilize optimization under imbalanced missing rates by tracking distillation gaps in real-time, and to balance convergence speeds across modalities by adaptively reweighting losses and scaling gradients. Extensive experiments on BraTS2020 and MyoPS2020 demonstrate that DMAF-Net outperforms existing methods for incomplete multi-modal medical image segmentation. Extensive experiments on BraTS2020 and MyoPS2020 demonstrate that DMAF-Net outperforms existing methods for incomplete multi-modal medical image segmentation. Our code is available at https://github.com/violet-42/DMAF-Net.
中文: 提出的动态模态感知融合网络(DMAF-Net)通过动态融合、双重蒸馏和自适应训练策略,解决了不完整多模态医学图像分割中的模态不平衡问题,在多个基准数据集上超越了现有方法。
English: The proposed Dynamic Modality-Aware Fusion Network (DMAF-Net) addresses modality imbalance in incomplete multi-modal medical image segmentation through dynamic fusion, dual distillation, and adaptive training strategies, outperforming existing methods on benchmark datasets.

Authors:Libin Lan, Hongxing Li, Zunhui Xia, Juan Zhou, Xiaofei Zhu, Yongmei Li, Yudong Zhang, Xin Luo
Title: Cross-Modal Clustering-Guided Negative Sampling for Self-Supervised Joint Learning from Medical Images and Reports
Abstract:
Learning medical visual representations directly from paired images and reports through multimodal self-supervised learning has emerged as a novel and efficient approach to digital diagnosis in recent years. However, existing models suffer from several severe limitations. 1) neglecting the selection of negative samples, resulting in the scarcity of hard negatives and the inclusion of false negatives; 2) focusing on global feature extraction, but overlooking the fine-grained local details that are crucial for medical image recognition tasks; and 3) contrastive learning primarily targets high-level features but ignoring low-level details which are essential for accurate medical analysis. Motivated by these critical issues, this paper presents a Cross-Modal Cluster-Guided Negative Sampling (CM-CGNS) method with two-fold ideas. First, it extends the k-means clustering used for local text features in the single-modal domain to the multimodal domain through cross-modal attention. This improvement increases the number of negative samples and boosts the model representation capability. Second, it introduces a Cross-Modal Masked Image Reconstruction (CM-MIR) module that leverages local text-to-image features obtained via cross-modal attention to reconstruct masked local image regions. This module significantly strengthens the model's cross-modal information interaction capabilities and retains low-level image features essential for downstream tasks. By well handling the aforementioned limitations, the proposed CM-CGNS can learn effective and robust medical visual representations suitable for various recognition tasks. Extensive experimental results on classification, detection, and segmentation tasks across five downstream datasets show that our method outperforms state-of-the-art approaches on multiple metrics, verifying its superior performance.
中文: 本文提出跨模态聚类引导负采样方法,通过改进负样本选择和利用跨模态注意力保留细粒度细节,显著提升医学视觉表征学习效果,在多项下游任务中超越现有最佳方法。
English: This paper introduces a Cross-Modal Cluster-Guided Negative Sampling method that enhances medical visual representation learning by improving negative sample selection and preserving fine-grained details through cross-modal attention and masked image reconstruction, achieving superior performance across multiple medical tasks.

Authors:Yunhan Ren, Ruihuang Li, Lingbo Liu, Changwen Chen
Title: Prohibited Items Segmentation via Occlusion-aware Bilayer Modeling
Abstract:
Instance segmentation of prohibited items in security X-ray images is a critical yet challenging task. This is mainly caused by the significant appearance gap between prohibited items in X-ray images and natural objects, as well as the severe overlapping among objects in X-ray images. To address these issues, we propose an occlusion-aware instance segmentation pipeline designed to identify prohibited items in X-ray images. Specifically, to bridge the representation gap, we integrate the Segment Anything Model (SAM) into our pipeline, taking advantage of its rich priors and zero-shot generalization capabilities. To address the overlap between prohibited items, we design an occlusion-aware bilayer mask decoder module that explicitly models the occlusion relationships. To supervise occlusion estimation, we manually annotated occlusion areas of prohibited items in two large-scale X-ray image segmentation datasets, PIDray and PIXray. We then reorganized these additional annotations together with the original information as two occlusion-annotated datasets, PIDray-A and PIXray-A. Extensive experimental results on these occlusion-annotated datasets demonstrate the effectiveness of our proposed method. The datasets and codes are available at: https://github.com/Ryh1218/Occ
Chinese: 本文提出了一种遮挡感知的实例分割流程,结合SAM模型和双层掩码解码器,有效解决了X光图像中违禁物品识别因遮挡和外观差异带来的难题,并在新标注的数据集PIDray-A和PIXray-A上验证了其优越性能。
English: This paper introduces an occlusion-aware instance segmentation pipeline that leverages the Segment Anything Model (SAM) and a bilayer mask decoder to address the challenges of identifying prohibited items in X-ray images, validated on newly annotated datasets PIDray-A and PIXray-A.

Authors:Emre Kavak, Tom Nuno Wolf, Christian Wachinger
Title: DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation
Abstract:
Dataset bias often leads deep learning models to exploit spurious correlations instead of task-relevant signals. We introduce the Standard Anti-Causal Model (SAM), a unifying causal framework that characterizes bias mechanisms and yields a conditional independence criterion for causal stability. Building on this theory, we propose DISCO$_m$ and sDISCO, efficient and scalable estimators of conditional distance correlation that enable independence regularization in black-box models. Across five diverse datasets, our methods consistently outperform or are competitive in existing bias mitigation approaches, while requiring fewer hyperparameters and scaling seamlessly to multi-bias scenarios. This work bridges causal theory and practical deep learning, providing both a principled foundation and effective tools for robust prediction. Source Code: https://github.com/***.
中文: 本文提出因果框架和高效估计器来解决深度学习中的数据集偏差问题,在多种数据集上以最少超参数实现了稳健性能。
English: This paper introduces a causal framework and efficient estimators to mitigate dataset bias in deep learning, achieving robust performance across diverse datasets with minimal hyperparameters.

Authors:Akshay Jindal, Nabil Sadaka, Manu Mathew Thomas, Anton Sochenov, Anton Kaplanyan
Title: CGVQM+D: Computer Graphics Video Quality Metric and Dataset
Abstract:
While existing video and image quality datasets have extensively studied natural videos and traditional distortions, the perception of synthetic content and modern rendering artifacts remains underexplored. We present a novel video quality dataset focused on distortions introduced by advanced rendering techniques, including neural supersampling, novel-view synthesis, path tracing, neural denoising, frame interpolation, and variable rate shading. Our evaluations show that existing full-reference quality metrics perform sub-optimally on these distortions, with a maximum Pearson correlation of 0.78. Additionally, we find that the feature space of pre-trained 3D CNNs aligns strongly with human perception of visual quality. We propose CGVQM, a full-reference video quality metric that significantly outperforms existing metrics while generating both per-pixel error maps and global quality scores. Our dataset and metric implementation is available at https://github.com/IntelLabs/CGVQM.
Chinese: 本文针对现代渲染失真提出了一种新型视频质量数据集,并开发了CGVQM全参考质量评估方法,其性能显著优于现有指标。
English: This paper introduces a novel video quality dataset for modern rendering distortions and proposes CGVQM, a full-reference metric that surpasses existing methods in performance.

Authors:Zhaoyang Wang, Jie Li, Wen Lu, Lihuo He, Maoguo Gong, Xinbo Gao
Title: FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution
Abstract:
State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information. As video frame rates continue to increase, the diminishing inter-frame differences further expose the limitations of traditional frame-to-frame information exploitation methods, which are inadequate for addressing current video super-resolution (VSR) demands. To overcome these challenges, we propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data. Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames. The proposed modular architecture is designed for seamless integration with existing VSR frameworks, ensuring strong adaptability and transferability across diverse applications. Experimental results demonstrate that our method achieves performance on par with, or surpassing, the current SOTA models, while significantly reducing inference time. By addressing key bottlenecks in CVSR, our work offers a practical and efficient pathway for advancing VSR technology. Our code will be publicly available at https://github.com/handsomewzy/FCA2.
中文摘要:我们提出的压缩驱动降维方法在性能上达到或超越了当前最先进的压缩视频超分辨率模型,同时显著减少了推理时间并增强了帧间时序信息提取能力。
English Summary: Our proposed method introduces a compression-driven dimensionality reduction strategy that achieves performance comparable to or better than state-of-the-art compressed video super-resolution models while significantly reducing inference time and enhancing temporal information extraction.

Authors:Heng Fang, Hossein Azizpour
Title: Leveraging Satellite Image Time Series for Accurate Extreme Event Detection
Abstract:
Climate change is leading to an increase in extreme weather events, causing significant environmental damage and loss of life. Early detection of such events is essential for improving disaster response. In this work, we propose SITS-Extreme, a novel framework that leverages satellite image time series to detect extreme events by incorporating multiple pre-disaster observations. This approach effectively filters out irrelevant changes while isolating disaster-relevant signals, enabling more accurate detection. Extensive experiments on both real-world and synthetic datasets validate the effectiveness of SITS-Extreme, demonstrating substantial improvements over widely used strong bi-temporal baselines. Additionally, we examine the impact of incorporating more timesteps, analyze the contribution of key components in our framework, and evaluate its performance across different disaster types, offering valuable insights into its scalability and applicability for large-scale disaster monitoring.
中文: SITS-Extreme框架通过整合多时相卫星影像与灾前观测数据,能有效过滤无关变化并提取灾害信号,在各类灾害检测实验中显著优于双时相基线方法,为大规模灾害监测提供了可扩展的解决方案。
English: The SITS-Extreme framework utilizes satellite image time series with multiple pre-disaster observations to accurately detect extreme weather events by filtering irrelevant changes and isolating disaster signals, demonstrating significant improvements over bi-temporal baselines in experiments across various disaster types.

Authors:Zhuguanyu Wu, Shihe Wang, Jiayi Zhang, Jiaxin Chen, Yunhong Wang
Title: FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation
Abstract:
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years, as it avoids computationally intensive model retraining. Nevertheless, current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization. To address these shortcomings, we analyze the prevailing Hessian-guided quantization loss, and uncover certain limitations of conventional Hessian approximations. By following the block-wise reconstruction framework, we propose a novel PTQ method for ViTs, dubbed FIMA-Q. Specifically, we firstly establish the connection between KL divergence and FIM, which enables fast computation of the quantization loss during reconstruction. We further propose an efficient FIM approximation method, namely DPLR-FIM, by employing the diagonal plus low-rank principle, and formulate the ultimate quantization loss. Our extensive experiments, conducted across various vision tasks with representative ViT-based architectures on public datasets, demonstrate that our method substantially promotes the accuracy compared to the state-of-the-art approaches, especially in the case of low-bit quantization. The source code is available at https://github.com/ShiheWang/FIMA-Q.
中文摘要:本文提出FIMA-Q这一新型视觉Transformer后训练量化方法,通过快速Fisher信息矩阵近似和分块重建框架解决低比特量化精度损失问题,在多项视觉任务中显著超越现有最优方法。
English Summary: This paper introduces FIMA-Q, a novel post-training quantization method for Vision Transformers that addresses accuracy degradation in low-bit settings by leveraging a fast Fisher Information Matrix approximation and block-wise reconstruction, achieving superior performance across various tasks.

Authors:Xiao Xu, Libo Qin, Wanxiang Che, Min-Yen Kan
Title: Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Abstract:
Two-Tower Vision--Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it \textit{(i)} suffers from ineffective layer-by-layer utilization of unimodal representations, \textit{(ii)} restricts the flexible exploitation of different levels of unimodal semantic knowledge, and \textit{(iii)} is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer. Whether with or without VL pre-training, ManagerTower outperforms previous strong baselines and achieves superior performance on 4 downstream VL tasks. Moreover, we extend our exploration to the latest Multimodal Large Language Model (MLLM) architecture. We demonstrate that LLaVA-OV-Manager significantly boosts the zero-shot performance of LLaVA-OV across different categories of capabilities, images, and resolutions on 20 downstream datasets, whether the multi-grid algorithm is enabled or not. In-depth analysis reveals that both our manager and the multi-grid algorithm can be viewed as a plugin that improves the visual representation by capturing more diverse visual details from two orthogonal perspectives (depth and width). Their synergy can mitigate the semantic ambiguity caused by the multi-grid algorithm and further improve performance. Code and models are available at https://github.com/LooperXX/ManagerTower.
中文: 提出的Manager插件通过自适应整合多层级单模态专家知识,在双塔视觉语言模型和多模态大语言模型中均实现了性能提升,并与多网格算法形成互补,增强了视觉细节的捕捉能力。
English: The proposed Manager plugin enhances Two-Tower Vision-Language Models by adaptively integrating multi-level unimodal expertise, achieving superior performance across VL tasks and MLLM architectures while complementing multi-grid algorithms for richer visual representation.

Authors:Haotian Ni, Yake Wei, Hang Liu, Gong Chen, Chong Peng, Hao Lin, Di Hu
Title: RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer
Abstract:
Multimodal learning faces challenges in effectively fusing information from diverse modalities, especially when modality quality varies across samples. Dynamic fusion strategies, such as attention mechanism in Transformers, aim to address such challenge by adaptively emphasizing modalities based on the characteristics of input data. However, through amounts of carefully designed experiments, we surprisingly observed that the dynamic adaptability of widely-used self-attention models diminishes. Model tends to prefer one modality regardless of data characteristics. This bias triggers a self-reinforcing cycle that progressively overemphasizes the favored modality, widening the distribution gap in attention keys across modalities and deactivating attention mechanism's dynamic properties. To revive adaptability, we propose a simple yet effective method Rolling Query (RollingQ), which balances attention allocation by rotating the query to break the self-reinforcing cycle and mitigate the key distribution gap. Extensive experiments on various multimodal scenarios validate the effectiveness of RollingQ and the restoration of cooperation dynamics is pivotal for enhancing the broader capabilities of widely deployed multimodal Transformers. The source code is available at https://github.com/GeWu-Lab/RollingQ_ICML2025.
中文摘要:研究发现多模态学习中的自注意力机制常无法动态适应不同模态质量,反而偏好单一模态形成自我强化偏差,但提出的滚动查询(RollingQ)方法通过轮换查询平衡注意力分配,成功恢复了模型的动态适应性。
English Summary: The study reveals that self-attention mechanisms in multimodal learning often fail to dynamically adapt to varying modality qualities, instead favoring one modality and creating a self-reinforcing bias, but proposes Rolling Query (RollingQ) to restore adaptability by rotating queries to balance attention allocation.

Authors:Jie Zhu, Leye Wang
Title: Auditing Data Provenance in Real-world Text-to-Image Diffusion Models for Privacy and Copyright Protection
Abstract:
Text-to-image diffusion model since its propose has significantly influenced the content creation due to its impressive generation capability. However, this capability depends on large-scale text-image datasets gathered from web platforms like social media, posing substantial challenges in copyright compliance and personal privacy leakage. Though there are some efforts devoted to explore approaches for auditing data provenance in text-to-image diffusion models, existing work has unrealistic assumptions that can obtain model internal knowledge, e.g., intermediate results, or the evaluation is not reliable. To fill this gap, we propose a completely black-box auditing framework called Feature Semantic Consistency-based Auditing (FSCA). It utilizes two types of semantic connections within the text-to-image diffusion model for auditing, eliminating the need for access to internal knowledge. To demonstrate the effectiveness of our FSCA framework, we perform extensive experiments on LAION-mi dataset and COCO dataset, and compare with eight state-of-the-art baseline approaches. The results show that FSCA surpasses previous baseline approaches across various metrics and different data distributions, showcasing the superiority of our FSCA. Moreover, we introduce a recall balance strategy and a threshold adjustment strategy, which collectively allows FSCA to reach up a user-level accuracy of 90% in a real-world auditing scenario with only 10 samples/user, highlighting its strong auditing potential in real-world applications. Our code is made available at https://github.com/JiePKU/FSCA.
Chinese: 提出的FSCA框架通过利用语义一致性实现了文本到图像扩散模型的黑盒审计,无需访问模型内部知识,并在实际应用中展现出卓越的准确性。
English: The proposed FSCA framework enables black-box auditing of text-to-image diffusion models by leveraging semantic consistency, eliminating the need for internal model knowledge and achieving superior accuracy in real-world scenarios.

Authors:Xiaoxin Lu, Ranran Haoran Zhang, Yusen Zhang, Rui Zhang
Title: Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation
Abstract:
People get informed of a daily task plan through diverse media involving both texts and images. However, most prior research only focuses on LLM's capability of textual plan generation. The potential of large-scale models in providing text-image plans remains understudied. Generating high-quality text-image plans faces two main challenges: ensuring consistent alignment between two modalities and keeping coherence among visual steps. To address these challenges, we propose a novel framework that generates and refines text-image plans step-by-step. At each iteration, our framework (1) drafts the next textual step based on the prediction history; (2) edits the last visual step to obtain the next one; (3) extracts PDDL-like visual information; and (4) refines the draft with the extracted visual information. The textual and visual step produced in stage (4) and (2) will then serve as inputs for the next iteration. Our approach offers a plug-and-play improvement to various backbone models, such as Mistral-7B, Gemini-1.5, and GPT-4o. To evaluate the effectiveness of our approach, we collect a new benchmark consisting of 1,100 tasks and their text-image pair solutions covering 11 daily topics. We also design and validate a new set of metrics to evaluate the multimodal consistency and coherence in text-image plans. Extensive experiment results show the effectiveness of our approach on a range of backbone models against competitive baselines. Our code and data are available at https://github.com/psunlpgroup/MPlanner.
中文: 本文提出了一种新颖框架,通过逐步生成和优化文本-图像计划来解决多模态对齐与连贯性挑战,并借助新基准和评估指标证明了该框架在不同骨干模型中的即插即用有效性。
English: This paper introduces a novel framework that generates and refines text-image plans step-by-step, addressing multimodal alignment and coherence challenges while demonstrating plug-and-play effectiveness across various backbone models through a new benchmark and metrics.

Authors:Xianlu Li, Nicolas Nadisic, Shaoguang Huang, Nikos Deligiannis, Aleksandra Pižurica
Title: Scalable Context-Preserving Model-Aware Deep Clustering for Hyperspectral Images
Abstract:
Subspace clustering has become widely adopted for the unsupervised analysis of hyperspectral images (HSIs). Recent model-aware deep subspace clustering methods often use a two-stage framework, involving the calculation of a self-representation matrix with complexity of O(n^2), followed by spectral clustering. However, these methods are computationally intensive, generally incorporating solely either local or non-local spatial structure constraints, and their structural constraints fall short of effectively supervising the entire clustering process. We propose a scalable, context-preserving deep clustering method based on basis representation, which jointly captures local and non-local structures for efficient HSI clustering. To preserve local structure (i.e., spatial continuity within subspaces), we introduce a spatial smoothness constraint that aligns clustering predictions with their spatially filtered versions. For non-local structure (i.e., spectral continuity), we employ a mini-cluster-based scheme that refines predictions at the group level, encouraging spectrally similar pixels to belong to the same subspace. Notably, these two constraints are jointly optimized to reinforce each other. Specifically, our model is designed as an one-stage approach in which the structural constraints are applied to the entire clustering process. The time and space complexity of our method is O(n), making it applicable to large-scale HSI data. Experiments on real-world datasets show that our method outperforms state-of-the-art techniques. Our code is available at: https://github.com/lxlscut/SCDSC
中文: 该方法通过单阶段深度学习框架,以线性复杂度联合优化局部和非局部结构约束,在保持空间连续性和光谱相似性的同时显著提升了高光谱图像聚类性能。
English: The proposed one-stage deep clustering method efficiently captures both local and non-local structures in hyperspectral images with linear complexity, outperforming existing techniques through joint optimization of spatial and spectral constraints.

Authors:Mae Younes, Adnane Boukhayma
Title: Anti-Aliased 2D Gaussian Splatting
Abstract:
2D Gaussian Splatting (2DGS) has recently emerged as a promising method for novel view synthesis and surface reconstruction, offering better view-consistency and geometric accuracy than volumetric 3DGS. However, 2DGS suffers from severe aliasing artifacts when rendering at different sampling rates than those used during training, limiting its practical applications in scenarios requiring camera zoom or varying fields of view. We identify that these artifacts stem from two key limitations: the lack of frequency constraints in the representation and an ineffective screen-space clamping approach. To address these issues, we present AA-2DGS, an antialiased formulation of 2D Gaussian Splatting that maintains its geometric benefits while significantly enhancing rendering quality across different scales. Our method introduces a world space flat smoothing kernel that constrains the frequency content of 2D Gaussian primitives based on the maximal sampling frequency from training views, effectively eliminating high-frequency artifacts when zooming in. Additionally, we derive a novel object space Mip filter by leveraging an affine approximation of the ray-splat intersection mapping, which allows us to efficiently apply proper anti-aliasing directly in the local space of each splat.
Chinese: AA-2DGS通过引入世界空间平滑核和新型物体空间Mip滤波器,解决了二维高斯泼溅的混叠问题,在保持几何精度的同时显著提升了多尺度渲染质量。
English: AA-2DGS addresses aliasing issues in 2D Gaussian Splatting by introducing a world space smoothing kernel and a novel object space Mip filter, enhancing rendering quality across scales while preserving geometric accuracy.

Authors:Hongyu Chen, Jiping Liu, Yong Wang, Jun Zhu, Dejun Feng, Yakun Xie
Title: Teaching in adverse scenes: a statistically feedback-driven threshold and mask adjustment teacher-student framework for object detection in UAV images under adverse scenes
Abstract:
Unsupervised Domain Adaptation (UDA) has shown promise in effectively alleviating the performance degradation caused by domain gaps between source and target domains, and it can potentially be generalized to UAV object detection in adverse scenes. However, existing UDA studies are based on natural images or clear UAV imagery, and research focused on UAV imagery in adverse conditions is still in its infancy. Moreover, due to the unique perspective of UAVs and the interference from adverse conditions, these methods often fail to accurately align features and are influenced by limited or noisy pseudo-labels. To address this, we propose the first benchmark for UAV object detection in adverse scenes, the Statistical Feedback-Driven Threshold and Mask Adjustment Teacher-Student Framework (SF-TMAT). Specifically, SF-TMAT introduces a design called Dynamic Step Feedback Mask Adjustment Autoencoder (DSFMA), which dynamically adjusts the mask ratio and reconstructs feature maps by integrating training progress and loss feedback. This approach dynamically adjusts the learning focus at different training stages to meet the model's needs for learning features at varying levels of granularity. Additionally, we propose a unique Variance Feedback Smoothing Threshold (VFST) strategy, which statistically computes the mean confidence of each class and dynamically adjusts the selection threshold by incorporating a variance penalty term. This strategy improves the quality of pseudo-labels and uncovers potentially valid labels, thus mitigating domain bias. Extensive experiments demonstrate the superiority and generalization capability of the proposed SF-TMAT in UAV object detection under adverse scene conditions. The Code is released at https://github.com/ChenHuyoo .
中文:提出的SF-TMAT框架通过动态掩码调整和基于方差的阈值策略,有效提升恶劣场景下无人机目标检测的特征对齐与伪标签质量,实验证明其具有优越性能。
English: The proposed SF-TMAT framework introduces dynamic mask adjustment and variance-based thresholding to improve feature alignment and pseudo-label quality for UAV object detection in adverse conditions, demonstrating superior performance in experiments.

Authors:Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa, Simone Benatti, Luca Benini, Yawei Li
Title: WaveFormer: A Lightweight Transformer Model for sEMG-based Gesture Recognition
Abstract:
Human-machine interaction, particularly in prosthetic and robotic control, has seen progress with gesture recognition via surface electromyographic (sEMG) signals.However, classifying similar gestures that produce nearly identical muscle signals remains a challenge, often reducing classification accuracy. Traditional deep learning models for sEMG gesture recognition are large and computationally expensive, limiting their deployment on resource-constrained embedded systems. In this work, we propose WaveFormer, a lightweight transformer-based architecture tailored for sEMG gesture recognition. Our model integrates time-domain and frequency-domain features through a novel learnable wavelet transform, enhancing feature extraction. In particular, the WaveletConv module, a multi-level wavelet decomposition layer with depthwise separable convolution, ensures both efficiency and compactness. With just 3.1 million parameters, WaveFormer achieves 95% classification accuracy on the EPN612 dataset, outperforming larger models. Furthermore, when profiled on a laptop equipped with an Intel CPU, INT8 quantization achieves real-time deployment with a 6.75 ms inference latency.
中文: WaveFormer提出了一种基于Transformer的轻量级模型,通过可学习的小波变换融合时频特征,仅用310万参数就实现了95%的分类准确率,并能在嵌入式系统上实时部署。
English: WaveFormer introduces a lightweight transformer-based model that integrates time-frequency features through a learnable wavelet transform, achieving 95% accuracy with only 3.1 million parameters and enabling real-time deployment on embedded systems.

Authors:Simon Ghyselincks, Valeriia Okhmak, Stefano Zampini, George Turkiyyah, David Keyes, Eldad Haber
Title: Synthetic Geology -- Structural Geology Meets Deep Learning
Abstract:
Visualizing the first few kilometers of the Earth's subsurface, a long-standing challenge gating a virtually inexhaustible list of important applications, is coming within reach through deep learning. Building on techniques of generative artificial intelligence applied to voxelated images, we demonstrate a method that extends surface geological data supplemented by boreholes to a three-dimensional subsurface region by training a neural network. The Earth's land area having been extensively mapped for geological features, the bottleneck of this or any related technique is the availability of data below the surface. We close this data gap in the development of subsurface deep learning by designing a synthetic data-generator process that mimics eons of geological activity such as sediment compaction, volcanic intrusion, and tectonic dynamics to produce a virtually limitless number of samples of the near lithosphere. A foundation model trained on such synthetic data is able to generate a 3D image of the subsurface from a previously unseen map of surface topography and geology, showing increasing fidelity with increasing access to borehole data, depicting such structures as layers, faults, folds, dikes, and sills. We illustrate the early promise of the combination of a synthetic lithospheric generator with a trained neural network model using generative flow matching. Ultimately, such models will be fine-tuned on data from applicable campaigns, such as mineral prospecting in a given region. Though useful in itself, a regionally fine-tuned models may be employed not as an end but as a means: as an AI-based regularizer in a more traditional inverse problem application, in which the objective function represents the mismatch of additional data with physical models with applications in resource exploration, hazard assessment, and geotechnical engineering.
中文: 通过生成式人工智能,利用地表数据和钻孔信息训练神经网络,结合模拟地质活动的合成数据,实现了对地球近地表三维结构的可视化建模。
English: Deep learning is enabling the visualization of the Earth's subsurface by using generative AI to create 3D models from surface data and boreholes, trained on synthetic geological data that mimics natural processes.

Authors:Linhao Yu, Xinguang Ji, Yahui Liu, Fanheng Kong, Chenxi Sun, Jingyuan Zhang, Hongzhi Zhang, V. W., Fuzheng Zhang, Deyi Xiong
Title: Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search
Abstract:
Video captioning can be used to assess the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, existing benchmarks and evaluation protocols suffer from crucial issues, such as inadequate or homogeneous creation of key points, exorbitant cost of data creation, and limited evaluation scopes. To address these issues, we propose an automatic framework, named AutoCaption, which leverages Monte Carlo Tree Search (MCTS) to construct numerous and diverse descriptive sentences (\textit{i.e.}, key points) that thoroughly represent video content in an iterative way. This iterative captioning strategy enables the continuous enhancement of video details such as actions, objects' attributes, environment details, etc. We apply AutoCaption to curate MCTS-VCB, a fine-grained video caption benchmark covering video details, thereby enabling a comprehensive evaluation of MLLMs on the video captioning task. We evaluate more than 20 open- and closed-source MLLMs of varying sizes on MCTS-VCB. Results show that MCTS-VCB can effectively and comprehensively evaluate the video captioning capability, with Gemini-1.5-Pro achieving the highest F1 score of 71.2. Interestingly, we fine-tune InternVL2.5-8B with the AutoCaption-generated data, which helps the model achieve an overall improvement of 25.0% on MCTS-VCB and 16.3% on DREAM-1K, further demonstrating the effectiveness of AutoCaption. The code and data are available at https://github.com/tjunlp-lab/MCTS-VCB.
中文:AutoCaption是一个利用蒙特卡洛树搜索自动生成多样化视频描述的框架,它构建的MCTS-VCB基准能全面评估多模态大语言模型,并通过微调显著提升模型性能。
English: AutoCaption is an automatic framework using Monte Carlo Tree Search to generate diverse and detailed video captions, creating the MCTS-VCB benchmark that comprehensively evaluates Multimodal Large Language Models and enhances their performance through fine-tuning.

Authors:Sharvari Kamble
Title: SLRNet: A Real-Time LSTM-Based Sign Language Recognition System
Abstract:
Sign Language Recognition (SLR) plays a crucial role in bridging the communication gap between the hearing-impaired community and society. This paper introduces SLRNet, a real-time webcam-based ASL recognition system using MediaPipe Holistic and Long Short-Term Memory (LSTM) networks. The model processes video streams to recognize both ASL alphabet letters and functional words. With a validation accuracy of 86.7%, SLRNet demonstrates the feasibility of inclusive, hardware-independent gesture recognition.
中文: SLRNet是一种基于网络摄像头的实时系统,结合MediaPipe Holistic和LSTM网络,以86.7%的准确率识别美国手语字母和功能词,实现了包容性的手势识别。
English: SLRNet is a real-time webcam-based system using MediaPipe Holistic and LSTM networks to recognize ASL alphabet letters and functional words with 86.7% accuracy, enabling inclusive gesture recognition.

Authors:Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, Zuozhu Liu
Title: 3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks
Abstract:
Medical Visual Question Answering (Med-VQA) holds significant potential for clinical decision support, yet existing efforts primarily focus on 2D imaging with limited task diversity. This paper presents 3D-RAD, a large-scale dataset designed to advance 3D Med-VQA using radiology CT scans. The 3D-RAD dataset encompasses six diverse VQA tasks: anomaly detection, image observation, medical computation, existence detection, static temporal diagnosis, and longitudinal temporal diagnosis. It supports both open- and closed-ended questions while introducing complex reasoning challenges, including computational tasks and multi-stage temporal analysis, to enable comprehensive benchmarking. Extensive evaluations demonstrate that existing vision-language models (VLMs), especially medical VLMs exhibit limited generalization, particularly in multi-temporal tasks, underscoring the challenges of real-world 3D diagnostic reasoning. To drive future advancements, we release a high-quality training set 3D-RAD-T of 136,195 expert-aligned samples, showing that fine-tuning on this dataset could significantly enhance model performance. Our dataset and code, aiming to catalyze multimodal medical AI research and establish a robust foundation for 3D medical visual understanding, are publicly available at https://github.com/Tang-xiaoxiao/M3D-RAD.
Chinese: 本文提出了3D-RAD这一基于CT扫描的大规模三维医学视觉问答数据集,包含六项多样化任务,揭示了现有视觉语言模型在复杂诊断推理中的局限性,并通过专家标注数据的微调实现了显著性能提升。
English: This paper introduces 3D-RAD, a large-scale dataset for 3D medical visual question answering using CT scans, featuring six diverse tasks that reveal the limitations of current vision-language models in complex diagnostic reasoning and demonstrate significant performance improvements through fine-tuning on expert-aligned data.

Authors:Chao Liang, Jianwen Jiang, Wang Liao, Jiaqi Yang, Zerong zheng, Weihong Zeng, Han Liang
Title: AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation
Abstract:
Recent advancements in human video generation and animation tasks, driven by diffusion models, have achieved significant progress. However, expressive and realistic human animation remains challenging due to the trade-off between motion naturalness and visual fidelity. To address this, we propose \textbf{AlignHuman}, a framework that combines Preference Optimization as a post-training technique with a divide-and-conquer training strategy to jointly optimize these competing objectives. Our key insight stems from an analysis of the denoising process across timesteps: (1) early denoising timesteps primarily control motion dynamics, while (2) fidelity and human structure can be effectively managed by later timesteps, even if early steps are skipped. Building on this observation, we propose timestep-segment preference optimization (TPO) and introduce two specialized LoRAs as expert alignment modules, each targeting a specific dimension in its corresponding timestep interval. The LoRAs are trained using their respective preference data and activated in the corresponding intervals during inference to enhance motion naturalness and fidelity. Extensive experiments demonstrate that AlignHuman improves strong baselines and reduces NFEs during inference, achieving a 3.3$\times$ speedup (from 100 NFEs to 30 NFEs) with minimal impact on generation quality. Homepage: \href{https://alignhuman.github.io/}{https://alignhuman.github.io/}

Authors:Namhoon Kim, Sara Fridovich-Keil
Title: Grids Often Outperform Implicit Neural Representations
Abstract:
Implicit Neural Representations (INRs) have recently shown impressive results, but their fundamental capacity, implicit biases, and scaling behavior remain poorly understood. We investigate the performance of diverse INRs across a suite of 2D and 3D real and synthetic signals with varying effective bandwidth, as well as both overfitting and generalization tasks including tomography, super-resolution, and denoising. By stratifying performance according to model size as well as signal type and bandwidth, our results shed light on how different INR and grid representations allocate their capacity. We find that, for most tasks and signals, a simple regularized grid with interpolation trains faster and to higher quality than any INR with the same number of parameters. We also find limited settings where INRs outperform grids -- namely fitting signals with underlying lower-dimensional structure such as shape contours -- to guide future use of INRs towards the most advantageous applications. Code and synthetic signals used in our analysis are available at https://github.com/voilalab/INR-benchmark.
中文: 本研究评估了多种隐式神经表示和网格方法在不同任务中的表现,发现正则化网格在训练速度和质量上通常优于INRs,但在处理如形状轮廓等低维结构时,INRs展现出独特优势,为未来应用提供了指导。
English: This study evaluates various Implicit Neural Representations (INRs) and grid-based methods across multiple tasks, revealing that regularized grids generally outperform INRs in training speed and quality, except in cases involving lower-dimensional structures like shape contours, which highlight specific advantages for INRs.

Authors:Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome
Title: JAFAR: Jack up Any Feature at Any Resolution
Abstract:
Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io
基础视觉编码器需要特征上采样以实现高分辨率任务,JAFAR被提出作为一种轻量级、灵活的上采样器,无需高分辨率监督即可提升空间分辨率,并在多种任务中优于现有方法。
Foundation Vision Encoders require feature upsampling for high-resolution tasks, and JAFAR is introduced as a lightweight, flexible upsampler that enhances spatial resolution without high-resolution supervision, outperforming existing methods across various tasks.

Authors:Christos Pantazopoulos, Spyridon Thermos, Gerasimos Potamianos
Title: Monocular 3D Hand Pose Estimation with Implicit Camera Alignment
Abstract:
Estimating the 3D hand articulation from a single color image is an important problem with applications in Augmented Reality (AR), Virtual Reality (VR), Human-Computer Interaction (HCI), and robotics. Apart from the absence of depth information, occlusions, articulation complexity, and the need for camera parameters knowledge pose additional challenges. In this work, we propose an optimization pipeline for estimating the 3D hand articulation from 2D keypoint input, which includes a keypoint alignment step and a fingertip loss to overcome the need to know or estimate the camera parameters. We evaluate our approach on the EgoDexter and Dexter+Object benchmarks to showcase that it performs competitively with the state-of-the-art, while also demonstrating its robustness when processing "in-the-wild" images without any prior camera knowledge. Our quantitative analysis highlights the sensitivity of the 2D keypoint estimation accuracy, despite the use of hand priors. Code is available at the project page https://cpantazop.github.io/HandRepo/

Authors:Weiliang Chen, Jiayi Bi, Yuanhui Huang, Wenzhao Zheng, Yueqi Duan
Title: SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis
Abstract:
Generative models have gained significant attention in novel view synthesis (NVS) by alleviating the reliance on dense multi-view captures. However, existing methods typically fall into a conventional paradigm, where generative models first complete missing areas in 2D, followed by 3D recovery techniques to reconstruct the scene, which often results in overly smooth surfaces and distorted geometry, as generative models struggle to infer 3D structure solely from RGB data. In this paper, we propose SceneCompleter, a novel framework that achieves 3D-consistent generative novel view synthesis through dense 3D scene completion. SceneCompleter achieves both visual coherence and 3D-consistent generative scene completion through two key components: (1) a geometry-appearance dual-stream diffusion model that jointly synthesizes novel views in RGBD space; (2) a scene embedder that encodes a more holistic scene understanding from the reference image. By effectively fusing structural and textural information, our method demonstrates superior coherence and plausibility in generative novel view synthesis across diverse datasets. Project Page: https://chen-wl20.github.io/SceneCompleter

Authors:Junqi You, Chieh Hubert Lin, Weijie Lyu, Zhengbo Zhang, Ming-Hsuan Yang
Title: InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model
Abstract:
Recent advances in 3D scene reconstruction enable real-time viewing in virtual and augmented reality. To support interactive operations for better immersiveness, such as moving or editing objects, 3D scene inpainting methods are proposed to repair or complete the altered geometry. However, current approaches rely on lengthy and computationally intensive optimization, making them impractical for real-time or online applications. We propose InstaInpaint, a reference-based feed-forward framework that produces 3D-scene inpainting from a 2D inpainting proposal within 0.4 seconds. We develop a self-supervised masked-finetuning strategy to enable training of our custom large reconstruction model (LRM) on the large-scale dataset. Through extensive experiments, we analyze and identify several key designs that improve generalization, textural consistency, and geometric correctness. InstaInpaint achieves a 1000x speed-up from prior methods while maintaining a state-of-the-art performance across two standard benchmarks. Moreover, we show that InstaInpaint generalizes well to flexible downstream applications such as object insertion and multi-region inpainting. More video results are available at our project page: https://dhmbb2.github.io/InstaInpaint_page/.

Authors:Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min, Wooseok Jang, Sangwu Lee, Sayak Paul, Susung Hong, Seungryong Kim
Title: Fine-Grained Perturbation Guidance via Attention Head Selection
Abstract:
Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.

Authors:Sicheng Zuo, Wenzhao Zheng, Xiaoyong Han, Longchao Yang, Yong Pan, Jiwen Lu
Title: QuadricFormer: Scene as Superquadrics for 3D Semantic Occupancy Prediction
Abstract:
3D occupancy prediction is crucial for robust autonomous driving systems as it enables comprehensive perception of environmental structures and semantics. Most existing methods employ dense voxel-based scene representations, ignoring the sparsity of driving scenes and resulting in inefficiency. Recent works explore object-centric representations based on sparse Gaussians, but their ellipsoidal shape prior limits the modeling of diverse structures. In real-world driving scenes, objects exhibit rich geometries (e.g., cuboids, cylinders, and irregular shapes), necessitating excessive ellipsoidal Gaussians densely packed for accurate modeling, which leads to inefficient representations. To address this, we propose to use geometrically expressive superquadrics as scene primitives, enabling efficient representation of complex structures with fewer primitives through their inherent shape diversity. We develop a probabilistic superquadric mixture model, which interprets each superquadric as an occupancy probability distribution with a corresponding geometry prior, and calculates semantics through probabilistic mixture. Building on this, we present QuadricFormer, a superquadric-based model for efficient 3D occupancy prediction, and introduce a pruning-and-splitting module to further enhance modeling efficiency by concentrating superquadrics in occupied regions. Extensive experiments on the nuScenes dataset demonstrate that QuadricFormer achieves state-of-the-art performance while maintaining superior efficiency.
中文: 本文提出QuadricFormer模型,通过采用几何表达力强的超二次曲面作为场景基元,以更少的基元有效表示复杂结构,克服了密集体素和椭球高斯方法的局限性,在nuScenes数据集上实现了最优性能并保持了卓越效率。
English: This paper introduces QuadricFormer, a superquadric-based model for efficient 3D occupancy prediction that overcomes the limitations of dense voxel and ellipsoidal Gaussian methods by using geometrically expressive superquadrics to represent complex structures with fewer primitives, achieving state-of-the-art performance on the nuScenes dataset.

Authors:Weiliang Chen, Wenzhao Zheng, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu, Yueqi Duan
Title: GenWorld: Towards Detecting AI-generated Real-world Simulation Videos
Abstract:
The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection. GenWorld features the following characteristics: (1) Real-world Simulation: GenWorld focuses on videos that replicate real-world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state-of-the-art video generation models to provide realistic and high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high-quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real-world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI-generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI-generated video detection. Project Page: https://chen-wl20.github.io/GenWorld
中文摘要:视频生成技术的飞速发展加大了对AI生成视频检测器的需求,为此我们提出了高质量真实世界模拟数据集GenWorld和利用多视角一致性的SpannDetector检测模型,显著提升了检测性能。
English Summary: The rapid advancement of video generation technology has increased the need for effective AI-generated video detectors, leading to the creation of GenWorld, a high-quality real-world simulation dataset, and SpannDetector, a model that uses multi-view consistency to improve detection accuracy.

Authors:Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang
Title: Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Abstract:
In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance. In this paper, we go beyond attention or similarity by proposing a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens. We first define the conditional similarity between visual tokens conditioned on the instruction, and then reformulate the token pruning problem with determinantal point process (DPP) to maximize the conditional diversity of the selected subset. The proposed CDPruner is training-free and model-agnostic, allowing easy application to various MLLMs. Extensive experiments across diverse MLLMs show that CDPruner establishes new state-of-the-art on various vision-language benchmarks. By maximizing conditional diversity through DPP, the selected subset better represents the input images while closely adhering to user instructions, thereby preserving strong performance even with high reduction ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95\% and CUDA latency by 78\%, while maintaining 94\% of the original accuracy. Our code is available at https://github.com/Theia-4869/CDPruner.
中文摘要:CDPruner提出了一种无需训练、与模型无关的视觉令牌剪枝方法,通过行列式点过程最大化条件多样性,在显著降低多模态大语言模型计算成本的同时保持了性能表现。
English Summary: CDPruner introduces a training-free, model-agnostic visual token pruning method that maximizes conditional diversity using determinantal point processes, significantly reducing computational costs while preserving performance across multimodal large language models.

Authors:Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian
Title: MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
Abstract:
In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning -- a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image's core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits -- low entity fidelity, weak relations, and clutter -- with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.
本文提出知识图像生成作为新任务及MMMG基准,通过评估16个领先模型揭示了当前AI在多模态推理方面的显著不足。
This paper introduces knowledge image generation as a new task and the MMMG benchmark to evaluate multimodal reasoning in AI models, revealing significant gaps in current systems through comprehensive testing of 16 leading models.

Authors:Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, Jiwen Lu
Title: SpectralAR: Spectral Autoregressive Visual Generation
Abstract:
Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: https://huang-yh.github.io/spectralar/.
中文摘要:SpectralAR框架通过谱视角实现视觉序列的因果性,采用嵌套谱标记化将图像转化为有序频谱标记进行自回归生成,以少量标记和参数取得了优异性能。
English Summary: The SpectralAR framework introduces a novel autoregressive visual generation method using spectral tokens to ensure causality and efficiency, achieving state-of-the-art results with minimal tokens and parameters.

Authors:Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian, Ling Chen, Yunchao Wei, Lin Ma
Title: M4V: Multi-Modal Mamba for Text-to-Video Generation
Abstract:
Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45% compared to the attention-based alternative when generating videos at 768$\times$1280 resolution. Additionally, to mitigate the visual quality degradation in long-context autoregressive generation processes, we introduce a reward learning strategy that further enhances per-frame visual realism. Extensive experiments on text-to-video benchmarks demonstrate M4V's ability to produce high-quality videos while significantly lowering computational costs. Code and models will be publicly available at https://huangjch526.github.io/M4V_project.
中文: M4V框架采用多模态Mamba架构,通过创新的令牌重组设计和奖励学习策略,在保持高质量视频生成的同时,将计算成本显著降低45%。
English: The M4V framework introduces a multi-modal Mamba architecture that significantly reduces computational costs by 45% while maintaining high-quality video generation through innovative token re-composition and reward learning strategies.

Authors:Zhao Zhang, Yutao Cheng, Dexiang Hong, Maoke Yang, Gonglei Shi, Lei Ma, Hui Zhang, Jie Shao, Xinglong Wu
Title: CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation
Abstract:
Graphic design plays a crucial role in both commercial and personal contexts, yet creating high-quality, editable, and aesthetically pleasing graphic compositions remains a time-consuming and skill-intensive task, especially for beginners. Current AI tools automate parts of the workflow, but struggle to accurately incorporate user-supplied assets, maintain editability, and achieve professional visual appeal. Commercial systems, like Canva Magic Design, rely on vast template libraries, which are impractical for replicate. In this paper, we introduce CreatiPoster, a framework that generates editable, multi-layer compositions from optional natural-language instructions or assets. A protocol model, an RGBA large multimodal model, first produces a JSON specification detailing every layer (text or asset) with precise layout, hierarchy, content and style, plus a concise background prompt. A conditional background model then synthesizes a coherent background conditioned on this rendered foreground layers. We construct a benchmark with automated metrics for graphic-design generation and show that CreatiPoster surpasses leading open-source approaches and proprietary commercial systems. To catalyze further research, we release a copyright-free corpus of 100,000 multi-layer designs. CreatiPoster supports diverse applications such as canvas editing, text overlay, responsive resizing, multilingual adaptation, and animated posters, advancing the democratization of AI-assisted graphic design. Project homepage: https://github.com/graphic-design-ai/creatiposter
中文摘要:CreatiPoster是一个通过AI生成可编辑多层图形设计的框架,它基于用户输入创建专业级排版,在自动布局和视觉质量上超越现有工具,并支持响应式调整与动画等多样化应用。
English Summary: CreatiPoster is an AI framework that generates editable, multi-layer graphic designs from user inputs, outperforming existing tools by producing professional-quality compositions while enabling diverse applications like responsive resizing and animation.

Authors:Hang Zhang, Xiang Chen, Renjiu Hu, Rongguang Wang, Jinwei Zhang, Min Liu, Yaonan Wang, Gaolei Li, Xinxing Cheng, Jinming Duan
Title: Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing
Abstract:
Learning-based deformable image registration (DIR) accelerates alignment by amortizing traditional optimization via neural networks. Label supervision further enhances accuracy, enabling efficient and precise nonlinear alignment of unseen scans. However, images with sparse features amid large smooth regions, such as retinal vessels, introduce aperture and large-displacement challenges that unsupervised DIR methods struggle to address. This limitation occurs because neural networks predict deformation fields in a single forward pass, leaving fields unconstrained post-training and shifting the regularization burden entirely to network weights. To address these issues, we introduce SmoothProper, a plug-and-play neural module enforcing smoothness and promoting message passing within the network's forward pass. By integrating a duality-based optimization layer with tailored interaction terms, SmoothProper efficiently propagates flow signals across spatial locations, enforces smoothness, and preserves structural consistency. It is model-agnostic, seamlessly integrates into existing registration frameworks with minimal parameter overhead, and eliminates regularizer hyperparameter tuning. Preliminary results on a retinal vessel dataset exhibiting aperture and large-displacement challenges demonstrate our method reduces registration error to 1.88 pixels on 2912x2912 images, marking the first unsupervised DIR approach to effectively address both challenges. The source code will be available at https://github.com/tinymilky/SmoothProper.
中文摘要:本文提出SmoothProper,一种即插即用的神经模块,通过强制平滑性和促进网络前向传播中的信息传递,有效解决了视网膜血管等特征稀疏图像中的孔径和大位移挑战,提升了无监督形变图像配准的性能。
English Summary: The paper introduces SmoothProper, a plug-and-play neural module that enhances unsupervised deformable image registration by enforcing smoothness and enabling effective message passing to address aperture and large-displacement challenges in feature-sparse images like retinal vessels.

Authors:Ignacio Bugueno-Cordova, Javier Ruiz-del-Solar, Rodrigo Verschae
Title: Human-Robot Navigation using Event-based Cameras and Reinforcement Learning
Abstract:
This work introduces a robot navigation controller that combines event cameras and other sensors with reinforcement learning to enable real-time human-centered navigation and obstacle avoidance. Unlike conventional image-based controllers, which operate at fixed rates and suffer from motion blur and latency, this approach leverages the asynchronous nature of event cameras to process visual information over flexible time intervals, enabling adaptive inference and control. The framework integrates event-based perception, additional range sensing, and policy optimization via Deep Deterministic Policy Gradient, with an initial imitation learning phase to improve sample efficiency. Promising results are achieved in simulated environments, demonstrating robust navigation, pedestrian following, and obstacle avoidance. A demo video is available at the project website.

Authors:SiXiang Chen, Jianyu Lai, Jialin Gao, Tian Ye, Haoyu Chen, Hengyu Shi, Shitong Shao, Yunlong Lin, Song Fei, Zhaohu Xing, Yeying Jin, Junfeng Luo, Xiaoming Wei, Lei Zhu
Title: PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework
Abstract:
Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal-approaching the quality of SOTA commercial systems. Our code, models, and datasets can be found in the Project page: https://ephemeral182.github.io/PosterCraft
中文: PosterCraft框架通过级联工作流程突破传统固定布局限制,能生成文本精准、艺术元素融合自然的高质量海报,在渲染精度和视觉吸引力上显著优于现有方法。
English: PosterCraft is a unified framework that overcomes the limitations of rigid layout designs by employing a cascaded workflow to generate aesthetically pleasing posters with precise text rendering and seamless artistic integration, significantly outperforming existing methods in accuracy and visual appeal.

Authors:Hong Huang, Weixiang Sun, Zhijian Wu, Jingwen Niu, Donghuan Lu, Xian Wu, Yefeng Zheng
Title: IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain
Abstract:
Recently, the rapid advancements of vision-language models, such as CLIP, leads to significant progress in zero-/few-shot anomaly detection (ZFSAD) tasks. However, most existing CLIP-based ZFSAD methods commonly assume prior knowledge of categories and rely on carefully crafted prompts tailored to specific scenarios. While such meticulously designed text prompts effectively capture semantic information in the textual space, they fall short of distinguishing normal and anomalous instances within the joint embedding space. Moreover, these ZFSAD methods are predominantly explored in industrial scenarios, with few efforts conducted to medical tasks. To this end, we propose an innovative framework for ZFSAD tasks in medical domain, denoted as IQE-CLIP. We reveal that query embeddings, which incorporate both textual and instance-aware visual information, are better indicators for abnormalities. Specifically, we first introduce class-based prompting tokens and learnable prompting tokens for better adaptation of CLIP to the medical domain. Then, we design an instance-aware query module (IQM) to extract region-level contextual information from both text prompts and visual features, enabling the generation of query embeddings that are more sensitive to anomalies. Extensive experiments conducted on six medical datasets demonstrate that IQE-CLIP achieves state-of-the-art performance on both zero-shot and few-shot tasks. We release our code and data at https://github.com/hongh0/IQE-CLIP/.
Chinese: 本文提出IQE-CLIP框架,通过融合文本和视觉信息生成查询嵌入,显著提升了医学影像中的零样本/少样本异常检测性能,在六个医学数据集上实现了最优表现。
English: This paper introduces IQE-CLIP, a novel framework that enhances zero-/few-shot anomaly detection in medical imaging by generating query embeddings that integrate textual and visual information, achieving state-of-the-art results across six medical datasets.

Authors:Xi Chen, Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane
Title: ConStyX: Content Style Augmentation for Generalizable Medical Image Segmentation
Abstract:
Medical images are usually collected from multiple domains, leading to domain shifts that impair the performance of medical image segmentation models. Domain Generalization (DG) aims to address this issue by training a robust model with strong generalizability. Recently, numerous domain randomization-based DG methods have been proposed. However, these methods suffer from the following limitations: 1) constrained efficiency of domain randomization due to their exclusive dependence on image style perturbation, and 2) neglect of the adverse effects of over-augmented images on model training. To address these issues, we propose a novel domain randomization-based DG method, called content style augmentation (ConStyX), for generalizable medical image segmentation. Specifically, ConStyX 1) augments the content and style of training data, allowing the augmented training data to better cover a wider range of data domains, and 2) leverages well-augmented features while mitigating the negative effects of over-augmented features during model training. Extensive experiments across multiple domains demonstrate that our ConStyX achieves superior generalization performance. The code is available at https://github.com/jwxsp1/ConStyX.
中文:提出的ConStyX方法通过同时增强训练数据的内容和风格特征,并减轻过度增强的负面影响,在跨域医学图像分割中实现了优越的泛化性能。
English: The proposed ConStyX method enhances domain generalization for medical image segmentation by augmenting both content and style of training data while mitigating negative effects from over-augmentation, achieving superior performance across multiple domains.

Authors:Marzieh Oghbaie, Teresa Araújo, Hrvoje Bogunović
Title: PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis
Abstract:
Background and Objective: Prototype-based methods improve interpretability by learning fine-grained part-prototypes; however, their visualization in the input pixel space is not always consistent with human-understandable biomarkers. In addition, well-known prototype-based approaches typically learn extremely granular prototypes that are less interpretable in medical imaging, where both the presence and extent of biomarkers and lesions are critical. Methods: To address these challenges, we propose PiPViT (Patch-based Visual Interpretable Prototypes), an inherently interpretable prototypical model for image recognition. Leveraging a vision transformer (ViT), PiPViT captures long-range dependencies among patches to learn robust, human-interpretable prototypes that approximate lesion extent only using image-level labels. Additionally, PiPViT benefits from contrastive learning and multi-resolution input processing, which enables effective localization of biomarkers across scales. Results: We evaluated PiPViT on retinal OCT image classification across four datasets, where it achieved competitive quantitative performance compared to state-of-the-art methods while delivering more meaningful explanations. Moreover, quantitative evaluation on a hold-out test set confirms that the learned prototypes are semantically and clinically relevant. We believe PiPViT can transparently explain its decisions and assist clinicians in understanding diagnostic outcomes. Github page: https://github.com/marziehoghbaie/PiPViT
中文摘要:本文提出PiPViT模型,通过视觉转换器学习与临床相关的原型来近似病变范围,在视网膜OCT分类中实现优异性能的同时,提供可解释的诊断依据。
English Summary: The paper introduces PiPViT, an interpretable prototypical model using vision transformers to learn clinically relevant prototypes that approximate lesion extent from image-level labels, achieving competitive performance in retinal OCT classification while providing transparent diagnostic explanations.

Authors:Alexander Lobashev, Dmitry Guskov, Maria Larchenko, Mikhail Tamm
Title: Hessian Geometry of Latent Space in Generative Models
Abstract:
This paper presents a novel method for analyzing the latent space geometry of generative models, including statistical physics models and diffusion models, by reconstructing the Fisher information metric. The method approximates the posterior distribution of latent variables given generated samples and uses this to learn the log-partition function, which defines the Fisher metric for exponential families. Theoretical convergence guarantees are provided, and the method is validated on the Ising and TASEP models, outperforming existing baselines in reconstructing thermodynamic quantities. Applied to diffusion models, the method reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher metric. We demonstrate that while geodesic interpolations are approximately linear within individual phases, this linearity breaks down at phase boundaries, where the diffusion model exhibits a divergent Lipschitz constant with respect to the latent space. These findings provide new insights into the complex structure of diffusion model latent spaces and their connection to phenomena like phase transitions. Our source code is available at https://github.com/alobashev/hessian-geometry-of-diffusion-models.
Chinese: 本文提出了一种通过重构Fisher信息度量来分析生成模型潜在空间几何结构的新方法,揭示了扩散模型中相变的分形结构,并在热力学量重构方面优于现有基准。
English: This paper introduces a method to reconstruct the Fisher information metric for analyzing latent space geometry in generative models, revealing phase transitions and fractal structures in diffusion models while outperforming baselines in thermodynamic quantity reconstruction.

Authors:Reza Karbasi, Masoud Rahimi, Abdol-Hossein Vahabie, Hadi Moradi
Title: Deep Learning-Based Digitization of Overlapping ECG Images with Open-Source Python Code
Abstract:
This paper addresses the persistent challenge of accurately digitizing paper-based electrocardiogram (ECG) recordings, with a particular focus on robustly handling single leads compromised by signal overlaps-a common yet under-addressed issue in existing methodologies. We propose a two-stage pipeline designed to overcome this limitation. The first stage employs a U-Net based segmentation network, trained on a dataset enriched with overlapping signals and fortified with custom data augmentations, to accurately isolate the primary ECG trace. The subsequent stage converts this refined binary mask into a time-series signal using established digitization techniques, enhanced by an adaptive grid detection module for improved versatility across different ECG formats and scales. Our experimental results demonstrate the efficacy of our approach. The U-Net architecture achieves an IoU of 0.87 for the fine-grained segmentation task. Crucially, our proposed digitization method yields superior performance compared to a well-established baseline technique across both non-overlapping and challenging overlapping ECG samples. For non-overlapping signals, our method achieved a Mean Squared Error (MSE) of 0.0010 and a Pearson Correlation Coefficient (rho) of 0.9644, compared to 0.0015 and 0.9366, respectively, for the baseline. On samples with signal overlap, our method achieved an MSE of 0.0029 and a rho of 0.9641, significantly improving upon the baseline's 0.0178 and 0.8676. This work demonstrates an effective strategy to significantly enhance digitization accuracy, especially in the presence of signal overlaps, thereby laying a strong foundation for the reliable conversion of analog ECG records into analyzable digital data for contemporary research and clinical applications. The implementation is publicly available at this GitHub repository: https://github.com/masoudrahimi39/ECG-code.
本文提出了一种采用U-Net分割和自适应数字化技术的两阶段流程,能精准将纸质心电图转换为数字信号,相比基线方法在处理信号重叠情况时表现出显著优势。
This paper introduces a two-stage pipeline using U-Net segmentation and adaptive digitization to accurately convert paper ECG recordings into digital signals, significantly improving performance especially for overlapping signals compared to baseline methods.

Authors:Suin Lee, Dae-Shik Kim
Title: TexTailor: Customized Text-aligned Texturing via Effective Resampling
Abstract:
We present TexTailor, a novel method for generating consistent object textures from textual descriptions. Existing text-to-texture synthesis approaches utilize depth-aware diffusion models to progressively generate images and synthesize textures across predefined multiple viewpoints. However, these approaches lead to a gradual shift in texture properties across viewpoints due to (1) insufficient integration of previously synthesized textures at each viewpoint during the diffusion process and (2) the autoregressive nature of the texture synthesis process. Moreover, the predefined selection of camera positions, which does not account for the object's geometry, limits the effective use of texture information synthesized from different viewpoints, ultimately degrading overall texture consistency. In TexTailor, we address these issues by (1) applying a resampling scheme that repeatedly integrates information from previously synthesized textures within the diffusion process, and (2) fine-tuning a depth-aware diffusion model on these resampled textures. During this process, we observed that using only a few training images restricts the model's original ability to generate high-fidelity images aligned with the conditioning, and therefore propose an performance preservation loss to mitigate this issue. Additionally, we improve the synthesis of view-consistent textures by adaptively adjusting camera positions based on the object's geometry. Experiments on a subset of the Objaverse dataset and the ShapeNet car dataset demonstrate that TexTailor outperforms state-of-the-art methods in synthesizing view-consistent textures. The source code for TexTailor is available at https://github.com/Adios42/Textailor
中文: TexTailor提出了一种新方法,通过引入重采样方案和自适应相机调整,从文本描述生成一致的对象纹理,在视角一致性方面优于现有技术。
English: TexTailor introduces a novel method to generate consistent object textures from text by integrating a resampling scheme and adaptive camera adjustments, outperforming existing approaches in view-consistent texture synthesis.

Authors:Liang Yin, Xudong Xie, Zhang Li, Xiang Bai, Yuliang Liu
Title: MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling
Abstract:
Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.
Chinese Summary: MSTAR提出了一种无需边界框的场景文本检索方法,通过动态捕捉多粒度文本表征并协调多样化查询,在消除标注成本的同时实现了卓越的检索性能。
English Summary: MSTAR introduces a box-free scene text retrieval method that dynamically captures multi-grained text representations and harmonizes diverse queries, achieving superior performance without costly bounding box annotations.

Authors:Xinyuan Liu, Hang Xu, Yike Ma, Yucheng Zhang, Feng Dai
Title: Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection
Abstract:
Recent remote sensing tech advancements drive imagery growth, making oriented object detection rapid development, yet hindered by labor-intensive annotation for high-density scenes. Oriented object detection with point supervision offers a cost-effective solution for densely packed scenes in remote sensing, yet existing methods suffer from inadequate sample assignment and instance confusion due to rigid rule-based designs. To address this, we propose SSP (Semantic-decoupled Spatial Partition), a unified framework that synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and reliably converts them into bounding boxes to form pseudo-labels for supervising the learning of downstream detectors. Experiments on DOTA-v1.0 and others demonstrate SSP\' s superiority: it achieves 45.78% mAP under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%. Furthermore, when integrated with ORCNN and ReDet architectures, the SSP framework achieves mAP values of 47.86% and 48.50%, respectively. The code is available at https://github.com/antxinyuan/ssp.
Chinese: 提出的SSP框架通过结合规则驱动和数据驱动的方法,改进了密集遥感场景中的定向目标检测,实现了卓越的样本分配和边界框提取,以45.78% mAP达到最先进性能。
English: The proposed SSP framework enhances oriented object detection in dense remote sensing scenes by combining rule-driven and data-driven approaches for superior sample assignment and box extraction, achieving state-of-the-art performance with 45.78% mAP.

Authors:Xinjie Wang, Liu Liu, Yu Cao, Ruiqi Wu, Wenkang Qin, Dehui Wang, Wei Sui, Zhizhong Su
Title: EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence
Abstract:
Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality, controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object Generation, Scene Generation and Layout Generation. EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets, leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.

Authors:Chengxu Zuo, Jiawei Huang, Xiao Jiang, Yuan Yao, Xiangren Shi, Rui Cao, Xinyu Yi, Feng Xu, Shihui Guo, Yipeng Qin
Title: Transformer IMU Calibrator: Dynamic On-body IMU Calibration for Inertial Motion Capture
Abstract:
In this paper, we propose a novel dynamic calibration method for sparse inertial motion capture systems, which is the first to break the restrictive absolute static assumption in IMU calibration, i.e., the coordinate drift RG'G and measurement offset RBS remain constant during the entire motion, thereby significantly expanding their application scenarios. Specifically, we achieve real-time estimation of RG'G and RBS under two relaxed assumptions: i) the matrices change negligibly in a short time window; ii) the human movements/IMU readings are diverse in such a time window. Intuitively, the first assumption reduces the number of candidate matrices, and the second assumption provides diverse constraints, which greatly reduces the solution space and allows for accurate estimation of RG'G and RBS from a short history of IMU readings in real time. To achieve this, we created synthetic datasets of paired RG'G, RBS matrices and IMU readings, and learned their mappings using a Transformer-based model. We also designed a calibration trigger based on the diversity of IMU readings to ensure that assumption ii) is met before applying our method. To our knowledge, we are the first to achieve implicit IMU calibration (i.e., seamlessly putting IMUs into use without the need for an explicit calibration process), as well as the first to enable long-term and accurate motion capture using sparse IMUs. The code and dataset are available at https://github.com/ZuoCX1996/TIC.
中文摘要:本文提出了一种创新的稀疏惯性动作捕捉系统动态校准方法,首次突破了IMU校准中绝对静态假设的限制,通过两个宽松假设实时估计坐标漂移和测量偏移,实现了无需显式校准流程的隐式校准及稀疏IMU的长期精准动作捕捉。
English Summary: This paper introduces a novel dynamic calibration method for sparse inertial motion capture systems, enabling real-time estimation of coordinate drift and measurement offset without requiring static assumptions, thereby achieving the first implicit calibration and long-term accurate motion capture with sparse IMUs.

Authors:Muskan Dosi, Chiranjeev Chiranjeev, Kartik Thakral, Mayank Vatsa, Richa Singh
Title: Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres
Abstract:
Do contemporary diffusion models preserve the class geometry of hyperspherical data? Standard diffusion models rely on isotropic Gaussian noise in the forward process, inherently favoring Euclidean spaces. However, many real-world problems involve non-Euclidean distributions, such as hyperspherical manifolds, where class-specific patterns are governed by angular geometry within hypercones. When modeled in Euclidean space, these angular subtleties are lost, leading to suboptimal generative performance. To address this limitation, we introduce HyperSphereDiff to align hyperspherical structures with directional noise, preserving class geometry and effectively capturing angular uncertainty. We demonstrate both theoretically and empirically that this approach aligns the generative process with the intrinsic geometry of hyperspherical data, resulting in more accurate and geometry-aware generative models. We evaluate our framework on four object datasets and two face datasets, showing that incorporating angular uncertainty better preserves the underlying hyperspherical manifold. Resources are available at: {https://github.com/IAB-IITJ/Harmonizing-Geometry-and-Uncertainty-Diffusion-with-Hyperspheres/}
中文摘要:HyperSphereDiff 通过引入定向噪声使扩散模型与超球面数据几何对齐,有效保留角度特征并提升多个数据集的生成精度。
English Summary: HyperSphereDiff introduces directional noise to align diffusion models with hyperspherical data geometry, preserving angular patterns and improving generative accuracy across multiple datasets.

Authors:Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wei, Zerong Zheng, Ming Zhou, Yuan Zhang, Mingyuan Gao
Title: DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
Abstract:
In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://lizhenwangt.github.io/DreamActor-H1/.

Authors:Yutong Zhou, Masahiro Ryo
Title: From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations
Abstract:
Explaining why the species lives at a particular location is important for understanding ecological systems and conserving biodiversity. However, existing ecological workflows are fragmented and often inaccessible to non-specialists. We propose an end-to-end visual-to-causal framework that transforms a species image into interpretable causal insights about its habitat preference. The system integrates species recognition, global occurrence retrieval, pseudo-absence sampling, and climate data extraction. We then discover causal structures among environmental features and estimate their influence on species occurrence using modern causal inference methods. Finally, we generate statistically grounded, human-readable causal explanations from structured templates and large language models. We demonstrate the framework on a bee and a flower species and report early results as part of an ongoing project, showing the potential of the multimodal AI assistant backed up by a recommended ecological modeling practice for describing species habitat in human-understandable language. Our code is available at: https://github.com/Yutong-Zhou-cv/BioX.
中文: 本研究提出一个端到端的视觉因果框架,将物种图像转化为可解释的栖息地偏好因果分析,通过多模态人工智能与生态建模相结合,生成人类可读的解释。
English: This study introduces an end-to-end visual-to-causal framework that transforms species images into interpretable causal insights about habitat preferences, integrating multimodal AI with ecological modeling to generate human-readable explanations.

Authors:Jing He, Yiqing Wang, Lingling Li, Kexin Zhang, Puhua Chen
Title: ContextRefine-CLIP for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2025
Abstract:
This report presents ContextRefine-CLIP (CR-CLIP), an efficient model for visual-textual multi-instance retrieval tasks. The approach is based on the dual-encoder AVION, on which we introduce a cross-modal attention flow module to achieve bidirectional dynamic interaction and refinement between visual and textual features to generate more context-aware joint representations. For soft-label relevance matrices provided in tasks such as EPIC-KITCHENS-100, CR-CLIP can work with Symmetric Multi-Similarity Loss to achieve more accurate semantic alignment and optimization using the refined features. Without using ensemble learning, the CR-CLIP model achieves 66.78mAP and 82.08nDCG on the EPIC-KITCHENS-100 public leaderboard, which significantly outperforms the baseline model and fully validates its effectiveness in cross-modal retrieval. The code will be released open-source on https://github.com/delCayr/ContextRefine-Clip
Chinese: 本报告提出CR-CLIP模型,通过引入跨模态注意力流实现视觉与文本特征的双向动态交互优化,在EPIC-KITCHENS-100等任务中无需集成学习即显著超越基线,验证了其在跨模态检索中的高效性。
English: This report introduces CR-CLIP, an efficient model that enhances cross-modal retrieval by integrating a cross-modal attention flow for dynamic interaction between visual and textual features, achieving superior performance on benchmarks like EPIC-KITCHENS-100 without ensemble learning.

Authors:Zicheng Zhao, Kangyu Wang, Shijie Li, Rui Qian, Weiyao Lin, Huabin Liu
Title: CogStream: Context-guided Streaming Video Question Answering
Abstract:
Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. The project is released on https://github.com/LiamZhao326/CogStream.
中文摘要:本文提出了CogStream这一具有挑战性的流媒体视频推理任务,要求模型通过筛选相关历史上下文来高效回答问题,并开发了CogReasoner基准模型,利用视觉压缩和对话检索技术,在降低计算负担的同时提升了推理准确性。
English Summary: This paper introduces CogStream, a challenging task for streaming video reasoning that requires models to identify relevant historical context to answer questions efficiently, and presents a baseline model, CogReasoner, which uses visual compression and dialogue retrieval to reduce computational load while improving accuracy.

Authors:Guowei Zhong, Ruohong Huan, Mingzhen Wu, Ronghua Liang, Peng Chen
Title: Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts
Abstract:
Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CIDer.
中文摘要:提出的因果推理蒸馏器(CIDer)框架通过自蒸馏和因果推理模块,解决了多模态情感识别中模态缺失和分布外数据的双重挑战,以更少参数和更快训练实现了鲁棒性能。
English Summary: The proposed Causal Inference Distiller (CIDer) framework addresses simultaneous modality missing and Out-Of-Distribution challenges in Multimodal Emotion Recognition through self-distillation and causal inference modules, achieving robust performance with fewer parameters and faster training.

Authors:Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt
Title: Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?
Abstract:
This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.
中文: TempVS基准测试评估多模态大语言模型在图像序列中的时序推理能力,揭示了与人类能力间的显著差距,并为未来研究提供了方向性见解。
English: The TempVS benchmark evaluates Multimodal Large Language Models' temporal reasoning in image sequences, revealing significant performance gaps compared to humans while offering insights for future research.

Authors:Yuanyi Song, Pumeng Lyu, Ben Fei, Fenghua Ling, Wanli Ouyang, Lei Bai
Title: ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion
Abstract:
Accurate reconstruction of ocean is essential for reflecting global climate dynamics and supporting marine meteorological research. Conventional methods face challenges due to sparse data, algorithmic complexity, and high computational costs, while increasing usage of machine learning (ML) method remains limited to reconstruction problems at the sea surface and local regions, struggling with issues like cloud occlusion. To address these limitations, this paper proposes ReconMOST, a data-driven guided diffusion model framework for multi-layer sea temperature reconstruction. Specifically, we first pre-train an unconditional diffusion model using a large collection of historical numerical simulation data, enabling the model to attain physically consistent distribution patterns of ocean temperature fields. During the generation phase, sparse yet high-accuracy in-situ observational data are utilized as guidance points for the reverse diffusion process, generating accurate reconstruction results. Importantly, in regions lacking direct observational data, the physically consistent spatial distribution patterns learned during pre-training enable implicitly guided and physically plausible reconstructions. Our method extends ML-based SST reconstruction to a global, multi-layer setting, handling over 92.5% missing data while maintaining reconstruction accuracy, spatial resolution, and superior generalization capability. We pre-train our model on CMIP6 numerical simulation data and conduct guided reconstruction experiments on CMIP6 and EN4 analysis data. The results of mean squared error (MSE) values achieve 0.049 on guidance, 0.680 on reconstruction, and 0.633 on total, respectively, demonstrating the effectiveness and robustness of the proposed framework. Our source code is available at https://github.com/norsheep/ReconMOST.
中文: 本文提出ReconMOST,一种数据驱动的引导扩散模型,利用历史模拟数据和稀疏观测数据重建全球多层海水温度,能有效处理超过92.5%的数据缺失,同时保持高精度和鲁棒性。
English: This paper introduces ReconMOST, a data-driven guided diffusion model that reconstructs global, multi-layer sea temperatures by leveraging historical simulation data and sparse observational inputs, effectively handling over 92.5% missing data with high accuracy and robustness.

Authors:Shicheng Yin, Kaixuan Yin, Yang Liu, Weixing Chen, Liang Lin
Title: DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Foundation Models
Abstract:
The content-agnostic, fixed-grid tokenizers used by standard large-scale vision models like Vision Transformer (ViT) and Vision Mamba (Vim) represent a fundamental performance bottleneck, creating a trade-off between capturing fine-grained detail and suffering from redundant computation. To resolve this dilemma, we introduce DART, a fully differentiable Dynamic Adaptive Region Tokenizer. DART employs learnable region scores and quantile-based partitioning to create content-aware patches of varying sizes, intelligently allocating a higher token density to information-rich regions. The impact of this approach is profound: it unlocks a more intelligent scaling paradigm, where a DART-equipped DeiT-Small (22M parameters) matches the performance of a DeiT-Base (86M) with nearly double the inference speed by efficiently capturing high-resolution details in key regions. Furthermore, the principle of adaptive tokenization proves its generality with clear benefits in dense prediction and spatiotemporal video tasks. We argue that by resolving the tokenizer bottleneck at its source, adaptive tokenization is a key component for building the next generation of more efficient and capable foundation models for multimodal AI, robotics, and content generation. Code is available at https://github.com/HCPLab-SYSU/DART.
中文: DART提出了一种动态自适应区域分词器,通过生成内容感知的不同尺寸图像块,智能地将更高密度的令牌分配给信息丰富区域,从而在提升性能的同时显著加快推理速度,为下一代高效视觉模型奠定基础。
English: DART introduces a dynamic adaptive region tokenizer that creates content-aware patches of varying sizes, enabling more efficient and capable vision models by intelligently allocating token density to information-rich regions while improving both performance and inference speed.

Authors:Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui, Yuhan Lyu
Title: FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion
Abstract:
The infrared and visible images fusion (IVIF) is receiving increasing attention from both the research community and industry due to its excellent results in downstream applications. Existing deep learning approaches often utilize convolutional neural networks to extract image features. However, the inherently capacity of convolution operations to capture global context can lead to information loss, thereby restricting fusion performance. To address this limitation, we propose an end-to-end fusion network named the Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The FSATFusion contains a frequency-spatial attention Transformer (FSAT) module designed to effectively capture discriminate features from source images. This FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of extracting significant features from feature maps. Additionally, we propose an improved Transformer module (ITM) to enhance the ability to extract global context information of vanilla Transformer. We conducted both qualitative and quantitative comparative experiments, demonstrating the superior fusion quality and efficiency of FSATFusion compared to other state-of-the-art methods. Furthermore, our network was tested on two additional tasks without any modifications, to verify the excellent generalization capability of FSATFusion. Finally, the object detection experiment demonstrated the superiority of FSATFusion in downstream visual tasks. Our code is available at https://github.com/Lmmh058/FSATFusion.
中文: 提出的FSATFusion网络通过引入频空注意力Transformer模块,克服了卷积神经网络在全局上下文捕捉上的局限性,在红外与可见光图像融合任务中展现出卓越的融合性能和泛化能力。
English: The proposed FSATFusion network overcomes the limitations of convolutional neural networks in capturing global context by incorporating a frequency-spatial attention Transformer module, demonstrating superior fusion performance and generalization capability in infrared and visible image fusion tasks.

Authors:Zhanwei Zhang, Kaiyuan Liu, Junjie Liu, Wenxiao Wang, Binbin Lin, Liang Xie, Chen Shen, Deng Cai
Title: GeoCAD: Local Geometry-Controllable CAD Generation
Abstract:
Local geometry-controllable computer-aided design (CAD) generation aims to modify local parts of CAD models automatically, enhancing design efficiency. It also ensures that the shapes of newly generated local parts follow user-specific geometric instructions (e.g., an isosceles right triangle or a rectangle with one corner cut off). However, existing methods encounter challenges in achieving this goal. Specifically, they either lack the ability to follow textual instructions or are unable to focus on the local parts. To address this limitation, we introduce GeoCAD, a user-friendly and local geometry-controllable CAD generation method. Specifically, we first propose a complementary captioning strategy to generate geometric instructions for local parts. This strategy involves vertex-based and VLLM-based captioning for systematically annotating simple and complex parts, respectively. In this way, we caption $\sim$221k different local parts in total. In the training stage, given a CAD model, we randomly mask a local part. Then, using its geometric instruction and the remaining parts as input, we prompt large language models (LLMs) to predict the masked part. During inference, users can specify any local part for modification while adhering to a variety of predefined geometric instructions. Extensive experiments demonstrate the effectiveness of GeoCAD in generation quality, validity and text-to-CAD consistency. Code will be available at https://github.com/Zhanwei-Z/GeoCAD.
Chinese: GeoCAD提出了一种用户友好的局部几何可控CAD生成方法,通过互补标注策略和大语言模型,依据几何指令修改CAD模型的特定局部部件,有效解决了以往方法在文本引导和局部聚焦方面的不足。
English: GeoCAD introduces a user-friendly method for local geometry-controllable CAD generation that uses complementary captioning and LLMs to modify specific parts of CAD models according to geometric instructions, overcoming previous limitations in text guidance and local focus.

Authors:Cheng Wang, Siqi Chen, Donghua Mi, Yang Chen, Yudong Zhang, Yinsheng Li
Title: SWDL: Stratum-Wise Difference Learning with Deep Laplacian Pyramid for Semi-Supervised 3D Intracranial Hemorrhage Segmentation
Abstract:
Recent advances in medical imaging have established deep learning-based segmentation as the predominant approach, though it typically requires large amounts of manually annotated data. However, obtaining annotations for intracranial hemorrhage (ICH) remains particularly challenging due to the tedious and costly labeling process. Semi-supervised learning (SSL) has emerged as a promising solution to address the scarcity of labeled data, especially in volumetric medical image segmentation. Unlike conventional SSL methods that primarily focus on high-confidence pseudo-labels or consistency regularization, we propose SWDL-Net, a novel SSL framework that exploits the complementary advantages of Laplacian pyramid and deep convolutional upsampling. The Laplacian pyramid excels at edge sharpening, while deep convolutions enhance detail precision through flexible feature mapping. Our framework achieves superior segmentation of lesion details and boundaries through a difference learning mechanism that effectively integrates these complementary approaches. Extensive experiments on a 271-case ICH dataset and public benchmarks demonstrate that SWDL-Net outperforms current state-of-the-art methods in scenarios with only 2% labeled data. Additional evaluations on the publicly available Brain Hemorrhage Segmentation Dataset (BHSD) with 5% labeled data further confirm the superiority of our approach. Code and data have been released at https://github.com/SIAT-CT-LAB/SWDL.
中文摘要:SWDL-Net是一种新型半监督学习框架,通过结合拉普拉斯金字塔边缘锐化和深度卷积上采样,在仅使用少量标注数据的情况下实现了优越的颅内出血分割效果。
English Summary: SWDL-Net is a novel semi-supervised learning framework that combines Laplacian pyramid edge sharpening with deep convolutional upsampling to achieve superior intracranial hemorrhage segmentation using minimal labeled data.

Authors:Eunkyu Park, Minyeong Kim, Gunhee Kim
Title: HalLoc: Token-level Localization of Hallucinations for Vision Language Models
Abstract:
Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications. The HalLoc dataset and code are publicly available at: https://github.com/dbsltm/cvpr25_halloc.
Chinese: 为解决视觉语言模型中的幻觉问题,研究者提出了HalLoc数据集和基线模型,通过概率性检测和低开销集成,有效提升模型在实际应用中的可靠性。
English: Hallucinations in vision-language models undermine reliability, so the authors introduce HalLoc, a dataset and baseline model for efficient, probabilistic detection that integrates seamlessly to enhance trustworthiness in real-world applications.

Authors:Emerson P. Grabke, Masoom A. Haider, Babak Taati
Title: Prompt-Guided Latent Diffusion with Predictive Class Conditioning for 3D Prostate MRI Generation
Abstract:
Objective: Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM strategies typically rely on short-prompt text encoders, non-medical LDMs, or large data volumes. These strategies can limit performance and scientific accessibility. We propose a novel LDM conditioning approach to address these limitations. Methods: We propose Class-Conditioned Efficient Large Language model Adapter (CCELLA), a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with free-text clinical reports and radiology classification. We also propose a data-efficient LDM framework centered around CCELLA and a proposed joint loss function. We first evaluate our method on 3D prostate MRI against state-of-the-art. We then augment a downstream classifier model training dataset with synthetic images from our method. Results: Our method achieves a 3D FID score of 0.025 on a size-limited 3D prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.071. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method during training improves classifier accuracy from 69% to 74%. Training a classifier solely on our method's synthetic images achieved comparable performance to training on real images alone. Conclusion: We show that our method improved both synthetic image quality and downstream classifier performance using limited data and minimal human annotation. Significance: The proposed CCELLA-centric framework enables radiology report and class-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility. Code from this study will be available at https://github.com/grabkeem/CCELLA
中文摘要:本研究提出CCELLA这一新型双头调节方法,通过结合自由文本临床报告和放射学分类来增强医学影像的潜在扩散模型,在有限数据和最少人工标注条件下显著提升了合成图像质量及下游分类器性能。
English Summary: The study introduces CCELLA, a novel dual-head conditioning approach that enhances latent diffusion models for medical imaging by utilizing free-text clinical reports and radiology classifications, significantly improving synthetic image quality and downstream classifier performance with limited data and minimal human annotation.

Authors:Parsa Rahimi, Sebastien Marcel
Title: ScoreMix: Improving Face Recognition via Score Composition in Diffusion Generators
Abstract:
In this paper, we propose ScoreMix, a novel yet simple data augmentation strategy leveraging the score compositional properties of diffusion models to enhance discriminator performance, particularly under scenarios with limited labeled data. By convexly mixing the scores from different class-conditioned trajectories during diffusion sampling, we generate challenging synthetic samples that significantly improve discriminative capabilities in all studied benchmarks. We systematically investigate class-selection strategies for mixing and discover that greater performance gains arise when combining classes distant in the discriminator's embedding space, rather than close in the generator's condition space. Moreover, we empirically show that, under standard metrics, the correlation between the generator's learned condition space and the discriminator's embedding space is minimal. Our approach achieves notable performance improvements without extensive parameter searches, demonstrating practical advantages for training discriminative models while effectively mitigating problems regarding collections of large datasets. Paper website: https://parsa-ra.github.io/scoremix

Authors:Fiona Ryan, Josef Sivic, Fabian Caba Heilbron, Judy Hoffman, James M. Rehg, Bryan Russell
Title: Improving Personalized Search with Regularized Low-Rank Parameter Updates
Abstract:
Personalized vision-language retrieval seeks to recognize new concepts (e.g. "my dog Fido") from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the personal and general knowledge together to recognize the concept in different contexts. In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. We find that regularized low-rank adaption of a small set of parameters in the language encoder's final layer serves as a highly effective alternative to textual inversion for recognizing the personal concept while preserving general knowledge. Additionally, we explore strategies for combining parameters of multiple learned personal concepts, finding that parameter addition is effective. To evaluate how well general knowledge is preserved in a finetuned representation, we introduce a metric that measures image retrieval accuracy based on captions generated by a vision language model (VLM). Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries - DeepFashion2 and ConCon-Chi - outperforming the prior art by 4%-22% on personal retrievals.
Chinese: 本文提出了一种个性化视觉语言检索方法,通过对语言编码器进行正则化低秩参数调整来适应双编码器模型,有效识别个人概念同时保留通用知识,并在基准测试中取得了最先进的性能。
English: This paper introduces a method for personalized vision-language retrieval by adapting a dual encoder model through regularized low-rank parameter adjustments in the language encoder, effectively recognizing personal concepts while preserving general knowledge and achieving state-of-the-art results on benchmarks.

Authors:Bill Psomas, Dionysis Christopoulos, Eirini Baltzi, Ioannis Kakogeorgiou, Tilemachos Aravanis, Nikos Komodakis, Konstantinos Karantzalos, Yannis Avrithis, Giorgos Tolias
Title: Attention, Please! Revisiting Attentive Probing for Masked Image Modeling
Abstract:
As fine-tuning (FT) becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol for self-supervised learning (SSL). Yet, the standard linear probing (LP) fails to adequately reflect the potential of models trained with Masked Image Modeling (MIM), due to the distributed nature of patch tokens. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy-efficiency trade-off. We conduct a systematic study of existing methods, analyzing their mechanisms and benchmarking their performance. We introduce efficient probing (EP), a multi-query cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10$\times$ speed-up over conventional multi-head attention. Despite its simplicity, EP outperforms LP and prior attentive probing approaches across seven benchmarks, generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings. Code available at https://github.com/billpsomas/efficient-probing.
中文摘要:本研究提出了高效探测(EP)方法,通过减少冗余参数实现参数高效的注意力探测,在多个基准测试中优于线性探测及现有方法,并揭示了其新兴特性以拓展应用前景。
English Summary: This study introduces Efficient Probing (EP), a parameter-efficient attentive probing method that outperforms linear probing and existing approaches across multiple benchmarks while offering insights into its emergent properties for broader applications.

Authors:Yael Frischholz, Devis Tuia, Michael Lehning
Title: Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context
Abstract:
Accurate retrieval of surface solar radiation (SSR) from satellite imagery critically depends on estimating the background reflectance that a spaceborne sensor would observe under clear-sky conditions. Deviations from this baseline can then be used to detect cloud presence and guide radiative transfer models in inferring atmospheric attenuation. Operational retrieval algorithms typically approximate background reflectance using monthly statistics, assuming surface properties vary slowly relative to atmospheric conditions. However, this approach fails in mountainous regions where intermittent snow cover and changing snow surfaces are frequent. We propose an attention-based emulator for SSR retrieval that implicitly learns to infer clear-sky surface reflectance from raw satellite image sequences. Built on the Temporo-Spatial Vision Transformer, our approach eliminates the need for hand-crafted features such as explicit albedo maps or cloud masks. The emulator is trained on instantaneous SSR estimates from the HelioMont algorithm over Switzerland, a region characterized by complex terrain and dynamic snow cover. Inputs include multi-spectral SEVIRI imagery from the Meteosat Second Generation platform, augmented with static topographic features and solar geometry. The target variable is HelioMont's SSR, computed as the sum of its direct and diffuse horizontal irradiance components, given at a spatial resolution of 1.7 km. We show that, when provided a sufficiently long temporal context, the model matches the performances of albedo-informed models, highlighting the model's ability to internally learn and exploit latent surface reflectance dynamics. Our geospatial analysis shows this effect is most powerful in mountainous regions and improves generalization in both simple and complex topographic settings. Code and datasets are publicly available at https://github.com/frischwood/HeMu-dev.git
中文: 该研究提出的基于注意力的模拟器通过从卫星图像序列中学习晴空反射率动态,有效反演地表太阳辐射,无需人工特征且在复杂山区表现尤为突出。
English: The proposed attention-based emulator effectively retrieves surface solar radiation by learning clear-sky reflectance dynamics from satellite imagery, eliminating manual features and excelling particularly in complex mountainous terrain.

Authors:Mohammad Jalali, Haoyu Lei, Amin Gohari, Farzan Farnia
Title: SPARKE: Scalable Prompt-Aware Diversity Guidance in Diffusion Models via RKE Score
Abstract:
Diffusion models have demonstrated remarkable success in high-fidelity image synthesis and prompt-guided generative modeling. However, ensuring adequate diversity in generated samples of prompt-guided diffusion models remains a challenge, particularly when the prompts span a broad semantic spectrum and the diversity of generated data needs to be evaluated in a prompt-aware fashion across semantically similar prompts. Recent methods have introduced guidance via diversity measures to encourage more varied generations. In this work, we extend the diversity measure-based approaches by proposing the Scalable Prompt-Aware Rény Kernel Entropy Diversity Guidance (SPARKE) method for prompt-aware diversity guidance. SPARKE utilizes conditional entropy for diversity guidance, which dynamically conditions diversity measurement on similar prompts and enables prompt-aware diversity control. While the entropy-based guidance approach enhances prompt-aware diversity, its reliance on the matrix-based entropy scores poses computational challenges in large-scale generation settings. To address this, we focus on the special case of Conditional latent RKE Score Guidance, reducing entropy computation and gradient-based optimization complexity from the $O(n^3)$ of general entropy measures to $O(n)$. The reduced computational complexity allows for diversity-guided sampling over potentially thousands of generation rounds on different prompts. We numerically test the SPARKE method on several text-to-image diffusion models, demonstrating that the proposed method improves the prompt-aware diversity of the generated data without incurring significant computational costs. We release our code on the project page: https://mjalali.github.io/SPARKE

Authors:Minye Shao, Zeyu Wang, Haoran Duan, Yawen Huang, Bing Zhai, Shizheng Wang, Yang Long, Yefeng Zheng
Title: Rethinking Brain Tumor Segmentation from the Frequency Domain Perspective
Abstract:
Precise segmentation of brain tumors, particularly contrast-enhancing regions visible in post-contrast MRI (areas highlighted by contrast agent injection), is crucial for accurate clinical diagnosis and treatment planning but remains challenging. However, current methods exhibit notable performance degradation in segmenting these enhancing brain tumor areas, largely due to insufficient consideration of MRI-specific tumor features such as complex textures and directional variations. To address this, we propose the Harmonized Frequency Fusion Network (HFF-Net), which rethinks brain tumor segmentation from a frequency-domain perspective. To comprehensively characterize tumor regions, we develop a Frequency Domain Decomposition (FDD) module that separates MRI images into low-frequency components, capturing smooth tumor contours and high-frequency components, highlighting detailed textures and directional edges. To further enhance sensitivity to tumor boundaries, we introduce an Adaptive Laplacian Convolution (ALC) module that adaptively emphasizes critical high-frequency details using dynamically updated convolution kernels. To effectively fuse tumor features across multiple scales, we design a Frequency Domain Cross-Attention (FDCA) integrating semantic, positional, and slice-specific information. We further validate and interpret frequency-domain improvements through visualization, theoretical reasoning, and experimental analyses. Extensive experiments on four public datasets demonstrate that HFF-Net achieves an average relative improvement of 4.48\% (ranging from 2.39\% to 7.72\%) in the mean Dice scores across the three major subregions, and an average relative improvement of 7.33% (ranging from 5.96% to 8.64%) in the segmentation of contrast-enhancing tumor regions, while maintaining favorable computational efficiency and clinical applicability. Code: https://github.com/VinyehShaw/HFF.
中文: 提出的谐波频率融合网络(HFF-Net)通过频域分析磁共振图像来解决脑肿瘤分割难题,在保持计算效率的同时显著提升了对比增强区域的 segmentation 精度。
English: The proposed Harmonized Frequency Fusion Network (HFF-Net) addresses brain tumor segmentation challenges by analyzing MRI images in the frequency domain, achieving significant improvements in accuracy for contrast-enhancing regions while maintaining computational efficiency.

Authors:Chenjian Gao, Lihe Ding, Xin Cai, Zhanpeng Huang, Zibin Wang, Tianfan Xue
Title: LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning
Abstract:
Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our key innovation is using a spatiotemporal mask to strategically guide the LoRA fine-tuning process. This teaches the model two distinct skills: first, to interpret the mask as a command to either preserve content from the source video or generate new content in designated regions. Second, for these generated regions, LoRA learns to synthesize either temporally consistent motion inherited from the video or novel appearances guided by user-provided reference frames. This dual-capability LoRA grants users control over the edit's entire temporal evolution, allowing complex transformations like an object rotating or a flower blooming. Experimental results show our method achieves superior video editing performance compared to baseline methods. Project Page: https://cjeen.github.io/LoRAEdit

Authors:Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang
Title: Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs
Abstract:
Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning utility, leading to suboptimal data utilization and performance. To address this challenge, we propose Omni-DPO, a dual-perspective optimization framework that jointly accounts for (1) the inherent quality of each preference pair and (2) the model's evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model's learning dynamics during training, Omni-DPO enables more effective training data utilization and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning tasks, Omni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be available at https://github.com/pspdada/Omni-DPO.
中文摘要:Omni-DPO是一种新颖的双视角优化框架,通过基于数据固有质量和模型学习动态的自适应加权机制改进直接偏好优化方法,在多个基准测试中实现了卓越性能。
English Summary: Omni-DPO is a novel dual-perspective optimization framework that enhances Direct Preference Optimization by adaptively weighting preference pairs based on both inherent data quality and the model's learning dynamics, achieving superior performance across various benchmarks.

Authors:Kangwei Liu, Junwu Liu, Xiaowei Yi, Jinlin Guo, Yun Cao
Title: Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space
Abstract:
Audio-driven emotional 3D facial animation encounters two significant challenges: (1) reliance on single-modal control signals (videos, text, or emotion labels) without leveraging their complementary strengths for comprehensive emotion manipulation, and (2) deterministic regression-based mapping that constrains the stochastic nature of emotional expressions and non-verbal behaviors, limiting the expressiveness of synthesized animations. To address these challenges, we present a diffusion-based framework for controllable expressive 3D facial animation. Our approach introduces two key innovations: (1) a FLAME-centered multimodal emotion binding strategy that aligns diverse modalities (text, audio, and emotion labels) through contrastive learning, enabling flexible emotion control from multiple signal sources, and (2) an attention-based latent diffusion model with content-aware attention and emotion-guided layers, which enriches motion diversity while maintaining temporal coherence and natural facial dynamics. Extensive experiments demonstrate that our method outperforms existing approaches across most metrics, achieving a 21.6\% improvement in emotion similarity while preserving physiologically plausible facial dynamics. Project Page: https://kangweiiliu.github.io/Control_3D_Animation.

Authors:Chieh Hubert Lin, Zhaoyang Lv, Songyin Wu, Zhen Xu, Thu Nguyen-Phuoc, Hung-Yu Tseng, Julian Straub, Numair Khan, Lei Xiao, Ming-Hsuan Yang, Yuheng Ren, Richard Newcombe, Zhao Dong, Zhengqin Li
Title: DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos
Abstract:
We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene. Feed-forward scene reconstruction has gained significant attention for its ability to rapidly create digital replicas of real-world environments. However, most existing models are limited to static scenes and fail to reconstruct the motion of moving objects. Developing a feed-forward model for dynamic scene reconstruction poses significant challenges, including the scarcity of training data and the need for appropriate 3D representations and training paradigms. To address these challenges, we introduce several key technical contributions: an enhanced large-scale synthetic dataset with ground-truth multi-view videos and dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian representation that is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking; and a large transformer network that achieves real-time, generalizable dynamic scene reconstruction. Extensive qualitative and quantitative experiments demonstrate that DGS-LRM achieves dynamic scene reconstruction quality comparable to optimization-based methods, while significantly outperforming the state-of-the-art predictive dynamic reconstruction method on real-world examples. Its predicted physically grounded 3D deformation is accurate and can readily adapt for long-range 3D tracking tasks, achieving performance on par with state-of-the-art monocular video 3D tracking methods.

Authors:Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim
Title: Text-Aware Image Restoration with Diffusion Models
Abstract:
Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/

Authors:Yiming Dou, Wonseok Oh, Yuqing Luo, Antonio Loquercio, Andrew Owens
Title: Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes
Abstract:
We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? First, we record a video of a human manipulating objects within a 3D scene using their hands. We then use these action-sound pairs to train a rectified flow model to map 3D hand trajectories to their corresponding audio. At test time, a user can query the model for other actions, parameterized as sequences of hand poses, to estimate their corresponding sounds. In our experiments, we find that our generated sounds accurately convey material properties and actions, and that they are often indistinguishable to human observers from real sounds. Project page: https://www.yimingdou.com/hearing_hands/
中文摘要:本研究通过训练校正流模型学习动作-声音配对数据,开发出能够根据手部运动轨迹生成逼真物体交互声音的系统,其合成效果在材质传达和听觉真实性方面接近真实录音。
English Summary: This research develops a method to generate realistic sounds of hand-object interactions in 3D scenes by training a rectified flow model on action-sound pairs, enabling users to predict audio for various hand movements with high perceptual accuracy.

Authors:Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang, Chao Liang, Gaojie Lin, Zerong Zheng, Ceyuan Yang, Dahua Lin
Title: InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
Abstract:
End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.

Authors:Zijie Wu, Chaohui Yu, Fan Wang, Xiang Bai
Title: AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation
Abstract:
Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures. To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space. Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency. Our work marks a substantial step forward in making 4D content creation more accessible and practical. All the data, code, and models will be open-released.
中文: AnimateAnyMesh提出了首个前馈式框架,通过创新的DyMeshVAE架构和Rectified Flow训练策略,能够根据文本输入高效生成任意三维网格的高质量动画,在数秒内实现语义准确且时序连贯的动态效果。
English: AnimateAnyMesh introduces the first feed-forward framework for efficiently animating any 3D mesh from text input, using a novel DyMeshVAE architecture and Rectified Flow training to generate high-quality, semantically accurate animations in seconds.

Authors:Jiaxiang Tang, Ruijie Lu, Zhaoshuo Li, Zekun Hao, Xuan Li, Fangyin Wei, Shuran Song, Gang Zeng, Ming-Yu Liu, Tsung-Yi Lin
Title: Efficient Part-level 3D Object Generation via Dual Volume Packing
Abstract:
Recent progress in 3D object generation has greatly improved both the quality and efficiency. However, most existing methods generate a single mesh with all parts fused together, which limits the ability to edit or manipulate individual parts. A key challenge is that different objects may have a varying number of parts. To address this, we propose a new end-to-end framework for part-level 3D object generation. Given a single input image, our method generates high-quality 3D objects with an arbitrary number of complete and semantically meaningful parts. We introduce a dual volume packing strategy that organizes all parts into two complementary volumes, allowing for the creation of complete and interleaved parts that assemble into the final object. Experiments show that our model achieves better quality, diversity, and generalization than previous image-based part-level generation methods.
中文: 本文提出了一种端到端的部件级三维物体生成框架,通过双体积打包策略从单张图像生成具有语义意义且可交错组合的完整部件,在质量和多样性上超越了现有方法。
English: This paper introduces an end-to-end framework for part-level 3D object generation that produces high-quality objects with semantically meaningful, interleaved parts from a single image using a dual volume packing strategy, outperforming previous methods in quality and flexibility.

Authors:Sushant Gautam, Michael A. Riegler, PÃ¥l Halvorsen
Title: Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy
Abstract:
Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model's inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main evaluation tracks: one for standard VQA performance and another to test model robustness against these visual perturbations. By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems for use in clinical settings. The dataset is fully accessible and adheres to FAIR data principles, making it a valuable resource for the wider research community. Code and data: https://github.com/Simula/Kvasir-VQA-x1 and https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1
中文: Kvasir-VQA-x1是一个大规模胃肠道内窥镜数据集,通过15.9万个问题-答案对和模拟影像伪影的视觉增强,提升了临床复杂性和视觉多样性,旨在推动临床多模态AI系统的发展。
English: Kvasir-VQA-x1 is a large-scale gastrointestinal endoscopy dataset that expands clinical complexity and visual diversity through 159,549 question-answer pairs and augmented imaging artifacts, aiming to advance multimodal AI for clinical decision support.

Authors:Benjamin Reichman, Constantin Patsch, Jack Truxal, Atishay Jain, Larry Heck
Title: Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
Abstract:
In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: https://github.com/c-patsch/OKCV.
中文摘要:本研究提出了一个视觉对话数据集,要求模型识别相关视频片段并利用外部知识回答视觉信息中未包含的问题,同时通过基线评估揭示了该任务未来的挑战。
English Summary: This study introduces a dataset for visually grounded dialogue tasks requiring models to identify relevant video segments and incorporate external knowledge to answer questions not present in the visual content, with baseline evaluations highlighting future challenges.

Authors:Ziyi Wang, Yanran Zhang, Jie Zhou, Jiwen Lu
Title: UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting
Abstract:
The scale diversity of point cloud data presents significant challenges in developing unified representation learning techniques for 3D vision. Currently, there are few unified 3D models, and no existing pre-training method is equally effective for both object- and scene-level point clouds. In this paper, we introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture. Our approach predicts Gaussian primitives as the pre-training task and employs differentiable Gaussian splatting to render images, enabling precise pixel-level supervision and end-to-end optimization. To further regulate the complexity of the pre-training task and direct the model's focus toward geometric structures, we integrate 2D features from pre-trained image models to incorporate well-established texture knowledge. We validate the universal effectiveness of our proposed method through extensive experiments across a variety of object- and scene-level tasks, using diverse point cloud models as backbones. Code is available at https://github.com/wangzy22/UniPre3D.
中文: UniPre3D首次提出适用于任意尺度和架构点云数据的统一预训练方法,通过高斯基元预测、可微分渲染结合二维特征融合,在物体和场景级任务中均展现出普适有效性。
English: UniPre3D introduces the first unified pre-training method for point clouds of any scale and architecture, using Gaussian primitives and differentiable rendering with 2D feature integration to achieve universal effectiveness across object- and scene-level tasks.

Authors:Irving Fang, Juexiao Zhang, Shengbang Tong, Chen Feng
Title: From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models
Abstract:
One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at https://ai4ce.github.io/INT-ACT/

Authors:Jianhan Qi, Yuheng Jia, Hui Liu, Junhui Hou
Title: Structural-Spectral Graph Convolution with Evidential Edge Learning for Hyperspectral Image Clustering
Abstract:
Hyperspectral image (HSI) clustering assigns similar pixels to the same class without any annotations, which is an important yet challenging task. For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs). However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation. To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features. Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph. We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted. Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. Our code is available at https://github.com/jhqi/SSGCO-EGAEL.
中文: 本研究提出结构-光谱图卷积算子和证据引导的自适应边学习模块,通过优化光谱信息利用和图结构,显著提升了高光谱图像聚类的准确性,在多个数据集上实现了明显的精度提升。
English: This study introduces a structural-spectral graph convolutional operator and an evidence-guided adaptive edge learning module to enhance hyperspectral image clustering by better utilizing spectral information and refining graph structures, achieving significant accuracy improvements across multiple datasets.

Authors:Haoru Wang, Kai Ye, Yangyan Li, Wenzheng Chen, Baoquan Chen
Title: The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge
Abstract:
We consider the problem of generalizable novel view synthesis (NVS), which aims to generate photorealistic novel views from sparse or even unposed 2D images without per-scene optimization. This task remains fundamentally challenging, as it requires inferring 3D structure from incomplete and ambiguous 2D observations. Early approaches typically rely on strong 3D knowledge, including architectural 3D inductive biases (e.g., embedding explicit 3D representations, such as NeRF or 3DGS, into network design) and ground-truth camera poses for both input and target views. While recent efforts have sought to reduce the 3D inductive bias or the dependence on known camera poses of input views, critical questions regarding the role of 3D knowledge and the necessity of circumventing its use remain under-explored. In this work, we conduct a systematic analysis on the 3D knowledge and uncover a critical trend: the performance of methods that requires less 3D knowledge accelerates more as data scales, eventually achieving performance on par with their 3D knowledge-driven counterparts, which highlights the increasing importance of reducing dependence on 3D knowledge in the era of large-scale data. Motivated by and following this trend, we propose a novel NVS framework that minimizes 3D inductive bias and pose dependence for both input and target views. By eliminating this 3D knowledge, our method fully leverages data scaling and learns implicit 3D awareness directly from sparse 2D images, without any 3D inductive bias or pose annotation during training. Extensive experiments demonstrate that our model generates photorealistic and 3D-consistent novel views, achieving even comparable performance with methods that rely on posed inputs, thereby validating the feasibility and effectiveness of our data-centric paradigm. Project page: https://pku-vcl-geometry.github.io/Less3Depend/ .

Authors:Siyu Chen, Ting Han, Chengzheng Fu, Changshe Zhang, Chaolei Wang, Jinhe Su, Guorong Cai, Meiliu Wu
Title: Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation
Abstract:
Open-Vocabulary semantic segmentation (OVSS) and domain generalization in semantic segmentation (DGSS) highlight a subtle complementarity that motivates Open-Vocabulary Domain-Generalized Semantic Segmentation (OV-DGSS). OV-DGSS aims to generate pixel-level masks for unseen categories while maintaining robustness across unseen domains, a critical capability for real-world scenarios such as autonomous driving in adverse conditions. We introduce Vireo, a novel single-stage framework for OV-DGSS that unifies the strengths of OVSS and DGSS for the first time. Vireo builds upon the frozen Visual Foundation Models (VFMs) and incorporates scene geometry via Depth VFMs to extract domain-invariant structural features. To bridge the gap between visual and textual modalities under domain shift, we propose three key components: (1) GeoText Prompts, which align geometric features with language cues and progressively refine VFM encoder representations; (2) Coarse Mask Prior Embedding (CMPE) for enhancing gradient flow for faster convergence and stronger textual influence; and (3) the Domain-Open-Vocabulary Vector Embedding Head (DOV-VEH), which fuses refined structural and semantic features for robust prediction. Comprehensive evaluation on these components demonstrates the effectiveness of our designs. Our proposed Vireo achieves the state-of-the-art performance and surpasses existing methods by a large margin in both domain generalization and open-vocabulary recognition, offering a unified and scalable solution for robust visual understanding in diverse and dynamic environments. Code is available at https://github.com/anonymouse-9c53tp182bvz/Vireo.
中文: Vireo是一个创新的单阶段框架,首次将开放词汇语义分割与领域泛化相结合,通过冻结的视觉基础模型和深度几何特征实现了在未知类别和领域中领先的鲁棒性表现。
English: Vireo is a novel single-stage framework that unifies open-vocabulary semantic segmentation with domain generalization, leveraging frozen visual foundation models and depth-based geometry to achieve state-of-the-art robustness across unseen categories and domains.

Authors:Panagiotis Kaliosis, John Pavlopoulos
Title: Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition
Abstract:
Handwritten text recognition aims to convert visual input into machine-readable text, and it remains challenging due to the evolving and context-dependent nature of handwriting. Character sets change over time, and character frequency distributions shift across historical periods or regions, often causing models trained on broad, heterogeneous corpora to underperform on specific subsets. To tackle this, we propose a novel loss function that incorporates the Wasserstein distance between the character frequency distribution of the predicted text and a target distribution empirically derived from training data. By penalizing divergence from expected distributions, our approach enhances both accuracy and robustness under temporal and contextual intra-dataset shifts. Furthermore, we demonstrate that character distribution alignment can also improve existing models at inference time without requiring retraining by integrating it as a scoring function in a guided decoding scheme. Experimental results across multiple datasets and architectures confirm the effectiveness of our method in boosting generalization and performance. We open source our code at https://github.com/pkaliosis/fada.
中文摘要:本文提出了一种利用Wasserstein距离对齐字符频率分布的新型损失函数,无需重新训练模型即可提升手写文本识别的准确性和鲁棒性,有效应对时间和上下文变化。
English Summary: This paper introduces a novel loss function using Wasserstein distance to align character frequency distributions, improving handwritten text recognition accuracy and robustness against temporal and contextual shifts without requiring model retraining.

Authors:Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
Title: ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
Abstract:
AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.
中文:ComfyUI-R1是首个通过两阶段训练框架实现自动化工作流生成的大型推理模型,在格式有效性和结构准确性方面显著超越了当前最先进的闭源模型。
English: ComfyUI-R1 is the first large reasoning model that automates AI workflow generation through a two-stage training framework, achieving superior performance in format validity and structural accuracy compared to leading closed-source models.

Authors:Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Linghe Kong, Lichao Sun, Weiran Huang
Title: Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning
Abstract:
Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through extensive experiments across multiple datasets, we demonstrate consistent improvements in mathematical reasoning performance, with gains comparable to those achieved through algorithmic changes. Additionally, we achieve competitive performance among open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual perturbation. Through comprehensive ablation studies, we analyze the effectiveness of different perturbation strategies, revealing that each perturbation type contributes uniquely to different aspects of visual reasoning. Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning: better reasoning begins with better seeing. Our code is available at https://github.com/YutingLi0606/Vision-Matters.
中文摘要:当前多模态大语言模型在推理过程中常未能有效整合视觉信息,为此我们提出了一个简单的视觉扰动框架,无需算法修改即可增强感知鲁棒性,显著提升数学推理性能。
English Summary: Current multimodal large language models often fail to effectively integrate visual information during reasoning, prompting the development of a simple visual perturbation framework that enhances perceptual robustness and improves mathematical reasoning performance without requiring algorithmic modifications.

Authors:Ye Zhang, Yu Zhou, Yifeng Wang, Jun Xiao, Ziyue Wang, Yongbing Zhang, Jianxu Chen
Title: The Four Color Theorem for Cell Instance Segmentation
Abstract:
Cell instance segmentation is critical to analyzing biomedical images, yet accurately distinguishing tightly touching cells remains a persistent challenge. Existing instance segmentation frameworks, including detection-based, contour-based, and distance mapping-based approaches, have made significant progress, but balancing model performance with computational efficiency remains an open problem. In this paper, we propose a novel cell instance segmentation method inspired by the four-color theorem. By conceptualizing cells as countries and tissues as oceans, we introduce a four-color encoding scheme that ensures adjacent instances receive distinct labels. This reformulation transforms instance segmentation into a constrained semantic segmentation problem with only four predicted classes, substantially simplifying the instance differentiation process. To solve the training instability caused by the non-uniqueness of four-color encoding, we design an asymptotic training strategy and encoding transformation method. Extensive experiments on various modes demonstrate our approach achieves state-of-the-art performance. The code is available at https://github.com/zhangye-zoe/FCIS.
中文: 本文提出了一种基于四色定理的新型细胞实例分割方法,将任务转化为简化的四类语义分割问题,并通过渐进式训练策略实现了最先进的性能。
English: This paper introduces a novel cell instance segmentation method based on the four-color theorem, transforming the task into a simplified four-class semantic segmentation problem and achieving state-of-the-art performance through an asymptotic training strategy.

Authors:Xulin Ma, Jiankai Tang, Zhang Jiang, Songqin Cheng, Yuanchun Shi, Dong LI, Xin Liu, Daniel McDuff, Xiaojing Liu, Yuntao Wang
Title: Non-Contact Health Monitoring During Daily Personal Care Routines
Abstract:
Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring of physiological signals and offers a practical alternative to traditional health sensing methods. Although rPPG is promising for daily health monitoring, its application in long-term personal care scenarios, such as mirror-facing routines in high-altitude environments, remains challenging due to ambient lighting variations, frequent occlusions from hand movements, and dynamic facial postures. To address these challenges, we present LADH (Long-term Altitude Daily Health), the first long-term rPPG dataset containing 240 synchronized RGB and infrared (IR) facial videos from 21 participants across five common personal care scenarios, along with ground-truth PPG, respiration, and blood oxygen signals. Our experiments demonstrate that combining RGB and IR video inputs improves the accuracy and robustness of non-contact physiological monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate estimation. Furthermore, we find that multi-task learning enhances performance across multiple physiological indicators simultaneously. Dataset and code are open at https://github.com/McJackTang/FusionVitals.
中文摘要:LADH数据集通过融合RGB与红外面部视频,结合多任务学习,在长期高原日常护理场景中提升了非接触式生理监测的鲁棒性,实现了心率估计4.99 BPM的平均绝对误差。
English Summary: The LADH dataset, featuring synchronized RGB and infrared facial videos from daily care scenarios, enhances remote physiological monitoring by combining multi-modal inputs and multi-task learning to achieve robust heart rate estimation with a 4.99 BPM MAE.

Authors:Changwei Wu, Yifei Chen, Yuxin Du, Jinying Zong, Jie Dong, Mingxuan Liu, Yong Peng, Jin Fan, Feiwei Qin, Changmiao Wang
Title: Towards Practical Alzheimer's Disease Diagnosis: A Lightweight and Interpretable Spiking Neural Model
Abstract:
Early diagnosis of Alzheimer's Disease (AD), especially at the mild cognitive impairment (MCI) stage, is vital yet hindered by subjective assessments and the high cost of multimodal imaging modalities. Although deep learning methods offer automated alternatives, their energy inefficiency and computational demands limit real-world deployment, particularly in resource-constrained settings. As a brain-inspired paradigm, spiking neural networks (SNNs) are inherently well-suited for modeling the sparse, event-driven patterns of neural degeneration in AD, offering a promising foundation for interpretable and low-power medical diagnostics. However, existing SNNs often suffer from weak expressiveness and unstable training, which restrict their effectiveness in complex medical tasks. To address these limitations, we propose FasterSNN, a hybrid neural architecture that integrates biologically inspired LIF neurons with region-adaptive convolution and multi-scale spiking attention. This design enables sparse, efficient processing of 3D MRI while preserving diagnostic accuracy. Experiments on benchmark datasets demonstrate that FasterSNN achieves competitive performance with substantially improved efficiency and stability, supporting its potential for practical AD screening. Our source code is available at https://github.com/wuchangw/FasterSNN.
中文:提出的FasterSNN架构整合了LIF神经元、区域自适应卷积和多尺度脉冲注意力机制,能够利用3D核磁共振数据实现高效稳定的阿尔茨海默症筛查,在保持诊断准确性的同时显著提升了计算效率。
English: The proposed FasterSNN architecture combines LIF neurons with region-adaptive convolution and multi-scale attention to enable efficient and stable Alzheimer's Disease screening using 3D MRI data, achieving competitive performance with improved computational efficiency.

Authors:Maik Dannecker, Vasiliki Sideri-Lampretsa, Sophie Starck, Angeline Mihailov, Mathieu Milh, Nadine Girard, Guillaume Auzias, Daniel Rueckert
Title: CINeMA: Conditional Implicit Neural Multi-Modal Atlas for a Spatio-Temporal Representation of the Perinatal Brain
Abstract:
Magnetic resonance imaging of fetal and neonatal brains reveals rapid neurodevelopment marked by substantial anatomical changes unfolding within days. Studying this critical stage of the developing human brain, therefore, requires accurate brain models-referred to as atlases-of high spatial and temporal resolution. To meet these demands, established traditional atlases and recently proposed deep learning-based methods rely on large and comprehensive datasets. This poses a major challenge for studying brains in the presence of pathologies for which data remains scarce. We address this limitation with CINeMA (Conditional Implicit Neural Multi-Modal Atlas), a novel framework for creating high-resolution, spatio-temporal, multimodal brain atlases, suitable for low-data settings. Unlike established methods, CINeMA operates in latent space, avoiding compute-intensive image registration and reducing atlas construction times from days to minutes. Furthermore, it enables flexible conditioning on anatomical features including GA, birth age, and pathologies like ventriculomegaly (VM) and agenesis of the corpus callosum (ACC). CINeMA supports downstream tasks such as tissue segmentation and age prediction whereas its generative properties enable synthetic data creation and anatomically informed data augmentation. Surpassing state-of-the-art methods in accuracy, efficiency, and versatility, CINeMA represents a powerful tool for advancing brain research. We release the code and atlases at https://github.com/m-dannecker/CINeMA.
Chinese: CINeMA是一种新型框架,通过在潜在空间操作,能在低数据环境下快速构建高分辨率多模态脑图谱,支持对解剖特征的灵活条件设置,并实现分割和数据增强等下游任务。
English: CINeMA is a novel framework that rapidly constructs high-resolution, multimodal brain atlases in low-data settings by operating in latent space, enabling flexible conditioning on anatomical features and supporting downstream tasks like segmentation and data augmentation.

Authors:Jacob Munkberg, Zian Wang, Ruofan Liang, Tianchang Shen, Jon Hasselgren
Title: VideoMat: Extracting PBR Materials from Video Diffusion Models
Abstract:
We leverage finetuned video diffusion models, intrinsic decomposition of videos, and physically-based differentiable rendering to generate high quality materials for 3D models given a text prompt or a single image. We condition a video diffusion model to respect the input geometry and lighting condition. This model produces multiple views of a given 3D model with coherent material properties. Secondly, we use a recent model to extract intrinsics (base color, roughness, metallic) from the generated video. Finally, we use the intrinsics alongside the generated video in a differentiable path tracer to robustly extract PBR materials directly compatible with common content creation tools.

Authors:Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen
Title: HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios
Abstract:
Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action recognition methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The code is available at https://github.com/KPeng9510/HopaDIFF.git.
中文: 本文开创性地提出文本参照引导的多人物动作分割方法,构建了首个RHAS133数据集,并设计了HopaDIFF框架——通过交叉输入门控注意力机制和傅里叶条件调控实现整体-局部感知的扩散模型,在多样化评估设置中取得了最优性能。
English: This paper introduces a novel textual reference-guided approach for multi-person action segmentation, presenting the RHAS133 dataset and proposing HopaDIFF—a holistic-partial aware diffusion framework that achieves state-of-the-art performance by enhancing long-range reasoning and fine-grained control through cross-input gate attention and Fourier conditioning.

Authors:Yanzhao Shi, Xiaodan Zhang, Junzhong Ji, Haoning Jiang, Chengxin Zheng, Yinong Wang, Liangqiong Qu
Title: HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding
Abstract:
Automated 3D CT diagnosis empowers clinicians to make timely, evidence-based decisions by enhancing diagnostic accuracy and workflow efficiency. While multimodal large language models (MLLMs) exhibit promising performance in visual-language understanding, existing methods mainly focus on 2D medical images, which fundamentally limits their ability to capture complex 3D anatomical structures. This limitation often leads to misinterpretation of subtle pathologies and causes diagnostic hallucinations. In this paper, we present Hybrid Spatial Encoding Network (HSENet), a framework that exploits enriched 3D medical visual cues by effective visual perception and projection for accurate and robust vision-language understanding. Specifically, HSENet employs dual-3D vision encoders to perceive both global volumetric contexts and fine-grained anatomical details, which are pre-trained by dual-stage alignment with diagnostic reports. Furthermore, we propose Spatial Packer, an efficient multimodal projector that condenses high-resolution 3D spatial regions into a compact set of informative visual tokens via centroid-based compression. By assigning spatial packers with dual-3D vision encoders, HSENet can seamlessly perceive and transfer hybrid visual representations to LLM's semantic space, facilitating accurate diagnostic text generation. Experimental results demonstrate that our method achieves state-of-the-art performance in 3D language-visual retrieval (39.85% of R@100, +5.96% gain), 3D medical report generation (24.01% of BLEU-4, +8.01% gain), and 3D visual question answering (73.60% of Major Class Accuracy, +1.99% gain), confirming its effectiveness. Our code is available at https://github.com/YanzhaoShi/HSENet.
中文: HSENet提出了一种混合空间编码框架,通过双3D视觉编码器和基于质心的压缩技术提升对三维CT影像的多模态理解能力,在医学视觉语言任务中实现了最优性能。
English: HSENet introduces a hybrid spatial encoding framework that leverages dual-3D vision encoders and centroid-based compression to enhance multimodal understanding of 3D CT scans, achieving state-of-the-art performance in medical vision-language tasks.

Authors:Giacomo Rosin, Muhammad Rameez Ur Rahman, Sebastiano Vascon
Title: ECAM: A Contrastive Learning Approach to Avoid Environmental Collision in Trajectory Forecasting
Abstract:
Human trajectory forecasting is crucial in applications such as autonomous driving, robotics and surveillance. Accurate forecasting requires models to consider various factors, including social interactions, multi-modal predictions, pedestrian intention and environmental context. While existing methods account for these factors, they often overlook the impact of the environment, which leads to collisions with obstacles. This paper introduces ECAM (Environmental Collision Avoidance Module), a contrastive learning-based module to enhance collision avoidance ability with the environment. The proposed module can be integrated into existing trajectory forecasting models, improving their ability to generate collision-free predictions. We evaluate our method on the ETH/UCY dataset and quantitatively and qualitatively demonstrate its collision avoidance capabilities. Our experiments show that state-of-the-art methods significantly reduce (-40/50%) the collision rate when integrated with the proposed module. The code is available at https://github.com/CVML-CFU/ECAM.
Chinese Summary: 本文提出的ECAM模块通过对比学习增强轨迹预测模型的环境避障能力,在ETH/UCY数据集上实现碰撞率降低40-50%。
English Summary: This paper introduces ECAM, a contrastive learning module that enhances trajectory forecasting models' environmental collision avoidance, reducing collision rates by 40-50% on the ETH/UCY dataset.

Authors:Mingxiao Li, Mang Ning, Marie-Francine Moens
Title: Consistent Story Generation with Asymmetry Zigzag Sampling
Abstract:
Text-to-image generation models have made significant progress in producing high-quality images from textual descriptions, yet they continue to struggle with maintaining subject consistency across multiple images, a fundamental requirement for visual storytelling. Existing methods attempt to address this by either fine-tuning models on large-scale story visualization datasets, which is resource-intensive, or by using training-free techniques that share information across generations, which still yield limited success. In this paper, we introduce a novel training-free sampling strategy called Zigzag Sampling with Asymmetric Prompts and Visual Sharing to enhance subject consistency in visual story generation. Our approach proposes a zigzag sampling mechanism that alternates between asymmetric prompting to retain subject characteristics, while a visual sharing module transfers visual cues across generated images to %further enforce consistency. Experimental results, based on both quantitative metrics and qualitative evaluations, demonstrate that our method significantly outperforms previous approaches in generating coherent and consistent visual stories. The code is available at https://github.com/Mingxiao-Li/Asymmetry-Zigzag-StoryDiffusion.
中文: 本文提出了一种无需训练、结合非对称提示和视觉共享的Z字形采样方法,显著提升了视觉故事生成中的主题一致性,效果优于现有技术。
English: This paper introduces a training-free Zigzag Sampling method with asymmetric prompts and visual sharing to significantly improve subject consistency in visual story generation, outperforming previous approaches.

Authors:Ligao Deng, Yupeng Deng, Yu Meng, Jingbo Chen, Zhihao Xi, Diyou Liu, Qifeng Chu
Title: GLD-Road:A global-local decoding road network extraction model for remote sensing images
Abstract:
Road networks are crucial for mapping, autonomous driving, and disaster response. While manual annotation is costly, deep learning offers efficient extraction. Current methods include postprocessing (prone to errors), global parallel (fast but misses nodes), and local iterative (accurate but slow). We propose GLD-Road, a two-stage model combining global efficiency and local precision. First, it detects road nodes and connects them via a Connect Module. Then, it iteratively refines broken roads using local searches, drastically reducing computation. Experiments show GLD-Road outperforms state-of-the-art methods, improving APLS by 1.9% (City-Scale) and 0.67% (SpaceNet3). It also reduces retrieval time by 40% vs. Sat2Graph (global) and 92% vs. RNGDet++ (local). The experimental results are available at https://github.com/ucas-dlg/GLD-Road.
中文: GLD-Road是一个两阶段模型,通过全局检测节点并连接,再局部迭代修复断裂道路,高效提取道路网络,在精度和计算效率上均显著优于现有方法。
English: GLD-Road is a two-stage model that efficiently extracts road networks by globally detecting nodes and connecting them, then locally refining broken roads, achieving superior accuracy and significantly reducing computation time compared to existing methods.

Authors:Yi Zhang, Yi Wang, Yawen Cui, Lap-Pui Chau
Title: 3DGeoDet: General-purpose Geometry-aware Image-based 3D Object Detection
Abstract:
This paper proposes 3DGeoDet, a novel geometry-aware 3D object detection approach that effectively handles single- and multi-view RGB images in indoor and outdoor environments, showcasing its general-purpose applicability. The key challenge for image-based 3D object detection tasks is the lack of 3D geometric cues, which leads to ambiguity in establishing correspondences between images and 3D representations. To tackle this problem, 3DGeoDet generates efficient 3D geometric representations in both explicit and implicit manners based on predicted depth information. Specifically, we utilize the predicted depth to learn voxel occupancy and optimize the voxelized 3D feature volume explicitly through the proposed voxel occupancy attention. To further enhance 3D awareness, the feature volume is integrated with an implicit 3D representation, the truncated signed distance function (TSDF). Without requiring supervision from 3D signals, we significantly improve the model's comprehension of 3D geometry by leveraging intermediate 3D representations and achieve end-to-end training. Our approach surpasses the performance of state-of-the-art image-based methods on both single- and multi-view benchmark datasets across diverse environments, achieving a 9.3 mAP@0.5 improvement on the SUN RGB-D dataset, a 3.3 mAP@0.5 improvement on the ScanNetV2 dataset, and a 0.19 AP3D@0.7 improvement on the KITTI dataset. The project page is available at: https://cindy0725.github.io/3DGeoDet/.
中文: 本文提出3DGeoDet,一种几何感知的3D物体检测方法,通过预测深度生成显式和隐式3D表征来增强几何理解,在多个室内外数据集上实现了最优性能。
English: This paper introduces 3DGeoDet, a geometry-aware 3D object detection method that enhances 3D understanding through explicit and implicit representations using predicted depth, achieving state-of-the-art results across diverse indoor and outdoor datasets.

Authors:Tao Wang, Mengyu Li, Geduo Zeng, Cheng Meng, Qiong Zhang
Title: Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for radiance field rendering, but it typically requires millions of redundant Gaussian primitives, overwhelming memory and rendering budgets. Existing compaction approaches address this by pruning Gaussians based on heuristic importance scores, without global fidelity guarantee. To bridge this gap, we propose a novel optimal transport perspective that casts 3DGS compaction as global Gaussian mixture reduction. Specifically, we first minimize the composite transport divergence over a KD-tree partition to produce a compact geometric representation, and then decouple appearance from geometry by fine-tuning color and opacity attributes with far fewer Gaussian primitives. Experiments on benchmark datasets show that our method (i) yields negligible loss in rendering quality (PSNR, SSIM, LPIPS) compared to vanilla 3DGS with only 10% Gaussians; and (ii) consistently outperforms state-of-the-art 3DGS compaction techniques. Notably, our method is applicable to any stage of vanilla or accelerated 3DGS pipelines, providing an efficient and agnostic pathway to lightweight neural rendering. The code is publicly available at https://github.com/DrunkenPoet/GHAP

Authors:Beomsik Cho, Jaehyung Kim
Title: Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks by integrating visual perception with language understanding. However, conventional decoding strategies of LVLMs often fail to successfully utilize visual information, leading to visually ungrounded responses. While various approaches have been proposed to address this limitation, they typically require additional training, multi-step inference procedures, or external model dependencies. This paper introduces ReVisiT, a simple yet effective decoding method that references vision tokens to guide the text generation process in LVLMs. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution space, and dynamically selecting the most relevant vision token at each decoding step through constrained divergence minimization. This selected vision token is then used to refine the output distribution to better incorporate visual semantics. Experiments on three LVLM hallucination benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances visual grounding with minimal computational overhead. Moreover, our method achieves competitive or superior results relative to state-of-the-art baselines while reducing computational costs for up to $2\times$.
中文: ReVisiT是一种新颖的解码方法,通过在文本生成过程中动态参考视觉标记来增强大型视觉语言模型的视觉基础能力,无需额外训练即可显著减少幻觉并降低计算成本。
English: ReVisiT is a novel decoding method that enhances visual grounding in Large Vision-Language Models by dynamically referencing vision tokens during text generation, significantly reducing hallucinations and computational costs without requiring additional training.

Authors:Tianxiang Hao, Lixian Zhang, Yingjia Zhang, Mengxuan Chen, Jinxiao Zhang, Haohuan Fu
Title: Urban1960SatSeg: Unsupervised Semantic Segmentation of Mid-20$^{th}$ century Urban Landscapes with Satellite Imageries
Abstract:
Historical satellite imagery, such as mid-20$^{th}$ century Keyhole data, offers rare insights into understanding early urban development and long-term transformation. However, severe quality degradation (e.g., distortion, misalignment, and spectral scarcity) and annotation absence have long hindered semantic segmentation on such historical RS imagery. To bridge this gap and enhance understanding of urban development, we introduce $\textbf{Urban1960SatBench}$, an annotated segmentation dataset based on historical satellite imagery with the earliest observation time among all existing segmentation datasets, along with a benchmark framework for unsupervised segmentation tasks, $\textbf{Urban1960SatUSM}$. First, $\textbf{Urban1960SatBench}$ serves as a novel, expertly annotated semantic segmentation dataset built on mid-20$^{th}$ century Keyhole imagery, covering 1,240 km$^2$ and key urban classes (buildings, roads, farmland, water). As the earliest segmentation dataset of its kind, it provides a pioneering benchmark for historical urban understanding. Second, $\textbf{Urban1960SatUSM}$(Unsupervised Segmentation Model) is a novel unsupervised semantic segmentation framework for historical RS imagery. It employs a confidence-aware alignment mechanism and focal-confidence loss based on a self-supervised learning architecture, which generates robust pseudo-labels and adaptively prioritizes prediction difficulty and label reliability to improve unsupervised segmentation on noisy historical data without manual supervision. Experiments show Urban1960SatUSM significantly outperforms existing unsupervised segmentation methods on Urban1960SatSeg for segmenting historical urban scenes, promising in paving the way for quantitative studies of long-term urban change using modern computer vision. Our benchmark and supplementary material are available at https://github.com/Tianxiang-Hao/Urban1960SatSeg.
中文:该研究提出了Urban1960SatBench——目前最早的带标注历史卫星图像分割数据集,以及Urban1960SatUSM无监督框架,通过置信度感知机制有效提升了退化城市场景的分割效果。
English: The study introduces Urban1960SatBench, the earliest annotated historical satellite imagery segmentation dataset, and Urban1960SatUSM, an unsupervised framework that enhances segmentation of degraded urban scenes through confidence-aware mechanisms.

Authors:Amirreza Khoshbakht, Erchan Aptoula
Title: Evidential Deep Learning with Spectral-Spatial Uncertainty Disentanglement for Open-Set Hyperspectral Domain Generalization
Abstract:
Open-set domain generalization(OSDG) for hyperspectral image classification presents significant challenges due to the presence of unknown classes in target domains and the need for models to generalize across multiple unseen domains without target-specific adaptation. Existing domain adaptation methods assume access to target domain data during training and fail to address the fundamental issue of domain shift when unknown classes are present, leading to negative transfer and reduced classification performance. To address these limitations, we propose a novel open-set domain generalization framework that combines four key components: Spectrum-Invariant Frequency Disentanglement (SIFD) for domain-agnostic feature extraction, Dual-Channel Residual Network (DCRN) for robust spectral-spatial feature learning, Evidential Deep Learning (EDL) for uncertainty quantification, and Spectral-Spatial Uncertainty Disentanglement (SSUD) for reliable open-set classification. The SIFD module extracts domain-invariant spectral features in the frequency domain through attention-weighted frequency analysis and domain-agnostic regularization, while DCRN captures complementary spectral and spatial information via parallel pathways with adaptive fusion. EDL provides principled uncertainty estimation using Dirichlet distributions, enabling the SSUD module to make reliable open-set decisions through uncertainty-aware pathway weighting and adaptive rejection thresholding. Experimental results on three cross-scene hyperspectral classification tasks show that our approach achieves performance comparable to state-of-the-art domain adaptation methods while requiring no access to the target domain during training. The implementation will be made available at https://github.com/amir-khb/SSUDOSDG upon acceptance.
中文摘要:本文提出了一种新的高光谱图像开集域泛化框架,通过结合频谱不变频率解耦、双通道残差网络、证据深度学习和谱空不确定性解耦四大核心组件,在无需目标域训练数据的情况下实现了优异的跨域分类性能。
English Summary: This paper introduces a novel open-set domain generalization framework for hyperspectral image classification, integrating frequency disentanglement, dual-channel networks, evidential learning, and uncertainty disentanglement to achieve robust cross-domain performance without requiring target domain data during training.

Authors:Xinya Liu, Jianghao Wu, Tao Lu, Shaoting Zhang, Guotai Wang
Title: SRPL-SFDA: SAM-Guided Reliable Pseudo-Labels for Source-Free Domain Adaptation in Medical Image Segmentation
Abstract:
Domain Adaptation (DA) is crucial for robust deployment of medical image segmentation models when applied to new clinical centers with significant domain shifts. Source-Free Domain Adaptation (SFDA) is appealing as it can deal with privacy concerns and access constraints on source-domain data during adaptation to target-domain data. However, SFDA faces challenges such as insufficient supervision in the target domain with unlabeled images. In this work, we propose a Segment Anything Model (SAM)-guided Reliable Pseudo-Labels method for SFDA (SRPL-SFDA) with three key components: 1) Test-Time Tri-branch Intensity Enhancement (T3IE) that not only improves quality of raw pseudo-labels in the target domain, but also leads to SAM-compatible inputs with three channels to better leverage SAM's zero-shot inference ability for refining the pseudo-labels; 2) A reliable pseudo-label selection module that rejects low-quality pseudo-labels based on Consistency of Multiple SAM Outputs (CMSO) under input perturbations with T3IE; and 3) A reliability-aware training procedure in the unlabeled target domain where reliable pseudo-labels are used for supervision and unreliable parts are regularized by entropy minimization. Experiments conducted on two multi-domain medical image segmentation datasets for fetal brain and the prostate respectively demonstrate that: 1) SRPL-SFDA effectively enhances pseudo-label quality in the unlabeled target domain, and improves SFDA performance by leveraging the reliability-aware training; 2) SRPL-SFDA outperformed state-of-the-art SFDA methods, and its performance is close to that of supervised training in the target domain. The code of this work is available online: https://github.com/HiLab-git/SRPL-SFDA.
中文: 本文提出SRPL-SFDA方法,通过SAM引导的可靠伪标签系统,结合测试时增强和可靠性感知训练,在无源域适应的医学图像分割中取得了接近监督学习的性能。
English: This paper introduces SRPL-SFDA, a source-free domain adaptation method that enhances medical image segmentation by using SAM-guided reliable pseudo-labels through test-time enhancement and reliability-aware training, achieving performance close to supervised training.

Authors:Kaiyu Guo, Zijian Wang, Tan Pan, Brian C. Lovell, Mahsa Baktashmotlagh
Title: Improving Out-of-Distribution Detection via Dynamic Covariance Calibration
Abstract:
Out-of-Distribution (OOD) detection is essential for the trustworthiness of AI systems. Methods using prior information (i.e., subspace-based methods) have shown effective performance by extracting information geometry to detect OOD data with a more appropriate distance metric. However, these methods fail to address the geometry distorted by ill-distributed samples, due to the limitation of statically extracting information geometry from the training distribution. In this paper, we argue that the influence of ill-distributed samples can be corrected by dynamically adjusting the prior geometry in response to new data. Based on this insight, we propose a novel approach that dynamically updates the prior covariance matrix using real-time input features, refining its information. Specifically, we reduce the covariance along the direction of real-time input features and constrain adjustments to the residual space, thus preserving essential data characteristics and avoiding effects on unintended directions in the principal space. We evaluate our method on two pre-trained models for the CIFAR dataset and five pre-trained models for ImageNet-1k, including the self-supervised DINO model. Extensive experiments demonstrate that our approach significantly enhances OOD detection across various models. The code is released at https://github.com/workerbcd/ooddcc.
Chinese: 本文提出了一种动态方法,通过实时调整先验几何以纠正分布不良样本导致的失真,显著提升了多种模型的分布外检测性能,其核心在于对协方差矩阵进行实时优化。
English: This paper introduces a dynamic method that adjusts the prior geometry to correct distortions from ill-distributed samples, significantly improving Out-of-Distribution detection across multiple models by refining the covariance matrix in real-time.

Authors:Jialong Zuo, Yongtai Deng, Mengdan Tan, Rui Jin, Dongyue Wu, Nong Sang, Liang Pan, Changxin Gao
Title: ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model
Abstract:
In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at https://github.com/Zplusdragon/ReID5o_ORBench.
中文摘要:本研究提出了OM-ReID这一新型多模态行人重识别方法,通过构建ORBench数据集和ReID5o框架,实现了任意模态组合的协同融合与跨模态对齐,为多模态检索提供了有效解决方案。
English Summary: This study introduces OM-ReID, a novel approach for person re-identification using diverse multi-modal queries, supported by the newly created ORBench dataset and the ReID5o framework that enables effective cross-modal fusion and alignment.

Authors:Jiaqi Tang, Yu Xia, Yi-Feng Wu, Yuwei Hu, Yuhui Chen, Qing-Guo Chen, Xiaogang Xu, Xiangyu Wu, Hao Lu, Yanqing Ma, Shiyin Lu, Qifeng Chen
Title: LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization
Abstract:
The advent of autonomous agents is transforming interactions with Graphical User Interfaces (GUIs) by employing natural language as a powerful intermediary. Despite the predominance of Supervised Fine-Tuning (SFT) methods in current GUI agents for achieving spatial localization, these methods face substantial challenges due to their limited capacity to accurately perceive positional data. Existing strategies, such as reinforcement learning, often fail to assess positional accuracy effectively, thereby restricting their utility. In response, we introduce Location Preference Optimization (LPO), a novel approach that leverages locational data to optimize interaction preferences. LPO uses information entropy to predict interaction positions by focusing on zones rich in information. Besides, it further introduces a dynamic location reward function based on physical distance, reflecting the varying importance of interaction positions. Supported by Group Relative Preference Optimization (GRPO), LPO facilitates an extensive exploration of GUI environments and significantly enhances interaction precision. Comprehensive experiments demonstrate LPO's superior performance, achieving SOTA results across both offline benchmarks and real-world online evaluations. Our code will be made publicly available soon, at https://github.com/AIDC-AI/LPO.
Chinese: 该研究提出位置偏好优化(LPO)方法,通过信息熵和动态奖励机制利用位置数据提升自主代理与图形用户界面的交互精度,在离线和在线评估中均取得了最先进的性能表现。
English: The study introduces Location Preference Optimization (LPO), a novel method that enhances autonomous agents' interaction precision with Graphical User Interfaces by utilizing locational data through information entropy and dynamic rewards, achieving state-of-the-art results in both offline and online evaluations.

Authors:Zeran Ke, Bin Tan, Xianwei Zheng, Yujun Shen, Tianfu Wu, Nan Xue
Title: ScaleLSD: Scalable Deep Line Segment Detection Streamlined
Abstract:
This paper studies the problem of Line Segment Detection (LSD) for the characterization of line geometry in images, with the aim of learning a domain-agnostic robust LSD model that works well for any natural images. With the focus of scalable self-supervised learning of LSD, we revisit and streamline the fundamental designs of (deep and non-deep) LSD approaches to have a high-performing and efficient LSD learner, dubbed as ScaleLSD, for the curation of line geometry at scale from over 10M unlabeled real-world images. Our ScaleLSD works very well to detect much more number of line segments from any natural images even than the pioneered non-deep LSD approach, having a more complete and accurate geometric characterization of images using line segments. Experimentally, our proposed ScaleLSD is comprehensively testified under zero-shot protocols in detection performance, single-view 3D geometry estimation, two-view line segment matching, and multiview 3D line mapping, all with excellent performance obtained. Based on the thorough evaluation, our ScaleLSD is observed to be the first deep approach that outperforms the pioneered non-deep LSD in all aspects we have tested, significantly expanding and reinforcing the versatility of the line geometry of images. Code and Models are available at https://github.com/ant-research/scalelsd
中文: 本文提出了ScaleLSD模型,通过自监督学习从超过1000万张未标注图像中提取线段几何特征,在线段检测、三维几何估计等任务中全面超越了传统非深度方法。
English: This paper introduces ScaleLSD, a scalable self-supervised learning model for line segment detection that outperforms traditional non-deep methods across various tasks, including 3D geometry estimation and line matching, using over 10 million unlabeled images.

Authors:Hongguang Zhu, Yunchao Wei, Mengyu Wang, Siyu Jiao, Yan Fang, Jiannan Huang, Yao Zhao
Title: SAGE: Exploring the Boundaries of Unsafe Concept Domain with Semantic-Augment Erasing
Abstract:
Diffusion models (DMs) have achieved significant progress in text-to-image generation. However, the inevitable inclusion of sensitive information during pre-training poses safety risks, such as unsafe content generation and copyright infringement. Concept erasing finetunes weights to unlearn undesirable concepts, and has emerged as a promising solution. However, existing methods treat unsafe concept as a fixed word and repeatedly erase it, trapping DMs in ``word concept abyss'', which prevents generalized concept-related erasing. To escape this abyss, we introduce semantic-augment erasing which transforms concept word erasure into concept domain erasure by the cyclic self-check and self-erasure. It efficiently explores and unlearns the boundary representation of concept domain through semantic spatial relationships between original and training DMs, without requiring additional preprocessed data. Meanwhile, to mitigate the retention degradation of irrelevant concepts while erasing unsafe concepts, we further propose the global-local collaborative retention mechanism that combines global semantic relationship alignment with local predicted noise preservation, effectively expanding the retentive receptive field for irrelevant concepts. We name our method SAGE, and extensive experiments demonstrate the comprehensive superiority of SAGE compared with other methods in the safe generation of DMs. The code and weights will be open-sourced at https://github.com/KevinLight831/SAGE.
中文: SAGE通过语义增强擦除将单词概念消除转化为领域概念消除,并结合全局-局部协同保留机制,在不影响无关概念的前提下,显著提升了扩散模型的安全生成能力。
English: SAGE introduces semantic-augment erasing to transform word-level concept removal into domain-level erasure and employs a global-local retention mechanism, achieving superior safety in diffusion model generation without compromising unrelated concepts.

Authors:Yitong Zhang, Jia Li, Liyi Cai, Ge Li
Title: DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt
Abstract:
Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries that exploit the visual modality. Existing alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose Deep Aligned Visual Safety Prompt (DAVSP), which is built upon two key innovations. First, we introduce the Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model's activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments across five benchmarks on two representative LVLMs demonstrate that DAVSP effectively resists malicious queries while preserving benign input utility. Furthermore, DAVSP exhibits great cross-model generation ability. Ablation studies further reveal that both the Visual Safety Prompt and Deep Alignment are essential components, jointly contributing to its overall effectiveness. The code is publicly available at https://github.com/zhangyitonggg/DAVSP.
中文摘要:提出的深度对齐视觉安全提示(DAVSP)通过可训练的视觉边界扩展和激活空间对齐训练,有效增强大视觉语言模型抵御恶意查询的能力,同时保持正常输入的实用性。
English Summary: The proposed Deep Aligned Visual Safety Prompt (DAVSP) enhances Large Vision-Language Models' resistance to malicious visual queries while maintaining performance on benign inputs through a trainable visual padding mechanism and activation-space alignment training.

Authors:Tong Wang, Guanzhou Chen, Xiaodong Zhang, Chenxi Liu, Jiaqi Wang, Xiaoliang Tan, Wenchao Guo, Qingyuan Yang, Kaiqi Zhang
Title: MSSDF: Modality-Shared Self-supervised Distillation for High-Resolution Multi-modal Remote Sensing Image Learning
Abstract:
Remote sensing image interpretation plays a critical role in environmental monitoring, urban planning, and disaster assessment. However, acquiring high-quality labeled data is often costly and time-consuming. To address this challenge, we proposes a multi-modal self-supervised learning framework that leverages high-resolution RGB images, multi-spectral data, and digital surface models (DSM) for pre-training. By designing an information-aware adaptive masking strategy, cross-modal masking mechanism, and multi-task self-supervised objectives, the framework effectively captures both the correlations across different modalities and the unique feature structures within each modality. We evaluated the proposed method on multiple downstream tasks, covering typical remote sensing applications such as scene classification, semantic segmentation, change detection, object detection, and depth estimation. Experiments are conducted on 15 remote sensing datasets, encompassing 26 tasks. The results demonstrate that the proposed method outperforms existing pretraining approaches in most tasks. Specifically, on the Potsdam and Vaihingen semantic segmentation tasks, our method achieved mIoU scores of 78.30\% and 76.50\%, with only 50\% train-set. For the US3D depth estimation task, the RMSE error is reduced to 0.182, and for the binary change detection task in SECOND dataset, our method achieved mIoU scores of 47.51\%, surpassing the second CS-MAE by 3 percentage points. Our pretrain code, checkpoints, and HR-Pairs dataset can be found in https://github.com/CVEO/MSSDF.
中文摘要:本研究提出了一种多模态自监督学习框架,通过创新的掩码策略有效利用RGB、多光谱和DSM数据,在26个遥感任务中超越现有预训练方法,展现出卓越性能。
English Summary: This study introduces a multi-modal self-supervised learning framework that effectively utilizes RGB, multi-spectral, and DSM data through innovative masking strategies, achieving superior performance across 26 remote sensing tasks compared to existing methods.

Authors:Mojtaba Nafez, Amirhossein Koochakian, Arad Maleki, Jafar Habibi, Mohammad Hossein Rohban
Title: PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies
Abstract:
Anomaly Detection (AD) and Anomaly Localization (AL) are crucial in fields that demand high reliability, such as medical imaging and industrial monitoring. However, current AD and AL approaches are often susceptible to adversarial attacks due to limitations in training data, which typically include only normal, unlabeled samples. This study introduces PatchGuard, an adversarially robust AD and AL method that incorporates pseudo anomalies with localization masks within a Vision Transformer (ViT)-based architecture to address these vulnerabilities. We begin by examining the essential properties of pseudo anomalies, and follow it by providing theoretical insights into the attention mechanisms required to enhance the adversarial robustness of AD and AL systems. We then present our approach, which leverages Foreground-Aware Pseudo-Anomalies to overcome the deficiencies of previous anomaly-aware methods. Our method incorporates these crafted pseudo-anomaly samples into a ViT-based framework, with adversarial training guided by a novel loss function designed to improve model robustness, as supported by our theoretical analysis. Experimental results on well-established industrial and medical datasets demonstrate that PatchGuard significantly outperforms previous methods in adversarial settings, achieving performance gains of $53.2\%$ in AD and $68.5\%$ in AL, while also maintaining competitive accuracy in non-adversarial settings. The code repository is available at https://github.com/rohban-lab/PatchGuard .
Chinese: 本研究提出了PatchGuard,一种利用伪异常和视觉Transformer框架的鲁棒异常检测与定位方法,显著提升了对抗攻击的防御能力,在对抗性和标准环境下均实现了显著性能提升。
English: This study introduces PatchGuard, a robust method for anomaly detection and localization that uses pseudo anomalies and a Vision Transformer framework to significantly enhance resistance against adversarial attacks, achieving notable performance improvements in both adversarial and standard settings.

Authors:Boyu Jiang, Liang Shi, Zhengzhi Lin, Loren Stowe, Feng Guo
Title: Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule
Abstract:
The performance of perception systems in autonomous driving systems (ADS) is strongly influenced by object distance, scene dynamics, and environmental conditions such as weather. AI-based perception outputs are inherently stochastic, with variability driven by these external factors, while traditional evaluation metrics remain static and event-independent, failing to capture fluctuations in confidence over time. In this work, we introduce the Perception Characteristics Distance (PCD) -- a novel evaluation metric that quantifies the farthest distance at which an object can be reliably detected, incorporating uncertainty in model outputs. To support this, we present the SensorRainFall dataset, collected on the Virginia Smart Road using a sensor-equipped vehicle (cameras, radar, LiDAR) under controlled daylight-clear and daylight-rain scenarios, with precise ground-truth distances to the target objects. Statistical analysis reveals the presence of change points in the variance of detection confidence score with distance. By averaging the PCD values across a range of detection quality thresholds and probabilistic thresholds, we compute the mean PCD (mPCD), which captures the overall perception characteristics of a system with respect to detection distance. Applying state-of-the-art perception models shows that mPCD captures meaningful reliability differences under varying weather conditions -- differences that static metrics overlook. PCD provides a principled, distribution-aware measure of perception performance, supporting safer and more robust ADS operation, while the SensorRainFall dataset offers a valuable benchmark for evaluation. The SensorRainFall dataset is publicly available at https://www.kaggle.com/datasets/datadrivenwheels/sensorrainfall, and the evaluation code is open-sourced at https://github.com/datadrivenwheels/PCD_Python.
中文摘要:本文提出感知特征距离(PCD)这一新型评估指标,通过量化可靠检测距离并整合模型不确定性,结合SensorRainFall数据集验证了PCD能有效捕捉天气条件引起的可靠性差异,而传统静态指标无法实现这种动态评估。
English Summary: This paper introduces the Perception Characteristics Distance (PCD), a novel metric that quantifies reliable object detection range while incorporating model uncertainty, and presents the SensorRainFall dataset to demonstrate how PCD effectively captures weather-induced reliability variations that static metrics miss.

Authors:Zheqi He, Yesheng Liu, Jing-shu Zheng, Xuejing Li, Jin-Ge Yao, Bowen Qin, Richeng Xuan, Xi Yang
Title: FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation
Abstract:
We present FlagEvalMM, an open-source evaluation framework designed to comprehensively assess multimodal models across a diverse range of vision-language understanding and generation tasks, such as visual question answering, text-to-image/video generation, and image-text retrieval. We decouple model inference from evaluation through an independent evaluation service, thus enabling flexible resource allocation and seamless integration of new tasks and models. Moreover, FlagEvalMM utilizes advanced inference acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to significantly enhance evaluation efficiency. Extensive experiments show that FlagEvalMM offers accurate and efficient insights into model strengths and limitations, making it a valuable tool for advancing multimodal research. The framework is publicly accessible at https://github.com/flageval-baai/FlagEvalMM.
中文: FlagEvalMM是一个开源框架,通过将模型推理与评估解耦并采用加速工具,能高效评估多模态模型在各种视觉语言任务上的表现,为研究提供准确性能分析。
English: FlagEvalMM is an open-source framework that efficiently evaluates multimodal models across diverse vision-language tasks by decoupling inference from evaluation and utilizing acceleration tools for enhanced performance insights.

Authors:Shayan Shekarforoush, David B. Lindell, Marcus A. Brubaker, David J. Fleet
Title: Reconstructing Heterogeneous Biomolecules via Hierarchical Gaussian Mixtures and Part Discovery
Abstract:
Cryo-EM is a transformational paradigm in molecular biology where computational methods are used to infer 3D molecular structure at atomic resolution from extremely noisy 2D electron microscope images. At the forefront of research is how to model the structure when the imaged particles exhibit non-rigid conformational flexibility and compositional variation where parts are sometimes missing. We introduce a novel 3D reconstruction framework with a hierarchical Gaussian mixture model, inspired in part by Gaussian Splatting for 4D scene reconstruction. In particular, the structure of the model is grounded in an initial process that infers a part-based segmentation of the particle, providing essential inductive bias in order to handle both conformational and compositional variability. The framework, called CryoSPIRE, is shown to reveal biologically meaningful structures on complex experimental datasets, and establishes a new state-of-the-art on CryoBench, a benchmark for cryo-EM heterogeneity methods.

Authors:Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin
Title: VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
Abstract:
Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
中文摘要:VIKI-Bench是首个面向具身多智能体协作的分层基准测试平台,而VIKI-R作为两阶段框架,通过结合微调的视觉语言模型与强化学习,在异构智能体间实现了组合式协作模式,显著优于现有基线方法。
English Summary: VIKI-Bench is the first hierarchical benchmark for embodied multi-agent cooperation, while VIKI-R is a two-stage framework that significantly outperforms baselines by combining fine-tuned vision-language models with reinforcement learning to enable compositional cooperation among heterogeneous agents.

Authors:Zehong Ma, Longhui Wei, Feng Wang, Shiliang Zhang, Qi Tian
Title: MagCache: Fast Video Generation with Magnitude-Aware Cache
Abstract:
Existing acceleration techniques for video diffusion models often rely on uniform heuristics or time-embedding variants to skip timesteps and reuse cached features. These approaches typically require extensive calibration with curated prompts and risk inconsistent outputs due to prompt-specific overfitting. In this paper, we introduce a novel and robust discovery: a unified magnitude law observed across different models and prompts. Specifically, the magnitude ratio of successive residual outputs decreases monotonically and steadily in most timesteps while rapidly in the last several steps. Leveraging this insight, we introduce a Magnitude-aware Cache (MagCache) that adaptively skips unimportant timesteps using an error modeling mechanism and adaptive caching strategy. Unlike existing methods requiring dozens of curated samples for calibration, MagCache only requires a single sample for calibration. Experimental results show that MagCache achieves 2.1x and 2.68x speedups on Open-Sora and Wan 2.1, respectively, while preserving superior visual fidelity. It significantly outperforms existing methods in LPIPS, SSIM, and PSNR, under comparable computational budgets.
中文: 现有视频扩散模型加速技术常依赖统一启发式方法或时间嵌入变体,易因提示特定过拟合导致输出不一致,但本文提出统一幅度定律和幅度感知缓存(MagCache),通过误差建模和自适应缓存策略跳过不重要时间步,仅需单样本校准即可在Open-Sora和Wan 2.1上实现2.1倍和2.68倍加速,同时保持卓越视觉保真度。
English: Current video diffusion acceleration methods often depend on uniform heuristics or time-embedding variations, risking inconsistent outputs due to prompt-specific overfitting, but this paper introduces a unified magnitude law and a Magnitude-aware Cache (MagCache) that adaptively skips timesteps with minimal calibration, achieving significant speedups and superior visual fidelity on benchmarks.

Authors:Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang
Title: Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
Abstract:
Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.
中文: 现有的大规模视觉语言模型常因仅关注文本而忽略细粒度视觉信息,但提出的自回归语义视觉重建(ASVR)方法通过联合学习视觉与文本模态,显著提升了多模态理解能力,在多个基准测试中取得明显性能增益。
English: Current large vision-language models often neglect fine-grained visual details by focusing only on text, but the proposed Autoregressive Semantic Visual Reconstruction (ASVR) method enhances multimodal understanding by jointly learning visual and textual modalities, achieving significant performance improvements across benchmarks.

Authors:Daniel Shao, Richard J. Chen, Andrew H. Song, Joel Runevic, Ming Y. Lu, Tong Ding, Faisal Mahmood
Title: Do Multiple Instance Learning Models Transfer?
Abstract:
Multiple Instance Learning (MIL) is a cornerstone approach in computational pathology (CPath) for generating clinically meaningful slide-level embeddings from gigapixel tissue images. However, MIL often struggles with small, weakly supervised clinical datasets. In contrast to fields such as NLP and conventional computer vision, where transfer learning is widely used to address data scarcity, the transferability of MIL models remains poorly understood. In this study, we systematically evaluate the transfer learning capabilities of pretrained MIL models by assessing 11 models across 21 pretraining tasks for morphological and molecular subtype prediction. Our results show that pretrained MIL models, even when trained on different organs than the target task, consistently outperform models trained from scratch. Moreover, pretraining on pancancer datasets enables strong generalization across organs and tasks, outperforming slide foundation models while using substantially less pretraining data. These findings highlight the robust adaptability of MIL models and demonstrate the benefits of leveraging transfer learning to boost performance in CPath. Lastly, we provide a resource which standardizes the implementation of MIL models and collection of pretrained model weights on popular CPath tasks, available at https://github.com/mahmoodlab/MIL-Lab
Chinese: 本研究表明,在计算病理学中,经过预训练的多示例学习模型即使应用于不同器官也始终优于从头训练的模型,且全癌种预训练能以比基础模型更少的数据实现强大的泛化能力。
English: This study demonstrates that pretrained Multiple Instance Learning (MIL) models in computational pathology consistently outperform models trained from scratch, even across different organs, and that pancancer pretraining enables strong generalization with less data than foundation models.

Authors:Fabian Immel, Jan-Hendrik Pauls, Richard Fehler, Frank Bieder, Jonas Merkert, Christoph Stiller
Title: SDTagNet: Leveraging Text-Annotated Navigation Maps for Online HD Map Construction
Abstract:
Autonomous vehicles rely on detailed and accurate environmental information to operate safely. High definition (HD) maps offer a promising solution, but their high maintenance cost poses a significant barrier to scalable deployment. This challenge is addressed by online HD map construction methods, which generate local HD maps from live sensor data. However, these methods are inherently limited by the short perception range of onboard sensors. To overcome this limitation and improve general performance, recent approaches have explored the use of standard definition (SD) maps as prior, which are significantly easier to maintain. We propose SDTagNet, the first online HD map construction method that fully utilizes the information of widely available SD maps, like OpenStreetMap, to enhance far range detection accuracy. Our approach introduces two key innovations. First, in contrast to previous work, we incorporate not only polyline SD map data with manually selected classes, but additional semantic information in the form of textual annotations. In this way, we enrich SD vector map tokens with NLP-derived features, eliminating the dependency on predefined specifications or exhaustive class taxonomies. Second, we introduce a point-level SD map encoder together with orthogonal element identifiers to uniformly integrate all types of map elements. Experiments on Argoverse 2 and nuScenes show that this boosts map perception performance by up to +5.9 mAP (+45%) w.r.t. map construction without priors and up to +3.2 mAP (+20%) w.r.t. previous approaches that already use SD map priors. Code is available at https://github.com/immel-f/SDTagNet
中文: SDTagNet是一种在线高精地图构建方法,通过充分利用广泛可用的标准地图,结合折线数据和文本注释以丰富语义信息,并采用点级编码器和正交标识符统一整合所有地图元素,从而显著提升远距离检测精度。
English: SDTagNet is an online HD map construction method that enhances far-range detection accuracy by fully utilizing widely available SD maps, incorporating both polyline data and textual annotations for enriched semantic information, and integrating all map elements uniformly through a point-level encoder and orthogonal identifiers.

Authors:Chenyu Lian, Hong-Yu Zhou, Dongyun Liang, Jing Qin, Liansheng Wang
Title: Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models
Abstract:
Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling. Additionally, we integrate temporal-multiview radiograph inputs to enhance the information consistency between radiographs and their corresponding descriptions in reports, further improving the vision-language alignment. Experimental evaluations show that ALTA outperforms the best-performing counterpart by over 4% absolute points in text-to-image accuracy and approximately 6% absolute points in image-to-text retrieval accuracy. The adaptation of vision-language models during efficient alignment also promotes better vision and language understanding. Code is publicly available at https://github.com/DopamineLcy/ALTA.
中文: ALTA方法通过适配预训练的视觉模型,以极低的计算成本实现了医学视觉与语言的高效对齐,在检索和零样本分类任务中表现卓越。
English: The proposed ALTA method efficiently aligns medical vision and language by adapting a pretrained vision model from masked record modeling, achieving superior performance in retrieval and zero-shot classification with minimal computational resources.

Authors:Hongjie Zhu, Xiwei Liu, Rundong Xue, Zeyu Zhang, Yong Xu, Daji Ergu, Ying Cai, Yang Zhao
Title: SSS: Semi-Supervised SAM-2 with Efficient Prompting for Medical Imaging Segmentation
Abstract:
In the era of information explosion, efficiently leveraging large-scale unlabeled data while minimizing the reliance on high-quality pixel-level annotations remains a critical challenge in the field of medical imaging. Semi-supervised learning (SSL) enhances the utilization of unlabeled data by facilitating knowledge transfer, significantly improving the performance of fully supervised models and emerging as a highly promising research direction in medical image analysis. Inspired by the ability of Vision Foundation Models (e.g., SAM-2) to provide rich prior knowledge, we propose SSS (Semi-Supervised SAM-2), a novel approach that leverages SAM-2's robust feature extraction capabilities to uncover latent knowledge in unlabeled medical images, thus effectively enhancing feature support for fully supervised medical image segmentation. Specifically, building upon the single-stream "weak-to-strong" consistency regularization framework, this paper introduces a Discriminative Feature Enhancement (DFE) mechanism to further explore the feature discrepancies introduced by various data augmentation strategies across multiple views. By leveraging feature similarity and dissimilarity across multi-scale augmentation techniques, the method reconstructs and models the features, thereby effectively optimizing the salient regions. Furthermore, a prompt generator is developed that integrates Physical Constraints with a Sliding Window (PCSW) mechanism to generate input prompts for unlabeled data, fulfilling SAM-2's requirement for additional prompts. Extensive experiments demonstrate the superiority of the proposed method for semi-supervised medical image segmentation on two multi-label datasets, i.e., ACDC and BHSD. Notably, SSS achieves an average Dice score of 53.15 on BHSD, surpassing the previous state-of-the-art method by +3.65 Dice. Code will be available at https://github.com/AIGeeksGroup/SSS.
中文: 提出的SSS方法利用SAM-2的特征提取能力,通过引入判别性特征增强机制优化半监督医学图像分割,在BHSD数据集上以53.15的Dice分数实现了最先进的性能。
English: The proposed SSS method leverages SAM-2's feature extraction capabilities and introduces a Discriminative Feature Enhancement mechanism to optimize semi-supervised medical image segmentation, achieving state-of-the-art performance with a 53.15 Dice score on the BHSD dataset.

Authors:Wendong Bu, Yang Wu, Qifan Yu, Minghe Gao, Bingchen Miao, Zhenkui Zhang, Kaihang Pan, Yunfei Li, Mengze Li, Wei Ji, Juncheng Li, Siliang Tang, Yueting Zhuang
Title: What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities
Abstract:
As multimodal large language models (MLLMs) advance, MLLM-based virtual agents have demonstrated remarkable performance. However, existing benchmarks face significant limitations, including uncontrollable task complexity, extensive manual annotation with limited scenarios, and a lack of multidimensional evaluation. In response to these challenges, we introduce OmniBench, a self-generating, cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity through subtask composition. To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities. Our synthesized dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91\% human acceptance rate. Training on our graph-structured data shows that it can more efficiently guide agents compared to manually annotated data. We conduct multidimensional evaluations for various open-source and closed-source models, revealing their performance across various capabilities and paving the way for future advancements. Our project is available at https://omni-bench.github.io/.
中文:OmniBench提出了一种自生成、跨平台的图结构基准,通过子任务组合自动合成可控复杂度的任务,而OmniEval则提供多维度评估框架,其图结构数据在高效指导智能体和人类高接受度方面表现卓越。
English: OmniBench introduces a self-generating, cross-platform benchmark with automated task synthesis for controllable complexity, while OmniEval provides a multidimensional evaluation framework, achieving high human acceptance and efficient agent training through graph-structured data.

Authors:Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos
Title: Inherently Faithful Attention Maps for Vision Transformers
Abstract:
We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction. Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds. At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context. To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds. Code: https://github.com/ananthu-aniraj/ifam
Chinese: 本文提出了一种基于注意力的两阶段框架,首先识别任务相关的图像区域,然后使用二进制注意力掩码聚焦分析这些区域,显著提高了对虚假关联和分布外背景的鲁棒性。
English: This paper presents a two-stage attention-based framework that first identifies task-relevant image regions and then uses binary attention masks to focus analysis on these areas, significantly enhancing robustness against spurious correlations and out-of-distribution backgrounds.

Authors:Jiajun Li, Yue Ma, Xinyu Zhang, Qingyan Wei, Songhua Liu, Linfeng Zhang
Title: SkipVAR: Accelerating Visual Autoregressive Modeling via Adaptive Frequency-Aware Skipping
Abstract:
Recent studies on Visual Autoregressive (VAR) models have highlighted that high-frequency components, or later steps, in the generation process contribute disproportionately to inference latency. However, the underlying computational redundancy involved in these steps has yet to be thoroughly investigated. In this paper, we conduct an in-depth analysis of the VAR inference process and identify two primary sources of inefficiency: step redundancy and unconditional branch redundancy. To address step redundancy, we propose an automatic step-skipping strategy that selectively omits unnecessary generation steps to improve efficiency. For unconditional branch redundancy, we observe that the information gap between the conditional and unconditional branches is minimal. Leveraging this insight, we introduce unconditional branch replacement, a technique that bypasses the unconditional branch to reduce computational cost. Notably, we observe that the effectiveness of acceleration strategies varies significantly across different samples. Motivated by this, we propose SkipVAR, a sample-adaptive framework that leverages frequency information to dynamically select the most suitable acceleration strategy for each instance. To evaluate the role of high-frequency information, we introduce high-variation benchmark datasets that test model sensitivity to fine details. Extensive experiments show SkipVAR achieves over 0.88 average SSIM with up to 1.81x overall acceleration and 2.62x speedup on the GenEval benchmark, maintaining model quality. These results confirm the effectiveness of frequency-aware, training-free adaptive acceleration for scalable autoregressive image generation. Our code is available at https://github.com/fakerone-li/SkipVAR and has been publicly released.
中文: 针对视觉自回归模型中的步骤冗余和无条件分支冗余问题,研究者提出了SkipVAR框架,通过动态选择加速策略,在保持图像质量的同时实现了显著的速度提升。
English: Recent research on Visual Autoregressive models identifies step and unconditional branch redundancies as key inefficiencies, leading to the development of SkipVAR, a sample-adaptive framework that dynamically applies acceleration strategies to achieve significant speed improvements while preserving image quality.

Authors:José Morano, Botond Fazekas, Emese Sükei, Ronald Fecso, Taha Emre, Markus Gumpinger, Georg Faustmann, Marzieh Oghbaie, Ursula Schmidt-Erfurth, Hrvoje Bogunović
Title: MIRAGE: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis
Abstract:
Artificial intelligence (AI) has become a fundamental tool for assisting clinicians in analyzing ophthalmic images, such as optical coherence tomography (OCT). However, developing AI models often requires extensive annotation, and existing models tend to underperform on independent, unseen data. Foundation models (FMs), large AI models trained on vast unlabeled datasets, have shown promise in overcoming these challenges. Nonetheless, available FMs for ophthalmology lack extensive validation, especially for segmentation tasks, and focus on a single imaging modality. In this context, we propose MIRAGE, a novel multimodal FM for the analysis of OCT and scanning laser ophthalmoscopy (SLO) images. Additionally, we propose a new evaluation benchmark with OCT/SLO classification and segmentation tasks. The comparison with general and specialized FMs and segmentation methods shows the superiority of MIRAGE in both types of tasks, highlighting its suitability as a basis for the development of robust AI systems for retinal OCT image analysis. Both MIRAGE and the evaluation benchmark are publicly available: https://github.com/j-morano/MIRAGE.
Chinese: MIRAGE是一种新型多模态基础模型,用于分析OCT和SLO图像,在分类和分割任务中相比现有方法展现出卓越性能。
English: MIRAGE is a novel multimodal foundation model designed for analyzing OCT and SLO images, demonstrating superior performance in classification and segmentation tasks compared to existing methods.

Authors:Leqi Shen, Guoqiang Gong, Tianxiang Hao, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Jungong Han, Guiguang Ding
Title: DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval
Abstract:
The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at https://github.com/LunarShen/DsicoVLA.
Chinese: 本文提出DiscoVLA方法,通过融合图像-视频特征和利用图像级对齐知识,同时解决CLIP模型在视频-文本检索中的视觉、语言和对齐三大差异,在MSRVTT数据集上以50.5%的R@1分数实现了最优性能。
English: This paper introduces DiscoVLA, a method that simultaneously addresses vision, language, and alignment discrepancies in adapting CLIP for video-text retrieval by integrating image-video features and leveraging image-level alignment knowledge, achieving state-of-the-art performance with a 50.5% R@1 score on MSRVTT.

Authors:Zike Wu, Qi Yan, Xuanyu Yi, Lele Wang, Renjie Liao
Title: StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams
Abstract:
Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams is crucial for numerous real-world applications. However, existing methods struggle to jointly address three key challenges: 1) processing uncalibrated inputs in real time, 2) accurately modeling dynamic scene evolution, and 3) maintaining long-term stability and computational efficiency. To this end, we introduce StreamSplat, the first fully feed-forward framework that transforms uncalibrated video streams of arbitrary length into dynamic 3D Gaussian Splatting (3DGS) representations in an online manner, capable of recovering scene dynamics from temporally local observations. We propose two key technical innovations: a probabilistic sampling mechanism in the static encoder for 3DGS position prediction, and a bidirectional deformation field in the dynamic decoder that enables robust and efficient dynamic modeling. Extensive experiments on static and dynamic benchmarks demonstrate that StreamSplat consistently outperforms prior works in both reconstruction quality and dynamic scene modeling, while uniquely supporting online reconstruction of arbitrarily long video streams. Code and models are available at https://github.com/nickwzk/StreamSplat.
中文: StreamSplat是一种开创性的前馈框架,能将未校准视频流在线转换为动态3D高斯溅射表示,在实时重建、动态建模和计算效率方面表现卓越。
English: StreamSplat is a pioneering feed-forward framework that converts uncalibrated video streams into dynamic 3D Gaussian Splatting representations online, excelling in real-time reconstruction, dynamic modeling, and computational efficiency.

Authors:Jingguo Qu, Xinyang Han, Tonghuan Xiao, Jia Ai, Juan Wu, Tong Zhao, Jing Qin, Ann Dorothy King, Winnie Chiu-Wing Chu, Jing Cai, Michael Tin-Cheung Ying
Title: Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis
Abstract:
Medical ultrasonography is an essential imaging technique for examining superficial organs and tissues, including lymph nodes, breast, and thyroid. It employs high-frequency ultrasound waves to generate detailed images of the internal structures of the human body. However, manually contouring regions of interest in these images is a labor-intensive task that demands expertise and often results in inconsistent interpretations among individuals. Vision-language foundation models, which have excelled in various computer vision applications, present new opportunities for enhancing ultrasound image analysis. Yet, their performance is hindered by the significant differences between natural and medical imaging domains. This research seeks to overcome these challenges by developing domain adaptation methods for vision-language foundation models. In this study, we explore the fine-tuning pipeline for vision-language foundation models by utilizing large language model as text refiner with special-designed adaptation strategies and task-driven heads. Our approach has been extensively evaluated on six ultrasound datasets and two tasks: segmentation and classification. The experimental results show that our method can effectively improve the performance of vision-language foundation models for ultrasound image analysis, and outperform the existing state-of-the-art vision-language and pure foundation models. The source code of this study is available at https://github.com/jinggqu/NextGen-UIA.
中文: 本研究开发了视觉语言基础模型的领域自适应方法,通过使用大语言模型作为文本优化器和任务驱动头来提升超声图像分析性能,在多个数据集的分割和分类任务中均超越了现有最优模型。
English: This study develops domain adaptation methods for vision-language foundation models, utilizing large language models as text refiners and task-driven heads to enhance ultrasound image analysis, achieving superior performance in segmentation and classification tasks across multiple datasets.

Authors:Zhiyuan Ma, Ruixun Liu, Sixian Liu, Jianjun Li, Bowen Zhou
Title: Flow Diverse and Efficient: Learning Momentum Flow Matching via Stochastic Velocity Field Sampling
Abstract:
Recently, the rectified flow (RF) has emerged as the new state-of-the-art among flow-based diffusion models due to its high efficiency advantage in straight path sampling, especially with the amazing images generated by a series of RF models such as Flux 1.0 and SD 3.0. Although a straight-line connection between the noisy and natural data distributions is intuitive, fast, and easy to optimize, it still inevitably leads to: 1) Diversity concerns, which arise since straight-line paths only cover a fairly restricted sampling space. 2) Multi-scale noise modeling concerns, since the straight line flow only needs to optimize the constant velocity field $\bm v$ between the two distributions $\bmπ_0$ and $\bmπ_1$. In this work, we present Discretized-RF, a new family of rectified flow (also called momentum flow models since they refer to the previous velocity component and the random velocity component in each diffusion step), which discretizes the straight path into a series of variable velocity field sub-paths (namely ``momentum fields'') to expand the search space, especially when close to the distribution $p_\text{noise}$. Different from the previous case where noise is directly superimposed on $\bm x$, we introduce noise on the velocity $\bm v$ of the sub-path to change its direction in order to improve the diversity and multi-scale noise modeling abilities. Experimental results on several representative datasets demonstrate that learning momentum flow matching by sampling random velocity fields will produce trajectories that are both diverse and efficient, and can consistently generate high-quality and diverse results. Code is available at https://github.com/liuruixun/momentum-fm.
中文:Discretized-RF通过将直线采样路径分解为可变速度子路径并在速度分量上引入噪声,提升了整流流模型的多样性和多尺度噪声建模能力,同时保持了高效性。
English: Discretized-RF enhances rectified flow models by breaking the straight sampling path into variable-velocity sub-paths with noise introduced on velocity components, improving both diversity and multi-scale noise modeling while maintaining efficiency.

Authors:Yuhang Wang, Jun Li, Zhijian Wu, Jifeng Shen, Jianhua Xu, Wankou Yang
Title: InceptionMamba: An Efficient Hybrid Network with Large Band Convolution and Bottleneck Mamba
Abstract:
Within the family of convolutional neural networks, InceptionNeXt has shown excellent competitiveness in image classification and a number of downstream tasks. Built on parallel one-dimensional strip convolutions, however, it suffers from limited ability of capturing spatial dependencies along different dimensions and fails to fully explore spatial modeling in local neighborhood. Besides, inherent locality constraints of convolution operations are detrimental to effective global context modeling. To overcome these limitations, we propose a novel backbone architecture termed InceptionMamba in this study. More specifically, the traditional one-dimensional strip convolutions are replaced by orthogonal band convolutions in our InceptionMamba to achieve cohesive spatial modeling. Furthermore, global contextual modeling can be achieved via a bottleneck Mamba module, facilitating enhanced cross-channel information fusion and enlarged receptive field. Extensive evaluations on classification and various downstream tasks demonstrate that the proposed InceptionMamba achieves state-of-the-art performance with superior parameter and computational efficiency. The source code will be available at https://github.com/Wake1021/InceptionMamba.
中文: InceptionMamba架构采用正交带状卷积替代传统一维条带卷积,并结合瓶颈Mamba模块强化空间与全局上下文建模,在图像分类及下游任务中以卓越的参数量与计算效率实现最优性能。
English: The proposed InceptionMamba architecture replaces traditional one-dimensional strip convolutions with orthogonal band convolutions and integrates a bottleneck Mamba module to enhance spatial and global context modeling, achieving state-of-the-art performance in image classification and downstream tasks with superior efficiency.

Authors:Mohammadreza Salehi, Shashanka Venkataramanan, Ioana Simion, Efstratios Gavves, Cees G. M. Snoek, Yuki M Asano
Title: MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning
Abstract:
Dense self-supervised learning has shown great promise for learning pixel- and patch-level representations, but extending it to videos remains challenging due to the complexity of motion dynamics. Existing approaches struggle as they rely on static augmentations that fail under object deformations, occlusions, and camera movement, leading to inconsistent feature learning over time. We propose a motion-guided self-supervised learning framework that clusters dense point tracks to learn spatiotemporally consistent representations. By leveraging an off-the-shelf point tracker, we extract long-range motion trajectories and optimize feature clustering through a momentum-encoder-based optimal transport mechanism. To ensure temporal coherence, we propagate cluster assignments along tracked points, enforcing feature consistency across views despite viewpoint changes. Integrating motion as an implicit supervisory signal, our method learns representations that generalize across frames, improving robustness in dynamic scenes and challenging occlusion scenarios. By initializing from strong image-pretrained models and leveraging video data for training, we improve state-of-the-art by 1% to 6% on six image and video datasets and four evaluation benchmarks. The implementation is publicly available at our GitHub repository: https://github.com/SMSD75/MoSiC/tree/main
中文: 本文提出了一种运动引导的自监督学习框架,通过聚类密集点轨迹来学习时空一致的视频表征,在动态场景中增强了鲁棒性,并在多个基准测试中以1%至6%的优势超越了现有最优方法。
English: This paper introduces a motion-guided self-supervised learning framework that clusters dense point tracks to achieve spatiotemporally consistent video representations, improving robustness in dynamic scenes and outperforming state-of-the-art methods by 1% to 6% across multiple benchmarks.

Authors:Milica Škipina, Nikola Jovišić, Nicola Dall'Asen, Vanja Švenda, Anil Osman Tur, Slobodan Ilić, Elisa Ricci, Dubravko Ćulibrk
Title: MAMBO: High-Resolution Generative Approach for Mammography Images
Abstract:
Mammography is the gold standard for the detection and diagnosis of breast cancer. This procedure can be significantly enhanced with Artificial Intelligence (AI)-based software, which assists radiologists in identifying abnormalities. However, training AI systems requires large and diverse datasets, which are often difficult to obtain due to privacy and ethical constraints. To address this issue, the paper introduces MAMmography ensemBle mOdel (MAMBO), a novel patch-based diffusion approach designed to generate full-resolution mammograms. Diffusion models have shown breakthrough results in realistic image generation, yet few studies have focused on mammograms, and none have successfully generated high-resolution outputs required to capture fine-grained features of small lesions. To achieve this, MAMBO integrates separate diffusion models to capture both local and global (image-level) contexts. The contextual information is then fed into the final model, significantly aiding the noise removal process. This design enables MAMBO to generate highly realistic mammograms of up to 3840x3840 pixels. Importantly, this approach can be used to enhance the training of classification models and extended to anomaly segmentation. Experiments, both numerical and radiologist validation, assess MAMBO's capabilities in image generation, super-resolution, and anomaly segmentation, highlighting its potential to enhance mammography analysis for more accurate diagnoses and earlier lesion detection. The source code used in this study is publicly available at: https://github.com/iai-rs/mambo.
中文: 本文提出MAMBO模型,通过基于分块的扩散方法生成高分辨率乳腺X光图像,以解决人工智能训练数据短缺问题,实验验证其在提升诊断精度和早期病灶检测方面的潜力。
English: The paper introduces MAMBO, a patch-based diffusion model that generates high-resolution mammograms to overcome data limitations in AI training, with experiments demonstrating its effectiveness in enhancing diagnostic accuracy and early lesion detection.

Authors:Simon Roschmann, Quentin Bouniot, Vasilii Feofanov, Ievgen Redko, Zeynep Akata
Title: Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers
Abstract:
Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose Time Vision Transformer (TiViT), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal a new direction for reusing vision representations in a non-visual domain. Code is available at https://github.com/ExplainableML/TiViT.
中文: TiViT框架将时间序列转换为图像以利用预训练的视觉变换器,实现了最先进的分类性能,并揭示了在非视觉领域重用视觉表示的新方向。
English: The TiViT framework transforms time series into images to utilize pretrained Vision Transformers, achieving state-of-the-art classification performance and revealing a novel approach for reusing visual representations in non-visual domains.

Authors:Yichong Lu, Yuzhuo Tian, Zijin Jiang, Yikun Zhao, Yuanbo Yang, Hao Ouyang, Haoji Hu, Huimin Yu, Yujun Shen, Yiyi Liao
Title: Orientation Matters: Making 3D Generative Models Orientation-Aligned
Abstract:
Humans intuitively perceive object shape and orientation from a single image, guided by strong priors about canonical poses. However, existing 3D generative models often produce misaligned results due to inconsistent training data, limiting their usability in downstream tasks. To address this gap, we introduce the task of orientation-aligned 3D object generation: producing 3D objects from single images with consistent orientations across categories. To facilitate this, we construct Objaverse-OA, a dataset of 14,832 orientation-aligned 3D models spanning 1,008 categories. Leveraging Objaverse-OA, we fine-tune two representative 3D generative models based on multi-view diffusion and 3D variational autoencoder frameworks to produce aligned objects that generalize well to unseen objects across various categories. Experimental results demonstrate the superiority of our method over post-hoc alignment approaches. Furthermore, we showcase downstream applications enabled by our aligned object generation, including zero-shot object orientation estimation via analysis-by-synthesis and efficient arrow-based object rotation manipulation.

Authors:Xianquan Yan, Hakan Akgün, Kenji Kawaguchi, N. Duane Loh, Ching Hua Lee
Title: HSG-12M: A Large-Scale Spatial Multigraph Dataset
Abstract:
Existing graph benchmarks assume non-spatial, simple edges, collapsing physically distinct paths into a single link. We introduce HSG-12M, the first large-scale dataset of $\textbf{spatial multigraphs}-$graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. HSG-12M contains 11.6 million static and 5.1 million dynamic $\textit{Hamiltonian spectral graphs}$ across 1401 characteristic-polynomial classes, derived from 177 TB of spectral potential data. Each graph encodes the full geometry of a 1-D crystal's energy spectrum on the complex plane, producing diverse, physics-grounded topologies that transcend conventional node-coordinate datasets. To enable future extensions, we release $\texttt{Poly2Graph}$: a high-performance, open-source pipeline that maps arbitrary 1-D crystal Hamiltonians to spectral graphs. Benchmarks with popular GNNs expose new challenges in learning from multi-edge geometry at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for geometry-aware graph learning and new opportunities of data-driven scientific discovery in condensed matter physics and beyond.
中文摘要:本文提出了HSG-12M,这是首个大规模空间多重图数据集,通过保留节点间多条几何路径并基于光谱势数据构建,为几何感知的图学习及科学发现奠定了基础。
English Summary: This paper introduces HSG-12M, the first large-scale spatial multigraph dataset that preserves multiple geometrically distinct paths between nodes, derived from spectral potential data to enable geometry-aware graph learning and scientific discovery.

Authors:Chengchao Shen, Hourun Zhu, Gongfan Fang, Jianxin Wang, Xinchao Wang
Title: Diversity-Guided MLP Reduction for Efficient Large Vision Transformers
Abstract:
Transformer models achieve excellent scaling property, where the performance is improved with the increment of model capacity. However, large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability of the compressed models and propose a Diversity-Guided MLP Reduction (DGMR) method to significantly reduce the parameters of large vision transformers with only negligible performance degradation. Specifically, we conduct a Gram-Schmidt weight pruning strategy to eliminate redundant neurons of MLP hidden layer, while preserving weight diversity for better performance recover during distillation. Compared to the model trained from scratch, our pruned model only requires 0.06\% data of LAION-2B (for the training of large vision transformers) without labels (ImageNet-1K) to recover the original performance. Experimental results on several state-of-the-art large vision transformers demonstrate that our method achieves a more than 57.0\% parameter and FLOPs reduction in a near lossless manner. Notably, for EVA-CLIP-E (4.4B), our method accomplishes a 71.5\% parameter and FLOPs reduction without performance degradation. The source code and trained weights are available at https://github.com/visresearch/DGMR.
中文: 提出的多样性引导多层感知机缩减方法通过剪枝冗余神经元并保持权重多样性,以近乎无损的方式将大型视觉变换器的参数和计算量减少超过57%。
English: The proposed Diversity-Guided MLP Reduction (DGMR) method significantly compresses large vision transformers by pruning redundant MLP neurons while preserving weight diversity, achieving over 57% parameter and FLOPs reduction with negligible performance loss.

Authors:Qi Yan, Brian Zhang, Yutong Zhang, Daniel Yang, Joshua White, Di Chen, Jiachao Liu, Langechuan Liu, Binnan Zhuang, Shaoshuai Shi, Renjie Liao
Title: TrajFlow: Multi-modal Motion Prediction via Flow Matching
Abstract:
Efficient and accurate motion prediction is crucial for ensuring safety and informed decision-making in autonomous driving, particularly under dynamic real-world conditions that necessitate multi-modal forecasts. We introduce TrajFlow, a novel flow matching-based motion prediction framework that addresses the scalability and efficiency challenges of existing generative trajectory prediction methods. Unlike conventional generative approaches that employ i.i.d. sampling and require multiple inference passes to capture diverse outcomes, TrajFlow predicts multiple plausible future trajectories in a single pass, significantly reducing computational overhead while maintaining coherence across predictions. Moreover, we propose a ranking loss based on the Plackett-Luce distribution to improve uncertainty estimation of predicted trajectories. Additionally, we design a self-conditioning training technique that reuses the model's own predictions to construct noisy inputs during a second forward pass, thereby improving generalization and accelerating inference. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) demonstrate that TrajFlow achieves state-of-the-art performance across various key metrics, underscoring its effectiveness for safety-critical autonomous driving applications. The code and other details are available on the project website https://traj-flow.github.io/.

Authors:Xijun Wang, Xin Li, Bingchen Li, Zhibo Chen
Title: LiftVSR: Lifting Image Diffusion to Video Super-Resolution via Hybrid Temporal Modeling with Only 4$\times$RTX 4090s
Abstract:
Diffusion models have significantly advanced video super-resolution (VSR) by enhancing perceptual quality, largely through elaborately designed temporal modeling to ensure inter-frame consistency. However, existing methods usually suffer from limited temporal coherence and prohibitively high computational costs (e.g., typically requiring over 8 NVIDIA A100-80G GPUs), especially for long videos. In this work, we propose LiftVSR, an efficient VSR framework that leverages and elevates the image-wise diffusion prior from PixArt-$α$, achieving state-of-the-art results using only 4$\times$RTX 4090 GPUs. To balance long-term consistency and efficiency, we introduce a hybrid temporal modeling mechanism that decomposes temporal learning into two complementary components: (i) Dynamic Temporal Attention (DTA) for fine-grained temporal modeling within short frame segment ($\textit{i.e.}$, low complexity), and (ii) Attention Memory Cache (AMC) for long-term temporal modeling across segments ($\textit{i.e.}$, consistency). Specifically, DTA identifies multiple token flows across frames within multi-head query and key tokens to warp inter-frame contexts in the value tokens. AMC adaptively aggregates historical segment information via a cache unit, ensuring long-term coherence with minimal overhead. To further stabilize the cache interaction during inference, we introduce an asymmetric sampling strategy that mitigates feature mismatches arising from different diffusion sampling steps. Extensive experiments on several typical VSR benchmarks have demonstrated that LiftVSR achieves impressive performance with significantly lower computational costs.
中文: LiftVSR是一种高效的视频超分辨率框架,通过动态时序注意力实现短时帧间建模和注意力记忆缓存确保长时一致性,以显著降低的计算成本取得了领先性能。
English: LiftVSR is an efficient video super-resolution framework that combines Dynamic Temporal Attention for short-term frame consistency and an Attention Memory Cache for long-term coherence, achieving state-of-the-art results with reduced computational costs.

Authors:Sunny Gupta, Nikita Jangid, Shounak Das, Amit Sethi
Title: FEDTAIL: Federated Long-Tailed Domain Generalization with Sharpness-Guided Gradient Matching
Abstract:
Domain Generalization (DG) seeks to train models that perform reliably on unseen target domains without access to target data during training. While recent progress in smoothing the loss landscape has improved generalization, existing methods often falter under long-tailed class distributions and conflicting optimization objectives. We introduce FedTAIL, a federated domain generalization framework that explicitly addresses these challenges through sharpness-guided, gradient-aligned optimization. Our method incorporates a gradient coherence regularizer to mitigate conflicts between classification and adversarial objectives, leading to more stable convergence. To combat class imbalance, we perform class-wise sharpness minimization and propose a curvature-aware dynamic weighting scheme that adaptively emphasizes underrepresented tail classes. Furthermore, we enhance conditional distribution alignment by integrating sharpness-aware perturbations into entropy regularization, improving robustness under domain shift. FedTAIL unifies optimization harmonization, class-aware regularization, and conditional alignment into a scalable, federated-compatible framework. Extensive evaluations across standard domain generalization benchmarks demonstrate that FedTAIL achieves state-of-the-art performance, particularly in the presence of domain shifts and label imbalance, validating its effectiveness in both centralized and federated settings. Code: https://github.com/sunnyinAI/FedTail
中文摘要:FedTAIL作为一种联邦领域泛化框架,通过锐度感知优化解决了类别不平衡与目标冲突问题,在领域偏移和标签不平衡场景下实现了最先进的性能表现。
English Summary: FedTAIL is a federated domain generalization framework that tackles class imbalance and conflicting objectives through sharpness-aware optimization, achieving state-of-the-art performance under domain shifts and label imbalance.

Authors:June Suk Choi, Kyungmin Lee, Sihyun Yu, Yisol Choi, Jinwoo Shin, Kimin Lee
Title: Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance
Abstract:
Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying low-pass filtering at the early stage of denoising. Extensive experiments demonstrate that ALG significantly improves the temporal dynamics of generated videos, while preserving image fidelity and text alignment. Especially, under VBench-I2V test suite, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.
中文: 近期图像转视频模型常因过早接触高频细节而产生动态不足的视频,而提出的自适应低通引导(ALG)方法在保持图像质量和文本对齐的同时,显著提升了生成视频的动态效果。
English: Recent image-to-video models often produce overly static videos due to premature exposure to high-frequency details, but the proposed adaptive low-pass guidance (ALG) effectively enhances motion dynamics while maintaining image quality and text alignment.

Authors:Yanting Mei, Zhilu Zhang, Xiaohe Wu, Wangmeng Zuo
Title: Image Demoiréing Using Dual Camera Fusion on Mobile Phones
Abstract:
When shooting electronic screens, moiré patterns usually appear in captured images, which seriously affects the image quality. Existing image demoiréing methods face great challenges in removing large and heavy moiré. To address the issue, we propose to utilize Dual Camera fusion for Image Demoiréing (DCID), \ie, using the ultra-wide-angle (UW) image to assist the moiré removal of wide-angle (W) image. This is inspired by two motivations: (1) the two lenses are commonly equipped with modern smartphones, (2) the UW image generally can provide normal colors and textures when moiré exists in the W image mainly due to their different focal lengths. In particular, we propose an efficient DCID method, where a lightweight UW image encoder is integrated into an existing demoiréing network and a fast two-stage image alignment manner is present. Moreover, we construct a large-scale real-world dataset with diverse mobile phones and monitors, containing about 9,000 samples. Experiments on the dataset show our method performs better than state-of-the-art methods. Code and dataset are available at https://github.com/Mrduckk/DCID.
中文摘要:本研究提出的DCID方法通过双摄像头融合技术,利用超广角图像辅助去除广角图像中的摩尔纹,采用轻量级编码器和快速两阶段对齐方式,在包含9000个样本的新数据集上实现了优于现有方法的性能。
English Summary: The proposed DCID method uses ultra-wide-angle images to assist in removing moiré patterns from wide-angle images through dual camera fusion, achieving superior performance over existing methods with a lightweight encoder and two-stage alignment on a new 9,000-sample dataset.

Authors:Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko
Title: An Adaptive Method Stabilizing Activations for Enhanced Generalization
Abstract:
We introduce AdaAct, a novel optimization algorithm that adjusts learning rates according to activation variance. Our method enhances the stability of neuron outputs by incorporating neuron-wise adaptivity during the training process, which subsequently leads to better generalization -- a complementary approach to conventional activation regularization methods. Experimental results demonstrate AdaAct's competitive performance across standard image classification benchmarks. We evaluate AdaAct on CIFAR and ImageNet, comparing it with other state-of-the-art methods. Importantly, AdaAct effectively bridges the gap between the convergence speed of Adam and the strong generalization capabilities of SGD, all while maintaining competitive execution times. Code is available at https://github.com/hseung88/adaact.
中文: AdaAct是一种新颖的优化算法,通过根据激活方差调整学习率来提高训练稳定性和泛化能力,有效弥合了Adam与SGD之间的性能差距,同时保持高效运行。
English: AdaAct is a novel optimization algorithm that enhances training stability and generalization by adapting learning rates based on activation variance, effectively bridging the performance gap between Adam and SGD while maintaining competitive efficiency.

Authors:Weikun Peng, Jun Lv, Cewu Lu, Manolis Savva
Title: Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos
Abstract:
Articulated objects are prevalent in daily life. Understanding their kinematic structure and reconstructing them have numerous applications in embodied AI and robotics. However, current methods require carefully captured data for training or inference, preventing practical, scalable, and generalizable reconstruction of articulated objects. We focus on reconstruction of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to acquire at scale using smartphones. However, this setting is quite challenging, as the object and camera move simultaneously and there are significant occlusions as the person interacts with the object. To tackle these challenges, we introduce a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a 20$\times$ larger synthetic dataset of 784 videos containing 284 objects across 11 categories. We compare our approach with existing methods that also take video as input. Experiments show that our method can reconstruct synthetic and real articulated objects across different categories from dynamic RGBD videos, outperforming existing methods significantly.

Authors:Kangning Yang, Ling Ouyang, Huiming Sun, Jie Cai, Lan Fu, Jiaming Ding, Chiu Man Ho, Zibo Meng
Title: OpenRR-1k: A Scalable Dataset for Real-World Reflection Removal
Abstract:
Reflection removal technology plays a crucial role in photography and computer vision applications. However, existing techniques are hindered by the lack of high-quality in-the-wild datasets. In this paper, we propose a novel paradigm for collecting reflection datasets from a fresh perspective. Our approach is convenient, cost-effective, and scalable, while ensuring that the collected data pairs are of high quality, perfectly aligned, and represent natural and diverse scenarios. Following this paradigm, we collect a Real-world, Diverse, and Pixel-aligned dataset (named OpenRR-1k dataset), which contains 1,000 high-quality transmission-reflection image pairs collected in the wild. Through the analysis of several reflection removal methods and benchmark evaluation experiments on our dataset, we demonstrate its effectiveness in improving robustness in challenging real-world environments. Our dataset is available at https://github.com/caijie0620/OpenRR-1k.
中文: 本文提出了一种新颖、经济高效的高质量反射数据集采集方法,并发布了OpenRR-1k数据集,有效提升了反射消除算法在真实场景中的鲁棒性。
English: This paper introduces a novel, cost-effective method for creating high-quality reflection datasets and presents the OpenRR-1k dataset, which enhances reflection removal algorithms' robustness in real-world settings.

Authors:Daniel H. Pak, Shubh Thaker, Kyle Baylous, Xiaoran Zhang, Danny Bluestein, James S. Duncan
Title: Snap-and-tune: combining deep learning and test-time optimization for high-fidelity cardiovascular volumetric meshing
Abstract:
High-quality volumetric meshing from medical images is a key bottleneck for physics-based simulations in personalized medicine. For volumetric meshing of complex medical structures, recent studies have often utilized deep learning (DL)-based template deformation approaches to enable fast test-time generation with high spatial accuracy. However, these approaches still exhibit limitations, such as limited flexibility at high-curvature areas and unrealistic inter-part distances. In this study, we introduce a simple yet effective snap-and-tune strategy that sequentially applies DL and test-time optimization, which combines fast initial shape fitting with more detailed sample-specific mesh corrections. Our method provides significant improvements in both spatial accuracy and mesh quality, while being fully automated and requiring no additional training labels. Finally, we demonstrate the versatility and usefulness of our newly generated meshes via solid mechanics simulations in two different software platforms. Our code is available at https://github.com/danpak94/Deep-Cardiac-Volumetric-Mesh.
Chinese: 本研究提出一种结合深度学习与测试时优化的快速调整策略,通过自动化体网格生成显著提升了医学模拟中的空间精度和网格质量,并在固体力学应用中验证了其有效性。
English: This study introduces a snap-and-tune strategy combining deep learning with test-time optimization to enhance spatial accuracy and mesh quality in automated volumetric meshing for medical simulations, demonstrating improved performance in solid mechanics applications.

Authors:Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta
Title: Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in the Brain
Abstract:
Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models in both unimodal and multimodal stimulus settings. More recently, instruction-tuned multimodal models have shown to generate task-specific representations that align strongly with brain activity. However, prior work evaluating the brain alignment of MLLMs has primarily focused on unimodal settings or relied on non-instruction-tuned multimodal models for multimodal stimuli. To address this gap, we investigated brain alignment, that is, measuring the degree of predictivity of neural activity recorded while participants were watching naturalistic movies (video along with audio) with representations derived from MLLMs. We utilized instruction-specific embeddings from six video and two audio instruction-tuned MLLMs. Experiments with 13 video task-specific instructions show that instruction-tuned video MLLMs significantly outperform non-instruction-tuned multimodal (by 15%) and unimodal models (by 20%). Our evaluation of MLLMs for both video and audio tasks using language-guided instructions shows clear disentanglement in task-specific representations from MLLMs, leading to precise differentiation of multimodal functional processing in the brain. We also find that MLLM layers align hierarchically with the brain, with early sensory areas showing strong alignment with early layers, while higher-level visual and language regions align more with middle to late layers. These findings provide clear evidence for the role of task-specific instructions in improving the alignment between brain activity and MLLMs, and open new avenues for mapping joint information processing in both the systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].
中文: 指令调优的多模态大语言模型在自然观影场景中显著优于非指令调优模型,其任务特定表征能精确对应大脑层级处理机制,为大脑与计算系统的映射研究开辟了新途径。
English: Instruction-tuned multimodal large language models significantly outperform non-instruction-tuned models in aligning with brain activity during naturalistic movie viewing, showing hierarchical layer correspondence and task-specific representation disentanglement that advances brain-computation mapping.

Authors:Huixin Zhan, Jason H. Moore
Title: Agentic Surgical AI: Surgeon Style Fingerprinting and Privacy Risk Quantification via Discrete Diffusion in a Vision-Language-Action Framework
Abstract:
Surgeons exhibit distinct operating styles shaped by training, experience, and motor behavior-yet most surgical AI systems overlook this personalization signal. We propose a novel agentic modeling approach for surgeon-specific behavior prediction in robotic surgery, combining a discrete diffusion framework with a vision-language-action (VLA) pipeline. Gesture prediction is framed as a structured sequence denoising task, conditioned on multimodal inputs including surgical video, intent language, and personalized embeddings of surgeon identity and skill. These embeddings are encoded through natural language prompts using third-party language models, allowing the model to retain individual behavioral style without exposing explicit identity. We evaluate our method on the JIGSAWS dataset and demonstrate that it accurately reconstructs gesture sequences while learning meaningful motion fingerprints unique to each surgeon. To quantify the privacy implications of personalization, we perform membership inference attacks and find that more expressive embeddings improve task performance but simultaneously increase susceptibility to identity leakage. These findings demonstrate that while personalized embeddings improve performance, they also increase vulnerability to identity leakage, revealing the importance of balancing personalization with privacy risk in surgical modeling. Code is available at: https://github.com/huixin-zhan-ai/Surgeon_style_fingerprinting.
Chinese: 本研究提出了一种用于机器人手术的个性化AI模型,通过多模态方法预测外科医生的特定手势,在实现高精度的同时揭示了性能提升与身份泄露隐私风险之间的权衡关系。
English: This study introduces a personalized AI model for robotic surgery that predicts surgeon-specific gestures using a multimodal approach, achieving high accuracy while revealing a trade-off between enhanced performance and increased privacy risks from identity leakage.

Authors:Sunny Gupta, Nikita Jangid, Amit Sethi
Title: UniVarFL: Uniformity and Variance Regularized Federated Learning for Heterogeneous Data
Abstract:
Federated Learning (FL) often suffers from severe performance degradation when faced with non-IID data, largely due to local classifier bias. Traditional remedies such as global model regularization or layer freezing either incur high computational costs or struggle to adapt to feature shifts. In this work, we propose UniVarFL, a novel FL framework that emulates IID-like training dynamics directly at the client level, eliminating the need for global model dependency. UniVarFL leverages two complementary regularization strategies during local training: Classifier Variance Regularization, which aligns class-wise probability distributions with those expected under IID conditions, effectively mitigating local classifier bias; and Hyperspherical Uniformity Regularization, which encourages a uniform distribution of feature representations across the hypersphere, thereby enhancing the model's ability to generalize under diverse data distributions. Extensive experiments on multiple benchmark datasets demonstrate that UniVarFL outperforms existing methods in accuracy, highlighting its potential as a highly scalable and efficient solution for real-world FL deployments, especially in resource-constrained settings. Code: https://github.com/sunnyinAI/UniVarFL
Chinese: UniVarFL提出了一种新颖的联邦学习框架,通过双重正则化策略有效缓解本地分类器偏差并提升特征泛化能力,在非独立同分布数据场景下实现了更优的准确性与可扩展性。
English: UniVarFL introduces a novel federated learning framework that employs dual regularization strategies to mitigate local classifier bias and enhance feature generalization, achieving superior accuracy and scalability in non-IID settings.

Authors:Aniket Rege, Zinnia Nie, Mahesh Ramesh, Unmesh Raskar, Zhuoran Yu, Aditya Kusupati, Yong Jae Lee, Ramya Korlakai Vinayak
Title: CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems
Abstract:
Popular text-to-image (T2I) systems are trained on web-scraped data, which is heavily Amero and Euro-centric, underrepresenting the cultures of the Global South. To analyze these biases, we introduce CuRe, a novel and scalable benchmarking and scoring suite for cultural representativeness that leverages the marginal utility of attribute specification to T2I systems as a proxy for human judgments. Our CuRe benchmark dataset has a novel categorical hierarchy built from the crowdsourced Wikimedia knowledge graph, with 300 cultural artifacts across 32 cultural subcategories grouped into six broad cultural axes (food, art, fashion, architecture, celebrations, and people). Our dataset's categorical hierarchy enables CuRe scorers to evaluate T2I systems by analyzing their response to increasing the informativeness of text conditioning, enabling fine-grained cultural comparisons. We empirically observe much stronger correlations of our class of scorers to human judgments of perceptual similarity, image-text alignment, and cultural diversity across image encoders (SigLIP 2, AIMV2 and DINOv2), vision-language models (OpenCLIP, SigLIP 2, Gemini 2.0 Flash) and state-of-the-art text-to-image systems, including three variants of Stable Diffusion (1.5, XL, 3.5 Large), FLUX.1 [dev], Ideogram 2.0, and DALL-E 3. The code and dataset is open-sourced and available at https://aniketrege.github.io/cure/.

Authors:Livio Tenze, Enrique Canessa
Title: A Real-time 3D Desktop Display
Abstract:
A new extended version of the altiro3D C++ Library -- initially developed to get glass-free holographic displays starting from 2D images -- is here introduced aiming to deal with 3D video streams from either 2D webcam images or flat video files. These streams are processed in real-time to synthesize light-fields (in Native format) and feed realistic 3D experiences. The core function needed to recreate multiviews consists on the use of MiDaS Convolutional Neural Network (CNN), which allows to extract a depth map from a single 2D image. Artificial Intelligence (AI) computing techniques are applied to improve the overall performance of the extended altiro3D Library. Thus, altiro3D can now treat standard images, video streams or screen portions of a Desktop where other apps may be also running (like web browsers, video chats, etc) and render them into 3D. To achieve the latter, a screen region need to be selected in order to feed the output directly into a light-field 3D device such as Looking Glass (LG) Portrait. In order to simplify the acquisition of a Desktop screen area by the user, a multi-platform Graphical User Interface has been also implemented. Sources available at: https://github.com/canessae/altiro3D/releases/tag/2.0.0
中文:扩展版 altiro3D C++ 库现可利用人工智能深度提取技术,实时处理二维图像、视频流或桌面屏幕区域,生成光场以驱动如 Looking Glass Portrait 等 3D 显示设备,并配备了跨平台图形界面。
English: The extended altiro3D C++ Library now processes 2D images, video streams, or desktop screen areas in real-time using AI-driven depth extraction to generate light-fields for 3D displays like Looking Glass Portrait, supported by a multi-platform GUI.

Authors:Anh-Quan Cao, Ivan Lopes, Raoul de Charette
Title: StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets
Abstract:
Multi-task learning for dense prediction is limited by the need for extensive annotation for every task, though recent works have explored training with partial task labels. Leveraging the generalization power of diffusion models, we extend the partial learning setup to a zero-shot setting, training a multi-task model on multiple synthetic datasets, each labeled for only a subset of tasks. Our method, StableMTL, repurposes image generators for latent regression. Adapting a denoising framework with task encoding, per-task conditioning and a tailored training scheme. Instead of per-task losses requiring careful balancing, a unified latent loss is adopted, enabling seamless scaling to more tasks. To encourage inter-task synergy, we introduce a multi-stream model with a task-attention mechanism that converts N-to-N task interactions into efficient 1-to-N attention, promoting effective cross-task sharing. StableMTL outperforms baselines on 7 tasks across 8 benchmarks.
中文摘要:StableMTL提出了一种基于扩散模型的零样本多任务学习方法,利用部分标注的合成数据集进行训练,通过统一潜在损失和任务注意力机制促进任务间协同,在多个基准测试中超越基线模型。
English Summary: StableMTL introduces a zero-shot multi-task learning method using diffusion models that trains on synthetic datasets with partial labels and employs a unified latent loss with task-attention for cross-task synergy, outperforming baselines across multiple benchmarks.

Authors:Penghao Wu, Shengnan Ma, Bo Wang, Jiaheng Yu, Lewei Lu, Ziwei Liu
Title: GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior
Abstract:
Multimodal Large Language Models (MLLMs) have shown great potential in revolutionizing Graphical User Interface (GUI) automation. However, existing GUI models mostly rely on learning from nearly error-free offline trajectories, thus lacking reflection and error recovery capabilities. To bridge this gap, we propose GUI-Reflection, a novel framework that explicitly integrates self-reflection and error correction capabilities into end-to-end multimodal GUI models throughout dedicated training stages: GUI-specific pre-training, offline supervised fine-tuning (SFT), and online reflection tuning. GUI-reflection enables self-reflection behavior emergence with fully automated data generation and learning processes without requiring any human annotation. Specifically, 1) we first propose scalable data pipelines to automatically construct reflection and error correction data from existing successful trajectories. While existing GUI models mainly focus on grounding and UI understanding ability, we propose the GUI-Reflection Task Suite to learn and evaluate reflection-oriented abilities explicitly. 2) Furthermore, we built a diverse and efficient environment for online training and data collection of GUI models on mobile devices. 3) We also present an iterative online reflection tuning algorithm leveraging the proposed environment, enabling the model to continuously enhance its reflection and error correction abilities. Our framework equips GUI agents with self-reflection and correction capabilities, paving the way for more robust, adaptable, and intelligent GUI automation, with all data, models, environments, and tools to be released publicly.

Authors:Yunfei Xie, Yinsong Ma, Shiyi Lan, Alan Yuille, Junfei Xiao, Chen Wei
Title: Play to Generalize: Learning to Reason Through Game Play
Abstract:
Developing generalizable reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by cognitive science literature suggesting that gameplay promotes transferable cognitive skills, we propose a novel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games, e.g. Snake, significantly enhances its downstream performance on multimodal math benchmarks like MathVista, and on multi-discipline questions like MMMU, without seeing any worked solutions, equations, or diagrams during RL, suggesting the capture of transferable reasoning skills. Remarkably, our model outperforms specialist models tuned on multimodal reasoning data in multimodal reasoning benchmarks, while preserving the base model's performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that unlock generalizable multimodal reasoning abilities in MLLMs.

Authors:Nick Jiang, Amil Dravid, Alexei Efros, Yossi Gandelsman
Title: Vision Transformers Don't Need Trained Registers
Abstract:
We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers -- the emergence of high-norm tokens that lead to noisy attention maps. We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models to improve their interpretability. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.

Authors:Sicheng Mo, Ziyang Leng, Leon Liu, Weizhen Wang, Honglin He, Bolei Zhou
Title: Dreamland: Controllable World Creation with Simulator and Generative Models
Abstract:
Large-scale video generative models can synthesize diverse and realistic visual content for dynamic world creation, but they often lack element-wise controllability, hindering their use in editing scenes and training embodied AI agents. We propose Dreamland, a hybrid world generation framework combining the granular control of a physics-based simulator and the photorealistic content output of large-scale pretrained generative models. In particular, we design a layered world abstraction that encodes both pixel-level and object-level semantics and geometry as an intermediate representation to bridge the simulator and the generative model. This approach enhances controllability, minimizes adaptation cost through early alignment with real-world distributions, and supports off-the-shelf use of existing and future pretrained generative models. We further construct a D3Sim dataset to facilitate the training and evaluation of hybrid generation pipelines. Experiments demonstrate that Dreamland outperforms existing baselines with 50.8% improved image quality, 17.9% stronger controllability, and has great potential to enhance embodied agent training. Code and data will be made available.

Authors:Aadarsh Sahoo, Vansh Tibrewal, Georgia Gkioxari
Title: Aligning Text, Images, and 3D Structure Token-by-Token
Abstract:
Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D datasets, synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings, and show our model's effectiveness on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/

Authors:Boya Zeng, Yida Yin, Zhiqiu Xu, Zhuang Liu
Title: Generative Modeling of Weights: Generalization or Memorization?
Abstract:
Generative models, with their success in image and video generation, have recently been explored for synthesizing effective neural network weights. These approaches take trained neural network checkpoints as training data, and aim to generate high-performing neural network weights during inference. In this work, we examine four representative methods on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Surprisingly, we find that these methods synthesize weights largely by memorization: they produce either replicas, or at best simple interpolations, of the training checkpoints. Current methods fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. We further show that this memorization cannot be effectively mitigated by modifying modeling factors commonly associated with memorization in image diffusion models, or applying data augmentations. Our findings provide a realistic assessment of what types of data current generative models can model, and highlight the need for more careful evaluation of generative models in new domains. Our code is available at https://github.com/boyazeng/weight_memorization.
中文: 本研究发现,现有生成模型在合成神经网络权重时主要依赖记忆或简单插值训练检查点,未能超越基础方法,这源于数据限制及结构先验利用不足。
English: This study finds that current generative models for synthesizing neural network weights primarily produce memorized or interpolated versions of training checkpoints, failing to outperform simple baselines due to data limitations and inadequate use of structural priors.

Authors:Ming-Feng Li, Xin Yang, Fu-En Wang, Hritam Basak, Yuyin Sun, Shreekant Gayaka, Min Sun, Cheng-Hao Kuo
Title: UA-Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References
Abstract:
6D object pose estimation has shown strong generalizability to novel objects. However, existing methods often require either a complete, well-reconstructed 3D model or numerous reference images that fully cover the object. Estimating 6D poses from partial references, which capture only fragments of an object's appearance and geometry, remains challenging. To address this, we propose UA-Pose, an uncertainty-aware approach for 6D object pose estimation and online object completion specifically designed for partial references. We assume access to either (1) a limited set of RGBD images with known poses or (2) a single 2D image. For the first case, we initialize a partial object 3D model based on the provided images and poses, while for the second, we use image-to-3D techniques to generate an initial object 3D model. Our method integrates uncertainty into the incomplete 3D model, distinguishing between seen and unseen regions. This uncertainty enables confidence assessment in pose estimation and guides an uncertainty-aware sampling strategy for online object completion, enhancing robustness in pose estimation accuracy and improving object completeness. We evaluate our method on the YCB-Video, YCBInEOAT, and HO3D datasets, including RGBD sequences of YCB objects manipulated by robots and human hands. Experimental results demonstrate significant performance improvements over existing methods, particularly when object observations are incomplete or partially captured. Project page: https://minfenli.github.io/UA-Pose/
中文: UA-Pose提出了一种基于不确定性感知的6D物体姿态估计和在线补全方法,专门针对部分参考数据设计,在物体观测不完整时显著提升了性能表现。
English: UA-Pose introduces an uncertainty-aware method for 6D object pose estimation and online completion using partial references, significantly enhancing accuracy and robustness even with incomplete object data.

Authors:Haoguang Lu, Jiacheng Chen, Zhenguo Yang, Aurele Tohokantche Gnanha, Fu Lee Wang, Li Qing, Xudong Mao
Title: PairEdit: Learning Semantic Variations for Exemplar-based Image Editing
Abstract:
Recent advancements in text-guided image editing have achieved notable success by leveraging natural language prompts for fine-grained semantic control. However, certain editing semantics are challenging to specify precisely using textual descriptions alone. A practical alternative involves learning editing semantics from paired source-target examples. Existing exemplar-based editing methods still rely on text prompts describing the change within paired examples or learning implicit text-based editing instructions. In this paper, we introduce PairEdit, a novel visual editing method designed to effectively learn complex editing semantics from a limited number of image pairs or even a single image pair, without using any textual guidance. We propose a target noise prediction that explicitly models semantic variations within paired images through a guidance direction term. Moreover, we introduce a content-preserving noise schedule to facilitate more effective semantic learning. We also propose optimizing distinct LoRAs to disentangle the learning of semantic variations from content. Extensive qualitative and quantitative evaluations demonstrate that PairEdit successfully learns intricate semantics while significantly improving content consistency compared to baseline methods. Code will be available at https://github.com/xudonmao/PairEdit.
Chinese: PairEdit是一种新颖的视觉编辑方法,无需文本指导即可从少量图像对中学习复杂的编辑语义,通过目标噪声预测和内容保留技术有效提升语义学习能力和内容一致性。
English: PairEdit is a novel visual editing method that learns complex editing semantics from a limited number of image pairs without textual guidance, utilizing target noise prediction and content-preserving techniques to enhance semantic learning and content consistency.

Authors:Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong
Title: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
Abstract:
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{https://github.com/Vchitect/TACA}
中文摘要:本文提出的温度调节跨模态注意力方法(TACA)有效解决了多模态扩散变换器中的注意力失衡问题,以微小计算成本显著提升了FLUX和SD3.5等模型的图文对齐效果。
English Summary: The proposed Temperature-Adjusted Cross-modal Attention (TACA) method effectively addresses attention imbalance issues in multimodal diffusion transformers, significantly improving text-image alignment in models like FLUX and SD3.5 with minimal computational cost.

Authors:Jiahao Meng, Shuyang Sun, Yue Tan, Lu Qi, Yunhai Tong, Xiangtai Li, Longyin Wen
Title: CyberV: Cybernetics for Test-time Scaling in Video Understanding
Abstract:
Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate feedback to guide the next round. This test-time adaptive scaling framework enhances frozen MLLMs without requiring retraining or additional components. Experiments demonstrate significant improvements: CyberV boosts Qwen2.5-VL-7B by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0% improvement, achieving performance even comparable to human experts. Furthermore, our method demonstrates consistent gains on general-purpose benchmarks, such as VideoMME and WorldSense, highlighting its effectiveness and generalization capabilities in making MLLMs more robust and accurate for dynamic video understanding. The code is released at https://github.com/marinero4972/CyberV.
Chinese: 针对当前多模态大语言模型在处理长视频或复杂视频时存在的不足,我们提出了CyberV框架,通过引入控制论循环实现推理过程中的自适应监控、自我纠正和动态资源分配,无需重新训练即可显著提升模型性能。
English: To overcome the limitations of current Multimodal Large Language Models (MLLMs) in processing long or complex videos, we propose CyberV, a cybernetic framework that enables adaptive self-monitoring, self-correction, and dynamic resource allocation during inference, significantly improving performance without retraining.

Authors:Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, Rongrong Ji
Title: SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence
Abstract:
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple atomic spatial capabilities to handle complex and dynamic tasks. However, existing benchmarks struggle to comprehensively evaluate the spatial intelligence of common MLLMs from the atomic level to the compositional level. To fill this gap, we present SpaCE-10, a comprehensive benchmark for compositional spatial evaluations. In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities. Based on these definitions, we propose a novel hierarchical annotation pipeline to generate high-quality and diverse question-answer (QA) pairs. With over 150+ hours of human expert effort, we obtain over 5k QA pairs for 811 real indoor scenes in SpaCE-10, which covers various evaluation settings like point cloud input and multi-choice QA. We conduct an extensive evaluation of common MLLMs on SpaCE-10 and find that even the most advanced MLLM still lags behind humans by large margins. Through our careful study, we also draw several significant findings that benefit the MLLM community. For example, we reveal that the shortcoming of counting capability greatly limits the compositional spatial capabilities of existing MLLMs. The evaluation code and benchmark datasets are available at https://github.com/Cuzyoung/SpaCE-10.
中文: 针对现有基准难以全面评估多模态大模型空间智能的问题,SpaCE-10基准通过定义10项原子空间能力和8项组合能力,构建了包含5,000余对问答数据的评估体系,发现现有模型在计数能力等关键方面仍远逊于人类表现。
English: To address the limitations of existing benchmarks in evaluating multimodal large language models' spatial intelligence, the SpaCE-10 benchmark introduces 10 atomic and 8 compositional spatial capabilities with over 5,000 QA pairs, revealing that current models significantly trail human performance, particularly in counting skills.

Authors:Wenxin Tang, Jingyu Xiao, Wenxuan Jiang, Xi Xiao, Yuhang Wang, Xuxin Tang, Qing Li, Yuehe Ma, Junliang Liu, Shisong Tang, Michael R. Lyu
Title: SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design
Abstract:
Manual slide creation is labor-intensive and requires expert prior knowledge. Existing natural language-based LLM generation methods struggle to capture the visual and structural nuances of slide designs. To address this, we formalize the Reference Image to Slide Generation task and propose Slide2Code, the first benchmark with difficulty-tiered samples based on a novel Slide Complexity Metric. We introduce SlideCoder, a layout-aware, retrieval-augmented framework for generating editable slides from reference images. SlideCoder integrates a Color Gradient-based Segmentation algorithm and a Hierarchical Retrieval-Augmented Generation method to decompose complex tasks and enhance code generation. We also release SlideMaster, a 7B open-source model fine-tuned with improved reverse-engineered data. Experiments show that SlideCoder outperforms state-of-the-art baselines by up to 40.5 points, demonstrating strong performance across layout fidelity, execution accuracy, and visual consistency. Our code is available at https://github.com/vinsontang1/SlideCoder.
中文摘要:本文提出SlideCoder框架,通过结合布局感知算法和分层检索方法,实现了从参考图像生成可编辑幻灯片的功能,在多个评估维度上显著优于现有方法。
English Summary: This paper introduces SlideCoder, a novel framework that generates editable slides from reference images by integrating layout-aware algorithms and hierarchical retrieval methods, achieving significant improvements over existing methods.

Authors:Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X. -F. Ye, Molei Tao
Title: Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces
Abstract:
Diffusion models have demonstrated remarkable performance in generating unimodal data across various tasks, including image, video, and text generation. On the contrary, the joint generation of multimodal data through diffusion models is still in the early stages of exploration. Existing approaches heavily rely on external preprocessing protocols, such as tokenizers and variational autoencoders, to harmonize varied data representations into a unified, unimodal format. This process heavily demands the high accuracy of encoders and decoders, which can be problematic for applications with limited data. To lift this restriction, we propose a novel framework for building multimodal diffusion models on arbitrary state spaces, enabling native generation of coupled data across different modalities. By introducing an innovative decoupled noise schedule for each modality, we enable both unconditional and modality-conditioned generation within a single model simultaneously. We empirically validate our approach for text-image generation and mixed-type tabular data synthesis, demonstrating that it achieves competitive performance.
Chinese: 该框架通过解耦噪声调度,使多模态扩散模型能够原生生成跨模态的耦合数据,无需依赖外部预处理,并在单一模型中同时支持无条件生成和条件生成。
English: The proposed framework enables multimodal diffusion models to generate coupled data natively across different modalities without relying on external preprocessing, using a decoupled noise schedule to support both unconditional and conditioned generation within a single model.

Authors:Gen Li, Yutong Chen, Yiqian Wu, Kaifeng Zhao, Marc Pollefeys, Siyu Tang
Title: EgoM2P: Egocentric Multimodal Multitask Pretraining
Abstract:
Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction, enabling systems to better interpret the camera wearer's actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models. To address these challenges, we introduce a set of efficient temporal tokenizers and propose EgoM2P, a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding. This unified design supports multitasking across diverse egocentric perception and synthesis tasks, including gaze prediction, egocentric camera tracking, and monocular depth estimation from egocentric video, and also serves as a generative model for conditional egocentric video synthesis. Across these tasks, EgoM2P matches or outperforms specialist models while being an order of magnitude faster. We will fully open-source EgoM2P to support the community and advance egocentric vision research. Project page: https://egom2p.github.io/.

Authors:Muhammad Ahmed Humais, Xiaoqian Huang, Hussain Sajwani, Sajid Javed, Yahya Zweiri
Title: Spatio-Temporal State Space Model For Efficient Event-Based Optical Flow
Abstract:
Event cameras unlock new frontiers that were previously unthinkable with standard frame-based cameras. One notable example is low-latency motion estimation (optical flow), which is critical for many real-time applications. In such applications, the computational efficiency of algorithms is paramount. Although recent deep learning paradigms such as CNN, RNN, or ViT have shown remarkable performance, they often lack the desired computational efficiency. Conversely, asynchronous event-based methods including SNNs and GNNs are computationally efficient; however, these approaches fail to capture sufficient spatio-temporal information, a powerful feature required to achieve better performance for optical flow estimation. In this work, we introduce Spatio-Temporal State Space Model (STSSM) module along with a novel network architecture to develop an extremely efficient solution with competitive performance. Our STSSM module leverages state-space models to effectively capture spatio-temporal correlations in event data, offering higher performance with lower complexity compared to ViT, CNN-based architectures in similar settings. Our model achieves 4.5x faster inference and 8x lower computations compared to TMA and 2x lower computations compared to EV-FlowNet with competitive performance on the DSEC benchmark. Our code will be available at https://github.com/AhmedHumais/E-STMFlow
中文: 本研究提出的时空状态空间模型(STSSM)能高效捕捉事件数据中的时空关联,在保持光流估计性能竞争力的同时,实现了显著提升的推理速度和大幅降低的计算开销。
English: This study introduces a Spatio-Temporal State Space Model (STSSM) that efficiently captures spatio-temporal correlations in event data, achieving significantly faster inference and lower computational costs while maintaining competitive optical flow performance on benchmarks.

Authors:Jinxi Li, Ziyang Song, Siyuan Zhou, Bo Yang
Title: FreeGave: 3D Physics Learning from Dynamic Videos by Gaussian Velocity
Abstract:
In this paper, we aim to model 3D scene geometry, appearance, and the underlying physics purely from multi-view videos. By applying various governing PDEs as PINN losses or incorporating physics simulation into neural networks, existing works often fail to learn complex physical motions at boundaries or require object priors such as masks or types. In this paper, we propose FreeGave to learn the physics of complex dynamic 3D scenes without needing any object priors. The key to our approach is to introduce a physics code followed by a carefully designed divergence-free module for estimating a per-Gaussian velocity field, without relying on the inefficient PINN losses. Extensive experiments on three public datasets and a newly collected challenging real-world dataset demonstrate the superior performance of our method for future frame extrapolation and motion segmentation. Most notably, our investigation into the learned physics codes reveals that they truly learn meaningful 3D physical motion patterns in the absence of any human labels in training.
中文: 本文提出FreeGave方法,无需任何物体先验知识,仅通过多视角视频学习复杂动态3D场景的物理特性,采用物理编码和无散度模块估计速度场,在帧预测和运动分割任务中展现出优越性能。
English: This paper introduces FreeGave, a method that learns the physics of complex dynamic 3D scenes from multi-view videos without requiring object priors, using a physics code and divergence-free module to estimate velocity fields and demonstrating superior performance in future frame extrapolation and motion segmentation.

Authors:Zihui Zhang, Weisheng Dai, Hongtao Wen, Bo Yang
Title: LogoSP: Local-global Grouping of Superpoints for Unsupervised Semantic Segmentation of 3D Point Clouds
Abstract:
We study the problem of unsupervised 3D semantic segmentation on raw point clouds without needing human labels in training. Existing methods usually formulate this problem into learning per-point local features followed by a simple grouping strategy, lacking the ability to discover additional and possibly richer semantic priors beyond local features. In this paper, we introduce LogoSP to learn 3D semantics from both local and global point features. The key to our approach is to discover 3D semantic information by grouping superpoints according to their global patterns in the frequency domain, thus generating highly accurate semantic pseudo-labels for training a segmentation network. Extensive experiments on two indoor and an outdoor datasets show that our LogoSP surpasses all existing unsupervised methods by large margins, achieving the state-of-the-art performance for unsupervised 3D semantic segmentation. Notably, our investigation into the learned global patterns reveals that they truly represent meaningful 3D semantics in the absence of human labels during training.
中文: 本文提出LogoSP方法,通过结合局部和全局点特征,在频域中依据全局模式对超点进行分组以生成精确的语义伪标签,在无需人工标注的情况下实现了无监督3D语义分割的最优性能。
English: This paper introduces LogoSP, an unsupervised 3D semantic segmentation method that learns from both local and global point features by grouping superpoints based on frequency-domain patterns to generate accurate pseudo-labels, achieving state-of-the-art performance on multiple datasets without human annotations.

Authors:Tieyuan Chen, Huabin Liu, Yi Wang, Chaofan Gan, Mingxi Lyu, Gui Zou, Weiyao Lin
Title: Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning
Abstract:
Video Question Answering (VideoQA) aims to answer natural language questions based on the given video, with prior work primarily focusing on identifying the duration of relevant segments, referred to as explicit visual evidence. However, explicit visual evidence is not always directly available, particularly when questions target symbolic meanings or deeper intentions, leading to significant performance degradation. To fill this gap, we introduce a novel task and dataset, $\textbf{I}$mplicit $\textbf{V}$ideo $\textbf{Q}$uestion $\textbf{A}$nswering (I-VQA), which focuses on answering questions in scenarios where explicit visual evidence is inaccessible. Given an implicit question and its corresponding video, I-VQA requires answering based on the contextual visual cues present within the video. To tackle I-VQA, we propose a novel reasoning framework, IRM (Implicit Reasoning Model), incorporating dual-stream modeling of contextual actions and intent clues as implicit reasoning chains. IRM comprises the Action-Intent Module (AIM) and the Visual Enhancement Module (VEM). AIM deduces and preserves question-related dual clues by generating clue candidates and performing relation deduction. VEM enhances contextual visual representation by leveraging key contextual clues. Extensive experiments validate the effectiveness of our IRM in I-VQA tasks, outperforming GPT-4o, OpenAI-o3, and fine-tuned VideoChat2 by $0.76\%$, $1.37\%$, and $4.87\%$, respectively. Additionally, IRM performs SOTA on similar implicit advertisement understanding and future prediction in traffic-VQA. Datasets and codes are available for double-blind review in anonymous repo: https://github.com/tychen-SJTU/Implicit-VideoQA.
Chinese: 本文提出了隐式视频问答(I-VQA)新任务及数据集,针对无法直接获取显式视觉证据的场景,并设计了结合动作与意图双线索推理的隐式推理模型(IRM),在多项任务中性能超越现有先进模型。
English: The paper introduces a new task and dataset called Implicit Video Question Answering (I-VQA), which addresses questions requiring symbolic or deeper understanding beyond explicit visual evidence, and proposes a dual-stream Implicit Reasoning Model (IRM) that outperforms existing models in performance.

Authors:Qi Yang, Chenghao Zhang, Lubin Fan, Kun Ding, Jieping Ye, Shiming Xiang
Title: Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger
Abstract:
Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal Retrieval-Augmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. Specifically, we introduce a self-consistent evaluation mechanism to enrich the knowledge base with intrinsic reasoning patterns. We further propose a Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) to prioritize the most relevant examples. This ensures that LVLMs can leverage high-quality contextual reasoning for better and more consistent responses. Extensive experiments demonstrate that our framework achieves state-of-the-art performance on multiple VQA datasets, significantly outperforming In-Context Learning (ICL) and Vanilla-RAG methods. It highlights the effectiveness of our knowledge base and re-ranking method in improving LVLMs. Our code is available at https://github.com/yannqi/RCTS-RAG.
Chinese: 本研究提出名为RCTS的多模态RAG框架,通过构建富含推理背景的知识库和树搜索重排序方法增强大视觉语言模型,在多项VQA任务中实现了最先进的性能表现。
English: This study introduces RCTS, a multimodal RAG framework that enhances Large Vision Language Models by enriching the knowledge base with reasoning contexts and employing a tree search re-ranking method, achieving state-of-the-art performance on VQA tasks.

Authors:Mohamed Djilani, Nassim Ali Ousalah, Nidhal Eddine Chenni
Title: Trend-Aware Fashion Recommendation with Visual Segmentation and Semantic Similarity
Abstract:
We introduce a trend-aware and visually-grounded fashion recommendation system that integrates deep visual representations, garment-aware segmentation, semantic category similarity and user behavior simulation. Our pipeline extracts focused visual embeddings by masking non-garment regions via semantic segmentation followed by feature extraction using pretrained CNN backbones (ResNet-50, DenseNet-121, VGG16). To simulate realistic shopping behavior, we generate synthetic purchase histories influenced by user-specific trendiness and item popularity. Recommendations are computed using a weighted scoring function that fuses visual similarity, semantic coherence and popularity alignment. Experiments on the DeepFashion dataset demonstrate consistent gender alignment and improved category relevance, with ResNet-50 achieving 64.95% category similarity and lowest popularity MAE. An ablation study confirms the complementary roles of visual and popularity cues. Our method provides a scalable framework for personalized fashion recommendations that balances individual style with emerging trends. Our implementation is available at https://github.com/meddjilani/FashionRecommender
该系统通过融合服装视觉分析与用户行为模拟,实现了兼顾个人风格与流行趋势的个性化时尚推荐。
This system combines visual garment analysis with user behavior simulation to deliver personalized fashion recommendations that balance individual style with current trends.

Authors:Shadi Hamdan, Chonghao Sima, Zetong Yang, Hongyang Li, Fatma Güney
Title: ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models
Abstract:
How can we benefit from large models without sacrificing inference speed, a common dilemma in self-driving systems? A prevalent solution is a dual-system architecture, employing a small model for rapid, reactive decisions and a larger model for slower but more informative analyses. Existing dual-system designs often implement parallel architectures where inference is either directly conducted using the large model at each current frame or retrieved from previously stored inference results. However, these works still struggle to enable large models for a timely response to every online frame. Our key insight is to shift intensive computations of the current frame to previous time steps and perform a batch inference of multiple time steps to make large models respond promptly to each time step. To achieve the shifting, we introduce Efficiency through Thinking Ahead (ETA), an asynchronous system designed to: (1) propagate informative features from the past to the current frame using future predictions from the large model, (2) extract current frame features using a small model for real-time responsiveness, and (3) integrate these dual features via an action mask mechanism that emphasizes action-critical image regions. Evaluated on the Bench2Drive CARLA Leaderboard-v2 benchmark, ETA advances state-of-the-art performance by 8% with a driving score of 69.53 while maintaining a near-real-time inference speed at 50 ms.
中文: ETA采用异步双系统架构,将密集计算前移至历史时间步,使大模型能及时响应自动驾驶的每一帧,在保持近实时推理速度的同时将性能提升8%。
English: ETA introduces an asynchronous dual-system that shifts intensive computations to previous time steps, enabling large models to respond promptly in self-driving systems while maintaining near-real-time inference speeds and improving performance by 8%.

Authors:Yuchong Long, Wen Sun, Ningxiao Sun, Wenxiao Wang, Chao Li, Shan Yin
Title: HieraEdgeNet: A Multi-Scale Edge-Enhanced Framework for Automated Pollen Recognition
Abstract:
Automated pollen recognition is vital to paleoclimatology, biodiversity monitoring, and public health, yet conventional methods are hampered by inefficiency and subjectivity. Existing deep learning models often struggle to achieve the requisite localization accuracy for microscopic targets like pollen, which are characterized by their minute size, indistinct edges, and complex backgrounds. To overcome this limitation, we introduce HieraEdgeNet, a multi-scale edge-enhancement framework. The framework's core innovation is the introduction of three synergistic modules: the Hierarchical Edge Module (HEM), which explicitly extracts a multi-scale pyramid of edge features that corresponds to the semantic hierarchy at early network stages; the Synergistic Edge Fusion (SEF) module, for deeply fusing these edge priors with semantic information at each respective scale; and the Cross Stage Partial Omni-Kernel Module (CSPOKM), which maximally refines the most detail-rich feature layers using an Omni-Kernel operator - comprising anisotropic large-kernel convolutions and mixed-domain attention - all within a computationally efficient Cross-Stage Partial (CSP) framework. On a large-scale dataset comprising 120 pollen classes, HieraEdgeNet achieves a mean Average Precision (mAP@.5) of 0.9501, significantly outperforming state-of-the-art baseline models such as YOLOv12n and RT-DETR. Furthermore, qualitative analysis confirms that our approach generates feature representations that are more precisely focused on object boundaries. By systematically integrating edge information, HieraEdgeNet provides a robust and powerful solution for high-precision, high-efficiency automated detection of microscopic objects.
中文: HieraEdgeNet提出了一种多尺度边缘增强框架,通过三个协同模块解决深度学习在定位微观花粉时的局限性,在大规模数据集上实现了卓越的准确性和效率。
English: HieraEdgeNet introduces a multi-scale edge-enhancement framework with three synergistic modules to overcome deep learning challenges in localizing microscopic pollen, achieving superior accuracy and efficiency on a large-scale dataset.

Authors:Brian Gordon, Yonatan Bitton, Andreea Marzoca, Yasumasa Onoe, Xiao Wang, Daniel Cohen-Or, Idan Szpektor
Title: Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline
Abstract:
Large Vision-Language Models (VLMs) now generate highly detailed, paragraphlength image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark and strong results in CHOCOLATE claim verification. (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments (e.g., 0.98 Spearman). (3) An innovative Critic-and-Revise pipeline, where critiques from VNLI-Critique guide LLM-based corrections, achieves substantial improvements in caption factuality (e.g., a 46% gain on DetailCaps-4870). Our work offers a crucial benchmark alongside practical tools, designed to significantly elevate the standards for fine-grained evaluation and foster the improvement of VLM image understanding. Project page: https://google.github.io/unblocking-detail-caption

Authors:Hongyu Wang, Chuyan Xiong, Ruiping Wang, Xilin Chen
Title: BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
Abstract:
Vision-Language-Action (VLA) models have shown impressive capabilities across a wide range of robotics manipulation tasks. However, their growing model size poses significant challenges for deployment on resource-constrained robotic systems. While 1-bit pretraining has proven effective for enhancing the inference efficiency of large language models with minimal performance loss, its application to VLA models remains underexplored. In this work, we present BitVLA, the first 1-bit VLA model for robotics manipulation, in which every parameter is ternary, i.e., {-1, 0, 1}. To further reduce the memory footprint of the vision encoder, we propose the distillation-aware training strategy that compresses the full-precision encoder to 1.58-bit weights. During this process, a full-precision encoder serves as a teacher model to better align latent representations. Despite the lack of large-scale robotics pretraining, BitVLA achieves performance comparable to the state-of-the-art model OpenVLA-OFT with 4-bit post-training quantization on the LIBERO benchmark, while consuming only 29.8% of the memory. These results highlight BitVLA's promise for deployment on memory-constrained edge devices. We release the code and model weights in https://github.com/ustcwhy/BitVLA.
中文:BitVLA推出了首个用于机器人技术的1比特视觉语言动作模型,通过三值参数和蒸馏感知压缩技术实现了顶尖效率,同时在操作任务上保持优异性能。
English: BitVLA introduces the first 1-bit Vision-Language-Action model for robotics, achieving state-of-the-art efficiency with ternary parameters and distillation-aware compression while maintaining competitive performance on manipulation tasks.

Authors:Dasol Hong, Wooju Lee, Hyun Myung
Title: CoCoA-Mix: Confusion-and-Confidence-Aware Mixture Model for Context Optimization
Abstract:
Prompt tuning, which adapts vision-language models by freezing model parameters and optimizing only the prompt, has proven effective for task-specific adaptations. The core challenge in prompt tuning is improving specialization for a specific task and generalization for unseen domains. However, frozen encoders often produce misaligned features, leading to confusion between classes and limiting specialization. To overcome this issue, we propose a confusion-aware loss (CoA-loss) that improves specialization by refining the decision boundaries between confusing classes. Additionally, we mathematically demonstrate that a mixture model can enhance generalization without compromising specialization. This is achieved using confidence-aware weights (CoA-weights), which adjust the weights of each prediction in the mixture model based on its confidence within the class domains. Extensive experiments show that CoCoA-Mix, a mixture model with CoA-loss and CoA-weights, outperforms state-of-the-art methods by enhancing specialization and generalization. Our code is publicly available at https://github.com/url-kaist/CoCoA-Mix.
Chinese: 提示调优通过仅优化提示来有效适配视觉语言模型,但面临特征错位和泛化难题;提出的CoCoA-Mix方法结合混淆感知损失和权重,在提升专业化和泛化能力方面均优于现有最优方法。
English: Prompt tuning effectively adapts vision-language models by optimizing only the prompt, but faces challenges with misaligned features and generalization; the proposed CoCoA-Mix method, incorporating confusion-aware loss and weights, enhances both specialization and generalization, outperforming state-of-the-art approaches.

Authors:Wei Yao, Yunlian Sun, Chang Liu, Hongwen Zhang, Jinhui Tang
Title: PhysiInter: Integrating Physical Mapping for High-Fidelity Human Interaction Generation
Abstract:
Driven by advancements in motion capture and generative artificial intelligence, leveraging large-scale MoCap datasets to train generative models for synthesizing diverse, realistic human motions has become a promising research direction. However, existing motion-capture techniques and generative models often neglect physical constraints, leading to artifacts such as interpenetration, sliding, and floating. These issues are exacerbated in multi-person motion generation, where complex interactions are involved. To address these limitations, we introduce physical mapping, integrated throughout the human interaction generation pipeline. Specifically, motion imitation within a physics-based simulation environment is used to project target motions into a physically valid space. The resulting motions are adjusted to adhere to real-world physics constraints while retaining their original semantic meaning. This mapping not only improves MoCap data quality but also directly informs post-processing of generated motions. Given the unique interactivity of multi-person scenarios, we propose a tailored motion representation framework. Motion Consistency (MC) and Marker-based Interaction (MI) loss functions are introduced to improve model performance. Experiments show our method achieves impressive results in generated human motion quality, with a 3%-89% improvement in physical fidelity. Project page http://yw0208.github.io/physiinter

Authors:Changsheng Gao, Wei Zhou, Guosheng Lin, Weisi Lin
Title: Compressed Feature Quality Assessment: Dataset and Baselines
Abstract:
The widespread deployment of large models in resource-constrained environments has underscored the need for efficient transmission of intermediate feature representations. In this context, feature coding, which compresses features into compact bitstreams, becomes a critical component for scenarios involving feature transmission, storage, and reuse. However, this compression process inevitably introduces semantic degradation that is difficult to quantify with traditional metrics. To address this, we formalize the research problem of Compressed Feature Quality Assessment (CFQA), aiming to evaluate the semantic fidelity of compressed features. To advance CFQA research, we propose the first benchmark dataset, comprising 300 original features and 12000 compressed features derived from three vision tasks and four feature codecs. Task-specific performance degradation is provided as true semantic distortion for evaluating CFQA metrics. We systematically assess three widely used metrics -- MSE, cosine similarity, and Centered Kernel Alignment (CKA) -- in terms of their ability to capture semantic degradation. Our findings demonstrate the representativeness of the proposed dataset while underscoring the need for more sophisticated metrics capable of measuring semantic distortion in compressed features. This work advances the field by establishing a foundational benchmark and providing a critical resource for the community to explore CFQA. To foster further research, we release the dataset and all associated source code at https://github.com/chansongoal/Compressed-Feature-Quality-Assessment.
中文: 本研究提出了压缩特征质量评估(CFQA)基准,旨在量化压缩特征的语义保真度,通过构建首个包含多任务多编码器的数据集,揭示了现有评估指标的不足,并为该领域研究提供了基础资源。
English: This study introduces the Compressed Feature Quality Assessment (CFQA) benchmark to evaluate the semantic fidelity of compressed features, highlighting the limitations of current metrics and providing a dataset and tools for advancing research in this area.

Authors:Philip R. Liu, Sparsh Bansal, Jimmy Dinh, Aditya Pawar, Ramani Satishkumar, Shail Desai, Neeraj Gupta, Xin Wang, Shu Hu
Title: MedChat: A Multi-Agent Framework for Multimodal Diagnosis with Large Language Models
Abstract:
The integration of deep learning-based glaucoma detection with large language models (LLMs) presents an automated strategy to mitigate ophthalmologist shortages and improve clinical reporting efficiency. However, applying general LLMs to medical imaging remains challenging due to hallucinations, limited interpretability, and insufficient domain-specific medical knowledge, which can potentially reduce clinical accuracy. Although recent approaches combining imaging models with LLM reasoning have improved reporting, they typically rely on a single generalist agent, restricting their capacity to emulate the diverse and complex reasoning found in multidisciplinary medical teams. To address these limitations, we propose MedChat, a multi-agent diagnostic framework and platform that combines specialized vision models with multiple role-specific LLM agents, all coordinated by a director agent. This design enhances reliability, reduces hallucination risk, and enables interactive diagnostic reporting through an interface tailored for clinical review and educational use. Code available at https://github.com/Purdue-M2/MedChat.
中文总结:MedChat是一个多智能体诊断框架,通过结合专业视觉模型与角色定制的大语言模型,提高了青光眼检测的可靠性,减少幻觉现象,并提供交互式临床报告功能。
English Summary: MedChat is a multi-agent diagnostic framework that integrates specialized vision models with role-specific LLMs to enhance glaucoma detection reliability, reduce hallucinations, and provide interactive clinical reporting.

Authors:Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli, Tom Goldstein
Title: ARGUS: Hallucination and Omission Evaluation in Video-LLMs
Abstract:
Video large language models have not yet been widely deployed, largely due to their tendency to hallucinate. Typical benchmarks for Video-LLMs rely simply on multiple-choice questions. Unfortunately, VideoLLMs hallucinate far more aggressively on freeform text generation tasks like video captioning than they do on multiple choice verification tasks. To address this weakness, we propose ARGUS, a VideoLLM benchmark that measures freeform video captioning performance. By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics. First, we measure the rate of hallucinations in the form of incorrect statements about video content or temporal relationships. Second, we measure the rate at which the model omits important descriptive details. Together, these dual metrics form a comprehensive view of video captioning performance.

Authors:Jiaying He, Yitong Lin, Jiahe Chen, Honghui Xu, Jianwei Zheng
Title: C3S3: Complementary Competition and Contrastive Selection for Semi-Supervised Medical Image Segmentation
Abstract:
For the immanent challenge of insufficiently annotated samples in the medical field, semi-supervised medical image segmentation (SSMIS) offers a promising solution. Despite achieving impressive results in delineating primary target areas, most current methodologies struggle to precisely capture the subtle details of boundaries. This deficiency often leads to significant diagnostic inaccuracies. To tackle this issue, we introduce C3S3, a novel semi-supervised segmentation model that synergistically integrates complementary competition and contrastive selection. This design significantly sharpens boundary delineation and enhances overall precision. Specifically, we develop an Outcome-Driven Contrastive Learning module dedicated to refining boundary localization. Additionally, we incorporate a Dynamic Complementary Competition module that leverages two high-performing sub-networks to generate pseudo-labels, thereby further improving segmentation quality. The proposed C3S3 undergoes rigorous validation on two publicly accessible datasets, encompassing the practices of both MRI and CT scans. The results demonstrate that our method achieves superior performance compared to previous cutting-edge competitors. Especially, on the 95HD and ASD metrics, our approach achieves a notable improvement of at least 6%, highlighting the significant advancements. The code is available at https://github.com/Y-TARL/C3S3.
中文: C3S3模型通过融合结果驱动的对比学习和动态互补竞争机制,有效解决了半监督医学图像分割中的边界模糊问题,在MRI和CT数据集的关键指标上实现了超过6%的性能提升。
English: The C3S3 model addresses boundary imprecision in semi-supervised medical image segmentation by integrating outcome-driven contrastive learning and dynamic complementary competition, achieving over 6% improvement on key metrics across MRI and CT datasets.

Authors:Chengchao Shen, Dawei Liu, Jianxin Wang
Title: Multiple Object Stitching for Unsupervised Representation Learning
Abstract:
Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation. Experimental results on ImageNet, CIFAR and COCO datasets demonstrate that our proposed method achieves the leading unsupervised representation performance on both single object centric images and multi-object ones. The source code is available at https://github.com/visresearch/MultipleObjectStitching.
中文: 本文提出的多对象拼接(MOS)方法通过将单对象图像合成为多对象图像,无需人工标注即可提供额外对象对应关系,从而在多对象图像的无监督表征学习上取得领先性能。
English: The proposed Multiple Object Stitching (MOS) method enhances unsupervised representation learning for multi-object images by synthesizing them from single-object images, providing additional object correspondences without human annotation and achieving superior performance on various datasets.

Authors:Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, Paolo Favaro
Title: From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models
Abstract:
Video Diffusion Models (VDMs) have emerged as powerful generative tools, capable of synthesizing high-quality spatiotemporal content. Yet, their potential goes far beyond mere video generation. We argue that the training dynamics of VDMs, driven by the need to model coherent sequences, naturally pushes them to internalize structured representations and an implicit understanding of the visual world. To probe the extent of this internal knowledge, we introduce a few-shot fine-tuning framework that repurposes VDMs for new tasks using only a handful of examples. Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input-output sequences without altering the generative interface of a frozen VDM. Despite minimal supervision, the model exhibits strong generalization across diverse tasks, from low-level vision (for example, segmentation and pose estimation) to high-level reasoning (for example, on ARC-AGI). These results reframe VDMs as more than generative engines. They are adaptable visual learners with the potential to serve as the backbone for future foundation models in vision.

Authors:Nada Aboudeshish, Dmitry Ignatov, Radu Timofte
Title: AugmentGest: Can Random Data Cropping Augmentation Boost Gesture Recognition Performance?
Abstract:
Data augmentation is a crucial technique in deep learning, particularly for tasks with limited dataset diversity, such as skeleton-based datasets. This paper proposes a comprehensive data augmentation framework that integrates geometric transformations, random cropping, rotation, zooming and intensity-based transformations, brightness and contrast adjustments to simulate real-world variations. Random cropping ensures the preservation of spatio-temporal integrity while addressing challenges such as viewpoint bias and occlusions. The augmentation pipeline generates three augmented versions for each sample in addition to the data set sample, thus quadrupling the data set size and enriching the diversity of gesture representations. The proposed augmentation strategy is evaluated on three models: multi-stream e2eET, FPPR point cloud-based hand gesture recognition (HGR), and DD-Network. Experiments are conducted on benchmark datasets including DHG14/28, SHREC'17, and JHMDB. The e2eET model, recognized as the state-of-the-art for hand gesture recognition on DHG14/28 and SHREC'17. The FPPR-PCD model, the second-best performing model on SHREC'17, excels in point cloud-based gesture recognition. DD-Net, a lightweight and efficient architecture for skeleton-based action recognition, is evaluated on SHREC'17 and the Human Motion Data Base (JHMDB). The results underline the effectiveness and versatility of the proposed augmentation strategy, significantly improving model generalization and robustness across diverse datasets and architectures. This framework not only establishes state-of-the-art results on all three evaluated models but also offers a scalable solution to advance HGR and action recognition applications in real-world scenarios. The framework is available at https://github.com/NadaAbodeshish/Random-Cropping-augmentation-HGR
本文提出了一种全面的数据增强框架,通过几何和强度变换使骨架手势识别数据集多样性翻两番,在多个模型和基准数据集上均取得了最优性能。
This paper introduces a comprehensive data augmentation framework that enhances skeleton-based gesture recognition by quadrupling dataset diversity through geometric and intensity transformations, achieving state-of-the-art results across multiple models and datasets.

Authors:Min-Jung Kim, Dongjin Kim, Seokju Yun, Jaegul Choo
Title: TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation
Abstract:
Video editing has garnered increasing attention alongside the rapid progress of diffusion-based video generation models. As part of these advancements, there is a growing demand for more accessible and controllable forms of video editing, such as prompt-based editing. Previous studies have primarily focused on tasks such as style transfer, background replacement, object substitution, and attribute modification, while maintaining the content structure of the source video. However, more complex tasks, including the addition of novel objects and nonrigid transformations, remain relatively unexplored. In this paper, we present TV-LiVE, a Training-free and text-guided Video editing framework via Layerinformed Vitality Exploitation. We empirically identify vital layers within the video generation model that significantly influence the quality of generated outputs. Notably, these layers are closely associated with Rotary Position Embeddings (RoPE). Based on this observation, our method enables both object addition and non-rigid video editing by selectively injecting key and value features from the source model into the corresponding layers of the target model guided by the layer vitality. For object addition, we further identify prominent layers to extract the mask regions corresponding to the newly added target prompt. We found that the extracted masks from the prominent layers faithfully indicate the region to be edited. Experimental results demonstrate that TV-LiVE outperforms existing approaches for both object addition and non-rigid video editing. Project Page: https://emjay73.github.io/TV_LiVE/

Authors:Van Nguyen Nguyen, Christian Forster, Sindi Shkodrani, Vincent Lepetit, Bugra Tekin, Cem Keskin, Tomas Hodan
Title: GoTrack: Generic 6DoF Object Pose Refinement and Tracking
Abstract:
We introduce GoTrack, an efficient and accurate CAD-based method for 6DoF object pose refinement and tracking, which can handle diverse objects without any object-specific training. Unlike existing tracking methods that rely solely on an analysis-by-synthesis approach for model-to-frame registration, GoTrack additionally integrates frame-to-frame registration, which saves compute and stabilizes tracking. Both types of registration are realized by optical flow estimation. The model-to-frame registration is noticeably simpler than in existing methods, relying only on standard neural network blocks (a transformer is trained on top of DINOv2) and producing reliable pose confidence scores without a scoring network. For the frame-to-frame registration, which is an easier problem as consecutive video frames are typically nearly identical, we employ a light off-the-shelf optical flow model. We demonstrate that GoTrack can be seamlessly combined with existing coarse pose estimation methods to create a minimal pipeline that reaches state-of-the-art RGB-only results on standard benchmarks for 6DoF object pose estimation and tracking. Our source code and trained models are publicly available at https://github.com/facebookresearch/gotrack
GoTrack是一种高效的基于CAD的六自由度物体姿态优化与跟踪方法,通过光流估计结合模型到帧和帧到帧的配准,无需物体特定训练即可达到最先进的性能。
GoTrack is an efficient CAD-based method for 6DoF object pose refinement and tracking that integrates both model-to-frame and frame-to-frame registration through optical flow estimation, achieving state-of-the-art results without object-specific training.

Authors:Hao Tang, Chengchao Shen
Title: Learning Compact Vision Tokens for Efficient Large Multimodal Models
Abstract:
Large multimodal models (LMMs) suffer significant computational challenges due to the high cost of Large Language Models (LLMs) and the quadratic complexity of processing long vision token sequences. In this paper, we explore the spatial redundancy among vision tokens and shorten the length of vision token sequences for inference acceleration. Specifically, we propose a Spatial Token Fusion (STF) method to learn compact vision tokens for short vision token sequence, where spatial-adjacent tokens are fused into one. Meanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STF and MBTF module to balance token reduction and information preservation, thereby improving inference efficiency without sacrificing multimodal reasoning capabilities. Experimental results demonstrate that our method based on LLaVA-1.5 achieves comparable or even superior performance to the baseline on 8 popular vision-language benchmarks with only $25\%$ vision tokens of baseline. The source code and trained weights are available at https://github.com/visresearch/LLaVA-STF.
中文摘要:本文提出空间令牌融合方法,通过压缩视觉令牌并补充多粒度特征来降低大型多模态模型的计算成本,仅用基线25%的令牌即可实现相当甚至更优的性能。
English Summary: The paper introduces a Spatial Token Fusion method to reduce computational costs in large multimodal models by compressing vision tokens and enhancing them with multi-granularity features, achieving comparable performance with only 25% of baseline tokens.

Authors:Changhong Fu, Hua Lin, Haobo Zuo, Liangliang Yao, Liguo Zhang
Title: EdgeSpotter: Multi-Scale Dense Text Spotting for Industrial Panel Monitoring
Abstract:
Text spotting for industrial panels is a key task for intelligent monitoring. However, achieving efficient and accurate text spotting for complex industrial panels remains challenging due to issues such as cross-scale localization and ambiguous boundaries in dense text regions. Moreover, most existing methods primarily focus on representing a single text shape, neglecting a comprehensive exploration of multi-scale feature information across different texts. To address these issues, this work proposes a novel multi-scale dense text spotter for edge AI-based vision system (EdgeSpotter) to achieve accurate and robust industrial panel monitoring. Specifically, a novel Transformer with efficient mixer is developed to learn the interdependencies among multi-level features, integrating multi-layer spatial and semantic cues. In addition, a new feature sampling with catmull-rom splines is designed, which explicitly encodes the shape, position, and semantic information of text, thereby alleviating missed detections and reducing recognition errors caused by multi-scale or dense text regions. Furthermore, a new benchmark dataset for industrial panel monitoring (IPM) is constructed. Extensive qualitative and quantitative evaluations on this challenging benchmark dataset validate the superior performance of the proposed method in different challenging panel monitoring tasks. Finally, practical tests based on the self-designed edge AI-based vision system demonstrate the practicality of the method. The code and demo will be available at https://github.com/vision4robotics/EdgeSpotter.
中文: 本文提出EdgeSpotter多尺度密集文本识别器,通过新型Transformer和特征采样方法解决工业仪表复杂文本检测难题,并基于新建数据集和边缘AI系统验证了其优越性能。
English: This paper introduces EdgeSpotter, a multi-scale dense text spotter using a novel Transformer and feature sampling method to overcome challenges in industrial panel text recognition, validated by a new benchmark dataset and edge AI system tests.

Authors:Yangkai Lin, Jiabao Lei, Kui Jia
Title: SceneLCM: End-to-End Layout-Guided Interactive Indoor Scene Generation with Latent Consistency Model
Abstract:
Our project page: https://scutyklin.github.io/SceneLCM/. Automated generation of complex, interactive indoor scenes tailored to user prompt remains a formidable challenge. While existing methods achieve indoor scene synthesis, they struggle with rigid editing constraints, physical incoherence, excessive human effort, single-room limitations, and suboptimal material quality. To address these limitations, we propose SceneLCM, an end-to-end framework that synergizes Large Language Model (LLM) for layout design with Latent Consistency Model(LCM) for scene optimization. Our approach decomposes scene generation into four modular pipelines: (1) Layout Generation. We employ LLM-guided 3D spatial reasoning to convert textual descriptions into parametric blueprints(3D layout). And an iterative programmatic validation mechanism iteratively refines layout parameters through LLM-mediated dialogue loops; (2) Furniture Generation. SceneLCM employs Consistency Trajectory Sampling(CTS), a consistency distillation sampling loss guided by LCM, to form fast, semantically rich, and high-quality representations. We also offer two theoretical justification to demonstrate that our CTS loss is equivalent to consistency loss and its distillation error is bounded by the truncation error of the Euler solver; (3) Environment Optimization. We use a multiresolution texture field to encode the appearance of the scene, and optimize via CTS loss. To maintain cross-geometric texture coherence, we introduce a normal-aware cross-attention decoder to predict RGB by cross-attending to the anchors locations in geometrically heterogeneous instance. (4)Physically Editing. SceneLCM supports physically editing by integrating physical simulation, achieved persistent physical realism. Extensive experiments validate SceneLCM's superiority over state-of-the-art techniques, showing its wide-ranging potential for diverse applications.

Authors:Anatol Garioud, Sébastien Giordano, Nicolas David, Nicolas Gonthier
Title: FLAIR-HUB: Large-scale Multimodal Dataset for Land Cover and Crop Mapping
Abstract:
The growing availability of high-quality Earth Observation (EO) data enables accurate global land cover and crop type monitoring. However, the volume and heterogeneity of these datasets pose major processing and annotation challenges. To address this, the French National Institute of Geographical and Forest Information (IGN) is actively exploring innovative strategies to exploit diverse EO data, which require large annotated datasets. IGN introduces FLAIR-HUB, the largest multi-sensor land cover dataset with very-high-resolution (20 cm) annotations, covering 2528 km2 of France. It combines six aligned modalities: aerial imagery, Sentinel-1/2 time series, SPOT imagery, topographic data, and historical aerial images. Extensive benchmarks evaluate multimodal fusion and deep learning models (CNNs, transformers) for land cover or crop mapping and also explore multi-task learning. Results underscore the complexity of multimodal fusion and fine-grained classification, with best land cover performance (78.2% accuracy, 65.8% mIoU) achieved using nearly all modalities. FLAIR-HUB supports supervised and multimodal pretraining, with data and code available at https://ignf.github.io/FLAIR/flairhub.

Authors:Zheng Wang, Kai Ying, Bin Xu, Chunjiao Wang, Cong Bai
Title: From Swath to Full-Disc: Advancing Precipitation Retrieval with Multimodal Knowledge Expansion
Abstract:
Accurate near-real-time precipitation retrieval has been enhanced by satellite-based technologies. However, infrared-based algorithms have low accuracy due to weak relations with surface precipitation, whereas passive microwave and radar-based methods are more accurate but limited in range. This challenge motivates the Precipitation Retrieval Expansion (PRE) task, which aims to enable accurate, infrared-based full-disc precipitation retrievals beyond the scanning swath. We introduce Multimodal Knowledge Expansion, a two-stage pipeline with the proposed PRE-Net model. In the Swath-Distilling stage, PRE-Net transfers knowledge from a multimodal data integration model to an infrared-based model within the scanning swath via Coordinated Masking and Wavelet Enhancement (CoMWE). In the Full-Disc Adaptation stage, Self-MaskTune refines predictions across the full disc by balancing multimodal and full-disc infrared knowledge. Experiments on the introduced PRE benchmark demonstrate that PRE-Net significantly advanced precipitation retrieval performance, outperforming leading products like PERSIANN-CCS, PDIR, and IMERG. The code will be available at https://github.com/Zjut-MultimediaPlus/PRE-Net.
中文摘要:PRE-Net模型通过多模态知识扩展的两阶段流程,在扫描范围内通过协同掩蔽与小波增强实现知识迁移,并在全圆盘范围内通过自掩蔽调优平衡多模态与红外数据,显著提升了基于红外的全圆盘降水反演精度。
English Summary: The PRE-Net model introduces a two-stage multimodal knowledge expansion approach that significantly improves infrared-based full-disc precipitation retrieval accuracy by transferring knowledge from multimodal data and refining predictions across the entire observation area.

Authors:LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong
Title: Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...
中文: 多模态大语言模型在通用视觉任务中表现出色,但在医疗应用中因知识覆盖不足和数据差异而受限,为此研发了基于丰富医疗数据训练的专用模型Lingshu,其在多项核心医疗任务中优于现有模型。
English: Multimodal Large Language Models (MLLMs) excel in general visual tasks but underperform in medical applications due to knowledge gaps and data limitations, prompting the development of Lingshu, a specialized model trained on enriched medical data that surpasses existing models in key medical tasks.

Authors:Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More, Sanoojan Baliah, Hasindri Watawana, Yuhao Li, Fabian Farestam, Leon Schaller, Roman Tymtsiv, Simon Weber, Hisham Cholakkal, Ivan Laptev, Shin'ichi Satoh, Michael Felsberg, Mubarak Shah, Salman Khan, Fahad Shahbaz Khan
Title: A Culturally-diverse Multilingual Multimodal Video Benchmark & Model
Abstract:
Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.
中文:本文提出了ViMUL-Bench多语言视频基准测试,用于评估涵盖14种语言和多元文化类别的大型多模态模型,同时开发的新型多语言视频LMM显著提升了语言包容性。
English: This paper introduces ViMUL-Bench, a multilingual video benchmark evaluating large multimodal models across 14 languages and diverse cultural categories, along with a new multilingual video LMM that improves language inclusivity.

Authors:Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe, Junjie Fei, Sayan Nag, Salman Khan, Mohamed Elhoseiny, Dinesh Manocha
Title: MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
Abstract:
Large multimodal models (LMMs) have shown remarkable progress in audio-visual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audio-visual retrieval and reasoning encountered in practical applications. To bridge this gap, we introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer. To this end, we present AVHaystacks, an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task. Additionally, we propose a model-agnostic, multi-agent framework MAGNET to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystacks. To enable robust evaluation of multi-video retrieval and temporal grounding for optimal response generation, we introduce two new metrics, STEM, which captures alignment errors between a ground truth and a predicted step sequence and MTGS, to facilitate balanced and interpretable evaluation of segment-level grounding performance. Project: https://schowdhury671.github.io/magnet_project/

Authors:Arun Sharma, Mingzhou Yang, Majid Farhadloo, Subhankar Ghosh, Bharat Jayaprakash, Shashi Shekhar
Title: Towards Physics-informed Diffusion for Anomaly Detection in Trajectories
Abstract:
Given trajectory data, a domain-specific study area, and a user-defined threshold, we aim to find anomalous trajectories indicative of possible GPS spoofing (e.g., fake trajectory). The problem is societally important to curb illegal activities in international waters, such as unauthorized fishing and illicit oil transfers. The problem is challenging due to advances in AI generated in deep fakes generation (e.g., additive noise, fake trajectories) and lack of adequate amount of labeled samples for ground-truth verification. Recent literature shows promising results for anomalous trajectory detection using generative models despite data sparsity. However, they do not consider fine-scale spatiotemporal dependencies and prior physical knowledge, resulting in higher false-positive rates. To address these limitations, we propose a physics-informed diffusion model that integrates kinematic constraints to identify trajectories that do not adhere to physical laws. Experimental results on real-world datasets in the maritime and urban domains show that the proposed framework results in higher prediction accuracy and lower estimation error rate for anomaly detection and trajectory generation methods, respectively. Our implementation is available at https://github.com/arunshar/Physics-Informed-Diffusion-Probabilistic-Model.
中文摘要:本研究提出了一种融合运动学约束的物理信息扩散模型,用于检测GPS欺骗等异常轨迹,在海上和城市数据集中实现了更高的检测精度和更低的误差率。
English Summary: This study introduces a physics-informed diffusion model that incorporates kinematic constraints to detect anomalous trajectories, such as GPS spoofing, achieving higher accuracy and lower error rates in maritime and urban datasets.

Authors:Mellon M. Zhang, Glen Chou, Saibal Mukhopadhyay
Title: Polar Hierarchical Mamba: Towards Streaming LiDAR Object Detection with Point Clouds as Egocentric Sequences
Abstract:
Accurate and efficient object detection is essential for autonomous vehicles, where real-time perception requires low latency and high throughput. LiDAR sensors provide robust depth information, but conventional methods process full 360° scans in a single pass, introducing significant delay. Streaming approaches address this by sequentially processing partial scans in the native polar coordinate system, yet they rely on translation-invariant convolutions that are misaligned with polar geometry -- resulting in degraded performance or requiring complex distortion mitigation. Recent Mamba-based state space models (SSMs) have shown promise for LiDAR perception, but only in the full-scan setting, relying on geometric serialization and positional embeddings that are memory-intensive and ill-suited to streaming. We propose Polar Hierarchical Mamba (PHiM), a novel SSM architecture designed for polar-coordinate streaming LiDAR. PHiM uses local bidirectional Mamba blocks for intra-sector spatial encoding and a global forward Mamba for inter-sector temporal modeling, replacing convolutions and positional encodings with distortion-aware, dimensionally-decomposed operations. PHiM sets a new state-of-the-art among streaming detectors on the Waymo Open Dataset, outperforming the previous best by 10\% and matching full-scan baselines at twice the throughput. Code will be available at https://github.com/meilongzhang/Polar-Hierarchical-Mamba .
中文: 提出的极坐标分层Mamba(PHiM)架构通过用感知畸变的Mamba模块替代传统卷积,解决了流式LiDAR检测的低效问题,在Waymo开放数据集中以超越先前最佳方法10%的性能刷新了最高水平,并实现了与全扫描相当的吞吐量。
English: The proposed Polar Hierarchical Mamba (PHiM) architecture addresses inefficiencies in streaming LiDAR detection by replacing conventional convolutions with distortion-aware Mamba blocks, achieving state-of-the-art performance on the Waymo Open Dataset with a 10% improvement over prior methods and matching full-scan throughput.

Authors:George Lydakis, Alexander Hermans, Ali Athar, Daan de Geus, Bastian Leibe
Title: How Important are Videos for Training Video LLMs?
Abstract:
Research into Video Large Language Models (LLMs) has progressed rapidly, with numerous models and benchmarks emerging in just a few years. Typically, these models are initialized with a pretrained text-only LLM and finetuned on both image- and video-caption datasets. In this paper, we present findings indicating that Video LLMs are more capable of temporal reasoning after image-only training than one would assume, and that improvements from video-specific training are surprisingly small. Specifically, we show that image-trained versions of two LLMs trained with the recent LongVU algorithm perform significantly above chance level on TVBench, a temporal reasoning benchmark. Additionally, we introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities. This baseline results in temporal reasoning performance close to, and occasionally higher than, what is achieved by video-trained LLMs. This suggests suboptimal utilization of rich temporal features found in real video by current models. Our analysis motivates further research into the mechanisms that allow image-trained LLMs to perform temporal reasoning, as well as into the bottlenecks that render current video training schemes inefficient.
中文摘要:视频大语言模型在仅图像训练后展现出超预期的时序推理能力,而视频专项训练带来的提升微乎其微,表明现有模型未能充分利用视频中的时序特征。
English Summary: Video LLMs demonstrate unexpected temporal reasoning capabilities after image-only training, with minimal gains from video-specific training, suggesting current models underutilize temporal features in videos.

Authors:Nima Jamali, Matina Mahdizadeh Sani, Hanieh Naderi, Shohreh Kasaei
Title: KNN-Defense: Defense against 3D Adversarial Point Clouds using Nearest-Neighbor Search
Abstract:
Deep neural networks (DNNs) have demonstrated remarkable performance in analyzing 3D point cloud data. However, their vulnerability to adversarial attacks-such as point dropping, shifting, and adding-poses a critical challenge to the reliability of 3D vision systems. These attacks can compromise the semantic and structural integrity of point clouds, rendering many existing defense mechanisms ineffective. To address this issue, a defense strategy named KNN-Defense is proposed, grounded in the manifold assumption and nearest-neighbor search in feature space. Instead of reconstructing surface geometry or enforcing uniform point distributions, the method restores perturbed inputs by leveraging the semantic similarity of neighboring samples from the training set. KNN-Defense is lightweight and computationally efficient, enabling fast inference and making it suitable for real-time and practical applications. Empirical results on the ModelNet40 dataset demonstrated that KNN-Defense significantly improves robustness across various attack types. In particular, under point-dropping attacks-where many existing methods underperform due to the targeted removal of critical points-the proposed method achieves accuracy gains of 20.1%, 3.6%, 3.44%, and 7.74% on PointNet, PointNet++, DGCNN, and PCT, respectively. These findings suggest that KNN-Defense offers a scalable and effective solution for enhancing the adversarial resilience of 3D point cloud classifiers. (An open-source implementation of the method, including code and data, is available at https://github.com/nimajam41/3d-knn-defense).
中文: KNN-Defense是一种轻量高效的防御方法,通过利用训练数据的语义相似性恢复受扰动的输入,显著提升了三维点云分类器对抗多种攻击的鲁棒性,并在不同模型上实现了明显的准确率提升。
English: KNN-Defense is a lightweight and efficient method that enhances the robustness of 3D point cloud classifiers against adversarial attacks by restoring perturbed inputs using semantic similarity from training data, achieving significant accuracy improvements across various models.

Authors:Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Title: FREE: Fast and Robust Vision Language Models with Early Exits
Abstract:
In recent years, Vision-Language Models (VLMs) have shown remarkable performance improvements in Vision-Language tasks. However, their large size poses challenges for real-world applications where inference latency is a concern. To tackle this issue, we propose employing Early Exit (EE) strategies in VLMs. However, training exit classifiers in VLMs is challenging, particularly with limited labeled training data. To address this, we introduce FREE, an adversarial training approach within a GAN-based framework. Here, each exit consists of a transformer layer and a classifier. The transformer layer is adversarially trained to produce feature representations similar to the final layer, while a feature classifier serves as the discriminator. Our method focuses on performing input-adaptive inference that increases inference speed with minimal drop in performance. Experimental results demonstrate the effectiveness of our approach in enhancing accuracy and model robustness by mitigating overthinking and the phenomenon of mid-crisis that we highlight. We experimentally validate that our method speeds up the inference process by more than 1.51x while retaining comparable performance. The source code is available at https://github.com/Div290/FREE.
Chinese: 本文提出FREE方法,通过基于GAN框架的对抗性训练在视觉语言模型中实现早期退出策略,在保持性能的同时将推理速度提升超过1.51倍。
English: The paper introduces FREE, an adversarial training method using a GAN-based framework to implement Early Exit strategies in Vision-Language Models, significantly accelerating inference by over 1.51x while maintaining performance.

Authors:Chao Yin, Hao Li, Kequan Yang, Jide Li, Pinpin Zhu, Xiaoqiang Li
Title: Stepwise Decomposition and Dual-stream Focus: A Novel Approach for Training-free Camouflaged Object Segmentation
Abstract:
While promptable segmentation (\textit{e.g.}, SAM) has shown promise for various segmentation tasks, it still requires manual visual prompts for each object to be segmented. In contrast, task-generic promptable segmentation aims to reduce the need for such detailed prompts by employing only a task-generic prompt to guide segmentation across all test samples. However, when applied to Camouflaged Object Segmentation (COS), current methods still face two critical issues: 1) \textit{\textbf{semantic ambiguity in getting instance-specific text prompts}}, which arises from insufficient discriminative cues in holistic captions, leading to foreground-background confusion; 2) \textit{\textbf{semantic discrepancy combined with spatial separation in getting instance-specific visual prompts}}, which results from global background sampling far from object boundaries with low feature correlation, causing SAM to segment irrelevant regions. To address the issues above, we propose \textbf{RDVP-MSD}, a novel training-free test-time adaptation framework that synergizes \textbf{R}egion-constrained \textbf{D}ual-stream \textbf{V}isual \textbf{P}rompting (RDVP) via \textbf{M}ultimodal \textbf{S}tepwise \textbf{D}ecomposition Chain of Thought (MSD-CoT). MSD-CoT progressively disentangles image captions to eliminate semantic ambiguity, while RDVP injects spatial constraints into visual prompting and independently samples visual prompts for foreground and background points, effectively mitigating semantic discrepancy and spatial separation. Without requiring any training or supervision, RDVP-MSD achieves a state-of-the-art segmentation result on multiple COS benchmarks and delivers a faster inference speed than previous methods, demonstrating significantly improved accuracy and efficiency. The codes will be available at \href{https://github.com/ycyinchao/RDVP-MSD}{https://github.com/ycyinchao/RDVP-MSD}
中文:提出的RDVP-MSD框架通过结合多模态逐步分解和区域约束双流视觉提示,解决了伪装物体分割中的语义模糊和空间分离问题,无需训练即可实现最先进的性能。
English: The proposed RDVP-MSD framework addresses semantic ambiguity and spatial separation in camouflaged object segmentation by combining multimodal stepwise decomposition and region-constrained dual-stream visual prompting, achieving state-of-the-art results without requiring training.

Authors:Mohammad-Maher Nakshbandi, Ziad Sharawy, Dorian Cojocaru, Sorin Grigorescu
Title: LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and Mapping
Abstract:
In this study, we introduce LoopDB, which is a challenging loop closure dataset comprising over 1000 images captured across diverse environments, including parks, indoor scenes, parking spaces, as well as centered around individual objects. Each scene is represented by a sequence of five consecutive images. The dataset was collected using a high resolution camera, providing suitable imagery for benchmarking the accuracy of loop closure algorithms, typically used in simultaneous localization and mapping. As ground truth information, we provide computed rotations and translations between each consecutive images. Additional to its benchmarking goal, the dataset can be used to train and fine-tune loop closure methods based on deep neural networks. LoopDB is publicly available at https://github.com/RovisLab/LoopDB.
中文: 本研究推出LoopDB数据集,包含1000多张多样化环境图像,专为SLAM中的闭环检测算法基准测试和深度神经网络训练而设计,提供真实位姿数据并公开共享。
English: This study presents LoopDB, a challenging dataset of over 1000 images across varied environments designed for benchmarking and training loop closure algorithms in SLAM applications, with ground truth data and public availability.

Authors:Nidheesh Gorthi, Kartik Thakral, Rishabh Ranjan, Richa Singh, Mayank Vatsa
Title: LitMAS: A Lightweight and Generalized Multi-Modal Anti-Spoofing Framework for Biometric Security
Abstract:
Biometric authentication systems are increasingly being deployed in critical applications, but they remain susceptible to spoofing. Since most of the research efforts focus on modality-specific anti-spoofing techniques, building a unified, resource-efficient solution across multiple biometric modalities remains a challenge. To address this, we propose LitMAS, a $\textbf{Li}$gh$\textbf{t}$ weight and generalizable $\textbf{M}$ulti-modal $\textbf{A}$nti-$\textbf{S}$poofing framework designed to detect spoofing attacks in speech, face, iris, and fingerprint-based biometric systems. At the core of LitMAS is a Modality-Aligned Concentration Loss, which enhances inter-class separability while preserving cross-modal consistency and enabling robust spoof detection across diverse biometric traits. With just 6M parameters, LitMAS surpasses state-of-the-art methods by $1.36\%$ in average EER across seven datasets, demonstrating high efficiency, strong generalizability, and suitability for edge deployment. Code and trained models are available at https://github.com/IAB-IITJ/LitMAS.
Chinese: LitMAS是一种轻量级、可泛化的多模态防伪框架,通过仅600万参数即可有效检测语音、人脸、虹膜和指纹生物识别系统中的欺骗攻击,在七个数据集上的平均等错误率比现有最优方法提升1.36%。
English: LitMAS is a lightweight, generalizable multi-modal anti-spoofing framework that effectively detects spoofing attacks across speech, face, iris, and fingerprint biometric systems using only 6M parameters, outperforming existing methods by 1.36% in average EER.

Authors:Mufhumudzi Muthivhi, Jiahao Huo, Fredrik Gustafsson, Terence L. van Zyl
Title: Improving Wildlife Out-of-Distribution Detection: Africas Big Five
Abstract:
Mitigating human-wildlife conflict seeks to resolve unwanted encounters between these parties. Computer Vision provides a solution to identifying individuals that might escalate into conflict, such as members of the Big Five African animals. However, environments often contain several varied species. The current state-of-the-art animal classification models are trained under a closed-world assumption. They almost always remain overconfident in their predictions even when presented with unknown classes. This study investigates out-of-distribution (OOD) detection of wildlife, specifically the Big Five. To this end, we select a parametric Nearest Class Mean (NCM) and a non-parametric contrastive learning approach as baselines to take advantage of pretrained and projected features from popular classification encoders. Moreover, we compare our baselines to various common OOD methods in the literature. The results show feature-based methods reflect stronger generalisation capability across varying classification thresholds. Specifically, NCM with ImageNet pre-trained features achieves a 2%, 4% and 22% improvement on AUPR-IN, AUPR-OUT and AUTC over the best OOD methods, respectively. The code can be found here https://github.com/pxpana/BIG5OOD
中文摘要:本研究采用基于特征的方法探索非洲五大动物的分布外检测,相比现有方法实现了AUTC指标22%的性能提升。
English Summary: This study explores out-of-distribution detection for African Big Five wildlife using feature-based methods, demonstrating superior generalization with a 22% AUTC improvement over existing approaches.

Authors:Qianqian Zhao, Chunle Guo, Tianyi Zhang, Junpei Zhang, Peiyang Jia, Tan Su, Wenjie Jiang, Chongyi Li
Title: A Systematic Investigation on Deep Learning-Based Omnidirectional Image and Video Super-Resolution
Abstract:
Omnidirectional image and video super-resolution is a crucial research topic in low-level vision, playing an essential role in virtual reality and augmented reality applications. Its goal is to reconstruct high-resolution images or video frames from low-resolution inputs, thereby enhancing detail preservation and enabling more accurate scene analysis and interpretation. In recent years, numerous innovative and effective approaches have been proposed, predominantly based on deep learning techniques, involving diverse network architectures, loss functions, projection strategies, and training datasets. This paper presents a systematic review of recent progress in omnidirectional image and video super-resolution, focusing on deep learning-based methods. Given that existing datasets predominantly rely on synthetic degradation and fall short in capturing real-world distortions, we introduce a new dataset, 360Insta, that comprises authentically degraded omnidirectional images and videos collected under diverse conditions, including varying lighting, motion, and exposure settings. This dataset addresses a critical gap in current omnidirectional benchmarks and enables more robust evaluation of the generalization capabilities of omnidirectional super-resolution methods. We conduct comprehensive qualitative and quantitative evaluations of existing methods on both public datasets and our proposed dataset. Furthermore, we provide a systematic overview of the current status of research and discuss promising directions for future exploration. All datasets, methods, and evaluation metrics introduced in this work are publicly available and will be regularly updated. Project page: https://github.com/nqian1/Survey-on-ODISR-and-ODVSR.
中文: 本文系统综述了基于深度学习的全景图像与视频超分辨率方法,并引入包含真实场景退化的新数据集360Insta以弥补现有基准不足,同时提供了全面评估和未来研究方向。
English: This paper systematically reviews deep learning-based omnidirectional image and video super-resolution methods and introduces a new real-world dataset, 360Insta, to address limitations in existing benchmarks while providing comprehensive evaluations and future research directions.

Authors:Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, Jose M. Alvarez
Title: Generalized Trajectory Scoring for End-to-end Multimodal Planning
Abstract:
End-to-end multi-modal planning is a promising paradigm in autonomous driving, enabling decision-making with diverse trajectory candidates. A key component is a robust trajectory scorer capable of selecting the optimal trajectory from these candidates. While recent trajectory scorers focus on scoring either large sets of static trajectories or small sets of dynamically generated ones, both approaches face significant limitations in generalization. Static vocabularies provide effective coarse discretization but struggle to make fine-grained adaptation, while dynamic proposals offer detailed precision but fail to capture broader trajectory distributions. To overcome these challenges, we propose GTRS (Generalized Trajectory Scoring), a unified framework for end-to-end multi-modal planning that combines coarse and fine-grained trajectory evaluation. GTRS consists of three complementary innovations: (1) a diffusion-based trajectory generator that produces diverse fine-grained proposals; (2) a vocabulary generalization technique that trains a scorer on super-dense trajectory sets with dropout regularization, enabling its robust inference on smaller subsets; and (3) a sensor augmentation strategy that enhances out-of-domain generalization while incorporating refinement training for critical trajectory discrimination. As the winning solution of the Navsim v2 Challenge, GTRS demonstrates superior performance even with sub-optimal sensor inputs, approaching privileged methods that rely on ground-truth perception. Code will be available at https://github.com/NVlabs/GTRS.
中文摘要:GTRS是一个用于自动驾驶的统一框架,通过基于扩散的轨迹生成、词汇泛化技术和传感器增强策略,结合粗粒度与细粒度轨迹评估,有效解决了泛化能力不足的问题。
English Summary: GTRS is a unified framework for autonomous driving that combines coarse and fine-grained trajectory evaluation through diffusion-based generation, vocabulary generalization, and sensor augmentation to overcome limitations in generalization.

Authors:Cheng Peng, Jingxiang Sun, Yushuo Chen, Zhaoqi Su, Zhuo Su, Yebin Liu
Title: Parametric Gaussian Human Model: Generalizable Prior for Efficient and Realistic Human Avatar Modeling
Abstract:
Photorealistic and animatable human avatars are a key enabler for virtual/augmented reality, telepresence, and digital entertainment. While recent advances in 3D Gaussian Splatting (3DGS) have greatly improved rendering quality and efficiency, existing methods still face fundamental challenges, including time-consuming per-subject optimization and poor generalization under sparse monocular inputs. In this work, we present the Parametric Gaussian Human Model (PGHM), a generalizable and efficient framework that integrates human priors into 3DGS for fast and high-fidelity avatar reconstruction from monocular videos. PGHM introduces two core components: (1) a UV-aligned latent identity map that compactly encodes subject-specific geometry and appearance into a learnable feature tensor; and (2) a disentangled Multi-Head U-Net that predicts Gaussian attributes by decomposing static, pose-dependent, and view-dependent components via conditioned decoders. This design enables robust rendering quality under challenging poses and viewpoints, while allowing efficient subject adaptation without requiring multi-view capture or long optimization time. Experiments show that PGHM is significantly more efficient than optimization-from-scratch methods, requiring only approximately 20 minutes per subject to produce avatars with comparable visual quality, thereby demonstrating its practical applicability for real-world monocular avatar creation.

Authors:Minghao Zou, Qingtian Zeng, Yongping Miao, Shangkun Liu, Zilong Wang, Hantao Liu, Wei Zhou
Title: PhysLab: A Benchmark Dataset for Multi-Granularity Visual Parsing of Physics Experiments
Abstract:
Visual parsing of images and videos is critical for a wide range of real-world applications. However, progress in this field is constrained by limitations of existing datasets: (1) insufficient annotation granularity, which impedes fine-grained scene understanding and high-level reasoning; (2) limited coverage of domains, particularly a lack of datasets tailored for educational scenarios; and (3) lack of explicit procedural guidance, with minimal logical rules and insufficient representation of structured task process. To address these gaps, we introduce PhysLab, the first video dataset that captures students conducting complex physics experiments. The dataset includes four representative experiments that feature diverse scientific instruments and rich human-object interaction (HOI) patterns. PhysLab comprises 620 long-form videos and provides multilevel annotations that support a variety of vision tasks, including action recognition, object detection, HOI analysis, etc. We establish strong baselines and perform extensive evaluations to highlight key challenges in the parsing of procedural educational videos. We expect PhysLab to serve as a valuable resource for advancing fine-grained visual parsing, facilitating intelligent classroom systems, and fostering closer integration between computer vision and educational technologies. The dataset and the evaluation toolkit are publicly available at https://github.com/ZMH-SDUST/PhysLab.
中文摘要:PhysLab数据集通过提供620个包含多层次标注的物理实验长视频,解决了现有数据集在标注粒度、领域覆盖和过程指导方面的不足,推动了精细视觉解析与教育技术的融合发展。
English Summary: The PhysLab dataset addresses limitations in visual parsing by providing 620 long-form videos of physics experiments with multilevel annotations to advance fine-grained analysis and educational technology integration.

Authors:Ho Yin 'Sam' Ng, Ting-Yao Hsu, Aashish Anantha Ramakrishnan, Branislav Kveton, Nedim Lipka, Franck Dernoncourt, Dongwon Lee, Tong Yu, Sungchul Kim, Ryan A. Rossi, Ting-Hao 'Kenneth' Huang
Title: LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles
Abstract:
Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain's style, highlighting the need for personalization. Despite language models' personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document--each with its image, caption, and figure-mentioning paragraphs--as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.
Chinese: 本文介绍了用于个性化图表标题生成的多模态数据集LaMP-Cap,实验表明,结合多模态背景信息(特别是图像)能显著提升AI生成标题与作者原创标题的契合度,优于纯文本方法。
English: The paper introduces LaMP-Cap, a multimodal dataset for personalized figure caption generation, demonstrating through experiments that incorporating multimodal profile information, especially images, significantly improves the alignment of AI-generated captions with author-written ones compared to text-only approaches.

Authors:Masoud Rahimi, Reza Karbasi, Abdol-Hossein Vahabie
Title: An Open-Source Python Framework and Synthetic ECG Image Datasets for Digitization, Lead and Lead Name Detection, and Overlapping Signal Segmentation
Abstract:
We introduce an open-source Python framework for generating synthetic ECG image datasets to advance critical deep learning-based tasks in ECG analysis, including ECG digitization, lead region and lead name detection, and pixel-level waveform segmentation. Using the PTB-XL signal dataset, our proposed framework produces four open-access datasets: (1) ECG images in various lead configurations paired with time-series signals for ECG digitization, (2) ECG images annotated with YOLO-format bounding boxes for detection of lead region and lead name, (3)-(4) cropped single-lead images with segmentation masks compatible with U-Net-based models in normal and overlapping versions. In the overlapping case, waveforms from neighboring leads are superimposed onto the target lead image, while the segmentation masks remain clean. The open-source Python framework and datasets are publicly available at https://github.com/rezakarbasi/ecg-image-and-signal-dataset and https://doi.org/10.5281/zenodo.15484519, respectively.
中文: 该开源Python框架基于PTB-XL信号生成合成心电图图像数据集,支持心电图数字化、导联区域识别和波形分割等深度学习任务,四类开放数据集可通过GitHub和Zenodo公开获取。
English: This open-source Python framework generates synthetic ECG image datasets from PTB-XL signals to support deep learning tasks like ECG digitization, lead detection, and waveform segmentation, with four publicly available datasets accessible via GitHub and Zenodo.

Authors:Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Muhammad Haris Khan, Rao Muhammad Anwer, Jorma Laaksonen, Fahad Shahbaz Khan, Salman Khan
Title: TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation
Abstract:
Modern Earth observation (EO) increasingly leverages deep learning to harness the scale and diversity of satellite imagery across sensors and regions. While recent foundation models have demonstrated promising generalization across EO tasks, many remain limited by the scale, geographical coverage, and spectral diversity of their training data, factors critical for learning globally transferable representations. In this work, we introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery, combined with large spatial tiles and land-cover aware sampling to enrich spatial and semantic coverage. By treating sensing modalities as natural augmentations in our self-supervised approach, we unify radar and optical inputs via modality-specific patch embeddings and adaptive cross-attention fusion. Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism that incorporates class-frequency-aware regularization to address long-tailed distributions in land cover.TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench. Our code and pretrained models are publicly available at: https://github.com/mbzuai-oryx/TerraFM .
中文摘要:TerraFM是一种自监督基础模型,通过整合全球哨兵卫星影像和创新的训练策略,在地球观测任务中实现了卓越性能。
English Summary: TerraFM is a self-supervised foundation model that integrates global Sentinel-1 and Sentinel-2 imagery with innovative training strategies to achieve superior performance in Earth observation tasks.

Authors:Shi Liu, Weijie Su, Xizhou Zhu, Wenhai Wang, Jifeng Dai
Title: CoMemo: LVLMs Need Image Context with Image Memory
Abstract:
Recent advancements in Large Vision-Language Models built upon Large Language Models have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of middle visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we propose CoMemo - a dual-path architecture that combines a Context image path with an image Memory path for visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo's superior performance compared to conventional LVLM architectures. Project page is available at https://lalbj.github.io/projects/CoMemo/.
中文摘要:CoMemo模型通过双路径架构和新型位置编码机制RoPE-DHR,有效解决了现有大视觉语言模型中的视觉信息忽略和空间关系保持问题,在多项基准测试中表现优异。
English Summary: The CoMemo model introduces a dual-path architecture and RoPE-DHR positional encoding to overcome limitations in current Large Vision-Language Models, such as visual neglect and poor spatial awareness, achieving superior performance across multiple benchmarks.

Authors:Han Yi, Yulu Pan, Feihong He, Xinyu Liu, Benjamin Zhang, Oluwatumininu Oguntola, Gedas Bertasius
Title: ExAct: A Video-Language Benchmark for Expert Action Analysis
Abstract:
We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing GPT-4o model achieves only 44.70% accuracy, well below the 82.02% attained by trained human specialists/experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains. Dataset and code are available at https://texaser.github.io/exact_project_page/

Authors:Jonathan Schmidt, Simon Giebenhain, Matthias Niessner
Title: BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading
Abstract:
We introduce BecomingLit, a novel method for reconstructing relightable, high-resolution head avatars that can be rendered from novel viewpoints at interactive rates. Therefore, we propose a new low-cost light stage capture setup, tailored specifically towards capturing faces. Using this setup, we collect a novel dataset consisting of diverse multi-view sequences of numerous subjects under varying illumination conditions and facial expressions. By leveraging our new dataset, we introduce a new relightable avatar representation based on 3D Gaussian primitives that we animate with a parametric head model and an expression-dependent dynamics module. We propose a new hybrid neural shading approach, combining a neural diffuse BRDF with an analytical specular term. Our method reconstructs disentangled materials from our dynamic light stage recordings and enables all-frequency relighting of our avatars with both point lights and environment maps. In addition, our avatars can easily be animated and controlled from monocular videos. We validate our approach in extensive experiments on our dataset, where we consistently outperform existing state-of-the-art methods in relighting and reenactment by a significant margin.

Authors:Yuping He, Yifei Huang, Guo Chen, Lidong Lu, Baoqi Pei, Jilan Xu, Tong Lu, Yoichi Sato
Title: Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision
Abstract:
Perceiving the world from both egocentric (first-person) and exocentric (third-person) perspectives is fundamental to human cognition, enabling rich and complementary understanding of dynamic environments. In recent years, allowing the machines to leverage the synergistic potential of these dual perspectives has emerged as a compelling research direction in video understanding. In this survey, we provide a comprehensive review of video understanding from both exocentric and egocentric viewpoints. We begin by highlighting the practical applications of integrating egocentric and exocentric techniques, envisioning their potential collaboration across domains. We then identify key research tasks to realize these applications. Next, we systematically organize and review recent advancements into three main research directions: (1) leveraging egocentric data to enhance exocentric understanding, (2) utilizing exocentric data to improve egocentric analysis, and (3) joint learning frameworks that unify both perspectives. For each direction, we analyze a diverse set of tasks and relevant works. Additionally, we discuss benchmark datasets that support research in both perspectives, evaluating their scope, diversity, and applicability. Finally, we discuss limitations in current works and propose promising future research directions. By synthesizing insights from both perspectives, our goal is to inspire advancements in video understanding and artificial intelligence, bringing machines closer to perceiving the world in a human-like manner. A GitHub repo of related works can be found at https://github.com/ayiyayi/Awesome-Egocentric-and-Exocentric-Vision.
中文摘要:本综述全面探讨了自我中心与异我中心视角在视频理解中的协同应用,系统梳理了双向增强方法与联合学习框架的研究进展,旨在推动人工智能实现更接近人类的感知能力。
English Summary: This survey comprehensively reviews video understanding by integrating egocentric and exocentric perspectives, exploring their synergistic applications, current advancements, and future directions to enhance AI's human-like perception.

Authors:Akram Zaytar, Caleb Robinson, Girmaw Abebe Tadesse, Tammy Glazer, Gilles Hacheme, Anthony Ortiz, Rahul M Dodhia, Juan M Lavista Ferres
Title: Optimizing Cloud-to-GPU Throughput for Deep Learning With Earth Observation Data
Abstract:
Training deep learning models on petabyte-scale Earth observation (EO) data requires separating compute resources from data storage. However, standard PyTorch data loaders cannot keep modern GPUs utilized when streaming GeoTIFF files directly from cloud storage. In this work, we benchmark GeoTIFF loading throughput from both cloud object storage and local SSD, systematically testing different loader configurations and data parameters. We focus on tile-aligned reads and worker thread pools, using Bayesian optimization to find optimal settings for each storage type. Our optimized configurations increase remote data loading throughput by 20x and local throughput by 4x compared to default settings. On three public EO benchmarks, models trained with optimized remote loading achieve the same accuracy as local training within identical time budgets. We improve validation IoU by 6-15% and maintain 85-95% GPU utilization versus 0-30% with standard configurations. Code is publicly available at https://github.com/microsoft/pytorch-cloud-geotiff-optimization
中文摘要:优化的GeoTIFF加载配置大幅提升了地球观测深度学习模型的数据吞吐量和GPU利用率,在保持与本地训练同等精度的同时显著提高了效率。
English Summary: Optimized GeoTIFF loading configurations significantly enhance data throughput and GPU utilization for Earth observation deep learning models, achieving comparable accuracy to local training while improving efficiency.

Authors:Mohammad Jalali, Bahar Dibaei Nia, Farzan Farnia
Title: Towards an Explainable Comparison and Alignment of Feature Embeddings
Abstract:
While several feature embedding models have been developed in the literature, comparisons of these embeddings have largely focused on their numerical performance in classification-related downstream applications. However, an interpretable comparison of different embeddings requires identifying and analyzing mismatches between sample groups clustered within the embedding spaces. In this work, we propose the \emph{Spectral Pairwise Embedding Comparison (SPEC)} framework to compare embeddings and identify their differences in clustering a reference dataset. Our approach examines the kernel matrices derived from two embeddings and leverages the eigendecomposition of the difference kernel matrix to detect sample clusters that are captured differently by the two embeddings. We present a scalable implementation of this kernel-based approach, with computational complexity that grows linearly with the sample size. Furthermore, we introduce an optimization problem using this framework to align two embeddings, ensuring that clusters identified in one embedding are also captured in the other model. We provide numerical results demonstrating the SPEC's application to compare and align embeddings on large-scale datasets such as ImageNet and MS-COCO. The project page is available at https://mjalali.github.io/SPEC/.

Authors:Christian Fruhwirth-Reisinger, Dušan Malić, Wei Lin, David Schinagl, Samuel Schulter, Horst Possegger
Title: STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving
Abstract:
We introduce STSBench, a scenario-based framework to benchmark the holistic understanding of vision-language models (VLMs) for autonomous driving. The framework automatically mines pre-defined traffic scenarios from any dataset using ground-truth annotations, provides an intuitive user interface for efficient human verification, and generates multiple-choice questions for model evaluation. Applied to the NuScenes dataset, we present STSnu, the first benchmark that evaluates the spatio-temporal reasoning capabilities of VLMs based on comprehensive 3D perception. Existing benchmarks typically target off-the-shelf or fine-tuned VLMs for images or videos from a single viewpoint and focus on semantic tasks such as object recognition, dense captioning, risk assessment, or scene understanding. In contrast, STSnu evaluates driving expert VLMs for end-to-end driving, operating on videos from multi-view cameras or LiDAR. It specifically assesses their ability to reason about both ego-vehicle actions and complex interactions among traffic participants, a crucial capability for autonomous vehicles. The benchmark features 43 diverse scenarios spanning multiple views and frames, resulting in 971 human-verified multiple-choice questions. A thorough evaluation uncovers critical shortcomings in existing models' ability to reason about fundamental traffic dynamics in complex environments. These findings highlight the urgent need for architectural advances that explicitly model spatio-temporal reasoning. By addressing a core gap in spatio-temporal evaluation, STSBench enables the development of more robust and explainable VLMs for autonomous driving.
中文: STSBench提出了一种基于场景的框架来评估自动驾驶中的视觉语言模型,通过NuScenes数据集创建了STSnu基准测试,用于检验模型在多视角和多帧下的时空推理能力,揭示了现有模型在理解复杂交通动态方面存在严重不足。
English: STSBench introduces a scenario-based framework for evaluating vision-language models in autonomous driving, creating the STSnu benchmark from the NuScenes dataset to test spatio-temporal reasoning across multiple views and frames, revealing critical gaps in current models' understanding of traffic dynamics.

Authors:Hengzhi Li, Brendon Jiang, Alexander Naehu, Regan Song, Justin Zhang, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, Rebecca Chang, Paul Pu Liang
Title: PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
Abstract:
Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite recent progress in foundation models, their performance on such open-ended settings remains largely untested. In this paper, we introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces improves stepwise reasoning from 4% to 11%, while training on final answers alone degrades performance to near zero. Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.
中文: PuzzleWorld是一个包含667个解谜式问题的新基准,用于评估开放式多模态推理,当前最优模型仅解决14%的谜题,并暴露出短视推理和缺乏草图能力等局限。
English: PuzzleWorld is a new benchmark of 667 puzzlehunt problems that tests open-ended multimodal reasoning, where current top models struggle with only 14% solved and reveal limitations like myopic reasoning and lack of sketching skills.

Authors:Wenyuan Li, Shunlin Liang, Yuxiang Zhang, Liqin Liu, Keyan Chen, Yongzhe Chen, Han Ma, Jianglei Xu, Yichuan Ma, Shikang Guan, Zhenwei Shi
Title: Fine-grained Hierarchical Crop Type Classification from Integrated Hyperspectral EnMAP Data and Multispectral Sentinel-2 Time Series: A Large-scale Dataset and Dual-stream Transformer Method
Abstract:
Fine-grained crop type classification serves as the fundamental basis for large-scale crop mapping and plays a vital role in ensuring food security. It requires simultaneous capture of both phenological dynamics (obtained from multi-temporal satellite data like Sentinel-2) and subtle spectral variations (demanding nanometer-scale spectral resolution from hyperspectral imagery). Research combining these two modalities remains scarce currently due to challenges in hyperspectral data acquisition and crop types annotation costs. To address these issues, we construct a hierarchical hyperspectral crop dataset (H2Crop) by integrating 30m-resolution EnMAP hyperspectral data with Sentinel-2 time series. With over one million annotated field parcels organized in a four-tier crop taxonomy, H2Crop establishes a vital benchmark for fine-grained agricultural crop classification and hyperspectral image processing. We propose a dual-stream Transformer architecture that synergistically processes these modalities. It coordinates two specialized pathways: a spectral-spatial Transformer extracts fine-grained signatures from hyperspectral EnMAP data, while a temporal Swin Transformer extracts crop growth patterns from Sentinel-2 time series. The designed hierarchical classification head with hierarchical fusion then simultaneously delivers multi-level crop type classification across all taxonomic tiers. Experiments demonstrate that adding hyperspectral EnMAP data to Sentinel-2 time series yields a 4.2% average F1-scores improvement (peaking at 6.3%). Extensive comparisons also confirm our method's higher accuracy over existing deep learning approaches for crop type classification and the consistent benefits of hyperspectral data across varying temporal windows and crop change scenarios. Codes and dataset are available at https://github.com/flyakon/H2Crop.
中文: 本研究提出了H2Crop分层高光谱作物数据集及双流Transformer模型,通过协同处理高光谱与多时相卫星数据,显著提升了细粒度作物分类精度,相关代码与数据已开源。
English: This study introduces H2Crop, a hierarchical hyperspectral crop dataset, and a dual-stream Transformer model that synergistically combines hyperspectral and multi-temporal satellite data to significantly improve fine-grained crop classification accuracy, with codes and data publicly available.

Authors:David Wan, Han Wang, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
Title: CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval
Abstract:
Online video web content is richly multimodal: a single video blends vision, speech, ambient audio, and on-screen text. Retrieval systems typically treat these modalities as independent retrieval sources, which can lead to noisy and subpar retrieval. We explore multimodal video content retrieval, where relevance can be scored from one particular modality or jointly across multiple modalities simultaneously. Consequently, an effective retriever must dynamically choose which modality (or set of modalities) best addresses the query. We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata. CLaMR jointly encodes all modalities with a unified multimodal backbone for improved contextualization and is trained to enhance dynamic modality selection via two key innovations. First, given the lack of training data for multimodal retrieval, we introduce MultiVENT 2.0++, a large-scale synthetic training dataset built on MultiVENT 2.0 (event-centric videos in various languages paired with queries) with modality-targeted queries. Next, we propose a modality-aware loss that jointly trains according to a standard contrastive objective alongside an objective for learning correct modality usage. On the test sets of MultiVENT 2.0++ and MSRVTT, conventional aggregation strategies, such as averaging similarities for baseline retrievers, degrade performance by introducing noise from irrelevant modalities. In contrast, CLaMR consistently outperforms existing retrievers: on MultiVENT 2.0++, CLaMR improves nDCG@10 by 25.6 over the best single-modality retriever and by 35.4 over the best multi-modality retriever. We illustrate CLaMR's downstream utility on long-video QA, retrieving relevant frames and obtaining a 3.50% boost over LanguageBind on Video-MME and 1.42% over dense sampling on LongVideoBench.
中文: 本研究提出CLaMR多模态视频检索系统,通过动态选择相关模态并联合编码四种内容类型,结合合成数据集和模态感知训练方法,显著超越了现有检索模型的性能表现。
English: The study introduces CLaMR, a multimodal video retriever that dynamically selects relevant modalities and jointly encodes four types of content, achieving significant performance improvements over existing methods by using a synthetic dataset and a modality-aware training approach.

Authors:Felix Koulischer, Florian Handke, Johannes Deleu, Thomas Demeester, Luca Ambrogioni
Title: Feedback Guidance of Diffusion Models
Abstract:
While Classifier-Free Guidance (CFG) has become standard for improving sample fidelity in conditional diffusion models, it can harm diversity and induce memorization by applying constant guidance regardless of whether a particular sample needs correction. We propose FeedBack Guidance (FBG), which uses a state-dependent coefficient to self-regulate guidance amounts based on need. Our approach is derived from first principles by assuming the learned conditional distribution is linearly corrupted by the unconditional distribution, contrasting with CFG's implicit multiplicative assumption. Our scheme relies on feedback of its own predictions about the conditional signal informativeness to adapt guidance dynamically during inference, challenging the view of guidance as a fixed hyperparameter. The approach is benchmarked on ImageNet512x512, where it significantly outperforms Classifier-Free Guidance and is competitive to Limited Interval Guidance (LIG) while benefitting from a strong mathematical framework. On Text-To-Image generation, we demonstrate that, as anticipated, our approach automatically applies higher guidance scales for complex prompts than for simpler ones and that it can be easily combined with existing guidance schemes such as CFG or LIG.
中文摘要:提出的反馈引导(FBG)方法根据条件信号的信息性在推理过程中自适应调节引导强度,在ImageNet512x512上显著优于分类器自由引导,同时保持数学严谨性并能与现有引导方案兼容。
English Summary: The proposed FeedBack Guidance (FBG) adaptively regulates guidance during inference based on conditional signal informativeness, outperforming Classifier-Free Guidance on ImageNet512x512 while maintaining mathematical rigor and compatibility with existing methods.

Authors:Julio Silva-Rodríguez, Leo Fillioux, Paul-Henry Cournède, Maria Vakalopoulou, Stergios Christodoulidis, Ismail Ben Ayed, Jose Dolz
Title: Full Conformal Adaptation of Medical Vision-Language Models
Abstract:
Vision-language models (VLMs) pre-trained at large scale have shown unprecedented transferability capabilities and are being progressively integrated into medical image analysis. Although its discriminative potential has been widely explored, its reliability aspect remains overlooked. This work investigates their behavior under the increasingly popular split conformal prediction (SCP) framework, which theoretically guarantees a given error level on output sets by leveraging a labeled calibration set. However, the zero-shot performance of VLMs is inherently limited, and common practice involves few-shot transfer learning pipelines, which cannot absorb the rigid exchangeability assumptions of SCP. To alleviate this issue, we propose full conformal adaptation, a novel setting for jointly adapting and conformalizing pre-trained foundation models, which operates transductively over each test data point using a few-shot adaptation set. Moreover, we complement this framework with SS-Text, a novel training-free linear probe solver for VLMs that alleviates the computational cost of such a transductive approach. We provide comprehensive experiments using 3 different modality-specialized medical VLMs and 9 adaptation tasks. Our framework requires exactly the same data as SCP, and provides consistent relative improvements of up to 27% on set efficiency while maintaining the same coverage guarantees.
Chinese: 本研究提出了全适应性共形预测的新框架,用于联合调整和共形化预训练的视觉语言模型在医学图像分析中的应用,在保持覆盖保证的同时将集合效率提升高达27%,并辅以无需训练的SS-Text求解器来降低计算成本。
English: This study introduces full conformal adaptation, a novel framework that jointly adapts and conformalizes pre-trained vision-language models for medical image analysis, enhancing set efficiency by up to 27% while maintaining coverage guarantees, and complements it with SS-Text, a training-free solver to reduce computational costs.

Authors:Taoran Yue, Xiaojin Lu, Jiaxi Cai, Yuanping Chen, Shibing Chu
Title: SDS-Net: Shallow-Deep Synergism-detection Network for infrared small target detection
Abstract:
Current CNN-based infrared small target detection(IRSTD) methods generally overlook the heterogeneity between shallow and deep features, leading to inefficient collaboration between shallow fine grained structural information and deep high-level semantic representations. Additionally, the dependency relationships and fusion mechanisms across different feature hierarchies lack systematic modeling, which fails to fully exploit the complementarity of multilevel features. These limitations hinder IRSTD performance while incurring substantial computational costs. To address these challenges, this paper proposes a shallow-deep synergistic detection network (SDS-Net) that efficiently models multilevel feature representations to increase both the detection accuracy and computational efficiency in IRSTD tasks. SDS-Net introduces a dual-branch architecture that separately models the structural characteristics and semantic properties of features, effectively preserving shallow spatial details while capturing deep semantic representations, thereby achieving high-precision detection with significantly improved inference speed. Furthermore, the network incorporates an adaptive feature fusion module to dynamically model cross-layer feature correlations, enhancing overall feature collaboration and representation capability. Comprehensive experiments on three public datasets (NUAA-SIRST, NUDT-SIRST, and IRSTD-1K) demonstrate that SDS-Net outperforms state-of-the-art IRSTD methods while maintaining low computational complexity and high inference efficiency, showing superior detection performance and broad application prospects. Our code will be made public at https://github.com/PhysiLearn/SDS-Net.
中文: 本文提出的SDS-Net通过双分支结构和自适应特征融合模块,有效协同浅层与深层特征,在提升红外小目标检测精度的同时显著提高了计算效率。
English: This paper introduces SDS-Net, a dual-branch network that effectively models shallow and deep features through adaptive fusion, enhancing infrared small target detection accuracy and efficiency while reducing computational costs.

Authors:Joscha Diehl, Rasheed Ibraheem, Leonard Schmitz, Yue Wu
Title: Tensor-to-Tensor Models with Fast Iterated Sum Features
Abstract:
Data in the form of images or higher-order tensors is ubiquitous in modern deep learning applications. Owing to their inherent high dimensionality, the need for subquadratic layers processing such data is even more pressing than for sequence data. We propose a novel tensor-to-tensor layer with linear cost in the input size, utilizing the mathematical gadget of ``corner trees'' from the field of permutation counting. In particular, for order-two tensors, we provide an image-to-image layer that can be plugged into image processing pipelines. On the one hand, our method can be seen as a higher-order generalization of state-space models. On the other hand, it is based on a multiparameter generalization of the signature of iterated integrals (or sums). The proposed tensor-to-tensor concept is used to build a neural network layer called the Fast Iterated Sums (FIS) layer which integrates seamlessly with other layer types. We demonstrate the usability of the FIS layer with both classification and anomaly detection tasks. By replacing some layers of a smaller ResNet architecture with FIS, a similar accuracy (with a difference of only 0.1\%) was achieved in comparison to a larger ResNet while reducing the number of trainable parameters and multi-add operations. The FIS layer was also used to build an anomaly detection model that achieved an average AUROC of 97.3\% on the texture images of the popular MVTec AD dataset. The processing and modelling codes are publicly available at https://github.com/diehlj/fast-iterated-sums.
中文摘要:本文提出了一种基于排列计数中“角树”数学工具的新型张量到张量线性计算层,该层既是状态空间模型的高阶推广,也是迭代积分的多参数扩展,在保持与大型网络相近精度的同时显著减少了参数数量和计算量。
English Summary: This paper introduces a novel tensor-to-tensor layer with linear computational cost using "corner trees" from permutation counting, which serves as both a higher-order generalization of state-space models and a multiparameter extension of iterated integrals, achieving comparable accuracy to larger networks while reducing parameters and computational operations.

Authors:Yuhao Sun, Jiacheng Zhang, Zesheng Ye, Chaowei Xiao, Feng Liu
Title: Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification
Abstract:
Diffusion-based purification (DBP) methods aim to remove adversarial noise from the input sample by first injecting Gaussian noise through a forward diffusion process, and then recovering the clean example through a reverse generative process. In the above process, how much Gaussian noise is injected to the input sample is key to the success of DBP methods, which is controlled by a constant noise level $t^*$ for all samples in existing methods. In this paper, we discover that an optimal $t^*$ for each sample indeed could be different. Intuitively, the cleaner a sample is, the less the noise it should be injected, and vice versa. Motivated by this finding, we propose a new framework, called Sample-specific Score-aware Noise Injection (SSNI). Specifically, SSNI uses a pre-trained score network to estimate how much a data point deviates from the clean data distribution (i.e., score norms). Then, based on the magnitude of score norms, SSNI applies a reweighting function to adaptively adjust $t^*$ for each sample, achieving sample-specific noise injections. Empirically, incorporating our framework with existing DBP methods results in a notable improvement in both accuracy and robustness on CIFAR-10 and ImageNet-1K, highlighting the necessity to allocate distinct noise levels to different samples in DBP methods. Our code is available at: https://github.com/tmlr-group/SSNI.
中文: 本文提出SSNI框架,通过预训练评分网络评估样本偏离干净数据分布的程度,自适应调整每个样本的噪声注入水平,在CIFAR-10和ImageNet-1K数据集上显著提升了基于扩散的净化方法的准确性和鲁棒性。
English: The paper introduces SSNI, a framework that adaptively adjusts noise injection levels in diffusion-based purification methods based on each sample's deviation from the clean data distribution, improving accuracy and robustness on benchmark datasets.

Authors:Lorenzo Mur-Labadia, Maria Santos-Villafranca, Jesus Bermudez-Cameo, Alejandro Perez-Yus, Ruben Martinez-Cantin, Jose J. Guerrero
Title: O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views
Abstract:
Understanding the world from multiple perspectives is essential for intelligent systems operating together, where segmenting common objects across different views remains an open problem. We introduce a new approach that re-defines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) an Ego$\leftrightarrow$Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space, and (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects. O-MaMa achieves the state of the art in the Ego-Exo4D Correspondences benchmark, obtaining relative gains of +22% and +76% in the Ego2Exo and Exo2Ego IoU against the official challenge baselines, and a +13% and +6% compared with the SOTA with 1% of the training parameters.
Chinese: 本文提出O-MaMa方法,将跨图像分割重新定义为掩码匹配任务,通过在Ego-Exo4D基准测试中取得最先进性能,以极少的训练参数实现了显著性能提升。
English: The paper presents O-MaMa, a novel cross-image segmentation method that redefines the task as mask matching and achieves state-of-the-art results on the Ego-Exo4D benchmark with significant performance gains using minimal training parameters.

Authors:Yuchao Zheng, Jianing Zhang, Guochen Ning, Hongen Liao
Title: SurGSplat: Progressive Geometry-Constrained Gaussian Splatting for Surgical Scene Reconstruction
Abstract:
Intraoperative navigation relies heavily on precise 3D reconstruction to ensure accuracy and safety during surgical procedures. However, endoscopic scenarios present unique challenges, including sparse features and inconsistent lighting, which render many existing Structure-from-Motion (SfM)-based methods inadequate and prone to reconstruction failure. To mitigate these constraints, we propose SurGSplat, a novel paradigm designed to progressively refine 3D Gaussian Splatting (3DGS) through the integration of geometric constraints. By enabling the detailed reconstruction of vascular structures and other critical features, SurGSplat provides surgeons with enhanced visual clarity, facilitating precise intraoperative decision-making. Experimental evaluations demonstrate that SurGSplat achieves superior performance in both novel view synthesis (NVS) and pose estimation accuracy, establishing it as a high-fidelity and efficient solution for surgical scene reconstruction. More information and results can be found on the page https://surgsplat.github.io/.

Authors:Yixuan Zhu, Haolin Wang, Shilin Ma, Wenliang Zhao, Yansong Tang, Lei Chen, Jie Zhou
Title: FADE: Frequency-Aware Diffusion Model Factorization for Video Editing
Abstract:
Recent advancements in diffusion frameworks have significantly enhanced video editing, achieving high fidelity and strong alignment with textual prompts. However, conventional approaches using image diffusion models fall short in handling video dynamics, particularly for challenging temporal edits like motion adjustments. While current video diffusion models produce high-quality results, adapting them for efficient editing remains difficult due to the heavy computational demands that prevent the direct application of previous image editing techniques. To overcome these limitations, we introduce FADE, a training-free yet highly effective video editing approach that fully leverages the inherent priors from pre-trained video diffusion models via frequency-aware factorization. Rather than simply using these models, we first analyze the attention patterns within the video model to reveal how video priors are distributed across different components. Building on these insights, we propose a factorization strategy to optimize each component's specialized role. Furthermore, we devise spectrum-guided modulation to refine the sampling trajectory with frequency domain cues, preventing information leakage and supporting efficient, versatile edits while preserving the basic spatial and temporal structure. Extensive experiments on real-world videos demonstrate that our method consistently delivers high-quality, realistic and temporally coherent editing results both qualitatively and quantitatively. Code is available at https://github.com/EternalEvan/FADE .
中文摘要:FADE是一种无需训练的视频编辑方法,通过频域感知分解和频谱引导调制技术,充分利用预训练视频扩散模型的内在先验,在保持时空连贯性的同时实现高效高质量的视频编辑。
English Summary: FADE is a training-free video editing method that leverages pre-trained video diffusion models through frequency-aware factorization and spectrum-guided modulation to achieve high-quality, temporally coherent edits without heavy computational costs.

Authors:Xin Zhang, Dongdong Meng, Sheng Li
Title: Query Nearby: Offset-Adjusted Mask2Former enhances small-organ segmentation
Abstract:
Medical segmentation plays an important role in clinical applications like radiation therapy and surgical guidance, but acquiring clinically acceptable results is difficult. In recent years, progress has been witnessed with the success of utilizing transformer-like models, such as combining the attention mechanism with CNN. In particular, transformer-based segmentation models can extract global information more effectively, compensating for the drawbacks of CNN modules that focus on local features. However, utilizing transformer architecture is not easy, because training transformer-based models can be resource-demanding. Moreover, due to the distinct characteristics in the medical field, especially when encountering mid-sized and small organs with compact regions, their results often seem unsatisfactory. For example, using ViT to segment medical images directly only gives a DSC of less than 50\%, which is far lower than the clinically acceptable score of 80\%. In this paper, we used Mask2Former with deformable attention to reduce computation and proposed offset adjustment strategies to encourage sampling points within the same organs during attention weights computation, thereby integrating compact foreground information better. Additionally, we utilized the 4th feature map in Mask2Former to provide a coarse location of organs, and employed an FCN-based auxiliary head to help train Mask2Former more quickly using Dice loss. We show that our model achieves SOTA (State-of-the-Art) performance on the HaNSeg and SegRap2023 datasets, especially on mid-sized and small organs.Our code is available at link https://github.com/earis/Offsetadjustment\_Background-location\_Decoder\_Mask2former.
Chinese: 本文提出了一种基于Mask2Former的医学图像分割模型,采用可变形注意力和偏移调整策略来更好地捕捉紧凑器官结构,在降低计算需求的同时,在基准数据集上实现了最先进的性能。
English: This paper introduces a Mask2Former-based medical image segmentation model with deformable attention and offset adjustment strategies to better capture compact organ structures, achieving state-of-the-art performance on benchmark datasets while reducing computational demands.

Authors:Yiheng Li, Yang Yang, Zichang Tan, Huan Liu, Weihua Chen, Xu Zhou, Zhen Lei
Title: Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation
Abstract:
To tackle the threat of fake news, the task of detecting and grounding multi-modal media manipulation DGM4 has received increasing attention. However, most state-of-the-art methods fail to explore the fine-grained consistency within local content, usually resulting in an inadequate perception of detailed forgery and unreliable results. In this paper, we propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM4. Two branches for image and text modalities are established, each of which contains two cascaded decoders, i.e., Contextual Consistency Decoder (CCD) and Semantic Consistency Decoder (SCD), to capture within-modality contextual consistency and across-modality semantic consistency, respectively. Both CCD and SCD adhere to the same criteria for capturing fine-grained forgery details. To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair. Then, the forgery-aware reasoning or aggregating is adopted to deeply seek forgery cues based on the consistency features. Extensive experiments on DGM4 datasets prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content. Codes and weights are avaliable at https://github.com/liyih/CSCL.
Chinese Summary: 针对现有方法在检测多模态虚假新闻时难以感知细粒度伪造内容的问题,本文提出上下文语义一致性学习(CSCL)方法,通过双分支解码器分别捕捉模态内上下文一致性和跨模态语义一致性,在内容篡改检测与定位任务中取得了最优性能。
English Summary: To address the limitations of existing methods in detecting fine-grained inconsistencies in multimodal fake news, this paper introduces a Contextual-Semantic Consistency Learning (CSCL) approach that employs dual decoders to capture both within-modality contextual and cross-modality semantic consistencies, achieving state-of-the-art performance in manipulation detection and localization.

Authors:Yuqian Fu, Runze Wang, Yanwei Fu, Danda Pani Paudel, Luc Van Gool
Title: Cross-View Multi-Modal Segmentation @ Ego-Exo4D Challenges 2025
Abstract:
In this report, we present a cross-view multi-modal object segmentation approach for the object correspondence task in the Ego-Exo4D Correspondence Challenges 2025. Given object queries from one perspective (e.g., ego view), the goal is to predict the corresponding object masks in another perspective (e.g., exo view). To tackle this task, we propose a multimodal condition fusion module that enhances object localization by leveraging both visual masks and textual descriptions as segmentation conditions. Furthermore, to address the visual domain gap between ego and exo views, we introduce a cross-view object alignment module that enforces object-level consistency across perspectives, thereby improving the model's robustness to viewpoint changes. Our proposed method ranked second on the leaderboard of the large-scale Ego-Exo4D object correspondence benchmark. Code will be made available at https://github.com/lovelyqian/ObjectRelator.
中文: 本报告针对2025年Ego-Exo4D对应挑战赛提出了一种跨视角多模态物体分割方法,通过多模态条件融合模块和跨视角物体对齐模块提升物体定位与一致性,在基准测试中排名第二。
English: This report introduces a cross-view multi-modal object segmentation method for the Ego-Exo4D Correspondence Challenges 2025, featuring a multimodal condition fusion module and a cross-view object alignment module to enhance object localization and consistency, achieving second place on the benchmark leaderboard.

Authors:Quansong He, Xiangde Min, Kaishen Wang, Tao He
Title: FuseUNet: A Multi-Scale Feature Fusion Method for U-like Networks
Abstract:
Medical image segmentation is a critical task in computer vision, with UNet serving as a milestone architecture. The typical component of UNet family is the skip connection, however, their skip connections face two significant limitations: (1) they lack effective interaction between features at different scales, and (2) they rely on simple concatenation or addition operations, which constrain efficient information integration. While recent improvements to UNet have focused on enhancing encoder and decoder capabilities, these limitations remain overlooked. To overcome these challenges, we propose a novel multi-scale feature fusion method that reimagines the UNet decoding process as solving an initial value problem (IVP), treating skip connections as discrete nodes. By leveraging principles from the linear multistep method, we propose an adaptive ordinary differential equation method to enable effective multi-scale feature fusion. Our approach is independent of the encoder and decoder architectures, making it adaptable to various U-Net-like networks. Experiments on ACDC, KiTS2023, MSD brain tumor, and ISIC2017/2018 skin lesion segmentation datasets demonstrate improved feature utilization, reduced network parameters, and maintained high performance. The code is available at https://github.com/nayutayuki/FuseUNet.
中文摘要:本研究提出了一种新颖的多尺度特征融合方法,将UNet解码过程重新定义为求解初值问题,通过自适应常微分方程克服跳跃连接的局限性,同时保持与多种架构的兼容性。
English Summary: The study introduces a novel multi-scale feature fusion method that redefines UNet's decoding process as solving an initial value problem, overcoming limitations in skip connections through adaptive ordinary differential equations while maintaining compatibility with various architectures.

Authors:Ziwei Zhao, Zhixing Zhang, Yuhang Liu, Zhao Zhang, Haojun Yu, Dong Wang, Liwei Wang
Title: DeformCL: Learning Deformable Centerline Representation for Vessel Extraction in 3D Medical Image
Abstract:
In the field of 3D medical imaging, accurately extracting and representing the blood vessels with curvilinear structures holds paramount importance for clinical diagnosis. Previous methods have commonly relied on discrete representation like mask, often resulting in local fractures or scattered fragments due to the inherent limitations of the per-pixel classification paradigm. In this work, we introduce DeformCL, a new continuous representation based on Deformable Centerlines, where centerline points act as nodes connected by edges that capture spatial relationships. Compared with previous representations, DeformCL offers three key advantages: natural connectivity, noise robustness, and interaction facility. We present a comprehensive training pipeline structured in a cascaded manner to fully exploit these favorable properties of DeformCL. Extensive experiments on four 3D vessel segmentation datasets demonstrate the effectiveness and superiority of our method. Furthermore, the visualization of curved planar reformation images validates the clinical significance of the proposed framework. We release the code in https://github.com/barry664/DeformCL
Chinese: DeformCL 提出了一种基于可变形中心线的连续表示方法,用于三维血管提取,具有更好的连通性、抗噪性和交互性,在多个数据集上验证了其优越性能和临床意义。
English: DeformCL introduces a continuous representation using deformable centerlines for 3D blood vessel extraction, offering enhanced connectivity, noise resistance, and interaction capabilities, validated by superior performance on multiple datasets and clinical relevance.

Authors:Ivan Rodin, Tz-Ying Wu, Kyle Min, Sharath Nittur Sridhar, Antonino Furnari, Subarna Tripathi, Giovanni Maria Farinella
Title: EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs
Abstract:
We introduce EASG-Bench, a question-answering benchmark for egocentric videos where the question-answering pairs are created from spatio-temporally grounded dynamic scene graphs capturing intricate relationships among actors, actions, and objects. We propose a systematic evaluation framework and evaluate several language-only and video large language models (video-LLMs) on this benchmark. We observe a performance gap in language-only and video-LLMs, especially on questions focusing on temporal ordering, thus identifying a research gap in the area of long-context video understanding. To promote the reproducibility of our findings and facilitate further research, the benchmark and accompanying code are available at the following GitHub page: https://github.com/fpv-iplab/EASG-bench.
中文: EASG-Bench是一个基于动态场景图构建问答对的以自我为中心视频基准测试,通过系统评估发现现有模型在时序理解方面存在明显差距,为长上下文视频理解领域指明了研究方向。
English: EASG-Bench is a new question-answering benchmark for egocentric videos that uses dynamic scene graphs to generate questions, revealing a performance gap in temporal understanding among current models and highlighting the need for improved long-context video comprehension.

Authors:Wei-Cheng Lin, Chih-Ming Lien, Chen Lo, Chia-Hung Yeh
Title: GazeNLQ @ Ego4D Natural Language Queries Challenge 2025
Abstract:
This report presents our solution to the Ego4D Natural Language Queries (NLQ) Challenge at CVPR 2025. Egocentric video captures the scene from the wearer's perspective, where gaze serves as a key non-verbal communication cue that reflects visual attention and offer insights into human intention and cognition. Motivated by this, we propose a novel approach, GazeNLQ, which leverages gaze to retrieve video segments that match given natural language queries. Specifically, we introduce a contrastive learning-based pretraining strategy for gaze estimation directly from video. The estimated gaze is used to augment video representations within proposed model, thereby enhancing localization accuracy. Experimental results show that GazeNLQ achieves R1@IoU0.3 and R1@IoU0.5 scores of 27.82 and 18.68, respectively. Our code is available at https://github.com/stevenlin510/GazeNLQ.
中文: 本报告提出GazeNLQ方法,通过利用视线数据增强视频表征,有效提升了自然语言查询对应视频片段的定位精度,在R1@IoU0.3和R1@IoU0.5指标上分别达到27.82和18.68的得分。
English: This report introduces GazeNLQ, a novel method that utilizes gaze data to enhance video representation and improve the accuracy of retrieving video segments corresponding to natural language queries, achieving R1@IoU0.3 and R1@IoU0.5 scores of 27.82 and 18.68, respectively.

Authors:Taiga Shinozaki, Tomoki Doi, Amane Watahiki, Satoshi Nishida, Hitomi Yanaka
Title: Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions?
Abstract:
Humans are susceptible to optical illusions, which serve as valuable tools for investigating sensory and cognitive processes. Inspired by human vision studies, research has begun exploring whether machines, such as large vision language models (LVLMs), exhibit similar susceptibilities to visual illusions. However, studies often have used non-abstract images and have not distinguished actual and apparent features, leading to ambiguous assessments of machine cognition. To address these limitations, we introduce a visual question answering (VQA) dataset, categorized into genuine and fake illusions, along with corresponding control images. Genuine illusions present discrepancies between actual and apparent features, whereas fake illusions have the same actual and apparent features even though they look illusory due to the similar geometric configuration. We evaluate the performance of LVLMs for genuine and fake illusion VQA tasks and investigate whether the models discern actual and apparent features. Our findings indicate that although LVLMs may appear to recognize illusions by correctly answering questions about both feature types, they predict the same answers for both Genuine Illusion and Fake Illusion VQA questions. This suggests that their responses might be based on prior knowledge of illusions rather than genuine visual understanding. The dataset is available at https://github.com/ynklab/FILM
Chinese: 本研究通过引入新型数据集评估大型视觉语言模型对视觉错觉的真实感知能力,发现尽管模型能正确回答问题,但其反应可能基于先验知识而非真正的视觉理解。
English: This study introduces a dataset to assess whether large vision language models genuinely perceive visual illusions or rely on prior knowledge, revealing that their responses may not reflect true visual understanding despite correct answers.

Authors:Dumindu Tissera, Omar Awadallah, Muhammad Umair Danish, Ayan Sadhu, Katarina Grolinger
Title: Any-Class Presence Likelihood for Robust Multi-Label Classification with Abundant Negative Data
Abstract:
Multi-label Classification (MLC) assigns an instance to one or more non-exclusive classes. A challenge arises when the dataset contains a large proportion of instances with no assigned class, referred to as negative data, which can overwhelm the learning process and hinder the accurate identification and classification of positive instances. Nevertheless, it is common in MLC applications such as industrial defect detection, agricultural disease identification, and healthcare diagnosis to encounter large amounts of negative data. Assigning a separate negative class to these instances further complicates the learning objective and introduces unnecessary redundancies. To address this challenge, we redesign standard MLC loss functions by deriving a likelihood of any class being present, formulated by a normalized weighted geometric mean of the predicted class probabilities. We introduce a regularization parameter that controls the relative contribution of the absent class probabilities to the any-class presence likelihood in positive instances. The any-class presence likelihood complements the multi-label learning by encouraging the network to become more aware of implicit positive instances and improve the label classification within those positive instances. Experiments on large-scale datasets with negative data: SewerML, modified COCO, and ChestX-ray14, across various networks and base loss functions show that our loss functions consistently improve MLC performance of their standard loss counterparts, achieving gains of up to 6.01 percentage points in F1, 8.06 in F2, and 3.11 in mean average precision, all without additional parameters or computational complexity. Code available at: https://github.com/ML-for-Sensor-Data-Western/gmean-mlc
中文摘要:本研究通过归一化加权几何均值重构多标签分类损失函数,有效处理负数据,在不增加参数或计算复杂度的情况下提升了分类性能。
English Summary: This study redesigns standard multi-label classification loss functions using a normalized weighted geometric mean to handle negative data, improving performance without extra parameters or computational cost.

Authors:Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, Katerina Fragkiadaki
Title: PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers
Abstract:
We introduce PartCrafter, the first structured 3D generative model that jointly synthesizes multiple semantically meaningful and geometrically distinct 3D meshes from a single RGB image. Unlike existing methods that either produce monolithic 3D shapes or follow two-stage pipelines, i.e., first segmenting an image and then reconstructing each segment, PartCrafter adopts a unified, compositional generation architecture that does not rely on pre-segmented inputs. Conditioned on a single image, it simultaneously denoises multiple 3D parts, enabling end-to-end part-aware generation of both individual objects and complex multi-object scenes. PartCrafter builds upon a pretrained 3D mesh diffusion transformer (DiT) trained on whole objects, inheriting the pretrained weights, encoder, and decoder, and introduces two key innovations: (1) A compositional latent space, where each 3D part is represented by a set of disentangled latent tokens; (2) A hierarchical attention mechanism that enables structured information flow both within individual parts and across all parts, ensuring global coherence while preserving part-level detail during generation. To support part-level supervision, we curate a new dataset by mining part-level annotations from large-scale 3D object datasets. Experiments show that PartCrafter outperforms existing approaches in generating decomposable 3D meshes, including parts that are not directly visible in input images, demonstrating the strength of part-aware generative priors for 3D understanding and synthesis. Code and training data will be released.

Authors:Ziyue Zhu, Shenlong Wang, Jin Xie, Jiang-jiang Liu, Jingdong Wang, Jian Yang
Title: VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction
Abstract:
Recent advancements in camera-based occupancy prediction have focused on the simultaneous prediction of 3D semantics and scene flow, a task that presents significant challenges due to specific difficulties, e.g., occlusions and unbalanced dynamic environments. In this paper, we analyze these challenges and their underlying causes. To address them, we propose a novel regularization framework called VoxelSplat. This framework leverages recent developments in 3D Gaussian Splatting to enhance model performance in two key ways: (i) Enhanced Semantics Supervision through 2D Projection: During training, our method decodes sparse semantic 3D Gaussians from 3D representations and projects them onto the 2D camera view. This provides additional supervision signals in the camera-visible space, allowing 2D labels to improve the learning of 3D semantics. (ii) Scene Flow Learning: Our framework uses the predicted scene flow to model the motion of Gaussians, and is thus able to learn the scene flow of moving objects in a self-supervised manner using the labels of adjacent frames. Our method can be seamlessly integrated into various existing occupancy models, enhancing performance without increasing inference time. Extensive experiments on benchmark datasets demonstrate the effectiveness of VoxelSplat in improving the accuracy of both semantic occupancy and scene flow estimation. The project page and codes are available at https://zzy816.github.io/VoxelSplat-Demo/.
中文摘要:VoxelSplat框架通过3D高斯泼溅技术,利用2D投影监督增强3D语义学习,并以自监督方式实现场景流估计,在不增加推理时间的情况下显著提升了占据预测的准确性。
English Summary: The VoxelSplat framework enhances camera-based occupancy prediction by leveraging 3D Gaussian Splatting to improve 3D semantic learning through 2D projection supervision and enable self-supervised scene flow estimation, boosting performance without added inference cost.

Authors:Jie Cai, Kangning Yang, Ling Ouyang, Lan Fu, Jiaming Ding, Jinglin Shen, Zibo Meng
Title: OpenRR-5k: A Large-Scale Benchmark for Reflection Removal in the Wild
Abstract:
Removing reflections is a crucial task in computer vision, with significant applications in photography and image enhancement. Nevertheless, existing methods are constrained by the absence of large-scale, high-quality, and diverse datasets. In this paper, we present a novel benchmark for Single Image Reflection Removal (SIRR). We have developed a large-scale dataset containing 5,300 high-quality, pixel-aligned image pairs, each consisting of a reflection image and its corresponding clean version. Specifically, the dataset is divided into two parts: 5,000 images are used for training, and 300 images are used for validation. Additionally, we have included 100 real-world testing images without ground truth (GT) to further evaluate the practical performance of reflection removal methods. All image pairs are precisely aligned at the pixel level to guarantee accurate supervision. The dataset encompasses a broad spectrum of real-world scenarios, featuring various lighting conditions, object types, and reflection patterns, and is segmented into training, validation, and test sets to facilitate thorough evaluation. To validate the usefulness of our dataset, we train a U-Net-based model and evaluate it using five widely-used metrics, including PSNR, SSIM, LPIPS, DISTS, and NIQE. We will release both the dataset and the code on https://github.com/caijie0620/OpenRR-5k to facilitate future research in this field.
中文: 本文提出了一个包含5,300组像素级对齐图像对的大规模单图像反射去除基准数据集,解决了现有方法因缺乏高质量数据而受限的问题,并通过多种评估指标验证了其有效性。
English: This paper introduces a large-scale benchmark dataset of 5,300 pixel-aligned image pairs for Single Image Reflection Removal, addressing the limitations of existing methods and enabling comprehensive evaluation across diverse real-world scenarios.

Authors:Alex Costanzino, Woody Bayliss, Juil Sock, Marc Gorriz Blanch, Danijela Horak, Ivan Laptev, Philip Torr, Fabio Pizzati
Title: Towards Reliable Identification of Diffusion-based Image Manipulations
Abstract:
Changing facial expressions, gestures, or background details may dramatically alter the meaning conveyed by an image. Notably, recent advances in diffusion models greatly improve the quality of image manipulation while also opening the door to misuse. Identifying changes made to authentic images, thus, becomes an important task, constantly challenged by new diffusion-based editing tools. To this end, we propose a novel approach for ReliAble iDentification of inpainted AReas (RADAR). RADAR builds on existing foundation models and combines features from different image modalities. It also incorporates an auxiliary contrastive loss that helps to isolate manipulated image patches. We demonstrate these techniques to significantly improve both the accuracy of our method and its generalisation to a large number of diffusion models. To support realistic evaluation, we further introduce BBC-PAIR, a new comprehensive benchmark, with images tampered by 28 diffusion models. Our experiments show that RADAR achieves excellent results, outperforming the state-of-the-art in detecting and localising image edits made by both seen and unseen diffusion models. Our code, data and models will be publicly available at https://alex-costanzino.github.io/radar/.

Authors:Kimberley M. Bird, Xujiong Ye, Alan M. Race, James M. Brown
Title: Deep histological synthesis from mass spectrometry imaging for multimodal registration
Abstract:
Registration of histological and mass spectrometry imaging (MSI) allows for more precise identification of structural changes and chemical interactions in tissue. With histology and MSI having entirely different image formation processes and dimensionalities, registration of the two modalities remains an ongoing challenge. This work proposes a solution that synthesises histological images from MSI, using a pix2pix model, to effectively enable unimodal registration. Preliminary results show promising synthetic histology images with limited artifacts, achieving increases in mutual information (MI) and structural similarity index measures (SSIM) of +0.924 and +0.419, respectively, compared to a baseline U-Net model. Our source code is available on GitHub: https://github.com/kimberley/MIUA2025.
中文: 本研究提出了一种基于pix2pix的方法,可从质谱成像数据合成组织学图像,实现有效的单模态配准,并在互信息和结构相似性指标上展现出优于基准模型的性能提升。
English: This study introduces a pix2pix-based method to synthesize histological images from mass spectrometry imaging, enabling effective unimodal registration and demonstrating improved mutual information and structural similarity over baseline models.

Authors:Ludovic Arnould, Salim Khazem, Hugues Ali Mehenni
Title: BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models
Abstract:
Visual Language Models (VLMs) are now sufficiently advanced to support a broad range of applications, including answering complex visual questions, and are increasingly expected to interact with images in varied ways. To evaluate them, current benchmarks often focus on specific domains (e.g., reading charts), constructing datasets of annotated real images paired with pre-defined Multiple Choice Questions (MCQs) to report aggregate accuracy scores. However, such benchmarks entail high annotation costs, risk information leakage, and do not clarify whether failures stem from limitations in visual perception, reasoning, or general knowledge. We propose a new evaluation methodology, inspired by ophthalmologic diagnostics, leveraging procedural generation of synthetic images to obtain control over visual attributes and precisely reveal perception failures in VLMs. Specifically, we build collections of images with gradually more challenging variations in the content of interest (e.g., number of objects in a counting task) while holding other visual parameters constant. This diagnostic allows systematic stress testing and fine-grained failure analysis, shifting the focus from coarse benchmarking toward targeted and interpretable assessment of VLM capabilities. Our code is available at https://github.com/byoeval/BYO-EVAL.
Chinese: 该摘要提出了一种新的视觉语言模型评估方法,通过程序生成合成图像来系统性地测试并精确识别感知缺陷,超越了依赖标注真实图像和总体准确率的传统基准测试。
English: The abstract proposes a new diagnostic evaluation method for Visual Language Models (VLMs) using procedurally generated synthetic images to systematically test and precisely identify perception failures, moving beyond traditional benchmarks that rely on annotated real images and aggregate accuracy scores.

Authors:Fanqi Kong, Weiqin Zu, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, Xue Feng
Title: SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
Abstract:
The rich and multifaceted nature of human social interaction, encompassing multimodal cues, unobservable relations and mental states, and dynamical behavior, presents a formidable challenge for artificial intelligence. To advance research in this area, we introduce SIV-Bench, a novel video benchmark for rigorously evaluating the capabilities of Multimodal Large Language Models (MLLMs) across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP). SIV-Bench features 2,792 video clips and 8,792 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline. It is originally collected from TikTok and YouTube, covering a wide range of video genres, presentation styles, and linguistic and cultural backgrounds. It also includes a dedicated setup for analyzing the impact of different textual cues-original on-screen text, added dialogue, or no text. Our comprehensive experiments on leading MLLMs reveal that while models adeptly handle SSU, they significantly struggle with SSR and SDP, where Relation Inference (RI) is an acute bottleneck, as further examined in our analysis. Our study also confirms the critical role of transcribed dialogue in aiding comprehension of complex social interactions. By systematically identifying current MLLMs' strengths and limitations, SIV-Bench offers crucial insights to steer the development of more socially intelligent AI. The dataset and code are available at https://kfq20.github.io/sivbench/.

Authors:Seunghwan Shin, Yusung Kim
Title: Self-supervised One-Stage Learning for RF-based Multi-Person Pose Estimation
Abstract:
In the field of Multi-Person Pose Estimation (MPPE), Radio Frequency (RF)-based methods can operate effectively regardless of lighting conditions and obscured line-of-sight situations. Existing RF-based MPPE methods typically involve either 1) converting RF signals into heatmap images through complex preprocessing, or 2) applying a deep embedding network directly to raw RF signals. The first approach, while delivering decent performance, is computationally intensive and time-consuming. The second method, though simpler in preprocessing, results in lower MPPE accuracy and generalization performance. This paper proposes an efficient and lightweight one-stage MPPE model based on raw RF signals. By sub-grouping RF signals and embedding them using a shared single-layer CNN followed by multi-head attention, this model outperforms previous methods that embed all signals at once through a large and deep CNN. Additionally, we propose a new self-supervised learning (SSL) method that takes inputs from both one unmasked subgroup and the remaining masked subgroups to predict the latent representations of the masked data. Empirical results demonstrate that our model improves MPPE accuracy by up to 15 in PCKh@0.5 compared to previous methods using raw RF signals. Especially, the proposed SSL method has shown to significantly enhance performance improvements when placed in new locations or in front of obstacles at RF antennas, contributing to greater performance gains as the number of people increases. Our code and dataset is open at Github. https://github.com/sshnan7/SOSPE .
中文: 本文提出了一种基于原始射频信号的轻量级单阶段多人姿态估计模型,通过分组信号并采用共享CNN与多头注意力机制,在精度和泛化性能上优于现有方法,同时结合自监督学习有效提升了在陌生环境和存在障碍物时的表现。
English: This paper introduces a lightweight one-stage model for multi-person pose estimation using raw RF signals, which groups signals and employs a shared CNN with multi-head attention to achieve higher accuracy and better generalization than previous methods, alongside a self-supervised learning approach that enhances performance in new environments and with obstacles.

Authors:Jeongsoo Ha, Kyungsoo Kim, Yusung Kim
Title: Dream to Generalize: Zero-Shot Model-Based Reinforcement Learning for Unseen Visual Distractions
Abstract:
Model-based reinforcement learning (MBRL) has been used to efficiently solve vision-based control tasks in highdimensional image observations. Although recent MBRL algorithms perform well in trained observations, they fail when faced with visual distractions in observations. These task-irrelevant distractions (e.g., clouds, shadows, and light) may be constantly present in real-world scenarios. In this study, we propose a novel self-supervised method, Dream to Generalize (Dr. G), for zero-shot MBRL. Dr. G trains its encoder and world model with dual contrastive learning which efficiently captures task-relevant features among multi-view data augmentations. We also introduce a recurrent state inverse dynamics model that helps the world model to better understand the temporal structure. The proposed methods can enhance the robustness of the world model against visual distractions. To evaluate the generalization performance, we first train Dr. G on simple backgrounds and then test it on complex natural video backgrounds in the DeepMind Control suite, and the randomizing environments in Robosuite. Dr. G yields a performance improvement of 117% and 14% over prior works, respectively. Our code is open-sourced and available at https://github.com/JeongsooHa/DrG.git
中文摘要:基于模型的强化学习在视觉干扰下表现不佳,而本研究提出的自监督方法Dr. G通过双对比学习和循环状态逆动力学模型捕捉任务相关特征,在零样本泛化测试中显著提升了抗干扰能力和性能。
English Summary: Model-based reinforcement learning (MBRL) often struggles with visual distractions, but the proposed self-supervised method Dr. G uses dual contrastive learning and a recurrent state inverse dynamics model to enhance robustness, achieving significant performance improvements in zero-shot generalization tests.

Authors:Kyungsoo Kim, Jeongsoo Ha, Yusung Kim
Title: Self-Predictive Dynamics for Generalization of Vision-based Reinforcement Learning
Abstract:
Vision-based reinforcement learning requires efficient and robust representations of image-based observations, especially when the images contain distracting (task-irrelevant) elements such as shadows, clouds, and light. It becomes more important if those distractions are not exposed during training. We design a Self-Predictive Dynamics (SPD) method to extract task-relevant features efficiently, even in unseen observations after training. SPD uses weak and strong augmentations in parallel, and learns representations by predicting inverse and forward transitions across the two-way augmented versions. In a set of MuJoCo visual control tasks and an autonomous driving task (CARLA), SPD outperforms previous studies in complex observations, and significantly improves the generalization performance for unseen observations. Our code is available at https://github.com/unigary/SPD.
中文:自预测动态(SPD)方法通过双向增强预测状态转换,能有效提取含干扰图像中的任务相关特征,在视觉控制任务中表现优异,并显著提升对未见观察的泛化能力。
English: The Self-Predictive Dynamics (SPD) method effectively extracts task-relevant features from images with distractions by using dual augmentations to predict transitions, demonstrating superior performance in visual control tasks and enhanced generalization for unseen observations.

Authors:Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, Eli Shlizerman
Title: SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing
Abstract:
3D spatial reasoning in dynamic, audio-visual environments is a cornerstone of human cognition yet remains largely unexplored by existing Audio-Visual Large Language Models (AV-LLMs) and benchmarks, which predominantly focus on static or 2D scenes. We introduce SAVVY-Bench, the first benchmark for 3D spatial reasoning in dynamic scenes with synchronized spatial audio. SAVVY-Bench is comprised of thousands of relationships involving static and moving objects, and requires fine-grained temporal grounding, consistent 3D localization, and multi-modal annotation. To tackle this challenge, we propose SAVVY, a novel training-free reasoning pipeline that consists of two stages: (i) Egocentric Spatial Tracks Estimation, which leverages AV-LLMs as well as other audio-visual methods to track the trajectories of key objects related to the query using both visual and spatial audio cues, and (ii) Dynamic Global Map Construction, which aggregates multi-modal queried object trajectories and converts them into a unified global dynamic map. Using the constructed map, a final QA answer is obtained through a coordinate transformation that aligns the global map with the queried viewpoint. Empirical evaluation demonstrates that SAVVY substantially enhances performance of state-of-the-art AV-LLMs, setting a new standard and stage for approaching dynamic 3D spatial reasoning in AV-LLMs.

Authors:Zory Zhang, Pinyuan Feng, Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, Hokin Deng, Ziqiao Ma, Yijiang Li, Dezhi Luo
Title: Can Vision Language Models Infer Human Gaze Direction? A Controlled Study
Abstract:
Gaze-referential inference--the ability to infer what others are looking at--is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

Authors:Luka Vetoshkin, Dmitry Yudin
Title: Talk2SAM: Text-Guided Semantic Enhancement for Complex-Shaped Object Segmentation
Abstract:
Segmenting objects with complex shapes, such as wires, bicycles, or structural grids, remains a significant challenge for current segmentation models, including the Segment Anything Model (SAM) and its high-quality variant SAM-HQ. These models often struggle with thin structures and fine boundaries, leading to poor segmentation quality. We propose Talk2SAM, a novel approach that integrates textual guidance to improve segmentation of such challenging objects. The method uses CLIP-based embeddings derived from user-provided text prompts to identify relevant semantic regions, which are then projected into the DINO feature space. These features serve as additional prompts for SAM-HQ, enhancing its ability to focus on the target object. Beyond improving segmentation accuracy, Talk2SAM allows user-controllable segmentation, enabling disambiguation of objects within a single bounding box based on textual input. We evaluate our approach on three benchmarks: BIG, ThinObject5K, and DIS5K. Talk2SAM consistently outperforms SAM-HQ, achieving up to +5.9\% IoU and +8.3\% boundary IoU improvements. Our results demonstrate that incorporating natural language guidance provides a flexible and effective means for precise object segmentation, particularly in cases where traditional prompt-based methods fail. The source code is available on GitHub: https://github.com/richlukich/Talk2SAM
中文摘要:Talk2SAM通过引入文本引导改进了复杂形状物体的分割精度,相比SAM-HQ模型实现了显著的性能提升。
English Summary: Talk2SAM enhances object segmentation by integrating textual guidance to improve accuracy on complex shapes, achieving significant performance gains over existing models like SAM-HQ.

Authors:Abu Sufian, Marco Leo, Cosimo Distante, Anirudha Ghosh, Debaditya Barman
Title: Can Vision Transformers with ResNet's Global Features Fairly Authenticate Demographic Faces?
Abstract:
Biometric face authentication is crucial in computer vision, but ensuring fairness and generalization across demographic groups remains a big challenge. Therefore, we investigated whether Vision Transformer (ViT) and ResNet, leveraging pre-trained global features, can fairly authenticate different demographic faces while relying minimally on local features. In this investigation, we used three pre-trained state-of-the-art (SOTA) ViT foundation models from Facebook, Google, and Microsoft for global features as well as ResNet-18. We concatenated the features from ViT and ResNet, passed them through two fully connected layers, and trained on customized face image datasets to capture the local features. Then, we designed a novel few-shot prototype network with backbone features embedding. We also developed new demographic face image support and query datasets for this empirical study. The network's testing was conducted on this dataset in one-shot, three-shot, and five-shot scenarios to assess how performance improves as the size of the support set increases. We observed results across datasets with varying races/ethnicities, genders, and age groups. The Microsoft Swin Transformer backbone performed better among the three SOTA ViT for this task. The code and data are available at: https://github.com/Sufianlab/FairVitBio.
中文摘要:本研究通过结合预训练模型的全局特征与局部特征,采用新型小样本原型网络评估了Vision Transformer和ResNet在不同人口群体间的公平人脸认证效果,其中微软Swin Transformer表现最佳。
English Summary: This study evaluates Vision Transformers and ResNet for fair biometric face authentication across demographic groups by combining global features from pre-trained models with local features through a novel few-shot prototype network, with the Microsoft Swin Transformer showing superior performance.

Authors:George Stoica, Vivek Ramanujan, Xiang Fan, Ali Farhadi, Ranjay Krishna, Judy Hoffman
Title: Contrastive Flow Matching
Abstract:
Unconditional flow-matching trains diffusion models to transport samples from a source distribution to a target distribution by enforcing that the flows between sample pairs are unique. However, in conditional settings (e.g., class-conditioned models), this uniqueness is no longer guaranteed--flows from different conditions may overlap, leading to more ambiguous generations. We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs. We validate Contrastive Flow Matching by conducting extensive experiments across varying model architectures on both class-conditioned (ImageNet-1k) and text-to-image (CC3M) benchmarks. Notably, we find that training models with Contrastive Flow Matching (1) improves training speed by a factor of up to 9x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow matching. We release our code at: https://github.com/gstoica27/DeltaFM.git.
Chinese: 对比流匹配通过强制不同条件流之间的独特性,显著提升了条件扩散模型的训练速度、减少了去噪步骤,并降低了FID分数。
English: Contrastive Flow Matching enhances conditional diffusion models by enforcing uniqueness across flows, significantly improving training speed, reducing denoising steps, and lowering FID scores compared to standard flow matching.

Authors:Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, Fahad Shahbaz Khan
Title: VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
Abstract:
Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over $920$ man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA
中文摘要:VideoMathQA是一个评估模型跨模态数学推理能力的基准,通过整合视频中的视觉、音频和文本信息,涵盖10个领域,采用专家标注的多步骤问题来应对现实场景中的数学推理挑战。
English Summary: VideoMathQA is a benchmark designed to evaluate models' ability to perform cross-modal mathematical reasoning by integrating visual, audio, and textual information from videos across 10 domains, addressing real-world challenges through expert-annotated questions that require multi-step problem-solving.

Authors:Yifan Wang, Peishan Yang, Zhen Xu, Jiaming Sun, Zhanhua Zhang, Yong Chen, Hujun Bao, Sida Peng, Xiaowei Zhou
Title: FreeTimeGS: Free Gaussian Primitives at Anytime and Anywhere for Dynamic Scene Reconstruction
Abstract:
This paper addresses the challenge of reconstructing dynamic 3D scenes with complex motions. Some recent works define 3D Gaussian primitives in the canonical space and use deformation fields to map canonical primitives to observation spaces, achieving real-time dynamic view synthesis. However, these methods often struggle to handle scenes with complex motions due to the difficulty of optimizing deformation fields. To overcome this problem, we propose FreeTimeGS, a novel 4D representation that allows Gaussian primitives to appear at arbitrary time and locations. In contrast to canonical Gaussian primitives, our representation possesses the strong flexibility, thus improving the ability to model dynamic 3D scenes. In addition, we endow each Gaussian primitive with an motion function, allowing it to move to neighboring regions over time, which reduces the temporal redundancy. Experiments results on several datasets show that the rendering quality of our method outperforms recent methods by a large margin. Project page: https://zju3dv.github.io/freetimegs/ .
中文: 本文提出FreeTimeGS,一种创新的4D表示方法,允许高斯基元在任意时间和位置出现,增强了复杂运动动态3D场景的建模能力并减少时间冗余,在渲染质量上大幅超越现有方法。
English: This paper introduces FreeTimeGS, a novel 4D representation that enables Gaussian primitives to appear at any time and location, enhancing the modeling of dynamic 3D scenes with complex motions and reducing temporal redundancy, significantly outperforming recent methods in rendering quality.

Authors:Jiahui Wang, Zuyan Liu, Yongming Rao, Jiwen Lu
Title: SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
Abstract:
Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38x real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test. Our project is open sourced at https://github.com/CR400AF-A/SparseMM.
Chinese: 本研究揭示了多模态大语言模型中仅少量注意力头对视觉处理至关重要,并提出了SparseMM这一无需训练的优化策略,通过集中计算资源于关键视觉头,在保持性能的同时显著提升推理速度。
English: This study reveals that only a small fraction of attention heads in multimodal large language models are crucial for visual processing and introduces SparseMM, a training-free optimization strategy that accelerates inference while preserving performance by focusing computation on these key visual heads.

Authors:Alan Dolhasz, Chen Ma, Dave Gausebeck, Kevin Chen, Gregor Miller, Lucas Hayne, Gunnar Hovden, Azwad Sabik, Olaf Brandt, Mira Slavcheva
Title: Defurnishing with X-Ray Vision: Joint Removal of Furniture from Panoramas and Mesh
Abstract:
We present a pipeline for generating defurnished replicas of indoor spaces represented as textured meshes and corresponding multi-view panoramic images. To achieve this, we first segment and remove furniture from the mesh representation, extend planes, and fill holes, obtaining a simplified defurnished mesh (SDM). This SDM acts as an ``X-ray'' of the scene's underlying structure, guiding the defurnishing process. We extract Canny edges from depth and normal images rendered from the SDM. We then use these as a guide to remove the furniture from panorama images via ControlNet inpainting. This control signal ensures the availability of global geometric information that may be hidden from a particular panoramic view by the furniture being removed. The inpainted panoramas are used to texture the mesh. We show that our approach produces higher quality assets than methods that rely on neural radiance fields, which tend to produce blurry low-resolution images, or RGB-D inpainting, which is highly susceptible to hallucinations.

Authors:Ghazi Shazan Ahmad, Ahmed Heakl, Hanan Gani, Abdelrahman Shaker, Zhiqiang Shen, Fahad Shahbaz Khan, Salman Khan
Title: VideoMolmo: Spatio-Temporal Grounding Meets Pointing
Abstract:
Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video-based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of large language models, limiting their contextual understanding and generalization. We introduce VideoMolmo, a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition, i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, VideoMolmo substantially improves spatio-temporal pointing accuracy and reasoning capability. Our code and models are publicly available at https://github.com/mbzuai-oryx/VideoMolmo.
中文: VideoMolmo是一种新型多模态模型,通过结合时序注意力和掩码融合技术,显著提升了视频分析中的时空定位精度与推理能力。
English: VideoMolmo is a novel multimodal model that enhances spatio-temporal localization by integrating temporal attention and mask fusion for improved accuracy and reasoning in video analysis.

Authors:Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, Hongsheng Li
Title: MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
Abstract:
Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT
中文: MINT-CoT通过自适应地将视觉标记嵌入文本推理步骤,显著提升了多模态数学推理能力,并在多个基准测试中取得了卓越的性能提升。
English: MINT-CoT enhances multimodal mathematical reasoning by adaptively interleaving visual tokens into textual reasoning steps, achieving significant performance improvements across benchmarks.

Authors:Duochao Shi, Weijie Wang, Donny Y. Chen, Zeyu Zhang, Jia-Wang Bian, Bohan Zhuang, Chunhua Shen
Title: Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting
Abstract:
Depth maps are widely used in feed-forward 3D Gaussian Splatting (3DGS) pipelines by unprojecting them into 3D point clouds for novel view synthesis. This approach offers advantages such as efficient training, the use of known camera poses, and accurate geometry estimation. However, depth discontinuities at object boundaries often lead to fragmented or sparse point clouds, degrading rendering quality -- a well-known limitation of depth-based representations. To tackle this issue, we introduce PM-Loss, a novel regularization loss based on a pointmap predicted by a pre-trained transformer. Although the pointmap itself may be less accurate than the depth map, it effectively enforces geometric smoothness, especially around object boundaries. With the improved depth map, our method significantly improves the feed-forward 3DGS across various architectures and scenes, delivering consistently better rendering results. Our project page: https://aim-uofa.github.io/PMLoss

Authors:Daniel Rho, Jun Myeong Choi, Biswadip Dey, Roni Sengupta
Title: ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation
Abstract:
Neural rendering has made significant strides in 3D reconstruction and novel view synthesis. With the integration with physics, it opens up new applications. The inverse problem of estimating physics from visual data, however, still remains challenging, limiting its effectiveness for applications like physically accurate digital twin creation in robotics and XR. Existing methods that incorporate physics into neural rendering frameworks typically require dense multi-view videos as input, making them impractical for scalable, real-world use. When presented with sparse multi-view videos, the sequential optimization strategy used by existing approaches introduces significant error accumulation, e.g., poor initial 3D reconstruction leads to bad material parameter estimation in subsequent stages. Instead of sequential optimization, directly optimizing all parameters at the same time also fails due to the highly non-convex and often non-differentiable nature of the problem. We propose ProJo4D, a progressive joint optimization framework that gradually increases the set of jointly optimized parameters guided by their sensitivity, leading to fully joint optimization over geometry, appearance, physical state, and material property. Evaluations on PAC-NeRF and Spring-Gaus datasets show that ProJo4D outperforms prior work in 4D future state prediction, novel view rendering of future state, and material parameter estimation, demonstrating its effectiveness in physically grounded 4D scene understanding. For demos, please visit the project webpage: https://daniel03c1.github.io/ProJo4D/

Authors:Olaf Dünkel, Thomas Wimmer, Christian Theobalt, Christian Rupprecht, Adam Kortylewski
Title: Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels
Abstract:
Finding correspondences between semantically similar points across images and object instances is one of the everlasting challenges in computer vision. While large pre-trained vision models have recently been demonstrated as effective priors for semantic matching, they still suffer from ambiguities for symmetric objects or repeated object parts. We propose improving semantic correspondence estimation through 3D-aware pseudo-labeling. Specifically, we train an adapter to refine off-the-shelf features using pseudo-labels obtained via 3D-aware chaining, filtering wrong labels through relaxed cyclic consistency, and 3D spherical prototype mapping constraints. While reducing the need for dataset-specific annotations compared to prior work, we establish a new state-of-the-art on SPair-71k, achieving an absolute gain of over 4% and of over 7% compared to methods with similar supervision requirements. The generality of our proposed approach simplifies the extension of training to other data sources, which we demonstrate in our experiments.
Chinese: 本研究提出一种3D感知的伪标签方法,通过利用经过筛选的3D衍生标签优化预训练视觉特征来改进语义对应关系,在SPair-71k数据集上以更少标注需求实现了最先进的性能。
English: This study introduces a 3D-aware pseudo-labeling method that enhances semantic correspondence by refining pre-trained vision features with filtered 3D-derived labels, achieving state-of-the-art performance on SPair-71k with reduced annotation needs.

Authors:Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang
Title: SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training
Abstract:
Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.
Chinese: SeedVR2是一种基于扩散的单步视频修复模型,通过自适应窗口注意力机制和优化的训练损失函数,能够高效处理高分辨率视频并实现优越性能。
English: SeedVR2 is a one-step diffusion-based video restoration model that introduces adaptive window attention and enhanced training losses to efficiently handle high-resolution videos while achieving competitive performance.

Authors:Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, Zheng-Jun Zha
Title: AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model
Abstract:
Autoregressive image generation aims to predict the next token based on previous ones. However, existing image tokenizers encode tokens with bidirectional dependencies during the compression process, which hinders the effective modeling by autoregressive models. In this paper, we propose a novel Aligned Tokenizer (AliTok), which utilizes a causal decoder to establish unidirectional dependencies among encoded tokens, thereby aligning the token modeling approach between the tokenizer and autoregressive model. Furthermore, by incorporating prefix tokens and employing two-stage tokenizer training to enhance reconstruction consistency, AliTok achieves great reconstruction performance while being generation-friendly. On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator with only 177M parameters, AliTok achieves a gFID score of 1.50 and an IS of 305.9. When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.
Chinese: 本文提出对齐分词器(AliTok),通过因果解码器建立图像令牌间的单向依赖关系,与自回归模型保持一致,在实现优异重建效果的同时大幅提升了生成速度。
English: This paper introduces Aligned Tokenizer (AliTok), which uses a causal decoder to create unidirectional dependencies in image tokens, aligning with autoregressive models and achieving superior reconstruction and generation performance with faster sampling speeds.

Authors:Tao Sun, Liyuan Zhu, Shengyu Huang, Shuran Song, Iro Armeni
Title: Rectified Point Flow: Generic Point Cloud Pose Estimation
Abstract:
We introduce Rectified Point Flow, a unified parameterization that formulates pairwise point cloud registration and multi-part shape assembly as a single conditional generative problem. Given unposed point clouds, our method learns a continuous point-wise velocity field that transports noisy points toward their target positions, from which part poses are recovered. In contrast to prior work that regresses part-wise poses with ad-hoc symmetry handling, our method intrinsically learns assembly symmetries without symmetry labels. Together with a self-supervised encoder focused on overlapping points, our method achieves a new state-of-the-art performance on six benchmarks spanning pairwise registration and shape assembly. Notably, our unified formulation enables effective joint training on diverse datasets, facilitating the learning of shared geometric priors and consequently boosting accuracy. Project page: https://rectified-pointflow.github.io/.

Authors:Nan Wang, Yuantao Chen, Lixing Xiao, Weiqing Xiao, Bohan Li, Zhaoxi Chen, Chongjie Ye, Shaocong Xu, Saining Zhang, Ziyang Yan, Pierre Merriaux, Lei Lei, Tianfan Xue, Hao Zhao
Title: Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting
Abstract:
Neural rendering techniques, including NeRF and Gaussian Splatting (GS), rely on photometric consistency to produce high-quality reconstructions. However, in real-world scenarios, it is challenging to guarantee perfect photometric consistency in acquired images. Appearance codes have been widely used to address this issue, but their modeling capability is limited, as a single code is applied to the entire image. Recently, the bilateral grid was introduced to perform pixel-wise color mapping, but it is difficult to optimize and constrain effectively. In this paper, we propose a novel multi-scale bilateral grid that unifies appearance codes and bilateral grids. We demonstrate that this approach significantly improves geometric accuracy in dynamic, decoupled autonomous driving scene reconstruction, outperforming both appearance codes and bilateral grids. This is crucial for autonomous driving, where accurate geometry is important for obstacle avoidance and control. Our method shows strong results across four datasets: Waymo, NuScenes, Argoverse, and PandaSet. We further demonstrate that the improvement in geometry is driven by the multi-scale bilateral grid, which effectively reduces floaters caused by photometric inconsistency.
中文摘要:本文提出了一种多尺度双边网格,将外观编码与双边网格相结合,显著提升了动态自动驾驶场景重建的几何精度,通过有效减少光度不一致导致的漂浮物,在多个数据集上优于现有方法。
English Summary: This paper introduces a multi-scale bilateral grid that combines appearance codes and bilateral grids to enhance geometric accuracy in dynamic autonomous driving scene reconstruction, outperforming existing methods across multiple datasets by reducing floaters from photometric inconsistencies.

Authors:Xiaodong Wang, Jinfa Huang, Li Yuan, Peixi Peng
Title: LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMs
Abstract:
Most Video Large Language Models (Video-LLMs) adopt preference alignment techniques, e.g., DPO~\citep{rafailov2024dpo}, to optimize the reward margin between a winning response ($y_w$) and a losing response ($y_l$). However, the likelihood displacement observed in DPO indicates that both $\log π_θ(y_w\mid x)$ and $\log π_θ(y_l\mid x) $ often decrease during training, inadvertently boosting the probabilities of non-target responses. In this paper, we systematically revisit this phenomenon from LLMs to Video-LLMs, showing that it intensifies when dealing with the redundant complexity of video content. To alleviate the impact of this phenomenon, we propose \emph{Lean Preference Optimization} (LeanPO), a reference-free approach that reformulates the implicit reward as the average likelihood of the response with respect to the policy model. A key component of LeanPO is the reward-trustworthiness correlated self-generated preference data pipeline, which carefully infuses relevant prior knowledge into the model while continuously refining the preference data via self-reflection. This allows the policy model to obtain high-quality paired data and accurately estimate the newly defined reward, thus mitigating the unintended drop. In addition, we introduce a dynamic label smoothing strategy that mitigates the impact of noise in responses from diverse video content, preventing the model from overfitting to spurious details. Extensive experiments demonstrate that LeanPO significantly enhances the performance of state-of-the-art Video-LLMs, consistently boosting baselines of varying capacities with minimal additional training overhead. Moreover, LeanPO offers a simple yet effective solution for aligning Video-LLM preferences with human trustworthiness, paving the way toward the reliable and efficient Video-LLMs.
中文摘要:本文提出LeanPO方法,通过重新定义隐式奖励并采用自生成偏好数据和动态标签平滑策略,有效缓解视频大语言模型中非目标响应概率意外下降的问题,以最小训练开销显著提升模型性能。
English Summary: This paper introduces LeanPO, a reference-free preference optimization method that addresses the unintended probability drop in Video-LLMs by redefining implicit rewards and using self-generated preference data with dynamic label smoothing, significantly improving model performance with minimal training overhead.

Authors:Jianghao Wu, Yicheng Wu, Yutong Xie, Wenjia Bai, You Zhang, Feilong Tang, Yulong Li, Yasmeen George, Imran Razzak
Title: SAM-aware Test-time Adaptation for Universal Medical Image Segmentation
Abstract:
Universal medical image segmentation using the Segment Anything Model (SAM) remains challenging due to its limited adaptability to medical domains. Existing adaptations, such as MedSAM, enhance SAM's performance in medical imaging but at the cost of reduced generalization to unseen data. Therefore, in this paper, we propose SAM-aware Test-Time Adaptation (SAM-TTA), a fundamentally different pipeline that preserves the generalization of SAM while improving its segmentation performance in medical imaging via a test-time framework. SAM-TTA tackles two key challenges: (1) input-level discrepancies caused by differences in image acquisition between natural and medical images and (2) semantic-level discrepancies due to fundamental differences in object definition between natural and medical domains (e.g., clear boundaries vs. ambiguous structures). Specifically, our SAM-TTA framework comprises (1) Self-adaptive Bezier Curve-based Transformation (SBCT), which adaptively converts single-channel medical images into three-channel SAM-compatible inputs while maintaining structural integrity, to mitigate the input gap between medical and natural images, and (2) Dual-scale Uncertainty-driven Mean Teacher adaptation (DUMT), which employs consistency learning to align SAM's internal representations to medical semantics, enabling efficient adaptation without auxiliary supervision or expensive retraining. Extensive experiments on five public datasets demonstrate that our SAM-TTA outperforms existing TTA approaches and even surpasses fully fine-tuned models such as MedSAM in certain scenarios, establishing a new paradigm for universal medical image segmentation. Code can be found at https://github.com/JianghaoWu/SAM-TTA.
Chinese Summary: 针对通用医学图像分割中SAM模型适应性不足的问题,本文提出SAM-TTA测试时适应框架,通过自适应图像转换和双尺度一致性学习,在保持模型泛化能力的同时显著提升医学图像分割性能。
English Summary: The Segment Anything Model (SAM) struggles with medical image segmentation due to domain gaps, but the proposed SAM-TTA framework overcomes this by adaptively transforming inputs and aligning representations at test time, achieving superior performance without compromising generalization.

Authors:Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, Xiang Bai
Title: MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Abstract:
We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU's modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions - "Where is it?" (structure), "What is it?" (recognition), and "How is it organized?" (relation) - corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.
Chinese: MonkeyOCR采用创新的结构-识别-关系三元范式,将文档解析分解为布局分析、内容识别和逻辑排序,以仅30亿参数的紧凑模型在精度和速度上超越更大模型,实现了最先进的性能。
English: MonkeyOCR introduces a novel Structure-Recognition-Relation triplet paradigm that simplifies document parsing by decomposing it into layout analysis, content identification, and logical ordering, achieving state-of-the-art performance with a compact 3B-parameter model that surpasses larger models in both accuracy and speed.

Authors:Yani Zhang, Dongming Wu, Hao Shi, Yingfei Liu, Tiancai Wang, Haoqiang Fan, Xingping Dong
Title: Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding
Abstract:
Embodied 3D grounding aims to localize target objects described in human instructions from ego-centric viewpoint. Most methods typically follow a two-stage paradigm where a trained 3D detector's optimized backbone parameters are used to initialize a grounding model. In this study, we explore a fundamental question: Does embodied 3D grounding benefit enough from detection? To answer this question, we assess the grounding performance of detection models using predicted boxes filtered by the target category. Surprisingly, these detection models without any instruction-specific training outperform the grounding models explicitly trained with language instructions. This indicates that even category-level embodied 3D grounding may not be well resolved, let alone more fine-grained context-aware grounding. Motivated by this finding, we propose DEGround, which shares DETR queries as object representation for both DEtection and Grounding and enables the grounding to benefit from basic category classification and box detection. Based on this framework, we further introduce a regional activation grounding module that highlights instruction-related regions and a query-wise modulation module that incorporates sentence-level semantic into the query representation, strengthening the context-aware understanding of language instructions. Remarkably, DEGround outperforms state-of-the-art model BIP3D by 7.52% at overall accuracy on the EmbodiedScan validation set. The source code will be publicly available at https://github.com/zyn213/DEGround.
中文: 本研究发现在具身三维物体定位中,检测模型无需专门训练即可超越语言指导的定位模型,据此提出DEGround框架,通过共享检测与定位的查询表示及语言增强模块,在验证集上以7.52%的准确率优势刷新了最佳性能。
English: This study reveals that detection models can outperform specialized grounding models in embodied 3D object localization, leading to the development of DEGround—a unified framework that enhances grounding through shared object queries and language-aware modules, achieving a 7.52% accuracy improvement over state-of-the-art methods.

Authors:Yake Wei, Yu Miao, Dongzhan Zhou, Di Hu
Title: MokA: Multimodal Low-Rank Adaptation for MLLMs
Abstract:
In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities. Inspired by our empirical observation, we argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs. From this perspective, we propose Multimodal low-rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation. Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method. Ablation studies and efficiency evaluation are also conducted to fully asses our method. Overall, we think MokA provides a more targeted solution for efficient adaptation of MLLMs, paving the way for further exploration. The project page is at https://gewu-lab.github.io/MokA.

Authors:Weicheng Gao
Title: Through-the-Wall Radar Human Activity Recognition WITHOUT Using Neural Networks
Abstract:
After a few years of research in the field of through-the-wall radar (TWR) human activity recognition (HAR), I found that we seem to be stuck in the mindset of training on radar image data through neural network models. The earliest related works in this field based on template matching did not require a training process, and I believe they have never died. Because these methods possess a strong physical interpretability and are closer to the basis of theoretical signal processing research. In this paper, I would like to try to return to the original path by attempting to eschew neural networks to achieve the TWR HAR task and challenge to achieve intelligent recognition as neural network models. In detail, the range-time map and Doppler-time map of TWR are first generated. Then, the initial regions of the human target foreground and noise background on the maps are determined using corner detection method, and the micro-Doppler signature is segmented using the multiphase active contour model. The micro-Doppler segmentation feature is discretized into a two-dimensional point cloud. Finally, the topological similarity between the resulting point cloud and the point clouds of the template data is calculated using Mapper algorithm to obtain the recognition results. The effectiveness of the proposed method is demonstrated by numerical simulated and measured experiments. The open-source code of this work is released at: https://github.com/JoeyBGOfficial/Through-the-Wall-Radar-Human-Activity-Recognition-Without-Using-Neural-Networks.
中文摘要:本文提出一种不使用神经网络的穿墙雷达人体活动识别方法,通过信号处理技术提取微多普勒特征,并利用拓扑相似性与模板数据进行比对实现智能识别。
English Summary: This paper proposes a neural network-free approach for through-the-wall radar human activity recognition by extracting micro-Doppler signatures through signal processing techniques and comparing them with template data using topological similarity analysis.

Authors:Benedikt Hopf, Radu Timofte
Title: Practical Manipulation Model for Robust Deepfake Detection
Abstract:
Modern deepfake detection models have achieved strong performance even on the challenging cross-dataset task. However, detection performance under non-ideal conditions remains very unstable, limiting success on some benchmark datasets and making it easy to circumvent detection. Inspired by the move to a more real-world degradation model in the area of image super-resolution, we have developed a Practical Manipulation Model (PMM) that covers a larger set of possible forgeries. We extend the space of pseudo-fakes by using Poisson blending, more diverse masks, generator artifacts, and distractors. Additionally, we improve the detectors' generality and robustness by adding strong degradations to the training images. We demonstrate that these changes not only significantly enhance the model's robustness to common image degradations but also improve performance on standard benchmark datasets. Specifically, we show clear increases of $3.51\%$ and $6.21\%$ AUC on the DFDC and DFDCP datasets, respectively, over the s-o-t-a LAA backbone. Furthermore, we highlight the lack of robustness in previous detectors and our improvements in this regard. Code can be found at https://github.com/BenediktHopf/PMM
中文摘要:本研究提出了一种实用篡改模型(PMM),通过扩展伪造技术并在训练中加入图像退化处理,显著提升了深度伪造检测的鲁棒性及在基准数据集上的性能表现。
English Summary: The study introduces a Practical Manipulation Model (PMM) that expands forgery techniques and incorporates image degradations during training, significantly boosting deepfake detection robustness and performance on benchmark datasets.

Authors:Zeming Wei, Yiwen Guo, Yisen Wang
Title: Identifying and Understanding Cross-Class Features in Adversarial Training
Abstract:
Adversarial training (AT) has been considered one of the most effective methods for making deep neural networks robust against adversarial attacks, while the training mechanisms and dynamics of AT remain open research problems. In this paper, we present a novel perspective on studying AT through the lens of class-wise feature attribution. Specifically, we identify the impact of a key family of features on AT that are shared by multiple classes, which we call cross-class features. These features are typically useful for robust classification, which we offer theoretical evidence to illustrate through a synthetic data model. Through systematic studies across multiple model architectures and settings, we find that during the initial stage of AT, the model tends to learn more cross-class features until the best robustness checkpoint. As AT further squeezes the training robust loss and causes robust overfitting, the model tends to make decisions based on more class-specific features. Based on these discoveries, we further provide a unified view of two existing properties of AT, including the advantage of soft-label training and robust overfitting. Overall, these insights refine the current understanding of AT mechanisms and provide new perspectives on studying them. Our code is available at https://github.com/PKU-ML/Cross-Class-Features-AT.
中文: 本文通过分析跨类别特征提出了对抗训练的新视角,揭示了这类特征在实现鲁棒性中的关键作用,并通过类别特征归因解释了鲁棒过拟合等现象。
English: This paper introduces a novel perspective on adversarial training by analyzing cross-class features, revealing their crucial role in achieving robustness and explaining phenomena like robust overfitting through class-wise feature attribution.

Authors:Zhenran Xu, Xue Yang, Yiyu Wang, Qingli Hu, Zijiao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
Title: ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development
Abstract:
We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at https://github.com/AIDC-AI/ComfyUI-Copilot.
中文: ComfyUI-Copilot是一款基于大语言模型的插件,通过智能节点推荐和自动化工作流构建提升ComfyUI的易用性,经实践验证能有效降低新手门槛并提高工作效率。
English: ComfyUI-Copilot is an AI-powered plugin that enhances ComfyUI's usability by providing intelligent recommendations and automated workflow construction, validated through evaluations to lower entry barriers and boost efficiency.

Authors:Fuyi Zhang, Zhu Yu, Chunhao Li, Runmin Zhang, Xiaokai Bai, Zili Zhou, Si-Yuan Cao, Fang Wang, Hui-Liang Shen
Title: Structure-Aware Radar-Camera Depth Estimation
Abstract:
Radar has gained much attention in autonomous driving due to its accessibility and robustness. However, its standalone application for depth perception is constrained by issues of sparsity and noise. Radar-camera depth estimation offers a more promising complementary solution. Despite significant progress, current approaches fail to produce satisfactory dense depth maps, due to the unsatisfactory processing of the sparse and noisy radar data. They constrain the regions of interest for radar points in rigid rectangular regions, which may introduce unexpected errors and confusions. To address these issues, we develop a structure-aware strategy for radar depth enhancement, which provides more targeted regions of interest by leveraging the structural priors of RGB images. Furthermore, we design a Multi-Scale Structure Guided Network to enhance radar features and preserve detailed structures, achieving accurate and structure-detailed dense metric depth estimation. Building on these, we propose a structure-aware radar-camera depth estimation framework, named SA-RCD. Extensive experiments demonstrate that our SA-RCD achieves state-of-the-art performance on the nuScenes dataset. Our code will be available at https://github.com/FreyZhangYeh/SA-RCD.
Chinese: 提出的SA-RCD框架通过结构感知策略和多尺度网络改进雷达-相机深度估计,有效处理雷达稀疏性和噪声问题,在nuScenes数据集上取得了最优性能。
English: The proposed SA-RCD framework enhances radar-camera depth estimation by using structure-aware strategies and a multi-scale network to address radar sparsity and noise, achieving state-of-the-art results on the nuScenes dataset.

Authors:Edoardo Bianchi, Antonio Liotta
Title: PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment
Abstract:
Automated sports skill assessment requires capturing fundamental movement patterns that distinguish expert from novice performance, yet current video sampling methods disrupt the temporal continuity essential for proficiency evaluation. To this end, we introduce Proficiency-Aware Temporal Sampling (PATS), a novel sampling strategy that preserves complete fundamental movements within continuous temporal segments for multi-view skill assessment. PATS adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating this process across multiple segments to maximize information coverage while maintaining temporal coherence. Evaluated on the EgoExo4D benchmark with SkillFormer, PATS surpasses the state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) and delivers substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball). Systematic analysis reveals that PATS successfully adapts to diverse activity characteristics-from high-frequency sampling for dynamic sports to fine-grained segmentation for sequential skills-demonstrating its effectiveness as an adaptive approach to temporal sampling that advances automated skill assessment for real-world applications. Visit our project page at https://edowhite.github.io/PATS

Authors:Enrique Sanchez, Isma Hadji, Adrian Bulat, Christos Tzelepis, Brais Martinez, Georgios Tzimiropoulos
Title: Multi-scale Image Super Resolution with a Single Auto-Regressive Model
Abstract:
In this paper we tackle Image Super Resolution (ISR), using recent advances in Visual Auto-Regressive (VAR) modeling. VAR iteratively estimates the residual in latent space between gradually increasing image scales, a process referred to as next-scale prediction. Thus, the strong priors learned during pre-training align well with the downstream task (ISR). To our knowledge, only VARSR has exploited this synergy so far, showing promising results. However, due to the limitations of existing residual quantizers, VARSR works only at a fixed resolution, i.e. it fails to map intermediate outputs to the corresponding image scales. Additionally, it relies on a 1B transformer architecture (VAR-d24), and leverages a large-scale private dataset to achieve state-of-the-art results. We address these limitations through two novel components: a) a Hierarchical Image Tokenization approach with a multi-scale image tokenizer that progressively represents images at different scales while simultaneously enforcing token overlap across scales, and b) a Direct Preference Optimization (DPO) regularization term that, relying solely on the LR and HR tokenizations, encourages the transformer to produce the latter over the former. To the best of our knowledge, this is the first time a quantizer is trained to force semantically consistent residuals at different scales, and the first time that preference-based optimization is used to train a VAR. Using these two components, our model can denoise the LR image and super-resolve at half and full target upscale factors in a single forward pass. Additionally, we achieve \textit{state-of-the-art results on ISR}, while using a small model (300M params vs ~1B params of VARSR), and without using external training data.
中文: 本文提出了一种新颖的图像超分辨率方法,通过结合多尺度处理的分层图像标记化和直接偏好优化来提升变换器性能,使用更小的模型且无需外部数据即实现了最先进的结果。
English: This paper introduces a novel approach to Image Super Resolution by combining a Hierarchical Image Tokenization method for multi-scale processing and Direct Preference Optimization to enhance transformer performance, achieving state-of-the-art results with a smaller model and no external data.

Authors:Huihan Wang, Zhiwen Yang, Hui Zhang, Dan Zhao, Bingzheng Wei, Yan Xu
Title: FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation
Abstract:
Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23\% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at https://github.com/Yaziwel/FEAT.
中文摘要:提出的FEAT模型通过引入高效注意力机制和细粒度引导,克服了现有基于Transformer的动态医学视频合成方法的局限性,以更少的参数实现了更优的性能。
English Summary: The proposed FEAT model overcomes limitations in existing Transformer-based methods for dynamic medical video synthesis by introducing efficient attention mechanisms and fine-grained guidance, achieving superior performance with fewer parameters.

Authors:Andrew Hamara, Greg Hamerly, Pablo Rivas, Andrew C. Freeman
Title: Learning to Plan via Supervised Contrastive Learning and Strategic Interpolation: A Chess Case Study
Abstract:
Modern chess engines achieve superhuman performance through deep tree search and regressive evaluation, while human players rely on intuition to select candidate moves followed by a shallow search to validate them. To model this intuition-driven planning process, we train a transformer encoder using supervised contrastive learning to embed board states into a latent space structured by positional evaluation. In this space, distance reflects evaluative similarity, and visualized trajectories display interpretable transitions between game states. We demonstrate that move selection can occur entirely within this embedding space by advancing toward favorable regions, without relying on deep search. Despite using only a 6-ply beam search, our model achieves an estimated Elo rating of 2593. Performance improves with both model size and embedding dimensionality, suggesting that latent planning may offer a viable alternative to traditional search. Although we focus on chess, the proposed embedding-based planning method can be generalized to other perfect-information games where state evaluations are learnable. All source code is available at https://github.com/andrewhamara/SOLIS.
中文摘要:本研究开发了一种基于Transformer的模型,通过监督对比学习将棋盘状态嵌入结构化潜空间,无需深度搜索即可通过轨迹规划实现直觉式走子选择,并达到大师级棋力水平。
English Summary: This research develops a transformer-based model that uses supervised contrastive learning to embed chess positions into a structured latent space, enabling intuitive move selection through trajectory planning without deep search while achieving master-level performance.

Authors:Yu-Feng Chen, Tzuhsuan Huang, Pin-Yen Chiu, Jun-Cheng Chen
Title: Invisible Backdoor Triggers in Image Editing Model via Deep Watermarking
Abstract:
Diffusion models have achieved remarkable progress in both image generation and editing. However, recent studies have revealed their vulnerability to backdoor attacks, in which specific patterns embedded in the input can manipulate the model's behavior. Most existing research in this area has proposed attack frameworks focused on the image generation pipeline, leaving backdoor attacks in image editing relatively unexplored. Among the few studies targeting image editing, most utilize visible triggers, which are impractical because they introduce noticeable alterations to the input image before editing. In this paper, we propose a novel attack framework that embeds invisible triggers into the image editing process via poisoned training data. We leverage off-the-shelf deep watermarking models to encode imperceptible watermarks as backdoor triggers. Our goal is to make the model produce the predefined backdoor target when it receives watermarked inputs, while editing clean images normally according to the given prompt. With extensive experiments across different watermarking models, the proposed method achieves promising attack success rates. In addition, the analysis results of the watermark characteristics in term of backdoor attack further support the effectiveness of our approach. The code is available at:https://github.com/aiiu-lab/BackdoorImageEditing
中文: 本文提出了一种针对扩散模型图像编辑的新型后门攻击框架,通过投毒训练数据嵌入隐形触发器,利用不可见水印操控输出,同时保持对干净图像的正常编辑,实验验证了该方法的高攻击成功率。
English: This paper introduces a novel backdoor attack framework for diffusion models in image editing by embedding invisible triggers through poisoned training data, utilizing imperceptible watermarks to manipulate outputs while maintaining normal editing for clean inputs, with experiments confirming high attack success rates.

Authors:Mario Malizia, Charles Hamesse, Ken Hasselmann, Geert De Cubber, Nikolaos Tsiogkas, Eric Demeester, Rob Haelterman
Title: MineInsight: A Multi-sensor Dataset for Humanitarian Demining Robotics in Off-Road Environments
Abstract:
The use of robotics in humanitarian demining increasingly involves computer vision techniques to improve landmine detection capabilities. However, in the absence of diverse and realistic datasets, the reliable validation of algorithms remains a challenge for the research community. In this paper, we introduce MineInsight, a publicly available multi-sensor, multi-spectral dataset designed for off-road landmine detection. The dataset features 35 different targets (15 landmines and 20 commonly found objects) distributed along three distinct tracks, providing a diverse and realistic testing environment. MineInsight is, to the best of our knowledge, the first dataset to integrate dual-view sensor scans from both an Unmanned Ground Vehicle and its robotic arm, offering multiple viewpoints to mitigate occlusions and improve spatial awareness. It features two LiDARs, as well as images captured at diverse spectral ranges, including visible (RGB, monochrome), visible short-wave infrared (VIS-SWIR), and long-wave infrared (LWIR). Additionally, the dataset comes with an estimation of the location of the targets, offering a benchmark for evaluating detection algorithms. We recorded approximately one hour of data in both daylight and nighttime conditions, resulting in around 38,000 RGB frames, 53,000 VIS-SWIR frames, and 108,000 LWIR frames. MineInsight serves as a benchmark for developing and evaluating landmine detection algorithms. Our dataset is available at https://github.com/mariomlz99/MineInsight.
中文: 本文介绍了MineInsight,这是首个公开的多传感器数据集,集成了双视角机器人扫描技术,用于多样化和真实的地雷检测算法验证,包含多光谱范围和目标位置信息。
English: This paper introduces MineInsight, the first publicly available multi-sensor dataset integrating dual-view robotic scans for diverse and realistic landmine detection algorithm validation, featuring multiple spectral ranges and target locations.

Authors:Kunshen Zhang
Title: OpenMaskDINO3D : Reasoning 3D Segmentation via Large Language Model
Abstract:
Although perception systems have made remarkable advancements in recent years, particularly in 2D reasoning segmentation, these systems still rely on explicit human instruction or pre-defined categories to identify target objects before executing visual recognition tasks. Such systems have matured significantly, demonstrating the ability to reason and comprehend implicit user intentions in two-dimensional contexts, producing accurate segmentation masks based on complex and implicit query text. However, a comparable framework and structure for 3D reasoning segmentation remain absent. This paper introduces OpenMaskDINO3D, a LLM designed for comprehensive 3D understanding and segmentation. OpenMaskDINO3D processes point cloud data and text prompts to produce instance segmentation masks, excelling in many 3D tasks. By introducing a SEG token and object identifier, we achieve high-precision 3D segmentation mask generation, enabling the model to directly produce accurate point cloud segmentation results from natural language instructions. Experimental results on large-scale ScanNet datasets validate the effectiveness of our OpenMaskDINO3D across various tasks.
Chinese: 当前二维推理分割系统依赖显式人工指令或预定义类别,而三维推理分割尚缺乏类似框架,因此本文提出OpenMaskDINO3D大语言模型,通过处理点云数据和文本提示生成精确的三维实例分割掩码,并在ScanNet数据集上验证了其有效性。
English: Current 2D reasoning segmentation systems require explicit human instructions or predefined categories, but a similar framework for 3D reasoning segmentation is lacking, leading to the introduction of OpenMaskDINO3D, an LLM that processes point clouds and text prompts to generate precise 3D instance segmentation masks validated on ScanNet datasets.

Authors:Svetlana Pavlitska, Jamie Robb, Nikolai Polley, Melih Yazgan, J. Marius Zöllner
Title: Fool the Stoplight: Realistic Adversarial Patch Attacks on Traffic Light Detectors
Abstract:
Realistic adversarial attacks on various camera-based perception tasks of autonomous vehicles have been successfully demonstrated so far. However, only a few works considered attacks on traffic light detectors. This work shows how CNNs for traffic light detection can be attacked with printed patches. We propose a threat model, where each instance of a traffic light is attacked with a patch placed under it, and describe a training strategy. We demonstrate successful adversarial patch attacks in universal settings. Our experiments show realistic targeted red-to-green label-flipping attacks and attacks on pictogram classification. Finally, we perform a real-world evaluation with printed patches and demonstrate attacks in the lab settings with a mobile traffic light for construction sites and in a test area with stationary traffic lights. Our code is available at https://github.com/KASTEL-MobilityLab/attacks-on-traffic-light-detection.
中文摘要:本研究展示了使用印刷补丁对交通信号灯检测系统实施现实对抗攻击的方法,在模拟和真实环境中均成功实现了针对性标签翻转和分类攻击。
English Summary: This study demonstrates realistic adversarial attacks on traffic light detection systems using printed patches, successfully achieving targeted label-flipping and classification attacks in both simulated and real-world environments.

Authors:Yuyi Zhang, Yongxin Shi, Peirong Zhang, Yixin Zhao, Zhenhua Yang, Lianwen Jin
Title: MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories
Abstract:
Foundational to the Chinese language and culture, Chinese characters encompass extraordinarily extensive and ever-expanding categories, with the latest Chinese GB18030-2022 standard containing 87,887 categories. The accurate recognition of this vast number of characters, termed mega-category recognition, presents a formidable yet crucial challenge for cultural heritage preservation and digital applications. Despite significant advances in Optical Character Recognition (OCR), mega-category recognition remains unexplored due to the absence of comprehensive datasets, with the largest existing dataset containing merely 16,151 categories. To bridge this critical gap, we introduce MegaHan97K, a mega-category, large-scale dataset covering an unprecedented 97,455 categories of Chinese characters. Our work offers three major contributions: (1) MegaHan97K is the first dataset to fully support the latest GB18030-2022 standard, providing at least six times more categories than existing datasets; (2) It effectively addresses the long-tail distribution problem by providing balanced samples across all categories through its three distinct subsets: handwritten, historical and synthetic subsets; (3) Comprehensive benchmarking experiments reveal new challenges in mega-category scenarios, including increased storage demands, morphologically similar character recognition, and zero-shot learning difficulties, while also unlocking substantial opportunities for future research. To the best of our knowledge, the MetaHan97K is likely the dataset with the largest classes not only in the field of OCR but may also in the broader domain of pattern recognition. The dataset is available at https://github.com/SCUT-DLVCLab/MegaHan97K.
中文摘要:MegaHan97K数据集填补了超大类中文字符识别的关键空白,提供97,455个字符类别并涵盖手写、古籍和合成三个子集,完整支持最新GB18030-2022标准,在揭示新研究挑战的同时为文化遗产保护与数字应用开辟了新前景。
English Summary: The MegaHan97K dataset addresses the critical gap in mega-category Chinese character recognition by providing 97,455 character categories with balanced samples across handwritten, historical, and synthetic subsets, fully supporting the latest GB18030-2022 standard while revealing new research challenges and opportunities.

Authors:Daniel Barath
Title: SupeRANSAC: One RANSAC to Rule Them All
Abstract:
Robust estimation is a cornerstone in computer vision, particularly for tasks like Structure-from-Motion and Simultaneous Localization and Mapping. RANSAC and its variants are the gold standard for estimating geometric models (e.g., homographies, relative/absolute poses) from outlier-contaminated data. Despite RANSAC's apparent simplicity, achieving consistently high performance across different problems is challenging. While recent research often focuses on improving specific RANSAC components (e.g., sampling, scoring), overall performance is frequently more influenced by the "bells and whistles" (i.e., the implementation details and problem-specific optimizations) within a given library. Popular frameworks like OpenCV and PoseLib demonstrate varying performance, excelling in some tasks but lagging in others. We introduce SupeRANSAC, a novel unified RANSAC pipeline, and provide a detailed analysis of the techniques that make RANSAC effective for specific vision tasks, including homography, fundamental/essential matrix, and absolute/rigid pose estimation. SupeRANSAC is designed for consistent accuracy across these tasks, improving upon the best existing methods by, for example, 6 AUC points on average for fundamental matrix estimation. We demonstrate significant performance improvements over the state-of-the-art on multiple problems and datasets. Code: https://github.com/danini/superansac
中文: SupeRANSAC是一种新颖的统一RANSAC流程,在多种计算机视觉任务中实现了一致的精度,通过将基础矩阵估计的平均AUC值提高6个点,显著超越了现有方法。
English: SupeRANSAC is a novel unified RANSAC pipeline that achieves consistent accuracy across various computer vision tasks, significantly outperforming existing methods by improving fundamental matrix estimation by an average of 6 AUC points.

Authors:Yusuke Matsui
Title: LotusFilter: Fast Diverse Nearest Neighbor Search via a Learned Cutoff Table
Abstract:
Approximate nearest neighbor search (ANNS) is an essential building block for applications like RAG but can sometimes yield results that are overly similar to each other. In certain scenarios, search results should be similar to the query and yet diverse. We propose LotusFilter, a post-processing module to diversify ANNS results. We precompute a cutoff table summarizing vectors that are close to each other. During the filtering, LotusFilter greedily looks up the table to delete redundant vectors from the candidates. We demonstrated that the LotusFilter operates fast (0.02 [ms/query]) in settings resembling real-world RAG applications, utilizing features such as OpenAI embeddings. Our code is publicly available at https://github.com/matsui528/lotf.
Chinese: LotusFilter是一个后处理模块,通过预计算邻近向量表并贪婪地剔除候选结果中的冗余向量,有效提升近似最近邻搜索结果的多样性,在真实RAG应用中展现出高速处理性能。
English: LotusFilter is a post-processing module that efficiently diversifies approximate nearest neighbor search results by greedily removing redundant vectors using a precomputed cutoff table, operating at high speed in real-world RAG applications.

Authors:Suhan Woo, Seongwon Lee, Jinwoo Jang, Euntai Kim
Title: HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition
Abstract:
When applying Visual Place Recognition (VPR) to real-world mobile robots and similar applications, perspective-to-equirectangular (P2E) formulation naturally emerges as a suitable approach to accommodate diverse query images captured from various viewpoints. In this paper, we introduce HypeVPR, a novel hierarchical embedding framework in hyperbolic space, designed to address the unique challenges of P2E VPR. The key idea behind HypeVPR is that visual environments captured by panoramic views exhibit inherent hierarchical structures. To leverage this property, we employ hyperbolic space to represent hierarchical feature relationships and preserve distance properties within the feature space. To achieve this, we propose a hierarchical feature aggregation mechanism that organizes local-to-global feature representations within hyperbolic space. Additionally, HypeVPR adopts an efficient coarse-to-fine search strategy to enable flexible control over accuracy-efficiency trade-offs and ensure robust matching even between descriptors from different image types. This approach allows HypeVPR to outperform existing methods while significantly accelerating retrieval and reducing database storage requirements. The code and models will be released at https://github.com/suhan-woo/HypeVPR.git.
中文: HypeVPR提出了一种双曲空间中的分层嵌入框架,通过利用全景视图固有的层次结构解决透视到等距柱面投影的视觉位置识别难题,实现了高效粗到精搜索和更低存储需求的优越性能。
English: HypeVPR introduces a hierarchical embedding framework in hyperbolic space to address perspective-to-equirectangular visual place recognition challenges by leveraging inherent hierarchical structures in panoramic views, achieving superior performance with efficient coarse-to-fine search and reduced storage.

Authors:Shenshen Li, Kaiyuan Deng, Lei Wang, Hao Yang, Chong Peng, Peng Yan, Fumin Shen, Heng Tao Shen, Xing Xu
Title: Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning
Abstract:
While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%. Our code is available at https://github.com/Leo-ssl/RAP.
Chinese: 本研究提出推理激活潜力(RAP)方法,通过筛选关键"认知样本"来提升多模态大语言模型的推理能力,仅用9.3%的训练数据即可实现更优性能,同时降低超过43%的计算成本。
English: This study introduces Reasoning Activation Potential (RAP), a novel data selection method that identifies sparse "cognitive samples" to enhance multi-modal reasoning in MLLMs, achieving superior performance with only 9.3% of training data while cutting computational costs by over 43%.

Authors:Niki Martinel, Rita Pucci
Title: Physics Informed Capsule Enhanced Variational AutoEncoder for Underwater Image Enhancement
Abstract:
We present a novel dual-stream architecture that achieves state-of-the-art underwater image enhancement by explicitly integrating the Jaffe-McGlamery physical model with capsule clustering-based feature representation learning. Our method simultaneously estimates transmission maps and spatially-varying background light through a dedicated physics estimator while extracting entity-level features via capsule clustering in a parallel stream. This physics-guided approach enables parameter-free enhancement that respects underwater formation constraints while preserving semantic structures and fine-grained details. Our approach also features a novel optimization objective ensuring both physical adherence and perceptual quality across multiple spatial frequencies. To validate our approach, we conducted extensive experiments across six challenging benchmarks. Results demonstrate consistent improvements of $+0.5$dB PSNR over the best existing methods while requiring only one-third of their computational complexity (FLOPs), or alternatively, more than $+1$dB PSNR improvement when compared to methods with similar computational budgets. Code and data \textit{will} be available at https://github.com/iN1k1/.
中文: 该新颖双流架构通过融合物理成像模型与胶囊聚类,实现了无需参数的水下图像增强,在仅需现有方法三分之一计算量的情况下将PSNR提升0.5分贝。
English: This novel dual-stream architecture integrates a physical underwater imaging model with capsule clustering to achieve state-of-the-art enhancement, improving PSNR by +0.5dB with one-third the computational cost of existing methods.

Authors:Zelu Qi, Ping Shi, Chaoyang Zhang, Shuqi Wang, Fei Zhao, Da Pan, Zefeng Ying
Title: Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model
Abstract:
The development of AI-Generated Video (AIGV) technology has been remarkable in recent years, significantly transforming the paradigm of video content production. However, AIGVs still suffer from noticeable visual quality defects, such as noise, blurriness, frame jitter and low dynamic degree, which severely impact the user's viewing experience. Therefore, an effective automatic visual quality assessment is of great importance for AIGV content regulation and generative model improvement. In this work, we decompose the visual quality of AIGVs into three dimensions: technical quality, motion quality, and video semantics. For each dimension, we design corresponding encoder to achieve effective feature representation. Moreover, considering the outstanding performance of large language models (LLMs) in various vision and language tasks, we introduce a LLM as the quality regression module. To better enable the LLM to establish reasoning associations between multi-dimensional features and visual quality, we propose a specially designed multi-modal prompt engineering framework. Additionally, we incorporate LoRA fine-tuning technology during the training phase, allowing the LLM to better adapt to specific tasks. Our proposed method achieved \textbf{second place} in the NTIRE 2025 Quality Assessment of AI-Generated Content Challenge: Track 2 AI Generated video, demonstrating its effectiveness. Codes can be obtained at https://github.com/QiZelu/AIGVEval.
Chinese: 针对AI生成视频的视觉质量问题,本研究提出了一种结合多维度特征和大语言模型的质量评估方法,通过专门设计的提示框架,在NTIRE 2025比赛中荣获第二名。
English: AI-generated videos often suffer from visual defects, so this study introduces a multi-dimensional quality assessment method using large language models and specialized prompts, which achieved second place in the NTIRE 2025 challenge.

Authors:Hanxin Wang, Tian Liu, Shu Kong
Title: Robust Few-Shot Vision-Language Model Adaptation
Abstract:
Pretrained VLMs achieve strong performance on downstream tasks when adapted with just a few labeled examples. As the adapted models inevitably encounter out-of-distribution (OOD) test data that deviates from the in-distribution (ID) task-specific training data, enhancing OOD generalization in few-shot adaptation is critically important. We study robust few-shot VLM adaptation, aiming to increase both ID and OOD accuracy. By comparing different adaptation methods (e.g., prompt tuning, linear probing, contrastive finetuning, and full finetuning), we uncover three key findings: (1) finetuning with proper hyperparameters significantly outperforms the popular VLM adaptation methods prompt tuning and linear probing; (2) visual encoder-only finetuning achieves better efficiency and accuracy than contrastively finetuning both visual and textual encoders; (3) finetuning the top layers of the visual encoder provides the best balance between ID and OOD accuracy. Building on these findings, we propose partial finetuning of the visual encoder empowered with two simple augmentation techniques: (1) retrieval augmentation which retrieves task-relevant data from the VLM's pretraining dataset to enhance adaptation, and (2) adversarial perturbation which promotes robustness during finetuning. Results show that the former/latter boosts OOD/ID accuracy while slightly sacrificing the ID/OOD accuracy. Yet, perhaps understandably, naively combining the two does not maintain their best OOD/ID accuracy. We address this dilemma with the developed SRAPF, Stage-wise Retrieval Augmentation-based Adversarial Partial Finetuning. SRAPF consists of two stages: (1) partial finetuning the visual encoder using both ID and retrieved data, and (2) adversarial partial finetuning with few-shot ID data. Extensive experiments demonstrate that SRAPF achieves the state-of-the-art ID and OOD accuracy on the ImageNet OOD benchmarks.

Authors:Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
Title: HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model
Abstract:
Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation. We further propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head. The meta token encodes harmful visual cues during training, intrinsically guiding the language model toward safer responses, while the safety head offers interpretable harmfulness classification aligned with refusal rationales. Experiments show that SafeLLaVA, trained on HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe benchmark itself reveals critical vulnerabilities in existing models. We hope that HoliSafe and SafeLLaVA will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.

Authors:Gio Paik, Geewook Kim, Jinbae Im
Title: MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models
Abstract:
This paper introduces MMRefine, a MultiModal Refinement benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs). As the emphasis shifts toward enhancing reasoning during inference, MMRefine provides a framework that evaluates MLLMs' abilities to detect and correct errors across six distinct scenarios beyond just comparing final accuracy before and after refinement. Furthermore, the benchmark analyzes the refinement performance by categorizing errors into six error types. Experiments with various open and closed MLLMs reveal bottlenecks and factors impeding refinement performance, highlighting areas for improvement in effective reasoning enhancement. Our code and dataset are publicly available at https://github.com/naver-ai/MMRefine.
中文: 本文介绍MMRefine基准,用于评估多模态大语言模型在六种场景和错误类型中的检测与修正能力,揭示了当前性能瓶颈和改进方向。
English: This paper presents MMRefine, a benchmark for evaluating multimodal large language models' ability to detect and correct errors across six scenarios and error types, identifying current performance bottlenecks and improvement areas.

Authors:Qiming Hu, Linlong Fan, Yiyan Luo, Yuhang Yu, Xiaojie Guo, Qingnan Fan
Title: Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders
Abstract:
The introduction of generative models has significantly advanced image super-resolution (SR) in handling real-world degradations. However, they often incur fidelity-related issues, particularly distorting textual structures. In this paper, we introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders to recover not only natural details but also the structural fidelity of text regions in degraded real-world images. Moreover, we propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks, combining realistic foreground text regions with detailed background content. Extensive experiments demonstrate that our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics and exhibiting strong generalization to real-world scenarios. Our code is available at \href{https://github.com/mingcv/TADiSR}{here}.
中文:提出的TADiSR框架通过文本感知注意力和联合分割解码器,有效恢复了超分辨率图像中的自然细节和文本结构,在真实场景中实现了卓越的性能和泛化能力。
English: The proposed TADiSR framework uses text-aware attention and joint segmentation decoders to effectively restore both natural details and text structures in super-resolved images, achieving superior performance and generalization in real-world scenarios.

Authors:Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, Ranjay Krishna
Title: Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations
Abstract:
Spatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 4K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks like 3D cube net folding and tangram puzzles that require multi-step visual simulations. Humans achieve near-perfect accuracy but take considerable time (up to 28.9s) on complex tasks, significantly speeding up (down by 7.5 seconds on average) with intermediate visual simulations. In contrast, models exhibit inconsistent performance gains from visual simulations, improving on most tasks but declining in specific cases like tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0 Flash), indicating that models may not know how to effectively leverage intermediate visual information.
中文: STARE基准测试通过多步骤视觉模拟任务评估AI模型的空间推理能力,结果表明人类借助视觉辅助表现优异,而模型在复杂3D和综合拼图任务中表现不佳,仅在简单2D变换上表现良好。
English: The STARE benchmark evaluates AI models on spatial reasoning tasks requiring multi-step visual simulations, revealing that while humans excel with visual aids, models struggle with complex 3D and integrated puzzles despite performing well on simpler 2D transformations.

Authors:Wuyang Li, Zhu Yu, Alexandre Alahi
Title: VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection
Abstract:
3D semantic occupancy prediction aims to reconstruct the 3D geometry and semantics of the surrounding environment. With dense voxel labels, prior works typically formulate it as a dense segmentation task, independently classifying each voxel. However, this paradigm neglects critical instance-centric discriminability, leading to instance-level incompleteness and adjacent ambiguities. To address this, we highlight a free lunch of occupancy labels: the voxel-level class label implicitly provides insight at the instance level, which is overlooked by the community. Motivated by this observation, we first introduce a training-free Voxel-to-Instance (VoxNT) trick: a simple yet effective method that freely converts voxel-level class labels into instance-level offset labels. Building on this, we further propose VoxDet, an instance-centric framework that reformulates the voxel-level occupancy prediction as dense object detection by decoupling it into two sub-tasks: offset regression and semantic prediction. Specifically, based on the lifted 3D volume, VoxDet first uses (a) Spatially-decoupled Voxel Encoder to generate disentangled feature volumes for the two sub-tasks, which learn task-specific spatial deformation in the densely projected tri-perceptive space. Then, we deploy (b) Task-decoupled Dense Predictor to address this task via dense detection. Here, we first regress a 4D offset field to estimate distances (6 directions) between voxels and object borders in the voxel space. The regressed offsets are then used to guide the instance-level aggregation in the classification branch, achieving instance-aware prediction. Experiments show that VoxDet can be deployed on both camera and LiDAR input, jointly achieving state-of-the-art results on both benchmarks. VoxDet is not only highly efficient, but also achieves 63.0 IoU on the SemanticKITTI test set, ranking 1st on the online leaderboard.

Authors:Marianna Nezhurina, Tomer Porian, Giovanni Pucceti, Tommie Kerssies, Romain Beaumont, Mehdi Cherti, Jenia Jitsev
Title: Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets
Abstract:
In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. For the first time, full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. Ensuring sufficient prediction accuracy for held out points, we use derived scaling laws to compare both models, obtaining evidence for MaMMUT's stronger improvement with scale and better sample efficiency than standard CLIP. To strengthen validity of the comparison, we show scaling laws for various downstream tasks, classification, retrieval, and segmentation, and for different open datasets, DataComp, DFN and Re-LAION, observing consistently the same trends. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison across scale spans, avoiding misleading conclusions based on measurements from single reference scales only, paving the road for systematic comparison and improvement of open foundation models and datasets for their creation. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves $80.3\%$ zero-shot ImageNet-1k accuracy, trained on 12.8B samples from DataComp-1.4B. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/scaling-laws-for-comparison.
中文: 本研究展示了如何利用扩展定律比较模型与数据集,发现MaMMUT在多任务和数据集上比CLIP具有更优的扩展性和样本效率,同时开源了模型与代码以支持复现。
English: This study demonstrates how scaling laws can be used to compare models and datasets, revealing MaMMUT's superior scalability and sample efficiency over CLIP across multiple tasks and datasets, while providing open-source models and code for reproducibility.

Authors:Hasin Us Sami, Swapneel Sen, Amit K. Roy-Chowdhury, Srikanth V. Krishnamurthy, Basak Guler
Title: Gradient Inversion Attacks on Parameter-Efficient Fine-Tuning
Abstract:
Federated learning (FL) allows multiple data-owners to collaboratively train machine learning models by exchanging local gradients, while keeping their private data on-device. To simultaneously enhance privacy and training efficiency, recently parameter-efficient fine-tuning (PEFT) of large-scale pretrained models has gained substantial attention in FL. While keeping a pretrained (backbone) model frozen, each user fine-tunes only a few lightweight modules to be used in conjunction, to fit specific downstream applications. Accordingly, only the gradients with respect to these lightweight modules are shared with the server. In this work, we investigate how the privacy of the fine-tuning data of the users can be compromised via a malicious design of the pretrained model and trainable adapter modules. We demonstrate gradient inversion attacks on a popular PEFT mechanism, the adapter, which allow an attacker to reconstruct local data samples of a target user, using only the accessible adapter gradients. Via extensive experiments, we demonstrate that a large batch of fine-tuning images can be retrieved with high fidelity. Our attack highlights the need for privacy-preserving mechanisms for PEFT, while opening up several future directions. Our code is available at https://github.com/info-ucr/PEFTLeak.
Chinese: 联邦学习中的参数高效微调(PEFT)虽能在保护数据隐私的同时实现协同训练,但本研究发现恶意设计的预训练模型和适配器模块可通过梯度反演攻击,利用共享梯度重构用户的本地数据样本。
English: Federated learning with parameter-efficient fine-tuning (PEFT) enables collaborative model training while preserving data privacy, but this study reveals that maliciously designed pre-trained models and adapters can exploit shared gradients to reconstruct users' local data through gradient inversion attacks.

Authors:Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, Shanghang Zhang
Title: RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
Abstract:
Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.

Authors:Sihui Ji, Hao Luo, Xi Chen, Yuanpeng Tu, Yiyang Wang, Hengshuang Zhao
Title: LayerFlow: A Unified Model for Layer-aware Video Generation
Abstract:
We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.
中文摘要:LayerFlow是一个统一的层感知视频生成框架,通过分层嵌入和多阶段训练策略,能够根据分层提示生成透明前景、干净背景及融合场景的视频内容。
English Summary: LayerFlow is a unified framework for layer-aware video generation that uses layer embeddings and a multi-stage training strategy to produce transparent foregrounds, clean backgrounds, and blended scenes from per-layer prompts.

Authors:Zhao-Heng Yin, Sherry Yang, Pieter Abbeel
Title: Object-centric 3D Motion Field for Robot Learning from Human Videos
Abstract:
Learning robot control policies from human videos is a promising direction for scaling up robot learning. However, how to extract action knowledge (or action representations) from videos for policy learning remains a key challenge. Existing action representations such as video frames, pixelflow, and pointcloud flow have inherent limitations such as modeling complexity or loss of information. In this paper, we propose to use object-centric 3D motion field to represent actions for robot learning from human videos, and present a novel framework for extracting this representation from videos for zero-shot control. We introduce two novel components in its implementation. First, a novel training pipeline for training a ''denoising'' 3D motion field estimator to extract fine object 3D motions from human videos with noisy depth robustly. Second, a dense object-centric 3D motion field prediction architecture that favors both cross-embodiment transfer and policy generalization to background. We evaluate the system in real world setups. Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method, achieve 55% average success rate in diverse tasks where prior approaches fail~($\lesssim 10$\%), and can even acquire fine-grained manipulation skills like insertion.

Authors:Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, Huaizu Jiang
Title: Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models
Abstract:
Unlocking spatial reasoning in Large Multimodal Models (LMMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can LMMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird's-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source LMMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source LMM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in LMMs-without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.
Chinese: Struct2D框架通过结构化二维输入使大型多模态模型展现出强大的空间推理能力,无需显式三维表征即可在多种3D任务中取得优异表现。
English: The Struct2D framework enables large multimodal models to perform strong spatial reasoning using only structured 2D inputs, achieving competitive results on 3D tasks without requiring explicit 3D representations.

Authors:Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, Kashyap Chitta
Title: Pseudo-Simulation for Autonomous Driving
Abstract:
Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations. Real-world evaluation is often challenging due to safety concerns and a lack of reproducibility, whereas closed-loop simulation can face insufficient realism or high computational costs. Open-loop evaluation, while being efficient and data-driven, relies on metrics that generally overlook compounding errors. In this paper, we propose pseudo-simulation, a novel paradigm that addresses these limitations. Pseudo-simulation operates on real datasets, similar to open-loop evaluation, but augments them with synthetic observations generated prior to evaluation using 3D Gaussian Splatting. Our key idea is to approximate potential future states the AV might encounter by generating a diverse set of observations that vary in position, heading, and speed. Our method then assigns a higher importance to synthetic observations that best match the AV's likely behavior using a novel proximity-based weighting scheme. This enables evaluating error recovery and the mitigation of causal confusion, as in closed-loop benchmarks, without requiring sequential interactive simulation. We show that pseudo-simulation is better correlated with closed-loop simulations ($R^2=0.8$) than the best existing open-loop approach ($R^2=0.7$). We also establish a public leaderboard for the community to benchmark new methodologies with pseudo-simulation. Our code is available at https://github.com/autonomousvision/navsim.
中文: 本文提出伪仿真这一自动驾驶汽车评估新范式,通过3D高斯溅射技术融合真实数据与合成观测,无需完整仿真即可评估错误恢复能力,且与闭环基准的相关性(R²=0.8)优于现有方法。
English: This paper introduces pseudo-simulation, a novel evaluation paradigm for autonomous vehicles that combines real-world data efficiency with synthetic observations generated via 3D Gaussian Splatting, enabling error recovery assessment without full simulation while achieving higher correlation (R²=0.8) with closed-loop benchmarks than existing methods.

Authors:Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qifeng Chen, Wenhan Luo
Title: UNIC: Unified In-Context Video Editing
Abstract:
Recent advances in text-to-video generation have sparked interest in generative video editing tasks. Previous methods often rely on task-specific architectures (e.g., additional adapter modules) or dedicated customizations (e.g., DDIM inversion), which limit the integration of versatile editing conditions and the unification of various editing tasks. In this paper, we introduce UNified In-Context Video Editing (UNIC), a simple yet effective framework that unifies diverse video editing tasks within a single model in an in-context manner. To achieve this unification, we represent the inputs of various video editing tasks as three types of tokens: the source video tokens, the noisy video latent, and the multi-modal conditioning tokens that vary according to the specific editing task. Based on this formulation, our key insight is to integrate these three types into a single consecutive token sequence and jointly model them using the native attention operations of DiT, thereby eliminating the need for task-specific adapter designs. Nevertheless, direct task unification under this framework is challenging, leading to severe token collisions and task confusion due to the varying video lengths and diverse condition modalities across tasks. To address these, we introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks. This allows our approach to adaptively perform different video editing tasks by referring the source video and varying condition tokens "in context", and support flexible task composition. To validate our method, we construct a unified video editing benchmark containing six representative video editing tasks. Results demonstrate that our unified approach achieves superior performance on each task and exhibits emergent task composition abilities.
中文:UNIC框架通过将源视频、噪声潜变量和多模态条件标记整合为单一序列,利用DiT的注意力机制进行联合建模,并引入任务感知RoPE和条件偏置解决标记冲突,从而统一多种视频编辑任务并支持灵活的任务组合。
English: The UNIC framework unifies diverse video editing tasks by integrating source video, noisy latent, and multi-modal condition tokens into a single sequence modeled with DiT's attention, enhanced by task-aware RoPE and condition bias to resolve token collisions and support flexible task composition.

Authors:Xuanhua He, Quande Liu, Zixuan Ye, Weicai Ye, Qiulin Wang, Xintao Wang, Qifeng Chen, Pengfei Wan, Di Zhang, Kun Gai
Title: FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers
Abstract:
Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy, FullDiT2 leverages a dynamic token selection mechanism to adaptively identify important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality. The project page is at \href{https://fulldit2.github.io/}{https://fulldit2.github.io/}.
中文: FullDiT2通过动态令牌选择和选择性上下文缓存机制,有效解决了视频生成中上下文条件化的计算效率瓶颈,在保持高质量生成效果的同时实现了显著加速。
English: FullDiT2 addresses the computational inefficiency of in-context conditioning video generation by introducing dynamic token selection and selective context caching, achieving significant speed improvements while maintaining high-quality output.

Authors:Boyong He, Yuxiang Ji, Zhuoyue Tan, Liaoni Wu
Title: Diffusion Domain Teacher: Diffusion Guided Domain Adaptive Object Detector
Abstract:
Object detectors often suffer a decrease in performance due to the large domain gap between the training data (source domain) and real-world data (target domain). Diffusion-based generative models have shown remarkable abilities in generating high-quality and diverse images, suggesting their potential for extracting valuable feature from various domains. To effectively leverage the cross-domain feature representation of diffusion models, in this paper, we train a detector with frozen-weight diffusion model on the source domain, then employ it as a teacher model to generate pseudo labels on the unlabeled target domain, which are used to guide the supervised learning of the student model on the target domain. We refer to this approach as Diffusion Domain Teacher (DDT). By employing this straightforward yet potent framework, we significantly improve cross-domain object detection performance without compromising the inference speed. Our method achieves an average mAP improvement of 21.2% compared to the baseline on 6 datasets from three common cross-domain detection benchmarks (Cross-Camera, Syn2Real, Real2Artistic}, surpassing the current state-of-the-art (SOTA) methods by an average of 5.7% mAP. Furthermore, extensive experiments demonstrate that our method consistently brings improvements even in more powerful and complex models, highlighting broadly applicable and effective domain adaptation capability of our DDT. The code is available at https://github.com/heboyong/Diffusion-Domain-Teacher.
中文: 本文提出扩散域教师(DDT)方法,通过冻结权重的扩散模型作为教师模型生成目标域伪标签,在不影响推理速度的情况下显著提升了跨域目标检测性能。
English: This paper introduces Diffusion Domain Teacher (DDT), a method that uses a frozen-weight diffusion model as a teacher to generate pseudo labels for unlabeled target domains, significantly enhancing cross-domain object detection performance without affecting inference speed.

Authors:Yujia Hu, Songhua Liu, Zhenxiong Tan, Xingyi Yang, Xinchao Wang
Title: Image Editing As Programs with Diffusion Models
Abstract:
While diffusion models have achieved remarkable success in text-to-image generation, they encounter significant challenges with instruction-driven image editing. Our research highlights a key challenge: these models particularly struggle with structurally inconsistent edits that involve substantial layout changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture. At its core, IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations. Each operation is implemented via a lightweight adapter sharing the same DiT backbone and is specialized for a specific type of edit. Programmed by a vision-language model (VLM)-based agent, these operations collaboratively support arbitrary and structurally inconsistent transformations. By modularizing and sequencing edits in this way, IEAP generalizes robustly across a wide range of editing tasks, from simple adjustments to substantial structural changes. Extensive experiments demonstrate that IEAP significantly outperforms state-of-the-art methods on standard benchmarks across various editing scenarios. In these evaluations, our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions. Codes are available at https://github.com/YujiaHu1109/IEAP.
中文: 本研究提出IEAP框架,通过将复杂编辑指令分解为原子操作并共享扩散Transformer骨干网络,有效解决了结构不一致的图像编辑难题,在各项评估中显著优于现有方法。
English: This research introduces the Image Editing As Programs (IEAP) framework, which decomposes complex editing instructions into atomic operations using a shared Diffusion Transformer backbone to effectively handle structurally inconsistent image edits and outperform existing methods.

Authors:Robin Bruneau, Baptiste Brument, Yvain Quéau, Jean Mélou, François Bernard Lauze, Jean-Denis Durou, Lilian Calvet
Title: Multi-view Surface Reconstruction Using Normal and Reflectance Cues
Abstract:
Achieving high-fidelity 3D surface reconstruction while preserving fine details remains challenging, especially in the presence of materials with complex reflectance properties and without a dense-view setup. In this paper, we introduce a versatile framework that incorporates multi-view normal and optionally reflectance maps into radiance-based surface reconstruction. Our approach employs a pixel-wise joint re-parametrization of reflectance and surface normals, representing them as a vector of radiances under simulated, varying illumination. This formulation enables seamless incorporation into standard surface reconstruction pipelines, such as traditional multi-view stereo (MVS) frameworks or modern neural volume rendering (NVR) ones. Combined with the latter, our approach achieves state-of-the-art performance on multi-view photometric stereo (MVPS) benchmark datasets, including DiLiGenT-MV, LUCES-MV and Skoltech3D. In particular, our method excels in reconstructing fine-grained details and handling challenging visibility conditions. The present paper is an extended version of the earlier conference paper by Brument et al. (in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024), featuring an accelerated and more robust algorithm as well as a broader empirical evaluation. The code and data relative to this article is available at https://github.com/RobinBruneau/RNb-NeuS2.
中文: 本文提出了一种通用框架,将多视角法线和反射率图融入基于辐射度的表面重建,在复杂可见性条件下实现了捕捉细微结构的最先进性能。
English: This paper presents a versatile framework that integrates multi-view normal and reflectance maps into radiance-based surface reconstruction, achieving state-of-the-art performance in capturing fine details under challenging visibility conditions.

Authors:Maxime Zanella, Clément Fuchs, Ismail Ben Ayed, Christophe De Vleeschouwer
Title: Vocabulary-free few-shot learning for Vision-Language Models
Abstract:
Recent advances in few-shot adaptation for Vision-Language Models (VLMs) have greatly expanded their ability to generalize across tasks using only a few labeled examples. However, existing approaches primarily build upon the strong zero-shot priors of these models by leveraging carefully designed, task-specific prompts. This dependence on predefined class names can restrict their applicability, especially in scenarios where exact class names are unavailable or difficult to specify. To address this limitation, we introduce vocabulary-free few-shot learning for VLMs, a setting where target class instances - that is, images - are available but their corresponding names are not. We propose Similarity Mapping (SiM), a simple yet effective baseline that classifies target instances solely based on similarity scores with a set of generic prompts (textual or visual), eliminating the need for carefully handcrafted prompts. Although conceptually straightforward, SiM demonstrates strong performance, operates with high computational efficiency (learning the mapping typically takes less than one second), and provides interpretability by linking target classes to generic prompts. We believe that our approach could serve as an important baseline for future research in vocabulary-free few-shot learning. Code is available at https://github.com/MaxZanella/vocabulary-free-FSL.
中文: 本文提出针对视觉语言模型的词汇表无关小样本学习方法,通过相似性映射(SiM)这一高效基线方案,仅使用通用提示对图像进行分类而无需预定义类别名称,在保持优异性能的同时具备高计算效率。
English: This paper introduces vocabulary-free few-shot learning for Vision-Language Models, proposing Similarity Mapping (SiM) as an efficient baseline that classifies images using generic prompts without requiring predefined class names, achieving strong performance and high computational efficiency.

Authors:Hicham Eddoubi, Jonas Ricker, Federico Cocchi, Lorenzo Baraldi, Angelo Sotgiu, Maura Pintor, Marcella Cornia, Lorenzo Baraldi, Asja Fischer, Rita Cucchiara, Battista Biggio
Title: RAID: A Dataset for Testing the Adversarial Robustness of AI-Generated Image Detectors
Abstract:
AI-generated images have reached a quality level at which humans are incapable of reliably distinguishing them from real images. To counteract the inherent risk of fraud and disinformation, the detection of AI-generated images is a pressing challenge and an active research topic. While many of the presented methods claim to achieve high detection accuracy, they are usually evaluated under idealized conditions. In particular, the adversarial robustness is often neglected, potentially due to a lack of awareness or the substantial effort required to conduct a comprehensive robustness analysis. In this work, we tackle this problem by providing a simpler means to assess the robustness of AI-generated image detectors. We present RAID (Robust evaluation of AI-generated image Detectors), a dataset of 72k diverse and highly transferable adversarial examples. The dataset is created by running attacks against an ensemble of seven state-of-the-art detectors and images generated by four different text-to-image models. Extensive experiments show that our methodology generates adversarial images that transfer with a high success rate to unseen detectors, which can be used to quickly provide an approximate yet still reliable estimate of a detector's adversarial robustness. Our findings indicate that current state-of-the-art AI-generated image detectors can be easily deceived by adversarial examples, highlighting the critical need for the development of more robust methods. We release our dataset at https://huggingface.co/datasets/aimagelab/RAID and evaluation code at https://github.com/pralab/RAID.
中文摘要:当前最先进的人工智能生成图像检测器极易受到对抗性样本的攻击,RAID数据集揭示了现有检测方法的严重脆弱性,凸显了开发更强健检测技术的迫切需求。
English Summary: AI-generated image detectors are highly vulnerable to adversarial attacks, as demonstrated by the RAID dataset, which exposes significant weaknesses in current state-of-the-art detection methods and underscores the urgent need for more robust solutions.

Authors:Aojun Lu, Tao Feng, Hangjie Yuan, Chunhui Ding, Yanan Sun
Title: Adapt before Continual Learning
Abstract:
Continual Learning (CL) seeks to enable neural networks to incrementally acquire new knowledge (plasticity) while retaining existing knowledge (stability). Although pre-trained models (PTMs) have provided a strong foundation for CL, existing approaches face a fundamental challenge in balancing these two competing objectives. Current methods typically address stability by freezing the PTM backbone, which severely limits the model's plasticity, particularly when incoming data distribution diverges largely from the pre-training data. Alternatively, sequentially fine-tuning the entire PTM can adapt to new knowledge but often leads to catastrophic forgetting, highlighting the critical stability-plasticity trade-off in PTM-based CL. To address this limitation, we propose Adapting PTMs before the core CL} process (ACL), a novel framework that introduces a plug-and-play adaptation phase prior to learning each new task. During this phase, ACL refines the PTM backbone by aligning embeddings with their original class prototypes while distancing them from irrelevant classes. This mechanism theoretically and empirically demonstrates desirable balance between stability and plasticity, significantly improving CL performance across benchmarks and integrated methods. Code is available at https://github.com/byyx666/ACL_code.
中文: 提出的ACL框架在持续学习任务前引入即插即用的适应阶段,通过将嵌入向量与原始类别原型对齐来优化预训练模型,有效平衡稳定性与可塑性以提升性能。
English: The proposed ACL framework introduces a plug-and-play adaptation phase before continual learning tasks to refine pre-trained models by aligning embeddings with original class prototypes, effectively balancing stability and plasticity to enhance performance.

Authors:Aojun Lu, Hangjie Yuan, Tao Feng, Yanan Sun
Title: Rethinking the Stability-Plasticity Trade-off in Continual Learning from an Architectural Perspective
Abstract:
The quest for Continual Learning (CL) seeks to empower neural networks with the ability to learn and adapt incrementally. Central to this pursuit is addressing the stability-plasticity dilemma, which involves striking a balance between two conflicting objectives: preserving previously learned knowledge and acquiring new knowledge. While numerous CL methods aim to achieve this trade-off, they often overlook the impact of network architecture on stability and plasticity, restricting the trade-off to the parameter level. In this paper, we delve into the conflict between stability and plasticity at the architectural level. We reveal that under an equal parameter constraint, deeper networks exhibit better plasticity, while wider networks are characterized by superior stability. To address this architectural-level dilemma, we introduce a novel framework denoted Dual-Arch, which serves as a plug-in component for CL. This framework leverages the complementary strengths of two distinct and independent networks: one dedicated to plasticity and the other to stability. Each network is designed with a specialized and lightweight architecture, tailored to its respective objective. Extensive experiments demonstrate that Dual-Arch enhances the performance of existing CL methods while being up to 87% more compact in terms of parameters. Code: https://github.com/byyx666/Dual-Arch.
Chinese: 本文提出Dual-Arch框架,通过分别针对可塑性和稳定性设计的两个独立网络,在架构层面解决持续学习中的稳定性-可塑性平衡问题,在提升性能的同时将参数量减少高达87%。
English: The paper introduces Dual-Arch, a plug-in framework for continual learning that utilizes two specialized networks—one for plasticity and one for stability—to address the stability-plasticity dilemma at the architectural level, improving performance while reducing parameters by up to 87%.

Authors:Theodore Barfoot, Luis C. Garcia-Peraza-Herrera, Samet Akcay, Ben Glocker, Tom Vercauteren
Title: Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation
Abstract:
Deep neural networks for medical image segmentation are often overconfident, compromising both reliability and clinical utility. In this work, we propose differentiable formulations of marginal L1 Average Calibration Error (mL1-ACE) as an auxiliary loss that can be computed on a per-image basis. We compare both hard- and soft-binning approaches to directly improve pixel-wise calibration. Our experiments on four datasets (ACDC, AMOS, KiTS, BraTS) demonstrate that incorporating mL1-ACE significantly reduces calibration errors, particularly Average Calibration Error (ACE) and Maximum Calibration Error (MCE), while largely maintaining high Dice Similarity Coefficients (DSCs). We find that the soft-binned variant yields the greatest improvements in calibration, over the Dice plus cross-entropy loss baseline, but often compromises segmentation performance, with hard-binned mL1-ACE maintaining segmentation performance, albeit with weaker calibration improvement. To gain further insight into calibration performance and its variability across an imaging dataset, we introduce dataset reliability histograms, an aggregation of per-image reliability diagrams. The resulting analysis highlights improved alignment between predicted confidences and true accuracies. Overall, our approach not only enhances the trustworthiness of segmentation predictions but also shows potential for safer integration of deep learning methods into clinical workflows. We share our code here: https://github.com/cai4cai/Average-Calibration-Losses
Chinese: 本研究提出了一种可微分的边际L1平均校准误差(mL1-ACE)损失函数,通过硬分箱和软分箱方法有效提升医学图像分割网络的校准性能,在保持分割精度的同时显著降低了校准误差,增强了深度学习模型在临床应用中的可靠性。
English: This study introduces a differentiable marginal L1 Average Calibration Error (mL1-ACE) loss function to improve the calibration of medical image segmentation networks, significantly reducing calibration errors while largely preserving segmentation accuracy across multiple datasets.

Authors:Marcin Kowalczyk, Kamil Jeziorek, Tomasz Kryjak
Title: Learning from Noise: Enhancing DNNs for Event-Based Vision through Controlled Noise Injection
Abstract:
Event-based sensors offer significant advantages over traditional frame-based cameras, especially in scenarios involving rapid motion or challenging lighting conditions. However, event data frequently suffers from considerable noise, negatively impacting the performance and robustness of deep learning models. Traditionally, this problem has been addressed by applying filtering algorithms to the event stream, but this may also remove some of relevant data. In this paper, we propose a novel noise-injection training methodology designed to enhance the neural networks robustness against varying levels of event noise. Our approach introduces controlled noise directly into the training data, enabling models to learn noise-resilient representations. We have conducted extensive evaluations of the proposed method using multiple benchmark datasets (N-Caltech101, N-Cars, and Mini N-ImageNet) and various network architectures, including Convolutional Neural Networks, Vision Transformers, Spiking Neural Networks, and Graph Convolutional Networks. Experimental results show that our noise-injection training strategy achieves stable performance over a range of noise intensities, consistently outperforms event-filtering techniques, and achieves the highest average classification accuracy, making it a viable alternative to traditional event-data filtering methods in an object classification system. Code: https://github.com/vision-agh/DVS_Filtering
中文摘要:本文提出一种噪声注入训练方法,通过在训练过程中注入受控噪声来增强神经网络对事件数据噪声的鲁棒性,在多个数据集和网络架构上的实验表明,该方法比传统滤波技术获得了更高的分类准确率。
English Summary: This paper introduces a noise-injection training method that enhances neural network robustness against event data noise by injecting controlled noise during training, achieving superior classification accuracy across multiple datasets and architectures compared to traditional filtering techniques.

Authors:Sam Pollard, Michael Wray
Title: Video, How Do Your Tokens Merge?
Abstract:
Video transformer models require huge amounts of compute resources due to the spatio-temporal scaling of the input. Tackling this, recent methods have proposed to drop or merge tokens for image models, whether randomly or via learned methods. Merging tokens has many benefits: it can be plugged into any vision transformer, does not require model re-training, and it propagates information that would otherwise be dropped through the model. Before now, video token merging has not been evaluated on temporally complex datasets for video understanding. In this work, we explore training-free token merging for video to provide comprehensive experiments and find best practices across four video transformers on three datasets that exhibit coarse and fine-grained action recognition. Our results showcase the benefits of video token merging with a speedup of around $2.5$X while maintaining accuracy (avg. $-0.55\%$ for ViViT). Code available at https://github.com/sjpollard/video-how-do-your-tokens-merge.
Chinese: 本研究探索了视频变换器的免训练令牌合并方法,在多个模型和数据集上实现了约2.5倍加速,同时保持了接近原始的准确率。
English: This study explores training-free token merging for video transformers, achieving a 2.5X speedup while maintaining near-original accuracy across multiple models and datasets.

Authors:Zhuoyang Pan, Boxiao Pan, Guandao Yang, Adam W. Harley, Leonidas Guibas
Title: Animal Pose Labeling Using General-Purpose Point Trackers
Abstract:
Automatically estimating animal poses from videos is important for studying animal behaviors. Existing methods do not perform reliably since they are trained on datasets that are not comprehensive enough to capture all necessary animal behaviors. However, it is very challenging to collect such datasets due to the large variations in animal morphology. In this paper, we propose an animal pose labeling pipeline that follows a different strategy, i.e. test time optimization. Given a video, we fine-tune a lightweight appearance embedding inside a pre-trained general-purpose point tracker on a sparse set of annotated frames. These annotations can be obtained from human labelers or off-the-shelf pose detectors. The fine-tuned model is then applied to the rest of the frames for automatic labeling. Our method achieves state-of-the-art performance at a reasonable annotation cost. We believe our pipeline offers a valuable tool for the automatic quantification of animal behavior. Visit our project webpage at https://zhuoyang-pan.github.io/animal-labeling.

Authors:Joonkyung Kim, Joonyeol Sim, Woojun Kim, Katia Sycara, Changjoo Nam
Title: CARE: Enhancing Safety of Visual Navigation through Collision Avoidance via Repulsive Estimation
Abstract:
We propose CARE (Collision Avoidance via Repulsive Estimation) to improve the robustness of learning-based visual navigation methods. Recently, visual navigation models, particularly foundation models, have demonstrated promising performance by generating viable trajectories using only RGB images. However, these policies can generalize poorly to environments containing out-of-distribution (OOD) scenes characterized by unseen objects or different camera setups (e.g., variations in field of view, camera pose, or focal length). Without fine-tuning, such models could produce trajectories that lead to collisions, necessitating substantial efforts in data collection and additional training. To address this limitation, we introduce CARE, an attachable module that enhances the safety of visual navigation without requiring additional range sensors or fine-tuning of pretrained models. CARE can be integrated seamlessly into any RGB-based navigation model that generates local robot trajectories. It dynamically adjusts trajectories produced by a pretrained model using repulsive force vectors computed from depth images estimated directly from RGB inputs. We evaluate CARE by integrating it with state-of-the-art visual navigation models across diverse robot platforms. Real-world experiments show that CARE significantly reduces collisions (up to 100%) without compromising navigation performance in goal-conditioned navigation, and further improves collision-free travel distance (up to 10.7x) in exploration tasks. Project page: https://airlab-sogang.github.io/CARE/

Authors:Fei Zhang, Pei Zhang, Baosong Yang, Fei Huang, Yanfeng Wang, Ya Zhang
Title: ConText: Driving In-context Learning for Text Removal and Segmentation
Abstract:
This paper presents the first study on adapting the visual in-context learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model's in-context reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model's in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks. The code is available at https://github.com/Ferenas/ConText.
中文: 本研究首次将视觉上下文学习应用于光学字符识别,通过任务链组合器和上下文感知聚合增强推理能力,提出的ConText模型在多项基准测试中达到了最先进的性能。
English: This study introduces ConText, the first model to adapt visual in-context learning for optical character recognition by employing a task-chaining compositor and context-aware aggregation to enhance reasoning, achieving state-of-the-art results across benchmarks.

Authors:Hao Yu, Tangyu Jiang, Shuning Jia, Shannan Yan, Shunning Liu, Haolong Qian, Guanghao Li, Shuting Dong, Huaisong Zhang, Chun Yuan
Title: ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices
Abstract:
The Transformer architecture has revolutionized various regions since it was proposed, and its effectiveness largely depends on the ability to encode positional information. Traditional position encoding methods exhibit significant limitations due to lack of robustness and flexibility of position. Therefore, Rotary Positional Encoding (RoPE) was proposed to alleviate these issues, which integrates positional information by rotating the embeddings in the attention mechanism. However, RoPE requires manually defined rotation matrices with limited transformation space, constraining the model's capacity. In this work, we propose ComRoPE, which generalizes RoPE by defining it in terms of trainable commuting angle matrices. Specifically, we demonstrate that pairwise commutativity of these matrices is essential for RoPE to achieve scalability and positional robustness. We formally define the RoPE Equation, which is an essential condition that ensures consistent performance with position offsets. Based on the theoretical analysis, we present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation, which significantly improve performance, surpassing the current state-of-the-art method by 1.6% at training resolution and 2.9% at higher resolution on the ImageNet-1K dataset. Furthermore, our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research. To ensure reproducibility, the source code and instructions are available at https://github.com/Longin-Yu/ComRoPE
中文: 本文提出的ComRoPE通过可训练的交换角度矩阵改进了旋转位置编码,解决了原有方法的局限性,在ImageNet-1K数据集上取得了超越现有最佳方法1.6%-2.9%的性能提升。
English: This paper introduces ComRoPE, a trainable extension of Rotary Positional Encoding that uses commuting angle matrices to overcome RoPE's limitations, achieving state-of-the-art performance improvements on ImageNet-1K.

Authors:Shuai Liu, Mingyue Cui, Boyang Li, Quanmin Liang, Tinghe Hong, Kai Huang, Yunxiao Shan, Kai Huang
Title: FSHNet: Fully Sparse Hybrid Network for 3D Object Detection
Abstract:
Fully sparse 3D detectors have recently gained significant attention due to their efficiency in long-range detection. However, sparse 3D detectors extract features only from non-empty voxels, which impairs long-range interactions and causes the center feature missing. The former weakens the feature extraction capability, while the latter hinders network optimization. To address these challenges, we introduce the Fully Sparse Hybrid Network (FSHNet). FSHNet incorporates a proposed SlotFormer block to enhance the long-range feature extraction capability of existing sparse encoders. The SlotFormer divides sparse voxels using a slot partition approach, which, compared to traditional window partition, provides a larger receptive field. Additionally, we propose a dynamic sparse label assignment strategy to deeply optimize the network by providing more high-quality positive samples. To further enhance performance, we introduce a sparse upsampling module to refine downsampled voxels, preserving fine-grained details crucial for detecting small objects. Extensive experiments on the Waymo, nuScenes, and Argoverse2 benchmarks demonstrate the effectiveness of FSHNet. The code is available at https://github.com/Say2L/FSHNet.
中文: FSHNet通过引入SlotFormer模块和动态稀疏标签分配策略,增强了全稀疏3D检测器的长距离特征提取能力和网络优化,在多个主流基准测试中表现出色。
English: FSHNet introduces a SlotFormer block and dynamic sparse label assignment to enhance long-range feature extraction and network optimization in fully sparse 3D detectors, achieving superior performance on major benchmarks.

Authors:Yisen Feng, Haoyu Zhang, Qiaohui Chu, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie
Title: OSGNet @ Ego4D Episodic Memory Challenge 2025
Abstract:
In this report, we present our champion solutions for the three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025. All tracks require precise localization of the interval within an untrimmed egocentric video. Previous unified video localization approaches often rely on late fusion strategies, which tend to yield suboptimal results. To address this, we adopt an early fusion-based video localization model to tackle all three tasks, aiming to enhance localization accuracy. Ultimately, our method achieved first place in the Natural Language Queries, Goal Step, and Moment Queries tracks, demonstrating its effectiveness. Our code can be found at https://github.com/Yisen-Feng/OSGNet.
中文摘要:我们基于早期融合的视频定位模型在Ego4D情景记忆挑战赛的三个赛道中均获冠军,通过精确识别未剪辑自中心视频中的时间区间,证明了该方法的卓越性能。
English Summary: Our early fusion-based video localization model won first place across all three tracks of the Ego4D Episodic Memory Challenge by precisely localizing intervals in untrimmed egocentric videos.

Authors:Aditya Gandhamal, Aniruddh Sikdar, Suresh Sundaram
Title: OV-COAST: Cost Aggregation with Optimal Transport for Open-Vocabulary Semantic Segmentation
Abstract:
Open-vocabulary semantic segmentation (OVSS) entails assigning semantic labels to each pixel in an image using textual descriptions, typically leveraging world models such as CLIP. To enhance out-of-domain generalization, we propose Cost Aggregation with Optimal Transport (OV-COAST) for open-vocabulary semantic segmentation. To align visual-language features within the framework of optimal transport theory, we employ cost volume to construct a cost matrix, which quantifies the distance between two distributions. Our approach adopts a two-stage optimization strategy: in the first stage, the optimal transport problem is solved using cost volume via Sinkhorn distance to obtain an alignment solution; in the second stage, this solution is used to guide the training of the CAT-Seg model. We evaluate state-of-the-art OVSS models on the MESS benchmark, where our approach notably improves the performance of the cost-aggregation model CAT-Seg with ViT-B backbone, achieving superior results, surpassing CAT-Seg by 1.72 % and SAN-B by 4.9 % mIoU. The code is available at https://github.com/adityagandhamal/OV-COAST/}{https://github.com/adityagandhamal/OV-COAST/ .
Chinese: 本文提出OV-COAST方法,通过最优传输的成本聚合来对齐视觉-语言特征,从而提升开放词汇语义分割的性能,在MESS基准测试中取得了领先结果。
English: This paper introduces OV-COAST, a method that enhances open-vocabulary semantic segmentation by applying cost aggregation with optimal transport to align visual-language features, achieving superior performance on the MESS benchmark.

Authors:Zunhui Xia, Hongxing Li, Libin Lan
Title: DSSAU-Net:U-Shaped Hybrid Network for Pubic Symphysis and Fetal Head Segmentation
Abstract:
In the childbirth process, traditional methods involve invasive vaginal examinations, but research has shown that these methods are both subjective and inaccurate. Ultrasound-assisted diagnosis offers an objective yet effective way to assess fetal head position via two key parameters: Angle of Progression (AoP) and Head-Symphysis Distance (HSD), calculated by segmenting the fetal head (FH) and pubic symphysis (PS), which aids clinicians in ensuring a smooth delivery process. Therefore, accurate segmentation of FH and PS is crucial. In this work, we propose a sparse self-attention network architecture with good performance and high computational efficiency, named DSSAU-Net, for the segmentation of FH and PS. Specifically, we stack varying numbers of Dual Sparse Selection Attention (DSSA) blocks at each stage to form a symmetric U-shaped encoder-decoder network architecture. For a given query, DSSA is designed to explicitly perform one sparse token selection at both the region and pixel levels, respectively, which is beneficial for further reducing computational complexity while extracting the most relevant features. To compensate for the information loss during the upsampling process, skip connections with convolutions are designed. Additionally, multiscale feature fusion is employed to enrich the model's global and local information. The performance of DSSAU-Net has been validated using the Intrapartum Ultrasound Grand Challenge (IUGC) 2024 \textit{test set} provided by the organizer in the MICCAI IUGC 2024 competition\footnote{\href{https://codalab.lisn.upsaclay.fr/competitions/18413\#learn\_the\_details}{https://codalab.lisn.upsaclay.fr/competitions/18413\#learn\_the\_details}}, where we win the fourth place on the tasks of classification and segmentation, demonstrating its effectiveness. The codes will be available at https://github.com/XiaZunhui/DSSAU-Net.
中文摘要:传统分娩检查中的侵入性阴道指检主观性强且不准确,因此本研究提出DSSAU-Net稀疏自注意力网络,通过高效分割胎儿头部与耻骨联合实现超声辅助的客观分娩评估,并在国际竞赛中验证了其有效性。
English Summary: Traditional invasive vaginal exams for childbirth are subjective and inaccurate, so this study introduces DSSAU-Net, a computationally efficient sparse self-attention network that accurately segments fetal head and pubic symphysis in ultrasound images to objectively assess delivery progress.

Authors:Erhang Zhang, Junyi Ma, Yin-Dong Zheng, Yixuan Zhou, Hesheng Wang
Title: Zero-Shot Temporal Interaction Localization for Egocentric Videos
Abstract:
Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable visual prompts for VLM reasoning. By absorbing both 2D and 3D observations, it directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI according to 3D hand velocities, leading to high inference accuracy and efficiency. In addition, EgoLoc generates closed-loop feedback from visual and dynamic cues to further refine the localization results. Comprehensive experiments on the publicly available dataset and our newly proposed benchmark demonstrate that EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines. We will release our code and relevant data as open-source at https://github.com/IRMVLab/EgoLoc.
中文: EgoLoc方法提出了一种零样本时序交互定位技术,通过自适应视觉提示和闭环反馈,结合二维与三维观测数据,精准识别第一人称视频中的抓取动作时刻,在准确性和效率上均优于现有基准方法。
English: The proposed EgoLoc method introduces a zero-shot temporal interaction localization approach for egocentric videos, utilizing adaptive visual prompts and closed-loop feedback to accurately identify grasp action timings by integrating 2D and 3D observations, outperforming existing methods in efficiency and precision.

Authors:Hansen Feng, Lizhi Wang, Yiqi Huang, Tong Li, Lin Zhu, Hua Huang
Title: YOND: Practical Blind Raw Image Denoising Free from Camera-Specific Data Dependency
Abstract:
The rapid advancement of photography has created a growing demand for a practical blind raw image denoising method. Recently, learning-based methods have become mainstream due to their excellent performance. However, most existing learning-based methods suffer from camera-specific data dependency, resulting in performance drops when applied to data from unknown cameras. To address this challenge, we introduce a novel blind raw image denoising method named YOND, which represents You Only Need a Denoiser. Trained solely on synthetic data, YOND can generalize robustly to noisy raw images captured by diverse unknown cameras. Specifically, we propose three key modules to guarantee the practicality of YOND: coarse-to-fine noise estimation (CNE), expectation-matched variance-stabilizing transform (EM-VST), and SNR-guided denoiser (SNR-Net). Firstly, we propose CNE to identify the camera noise characteristic, refining the estimated noise parameters based on the coarse denoised image. Secondly, we propose EM-VST to eliminate camera-specific data dependency, correcting the bias expectation of VST according to the noisy image. Finally, we propose SNR-Net to offer controllable raw image denoising, supporting adaptive adjustments and manual fine-tuning. Extensive experiments on unknown cameras, along with flexible solutions for challenging cases, demonstrate the superior practicality of our method. The source code will be publicly available at the \href{https://fenghansen.github.io/publication/YOND}{project homepage}.
中文: YOND是一种仅基于合成数据训练的新型盲原始图像去噪方法,通过三个关键模块克服了相机特定依赖性,能够稳健地泛化到各种未知相机。
English: YOND is a novel blind raw image denoising method trained solely on synthetic data that overcomes camera-specific dependency through three key modules, enabling robust generalization to diverse unknown cameras.

Authors:Lingjun Mao, Rodolfo Corona, Xin Liang, Wenhao Yan, Zineng Tang
Title: Images are Worth Variable Length of Representations
Abstract:
Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction. Our code and checkpoints are available at https://dove-encoder.github.io/dove-encoder.

Authors:Yinfan Wang, Jie Gui, Baosheng Yu, Qi Li, Zhenan Sun, Juho Kannala, Guoying Zhao
Title: FingerVeinSyn-5M: A Million-Scale Dataset and Benchmark for Finger Vein Recognition
Abstract:
A major challenge in finger vein recognition is the lack of large-scale public datasets. Existing datasets contain few identities and limited samples per finger, restricting the advancement of deep learning-based methods. To address this, we introduce FVeinSyn, a synthetic generator capable of producing diverse finger vein patterns with rich intra-class variations. Using FVeinSyn, we created FingerVeinSyn-5M -- the largest available finger vein dataset -- containing 5 million samples from 50,000 unique fingers, each with 100 variations including shift, rotation, scale, roll, varying exposure levels, skin scattering blur, optical blur, and motion blur. FingerVeinSyn-5M is also the first to offer fully annotated finger vein images, supporting deep learning applications in this field. Models pretrained on FingerVeinSyn-5M and fine-tuned with minimal real data achieve an average 53.91\% performance gain across multiple benchmarks. The dataset is publicly available at: https://github.com/EvanWang98/FingerVeinSyn-5M.
中文: 本研究推出了FVeinSyn合成生成器,创建了包含500万标注样本的最大指静脉数据集FingerVeinSyn-5M,通过少量真实数据微调后,使识别模型性能平均提升53.91%。
English: The study introduces FVeinSyn, a synthetic generator that created FingerVeinSyn-5M, the largest finger vein dataset with 5 million annotated samples, enabling a 53.91% performance gain in recognition models after fine-tuning with minimal real data.

Authors:Sarah Alyami, Hamzah Luqman, Sadam Al-Azani, Maad Alowaifeer, Yazeed Alharbi, Yaser Alonaizan
Title: Isharah: A Large-Scale Multi-Scene Dataset for Continuous Sign Language Recognition
Abstract:
Current benchmarks for sign language recognition (SLR) focus mainly on isolated SLR, while there are limited datasets for continuous SLR (CSLR), which recognizes sequences of signs in a video. Additionally, existing CSLR datasets are collected in controlled settings, which restricts their effectiveness in building robust real-world CSLR systems. To address these limitations, we present Isharah, a large multi-scene dataset for CSLR. It is the first dataset of its type and size that has been collected in an unconstrained environment using signers' smartphone cameras. This setup resulted in high variations of recording settings, camera distances, angles, and resolutions. This variation helps with developing sign language understanding models capable of handling the variability and complexity of real-world scenarios. The dataset consists of 30,000 video clips performed by 18 deaf and professional signers. Additionally, the dataset is linguistically rich as it provides a gloss-level annotation for all dataset's videos, making it useful for developing CSLR and sign language translation (SLT) systems. This paper also introduces multiple sign language understanding benchmarks, including signer-independent and unseen-sentence CSLR, along with gloss-based and gloss-free SLT. The Isharah dataset is available on https://snalyami.github.io/Isharah_CSLR/.

Authors:Zhanhui Zhou, Lingjie Chen, Chao Yang, Chaochao Lu
Title: VLMs Can Aggregate Scattered Training Patches
Abstract:
One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as $\textit{visual stitching}$ -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each $(\texttt{image}, \texttt{ID})$ pair into $\{(\texttt{patch}, \texttt{ID})\}$ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.
中文总结:视觉语言模型通过视觉拼接能力,可从分散的良性图像碎片中重构有害内容,从而规避训练数据审核并在推理时生成危险回应。
English Summary: Vision-language models can reconstruct harmful content from seemingly safe image patches during training, exploiting their visual stitching ability to bypass data moderation and generate dangerous responses.

Authors:Feng Han, Yang Jiao, Shaoxiang Chen, Junhao Xu, Jingjing Chen, Yu-Gang Jiang
Title: ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning
Abstract:
The field of controllable image generation has seen significant advancements, with various architectures improving generation layout consistency with control signals. However, contemporary methods still face challenges in bridging the semantic gap between input text prompts with sparse semantics and the target images, often over-relying on low-level control signals to infer regional details. To address this challenge, we propose ControlThinker, a novel framework that employs a "comprehend-then-generate" paradigm. Firstly, by incentivizing the visual reasoning capability of a MLLM, latent semantics from control images are mined to enrich text prompts. This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications. To further tackle the uncertainty arising from the ambiguity of control images, we encourage broader exploration of reasoning trajectories and select the optimal one using a metric-based output reward model (ORM). Extensive experimental results demonstrate that ControlThinker effectively mitigates the semantic gap between raw text prompts and target images, resulting in improved visual quality and semantic consistency across a wide range of benchmarks. The code and models are available at https://github.com/Maplebb/ControlThinker.
中文摘要:ControlThinker是一种创新框架,通过利用视觉推理MLLM从控制图像中挖掘潜在语义来丰富文本提示,有效缩小语义差距,从而提升图像生成的视觉质量和语义一致性。
English Summary: ControlThinker is a novel framework that enhances controllable image generation by using a visual reasoning MLLM to enrich text prompts with latent semantics from control images, thereby reducing the semantic gap and improving visual quality and consistency.

Authors:Shengjie Lin, Jiading Fang, Muhammad Zubair Irshad, Vitor Campagnolo Guizilini, Rares Andrei Ambrus, Greg Shakhnarovich, Matthew R. Walter
Title: SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting
Abstract:
Reconstructing articulated objects prevalent in daily environments is crucial for applications in augmented/virtual reality and robotics. However, existing methods face scalability limitations (requiring 3D supervision or costly annotations), robustness issues (being susceptible to local optima), and rendering shortcomings (lacking speed or photorealism). We introduce SplArt, a self-supervised, category-agnostic framework that leverages 3D Gaussian Splatting (3DGS) to reconstruct articulated objects and infer kinematics from two sets of posed RGB images captured at different articulation states, enabling real-time photorealistic rendering for novel viewpoints and articulations. SplArt augments 3DGS with a differentiable mobility parameter per Gaussian, achieving refined part segmentation. A multi-stage optimization strategy is employed to progressively handle reconstruction, part segmentation, and articulation estimation, significantly enhancing robustness and accuracy. SplArt exploits geometric self-supervision, effectively addressing challenging scenarios without requiring 3D annotations or category-specific priors. Evaluations on established and newly proposed benchmarks, along with applications to real-world scenarios using a handheld RGB camera, demonstrate SplArt's state-of-the-art performance and real-world practicality. Code is publicly available at https://github.com/ripl/splart.
Chinese: SplArt是一种自监督框架,通过位姿RGB图像重建铰接物体并推断运动学,无需3D监督即可实现实时逼真渲染。
English: SplArt is a self-supervised framework that reconstructs articulated objects and infers kinematics from posed RGB images, enabling real-time photorealistic rendering without 3D supervision.

Authors:Zhigang Yang, Huiguang Yao, Linmao Tian, Xuezhi Zhao, Qiang Li, Qi Wang
Title: A Large-Scale Referring Remote Sensing Image Segmentation Dataset and Benchmark
Abstract:
Referring Remote Sensing Image Segmentation is a complex and challenging task that integrates the paradigms of computer vision and natural language processing. Existing datasets for RRSIS suffer from critical limitations in resolution, scene diversity, and category coverage, which hinders the generalization and real-world applicability of refer segmentation models. To facilitate the development of this field, we introduce NWPU-Refer, the largest and most diverse RRSIS dataset to date, comprising 15,003 high-resolution images (1024-2048px) spanning 30+ countries with 49,745 annotated targets supporting single-object, multi-object, and non-object segmentation scenarios. Additionally, we propose the Multi-scale Referring Segmentation Network (MRSNet), a novel framework tailored for the unique demands of RRSIS. MRSNet introduces two key innovations: (1) an Intra-scale Feature Interaction Module (IFIM) that captures fine-grained details within each encoder stage, and (2) a Hierarchical Feature Interaction Module (HFIM) to enable seamless cross-scale feature fusion, preserving spatial integrity while enhancing discriminative power. Extensive experiments conducte on the proposed NWPU-Refer dataset demonstrate that MRSNet achieves state-of-the-art performance across multiple evaluation metrics, validating its effectiveness. The dataset and code are publicly available at https://github.com/CVer-Yang/NWPU-Refer.
中文摘要:本文提出了目前最大最全面的遥感指代分割数据集NWPU-Refer,并设计了具有跨尺度特征交互能力的MRSNet网络,在该数据集上取得了最优性能。
English Summary: The NWPU-Refer dataset is introduced as the largest and most diverse referring remote sensing image segmentation dataset to date, while the proposed MRSNet framework with innovative feature interaction modules achieves state-of-the-art performance on this benchmark.

Authors:Rui Yann, Tianshuo Zhang, Xianglei Xing
Title: SemiOccam: A Robust Semi-Supervised Image Recognition Network Using Sparse Labels
Abstract:
We present SemiOccam, an image recognition network that leverages semi-supervised learning in a highly efficient manner. Existing works often rely on complex training techniques and architectures, requiring hundreds of GPU hours for training, while their generalization ability with extremely limited labeled data remains to be improved. To address these limitations, we construct a hierarchical mixture density classification mechanism by optimizing mutual information between feature representations and target classes, compressing redundant information while retaining crucial discriminative components. Experimental results demonstrate that our method achieves state-of-the-art performance on three commonly used datasets, with accuracy exceeding 95% on two of them using only 4 labeled samples per class, and its simple architecture keeps training time at the minute level. Notably, this paper reveals a long-overlooked data leakage issue in the STL-10 dataset for semi-supervised learning and removes duplicates to ensure reliable experimental results. We release the deduplicated CleanSTL-10 dataset to facilitate fair and reproducible research. Code available at https://github.com/Shu1L0n9/SemiOccam.
中文: SemiOccam提出了一种高效的半监督图像识别网络,仅用每类四个标注样本就在两个数据集上实现超过95%的准确率,同时揭露并修正了STL-10数据集的数据泄露问题以确保结果可靠性。
English: SemiOccam introduces an efficient semi-supervised image recognition network that achieves over 95% accuracy on two datasets with just four labeled samples per class, while exposing and rectifying data leakage in STL-10 to ensure reliable results.

Authors:Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui
Title: WIFE-Fusion:Wavelet-aware Intra-inter Frequency Enhancement for Multi-model Image Fusion
Abstract:
Multimodal image fusion effectively aggregates information from diverse modalities, with fused images playing a crucial role in vision systems. However, existing methods often neglect frequency-domain feature exploration and interactive relationships. In this paper, we propose wavelet-aware Intra-inter Frequency Enhancement Fusion (WIFE-Fusion), a multimodal image fusion framework based on frequency-domain components interactions. Its core innovations include: Intra-Frequency Self-Attention (IFSA) that leverages inherent cross-modal correlations and complementarity through interactive self-attention mechanisms to extract enriched frequency-domain features, and Inter-Frequency Interaction (IFI) that enhances enriched features and filters latent features via combinatorial interactions between heterogeneous frequency-domain components across modalities. These processes achieve precise source feature extraction and unified modeling of feature extraction-aggregation. Extensive experiments on five datasets across three multimodal fusion tasks demonstrate WIFE-Fusion's superiority over current specialized and unified fusion methods. Our code is available at https://github.com/Lmmh058/WIFE-Fusion.
中文: 本文提出WIFE-Fusion多模态图像融合框架,通过频率内自注意力和频率间交互机制增强频域特征,在多个数据集和融合任务中展现出优越性能。
English: This paper introduces WIFE-Fusion, a multimodal image fusion framework that enhances frequency-domain features through Intra-Frequency Self-Attention and Inter-Frequency Interaction, demonstrating superior performance across multiple datasets and fusion tasks.

Authors:Guanzhou Ke, Bo Wang, Guoqing Chao, Weiming Hu, Shengfeng He
Title: How Far Are We from Generating Missing Modalities with Foundation Models?
Abstract:
Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a self-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image reconstruction by at least 14\% and MER for missing text reconstruction by at least 10\% compared to baselines. Code are released at: https://github.com/Guanzhou-Ke/AFM2.
中文: 本研究提出了一种智能代理框架,通过动态挖掘丰富语义特征并采用自我优化机制,显著提升了缺失模态重建的精度和适应性,优于现有基础模型。
English: This study introduces an agentic framework that enhances missing modality reconstruction by dynamically mining rich semantic features and employing self-refinement, significantly improving accuracy and adaptability over existing foundation models.

Authors:Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, Aliaksandr Siarohin
Title: DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models
Abstract:
Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.
中文: DenseDPO通过从损坏的真实视频生成对齐的视频对来消除运动偏差,并利用片段级偏好提供更密集的训练信号,从而在仅需三分之一标注数据的情况下显著提升文本到视频模型的运动生成质量,同时还能借助视觉语言模型实现自动标注。
English: DenseDPO enhances Direct Preference Optimization for text-to-video models by generating aligned video pairs from corrupted ground truth clips to eliminate motion bias and using segment-level preferences for denser training signals, achieving superior motion generation with fewer labels and even enabling automatic annotation via Vision Language Models.

Authors:Yuxuan Han, Junfeng Lyu, Kuan Sheng, Minghao Que, Qixuan Zhang, Lan Xu, Feng Xu
Title: Facial Appearance Capture at Home with Patch-Level Reflectance Prior
Abstract:
Existing facial appearance capture methods can reconstruct plausible facial reflectance from smartphone-recorded videos. However, the reconstruction quality is still far behind the ones based on studio recordings. This paper fills the gap by developing a novel daily-used solution with a co-located smartphone and flashlight video capture setting in a dim room. To enhance the quality, our key observation is to solve facial reflectance maps within the data distribution of studio-scanned ones. Specifically, we first learn a diffusion prior over the Light Stage scans and then steer it to produce the reflectance map that best matches the captured images. We propose to train the diffusion prior at the patch level to improve generalization ability and training stability, as current Light Stage datasets are in ultra-high resolution but limited in data size. Tailored to this prior, we propose a patch-level posterior sampling technique to sample seamless full-resolution reflectance maps from this patch-level diffusion model. Experiments demonstrate our method closes the quality gap between low-cost and studio recordings by a large margin, opening the door for everyday users to clone themselves to the digital world. Our code will be released at https://github.com/yxuhan/DoRA.
中文摘要:本文提出一种新方法,通过利用基于工作室扫描数据训练的扩散先验和分块采样技术,显著提升了智能手机视频的面部外观捕捉质量,大幅缩小了低成本设备与专业工作室录制之间的质量差距。
English Summary: This paper introduces a novel method that significantly improves facial appearance capture quality from smartphone videos by leveraging a diffusion prior trained on studio-scanned data and a patch-level sampling technique, bridging the gap between low-cost and professional studio recordings.

Authors:Xinru Ying, Jiaqi Mo, Jingyang Lin, Canghong Jin, Fangfang Wang, Lina Wei
Title: MamFusion: Multi-Mamba with Temporal Fusion for Partially Relevant Video Retrieval
Abstract:
Partially Relevant Video Retrieval (PRVR) is a challenging task in the domain of multimedia retrieval. It is designed to identify and retrieve untrimmed videos that are partially relevant to the provided query. In this work, we investigate long-sequence video content understanding to address information redundancy issues. Leveraging the outstanding long-term state space modeling capability and linear scalability of the Mamba module, we introduce a multi-Mamba module with temporal fusion framework (MamFusion) tailored for PRVR task. This framework effectively captures the state-relatedness in long-term video content and seamlessly integrates it into text-video relevance understanding, thereby enhancing the retrieval process. Specifically, we introduce Temporal T-to-V Fusion and Temporal V-to-T Fusion to explicitly model temporal relationships between text queries and video moments, improving contextual awareness and retrieval accuracy. Extensive experiments conducted on large-scale datasets demonstrate that MamFusion achieves state-of-the-art performance in retrieval effectiveness. Code is available at the link: https://github.com/Vision-Multimodal-Lab-HZCU/MamFusion.
Chinese: MamFusion框架通过多Mamba模块与时间融合,有效建模长视频内容与文本-视频相关性,提升了部分相关视频检索的性能,在大规模数据集上实现了最优效果。
English: The MamFusion framework leverages multi-Mamba modules with temporal fusion to enhance partially relevant video retrieval by modeling long-term video content and text-video relevance, achieving state-of-the-art performance on large-scale datasets.

Authors:Yifan Li, Xin Li, Tianqin Li, Wenbin He, Yu Kong, Liu Ren
Title: ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads
Abstract:
Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: the layers of several VFMs, like DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features. Leveraging this insight, we eliminate the CNN branch and introduce two heads, task head and prior head, to the frozen VFM. The task head is designed to learn task-specific features, mitigating the early gradient propagation issue. The prior head is used to leverage the multi-scale prior features from the frozen VFM, reducing tuning parameters and overfitting. Extensive experiments on various tasks (e.g., segmentation, detection, depth estimation, and visual question answering) validate the effectiveness and efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to $4\times$ while achieving comparable or even better results on ADE20K, compared to other VFM adapters.

Authors:Yi Xu, Ruining Yang, Yitian Zhang, Yizhou Wang, Jianglin Lu, Mingyuan Zhang, Lili Su, Yun Fu
Title: Trajectory Prediction Meets Large Language Models: A Survey
Abstract:
Recent advances in large language models (LLMs) have sparked growing interest in integrating language-driven techniques into trajectory prediction. By leveraging their semantic and reasoning capabilities, LLMs are reshaping how autonomous systems perceive, model, and predict trajectories. This survey provides a comprehensive overview of this emerging field, categorizing recent work into five directions: (1) Trajectory prediction via language modeling paradigms, (2) Direct trajectory prediction with pretrained language models, (3) Language-guided scene understanding for trajectory prediction, (4) Language-driven data generation for trajectory prediction, (5) Language-based reasoning and interpretability for trajectory prediction. For each, we analyze representative methods, highlight core design choices, and identify open challenges. This survey bridges natural language processing and trajectory prediction, offering a unified perspective on how language can enrich trajectory prediction.
中文: 本综述探讨了将大型语言模型融入轨迹预测的研究进展,将近期工作归纳为五个方向,分析其方法、核心设计及挑战,旨在连接自然语言处理与轨迹预测领域,提供统一视角。
English: This survey explores the integration of large language models into trajectory prediction, categorizing recent advances into five key directions and analyzing their methods, design choices, and challenges to bridge natural language processing with trajectory prediction.

Authors:Muhammad Shaban, Yuzhou Chang, Huaying Qiu, Yao Yu Yeo, Andrew H. Song, Guillaume Jaume, Yuchen Wang, Luca L. Weishaupt, Tong Ding, Anurag Vaidya, Abdallah Lamane, Daniel Shao, Mohammed Zidane, Yunhao Bai, Paige McCallum, Shuli Luo, Wenrui Wu, Yang Wang, Precious Cramer, Chi Ngai Chan, Pierre Stephan, Johanna Schaffenrath, Jia Le Lee, Hendrik A. Michel, Caiwei Tian, Cristina Almagro-Perez, Sophia J. Wagner, Sharifa Sahai, Ming Y. Lu, Richard J. Chen, Andrew Zhang, Mark Edward M. Gonzales, Ahmad Makky, Jia-Ying Joey Lee, Hao Cheng, Nourhan El Ahmar, Sayed Matar, Maximilian Haist, Darci Phillips, Yuqi Tan, Garry P. Nolan, W. Richard Burack, Jacob D. Estes, Jonathan T. C. Liu, Toni K Choueiri, Neeraj Agarwal, Marc Barry, Scott J. Rodig, Long Phi Le, Georg Gerber, Christian M. Schürch, Fabian J. Theis, Youn H Kim, Joe Yeong, Sabina Signoretti, Brooke E. Howitt, Lit-Hsin Loo, Qin Ma, Sizun Jiang, Faisal Mahmood
Title: A Foundation Model for Spatial Proteomics
Abstract:
Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-supervised manner on over 47 million image patches covering 175 protein markers, 16 tissue types, and 8 fluorescence-based imaging platforms. We introduce key architectural adaptations to address the high-dimensional, multi-channel, and heterogeneous nature of multiplex imaging. We demonstrate that KRONOS learns biologically meaningful representations across multiple scales, ranging from cellular and microenvironment to tissue levels, enabling it to address diverse downstream tasks, including cell phenotyping, region classification, and patient stratification. Evaluated across 11 independent cohorts, KRONOS achieves state-of-the-art performance across cell phenotyping, treatment response prediction, and retrieval tasks, and is highly data-efficient. KRONOS also introduces the paradigm of segmentation-free patch-level processing for efficient and scalable spatial proteomics analysis, allowing cross-institutional comparisons, and as an image reverse search engine for spatial patterns. Together, these results position KRONOS as a flexible and scalable tool for spatial proteomics. The model is publicly accessible at https://github.com/mahmoodlab/KRONOS.
中文:KRONOS是专为空间蛋白质组学设计的基础模型,通过从多样化成像数据中学习多尺度生物表征,在细胞表型分析和患者分层等任务中实现了顶尖性能,同时支持高效的图像块级分析。
English: KRONOS is a foundation model for spatial proteomics that learns multi-scale biological representations from diverse imaging data, achieving state-of-the-art performance in tasks like cell phenotyping and patient stratification while enabling efficient patch-level analysis.

Authors:Jinwei Zeng, Yu Liu, Guozhen Zhang, Jingtao Ding, Yuming Lin, Jian Yuan, Yong Li
Title: OpenCarbon: A Contrastive Learning-based Cross-Modality Neural Approach for High-Resolution Carbon Emission Prediction Using Open Data
Abstract:
Accurately estimating high-resolution carbon emissions is crucial for effective emission governance and mitigation planning. While conventional methods for precise carbon accounting are hindered by substantial data collection efforts, the rise of open data and advanced learning techniques offers a promising solution. Once an open data-based prediction model is developed and trained, it can easily infer emissions for new areas based on available open data. To address this, we incorporate two modalities of open data, satellite images and point-of-interest (POI) data, to predict high-resolution urban carbon emissions, with satellite images providing macroscopic and static and POI data offering fine-grained and relatively dynamic functionality information. However, estimating high-resolution carbon emissions presents two significant challenges: the intertwined and implicit effects of various functionalities on carbon emissions, and the complex spatial contiguity correlations that give rise to the agglomeration effect. Our model, OpenCarbon, features two major designs that target the challenges: a cross-modality information extraction and fusion module to extract complementary functionality information from two modules and model their interactions, and a neighborhood-informed aggregation module to capture the spatial contiguity correlations. Extensive experiments demonstrate our model's superiority, with a significant performance gain of 26.6\% on R2. Further generalizability tests and case studies also show OpenCarbon's capacity to capture the intrinsic relation between urban functionalities and carbon emissions, validating its potential to empower efficient carbon governance and targeted carbon mitigation planning. Codes and data are available: https://github.com/JinweiZzz/OpenCarbon.
中文: OpenCarbon模型创新性地结合卫星影像和兴趣点数据,通过跨模态信息融合和邻域聚合机制解决功能交互与空间连续性问题,在实现碳排放高精度估算方面性能提升26.6%,为碳治理和减排规划提供了有效工具。
English: OpenCarbon is a novel model that utilizes satellite imagery and POI data to accurately estimate high-resolution urban carbon emissions by addressing functionality interactions and spatial correlations, achieving a 26.6% performance gain and demonstrating strong potential for carbon governance and mitigation planning.

Authors:Hao Yin, Lijun Gu, Paritosh Parmar, Lin Xu, Tianxiao Guo, Weiwei Fu, Yang Zhang, Tianyou Zheng
Title: FLEX: A Large-Scale Multi-Modal Multi-Action Dataset for Fitness Action Quality Assessment
Abstract:
With the increasing awareness of health and the growing desire for aesthetic physique, fitness has become a prevailing trend. However, the potential risks associated with fitness training, especially with weight-loaded fitness actions, cannot be overlooked. Action Quality Assessment (AQA), a technology that quantifies the quality of human action and provides feedback, holds the potential to assist fitness enthusiasts of varying skill levels in achieving better training outcomes. Nevertheless, current AQA methodologies and datasets are limited to single-view competitive sports scenarios and RGB modality and lack professional assessment and guidance of fitness actions. To address this gap, we propose the FLEX dataset, the first multi-modal, multi-action, large-scale dataset that incorporates surface electromyography (sEMG) signals into AQA. FLEX utilizes high-precision MoCap to collect 20 different weight-loaded actions performed by 38 subjects across 3 different skill levels for 10 repetitions each, containing 5 different views of the RGB video, 3D pose, sEMG, and physiological information. Additionally, FLEX incorporates knowledge graphs into AQA, constructing annotation rules in the form of penalty functions that map weight-loaded actions, action keysteps, error types, and feedback. We conducted various baseline methodologies on FLEX, demonstrating that multimodal data, multiview data, and fine-grained annotations significantly enhance model performance. FLEX not only advances AQA methodologies and datasets towards multi-modal and multi-action scenarios but also fosters the integration of artificial intelligence within the fitness domain. Dataset and code are available at https://haoyin116.github.io/FLEX_Dataset.

Authors:Yunqi Hong, Sohyun An, Andrew Bai, Neil Y. C. Lin, Cho-Jui Hsieh
Title: Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs
Abstract:
Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories--details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP on average improves 13 percent over standard zero-shot classification and 5 percent over the best-performing baselines. Code is available at: https://github.com/yq-hong/AutoSEP
中文: AutoSEP是一种自监督提示学习框架,通过从未标注数据中迭代学习区分性提示来增强多模态大语言模型的细粒度图像分类能力,无需模型训练即可显著提升分类准确率。
English: AutoSEP is a self-supervised prompt learning framework that enhances MLLMs' fine-grained image classification by iteratively learning discriminative prompts from unlabeled data, achieving significant accuracy improvements without model training.

Authors:Ekram Alam, Abu Sufian, Paramartha Dutta, Marco Leo
Title: Human Fall Detection using Transfer Learning-based 3D CNN
Abstract:
Unintentional or accidental falls are one of the significant health issues in senior persons. The population of senior persons is increasing steadily. So, there is a need for an automated fall detection monitoring system. This paper introduces a vision-based fall detection system using a pre-trained 3D CNN. Unlike 2D CNN, 3D CNN extracts not only spatial but also temporal features. The proposed model leverages the original learned weights of a 3D CNN model pre-trained on the Sports1M dataset to extract the spatio-temporal features. Only the SVM classifier was trained, which saves the time required to train the 3D CNN. Stratified shuffle five split cross-validation has been used to split the dataset into training and testing data. Extracted features from the proposed 3D CNN model were fed to an SVM classifier to classify the activity as fall or ADL. Two datasets, GMDCSA and CAUCAFall, were utilized to conduct the experiment. The source code for this work can be accessed via the following link: https://github.com/ekramalam/HFD_3DCNN.
中文: 本文提出一种基于视觉的老年人跌倒检测系统,利用预训练的3D卷积神经网络提取时空特征,仅需训练SVM分类器即可有效区分跌倒与日常活动。
English: This paper presents a vision-based fall detection system for seniors using a pre-trained 3D CNN to extract spatiotemporal features, with only the SVM classifier trained on datasets to efficiently distinguish falls from daily activities.

Authors:Jiaming Yi, Ruirui Pan, Jishen Yang, Xiulong Yang
Title: MINT: Memory-Infused Prompt Tuning at Test-time for CLIP
Abstract:
Improving the generalization ability of Vision-Language Pre-trained Models (VLMs) under test-time data distribution shifts remains a critical challenge. The existing Test-Time Adaptation (TTA) methods fall short in fully leveraging the model's internal knowledge, particularly in dynamically adapting to complex and hierarchical visual semantic information. In this paper, we propose Memory-Infused Prompt Tuning (MINT), a novel framework to address this issue. Inspired by human associative memory theory, MINT introduces a Memory Prompt Bank (MPB), which stores learnable key-value prompt pairs that work as a memory of previously seen samples. During the test time, relevant prompt pairs in the MPB are retrieved by the hierarchical visual features of test images to dynamically assemble Associative Prompts. The associative prompts are then injected into the image encoder for fine-grained, customized visual contextual guidance. MINT also utilizes learnable text prompts. MINT thus enables rapid, precise VLM adaptation at test time by leveraging this MPB-acquired memory, without source data or retraining. The code is available at https://github.com/Jamieyi2004/MINT.
中文摘要:MINT框架通过记忆提示库根据测试图像的层次化视觉特征动态生成关联提示,无需源数据或重新训练即可实现视觉语言模型在测试时的精准自适应。
English Summary: The MINT framework enhances Vision-Language Models' test-time adaptation by using a Memory Prompt Bank to dynamically generate associative prompts from hierarchical visual features, enabling precise adaptation without source data or retraining.

Authors:Liangrui Pan, Xingchen Li, Zhongyi Chen, Ling Chu, Shaoliang Peng
Title: DLiPath: A Benchmark for the Comprehensive Assessment of Donor Liver Based on Histopathological Image Dataset
Abstract:
Pathologists comprehensive evaluation of donor liver biopsies provides crucial information for accepting or discarding potential grafts. However, rapidly and accurately obtaining these assessments intraoperatively poses a significant challenge for pathologists. Features in donor liver biopsies, such as portal tract fibrosis, total steatosis, macrovesicular steatosis, and hepatocellular ballooning are correlated with transplant outcomes, yet quantifying these indicators suffers from substantial inter- and intra-observer variability. To address this, we introduce DLiPath, the first benchmark for comprehensive donor liver assessment based on a histopathology image dataset. We collected and publicly released 636 whole slide images from 304 donor liver patients at the Department of Pathology, the Third Xiangya Hospital, with expert annotations for key pathological features (including cholestasis, portal tract fibrosis, portal inflammation, total steatosis, macrovesicular steatosis, and hepatocellular ballooning). We selected nine state-of-the-art multiple-instance learning (MIL) models based on the DLiPath dataset as baselines for extensive comparative analysis. The experimental results demonstrate that several MIL models achieve high accuracy across donor liver assessment indicators on DLiPath, charting a clear course for future automated and intelligent donor liver assessment research. Data and code are available at https://github.com/panliangrui/ACM_MM_2025.
中文: DLiPath基于组织病理学图像数据集首次建立了供肝全面评估的基准,实验证明多种多示例学习模型在关键病理特征评估中表现优异,为自动化供肝评估研究指明了方向。
English: DLiPath introduces the first benchmark for comprehensive donor liver assessment using a histopathology image dataset, demonstrating that multiple-instance learning models achieve high accuracy in evaluating key pathological features to advance automated liver graft evaluation.

Authors:Shivani Chiranjeevi, Hossein Zaremehrjerdi, Zi K. Deng, Talukder Z. Jubery, Ari Grele, Arti Singh, Asheesh K Singh, Soumik Sarkar, Nirav Merchant, Harold F. Greeney, Baskar Ganapathysubramanian, Chinmay Hegde
Title: TerraIncognita: A Dynamic Benchmark for Species Discovery Using Frontier Models
Abstract:
The rapid global loss of biodiversity, particularly among insects, represents an urgent ecological crisis. Current methods for insect species discovery are manual, slow, and severely constrained by taxonomic expertise, hindering timely conservation actions. We introduce TerraIncognita, a dynamic benchmark designed to evaluate state-of-the-art multimodal models for the challenging problem of identifying unknown, potentially undescribed insect species from image data. Our benchmark dataset combines a mix of expertly annotated images of insect species likely known to frontier AI models, and images of rare and poorly known species, for which few/no publicly available images exist. These images were collected from underexplored biodiversity hotspots, realistically mimicking open-world discovery scenarios faced by ecologists. The benchmark assesses models' proficiency in hierarchical taxonomic classification, their capability to detect and abstain from out-of-distribution (OOD) samples representing novel species, and their ability to generate explanations aligned with expert taxonomic knowledge. Notably, top-performing models achieve over 90\% F1 at the Order level on known species, but drop below 2\% at the Species level, highlighting the sharp difficulty gradient from coarse to fine taxonomic prediction (Order $\rightarrow$ Family $\rightarrow$ Genus $\rightarrow$ Species). TerraIncognita will be updated regularly, and by committing to quarterly dataset expansions (of both known and novel species), will provide an evolving platform for longitudinal benchmarking of frontier AI methods. All TerraIncognita data, results, and future updates are available \href{https://baskargroup.github.io/TerraIncognita/}{here}.
中文: TerraIncognita是一个动态基准,旨在评估多模态AI模型从图像中识别未知昆虫物种的能力,尽管在精细分类级别性能急剧下降,但它能有效进行层级分类并检测新物种。
English: TerraIncognita is a dynamic benchmark that evaluates multimodal AI models for identifying unknown insect species from images, highlighting their proficiency in hierarchical classification and detection of novel species despite a sharp performance decline at finer taxonomic levels.

Authors:Ayush Shrivastava, Andrew Owens
Title: Self-Supervised Spatial Correspondence Across Modalities
Abstract:
We present a method for finding cross-modal space-time correspondences. Given two images from different visual modalities, such as an RGB image and a depth map, our model identifies which pairs of pixels correspond to the same physical points in the scene. To solve this problem, we extend the contrastive random walk framework to simultaneously learn cycle-consistent feature representations for both cross-modal and intra-modal matching. The resulting model is simple and has no explicit photo-consistency assumptions. It can be trained entirely using unlabeled data, without the need for any spatially aligned multimodal image pairs. We evaluate our method on both geometric and semantic correspondence tasks. For geometric matching, we consider challenging tasks such as RGB-to-depth and RGB-to-thermal matching (and vice versa); for semantic matching, we evaluate on photo-sketch and cross-style image alignment. Our method achieves strong performance across all benchmarks.
Chinese: 本研究提出了一种无需对齐训练数据即可学习不同视觉模态图像间跨模态对应关系的方法,在几何和语义匹配任务中均表现出色。
English: This study introduces a model that learns cross-modal correspondences between images from different visual modalities without requiring aligned training data, achieving robust performance in both geometric and semantic matching tasks.

Authors:Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu
Title: Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval
Abstract:
Recent advances in interactive video generation have shown promising results, yet existing approaches struggle with scene-consistent memory capabilities in long video generation due to limited use of historical context. In this work, we propose Context-as-Memory, which utilizes historical context as memory for video generation. It includes two simple yet effective designs: (1) storing context in frame format without additional post-processing; (2) conditioning by concatenating context and frames to be predicted along the frame dimension at the input, requiring no external control modules. Furthermore, considering the enormous computational overhead of incorporating all historical context, we propose the Memory Retrieval module to select truly relevant context frames by determining FOV (Field of View) overlap between camera poses, which significantly reduces the number of candidate frames without substantial information loss. Experiments demonstrate that Context-as-Memory achieves superior memory capabilities in interactive long video generation compared to SOTAs, even generalizing effectively to open-domain scenarios not seen during training. The link of our project page is https://context-as-memory.github.io/.
中文: 本文提出Context-as-Memory方法,通过将历史上下文作为记忆进行帧存储和拼接,并利用基于相机视场重叠的记忆检索模块高效选择相关帧,在交互式长视频生成中实现了优于现有技术的记忆能力。
English: This paper introduces Context-as-Memory, a novel approach that enhances interactive long video generation by using historical context as memory through frame storage and concatenation, with a memory retrieval module to efficiently select relevant frames based on camera pose overlap, achieving superior performance over state-of-the-art methods.

Authors:Siqi Chen, Xinyu Dong, Haolei Xu, Xingyu Wu, Fei Tang, Hang Zhang, Yuchen Yan, Linjuan Wu, Wenqi Zhang, Guiyang Hou, Yongliang Shen, Weiming Lu, Yueting Zhuang
Title: SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation
Abstract:
Large Language Models (LLMs) and Multimodal LLMs have shown promising capabilities for SVG processing, yet existing benchmarks suffer from limited real-world coverage, lack of complexity stratification, and fragmented evaluation paradigms. We introduce SVGenius, a comprehensive benchmark comprising 2,377 queries across three progressive dimensions: understanding, editing, and generation. Built on real-world data from 24 application domains with systematic complexity stratification, SVGenius evaluates models through 8 task categories and 18 metrics. We assess 22 mainstream models spanning different scales, architectures, training paradigms, and accessibility levels. Our analysis reveals that while proprietary models significantly outperform open-source counterparts, all models exhibit systematic performance degradation with increasing complexity, indicating fundamental limitations in current approaches; however, reasoning-enhanced training proves more effective than pure scaling for overcoming these limitations, though style transfer remains the most challenging capability across all model types. SVGenius establishes the first systematic evaluation framework for SVG processing, providing crucial insights for developing more capable vector graphics models and advancing automated graphic design applications. Appendix and supplementary materials (including all data and code) are available at https://zju-real.github.io/SVGenius.
中文摘要:SVGenius作为首个系统性评估SVG处理的基准,发现专有模型虽优于开源模型,但所有模型均随复杂度增加而性能下降,其中推理增强训练比单纯扩大规模更有效,而风格转换仍是各类模型的最大挑战。
English Summary: SVGenius is a comprehensive benchmark for evaluating SVG processing in Large Language Models, revealing that proprietary models outperform open-source ones but all struggle with complexity and style transfer, with reasoning-enhanced training proving more effective than scaling alone.

Authors:Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi
Title: OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Abstract:
Spatial reasoning is a key aspect of cognitive psychology and remains a bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs' understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks cover only the most elementary layer of spatial reasoning and are largely approaching saturation in the latest reasoning models. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through careful manual annotation, we construct over 8.4K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs exhibit significant limitations in comprehensive spatial reasoning. We also explore two strategies-PointGraph (explicit scene graph cues) and SpatialCoT (novel-view chain-of-thought)-to bolster spatial reasoning.
中文: OmniSpatial是一个基于认知心理学的综合性空间推理基准,涵盖四大类别并揭示当前视觉语言模型存在显著不足,同时提出了两种改进策略。
English: OmniSpatial is a comprehensive benchmark for spatial reasoning in vision-language models, covering four major categories and revealing significant limitations in current models despite proposed enhancement strategies.

Authors:Weiqing Xiao, Hao Huang, Chonghao Zhong, Yujie Lin, Nan Wang, Xiaoxue Chen, Zhaoxi Chen, Saining Zhang, Shuocheng Yang, Pierre Merriaux, Lei Lei, Hao Zhao
Title: Simulate Any Radar: Attribute-Controllable Radar Simulation via Waveform Parameter Embedding
Abstract:
We present SA-Radar (Simulate Any Radar), a radar simulation approach that enables controllable and efficient generation of radar cubes conditioned on customizable radar attributes. Unlike prior generative or physics-based simulators, SA-Radar integrates both paradigms through a waveform-parameterized attribute embedding. We design ICFAR-Net, a 3D U-Net conditioned on radar attributes encoded via waveform parameters, which captures signal variations induced by different radar configurations. This formulation bypasses the need for detailed radar hardware specifications and allows efficient simulation of range-azimuth-Doppler (RAD) tensors across diverse sensor settings. We further construct a mixed real-simulated dataset with attribute annotations to robustly train the network. Extensive evaluations on multiple downstream tasks-including 2D/3D object detection and radar semantic segmentation-demonstrate that SA-Radar's simulated data is both realistic and effective, consistently improving model performance when used standalone or in combination with real data. Our framework also supports simulation in novel sensor viewpoints and edited scenes, showcasing its potential as a general-purpose radar data engine for autonomous driving applications. Code and additional materials are available at https://zhuxing0.github.io/projects/SA-Radar.
中文: SA-Radar是一种创新的雷达模拟方法,通过波形参数化嵌入融合生成式与物理建模方法,无需详细硬件规格即可高效生成可定制的雷达数据。
English: SA-Radar is a novel radar simulation method that combines generative and physics-based approaches through waveform-parameterized embeddings, enabling efficient generation of customizable radar data without requiring detailed hardware specifications.

Authors:Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, Yiyuan Zhang
Title: Native-Resolution Image Synthesis
Abstract:
We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution, square-image methods by natively handling variable-length visual tokens, a core challenge for traditional techniques. To this end, we introduce the Native-resolution diffusion Transformer (NiT), an architecture designed to explicitly model varying resolutions and aspect ratios within its denoising process. Free from the constraints of fixed formats, NiT learns intrinsic visual distributions from images spanning a broad range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen in advanced large language models, NiT, trained solely on ImageNet, demonstrates excellent zero-shot generalization performance. It successfully generates high-fidelity images at previously unseen high resolutions (e.g., 1536 x 1536) and diverse aspect ratios (e.g., 16:9, 3:1, 4:3), as shown in Figure 1. These findings indicate the significant potential of native-resolution modeling as a bridge between visual generative modeling and advanced LLM methodologies.
Chinese: 我们提出了原生分辨率图像合成,这是一种利用原生分辨率扩散变换器(NiT)生成任意分辨率和宽高比图像的新范式,实现了最先进的性能及超越训练数据的零样本泛化能力。
English: We introduce native-resolution image synthesis, a novel paradigm using the Native-resolution diffusion Transformer (NiT) to generate images at any resolution and aspect ratio, achieving state-of-the-art performance and zero-shot generalization beyond training data.

Authors:Lu Qiu, Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Xihui Liu
Title: AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation
Abstract:
Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions. Additionally, a dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources. To demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs) and video diffusion models. The reference image and previously generated shots are first processed by MLLM to produce representations aware of both reference and context, which are then used as the condition for the diffusion model to decode the subsequent shot. Experimental results show that the model trained on AnimeShooter achieves superior cross-shot visual consistency and adherence to reference visual guidance, which highlight the value of our dataset for coherent animated video generation.
Chinese: 近期人工智能生成内容的进展显著加快了动画制作速度,为此我们推出了AnimeShooter数据集,该数据集通过分层标注和自动化流程实现了参考引导的多镜头动画生成,而基于此开发的AnimeShooterGen模型在跨镜头视觉一致性和参考遵循方面表现出色。
English: Recent AI advancements have accelerated animation production, leading to the creation of AnimeShooter, a dataset that provides reference-guided multi-shot animations with hierarchical annotations and strong visual consistency, while AnimeShooterGen establishes a baseline model demonstrating superior coherence and adherence to references.

Authors:Zhengyao Lv, Chenyang Si, Tianlin Pan, Zhaoxi Chen, Kwan-Yee K. Wong, Yu Qiao, Ziwei Liu
Title: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation
Abstract:
Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient \textbf{Dual-Expert Consistency Model~(DCM)}, where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail expert.Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models are available at \href{https://github.com/Vchitect/DCM}{https://github.com/Vchitect/DCM}.
Chinese: 本文提出双专家一致性模型(DCM),通过分别优化语义布局与细节增强的专家模块,结合时序一致性损失与特征匹配损失,在显著减少采样步数的同时实现了视频生成中时间连贯性与画面细节的同步提升。
English: This paper introduces the Dual-Expert Consistency Model (DCM), which addresses temporal inconsistency and detail degradation in video diffusion models by employing specialized semantic and detail experts with tailored loss functions, achieving state-of-the-art visual quality with fewer sampling steps.

Authors:Zujin Guo, Size Wu, Zhongang Cai, Wei Li, Chen Change Loy
Title: Controllable Human-centric Keyframe Interpolation with Generative Prior
Abstract:
Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.

Authors:Zhiyuan Yu, Zhe Li, Hujun Bao, Can Yang, Xiaowei Zhou
Title: HumanRAM: Feed-forward Human Reconstruction and Animation Model using Transformers
Abstract:
3D human reconstruction and animation are long-standing topics in computer graphics and vision. However, existing methods typically rely on sophisticated dense-view capture and/or time-consuming per-subject optimization procedures. To address these limitations, we propose HumanRAM, a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images. Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions, parameterized by a shared SMPL-X neural texture, into transformer-based large reconstruction models (LRM). Given monocular or sparse input images with associated camera parameters and SMPL-X poses, our model employs scalable transformers and a DPT-based decoder to synthesize realistic human renderings under novel viewpoints and novel poses. By leveraging the explicit pose conditions, our model simultaneously enables high-quality human reconstruction and high-fidelity pose-controlled animation. Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets. Video results are available at https://zju3dv.github.io/humanram/.
中文摘要:HumanRAM是一种新颖的前馈方法,通过显式姿态条件和可扩展变换器将人体重建与动画统一起来,仅需单目或稀疏图像输入即可实现优于现有方法的重建精度与动画保真度。
English Summary: HumanRAM is a novel feed-forward approach that unifies human reconstruction and animation using explicit pose conditions and scalable transformers, achieving superior accuracy and fidelity over previous methods from monocular or sparse inputs.

Authors:Zeliang Zhang, Gaowen Liu, Charles Fleming, Ramana Rao Kompella, Chenliang Xu
Title: Targeted Forgetting of Image Subgroups in CLIP Models
Abstract:
Foundation models (FMs) such as CLIP have demonstrated impressive zero-shot performance across various tasks by leveraging large-scale, unsupervised pre-training. However, they often inherit harmful or unwanted knowledge from noisy internet-sourced datasets, compromising their reliability in real-world applications. Existing model unlearning methods either rely on access to pre-trained datasets or focus on coarse-grained unlearning (e.g., entire classes), leaving a critical gap for fine-grained unlearning. In this paper, we address the challenging scenario of selectively forgetting specific portions of knowledge within a class, without access to pre-trained data, while preserving the model's overall performance. We propose a novel three-stage approach that progressively unlearns targeted knowledge while mitigating over-forgetting. It consists of (1) a forgetting stage to fine-tune the CLIP on samples to be forgotten, (2) a reminding stage to restore performance on retained samples, and (3) a restoring stage to recover zero-shot capabilities using model souping. Additionally, we introduce knowledge distillation to handle the distribution disparity between forgetting, retaining samples, and unseen pre-trained data. Extensive experiments on CIFAR-10, ImageNet-1K, and style datasets demonstrate that our approach effectively unlearns specific subgroups while maintaining strong zero-shot performance on semantically similar subgroups and other categories, significantly outperforming baseline unlearning methods, which lose effectiveness under the CLIP unlearning setting.
中文:基础模型如CLIP存在从噪声数据集中继承有害知识的问题,为此我们提出了一种新颖的三阶段遗忘方法,无需预训练数据即可选择性遗忘特定知识,同时保持整体性能。
English: Foundation models like CLIP face challenges with unwanted knowledge from noisy datasets, prompting the development of a novel three-stage unlearning method that selectively forgets specific knowledge without access to pre-training data while preserving overall performance.

Authors:Michelle Chen, David Russell, Amritha Pallavoor, Derek Young, Jane Wu
Title: Zero-Shot Tree Detection and Segmentation from Aerial Forest Imagery
Abstract:
Large-scale delineation of individual trees from remote sensing imagery is crucial to the advancement of ecological research, particularly as climate change and other environmental factors rapidly transform forest landscapes across the world. Current RGB tree segmentation methods rely on training specialized machine learning models with labeled tree datasets. While these learning-based approaches can outperform manual data collection when accurate, the existing models still depend on training data that's hard to scale. In this paper, we investigate the efficacy of using a state-of-the-art image segmentation model, Segment Anything Model 2 (SAM2), in a zero-shot manner for individual tree detection and segmentation. We evaluate a pretrained SAM2 model on two tasks in this domain: (1) zero-shot segmentation and (2) zero-shot transfer by using predictions from an existing tree detection model as prompts. Our results suggest that SAM2 not only has impressive generalization capabilities, but also can form a natural synergy with specialized methods trained on in-domain labeled data. We find that applying large pretrained models to problems in remote sensing is a promising avenue for future progress. We make our code available at: https://github.com/open-forest-observatory/tree-detection-framework.
中文摘要:本研究证明Segment Anything Model 2 (SAM2)能够通过零样本学习有效实现遥感影像中的单木分割,展现出卓越的泛化能力与专业方法的协同效应,为突破训练数据限制提供了可行方案。
English Summary: This study demonstrates that the Segment Anything Model 2 (SAM2) effectively performs zero-shot individual tree segmentation from remote sensing imagery, showing strong generalization and synergy with specialized methods while offering a scalable alternative to data-dependent approaches.

Authors:Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li
Title: Revisiting Continuity of Image Tokens for Cross-domain Few-shot Learning
Abstract:
Vision Transformer (ViT) has achieved remarkable success due to its large-scale pretraining on general domains, but it still faces challenges when applying it to downstream distant domains that have only scarce training data, which gives rise to the Cross-Domain Few-Shot Learning (CDFSL) task. Inspired by Self-Attention's insensitivity to token orders, we find an interesting phenomenon neglected in current works: disrupting the continuity of image tokens (i.e., making pixels not smoothly transited across patches) in ViT leads to a noticeable performance decline in the general (source) domain but only a marginal decrease in downstream target domains. This questions the role of image tokens' continuity in ViT's generalization under large domain gaps. In this paper, we delve into this phenomenon for an interpretation. We find continuity aids ViT in learning larger spatial patterns, which are harder to transfer than smaller ones, enlarging domain distances. Meanwhile, it implies that only smaller patterns within each patch could be transferred under extreme domain gaps. Based on this interpretation, we further propose a simple yet effective method for CDFSL that better disrupts the continuity of image tokens, encouraging the model to rely less on large patterns and more on smaller ones. Extensive experiments show the effectiveness of our method in reducing domain gaps and outperforming state-of-the-art works. Codes and models are available at https://github.com/shuaiyi308/ReCIT.
中文: 通过破坏图像标记的连续性,Vision Transformer在跨域小样本学习中减少了对难以迁移的大空间模式的依赖,转而更有效地利用小模式,从而显著缩小了域间差距。
English: Vision Transformer's performance in cross-domain few-shot learning is improved by disrupting image token continuity, which reduces reliance on hard-to-transfer large patterns and enhances smaller pattern utilization, effectively narrowing domain gaps.

Authors:Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, Peng Wang
Title: ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions
Abstract:
Editing images with instructions to reflect non-rigid motions, camera viewpoint shifts, object deformations, human articulations, and complex interactions, poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce ByteMorph, a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT), named ByteMorpher. ByteMorph-6M includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark ByteMorph-Bench. Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories. The dataset is constructed using motion-guided data generation, layered compositing techniques, and automated captioning to ensure diversity, realism, and semantic coherence. We further conduct a comprehensive evaluation of recent instruction-based image editing methods from both academic and commercial domains.
Chinese Summary: 本文提出了ByteMorph框架,通过包含600万图像对的大规模数据集和基于扩散变换器的基线模型,解决了基于指令的图像编辑中非刚性运动处理这一研究不足的挑战。
English Summary: The paper introduces ByteMorph, a comprehensive framework for instruction-based image editing that addresses the underexplored challenge of handling non-rigid motions through a large-scale dataset and a Diffusion Transformer-based model.

Authors:Xiaoyan Cong, Angela Xing, Chandradeep Pokhariya, Rao Fu, Srinath Sridhar
Title: DyTact: Capturing Dynamic Contacts in Hand-Object Manipulation
Abstract:
Reconstructing dynamic hand-object contacts is essential for realistic manipulation in AI character animation, XR, and robotics, yet it remains challenging due to heavy occlusions, complex surface details, and limitations in existing capture techniques. In this paper, we introduce DyTact, a markerless capture method for accurately capturing dynamic contact in hand-object manipulations in a non-intrusive manner. Our approach leverages a dynamic, articulated representation based on 2D Gaussian surfels to model complex manipulations. By binding these surfels to MANO meshes, DyTact harnesses the inductive bias of template models to stabilize and accelerate optimization. A refinement module addresses time-dependent high-frequency deformations, while a contact-guided adaptive sampling strategy selectively increases surfel density in contact regions to handle heavy occlusion. Extensive experiments demonstrate that DyTact not only achieves state-of-the-art dynamic contact estimation accuracy but also significantly improves novel view synthesis quality, all while operating with fast optimization and efficient memory usage. Project Page: https://oliver-cong02.github.io/DyTact.github.io/ .

Authors:Ashwin Vinod, Shrey Pandit, Aditya Vavre, Linshen Liu
Title: EgoVLM: Policy Optimization for Egocentric Video Understanding
Abstract:
Emerging embodied AI applications, such as wearable cameras and autonomous agents, have underscored the need for robust reasoning from first person video streams. We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning within egocentric video contexts. EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps. Following DeepSeek R1-Zero's approach, we directly tune using RL without any supervised fine-tuning phase on chain-of-thought (CoT) data. We evaluate EgoVLM on egocentric video question answering benchmarks and show that domain-specific training substantially improves performance over general-purpose VLMs. Our EgoVLM-3B, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B and 7B models by 14.33 and 13.87 accuracy points on the EgoSchema benchmark, respectively. By explicitly generating reasoning traces, EgoVLM enhances interpretability, making it well-suited for downstream applications. Furthermore, we introduce a novel keyframe-based reward that incorporates salient frame selection to guide reinforcement learning optimization. This reward formulation opens a promising avenue for future exploration in temporally grounded egocentric reasoning.
中文:EgoVLM是一种专为第一人称视频设计的视觉语言模型,通过强化学习微调融合视觉理解与时空推理,在基准测试中显著优于通用模型且无需监督训练。
English: EgoVLM is a specialized vision-language model that integrates visual understanding and spatiotemporal reasoning for egocentric videos, achieving superior performance on benchmarks through reinforcement learning fine-tuning without supervised training.

Authors:Christian Schlarmann, Francesco Croce, Nicolas Flammarion, Matthias Hein
Title: FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens
Abstract:
Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval, while being comparable to baselines on unimodal tasks.
中文摘要:FuseLIP采用早期融合架构,通过单一Transformer处理文本与图像标记的组合,在多模态任务中表现卓越,同时在单模态任务上保持基准竞争力。
English Summary: FuseLIP introduces an early fusion architecture using a single transformer to process combined text and image tokens, achieving superior multimodal task performance while maintaining competitive unimodal capabilities.

Authors:Xiuyu Yang, Bohan Li, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Xin Jin, Hang Zhao, Hao Zhao
Title: ORV: 4D Occupancy-centric Robot Video Generation
Abstract:
Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive. Recently, action-driven generative models have gained widespread adoption in robot learning and simulation, as they eliminate safety concerns and reduce maintenance efforts. However, the action sequences used in these methods often result in limited control precision and poor generalization due to their globally coarse alignment. To address these limitations, we propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation to provide more accurate semantic and geometric guidance for video generation. By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability. Furthermore, our framework supports the simultaneous generation of multi-view videos of robot gripping operations - an important capability for downstream robotic learning tasks. Extensive experimental results demonstrate that ORV consistently outperforms existing baseline methods across various datasets and sub-tasks. Demo, Code and Model: https://orangesodahub.github.io/ORV
中文摘要:提出的ORV框架采用4D语义占据序列生成具有更高精度和时间一致性的逼真机器人视频,在机器人仿真任务中优于现有方法。
English Summary: The proposed ORV framework uses 4D semantic occupancy sequences to generate photorealistic robot videos with enhanced precision and temporal consistency, outperforming existing methods in robotic simulation.

Authors:Roman Titkov, Egor Zubkov, Dmitry Yudin, Jaafar Mahmoud, Malik Mohrat, Gennady Sidorov
Title: LEG-SLAM: Real-Time Language-Enhanced Gaussian Splatting for SLAM
Abstract:
Modern Gaussian Splatting methods have proven highly effective for real-time photorealistic rendering of 3D scenes. However, integrating semantic information into this representation remains a significant challenge, especially in maintaining real-time performance for SLAM (Simultaneous Localization and Mapping) applications. In this work, we introduce LEG-SLAM -- a novel approach that fuses an optimized Gaussian Splatting implementation with visual-language feature extraction using DINOv2 followed by a learnable feature compressor based on Principal Component Analysis, while enabling an online dense SLAM. Our method simultaneously generates high-quality photorealistic images and semantically labeled scene maps, achieving real-time scene reconstruction with more than 10 fps on the Replica dataset and 18 fps on ScanNet. Experimental results show that our approach significantly outperforms state-of-the-art methods in reconstruction speed while achieving competitive rendering quality. The proposed system eliminates the need for prior data preparation such as camera's ego motion or pre-computed static semantic maps. With its potential applications in autonomous robotics, augmented reality, and other interactive domains, LEG-SLAM represents a significant step forward in real-time semantic 3D Gaussian-based SLAM. Project page: https://titrom025.github.io/LEG-SLAM/

Authors:Jiarui Wang, Huiyu Duan, Juntong Wang, Ziheng Jia, Woo Yi Yang, Xiaorong Zhu, Yu Zhao, Jiaying Qian, Yuke Xing, Guangtao Zhai, Xiongkuo Min
Title: DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models
Abstract:
With the rapid advancement of generative models, the realism of AI-generated images has significantly improved, posing critical challenges for verifying digital content authenticity. Current deepfake detection methods often depend on datasets with limited generation models and content diversity that fail to keep pace with the evolving complexity and increasing realism of the AI-generated content. Large multimodal models (LMMs), widely adopted in various vision tasks, have demonstrated strong zero-shot capabilities, yet their potential in deepfake detection remains largely unexplored. To bridge this gap, we present \textbf{DFBench}, a large-scale DeepFake Benchmark featuring (i) broad diversity, including 540,000 images across real, AI-edited, and AI-generated content, (ii) latest model, the fake images are generated by 12 state-of-the-art generation models, and (iii) bidirectional benchmarking and evaluating for both the detection accuracy of deepfake detectors and the evasion capability of generative models. Based on DFBench, we propose \textbf{MoA-DF}, Mixture of Agents for DeepFake detection, leveraging a combined probability strategy from multiple LMMs. MoA-DF achieves state-of-the-art performance, further proving the effectiveness of leveraging LMMs for deepfake detection. Database and codes are publicly available at https://github.com/IntMeGroup/DFBench.
Chinese: 本研究提出了DFBench这一涵盖多种先进生成模型内容的深度伪造检测基准,并开发了MoA-DF方法,通过整合多个大型多模态模型实现了最先进的AI生成图像检测性能。
English: The study introduces DFBench, a comprehensive deepfake detection benchmark featuring diverse content from advanced generative models, and proposes MoA-DF, a novel method using multiple large multimodal models that achieves state-of-the-art performance in detecting AI-generated images.

Authors:Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Zunnan Xu, Zhaoyang Zhang, Yixiao Ge, Xiu Li, Ying Shan
Title: HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
Abstract:
With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strategy utilizing prior knowledge to extend capabilities. To address cross-modal compatibility challenges, we introduce feature pre-scaling and multimodal AdaLN techniques. Integrating the proposed technologies, we present the HaploOmni, a new single multimodal transformer. With limited training costs, HaploOmni achieves competitive performance across multiple image and video understanding and generation benchmarks over advanced unified models. All codes will be made public at https://github.com/Tencent/HaploVLM.
中文摘要:本文提出HaploOmni单一多模态变换器,通过多模态预热策略和跨模态兼容技术实现高效训练,在图像视频理解与生成任务中取得优异性能。
English Summary: This paper introduces HaploOmni, a single multimodal transformer trained efficiently with a multimodal warmup strategy and cross-modal compatibility techniques to achieve competitive performance in image and video understanding and generation tasks.

Authors:Yechi Ma, Wei Hua, Shu Kong
Title: Towards Auto-Annotation from Annotation Guidelines: A Benchmark through 3D LiDAR Detection
Abstract:
A crucial yet under-appreciated prerequisite in machine learning solutions for real-applications is data annotation: human annotators are hired to manually label data according to detailed, expert-crafted guidelines. This is often a laborious, tedious, and costly process. To study methods for facilitating data annotation, we introduce a new benchmark AnnoGuide: Auto-Annotation from Annotation Guidelines. It aims to evaluate automated methods for data annotation directly from expert-defined annotation guidelines, eliminating the need for manual labeling. As a case study, we repurpose the well-established nuScenes dataset, commonly used in autonomous driving research, which provides comprehensive annotation guidelines for labeling LiDAR point clouds with 3D cuboids across 18 object classes. These guidelines include a few visual examples and textual descriptions, but no labeled 3D cuboids in LiDAR data, making this a novel task of multi-modal few-shot 3D detection without 3D annotations. The advances of powerful foundation models (FMs) make AnnoGuide especially timely, as FMs offer promising tools to tackle its challenges. We employ a conceptually straightforward pipeline that (1) utilizes open-source FMs for object detection and segmentation in RGB images, (2) projects 2D detections into 3D using known camera poses, and (3) clusters LiDAR points within the frustum of each 2D detection to generate a 3D cuboid. Starting with a non-learned solution that leverages off-the-shelf FMs, we progressively refine key components and achieve significant performance improvements, boosting 3D detection mAP from 12.1 to 21.9! Nevertheless, our results highlight that AnnoGuide remains an open and challenging problem, underscoring the urgent need for developing LiDAR-based FMs. We release our code and models at GitHub: https://annoguide.github.io/annoguide3Dbenchmark

Authors:Ahmad AlMughrabi, Umair Haroon, Ricardo Marques, Petia Radeva
Title: VolTex: Food Volume Estimation using Text-Guided Segmentation and Neural Surface Reconstruction
Abstract:
Accurate food volume estimation is crucial for dietary monitoring, medical nutrition management, and food intake analysis. Existing 3D Food Volume estimation methods accurately compute the food volume but lack for food portions selection. We present VolTex, a framework that improves \change{the food object selection} in food volume estimation. Allowing users to specify a target food item via text input to be segmented, our method enables the precise selection of specific food objects in real-world scenes. The segmented object is then reconstructed using the Neural Surface Reconstruction method to generate high-fidelity 3D meshes for volume computation. Extensive evaluations on the MetaFood3D dataset demonstrate the effectiveness of our approach in isolating and reconstructing food items for accurate volume estimation. The source code is accessible at https://github.com/GCVCG/VolTex.
Chinese: VolTex框架通过文本输入实现精准食物分割,并利用神经表面重建方法生成高保真三维网格,在MetaFood3D数据集上的广泛评估验证了其在食物体积估算中的有效性。
English: VolTex is a framework that enables precise food item selection through text input for segmentation and reconstructs them into high-fidelity 3D meshes using Neural Surface Reconstruction, achieving accurate volume estimation as validated on the MetaFood3D dataset.

Authors:Ahsan Baidar Bakht, Muhayy Ud Din, Sajid Javed, Irfan Hussain
Title: MVTD: A Benchmark Dataset for Maritime Visual Object Tracking
Abstract:
Visual Object Tracking (VOT) is a fundamental task with widespread applications in autonomous navigation, surveillance, and maritime robotics. Despite significant advances in generic object tracking, maritime environments continue to present unique challenges, including specular water reflections, low-contrast targets, dynamically changing backgrounds, and frequent occlusions. These complexities significantly degrade the performance of state-of-the-art tracking algorithms, highlighting the need for domain-specific datasets. To address this gap, we introduce the Maritime Visual Tracking Dataset (MVTD), a comprehensive and publicly available benchmark specifically designed for maritime VOT. MVTD comprises 182 high-resolution video sequences, totaling approximately 150,000 frames, and includes four representative object classes: boat, ship, sailboat, and unmanned surface vehicle (USV). The dataset captures a diverse range of operational conditions and maritime scenarios, reflecting the real-world complexities of maritime environments. We evaluated 14 recent SOTA tracking algorithms on the MVTD benchmark and observed substantial performance degradation compared to their performance on general-purpose datasets. However, when fine-tuned on MVTD, these models demonstrate significant performance gains, underscoring the effectiveness of domain adaptation and the importance of transfer learning in specialized tracking contexts. The MVTD dataset fills a critical gap in the visual tracking community by providing a realistic and challenging benchmark for maritime scenarios. Dataset and Source Code can be accessed here "https://github.com/AhsanBaidar/MVTD".
中文摘要:针对海上环境中视觉目标跟踪面临的特殊挑战,如水面反光、低对比度和动态背景等,我们推出了海上视觉跟踪数据集(MVTD),该数据集不仅能有效评估现有算法的性能,还能通过微调显著提升它们在专业场景中的跟踪效果。
English Summary: The Maritime Visual Tracking Dataset (MVTD) is introduced to address the unique challenges of visual object tracking in maritime environments, where existing algorithms struggle with reflections, low contrast, and dynamic backgrounds, but show significant improvement when fine-tuned on this specialized dataset.

Authors:Mingjie Wei, Xuemei Xie, Yutong Zhong, Guangming Shi
Title: Learning Pyramid-structured Long-range Dependencies for 3D Human Pose Estimation
Abstract:
Action coordination in human structure is indispensable for the spatial constraints of 2D joints to recover 3D pose. Usually, action coordination is represented as a long-range dependence among body parts. However, there are two main challenges in modeling long-range dependencies. First, joints should not only be constrained by other individual joints but also be modulated by the body parts. Second, existing methods make networks deeper to learn dependencies between non-linked parts. They introduce uncorrelated noise and increase the model size. In this paper, we utilize a pyramid structure to better learn potential long-range dependencies. It can capture the correlation across joints and groups, which complements the context of the human sub-structure. In an effective cross-scale way, it captures the pyramid-structured long-range dependence. Specifically, we propose a novel Pyramid Graph Attention (PGA) module to capture long-range cross-scale dependencies. It concatenates information from various scales into a compact sequence, and then computes the correlation between scales in parallel. Combining PGA with graph convolution modules, we develop a Pyramid Graph Transformer (PGFormer) for 3D human pose estimation, which is a lightweight multi-scale transformer architecture. It encapsulates human sub-structures into self-attention by pooling. Extensive experiments show that our approach achieves lower error and smaller model size than state-of-the-art methods on Human3.6M and MPI-INF-3DHP datasets. The code is available at https://github.com/MingjieWe/PGFormer.
中文摘要:本文提出的金字塔图变换器(PGFormer)通过新型金字塔图注意力模块有效捕捉跨关节和身体部位的长程依赖关系,在3D人体姿态估计任务中以更低的误差和更小的模型尺寸优于现有方法。
English Summary: The paper introduces a Pyramid Graph Transformer (PGFormer) that uses a novel Pyramid Graph Attention module to efficiently capture long-range dependencies across body joints and parts, achieving superior 3D pose estimation with lower error and smaller model size than existing methods.

Authors:Yujin Chen, Yinyu Nie, Benjamin Ummenhofer, Reiner Birkl, Michael Paulitsch, Matthias Nießner
Title: PBR-SR: Mesh PBR Texture Super Resolution from 2D Image Priors
Abstract:
We present PBR-SR, a novel method for physically based rendering (PBR) texture super resolution (SR). It outputs high-resolution, high-quality PBR textures from low-resolution (LR) PBR input in a zero-shot manner. PBR-SR leverages an off-the-shelf super-resolution model trained on natural images, and iteratively minimizes the deviations between super-resolution priors and differentiable renderings. These enhancements are then back-projected into the PBR map space in a differentiable manner to produce refined, high-resolution textures. To mitigate view inconsistencies and lighting sensitivity, which is common in view-based super-resolution, our method applies 2D prior constraints across multi-view renderings, iteratively refining the shared, upscaled textures. In parallel, we incorporate identity constraints directly in the PBR texture domain to ensure the upscaled textures remain faithful to the LR input. PBR-SR operates without any additional training or data requirements, relying entirely on pretrained image priors. We demonstrate that our approach produces high-fidelity PBR textures for both artist-designed and AI-generated meshes, outperforming both direct SR models application and prior texture optimization methods. Our results show high-quality outputs in both PBR and rendering evaluations, supporting advanced applications such as relighting.

Authors:Di Wen, Lei Qi, Kunyu Peng, Kailun Yang, Fei Teng, Ao Luo, Jia Fu, Yufan Chen, Ruiping Liu, Yitian Shi, M. Saquib Sarfraz, Rainer Stiefelhagen
Title: Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments
Abstract:
Despite substantial progress in video understanding, most existing datasets are limited to Earth's gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding. MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at https://github.com/LEI-QI-233/HAR-in-Space.
中文摘要:本文提出了首个微重力环境视频理解基准数据集MicroG-4M,通过4,759个标注视频片段填补了地球重力条件数据集的空白,支持动作识别、视频描述和视觉问答三大核心任务。
English Summary: This paper introduces MicroG-4M, the first benchmark dataset for video understanding in microgravity environments, addressing the gap in existing Earth-centric datasets by providing 4,759 annotated clips for action recognition, video captioning, and visual question answering tasks.

Authors:Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li
Title: Random Registers for Cross-Domain Few-Shot Learning
Abstract:
Cross-domain few-shot learning (CDFSL) aims to transfer knowledge from a data-sufficient source domain to data-scarce target domains. Although Vision Transformer (ViT) has shown superior capability in many vision tasks, its transferability against huge domain gaps in CDFSL is still under-explored. In this paper, we find an intriguing phenomenon: during the source-domain training, prompt tuning, as a common way to train ViT, could be harmful for the generalization of ViT in target domains, but setting them to random noises (i.e., random registers) could consistently improve target-domain performance. We then delve into this phenomenon for an interpretation. We find that learnable prompts capture domain information during the training on the source dataset, which views irrelevant visual patterns as vital cues for recognition. This can be viewed as a kind of overfitting and increases the sharpness of the loss landscapes. In contrast, random registers are essentially a novel way of perturbing attention for the sharpness-aware minimization, which helps the model find a flattened minimum in loss landscapes, increasing the transferability. Based on this phenomenon and interpretation, we further propose a simple but effective approach for CDFSL to enhance the perturbation on attention maps by adding random registers on the semantic regions of image tokens, improving the effectiveness and efficiency of random registers. Extensive experiments on four benchmarks validate our rationale and state-of-the-art performance. Codes and models are available at https://github.com/shuaiyi308/REAP.
Chinese: 本研究发现,在跨领域小样本学习中,使用随机寄存器替代可学习提示能通过平滑损失景观提升Vision Transformer的泛化能力,并提出在图像令牌语义区域添加随机扰动的新方法,实现了最优性能。
English: This study reveals that using random registers instead of learnable prompts in Vision Transformers enhances cross-domain few-shot learning by promoting flatter loss landscapes, and proposes an improved method that applies random perturbations to image token regions for state-of-the-art performance.

Authors:Chunwei Tian, Kai Liu, Bob Zhang, Zhixiang Huang, Chia-Wen Lin, David Zhang
Title: A Dynamic Transformer Network for Vehicle Detection
Abstract:
Stable consumer electronic systems can assist traffic better. Good traffic consumer electronic systems require collaborative work between traffic algorithms and hardware. However, performance of popular traffic algorithms containing vehicle detection methods based on deep networks via learning data relation rather than learning differences in different lighting and occlusions is limited. In this paper, we present a dynamic Transformer network for vehicle detection (DTNet). DTNet utilizes a dynamic convolution to guide a deep network to dynamically generate weights to enhance adaptability of an obtained detector. Taking into relations of different information account, a mixed attention mechanism based channel attention and Transformer is exploited to strengthen relations of channels and pixels to extract more salient information for vehicle detection. To overcome the drawback of difference in an image account, a translation-variant convolution relies on spatial location information to refine obtained structural information for vehicle detection. Experimental results illustrate that our DTNet is competitive for vehicle detection. Code of the proposed DTNet can be obtained at https://github.com/hellloxiaotian/DTNet.
中文摘要:本文提出了一种用于车辆检测的动态Transformer网络DTNet,通过动态卷积和混合注意力机制提升检测器的适应性并提取显著特征,实验结果表明该方法具有竞争优势。
English Summary: This paper introduces DTNet, a dynamic Transformer network for vehicle detection that uses dynamic convolution and a mixed attention mechanism to enhance adaptability and extract salient features, demonstrating competitive performance in experiments.

Authors:Renyang Liu, Wenjie Feng, Tianwei Zhang, Wei Zhou, Xueqi Cheng, See-Kiong Ng
Title: Rethinking Machine Unlearning in Image Generation Models
Abstract:
With the surge and widespread application of image generation models, data privacy and content safety have become major concerns and attracted great attention from users, service providers, and policymakers. Machine unlearning (MU) is recognized as a cost-effective and promising means to address these challenges. Despite some advancements, image generation model unlearning (IGMU) still faces remarkable gaps in practice, e.g., unclear task discrimination and unlearning guidelines, lack of an effective evaluation framework, and unreliable evaluation metrics. These can hinder the understanding of unlearning mechanisms and the design of practical unlearning algorithms. We perform exhaustive assessments over existing state-of-the-art unlearning algorithms and evaluation standards, and discover several critical flaws and challenges in IGMU tasks. Driven by these limitations, we make several core contributions, to facilitate the comprehensive understanding, standardized categorization, and reliable evaluation of IGMU. Specifically, (1) We design CatIGMU, a novel hierarchical task categorization framework. It provides detailed implementation guidance for IGMU, assisting in the design of unlearning algorithms and the construction of testbeds. (2) We introduce EvalIGMU, a comprehensive evaluation framework. It includes reliable quantitative metrics across five critical aspects. (3) We construct DataIGM, a high-quality unlearning dataset, which can be used for extensive evaluations of IGMU, training content detectors for judgment, and benchmarking the state-of-the-art unlearning algorithms. With EvalIGMU and DataIGM, we discover that most existing IGMU algorithms cannot handle the unlearning well across different evaluation dimensions, especially for preservation and robustness. Code and models are available at https://github.com/ryliu68/IGMU.
中文: 本研究针对图像生成模型遗忘中的关键问题,提出了CatIGMU任务分类框架、EvalIGMU评估体系和DataIGM数据集,发现现有算法在保持性和鲁棒性方面存在明显不足。
English: The study addresses critical gaps in image generation model unlearning by introducing CatIGMU for task categorization, EvalIGMU for evaluation, and DataIGM for benchmarking, revealing that current algorithms struggle with preservation and robustness.

Authors:Chuanyu Fu, Yuqi Zhang, Kunbin Yao, Guanying Chen, Yuan Xiong, Chuan Huang, Shuguang Cui, Xiaochun Cao
Title: RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS
Abstract:
3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling. However, existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images. We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances. To address this, we propose RobustSplat, a robust solution based on two critical designs. First, we introduce a delayed Gaussian growth strategy that prioritizes optimizing static scene structure before allowing Gaussian splitting/cloning, mitigating overfitting to transient objects in early optimization. Second, we design a scale-cascaded mask bootstrapping approach that first leverages lower-resolution feature similarity supervision for reliable initial transient mask estimation, taking advantage of its stronger semantic consistency and robustness to noise, and then progresses to high-resolution supervision to achieve more precise mask prediction. Extensive experiments on multiple challenging datasets show that our method outperforms existing methods, clearly demonstrating the robustness and effectiveness of our method. Our project page is https://fcyycf.github.io/RobustSplat/.

Authors:Pengchong Hu, Zhizhong Han
Title: VTGaussian-SLAM: RGBD SLAM for Large Scale Scenes with Splatting View-Tied 3D Gaussians
Abstract:
Jointly estimating camera poses and mapping scenes from RGBD images is a fundamental task in simultaneous localization and mapping (SLAM). State-of-the-art methods employ 3D Gaussians to represent a scene, and render these Gaussians through splatting for higher efficiency and better rendering. However, these methods cannot scale up to extremely large scenes, due to the inefficient tracking and mapping strategies that need to optimize all 3D Gaussians in the limited GPU memories throughout the training to maintain the geometry and color consistency to previous RGBD observations. To resolve this issue, we propose novel tracking and mapping strategies to work with a novel 3D representation, dubbed view-tied 3D Gaussians, for RGBD SLAM systems. View-tied 3D Gaussians is a kind of simplified Gaussians, which is tied to depth pixels, without needing to learn locations, rotations, and multi-dimensional variances. Tying Gaussians to views not only significantly saves storage but also allows us to employ many more Gaussians to represent local details in the limited GPU memory. Moreover, our strategies remove the need of maintaining all Gaussians learnable throughout the training, while improving rendering quality, and tracking accuracy. We justify the effectiveness of these designs, and report better performance over the latest methods on the widely used benchmarks in terms of rendering and tracking accuracy and scalability. Please see our project page for code and videos at https://machineperceptionlab.github.io/VTGaussian-SLAM-Project .

Authors:Shufan Qing, Anzhen Li, Qiandi Wang, Yuefeng Niu, Mingchen Feng, Guoliang Hu, Jinqiao Wu, Fengtao Nan, Yingchun Fan
Title: GeneA-SLAM2: Dynamic SLAM with AutoEncoder-Preprocessed Genetic Keypoints Resampling and Depth Variance-Guided Dynamic Region Removal
Abstract:
Existing semantic SLAM in dynamic environments mainly identify dynamic regions through object detection or semantic segmentation methods. However, in certain highly dynamic scenarios, the detection boxes or segmentation masks cannot fully cover dynamic regions. Therefore, this paper proposes a robust and efficient GeneA-SLAM2 system that leverages depth variance constraints to handle dynamic scenes. Our method extracts dynamic pixels via depth variance and creates precise depth masks to guide the removal of dynamic objects. Simultaneously, an autoencoder is used to reconstruct keypoints, improving the genetic resampling keypoint algorithm to obtain more uniformly distributed keypoints and enhance the accuracy of pose estimation. Our system was evaluated on multiple highly dynamic sequences. The results demonstrate that GeneA-SLAM2 maintains high accuracy in dynamic scenes compared to current methods. Code is available at: https://github.com/qingshufan/GeneA-SLAM2.
中文: 本文提出GeneA-SLAM2系统,通过深度方差约束消除动态物体,并采用改进的关键点算法,在高度动态环境中实现更精确的位姿估计。
English: This paper introduces GeneA-SLAM2, a system that uses depth variance constraints to remove dynamic objects and an improved keypoint algorithm for more accurate pose estimation in highly dynamic environments.

Authors:Tibor Kubík, François Guibault, Michal Španěl, Hervé Lombaert
Title: ToothForge: Automatic Dental Shape Generation using Synchronized Spectral Embeddings
Abstract:
We introduce ToothForge, a spectral approach for automatically generating novel 3D teeth, effectively addressing the sparsity of dental shape datasets. By operating in the spectral domain, our method enables compact machine learning modeling, allowing the generation of high-resolution tooth meshes in milliseconds. However, generating shape spectra comes with the instability of the decomposed harmonics. To address this, we propose modeling the latent manifold on synchronized frequential embeddings. Spectra of all data samples are aligned to a common basis prior to the training procedure, effectively eliminating biases introduced by the decomposition instability. Furthermore, synchronized modeling removes the limiting factor imposed by previous methods, which require all shapes to share a common fixed connectivity. Using a private dataset of real dental crowns, we observe a greater reconstruction quality of the synthetized shapes, exceeding those of models trained on unaligned embeddings. We also explore additional applications of spectral analysis in digital dentistry, such as shape compression and interpolation. ToothForge facilitates a range of approaches at the intersection of spectral analysis and machine learning, with fewer restrictions on mesh structure. This makes it applicable for shape analysis not only in dentistry, but also in broader medical applications, where guaranteeing consistent connectivity across shapes from various clinics is unrealistic. The code is available at https://github.com/tiborkubik/toothForge.
中文摘要:ToothForge提出了一种光谱方法,通过同步频域嵌入有效解决牙齿形状数据集稀缺问题,无需固定网格连接即可快速生成高精度3D牙齿模型,在保持优异重建质量的同时拓展了在数字牙科及其他医疗领域的应用潜力。
English Summary: ToothForge introduces a spectral method for rapid 3D tooth generation that overcomes dataset limitations and mesh connectivity constraints by aligning spectral embeddings, achieving superior reconstruction quality and enabling broader medical applications.

Authors:Yunhong Lu, Qichao Wang, Hengyuan Cao, Xiaoyin Xu, Min Zhang
Title: Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences
Abstract:
Direct Preference Optimization (DPO) aligns text-to-image (T2I) generation models with human preferences using pairwise preference data. Although substantial resources are expended in collecting and labeling datasets, a critical aspect is often neglected: \textit{preferences vary across individuals and should be represented with more granularity.} To address this, we propose SmPO-Diffusion, a novel method for modeling preference distributions to improve the DPO objective, along with a numerical upper bound estimation for the diffusion optimization objective. First, we introduce a smoothed preference distribution to replace the original binary distribution. We employ a reward model to simulate human preferences and apply preference likelihood averaging to improve the DPO loss, such that the loss function approaches zero when preferences are similar. Furthermore, we utilize an inversion technique to simulate the trajectory preference distribution of the diffusion model, enabling more accurate alignment with the optimization objective. Our approach effectively mitigates issues of excessive optimization and objective misalignment present in existing methods through straightforward modifications. Our SmPO-Diffusion achieves state-of-the-art performance in preference evaluation, outperforming baselines across metrics with lower training costs. The project page is https://jaydenlyh.github.io/SmPO-project-page/.

Authors:Chunwei Tian, Mingjian Song, Xiaopeng Fan, Xiangtao Zheng, Bob Zhang, David Zhang
Title: A Tree-guided CNN for image super-resolution
Abstract:
Deep convolutional neural networks can extract more accurate structural information via deep architectures to obtain good performance in image super-resolution. However, it is not easy to find effect of important layers in a single network architecture to decrease performance of super-resolution. In this paper, we design a tree-guided CNN for image super-resolution (TSRNet). It uses a tree architecture to guide a deep network to enhance effect of key nodes to amplify the relation of hierarchical information for improving the ability of recovering images. To prevent insufficiency of the obtained structural information, cosine transform techniques in the TSRNet are used to extract cross-domain information to improve the performance of image super-resolution. Adaptive Nesterov momentum optimizer (Adan) is applied to optimize parameters to boost effectiveness of training a super-resolution model. Extended experiments can verify superiority of the proposed TSRNet for restoring high-quality images. Its code can be obtained at https://github.com/hellloxiaotian/TSRNet.
中文: 提出的TSRNet采用树引导的CNN架构,结合余弦变换技术和自适应优化器,通过强化关键节点和跨域信息提取来提升图像超分辨率效果。
English: The proposed TSRNet employs a tree-guided CNN architecture with cosine transform techniques and an adaptive optimizer to enhance image super-resolution by strengthening key nodes and cross-domain information extraction.

Authors:Timo Osterburg, Franz Albers, Christopher Diehl, Rajesh Pushparaj, Torsten Bertram
Title: HiLO: High-Level Object Fusion for Autonomous Driving using Transformers
Abstract:
The fusion of sensor data is essential for a robust perception of the environment in autonomous driving. Learning-based fusion approaches mainly use feature-level fusion to achieve high performance, but their complexity and hardware requirements limit their applicability in near-production vehicles. High-level fusion methods offer robustness with lower computational requirements. Traditional methods, such as the Kalman filter, dominate this area. This paper modifies the Adapted Kalman Filter (AKF) and proposes a novel transformer-based high-level object fusion method called HiLO. Experimental results demonstrate improvements of $25.9$ percentage points in $\textrm{F}_1$ score and $6.1$ percentage points in mean IoU. Evaluation on a new large-scale real-world dataset demonstrates the effectiveness of the proposed approaches. Their generalizability is further validated by cross-domain evaluation between urban and highway scenarios. Code, data, and models are available at https://github.com/rst-tu-dortmund/HiLO .
中文: 本文提出了一种名为HiLO的基于Transformer的高层目标融合方法,通过将F1分数提升25.9个百分点、平均交并比提升6.1个百分点,显著增强了自动驾驶的环境感知能力,并在城市与高速公路场景中验证了其通用性。
English: This paper introduces HiLO, a transformer-based high-level object fusion method that enhances autonomous driving perception by improving the F1 score by 25.9 percentage points and mean IoU by 6.1 percentage points, validated across urban and highway scenarios.

Authors:Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie
Title: Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025
Abstract:
In this report, we present a novel three-stage framework developed for the Ego4D Long-Term Action Anticipation (LTA) task. Inspired by recent advances in foundation models, our method consists of three stages: feature extraction, action recognition, and long-term action anticipation. First, visual features are extracted using a high-performance visual encoder. The features are then fed into a Transformer to predict verbs and nouns, with a verb-noun co-occurrence matrix incorporated to enhance recognition accuracy. Finally, the predicted verb-noun pairs are formatted as textual prompts and input into a fine-tuned large language model (LLM) to anticipate future action sequences. Our framework achieves first place in this challenge at CVPR 2025, establishing a new state-of-the-art in long-term action prediction. Our code will be released at https://github.com/CorrineQiu/Ego4D-LTA-Challenge-2025.
Chinese Summary: 本报告提出了一种新颖的三阶段框架,用于Ego4D长期动作预测任务,通过特征提取、动作识别和长期预测三个步骤,结合基础模型和大语言模型,在CVPR 2025挑战赛中取得第一名并创下最新技术水准。
English Summary: This report introduces a novel three-stage framework for the Ego4D Long-Term Action Anticipation task, which leverages foundation models for feature extraction, action recognition, and anticipation, achieving state-of-the-art results and winning first place at CVPR 2025.

Authors:Niklas Kormann, Masoud Ramuz, Zeeshan Nisar, Nadine S. Schaadt, Hendrik Annuth, Benjamin Doerr, Friedrich Feuerhake, Thomas Lampert, Johannes F. Lutzeyer
Title: HIEGNet: A Heterogenous Graph Neural Network Including the Immune Environment in Glomeruli Classification
Abstract:
Graph Neural Networks (GNNs) have recently been found to excel in histopathology. However, an important histopathological task, where GNNs have not been extensively explored, is the classification of glomeruli health as an important indicator in nephropathology. This task presents unique difficulties, particularly for the graph construction, i.e., the identification of nodes, edges, and informative features. In this work, we propose a pipeline composed of different traditional and machine learning-based computer vision techniques to identify nodes, edges, and their corresponding features to form a heterogeneous graph. We then proceed to propose a novel heterogeneous GNN architecture for glomeruli classification, called HIEGNet, that integrates both glomeruli and their surrounding immune cells. Hence, HIEGNet is able to consider the immune environment of each glomerulus in its classification. Our HIEGNet was trained and tested on a dataset of Whole Slide Images from kidney transplant patients. Experimental results demonstrate that HIEGNet outperforms several baseline models and generalises best between patients among all baseline models. Our implementation is publicly available at https://github.com/nklsKrmnn/HIEGNet.git.
中文: 图神经网络在组织病理学中表现优异,提出的HIEGNet模型通过整合免疫细胞数据有效分类肾小球健康状况,在肾移植患者数据分析中优于基线模型。
English: Graph Neural Networks (GNNs) show strong performance in histopathology, and the proposed HIEGNet model effectively classifies glomeruli health by integrating immune cell data, outperforming baseline models in kidney transplant patient analysis.

Authors:Hao Yan, Handong Zheng, Hao Wang, Liang Yin, Xingchen Liu, Zhenbiao Cao, Xinxing Su, Zihao Chen, Jihao Wu, Minghui Liao, Chao Weng, Wei Chen, Yuliang Liu, Xiang Bai
Title: VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning
Abstract:
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models' reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially allowing for supervision over intermediate reasoning stages and thereby improving both training efficacy and model interpretability. Our extensive experimental results on VisuRiddles empirically validate that fine-grained visual perception is the principal bottleneck and our synthesis framework markedly enhances the performance of contemporary MLLMs on these challenging tasks. Our code and dataset will be released at https://github.com/yh-hust/VisuRiddles
Chinese: 多模态大语言模型在推理任务上取得进展,但抽象视觉推理仍面临感知局限的挑战,为此我们开发了VisuRiddles基准和感知谜题合成器,通过细粒度感知训练数据显著提升了模型性能。
English: Recent advances in multimodal large language models have improved reasoning tasks, but abstract visual reasoning remains challenging due to perception limitations, prompting the development of the VisuRiddles benchmark and the Perceptual Riddle Synthesizer to enhance model performance through fine-grained perceptual training data.

Authors:Juntong Li, Lingwei Dang, Yukun Su, Yun Hao, Qingxin Xiao, Yongwei Nie, Qingyao Wu
Title: MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection
Abstract:
Video Anomaly Detection (VAD) methods based on reconstruction or prediction face two critical challenges: (1) strong generalization capability often results in accurate reconstruction or prediction of abnormal events, making it difficult to distinguish normal from abnormal patterns; (2) reliance only on low-level appearance and motion cues limits their ability to identify high-level semantic in abnormal events from complex scenes. To address these limitations, we propose a novel VAD framework with two key innovations. First, to suppress excessive generalization, we introduce the Sparse Feature Filtering Module (SFFM) that employs bottleneck filters to dynamically and adaptively remove abnormal information from features. Unlike traditional memory modules, it does not need to memorize the normal prototypes across the training dataset. Further, we design the Mixture of Experts (MoE) architecture for SFFM. Each expert is responsible for extracting specialized principal features during running time, and different experts are selectively activated to ensure the diversity of the learned principal features. Second, to overcome the neglect of semantics in existing methods, we integrate a Vision-Language Model (VLM) to generate textual descriptions for video clips, enabling comprehensive joint modeling of semantic, appearance, and motion cues. Additionally, we enforce modality consistency through semantic similarity constraints and motion frame-difference contrastive loss. Extensive experiments on multiple public datasets validate the effectiveness of our multimodal joint modeling framework and sparse feature filtering paradigm. Project page at https://qzfm.github.io/sfn_vad_project_page/.

Authors:Sining Chen, Yilei Shi, Xiao Xiang Zhu
Title: Enhancing Monocular Height Estimation via Weak Supervision from Imperfect Labels
Abstract:
Monocular height estimation is considered the most efficient and cost-effective means of 3D perception in remote sensing, and it has attracted much attention since the emergence of deep learning. While training neural networks requires a large amount of data, data with perfect labels are scarce and only available within developed regions. The trained models therefore lack generalizability, which limits the potential for large-scale application of existing methods. We tackle this problem for the first time, by introducing data with imperfect labels into training pixel-wise height estimation networks, including labels that are incomplete, inexact, and inaccurate compared to high-quality labels. We propose an ensemble-based pipeline compatible with any monocular height estimation network. Taking the challenges of noisy labels, domain shift, and long-tailed distribution of height values into consideration, we carefully design the architecture and loss functions to leverage the information concealed in imperfect labels using weak supervision through balanced soft losses and ordinal constraints. We conduct extensive experiments on two datasets with different resolutions, DFC23 (0.5 to 1 m) and GBH (3 m). The results indicate that the proposed pipeline outperforms baselines by achieving more balanced performance across various domains, leading to improvements of average root mean square errors up to 22.94 %, and 18.62 % on DFC23 and GBH, respectively. The efficacy of each design component is validated through ablation studies. Code is available at https://github.com/zhu-xlab/weakim2h.
Chinese: 本研究提出了一种利用不完美标签训练单目高度估计网络的新方法,通过平衡软损失和有序约束解决数据稀缺问题并提升模型泛化能力,在多个数据集上实现了显著的误差降低。
English: This study introduces a novel pipeline that leverages imperfect labels to train monocular height estimation networks, addressing data scarcity and enhancing model generalizability through balanced soft losses and ordinal constraints, achieving significant error reductions on diverse datasets.

Authors:Jiahao Chen, Hangjie Yuan, Yichen Qian, Jingyun Liang, Jiazheng Xing, Pengwei Liu, Weihua Chen, Fan Wang, Bing Su
Title: LumosFlow: Motion-Guided Long Video Generation
Abstract:
Long video generation has gained increasing attention due to its widespread applications in fields such as entertainment and simulation. Despite advances, synthesizing temporally coherent and visually compelling long sequences remains a formidable challenge. Conventional approaches often synthesize long videos by sequentially generating and concatenating short clips, or generating key frames and then interpolate the intermediate frames in a hierarchical manner. However, both of them still remain significant challenges, leading to issues such as temporal repetition or unnatural transitions. In this paper, we revisit the hierarchical long video generation pipeline and introduce LumosFlow, a framework introduce motion guidance explicitly. Specifically, we first employ the Large Motion Text-to-Video Diffusion Model (LMTV-DM) to generate key frames with larger motion intervals, thereby ensuring content diversity in the generated long videos. Given the complexity of interpolating contextual transitions between key frames, we further decompose the intermediate frame interpolation into motion generation and post-hoc refinement. For each pair of key frames, the Latent Optical Flow Diffusion Model (LOF-DM) synthesizes complex and large-motion optical flows, while MotionControlNet subsequently refines the warped results to enhance quality and guide intermediate frame generation. Compared with traditional video frame interpolation, we achieve 15x interpolation, ensuring reasonable and continuous motion between adjacent frames. Experiments show that our method can generate long videos with consistent motion and appearance. Code and models will be made publicly available upon acceptance. Our project page: https://jiahaochen1.github.io/LumosFlow/
中文: 本文提出LumosFlow框架,通过采用大间隔运动的关键帧生成技术和分解式中间帧插值方法,在保持内容多样性的同时显著提升了长视频生成的运动连贯性与视觉质量。
English: The paper introduces LumosFlow, a hierarchical framework that enhances long video generation by explicitly incorporating motion guidance through key frame generation with large motion intervals and advanced intermediate frame interpolation to ensure temporal coherence and visual quality.

Authors:Jiachen Liu, Rui Yu, Sili Chen, Sharon X. Huang, Hengkai Guo
Title: Towards In-the-wild 3D Plane Reconstruction from a Single Image
Abstract:
3D plane reconstruction from a single image is a crucial yet challenging topic in 3D computer vision. Previous state-of-the-art (SOTA) methods have focused on training their system on a single dataset from either indoor or outdoor domain, limiting their generalizability across diverse testing data. In this work, we introduce a novel framework dubbed ZeroPlane, a Transformer-based model targeting zero-shot 3D plane detection and reconstruction from a single image, over diverse domains and environments. To enable data-driven models across multiple domains, we have curated a large-scale planar benchmark, comprising over 14 datasets and 560,000 high-resolution, dense planar annotations for diverse indoor and outdoor scenes. To address the challenge of achieving desirable planar geometry on multi-dataset training, we propose to disentangle the representation of plane normal and offset, and employ an exemplar-guided, classification-then-regression paradigm to learn plane and offset respectively. Additionally, we employ advanced backbones as image encoder, and present an effective pixel-geometry-enhanced plane embedding module to further facilitate planar reconstruction. Extensive experiments across multiple zero-shot evaluation datasets have demonstrated that our approach significantly outperforms previous methods on both reconstruction accuracy and generalizability, especially over in-the-wild data. Our code and data are available at: https://github.com/jcliu0428/ZeroPlane.
中文摘要:本文提出ZeroPlane这一基于Transformer的零样本三维平面重建框架,通过多数据集训练和创新的几何解耦方法,在跨领域场景中实现了卓越的重建性能。
English Summary: This paper introduces ZeroPlane, a Transformer-based framework for zero-shot 3D plane reconstruction from single images, which achieves superior performance across diverse domains through multi-dataset training and novel geometric disentanglement techniques.

Authors:Lingwei Dang, Ruizhi Shao, Hongwen Zhang, Wei Min, Yebin Liu, Qingyao Wu
Title: SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios
Abstract:
Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at https://github.com/Droliven/SViMo_project.
中文: 本文提出了一种创新框架,通过同步扩散过程融合视觉先验与动态约束,无需依赖预定义模型即可同步生成手物交互视频与运动,显著提升了物理合理性与泛化能力。
English: This paper introduces a novel framework that integrates visual priors and dynamic constraints through synchronized diffusion to simultaneously generate hand-object interaction videos and motions, enhancing physical plausibility and generalization without relying on predefined models.

Authors:Shuang Li, Jiaxu Leng, Changjiang Kuang, Mingpi Tan, Xinbo Gao
Title: Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification
Abstract:
Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model to generate video-level language prompts and guide the learning of modality-invariant sequence-level features is theoretically feasible. However, the challenge of generating and utilizing modality-shared video-level language prompts to address modality gaps remains a critical problem. To address this problem, we propose a simple yet powerful framework, video-level language-driven VVI-ReID (VLD), which consists of two core modules: invariant-modality language prompting (IMLP) and spatial-temporal prompting (STP). IMLP employs a joint fine-tuning strategy for the visual encoder and the prompt learner to effectively generate modality-shared text prompts and align them with visual features from different modalities in CLIP's multimodal space, thereby mitigating modality differences. Additionally, STP models spatiotemporal information through two submodules, the spatial-temporal hub (STH) and spatial-temporal aggregation (STA), which further enhance IMLP by incorporating spatiotemporal information into text prompts. The STH aggregates and diffuses spatiotemporal information into the [CLS] token of each frame across the vision transformer (ViT) layers, whereas STA introduces dedicated identity-level loss and specialized multihead attention to ensure that the STH focuses on identity-relevant spatiotemporal feature aggregation. The VLD framework achieves state-of-the-art results on two VVI-ReID benchmarks. The code will be released at https://github.com/Visuang/VLD.
中文摘要:VLD框架通过模态不变语言提示和时空提示模块,生成共享文本提示并整合时空特征,在跨模态视频行人重识别中弥合模态差异,实现了最先进的性能。
English Summary: The VLD framework introduces invariant-modality language prompting and spatial-temporal prompting modules to generate shared text prompts and integrate spatiotemporal features, achieving state-of-the-art performance in cross-modality video person re-identification by bridging modality gaps.

Authors:Nurislam Tursynbek, Hastings Greer, Basar Demir, Marc Niethammer
Title: Guiding Registration with Emergent Similarity from Pre-Trained Diffusion Models
Abstract:
Diffusion models, while trained for image generation, have emerged as powerful foundational feature extractors for downstream tasks. We find that off-the-shelf diffusion models, trained exclusively to generate natural RGB images, can identify semantically meaningful correspondences in medical images. Building on this observation, we propose to leverage diffusion model features as a similarity measure to guide deformable image registration networks. We show that common intensity-based similarity losses often fail in challenging scenarios, such as when certain anatomies are visible in one image but absent in another, leading to anatomically inaccurate alignments. In contrast, our method identifies true semantic correspondences, aligning meaningful structures while disregarding those not present across images. We demonstrate superior performance of our approach on two tasks: multimodal 2D registration (DXA to X-Ray) and monomodal 3D registration (brain-extracted to non-brain-extracted MRI). Code: https://github.com/uncbiag/dgir
中文摘要:原本用于图像生成的扩散模型能够有效识别医学图像中的语义对应关系,通过关注有意义的解剖结构并忽略无关特征,实现了更精准的可变形图像配准。
English Summary: Diffusion models trained for image generation can effectively identify semantic correspondences in medical images, enabling more accurate deformable image registration by focusing on meaningful anatomical structures while ignoring irrelevant features.

Authors:Wenhao Tang, Rong Qin, Heng Fang, Fengtao Zhou, Hao Chen, Xiang Li, Ming-Ming Cheng
Title: Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology
Abstract:
Pre-trained encoders for offline feature extraction followed by multiple instance learning (MIL) aggregators have become the dominant paradigm in computational pathology (CPath), benefiting cancer diagnosis and prognosis. However, performance limitations arise from the absence of encoder fine-tuning for downstream tasks and disjoint optimization with MIL. While slide-level supervised end-to-end (E2E) learning is an intuitive solution to this issue, it faces challenges such as high computational demands and suboptimal results. These limitations motivate us to revisit E2E learning. We argue that prior work neglects inherent E2E optimization challenges, leading to performance disparities compared to traditional two-stage methods. In this paper, we pioneer the elucidation of optimization challenge caused by sparse-attention MIL and propose a novel MIL called ABMILX. It mitigates this problem through global correlation-based attention refinement and multi-head mechanisms. With the efficient multi-scale random patch sampling strategy, an E2E trained ResNet with ABMILX surpasses SOTA foundation models under the two-stage paradigm across multiple challenging benchmarks, while remaining computationally efficient (<10 RTX3090 hours). We show the potential of E2E learning in CPath and calls for greater research focus in this area. The code is https://github.com/DearCaat/E2E-WSI-ABMILX.
中文: 本文提出ABMILX这一新型多示例学习方法,解决了计算病理学中端到端学习的优化难题,在多个基准测试中高效超越现有最优模型。
English: This paper introduces ABMILX, a novel multiple instance learning method that addresses optimization challenges in end-to-end learning for computational pathology, outperforming state-of-the-art models efficiently across multiple benchmarks.

Authors:Yongxian Liu, Boyang Li, Ting Liu, Zaiping Lin, Wei An
Title: RRCANet: Recurrent Reusable-Convolution Attention Network for Infrared Small Target Detection
Abstract:
Infrared small target detection is a challenging task due to its unique characteristics (e.g., small, dim, shapeless and changeable). Recently published CNN-based methods have achieved promising performance with heavy feature extraction and fusion modules. To achieve efficient and effective detection, we propose a recurrent reusable-convolution attention network (RRCA-Net) for infrared small target detection. Specifically, RRCA-Net incorporates reusable-convolution block (RuCB) in a recurrent manner without introducing extra parameters. With the help of the repetitive iteration in RuCB, the high-level information of small targets in the deep layers can be well maintained and further refined. Then, a dual interactive attention aggregation module (DIAAM) is proposed to promote the mutual enhancement and fusion of refined information. In this way, RRCA-Net can both achieve high-level feature refinement and enhance the correlation of contextual information between adjacent layers. Moreover, to achieve steady convergence, we design a target characteristic inspired loss function (DpT-k loss) by integrating physical and mathematical constraints. Experimental results on three benchmark datasets (e.g. NUAA-SIRST, IRSTD-1k, DenseSIRST) demonstrate that our RRCA-Net can achieve comparable performance to the state-of-the-art methods while maintaining a small number of parameters, and act as a plug and play module to introduce consistent performance improvement for several popular IRSTD methods. Our code will be available at https://github.com/yongxianLiu/ soon.
中文: 提出的RRCA-Net通过循环使用卷积块和双交互注意力模块,在保持参数极少的同时有效检测红外小目标,并达到先进性能水平。
English: The proposed RRCA-Net utilizes reusable convolution blocks and a dual interactive attention module to efficiently detect infrared small targets while maintaining minimal parameters and achieving state-of-the-art performance.

Authors:Seulgi Kim, Ghazal Kaviani, Mohit Prabhushankar, Ghassan AlRegib
Title: Multi-level and Multi-modal Action Anticipation
Abstract:
Action anticipation, the task of predicting future actions from partially observed videos, is crucial for advancing intelligent systems. Unlike action recognition, which operates on fully observed videos, action anticipation must handle incomplete information. Hence, it requires temporal reasoning, and inherent uncertainty handling. While recent advances have been made, traditional methods often focus solely on visual modalities, neglecting the potential of integrating multiple sources of information. Drawing inspiration from human behavior, we introduce \textit{Multi-level and Multi-modal Action Anticipation (m\&m-Ant)}, a novel multi-modal action anticipation approach that combines both visual and textual cues, while explicitly modeling hierarchical semantic information for more accurate predictions. To address the challenge of inaccurate coarse action labels, we propose a fine-grained label generator paired with a specialized temporal consistency loss function to optimize performance. Extensive experiments on widely used datasets, including Breakfast, 50 Salads, and DARai, demonstrate the effectiveness of our approach, achieving state-of-the-art results with an average anticipation accuracy improvement of 3.08\% over existing methods. This work underscores the potential of multi-modal and hierarchical modeling in advancing action anticipation and establishes a new benchmark for future research in the field. Our code is available at: https://github.com/olivesgatech/mM-ant.
中文摘要:本文提出m&m-Ant多模态方法,通过融合视觉文本线索与分层语义建模,将动作预测准确率较现有方法平均提升3.08%。
English Summary: The paper introduces m&m-Ant, a novel multi-modal approach that integrates visual and textual cues with hierarchical modeling to improve action anticipation accuracy by 3.08% over existing methods.

Authors:Zihao Ding, Cheng-Tse Lee, Mufeng Zhu, Tao Guan, Yuan-Chun Sun, Cheng-Hsin Hsu, Yao Liu
Title: EyeNavGS: A 6-DoF Navigation Dataset and Record-n-Replay Software for Real-World 3DGS Scenes in VR
Abstract:
3D Gaussian Splatting (3DGS) is an emerging media representation that reconstructs real-world 3D scenes in high fidelity, enabling 6-degrees-of-freedom (6-DoF) navigation in virtual reality (VR). However, developing and evaluating 3DGS-enabled applications and optimizing their rendering performance, require realistic user navigation data. Such data is currently unavailable for photorealistic 3DGS reconstructions of real-world scenes. This paper introduces EyeNavGS (EyeNavGS), the first publicly available 6-DoF navigation dataset featuring traces from 46 participants exploring twelve diverse, real-world 3DGS scenes. The dataset was collected at two sites, using the Meta Quest Pro headsets, recording the head pose and eye gaze data for each rendered frame during free world standing 6-DoF navigation. For each of the twelve scenes, we performed careful scene initialization to correct for scene tilt and scale, ensuring a perceptually-comfortable VR experience. We also release our open-source SIBR viewer software fork with record-and-replay functionalities and a suite of utility tools for data processing, conversion, and visualization. The EyeNavGS dataset and its accompanying software tools provide valuable resources for advancing research in 6-DoF viewport prediction, adaptive streaming, 3D saliency, and foveated rendering for 3DGS scenes. The EyeNavGS dataset is available at: https://symmru.github.io/EyeNavGS/.

Authors:Qin Xie, Qinghua Zhang, Shuyin Xia
Title: Approximate Borderline Sampling using Granular-Ball for Classification Tasks
Abstract:
Data sampling enhances classifier efficiency and robustness through data compression and quality improvement. Recently, the sampling method based on granular-ball (GB) has shown promising performance in generality and noisy classification tasks. However, some limitations remain, including the absence of borderline sampling strategies and issues with class boundary blurring or shrinking due to overlap between GBs. In this paper, an approximate borderline sampling method using GBs is proposed for classification tasks. First, a restricted diffusion-based GB generation (RD-GBG) method is proposed, which prevents GB overlaps by constrained expansion, preserving precise geometric representation of GBs via redefined ones. Second, based on the concept of heterogeneous nearest neighbor, a GB-based approximate borderline sampling (GBABS) method is proposed, which is the first general sampling method capable of both borderline sampling and improving the quality of class noise datasets. Additionally, since RD-GBG incorporates noise detection and GBABS focuses on borderline samples, GBABS performs outstandingly on class noise datasets without the need for an optimal purity threshold. Experimental results demonstrate that the proposed methods outperform the GB-based sampling method and several representative sampling methods. Our source code is publicly available at https://github.com/CherylTse/GBABS.
中文:本文提出一种基于粒球的近似边界采样方法,通过受限扩散避免粒球重叠,无需纯度阈值即可提升噪声数据集的分类性能。
English: This paper introduces a granular-ball-based approximate borderline sampling method that prevents overlaps through restricted diffusion and enhances classification performance on noisy datasets without requiring purity thresholds.

Authors:Liang Li, Jianli Zhao, Sheng Fang, Siyu Chen, Hui Sun
Title: A TRPCA-Inspired Deep Unfolding Network for Hyperspectral Image Denoising via Thresholded t-SVD and Top-K Sparse Transformer
Abstract:
Hyperspectral images (HSIs) are often degraded by complex mixed noise during acquisition and transmission, making effective denoising essential for subsequent analysis. Recent hybrid approaches that bridge model-driven and data-driven paradigms have shown great promise. However, most of these approaches lack effective alternation between different priors or modules, resulting in loosely coupled regularization and insufficient exploitation of their complementary strengths. Inspired by tensor robust principal component analysis (TRPCA), we propose a novel deep unfolding network (DU-TRPCA) that enforces stage-wise alternation between two tightly integrated modules: low-rank and sparse. The low-rank module employs thresholded tensor singular value decomposition (t-SVD), providing a widely adopted convex surrogate for tensor low-rankness and has been demonstrated to effectively capture the global spatial-spectral structure of HSIs. The Top-K sparse transformer module adaptively imposes sparse constraints, directly matching the sparse regularization in TRPCA and enabling effective removal of localized outliers and complex noise. This tightly coupled architecture preserves the stage-wise alternation between low-rank approximation and sparse refinement inherent in TRPCA, while enhancing representational capacity through attention mechanisms. Extensive experiments on synthetic and real-world HSIs demonstrate that DU-TRPCA surpasses state-of-the-art methods under severe mixed noise, while offering interpretability benefits and stable denoising dynamics inspired by iterative optimization. Code is available at https://github.com/liangli97/TRPCA-Deep-Unfolding-HSI-Denoising.
中文: 提出的DU-TRPCA深度展开网络通过阶段性交替紧密整合低秩与稀疏模块,在有效去除高光谱图像混合噪声的同时超越了现有最优方法,并保持了可解释性优势。
English: The proposed DU-TRPCA deep unfolding network tightly integrates low-rank and sparse modules through stage-wise alternation, effectively removing mixed noise in hyperspectral images while surpassing state-of-the-art methods and maintaining interpretability.

Authors:Woojeong Jin, Seongchan Kim, Jaeho Lee, Seungryong Kim
Title: InterRVOS: Interaction-aware Referring Video Object Segmentation
Abstract:
Referring video object segmentation (RVOS) aims to segment objects in a video described by a natural language expression. However, most existing approaches focus on segmenting only the referred object (typically the actor), even when the expression clearly describes an interaction involving multiple objects with distinct roles. For instance, "A throwing B" implies a directional interaction, but standard RVOS segments only the actor (A), neglecting other involved target objects (B). In this paper, we introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction. This task formulation enables fine-grained understanding of object relationships, as many video events are defined by such relationships rather than individual objects. To support this task, we propose a new evaluation protocol that separately evaluates actor and target segmentation, enabling more accurate assessment of the model's ability to distinguish and segment actor and target roles. We also present InterRVOS-127K, a large-scale dataset with over 127K automatically annotated expressions, including interaction expressions annotated with distinct masks for actor and target objects. Furthermore, we develop ReVIOSa, an MLLM-based architecture that introduces interaction-aware special tokens and leverages an attention mask loss to enhance role-specific segmentation. Extensive experiments show that ReVIOSa not only outperforms existing baselines on our proposed InterRVOS-127K evaluation set, but also achieves strong performance on standard RVOS benchmarks. Our project page is available at: https://cvlab-kaist.github.io/InterRVOS.

Authors:Duo Liu, Zhiquan Tan, Linglan Zhao, Zhongqiang Zhang, Xiangzhong Fang, Weiran Huang
Title: Generalized Category Discovery via Reciprocal Learning and Class-Wise Distribution Regularization
Abstract:
Generalized Category Discovery (GCD) aims to identify unlabeled samples by leveraging the base knowledge from labeled ones, where the unlabeled set consists of both base and novel classes. Since clustering methods are time-consuming at inference, parametric-based approaches have become more popular. However, recent parametric-based methods suffer from inferior base discrimination due to unreliable self-supervision. To address this issue, we propose a Reciprocal Learning Framework (RLF) that introduces an auxiliary branch devoted to base classification. During training, the main branch filters the pseudo-base samples to the auxiliary branch. In response, the auxiliary branch provides more reliable soft labels for the main branch, leading to a virtuous cycle. Furthermore, we introduce Class-wise Distribution Regularization (CDR) to mitigate the learning bias towards base classes. CDR essentially increases the prediction confidence of the unlabeled data and boosts the novel class performance. Combined with both components, our proposed method, RLCD, achieves superior performance in all classes with negligible extra computation. Comprehensive experiments across seven GCD datasets validate its superiority. Our codes are available at https://github.com/APORduo/RLCD.
中文摘要:提出的互学习框架(RLF)结合类间分布正则化(CDR),通过主分支与辅助分支的协同优化,有效提升广义类别发现中基类与新类别的识别能力,在七个数据集上以可忽略的额外计算成本实现了最优性能。
English Summary: The proposed Reciprocal Learning Framework (RLF) with Class-wise Distribution Regularization (CDR) enhances base and novel class recognition in Generalized Category Discovery by creating a synergistic loop between main and auxiliary branches, achieving superior performance across seven datasets with minimal computational overhead.

Authors:Cristian-Ioan Blaga, Paul Suganthan, Sahil Dua, Krishna Srinivasan, Enrique Alfonseca, Peter Dornbach, Tom Duerig, Imed Zitouni, Zhe Dong
Title: Entity Image and Mixed-Modal Image Retrieval Datasets
Abstract:
Despite advances in multimodal learning, challenging benchmarks for mixed-modal image retrieval that combines visual and textual information are lacking. This paper introduces a novel benchmark to rigorously evaluate image retrieval that demands deep cross-modal contextual understanding. We present two new datasets: the Entity Image Dataset (EI), providing canonical images for Wikipedia entities, and the Mixed-Modal Image Retrieval Dataset (MMIR), derived from the WIT dataset. The MMIR benchmark features two challenging query types requiring models to ground textual descriptions in the context of provided visual entities: single entity-image queries (one entity image with descriptive text) and multi-entity-image queries (multiple entity images with relational text). We empirically validate the benchmark's utility as both a training corpus and an evaluation set for mixed-modal retrieval. The quality of both datasets is further affirmed through crowd-sourced human annotations. The datasets are accessible through the GitHub page: https://github.com/google-research-datasets/wit-retrieval.
中文摘要:本文提出了一个包含两个数据集的新型基准,用于评估需要深度跨模态理解的混合模态图像检索,并通过实证测试和人工标注验证了其有效性。
English Summary: This paper introduces a novel benchmark with two datasets for evaluating mixed-modal image retrieval that requires deep cross-modal understanding, validated through empirical testing and human annotations.

Authors:Johannes Schusterbauer, Ming Gui, Frank Fundel, Björn Ommer
Title: Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment
Abstract:
Diffusion models have revolutionized generative tasks through high-fidelity outputs, yet flow matching (FM) offers faster inference and empirical performance gains. However, current foundation FM models are computationally prohibitive for finetuning, while diffusion models like Stable Diffusion benefit from efficient architectures and ecosystem support. This work addresses the critical challenge of efficiently transferring knowledge from pre-trained diffusion models to flow matching. We propose Diff2Flow, a novel framework that systematically bridges diffusion and FM paradigms by rescaling timesteps, aligning interpolants, and deriving FM-compatible velocity fields from diffusion predictions. This alignment enables direct and efficient FM finetuning of diffusion priors with no extra computation overhead. Our experiments demonstrate that Diff2Flow outperforms naïve FM and diffusion finetuning particularly under parameter-efficient constraints, while achieving superior or competitive performance across diverse downstream tasks compared to state-of-the-art methods. We will release our code at https://github.com/CompVis/diff2flow.
中文: Diff2Flow通过重新调整时间步长和对齐插值,将预训练扩散模型的知识高效迁移至流匹配,实现了无需额外计算的卓越微调性能。
English: Diff2Flow efficiently transfers knowledge from pre-trained diffusion models to flow matching by aligning their paradigms, enabling superior finetuning performance without extra computation.

Authors:Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, Lei Zhang
Title: TIIF-Bench: How Does Your T2I Model Follow Your Instructions?
Abstract:
The rapid advancements of Text-to-Image (T2I) models have ushered in a new phase of AI-generated content, marked by their growing ability to interpret and follow user instructions. However, existing T2I model evaluation benchmarks fall short in limited prompt diversity and complexity, as well as coarse evaluation metrics, making it difficult to evaluate the fine-grained alignment performance between textual instructions and generated images. In this paper, we present TIIF-Bench (Text-to-Image Instruction Following Benchmark), aiming to systematically assess T2I models' ability in interpreting and following intricate textual instructions. TIIF-Bench comprises a set of 5000 prompts organized along multiple dimensions, which are categorized into three levels of difficulties and complexities. To rigorously evaluate model robustness to varying prompt lengths, we provide a short and a long version for each prompt with identical core semantics. Two critical attributes, i.e., text rendering and style control, are introduced to evaluate the precision of text synthesis and the aesthetic coherence of T2I models. In addition, we collect 100 high-quality designer level prompts that encompass various scenarios to comprehensively assess model performance. Leveraging the world knowledge encoded in large vision language models, we propose a novel computable framework to discern subtle variations in T2I model outputs. Through meticulous benchmarking of mainstream T2I models on TIIF-Bench, we analyze the pros and cons of current T2I models and reveal the limitations of current T2I benchmarks. Project Page: https://a113n-w3i.github.io/TIIF_Bench/.

Authors:Xuweiyi Chen, Tian Xia, Sihan Xu, Jianing Yang, Joyce Chai, Zezhou Cheng
Title: SAB3R: Semantic-Augmented Backbone in 3D Reconstruction
Abstract:
We introduce a new task, Map and Locate, which unifies the traditionally distinct objectives of open-vocabulary segmentation - detecting and segmenting object instances based on natural language queries - and 3D reconstruction, the process of estimating a scene's 3D structure from visual inputs. Specifically, Map and Locate involves generating a point cloud from an unposed video and segmenting object instances based on open-vocabulary queries. This task serves as a critical step toward real-world embodied AI applications and introduces a practical task that bridges reconstruction, recognition and reorganization. To tackle this task, we introduce a simple yet effective baseline, which we denote as SAB3R. Our approach builds upon MASt3R, a recent breakthrough in 3D computer vision, and incorporates a lightweight distillation strategy. This method transfers dense, per-pixel semantic features from 2D vision backbones (eg, CLIP and DINOv2) to enhance MASt3R's capabilities. Without introducing any auxiliary frozen networks, our model generates per-pixel semantic features and constructs cohesive point maps in a single forward pass. Compared to separately deploying MASt3R and CLIP, our unified model, SAB3R, achieves superior performance on the Map and Locate benchmark. Furthermore, we evaluate SAB3R on both 2D semantic segmentation and 3D tasks to comprehensively validate its effectiveness.

Authors:Xuefeng Jiang, Tian Wen, Zhiqin Yang, Lvhua Wu, Yufeng Chen, Sheng Sun, Yuwei Wang, Min Liu
Title: Robust Federated Learning against Noisy Clients via Masked Optimization
Abstract:
In recent years, federated learning (FL) has made significant advance in privacy-sensitive applications. However, it can be hard to ensure that FL participants provide well-annotated data for training. The corresponding annotations from different clients often contain complex label noise at varying levels. This label noise issue has a substantial impact on the performance of the trained models, and clients with greater noise levels can be largely attributed for this degradation. To this end, it is necessary to develop an effective optimization strategy to alleviate the adverse effects of these noisy clients.In this study, we present a two-stage optimization framework, MaskedOptim, to address this intricate label noise problem. The first stage is designed to facilitate the detection of noisy clients with higher label noise rates. The second stage focuses on rectifying the labels of the noisy clients' data through an end-to-end label correction mechanism, aiming to mitigate the negative impacts caused by misinformation within datasets. This is achieved by learning the potential ground-truth labels of the noisy clients' datasets via backpropagation. To further enhance the training robustness, we apply the geometric median based model aggregation instead of the commonly-used vanilla averaged model aggregation. We implement sixteen related methods and conduct evaluations on three image datasets and one text dataset with diverse label noise patterns for a comprehensive comparison. Extensive experimental results indicate that our proposed framework shows its robustness in different scenarios. Additionally, our label correction framework effectively enhances the data quality of the detected noisy clients' local datasets. % Our codes will be open-sourced to facilitate related research communities. Our codes are available via https://github.com/Sprinter1999/MaskedOptim .
Chinese: 联邦学习面临来自不同客户端的复杂标签噪声问题,这会降低模型性能,本研究提出了MaskedOptim框架,通过检测噪声客户端并利用反向传播纠正其标签,有效提升了训练鲁棒性和数据质量。
English: Federated learning faces challenges from complex label noise across clients, which degrades model performance, and this study introduces MaskedOptim, a two-stage framework that detects noisy clients and corrects their labels via backpropagation to improve robustness and data quality.

Authors:Aditya Kanade, Tanuja Ganu
Title: Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs
Abstract:
Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce "Do You See Me", a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens rapidly with increased task complexity (e.g., from 12% to 45% in the visual form constancy subtask). Further analysis into the root causes suggests that failures stem from challenges like misallocated visual attention and the instability of internal representations for fine-grained details, especially at or below encoder patch resolution. This underscores an urgent need for MLLMs with truly robust visual perception. The benchmark dataset, source code and evaluation scripts are available at https://github.com/microsoft/Do-You-See-Me.
中文: 多模态大语言模型存在严重视觉感知缺陷,在人类准确率达96.49%的新基准测试中其平均准确率不足50%,表明即使能给出正确答案,模型在处理精细视觉细节时仍存在根本性缺陷。
English: Multimodal Large Language Models exhibit significant visual perception deficiencies, as demonstrated by their under 50% accuracy on a new benchmark where humans achieve 96.49%, revealing critical failures in processing fine-grained visual details despite sometimes producing correct answers.

Authors:Youze Xue, Dian Li, Gang Liu
Title: Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying
Abstract:
With the rapid advancement of multi-modal large language models (MLLMs) in recent years, the foundational Contrastive Language-Image Pretraining (CLIP) framework has been successfully extended to MLLMs, enabling more powerful and universal multi-modal embeddings for a wide range of retrieval tasks. Despite these developments, the core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs. Within this framework, the effective mining of hard negative samples continues to be a critical factor for enhancing performance. Prior works have introduced both offline and online strategies for hard negative mining to improve the efficiency of contrastive learning. While these approaches have led to improved multi-modal embeddings, the specific contribution of each hard negative sample to the learning process has not been thoroughly investigated. In this work, we conduct a detailed analysis of the gradients of the info-NCE loss with respect to the query, positive, and negative samples, elucidating the role of hard negatives in updating model parameters. Building upon this analysis, we propose to explicitly amplify the gradients associated with hard negative samples, thereby encouraging the model to learn more discriminative embeddings. Our multi-modal embedding model, trained with the proposed Explicit Gradient Amplifier and based on the LLaVA-OneVision-7B architecture, achieves state-of-the-art performance on the MMEB benchmark compared to previous methods utilizing the same MLLM backbone. Furthermore, when integrated with our self-developed MLLM, QQMM, our approach attains the top rank on the MMEB leaderboard. Code and models are released on https://github.com/QQ-MM/QQMM-embed.
Chinese: 本研究提出显式梯度放大器来增强对比学习中困难负样本的梯度,基于LLaVA-OneVision-7B架构和自研QQMM多模态大模型,在MMEB基准测试中实现了最优性能。
English: This study introduces an Explicit Gradient Amplifier to enhance hard negative sample gradients in contrastive learning, achieving state-of-the-art performance on the MMEB benchmark with both the LLaVA-OneVision-7B architecture and a proprietary MLLM called QQMM.

Authors:Jiajun Jiang, Yiming Zhu, Zirui Wu, Jie Song
Title: DualMap: Online Open-Vocabulary Semantic Mapping for Natural Language Navigation in Dynamic Changing Scenes
Abstract:
We introduce DualMap, an online open-vocabulary mapping system that enables robots to understand and navigate dynamically changing environments through natural language queries. Designed for efficient semantic mapping and adaptability to changing environments, DualMap meets the essential requirements for real-world robot navigation applications. Our proposed hybrid segmentation frontend and object-level status check eliminate the costly 3D object merging required by prior methods, enabling efficient online scene mapping. The dual-map representation combines a global abstract map for high-level candidate selection with a local concrete map for precise goal-reaching, effectively managing and updating dynamic changes in the environment. Through extensive experiments in both simulation and real-world scenarios, we demonstrate state-of-the-art performance in 3D open-vocabulary segmentation, efficient scene mapping, and online language-guided navigation.Project page: https://eku127.github.io/DualMap/
中文: DualMap 是一种在线开放词汇映射系统,通过自然语言查询使机器人能在动态环境中导航,采用混合分割前端和双地图表示法,实现高效的实时语义建图与导航。
English: DualMap is an online open-vocabulary mapping system that enables robots to navigate dynamic environments using natural language queries, featuring a hybrid segmentation frontend and dual-map representation for efficient, real-time semantic mapping and navigation.

Authors:Fei Shen, Xiaoyu Du, Yutong Gao, Jian Yu, Yushe Cao, Xing Lei, Jinhui Tang
Title: IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout
Abstract:
Recent diffusion models have advanced image editing by enhancing visual quality and control, supporting broad applications across creative and personalized domains. However, current image editing largely overlooks multi-object scenarios, where precise control over object categories, counts, and spatial layouts remains a significant challenge. To address this, we introduce a new task, quantity-and-layout consistent image editing (QL-Edit), which aims to enable fine-grained control of object quantity and spatial structure in complex scenes. We further propose IMAGHarmony, a structure-aware framework that incorporates harmony-aware attention (HA) to integrate multimodal semantics, explicitly modeling object counts and layouts to enhance editing accuracy and structural consistency. In addition, we observe that diffusion models are susceptible to initial noise and exhibit strong preferences for specific noise patterns. Motivated by this, we present a preference-guided noise selection (PNS) strategy that chooses semantically aligned initial noise samples based on vision-language matching, thereby improving generation stability and layout consistency in multi-object editing. To support evaluation, we construct HarmonyBench, a comprehensive benchmark covering diverse quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony consistently outperforms state-of-the-art methods in structural alignment and semantic accuracy. The code and model are available at https://github.com/muzishen/IMAGHarmony.
中文摘要:IMAGHarmony框架通过融合感知语义的协调模块和偏好引导的噪声选择策略,有效解决了多物体场景中数量与布局控制的难题,仅用少量训练数据就实现了卓越的结构一致性和语义准确性。
English Summary: The IMAGHarmony framework addresses multi-object image editing challenges by integrating a harmony-aware module and preference-guided noise selection, achieving superior structural and semantic accuracy with minimal training data.

Authors:Xiaohu Huang, Jingjing Wu, Qunyi Xie, Kai Han
Title: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
Abstract:
Recent advances in scene understanding have leveraged multimodal large language models (MLLMs) for 3D reasoning by capitalizing on their strong 2D pretraining. However, the lack of explicit 3D data during MLLM pretraining limits 3D representation capability. In this paper, we investigate the 3D-awareness of MLLMs by evaluating multi-view correspondence and reveal a strong positive correlation between the quality of 3D-aware representation and downstream task performance. Motivated by this, we propose 3DRS, a framework that enhances MLLM 3D representation learning by introducing supervision from pretrained 3D foundation models. Our approach aligns MLLM visual features with rich 3D knowledge distilled from 3D models, effectively improving scene understanding. Extensive experiments across multiple benchmarks and MLLMs -- including visual grounding, captioning, and question answering -- demonstrate consistent performance gains. Project page: https://visual-ai.github.io/3drs

Authors:Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, Dahua Lin
Title: Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control
Abstract:
Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction, thereby mitigating the drawback of multi-object feature fusion present during interaction in prior work. To further ensure subject semantic consistency throughout the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge V2 dataset, as well as in-the-wild evaluation, demonstrate that our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
中文摘要:RoboMaster框架通过将多物体交互分解为三个子阶段并利用主导物体特征,有效解决了现有视频扩散模型在复杂机器人操作中难以捕捉多物体交互的问题,显著提升了轨迹控制视频生成的性能。
English Summary: The RoboMaster framework addresses limitations in existing video diffusion models by decomposing multi-object interactions into three distinct stages and using dominant object features to enhance visual fidelity and control in robotic manipulation tasks.

Authors:Salwa K. Al Khatib, Ahmed ElHagry, Shitong Shao, Zhiqiang Shen
Title: OD3: Optimization-free Dataset Distillation for Object Detection
Abstract:
Training large neural networks on large-scale datasets requires substantial computational resources, particularly for dense prediction tasks such as object detection. Although dataset distillation (DD) has been proposed to alleviate these demands by synthesizing compact datasets from larger ones, most existing work focuses solely on image classification, leaving the more complex detection setting largely unexplored. In this paper, we introduce OD3, a novel optimization-free data distillation framework specifically designed for object detection. Our approach involves two stages: first, a candidate selection process in which object instances are iteratively placed in synthesized images based on their suitable locations, and second, a candidate screening process using a pre-trained observer model to remove low-confidence objects. We perform our data synthesis framework on MS COCO and PASCAL VOC, two popular detection datasets, with compression ratios ranging from 0.25% to 5%. Compared to the prior solely existing dataset distillation method on detection and conventional core set selection methods, OD3 delivers superior accuracy, establishes new state-of-the-art results, surpassing prior best method by more than 14% on COCO mAP50 at a compression ratio of 1.0%. Code and condensed datasets are available at: https://github.com/VILA-Lab/OD3.
中文: 本文提出OD3这一专为物体检测设计的无优化数据集蒸馏框架,通过候选选择和筛选流程,在1%压缩率下以超过14%的优势刷新了COCO mAP50的最优性能。
English: This paper introduces OD3, an optimization-free dataset distillation framework for object detection that outperforms existing methods by over 14% on COCO mAP50 at 1% compression through candidate selection and screening processes.

Authors:Yaroslava Lochman, Carl Olsson, Christopher Zach
Title: Making Rotation Averaging Fast and Robust with Anisotropic Coordinate Descent
Abstract:
Anisotropic rotation averaging has recently been explored as a natural extension of respective isotropic methods. In the anisotropic formulation, uncertainties of the estimated relative rotations -- obtained via standard two-view optimization -- are propagated to the optimization of absolute rotations. The resulting semidefinite relaxations are able to recover global minima but scale poorly with the problem size. Local methods are fast and also admit robust estimation but are sensitive to initialization. They usually employ minimum spanning trees and therefore suffer from drift accumulation and can get trapped in poor local minima. In this paper, we attempt to bridge the gap between optimality, robustness and efficiency of anisotropic rotation averaging. We analyze a family of block coordinate descent methods initially proposed to optimize the standard chordal distances, and derive a much simpler formulation and an anisotropic extension obtaining a fast general solver. We integrate this solver into the extended anisotropic large-scale robust rotation averaging pipeline. The resulting algorithm achieves state-of-the-art performance on public structure-from-motion datasets. Project page: https://ylochman.github.io/acd

Authors:Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Md Moniruzzaman, Chen-Ping Yu, Yi-Hsuan Tsai, Dimitris Samaras
Title: Low-Rank Head Avatar Personalization with Registers
Abstract:
We introduce a novel method for low-rank personalization of a generic model for head avatar generation. Prior work proposes generic models that achieve high-quality face animation by leveraging large-scale datasets of multiple identities. However, such generic models usually fail to synthesize unique identity-specific details, since they learn a general domain prior. To adapt to specific subjects, we find that it is still challenging to capture high-frequency facial details via popular solutions like low-rank adaptation (LoRA). This motivates us to propose a specific architecture, a Register Module, that enhances the performance of LoRA, while requiring only a small number of parameters to adapt to an unseen identity. Our module is applied to intermediate features of a pre-trained model, storing and re-purposing information in a learnable 3D feature space. To demonstrate the efficacy of our personalization method, we collect a dataset of talking videos of individuals with distinctive facial details, such as wrinkles and tattoos. Our approach faithfully captures unseen faces, outperforming existing methods quantitatively and qualitatively. We will release the code, models, and dataset to the public.

Authors:Saar Huberman, Or Patashnik, Omer Dahary, Ron Mokady, Daniel Cohen-Or
Title: Image Generation from Contextually-Contradictory Prompts
Abstract:
Text-to-image diffusion models excel at generating high-quality, diverse images from natural language prompts. However, they often fail to produce semantically accurate results when the prompt contains concept combinations that contradict their learned priors. We define this failure mode as contextual contradiction, where one concept implicitly negates another due to entangled associations learned during training. To address this, we propose a stage-aware prompt decomposition framework that guides the denoising process using a sequence of proxy prompts. Each proxy prompt is constructed to match the semantic content expected to emerge at a specific stage of denoising, while ensuring contextual coherence. To construct these proxy prompts, we leverage a large language model (LLM) to analyze the target prompt, identify contradictions, and generate alternative expressions that preserve the original intent while resolving contextual conflicts. By aligning prompt information with the denoising progression, our method enables fine-grained semantic control and accurate image generation in the presence of contextual contradictions. Experiments across a variety of challenging prompts show substantial improvements in alignment to the textual prompt.
Chinese: 本文提出了一种阶段感知提示分解框架,利用大型语言模型解决文本到图像扩散模型中的上下文矛盾问题,通过将代理提示与去噪过程对齐来提升语义准确性。
English: This paper introduces a stage-aware prompt decomposition framework that uses large language models to resolve contextual contradictions in text-to-image diffusion models, improving semantic accuracy by aligning proxy prompts with the denoising process.

Authors:Amin Karimi Monsefi, Mridul Khurana, Rajiv Ramnath, Anuj Karpatne, Wei-Lun Chao, Cheng Zhang
Title: TaxaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation
Abstract:
We propose TaxaDiffusion, a taxonomy-informed training framework for diffusion models to generate fine-grained animal images with high morphological and identity accuracy. Unlike standard approaches that treat each species as an independent category, TaxaDiffusion incorporates domain knowledge that many species exhibit strong visual similarities, with distinctions often residing in subtle variations of shape, pattern, and color. To exploit these relationships, TaxaDiffusion progressively trains conditioned diffusion models across different taxonomic levels -- starting from broad classifications such as Class and Order, refining through Family and Genus, and ultimately distinguishing at the Species level. This hierarchical learning strategy first captures coarse-grained morphological traits shared by species with common ancestors, facilitating knowledge transfer before refining fine-grained differences for species-level distinction. As a result, TaxaDiffusion enables accurate generation even with limited training samples per species. Extensive experiments on three fine-grained animal datasets demonstrate that outperforms existing approaches, achieving superior fidelity in fine-grained animal image generation. Project page: https://amink8.github.io/TaxaDiffusion/

Authors:Minghao Liu, Zhitao He, Zhiyuan Fan, Qingyun Wang, Yi R. Fung
Title: MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing
Abstract:
Text-guided image editing has seen significant progress in natural image domains, but its application in medical imaging remains limited and lacks standardized evaluation frameworks. Such editing could revolutionize clinical practices by enabling personalized surgical planning, enhancing medical education, and improving patient communication. To bridge this gap, we introduce MedEBench1, a robust benchmark designed to diagnose reliability in text-guided medical image editing. MedEBench consists of 1,182 clinically curated image-prompt pairs covering 70 distinct editing tasks and 13 anatomical regions. It contributes in three key areas: (1) a clinically grounded evaluation framework that measures Editing Accuracy, Context Preservation, and Visual Quality, complemented by detailed descriptions of intended edits and corresponding Region-of-Interest (ROI) masks; (2) a comprehensive comparison of seven state-of-theart models, revealing consistent patterns of failure; and (3) a diagnostic error analysis technique that leverages attention alignment, using Intersection-over-Union (IoU) between model attention maps and ROI masks to identify mislocalization issues, where models erroneously focus on incorrect anatomical regions. MedEBench sets the stage for developing more reliable and clinically effective text-guided medical image editing tools.

Authors:Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, Si Liu
Title: Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency
Abstract:
Understanding real-world videos with complex semantics and long temporal dependencies remains a fundamental challenge in computer vision. Recent progress in multimodal large language models (MLLMs) has demonstrated strong capabilities in vision-language tasks, while reinforcement learning tuning (RLT) has further improved their reasoning abilities. In this work, we explore RLT as a post-training strategy to enhance the video-specific reasoning capabilities of MLLMs. Built upon the Group Relative Policy Optimization (GRPO) framework, we propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals. To facilitate effective preference-based optimization, we introduce a variance-aware data selection strategy based on repeated inference to identify samples that provide informative learning signals. We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA. Our method consistently outperforms supervised fine-tuning and existing RLT baselines, achieving superior performance with significantly less training data. These results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs. Notably, The initial code release (two months ago) has now been expanded with updates, including optimized reward mechanisms and additional datasets. The latest version is available at https://github.com/appletea233/Temporal-R1 .
中文摘要:本研究通过强化学习调优,采用双奖励机制和方差感知数据选择策略,有效提升了多模态大语言模型在视频理解中的推理能力,在八个视频任务中以更少训练数据实现了更优性能。
English Summary: This study enhances video reasoning in multimodal large language models through reinforcement learning tuning with a dual-reward system and variance-aware data selection, achieving superior performance across eight video understanding tasks with less training data.

Authors:Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, Jun Zhu
Title: ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding
Abstract:
Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni-a native 3D large language model capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and training. Finally, by performing instruction-based training of the Qwen-2.5-vl-7B-Instruct model on the 3D-Alpaca dataset. Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI. Project page: https://github.com/JAMESYJL/ShapeLLM-Omni
中文: 针对当前多模态模型在3D理解和生成方面的局限,本研究提出了ShapeLLM-Omni这一原生3D大语言模型,通过3D VQVAE和3D-Alpaca数据集训练,实现了3D资源与文本的任意序列处理,将AI能力拓展至三维领域。
English: To address the limitations of current multimodal models in 3D understanding and generation, this study introduces ShapeLLM-Omni, a native 3D large language model that processes 3D assets and text in any sequence, trained using a 3D VQVAE and the 3D-Alpaca dataset to extend AI capabilities into the 3D domain.

Authors:Sicheng Li, Chengzhen Wu, Hao Li, Xiang Gao, Yiyi Liao, Lu Yu
Title: GSCodec Studio: A Modular Framework for Gaussian Splat Compression
Abstract:
3D Gaussian Splatting and its extension to 4D dynamic scenes enable photorealistic, real-time rendering from real-world captures, positioning Gaussian Splats (GS) as a promising format for next-generation immersive media. However, their high storage requirements pose significant challenges for practical use in sharing, transmission, and storage. Despite various studies exploring GS compression from different perspectives, these efforts remain scattered across separate repositories, complicating benchmarking and the integration of best practices. To address this gap, we present GSCodec Studio, a unified and modular framework for GS reconstruction, compression, and rendering. The framework incorporates a diverse set of 3D/4D GS reconstruction methods and GS compression techniques as modular components, facilitating flexible combinations and comprehensive comparisons. By integrating best practices from community research and our own explorations, GSCodec Studio supports the development of compact representation and compression solutions for static and dynamic Gaussian Splats, namely our Static and Dynamic GSCodec, achieving competitive rate-distortion performance in static and dynamic GS compression. The code for our framework is publicly available at https://github.com/JasonLSC/GSCodec_Studio , to advance the research on Gaussian Splats compression.
中文:GSCodec Studio是一个集成了多种3D/4D高斯泼溅重建与压缩技术的统一框架,能够高效开发静态和动态场景的紧凑表示,并实现卓越的压缩性能。
English: GSCodec Studio is a unified framework that integrates various 3D/4D Gaussian Splat reconstruction and compression techniques, enabling efficient development of compact representations for static and dynamic scenes with competitive performance.

Authors:Manuel-Andreas Schneider, Lukas Höllein, Matthias Nießner
Title: WorldExplorer: Towards Generating Fully Navigable 3D Scenes
Abstract:
Generating 3D worlds from text is a highly anticipated goal in computer vision. Existing works are limited by the degree of exploration they allow inside of a scene, i.e., produce streched-out and noisy artifacts when moving beyond central or panoramic perspectives. To this end, we propose WorldExplorer, a novel method based on autoregressive video trajectory generation, which builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints. We initialize our scenes by creating multi-view consistent images corresponding to a 360 degree panorama. Then, we expand it by leveraging video diffusion models in an iterative scene generation pipeline. Concretely, we generate multiple videos along short, pre-defined trajectories, that explore the scene in depth, including motion around objects. Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results, like moving into objects. Finally, we fuse all generated views into a unified 3D representation via 3D Gaussian Splatting optimization. Compared to prior approaches, WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling for the first time realistic and unrestricted exploration. We believe this marks a significant step toward generating immersive and truly explorable virtual 3D environments.
中文摘要:WorldExplorer提出了一种基于自回归视频轨迹生成的新方法,通过迭代场景生成流程和三维高斯溅射优化,首次实现了在大型相机运动下保持稳定的高质量可导航三维场景,突破了传统方法在场景探索中的视角限制。
English Summary: WorldExplorer introduces an autoregressive video trajectory method that creates navigable 3D scenes with consistent visual quality across viewpoints, overcoming previous limitations in scene exploration by integrating multi-view generation with 3D Gaussian Splatting.

Authors:Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, Niklas Muennighoff, Seungone Kim, Hanyang Zhao, Sudipta Kar, Kezia Erina Suryoraharjo, M. Farid Adilazuarda, En-Shiun Annie Lee, Ayu Purwarianti, Derry Tanti Wijaya, Monojit Choudhury
Title: Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability
Abstract:
High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at https://github.com/datarubrics/datarubrics.
中文: 本立场文件提出DataRubrics框架,通过基于量规的指标和LLM驱动的评估,系统性地评判数据集质量,以解决数据研究中原创性、透明度和可复现性不足的问题。
English: This position paper proposes DataRubrics, a structured framework using rubric-based metrics and LLM-powered evaluation to systematically assess dataset quality, addressing gaps in originality, transparency, and reproducibility in data-centric research.

Authors:Yafei Yang, Zihui Zhang, Bo Yang
Title: unMORE: Unsupervised Multi-Object Segmentation via Center-Boundary Reasoning
Abstract:
We study the challenging problem of unsupervised multi-object segmentation on single images. Existing methods, which rely on image reconstruction objectives to learn objectness or leverage pretrained image features to group similar pixels, often succeed only in segmenting simple synthetic objects or discovering a limited number of real-world objects. In this paper, we introduce unMORE, a novel two-stage pipeline designed to identify many complex objects in real-world images. The key to our approach involves explicitly learning three levels of carefully defined object-centric representations in the first stage. Subsequently, our multi-object reasoning module utilizes these learned object priors to discover multiple objects in the second stage. Notably, this reasoning module is entirely network-free and does not require human labels. Extensive experiments demonstrate that unMORE significantly outperforms all existing unsupervised methods across 6 real-world benchmark datasets, including the challenging COCO dataset, achieving state-of-the-art object segmentation results. Remarkably, our method excels in crowded images where all baselines collapse.
中文: 本文提出unMORE这一新型两阶段无监督方法,通过习得物体中心化表征并采用无需网络结构的推理模块,在复杂真实图像上实现了最先进的多目标分割效果,显著超越了现有方法。
English: This paper introduces unMORE, a novel two-stage unsupervised method that learns object-centric representations and uses a network-free reasoning module to achieve state-of-the-art multi-object segmentation on complex real-world images, significantly outperforming existing approaches.

Authors:Tao Yang, Ruibin Li, Yangming Shi, Yuqi Zhang, Qide Dong, Haoran Cheng, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang
Title: Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks
Abstract:
Diffusion models have shown impressive performance in many visual generation and manipulation tasks. Many existing methods focus on training a model for a specific task, especially, text-to-video (T2V) generation, while many other works focus on finetuning the pretrained T2V model for image-to-video (I2V), video-to-video (V2V), image and video manipulation tasks, etc. However, training a strong T2V foundation model requires a large amount of high-quality annotations, which is very costly. In addition, many existing models can perform only one or several tasks. In this work, we introduce a unified framework, namely many-for-many, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then employ a joint image-video learning strategy to progressively train the model from scratch. Our joint learning leads to a unified visual generation and manipulation model with improved video generation performance. In addition, we introduce depth maps as a condition to help our model better perceive the 3D space in visual generation. Two versions of our model are trained with different model sizes (8B and 2B), each of which can perform more than 10 different tasks. In particular, our 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines. Our models and source codes are available at https://github.com/leeruibin/MfM.git.
中文: 本文提出了一种多对多的统一框架,通过联合图像-视频学习和深度图条件,训练单一模型处理多种视觉生成与编辑任务,实现了可执行十余种任务的高竞争力性能。
English: This paper introduces a unified many-for-many framework that trains a single model for multiple visual generation and manipulation tasks using joint image-video learning and depth maps, achieving competitive performance with models capable of over 10 different tasks.

Authors:Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Jintai Chen, Haochao Ying, Hongxia Xu, Danny Chen, Jian Wu
Title: STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset
Abstract:
Visual rating is an essential capability of artificial intelligence (AI) for multi-dimensional quantification of visual content, primarily applied in ordinal regression (OR) tasks such as image quality assessment, facial age estimation, and medical image grading. However, current multi-modal large language models (MLLMs) under-perform in such visual rating ability while also suffering the lack of relevant datasets and benchmarks. In this work, we collect and present STORM, a data collection and benchmark for Stimulating Trustworthy Ordinal Regression Ability of MLLMs for universal visual rating. STORM encompasses 14 ordinal regression datasets across five common visual rating domains, comprising 655K image-level pairs and the corresponding carefully curated VQAs. Importantly, we also propose a coarse-to-fine processing pipeline that dynamically considers label candidates and provides interpretable thoughts, providing MLLMs with a general and trustworthy ordinal thinking paradigm. This benchmark aims to evaluate the all-in-one and zero-shot performance of MLLMs in scenarios requiring understanding of the essential common ordinal relationships of rating labels. Extensive experiments demonstrate the effectiveness of our framework and shed light on better fine-tuning strategies. The STORM dataset, benchmark, and pre-trained models are available on the following webpage to support further research in this area. Datasets and codes are released on the project page: https://storm-bench.github.io/.
中文: 本文提出了STORM数据集与基准,旨在提升多模态大语言模型在五个领域的视觉排序能力,包含65.5万图像-文本对及一种新颖的从粗到细处理流程,以增强模型在序数回归任务中的可信表现。
English: This paper introduces STORM, a comprehensive dataset and benchmark designed to enhance the visual rating capabilities of multi-modal large language models (MLLMs) in ordinal regression tasks across five domains, featuring 655K image-text pairs and a novel coarse-to-fine processing pipeline for trustworthy performance.

Authors:Tong Wang, Jiaqi Wang, Shu Kong
Title: Active Learning via Vision-Language Model Adaptation with Open Data
Abstract:
Pretrained on web-scale open data, VLMs offer powerful capabilities for solving downstream tasks after being adapted to task-specific labeled data. Yet, data labeling can be expensive and may demand domain expertise. Active Learning (AL) aims to reduce this expense by strategically selecting the most informative data for labeling and model training. Recent AL methods have explored VLMs but have not leveraged publicly available open data, such as VLM's pretraining data. In this work, we leverage such data by retrieving task-relevant examples to augment the task-specific examples. As expected, incorporating them significantly improves AL. Given that our method exploits open-source VLM and open data, we refer to it as Active Learning with Open Resources (ALOR). Additionally, most VLM-based AL methods use prompt tuning (PT) for model adaptation, likely due to its ability to directly utilize pretrained parameters and the assumption that doing so reduces the risk of overfitting to limited labeled data. We rigorously compare popular adaptation approaches, including linear probing (LP), finetuning (FT), and contrastive tuning (CT). We reveal two key findings: (1) All adaptation approaches benefit from incorporating retrieved data, and (2) CT resoundingly outperforms other approaches across AL methods. Further analysis of retrieved data reveals a naturally imbalanced distribution of task-relevant classes, exposing inherent biases within the VLM. This motivates our novel Tail First Sampling (TFS) strategy for AL, an embarrassingly simple yet effective method that prioritizes sampling data from underrepresented classes to label. Extensive experiments demonstrate that our final method, contrastively finetuning VLM on both retrieved and TFS-selected labeled data, significantly outperforms existing methods.

Authors:Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi
Title: Data Pruning by Information Maximization
Abstract:
In this paper, we present InfoMax, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, InfoMax enhances the overall informativeness of the coreset. The information of individual samples is measured by importance scores, which capture their influence or difficulty in model learning. To quantify redundancy, we use pairwise sample similarities, based on the premise that similar samples contribute similarly to the learning process. We formalize the coreset selection problem as a discrete quadratic programming (DQP) task, with the objective of maximizing the total information content, represented as the sum of individual sample contributions minus the redundancies introduced by similar samples within the coreset. To ensure practical scalability, we introduce an efficient gradient-based solver, complemented by sparsification techniques applied to the similarity matrix and dataset partitioning strategies. This enables InfoMax to seamlessly scale to datasets with millions of samples. Extensive experiments demonstrate the superior performance of InfoMax in various data pruning tasks, including image classification, vision-language pre-training, and instruction tuning for large language models. Code is available at https://github.com/hrtan/InfoMax.
中文:InfoMax是一种新颖的数据剪枝方法,通过将核心集选择构建为离散二次规划问题来最大化样本信息量并最小化冗余,实验证明其在多项任务中均表现出卓越性能。
English: InfoMax is a novel data pruning method that maximizes information content and minimizes redundancy in selected samples by formulating coreset selection as a discrete quadratic programming problem, with experiments showing its superior performance across multiple tasks.

Authors:Sang-Eun Lee, Ko Nishino, Shohei Nobuhara
Title: SteerPose: Simultaneous Extrinsic Camera Calibration and Matching from Articulation
Abstract:
Can freely moving humans or animals themselves serve as calibration targets for multi-camera systems while simultaneously estimating their correspondences across views? We humans can solve this problem by mentally rotating the observed 2D poses and aligning them with those in the target views. Inspired by this cognitive ability, we propose SteerPose, a neural network that performs this rotation of 2D poses into another view. By integrating differentiable matching, SteerPose simultaneously performs extrinsic camera calibration and correspondence search within a single unified framework. We also introduce a novel geometric consistency loss that explicitly ensures that the estimated rotation and correspondences result in a valid translation estimation. Experimental results on diverse in-the-wild datasets of humans and animals validate the effectiveness and robustness of the proposed method. Furthermore, we demonstrate that our method can reconstruct the 3D poses of novel animals in multi-camera setups by leveraging off-the-shelf 2D pose estimators and our class-agnostic model.

Authors:Xuan Yu, Dayan Guan, Yanfeng Gu
Title: Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement
Abstract:
Multimodal Large Language Models (MLLM) often struggle to interpret high-resolution images accurately, where fine-grained details are crucial for complex visual understanding. We introduce Zoom-Refine, a novel training-free method that enhances MLLM capabilities to address this issue. Zoom-Refine operates through a synergistic process of \textit{Localized Zoom} and \textit{Self-Refinement}. In the \textit{Localized Zoom} step, Zoom-Refine leverages the MLLM to provide a preliminary response to an input query and identifies the most task-relevant image region by predicting its bounding box coordinates. During the \textit{Self-Refinement} step, Zoom-Refine then integrates fine-grained details from the high-resolution crop (identified by \textit{Localized Zoom}) with its initial reasoning to re-evaluate and refine its preliminary response. Our method harnesses the MLLM's inherent capabilities for spatial localization, contextual reasoning and comparative analysis without requiring additional training or external experts. Comprehensive experiments demonstrate the efficacy of Zoom-Refine on two challenging high-resolution multimodal benchmarks. Code is available at \href{https://github.com/xavier-yu114/Zoom-Refine}{\color{magenta}github.com/xavier-yu114/Zoom-Refine}
中文: Zoom-Refine是一种无需训练的方法,通过初步响应、定位关键区域和利用细节信息自我优化的协同过程,提升多模态大语言模型对高分辨率图像的精确理解能力。
English: Zoom-Refine is a training-free method that enhances multimodal large language models' ability to interpret high-resolution images by first providing a preliminary response, identifying key regions, and then refining the answer using detailed visual information.

Authors:Andy Bonnetto, Haozhe Qi, Franklin Leong, Matea Tashkovska, Mahdi Rad, Solaiman Shokur, Friedhelm Hummel, Silvestro Micera, Marc Pollefeys, Alexander Mathis
Title: EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models
Abstract:
Understanding behavior requires datasets that capture humans while carrying out complex tasks. The kitchen is an excellent environment for assessing human motor and cognitive function, as many complex actions are naturally exhibited in kitchens from chopping to cleaning. Here, we introduce the EPFL-Smart-Kitchen-30 dataset, collected in a noninvasive motion capture platform inside a kitchen environment. Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens~2 headset were used to capture 3D hand, body, and eye movements. The EPFL-Smart-Kitchen-30 dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes. Action sequences were densely annotated with 33.78 action segments per minute. Leveraging this multi-modal dataset, we propose four benchmarks to advance behavior understanding and modeling through 1) a vision-language benchmark, 2) a semantic text-to-motion generation benchmark, 3) a multi-modal action recognition benchmark, 4) a pose-based action segmentation benchmark. We expect the EPFL-Smart-Kitchen-30 dataset to pave the way for better methods as well as insights to understand the nature of ecologically-valid human behavior. Code and data are available at https://github.com/amathislab/EPFL-Smart-Kitchen
中文: EPFL-Smart-Kitchen-30数据集通过同步多模态传感器捕捉烹饪时的人类行为,并设立四项基准测试以推动生态效度行为研究的发展。
English: The EPFL-Smart-Kitchen-30 dataset captures multi-modal human movements during cooking tasks using synchronized sensors to advance behavior understanding through four proposed benchmarks.

Authors:Yuan Gan, Jiaxu Miao, Yunze Wang, Yi Yang
Title: Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation
Abstract:
Advances in talking-head animation based on Latent Diffusion Models (LDM) enable the creation of highly realistic, synchronized videos. These fabricated videos are indistinguishable from real ones, increasing the risk of potential misuse for scams, political manipulation, and misinformation. Hence, addressing these ethical concerns has become a pressing issue in AI security. Recent proactive defense studies focused on countering LDM-based models by adding perturbations to portraits. However, these methods are ineffective at protecting reference portraits from advanced image-to-video animation. The limitations are twofold: 1) they fail to prevent images from being manipulated by audio signals, and 2) diffusion-based purification techniques can effectively eliminate protective perturbations. To address these challenges, we propose Silencer, a two-stage method designed to proactively protect the privacy of portraits. First, a nullifying loss is proposed to ignore audio control in talking-head generation. Second, we apply anti-purification loss in LDM to optimize the inverted latent feature to generate robust perturbations. Extensive experiments demonstrate the effectiveness of Silencer in proactively protecting portrait privacy. We hope this work will raise awareness among the AI security community regarding critical ethical issues related to talking-head generation techniques. Code: https://github.com/yuangan/Silencer.
中文: 基于潜在扩散模型的说话头部动画技术进展带来了高度逼真的视频,引发了伦理担忧,为此提出了Silencer这一两阶段方法,通过忽略音频控制和抵抗净化技术来主动保护肖像隐私。
English: Advances in talking-head animation using Latent Diffusion Models create realistic videos that raise ethical concerns, leading to the development of Silencer, a two-stage method that protects portrait privacy by ignoring audio control and resisting purification techniques.

Authors:Wei Yao, Yunlian Sun, Hongwen Zhang, Yebin Liu, Jinhui Tang
Title: HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception
Abstract:
Generating high-fidelity full-body human interactions with dynamic objects and static scenes remains a critical challenge in computer graphics and animation. Existing methods for human-object interaction often neglect scene context, leading to implausible penetrations, while human-scene interaction approaches struggle to coordinate fine-grained manipulations with long-range navigation. To address these limitations, we propose HOSIG, a novel framework for synthesizing full-body interactions through hierarchical scene perception. Our method decouples the task into three key components: 1) a scene-aware grasp pose generator that ensures collision-free whole-body postures with precise hand-object contact by integrating local geometry constraints, 2) a heuristic navigation algorithm that autonomously plans obstacle-avoiding paths in complex indoor environments via compressed 2D floor maps and dual-component spatial reasoning, and 3) a scene-guided motion diffusion model that generates trajectory-controlled, full-body motions with finger-level accuracy by incorporating spatial anchors and dual-space classifier-free guidance. Extensive experiments on the TRUMANS dataset demonstrate superior performance over state-of-the-art methods. Notably, our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention. This work bridges the critical gap between scene-aware navigation and dexterous object manipulation, advancing the frontier of embodied interaction synthesis. Codes will be available after publication. Project page: http://yw0208.github.io/hosig
中文: 提出的HOSIG框架通过分层整合场景感知抓取生成、启发式导航和运动扩散,克服现有方法在协调场景导航与精细操作方面的不足,实现了无碰撞、高精度全身运动合成,在实验中展现出优越性能且需极少人工干预。
English: The proposed HOSIG framework synthesizes full-body human interactions by hierarchically integrating scene-aware grasp generation, heuristic navigation, and motion diffusion to overcome limitations in existing methods, achieving superior performance in generating collision-free, dexterous motions with minimal manual intervention.

Authors:Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Liang Lin, Cewu Lu, Xiaodan Liang
Title: EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation
Abstract:
Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches primarily adopt direct input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. In this paper, we propose a novel sElf-improving embodied reasoning framework for boosting LLM-based vision-language Navigation, dubbed EvolveNav. Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to both activate the model's navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also introduced to encourage learning correct reasoning patterns by contrasting with wrong ones. Experimental results on the popular VLN benchmarks demonstrate the superiority of EvolveNav over previous LLM-based VLN approaches. Code is available at https://github.com/expectorlin/EvolveNav.
中文: 本文提出EvolveNav自演进框架,通过链式思维监督微调和自反思后训练两阶段方法,利用大语言模型的推理能力提升视觉语言导航任务的准确性与可解释性。
English: This paper introduces EvolveNav, a self-improving framework that enhances vision-language navigation by training LLMs with formalized chain-of-thought reasoning and iterative self-reflective post-training to boost both accuracy and interpretability.

Authors:Xiaodong Wang, Zhirong Wu, Peixi Peng
Title: LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model
Abstract:
Driving world models are used to simulate futures by video generation based on the condition of the current state and actions. However, current models often suffer serious error accumulations when predicting the long-term future, which limits the practical application. Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility. However, these models are always trained on short video clips (high fps and short duration), and multiple roll-out generations struggle to produce consistent and reasonable long videos due to the training-inference gap. To this end, we propose several solutions to build a simple yet effective long-term driving world model. First, we hierarchically decouple world model learning into large motion learning and bidirectional continuous motion learning. Then, considering the continuity of driving scenes, we propose a simple distillation method where fine-grained video flows are self-supervised signals for coarse-grained flows. The distillation is designed to improve the coherence of infinite video generation. The coarse-grained and fine-grained modules are coordinated to generate long-term and temporally coherent videos. In the public benchmark NuScenes, compared with the state-of-the-art front-view model, our model improves FVD by $27\%$ and reduces inference time by $85\%$ for the video task of generating 110+ frames. More videos (including 90s duration) are available at https://Wang-Xiaodong1899.github.io/longdwm/.

Authors:Kaixun Jiang, Zhaoyu Chen, Haijing Guo, Jinglun Li, Jiyuan Fu, Pinxue Guo, Hao Tang, Bo Li, Wenqiang Zhang
Title: Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment
Abstract:
Preference alignment in diffusion models has primarily focused on benign human preferences (e.g., aesthetic). In this paper, we propose a novel perspective: framing unrestricted adversarial example generation as a problem of aligning with adversary preferences. Unlike benign alignment, adversarial alignment involves two inherently conflicting preferences: visual consistency and attack effectiveness, which often lead to unstable optimization and reward hacking (e.g., reducing visual quality to improve attack success). To address this, we propose APA (Adversary Preferences Alignment), a two-stage framework that decouples conflicting preferences and optimizes each with differentiable rewards. In the first stage, APA fine-tunes LoRA to improve visual consistency using rule-based similarity reward. In the second stage, APA updates either the image latent or prompt embedding based on feedback from a substitute classifier, guided by trajectory-level and step-wise rewards. To enhance black-box transferability, we further incorporate a diffusion augmentation strategy. Experiments demonstrate that APA achieves significantly better attack transferability while maintaining high visual consistency, inspiring further research to approach adversarial attacks from an alignment perspective. Code will be available at https://github.com/deep-kaixun/APA.
中文摘要:本文提出APA框架,通过解耦视觉一致性与攻击效果的冲突性偏好,实现扩散模型与对抗性偏好的对齐,在保持图像质量的同时显著提升攻击迁移性。
English Summary: This paper introduces APA, a two-stage framework that aligns diffusion models with adversary preferences by decoupling conflicting visual and attack objectives, achieving superior transferability while preserving image quality.

Authors:Shuyu Yang, Yilun Wang, Yaxiong Wang, Li Zhu, Zhedong Zheng
Title: Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark
Abstract:
Video anomaly retrieval aims to localize anomalous events in videos using natural language queries to facilitate public safety. However, existing datasets suffer from severe limitations: (1) data scarcity due to the long-tail nature of real-world anomalies, and (2) privacy constraints that impede large-scale collection. To address the aforementioned issues in one go, we introduce SVTA (Synthetic Video-Text Anomaly benchmark), the first large-scale dataset for cross-modal anomaly retrieval, leveraging generative models to overcome data availability challenges. Specifically, we collect and generate video descriptions via the off-the-shelf LLM (Large Language Model) covering 68 anomaly categories, e.g., throwing, stealing, and shooting. These descriptions encompass common long-tail events. We adopt these texts to guide the video generative model to produce diverse and high-quality videos. Finally, our SVTA involves 41,315 videos (1.36M frames) with paired captions, covering 30 normal activities, e.g., standing, walking, and sports, and 68 anomalous events, e.g., falling, fighting, theft, explosions, and natural disasters. We adopt three widely-used video-text retrieval baselines to comprehensively test our SVTA, revealing SVTA's challenging nature and its effectiveness in evaluating a robust cross-modal retrieval method. SVTA eliminates privacy risks associated with real-world anomaly collection while maintaining realistic scenarios. The dataset demo is available at: [https://svta-mm.github.io/SVTA.github.io/].

Authors:Jakob Schmid, Azin Jahedi, Noah Berenguel Senn, Andrés Bruhn
Title: MS-RAFT-3D: A Multi-Scale Architecture for Recurrent Image-Based Scene Flow
Abstract:
Although multi-scale concepts have recently proven useful for recurrent network architectures in the field of optical flow and stereo, they have not been considered for image-based scene flow so far. Hence, based on a single-scale recurrent scene flow backbone, we develop a multi-scale approach that generalizes successful hierarchical ideas from optical flow to image-based scene flow. By considering suitable concepts for the feature and the context encoder, the overall coarse-to-fine framework and the training loss, we succeed to design a scene flow approach that outperforms the current state of the art on KITTI and Spring by 8.7%(3.89 vs. 4.26) and 65.8% (9.13 vs. 26.71), respectively. Our code is available at https://github.com/cv-stuttgart/MS-RAFT-3D.
中文: 我们提出了一种多尺度循环场景流方法,将光流中的分层理念成功应用于图像场景流,在KITTI和Spring基准测试中以显著降低误差的表现超越了现有最优方法。
English: We introduce a multi-scale recurrent scene flow method that adapts hierarchical concepts from optical flow, achieving state-of-the-art performance on KITTI and Spring benchmarks with significant error reductions.

Authors:Chenxi Xie, Minghan Li, Shuai Li, Yuhui Wu, Qiaosi Yi, Lei Zhang
Title: DNAEdit: Direct Noise Alignment for Text-Guided Rectified Flow Editing
Abstract:
Leveraging the powerful generation capability of large-scale pretrained text-to-image models, training-free methods have demonstrated impressive image editing results. Conventional diffusion-based methods, as well as recent rectified flow (RF)-based methods, typically reverse synthesis trajectories by gradually adding noise to clean images, during which the noisy latent at the current timestep is used to approximate that at the next timesteps, introducing accumulated drift and degrading reconstruction accuracy. Considering the fact that in RF the noisy latent is estimated through direct interpolation between Gaussian noises and clean images at each timestep, we propose Direct Noise Alignment (DNA), which directly refines the desired Gaussian noise in the noise domain, significantly reducing the error accumulation in previous methods. Specifically, DNA estimates the velocity field of the interpolated noised latent at each timestep and adjusts the Gaussian noise by computing the difference between the predicted and expected velocity field. We validate the effectiveness of DNA and reveal its relationship with existing RF-based inversion methods. Additionally, we introduce a Mobile Velocity Guidance (MVG) to control the target prompt-guided generation process, balancing image background preservation and target object editability. DNA and MVG collectively constitute our proposed method, namely DNAEdit. Finally, we introduce DNA-Bench, a long-prompt benchmark, to evaluate the performance of advanced image editing models. Experimental results demonstrate that our DNAEdit achieves superior performance to state-of-the-art text-guided editing methods. Codes and benchmark will be available at \href{ https://xiechenxi99.github.io/DNAEdit/}{https://xiechenxi99.github.io/DNAEdit/}.

Authors:Rafael Flor-Rodríguez, Carlos Gutiérrez-Álvarez, Francisco Javier Acevedo-Rodríguez, Sergio Lafuente-Arroyo, Roberto J. López-Sastre
Title: SEMNAV: A Semantic Segmentation-Driven Approach to Visual Semantic Navigation
Abstract:
Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment, mainly using visual information. Most state-of-the-art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best. These approaches typically rely on raw RGB data from the virtual scenes, which limits their ability to generalize to real-world environments due to domain adaptation issues. To tackle this problem, in this work, we propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment to enhance the agent's perception and decision-making capabilities. By explicitly incorporating high-level semantic information, our model learns robust navigation policies that improve generalization across unseen environments, both in simulated and real world settings. We also introduce a newly curated dataset, i.e. the SEMNAV dataset, designed for training semantic segmentation-aware navigation models like SEMNAV. Our approach is evaluated extensively in both simulated environments and with real-world robotic platforms. Experimental results demonstrate that SEMNAV outperforms existing state-of-the-art VSN models, achieving higher success rates in the Habitat 2.0 simulation environment, using the HM3D dataset. Furthermore, our real-world experiments highlight the effectiveness of semantic segmentation in mitigating the sim-to-real gap, making our model a promising solution for practical VSN-based robotic applications. We release SEMNAV dataset, code and trained models at https://github.com/gramuah/semnav
Chinese: SEMNAV是一种新颖的视觉语义导航方法,利用语义分割作为视觉输入来增强感知和决策能力,通过提升泛化性和缩小仿真与现实差距,在模拟和真实环境中均实现了卓越性能。
English: SEMNAV is a novel visual semantic navigation approach that uses semantic segmentation as visual input to enhance perception and decision-making, achieving superior performance in both simulation and real-world settings by improving generalization and bridging the sim-to-real gap.

Authors:Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun
Title: Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models
Abstract:
Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose RAIF, a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Evaluation on OOD constraints also confirms the generalizability of our RAIF. Codes and data are available at https://github.com/yuleiqin/RAIF. Keywords: reinforcement learning with verifiable rewards (RLVR), instruction following, complex instructions
中文: 现有大语言模型在处理包含多重约束的复杂指令时存在困难,而提出的RAIF方法通过基于可验证奖励的强化学习来增强推理能力,显著提升了小规模模型的性能表现。
English: Current large language models struggle with complex instructions containing multiple constraints, but the proposed RAIF method uses reinforcement learning with verifiable rewards to enhance reasoning capabilities, significantly improving performance even in smaller models.

Authors:Minjeong Park, Hongbeen Park, Jinkyu Kim
Title: ViTA-PAR: Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition
Abstract:
The Pedestrian Attribute Recognition (PAR) task aims to identify various detailed attributes of an individual, such as clothing, accessories, and gender. To enhance PAR performance, a model must capture features ranging from coarse-grained global attributes (e.g., for identifying gender) to fine-grained local details (e.g., for recognizing accessories) that may appear in diverse regions. Recent research suggests that body part representation can enhance the model's robustness and accuracy, but these methods are often restricted to attribute classes within fixed horizontal regions, leading to degraded performance when attributes appear in varying or unexpected body locations. In this paper, we propose Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition, dubbed as ViTA-PAR, to enhance attribute recognition through specialized multimodal prompting and vision-language alignment. We introduce visual attribute prompts that capture global-to-local semantics, enabling diverse attribute representations. To enrich textual embeddings, we design a learnable prompt template, termed person and attribute context prompting, to learn person and attributes context. Finally, we align visual and textual attribute features for effective fusion. ViTA-PAR is validated on four PAR benchmarks, achieving competitive performance with efficient inference. We release our code and model at https://github.com/mlnjeongpark/ViTA-PAR.
中文:ViTA-PAR模型通过多模态提示和视觉-文本属性对齐技术,结合全局到局部的语义捕捉与上下文感知特征融合,在四个基准测试中实现了优越的行人属性识别性能。
English: The proposed ViTA-PAR model enhances pedestrian attribute recognition by integrating visual and textual attribute alignment with multimodal prompting, achieving competitive performance across four benchmarks through global-to-local semantic capture and context-aware feature fusion.

Authors:Zhong Zhang, Yaxi Lu, Yikun Fu, Yupeng Huo, Shenzhi Yang, Yesai Wu, Han Si, Xin Cong, Haotian Chen, Yankai Lin, Jie Xie, Wei Zhou, Wang Xu, Yuanheng Zhang, Zhou Su, Zhongwu Zhai, Xiaoming Liu, Yudong Mei, Jianming Xu, Hongyan Tian, Chongyi Wang, Chi Chen, Yuan Yao, Zhiyuan Liu, Maosong Sun
Title: AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning
Abstract:
The recent progress of large language model agents has opened new possibilities for automating tasks through graphical user interfaces (GUIs), especially in mobile environments where intelligent interaction can greatly enhance usability. However, practical deployment of such agents remains constrained by several key challenges. Existing training data is often noisy and lack semantic diversity, which hinders the learning of precise grounding and planning. Models trained purely by imitation tend to overfit to seen interface patterns and fail to generalize in unfamiliar scenarios. Moreover, most prior work focuses on English interfaces while overlooks the growing diversity of non-English applications such as those in the Chinese mobile ecosystem. In this work, we present AgentCPM-GUI, an 8B-parameter GUI agent built for robust and efficient on-device GUI interaction. Our training pipeline includes grounding-aware pre-training to enhance perception, supervised fine-tuning on high-quality Chinese and English trajectories to imitate human-like actions, and reinforcement fine-tuning with GRPO to improve reasoning capability. We also introduce a compact action space that reduces output length and supports low-latency execution on mobile devices. AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks and a new Chinese GUI benchmark called CAGUI, reaching $96.9\%$ Type-Match and $91.3\%$ Exact-Match. To facilitate reproducibility and further research, we publicly release all code, model checkpoint, and evaluation data.
中文: 大型语言模型代理在自动化图形界面任务方面展现出潜力,但面临训练数据噪声大、泛化能力差等挑战;AgentCPM-GUI通过增强训练流程和精简动作空间,在多个基准测试中取得了领先性能。
English: Large language model agents show promise for automating GUI tasks, yet face challenges like noisy training data and limited generalization, which AgentCPM-GUI addresses through a robust training pipeline and compact action space to achieve top performance on benchmarks.

Authors:Pou-Chun Kung, Skanda Harisha, Ram Vasudevan, Aline Eid, Katherine A. Skinner
Title: RadarSplat: Radar Gaussian Splatting for High-Fidelity Data Synthesis and 3D Reconstruction of Autonomous Driving Scenes
Abstract:
High-Fidelity 3D scene reconstruction plays a crucial role in autonomous driving by enabling novel data generation from existing datasets. This allows simulating safety-critical scenarios and augmenting training datasets without incurring further data collection costs. While recent advances in radiance fields have demonstrated promising results in 3D reconstruction and sensor data synthesis using cameras and LiDAR, their potential for radar remains largely unexplored. Radar is crucial for autonomous driving due to its robustness in adverse weather conditions like rain, fog, and snow, where optical sensors often struggle. Although the state-of-the-art radar-based neural representation shows promise for 3D driving scene reconstruction, it performs poorly in scenarios with significant radar noise, including receiver saturation and multipath reflection. Moreover, it is limited to synthesizing preprocessed, noise-excluded radar images, failing to address realistic radar data synthesis. To address these limitations, this paper proposes RadarSplat, which integrates Gaussian Splatting with novel radar noise modeling to enable realistic radar data synthesis and enhanced 3D reconstruction. Compared to the state-of-the-art, RadarSplat achieves superior radar image synthesis (+3.4 PSNR / 2.6x SSIM) and improved geometric reconstruction (-40% RMSE / 1.5x Accuracy), demonstrating its effectiveness in generating high-fidelity radar data and scene reconstruction. A project page is available at https://umautobots.github.io/radarsplat.

Authors:Tomasz Stanczyk, Seongro Yoon, Francois Bremond
Title: No Train Yet Gain: Towards Generic Multi-Object Tracking in Sports and Beyond
Abstract:
Multi-object tracking (MOT) is essential for sports analytics, enabling performance evaluation and tactical insights. However, tracking in sports is challenging due to fast movements, occlusions, and camera shifts. Traditional tracking-by-detection methods require extensive tuning, while segmentation-based approaches struggle with track processing. We propose McByte, a tracking-by-detection framework that integrates temporally propagated segmentation mask as an association cue to improve robustness without per-video tuning. Unlike many existing methods, McByte does not require training, relying solely on pre-trained models and object detectors commonly used in the community. Evaluated on SportsMOT, DanceTrack, SoccerNet-tracking 2022 and MOT17, McByte demonstrates strong performance across sports and general pedestrian tracking. Our results highlight the benefits of mask propagation for a more adaptable and generalizable MOT approach. Code will be made available at https://github.com/tstanczyk95/McByte.
中文:McByte是一种无需训练的多目标跟踪框架,通过整合传播的分割掩码作为关联线索来增强鲁棒性,在多种体育和行人跟踪数据集上表现出色,且无需针对每个视频进行调整。
English: McByte is a training-free multi-object tracking framework that enhances robustness by integrating propagated segmentation masks for association, demonstrating strong performance across various sports and pedestrian tracking datasets without requiring per-video tuning.

Authors:Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, Di Huang
Title: Ultra-High-Resolution Image Synthesis: Data, Method and Evaluation
Abstract:
Ultra-high-resolution image synthesis holds significant potential, yet remains an underexplored challenge due to the absence of standardized benchmarks and computational constraints. In this paper, we establish Aesthetic-4K, a meticulously curated dataset containing dedicated training and evaluation subsets specifically designed for comprehensive research on ultra-high-resolution image synthesis. This dataset consists of high-quality 4K images accompanied by descriptive captions generated by GPT-4o. Furthermore, we propose Diffusion-4K, an innovative framework for the direct generation of ultra-high-resolution images. Our approach incorporates the Scale Consistent Variational Auto-Encoder (SC-VAE) and Wavelet-based Latent Fine-tuning (WLF), which are designed for efficient visual token compression and the capture of intricate details in ultra-high-resolution images, thereby facilitating direct training with photorealistic 4K data. This method is applicable to various latent diffusion models and demonstrates its efficacy in synthesizing highly detailed 4K images. Additionally, we propose novel metrics, namely the GLCM Score and Compression Ratio, to assess the texture richness and fine details in local patches, in conjunction with holistic measures such as FID, Aesthetics, and CLIPScore, enabling a thorough and multifaceted evaluation of ultra-high-resolution image synthesis. Consequently, Diffusion-4K achieves impressive performance in ultra-high-resolution image synthesis, particularly when powered by state-of-the-art large-scale diffusion models (eg, Flux-12B). The source code is publicly available at https://github.com/zhang0jhon/diffusion-4k.
中文: 本文提出了包含GPT-4o生成描述的Aesthetic-4K数据集和采用SC-VAE与WLF技术的Diffusion-4K框架,能够直接生成超高清图像,并通过新指标验证了其在大规模模型支持下取得的优异合成效果。
English: This paper introduces Aesthetic-4K, a curated dataset with GPT-4o captions, and Diffusion-4K, a framework using SC-VAE and WLF for direct ultra-high-resolution image synthesis, validated by novel metrics and achieving impressive results with large-scale models.

Authors:Haiyang Mei, Pengyu Zhang, Mike Zheng Shou
Title: SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost
Abstract:
Foundation models like the Segment Anything Model (SAM) have significantly advanced promptable image segmentation in computer vision. However, extending these capabilities to videos presents substantial challenges, particularly in ensuring precise and temporally consistent mask propagation in dynamic scenes. SAM 2 attempts to address this by training a model on massive image and video data from scratch to learn complex spatiotemporal associations, resulting in huge training costs that hinder research and practical deployment. In this paper, we introduce SAM-I2V, an effective image-to-video upgradation method for cultivating a promptable video segmentation (PVS) model. Our approach strategically upgrades the pre-trained SAM to support PVS, significantly reducing training complexity and resource requirements. To achieve this, we introduce three key innovations: (i) an image-to-video feature extraction upgrader built upon SAM's static image encoder to enable spatiotemporal video perception, (ii) a memory filtering strategy that selects the most relevant past frames for more effective utilization of historical information, and (iii) a memory-as-prompt mechanism leveraging object memory to ensure temporally consistent mask propagation in dynamic scenes. Comprehensive experiments demonstrate that our method achieves over 90% of SAM 2's performance while using only 0.2% of its training cost. Our work presents a resource-efficient pathway to PVS, lowering barriers for further research in PVS model design and enabling broader applications and advancements in the field. Code and model are available at: https://github.com/showlab/SAM-I2V.
中文: SAM-I2V 提出了一种资源高效的图像到视频升级方法,可将预训练的 SAM 模型转化为可提示视频分割模型,仅用 0.2% 的训练成本即可实现接近顶尖水平的性能。
English: SAM-I2V offers a resource-efficient method to upgrade the Segment Anything Model for promptable video segmentation, achieving near state-of-the-art performance with drastically reduced training costs.

Authors:Yichi Zhang, Zhuo Chen, Lingbing Guo, Yajing Xu, Min Zhang, Wen Zhang, Huajun Chen
Title: Abstractive Visual Understanding of Multi-modal Structured Knowledge: A New Perspective for MLLM Evaluation
Abstract:
Multi-modal large language models (MLLMs) incorporate heterogeneous modalities into LLMs, enabling a comprehensive understanding of diverse scenarios and objects. Despite the proliferation of evaluation benchmarks and leaderboards for MLLMs, they predominantly overlook the critical capacity of MLLMs to comprehend world knowledge with structured abstractions that appear in visual form. To address this gap, we propose a novel evaluation paradigm and devise M3STR, an innovative benchmark grounded in the Multi-Modal Map for STRuctured understanding. This benchmark leverages multi-modal knowledge graphs to synthesize images encapsulating subgraph architectures enriched with multi-modal entities. M3STR necessitates that MLLMs not only recognize the multi-modal entities within the visual inputs but also decipher intricate relational topologies among them. We delineate the benchmark's statistical profiles and automated construction pipeline, accompanied by an extensive empirical analysis of 26 state-of-the-art MLLMs. Our findings reveal persistent deficiencies in processing abstractive visual information with structured knowledge, thereby charting a pivotal trajectory for advancing MLLMs' holistic reasoning capacities. Our code and data are released at https://github.com/zjukg/M3STR
中文: M3STR是一种新型基准测试,旨在评估多模态大语言模型理解视觉形式结构化知识的能力,尽管对26种先进模型进行了广泛测试,结果仍显示其推理能力存在显著不足。
English: M3STR is a novel benchmark designed to evaluate multi-modal large language models' ability to comprehend structured knowledge in visual form, revealing significant gaps in their reasoning capacities despite extensive testing of 26 advanced models.

Authors:Yufei Zhan, Ziheng Wu, Yousong Zhu, Rongkun Xue, Ruipu Luo, Zhenghao Chen, Can Zhang, Yifan Li, Zhentao He, Zheming Yang, Ming Tang, Minghui Qiu, Jinqiao Wang
Title: GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking
Abstract:
Despite notable advancements in multimodal reasoning, leading Multimodal Large Language Models (MLLMs) still underperform on vision-centric multimodal reasoning tasks in general scenarios. This shortfall stems from their predominant reliance on logic- and knowledge-based slow thinking strategies, while effective for domains like math and science, fail to integrate visual information effectively during reasoning. Consequently, these models often fail to adequately ground visual cues, resulting in suboptimal performance in tasks that require multiple plausible visual interpretations and inferences. To address this, we present GThinker (General Thinker), a novel reasoning MLLM excelling in multimodal reasoning across general scenarios, mathematics, and science. GThinker introduces Cue-Rethinking, a flexible reasoning pattern that grounds inferences in visual cues and iteratively reinterprets these cues to resolve inconsistencies. Building on this pattern, we further propose a two-stage training pipeline, including pattern-guided cold start and incentive reinforcement learning, designed to enable multimodal reasoning capabilities across domains. Furthermore, to support the training, we construct GThinker-11K, comprising 7K high-quality, iteratively-annotated reasoning paths and 4K curated reinforcement learning samples, filling the data gap toward general multimodal reasoning. Extensive experiments demonstrate that GThinker achieves 81.5% on the challenging comprehensive multimodal reasoning benchmark M$^3$CoT, surpassing the latest O4-mini model. It also shows an average improvement of 2.1% on general scenario multimodal reasoning benchmarks, while maintaining on-par performance in mathematical reasoning compared to counterpart advanced reasoning models. The code, model, and data will be released soon at https://github.com/jefferyZhan/GThinker.
中文: GThinker是一种新型多模态推理模型,通过引入线索重思机制有效整合视觉信息,在通用场景中表现卓越,同时在数学和科学领域保持强劲性能。
English: GThinker is a novel multimodal reasoning model that introduces Cue-Rethinking to effectively integrate visual cues, achieving superior performance in general scenarios and maintaining strong results in mathematics and science.

Authors:Yudong Zhang, Ruobing Xie, Yiqing Huang, Jiansheng Chen, Xingwu Sun, Zhanhui Kang, Di Wang, Yu Wang
Title: Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMs
Abstract:
Recent advances in large vision-language models (LVLMs) have showcased their remarkable capabilities across a wide range of multimodal vision-language tasks. However, these models remain vulnerable to visual adversarial attacks, which can substantially compromise their performance. In this paper, we introduce F3, a novel adversarial purification framework that employs a counterintuitive ``fighting fire with fire'' strategy: intentionally introducing simple perturbations to adversarial examples to mitigate their harmful effects. Specifically, F3 leverages cross-modal attentions derived from randomly perturbed adversary examples as reference targets. By injecting noise into these adversarial examples, F3 effectively refines their attention, resulting in cleaner and more reliable model outputs. Remarkably, this seemingly paradoxical approach of employing noise to counteract adversarial attacks yields impressive purification results. Furthermore, F3 offers several distinct advantages: it is training-free and straightforward to implement, and exhibits significant computational efficiency improvements compared to existing purification methods. These attributes render F3 particularly suitable for large-scale industrial applications where both robust performance and operational efficiency are critical priorities. The code is available at https://github.com/btzyd/F3.
中文: F3框架采用“以火攻火”的创新策略,通过有意引入噪声扰动来净化大型视觉语言模型中的对抗样本,提供了一种无需训练、高效且适用于工业应用的解决方案。
English: The F3 framework introduces a novel "fighting fire with fire" strategy that uses intentional noise perturbations to purify adversarial examples in large vision-language models, offering a training-free, efficient solution suitable for industrial applications.

Authors:Dahyeon Kye, Changhyun Roh, Sukhun Ko, Chanho Eom, Jihyong Oh
Title: AceVFI: A Comprehensive Survey of Advances in Video Frame Interpolation
Abstract:
Video Frame Interpolation (VFI) is a fundamental Low-Level Vision (LLV) task that synthesizes intermediate frames between existing ones while maintaining spatial and temporal coherence. VFI techniques have evolved from classical motion compensation-based approach to deep learning-based approach, including kernel-, flow-, hybrid-, phase-, GAN-, Transformer-, Mamba-, and more recently diffusion model-based approach. We introduce AceVFI, the most comprehensive survey on VFI to date, covering over 250+ papers across these approaches. We systematically organize and describe VFI methodologies, detailing the core principles, design assumptions, and technical characteristics of each approach. We categorize the learning paradigm of VFI methods namely, Center-Time Frame Interpolation (CTFI) and Arbitrary-Time Frame Interpolation (ATFI). We analyze key challenges of VFI such as large motion, occlusion, lighting variation, and non-linear motion. In addition, we review standard datasets, loss functions, evaluation metrics. We examine applications of VFI including event-based, cartoon, medical image VFI and joint VFI with other LLV tasks. We conclude by outlining promising future research directions to support continued progress in the field. This survey aims to serve as a unified reference for both newcomers and experts seeking a deep understanding of modern VFI landscapes.
Chinese: AceVFI 是迄今为止最全面的视频帧插值综述,系统梳理了各类方法、关键挑战、数据集及应用,并展望了未来研究方向,旨在为该领域的新手和专家提供统一的参考指南。
English: AceVFI is the most comprehensive survey on Video Frame Interpolation to date, systematically organizing methodologies, challenges, datasets, and applications while outlining future research directions to serve as a unified reference for the field.

Authors:Zuzheng Kuang, Haixia Bi, Chen Xu, Jian Sun
Title: ECP-Mamba: An Efficient Multi-scale Self-supervised Contrastive Learning Method with State Space Model for PolSAR Image Classification
Abstract:
Recently, polarimetric synthetic aperture radar (PolSAR) image classification has been greatly promoted by deep neural networks. However,current deep learning-based PolSAR classification methods encounter difficulties due to its dependence on extensive labeled data and the computational inefficiency of architectures like Transformers. This paper presents ECP-Mamba, an efficient framework integrating multi-scale self-supervised contrastive learning with a state space model (SSM) backbone. Specifically, ECP-Mamba addresses annotation scarcity through a multi-scale predictive pretext task based on local-to-global feature correspondences, which uses a simplified self-distillation paradigm without negative sample pairs. To enhance computational efficiency,the Mamba architecture (a selective SSM) is first tailored for pixel-wise PolSAR classification task by designing a spiral scan strategy. This strategy prioritizes causally relevant features near the central pixel, leveraging the localized nature of pixel-wise classification tasks. Additionally, the lightweight Cross Mamba module is proposed to facilitates complementary multi-scale feature interaction with minimal overhead. Extensive experiments across four benchmark datasets demonstrate ECP-Mamba's effectiveness in balancing high accuracy with resource efficiency. On the Flevoland 1989 dataset, ECP-Mamba achieves state-of-the-art performance with an overall accuracy of 99.70%, average accuracy of 99.64% and Kappa coefficient of 99.62e-2. Our code will be available at https://github.com/HaixiaBi1982/ECP_Mamba.
Chinese: 本文提出ECP-Mamba框架,通过多尺度自监督对比学习和状态空间模型,有效解决了极化SAR图像分类中标注数据稀缺和计算效率低的问题,实现了最先进的分类精度。
English: This paper introduces ECP-Mamba, an efficient framework for PolSAR image classification that combines multi-scale self-supervised contrastive learning with a state space model backbone to overcome data annotation scarcity and computational inefficiency, achieving state-of-the-art accuracy.

Authors:Yuyuan Liu, Yuanhong Chen, Chong Wang, Junlin Han, Junde Wu, Can Peng, Jingkun Chen, Yu Tian, Gustavo Carneiro
Title: AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting
Abstract:
Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches mainly follow two directions: (1) injecting adapters into the image encoder to receive audio signals, which incurs efficiency costs during prompt engineering, and (2) leveraging additional foundation models to generate visual prompts for the sounding objects, which are often imprecisely localised, leading to misguidance in SAM2. Moreover, these methods overlook the rich semantic interplay between hierarchical visual features and other modalities, resulting in suboptimal cross-modal fusion. In this work, we propose AuralSAM2, comprising the novel AuralFuser module, which externally attaches to SAM2 to integrate features from different modalities and generate feature-level prompts, guiding SAM2's decoder in segmenting sounding targets. Such integration is facilitated by a feature pyramid, further refining semantic understanding and enhancing object awareness in multimodal scenarios. Additionally, the audio-guided contrastive learning is introduced to explicitly align audio and visual representations and to also mitigate biases caused by dominant visual patterns. Results on public benchmarks show that our approach achieves remarkable improvements over the previous methods in the field. Code is available at https://github.com/yyliu01/AuralSAM2.
Chinese: AuralSAM2 提出外部连接的 AuralFuser 模块,通过特征金字塔和对比学习融合多模态特征,有效提升 SAM2 对发声物体的分割能力,在公开基准测试中显著优于现有方法。
English: AuralSAM2 introduces an external AuralFuser module that integrates multimodal features through a feature pyramid and contrastive learning to enhance SAM2's segmentation of sounding objects, achieving significant performance gains over existing methods.

Authors:Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu
Title: Understanding Model Reprogramming for CLIP via Decoupling Visual Prompts
Abstract:
Model reprogramming adapts pretrained models to downstream tasks by modifying only the input and output spaces. Visual reprogramming (VR) is one instance for vision tasks that adds a trainable noise pattern (i.e., a visual prompt) to input images to facilitate downstream classification. The existing VR approaches for CLIP train a single visual prompt using all descriptions of different downstream classes. However, the limited learning capacity may result in (1) a failure to capture diverse aspects of the descriptions (e.g., shape, color, and texture), and (2) a possible bias toward less informative attributes that do not help distinguish between classes. In this paper, we introduce a decoupling-and-reweighting framework. Our decoupled visual prompts (DVP) are optimized using descriptions grouped by explicit causes (DVP-cse) or unsupervised clusters (DVP-cls). Then, we integrate the outputs of these visual prompts with a probabilistic reweighting matrix (PRM) that measures their contributions to each downstream class. Theoretically, DVP lowers the empirical risk bound. Experimentally, DVP outperforms baselines on average across 11 downstream datasets. Notably, the DVP-PRM integration enables insights into how individual visual prompts influence classification decisions, providing a probabilistic framework for understanding reprogramming. Our code is available at https://github.com/tmlr-group/DecoupledVP.
Chinese: 本文提出了一种解耦与重加权框架,通过分组描述并使用概率重加权矩阵整合解耦视觉提示(DVP)的输出,解决了现有视觉重编程方法的局限性,在11个数据集上提升了性能并增强了可解释性。
English: This paper introduces a decoupling-and-reweighting framework with decoupled visual prompts (DVP) that address limitations in existing visual reprogramming methods by grouping descriptions and integrating outputs via a probabilistic reweighting matrix, improving performance and interpretability across 11 datasets.

Authors:Kinam Kim, Junha Hyung, Jaegul Choo
Title: Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models
Abstract:
Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, limiting flexibility and scalability. In this work, we introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approach for adapting pretrained video diffusion models to diverse conditional generation tasks. Our key idea is to concatenate condition and target frames along the temporal axis and insert intermediate buffer frames with progressively increasing noise levels. These buffer frames enable smooth transitions, aligning the fine-tuning process with the pretrained model's temporal dynamics. TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples. We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B. Extensive experiments show that TIC-FT outperforms existing baselines in both condition fidelity and visual quality, while remaining highly efficient in both training and inference. For additional results, visit https://kinam0252.github.io/TIC-FT/

Authors:Yunzhu Zhang, Yu Lu, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu
Title: FlexSelect: Flexible Token Selection for Efficient Long Video Understanding
Abstract:
Long-form video understanding poses a significant challenge for video large language models (VideoLLMs) due to prohibitively high computational and memory demands. In this paper, we propose FlexSelect, a flexible and efficient token selection strategy for processing long videos. FlexSelect identifies and retains the most semantically relevant content by leveraging cross-modal attention patterns from a reference transformer layer. It comprises two key components: (1) a training-free token ranking pipeline that leverages faithful cross-modal attention weights to estimate each video token's importance, and (2) a rank-supervised lightweight selector that is trained to replicate these rankings and filter redundant tokens. This generic approach can be seamlessly integrated into various VideoLLM architectures, such as LLaVA-Video, InternVL and Qwen-VL, serving as a plug-and-play module to extend their temporal context length. Empirically, FlexSelect delivers strong gains across multiple long-video benchmarks including VideoMME, MLVU, LongVB, and LVBench. Moreover, it achieves significant speed-ups (for example, up to 9 times on a LLaVA-Video-7B model), highlighting FlexSelect's promise for efficient long-form video understanding. Project page available at: https://yunzhuzhang0918.github.io/flex_select
中文摘要:FlexSelect是一种高效的令牌选择策略,通过跨模态注意力识别并保留语义相关内容,显著提升长视频理解能力,在多种VideoLLM架构中实现计算加速和性能提升。
English Summary: FlexSelect is an efficient token selection strategy that enhances long-form video understanding by identifying and retaining semantically relevant content through cross-modal attention, enabling significant computational speed-ups and improved performance across various VideoLLM architectures.

Authors:Xiaorong Zhu, Ziheng Jia, Jiarui Wang, Xiangyu Zhao, Haodong Duan, Xiongkuo Min, Jia Wang, Zicheng Zhang, Guangtao Zhai
Title: GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs
Abstract:
The rapid evolution of Multi-modality Large Language Models (MLLMs) is driving significant advancements in visual understanding and generation. Nevertheless, a comprehensive assessment of their capabilities, concerning the fine-grained physical principles especially in geometric optics, remains underexplored. To address this gap, we introduce GOBench, the first benchmark to systematically evaluate MLLMs' ability across two tasks: 1) Generating Optically Authentic Imagery and 2) Understanding Underlying Optical Phenomena. We curates high-quality prompts of geometric optical scenarios and use MLLMs to construct GOBench-Gen-1k dataset.We then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity, revealing MLLMs' generation flaws that violate optical principles. For the understanding task, we apply crafted evaluation instructions to test optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding. The top-performing generative model, GPT-4o-Image, cannot perfectly complete all generation tasks, and the best-performing MLLM model, Gemini-2.5Pro, attains a mere 37.35\% accuracy in optical understanding. Database and codes are publicly available at https://github.com/aiben-ch/GOBench.
中文: 该研究提出了GOBench基准测试,发现当前多模态大语言模型在生成光学真实图像和理解几何光学方面存在明显不足,顶尖模型无法完美完成所有生成任务,且光学理解准确率仅为37.35%。
English: The study introduces GOBench, a benchmark revealing that current Multi-modality Large Language Models struggle with generating optically accurate images and understanding geometric optics, with top models achieving imperfect generation and only 37.35% accuracy in comprehension.

Authors:Zhan Li, Mingyu Zhao, Xin Dong, Haibin Ling, Bingyao Huang
Title: CAPAA: Classifier-Agnostic Projector-Based Adversarial Attack
Abstract:
Projector-based adversarial attack aims to project carefully designed light patterns (i.e., adversarial projections) onto scenes to deceive deep image classifiers. It has potential applications in privacy protection and the development of more robust classifiers. However, existing approaches primarily focus on individual classifiers and fixed camera poses, often neglecting the complexities of multi-classifier systems and scenarios with varying camera poses. This limitation reduces their effectiveness when introducing new classifiers or camera poses. In this paper, we introduce Classifier-Agnostic Projector-Based Adversarial Attack (CAPAA) to address these issues. First, we develop a novel classifier-agnostic adversarial loss and optimization framework that aggregates adversarial and stealthiness loss gradients from multiple classifiers. Then, we propose an attention-based gradient weighting mechanism that concentrates perturbations on regions of high classification activation, thereby improving the robustness of adversarial projections when applied to scenes with varying camera poses. Our extensive experimental evaluations demonstrate that CAPAA achieves both a higher attack success rate and greater stealthiness compared to existing baselines. Codes are available at: https://github.com/ZhanLiQxQ/CAPAA.
中文:本文提出CAPAA,一种与分类器无关的投影式对抗攻击方法,通过新颖的优化框架和基于注意力的梯度加权机制,提升了对多分类器和不同相机位姿的鲁棒性,实现了更高的攻击成功率与隐蔽性。
English: This paper introduces CAPAA, a classifier-agnostic projector-based adversarial attack that enhances robustness across multiple classifiers and varying camera poses through a novel optimization framework and attention-based gradient weighting, achieving superior attack success and stealthiness.

Authors:Youngmin Kim, Jiwan Chung, Jisoo Kim, Sunghyun Lee, Sangkyu Lee, Junhyeok Kim, Cheoljong Yang, Youngjae Yu
Title: Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues
Abstract:
Nonverbal communication is integral to human interaction, with gestures, facial expressions, and body language conveying critical aspects of intent and emotion. However, existing large language models (LLMs) fail to effectively incorporate these nonverbal elements, limiting their capacity to create fully immersive conversational experiences. We introduce MARS, a multimodal language model designed to understand and generate nonverbal cues alongside text, bridging this gap in conversational AI. Our key innovation is VENUS, a large-scale dataset comprising annotated videos with time-aligned text, facial expressions, and body language. Leveraging VENUS, we train MARS with a next-token prediction objective, combining text with vector-quantized nonverbal representations to achieve multimodal understanding and generation within a unified framework. Based on various analyses of the VENUS datasets, we validate its substantial scale and high effectiveness. Our quantitative and qualitative results demonstrate that MARS successfully generates text and nonverbal languages, corresponding to conversational input.
中文: MARS是一种多模态语言模型,通过结合文本与表情、肢体语言等非语言线索,并利用VENUS数据集训练,实现了更沉浸式的对话AI体验。
English: MARS is a multimodal language model that integrates nonverbal cues like facial expressions and body language with text, using the VENUS dataset to enable more immersive conversational AI experiences.

Authors:Geonu Lee, Yujeong Oh, Geonhui Jang, Soyoung Lee, Jeonghyo Song, Sungmin Cha, YoungJoon Yoo
Title: Continual-MEGA: A Large-scale Benchmark for Generalizable Continual Anomaly Detection
Abstract:
In this paper, we introduce a new benchmark for continual learning in anomaly detection, aimed at better reflecting real-world deployment scenarios. Our benchmark, Continual-MEGA, includes a large and diverse dataset that significantly expands existing evaluation settings by combining carefully curated existing datasets with our newly proposed dataset, ContinualAD. In addition to standard continual learning with expanded quantity, we propose a novel scenario that measures zero-shot generalization to unseen classes, those not observed during continual adaptation. This setting poses a new problem setting that continual adaptation also enhances zero-shot performance. We also present a unified baseline algorithm that improves robustness in few-shot detection and maintains strong generalization. Through extensive evaluations, we report three key findings: (1) existing methods show substantial room for improvement, particularly in pixel-level defect localization; (2) our proposed method consistently outperforms prior approaches; and (3) the newly introduced ContinualAD dataset enhances the performance of strong anomaly detection models. We release the benchmark and code in https://github.com/Continual-Mega/Continual-Mega.
中文: 本文提出了持续学习异常检测的新基准Continual-MEGA,包含多样化数据集和零样本泛化新场景,评估表明现有方法仍需改进且所提方法优于先前方案。
English: This paper introduces Continual-MEGA, a new benchmark for continual learning in anomaly detection that includes a diverse dataset and a novel scenario for zero-shot generalization, with evaluations showing existing methods need improvement and the proposed approach outperforms prior ones.

Authors:Sa Zhu, Huashan Chen, Wanqian Zhang, Jinchao Zhang, Zexian Yang, Xiaoshuai Hao, Bo Li
Title: Uneven Event Modeling for Partially Relevant Video Retrieval
Abstract:
Given a text query, partially relevant video retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments, wherein event modeling is crucial for partitioning the video into smaller temporal events that partially correspond to the text. Previous methods typically segment videos into a fixed number of equal-length clips, resulting in ambiguous event boundaries. Additionally, they rely on mean pooling to compute event representations, inevitably introducing undesired misalignment. To address these, we propose an Uneven Event Modeling (UEM) framework for PRVR. We first introduce the Progressive-Grouped Video Segmentation (PGVS) module, to iteratively formulate events in light of both temporal dependencies and semantic similarity between consecutive frames, enabling clear event boundaries. Furthermore, we also propose the Context-Aware Event Refinement (CAER) module to refine the event representation conditioned the text's cross-attention. This enables event representations to focus on the most relevant frames for a given text, facilitating more precise text-video alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two PRVR benchmarks. Code is available at https://github.com/Sasa77777779/UEM.git.
中文摘要:本研究提出的非均匀事件建模(UEM)框架通过渐进式视频分割明确事件边界,并利用上下文感知优化实现精准的文本-视频对齐,在部分相关视频检索任务中取得了最优性能。
English Summary: The proposed Uneven Event Modeling (UEM) framework addresses limitations in partially relevant video retrieval by introducing progressive video segmentation for clear event boundaries and context-aware refinement for precise text-video alignment, achieving state-of-the-art results.

Authors:Ryo Fujii, Hideo Saito, Ryo Hachiuma
Title: Towards Predicting Any Human Trajectory In Context
Abstract:
Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, this process is often impractical on edge devices due to constrained computational resources. To address this challenge, we introduce TrajICL, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables rapid adaptation without fine-tuning on the scenario-specific data. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. The code will be released at https://fujiry0.github.io/TrajICL-project-page.

Authors:Parul Gupta, Shreya Ghosh, Tom Gedeon, Thanh-Toan Do, Abhinav Dhall
Title: Multiverse Through Deepfakes: The MultiFakeVerse Dataset of Person-Centric Visual and Conceptual Manipulations
Abstract:
The rapid advancement of GenAI technology over the past few years has significantly contributed towards highly realistic deepfake content generation. Despite ongoing efforts, the research community still lacks a large-scale and reasoning capability driven deepfake benchmark dataset specifically tailored for person-centric object, context and scene manipulations. In this paper, we address this gap by introducing MultiFakeVerse, a large scale person-centric deepfake dataset, comprising 845,286 images generated through manipulation suggestions and image manipulations both derived from vision-language models (VLM). The VLM instructions were specifically targeted towards modifications to individuals or contextual elements of a scene that influence human perception of importance, intent, or narrative. This VLM-driven approach enables semantic, context-aware alterations such as modifying actions, scenes, and human-object interactions rather than synthetic or low-level identity swaps and region-specific edits that are common in existing datasets. Our experiments reveal that current state-of-the-art deepfake detection models and human observers struggle to detect these subtle yet meaningful manipulations. The code and dataset are available on \href{https://github.com/Parul-Gupta/MultiFakeVerse}{GitHub}.
中文: 本文提出了MultiFakeVerse这一大规模人物中心深度伪造数据集,通过视觉语言模型生成具有语义意义的篡改内容,现有检测方法和人类观察者均难以识别这些篡改。
English: This paper introduces MultiFakeVerse, a large-scale person-centric deepfake dataset created using vision-language models to generate semantically meaningful manipulations that current detection methods and humans find challenging to identify.

Authors:Jisheng Dang, Yizhou Zhang, Hao Ye, Teng Wang, Siming Chen, Huicheng Zheng, Yulan Guo, Jianhuang Lai, Bin Hu
Title: SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning
Abstract:
Fine-grained video captioning aims to generate detailed, temporally coherent descriptions of video content. However, existing methods struggle to capture subtle video dynamics and rich detailed information. In this paper, we leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning, while mitigating several limitations inherent to direct preference optimization (DPO). First, we propose a pipeline for constructing preference pairs that leverages the intrinsic properties of VLMs along with partial assistance from large language models, achieving an optimal balance between cost and data quality. Second, we propose Synergistic Preference Optimization (SynPO), a novel optimization method offering significant advantages over DPO and its variants. SynPO prevents negative preferences from dominating the optimization, explicitly preserves the model's language capability to avoid deviation of the optimization objective, and improves training efficiency by eliminating the need for the reference model. We extensively evaluate SynPO not only on video captioning benchmarks (e.g., VDC, VDD, VATEX) but also across well-established NLP tasks, including general language understanding and preference evaluation, using diverse pretrained models. Results demonstrate that SynPO consistently outperforms DPO variants while achieving 20\% improvement in training efficiency. Code is available at https://github.com/longmalongma/SynPO
中文摘要:本文提出协同偏好优化(SynPO)新方法,通过构建高质量偏好对和优化训练过程,在细粒度视频描述任务中有效提升性能并实现20%的训练效率提升。
English Summary: This paper introduces Synergistic Preference Optimization (SynPO), a novel method that enhances fine-grained video captioning by improving training efficiency and preventing optimization issues found in direct preference optimization, achieving a 20% efficiency gain.

Authors:Jiatong Li, Libo Zhu, Haotong Qin, Jingkai Wang, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang
Title: QuantFace: Low-Bit Post-Training Quantization for One-Step Diffusion Face Restoration
Abstract:
Diffusion models have been achieving remarkable performance in face restoration. However, the heavy computations of diffusion models make it difficult to deploy them on devices like smartphones. In this work, we propose QuantFace, a novel low-bit quantization for one-step diffusion face restoration models, where the full-precision (\ie, 32-bit) weights and activations are quantized to 4$\sim$6-bit. We first analyze the data distribution within activations and find that they are highly variant. To preserve the original data information, we employ rotation-scaling channel balancing. Furthermore, we propose Quantization-Distillation Low-Rank Adaptation (QD-LoRA) that jointly optimizes for quantization and distillation performance. Finally, we propose an adaptive bit-width allocation strategy. We formulate such a strategy as an integer programming problem, which combines quantization error and perceptual metrics to find a satisfactory resource allocation. Extensive experiments on the synthetic and real-world datasets demonstrate the effectiveness of QuantFace under 6-bit and 4-bit. QuantFace achieves significant advantages over recent leading low-bit quantization methods for face restoration. The code is available at https://github.com/jiatongli2024/QuantFace.
中文摘要:QuantFace提出了一种针对一步扩散人脸修复模型的低比特量化方法,通过旋转缩放通道平衡和QD-LoRA等技术在降低计算量的同时保持性能,使其更适合在智能手机等设备上部署。
English Summary: QuantFace introduces a low-bit quantization method for one-step diffusion face restoration models, employing techniques like rotation-scaling channel balancing and QD-LoRA to optimize performance while reducing computations for deployment on devices like smartphones.

Authors:Xiang Zhang, Run He, Jiao Chen, Di Fang, Ming Li, Ziqian Zeng, Cen Chen, Huiping Zhuang
Title: L3A: Label-Augmented Analytic Adaptation for Multi-Label Class Incremental Learning
Abstract:
Class-incremental learning (CIL) enables models to learn new classes continually without forgetting previously acquired knowledge. Multi-label CIL (MLCIL) extends CIL to a real-world scenario where each sample may belong to multiple classes, introducing several challenges: label absence, which leads to incomplete historical information due to missing labels, and class imbalance, which results in the model bias toward majority classes. To address these challenges, we propose Label-Augmented Analytic Adaptation (L3A), an exemplar-free approach without storing past samples. L3A integrates two key modules. The pseudo-label (PL) module implements label augmentation by generating pseudo-labels for current phase samples, addressing the label absence problem. The weighted analytic classifier (WAC) derives a closed-form solution for neural networks. It introduces sample-specific weights to adaptively balance the class contribution and mitigate class imbalance. Experiments on MS-COCO and PASCAL VOC datasets demonstrate that L3A outperforms existing methods in MLCIL tasks. Our code is available at https://github.com/scut-zx/L3A.
中文:提出的标签增强解析适应(L3A)框架通过生成伪标签弥补历史标签缺失,并采用加权解析分类器缓解类别不平衡,在无需存储历史样本的情况下,于多标签类增量学习任务中实现了优于现有方法的性能。
English: The proposed Label-Augmented Analytic Adaptation (L3A) framework addresses multi-label class-incremental learning challenges by generating pseudo-labels to compensate for missing historical labels and employing a weighted analytic classifier to mitigate class imbalance, demonstrating superior performance on benchmark datasets without storing past exemplars.

Authors:Milad Khanchi, Maria Amer, Charalambos Poullis
Title: Depth-Aware Scoring and Hierarchical Alignment for Multiple Object Tracking
Abstract:
Current motion-based multiple object tracking (MOT) approaches rely heavily on Intersection-over-Union (IoU) for object association. Without using 3D features, they are ineffective in scenarios with occlusions or visually similar objects. To address this, our paper presents a novel depth-aware framework for MOT. We estimate depth using a zero-shot approach and incorporate it as an independent feature in the association process. Additionally, we introduce a Hierarchical Alignment Score that refines IoU by integrating both coarse bounding box overlap and fine-grained (pixel-level) alignment to improve association accuracy without requiring additional learnable parameters. To our knowledge, this is the first MOT framework to incorporate 3D features (monocular depth) as an independent decision matrix in the association step. Our framework achieves state-of-the-art results on challenging benchmarks without any training nor fine-tuning. The code is available at https://github.com/Milad-Khanchi/DepthMOT
Chinese Summary: 本文提出一种深度感知多目标跟踪框架,通过将估计深度作为独立特征并结合分层对齐评分来改进目标关联,无需训练即实现最先进的性能。
English Summary: This paper introduces a depth-aware multiple object tracking framework that enhances object association by incorporating estimated depth as an independent feature and a hierarchical alignment score, achieving state-of-the-art performance without training.

Authors:Tianze Yang, Tyson Jordan, Ninghao Liu, Jin Sun
Title: Common Inpainted Objects In-N-Out of Context
Abstract:
We present Common Inpainted Objects In-N-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images featuring both contextually coherent and inconsistent scenes, enabling effective context learning. Each inpainted object is meticulously verified and categorized as in- or out-of-context through a multimodal large language model assessment. Our analysis reveals significant patterns in semantic priors that influence inpainting success across object categories. We demonstrate three key tasks enabled by COinCO: (1) training context classifiers that effectively determine whether existing objects belong in their context; (2) a novel Objects-from-Context prediction task that determines which new objects naturally belong in given scenes at both instance and clique levels, and (3) context-enhanced fake detection on state-of-the-art methods without fine-tuning. COinCO provides a controlled testbed with contextual variations, establishing a foundation for advancing context-aware visual understanding in computer vision and image forensics. Our code and data are at: https://github.com/YangTianze009/COinCO.
中文摘要:COinCO数据集通过系统替换图像中的对象创建了97,722张包含上下文一致与不一致场景的图像,为推进计算机视觉中的上下文感知理解和图像取证研究提供了重要基础。
English Summary: The COinCO dataset introduces 97,722 images with systematically replaced objects to study contextual coherence, enabling advancements in context-aware computer vision tasks and image forensics.

Authors:Wei Dai, Peilin Chen, Chanakya Ekbote, Paul Pu Liang
Title: QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training
Abstract:
Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at https://github.com/DDVD233/QoQ_Med.
中文: QoQ-Med-7B/32B 是首个开放通用的临床基础模型,能够联合推理医学图像、时间序列信号和文本报告,并通过创新的领域感知相对策略优化(DRPO)显著提升诊断性能,有效应对临床数据分布不均的问题。
English: QoQ-Med-7B/32B is the first open generalist clinical foundation model that integrates reasoning across medical images, time-series signals, and text reports, utilizing a novel Domain-aware Relative Policy Optimization (DRPO) to enhance diagnostic performance and address data imbalance across clinical specialties.

Authors:Tianze Yang, Yucheng Shi, Mengnan Du, Xuansheng Wu, Qiaoyu Tan, Jin Sun, Ninghao Liu
Title: Concept-Centric Token Interpretation for Vector-Quantized Generative Models
Abstract:
Vector-Quantized Generative Models (VQGMs) have emerged as powerful tools for image generation. However, the key component of VQGMs -- the codebook of discrete tokens -- is still not well understood, e.g., which tokens are critical to generate an image of a certain concept? This paper introduces Concept-Oriented Token Explanation (CORTEX), a novel approach for interpreting VQGMs by identifying concept-specific token combinations. Our framework employs two methods: (1) a sample-level explanation method that analyzes token importance scores in individual images, and (2) a codebook-level explanation method that explores the entire codebook to find globally relevant tokens. Experimental results demonstrate CORTEX's efficacy in providing clear explanations of token usage in the generative process, outperforming baselines across multiple pretrained VQGMs. Besides enhancing VQGMs transparency, CORTEX is useful in applications such as targeted image editing and shortcut feature detection. Our code is available at https://github.com/YangTianze009/CORTEX.
Chinese: 本文提出CORTEX框架,通过样本级和码本级的解释方法识别概念相关的令牌组合来解读矢量量化生成模型,有效提升模型透明度并支持图像编辑和特征检测等应用。
English: This paper introduces CORTEX, a novel framework that interprets Vector-Quantized Generative Models by identifying concept-specific token combinations through sample-level and codebook-level methods, enhancing model transparency and enabling applications like image editing and feature detection.

Authors:Yunguan Fu, Wenjia Bai, Weixi Yi, Charlotte Manisty, Anish N Bhuva, Thomas A Treibel, James C Moon, Matthew J Clarkson, Rhodri Huw Davies, Yipeng Hu
Title: A versatile foundation model for cine cardiac magnetic resonance image analysis tasks
Abstract:
Here we present a versatile foundation model that can perform a range of clinically-relevant image analysis tasks, including segmentation, landmark localisation, diagnosis, and prognostication. A multi-view convolution-transformer masked autoencoder, named as CineMA, was trained on 15 million cine images from 74,916 subjects. The model was validated on multiple image analysis tasks and compared to existing models on >4,500 images from eight independent datasets with diverse population characteristics, representing the largest benchmark study for cine CMR so far. CineMA consistently outperformed conventional convolutional neural networks (CNNs) in delineating ventricular boundaries and estimating ejection fraction, a key measure of cardiac function. The improved performance was preserved, even when the model only used half of fine-tuning data. CineMA also surpassed CNNs in disease detection and matched their performance in long-axis function measurement. Interestingly, we found that CineMA can also detect cardiac changes in systemic diseases, such as diabetes, hypertension and cancer, and can also predict mortality. Finally, we assessed model fairness and demonstrated consistent model performance across demographic subgroups. These findings highlight CineMA's accuracy, learning efficiency, adaptability, and fairness, underscoring its potential as a foundation model for automated cardiac image analysis to support clinical workflow and cardiovascular research. All training and inference code and models are made publicly available at https://github.com/mathpluscode/CineMA.
中文: CineMA作为一种基于1500万电影图像训练的多视角卷积-变换器掩码自编码器,在心脏图像分割、疾病检测和死亡率预测等任务中持续优于传统卷积神经网络,同时展现出卓越的学习效率和在人口统计学群体间的公平性。
English: CineMA, a multi-view convolution-transformer masked autoencoder trained on 15 million cine images, consistently outperforms conventional CNNs in cardiac image analysis tasks including segmentation, disease detection, and mortality prediction while demonstrating high learning efficiency and fairness across demographic groups.

Authors:Yu Huang, Junhao Chen, Shuliang Liu, Hanqian Li, Qi Zheng, Yi R. Fung, Xuming Hu
Title: Video Signature: In-generation Watermarking for Latent Video Diffusion Models
Abstract:
The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, but existing methods for video generation mainly follow a post-generation paradigm, which introduces additional computational overhead and often fails to effectively balance the trade-off between video quality and watermark extraction. To address these issues, we propose Video Signature (VIDSIG), an in-generation watermarking method for latent video diffusion models, which enables implicit and adaptive watermark integration during generation. Specifically, we achieve this by partially fine-tuning the latent decoder, where Perturbation-Aware Suppression (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality. Beyond spatial fidelity, we further enhance temporal consistency by introducing a lightweight Temporal Alignment module that guides the decoder to generate coherent frame sequences during fine-tuning. Experimental results show that VIDSIG achieves the best overall performance in watermark extraction, visual quality, and generation efficiency. It also demonstrates strong robustness against both spatial and temporal tampering, highlighting its practicality in real-world scenarios. Our code is available at \href{https://github.com/hardenyu21/Video-Signature}{here}
Chinese: VIDSIG提出了一种针对潜在视频扩散模型的生成内水印方法,在视频生成过程中自适应地嵌入水印,在提取精度、视觉质量和效率方面表现优异,同时具备强大的抗篡改鲁棒性。
English: VIDSIG introduces an in-generation watermarking method for latent video diffusion models that integrates watermarks adaptively during video creation, achieving superior performance in extraction accuracy, visual quality, and efficiency while maintaining robustness against tampering.

Authors:Ruiming Min, Minghao Liu
Title: ABCDEFGH: An Adaptation-Based Convolutional Neural Network-CycleGAN Disease-Courses Evolution Framework Using Generative Models in Health Education
Abstract:
With the advancement of modern medicine and the development of technologies such as MRI, CT, and cellular analysis, it has become increasingly critical for clinicians to accurately interpret various diagnostic images. However, modern medical education often faces challenges due to limited access to high-quality teaching materials, stemming from privacy concerns and a shortage of educational resources (Balogh et al., 2015). In this context, image data generated by machine learning models, particularly generative models, presents a promising solution. These models can create diverse and comparable imaging datasets without compromising patient privacy, thereby supporting modern medical education. In this study, we explore the use of convolutional neural networks (CNNs) and CycleGAN (Zhu et al., 2017) for generating synthetic medical images. The source code is available at https://github.com/mliuby/COMP4211-Project.
中文: 机器学习的卷积神经网络和CycleGAN等模型为生成合成医学图像提供了可行方案,既解决了医学教育资源匮乏的问题,又有效保护了患者隐私。
English: Machine learning models like CNNs and CycleGAN offer a promising solution to generate synthetic medical images, addressing the shortage of educational resources while preserving patient privacy in modern medical education.

Authors:Junwen Huang, Jizhong Liang, Jiaqi Hu, Martin Sundermeyer, Peter KT Yu, Nassir Navab, Benjamin Busam
Title: XYZ-IBD: A High-precision Bin-picking Dataset for Object 6D Pose Estimation Capturing Real-world Industrial Complexity
Abstract:
We introduce XYZ-IBD, a bin-picking dataset for 6D pose estimation that captures real-world industrial complexity, including challenging object geometries, reflective materials, severe occlusions, and dense clutter. The dataset reflects authentic robotic manipulation scenarios with millimeter-accurate annotations. Unlike existing datasets that primarily focus on household objects, which approach saturation,XYZ-IBD represents the unsolved realistic industrial conditions. The dataset features 15 texture-less, metallic, and mostly symmetrical objects of varying shapes and sizes. These objects are heavily occluded and randomly arranged in bins with high density, replicating the challenges of real-world bin-picking. XYZ-IBD was collected using two high-precision industrial cameras and one commercially available camera, providing RGB, grayscale, and depth images. It contains 75 multi-view real-world scenes, along with a large-scale synthetic dataset rendered under simulated bin-picking conditions. We employ a meticulous annotation pipeline that includes anti-reflection spray, multi-view depth fusion, and semi-automatic annotation, achieving millimeter-level pose labeling accuracy required for industrial manipulation. Quantification in simulated environments confirms the reliability of the ground-truth annotations. We benchmark state-of-the-art methods on 2D detection, 6D pose estimation, and depth estimation tasks on our dataset, revealing significant performance degradation in our setups compared to current academic household benchmarks. By capturing the complexity of real-world bin-picking scenarios, XYZ-IBD introduces more realistic and challenging problems for future research. The dataset and benchmark are publicly available at https://xyz-ibd.github.io/XYZ-IBD/.
中文:XYZ-IBD是一个针对真实工业场景的抓取数据集,通过毫米级精确标注呈现了复杂物体和遮挡环境,相比现有家居数据集更具工业挑战性。
English: XYZ-IBD is a bin-picking dataset that captures real-world industrial complexity with challenging objects and millimeter-accurate annotations, providing a more realistic benchmark than existing household-focused datasets.

Authors:Yule Zhu, Ping Liu, Zhedong Zheng, Wei Liu
Title: SEED: A Benchmark Dataset for Sequential Facial Attribute Editing with Diffusion Models
Abstract:
Diffusion models have recently enabled precise and photorealistic facial editing across a wide range of semantic attributes. Beyond single-step modifications, a growing class of applications now demands the ability to analyze and track sequences of progressive edits, such as stepwise changes to hair, makeup, or accessories. However, sequential editing introduces significant challenges in edit attribution and detection robustness, further complicated by the lack of large-scale, finely annotated benchmarks tailored explicitly for this task. We introduce SEED, a large-scale Sequentially Edited facE Dataset constructed via state-of-the-art diffusion models. SEED contains over 90,000 facial images with one to four sequential attribute modifications, generated using diverse diffusion-based editing pipelines (LEdits, SDXL, SD3). Each image is annotated with detailed edit sequences, attribute masks, and prompts, facilitating research on sequential edit tracking, visual provenance analysis, and manipulation robustness assessment. To benchmark this task, we propose FAITH, a frequency-aware transformer-based model that incorporates high-frequency cues to enhance sensitivity to subtle sequential changes. Comprehensive experiments, including systematic comparisons of multiple frequency-domain methods, demonstrate the effectiveness of FAITH and the unique challenges posed by SEED. SEED offers a challenging and flexible resource for studying progressive diffusion-based edits at scale. Dataset and code will be publicly released at: https://github.com/Zeus1037/SEED.
中文: 扩散模型实现了精确的面部编辑,但追踪连续编辑仍具挑战,为此开发了SEED数据集和FAITH模型,以推动该领域研究。
English: Diffusion models enable precise facial editing, but tracking sequential edits remains challenging, leading to the creation of the SEED dataset and FAITH model to advance research in this area.

Authors:Xingtong Ge, Xin Zhang, Tongda Xu, Yi Zhang, Xinjie Zhang, Yan Wang, Jun Zhang
Title: SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation
Abstract:
The Distribution Matching Distillation (DMD) has been successfully applied to text-to-image diffusion models such as Stable Diffusion (SD) 1.5. However, vanilla DMD suffers from convergence difficulties on large-scale flow-based text-to-image models, such as SD 3.5 and FLUX. In this paper, we first analyze the issues when applying vanilla DMD on large-scale models. Then, to overcome the scalability challenge, we propose implicit distribution alignment (IDA) to regularize the distance between the generator and fake distribution. Furthermore, we propose intra-segment guidance (ISG) to relocate the timestep importance distribution from the teacher model. With IDA alone, DMD converges for SD 3.5; employing both IDA and ISG, DMD converges for SD 3.5 and FLUX.1 dev. Along with other improvements such as scaled up discriminator models, our final model, dubbed \textbf{SenseFlow}, achieves superior performance in distillation for both diffusion based text-to-image models such as SDXL, and flow-matching models such as SD 3.5 Large and FLUX. The source code will be avaliable at https://github.com/XingtongGe/SenseFlow.
中文: 本文提出SenseFlow,通过引入隐式分布对齐(IDA)和段内引导(ISG)来改进分布匹配蒸馏(DMD),解决了在SD 3.5和FLUX等大规模文本到图像模型中的收敛难题,实现了卓越的蒸馏性能。
English: The paper introduces SenseFlow, which enhances Distribution Matching Distillation (DMD) with implicit distribution alignment (IDA) and intra-segment guidance (ISG) to overcome convergence issues in large-scale text-to-image models like SD 3.5 and FLUX, achieving superior performance.

Authors:Leila Mahmoodi, Peyman Moghadam, Munawar Hayat, Christian Simon, Mehrtash Harandi
Title: Flashbacks to Harmonize Stability and Plasticity in Continual Learning
Abstract:
We introduce Flashback Learning (FL), a novel method designed to harmonize the stability and plasticity of models in Continual Learning (CL). Unlike prior approaches that primarily focus on regularizing model updates to preserve old information while learning new concepts, FL explicitly balances this trade-off through a bidirectional form of regularization. This approach effectively guides the model to swiftly incorporate new knowledge while actively retaining its old knowledge. FL operates through a two-phase training process and can be seamlessly integrated into various CL methods, including replay, parameter regularization, distillation, and dynamic architecture techniques. In designing FL, we use two distinct knowledge bases: one to enhance plasticity and another to improve stability. FL ensures a more balanced model by utilizing both knowledge bases to regularize model updates. Theoretically, we analyze how the FL mechanism enhances the stability-plasticity balance. Empirically, FL demonstrates tangible improvements over baseline methods within the same training budget. By integrating FL into at least one representative baseline from each CL category, we observed an average accuracy improvement of up to 4.91% in Class-Incremental and 3.51% in Task-Incremental settings on standard image classification benchmarks. Additionally, measurements of the stability-to-plasticity ratio confirm that FL effectively enhances this balance. FL also outperforms state-of-the-art CL methods on more challenging datasets like ImageNet.
中文摘要:Flashback Learning(FL)是一种新颖的持续学习方法,通过双向正则化有效平衡模型的稳定性与可塑性,在多个基准测试中显著提升准确率并优化了稳定性与可塑性的平衡。
English Summary: Flashback Learning (FL) is a novel continual learning method that balances stability and plasticity through bidirectional regularization, achieving significant accuracy improvements and better stability-plasticity balance across various benchmarks.

Authors:Mehedi Ahamed, Radib Bin Kabir, Tawsif Tashwar Dipto, Mueeze Al Mushabbir, Sabbir Ahmed, Md. Hasanul Kabir
Title: Performance Analysis of Few-Shot Learning Approaches for Bangla Handwritten Character and Digit Recognition
Abstract:
This study investigates the performance of few-shot learning (FSL) approaches in recognizing Bangla handwritten characters and numerals using limited labeled data. It demonstrates the applicability of these methods to scripts with intricate and complex structures, where dataset scarcity is a common challenge. Given the complexity of Bangla script, we hypothesize that models performing well on these characters can generalize effectively to languages of similar or lower structural complexity. To this end, we introduce SynergiProtoNet, a hybrid network designed to improve the recognition accuracy of handwritten characters and digits. The model integrates advanced clustering techniques with a robust embedding framework to capture fine-grained details and contextual nuances. It leverages multi-level (both high- and low-level) feature extraction within a prototypical learning framework. We rigorously benchmark SynergiProtoNet against several state-of-the-art few-shot learning models: BD-CSPN, Prototypical Network, Relation Network, Matching Network, and SimpleShot, across diverse evaluation settings including Monolingual Intra-Dataset Evaluation, Monolingual Inter-Dataset Evaluation, Cross-Lingual Transfer, and Split Digit Testing. Experimental results show that SynergiProtoNet consistently outperforms existing methods, establishing a new benchmark in few-shot learning for handwritten character and digit recognition. The code is available on GitHub: https://github.com/MehediAhamed/SynergiProtoNet.
中文摘要:本研究提出SynergiProtoNet混合小样本学习模型,通过结合聚类技术和多层次特征提取,显著提升了孟加拉语手写字符与数字的识别准确率,在多种评估场景下均优于现有先进方法。
English Summary: This study introduces SynergiProtoNet, a hybrid few-shot learning model that enhances recognition accuracy for Bangla handwritten characters and numerals by integrating clustering techniques and multi-level feature extraction, outperforming existing methods across diverse evaluation settings.

Authors:Tuan-Luc Huynh, Thanh-Danh Le, Tam V. Nguyen, Trung-Nghia Le, Minh-Triet Tran
Title: Efficient 3D Brain Tumor Segmentation with Axial-Coronal-Sagittal Embedding
Abstract:
In this paper, we address the crucial task of brain tumor segmentation in medical imaging and propose innovative approaches to enhance its performance. The current state-of-the-art nnU-Net has shown promising results but suffers from extensive training requirements and underutilization of pre-trained weights. To overcome these limitations, we integrate Axial-Coronal-Sagittal convolutions and pre-trained weights from ImageNet into the nnU-Net framework, resulting in reduced training epochs, reduced trainable parameters, and improved efficiency. Two strategies for transferring 2D pre-trained weights to the 3D domain are presented, ensuring the preservation of learned relationships and feature representations critical for effective information propagation. Furthermore, we explore a joint classification and segmentation model that leverages pre-trained encoders from a brain glioma grade classification proxy task, leading to enhanced segmentation performance, especially for challenging tumor labels. Experimental results demonstrate that our proposed methods in the fast training settings achieve comparable or even outperform the ensemble of cross-validation models, a common practice in the brain tumor segmentation literature.
中文: 本文通过将轴状-冠状-矢状卷积和预训练ImageNet权重融入nnU-Net框架,在减少训练需求的同时,实现了与现有最优方法相当甚至更优的脑肿瘤分割效果。
English: This paper enhances brain tumor segmentation by integrating Axial-Coronal-Sagittal convolutions and pre-trained ImageNet weights into nnU-Net, reducing training requirements while maintaining or exceeding state-of-the-art performance.

Authors:Muhammad Adnan, Nithesh Kurella, Akhil Arunkumar, Prashant J. Nair
Title: Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation
Abstract:
Diffusion Transformers (DiTs) achieve state-of-the-art results in text-to-image, text-to-video generation, and editing. However, their large model size and the quadratic cost of spatial-temporal attention over multiple denoising steps make video generation computationally expensive. Static caching mitigates this by reusing features across fixed steps but fails to adapt to generation dynamics, leading to suboptimal trade-offs between speed and quality. We propose Foresight, an adaptive layer-reuse technique that reduces computational redundancy across denoising steps while preserving baseline performance. Foresight dynamically identifies and reuses DiT block outputs for all layers across steps, adapting to generation parameters such as resolution and denoising schedules to optimize efficiency. Applied to OpenSora, Latte, and CogVideoX, Foresight achieves up to \latencyimprv end-to-end speedup, while maintaining video quality. The source code of Foresight is available at \href{https://github.com/STAR-Laboratory/foresight}{https://github.com/STAR-Laboratory/foresight}.
Chinese: Foresight 提出了一种自适应层重用技术,在保持视频生成质量的同时,有效降低了扩散变换器的计算成本,显著提升了 OpenSora 等模型的运行速度。
English: Foresight introduces an adaptive layer-reuse technique that reduces computational costs in Diffusion Transformers for video generation while maintaining quality, achieving significant speed improvements in models like OpenSora.

Authors:Long Xu, Peng Gao, Wen-Jia Tang, Fei Wang, Ru-Yue Yuan
Title: Towards Effective and Efficient Adversarial Defense with Diffusion Models for Robust Visual Tracking
Abstract:
Although deep learning-based visual tracking methods have made significant progress, they exhibit vulnerabilities when facing carefully designed adversarial attacks, which can lead to a sharp decline in tracking performance. To address this issue, this paper proposes for the first time a novel adversarial defense method based on denoise diffusion probabilistic models, termed DiffDf, aimed at effectively improving the robustness of existing visual tracking methods against adversarial attacks. DiffDf establishes a multi-scale defense mechanism by combining pixel-level reconstruction loss, semantic consistency loss, and structural similarity loss, effectively suppressing adversarial perturbations through a gradual denoising process. Extensive experimental results on several mainstream datasets show that the DiffDf method demonstrates excellent generalization performance for trackers with different architectures, significantly improving various evaluation metrics while achieving real-time inference speeds of over 30 FPS, showcasing outstanding defense performance and efficiency. Codes are available at https://github.com/pgao-lab/DiffDf.
中文摘要:本文首次提出基于去噪扩散概率模型的DiffDf防御方法,通过多尺度防御机制有效提升视觉跟踪器对抗攻击的鲁棒性,并在保持实时性能的同时显著改善各项评估指标。
English Summary: This paper introduces DiffDf, a novel adversarial defense method using denoising diffusion models to significantly enhance the robustness of visual trackers against attacks while maintaining real-time performance.

Authors:Sara Ghazanfari, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Siddharth Garg
Title: Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning
Abstract:
Recent work has shown that eliciting Large Language Models (LLMs) to generate reasoning traces in natural language before answering the user's request can significantly improve their performance across tasks. This approach has been extended to multimodal LLMs, where the models can produce chain-of-thoughts (CoT) about the content of input images and videos. In this work, we propose to obtain video LLMs whose reasoning steps are grounded in, and explicitly refer to, the relevant video frames. For this, we first create CoF-Data, a large dataset of diverse questions, answers, and corresponding frame-grounded reasoning traces about both natural and synthetic videos, spanning various topics and tasks. Then, we fine-tune existing video LLMs on this chain-of-frames (CoF) data. Our approach is simple and self-contained, and, unlike existing approaches for video CoT, does not require auxiliary networks to select or caption relevant frames. We show that our models based on CoF are able to generate chain-of-thoughts that accurately refer to the key frames to answer the given question. This, in turn, leads to improved performance across multiple video understanding benchmarks, for example, surpassing leading video LLMs on Video-MME, MVBench, and VSI-Bench, and notably reducing the hallucination rate. Code available at https://github.com/SaraGhazanfari/CoF}{github.com/SaraGhazanfari/CoF.
Chinese Summary: 近期研究通过训练视频大语言模型生成明确基于相关视频帧的推理轨迹,无需辅助网络即可在多项基准测试中显著提升性能并减少幻觉现象。
English Summary: Recent research enhances video large language models by training them to generate reasoning traces explicitly grounded in relevant video frames, leading to improved performance on multiple benchmarks without needing auxiliary networks.

Authors:Anoop Kini, Andreas Jansche, Timo Bernthaler, Gerhard Schneider
Title: FastCAR: Fast Classification And Regression for Task Consolidation in Multi-Task Learning to Model a Continuous Property Variable of Detected Object Class
Abstract:
FastCAR is a novel task consolidation approach in Multi-Task Learning (MTL) for a classification and a regression task, despite the non-triviality of task heterogeneity with only a subtle correlation. The approach addresses the classification of a detected object (occupying the entire image frame) and regression for modeling a continuous property variable (for instances of an object class), a crucial use case in science and engineering. FastCAR involves a label transformation approach that is amenable for use with only a single-task regression network architecture. FastCAR outperforms traditional MTL model families, parametrized in the landscape of architecture and loss weighting schemes, when learning both tasks are collectively considered (classification accuracy of 99.54%, regression mean absolute percentage error of 2.4%). The experiments performed used "Advanced Steel Property Dataset" contributed by us https://github.com/fastcandr/AdvancedSteel-Property-Dataset. The dataset comprises 4536 images of 224x224 pixels, annotated with discrete object classes and its hardness property that can take continuous values. Our proposed FastCAR approach for task consolidation achieves training time efficiency (2.52x quicker) and reduced inference latency (55% faster) than benchmark MTL networks.
Chinese: FastCAR是一种新颖的多任务学习方法,通过标签转换技术有效整合分类与回归任务,在达到99.54%分类精度和2.4%回归误差的同时,相比传统方法训练速度提升2.52倍、推理延迟降低55%。
English: FastCAR is a novel multi-task learning approach that efficiently consolidates classification and regression tasks using a label transformation method, achieving superior performance with 99.54% classification accuracy and 2.4% regression error while being 2.52x faster in training and 55% quicker in inference than traditional methods.

Authors:Edward Fish, Richard Bowden
Title: Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation
Abstract:
Recent progress in Sign Language Translation (SLT) has focussed primarily on improving the representational capacity of large language models to incorporate Sign Language features. This work explores an alternative direction: enhancing the geometric properties of skeletal representations themselves. We propose Geo-Sign, a method that leverages the properties of hyperbolic geometry to model the hierarchical structure inherent in sign language kinematics. By projecting skeletal features derived from Spatio-Temporal Graph Convolutional Networks (ST-GCNs) into the Poincaré ball model, we aim to create more discriminative embeddings, particularly for fine-grained motions like finger articulations. We introduce a hyperbolic projection layer, a weighted Fréchet mean aggregation scheme, and a geometric contrastive loss operating directly in hyperbolic space. These components are integrated into an end-to-end translation framework as a regularisation function, to enhance the representations within the language model. This work demonstrates the potential of hyperbolic geometry to improve skeletal representations for Sign Language Translation, improving on SOTA RGB methods while preserving privacy and improving computational efficiency. Code available here: https://github.com/ed-fish/geo-sign.
中文: 本研究提出Geo-Sign方法,通过将骨骼特征投影到双曲空间来增强手语翻译中的骨架表示,从而更好地捕捉层次化运动结构,在提升精度的同时提高了计算效率。
English: This work introduces Geo-Sign, a method that enhances skeletal representations for sign language translation by projecting features into hyperbolic space to better capture hierarchical kinematic structures, improving both accuracy and computational efficiency over existing approaches.

Authors:Shuai Liu, Quanmin Liang, Zefeng Li, Boyang Li, Kai Huang
Title: GaussianFusion: Gaussian-Based Multi-Sensor Fusion for End-to-End Autonomous Driving
Abstract:
Multi-sensor fusion is crucial for improving the performance and robustness of end-to-end autonomous driving systems. Existing methods predominantly adopt either attention-based flatten fusion or bird's eye view fusion through geometric transformations. However, these approaches often suffer from limited interpretability or dense computational overhead. In this paper, we introduce GaussianFusion, a Gaussian-based multi-sensor fusion framework for end-to-end autonomous driving. Our method employs intuitive and compact Gaussian representations as intermediate carriers to aggregate information from diverse sensors. Specifically, we initialize a set of 2D Gaussians uniformly across the driving scene, where each Gaussian is parameterized by physical attributes and equipped with explicit and implicit features. These Gaussians are progressively refined by integrating multi-modal features. The explicit features capture rich semantic and spatial information about the traffic scene, while the implicit features provide complementary cues beneficial for trajectory planning. To fully exploit rich spatial and semantic information in Gaussians, we design a cascade planning head that iteratively refines trajectory predictions through interactions with Gaussians. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate the effectiveness and robustness of the proposed GaussianFusion framework. The source code will be released at https://github.com/Say2L/GaussianFusion.
中文: GaussianFusion提出了一种基于高斯表示的多传感器融合框架,通过整合显式和隐式特征来增强自动驾驶系统的轨迹规划能力和鲁棒性。
English: GaussianFusion introduces a multi-sensor fusion framework using Gaussian representations to enhance autonomous driving by integrating explicit and implicit features for improved trajectory planning and robustness.

Authors:Yaxin Luo, Zhaoyi Li, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, Zhiqiang Shen
Title: Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
Abstract:
CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3, far below human-level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems. Code and Data are available at this https URL.
Chinese Summary: Open CaptchaWorld 是首个基于网络的基准测试平台,用于评估多模态大语言模型代理解决各类验证码的能力,实验显示人类表现接近完美,而顶尖模型成功率远低于人类水平,揭示了当前交互推理能力的重大不足。
English Summary: Open CaptchaWorld is introduced as the first web-based benchmark to test multimodal LLM agents' capabilities in solving diverse CAPTCHA puzzles, revealing that while humans achieve near-perfect scores, current state-of-the-art agents significantly underperform, highlighting critical gaps in interactive reasoning.

Authors:Yangyi Huang, Ye Yuan, Xueting Li, Jan Kautz, Umar Iqbal
Title: AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion
Abstract:
Existing methods for image-to-3D avatar generation struggle to produce highly detailed, animation-ready avatars suitable for real-world applications. We introduce AdaHuman, a novel framework that generates high-fidelity animatable 3D avatars from a single in-the-wild image. AdaHuman incorporates two key innovations: (1) A pose-conditioned 3D joint diffusion model that synthesizes consistent multi-view images in arbitrary poses alongside corresponding 3D Gaussian Splats (3DGS) reconstruction at each diffusion step; (2) A compositional 3DGS refinement module that enhances the details of local body parts through image-to-image refinement and seamlessly integrates them using a novel crop-aware camera ray map, producing a cohesive detailed 3D avatar. These components allow AdaHuman to generate highly realistic standardized A-pose avatars with minimal self-occlusion, enabling rigging and animation with any input motion. Extensive evaluation on public benchmarks and in-the-wild images demonstrates that AdaHuman significantly outperforms state-of-the-art methods in both avatar reconstruction and reposing. Code and models will be publicly available for research purposes.

Authors:Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan
Title: Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks
Abstract:
Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at https://github.com/mbzuai-oryx/Agent-X
中文摘要:我们推出了Agent-X基准测试,用于评估视觉中心智能体在真实多模态环境中的多步推理和工具使用能力,结果表明即使GPT、Gemini等顶尖模型在复杂任务中表现不佳,完整任务成功率不足50%。
English Summary: The Agent-X benchmark is introduced to evaluate vision-centric agents' multi-step reasoning and tool-use capabilities in real-world multimodal settings, revealing that top models like GPT and Gemini struggle with complex tasks, achieving under 50% success rates.

Authors:Zilin Xiao, Jaywon Koo, Siru Ouyang, Jefferson Hernandez, Yu Meng, Vicente Ordonez
Title: ProxyThinker: Test-Time Guidance through Small Visual Reasoners
Abstract:
Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 $\times$ faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://github.com/MrZilinXiao/ProxyThinker.
中文: ProxyThinker是一种无需训练的推理时技术,通过调整解码动态使大型视觉语言模型能够继承小型慢思考推理器的视觉推理能力,在多项复杂视觉基准测试中显著提升性能,同时实现高达38倍的推理加速。
English: ProxyThinker is an inference-time technique that enables large vision-language models to acquire enhanced visual reasoning capabilities from smaller, specialized reasoners without additional training, significantly improving performance on complex visual benchmarks while accelerating inference speeds by up to 38 times.

Authors:Ce Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, Gedas Bertasius
Title: SiLVR: A Simple Language-based Video Reasoning Framework
Abstract:
Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SiLVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an adaptive token reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU (comprehension), Video-MMLU, CGBench, and EgoLife. Furthermore, our empirical study focused on video reasoning capabilities shows that, despite not being explicitly trained on video, strong reasoning LLMs can effectively aggregate multisensory input information from video, speech, and audio for complex temporal, causal, long-context, and knowledge acquisition reasoning tasks in video. Code is available at https://github.com/CeeZh/SILVR.
中文: SiLVR 是一个简单、模块化且无需训练的视频推理框架,它将复杂视频理解分解为两个阶段:先将原始视频转化为基于语言的表征,再借助强大的推理大语言模型完成任务,并在多个基准测试中取得了最优性能。
English: SiLVR is a simple, modular, and training-free video reasoning framework that decomposes complex video understanding into two stages—first converting raw video into language-based representations and then using a powerful reasoning LLM to solve tasks—achieving state-of-the-art results on multiple benchmarks.

Authors:Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, Mohamed Elhoseiny
Title: Time Blindness: Why Video-Language Models Can't See What Humans Can?
Abstract:
Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $\textbf{SpookyBench}$, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.

Authors:Cailin Zhuang, Ailin Huang, Wei Cheng, Jingwei Wu, Yaoqi Hu, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Zhewei Huang, Gang Yu, Chi Zhang
Title: ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
Abstract:
Story visualization aims to generate coherent image sequences that faithfully depict a narrative and align with character references. Despite progress in generative models, existing benchmarks are narrow in scope, often limited to short prompts, no character reference, or single-image cases, and fall short of real-world storytelling complexity. This hinders a nuanced understanding of model capabilities and limitations. We present ViStoryBench, a comprehensive benchmark designed to evaluate story visualization models across diverse narrative structures, visual styles, and character settings. The benchmark features richly annotated multi-shot scripts derived from curated stories spanning literature, film, and folklore. Large language models assist in story summarization and script generation, with all outputs verified by humans to ensure coherence and fidelity. Character references are carefully curated to maintain intra-story consistency across varying artistic styles. To enable thorough evaluation, ViStoryBench introduces a set of automated metrics that assess character consistency, style similarity, prompt adherence, aesthetic quality, and generation artifacts such as copy-paste behavior. These metrics are validated through human studies, and used to benchmark a broad range of open-source and commercial models. ViStoryBench offers a high-fidelity, multi-dimensional evaluation suite that facilitates systematic analysis and fosters future progress in visual storytelling.
中文:ViStoryBench是一个全面的基准测试,旨在评估故事可视化模型在多样化叙事、风格和角色设定下的表现,它结合了人工验证的脚本和自动化指标来评估一致性、质量及生成瑕疵,从而推动视觉叙事领域的系统性分析与进步。
English: ViStoryBench is a comprehensive benchmark designed to evaluate story visualization models across diverse narratives, styles, and character settings, featuring human-verified scripts and automated metrics to assess consistency, quality, and artifacts, thereby enabling systematic analysis and advancement in visual storytelling.

Authors:Li yunhan, Wu gengshen
Title: LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text
Abstract:
As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks tend to focus mainly on factual accuracy while largely neglecting important linguistic quality aspects such as clarity, coherence, and terminology. To address this gap, we propose three steps: First, we develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology. Second, we create a specialized set of legal questions. Third, we analyze 49 LLMs using this evaluation framework. Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of $2.7\%$ noted at 72 billion parameters. Second, engineering choices such as quantization and context length have a negligible impact, as indicated by statistical significance thresholds above 0.016. Third, reasoning models consistently outperform base architectures. A significant outcome of our research is the release of a ranking list and Pareto analysis, which highlight the Qwen3 series as the optimal choice for cost-performance tradeoffs. This work not only establishes standardized evaluation protocols for legal LLMs but also uncovers fundamental limitations in current training data refinement approaches. Code and models are available at: https://github.com/lyxx3rd/LegalEval-Q.
中文: 本研究通过开发回归模型、构建专业法律问题集并分析49个大型语言模型,填补了法律大模型语言质量评估的空白,发现模型性能在140亿参数时趋于稳定,推理模型优于基础架构,且Qwen3系列在成本效益方面表现最佳。
English: This study addresses the gap in evaluating linguistic quality of legal LLMs by developing a regression model, specialized legal questions, and analyzing 49 models, revealing that performance plateaus at 14B parameters and reasoning models outperform base architectures, with the Qwen3 series identified as optimal for cost-performance balance.

Authors:Marta López-Rauhut, Hongyu Zhou, Mathieu Aubry, Loic Landrieu
Title: Segmenting France Across Four Centuries
Abstract:
Historical maps offer an invaluable perspective into territory evolution across past centuries--long before satellite or remote sensing technologies existed. Deep learning methods have shown promising results in segmenting historical maps, but publicly available datasets typically focus on a single map type or period, require extensive and costly annotations, and are not suited for nationwide, long-term analyses. In this paper, we introduce a new dataset of historical maps tailored for analyzing large-scale, long-term land use and land cover evolution with limited annotations. Spanning metropolitan France (548,305 km^2), our dataset contains three map collections from the 18th, 19th, and 20th centuries. We provide both comprehensive modern labels and 22,878 km^2 of manually annotated historical labels for the 18th and 19th century maps. Our dataset illustrates the complexity of the segmentation task, featuring stylistic inconsistencies, interpretive ambiguities, and significant landscape changes (e.g., marshlands disappearing in favor of forests). We assess the difficulty of these challenges by benchmarking three approaches: a fully-supervised model trained with historical labels, and two weakly-supervised models that rely only on modern annotations. The latter either use the modern labels directly or first perform image-to-image translation to address the stylistic gap between historical and contemporary maps. Finally, we discuss how these methods can support long-term environment monitoring, offering insights into centuries of landscape transformation. Our official project repository is publicly available at https://github.com/Archiel19/FRAx4.git.
Chinese: 本文提出一个涵盖法国三个世纪历史地图的新数据集,旨在通过有限标注分析长期土地利用变化,并评估了多种分割方法以应对风格不一致和景观变迁等挑战。
English: This paper introduces a new dataset of historical maps for France spanning three centuries, designed to analyze long-term land use changes with limited annotations, and benchmarks segmentation methods to address challenges like stylistic inconsistencies and landscape transformations.

Authors:Zimu Liao, Jifeng Ding, Rong Fu, Siwei Cui, Ruixuan Gong, Li Wang, Boni Hu, Yi Wang, Hengjie Li, XIngcheng Zhang, Hui Wang
Title: TC-GS: A Faster Gaussian Splatting Module Utilizing Tensor Cores
Abstract:
3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where conditional alpha-blending dominates the time cost in the rendering pipeline. This paper proposes TC-GS, an algorithm-independent universal module that expands Tensor Core (TCU) applicability for 3DGS, leading to substantial speedups and seamless integration into existing 3DGS optimization frameworks. The key innovation lies in mapping alpha computation to matrix multiplication, fully utilizing otherwise idle TCUs in existing 3DGS implementations. TC-GS provides plug-and-play acceleration for existing top-tier acceleration algorithms tightly coupled with rendering pipeline designs, like Gaussian compression and redundancy elimination algorithms. Additionally, we introduce a global-to-local coordinate transformation to mitigate rounding errors from quadratic terms of pixel coordinates caused by Tensor Core half-precision computation. Extensive experiments demonstrate that our method maintains rendering quality while providing an additional 2.18x speedup over existing Gaussian acceleration algorithms, thus reaching up to a total 5.6x acceleration. The code is currently available at anonymous \href{https://github.com/TensorCore3DGS/3DGSTensorCore}
中文: 本文提出TC-GS通用模块,通过将alpha计算映射为矩阵乘法来利用张量核心加速3D高斯泼溅,在保持渲染质量的同时实现了2.18倍的加速效果。
English: This paper introduces TC-GS, a universal module that leverages Tensor Cores to accelerate 3D Gaussian Splatting by mapping alpha computation to matrix multiplication, achieving a 2.18x speedup while maintaining rendering quality.

Authors:Yucheng Zhou, Jiahao Yuan, Qianning Wang
Title: Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation
Abstract:
Recent advancements in text-to-image (T2I) generation have enabled models to produce high-quality images from textual descriptions. However, these models often struggle with complex instructions involving multiple objects, attributes, and spatial relationships. Existing benchmarks for evaluating T2I models primarily focus on general text-image alignment and fail to capture the nuanced requirements of complex, multi-faceted prompts. Given this gap, we introduce LongBench-T2I, a comprehensive benchmark specifically designed to evaluate T2I models under complex instructions. LongBench-T2I consists of 500 intricately designed prompts spanning nine diverse visual evaluation dimensions, enabling a thorough assessment of a model's ability to follow complex instructions. Beyond benchmarking, we propose an agent framework (Plan2Gen) that facilitates complex instruction-driven image generation without requiring additional model training. This framework integrates seamlessly with existing T2I models, using large language models to interpret and decompose complex prompts, thereby guiding the generation process more effectively. As existing evaluation metrics, such as CLIPScore, fail to adequately capture the nuances of complex instructions, we introduce an evaluation toolkit that automates the quality assessment of generated images using a set of multi-dimensional metrics. The data and code are released at https://github.com/yczhou001/LongBench-T2I.
中文: 针对文本到图像模型处理复杂指令的不足,我们推出了LongBench-T2I基准测试和Plan2Gen代理框架,无需额外训练即可提升生成与评估能力。
English: Recent text-to-image models face challenges with complex prompts, prompting the introduction of LongBench-T2I, a comprehensive benchmark and Plan2Gen agent framework to enhance generation and evaluation without additional training.

Authors:Jiazhong Cen, Xudong Zhou, Jiemin Fang, Changsong Wen, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian
Title: Tackling View-Dependent Semantics in 3D Language Gaussian Splatting
Abstract:
Recent advancements in 3D Gaussian Splatting (3D-GS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints--a phenomenon we term view-dependent semantics. To address this challenge, we propose LaGa (Language Gaussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset. Our code is available at: https://github.com/SJTU-DeepVisionLab/LaGa.
中文: LaGa通过将3D场景分解为对象并构建聚合语义表征,有效解决了3D高斯溅射中视角依赖语义的挑战,在LERF-OVS数据集上相比之前最优方法实现了18.7% mIoU的显著提升。
English: LaGa addresses the challenge of view-dependent semantics in 3D Gaussian Splatting by decomposing scenes into objects and constructing aggregated semantic representations, achieving a notable +18.7% mIoU improvement over previous methods on the LERF-OVS dataset.

Authors:Jisheng Dang, Jingze Wu, Teng Wang, Xuanhui Lin, Nannan Zhu, Hongbo Chen, Wei-Shi Zheng, Meng Wang, Tat-Seng Chua
Title: Reinforcing Video Reasoning with Focused Thinking
Abstract:
Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially correct answers, resulting in high reward variance and inefficient learning. In this paper, we propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity. Specifically, we employs a token weighting mechanism that prioritizes tokens with high informational density (estimated by intra-group information entropy), suppressing redundant tokens like generic reasoning prefixes. Furthermore, we reformulate RL training by shifting from single-choice to multi-choice QA tasks, where soft rewards enable finer-grained gradient estimation by distinguishing partial correctness. Additionally, we propose question-answer inversion, a data augmentation strategy to generate diverse multi-choice samples from existing benchmarks. Experiments demonstrate state-of-the-art performance on several video reasoning and general understanding benchmarks. Notably, TW-GRPO achieves 50.4\% accuracy on CLEVRER (18.8\% improvement over Video-R1) and 65.8\% on MMVU. Our codes are available at \href{https://github.com/longmalongma/TW-GRPO}.
中文:TW-GRPO通过引入令牌加权机制聚焦关键信息和采用密集奖励评估部分正确性,显著提升了多模态推理能力,在CLEVRER和MMVU等基准测试中达到最优性能。
English: TW-GRPO enhances multimodal reasoning by introducing token weighting to focus on key information and dense rewards for partial correctness, achieving state-of-the-art results on benchmarks like CLEVRER and MMVU.

Authors:Raman Jha, Adithya Lenka, Mani Ramanagopal, Aswin Sankaranarayanan, Kaushik Mitra
Title: RT-X Net: RGB-Thermal cross attention network for Low-Light Image Enhancement
Abstract:
In nighttime conditions, high noise levels and bright illumination sources degrade image quality, making low-light image enhancement challenging. Thermal images provide complementary information, offering richer textures and structural details. We propose RT-X Net, a cross-attention network that fuses RGB and thermal images for nighttime image enhancement. We leverage self-attention networks for feature extraction and a cross-attention mechanism for fusion to effectively integrate information from both modalities. To support research in this domain, we introduce the Visible-Thermal Image Enhancement Evaluation (V-TIEE) dataset, comprising 50 co-located visible and thermal images captured under diverse nighttime conditions. Extensive evaluations on the publicly available LLVIP dataset and our V-TIEE dataset demonstrate that RT-X Net outperforms state-of-the-art methods in low-light image enhancement. The code and the V-TIEE can be found here https://github.com/jhakrraman/rt-xnet.
中文摘要:RT-X Net通过交叉注意力网络融合RGB与热成像数据来增强夜间图像质量,在公开LLVIP数据集及新构建的V-TIEE数据集上均优于现有先进方法。
English Summary: RT-X Net is a cross-attention network that enhances nighttime images by fusing RGB and thermal data, outperforming existing methods on both the LLVIP and newly introduced V-TIEE datasets.

Authors:Julio Silva-Rodríguez, Ismail Ben Ayed, Jose Dolz
Title: Conformal Prediction for Zero-Shot Models
Abstract:
Vision-language models pre-trained at large scale have shown unprecedented adaptability and generalization to downstream tasks. Although its discriminative potential has been widely explored, its reliability and uncertainty are still overlooked. In this work, we investigate the capabilities of CLIP models under the split conformal prediction paradigm, which provides theoretical guarantees to black-box models based on a small, labeled calibration set. In contrast to the main body of literature on conformal predictors in vision classifiers, foundation models exhibit a particular characteristic: they are pre-trained on a one-time basis on an inaccessible source domain, different from the transferred task. This domain drift negatively affects the efficiency of the conformal sets and poses additional challenges. To alleviate this issue, we propose Conf-OT, a transfer learning setting that operates transductive over the combined calibration and query sets. Solving an optimal transport problem, the proposed method bridges the domain gap between pre-training and adaptation without requiring additional data splits but still maintaining coverage guarantees. We comprehensively explore this conformal prediction strategy on a broad span of 15 datasets and three non-conformity scores. Conf-OT provides consistent relative improvements of up to 20% on set efficiency while being 15 times faster than popular transductive approaches.
Chinese: 本研究针对大规模视觉语言模型在可靠性方面的不足,提出Conf-OT方法,通过最优传输理论桥接领域差异,在15个数据集上实现最高20%的集合效率提升,运算速度比现有方法快15倍,同时保持理论覆盖保证。
English: Large-scale vision-language models exhibit strong adaptability but face reliability challenges, which this study addresses by proposing Conf-OT, a conformal prediction method that enhances set efficiency by up to 20% and speeds up processing 15-fold while maintaining theoretical guarantees across diverse datasets.

Authors:Jiahe Chen, Jiahe Ying, Shen Wang, Jianwei Zheng
Title: Decoupled Competitive Framework for Semi-supervised Medical Image Segmentation
Abstract:
Confronting the critical challenge of insufficiently annotated samples in medical domain, semi-supervised medical image segmentation (SSMIS) emerges as a promising solution. Specifically, most methodologies following the Mean Teacher (MT) or Dual Students (DS) architecture have achieved commendable results. However, to date, these approaches face a performance bottleneck due to two inherent limitations, \textit{e.g.}, the over-coupling problem within MT structure owing to the employment of exponential moving average (EMA) mechanism, as well as the severe cognitive bias between two students of DS structure, both of which potentially lead to reduced efficacy, or even model collapse eventually. To mitigate these issues, a Decoupled Competitive Framework (DCF) is elaborated in this work, which utilizes a straightforward competition mechanism for the update of EMA, effectively decoupling students and teachers in a dynamical manner. In addition, the seamless exchange of invaluable and precise insights is facilitated among students, guaranteeing a better learning paradigm. The DCF introduced undergoes rigorous validation on three publicly accessible datasets, which encompass both 2D and 3D datasets. The results demonstrate the superiority of our method over previous cutting-edge competitors. Code will be available at https://github.com/JiaheChen2002/DCF.
中文: 针对半监督医学图像分割中因均值教师结构过耦合和双学生结构认知偏差导致的性能瓶颈,本研究提出解耦竞争框架(DCF),通过竞争机制动态解耦师生关系并促进知识无缝交换,在多个数据集上取得优越性能。
English: To overcome performance bottlenecks in semi-supervised medical image segmentation caused by over-coupling in Mean Teacher and cognitive bias in Dual Students, this study proposes a Decoupled Competitive Framework (DCF) that dynamically decouples students and teachers through a competition mechanism and enables seamless knowledge exchange, achieving superior results on multiple datasets.

Authors:Xuzhi Wang, Wei Feng, Lingdong Kong, Liang Wan
Title: NUC-Net: Non-uniform Cylindrical Partition Network for Efficient LiDAR Semantic Segmentation
Abstract:
LiDAR semantic segmentation plays a vital role in autonomous driving. Existing voxel-based methods for LiDAR semantic segmentation apply uniform partition to the 3D LiDAR point cloud to form a structured representation based on cartesian/cylindrical coordinates. Although these methods show impressive performance, the drawback of existing voxel-based methods remains in two aspects: (1) it requires a large enough input voxel resolution, which brings a large amount of computation cost and memory consumption. (2) it does not well handle the unbalanced point distribution of LiDAR point cloud. In this paper, we propose a non-uniform cylindrical partition network named NUC-Net to tackle the above challenges. Specifically, we propose the Arithmetic Progression of Interval (API) method to non-uniformly partition the radial axis and generate the voxel representation which is representative and efficient. Moreover, we propose a non-uniform multi-scale aggregation method to improve contextual information. Our method achieves state-of-the-art performance on SemanticKITTI and nuScenes datasets with much faster speed and much less training time. And our method can be a general component for LiDAR semantic segmentation, which significantly improves both the accuracy and efficiency of the uniform counterpart by $4 \times$ training faster and $2 \times$ GPU memory reduction and $3 \times$ inference speedup. We further provide theoretical analysis towards understanding why NUC is effective and how point distribution affects performance. Code is available at \href{https://github.com/alanWXZ/NUC-Net}{https://github.com/alanWXZ/NUC-Net}.
中文: 提出的NUC-Net采用非均匀柱面分区和多尺度聚合方法,有效解决了现有激光雷达语义分割方法计算效率低和点云分布不均的问题,在保持顶尖性能的同时大幅提升了速度和内存效率。
English: The proposed NUC-Net introduces non-uniform cylindrical partitioning and multi-scale aggregation to overcome the computational inefficiency and unbalanced point distribution issues in existing LiDAR semantic segmentation methods, achieving state-of-the-art performance with significant speed and memory improvements.

Authors:Ivan Pereira-Sánchez, Julia Navarro, Ana Belén Petro, Joan Duran
Title: Model-Guided Network with Cluster-Based Operators for Spatio-Spectral Super-Resolution
Abstract:
This paper addresses the problem of reconstructing a high-resolution hyperspectral image from a low-resolution multispectral observation. While spatial super-resolution and spectral super-resolution have been extensively studied, joint spatio-spectral super-resolution remains relatively explored. We propose an end-to-end model-driven framework that explicitly decomposes the joint spatio-spectral super-resolution problem into spatial super-resolution, spectral super-resolution and fusion tasks. Each sub-task is addressed by unfolding a variational-based approach, where the operators involved in the proximal gradient iterative scheme are replaced with tailored learnable modules. In particular, we design an upsampling operator for spatial super-resolution based on classical back-projection algorithms, adapted to handle arbitrary scaling factors. Spectral reconstruction is performed using learnable cluster-based upsampling and downsampling operators. For image fusion, we integrate low-frequency estimation and high-frequency injection modules to combine the spatial and spectral information from spatial super-resolution and spectral super-resolution outputs. Additionally, we introduce an efficient nonlocal post-processing step that leverages image self-similarity by combining a multi-head attention mechanism with residual connections. Extensive evaluations on several datasets and sampling factors demonstrate the effectiveness of our approach. The source code will be available at https://github.com/TAMI-UIB/JSSUNet
中文: 本文提出了一种模型驱动的联合空谱超分辨率框架,通过将问题分解为空间超分辨率、光谱超分辨率和融合子任务,采用可学习模块和非局部后处理步骤,在多个数据集和缩放因子下实现了优越性能。
English: This paper proposes a model-driven framework for joint spatio-spectral super-resolution that decomposes the problem into spatial super-resolution, spectral super-resolution, and fusion tasks, utilizing learnable modules and a nonlocal post-processing step to achieve superior performance across various datasets and scaling factors.

Authors:Qinghe Ma, Jian Zhang, Lei Qi, Qian Yu, Yinghuan Shi, Yang Gao
Title: Unleashing the Power of Intermediate Domains for Mixed Domain Semi-Supervised Medical Image Segmentation
Abstract:
Both limited annotation and domain shift are prevalent challenges in medical image segmentation. Traditional semi-supervised segmentation and unsupervised domain adaptation methods address one of these issues separately. However, the coexistence of limited annotation and domain shift is quite common, which motivates us to introduce a novel and challenging scenario: Mixed Domain Semi-supervised medical image Segmentation (MiDSS), where limited labeled data from a single domain and a large amount of unlabeled data from multiple domains. To tackle this issue, we propose the UST-RUN framework, which fully leverages intermediate domain information to facilitate knowledge transfer. We employ Unified Copy-paste (UCP) to construct intermediate domains, and propose a Symmetric GuiDance training strategy (SymGD) to supervise unlabeled data by merging pseudo-labels from intermediate samples. Subsequently, we introduce a Training Process aware Random Amplitude MixUp (TP-RAM) to progressively incorporate style-transition components into intermediate samples. To generate more diverse intermediate samples, we further select reliable samples with high-quality pseudo-labels, which are then mixed with other unlabeled data. Additionally, we generate sophisticated intermediate samples with high-quality pseudo-labels for unreliable samples, ensuring effective knowledge transfer for them. Extensive experiments on four public datasets demonstrate the superiority of UST-RUN. Notably, UST-RUN achieves a 12.94% improvement in Dice score on the Prostate dataset. Our code is available at https://github.com/MQinghe/UST-RUN
中文: 本研究针对医学图像分割中标注有限和领域偏移并存的问题,提出了MiDSS场景和UST-RUN框架,通过统一复制粘贴构建中间域、对称引导训练和训练过程感知的随机幅度混合等创新方法,有效促进知识迁移,在前列腺数据集上实现了12.94%的Dice指标提升。
English: The study introduces the MiDSS scenario to address both limited annotations and domain shifts in medical image segmentation, proposing the UST-RUN framework that utilizes intermediate domains and innovative strategies like UCP, SymGD, and TP-RAM to enhance knowledge transfer and achieve significant performance gains, such as a 12.94% Dice improvement on the Prostate dataset.

Authors:Simone Cammarasana, Giuseppe Patanè
Title: Optimal Weighted Convolution for Classification and Denosing
Abstract:
We introduce a novel weighted convolution operator that enhances traditional convolutional neural networks (CNNs) by integrating a spatial density function into the convolution operator. This extension enables the network to differentially weight neighbouring pixels based on their relative position to the reference pixel, improving spatial characterisation and feature extraction. The proposed operator maintains the same number of trainable parameters and is fully compatible with existing CNN architectures. Although developed for 2D image data, the framework is generalisable to signals on regular grids of arbitrary dimensions, such as 3D volumetric data or 1D time series. We propose an efficient implementation of the weighted convolution by pre-computing the density function and achieving execution times comparable to standard convolution layers. We evaluate our method on two deep learning tasks: image classification using the CIFAR-100 dataset [KH+09] and image denoising using the DIV2K dataset [AT17]. Experimental results with state-of-the-art classification (e.g., VGG [SZ15], ResNet [HZRS16]) and denoising (e.g., DnCNN [ZZC+17], NAFNet [CCZS22]) methods show that the weighted convolution improves performance with respect to standard convolution across different quantitative metrics. For example, VGG achieves an accuracy of 66.94% with weighted convolution versus 56.89% with standard convolution on the classification problem, while DnCNN improves the PSNR value from 20.17 to 22.63 on the denoising problem. All models were trained on the CINECA Leonardo cluster to reduce the execution time and improve the tuning of the density function values. The PyTorch implementation of the weighted convolution is publicly available at: https://github.com/cammarasana123/weightedConvolution2.0.
中文: 本文提出了一种新型加权卷积算子,通过引入空间密度函数对邻域像素进行差异化加权,在保持参数数量和架构兼容性的同时,有效提升了图像分类与去噪任务的性能表现。
English: This paper introduces a novel weighted convolution operator that enhances CNNs by incorporating a spatial density function to differentially weight pixels, improving performance in image classification and denoising tasks while maintaining parameter efficiency and architectural compatibility.

Authors:Yang-Tian Sun, Xin Yu, Zehuan Huang, Yi-Hua Huang, Yuan-Chen Guo, Ziyi Yang, Yan-Pei Cao, Xiaojuan Qi
Title: UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation
Abstract:
Recently, methods leveraging diffusion model priors to assist monocular geometric estimation (e.g., depth and normal) have gained significant attention due to their strong generalization ability. However, most existing works focus on estimating geometric properties within the camera coordinate system of individual video frames, neglecting the inherent ability of diffusion models to determine inter-frame correspondence. In this work, we demonstrate that, through appropriate design and fine-tuning, the intrinsic consistency of video generation models can be effectively harnessed for consistent geometric estimation. Specifically, we 1) select geometric attributes in the global coordinate system that share the same correspondence with video frames as the prediction targets, 2) introduce a novel and efficient conditioning method by reusing positional encodings, and 3) enhance performance through joint training on multiple geometric attributes that share the same correspondence. Our results achieve superior performance in predicting global geometric attributes in videos and can be directly applied to reconstruction tasks. Even when trained solely on static video data, our approach exhibits the potential to generalize to dynamic video scenes.

Authors:Yinqi Li, Jiahe Zhao, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen
Title: un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP
Abstract:
Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding. In other words, it inverts the CLIP image encoder. Compared to discriminative models like CLIP, generative models are better at capturing image details because they are trained to learn the data distribution of images. Additionally, the conditional input space of unCLIP aligns with CLIP's original image-text embedding space. Therefore, we propose to invert unCLIP (dubbed un$^2$CLIP) to improve the CLIP model. In this way, the improved image encoder can gain unCLIP's visual detail capturing ability while preserving its alignment with the original text encoder simultaneously. We evaluate our improved CLIP across various tasks to which CLIP has been applied, including the challenging MMVP-VLM benchmark, the dense-prediction open-vocabulary segmentation task, and multimodal large language model tasks. Experiments show that un$^2$CLIP significantly improves the original CLIP and previous CLIP improvement methods. Code and models will be available at https://github.com/LiYinqi/un2CLIP.
中文: 本研究提出un$^2$CLIP改进模型,通过反演unCLIP生成框架增强视觉细节捕捉能力,同时保持图文对齐特性,在多项视觉与多模态任务中显著提升原CLIP模型性能。
English: This work proposes un$^2$CLIP, an improved CLIP model that leverages the inversion of unCLIP's generative framework to enhance visual detail capture while maintaining text-image alignment, achieving superior performance across multiple vision and multimodal tasks.

Authors:Jing Huang, Yongkang Zhao, Yuhan Li, Zhitao Dai, Cheng Chen, Qiying Lai
Title: ACM-UNet: Adaptive Integration of CNNs and Mamba for Efficient Medical Image Segmentation
Abstract:
The U-shaped encoder-decoder architecture with skip connections has become a prevailing paradigm in medical image segmentation due to its simplicity and effectiveness. While many recent works aim to improve this framework by designing more powerful encoders and decoders, employing advanced convolutional neural networks (CNNs) for local feature extraction, Transformers or state space models (SSMs) such as Mamba for global context modeling, or hybrid combinations of both, these methods often struggle to fully utilize pretrained vision backbones (e.g., ResNet, ViT, VMamba) due to structural mismatches. To bridge this gap, we introduce ACM-UNet, a general-purpose segmentation framework that retains a simple UNet-like design while effectively incorporating pretrained CNNs and Mamba models through a lightweight adapter mechanism. This adapter resolves architectural incompatibilities and enables the model to harness the complementary strengths of CNNs and SSMs-namely, fine-grained local detail extraction and long-range dependency modeling. Additionally, we propose a hierarchical multi-scale wavelet transform module in the decoder to enhance feature fusion and reconstruction fidelity. Extensive experiments on the Synapse and ACDC benchmarks demonstrate that ACM-UNet achieves state-of-the-art performance while remaining computationally efficient. Notably, it reaches 85.12% Dice Score and 13.89mm HD95 on the Synapse dataset with 17.93G FLOPs, showcasing its effectiveness and scalability. Code is available at: https://github.com/zyklcode/ACM-UNet.
中文: ACM-UNet通过轻量级适配器有效整合预训练的CNN和Mamba模型,在保持计算效率的同时实现了最先进的医学图像分割性能。
English: ACM-UNet introduces a lightweight adapter to effectively integrate pretrained CNNs and Mamba models, achieving state-of-the-art medical image segmentation performance while maintaining computational efficiency.

Authors:Yuting Zhang, Hao Lu, Qingyong Hu, Yin Wang, Kaishen Yuan, Xin Liu, Kaishun Wu
Title: Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model
Abstract:
Periodic or quasi-periodic phenomena reveal intrinsic characteristics in various natural processes, such as weather patterns, movement behaviors, traffic flows, and biological signals. Given that these phenomena span multiple modalities, the capabilities of Multimodal Large Language Models (MLLMs) offer promising potential to effectively capture and understand their complex nature. However, current MLLMs struggle with periodic tasks due to limitations in: 1) lack of temporal modelling and 2) conflict between short and long periods. This paper introduces Period-LLM, a multimodal large language model designed to enhance the performance of periodic tasks across various modalities, and constructs a benchmark of various difficulty for evaluating the cross-modal periodic capabilities of large models. Specially, We adopt an "Easy to Hard Generalization" paradigm, starting with relatively simple text-based tasks and progressing to more complex visual and multimodal tasks, ensuring that the model gradually builds robust periodic reasoning capabilities. Additionally, we propose a "Resisting Logical Oblivion" optimization strategy to maintain periodic reasoning abilities during semantic alignment. Extensive experiments demonstrate the superiority of the proposed Period-LLM over existing MLLMs in periodic tasks. The code is available at https://github.com/keke-nice/Period-LLM.
中文摘要:本文提出Period-LLM多模态大语言模型,通过采用“由易到难泛化”范式和“抵抗逻辑遗忘”优化策略,有效解决了现有模型在周期性任务中的时序建模与长短周期冲突问题,在多模态周期性任务中展现出卓越性能。
English Summary: This paper introduces Period-LLM, a multimodal large language model designed to overcome current limitations in handling periodic tasks by employing an "Easy to Hard Generalization" paradigm and a "Resisting Logical Oblivion" strategy, demonstrating superior performance across various modalities.

Authors:Heejo Kong, Sung-Jin Kim, Gunho Jung, Seong-Whan Lee
Title: Diversify and Conquer: Open-set Disagreement for Robust Semi-supervised Learning with Outliers
Abstract:
Conventional semi-supervised learning (SSL) ideally assumes that labeled and unlabeled data share an identical class distribution, however in practice, this assumption is easily violated, as unlabeled data often includes unknown class data, i.e., outliers. The outliers are treated as noise, considerably degrading the performance of SSL models. To address this drawback, we propose a novel framework, Diversify and Conquer (DAC), to enhance SSL robustness in the context of open-set semi-supervised learning. In particular, we note that existing open-set SSL methods rely on prediction discrepancies between inliers and outliers from a single model trained on labeled data. This approach can be easily failed when the labeled data is insufficient, leading to performance degradation that is worse than naive SSL that do not account for outliers. In contrast, our approach exploits prediction disagreements among multiple models that are differently biased towards the unlabeled distribution. By leveraging the discrepancies arising from training on unlabeled data, our method enables robust outlier detection even when the labeled data is underspecified. Our key contribution is constructing a collection of differently biased models through a single training process. By encouraging divergent heads to be differently biased towards outliers while making consistent predictions for inliers, we exploit the disagreement among these heads as a measure to identify unknown concepts. Our code is available at https://github.com/heejokong/DivCon.
中文: 提出的“多样化与征服”(DAC)框架通过训练多个具有不同偏差的模型,利用其预测差异来检测异常数据,有效提升了半监督学习在开放集场景下的鲁棒性,克服了现有方法在标注数据不足时性能下降的问题。
English: The proposed Diversify and Conquer (DAC) framework enhances semi-supervised learning robustness by training multiple divergent models that detect outliers through their prediction disagreements, overcoming limitations of existing methods that fail with insufficient labeled data.

Authors:Chunxu Liu, Chi Xie, Xiaxu Chen, Wei Li, Feng Zhu, Rui Zhao, Limin Wang
Title: SORCE: Small Object Retrieval in Complex Environments
Abstract:
Text-to-Image Retrieval (T2IR) is a highly valuable task that aims to match a given textual query to images in a gallery. Existing benchmarks primarily focus on textual queries describing overall image semantics or foreground salient objects, possibly overlooking inconspicuous small objects, especially in complex environments. Such small object retrieval is crucial, as in real-world applications, the targets of interest are not always prominent in the image. Thus, we introduce SORCE (Small Object Retrieval in Complex Environments), a new subfield of T2IR, focusing on retrieving small objects in complex images with textual queries. We propose a new benchmark, SORCE-1K, consisting of images with complex environments and textual queries describing less conspicuous small objects with minimal contextual cues from other salient objects. Preliminary analysis on SORCE-1K finds that existing T2IR methods struggle to capture small objects and encode all the semantics into a single embedding, leading to poor retrieval performance on SORCE-1K. Therefore, we propose to represent each image with multiple distinctive embeddings. We leverage Multimodal Large Language Models (MLLMs) to extract multiple embeddings for each image instructed by a set of Regional Prompts (ReP). Experimental results show that our multi-embedding approach through MLLM and ReP significantly outperforms existing T2IR methods on SORCE-1K. Our experiments validate the effectiveness of SORCE-1K for benchmarking SORCE performances, highlighting the potential of multi-embedding representation and text-customized MLLM features for addressing this task.
中文: 本研究提出了专注于复杂环境中小物体检索的SORCE新子领域,并采用多模态大语言模型和区域提示的多嵌入方法,在SORCE-1K基准测试中显著优于现有技术。
English: The study introduces SORCE, a new subfield of Text-to-Image Retrieval focusing on small objects in complex environments, and proposes a multi-embedding approach using MLLMs and Regional Prompts that significantly outperforms existing methods on the SORCE-1K benchmark.

Authors:Bozhong Zheng, Jinye Gan, Xiaohao Xu, Wenqiao Li, Xiaonan Huang, Na Ni, Yingna Wu
Title: Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation
Abstract:
3D point cloud anomaly detection is essential for robust vision systems but is challenged by pose variations and complex geometric anomalies. Existing patch-based methods often suffer from geometric fidelity issues due to discrete voxelization or projection-based representations, limiting fine-grained anomaly localization. We introduce Pose-Aware Signed Distance Field (PASDF), a novel framework that integrates 3D anomaly detection and repair by learning a continuous, pose-invariant shape representation. PASDF leverages a Pose Alignment Module for canonicalization and a SDF Network to dynamically incorporate pose, enabling implicit learning of high-fidelity anomaly repair templates from the continuous SDF. This facilitates precise pixel-level anomaly localization through an Anomaly-Aware Scoring Module. Crucially, the continuous 3D representation in PASDF extends beyond detection, facilitating in-situ anomaly repair. Experiments on Real3D-AD and Anomaly-ShapeNet demonstrate state-of-the-art performance, achieving high object-level AUROC scores of 80.2% and 90.0%, respectively. These results highlight the effectiveness of continuous geometric representations in advancing 3D anomaly detection and facilitating practical anomaly region repair. The code is available at https://github.com/ZZZBBBZZZ/PASDF to support further research.
中文摘要:PASDF框架通过引入姿态不变的连续有向距离场表示,实现了精确的三维异常检测与原位修复,在基准数据集上取得了领先性能。
English Summary: The PASDF framework introduces a pose-invariant continuous shape representation using signed distance fields to achieve precise 3D anomaly detection and in-situ repair, demonstrating state-of-the-art performance on benchmark datasets.

Authors:Wenlong Jiao, Binglong Li, Wei Shang, Ping Wang, Dongwei Ren
Title: Efficient RAW Image Deblurring with Adaptive Frequency Modulation
Abstract:
Image deblurring plays a crucial role in enhancing visual clarity across various applications. Although most deep learning approaches primarily focus on sRGB images, which inherently lose critical information during the image signal processing pipeline, RAW images, being unprocessed and linear, possess superior restoration potential but remain underexplored. Deblurring RAW images presents unique challenges, particularly in handling frequency-dependent blur while maintaining computational efficiency. To address these issues, we propose Frequency Enhanced Network (FrENet), a framework specifically designed for RAW-to-RAW deblurring that operates directly in the frequency domain. We introduce a novel Adaptive Frequency Positional Modulation module, which dynamically adjusts frequency components according to their spectral positions, thereby enabling precise control over the deblurring process. Additionally, frequency domain skip connections are adopted to further preserve high-frequency details. Experimental results demonstrate that FrENet surpasses state-of-the-art deblurring methods in RAW image deblurring, achieving significantly better restoration quality while maintaining high efficiency in terms of reduced MACs. Furthermore, FrENet's adaptability enables it to be extended to sRGB images, where it delivers comparable or superior performance compared to methods specifically designed for sRGB data. The code will be available at https://github.com/WenlongJiao/FrENet .
中文: 提出的频率增强网络(FrENet)通过在频域直接处理RAW图像去模糊问题,以高效计算实现了卓越的恢复质量,并展现出对sRGB图像的适配能力。
English: The proposed Frequency Enhanced Network (FrENet) effectively addresses RAW image deblurring by operating directly in the frequency domain, achieving superior restoration quality with high computational efficiency and demonstrating adaptability to sRGB images.

Authors:Kanokphan Lertniphonphan, Feng Chen, Junda Xu, Fengbu Lan, Jun Xie, Tao Zhang, Zhepeng Wang
Title: PCIE_Interaction Solution for Ego4D Social Interaction Challenge
Abstract:
This report presents our team's PCIE_Interaction solution for the Ego4D Social Interaction Challenge at CVPR 2025, addressing both Looking At Me (LAM) and Talking To Me (TTM) tasks. The challenge requires accurate detection of social interactions between subjects and the camera wearer, with LAM relying exclusively on face crop sequences and TTM combining speaker face crops with synchronized audio segments. In the LAM track, we employ face quality enhancement and ensemble methods. For the TTM task, we extend visual interaction analysis by fusing audio and visual cues, weighted by a visual quality score. Our approach achieved 0.81 and 0.71 mean average precision (mAP) on the LAM and TTM challenges leader board. Code is available at https://github.com/KanokphanL/PCIE_Ego4D_Social_Interaction
中文: 我们针对CVPR 2025 Ego4D社交互动挑战赛提出的PCIE_Interaction方案,通过增强面部分析处理"注视检测"任务,融合视听信息解决"对话检测"任务,分别取得了0.81和0.71的平均精度值。
English: Our PCIE_Interaction solution for CVPR 2025's Ego4D Social Interaction Challenge uses enhanced face analysis for the Looking At Me task and audio-visual fusion for Talking To Me, achieving 0.81 and 0.71 mAP respectively.

Authors:Mika Feng, Koichi Ito, Takafumi Aoki, Tetsushi Ohki, Masakatsu Nishigaki
Title: Leveraging Intermediate Features of Vision Transformer for Face Anti-Spoofing
Abstract:
Face recognition systems are designed to be robust against changes in head pose, illumination, and blurring during image capture. If a malicious person presents a face photo of the registered user, they may bypass the authentication process illegally. Such spoofing attacks need to be detected before face recognition. In this paper, we propose a spoofing attack detection method based on Vision Transformer (ViT) to detect minute differences between live and spoofed face images. The proposed method utilizes the intermediate features of ViT, which have a good balance between local and global features that are important for spoofing attack detection, for calculating loss in training and score in inference. The proposed method also introduces two data augmentation methods: face anti-spoofing data augmentation and patch-wise data augmentation, to improve the accuracy of spoofing attack detection. We demonstrate the effectiveness of the proposed method through experiments using the OULU-NPU and SiW datasets. The project page is available at: https://gsisaoki.github.io/FAS-ViT-CVPRW/ .

Authors:Xianheng Ma, Hongchen Tan, Xiuping Liu, Yi Zhang, Huasheng Wang, Jiang Liu, Ying Chen, Hantao Liu
Title: S3CE-Net: Spike-guided Spatiotemporal Semantic Coupling and Expansion Network for Long Sequence Event Re-Identification
Abstract:
In this paper, we leverage the advantages of event cameras to resist harsh lighting conditions, reduce background interference, achieve high time resolution, and protect facial information to study the long-sequence event-based person re-identification (Re-ID) task. To this end, we propose a simple and efficient long-sequence event Re-ID model, namely the Spike-guided Spatiotemporal Semantic Coupling and Expansion Network (S3CE-Net). To better handle asynchronous event data, we build S3CE-Net based on spiking neural networks (SNNs). The S3CE-Net incorporates the Spike-guided Spatial-temporal Attention Mechanism (SSAM) and the Spatiotemporal Feature Sampling Strategy (STFS). The SSAM is designed to carry out semantic interaction and association in both spatial and temporal dimensions, leveraging the capabilities of SNNs. The STFS involves sampling spatial feature subsequences and temporal feature subsequences from the spatiotemporal dimensions, driving the Re-ID model to perceive broader and more robust effective semantics. Notably, the STFS introduces no additional parameters and is only utilized during the training stage. Therefore, S3CE-Net is a low-parameter and high-efficiency model for long-sequence event-based person Re-ID. Extensive experiments have verified that our S3CE-Net achieves outstanding performance on many mainstream long-sequence event-based person Re-ID datasets. Code is available at:https://github.com/Mhsunshine/SC3E_Net.
中文: 本文提出S3CE-Net模型,基于脉冲神经网络处理事件相机数据,通过脉冲引导的时空注意力机制和特征采样策略,以低参数量实现高效的长序列行人重识别,并在多个主流数据集上表现出色。
English: This paper introduces S3CE-Net, a spiking neural network-based model that uses event cameras for efficient long-sequence person re-identification by incorporating spike-guided attention and spatiotemporal feature sampling to achieve robust performance with minimal parameters.

Authors:Gilles Quentin Hacheme, Girmaw Abebe Tadesse, Caleb Robinson, Akram Zaytar, Rahul Dodhia, Juan M. Lavista Ferres
Title: GeoVision Labeler: Zero-Shot Geospatial Classification with Vision and Language Models
Abstract:
Classifying geospatial imagery remains a major bottleneck for applications such as disaster response and land-use monitoring-particularly in regions where annotated data is scarce or unavailable. Existing tools (e.g., RS-CLIP) that claim zero-shot classification capabilities for satellite imagery nonetheless rely on task-specific pretraining and adaptation to reach competitive performance. We introduce GeoVision Labeler (GVL), a strictly zero-shot classification framework: a vision Large Language Model (vLLM) generates rich, human-readable image descriptions, which are then mapped to user-defined classes by a conventional Large Language Model (LLM). This modular, and interpretable pipeline enables flexible image classification for a large range of use cases. We evaluated GVL across three benchmarks-SpaceNet v7, UC Merced, and RESISC45. It achieves up to 93.2% zero-shot accuracy on the binary Buildings vs. No Buildings task on SpaceNet v7. For complex multi-class classification tasks (UC Merced, RESISC45), we implemented a recursive LLM-driven clustering to form meta-classes at successive depths, followed by hierarchical classification-first resolving coarse groups, then finer distinctions-to deliver competitive zero-shot performance. GVL is open-sourced at https://github.com/microsoft/geo-vision-labeler to catalyze adoption in real-world geospatial workflows.
Chinese: GeoVision标注器(GVL)是一种严格零样本分类框架,通过视觉大语言模型生成图像描述并由常规大语言模型将其映射至用户定义类别,无需任务特定训练即可实现高精度地理空间影像分类。
English: The GeoVision Labeler (GVL) is a strictly zero-shot classification framework that uses a vision Large Language Model to generate image descriptions and a conventional LLM to map them to user-defined classes, achieving high accuracy in geospatial imagery classification without task-specific training.

Authors:Uzair Khan, Franco Fummi, Luigi Capogrosso
Title: KairosAD: A SAM-Based Model for Industrial Anomaly Detection on Embedded Devices
Abstract:
In the era of intelligent manufacturing, anomaly detection has become essential for maintaining quality control on modern production lines. However, while many existing models show promising performance, they are often too large, computationally demanding, and impractical to deploy on resource-constrained embedded devices that can be easily installed on the production lines of Small and Medium Enterprises (SMEs). To bridge this gap, we present KairosAD, a novel supervised approach that uses the power of the Mobile Segment Anything Model (MobileSAM) for image-based anomaly detection. KairosAD has been evaluated on the two well-known industrial anomaly detection datasets, i.e., MVTec-AD and ViSA. The results show that KairosAD requires 78% fewer parameters and boasts a 4x faster inference time compared to the leading state-of-the-art model, while maintaining comparable AUROC performance. We deployed KairosAD on two embedded devices, the NVIDIA Jetson NX, and the NVIDIA Jetson AGX. Finally, KairosAD was successfully installed and tested on the real production line of the Industrial Computer Engineering Laboratory (ICE Lab) at the University of Verona. The code is available at https://github.com/intelligolabs/KairosAD.
中文摘要:KairosAD是一种基于MobileSAM的轻量级监督异常检测模型,在保持与顶尖模型相当的AUROC性能的同时,参数减少78%、推理速度提升4倍,可部署于中小企业生产线的嵌入式设备。
English Summary: KairosAD is a lightweight supervised anomaly detection model that leverages MobileSAM to achieve comparable performance with 78% fewer parameters and 4x faster inference than leading models, making it suitable for deployment on resource-constrained embedded devices in SMEs.

Authors:Yingsen Zeng, Zepeng Huang, Yujie Zhong, Chengjian Feng, Jie Hu, Lin Ma, Yang Liu
Title: DisTime: Distribution-based Time Representation for Video Large Language Models
Abstract:
Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks. Code and data are released at https://github.com/josephzpng/DisTime.
中文: DisTime通过引入连续时间嵌入和基于分布的编解码器增强视频大语言模型的时间定位能力,并借助大规模InternVid-TG数据集,在时间敏感任务中取得了最优性能。
English: DisTime enhances Video-LLMs' temporal localization by introducing continuous temporal embeddings and a distribution-based decoder, supported by the large-scale InternVid-TG dataset, achieving state-of-the-art results in time-sensitive tasks.

Authors:Enshang Zhang, Zhicheng Zhang, Takashi Hanakawa
Title: Category-aware EEG image generation based on wavelet transform and contrast semantic loss
Abstract:
Reconstructing visual stimuli from EEG signals is a crucial step in realizing brain-computer interfaces. In this paper, we propose a transformer-based EEG signal encoder integrating the Discrete Wavelet Transform (DWT) and the gating mechanism. Guided by the feature alignment and category-aware fusion losses, this encoder is used to extract features related to visual stimuli from EEG signals. Subsequently, with the aid of a pre-trained diffusion model, these features are reconstructed into visual stimuli. To verify the effectiveness of the model, we conducted EEG-to-image generation and classification tasks using the THINGS-EEG dataset. To address the limitations of quantitative analysis at the semantic level, we combined WordNet-based classification and semantic similarity metrics to propose a novel semantic-based score, emphasizing the ability of our model to transfer neural activities into visual representations. Experimental results show that our model significantly improves semantic alignment and classification accuracy, which achieves a maximum single-subject accuracy of 43\%, outperforming other state-of-the-art methods. The source code and supplementary material is available at https://github.com/zes0v0inn/DWT_EEG_Reconstruction/tree/main.
中文摘要:本文提出一种结合离散小波变换和门控机制的基于Transformer的脑电信号编码器,通过预训练扩散模型从脑电信号重建视觉刺激,在THINGS-EEG数据集上实现了最佳的语义对齐效果和高达43%的分类准确率。
English Summary: This paper introduces a transformer-based EEG encoder that integrates Discrete Wavelet Transform and gating mechanisms to reconstruct visual stimuli from brain signals using a diffusion model, achieving state-of-the-art performance in semantic alignment and classification accuracy on the THINGS-EEG dataset.

Authors:Prasanna Reddy Pulakurthi, Majid Rabbani, Jamison Heard, Sohail Dianat, Celso M. de Melo, Raghuveer Rao
Title: Shuffle PatchMix Augmentation with Confidence-Margin Weighted Pseudo-Labels for Enhanced Source-Free Domain Adaptation
Abstract:
This work investigates Source-Free Domain Adaptation (SFDA), where a model adapts to a target domain without access to source data. A new augmentation technique, Shuffle PatchMix (SPM), and a novel reweighting strategy are introduced to enhance performance. SPM shuffles and blends image patches to generate diverse and challenging augmentations, while the reweighting strategy prioritizes reliable pseudo-labels to mitigate label noise. These techniques are particularly effective on smaller datasets like PACS, where overfitting and pseudo-label noise pose greater risks. State-of-the-art results are achieved on three major benchmarks: PACS, VisDA-C, and DomainNet-126. Notably, on PACS, improvements of 7.3% (79.4% to 86.7%) and 7.2% are observed in single-target and multi-target settings, respectively, while gains of 2.8% and 0.7% are attained on DomainNet-126 and VisDA-C. This combination of advanced augmentation and robust pseudo-label reweighting establishes a new benchmark for SFDA. The code is available at: https://github.com/PrasannaPulakurthi/SPM
中文: 本研究提出了用于无源域自适应的Shuffle PatchMix增强技术和新型重加权策略,通过提升数据多样性和减少伪标签噪声,在多个基准测试中实现了最先进的性能。
English: This study introduces Shuffle PatchMix augmentation and a novel reweighting strategy for Source-Free Domain Adaptation, achieving state-of-the-art performance across multiple benchmarks by enhancing data diversity and reducing pseudo-label noise.

Authors:Zheng Tan, Weizhen Wang, Andrea L. Bertozzi, Ernest K. Ryu
Title: STORK: Improving the Fidelity of Mid-NFE Sampling for Diffusion and Flow Matching Models
Abstract:
Diffusion models (DMs) have demonstrated remarkable performance in high-fidelity image and video generation. Because high-quality generations with DMs typically require a large number of function evaluations (NFEs), resulting in slow sampling, there has been extensive research successfully reducing the NFE to a small range (<10) while maintaining acceptable image quality. However, many practical applications, such as those involving Stable Diffusion 3.5, FLUX, and SANA, commonly operate in the mid-NFE regime (20-50 NFE) to achieve superior results, and, despite the practical relevance, research on the effective sampling within this mid-NFE regime remains underexplored. In this work, we propose a novel, training-free, and structure-independent DM ODE solver called the Stabilized Taylor Orthogonal Runge--Kutta (STORK) method, based on a class of stiff ODE solvers with a Taylor expansion adaptation. Unlike prior work such as DPM-Solver, which is dependent on the semi-linear structure of the DM ODE, STORK is applicable to any DM sampling, including noise-based and flow matching-based models. Within the 20-50 NFE range, STORK achieves improved generation quality, as measured by FID scores, across unconditional pixel-level generation and conditional latent-space generation tasks using models like Stable Diffusion 3.5 and SANA. Code is available at https://github.com/ZT220501/STORK.
中文: STORK方法通过同时解决ODE刚性和结构限制问题,为扩散模型和流匹配模型实现了保持质量的快速采样,持续提升了图像和视频生成效果。
English: The STORK method is introduced to enable quality-preserving fast sampling for diffusion and flow-matching models by addressing both ODE stiffness and structural limitations, consistently improving image and video generation.

Authors:Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu
Title: Boosting All-in-One Image Restoration via Self-Improved Privilege Learning
Abstract:
Unified image restoration models for diverse and mixed degradations often suffer from unstable optimization dynamics and inter-task conflicts. This paper introduces Self-Improved Privilege Learning (SIPL), a novel paradigm that overcomes these limitations by innovatively extending the utility of privileged information (PI) beyond training into the inference stage. Unlike conventional Privilege Learning, where ground-truth-derived guidance is typically discarded after training, SIPL empowers the model to leverage its own preliminary outputs as pseudo-privileged signals for iterative self-refinement at test time. Central to SIPL is Proxy Fusion, a lightweight module incorporating a learnable Privileged Dictionary. During training, this dictionary distills essential high-frequency and structural priors from privileged feature representations. Critically, at inference, the same learned dictionary then interacts with features derived from the model's initial restoration, facilitating a self-correction loop. SIPL can be seamlessly integrated into various backbone architectures, offering substantial performance improvements with minimal computational overhead. Extensive experiments demonstrate that SIPL significantly advances the state-of-the-art on diverse all-in-one image restoration benchmarks. For instance, when integrated with the PromptIR model, SIPL achieves remarkable PSNR improvements of +4.58 dB on composite degradation tasks and +1.28 dB on diverse five-task benchmarks, underscoring its effectiveness and broad applicability. Codes are available at our project page https://github.com/Aitical/SIPL.
中文: 本文提出的自改进特权学习(SIPL)创新地将特权信息应用扩展至推理阶段,通过自我修正机制实现迭代优化,在多种图像修复任务中以最小计算成本取得了最先进的性能突破。
English: This paper introduces Self-Improved Privilege Learning (SIPL), a novel paradigm that extends privileged information utilization into the inference stage, enabling iterative self-refinement and achieving state-of-the-art performance across diverse image restoration tasks with minimal computational overhead.

Authors:Lan-Cuong Nguyen, Quan Nguyen-Tri, Bang Tran Khanh, Dung D. Le, Long Tran-Thanh, Khoat Than
Title: Provably Improving Generalization of Few-Shot Models with Synthetic Data
Abstract:
Few-shot image classification remains challenging due to the scarcity of labeled training examples. Augmenting them with synthetic data has emerged as a promising way to alleviate this issue, but models trained on synthetic samples often face performance degradation due to the inherent gap between real and synthetic distributions. To address this limitation, we develop a theoretical framework that quantifies the impact of such distribution discrepancies on supervised learning, specifically in the context of image classification. More importantly, our framework suggests practical ways to generate good synthetic samples and to train a predictor with high generalization ability. Building upon this framework, we propose a novel theoretical-based algorithm that integrates prototype learning to optimize both data partitioning and model training, effectively bridging the gap between real few-shot data and synthetic data. Extensive experiments results show that our approach demonstrates superior performance compared to state-of-the-art methods, outperforming them across multiple datasets.
Chinese Summary: 本研究提出了一种理论框架和算法,通过优化的原型学习有效弥合了少样本图像分类中真实与合成数据间的分布差距,在多个数据集上超越了现有最优方法的性能表现。
English Summary: This study introduces a theoretical framework and algorithm that effectively bridge the distribution gap between real and synthetic data in few-shot image classification, achieving superior performance over state-of-the-art methods through optimized prototype learning.

Authors:Junyu Chen, Shuwen Wei, Yihao Liu, Aaron Carass, Yong Du
Title: Pretraining Deformable Image Registration Networks with Random Images
Abstract:
Recent advances in deep learning-based medical image registration have shown that training deep neural networks~(DNNs) does not necessarily require medical images. Previous work showed that DNNs trained on randomly generated images with carefully designed noise and contrast properties can still generalize well to unseen medical data. Building on this insight, we propose using registration between random images as a proxy task for pretraining a foundation model for image registration. Empirical results show that our pretraining strategy improves registration accuracy, reduces the amount of domain-specific data needed to achieve competitive performance, and accelerates convergence during downstream training, thereby enhancing computational efficiency.
中文: 最新研究表明,利用随机生成的图像对医学图像配准的深度神经网络进行预训练,可有效提升配准精度、数据利用效率和计算速度。
English: Recent research demonstrates that deep neural networks for medical image registration can be effectively pretrained using randomly generated images, improving accuracy, data efficiency, and computational speed.

Authors:Shilin Xu, Yanwei Li, Rui Yang, Tao Zhang, Yueyi Sun, Wei Chow, Linfeng Li, Hang Song, Qi Xu, Yunhai Tong, Xiangtai Li, Hao Fei
Title: Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models
Abstract:
Recent works on large language models (LLMs) have successfully demonstrated the emergence of reasoning capabilities via reinforcement learning (RL). Although recent efforts leverage group relative policy optimization (GRPO) for MLLMs post-training, they constantly explore one specific aspect, such as grounding tasks, math problems, or chart analysis. There are no works that can leverage multi-source MLLM tasks for stable reinforcement learning. In this work, we present a unified perspective to solve this problem. We present Mixed-R1, a unified yet straightforward framework that contains a mixed reward function design (Mixed-Reward) and a mixed post-training dataset (Mixed-45K). We first design a data engine to select high-quality examples to build the Mixed-45K post-training dataset. Then, we present a Mixed-Reward design, which contains various reward functions for various MLLM tasks. In particular, it has four different reward functions: matching reward for binary answer or multiple-choice problems, chart reward for chart-aware datasets, IoU reward for grounding problems, and open-ended reward for long-form text responses such as caption datasets. To handle the various long-form text content, we propose a new open-ended reward named Bidirectional Max-Average Similarity (BMAS) by leveraging tokenizer embedding matching between the generated response and the ground truth. Extensive experiments show the effectiveness of our proposed method on various MLLMs, including Qwen2.5-VL and Intern-VL on various sizes. Our dataset and model are available at https://github.com/xushilin1/mixed-r1.
Chinese: 本文提出Mixed-R1框架,通过混合奖励函数设计和混合后训练数据集,解决了多模态大语言模型在多源任务中稳定强化学习的难题,并在Qwen2.5-VL和Intern-VL等模型上验证了其有效性。
English: This paper introduces Mixed-R1, a unified framework that combines a mixed reward function design and a mixed post-training dataset to enable stable reinforcement learning across diverse multimodal large language model tasks, demonstrating effectiveness on models like Qwen2.5-VL and Intern-VL.

Authors:Jiashuai Liu, Yingjia Shang, Yingkang Zhan, Di Zhang, Yi Niu, Dong Wei, Xian Wu, Zeyu Gao, Chen Li, Yefeng Zheng
Title: The Butterfly Effect in Pathology: Exploring Security in Pathology Foundation Models
Abstract:
With the widespread adoption of pathology foundation models in both research and clinical decision support systems, exploring their security has become a critical concern. However, despite their growing impact, the vulnerability of these models to adversarial attacks remains largely unexplored. In this work, we present the first systematic investigation into the security of pathology foundation models for whole slide image~(WSI) analysis against adversarial attacks. Specifically, we introduce the principle of \textit{local perturbation with global impact} and propose a label-free attack framework that operates without requiring access to downstream task labels. Under this attack framework, we revise four classical white-box attack methods and redefine the perturbation budget based on the characteristics of WSI. We conduct comprehensive experiments on three representative pathology foundation models across five datasets and six downstream tasks. Despite modifying only 0.1\% of patches per slide with imperceptible noise, our attack leads to downstream accuracy degradation that can reach up to 20\% in the worst cases. Furthermore, we analyze key factors that influence attack success, explore the relationship between patch-level vulnerability and semantic content, and conduct a preliminary investigation into potential defence strategies. These findings lay the groundwork for future research on the adversarial robustness and reliable deployment of pathology foundation models. Our code is publicly available at: https://github.com/Jiashuai-Liu-hmos/Attack-WSI-pathology-foundation-models.
Chinese: 本研究首次系统探究了病理学基础模型在全切片图像分析中的安全漏洞,提出无需下游任务标签的对抗攻击框架,仅修改0.1%图像区块即可导致下游任务准确率最高下降20%,为模型可靠性研究奠定基础。
English: This study conducts the first systematic investigation into the security vulnerabilities of pathology foundation models for whole slide image analysis, introducing a label-free adversarial attack framework that degrades downstream task accuracy by up to 20% while modifying only 0.1% of patches with imperceptible noise.

Authors:Peiran Xu, Yadong Mu
Title: Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors
Abstract:
In this work, we focus on the task of weakly supervised affordance grounding, where a model is trained to identify affordance regions on objects using human-object interaction images and egocentric object images without dense labels. Previous works are mostly built upon class activation maps, which are effective for semantic segmentation but may not be suitable for locating actions and functions. Leveraging recent advanced foundation models, we develop a supervised training pipeline based on pseudo labels. The pseudo labels are generated from an off-the-shelf part segmentation model, guided by a mapping from affordance to part names. Furthermore, we introduce three key enhancements to the baseline model: a label refining stage, a fine-grained feature alignment process, and a lightweight reasoning module. These techniques harness the semantic knowledge of static objects embedded in off-the-shelf foundation models to improve affordance learning, effectively bridging the gap between objects and actions. Extensive experiments demonstrate that the performance of the proposed model has achieved a breakthrough improvement over existing methods. Our codes are available at https://github.com/woyut/WSAG-PLSP .
Chinese: 本研究提出了一种弱监督功能定位模型,利用基础模型和伪标签来弥合物体与动作之间的鸿沟,通过标签精炼、特征对齐和推理模块实现了突破性的性能提升。
English: This study introduces a weakly supervised affordance grounding model that leverages foundation models and pseudo labels to bridge the gap between objects and actions, achieving breakthrough performance improvements through label refinement, feature alignment, and a reasoning module.

Authors:Xuweiyi Chen, Wentao Zhou, Aruni RoyChowdhury, Zezhou Cheng
Title: Point-MoE: Towards Cross-Domain Generalization in 3D Semantic Segmentation via Mixture-of-Experts
Abstract:
While scaling laws have transformed natural language processing and computer vision, 3D point cloud understanding has yet to reach that stage. This can be attributed to both the comparatively smaller scale of 3D datasets, as well as the disparate sources of the data itself. Point clouds are captured by diverse sensors (e.g., depth cameras, LiDAR) across varied domains (e.g., indoor, outdoor), each introducing unique scanning patterns, sampling densities, and semantic biases. Such domain heterogeneity poses a major barrier towards training unified models at scale, especially under the realistic constraint that domain labels are typically inaccessible at inference time. In this work, we propose Point-MoE, a Mixture-of-Experts architecture designed to enable large-scale, cross-domain generalization in 3D perception. We show that standard point cloud backbones degrade significantly in performance when trained on mixed-domain data, whereas Point-MoE with a simple top-k routing strategy can automatically specialize experts, even without access to domain labels. Our experiments demonstrate that Point-MoE not only outperforms strong multi-domain baselines but also generalizes better to unseen domains. This work highlights a scalable path forward for 3D understanding: letting the model discover structure in diverse 3D data, rather than imposing it via manual curation or domain supervision.

Authors:David Ma, Huaqing Yuan, Xingjian Wang, Qianbo Zang, Tianci Liu, Xinyang He, Yanbin Wei, Jiawei Guo, Ni Jiahui, Zhenzhu Yang, Meng Cao, Shanghaoran Quan, Yizhi Li, Wangchunshu Zhou, Jiaheng Liu, Wenhao Huang, Ge Zhang, Shiwen Ni, Xiaojie Jin
Title: ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Abstract:
Although long-video understanding demands that models capture hierarchical temporal information -- from clip (seconds) and shot (tens of seconds) to event (minutes) and story (hours) -- existing benchmarks either neglect this multi-scale design or scatter scale-specific questions across different videos, preventing direct comparison of model performance across timescales on the same content. To address this, we introduce ScaleLong, the first benchmark to disentangle these factors by embedding questions targeting four hierarchical timescales -- clip (seconds), shot (tens of seconds), event (minutes), and story (hours) -- all within the same video content. This within-content multi-timescale questioning design enables direct comparison of model performance across timescales on identical videos. ScaleLong features 269 long videos (avg.\ 86\,min) from 5 main categories and 36 sub-categories, with 4--8 carefully designed questions, including at least one question for each timescale. Evaluating 23 MLLMs reveals a U-shaped performance curve, with higher accuracy at the shortest and longest timescales and a dip at intermediate levels. Furthermore, ablation studies show that increased visual token capacity consistently enhances reasoning across all timescales. ScaleLong offers a fine-grained, multi-timescale benchmark for advancing MLLM capabilities in long-video understanding. The code and dataset are available https://github.com/multimodal-art-projection/ScaleLong.
中文:ScaleLong基准通过在相同视频内容中嵌入多时间尺度问题,实现了模型在分层时间级别上的直接性能比较,揭示了U形准确率曲线并证明增加视觉标记容量可提升推理能力。
English: The ScaleLong benchmark introduces multi-timescale questions within the same video content to enable direct comparison of model performance across hierarchical temporal levels, revealing a U-shaped accuracy curve and demonstrating that increased visual token capacity improves reasoning.

Authors:Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su
Title: BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning
Abstract:
Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.

Authors:Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem
Title: TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models
Abstract:
Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.
中文: TextRegion框架巧妙融合图文模型与SAM2,生成文本对齐的区域标记,无需训练即可在开放世界分割等任务中实现卓越性能,且兼容多种模型,便于扩展应用。
English: The proposed TextRegion framework effectively combines image-text models with SAM2 to produce text-aligned region tokens, enabling superior performance in tasks like open-world segmentation without requiring training, while remaining compatible with various models for easy extension.

Authors:Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, Zhiding Yu
Title: Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
Abstract:
Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective. Project page: https://yunzeman.github.io/argus/
Chinese: 本文提出Argus模型,通过以物体为中心的视觉定位机制增强多模态推理能力,在需要精确视觉关注的视觉语言任务中表现卓越。
English: The paper introduces Argus, a multimodal model that enhances visual reasoning through object-centric grounding, effectively improving performance in vision-language tasks requiring precise visual focus.

Authors:Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, Jifeng Dai
Title: ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Abstract:
The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.
中文:ZeroGUI是一种无需人工干预的在线学习框架,通过基于视觉语言模型的任务生成与奖励评估,自动训练图形界面代理,显著提升了在动态环境中的性能表现。
English: ZeroGUI is an online learning framework that automates GUI agent training without human intervention by using VLM-based task generation and reward estimation, significantly enhancing performance in dynamic environments.

Authors:Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, Leichen Wang, Xingtao Hu, Hao Sun, Hang Zhao, Hao Zhao
Title: Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models
Abstract:
Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks--improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.
Chinese Summary: Impromptu VLA数据集通过提供超过80,000个精心筛选的视频片段及丰富注释,解决了自动驾驶视觉语言动作模型缺乏针对性基准的问题,显著提升了模型在各项指标上的表现。
English Summary: The Impromptu VLA dataset addresses the lack of specialized benchmarks for autonomous driving Vision-Language-Action models by providing over 80,000 curated video clips with comprehensive annotations, significantly enhancing model performance across multiple metrics.

Authors:Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan
Title: Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Abstract:
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

Authors:Hao Dong, Moru Liu, Jian Liang, Eleni Chatzi, Olga Fink
Title: To Trust Or Not To Trust Your Vision-Language Model's Prediction
Abstract:
Vision-Language Models (VLMs) have demonstrated strong capabilities in aligning visual and textual modalities, enabling a wide range of applications in multimodal understanding and generation. While they excel in zero-shot and transfer learning scenarios, VLMs remain susceptible to misclassification, often yielding confident yet incorrect predictions. This limitation poses a significant risk in safety-critical domains, where erroneous predictions can lead to severe consequences. In this work, we introduce TrustVLM, a training-free framework designed to address the critical challenge of estimating when VLM's predictions can be trusted. Motivated by the observed modality gap in VLMs and the insight that certain concepts are more distinctly represented in the image embedding space, we propose a novel confidence-scoring function that leverages this space to improve misclassification detection. We rigorously evaluate our approach across 17 diverse datasets, employing 4 architectures and 2 VLMs, and demonstrate state-of-the-art performance, with improvements of up to 51.87% in AURC, 9.14% in AUROC, and 32.42% in FPR95 compared to existing baselines. By improving the reliability of the model without requiring retraining, TrustVLM paves the way for safer deployment of VLMs in real-world applications. The code is available at https://github.com/EPFL-IMOS/TrustVLM.
Chinese: TrustVLM是一种无需训练的框架,通过基于图像嵌入的新型置信度评分函数提升视觉语言模型的可靠性,在多个数据集和架构上实现了误分类检测的最先进性能。
English: TrustVLM is a training-free framework that enhances the reliability of Vision-Language Models by introducing a novel confidence-scoring function based on image embeddings, achieving state-of-the-art performance in misclassification detection across multiple datasets and architectures.

Authors:Qiang Wang, Xiang Song, Yuhang He, Jizhou Han, Chenhao Ding, Xinyuan Gao, Yihong Gong
Title: Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need
Abstract:
Deep neural networks (DNNs) often underperform in real-world, dynamic settings where data distributions change over time. Domain Incremental Learning (DIL) offers a solution by enabling continual model adaptation, with Parameter-Isolation DIL (PIDIL) emerging as a promising paradigm to reduce knowledge conflicts. However, existing PIDIL methods struggle with parameter selection accuracy, especially as the number of domains and corresponding classes grows. To address this, we propose SOYO, a lightweight framework that improves domain selection in PIDIL. SOYO introduces a Gaussian Mixture Compressor (GMC) and Domain Feature Resampler (DFR) to store and balance prior domain data efficiently, while a Multi-level Domain Feature Fusion Network (MDFN) enhances domain feature extraction. Our framework supports multiple Parameter-Efficient Fine-Tuning (PEFT) methods and is validated across tasks such as image classification, object detection, and speech enhancement. Experimental results on six benchmarks demonstrate SOYO's consistent superiority over existing baselines, showcasing its robustness and adaptability in complex, evolving environments. The codes will be released in https://github.com/qwangcv/SOYO.
中文摘要:SOYO是一个轻量级框架,通过引入高斯混合压缩器、领域特征重采样器和多级领域特征融合网络,改进了参数隔离领域增量学习中的领域选择,在多个基准测试中展现出卓越性能。
English Summary: SOYO is a lightweight framework that enhances domain selection in Parameter-Isolation Domain Incremental Learning by incorporating a Gaussian Mixture Compressor, Domain Feature Resampler, and Multi-level Domain Feature Fusion Network, demonstrating superior performance across multiple benchmarks.

Authors:Yufan Deng, Xun Guo, Yuanyang Yin, Jacob Zhiyuan Fang, Yiding Yang, Yizhi Wang, Shenghai Yuan, Angtian Wang, Bo Liu, Haibin Huang, Chongyang Ma
Title: MAGREF: Masked Guidance for Any-Reference Video Generation
Abstract:
Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF
中文: 我们提出MAGREF这一统一框架,通过掩码引导和主体解耦机制解决任意参考视频生成中的身份不一致、多主体纠缠和复制粘贴伪影问题,实现了基于多样化参考图像与文本提示的高质量视频合成,性能超越现有最优方法。
English: We introduce MAGREF, a unified framework for any-reference video generation that addresses identity inconsistency, subject entanglement, and copy-paste artifacts through masked guidance and subject disentanglement mechanisms, achieving state-of-the-art performance in synthesizing videos from diverse references and text prompts.

Authors:Weijie Wang, Donny Y. Chen, Zeyu Zhang, Duochao Shi, Akide Liu, Bohan Zhuang
Title: ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS
Abstract:
Feed-forward 3D Gaussian Splatting (3DGS) models have recently emerged as a promising solution for novel view synthesis, enabling one-pass inference without the need for per-scene 3DGS optimization. However, their scalability is fundamentally constrained by the limited capacity of their encoders, leading to degraded performance or excessive memory consumption as the number of input views increases. In this work, we analyze feed-forward 3DGS frameworks through the lens of the Information Bottleneck principle and introduce ZPressor, a lightweight architecture-agnostic module that enables efficient compression of multi-view inputs into a compact latent state $Z$ that retains essential scene information while discarding redundancy. Concretely, ZPressor enables existing feed-forward 3DGS models to scale to over 100 input views at 480P resolution on an 80GB GPU, by partitioning the views into anchor and support sets and using cross attention to compress the information from the support views into anchor views, forming the compressed latent state $Z$. We show that integrating ZPressor into several state-of-the-art feed-forward 3DGS models consistently improves performance under moderate input views and enhances robustness under dense view settings on two large-scale benchmarks DL3DV-10K and RealEstate10K. The video results, code and trained models are available on our project page: https://lhmd.top/zpressor.
中文摘要:ZPressor是一种轻量级模块,能够将多视角输入高效压缩为紧凑的潜在表示,使前馈式3D高斯溅射模型在处理密集视角输入时显著提升扩展性和性能表现。
English Summary: ZPressor is a lightweight module that enables feed-forward 3D Gaussian Splatting models to efficiently compress multi-view inputs into a compact latent representation, significantly improving scalability and performance with dense view inputs.

Authors:Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, Dahua Lin, Bo Dai
Title: AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views
Abstract:
We introduce AnySplat, a feed forward network for novel view synthesis from uncalibrated image collections. In contrast to traditional neural rendering pipelines that demand known camera poses and per scene optimization, or recent feed forward methods that buckle under the computational weight of dense views, our model predicts everything in one shot. A single forward pass yields a set of 3D Gaussian primitives encoding both scene geometry and appearance, and the corresponding camera intrinsics and extrinsics for each input image. This unified design scales effortlessly to casually captured, multi view datasets without any pose annotations. In extensive zero shot evaluations, AnySplat matches the quality of pose aware baselines in both sparse and dense view scenarios while surpassing existing pose free approaches. Moreover, it greatly reduce rendering latency compared to optimization based neural fields, bringing real time novel view synthesis within reach for unconstrained capture settings.Project page: https://city-super.github.io/anysplat/
中文: AnySplat是一种前馈网络模型,能够通过单次前向传播从无标定图像中预测3D高斯基元与相机参数,实现实时新视角合成,在无需相机位姿标注的情况下达到了与已知位姿方法相当的质量水平。
English: AnySplat is a feed-forward model that enables real-time novel view synthesis from uncalibrated images by predicting 3D Gaussian primitives and camera parameters in a single pass, achieving quality comparable to pose-aware methods without requiring camera pose annotations.

Authors:Mohamad Alansari, Sajid Javed, Iyyakutti Iyappan Ganapathi, Sara Alansari, Muzammal Naseer
Title: CLDTracker: A Comprehensive Language Description for Visual Tracking
Abstract:
VOT remains a fundamental yet challenging task in computer vision due to dynamic appearance changes, occlusions, and background clutter. Traditional trackers, relying primarily on visual cues, often struggle in such complex scenarios. Recent advancements in VLMs have shown promise in semantic understanding for tasks like open-vocabulary detection and image captioning, suggesting their potential for VOT. However, the direct application of VLMs to VOT is hindered by critical limitations: the absence of a rich and comprehensive textual representation that semantically captures the target object's nuances, limiting the effective use of language information; inefficient fusion mechanisms that fail to optimally integrate visual and textual features, preventing a holistic understanding of the target; and a lack of temporal modeling of the target's evolving appearance in the language domain, leading to a disconnect between the initial description and the object's subsequent visual changes. To bridge these gaps and unlock the full potential of VLMs for VOT, we propose CLDTracker, a novel Comprehensive Language Description framework for robust visual Tracking. Our tracker introduces a dual-branch architecture consisting of a textual and a visual branch. In the textual branch, we construct a rich bag of textual descriptions derived by harnessing the powerful VLMs such as CLIP and GPT-4V, enriched with semantic and contextual cues to address the lack of rich textual representation. Experiments on six standard VOT benchmarks demonstrate that CLDTracker achieves SOTA performance, validating the effectiveness of leveraging robust and temporally-adaptive vision-language representations for tracking. Code and models are publicly available at: https://github.com/HamadYA/CLDTracker
中文: CLDTracker提出了一种双分支框架,通过融合视觉语言模型生成的丰富自适应文本描述来增强视觉目标跟踪性能,在多个基准测试中实现了最先进的结果。
English: CLDTracker introduces a dual-branch framework that enhances visual object tracking by integrating rich, adaptive textual descriptions from vision-language models, achieving state-of-the-art performance across multiple benchmarks.

Authors:Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, Chen Change Loy
Title: OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation
Abstract:
In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at https://github.com/wusize/OpenUni.
中文: 本报告介绍了OpenUni,一个简单、轻量级且完全开源的基准模型,通过高效训练策略连接现有多模态大语言模型和扩散模型,以最少参数实现高质量多模态理解与生成。
English: This report introduces OpenUni, a simple, lightweight, and fully open-source baseline that unifies multimodal understanding and generation using an efficient training strategy to connect off-the-shelf models and achieves high-quality results with minimal parameters.

Authors:Ziteng Gao, Mike Zheng Shou
Title: D-AR: Diffusion via Autoregressive Models
Abstract:
This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models. Code and models will be available at https://github.com/showlab/D-AR
中文: 本文提出D-AR方法,将图像扩散过程重构为标准自回归的下一令牌预测任务,通过离散令牌序列实现由粗到精的图像生成,在ImageNet基准上使用7.75亿参数Llama模型取得了2.09的FID分数。
English: This paper introduces D-AR, a novel approach that reformulates image diffusion as an autoregressive next-token prediction process, enabling coarse-to-fine image generation with properties like consistent previews and achieving 2.09 FID on ImageNet using a 775M Llama model.

Authors:Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng
Title: VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
Abstract:
Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.

Authors:Juncheol Shin, Minsang Seok, Seonggon Kim, Eunhyeok Park
Title: Merge-Friendly Post-Training Quantization for Multi-Target Domain Adaptation
Abstract:
Model merging has emerged as a powerful technique for combining task-specific weights, achieving superior performance in multi-target domain adaptation. However, when applied to practical scenarios, such as quantized models, new challenges arise. In practical scenarios, quantization is often applied to target-specific data, but this process restricts the domain of interest and introduces discretization effects, making model merging highly non-trivial. In this study, we analyze the impact of quantization on model merging through the lens of error barriers. Leveraging these insights, we propose a novel post-training quantization, HDRQ - Hessian and distant regularizing quantization - that is designed to consider model merging for multi-target domain adaptation. Our approach ensures that the quantization process incurs minimal deviation from the source pre-trained model while flattening the loss surface to facilitate smooth model merging. To our knowledge, this is the first study on this challenge, and extensive experiments confirm its effectiveness.
Chinese: 本研究提出了HDRQ,一种新颖的训练后量化方法,通过最小化与源模型的偏差并平滑损失表面,解决了量化带来的挑战,实现了多目标领域自适应中的有效模型融合。
English: This study introduces HDRQ, a novel post-training quantization method that minimizes deviation from the source model and flattens the loss surface to enable effective model merging for multi-target domain adaptation, addressing challenges posed by quantization.

Authors:Chao Huang, Yuesheng Ma, Junxuan Huang, Susan Liang, Yunlong Tang, Jing Bi, Wenqiang Liu, Nima Mesgarani, Chenliang Xu
Title: ZeroSep: Separate Anything in Audio with Zero Training
Abstract:
Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.
中文摘要:ZeroSep是一种创新的零样本音频源分离方法,通过将预训练的文本引导音频扩散模型重新用于分离任务,无需特定训练即可利用文本条件引导去噪过程,在多个基准测试中展现出卓越性能。
English Summary: ZeroSep is a novel zero-shot audio source separation method that repurposes pre-trained text-guided audio diffusion models without task-specific training, achieving strong performance by inverting mixed audio into latent space and using text conditioning to guide denoising.

Authors:Jianbo Zhao, Taiyu Ban, Xiyang Wang, Qibin Zhou, Hangning Zhou, Zhihao Liu, Mu Yang, Lei Liu, Bin Li
Title: Autoregressive Meta-Actions for Unified Controllable Trajectory Generation
Abstract:
Controllable trajectory generation guided by high-level semantic decisions, termed meta-actions, is crucial for autonomous driving systems. A significant limitation of existing frameworks is their reliance on invariant meta-actions assigned over fixed future time intervals, causing temporal misalignment with the actual behavior trajectories. This misalignment leads to irrelevant associations between the prescribed meta-actions and the resulting trajectories, disrupting task coherence and limiting model performance. To address this challenge, we introduce Autoregressive Meta-Actions, an approach integrated into autoregressive trajectory generation frameworks that provides a unified and precise definition for meta-action-conditioned trajectory prediction. Specifically, We decompose traditional long-interval meta-actions into frame-level meta-actions, enabling a sequential interplay between autoregressive meta-action prediction and meta-action-conditioned trajectory generation. This decomposition ensures strict alignment between each trajectory segment and its corresponding meta-action, achieving a consistent and unified task formulation across the entire trajectory span and significantly reducing complexity. Moreover, we propose a staged pre-training process to decouple the learning of basic motion dynamics from the integration of high-level decision control, which offers flexibility, stability, and modularity. Experimental results validate our framework's effectiveness, demonstrating improved trajectory adaptivity and responsiveness to dynamic decision-making scenarios. We provide the video document and dataset, which are available at https://arma-traj.github.io/.
中文摘要:本文提出的自回归元动作框架通过将长间隔元动作分解为帧级动作,解决了自动驾驶中轨迹与语义决策的时间错位问题,实现了轨迹段与元动作的精确对齐,并通过分阶段预训练显著提升了动态决策场景下的适应性和响应能力。
English Summary: The proposed Autoregressive Meta-Actions framework addresses temporal misalignment in autonomous driving by decomposing long-interval meta-actions into frame-level actions, ensuring precise alignment between trajectory segments and semantic decisions while improving adaptability through staged pre-training.

Authors:Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan
Title: Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Abstract:
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.
中文: Muddit作为一种统一的离散扩散变换器,通过将预训练骨干网络的强大视觉先验与轻量级文本解码器相结合,实现了快速、并行且高质量的多模态生成,在效率和性能上均优于更大的自回归模型。
English: Muddit is a unified discrete diffusion transformer that enables fast, parallel, and high-quality multimodal generation by integrating strong visual priors from a pretrained backbone with a lightweight text decoder, outperforming larger autoregressive models in efficiency and quality.

Authors:Youssef Mohamed, Noran Mohamed, Khaled Abouhashad, Feilong Tang, Sara Atito, Shoaib Jameel, Imran Razzak, Ahmed B. Zaky
Title: DeepChest: Dynamic Gradient-Free Task Weighting for Effective Multi-Task Learning in Chest X-ray Classification
Abstract:
While Multi-Task Learning (MTL) offers inherent advantages in complex domains such as medical imaging by enabling shared representation learning, effectively balancing task contributions remains a significant challenge. This paper addresses this critical issue by introducing DeepChest, a novel, computationally efficient and effective dynamic task-weighting framework specifically designed for multi-label chest X-ray (CXR) classification. Unlike existing heuristic or gradient-based methods that often incur substantial overhead, DeepChest leverages a performance-driven weighting mechanism based on effective analysis of task-specific loss trends. Given a network architecture (e.g., ResNet18), our model-agnostic approach adaptively adjusts task importance without requiring gradient access, thereby significantly reducing memory usage and achieving a threefold increase in training speed. It can be easily applied to improve various state-of-the-art methods. Extensive experiments on a large-scale CXR dataset demonstrate that DeepChest not only outperforms state-of-the-art MTL methods by 7% in overall accuracy but also yields substantial reductions in individual task losses, indicating improved generalization and effective mitigation of negative transfer. The efficiency and performance gains of DeepChest pave the way for more practical and robust deployment of deep learning in critical medical diagnostic applications. The code is publicly available at https://github.com/youssefkhalil320/DeepChest-MTL
中文: 本文提出的DeepChest框架通过动态任务加权机制,无需梯度访问即可自适应调整多标签胸片分类任务权重,在提升训练效率三倍的同时,整体准确率较现有最优方法提高7%。
English: This paper introduces DeepChest, a dynamic task-weighting framework for multi-label chest X-ray classification that enhances training efficiency and accuracy by adaptively balancing task contributions without gradient access, achieving a 7% improvement over state-of-the-art methods.

Authors:Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko
Title: Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles
Abstract:
The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL, using jigsaw puzzles as a structured experimental framework. Jigsaw puzzles offer inherent ground truth, adjustable difficulty, and demand complex decision-making, making them ideal for this study. Our research reveals several key findings: \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: https://github.com/zifuwanggg/Jigsaw-R1.
中文摘要:本研究以拼图游戏为实验框架探索基于规则的多模态大语言模型强化学习,发现微调能使模型从接近随机猜测提升至近乎完美的准确率并泛化至复杂配置,且强化学习比监督微调具有更好的泛化效果。
English Summary: This study explores rule-based reinforcement learning in multimodal large language models using jigsaw puzzles, revealing that fine-tuning enables models to progress from random guessing to near-perfect accuracy and generalize to complex configurations, with RL outperforming supervised fine-tuning in generalization capability.

Authors:Yu Li, Jin Jiang, Jianhua Zhu, Shuai Peng, Baole Wei, Yuxuan Zhou, Liangcai Gao
Title: Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition
Abstract:
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layout and variability in handwriting styles. Prior methods have faced performance bottlenecks, proposing isolated architectural modifications that are difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model SSAN by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting. Our datasets, models, and code are open-sourced at: https://github.com/BFlameSwift/Uni-MuMER
中文: Uni-MuMER通过全微调预训练视觉语言模型,结合树感知思维链、错误驱动学习和符号计数三项任务,在手写数学表达式识别任务中实现了最先进的性能表现。
English: Uni-MuMER introduces a novel approach by fully fine-tuning a pretrained vision-language model for handwritten mathematical expression recognition, integrating three data-driven tasks to achieve state-of-the-art performance on benchmark datasets.

Authors:Xu Chu, Xinrong Chen, Guanyu Wang, Zhijie Tan, Kui Huang, Wenyu Lv, Tong Mo, Weiping Li
Title: Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information
Abstract:
Inference time scaling drives extended reasoning to enhance the performance of Vision-Language Models (VLMs), thus forming powerful Vision-Language Reasoning Models (VLRMs). However, long reasoning dilutes visual tokens, causing visual information to receive less attention and may trigger hallucinations. Although introducing text-only reflection processes shows promise in language models, we demonstrate that it is insufficient to suppress hallucinations in VLMs. To address this issue, we introduce Qwen-LookAgain (Qwen-LA), a novel VLRM designed to mitigate hallucinations by incorporating a vision-text reflection process that guides the model to re-attention visual information during reasoning. We first propose a reinforcement learning method Balanced Reflective Policy Optimization (BRPO), which guides the model to decide when to generate vision-text reflection on its own and balance the number and length of reflections. Then, we formally prove that VLRMs lose attention to visual tokens as reasoning progresses, and demonstrate that supplementing visual information during reflection enhances visual attention. Therefore, during training and inference, Visual Token COPY and Visual Token ROUTE are introduced to force the model to re-attention visual information at the visual level, addressing the limitations of text-only reflection. Experiments on multiple visual QA datasets and hallucination metrics indicate that Qwen-LA achieves leading accuracy performance while reducing hallucinations. Our code is available at: https://github.com/Liar406/Look_Again
中文摘要:Qwen-LookAgain是一种新型视觉语言推理模型,通过引入视觉-文本反思过程和强化学习机制,引导模型在推理过程中重新关注视觉信息,从而有效减少幻觉现象,在多项视觉问答基准测试中取得领先准确率。
English Summary: Qwen-LookAgain is a novel vision-language reasoning model that mitigates hallucinations by incorporating a vision-text reflection process and reinforcement learning to guide the model in re-attending to visual information during reasoning, achieving leading accuracy on multiple benchmarks.

Authors:Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, Siyu Zhu
Title: Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization
Abstract:
Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion remains challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics. We propose a human-preference-aligned diffusion framework that addresses these challenges through two key innovations. First, we introduce direct preference optimization tailored for human-centric animation, leveraging a curated dataset of human preferences to align generated outputs with perceptual metrics for portrait motion-video alignment and naturalness of expression. Second, the proposed temporal motion modulation resolves spatiotemporal resolution mismatches by reshaping motion conditions into dimensionally aligned latent features through temporal channel redistribution and proportional feature expansion, preserving the fidelity of high-frequency motion details in diffusion-based synthesis. The proposed mechanism is complementary to existing UNet and DiT-based portrait diffusion approaches, and experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics. Our model and source code can be found at: https://github.com/xyz123xyz456/hallo4.
中文: 本文提出了一种基于人类偏好的扩散框架,通过优化感知对齐和解决时空分辨率不匹配问题,显著提升了肖像动画的唇音同步精度和运动连贯性。
English: This paper introduces a human-preference-aligned diffusion framework that enhances portrait animation by optimizing perceptual alignment and resolving spatiotemporal mismatches, achieving superior lip synchronization and motion fidelity.

Authors:Liyun Zhu, Qixiang Chen, Xi Shen, Xiaodong Cun
Title: VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning
Abstract:
Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at https://github.com/GVCLab/VAU-R1.
中文摘要:VAU-R1框架基于多模态大语言模型和强化微调技术,显著提升了视频异常推理的准确性和可解释性,同时VAU-Bench基准测试为该领域提供了首个思维链评估标准。
English Summary: The VAU-R1 framework, leveraging Multimodal Large Language Models and Reinforcement Fine-Tuning, significantly enhances video anomaly reasoning accuracy and interpretability, while the VAU-Bench benchmark provides comprehensive evaluation tools for this domain.

Authors:Shi-Xue Zhang, Hongfa Wang, Duojun Huang, Xin Li, Xiaobin Zhu, Xu-Cheng Yin
Title: VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation
Abstract:
Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: https://github.com/GXYM/VCapsBench.
中文摘要:VCapsBench作为首个大规模细粒度视频描述评估基准,包含5,677个视频和109,796组问答对,通过21个维度的系统标注和自动化评估指标,为文本生成视频任务提供可操作的优化指导。
English Summary: VCapsBench is introduced as the first large-scale fine-grained benchmark for evaluating video captions, featuring 5,677 videos and 109,796 QA-pairs across 21 dimensions to enhance text-to-video generation through actionable insights and automated metrics.

Authors:Ron Shapira Weber, Shahar Ben Ishay, Andrey Lavrinenko, Shahaf E. Finder, Oren Freifeld
Title: TimePoint: Accelerated Time Series Alignment via Self-Supervised Keypoint and Descriptor Learning
Abstract:
Fast and scalable alignment of time series is a fundamental challenge in many domains. The standard solution, Dynamic Time Warping (DTW), struggles with poor scalability and sensitivity to noise. We introduce TimePoint, a self-supervised method that dramatically accelerates DTW-based alignment while typically improving alignment accuracy by learning keypoints and descriptors from synthetic data. Inspired by 2D keypoint detection but carefully adapted to the unique challenges of 1D signals, TimePoint leverages efficient 1D diffeomorphisms, which effectively model nonlinear time warping, to generate realistic training data. This approach, along with fully convolutional and wavelet convolutional architectures, enables the extraction of informative keypoints and descriptors. Applying DTW to these sparse representations yield major speedups and typically higher alignment accuracy than standard DTW applied to the full signals. TimePoint demonstrates strong generalization to real-world time series when trained solely on synthetic data, and further improves with fine-tuning on real data. Extensive experiments demonstrate that TimePoint consistently achieves faster and more accurate alignments than standard DTW, making it a scalable solution for time-series analysis. Our code is available at https://github.com/BGU-CS-VIL/TimePoint
Chinese: TimePoint是一种自监督方法,通过从合成数据中学习关键点来加速动态时间规整(DTW)对齐,相比标准DTW实现了更快速度和更高精度。
English: TimePoint is a self-supervised method that accelerates Dynamic Time Warping (DTW) alignment by learning keypoints from synthetic data, achieving faster and more accurate results than standard DTW.

Authors:Zhuodong Li, Fei Hou, Wencheng Wang, Xuequan Lu, Ying He
Title: A Divide-and-Conquer Approach for Global Orientation of Non-Watertight Scene-Level Point Clouds Using 0-1 Integer Optimization
Abstract:
Orienting point clouds is a fundamental problem in computer graphics and 3D vision, with applications in reconstruction, segmentation, and analysis. While significant progress has been made, existing approaches mainly focus on watertight, object-level 3D models. The orientation of large-scale, non-watertight 3D scenes remains an underexplored challenge. To address this gap, we propose DACPO (Divide-And-Conquer Point Orientation), a novel framework that leverages a divide-and-conquer strategy for scalable and robust point cloud orientation. Rather than attempting to orient an unbounded scene at once, DACPO segments the input point cloud into smaller, manageable blocks, processes each block independently, and integrates the results through a global optimization stage. For each block, we introduce a two-step process: estimating initial normal orientations by a randomized greedy method and refining them by an adapted iterative Poisson surface reconstruction. To achieve consistency across blocks, we model inter-block relationships using an an undirected graph, where nodes represent blocks and edges connect spatially adjacent blocks. To reliably evaluate orientation consistency between adjacent blocks, we introduce the concept of the visible connected region, which defines the region over which visibility-based assessments are performed. The global integration is then formulated as a 0-1 integer-constrained optimization problem, with block flip states as binary variables. Despite the combinatorial nature of the problem, DACPO remains scalable by limiting the number of blocks (typically a few hundred for 3D scenes) involved in the optimization. Experiments on benchmark datasets demonstrate DACPO's strong performance, particularly in challenging large-scale, non-watertight scenarios where existing methods often fail. The source code is available at https://github.com/zd-lee/DACPO.
中文: DACPO是一种新颖的分治框架,通过将大规模非封闭点云分割为可处理块进行局部定向,再采用基于图的全局优化确保一致性,有效解决了复杂三维场景定向的难题。
English: DACPO is a novel divide-and-conquer framework that segments large-scale non-watertight point clouds into manageable blocks for local orientation processing, then achieves global consistency through graph-based optimization to address the challenging orientation of complex 3D scenes.

Authors:Weijia Mao, Zhenheng Yang, Mike Zheng Shou
Title: UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
Abstract:
Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in https://github.com/showlab/UniRL.
中文: UniRL是一种自我改进的后训练方法,它让多模态模型无需外部数据即可生成图像进行训练,不仅提升生成与理解任务的表现,还缩小了二者之间的性能差距。
English: UniRL is a self-improving post-training method that enables multimodal models to generate images for training without external data, improving both generation and understanding tasks while reducing their performance gap.

Authors:Jeonghyeok Do, Sungpyo Kim, Geunhyuk Youk, Jaehyup Lee, Munchurl Kim
Title: PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening
Abstract:
PAN-sharpening aims to fuse high-resolution panchromatic (PAN) images with low-resolution multi-spectral (MS) images to generate high-resolution multi-spectral (HRMS) outputs. However, cross-modality misalignment -- caused by sensor placement, acquisition timing, and resolution disparity -- induces a fundamental challenge. Conventional deep learning methods assume perfect pixel-wise alignment and rely on per-pixel reconstruction losses, leading to spectral distortion, double edges, and blurring when misalignment is present. To address this, we propose PAN-Crafter, a modality-consistent alignment framework that explicitly mitigates the misalignment gap between PAN and MS modalities. At its core, Modality-Adaptive Reconstruction (MARs) enables a single network to jointly reconstruct HRMS and PAN images, leveraging PAN's high-frequency details as auxiliary self-supervision. Additionally, we introduce Cross-Modality Alignment-Aware Attention (CM3A), a novel mechanism that bidirectionally aligns MS texture to PAN structure and vice versa, enabling adaptive feature refinement across modalities. Extensive experiments on multiple benchmark datasets demonstrate that our PAN-Crafter outperforms the most recent state-of-the-art method in all metrics, even with 50.11$\times$ faster inference time and 0.63$\times$ the memory size. Furthermore, it demonstrates strong generalization performance on unseen satellite datasets, showing its robustness across different conditions.

Authors:Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, Xu Sun
Title: VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
Abstract:
Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

Authors:Alexandra G. Roberts, Ha M. Luu, Mert Şişman, Alexey V. Dimov, Ceren Tozlu, Ilhami Kovanlikaya, Susan A. Gauthier, Thanh D. Nguyen, Yi Wang
Title: Synthetic Generation and Latent Projection Denoising of Rim Lesions in Multiple Sclerosis
Abstract:
Quantitative susceptibility maps from magnetic resonance images can provide both prognostic and diagnostic information in multiple sclerosis, a neurodegenerative disease characterized by the formation of lesions in white matter brain tissue. In particular, susceptibility maps provide adequate contrast to distinguish between "rim" lesions, surrounded by deposited paramagnetic iron, and "non-rim" lesion types. These paramagnetic rim lesions (PRLs) are an emerging biomarker in multiple sclerosis. Much effort has been devoted to both detection and segmentation of such lesions to monitor longitudinal change. As paramagnetic rim lesions are rare, addressing this problem requires confronting the class imbalance between rim and non-rim lesions. We produce synthetic quantitative susceptibility maps of paramagnetic rim lesions and show that inclusion of such synthetic data improves classifier performance and provide a multi-channel extension to generate accompanying contrasts and probabilistic segmentation maps. We exploit the projection capability of our trained generative network to demonstrate a novel denoising approach that allows us to train on ambiguous rim cases and substantially increase the minority class. We show that both synthetic lesion synthesis and our proposed rim lesion label denoising method best approximate the unseen rim lesion distribution and improve detection in a clinically interpretable manner. We release our code and generated data at https://github.com/agr78/PRLx-GAN upon publication.
中文摘要:定量磁化率图通过生成合成数据和采用新型去噪方法,改进了多发性硬化症中顺磁性边缘病变的检测与分割,从而提升了分类器性能和临床可解释性。
English Summary: Quantitative susceptibility maps enhance the detection and segmentation of paramagnetic rim lesions in multiple sclerosis by generating synthetic data and employing a novel denoising method, which improves classifier performance and clinical interpretability.

Authors:Yexiong Lin, Yu Yao, Tongliang Liu
Title: Beyond Optimal Transport: Model-Aligned Coupling for Flow Matching
Abstract:
Flow Matching (FM) is an effective framework for training a model to learn a vector field that transports samples from a source distribution to a target distribution. To train the model, early FM methods use random couplings, which often result in crossing paths and lead the model to learn non-straight trajectories that require many integration steps to generate high-quality samples. To address this, recent methods adopt Optimal Transport (OT) to construct couplings by minimizing geometric distances, which helps reduce path crossings. However, we observe that such geometry-based couplings do not necessarily align with the model's preferred trajectories, making it difficult to learn the vector field induced by these couplings, which prevents the model from learning straight trajectories. Motivated by this, we propose Model-Aligned Coupling (MAC), an effective method that matches training couplings based not only on geometric distance but also on alignment with the model's preferred transport directions based on its prediction error. To avoid the time-costly match process, MAC proposes to select the top-$k$ fraction of couplings with the lowest error for training. Extensive experiments show that MAC significantly improves generation quality and efficiency in few-step settings compared to existing methods. Project page: https://yexionglin.github.io/mac

Authors:Hengyuan Cao, Yutong Feng, Biao Gong, Yijing Tian, Yunhong Lu, Chuang Liu, Bin Wang
Title: Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis
Abstract:
Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed \textit{Dimension-Reduction Attack} (\texttt{DRA-Ctrl}), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications. \texttt{DRA-Ctrl} provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities. The project page is https://dra-ctrl-2025.github.io/DRA-Ctrl/.

Authors:Weizhe Kong, Xiao Wang, Ruichong Gao, Chenglong Li, Yu Zhang, Xing Yang, Yaowei Wang, Jin Tang
Title: Adversarial Semantic and Label Perturbation Attack for Pedestrian Attribute Recognition
Abstract:
Pedestrian Attribute Recognition (PAR) is an indispensable task in human-centered research and has made great progress in recent years with the development of deep neural networks. However, the potential vulnerability and anti-interference ability have still not been fully explored. To bridge this gap, this paper proposes the first adversarial attack and defense framework for pedestrian attribute recognition. Specifically, we exploit both global- and patch-level attacks on the pedestrian images, based on the pre-trained CLIP-based PAR framework. It first divides the input pedestrian image into non-overlapping patches and embeds them into feature embeddings using a projection layer. Meanwhile, the attribute set is expanded into sentences using prompts and embedded into attribute features using a pre-trained CLIP text encoder. A multi-modal Transformer is adopted to fuse the obtained vision and text tokens, and a feed-forward network is utilized for attribute recognition. Based on the aforementioned PAR framework, we adopt the adversarial semantic and label-perturbation to generate the adversarial noise, termed ASL-PAR. We also design a semantic offset defense strategy to suppress the influence of adversarial attacks. Extensive experiments conducted on both digital domains (i.e., PETA, PA100K, MSP60K, RAPv2) and physical domains fully validated the effectiveness of our proposed adversarial attack and defense strategies for the pedestrian attribute recognition. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR.
中文摘要:本文针对行人属性识别首次提出对抗攻击与防御框架,通过多模态Transformer和CLIP嵌入技术开发攻击策略及语义偏移防御方法,在数字与物理场景下的实验验证了其有效性。
English Summary: This paper introduces the first adversarial attack and defense framework for pedestrian attribute recognition, utilizing multimodal Transformers and CLIP-based embeddings to develop both attack strategies and a semantic offset defense method validated across digital and physical domains.

Authors:Hao Li, Ju Dai, Xin Zhao, Feng Zhou, Junjun Pan, Lei Li
Title: Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation
Abstract:
In 3D speech-driven facial animation generation, existing methods commonly employ pre-trained self-supervised audio models as encoders. However, due to the prevalence of phonetically similar syllables with distinct lip shapes in language, these near-homophone syllables tend to exhibit significant coupling in self-supervised audio feature spaces, leading to the averaging effect in subsequent lip motion generation. To address this issue, this paper proposes a plug-and-play semantic decorrelation module-Wav2Sem. This module extracts semantic features corresponding to the entire audio sequence, leveraging the added semantic information to decorrelate audio encodings within the feature space, thereby achieving more expressive audio features. Extensive experiments across multiple Speech-driven models indicate that the Wav2Sem module effectively decouples audio features, significantly alleviating the averaging effect of phonetically similar syllables in lip shape generation, thereby enhancing the precision and naturalness of facial animations. Our source code is available at https://github.com/wslh852/Wav2Sem.git.
中文: 本文提出Wav2Sem即插即用模块,通过语义信息解耦音频特征,有效缓解语音驱动面部动画中近音节的耦合效应,从而提升唇形生成的精确度与自然度。
English: This paper introduces Wav2Sem, a plug-and-play module that enhances 3D speech-driven facial animation by decorrelating audio features with semantic information, effectively reducing the averaging effect of phonetically similar syllables for more precise and natural lip movements.

Authors:Chuandong Liu, Huijiao Wang, Lei Yu, Gui-Song Xia
Title: Holistic Large-Scale Scene Reconstruction via Mixed Gaussian Splatting
Abstract:
Recent advances in 3D Gaussian Splatting have shown remarkable potential for novel view synthesis. However, most existing large-scale scene reconstruction methods rely on the divide-and-conquer paradigm, which often leads to the loss of global scene information and requires complex parameter tuning due to scene partitioning and local optimization. To address these limitations, we propose MixGS, a novel holistic optimization framework for large-scale 3D scene reconstruction. MixGS models the entire scene holistically by integrating camera pose and Gaussian attributes into a view-aware representation, which is decoded into fine-detailed Gaussians. Furthermore, a novel mixing operation combines decoded and original Gaussians to jointly preserve global coherence and local fidelity. Extensive experiments on large-scale scenes demonstrate that MixGS achieves state-of-the-art rendering quality and competitive speed, while significantly reducing computational requirements, enabling large-scale scene reconstruction training on a single 24GB VRAM GPU. The code will be released at https://github.com/azhuantou/MixGS.
Chinese: 提出的MixGS框架通过视图感知表示和混合操作,整体优化相机位姿与高斯属性,解决了大规模3D场景重建中的全局信息丢失问题,在单GPU上以更低计算成本实现了最优渲染质量。
English: The proposed MixGS framework overcomes limitations in large-scale 3D scene reconstruction by holistically optimizing camera poses and Gaussian attributes through a view-aware representation and mixing operation, achieving state-of-the-art rendering quality with reduced computational demands on a single GPU.

Authors:Mao-Lin Luo, Zi-Hao Zhou, Tong Wei, Min-Ling Zhang
Title: LADA: Scalable Label-Specific CLIP Adapter for Continual Learning
Abstract:
Continual learning with vision-language models like CLIP offers a pathway toward scalable machine learning systems by leveraging its transferable representations. Existing CLIP-based methods adapt the pre-trained image encoder by adding multiple sets of learnable parameters, with each task using a partial set of parameters. This requires selecting the expected parameters for input images during inference, which is prone to error that degrades performance. To address this problem, we introduce LADA (Label-specific ADApter). Instead of partitioning parameters across tasks, LADA appends lightweight, label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation by aggregating task-agnostic knowledge. To prevent catastrophic forgetting, LADA employs feature distillation for seen classes, preventing their features from being interfered with by new classes. Positioned after the image encoder, LADA prevents gradient flow to the frozen CLIP parameters, ensuring efficient training. Extensive results show that LADA achieves state-of-the-art performance in continual learning settings. The implementation code is available at https://github.com/MaolinLuo/LADA.
中文: LADA通过向冻结的CLIP图像编码器添加标签特定记忆单元,实现判别性特征生成,并利用特征蒸馏防止灾难性遗忘,在持续学习中达到最优性能。
English: LADA introduces label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation and preventing catastrophic forgetting through feature distillation, achieving state-of-the-art continual learning performance.

Authors:Yixun Liang, Kunming Luo, Xiao Chen, Rui Chen, Hongyu Yan, Weiyu Li, Jiarui Liu, Ping Tan
Title: UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes
Abstract:
We present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based inpainting to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we propose to bypass the limitations of UV mapping by operating directly in a unified 3D functional space. Specifically, we first propose that lifts texture generation into 3D space via Texture Functions (TFs)--a continuous, volumetric representation that maps any 3D point to a texture value based solely on surface proximity, independent of mesh topology. Then, we propose to predict these TFs directly from images and geometry inputs using a transformer-based Large Texturing Model (LTM). To further enhance texture quality and leverage powerful 2D priors, we develop an advanced LoRA-based strategy for efficiently adapting large-scale Diffusion Transformers (DiTs) for high-quality multi-view texture synthesis as our first stage. Extensive experiments demonstrate that UniTEX achieves superior visual quality and texture integrity compared to existing approaches, offering a generalizable and scalable solution for automated 3D texture generation. Code will available in: https://github.com/YixunLiang/UniTEX.
中文: UniTEX提出了一种新颖的两阶段框架,通过在统一的三维功能空间中直接操作,利用纹理函数和基于Transformer的大型纹理模型绕过传统UV映射的限制,相比现有方法实现了更优越的纹理生成质量。
English: UniTEX introduces a novel two-stage framework for generating high-quality 3D textures by operating directly in a unified 3D functional space, bypassing traditional UV mapping limitations through Texture Functions and a transformer-based Large Texturing Model, achieving superior results compared to existing methods.

Authors:Wenhao Xu, Shuchen Zheng, Changwei Wang, Zherui Zhang, Chuan Ren, Rongtao Xu, Shibiao Xu
Title: SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection
Abstract:
Infrared small target detection (ISTD) is vital for long-range surveillance in military, maritime, and early warning applications. ISTD is challenged by targets occupying less than 0.15% of the image and low distinguishability from complex backgrounds. Existing deep learning methods often suffer from information loss during downsampling and inefficient global context modeling. This paper presents SAMamba, a novel framework integrating SAM2's hierarchical feature learning with Mamba's selective sequence modeling. Key innovations include: (1) A Feature Selection Adapter (FS-Adapter) for efficient natural-to-infrared domain adaptation via dual-stage selection (token-level with a learnable task embedding and channel-wise adaptive transformations); (2) A Cross-Channel State-Space Interaction (CSI) module for efficient global context modeling with linear complexity using selective state space modeling; and (3) A Detail-Preserving Contextual Fusion (DPCF) module that adaptively combines multi-scale features with a gating mechanism to balance high-resolution and low-resolution feature contributions. SAMamba addresses core ISTD challenges by bridging the domain gap, maintaining fine-grained details, and efficiently modeling long-range dependencies. Experiments on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST datasets show SAMamba significantly outperforms state-of-the-art methods, especially in challenging scenarios with heterogeneous backgrounds and varying target scales. Code: https://github.com/zhengshuchen/SAMamba.
Chinese: SAMamba是一种创新框架,融合了分层特征学习与选择性序列建模,通过弥合领域差异、保留细节并高效建模长程依赖,解决了红外小目标检测的核心难题,在基准数据集上实现了卓越性能。
English: SAMamba is a novel framework that combines hierarchical feature learning with selective sequence modeling to address infrared small target detection challenges by bridging domain gaps, preserving details, and efficiently modeling long-range dependencies, achieving superior performance on benchmark datasets.

Authors:Aldino Rizaldy, Richard Gloaguen, Fabian Ewald Fassnacht, Pedram Ghamisi
Title: HyperPointFormer: Multimodal Fusion in 3D Space with Dual-Branch Cross-Attention Transformers
Abstract:
Multimodal remote sensing data, including spectral and lidar or photogrammetry, is crucial for achieving satisfactory land-use / land-cover classification results in urban scenes. So far, most studies have been conducted in a 2D context. When 3D information is available in the dataset, it is typically integrated with the 2D data by rasterizing the 3D data into 2D formats. Although this method yields satisfactory classification results, it falls short in fully exploiting the potential of 3D data by restricting the model's ability to learn 3D spatial features directly from raw point clouds. Additionally, it limits the generation of 3D predictions, as the dimensionality of the input data has been reduced. In this study, we propose a fully 3D-based method that fuses all modalities within the 3D point cloud and employs a dedicated dual-branch Transformer model to simultaneously learn geometric and spectral features. To enhance the fusion process, we introduce a cross-attention-based mechanism that fully operates on 3D points, effectively integrating features from various modalities across multiple scales. The purpose of cross-attention is to allow one modality to assess the importance of another by weighing the relevant features. We evaluated our method by comparing it against both 3D and 2D methods using the 2018 IEEE GRSS Data Fusion Contest (DFC2018) dataset. Our findings indicate that 3D fusion delivers competitive results compared to 2D methods and offers more flexibility by providing 3D predictions. These predictions can be projected onto 2D maps, a capability that is not feasible in reverse. Additionally, we evaluated our method on different datasets, specifically the ISPRS Vaihingen 3D and the IEEE 2019 Data Fusion Contest. Our code will be published here: https://github.com/aldinorizaldy/hyperpointformer.
中文摘要:本研究提出了一种全三维方法,通过多模态遥感数据和双分支Transformer直接从点云学习几何与光谱特征,实现了具有竞争力的分类结果并能灵活生成三维预测。
English Summary: This study introduces a fully 3D-based method using multimodal remote sensing data and a dual-branch Transformer to learn geometric and spectral features directly from point clouds, achieving competitive classification results and enabling flexible 3D predictions.

Authors:Shaoan Wang, Jiazhao Zhang, Minghan Li, Jiahang Liu, Anqi Li, Kui Wu, Fangwei Zhong, Junzhi Yu, Zhizheng Zhang, He Wang
Title: TrackVLA: Embodied Visual Tracking in the Wild
Abstract:
Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge through a modular separation of recognition and planning. In this work, we propose TrackVLA, a Vision-Language-Action (VLA) model that learns the synergy between object recognition and trajectory planning. Leveraging a shared LLM backbone, we employ a language modeling head for recognition and an anchor-based diffusion model for trajectory planning. To train TrackVLA, we construct an Embodied Visual Tracking Benchmark (EVT-Bench) and collect diverse difficulty levels of recognition samples, resulting in a dataset of 1.7 million samples. Through extensive experiments in both synthetic and real-world environments, TrackVLA demonstrates SOTA performance and strong generalizability. It significantly outperforms existing methods on public benchmarks in a zero-shot manner while remaining robust to high dynamics and occlusion in real-world scenarios at 10 FPS inference speed. Our project page is: https://pku-epic.github.io/TrackVLA-web.

Authors:Junyi Guo, Jingxuan Zhang, Fangyu Wu, Huanda Lu, Qiufeng Wang, Wenmian Yang, Eng Gee Lim, Dongming Lu
Title: HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image
Abstract:
Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence. To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset are available at https://github.com/Maple498/HiGarment.
中文摘要:本文提出FS2RG任务,通过结合平面草图和文本指导生成逼真服装图像,并针对织物细节表现与多模态冲突问题,开发了包含跨模态语义增强与协调注意力机制的HiGarment框架。
English Summary: This paper introduces the FS2RG task for generating realistic garment images from flat sketches and text, addressing challenges in fabric detail representation and conflicting modality guidance through the HiGarment framework with multi-modal enhancement and harmonized attention mechanisms.

Authors:Ping Wang, Lishun Wang, Gang Qu, Xiaodong Wang, Yulun Zhang, Xin Yuan
Title: Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging
Abstract:
Deep-unrolling and plug-and-play (PnP) approaches have become the de-facto standard solvers for single-pixel imaging (SPI) inverse problem. PnP approaches, a class of iterative algorithms where regularization is implicitly performed by an off-the-shelf deep denoiser, are flexible for varying compression ratios (CRs) but are limited in reconstruction accuracy and speed. Conversely, unrolling approaches, a class of multi-stage neural networks where a truncated iterative optimization process is transformed into an end-to-end trainable network, typically achieve better accuracy with faster inference but require fine-tuning or even retraining when CR changes. In this paper, we address the challenge of integrating the strengths of both classes of solvers. To this end, we design an efficient deep image restorer (DIR) for the unrolling of HQS (half quadratic splitting) and ADMM (alternating direction method of multipliers). More importantly, a general proximal trajectory (PT) loss function is proposed to train HQS/ADMM-unrolling networks such that learned DIR approximates the proximal operator of an ideal explicit restoration regularizer. Extensive experiments demonstrate that, the resulting proximal unrolling networks can not only flexibly handle varying CRs with a single model like PnP algorithms, but also outperform previous CR-specific unrolling networks in both reconstruction accuracy and speed. Source codes and models are available at https://github.com/pwangcs/ProxUnroll.
中文摘要:本文提出了一种新颖的近端展开网络,将即插即用方法的灵活性与展开方法的高精度和快速性相结合,实现了单一模型即可有效处理不同压缩比的单像素成像重建任务。
English Summary: This paper introduces a novel proximal unrolling network that combines the flexibility of plug-and-play methods with the superior accuracy and speed of unrolling approaches for single-pixel imaging, enabling a single model to handle varying compression ratios effectively.

Authors:Shohei Enomoto
Title: Pseudo Multi-Source Domain Generalization: Bridging the Gap Between Single and Multi-Source Domain Generalization
Abstract:
Deep learning models often struggle to maintain performance when deployed on data distributions different from their training data, particularly in real-world applications where environmental conditions frequently change. While Multi-source Domain Generalization (MDG) has shown promise in addressing this challenge by leveraging multiple source domains during training, its practical application is limited by the significant costs and difficulties associated with creating multi-domain datasets. To address this limitation, we propose Pseudo Multi-source Domain Generalization (PMDG), a novel framework that enables the application of sophisticated MDG algorithms in more practical Single-source Domain Generalization (SDG) settings. PMDG generates multiple pseudo-domains from a single source domain through style transfer and data augmentation techniques, creating a synthetic multi-domain dataset that can be used with existing MDG algorithms. Through extensive experiments with PseudoDomainBed, our modified version of the DomainBed benchmark, we analyze the effectiveness of PMDG across multiple datasets and architectures. Our analysis reveals several key findings, including a positive correlation between MDG and PMDG performance and the potential of pseudo-domains to match or exceed actual multi-domain performance with sufficient data. These comprehensive empirical results provide valuable insights for future research in domain generalization. Our code is available at https://github.com/s-enmt/PseudoDomainBed.
中文摘要:本文提出的伪多源域泛化(PMDG)框架通过从单源数据生成合成多域数据集,解决了多源域泛化的实际应用限制,使MDG算法能在更现实的单源场景中有效运用。
English Summary: The proposed Pseudo Multi-source Domain Generalization (PMDG) framework addresses the limitations of multi-source domain generalization by generating synthetic multi-domain datasets from single-source data, enabling effective application of MDG algorithms in practical settings.

Authors:Liu Liu, Xiaofeng Wang, Guosheng Zhao, Keyu Li, Wenkang Qin, Jiaxiong Qiu, Zheng Zhu, Guan Huang, Zhizhong Su
Title: RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer
Abstract:
Imitation Learning has become a fundamental approach in robotic manipulation. However, collecting large-scale real-world robot demonstrations is prohibitively expensive. Simulators offer a cost-effective alternative, but the sim-to-real gap make it extremely challenging to scale. Therefore, we introduce RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike previous methods, RoboTransfer integrates multi-view geometry with explicit control over scene components, such as background and object attributes. By incorporating cross-view feature interactions and global depth/normal conditions, RoboTransfer ensures geometry consistency across views. This framework allows fine-grained control, including background edits and object swaps. Experiments demonstrate that RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity. In addition, policies trained on the data generated by RoboTransfer achieve a 33.3% relative improvement in the success rate in the DIFF-OBJ setting and a substantial 251% relative improvement in the more challenging DIFF-ALL scenario. Explore more demos on our project page: https://horizonrobotics.github.io/robot_lab/robotransfer
中文摘要:RoboTransfer是一种基于扩散的视频生成框架,通过多视角几何集成和场景组件控制,提升了机器人数据合成的几何一致性与视觉保真度,并在策略训练中实现了显著性能提升。
English Summary: RoboTransfer is a diffusion-based video generation framework that enhances geometric consistency and visual fidelity in robotic data synthesis, achieving significant performance improvements in policy training.

Authors:Xiao Yu, Yan Fang, Xiaojie Jin, Yao Zhao, Yunchao Wei
Title: PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
Abstract:
Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding. Code is available at https://github.com/XiaoYu-1123/PreFM.
中文: 本文提出On-AVEP实时音视频事件解析方法,采用PreFM框架通过预测性未来建模增强上下文理解,以更少参数实现显著优于现有方法的性能。
English: This paper introduces On-AVEP, a real-time audio-visual event parsing method using the PreFM framework that enhances context through predictive future modeling and achieves superior performance with fewer parameters.

Authors:Jinquan Guan, Qi Chen, Lizhou Liang, Yuhang Liu, Vu Minh Hieu Phan, Minh-Son To, Jian Chen, Yutong Xie
Title: Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning
Abstract:
Artificial intelligence (AI)-based chest X-ray (CXR) interpretation assistants have demonstrated significant progress and are increasingly being applied in clinical settings. However, contemporary medical AI models often adhere to a simplistic input-to-output paradigm, directly processing an image and an instruction to generate a result, where the instructions may be integral to the model's architecture. This approach overlooks the modeling of the inherent diagnostic reasoning in chest X-ray interpretation. Such reasoning is typically sequential, where each interpretive stage considers the images, the current task, and the contextual information from previous stages. This oversight leads to several shortcomings, including misalignment with clinical scenarios, contextless reasoning, and untraceable errors. To fill this gap, we construct CXRTrek, a new multi-stage visual question answering (VQA) dataset for CXR interpretation. The dataset is designed to explicitly simulate the diagnostic reasoning process employed by radiologists in real-world clinical settings for the first time. CXRTrek covers 8 sequential diagnostic stages, comprising 428,966 samples and over 11 million question-answer (Q&A) pairs, with an average of 26.29 Q&A pairs per sample. Building on the CXRTrek dataset, we propose a new vision-language large model (VLLM), CXRTrekNet, specifically designed to incorporate the clinical reasoning flow into the VLLM framework. CXRTrekNet effectively models the dependencies between diagnostic stages and captures reasoning patterns within the radiological context. Trained on our dataset, the model consistently outperforms existing medical VLLMs on the CXRTrek benchmarks and demonstrates superior generalization across multiple tasks on five diverse external datasets. The dataset and model can be found in our repository (https://github.com/guanjinquan/CXRTrek).
中文: 该摘要介绍了CXRTrek数据集,它模拟放射科医生在胸片诊断中的多阶段推理过程,以及基于此构建的CXRTrekNet视觉语言模型,该模型通过融入临床推理流程,在多项任务中展现出优于现有模型的性能。
English: The abstract introduces CXRTrek, a multi-stage visual question answering dataset designed to simulate radiologists' diagnostic reasoning in chest X-ray interpretation, and CXRTrekNet, a vision-language model that incorporates clinical reasoning flow, demonstrating superior performance over existing models.

Authors:Tongtong Su, Chengyu Wang, Jun Huang, Dongming Lu
Title: Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing
Abstract:
Appearance editing according to user needs is a pivotal task in video editing. Existing text-guided methods often lead to ambiguities regarding user intentions and restrict fine-grained control over editing specific aspects of objects. To overcome these limitations, this paper introduces a novel approach named {Zero-to-Hero}, which focuses on reference-based video editing that disentangles the editing process into two distinct problems. It achieves this by first editing an anchor frame to satisfy user requirements as a reference image and then consistently propagating its appearance across other frames. We leverage correspondence within the original frames to guide the attention mechanism, which is more robust than previously proposed optical flow or temporal modules in memory-friendly video generative models, especially when dealing with objects exhibiting large motions. It offers a solid ZERO-shot initialization that ensures both accuracy and temporal consistency. However, intervention in the attention mechanism results in compounded imaging degradation with over-saturated colors and unknown blurring issues. Starting from Zero-Stage, our Hero-Stage Holistically learns a conditional generative model for vidEo RestOration. To accurately evaluate the consistency of the appearance, we construct a set of videos with multiple appearances using Blender, enabling a fine-grained and deterministic evaluation. Our method outperforms the best-performing baseline with a PSNR improvement of 2.6 dB. The project page is at https://github.com/Tonniia/Zero2Hero.
中文摘要:本文提出Zero-to-Hero方法,通过基于参考的视频编辑先将锚帧编辑为参考图像并利用对应关系引导注意力机制在帧间传播外观,再通过条件生成修复模型解决成像退化问题,相比最佳基线方法PSNR指标提升2.6分贝。
English Summary: This paper introduces Zero-to-Hero, a reference-based video editing method that first edits an anchor frame as reference and propagates its appearance across frames using correspondence-guided attention, then addresses imaging degradation through a conditional generative restoration model, achieving significant performance improvements over baselines.

Authors:Siyuan Wang, Jiawei Liu, Wei Wang, Yeying Jin, Jinsong Du, Zhi Han
Title: MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation
Abstract:
Co-Speech Gesture Video Generation aims to generate vivid speech videos from audio-driven still images, which is challenging due to the diversity of different parts of the body in terms of amplitude of motion, audio relevance, and detailed features. Relying solely on audio as the control signal often fails to capture large gesture movements in video, leading to more pronounced artifacts and distortions. Existing approaches typically address this issue by introducing additional a priori information, but this can limit the practical application of the task. Specifically, we propose a Motion Mask-Guided Two-Stage Network (MMGT) that uses audio, as well as motion masks and motion features generated from the audio signal to jointly drive the generation of synchronized speech gesture videos. In the first stage, the Spatial Mask-Guided Audio Pose Generation (SMGA) Network generates high-quality pose videos and motion masks from audio, effectively capturing large movements in key regions such as the face and gestures. In the second stage, we integrate the Motion Masked Hierarchical Audio Attention (MM-HAA) into the Stabilized Diffusion Video Generation model, overcoming limitations in fine-grained motion generation and region-specific detail control found in traditional methods. This guarantees high-quality, detailed upper-body video generation with accurate texture and motion details. Evaluations show improved video quality, lip-sync, and gesture. The model and code are available at https://github.com/SIA-IDE/MMGT.
中文:MMGT模型通过两阶段处理,利用音频驱动的运动遮罩和特征生成同步的伴随语音手势视频,克服了仅依赖音频的局限,实现了高质量的上半身运动细节与精准同步。
English: The MMGT model uses audio-driven motion masks and features in a two-stage process to generate synchronized co-speech gesture videos, overcoming limitations of audio-only methods by producing high-quality upper-body movements with enhanced detail and synchronization.

Authors:Chuanhao Li, Wenbo Ye, Zhen Li, Yuwei Wu, Yunde Jia
Title: Multi-Sourced Compositional Generalization in Visual Question Answering
Abstract:
Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V\&L) recently. Due to the multi-modal nature of V\&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, \textit{i.e.}, multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. Experimental results demonstrate the effectiveness of the proposed framework. We release GQA-MSCG at https://github.com/NeverMoreLCH/MSCG.
中文: 本文提出了一种检索增强训练框架,通过为跨模态基元学习统一表征来增强视觉问答模型的多源组合泛化能力,并基于新构建的GQA-MSCG数据集验证了其有效性。
English: This paper introduces a retrieval-augmented training framework to enhance multi-sourced compositional generalization in visual question answering models by learning unified representations for cross-modal primitives, validated through a newly constructed GQA-MSCG dataset.

Authors:Yihang Wu, Muhammad Owais, Reem Kateb, Ahmad Chaddad
Title: Deep Modeling and Optimization of Medical Image Classification
Abstract:
Deep models, such as convolutional neural networks (CNNs) and vision transformer (ViT), demonstrate remarkable performance in image classification. However, those deep models require large data to fine-tune, which is impractical in the medical domain due to the data privacy issue. Furthermore, despite the feasible performance of contrastive language image pre-training (CLIP) in the natural domain, the potential of CLIP has not been fully investigated in the medical field. To face these challenges, we considered three scenarios: 1) we introduce a novel CLIP variant using four CNNs and eight ViTs as image encoders for the classification of brain cancer and skin cancer, 2) we combine 12 deep models with two federated learning techniques to protect data privacy, and 3) we involve traditional machine learning (ML) methods to improve the generalization ability of those deep models in unseen domain data. The experimental results indicate that maxvit shows the highest averaged (AVG) test metrics (AVG = 87.03\%) in HAM10000 dataset with multimodal learning, while convnext\_l demonstrates remarkable test with an F1-score of 83.98\% compared to swin\_b with 81.33\% in FL model. Furthermore, the use of support vector machine (SVM) can improve the overall test metrics with AVG of $\sim 2\%$ for swin transformer series in ISIC2018. Our codes are available at https://github.com/AIPMLab/SkinCancerSimulation.
中文摘要:深度学习模型在图像分类中表现优异,但在医学领域面临数据稀缺和隐私问题,为此研究提出了一种新型CLIP变体,结合联邦学习和传统机器学习方法,有效提升了医学图像分析的性能与泛化能力。
English Summary: Deep models like CNNs and ViT excel in image classification but face data scarcity and privacy issues in medicine, prompting the development of a novel CLIP variant combined with federated learning and traditional ML to enhance performance and generalization in medical image analysis.

Authors:Minh Nguyen Nhat To, Paul F RWilson, Viet Nguyen, Mohamed Harmanani, Michael Cooper, Fahimeh Fooladgar, Purang Abolmaesumi, Parvin Mousavi, Rahul G. Krishnan
Title: Diverse Prototypical Ensembles Improve Robustness to Subpopulation Shift
Abstract:
The subpopulationtion shift, characterized by a disparity in subpopulation distributibetween theween the training and target datasets, can significantly degrade the performance of machine learning models. Current solutions to subpopulation shift involve modifying empirical risk minimization with re-weighting strategies to improve generalization. This strategy relies on assumptions about the number and nature of subpopulations and annotations on group membership, which are unavailable for many real-world datasets. Instead, we propose using an ensemble of diverse classifiers to adaptively capture risk associated with subpopulations. Given a feature extractor network, we replace its standard linear classification layer with a mixture of prototypical classifiers, where each member is trained to classify the data while focusing on different features and samples from other members. In empirical evaluation on nine real-world datasets, covering diverse domains and kinds of subpopulation shift, our method of Diverse Prototypical Ensembles (DPEs) often outperforms the prior state-of-the-art in worst-group accuracy. The code is available at https://github.com/minhto2802/dpe4subpop
中文: 针对机器学习中的子群体分布偏移问题,本文提出的多样化原型集成方法通过用混合原型分类器替代标准分类器,自适应捕捉子群体风险,在多种数据集的最差组准确率上优于现有最优方法。
English: Subpopulation shift in machine learning models is addressed by the proposed Diverse Prototypical Ensembles method, which replaces standard classifiers with a mixture of prototypical classifiers to adaptively capture subpopulation risks and outperforms prior methods in worst-group accuracy across diverse datasets.

Authors:Bowen Chen, Keyan Chen, Mohan Yang, Zhengxia Zou, Zhenwei Shi
Title: SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model
Abstract:
High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring. However, due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation. Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition. Existing RSISR methods primarily focus on low-level characteristics in pixel space, while neglecting the high-level understanding of remote sensing scenes. This may lead to semantically inconsistent artifacts in the reconstructed results. Motivated by this observation, our work aims to explore the role of high-level semantic knowledge in improving RSISR performance. We propose a Semantic-Guided Super-Resolution framework, SeG-SR, which leverages Vision-Language Models (VLMs) to extract semantic knowledge from input images and uses it to guide the super resolution (SR) process. Specifically, we first design a Semantic Feature Extraction Module (SFEM) that utilizes a pretrained VLM to extract semantic knowledge from remote sensing images. Next, we propose a Semantic Localization Module (SLM), which derives a series of semantic guidance from the extracted semantic knowledge. Finally, we develop a Learnable Modulation Module (LMM) that uses semantic guidance to modulate the features extracted by the SR network, effectively incorporating high-level scene understanding into the SR pipeline. We validate the effectiveness and generalizability of SeG-SR through extensive experiments: SeG-SR achieves state-of-the-art performance on two datasets and consistently delivers performance improvements across various SR architectures. Codes can be found at https://github.com/Mr-Bamboo/SeG-SR.
中文:提出的SeG-SR框架通过融合视觉语言模型的语义指导来提升遥感图像超分辨率性能,其设计的语义提取与调制模块能有效利用高层场景理解消除重建伪影,在多个数据集上实现了最优效果。
English: The proposed SeG-SR framework enhances remote sensing image super-resolution by integrating semantic guidance from vision-language models, achieving state-of-the-art performance through modules that extract and utilize high-level scene understanding to prevent artifacts.

Authors:Yuhui Zhang, Yuchang Su, Yiming Liu, Serena Yeung-Levy
Title: NegVQA: Can Vision Language Models Understand Negation?
Abstract:
Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.
中文: NegVQA基准测试表明视觉语言模型在理解否定语义方面存在显著困难,不仅表现大幅下降,还呈现出模型规模与性能间的U型缩放规律。
English: The NegVQA benchmark reveals that vision language models struggle significantly with understanding negation, showing performance drops and a U-shaped scaling trend with model size increases.

Authors:Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, Chongyang Ma
Title: ATI: Any Trajectory Instruction for Controllable Video Generation
Abstract:
We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion using trajectory-based inputs. In contrast to prior methods that address these motion types through separate modules or task-specific designs, our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models via a lightweight motion injector. Users can specify keypoints and their motion paths to control localized deformations, entire object motion, virtual camera dynamics, or combinations of these. The injected trajectory signals guide the generative process to produce temporally consistent and semantically aligned motion sequences. Our framework demonstrates superior performance across multiple video motion control tasks, including stylized motion effects (e.g., motion brushes), dynamic viewpoint changes, and precise local motion manipulation. Experiments show that our method provides significantly better controllability and visual quality compared to prior approaches and commercial solutions, while remaining broadly compatible with various state-of-the-art video generation backbones. Project page: https://anytraj.github.io/.
中文: 我们提出了一种统一的视频生成运动控制框架,通过轨迹输入无缝整合相机运动、物体平移和局部精细运动,在多种任务中展现出卓越的可控性和视觉质量。
English: We introduce a unified motion control framework for video generation that integrates camera movement, object translation, and local motion through trajectory inputs, achieving superior controllability and visual quality across various tasks.

Authors:Ben Weiss
Title: Fast Isotropic Median Filtering
Abstract:
Median filtering is a cornerstone of computational image processing. It provides an effective means of image smoothing, with minimal blurring or softening of edges, invariance to monotonic transformations such as gamma adjustment, and robustness to noise and outliers. However, known algorithms have all suffered from practical limitations: the bit depth of the image data, the size of the filter kernel, or the kernel shape itself. Square-kernel implementations tend to produce streaky cross-hatching artifacts, and nearly all known efficient algorithms are in practice limited to square kernels. We present for the first time a method that overcomes all of these limitations. Our method operates efficiently on arbitrary bit-depth data, arbitrary kernel sizes, and arbitrary convex kernel shapes, including circular shapes.
中文: 该方法首次突破了以往中值滤波算法的局限,能高效处理任意位深、核尺寸及凸形核(包括圆形),避免了方形核产生的条纹伪影。
English: The proposed method overcomes the limitations of previous median filtering algorithms by efficiently handling arbitrary bit depths, kernel sizes, and convex shapes, including circular ones, without the artifacts associated with square kernels.

Authors:Ruichen Chen
Title: Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape
Abstract:
Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1\% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45\% end-to-end % and over 92\% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: \href{https://github.com/cccrrrccc/Re-ttention}{https://github.com/cccrrrccc/Re-ttention}
中文: Re-ttention通过利用历史softmax分布重塑注意力分数,在极高稀疏度下保持完整注意力的视觉质量,同时实现超过45%的端到端延迟降低。
English: Re-ttention enables high-quality visual generation with extremely sparse attention by reshaping scores using prior softmax distributions, achieving over 45% latency reduction while maintaining full-attention quality.

Authors:Junbo Yin, Chao Zha, Wenjia He, Chencheng Xu, Xin Gao
Title: CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models
Abstract:
Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-Gen enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at https://github.com/yinjunbo/cfpgen.
中文: CFP-Gen是一种扩散语言模型,通过整合多模态功能、序列和结构约束,能够生成具有天然蛋白质功能的新型蛋白质,并在多功能设计上实现高成功率。
English: CFP-Gen is a diffusion language model that integrates multimodal functional, sequence, and structural constraints to generate novel proteins with natural-like functionality and a high success rate in multifunctional designs.

Authors:Hidenobu Matsuki, Gwangbin Bae, Andrew J. Davison
Title: 4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians
Abstract:
We propose the first 4D tracking and mapping method that jointly performs camera localization and non-rigid surface reconstruction via differentiable rendering. Our approach captures 4D scenes from an online stream of color images with depth measurements or predictions by jointly optimizing scene geometry, appearance, dynamics, and camera ego-motion. Although natural environments exhibit complex non-rigid motions, 4D-SLAM remains relatively underexplored due to its inherent challenges; even with 2.5D signals, the problem is ill-posed because of the high dimensionality of the optimization space. To overcome these challenges, we first introduce a SLAM method based on Gaussian surface primitives that leverages depth signals more effectively than 3D Gaussians, thereby achieving accurate surface reconstruction. To further model non-rigid deformations, we employ a warp-field represented by a multi-layer perceptron (MLP) and introduce a novel camera pose estimation technique along with surface regularization terms that facilitate spatio-temporal reconstruction. In addition to these algorithmic challenges, a significant hurdle in 4D SLAM research is the lack of reliable ground truth and evaluation protocols, primarily due to the difficulty of 4D capture using commodity sensors. To address this, we present a novel open synthetic dataset of everyday objects with diverse motions, leveraging large-scale object models and animation modeling. In summary, we open up the modern 4D-SLAM research by introducing a novel method and evaluation protocols grounded in modern vision and rendering techniques.

Authors:Kostas Triaridis, Panagiotis Kaliosis, E-Ro Nguyen, Jingyi Xu, Hieu Le, Dimitris Samaras
Title: Improving Contrastive Learning for Referring Expression Counting
Abstract:
Object counting has progressed from class-specific models, which count only known categories, to class-agnostic models that generalize to unseen categories. The next challenge is Referring Expression Counting (REC), where the goal is to count objects based on fine-grained attributes and contextual differences. Existing methods struggle with distinguishing visually similar objects that belong to the same category but correspond to different referring expressions. To address this, we propose C-REX, a novel contrastive learning framework, based on supervised contrastive learning, designed to enhance discriminative representation learning. Unlike prior works, C-REX operates entirely within the image space, avoiding the misalignment issues of image-text contrastive learning, thus providing a more stable contrastive signal. It also guarantees a significantly larger pool of negative samples, leading to improved robustness in the learned representations. Moreover, we showcase that our framework is versatile and generic enough to be applied to other similar tasks like class-agnostic counting. To support our approach, we analyze the key components of sota detection-based models and identify that detecting object centroids instead of bounding boxes is the key common factor behind their success in counting tasks. We use this insight to design a simple yet effective detection-based baseline to build upon. Our experiments show that C-REX achieves state-of-the-art results in REC, outperforming previous methods by more than 22\% in MAE and more than 10\% in RMSE, while also demonstrating strong performance in class-agnostic counting. Code is available at https://github.com/cvlab-stonybrook/c-rex.
中文: 物体计数从类别特定模型发展到类别无关模型,当前面临指代表达式计数(REC)的挑战,我们提出的C-REX框架通过完全在图像空间运行的对比学习方法,增强了判别性表征学习和鲁棒性,从而实现了最先进的性能。
English: Object counting has evolved from class-specific to class-agnostic models, and now faces the challenge of Referring Expression Counting (REC), which our proposed C-REX framework addresses with a contrastive learning approach that operates entirely in the image space, achieving state-of-the-art results by improving discriminative representation learning and robustness.

Authors:Zhangyi Hu, Jiemin Wu, Hua Xu, Mingqian Liao, Ninghui Feng, Bo Gao, Songning Lai, Yutao Yue
Title: IMTS is Worth Time $\times$ Channel Patches: Visual Masked Autoencoders for Irregular Multivariate Time Series Prediction
Abstract:
Irregular Multivariate Time Series (IMTS) forecasting is challenging due to the unaligned nature of multi-channel signals and the prevalence of extensive missing data. Existing methods struggle to capture reliable temporal patterns from such data due to significant missing values. While pre-trained foundation models show potential for addressing these challenges, they are typically designed for Regularly Sampled Time Series (RTS). Motivated by the visual Mask AutoEncoder's (MAE) powerful capability for modeling sparse multi-channel information and its success in RTS forecasting, we propose VIMTS, a framework adapting Visual MAE for IMTS forecasting. To mitigate the effect of missing values, VIMTS first processes IMTS along the timeline into feature patches at equal intervals. These patches are then complemented using learned cross-channel dependencies. Then it leverages visual MAE's capability in handling sparse multichannel data for patch reconstruction, followed by a coarse-to-fine technique to generate precise predictions from focused contexts. In addition, we integrate self-supervised learning for improved IMTS modeling by adapting the visual MAE to IMTS data. Extensive experiments demonstrate VIMTS's superior performance and few-shot capability, advancing the application of visual foundation models in more general time series tasks. Our code is available at https://github.com/WHU-HZY/VIMTS.
中文摘要:VIMTS框架通过将不规则多元时间序列处理为等间隔特征补丁,利用跨通道依赖关系和自监督学习,有效解决了多通道信号不对齐和缺失数据带来的预测挑战,显著提升了预测性能。
English Summary: The VIMTS framework adapts the Visual Mask AutoEncoder to effectively forecast Irregular Multivariate Time Series by processing data into patches, leveraging cross-channel dependencies, and employing self-supervised learning to overcome challenges posed by missing values and unaligned signals.

Authors:Siddharth Ancha, Sunshine Jiang, Travis Manderson, Laura Brandt, Yilun Du, Philip R. Osteen, Nicholas Roy
Title: Anomalies by Synthesis: Anomaly Detection using Generative Diffusion Models for Off-Road Navigation
Abstract:
In order to navigate safely and reliably in off-road and unstructured environments, robots must detect anomalies that are out-of-distribution (OOD) with respect to the training data. We present an analysis-by-synthesis approach for pixel-wise anomaly detection without making any assumptions about the nature of OOD data. Given an input image, we use a generative diffusion model to synthesize an edited image that removes anomalies while keeping the remaining image unchanged. Then, we formulate anomaly detection as analyzing which image segments were modified by the diffusion model. We propose a novel inference approach for guided diffusion by analyzing the ideal guidance gradient and deriving a principled approximation that bootstraps the diffusion model to predict guidance gradients. Our editing technique is purely test-time that can be integrated into existing workflows without the need for retraining or fine-tuning. Finally, we use a combination of vision-language foundation models to compare pixels in a learned feature space and detect semantically meaningful edits, enabling accurate anomaly detection for off-road navigation. Project website: https://siddancha.github.io/anomalies-by-diffusion-synthesis/

Authors:Marco Colussi, Dragan Ahmetovic, Sergio Mascetti
Title: MIAS-SAM: Medical Image Anomaly Segmentation without thresholding
Abstract:
This paper presents MIAS-SAM, a novel approach for the segmentation of anomalous regions in medical images. MIAS-SAM uses a patch-based memory bank to store relevant image features, which are extracted from normal data using the SAM encoder. At inference time, the embedding patches extracted from the SAM encoder are compared with those in the memory bank to obtain the anomaly map. Finally, MIAS-SAM computes the center of gravity of the anomaly map to prompt the SAM decoder, obtaining an accurate segmentation from the previously extracted features. Differently from prior works, MIAS-SAM does not require to define a threshold value to obtain the segmentation from the anomaly map. Experimental results conducted on three publicly available datasets, each with a different imaging modality (Brain MRI, Liver CT, and Retina OCT) show accurate anomaly segmentation capabilities measured using DICE score. The code is available at: https://github.com/warpcut/MIAS-SAM
中文: MIAS-SAM提出了一种基于图像块记忆库的新方法,利用SAM编码器从正常数据提取特征进行医学图像异常区域分割,无需设定阈值即可在多种成像模态中实现精准分割。
English: MIAS-SAM introduces a patch-based memory bank approach using SAM encoder features from normal data to segment anomalies in medical images without thresholding, achieving high accuracy across multiple imaging modalities.

Authors:Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, Yimeng Wang, Kai Yu, Wenxuan Chen, Ziwei Feng, Zijian Gong, Jianzhuang Pan, Yi Peng, Rui Tian, Siyu Wang, Bo Zhao, Ting Yao, Tao Mei
Title: HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer
Abstract:
Recent advancements in image generative foundation models have prioritized quality improvements but often at the cost of increased computational complexity and inference latency. To address this critical trade-off, we introduce HiDream-I1, a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds. HiDream-I1 is constructed with a new sparse Diffusion Transformer (DiT) structure. Specifically, it starts with a dual-stream decoupled design of sparse DiT with dynamic Mixture-of-Experts (MoE) architecture, in which two separate encoders are first involved to independently process image and text tokens. Then, a single-stream sparse DiT structure with dynamic MoE architecture is adopted to trigger multi-model interaction for image generation in a cost-efficient manner. To support flexiable accessibility with varied model capabilities, we provide HiDream-I1 in three variants: HiDream-I1-Full, HiDream-I1-Dev, and HiDream-I1-Fast. Furthermore, we go beyond the typical text-to-image generation and remould HiDream-I1 with additional image conditions to perform precise, instruction-based editing on given images, yielding a new instruction-based image editing model namely HiDream-E1. Ultimately, by integrating text-to-image generation and instruction-based image editing, HiDream-I1 evolves to form a comprehensive image agent (HiDream-A1) capable of fully interactive image creation and refinement. To accelerate multi-modal AIGC research, we have open-sourced all the codes and model weights of HiDream-I1-Full, HiDream-I1-Dev, HiDream-I1-Fast, HiDream-E1 through our project websites: https://github.com/HiDream-ai/HiDream-I1 and https://github.com/HiDream-ai/HiDream-E1. All features can be directly experienced via https://vivago.ai/studio.
中文: HiDream-I1是一个拥有170亿参数的开源图像生成模型,采用新型稀疏扩散变换器和动态专家混合架构,可在数秒内实现顶尖图像生成质量,并能扩展为支持文本生成图像和指令编辑的完整图像代理系统。
English: HiDream-I1 is a 17B-parameter open-source image generation model that achieves state-of-the-art quality within seconds using a novel sparse Diffusion Transformer structure with dynamic Mixture-of-Experts, and it evolves into a comprehensive image agent supporting both text-to-image generation and instruction-based editing.

Authors:Kaiyu Yue, Vasu Singla, Menglin Jia, John Kirchenbauer, Rifaa Qadri, Zikui Cai, Abhinav Bhatele, Furong Huang, Tom Goldstein
Title: Zero-Shot Vision Encoder Grafting via LLM Surrogates
Abstract:
Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training. To reduce costs, a potential promising strategy is to first train the vision encoder using a small language model before transferring it to the large one. We construct small "surrogate models" that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers. Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting -- when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM. Furthermore, our surrogate training approach reduces overall VLM training costs by ~45% when using Llama-70B as the decoder. The code is at https://github.com/facebookresearch/zero.
中文: 视觉语言模型通过先用共享目标大语言模型嵌入空间的小型替代模型训练视觉编码器,实现直接迁移,不仅性能相当,还能将训练成本降低约45%。
English: Vision language models can reduce training costs by first training the vision encoder with a small surrogate model that shares the target large language model's embedding space, enabling direct transfer and achieving comparable performance while cutting costs by about 45%.

Authors:Aimon Rahman, Kartik Narayan, Vishal M. Patel
Title: Training Free Stylized Abstraction
Abstract:
Stylized abstraction synthesizes visually exaggerated yet semantically faithful representations of subjects, balancing recognizability with perceptual distortion. Unlike image-to-image translation, which prioritizes structural fidelity, stylized abstraction demands selective retention of identity cues while embracing stylistic divergence, especially challenging for out-of-distribution individuals. We propose a training-free framework that generates stylized abstractions from a single image using inference-time scaling in vision-language models (VLLMs) to extract identity-relevant features, and a novel cross-domain rectified flow inversion strategy that reconstructs structure based on style-dependent priors. Our method adapts structural restoration dynamically through style-aware temporal scheduling, enabling high-fidelity reconstructions that honor both subject and style. It supports multi-round abstraction-aware generation without fine-tuning. To evaluate this task, we introduce StyleBench, a GPT-based human-aligned metric suited for abstract styles where pixel-level similarity fails. Experiments across diverse abstraction (e.g., LEGO, knitted dolls, South Park) show strong generalization to unseen identities and styles in a fully open-source setup.

Authors:Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu
Title: VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Abstract:
Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4\% of the original performance. Code is available at https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan.
中文: 近期大型视觉语言模型因冗长视觉标记导致计算效率低下,为此提出VScan框架,通过整合全局-局部扫描与中间层剪枝的两阶段方法减少标记冗余,在加速推理的同时保持高性能。
English: Recent Large Vision-Language Models face computational inefficiency from lengthy visual tokens, prompting the development of VScan, a two-stage framework that reduces token redundancy through integrated global-local scans and intermediate layer pruning to accelerate inference while maintaining high performance.

Authors:Zhe Kong, Feng Gao, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Xunliang Cai, Guanying Chen, Wenhan Luo
Title: Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Abstract:
Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.
中文摘要:本文提出MultiTalk框架,通过标签旋转位置嵌入解决多人对话视频生成中的音画绑定问题,并采用部分参数训练和多任务训练保持基础模型的指令跟随能力,在多个数据集上展现出优于现有方法的生成性能。
English Summary: The paper introduces MultiTalk, a novel framework for multi-person conversational video generation that resolves audio-person binding issues with Label Rotary Position Embedding and maintains instruction-following ability through specialized training techniques, outperforming existing methods across multiple datasets.

Authors:Jixin Zhao, Shangchen Zhou, Zhouxia Wang, Peiqing Yang, Chen Change Loy
Title: ObjectClear: Complete Object Removal via Object-Effect Attention
Abstract:
Object removal requires eliminating not only the target object but also its effects, such as shadows and reflections. However, diffusion-based inpainting methods often produce artifacts, hallucinate content, alter background, and struggle to remove object effects accurately. To address this challenge, we introduce a new dataset for OBject-Effect Removal, named OBER, which provides paired images with and without object effects, along with precise masks for both objects and their associated visual artifacts. The dataset comprises high-quality captured and simulated data, covering diverse object categories and complex multi-object scenes. Building on OBER, we propose a novel framework, ObjectClear, which incorporates an object-effect attention mechanism to guide the model toward the foreground removal regions by learning attention masks, effectively decoupling foreground removal from background reconstruction. Furthermore, the predicted attention map enables an attention-guided fusion strategy during inference, greatly preserving background details. Extensive experiments demonstrate that ObjectClear outperforms existing methods, achieving improved object-effect removal quality and background fidelity, especially in complex scenarios.
中文摘要:本文提出OBER数据集和ObjectClear框架,通过物体效应注意力机制有效分离前景移除与背景重建,在复杂场景中显著提升了物体效应消除质量与背景保真度。
English Summary: This paper introduces the OBER dataset and ObjectClear framework to address limitations in diffusion-based inpainting methods, specifically targeting the removal of object effects like shadows and reflections while preserving background details through an innovative attention mechanism.

Authors:Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun
Title: RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction
Abstract:
Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.
中文: RICO框架通过将描述重构为参考图像并识别差异来迭代优化图像描述,提升准确性和完整性,同时RICO-Flash利用DPO提高效率,在多个基准测试中显著优于现有方法。
English: The proposed RICO framework iteratively refines image captions by reconstructing them into reference images and identifying discrepancies to enhance accuracy and completeness, with RICO-Flash optimizing efficiency through DPO, achieving significant improvements over existing methods.

Authors:Dmitrii Sorokin, Maksim Nakhodnov, Andrey Kuznetsov, Aibek Alanov
Title: ImageReFL: Balancing Quality and Diversity in Human-Aligned Diffusion Models
Abstract:
Recent advances in diffusion models have led to impressive image generation capabilities, but aligning these models with human preferences remains challenging. Reward-based fine-tuning using models trained on human feedback improves alignment but often harms diversity, producing less varied outputs. In this work, we address this trade-off with two contributions. First, we introduce \textit{combined generation}, a novel sampling strategy that applies a reward-tuned diffusion model only in the later stages of the generation process, while preserving the base model for earlier steps. This approach mitigates early-stage overfitting and helps retain global structure and diversity. Second, we propose \textit{ImageReFL}, a fine-tuning method that improves image diversity with minimal loss in quality by training on real images and incorporating multiple regularizers, including diffusion and ReFL losses. Our approach outperforms conventional reward tuning methods on standard quality and diversity metrics. A user study further confirms that our method better balances human preference alignment and visual diversity. The source code can be found at https://github.com/ControlGenAI/ImageReFL .
中文摘要:本文提出了一种结合新型采样策略和微调方法的两部分方案,有效提升了扩散模型在人类偏好对齐与图像多样性之间的平衡能力,其表现优于传统的奖励调优方法。
English Summary: This paper introduces a two-part approach combining a novel sampling strategy and a fine-tuning method to enhance diffusion models by better balancing human preference alignment with image diversity, outperforming traditional reward tuning methods.

Authors:Ethan Chern, Zhulin Hu, Steffi Chern, Siqi Kou, Jiadi Su, Yan Ma, Zhijie Deng, Pengfei Liu
Title: Thinking with Generated Images
Abstract:
We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at https://github.com/GAIR-NLP/thinking-with-generated-images.
中文摘要:本文提出“生成图像思维”新范式,通过让大型多模态模型自主生成并优化中间视觉思维步骤,从根本上改变了视觉推理方式,在复杂场景中实现高达50%的相对性能提升。
English Summary: This paper introduces "Thinking with Generated Images," a paradigm that enhances large multimodal models' visual reasoning by enabling them to spontaneously generate and refine intermediate visual thoughts, achieving up to 50% relative improvement in complex scenarios.

Authors:Abhra Chaudhuri, Anjan Dutta, Tu Bui, Serban Georgescu
Title: A Closer Look at Multimodal Representation Collapse
Abstract:
We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.

Authors:Zobia Batool, Huseyin Ozkan, Erchan Aptoula
Title: Single Domain Generalization for Alzheimer's Detection from 3D MRIs with Pseudo-Morphological Augmentations and Contrastive Learning
Abstract:
Although Alzheimer's disease detection via MRIs has advanced significantly thanks to contemporary deep learning models, challenges such as class imbalance, protocol variations, and limited dataset diversity often hinder their generalization capacity. To address this issue, this article focuses on the single domain generalization setting, where given the data of one domain, a model is designed and developed with maximal performance w.r.t. an unseen domain of distinct distribution. Since brain morphology is known to play a crucial role in Alzheimer's diagnosis, we propose the use of learnable pseudo-morphological modules aimed at producing shape-aware, anatomically meaningful class-specific augmentations in combination with a supervised contrastive learning module to extract robust class-specific representations. Experiments conducted across three datasets show improved performance and generalization capacity, especially under class imbalance and imaging protocol variations. The source code will be made available upon acceptance at https://github.com/zobia111/SDG-Alzheimer.
中文: 本研究通过引入可学习的伪形态学模块和监督对比学习,解决了阿尔茨海默病MRI检测中的泛化难题,在类别不平衡和成像协议差异下显著提升了模型性能与跨域适应能力。
English: This study tackles Alzheimer's disease detection challenges in MRI analysis by introducing learnable pseudo-morphological modules and supervised contrastive learning to enhance generalization across diverse domains, demonstrating improved performance despite class imbalance and protocol variations.

Authors:Qiucheng Yu, Yuan Xie, Xin Tan
Title: SHTOcc: Effective 3D Occupancy Prediction with Sparse Head and Tail Voxels
Abstract:
3D occupancy prediction has attracted much attention in the field of autonomous driving due to its powerful geometric perception and object recognition capabilities. However, existing methods have not explored the most essential distribution patterns of voxels, resulting in unsatisfactory results. This paper first explores the inter-class distribution and geometric distribution of voxels, thereby solving the long-tail problem caused by the inter-class distribution and the poor performance caused by the geometric distribution. Specifically, this paper proposes SHTOcc (Sparse Head-Tail Occupancy), which uses sparse head-tail voxel construction to accurately identify and balance key voxels in the head and tail classes, while using decoupled learning to reduce the model's bias towards the dominant (head) category and enhance the focus on the tail class. Experiments show that significant improvements have been made on multiple baselines: SHTOcc reduces GPU memory usage by 42.2%, increases inference speed by 58.6%, and improves accuracy by about 7%, verifying its effectiveness and efficiency. The code is available at https://github.com/ge95net/SHTOcc
中文摘要:本文提出SHTOcc方法,通过稀疏头尾体素构建和解耦学习解决体素分布不平衡问题,在内存效率、推理速度和精度方面均实现显著提升。
English Summary: This paper introduces SHTOcc, a 3D occupancy prediction method that addresses voxel distribution imbalances through sparse head-tail voxel construction and decoupled learning, achieving significant improvements in memory efficiency, inference speed, and accuracy.

Authors:Seun-An Choe, Keon-Hee Park, Jinwoo Choi, Gyeong-Moon Park
Title: Universal Domain Adaptation for Semantic Segmentation
Abstract:
Unsupervised domain adaptation for semantic segmentation (UDA-SS) aims to transfer knowledge from labeled source data to unlabeled target data. However, traditional UDA-SS methods assume that category settings between source and target domains are known, which is unrealistic in real-world scenarios. This leads to performance degradation if private classes exist. To address this limitation, we propose Universal Domain Adaptation for Semantic Segmentation (UniDA-SS), achieving robust adaptation even without prior knowledge of category settings. We define the problem in the UniDA-SS scenario as low confidence scores of common classes in the target domain, which leads to confusion with private classes. To solve this problem, we propose UniMAP: UniDA-SS with Image Matching and Prototype-based Distinction, a novel framework composed of two key components. First, Domain-Specific Prototype-based Distinction (DSPD) divides each class into two domain-specific prototypes, enabling finer separation of domain-specific features and enhancing the identification of common classes across domains. Second, Target-based Image Matching (TIM) selects a source image containing the most common-class pixels based on the target pseudo-label and pairs it in a batch to promote effective learning of common classes. We also introduce a new UniDA-SS benchmark and demonstrate through various experiments that UniMAP significantly outperforms baselines. The code is available at https://github.com/KU-VGI/UniMAP.
中文摘要:提出的UniMAP框架通过引入领域特定原型和目标图像匹配方法,有效解决了语义分割中未知类别设置的通用领域自适应问题,显著超越了现有基准方法。
English Summary: The proposed UniMAP framework addresses universal domain adaptation for semantic segmentation by introducing domain-specific prototypes and target-based image matching to effectively handle unknown category settings between domains, significantly outperforming existing methods.

Authors:Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun
Title: Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Abstract:
Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %$\rightarrow$72.9 % on MathVista, 62.9 %$\rightarrow$68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.
中文: MM-UPT提出了一种基于GRPO和无监督自奖励机制的后训练框架,无需外部监督即可显著提升多模态大语言模型的推理能力,其效果甚至接近有监督方法。
English: MM-UPT introduces an unsupervised post-training framework using GRPO with a self-rewarding mechanism, significantly enhancing MLLMs' reasoning without external supervision and even approaching supervised method performance.

Authors:Van-Tin Luu, Yon-Lin Cai, Vu-Hoang Tran, Wei-Chen Chiu, Yi-Ting Chen, Ching-Chun Huang
Title: RC-AutoCalib: An End-to-End Radar-Camera Automatic Calibration Network
Abstract:
This paper presents a groundbreaking approach - the first online automatic geometric calibration method for radar and camera systems. Given the significant data sparsity and measurement uncertainty in radar height data, achieving automatic calibration during system operation has long been a challenge. To address the sparsity issue, we propose a Dual-Perspective representation that gathers features from both frontal and bird's-eye views. The frontal view contains rich but sensitive height information, whereas the bird's-eye view provides robust features against height uncertainty. We thereby propose a novel Selective Fusion Mechanism to identify and fuse reliable features from both perspectives, reducing the effect of height uncertainty. Moreover, for each view, we incorporate a Multi-Modal Cross-Attention Mechanism to explicitly find location correspondences through cross-modal matching. During the training phase, we also design a Noise-Resistant Matcher to provide better supervision and enhance the robustness of the matching mechanism against sparsity and height uncertainty. Our experimental results, tested on the nuScenes dataset, demonstrate that our method significantly outperforms previous radar-camera auto-calibration methods, as well as existing state-of-the-art LiDAR-camera calibration techniques, establishing a new benchmark for future research. The code is available at https://github.com/nycu-acm/RC-AutoCalib.
中文摘要:本文提出了首个雷达-相机系统的在线自动标定方法,通过双视角特征融合和跨模态注意力机制有效解决了数据稀疏性和高度不确定性问题,在nuScenes数据集上实现了超越现有最佳方法的性能。
English Summary: This paper introduces the first online automatic calibration method for radar-camera systems, employing dual-perspective feature fusion and cross-modal attention to overcome data sparsity and height uncertainty, achieving state-of-the-art performance on the nuScenes dataset.

Authors:Anthony Chen, Wenzhao Zheng, Yida Wang, Xueyang Zhang, Kun Zhan, Peng Jia, Kurt Keutzer, Shanghang Zhang
Title: GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control
Abstract:
Recent advancements in world models have revolutionized dynamic environment simulation, allowing systems to foresee future states and assess potential actions. In autonomous driving, these capabilities help vehicles anticipate the behavior of other road users, perform risk-aware planning, accelerate training in simulation, and adapt to novel scenarios, thereby enhancing safety and reliability. Current approaches exhibit deficiencies in maintaining robust 3D geometric consistency or accumulating artifacts during occlusion handling, both critical for reliable safety assessment in autonomous navigation tasks. To address this, we introduce GeoDrive, which explicitly integrates robust 3D geometry conditions into driving world models to enhance spatial understanding and action controllability. Specifically, we first extract a 3D representation from the input frame and then obtain its 2D rendering based on the user-specified ego-car trajectory. To enable dynamic modeling, we propose a dynamic editing module during training to enhance the renderings by editing the positions of the vehicles. Extensive experiments demonstrate that our method significantly outperforms existing models in both action accuracy and 3D spatial awareness, leading to more realistic, adaptable, and reliable scene modeling for safer autonomous driving. Additionally, our model can generalize to novel trajectories and offers interactive scene editing capabilities, such as object editing and object trajectory control.
中文: GeoDrive通过将稳健的3D几何条件融入驾驶世界模型,提升了自动驾驶的空间感知和行动可控性,从而实现更真实、适应性更强的仿真场景建模。
English: GeoDrive enhances autonomous driving safety by integrating robust 3D geometry into world models, improving spatial awareness and action controllability for more realistic and adaptable simulations.

Authors:Jiadong Pan, Zhiyuan Ma, Kaiyan Zhang, Ning Ding, Bowen Zhou
Title: Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation
Abstract:
Diffusion models have recently demonstrated exceptional performance in image generation task. However, existing image generation methods still significantly suffer from the dilemma of image reasoning, especially in logic-centered image generation tasks. Inspired by the success of Chain of Thought (CoT) and Reinforcement Learning (RL) in LLMs, we propose SRRL, a self-reflective RL algorithm for diffusion models to achieve reasoning generation of logical images by performing reflection and iteration across generation trajectories. The intermediate samples in the denoising process carry noise, making accurate reward evaluation difficult. To address this challenge, SRRL treats the entire denoising trajectory as a CoT step with multi-round reflective denoising process and introduces condition guided forward process, which allows for reflective iteration between CoT steps. Through SRRL-based iterative diffusion training, we introduce image reasoning through CoT into generation tasks adhering to physical laws and unconventional physical phenomena for the first time. Notably, experimental results of case study exhibit that the superior performance of our SRRL algorithm even compared with GPT-4o. The project page is https://jadenpan0.github.io/srrl.github.io/.
中文摘要:SRRL算法通过结合自反思强化学习和思维链推理,显著提升了扩散模型在逻辑图像生成中的表现,实验证明其效果甚至优于GPT-4o。
English Summary: The SRRL algorithm enhances diffusion models by integrating self-reflective reinforcement learning and Chain of Thought reasoning, enabling superior logical image generation that outperforms even GPT-4o in experimental evaluations.

Authors:Shriram M S, Xinyue Hao, Shihao Hou, Yang Lu, Laura Sevilla-Lara, Anurag Arnab, Shreyank N Gowda
Title: Progressive Data Dropout: An Embarrassingly Simple Approach to Faster Training
Abstract:
The success of the machine learning field has reliably depended on training on large datasets. While effective, this trend comes at an extraordinary cost. This is due to two deeply intertwined factors: the size of models and the size of datasets. While promising research efforts focus on reducing the size of models, the other half of the equation remains fairly mysterious. Indeed, it is surprising that the standard approach to training remains to iterate over and over, uniformly sampling the training dataset. In this paper we explore a series of alternative training paradigms that leverage insights from hard-data-mining and dropout, simple enough to implement and use that can become the new training standard. The proposed Progressive Data Dropout reduces the number of effective epochs to as little as 12.4% of the baseline. This savings actually do not come at any cost for accuracy. Surprisingly, the proposed method improves accuracy by up to 4.82%. Our approach requires no changes to model architecture or optimizer, and can be applied across standard training pipelines, thus posing an excellent opportunity for wide adoption. Code can be found here: https://github.com/bazyagami/LearningWithRevision
Chinese: 本文提出的渐进式数据丢弃训练方法,在无需改变模型结构或优化器的前提下,显著减少了所需训练轮次,同时保持甚至提升了模型精度。
English: The paper introduces Progressive Data Dropout, a training method that significantly reduces the number of epochs needed while maintaining or even improving accuracy, without requiring changes to model architecture or optimizer.

Authors:Wancai Zheng, Linlin Ou, Jiajie He, Libo Zhou, Xinyi Yu, Yan Wei
Title: UP-SLAM: Adaptively Structured Gaussian SLAM with Uncertainty Prediction in Dynamic Environments
Abstract:
Recent 3D Gaussian Splatting (3DGS) techniques for Visual Simultaneous Localization and Mapping (SLAM) have significantly progressed in tracking and high-fidelity mapping. However, their sequential optimization framework and sensitivity to dynamic objects limit real-time performance and robustness in real-world scenarios. We present UP-SLAM, a real-time RGB-D SLAM system for dynamic environments that decouples tracking and mapping through a parallelized framework. A probabilistic octree is employed to manage Gaussian primitives adaptively, enabling efficient initialization and pruning without hand-crafted thresholds. To robustly filter dynamic regions during tracking, we propose a training-free uncertainty estimator that fuses multi-modal residuals to estimate per-pixel motion uncertainty, achieving open-set dynamic object handling without reliance on semantic labels. Furthermore, a temporal encoder is designed to enhance rendering quality. Concurrently, low-dimensional features are efficiently transformed via a shallow multilayer perceptron to construct DINO features, which are then employed to enrich the Gaussian field and improve the robustness of uncertainty prediction. Extensive experiments on multiple challenging datasets suggest that UP-SLAM outperforms state-of-the-art methods in both localization accuracy (by 59.8%) and rendering quality (by 4.57 dB PSNR), while maintaining real-time performance and producing reusable, artifact-free static maps in dynamic environments.The project: https://aczheng-cai.github.io/up_slam.github.io/

Authors:Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang
Title: Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start
Abstract:
Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % on MathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.
中文: 本研究表明多模态大语言模型中的自我修正模式在强化学习训练前就已存在,并提出结合监督微调与强化学习的两阶段方法,在推理基准测试中实现了最优性能。
English: This study demonstrates that self-correction patterns in multimodal LLMs exist before RL training and proposes a two-stage approach combining supervised fine-tuning and reinforcement learning, achieving state-of-the-art performance on reasoning benchmarks.

Authors:Haibin He, Jing Zhang, Maoyuan Ye, Juhua Liu, Bo Du, Dacheng Tao
Title: GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking
Abstract:
Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter's ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.
中文: 本文提出GoMatching++方法,通过冻结图像文本识别器并引入轻量级跟踪器,有效将现成的图像文本识别器转化为视频专用系统,在多个基准测试中创下性能记录并显著降低训练成本。
English: This paper introduces GoMatching++, a method that efficiently converts an image text spotter into a specialized video text spotter by adding a lightweight tracker and addressing domain gaps, achieving state-of-the-art results on benchmarks while reducing training costs.

Authors:Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, Michele Magno
Title: Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers
Abstract:
Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token-aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$\times$. Code will be available at https://github.com/cantbebetter2/Q-VDiT.
中文: 扩散变换器在视频生成中表现卓越,但高计算需求限制了其在边缘设备上的部署;为此开发的Q-VDiT量化框架通过令牌感知误差补偿和时间蒸馏技术,显著提升了性能并设定了新基准。
English: Diffusion transformers excel in video generation but face deployment challenges on edge devices due to high computational demands, prompting the development of Q-VDiT, a specialized quantization framework that enhances performance through token-aware error compensation and temporal distillation.

Authors:Guanwen Feng, Zhiyuan Ma, Yunan Li, Jiahao Yang, Junwei Jing, Qiguang Miao
Title: FaceEditTalker: Controllable Talking Head Generation with Facial Attribute Editing
Abstract:
Recent advances in audio-driven talking head generation have achieved impressive results in lip synchronization and emotional expression. However, they largely overlook the crucial task of facial attribute editing. This capability is indispensable for achieving deep personalization and expanding the range of practical applications, including user-tailored digital avatars, engaging online education content, and brand-specific digital customer service. In these key domains, flexible adjustment of visual attributes, such as hairstyle, accessories, and subtle facial features, is essential for aligning with user preferences, reflecting diverse brand identities and adapting to varying contextual demands. In this paper, we present FaceEditTalker, a unified framework that enables controllable facial attribute manipulation while generating high-quality, audio-synchronized talking head videos. Our method consists of two key components: an image feature space editing module, which extracts semantic and detail features and allows flexible control over attributes like expression, hairstyle, and accessories; and an audio-driven video generation module, which fuses these edited features with audio-guided facial landmarks to drive a diffusion-based generator. This design ensures temporal coherence, visual fidelity, and identity preservation across frames. Extensive experiments on public datasets demonstrate that our method achieves comparable or superior performance to representative baseline methods in lip-sync accuracy, video quality, and attribute controllability. Project page: https://peterfanfan.github.io/FaceEditTalker/

Authors:Hang Chen, Maoyuan Ye, Peng Yang, Haibin He, Juhua Liu, Bo Du
Title: Adapting Segment Anything Model for Power Transmission Corridor Hazard Segmentation
Abstract:
Power transmission corridor hazard segmentation (PTCHS) aims to separate transmission equipment and surrounding hazards from complex background, conveying great significance to maintaining electric power transmission safety. Recently, the Segment Anything Model (SAM) has emerged as a foundational vision model and pushed the boundaries of segmentation tasks. However, SAM struggles to deal with the target objects in complex transmission corridor scenario, especially those with fine structure. In this paper, we propose ELE-SAM, adapting SAM for the PTCHS task. Technically, we develop a Context-Aware Prompt Adapter to achieve better prompt tokens via incorporating global-local features and focusing more on key regions. Subsequently, to tackle the hazard objects with fine structure in complex background, we design a High-Fidelity Mask Decoder by leveraging multi-granularity mask features and then scaling them to a higher resolution. Moreover, to train ELE-SAM and advance this field, we construct the ELE-40K benchmark, the first large-scale and real-world dataset for PTCHS including 44,094 image-mask pairs. Experimental results for ELE-40K demonstrate the superior performance that ELE-SAM outperforms the baseline model with the average 16.8% mIoU and 20.6% mBIoU performance improvement. Moreover, compared with the state-of-the-art method on HQSeg-44K, the average 2.9% mIoU and 3.8% mBIoU absolute improvements further validate the effectiveness of our method on high-quality generic object segmentation. The source code and dataset are available at https://github.com/Hhaizee/ELE-SAM.
中文: 本文提出了ELE-SAM模型,基于Segment Anything Model改进用于电力传输走廊危险目标分割,通过上下文感知提示适配器和高保真掩码解码器增强复杂背景下精细结构目标的识别能力,并在新构建的ELE-40K数据集上验证了其显著性能提升。
English: This paper introduces ELE-SAM, an enhanced model based on the Segment Anything Model (SAM) for power transmission corridor hazard segmentation, featuring a Context-Aware Prompt Adapter and a High-Fidelity Mask Decoder to improve segmentation of fine-structured objects in complex backgrounds, validated by the newly constructed ELE-40K dataset with significant performance gains.

Authors:Junhuan Liu, San Jiang, Wei Ge, Wei Huang, Bingxuan Guo, Qingquan Li
Title: UAVPairs: A Challenging Benchmark for Match Pair Retrieval of Large-scale UAV Images
Abstract:
The primary contribution of this paper is a challenging benchmark dataset, UAVPairs, and a training pipeline designed for match pair retrieval of large-scale UAV images. First, the UAVPairs dataset, comprising 21,622 high-resolution images across 30 diverse scenes, is constructed; the 3D points and tracks generated by SfM-based 3D reconstruction are employed to define the geometric similarity of image pairs, ensuring genuinely matchable image pairs are used for training. Second, to solve the problem of expensive mining cost for global hard negative mining, a batched nontrivial sample mining strategy is proposed, leveraging the geometric similarity and multi-scene structure of the UAVPairs to generate training samples as to accelerate training. Third, recognizing the limitation of pair-based losses, the ranked list loss is designed to improve the discrimination of image retrieval models, which optimizes the global similarity structure constructed from the positive set and negative set. Finally, the effectiveness of the UAVPairs dataset and training pipeline is validated through comprehensive experiments on three distinct large-scale UAV datasets. The experiment results demonstrate that models trained with the UAVPairs dataset and the ranked list loss achieve significantly improved retrieval accuracy compared to models trained on existing datasets or with conventional losses. Furthermore, these improvements translate to enhanced view graph connectivity and higher quality of reconstructed 3D models. The models trained by the proposed approach perform more robustly compared with hand-crafted global features, particularly in challenging repetitively textured scenes and weakly textured scenes. For match pair retrieval of large-scale UAV images, the trained image retrieval models offer an effective solution. The dataset would be made publicly available at https://github.com/json87/UAVPairs.
中文: 本文提出了UAVPairs基准数据集,包含21,622张高分辨率无人机图像,以及配套的训练流程,通过批量非平凡样本挖掘策略和排序列表损失函数,显著提升了图像检索精度和三维重建质量。
English: This paper introduces UAVPairs, a challenging benchmark dataset with 21,622 high-resolution UAV images, and a training pipeline that includes a batched nontrivial sample mining strategy and ranked list loss to significantly improve retrieval accuracy and 3D reconstruction quality.

Authors:Qiuchen Wang, Ruixue Ding, Yu Zeng, Zehui Chen, Lin Chen, Shihang Wang, Pengjun Xie, Fei Huang, Feng Zhao
Title: VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
Abstract:
Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at https://github.com/Alibaba-NLP/VRAG.
中文摘要:VRAG-RL框架通过让视觉语言模型自主采样并优化推理轨迹,结合交互式搜索引擎查询和视觉感知标记,解决了当前视觉RAG方法在复杂视觉信息推理中的固有限制。
English Summary: The VRAG-RL framework addresses limitations in visual reasoning for RAG systems by enabling VLMs to autonomously sample and optimize reasoning trajectories through interactive search engine queries and visual perception tokens.

Authors:Wataru Ikeda, Masashi Hatano, Ryosei Hara, Mariko Isogawa
Title: Event-based Egocentric Human Pose Estimation in Dynamic Environment
Abstract:
Estimating human pose using a front-facing egocentric camera is essential for applications such as sports motion analysis, VR/AR, and AI for wearable devices. However, many existing methods rely on RGB cameras and do not account for low-light environments or motion blur. Event-based cameras have the potential to address these challenges. In this work, we introduce a novel task of human pose estimation using a front-facing event-based camera mounted on the head and propose D-EventEgo, the first framework for this task. The proposed method first estimates the head poses, and then these are used as conditions to generate body poses. However, when estimating head poses, the presence of dynamic objects mixed with background events may reduce head pose estimation accuracy. Therefore, we introduce the Motion Segmentation Module to remove dynamic objects and extract background information. Extensive experiments on our synthetic event-based dataset derived from EgoBody, demonstrate that our approach outperforms our baseline in four out of five evaluation metrics in dynamic environments.

Authors:Mengjingcheng Mo, Xinyang Tong, Jiaxu Leng, Mingpi Tan, Jiankang Zheng, Yiran Liu, Haosheng Chen, Ji Gan, Weisheng Li, Xinbo Gao
Title: A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding
Abstract:
While unmanned aerial vehicles (UAVs) offer wide-area, high-altitude coverage for anomaly detection, they face challenges such as dynamic viewpoints, scale variations, and complex scenes. Existing datasets and methods, mainly designed for fixed ground-level views, struggle to adapt to these conditions, leading to significant performance drops in drone-view scenarios. To bridge this gap, we introduce A2Seek (Aerial Anomaly Seek), a large-scale, reasoning-centric benchmark dataset for aerial anomaly understanding. This dataset covers various scenarios and environmental conditions, providing high-resolution real-world aerial videos with detailed annotations, including anomaly categories, frame-level timestamps, region-level bounding boxes, and natural language explanations for causal reasoning. Building on this dataset, we propose A2Seek-R1, a novel reasoning framework that generalizes R1-style strategies to aerial anomaly understanding, enabling a deeper understanding of "Where" anomalies occur and "Why" they happen in aerial frames. To this end, A2Seek-R1 first employs a graph-of-thought (GoT)-guided supervised fine-tuning approach to activate the model's latent reasoning capabilities on A2Seek. Then, we introduce Aerial Group Relative Policy Optimization (A-GRPO) to design rule-based reward functions tailored to aerial scenarios. Furthermore, we propose a novel "seeking" mechanism that simulates UAV flight behavior by directing the model's attention to informative regions. Extensive experiments demonstrate that A2Seek-R1 achieves up to a 22.04% improvement in AP for prediction accuracy and a 13.9% gain in mIoU for anomaly localization, exhibiting strong generalization across complex environments and out-of-distribution scenarios. Our dataset and code will be released at https://hayneyday.github.io/A2Seek/.

Authors:Senmao Li, Lei Wang, Kai Wang, Tao Liu, Jiehang Xie, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang
Title: One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models
Abstract:
Text-to-Image (T2I) diffusion models have made remarkable advancements in generative modeling; however, they face a trade-off between inference speed and image quality, posing challenges for efficient deployment. Existing distilled T2I models can generate high-fidelity images with fewer sampling steps, but often struggle with diversity and quality, especially in one-step models. From our analysis, we observe redundant computations in the UNet encoders. Our findings suggest that, for T2I diffusion models, decoders are more adept at capturing richer and more explicit semantic information, while encoders can be effectively shared across decoders from diverse time steps. Based on these observations, we introduce the first Time-independent Unified Encoder TiUE for the student model UNet architecture, which is a loop-free image generation approach for distilling T2I diffusion models. Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling and significantly reducing inference time complexity. In addition, we incorporate a KL divergence term to regularize noise prediction, which enhances the perceptual realism and diversity of the generated images. Experimental results demonstrate that TiUE outperforms state-of-the-art methods, including LCM, SD-Turbo, and SwiftBrushv2, producing more diverse and realistic results while maintaining the computational efficiency.
中文: 本研究提出TiUE时间无关统一编码器,通过在多时间步解码器间共享编码器特征实现并行采样,大幅降低文生图扩散模型推理耗时,并借助KL散度正则化提升生成图像的多样性和真实感。
English: The study introduces TiUE, a time-independent unified encoder that enables parallel sampling in text-to-image diffusion models by sharing encoder features across decoders, significantly reducing inference time while enhancing image diversity and realism through KL divergence regularization.

Authors:Le Thien Phuc Nguyen, Zhuoran Yu, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc Vo, Lucas Poon, Soochahn Lee, Yong Jae Lee
Title: UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios
Abstract:
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern "in-the-wild" datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD Code: https://github.com/plnguyen2908/UniTalk-ASD-code
UniTalk是一个专为主动说话人检测设计的新数据集,专注于多样化的现实场景,如多语言和拥挤环境,以提升模型泛化能力,它作为一个新基准揭示了当前先进模型在真实条件下的不足。
UniTalk is a novel dataset designed for active speaker detection, focusing on challenging real-world scenarios like diverse languages and crowded scenes to improve model generalization, and it serves as a new benchmark that reveals the limitations of current state-of-the-art models under realistic conditions.

Authors:Wei Lin, Chenyang Zhao, Antoni B. Chan
Title: Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting
Abstract:
Point detection has been developed to locate pedestrians in crowded scenes by training a counter through a point-to-point (P2P) supervision scheme. Despite its excellent localization and counting performance, training a point-based counter still faces challenges concerning annotation labor: hundreds to thousands of points are required to annotate a single sample capturing a dense crowd. In this paper, we integrate point-based methods into a semi-supervised counting framework based on pseudo-labeling, enabling the training of a counter with only a few annotated samples supplemented by a large volume of pseudo-labeled data. However, during implementation, the training encounters issues as the confidence for pseudo-labels fails to be propagated to background pixels via the P2P. To tackle this challenge, we devise a point-specific activation map (PSAM) to visually interpret the phenomena occurring during the ill-posed training. Observations from the PSAM suggest that the feature map is excessively activated by the loss for unlabeled data, causing the decoder to misinterpret these over-activations as pedestrians. To mitigate this issue, we propose a point-to-region (P2R) scheme to substitute P2P, which segments out local regions rather than detects a point corresponding to a pedestrian for supervision. Consequently, pixels in the local region can share the same confidence with the corresponding pseudo points. Experimental results in both semi-supervised counting and unsupervised domain adaptation highlight the advantages of our method, illustrating P2R can resolve issues identified in PSAM. The code is available at https://github.com/Elin24/P2RLoss.
中文: 本文提出了一种点对区域(P2R)方案替代点对点(P2P)方法,通过局部区域共享置信度解决半监督行人计数中的过度激活问题,有效提升了伪标签数据的训练效果。
English: This paper introduces a point-to-region (P2R) scheme to replace the point-to-point (P2P) method, addressing over-activation issues in semi-supervised pedestrian counting by enabling confidence sharing in local regions and improving training with pseudo-labels.

Authors:Chong Zeng, Yue Dong, Pieter Peers, Hongzhi Wu, Xin Tong
Title: RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination
Abstract:
We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.

Authors:Chenfeng Wei, Qi Wu, Si Zuo, Jiahua Xu, Boyang Zhao, Zeyu Yang, Guotao Xie, Shenhong Wang
Title: LiDARDustX: A LiDAR Dataset for Dusty Unstructured Road Environments
Abstract:
Autonomous driving datasets are essential for validating the progress of intelligent vehicle algorithms, which include localization, perception, and prediction. However, existing datasets are predominantly focused on structured urban environments, which limits the exploration of unstructured and specialized scenarios, particularly those characterized by significant dust levels. This paper introduces the LiDARDustX dataset, which is specifically designed for perception tasks under high-dust conditions, such as those encountered in mining areas. The LiDARDustX dataset consists of 30,000 LiDAR frames captured by six different LiDAR sensors, each accompanied by 3D bounding box annotations and point cloud semantic segmentation. Notably, over 80% of the dataset comprises dust-affected scenes. By utilizing this dataset, we have established a benchmark for evaluating the performance of state-of-the-art 3D detection and segmentation algorithms. Additionally, we have analyzed the impact of dust on perception accuracy and delved into the causes of these effects. The data and further information can be accessed at: https://github.com/vincentweikey/LiDARDustX.
Chinese: LiDARDustX数据集专门针对高粉尘环境下的感知任务,提供了30,000帧带标注的LiDAR数据,建立了三维检测与分割算法的性能基准,并深入分析了粉尘对感知精度的影响机制。
English: The LiDARDustX dataset addresses the lack of autonomous driving data in dusty environments by providing 30,000 LiDAR frames with annotations, enabling benchmark evaluation of 3D detection and segmentation algorithms while analyzing dust's impact on perception accuracy.

Authors:Jianchao Jiang, Haofeng Zhang
Title: Concentrate on Weakness: Mining Hard Prototypes for Few-Shot Medical Image Segmentation
Abstract:
Few-Shot Medical Image Segmentation (FSMIS) has been widely used to train a model that can perform segmentation from only a few annotated images. However, most existing prototype-based FSMIS methods generate multiple prototypes from the support image solely by random sampling or local averaging, which can cause particularly severe boundary blurring due to the tendency for normal features accounting for the majority of features of a specific category. Consequently, we propose to focus more attention to those weaker features that are crucial for clear segmentation boundary. Specifically, we design a Support Self-Prediction (SSP) module to identify such weak features by comparing true support mask with one predicted by global support prototype. Then, a Hard Prototypes Generation (HPG) module is employed to generate multiple hard prototypes based on these weak features. Subsequently, a Multiple Similarity Maps Fusion (MSMF) module is devised to generate final segmenting mask in a dual-path fashion to mitigate the imbalance between foreground and background in medical images. Furthermore, we introduce a boundary loss to further constraint the edge of segmentation. Extensive experiments on three publicly available medical image datasets demonstrate that our method achieves state-of-the-art performance. Code is available at https://github.com/jcjiang99/CoW.
中文: 本文提出了一种新颖的少样本医学图像分割方法,通过支持自预测模块和硬原型生成聚焦于弱特征以提升边界清晰度,在三个公开数据集上实现了最优性能。
English: This paper introduces a novel few-shot medical image segmentation method that enhances boundary clarity by focusing on weak features through a Support Self-Prediction module and Hard Prototypes Generation, achieving state-of-the-art results on three datasets.

Authors:Zun Wang, Jaemin Cho, Jialu Li, Han Lin, Jaehong Yoon, Yue Zhang, Mohit Bansal
Title: EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance
Abstract:
Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.
Chinese: EPiC提出了一种无需昂贵相机轨迹标注的高效框架,通过基于首帧可见性对源视频进行掩码来生成高质量锚点视频,并结合轻量级Anchor-ControlNet模块,在视频扩散模型中实现精确的3D相机控制,以最少的资源达到了最先进的性能。
English: EPiC introduces an efficient framework that generates high-quality anchor videos without costly camera annotations by masking source videos and integrates a lightweight Anchor-ControlNet module to enable precise 3D camera control in video diffusion models, achieving state-of-the-art performance with minimal resources.

Authors:Guiping Cao, Wenjian Huang, Xiangyuan Lan, Jianguo Zhang, Dongmei Jiang, Yaowei Wang
Title: Cross-DINO: Cross the Deep MLP and Transformer for Small Object Detection
Abstract:
Small Object Detection (SOD) poses significant challenges due to limited information and the model's low class prediction score. While Transformer-based detectors have shown promising performance, their potential for SOD remains largely unexplored. In typical DETR-like frameworks, the CNN backbone network, specialized in aggregating local information, struggles to capture the necessary contextual information for SOD. The multiple attention layers in the Transformer Encoder face difficulties in effectively attending to small objects and can also lead to blurring of features. Furthermore, the model's lower class prediction score of small objects compared to large objects further increases the difficulty of SOD. To address these challenges, we introduce a novel approach called Cross-DINO. This approach incorporates the deep MLP network to aggregate initial feature representations with both short and long range information for SOD. Then, a new Cross Coding Twice Module (CCTM) is applied to integrate these initial representations to the Transformer Encoder feature, enhancing the details of small objects. Additionally, we introduce a new kind of soft label named Category-Size (CS), integrating the Category and Size of objects. By treating CS as new ground truth, we propose a new loss function called Boost Loss to improve the class prediction score of the model. Extensive experimental results on COCO, WiderPerson, VisDrone, AI-TOD, and SODA-D datasets demonstrate that Cross-DINO efficiently improves the performance of DETR-like models on SOD. Specifically, our model achieves 36.4% APs on COCO for SOD with only 45M parameters, outperforming the DINO by +4.4% APS (36.4% vs. 32.0%) with fewer parameters and FLOPs, under 12 epochs training setting. The source codes will be available at https://github.com/Med-Process/Cross-DINO.
中文摘要:Cross-DINO通过结合深度MLP网络和交叉二次编码模块增强小物体特征细节,同时引入类别-尺寸软标签及提升损失函数来提高分类预测分数,以更少参数实现了优于现有模型的检测性能。
English Summary: Cross-DINO enhances small object detection by integrating deep MLP networks and a Cross Coding Twice Module to improve feature details, while introducing Category-Size soft labels with Boost Loss to boost class prediction scores, achieving superior performance over existing models with fewer parameters.

Authors:Chenhui Zhao, Yiwei Lyu, Asadur Chowdury, Edward Harake, Akhil Kondepudi, Akshay Rao, Xinhai Hou, Honglak Lee, Todd Hollon
Title: Towards Scalable Language-Image Pre-training for 3D Medical Imaging
Abstract:
Language-image pre-training has demonstrated strong performance in 2D medical imaging, but its success in 3D modalities such as CT and MRI remains limited due to the high computational demands of volumetric data, which pose a significant barrier to training on large-scale, uncurated clinical studies. In this study, we introduce Hierarchical attention for Language-Image Pre-training (HLIP), a scalable pre-training framework for 3D medical imaging. HLIP adopts a lightweight hierarchical attention mechanism inspired by the natural hierarchy of radiology data: slice, scan, and study. This mechanism exhibits strong generalizability, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. Moreover, the computational efficiency of HLIP enables direct training on uncurated datasets. Trained on 220K patients with 3.13 million scans for brain MRI and 240K patients with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +32.4% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +1.4% and +6.9% macro AUC on head CT benchmarks RSNA and CQ500, respectively. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. The code is available at https://github.com/Zch0414/hlip
中文: HLIP框架采用分层注意力机制,实现了对大规模未筛选3D医学影像的高效语言-图像预训练,在多个CT和MRI基准测试中均取得了最先进的性能表现。
English: The HLIP framework introduces a hierarchical attention mechanism for 3D medical imaging that enables efficient language-image pre-training on large-scale uncurated datasets, achieving state-of-the-art performance across multiple CT and MRI benchmarks.

Authors:Xuwei Xu, Yang Li, Yudong Chen, Jiajun Liu, Sen Wang
Title: RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers
Abstract:
We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency, with their impact signifying as model size increases. This finding highlights a critical opportunity for optimizing the efficiency of large-scale ViTs by focusing on FFN layers. In this work, we propose a novel channel idle mechanism that facilitates post-training structural reparameterization for efficient FFN layers during testing. Specifically, a set of feature channels remains idle and bypasses the nonlinear activation function in each FFN layer, thereby forming a linear pathway that enables structural reparameterization during inference. This mechanism results in a family of ReParameterizable Vision Transformers (RePaViTs), which achieve remarkable latency reductions with acceptable sacrifices (sometimes gains) in accuracy across various ViTs. The benefits of our method scale consistently with model sizes, demonstrating greater speed improvements and progressively narrowing accuracy gaps or even higher accuracies on larger models. In particular, RePa-ViT-Large and RePa-ViT-Huge enjoy 66.8% and 68.7% speed-ups with +1.7% and +1.1% higher top-1 accuracies under the same training strategy, respectively. RePaViT is the first to employ structural reparameterization on FFN layers to expedite ViTs to our best knowledge, and we believe that it represents an auspicious direction for efficient ViTs. Source code is available at https://github.com/Ackesnal/RePaViT.
中文: 本研究揭示前馈网络层是视觉Transformer推理延迟的主要瓶颈,并提出通道空闲机制实现结构重参数化,由此开发的RePaViT模型在保持精度的同时大幅提升推理速度。
English: This study identifies feedforward network layers as the main bottleneck in Vision Transformer inference latency and introduces a channel idle mechanism enabling structural reparameterization to create efficient RePaViT models that achieve significant speed gains with minimal accuracy loss.

Authors:Claudia Cuttano, Gabriele Trivigno, Giuseppe Averta, Carlo Masone
Title: SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation
Abstract:
Few-shot segmentation aims to segment unseen object categories from just a handful of annotated examples. This requires mechanisms that can both identify semantically related objects across images and accurately produce segmentation masks. We note that Segment Anything 2 (SAM2), with its prompt-and-propagate mechanism, offers both strong segmentation capabilities and a built-in feature matching process. However, we show that its representations are entangled with task-specific cues optimized for object tracking, which impairs its use for tasks requiring higher level semantic understanding. Our key insight is that, despite its class-agnostic pretraining, SAM2 already encodes rich semantic structure in its features. We propose SANSA (Semantically AligNed Segment Anything 2), a framework that makes this latent structure explicit, and repurposes SAM2 for few-shot segmentation through minimal task-specific modifications. SANSA achieves state-of-the-art performance on few-shot segmentation benchmarks specifically designed to assess generalization, outperforms generalist methods in the popular in-context setting, supports various prompts flexible interaction via points, boxes, or scribbles, and remains significantly faster and more compact than prior approaches. Code is available at https://github.com/ClaudiaCuttano/SANSA.
中文: SANSA框架通过显式对齐Segment Anything 2(SAM2)的潜在语义特征,以最小改动实现了最先进的少样本分割性能,在保持高效性的同时支持多种交互提示方式。
English: The SANSA framework enhances the Segment Anything 2 (SAM2) model by aligning its latent semantic features, enabling state-of-the-art few-shot segmentation with minimal modifications while maintaining efficiency and flexibility across various prompts.

Authors:Yanbo Wang, Justin Dauwels, Yilun Du
Title: Compositional Scene Understanding through Inverse Generative Modeling
Abstract:
Generative models have demonstrated remarkable abilities in generating high-fidelity visual content. In this work, we explore how generative models can further be used not only to synthesize visual content but also to understand the properties of a scene given a natural image. We formulate scene understanding as an inverse generative modeling problem, where we seek to find conditional parameters of a visual generative model to best fit a given natural image. To enable this procedure to infer scene structure from images substantially different than those seen during training, we further propose to build this visual generative model compositionally from smaller models over pieces of a scene. We illustrate how this procedure enables us to infer the set of objects in a scene, enabling robust generalization to new test scenes with an increased number of objects of new shapes. We further illustrate how this enables us to infer global scene factors, likewise enabling robust generalization to new scenes. Finally, we illustrate how this approach can be directly applied to existing pretrained text-to-image generative models for zero-shot multi-object perception. Code and visualizations are at https://energy-based-model.github.io/compositional-inference.

Authors:Chengyue Huang, Brisa Maneechotesuwan, Shivang Chopra, Zsolt Kira
Title: FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
Abstract:
Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multi-modal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni- and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. The code is available at https://github.com/chengyuehuang511/FRAMES-VQA .
中文: 本文提出了FRAMES-VQA新基准,用于评估视觉问答任务中针对多模态分布变化的鲁棒微调方法,通过将数据集分类为分布内和分布外类型并分析模态交互作用,为开发更鲁棒的微调方法提供指导。
English: This paper introduces FRAMES-VQA, a new benchmark for evaluating robust fine-tuning in visual question answering across multi-modal distribution shifts, categorizing datasets into in-distribution and out-of-distribution types while analyzing modality interactions to guide future method development.

Authors:Yitong Li, Morteza Ghahremani, Christian Wachinger
Title: MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis
Abstract:
Recent vision-language foundation models deliver state-of-the-art results on natural image classification but falter on medical images due to pronounced domain shifts. At the same time, training a medical foundation model requires substantial resources, including extensive annotated data and high computational capacity. To bridge this gap with minimal overhead, we introduce MedBridge, a lightweight multimodal adaptation framework that re-purposes pretrained VLMs for accurate medical image diagnosis. MedBridge comprises three key components. First, a Focal Sampling module that extracts high-resolution local regions to capture subtle pathological features and compensate for the limited input resolution of general-purpose VLMs. Second, a Query Encoder (QEncoder) injects a small set of learnable queries that attend to the frozen feature maps of VLM, aligning them with medical semantics without retraining the entire backbone. Third, a Mixture of Experts mechanism, driven by learnable queries, harnesses the complementary strength of diverse VLMs to maximize diagnostic performance. We evaluate MedBridge on five medical imaging benchmarks across three key adaptation tasks, demonstrating its superior performance in both cross-domain and in-domain adaptation settings, even under varying levels of training data availability. Notably, MedBridge achieved over 6-15% improvement in AUC compared to state-of-the-art VLM adaptation methods in multi-label thoracic disease diagnosis, underscoring its effectiveness in leveraging foundation models for accurate and data-efficient medical diagnosis. Our code is available at https://github.com/ai-med/MedBridge.
中文摘要:MedBridge是一种轻量级多模态适配框架,通过聚焦采样、查询编码和专家混合机制,将预训练的视觉语言模型高效应用于医学图像诊断,在多项基准测试中显著提升性能且资源消耗极低。
English Summary: MedBridge is a lightweight multimodal framework that adapts pretrained vision-language models for medical image diagnosis by incorporating focal sampling, query encoding, and mixture of experts to achieve superior performance with minimal computational overhead.

Authors:Ke Zhang, Cihan Xiao, Jiacong Xu, Yiqun Mei, Vishal M. Patel
Title: Think Before You Diffuse: Infusing Physical Rules into Video Diffusion
Abstract:
Recent video diffusion models have demonstrated their great capability in generating visually-pleasing results, while synthesizing the correct physical effects in generated videos remains challenging. The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data. In this work, we propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model. Our method leverages large language models (LLMs) to infer rich physical context from the text prompt. To incorporate this context into the video diffusion model, we use a multimodal large language model (MLLM) to verify intermediate latent variables against the inferred physical rules, guiding the gradient updates of model accordingly. Textual output of LLM is transformed into continuous signals. We then formulate a set of training objectives that jointly ensure physical accuracy and semantic alignment with the input text. Additionally, failure facts of physical phenomena are corrected via attention injection. We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning. Extensive experiments on public benchmarks demonstrate that DiffPhy is able to produce state-of-the-art results across diverse physics-related scenarios. Our project page is available at https://bwgzk-keke.github.io/DiffPhy/.

Authors:Xiaole Tang, Xiaoyi He, Xiang Gu, Jian Sun
Title: BaryIR: Learning Multi-Source Unified Representation in Continuous Barycenter Space for Generalizable All-in-One Image Restoration
Abstract:
Despite remarkable advances made in all-in-one image restoration (AIR) for handling different types of degradations simultaneously, existing methods remain vulnerable to out-of-distribution degradations and images, limiting their real-world applicability. In this paper, we propose a multi-source representation learning framework BaryIR, which decomposes the latent space of multi-source degraded images into a continuous barycenter space for unified feature encoding and source-specific subspaces for specific semantic encoding. Specifically, we seek the multi-source unified representation by introducing a multi-source latent optimal transport barycenter problem, in which a continuous barycenter map is learned to transport the latent representations to the barycenter space. The transport cost is designed such that the representations from source-specific subspaces are contrasted with each other while maintaining orthogonality to those from the barycenter space. This enables BaryIR to learn compact representations with unified degradation-agnostic information from the barycenter space, as well as degradation-specific semantics from source-specific subspaces, capturing the inherent geometry of multi-source data manifold for generalizable AIR. Extensive experiments demonstrate that BaryIR achieves competitive performance compared to state-of-the-art all-in-one methods. Particularly, BaryIR exhibits superior generalization ability to real-world data and unseen degradations. The code will be publicly available at https://github.com/xl-tang3/BaryIR.
中文: BaryIR提出了一种多源表示学习框架,通过将图像退化分解为统一的重心空间和特定源子空间,显著提升了全功能图像恢复方法对现实场景和未知退化的泛化能力。
English: BaryIR introduces a multi-source representation learning framework that decomposes image degradations into a unified barycenter space and source-specific subspaces, enhancing generalization for all-in-one image restoration across diverse real-world scenarios.

Authors:Chengyu Yang, Chengjun Liu
Title: Laparoscopic Image Desmoking Using the U-Net with New Loss Function and Integrated Differentiable Wiener Filter
Abstract:
Laparoscopic surgeries often suffer from reduced visual clarity due to the presence of surgical smoke originated by surgical instruments, which poses significant challenges for both surgeons and vision based computer-assisted technologies. In order to remove the surgical smoke, a novel U-Net deep learning with new loss function and integrated differentiable Wiener filter (ULW) method is presented. Specifically, the new loss function integrates the pixel, structural, and perceptual properties. Thus, the new loss function, which combines the structural similarity index measure loss, the perceptual loss, as well as the mean squared error loss, is able to enhance the quality and realism of the reconstructed images. Furthermore, the learnable Wiener filter is capable of effectively modelling the degradation process caused by the surgical smoke. The effectiveness of the proposed ULW method is evaluated using the publicly available paired laparoscopic smoke and smoke-free image dataset, which provides reliable benchmarking and quantitative comparisons. Experimental results show that the proposed ULW method excels in both visual clarity and metric-based evaluation. As a result, the proposed ULW method offers a promising solution for real-time enhancement of laparoscopic imagery. The code is available at https://github.com/chengyuyang-njit/ImageDesmoke.
中文摘要:ULW方法结合新型U-Net网络、综合损失函数和可微维纳滤波器,能有效消除腹腔镜图像中的手术烟雾,显著提升图像清晰度,为实时手术应用提供可靠解决方案。
English Summary: The ULW method, combining a novel U-Net with an integrated loss function and differentiable Wiener filter, effectively removes surgical smoke from laparoscopic images to enhance visual clarity and support real-time surgical applications.

Authors:Zhengyuan Jiang, Moyang Guo, Kecen Li, Yuepeng Hu, Yupu Wang, Zhicong Huang, Cheng Hong, Neil Zhenqiang Gong
Title: VideoMarkBench: Benchmarking Robustness of Video Watermarking
Abstract:
The rapid development of video generative models has led to a surge in highly realistic synthetic videos, raising ethical concerns related to disinformation and copyright infringement. Recently, video watermarking has been proposed as a mitigation strategy by embedding invisible marks into AI-generated videos to enable subsequent detection. However, the robustness of existing video watermarking methods against both common and adversarial perturbations remains underexplored. In this work, we introduce VideoMarkBench, the first systematic benchmark designed to evaluate the robustness of video watermarks under watermark removal and watermark forgery attacks. Our study encompasses a unified dataset generated by three state-of-the-art video generative models, across three video styles, incorporating four watermarking methods and seven aggregation strategies used during detection. We comprehensively evaluate 12 types of perturbations under white-box, black-box, and no-box threat models. Our findings reveal significant vulnerabilities in current watermarking approaches and highlight the urgent need for more robust solutions. Our code is available at https://github.com/zhengyuan-jiang/VideoMarkBench.
Chinese: 本文提出了首个系统性评估视频水印在去除和伪造攻击下鲁棒性的基准VideoMarkBench,揭示了现有方法的显著脆弱性,并强调了开发更强健解决方案的迫切性。
English: This paper introduces VideoMarkBench, the first benchmark to systematically evaluate the vulnerability of video watermarks against removal and forgery attacks, revealing significant weaknesses in current methods and underscoring the need for more robust solutions.

Authors:Yang Yang, Siming Zheng, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, Peng-Tao Jiang
Title: Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion
Abstract:
Recent advances in diffusion based editing models have enabled realistic camera simulation and image-based bokeh, but video bokeh remains largely unexplored. Existing video editing models cannot explicitly control focus planes or adjust bokeh intensity, limiting their applicability for controllable optical effects. Moreover, naively extending image-based bokeh methods to video often results in temporal flickering and unsatisfactory edge blur transitions due to the lack of temporal modeling and generalization capability. To address these challenges, we propose a novel one-step video bokeh framework that converts arbitrary input videos into temporally coherent, depth-aware bokeh effects. Our method leverages a multi-plane image (MPI) representation constructed through a progressively widening depth sampling function, providing explicit geometric guidance for depth-dependent blur synthesis. By conditioning a single-step video diffusion model on MPI layers and utilizing the strong 3D priors from pre-trained models such as Stable Video Diffusion, our approach achieves realistic and consistent bokeh effects across diverse scenes. Additionally, we introduce a progressive training strategy to enhance temporal consistency, depth robustness, and detail preservation. Extensive experiments demonstrate that our method produces high-quality, controllable bokeh effects and achieves state-of-the-art performance on multiple evaluation benchmarks.
中文摘要:本文提出了一种单步扩散框架,通过采用多平面图像表示和渐进式训练策略,实现了时间连贯且深度感知的视频背景虚化渲染,在合成与真实场景基准测试中均优于现有方法。
English Summary: This paper introduces a one-step diffusion framework that achieves temporally coherent and depth-aware video bokeh rendering by leveraging multi-plane image representations and progressive training, outperforming existing methods in both synthetic and real-world benchmarks.

Authors:Carina Newen, Luca Hinkamp, Maria Ntonti, Emmanuel Müller
Title: Do you see what I see? An Ambiguous Optical Illusion Dataset exposing limitations of Explainable AI
Abstract:
From uncertainty quantification to real-world object detection, we recognize the importance of machine learning algorithms, particularly in safety-critical domains such as autonomous driving or medical diagnostics. In machine learning, ambiguous data plays an important role in various machine learning domains. Optical illusions present a compelling area of study in this context, as they offer insight into the limitations of both human and machine perception. Despite this relevance, optical illusion datasets remain scarce. In this work, we introduce a novel dataset of optical illusions featuring intermingled animal pairs designed to evoke perceptual ambiguity. We identify generalizable visual concepts, particularly gaze direction and eye cues, as subtle yet impactful features that significantly influence model accuracy. By confronting models with perceptual ambiguity, our findings underscore the importance of concepts in visual learning and provide a foundation for studying bias and alignment between human and machine vision. To make this dataset useful for general purposes, we generate optical illusions systematically with different concepts discussed in our bias mitigation section. The dataset is accessible in Kaggle via https://kaggle.com/datasets/693bf7c6dd2cb45c8a863f9177350c8f9849a9508e9d50526e2ffcc5559a8333. Our source code can be found at https://github.com/KDD-OpenSource/Ambivision.git.
中文: 本研究引入了一个新颖的视觉错觉数据集,旨在探究机器学习中的感知模糊性,揭示了如注视方向等细微视觉线索对模型准确性和偏差缓解的重要影响。
English: This study introduces a novel dataset of optical illusions to explore perceptual ambiguity in machine learning, highlighting how subtle visual cues like gaze direction impact model accuracy and bias mitigation.

Authors:Weixing Wang, Zifeng Ding, Jindong Gu, Rui Cao, Christoph Meinel, Gerard de Melo, Haojin Yang
Title: Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing
Abstract:
Large Vision-Language Models (LVLMs) with discrete image tokenizers unify multimodal representations by encoding visual inputs into a finite set of tokens. Despite their effectiveness, we find that these models still hallucinate non-existent objects. We hypothesize that this may be due to visual priors induced during training: When certain image tokens frequently co-occur in the same spatial regions and represent shared objects, they become strongly associated with the verbalizations of those objects. As a result, the model may hallucinate by evoking visually absent tokens that often co-occur with present ones. To test this assumption, we construct a co-occurrence graph of image tokens using a segmentation dataset and employ a Graph Neural Network (GNN) with contrastive learning followed by a clustering method to group tokens that frequently co-occur in similar visual contexts. We find that hallucinations predominantly correspond to clusters whose tokens dominate the input, and more specifically, that the visually absent tokens in those clusters show much higher correlation with hallucinated objects compared to tokens present in the image. Based on this observation, we propose a hallucination mitigation method that suppresses the influence of visually absent tokens by modifying latent image embeddings during generation. Experiments show our method reduces hallucinations while preserving expressivity. Code is available at https://github.com/weixingW/CGC-VTD/tree/main
中文: 大型视觉语言模型因图像标记共现关联产生幻觉,本文通过抑制生成过程中视觉缺失标记的影响提出缓解方法,在保持表达力的同时有效减少幻觉现象。
English: Large Vision-Language Models often hallucinate objects due to strong associations between co-occurring image tokens, prompting the development of a mitigation method that suppresses visually absent tokens during generation to reduce errors while maintaining expressivity.

Authors:Chika Maduabuchi, Hao Chen, Yujin Han, Jindong Wang
Title: Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation
Abstract:
Latent Video Diffusion Models (LVDMs) achieve high-quality generation but are sensitive to imperfect conditioning, which causes semantic drift and temporal incoherence on noisy, web-scale video-text datasets. We introduce CAT-LVDM, the first corruption-aware training framework for LVDMs that improves robustness through structured, data-aligned noise injection. Our method includes Batch-Centered Noise Injection (BCNI), which perturbs embeddings along intra-batch semantic directions to preserve temporal consistency. BCNI is especially effective on caption-rich datasets like WebVid-2M, MSR-VTT, and MSVD. We also propose Spectrum-Aware Contextual Noise (SACN), which injects noise along dominant spectral directions to improve low-frequency smoothness, showing strong results on UCF-101. On average, BCNI reduces FVD by 31.9% across WebVid-2M, MSR-VTT, and MSVD, while SACN yields a 12.3% improvement on UCF-101. Ablation studies confirm the benefit of low-rank, data-aligned noise. Our theoretical analysis further explains how such perturbations tighten entropy, Wasserstein, score-drift, mixing-time, and generalization bounds. CAT-LVDM establishes a principled, scalable training approach for robust video diffusion under multimodal noise. Code and models: https://github.com/chikap421/catlvdm
中文: CAT-LVDM提出首个针对隐视频扩散模型的抗干扰训练框架,通过结构化的数据对齐噪声注入增强模型鲁棒性,在多个数据集上有效提升时间一致性并减少语义偏移。
English: CAT-LVDM introduces a corruption-aware training framework that enhances the robustness of Latent Video Diffusion Models against imperfect conditioning through structured noise injection, significantly improving temporal consistency and reducing semantic drift across multiple datasets.

Authors:Zitong Wang, Hang Zhao, Qianyu Zhou, Xuequan Lu, Xiangtai Li, Yiren Song
Title: DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers
Abstract:
Diffusion models have recently motivated great success in many generation tasks like object removal. Nevertheless, existing image decomposition methods struggle to disentangle semi-transparent or transparent layer occlusions due to mask prior dependencies, static object assumptions, and the lack of datasets. In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Images, aiming to recover constituent layers from single overlapped images under the condition of semi-transparent/transparent alpha layer non-linear occlusion. To address challenges in layer ambiguity, generalization, and data scarcity, we first introduce AlphaBlend, the first large-scale and high-quality dataset for transparent and semi-transparent layer decomposition, supporting six real-world subtasks (e.g., translucent flare removal, semi-transparent cell decomposition, glassware decomposition). Building on this dataset, we present DiffDecompose, a diffusion Transformer-based framework that learns the posterior over possible layer decompositions conditioned on the input image, semantic prompts, and blending type. Rather than regressing alpha mattes directly, DiffDecompose performs In-Context Decomposition, enabling the model to predict one or multiple layers without per-layer supervision, and introduces Layer Position Encoding Cloning to maintain pixel-level correspondence across layers. Extensive experiments on the proposed AlphaBlend dataset and public LOGO dataset verify the effectiveness of DiffDecompose. The code and dataset will be available upon paper acceptance. Our code will be available at: https://github.com/Wangzt1121/DiffDecompose.
中文: 本文提出DiffDecompose框架,基于扩散Transformer和新型AlphaBlend数据集,通过上下文分解方法解决半透明/透明图层分解中的层次模糊和数据稀缺问题。
English: This paper introduces DiffDecompose, a diffusion Transformer framework that addresses layer decomposition in alpha-composited images using a novel AlphaBlend dataset and in-context learning to overcome challenges like layer ambiguity and data scarcity.

Authors:Thalles Silva, Helio Pedrini, Adín Ramírez Rivera
Title: Self-Organizing Visual Prototypes for Non-Parametric Representation Learning
Abstract:
We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders.
中文: 本文提出自组织视觉原型(SOP)方法,通过多个互补的支持嵌入替代单一原型来更好表征数据簇,在多项基准测试中实现了最先进的性能。
English: This paper introduces Self-Organizing Visual Prototypes (SOP), an unsupervised visual feature learning method that uses multiple complementary support embeddings instead of single prototypes to better characterize data clusters and achieve state-of-the-art performance on various benchmarks.

Authors:Mokai Pan, Kaizhen Zhu, Yuexin Ma, Yanwei Fu, Jingyi Yu, Jingya Wang, Ye Shi
Title: UniDB++: Fast Sampling of Unified Diffusion Bridge
Abstract:
Diffusion Bridges enable transitions between arbitrary distributions, with the Unified Diffusion Bridge (UniDB) framework achieving high-fidelity image generation via a Stochastic Optimal Control (SOC) formulation. However, UniDB's reliance on iterative Euler sampling methods results in slow, computationally expensive inference, while existing acceleration techniques for diffusion or diffusion bridge models fail to address its unique challenges: missing terminal mean constraints and SOC-specific penalty coefficients in its SDEs. We present UniDB++, a training-free sampling algorithm that significantly improves upon these limitations. The method's key advancement comes from deriving exact closed-form solutions for UniDB's reverse-time SDEs, effectively reducing the error accumulation inherent in Euler approximations and enabling high-quality generation with up to 20$\times$ fewer sampling steps. This method is further complemented by replacing conventional noise prediction with a more stable data prediction model, along with an SDE-Corrector mechanism that maintains perceptual quality for low-step regimes (5-10 steps). Additionally, we demonstrate that UniDB++ aligns with existing diffusion bridge acceleration methods by evaluating their update rules, and UniDB++ can recover DBIMs as special cases under some theoretical conditions. Experiments demonstrate UniDB++'s state-of-the-art performance in image restoration tasks, outperforming Euler-based methods in fidelity and speed while reducing inference time significantly. This work bridges the gap between theoretical generality and practical efficiency in SOC-driven diffusion bridge models. Our code is available at https://github.com/2769433owo/UniDB-plusplus.
中文摘要:UniDB++ 作为一种免训练采样算法,通过推导 UniDB 随机微分方程的精确闭式解,解决了原有模型推理缓慢的问题,在保持感知质量的同时实现了20倍加速的高质量图像生成,并在图像修复任务中达到领先性能。
English Summary: UniDB++ is a training-free sampling algorithm that overcomes UniDB's slow inference by deriving exact closed-form solutions for its SDEs, enabling 20× faster high-quality image generation while maintaining perceptual quality through data prediction and SDE-corrector mechanisms.

Authors:Amitai Yacobi, Nir Ben-Ari, Ronen Talmon, Uri Shaham
Title: Learning Shared Representations from Unpaired Data
Abstract:
Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our code IS publicly available at https://github.com/shaham-lab/SUE.
中文: 本研究证明,通过单模态表示的光谱嵌入,可以主要从未配对数据中有效学习共享的多模态表示,在各种跨模态任务中表现出色,并建立了一种可能通用的嵌入方法。
English: This research demonstrates that shared multimodal representations can be effectively learned primarily from unpaired data using spectral embeddings from unimodal representations, achieving strong performance across various cross-modal tasks and establishing a potentially universal embedding method.

Authors:Yipengjing Sun, Chenyang Wang, Shunyuan Zheng, Zonglin Li, Shengping Zhang, Xiangyang Ji
Title: Generalizable and Relightable Gaussian Splatting for Human Novel View Synthesis
Abstract:
We propose GRGS, a generalizable and relightable 3D Gaussian framework for high-fidelity human novel view synthesis under diverse lighting conditions. Unlike existing methods that rely on per-character optimization or ignore physical constraints, GRGS adopts a feed-forward, fully supervised strategy that projects geometry, material, and illumination cues from multi-view 2D observations into 3D Gaussian representations. Specifically, to reconstruct lighting-invariant geometry, we introduce a Lighting-aware Geometry Refinement (LGR) module trained on synthetically relit data to predict accurate depth and surface normals. Based on the high-quality geometry, a Physically Grounded Neural Rendering (PGNR) module is further proposed to integrate neural prediction with physics-based shading, supporting editable relighting with shadows and indirect illumination. Besides, we design a 2D-to-3D projection training scheme that leverages differentiable supervision from ambient occlusion, direct, and indirect lighting maps, which alleviates the computational cost of explicit ray tracing. Extensive experiments demonstrate that GRGS achieves superior visual quality, geometric consistency, and generalization across characters and lighting conditions.

Authors:Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Yueting Zhuang
Title: ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
Abstract:
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity's spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs' corresponding spatial comprehension capabilities.
中文摘要:视觉语言模型在全中心空间推理方面存在不足,但通过多视角数据集微调后性能显著提升,ViewSpatial-Bench基准测试验证了这一突破。
English Summary: Vision-language models struggle with allocentric spatial reasoning but show significant improvement when fine-tuned on multi-perspective datasets, as demonstrated by the new ViewSpatial-Bench benchmark.

Authors:Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr
Title: Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Abstract:
Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.
中文总结:本研究提出了首个学术海报生成基准和PosterAgent流程,在视觉文本连贯性和成本效益上超越现有方法,同时揭示了自动化设计中的关键瓶颈。
English Summary: This research introduces the first benchmark and PosterAgent pipeline for academic poster generation, which outperforms existing methods in visual-textual coherence and cost-efficiency while identifying key bottlenecks in automated design.

Authors:Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Aojun Zhou, Hongsheng Li
Title: UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
Abstract:
In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently pro- cesses historical context and unifies action-level and task-level rewards. To sup- port the training of UI-Genie-RM, we develop deliberately-designed data genera- tion strategies including rule-based verification, controlled trajectory corruption, and hard negative mining. To address the second challenge, a self-improvement pipeline progressively expands solvable complex GUI tasks by enhancing both the agent and reward models through reward-guided exploration and outcome verification in dynamic environments. For training the model, we generate UI- Genie-RM-517k and UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI agents while demonstrating high-quality synthetic trajectory gen- eration without manual annotation. Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks with three generations of data-model self-improvement. We open-source our complete framework implementation and generated datasets to facilitate further research in https://github.com/Euphoria16/UI-Genie.
中文: 本文提出UI-Genie自改进框架,通过专用奖励模型和自动化数据生成解决GUI智能体的两大难题,在无需人工标注的情况下实现了最优性能。
English: This paper presents UI-Genie, a self-improving framework that tackles GUI agent challenges through a specialized reward model and automated data generation, achieving state-of-the-art results without manual annotation.

Authors:Xiaojun Jia, Sensen Gao, Simeng Qin, Tianyu Pang, Chao Du, Yihao Huang, Xinfeng Li, Yiming Li, Bo Li, Yang Liu
Title: Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment
Abstract:
Multimodal large language models (MLLMs) remain vulnerable to transferable adversarial examples. While existing methods typically achieve targeted attacks by aligning global features-such as CLIP's [CLS] token-between adversarial and target samples, they often overlook the rich local information encoded in patch tokens. This leads to suboptimal alignment and limited transferability, particularly for closed-source models. To address this limitation, we propose a targeted transferable adversarial attack method based on feature optimal alignment, called FOA-Attack, to improve adversarial transfer capability. Specifically, at the global level, we introduce a global feature loss based on cosine similarity to align the coarse-grained features of adversarial samples with those of target samples. At the local level, given the rich local representations within Transformers, we leverage clustering techniques to extract compact local patterns to alleviate redundant local features. We then formulate local feature alignment between adversarial and target samples as an optimal transport (OT) problem and propose a local clustering optimal transport loss to refine fine-grained feature alignment. Additionally, we propose a dynamic ensemble model weighting strategy to adaptively balance the influence of multiple models during adversarial example generation, thereby further improving transferability. Extensive experiments across various models demonstrate the superiority of the proposed method, outperforming state-of-the-art methods, especially in transferring to closed-source MLLMs. The code is released at https://github.com/jiaxiaojunQAQ/FOA-Attack.
Chinese: 提出的FOA-Attack方法通过结合全局余弦相似度对齐与基于聚类的局部最优传输,增强了多模态大语言模型的可迁移对抗攻击能力,并利用动态集成加权策略显著提升了对闭源模型的攻击效果。
English: The proposed FOA-Attack method enhances transferable adversarial attacks on multimodal large language models by combining global cosine similarity alignment with local clustering-based optimal transport, significantly improving performance against closed-source models through dynamic ensemble weighting.

Authors:Omer Dahary, Yehonathan Cohen, Or Patashnik, Kfir Aberman, Daniel Cohen-Or
Title: Be Decisive: Noise-Induced Layouts for Multi-Subject Generation
Abstract:
Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject's spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model's prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model's prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model's original distribution.
中文: 本文提出一种方法,通过在去噪过程中预测并优化与初始噪声对齐的空间布局,有效防止文本到图像扩散模型中的主体混淆,提升多主体生成的准确性和稳定性,同时保持模型原有的多样性。
English: This paper introduces a method that predicts and refines a spatial layout aligned with the initial noise during denoising to prevent subject leakage and enhance multi-subject generation in text-to-image diffusion models, improving alignment and stability while preserving diversity.

Authors:Bozhou Li, Wentao Zhang
Title: ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
Abstract:
Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench's relation reasoning tasks and notable gains across multiple benchmarks. Our code is available at the following link: https://github.com/zooblastlbz/ID-Align.
Chinese: 针对视觉语言模型中同时编码高分辨率和缩略图图像导致的效率低下及交互限制问题,ID-Align方法通过重新排列位置标识来增强标记间交互,在多个基准测试中实现了显著性能提升。
English: To address the inefficiency and interaction limitations caused by encoding both high-resolution and thumbnail images in Vision-Language Models, the proposed ID-Align method reorders position IDs to enhance token interaction and achieves significant performance improvements across multiple benchmarks.

Authors:Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen
Title: Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
Abstract:
Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks, such as small-object and dense object grounding, and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot reasoning abilities on the V* Benchmark, without relying on any explicit reasoning data. We hope that our work can provide a simple codebase and evaluation protocol to facilitate future research on active perception in MLLMs.
中文: 本文提出ACTIVE-O3强化学习框架,旨在赋予多模态大语言模型主动感知能力以提升搜索效率和区域选择精度,并通过综合基准测试验证其性能。
English: This paper introduces ACTIVE-O3, a reinforcement learning framework that equips Multimodal Large Language Models with active perception capabilities to improve search efficiency and accuracy, supported by a comprehensive benchmark for evaluation.

Authors:Huaijin Pi, Zhi Cen, Zhiyang Dou, Taku Komura
Title: CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects
Abstract:
Synthesizing whole-body manipulation of articulated objects, including body motion, hand motion, and object motion, is a critical yet challenging task with broad applications in virtual humans and robotics. The core challenges are twofold. First, achieving realistic whole-body motion requires tight coordination between the hands and the rest of the body, as their movements are interdependent during manipulation. Second, articulated object manipulation typically involves high degrees of freedom and demands higher precision, often requiring the fingers to be placed at specific regions to actuate movable parts. To address these challenges, we propose a novel coordinated diffusion noise optimization framework. Specifically, we perform noise-space optimization over three specialized diffusion models for the body, left hand, and right hand, each trained on its own motion dataset to improve generalization. Coordination naturally emerges through gradient flow along the human kinematic chain, allowing the global body posture to adapt in response to hand motion objectives with high fidelity. To further enhance precision in hand-object interaction, we adopt a unified representation based on basis point sets (BPS), where end-effector positions are encoded as distances to the same BPS used for object geometry. This unified representation captures fine-grained spatial relationships between the hand and articulated object parts, and the resulting trajectories serve as targets to guide the optimization of diffusion noise, producing highly accurate interaction motion. We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility, and enables various capabilities such as object pose control, simultaneous walking and manipulation, and whole-body generation from hand-only data.
中文摘要:本文提出一种协调扩散噪声优化框架,通过梯度流协调身体与手部运动,并采用基于基点的统一表示实现精确的手物交互,从而合成逼真的全身操控动作。
English Summary: This paper introduces a coordinated diffusion noise optimization framework that synthesizes realistic whole-body manipulation by coordinating body and hand movements through gradient flow and a unified basis point set representation for precise hand-object interactions.

Authors:Xihong Yang, Siwei Wang, Fangdi Wang, Jiaqi Jin, Suyuan Liu, Yue Liu, En Zhu, Xinwang Liu, Yueming Jin
Title: Automatically Identify and Rectify: Robust Deep Contrastive Multi-view Clustering in Noisy Scenarios
Abstract:
Leveraging the powerful representation learning capabilities, deep multi-view clustering methods have demonstrated reliable performance by effectively integrating multi-source information from diverse views in recent years. Most existing methods rely on the assumption of clean views. However, noise is pervasive in real-world scenarios, leading to a significant degradation in performance. To tackle this problem, we propose a novel multi-view clustering framework for the automatic identification and rectification of noisy data, termed AIRMVC. Specifically, we reformulate noisy identification as an anomaly identification problem using GMM. We then design a hybrid rectification strategy to mitigate the adverse effects of noisy data based on the identification results. Furthermore, we introduce a noise-robust contrastive mechanism to generate reliable representations. Additionally, we provide a theoretical proof demonstrating that these representations can discard noisy information, thereby improving the performance of downstream tasks. Extensive experiments on six benchmark datasets demonstrate that AIRMVC outperforms state-of-the-art algorithms in terms of robustness in noisy scenarios. The code of AIRMVC are available at https://github.com/xihongyang1999/AIRMVC on Github.
中文: AIRMVC框架通过高斯混合模型识别异常并采用混合校正策略,有效处理多视图聚类中的噪声问题,在嘈杂环境下展现出优于现有方法的鲁棒性。
English: The AIRMVC framework addresses noise in multi-view clustering by identifying anomalies with GMM and applying a hybrid rectification strategy, enhancing robustness and outperforming existing methods in noisy environments.

Authors:Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, Ying Shan
Title: Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Abstract:
Recent advances in CoT reasoning and RL post-training have been reported to enhance video reasoning capabilities of MLLMs. This progress naturally raises a question: can these models perform complex video reasoning in a manner comparable to human experts? However, existing video benchmarks primarily evaluate visual perception and grounding abilities, with questions that can be answered based on explicit prompts or isolated visual cues. Such benchmarks do not fully capture the intricacies of real-world reasoning, where humans must actively search for, integrate, and analyze multiple clues before reaching a conclusion. To address this issue, we present Video-Holmes, a benchmark inspired by the reasoning process of Sherlock Holmes, designed to evaluate the complex video reasoning capabilities of MLLMs. Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films, which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments. Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information and often miss critical clues. For example, the best-performing model, Gemini-2.5-Pro, achieves an accuracy of only 45%, with most models scoring below 40%. We aim that Video-Holmes can serve as a "Holmes-test" for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field. The benchmark is released in https://github.com/TencentARC/Video-Holmes.
中文: 近期在思维链推理和强化学习后训练方面的进展提升了多模态大模型的视频推理能力,但现有基准无法充分评估复杂的人类式推理,因此我们开发了Video-Holmes基准,发现模型在整合分散线索方面存在困难,最高准确率仅达45%。
English: Recent progress in CoT reasoning and RL post-training has improved MLLMs' video reasoning, but existing benchmarks fall short in evaluating complex human-like reasoning, leading to the creation of Video-Holmes, a benchmark that reveals models struggle with integrating scattered clues, achieving only up to 45% accuracy.

Authors:Wenyuan Li, Shunlin Liang, Keyan Chen, Yongzhe Chen, Han Ma, Jianglei Xu, Yichuan Ma, Shikang Guan, Husheng Fang, Zhenwei Shi
Title: AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Crop Mapping
Abstract:
Accurate crop mapping fundamentally relies on modeling multi-scale spatiotemporal patterns, where spatial scales range from individual field textures to landscape-level context, and temporal scales capture both short-term phenological transitions and full growing-season dynamics. Transformer-based remote sensing foundation models (RSFMs) offer promising potential for crop mapping due to their innate ability for unified spatiotemporal processing. However, current RSFMs remain suboptimal for crop mapping: they either employ fixed spatiotemporal windows that ignore the multi-scale nature of crop systems or completely disregard temporal information by focusing solely on spatial patterns. To bridge these gaps, we present AgriFM, a multi-source remote sensing foundation model specifically designed for agricultural crop mapping. Our approach begins by establishing the necessity of simultaneous hierarchical spatiotemporal feature extraction, leading to the development of a modified Video Swin Transformer architecture where temporal down-sampling is synchronized with spatial scaling operations. This modified backbone enables efficient unified processing of long time-series satellite inputs. AgriFM leverages temporally rich data streams from three satellite sources including MODIS, Landsat-8/9 and Sentinel-2, and is pre-trained on a global representative dataset comprising over 25 million image samples supervised by land cover products. The resulting framework incorporates a versatile decoder architecture that dynamically fuses these learned spatiotemporal representations, supporting diverse downstream tasks. Comprehensive evaluations demonstrate AgriFM's superior performance over conventional deep learning approaches and state-of-the-art general-purpose RSFMs across all downstream tasks. Codes will be available at https://github.com/flyakon/AgriFM.
中文: 该摘要介绍了AgriFM,这是一个专为农作物制图设计的模型,通过改进的Video Swin Transformer架构整合多源卫星数据的多尺度时空特征,在各项任务中均优于现有先进方法。
English: The abstract introduces AgriFM, a specialized foundation model that enhances crop mapping by integrating multi-scale spatiotemporal features from satellite data using a modified Video Swin Transformer, achieving superior performance over existing methods.

Authors:Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang
Title: HoliTom: Holistic Token Merging for Fast Video Large Language Models
Abstract:
Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (outer-LLM pruning) primarily address spatial redundancy within individual frames or limited temporal windows, neglecting the crucial global temporal dynamics and correlations across longer video sequences. This leads to sub-optimal spatio-temporal reduction and does not leverage video compressibility fully. Crucially, the synergistic potential and mutual influence of combining these strategies remain unexplored. To further reduce redundancy, we introduce HoliTom, a novel training-free holistic token merging framework. HoliTom employs outer-LLM pruning through global redundancy-aware temporal segmentation, followed by spatial-temporal merging to reduce visual tokens by over 90%, significantly alleviating the LLM's computational burden. Complementing this, we introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning. Evaluations demonstrate our method's promising efficiency-performance trade-off on LLaVA-OneVision-7B, reducing computational costs to 6.9% of FLOPs while maintaining 99.1% of the original performance. Furthermore, we achieve a 2.28x reduction in Time-To-First-Token (TTFT) and a 1.32x acceleration in decoding throughput, highlighting the practical benefits of our integrated pruning approach for efficient video LLMs inference.
中文: HoliTom是一种无需训练的整体令牌合并框架,通过外层LLM剪枝与内层令牌相似度合并相结合,将计算成本降低90%以上同时保持99.1%的原始性能,显著加速了视频大语言模型的推理效率。
English: HoliTom is a training-free holistic token merging framework that combines outer-LLM pruning with inner-LLM token similarity-based merging to reduce computational costs by over 90% while maintaining 99.1% of original performance, significantly accelerating video LLM inference.

Authors:Biao Zhang, Peter Wonka
Title: efunc: An Efficient Function Representation without Neural Networks
Abstract:
Function fitting/approximation plays a fundamental role in computer graphics and other engineering applications. While recent advances have explored neural networks to address this task, these methods often rely on architectures with many parameters, limiting their practical applicability. In contrast, we pursue high-quality function approximation using parameter-efficient representations that eliminate the dependency on neural networks entirely. We first propose a novel framework for continuous function modeling. Most existing works can be formulated using this framework. We then introduce a compact function representation, which is based on polynomials interpolated using radial basis functions, bypassing both neural networks and complex/hierarchical data structures. We also develop memory-efficient CUDA-optimized algorithms that reduce computational time and memory consumption to less than 10% compared to conventional automatic differentiation frameworks. Finally, we validate our representation and optimization pipeline through extensive experiments on 3D signed distance functions (SDFs). The proposed representation achieves comparable or superior performance to state-of-the-art techniques (e.g., octree/hash-grid techniques) with significantly fewer parameters.

Authors:M. Akin Yilmaz, Ahmet Bilican, A. Murat Tekalp
Title: DiMoSR: Feature Modulation via Multi-Branch Dilated Convolutions for Efficient Image Super-Resolution
Abstract:
Balancing reconstruction quality versus model efficiency remains a critical challenge in lightweight single image super-resolution (SISR). Despite the prevalence of attention mechanisms in recent state-of-the-art SISR approaches that primarily emphasize or suppress feature maps, alternative architectural paradigms warrant further exploration. This paper introduces DiMoSR (Dilated Modulation Super-Resolution), a novel architecture that enhances feature representation through modulation to complement attention in lightweight SISR networks. The proposed approach leverages multi-branch dilated convolutions to capture rich contextual information over a wider receptive field while maintaining computational efficiency. Experimental results demonstrate that DiMoSR outperforms state-of-the-art lightweight methods across diverse benchmark datasets, achieving superior PSNR and SSIM metrics with comparable or reduced computational complexity. Through comprehensive ablation studies, this work not only validates the effectiveness of DiMoSR but also provides critical insights into the interplay between attention mechanisms and feature modulation to guide future research in efficient network design. The code and model weights to reproduce our results are available at: https://github.com/makinyilmaz/DiMoSR
中文: DiMoSR提出了一种新颖的轻量级超分辨率架构,利用空洞卷积和特征调制技术,在保持高效计算的同时实现了领先的性能表现。
English: DiMoSR introduces a novel lightweight super-resolution architecture using dilated convolutions and feature modulation to achieve state-of-the-art performance with improved efficiency.

Authors:Changguanng Wu, Jiangxin Dong, Chengjian Li, Jinhui Tang
Title: Plenodium: UnderWater 3D Scene Reconstruction with Plenoptic Medium Representation
Abstract:
We present Plenodium (plenoptic medium), an effective and efficient 3D representation framework capable of jointly modeling both objects and participating media. In contrast to existing medium representations that rely solely on view-dependent modeling, our novel plenoptic medium representation incorporates both directional and positional information through spherical harmonics encoding, enabling highly accurate underwater scene reconstruction. To address the initialization challenge in degraded underwater environments, we propose the pseudo-depth Gaussian complementation to augment COLMAP-derived point clouds with robust depth priors. In addition, a depth ranking regularized loss is developed to optimize the geometry of the scene and improve the ordinal consistency of the depth maps. Extensive experiments on real-world underwater datasets demonstrate that our method achieves significant improvements in 3D reconstruction. Furthermore, we conduct a simulated dataset with ground truth and the controllable scattering medium to demonstrate the restoration capability of our method in underwater scenarios. Our code and dataset are available at https://plenodium.github.io/.

Authors:Jieyu Yuan, Yujun Li, Yuanlin Zhang, Chunle Guo, Xiongxin Tang, Ruixing Wang, Chongyi Li
Title: 3D-UIR: 3D Gaussian for Underwater 3D Scene Reconstruction via Physics Based Appearance-Medium Decoupling
Abstract:
Novel view synthesis for underwater scene reconstruction presents unique challenges due to complex light-media interactions. Optical scattering and absorption in water body bring inhomogeneous medium attenuation interference that disrupts conventional volume rendering assumptions of uniform propagation medium. While 3D Gaussian Splatting (3DGS) offers real-time rendering capabilities, it struggles with underwater inhomogeneous environments where scattering media introduce artifacts and inconsistent appearance. In this study, we propose a physics-based framework that disentangles object appearance from water medium effects through tailored Gaussian modeling. Our approach introduces appearance embeddings, which are explicit medium representations for backscatter and attenuation, enhancing scene consistency. In addition, we propose a distance-guided optimization strategy that leverages pseudo-depth maps as supervision with depth regularization and scale penalty terms to improve geometric fidelity. By integrating the proposed appearance and medium modeling components via an underwater imaging model, our approach achieves both high-quality novel view synthesis and physically accurate scene restoration. Experiments demonstrate our significant improvements in rendering quality and restoration accuracy over existing methods. The project page is available at https://bilityniu.github.io/3D-UIR.
中文摘要:本研究提出了一种基于物理的框架,通过定制高斯建模和距离引导优化策略,将物体外观与水体介质效应分离,显著提升了水下新视角合成的渲染质量和场景复原精度。
English Summary: This study introduces a physics-based framework that enhances underwater novel view synthesis by separating object appearance from water medium effects using tailored Gaussian modeling and a distance-guided optimization strategy, achieving superior rendering quality and scene restoration.

Authors:Hesam Araghi, Jan van Gemert, Nergis Tomen
Title: Making Every Event Count: Balancing Data Efficiency and Accuracy in Event Camera Subsampling
Abstract:
Event cameras offer high temporal resolution and power efficiency, making them well-suited for edge AI applications. However, their high event rates present challenges for data transmission and processing. Subsampling methods provide a practical solution, but their effect on downstream visual tasks remains underexplored. In this work, we systematically evaluate six hardware-friendly subsampling methods using convolutional neural networks for event video classification on various benchmark datasets. We hypothesize that events from high-density regions carry more task-relevant information and are therefore better suited for subsampling. To test this, we introduce a simple causal density-based subsampling method, demonstrating improved classification accuracy in sparse regimes. Our analysis further highlights key factors affecting subsampling performance, including sensitivity to hyperparameters and failure cases in scenarios with large event count variance. These findings provide insights for utilization of hardware-efficient subsampling strategies that balance data efficiency and task accuracy. The code for this paper will be released at: https://github.com/hesamaraghi/event-camera-subsampling-methods.
Chinese: 本研究系统评估了事件相机的六种硬件友好型子采样方法,提出了一种基于事件密度的子采样策略,在稀疏场景下提升分类精度,同时分析了影响性能的关键因素,以平衡数据效率与任务准确性。
English: This study systematically evaluates six hardware-friendly subsampling methods for event cameras, introducing a density-based approach that improves classification accuracy in sparse regimes while analyzing key factors affecting performance to balance data efficiency and task accuracy.

Authors:Xurui Li, Zhonesheng Jiang, Tingxuan Ai, Yu Zhou
Title: RoBiS: Robust Binary Segmentation for High-Resolution Industrial Images
Abstract:
Robust unsupervised anomaly detection (AD) in real-world scenarios is an important task. Current methods exhibit severe performance degradation on the MVTec AD 2 benchmark due to its complex real-world challenges. To solve this problem, we propose a robust framework RoBiS, which consists of three core modules: (1) Swin-Cropping, a high-resolution image pre-processing strategy to preserve the information of small anomalies through overlapping window cropping. (2) The data augmentation of noise addition and lighting simulation is carried out on the training data to improve the robustness of AD model. We use INP-Former as our baseline, which could generate better results on the various sub-images. (3) The traditional statistical-based binarization strategy (mean+3std) is combined with our previous work, MEBin (published in CVPR2025), for joint adaptive binarization. Then, SAM is further employed to refine the segmentation results. Compared with some methods reported by the MVTec AD 2, our RoBiS achieves a 29.2% SegF1 improvement (from 21.8% to 51.00%) on Test_private and 29.82% SegF1 gains (from 16.7% to 46.52%) on Test_private_mixed. Code is available at https://github.com/xrli-U/RoBiS.
Chinese: 提出的RoBiS框架通过集成Swin-Cropping保留小异常信息、数据增强提升鲁棒性以及自适应二值化结合SAM优化,在MVTec AD 2基准测试中实现了SegF1指标的显著提升。
English: The proposed RoBiS framework enhances unsupervised anomaly detection by integrating Swin-Cropping for small anomaly preservation, data augmentation for robustness, and adaptive binarization with SAM refinement, achieving significant SegF1 improvements on the MVTec AD 2 benchmark.

Authors:Sergey Karpukhin, Vadim Titov, Andrey Kuznetsov, Aibek Alanov
Title: FastFace: Tuning Identity Preservation in Distilled Diffusion via Guidance and Attention
Abstract:
In latest years plethora of identity-preserving adapters for a personalized generation with diffusion models have been released. Their main disadvantage is that they are dominantly trained jointly with base diffusion models, which suffer from slow multi-step inference. This work aims to tackle the challenge of training-free adaptation of pretrained ID-adapters to diffusion models accelerated via distillation - through careful re-design of classifier-free guidance for few-step stylistic generation and attention manipulation mechanisms in decoupled blocks to improve identity similarity and fidelity, we propose universal FastFace framework. Additionally, we develop a disentangled public evaluation protocol for id-preserving adapters.
中文摘要:本文提出FastFace框架,通过重新设计无分类器引导机制和解耦注意力操作,实现了预训练身份适配器在加速扩散模型中的免训练适配,并开发了新的分离式公共评估方案。
English Summary: The FastFace framework is proposed to enable training-free adaptation of identity-preserving adapters to accelerated diffusion models through redesigned classifier-free guidance and attention mechanisms, while also introducing a new evaluation protocol.

Authors:Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, Jianfei Chen
Title: SageAttention2++: A More Efficient Implementation of SageAttention2
Abstract:
The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in attention. To further accelerate SageAttention2, we propose to utilize the faster instruction of FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8 Matmul used in SageAttention2. Our experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2. This means SageAttention2++ effectively accelerates various models, including those for language, image, and video generation, with negligible end-to-end metrics loss. The code will be available at https://github.com/thu-ml/SageAttention.
中文: SageAttention2++ 通过采用更快的FP8矩阵乘法指令,在保持多种模型精度的同时,将注意力机制速度提升至FlashAttention的3.9倍。
English: SageAttention2++ significantly accelerates attention mechanisms by using faster FP8 matrix multiplication instructions, achieving a 3.9x speedup over FlashAttention while maintaining accuracy across multiple model types.

Authors:Tianhao Peng, Ho Man Kwan, Yuxuan Jiang, Ge Gao, Fan Zhang, Xiaozhong Xu, Shan Liu, David Bull
Title: Instance Data Condensation for Image Super-Resolution
Abstract:
Deep learning based image Super-Resolution (ISR) relies on large training datasets to optimize model generalization; this requires substantial computational and storage resources during training. While dataset condensation has shown potential in improving data efficiency and privacy for high-level computer vision tasks, it has not yet been fully exploited for ISR. In this paper, we propose a novel Instance Data Condensation (IDC) framework specifically for ISR, which achieves instance-level data condensation through Random Local Fourier Feature Extraction and Multi-level Feature Distribution Matching. This aims to optimize feature distributions at both global and local levels and obtain high-quality synthesized training content with fine detail. This framework has been utilized to condense the most commonly used training dataset for ISR, DIV2K, with a 10% condensation rate. The resulting synthetic dataset offers comparable or (in certain cases) even better performance compared to the original full dataset and excellent training stability when used to train various popular ISR models. To the best of our knowledge, this is the first time that a condensed/synthetic dataset (with a 10% data volume) has demonstrated such performance. The source code and the synthetic dataset have been made available at https://github.com/.
中文: 本文提出了一种新颖的实例数据压缩框架,通过随机局部傅里叶特征提取和多级特征分布匹配,将DIV2K数据集压缩至10%体积,在保持与原数据集相当甚至更优性能的同时显著提升了训练效率和稳定性。
English: This paper introduces a novel Instance Data Condensation framework for image super-resolution that synthesizes a 10% condensed dataset from DIV2K, achieving comparable or superior performance to the full dataset while enhancing training efficiency and stability.

Authors:Zeqing Wang, Bowen Zheng, Xingyi Yang, Zhenxiong Tan, Yuecong Xu, Xinchao Wang
Title: Minute-Long Videos with Dual Parallelisms
Abstract:
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54$\times$ lower latency and 1.48$\times$ lower memory cost on 8$\times$RTX 4090 GPUs.
中文: 提出的DualParal策略通过分块去噪、特征缓存和协调噪声初始化,将视频帧和模型层处理分布到多个GPU上,实现了长视频的高效生成,并显著降低了延迟和内存开销。
English: The proposed DualParal strategy distributes video frame and model layer processing across GPUs using block-wise denoising with feature caching and coordinated noise initialization, enabling efficient generation of long videos with significantly reduced latency and memory costs.

Authors:Fatemeh Pesaran Zadeh, Yoojin Oh, Gunhee Kim
Title: LPOI: Listwise Preference Optimization for Vision Language Models
Abstract:
Aligning large VLMs with human preferences is a challenging task, as methods like RLHF and DPO often overfit to textual information or exacerbate hallucinations. Although augmenting negative image samples partially addresses these pitfalls, no prior work has employed listwise preference optimization for VLMs, due to the complexity and cost of constructing listwise image samples. In this work, we propose LPOI, the first object-aware listwise preference optimization developed for reducing hallucinations in VLMs. LPOI identifies and masks a critical object in the image, and then interpolates the masked region between the positive and negative images to form a sequence of incrementally more complete images. The model is trained to rank these images in ascending order of object visibility, effectively reducing hallucinations while retaining visual fidelity. LPOI requires no extra annotations beyond standard pairwise preference data, as it automatically constructs the ranked lists through object masking and interpolation. Comprehensive experiments on MMHalBench, AMBER, and Object HalBench confirm that LPOI outperforms existing preference optimization methods in reducing hallucinations and enhancing VLM performance. We make the code available at https://github.com/fatemehpesaran310/lpoi.
中文: 本文提出LPOI方法,通过对象感知的列表偏好优化技术,对逐步显示物体的图像序列进行排序来减少视觉语言模型的幻觉现象,在无需额外标注的情况下超越了现有方法。
English: This paper introduces LPOI, an object-aware listwise preference optimization method that reduces hallucinations in vision-language models by ranking images with incrementally revealed objects, outperforming existing approaches without requiring additional annotations.

Authors:Peng Wang, Xiang Liu, Peidong Liu
Title: Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles
Abstract:
Stylizing 3D scenes instantly while maintaining multi-view consistency and faithfully resembling a style image remains a significant challenge. Current state-of-the-art 3D stylization methods typically involve computationally intensive test-time optimization to transfer artistic features into a pretrained 3D representation, often requiring dense posed input images. In contrast, leveraging recent advances in feed-forward reconstruction models, we demonstrate a novel approach to achieve direct 3D stylization in less than a second using unposed sparse-view scene images and an arbitrary style image. To address the inherent decoupling between reconstruction and stylization, we introduce a branched architecture that separates structure modeling and appearance shading, effectively preventing stylistic transfer from distorting the underlying 3D scene structure. Furthermore, we adapt an identity loss to facilitate pre-training our stylization model through the novel view synthesis task. This strategy also allows our model to retain its original reconstruction capabilities while being fine-tuned for stylization. Comprehensive evaluations, using both in-domain and out-of-domain datasets, demonstrate that our approach produces high-quality stylized 3D content that achieve a superior blend of style and scene appearance, while also outperforming existing methods in terms of multi-view consistency and efficiency.

Authors:Nils Neukirch, Johanna Vielhaben, Nils Strodthoff
Title: FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models
Abstract:
Internal representations are crucial for understanding deep neural networks, such as their properties and reasoning patterns, but remain difficult to interpret. While mapping from feature space to input space aids in interpreting the former, existing approaches often rely on crude approximations. We propose using a conditional diffusion model - a pretrained high-fidelity diffusion model conditioned on spatially resolved feature maps - to learn such a mapping in a probabilistic manner. We demonstrate the feasibility of this approach across various pretrained image classifiers from CNNs to ViTs, showing excellent reconstruction capabilities. Through qualitative comparisons and robustness analysis, we validate our method and showcase possible applications, such as the visualization of concept steering in input space or investigations of the composite nature of the feature space. This approach has broad potential for improving feature space understanding in computer vision models.
中文摘要:本文提出了一种条件扩散模型,以概率方式将特征空间映射到输入空间,有效提升了多种分类器中深度神经网络的可解释性,并展示了卓越的重建能力和应用前景。
English Summary: This paper introduces a conditional diffusion model to probabilistically map feature spaces to input spaces, enhancing the interpretability of deep neural networks across various classifiers with strong reconstruction and application potential.

Authors:Cainan Davidson, Deva Ramanan, Neehar Peri
Title: RefAV: Towards Planning-Centric Scenario Mining
Abstract:
Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recent CVPR 2025 competition and share insights from the community. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user-guide/tasks/scenario_mining.html
中文: 本研究提出RefAV方法,利用视觉语言模型从自动驾驶车辆数据中高效挖掘安全关键场景,通过新数据集和实证分析解决了传统方法的局限性。
English: This study introduces RefAV, a novel approach using vision-language models to efficiently mine safety-critical driving scenarios from autonomous vehicle data, addressing the limitations of traditional methods with a new dataset and empirical analysis.

Authors:Shamil Ayupov, Maksim Nakhodnov, Anastasia Yaschenko, Andrey Kuznetsov, Aibek Alanov
Title: DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization
Abstract:
Personalized diffusion models have shown remarkable success in Text-to-Image (T2I) generation by enabling the injection of user-defined concepts into diverse contexts. However, balancing concept fidelity with contextual alignment remains a challenging open problem. In this work, we propose an RL-based approach that leverages the diverse outputs of T2I models to address this issue. Our method eliminates the need for human-annotated scores by generating a synthetic paired dataset for DPO-like training using external quality metrics. These better-worse pairs are specifically constructed to improve both concept fidelity and prompt adherence. Moreover, our approach supports flexible adjustment of the trade-off between image fidelity and textual alignment. Through multi-step training, our approach outperforms a naive baseline in convergence speed and output quality. We conduct extensive qualitative and quantitative analysis, demonstrating the effectiveness of our method across various architectures and fine-tuning techniques. The source code can be found at https://github.com/ControlGenAI/DreamBoothDPO.
中文: 本文提出一种基于强化学习的方法,通过自动生成训练数据对来提升个性化文生图的概念保真度和提示对齐效果,无需人工标注即可灵活调节图像质量与文本匹配的平衡。
English: This paper introduces a reinforcement learning-based method that enhances personalized text-to-image generation by automatically creating training pairs to improve concept fidelity and prompt alignment without human annotation.

Authors:Yaohua Zha, Yanzi Wang, Hang Guo, Jinpeng Wang, Tao Dai, Bin Chen, Zhihao Ouyang, Xue Yuerong, Ke Chen, Shu-Tao Xia
Title: PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter
Abstract:
Applying pre-trained models to assist point cloud understanding has recently become a mainstream paradigm in 3D perception. However, existing application strategies are straightforward, utilizing only the final output of the pre-trained model for various task heads. It neglects the rich complementary information in the intermediate layer, thereby failing to fully unlock the potential of pre-trained models. To overcome this limitation, we propose an orthogonal solution: Point Mamba Adapter (PMA), which constructs an ordered feature sequence from all layers of the pre-trained model and leverages Mamba to fuse all complementary semantics, thereby promoting comprehensive point cloud understanding. Constructing this ordered sequence is non-trivial due to the inherent isotropy of 3D space. Therefore, we further propose a geometry-constrained gate prompt generator (G2PG) shared across different layers, which applies shared geometric constraints to the output gates of the Mamba and dynamically optimizes the spatial order, thus enabling more effective integration of multi-layer information. Extensive experiments conducted on challenging point cloud datasets across various tasks demonstrate that our PMA elevates the capability for point cloud understanding to a new level by fusing diverse complementary intermediate features. Code is available at https://github.com/zyh16143998882/PMA.
中文摘要:提出的点云曼巴适配器(PMA)通过利用曼巴模型融合预训练模型各层特征,并结合几何约束门控提示生成器,有效整合中间层互补信息,将点云理解能力提升至新水平。
English Summary: The Point Mamba Adapter (PMA) is introduced to enhance point cloud understanding by integrating multi-layer features from pre-trained models using a Mamba-based fusion approach and a geometry-constrained gate prompt generator, overcoming limitations of existing methods that only utilize final outputs.

Authors:Pingrui Zhang, Yifei Su, Pengyuan Wu, Dong An, Li Zhang, Zhigang Wang, Dong Wang, Yan Ding, Bin Zhao, Xuelong Li
Title: Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
Abstract:
Vision-and-Language Navigation (VLN) requires the agent to navigate by following natural instructions under partial observability, making it difficult to align perception with language. Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details. To this end, we propose to adaptively imagine key environmental semantics via \textit{language} form, enabling a more reliable and efficient strategy. Specifically, we introduce a novel Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation. Furthermore, we introduce a cross-interaction mechanism to regularize the imagined outputs and inject them into a navigation expert module, allowing ATD to jointly exploit both the reasoning capacity of the LLM and the expertise of the navigation model. We conduct extensive experiments on the R2R benchmark, where ATD achieves state-of-the-art performance with fewer parameters. The code is \href{https://github.com/zhangpingrui/Adaptive-Text-Dreamer}{here}.
中文: 提出的自适应文本想象器(ATD)采用双分支大语言模型架构,通过语言形式想象关键环境语义,以更低计算成本和参数量实现了领先的导航性能。
English: The proposed Adaptive Text Dreamer (ATD) uses a dual-branch LLM architecture to imagine future environmental semantics through language, reducing computational costs while achieving state-of-the-art navigation performance with fewer parameters.

Authors:Weichao Pan, Bohan Xu, Xu Wang, Chengze Lv, Shuoyang Wang, Zhenke Duan, Zhen Tian
Title: YOLO-FireAD: Efficient Fire Detection via Attention-Guided Inverted Residual Learning and Dual-Pooling Feature Preservation
Abstract:
Fire detection in dynamic environments faces continuous challenges, including the interference of illumination changes, many false detections or missed detections, and it is difficult to achieve both efficiency and accuracy. To address the problem of feature extraction limitation and information loss in the existing YOLO-based models, this study propose You Only Look Once for Fire Detection with Attention-guided Inverted Residual and Dual-pooling Downscale Fusion (YOLO-FireAD) with two core innovations: (1) Attention-guided Inverted Residual Block (AIR) integrates hybrid channel-spatial attention with inverted residuals to adaptively enhance fire features and suppress environmental noise; (2) Dual Pool Downscale Fusion Block (DPDF) preserves multi-scale fire patterns through learnable fusion of max-average pooling outputs, mitigating small-fire detection failures. Extensive evaluation on two public datasets shows the efficient performance of our model. Our proposed model keeps the sum amount of parameters (1.45M, 51.8% lower than YOLOv8n) (4.6G, 43.2% lower than YOLOv8n), and mAP75 is higher than the mainstream real-time object detection models YOLOv8n, YOL-Ov9t, YOLOv10n, YOLO11n, YOLOv12n and other YOLOv8 variants 1.3-5.5%. For more details, please visit our repository: https://github.com/JEFfersusu/YOLO-FireAD
Chinese: 本研究提出YOLO-FireAD模型,通过注意力引导的倒残差结构和双池化下采样融合技术,在降低参数量的同时显著提升了动态环境中火灾检测的准确率和效率。
English: The study introduces YOLO-FireAD, which enhances fire detection by integrating attention mechanisms and dual-pooling fusion to improve accuracy and efficiency while reducing parameters and computational costs.

Authors:Tatsuya Sasayama, Shintaro Ito, Koichi Ito, Takafumi Aoki
Title: Stereo Radargrammetry Using Deep Learning from Airborne SAR Images
Abstract:
In this paper, we propose a stereo radargrammetry method using deep learning from airborne Synthetic Aperture Radar (SAR) images. Deep learning-based methods are considered to suffer less from geometric image modulation, while there is no public SAR image dataset used to train such methods. We create a SAR image dataset and perform fine-tuning of a deep learning-based image correspondence method. The proposed method suppresses the degradation of image quality by pixel interpolation without ground projection of the SAR image and divides the SAR image into patches for processing, which makes it possible to apply deep learning. Through a set of experiments, we demonstrate that the proposed method exhibits a wider range and more accurate elevation measurements compared to conventional methods. The project web page is available at: https://gsisaoki.github.io/IGARSS2025_sasayama/

Authors:Yifeng Ma, Jinwei Qi, Chaonan Ji, Peng Zhang, Bang Zhang, Zhidong Deng, Liefeng Bo
Title: Exploring Timeline Control for Facial Motion Generation
Abstract:
This paper introduces a new control signal for facial motion generation: timeline control. Compared to audio and text signals, timelines provide more fine-grained control, such as generating specific facial motions with precise timing. Users can specify a multi-track timeline of facial actions arranged in temporal intervals, allowing precise control over the timing of each action. To model the timeline control capability, We first annotate the time intervals of facial actions in natural facial motion sequences at a frame-level granularity. This process is facilitated by Toeplitz Inverse Covariance-based Clustering to minimize human labor. Based on the annotations, we propose a diffusion-based generation model capable of generating facial motions that are natural and accurately aligned with input timelines. Our method supports text-guided motion generation by using ChatGPT to convert text into timelines. Experimental results show that our method can annotate facial action intervals with satisfactory accuracy, and produces natural facial motions accurately aligned with timelines.

Authors:Guangcong Zheng, Jianlong Yuan, Bo Wang, Haoyang Huang, Guoqing Ma, Nan Duan
Title: Frame-Level Captions for Long Video Generation with Complex Multi Scenes
Abstract:
Generating long videos that can show complex stories, like movie scenes from scripts, has great promise and offers much more than short clips. However, current methods that use autoregression with diffusion models often struggle because their step-by-step process naturally leads to a serious error accumulation (drift). Also, many existing ways to make long videos focus on single, continuous scenes, making them less useful for stories with many events and changes. This paper introduces a new approach to solve these problems. First, we propose a novel way to annotate datasets at the frame-level, providing detailed text guidance needed for making complex, multi-scene long videos. This detailed guidance works with a Frame-Level Attention Mechanism to make sure text and video match precisely. A key feature is that each part (frame) within these windows can be guided by its own distinct text prompt. Our training uses Diffusion Forcing to provide the model with the ability to handle time flexibly. We tested our approach on difficult VBench 2.0 benchmarks ("Complex Plots" and "Complex Landscapes") based on the WanX2.1-T2V-1.3B model. The results show our method is better at following instructions in complex, changing scenes and creates high-quality long videos. We plan to share our dataset annotation methods and trained models with the research community. Project page: https://zgctroy.github.io/frame-level-captions .

Authors:Eric Xing, Pranavi Kolouju, Robert Pless, Abby Stylianou, Nathan Jacobs
Title: ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval
Abstract:
Composed image retrieval (CIR) is the task of retrieving a target image specified by a query image and a relative text that describes a semantic modification to the query image. Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, ConText-CIR, trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. To support training with this loss function, we also propose a synthetic data generation pipeline that creates training data from existing CIR datasets or unlabeled images. We show that these components together enable stronger performance on CIR tasks, setting a new state-of-the-art in composed image retrieval in both the supervised and zero-shot settings on multiple benchmark datasets, including CIRR and CIRCO. Source code, model checkpoints, and our new datasets are available at https://github.com/mvrl/ConText-CIR.
中文:提出的ConText-CIR框架通过文本概念一致性损失和合成数据生成技术,在多个基准测试中实现了组合图像检索的最新性能突破。
English: The proposed ConText-CIR framework improves composed image retrieval by using a Text Concept-Consistency loss and synthetic data generation, achieving state-of-the-art performance on multiple benchmarks.

Authors:Yufei Zhan, Hongyin Zhao, Yousong Zhu, Shurong Zheng, Fan Yang, Ming Tang, Jinqiao Wang
Title: Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
Abstract:
Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks. However, they often fall short in integrating advanced, task-specific capabilities for compositional reasoning, which hinders their progress toward truly competent general vision models. To address this, we present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems by leveraging their intrinsic capabilities (e.g. grounding and visual understanding capabilities). Different from the previous shortcut learning mechanism, our approach introduces a human-like understanding-thinking-answering process, allowing the model to complete all steps in a single pass forwarding without the need for multiple inferences or external tools. This design bridges the gap between foundational visual capabilities and general question answering, encouraging LMMs to generate faithful and traceable responses for complex visual reasoning. Meanwhile, we curate 334K visual instruction samples covering both general scenes and text-rich scenes and involving multiple foundational visual capabilities. Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers. Comprehensive experiments show that Griffon-R not only achieves advancing performance on complex visual reasoning benchmarks including VSR and CLEVR, but also enhances multimodal capabilities across various benchmarks like MMBench and ScienceQA. Data, models, and codes will be release at https://github.com/jefferyZhan/Griffon/tree/master/Griffon-R soon.
中文摘要:该研究提出了一种统一视觉推理机制,通过类人认知流程实现大型多模态模型的单次推理组合问题求解,在复杂推理基准测试中展现出卓越性能。
English Summary: The study introduces a unified visual reasoning mechanism for Large Multimodal Models that enables single-pass compositional problem-solving through human-like cognitive processes, achieving superior performance on complex reasoning benchmarks.

Authors:Fuwen Luo, Shengfeng Lou, Chi Chen, Ziyue Wang, Chenliang Li, Weizhou Shen, Jiyue Guo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Title: MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
Abstract:
Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos. Despite recent advances in general video understanding, current MLLMs still struggle with fine-grained temporal reasoning. While reinforcement learning (RL) has been explored to address this issue recently, existing RL approaches remain limited in effectiveness. In this work, we propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding. MUSEG enables MLLMs to align queries with multiple relevant video segments, promoting more comprehensive temporal reasoning. To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning. Extensive experiments on temporal grounding and time-sensitive video QA tasks demonstrate that MUSEG significantly outperforms existing methods and generalizes well across diverse temporal understanding scenarios. View our project at https://github.com/THUNLP-MT/MUSEG.
中文:MUSEG是一种基于强化学习的新方法,通过引入时间戳感知的多片段定位和分阶段奖励训练,显著提升了多模态大语言模型的时序理解能力,在时序推理任务中明显优于现有方法。
English: MUSEG is a novel reinforcement learning method that enhances multimodal large language models' temporal understanding by enabling timestamp-aware multi-segment grounding and phased reward training, significantly outperforming existing approaches in temporal reasoning tasks.

Authors:Mingxuan Sun, Juntao Jiang, Zhiqiang Yang, Shenao Kong, Jiamin Qi, Jianru Shang, Shuangling Luo, Wanfa Sun, Tianyi Wang, Yanqi Wang, Qixuan Wang, Tingjian Dai, Tianxiang Chen, Jinming Zhang, Xuerui Zhang, Yuepeng He, Pengcheng Fu, Qiu Guan, Shizheng Zhou, Yanbo Yu, Qigui Jiang, Teng Zhou, Liuyong Shi, Hong Yan
Title: VisAlgae 2023: A Dataset and Challenge for Algae Detection in Microscopy Images
Abstract:
Microalgae, vital for ecological balance and economic sectors, present challenges in detection due to their diverse sizes and conditions. This paper summarizes the second "Vision Meets Algae" (VisAlgae 2023) Challenge, aiming to enhance high-throughput microalgae cell detection. The challenge, which attracted 369 participating teams, includes a dataset of 1000 images across six classes, featuring microalgae of varying sizes and distinct features. Participants faced tasks such as detecting small targets, handling motion blur, and complex backgrounds. The top 10 methods, outlined here, offer insights into overcoming these challenges and maximizing detection accuracy. This intersection of algae research and computer vision offers promise for ecological understanding and technological advancement. The dataset can be accessed at: https://github.com/juntaoJianggavin/Visalgae2023/.
中文:VisAlgae 2023挑战赛通过369支参赛团队利用多样化数据集应对小目标识别和复杂背景等难题,推动了高通量微藻检测技术的发展,其优胜方案为生态研究和技术创新提供了重要参考。
English: The VisAlgae 2023 Challenge advanced high-throughput microalgae detection by engaging 369 teams to address challenges like small target identification and complex backgrounds using a diverse dataset, with top methods providing valuable insights for ecological and technological progress.

Authors:Lingyu Qiu, Ke Jiang, Xiaoyang Tan
Title: RoGA: Towards Generalizable Deepfake Detection through Robust Gradient Alignment
Abstract:
Recent advancements in domain generalization for deepfake detection have attracted significant attention, with previous methods often incorporating additional modules to prevent overfitting to domain-specific patterns. However, such regularization can hinder the optimization of the empirical risk minimization (ERM) objective, ultimately degrading model performance. In this paper, we propose a novel learning objective that aligns generalization gradient updates with ERM gradient updates. The key innovation is the application of perturbations to model parameters, aligning the ascending points across domains, which specifically enhances the robustness of deepfake detection models to domain shifts. This approach effectively preserves domain-invariant features while managing domain-specific characteristics, without introducing additional regularization. Experimental results on multiple challenging deepfake detection datasets demonstrate that our gradient alignment strategy outperforms state-of-the-art domain generalization techniques, confirming the efficacy of our method. The code is available at https://github.com/Lynn0925/RoGA.
中文: 本文提出一种新颖的梯度对齐策略,通过协调泛化与经验风险最小化的更新来增强深度伪造检测能力,无需额外正则化即可超越现有领域泛化方法。
English: This paper introduces a novel gradient alignment strategy that enhances deepfake detection by aligning generalization and empirical risk minimization updates, outperforming existing domain generalization methods without extra regularization.

Authors:HsiaoYuan Hsu, Yuxin Peng
Title: Scan-and-Print: Patch-level Data Summarization and Augmentation for Content-aware Layout Generation in Poster Design
Abstract:
In AI-empowered poster design, content-aware layout generation is crucial for the on-image arrangement of visual-textual elements, e.g., logo, text, and underlay. To perceive the background images, existing work demanded a high parameter count that far exceeds the size of available training data, which has impeded the model's real-time performance and generalization ability. To address these challenges, we proposed a patch-level data summarization and augmentation approach, vividly named Scan-and-Print. Specifically, the scan procedure selects only the patches suitable for placing element vertices to perform fine-grained perception efficiently. Then, the print procedure mixes up the patches and vertices across two image-layout pairs to synthesize over 100% new samples in each epoch while preserving their plausibility. Besides, to facilitate the vertex-level operations, a vertex-based layout representation is introduced. Extensive experimental results on widely used benchmarks demonstrated that Scan-and-Print can generate visually appealing layouts with state-of-the-art quality while dramatically reducing computational bottleneck by 95.2%.

Authors:Haoyu Zhang, Yisen Feng, Qiaohui Chu, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie
Title: HCQA-1.5 @ Ego4D EgoSchema Challenge 2025
Abstract:
In this report, we present the method that achieves third place for Ego4D EgoSchema Challenge in CVPR 2025. To improve the reliability of answer prediction in egocentric video question answering, we propose an effective extension to the previously proposed HCQA framework. Our approach introduces a multi-source aggregation strategy to generate diverse predictions, followed by a confidence-based filtering mechanism that selects high-confidence answers directly. For low-confidence cases, we incorporate a fine-grained reasoning module that performs additional visual and contextual analysis to refine the predictions. Evaluated on the EgoSchema blind test set, our method achieves 77% accuracy on over 5,000 human-curated multiple-choice questions, outperforming last year's winning solution and the majority of participating teams. Our code will be added at https://github.com/Hyu-Zhang/HCQA.
中文: 我们通过改进HCQA框架,采用多源预测聚合与置信度筛选机制,在CVPR 2025 Ego4D EgoSchema挑战赛中荣获第三名,在以自我为中心的视频问答任务上达到77%的准确率。
English: Our method, an enhanced HCQA framework with multi-source prediction aggregation and confidence-based filtering, secured third place in the CVPR 2025 Ego4D EgoSchema Challenge by achieving 77% accuracy on egocentric video QA.

Authors:Yuan Wu, Zhiqiang Yan, Yigong Zhang, Xiang Li, Jian Yang
Title: See through the Dark: Learning Illumination-affined Representations for Nighttime Occupancy Prediction
Abstract:
Occupancy prediction aims to estimate the 3D spatial distribution of occupied regions along with their corresponding semantic labels. Existing vision-based methods perform well on daytime benchmarks but struggle in nighttime scenarios due to limited visibility and challenging lighting conditions. To address these challenges, we propose \textbf{LIAR}, a novel framework that learns illumination-affined representations. LIAR first introduces Selective Low-light Image Enhancement (SLLIE), which leverages the illumination priors from daytime scenes to adaptively determine whether a nighttime image is genuinely dark or sufficiently well-lit, enabling more targeted global enhancement. Building on the illumination maps generated by SLLIE, LIAR further incorporates two illumination-aware components: 2D Illumination-guided Sampling (2D-IGS) and 3D Illumination-driven Projection (3D-IDP), to respectively tackle local underexposure and overexposure. Specifically, 2D-IGS modulates feature sampling positions according to illumination maps, assigning larger offsets to darker regions and smaller ones to brighter regions, thereby alleviating feature degradation in underexposed areas. Subsequently, 3D-IDP enhances semantic understanding in overexposed regions by constructing illumination intensity fields and supplying refined residual queries to the BEV context refinement process. Extensive experiments on both real and synthetic datasets demonstrate the superior performance of LIAR under challenging nighttime scenarios. The source code and pretrained models are available \href{https://github.com/yanzq95/LIAR}{here}.
中文: LIAR是一种创新框架,通过学习光照适应表征,利用选择性低光图像增强和光照感知组件解决夜间场景中的欠曝光和过曝光问题,在挑战性条件下表现出卓越性能。
English: LIAR is a novel framework that enhances nighttime occupancy prediction by learning illumination-affined representations, using selective low-light image enhancement and illumination-aware components to address underexposure and overexposure, achieving superior performance in challenging conditions.

Authors:Guiping Cao, Tao Wang, Wenjian Huang, Xiangyuan Lan, Jianguo Zhang, Dongmei Jiang
Title: Open-Det: An Efficient Learning Framework for Open-Ended Detection
Abstract:
Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes are available at: https://github.com/Med-Process/Open-Det.
中文: Open-Det框架通过加速训练和弥合视觉与语言间的语义差距,有效解决了现有开放目标检测模型的不足,仅用少量数据和计算资源便实现了更优性能。
English: The Open-Det framework efficiently addresses the limitations of existing Open-Ended object Detection models by accelerating training and bridging the vision-language gap, achieving superior performance with significantly reduced data and computational resources.

Authors:Wenhao You, Xingjian Diao, Chunhui Zhang, Keyi Kong, Weiyi Wu, Zhongyu Ouyang, Chiyu Ma, Tingxuan Wu, Noah Wei, Zong Ke, Ming Cheng, Soroush Vosoughi, Jiang Gui
Title: Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs
Abstract:
While recent Multimodal Large Language Models exhibit impressive capabilities for general multimodal tasks, specialized domains like music necessitate tailored approaches. Music Audio-Visual Question Answering (Music AVQA) particularly underscores this, presenting unique challenges with its continuous, densely layered audio-visual content, intricate temporal dynamics, and the critical need for domain-specific knowledge. Through a systematic analysis of Music AVQA datasets and methods, this position paper identifies that specialized input processing, architectures incorporating dedicated spatial-temporal designs, and music-specific modeling strategies are critical for success in this domain. Our study provides valuable insights for researchers by highlighting effective design patterns empirically linked to strong performance, proposing concrete future directions for incorporating musical priors, and aiming to establish a robust foundation for advancing multimodal musical understanding. This work is intended to inspire broader attention and further research, supported by a continuously updated anonymous GitHub repository of relevant papers: https://github.com/xid32/Survey4MusicAVQA.
中文摘要:本文强调音乐视听问答领域需要专门的多模态方法,指出必须采用定制化输入处理、时空架构设计和音乐特定建模策略,以应对该领域连续密集的视听内容及复杂时序动态等独特挑战。
English Summary: This position paper emphasizes the need for specialized multimodal approaches in music audio-visual question answering, highlighting the importance of tailored input processing, spatiotemporal architectures, and music-specific modeling strategies to address the domain's unique challenges.

Authors:Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri
Title: Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models
Abstract:
Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 16.8 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl/ and https://universe.roboflow.com/rf100-vl/
中文: 视觉语言模型在常见物体检测上表现出色,但难以泛化到分布外概念,为此我们提出Roboflow100-VL基准,通过多模态指令实现少样本概念对齐以解决这一局限。
English: Vision-language models excel at detecting common objects but struggle with out-of-distribution concepts, prompting the introduction of Roboflow100-VL for few-shot alignment using multimodal instructions to address this limitation.

Authors:Xiaobao Wei, Xiaoan Zhang, Hao Wang, Qingpo Wuwu, Ming Lu, Wenzhao Zheng, Shanghang Zhang
Title: OmniIndoor3D: Comprehensive Indoor 3D Reconstruction
Abstract:
We propose a novel framework for comprehensive indoor 3D reconstruction using Gaussian representations, called OmniIndoor3D. This framework enables accurate appearance, geometry, and panoptic reconstruction of diverse indoor scenes captured by a consumer-level RGB-D camera. Since 3DGS is primarily optimized for photorealistic rendering, it lacks the precise geometry critical for high-quality panoptic reconstruction. Therefore, OmniIndoor3D first combines multiple RGB-D images to create a coarse 3D reconstruction, which is then used to initialize the 3D Gaussians and guide the 3DGS training. To decouple the optimization conflict between appearance and geometry, we introduce a lightweight MLP that adjusts the geometric properties of 3D Gaussians. The introduced lightweight MLP serves as a low-pass filter for geometry reconstruction and significantly reduces noise in indoor scenes. To improve the distribution of Gaussian primitives, we propose a densification strategy guided by panoptic priors to encourage smoothness on planar surfaces. Through the joint optimization of appearance, geometry, and panoptic reconstruction, OmniIndoor3D provides comprehensive 3D indoor scene understanding, which facilitates accurate and robust robotic navigation. We perform thorough evaluations across multiple datasets, and OmniIndoor3D achieves state-of-the-art results in appearance, geometry, and panoptic reconstruction. We believe our work bridges a critical gap in indoor 3D reconstruction. The code will be released at: https://ucwxb.github.io/OmniIndoor3D/
中文:OmniIndoor3D是一种创新框架,利用高斯表示实现全面的室内三维重建,通过联合优化外观、几何和全景分割,为机器人导航提供精确可靠的场景理解。
English: OmniIndoor3D is a novel framework that uses Gaussian representations to achieve comprehensive indoor 3D reconstruction, integrating appearance, geometry, and panoptic reconstruction through joint optimization for superior robotic navigation.

Authors:Jihoon Lee, Min Song
Title: Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models
Abstract:
Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.
中文: RVCD是一种先进的对比解码方法,通过在逻辑层面同时利用负面和正面的AI生成图像,有效抑制视觉语言模型中的物体幻觉问题,相比现有方法展现出显著改进。
English: RVCD is an advanced contrastive decoding method that suppresses object hallucination in vision-language models by leveraging both negative and positive AI-generated images at the logit level, showing significant improvements over existing approaches.

Authors:Elias Arbash, Ahmed Jamal Afifi, Ymane Belahsen, Margret Fuchs, Pedram Ghamisi, Paul Scheunders, Richard Gloaguen
Title: Electrolyzers-HSI: Close-Range Multi-Scene Hyperspectral Imaging Benchmark Dataset
Abstract:
The global challenge of sustainable recycling demands automated, fast, and accurate, state-of-the-art (SOTA) material detection systems that act as a bedrock for a circular economy. Democratizing access to these cutting-edge solutions that enable real-time waste analysis is essential for scaling up recycling efforts and fostering the Green Deal. In response, we introduce \textbf{Electrolyzers-HSI}, a novel multimodal benchmark dataset designed to accelerate the recovery of critical raw materials through accurate electrolyzer materials classification. The dataset comprises 55 co-registered high-resolution RGB images and hyperspectral imaging (HSI) data cubes spanning the 400--2500 nm spectral range, yielding over 4.2 million pixel vectors and 424,169 labeled ones. This enables non-invasive spectral analysis of shredded electrolyzer samples, supporting quantitative and qualitative material classification and spectral properties investigation. We evaluate a suite of baseline machine learning (ML) methods alongside SOTA transformer-based deep learning (DL) architectures, including Vision Transformer, SpectralFormer, and the Multimodal Fusion Transformer, to investigate architectural bottlenecks for further efficiency optimisation when deploying transformers in material identification. We implement zero-shot detection techniques and majority voting across pixel-level predictions to establish object-level classification robustness. In adherence to the FAIR data principles, the electrolyzers-HSI dataset and accompanying codebase are openly available at https://github.com/hifexplo/Electrolyzers-HSI and https://rodare.hzdr.de/record/3668, supporting reproducible research and facilitating the broader adoption of smart and sustainable e-waste recycling solutions.
中文: 该研究推出了Electrolyzers-HSI数据集,这一多模态资源结合了RGB和高光谱数据,旨在提升关键原材料的回收效率,并通过评估机器学习模型优化材料分类技术,推动可持续电子废物回收的发展。
English: The study introduces the Electrolyzers-HSI dataset, a multimodal resource with RGB and hyperspectral data to enhance critical raw material recovery, and evaluates machine learning models for efficient material classification to advance sustainable e-waste recycling.

Authors:Dongyu Luo, Kelin Yu, Amir-Hossein Shahidzadeh, Cornelia Fermüller, Yiannis Aloimonos, Ruohan Gao
Title: ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image
Abstract:
Vision-based tactile sensing has been widely used in perception, reconstruction, and robotic manipulation. However, collecting large-scale tactile data remains costly due to the localized nature of sensor-object interactions and inconsistencies across sensor instances. Existing approaches to scaling tactile data, such as simulation and free-form tactile generation, often suffer from unrealistic output and poor transferability to downstream tasks. To address this, we propose ControlTac, a two-stage controllable framework that generates realistic tactile images conditioned on a single reference tactile image, contact force, and contact position. With those physical priors as control input, ControlTac generates physically plausible and varied tactile images that can be used for effective data augmentation. Through experiments on three downstream tasks, we demonstrate that ControlTac can effectively augment tactile datasets and lead to consistent gains. Our three real-world experiments further validate the practical utility of our approach. Project page: https://dongyuluo.github.io/controltac.

Authors:Chenghao Qian, Wenjing Li, Yuhu Guo, Gustav Markkula
Title: WeatherEdit: Controllable Weather Editing with 4D Gaussian Field
Abstract:
In this work, we present WeatherEdit, a novel weather editing pipeline for generating realistic weather effects with controllable types and severity in 3D scenes. Our approach is structured into two key components: weather background editing and weather particle construction. For weather background editing, we introduce an all-in-one adapter that integrates multiple weather styles into a single pretrained diffusion model, enabling the generation of diverse weather effects in 2D image backgrounds. During inference, we design a Temporal-View (TV-) attention mechanism that follows a specific order to aggregate temporal and spatial information, ensuring consistent editing across multi-frame and multi-view images. To construct the weather particles, we first reconstruct a 3D scene using the edited images and then introduce a dynamic 4D Gaussian field to generate snowflakes, raindrops and fog in the scene. The attributes and dynamics of these particles are precisely controlled through physical-based modelling and simulation, ensuring realistic weather representation and flexible severity adjustments. Finally, we integrate the 4D Gaussian field with the 3D scene to render consistent and highly realistic weather effects. Experiments on multiple driving datasets demonstrate that WeatherEdit can generate diverse weather effects with controllable condition severity, highlighting its potential for autonomous driving simulation in adverse weather. See project page: https://jumponthemoon.github.io/w-edit

Authors:Lei Tian, Xiaomin Li, Liqian Ma, Hao Yin, Zirui Zheng, Hefei Huang, Taiqing Li, Huchuan Lu, Xu Jia
Title: CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting
Abstract:
Recent advances in 3D reconstruction techniques and vision-language models have fueled significant progress in 3D semantic understanding, a capability critical to robotics, autonomous driving, and virtual/augmented reality. However, methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies induced by occlusion, image blur, and view-dependent variations. These inconsistencies, when propagated via projection supervision, deteriorate the quality of 3D Gaussian semantic fields and introduce artifacts in the rendered outputs. To mitigate this limitation, we propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues. Specifically, our approach first employs a zero-shot tracker to align a set of SAM-generated 2D masks and reliably identify their corresponding categories. Next, we utilize CLIP to extract robust semantic encodings across views. Finally, our Contrastive Codebook Learning (CCL) module distills discriminative semantic features by enforcing intra-class compactness and inter-class distinctiveness. In contrast to previous methods that directly apply CLIP to imperfect masks, our framework explicitly resolves semantic conflicts while preserving category discriminability. Extensive experiments demonstrate that CCL-LGS outperforms previous state-of-the-art methods. Our project page is available at https://epsilontl.github.io/CCL-LGS/.
Chinese: 3D语义理解的最新进展面临跨视图不一致性的挑战,但提出的CCL-LGS框架通过整合多视图语义线索和对比学习,有效解决了这些问题,提高了准确性并减少了伪影。
English: Recent progress in 3D semantic understanding faces challenges from cross-view inconsistencies, but the proposed CCL-LGS framework effectively addresses these by integrating multi-view semantic cues and contrastive learning to enhance accuracy and reduce artifacts.

Authors:Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, Ruiwen Xu
Title: HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.
中文: 提出的HoPE方法通过混合频率分配和动态时间缩放机制增强视觉语言模型的长上下文处理能力,在长视频理解和检索任务中相比现有方法展现出更优性能。
English: The proposed HoPE method enhances Vision-Language Models' long-context capabilities through hybrid frequency allocation and dynamic temporal scaling, demonstrating superior performance in long video understanding and retrieval tasks compared to existing approaches.

Authors:Yunlong Tang, Pinxin Liu, Mingqian Feng, Zhangyun Tan, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu
Title: MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
Abstract:
Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/
中文: 本文提出首个系统评估多模态大语言模型视角几何理解能力的基准MMPerspective,通过三个维度的10项任务揭示了模型虽具备表层感知能力,但在组合推理和空间一致性方面存在显著局限。
English: This paper introduces MMPerspective, the first benchmark to systematically evaluate multimodal large language models' understanding of perspective geometry through 10 tasks across three dimensions, revealing significant limitations in compositional reasoning and spatial consistency despite surface-level perceptual competence.

Authors:Royden Wagner, Omer Sahin Tas, Felix Hauser, Marlon Steiner, Dominik Strutz, Abhishek Vivekanandan, Carlos Fernandez, Christoph Stiller
Title: RetroMotion: Retrocausal Motion Forecasting Models are Instructable
Abstract:
Motion forecasts of road users (i.e., agents) vary in complexity as a function of scene constraints and interactive behavior. We address this with a multi-task learning method for motion forecasting that includes a retrocausal flow of information. The corresponding tasks are to forecast (1) marginal trajectory distributions for all modeled agents and (2) joint trajectory distributions for interacting agents. Using a transformer model, we generate the joint distributions by re-encoding marginal distributions followed by pairwise modeling. This incorporates a retrocausal flow of information from later points in marginal trajectories to earlier points in joint trajectories. Per trajectory point, we model positional uncertainty using compressed exponential power distributions. Notably, our method achieves state-of-the-art results in the Waymo Interaction Prediction dataset and generalizes well to the Argoverse 2 dataset. Additionally, our method provides an interface for issuing instructions through trajectory modifications. Our experiments show that regular training of motion forecasting leads to the ability to follow goal-based instructions and to adapt basic directional instructions to the scene context. Code: https://github.com/kit-mrt/future-motion
中文摘要:本文提出一种多任务学习的运动预测方法,通过逆向因果信息流和Transformer模型生成个体与交互轨迹分布,在基准数据集上实现最优性能,同时支持基于目标的轨迹指令修改。
English Summary: This paper introduces a multi-task learning method for motion forecasting that uses retrocausal information flow and transformer models to generate both marginal and joint trajectory distributions, achieving state-of-the-art results on benchmark datasets while enabling goal-based trajectory instructions.

Authors:Dong Liu, Yanxuan Yu, Jiayi Zhang, Yifan Li, Ben Lengerich, Ying Nian Wu
Title: FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation
Abstract:
Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks. To alleviate this inefficiency, we propose FastCache, a hidden-state-level caching and compression framework that accelerates DiT inference by exploiting redundancy within the model's internal representations. FastCache introduces a dual strategy: (1) a spatial-aware token selection mechanism that adaptively filters redundant tokens based on hidden state saliency, and (2) a transformer-level cache that reuses latent activations across timesteps when changes are statistically insignificant. These modules work jointly to reduce unnecessary computation while preserving generation fidelity through learnable linear approximation. Theoretical analysis shows that FastCache maintains bounded approximation error under a hypothesis-testing-based decision rule. Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage, with best generation output quality compared to other cache methods, as measured by FID and t-FID. Code implementation of FastCache is available on GitHub at https://github.com/NoakLiu/FastCache-xDiT.
中文摘要:FastCache是一种创新的缓存压缩框架,通过自适应过滤冗余令牌和复用潜在激活来加速扩散Transformer推理,在保持生成质量的同时显著降低计算成本。
English Summary: FastCache is a novel caching and compression framework that accelerates Diffusion Transformer inference by adaptively filtering redundant tokens and reusing latent activations, significantly reducing computational costs while maintaining generation quality.

Authors:Qinyu Zhao, Jaskirat Singh, Ming Xu, Akshay Asthana, Stephen Gould, Liang Zheng
Title: DiSA: Diffusion Step Annealing in Autoregressive Image Generation
Abstract:
An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves $5-10\times$ faster inference for MAR and Harmon and $1.4-2.5\times$ for FlowAR and xAR, while maintaining the generation quality.
Chinese: 本文提出扩散步长退火(DiSA)方法,通过随生成令牌增多而逐步减少扩散步数,在保持图像生成质量的同时,将自回归模型的推理速度最高提升10倍。
English: This paper introduces diffusion step annealing (DiSA), a training-free method that accelerates autoregressive image generation by progressively reducing diffusion steps as more tokens are generated, achieving up to 10x faster inference while preserving quality.

Authors:Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, Li Yuan
Title: OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation
Abstract:
Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 18 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.
中文: 本研究提出了OpenS2V-Nexus这一完整基础设施,通过细粒度基准和大规模数据集来评估和提升视频生成中的主体一致性与自然度。
English: The study introduces OpenS2V-Nexus, a comprehensive infrastructure for Subject-to-Video generation featuring a fine-grained benchmark and a large-scale dataset to evaluate and enhance subject consistency and naturalness in videos.

Authors:Di Wu, Yixin Wan, Kai-Wei Chang
Title: Visualized Text-to-Image Retrieval
Abstract:
We propose Visualize-then-Retrieve (VisRet), a new paradigm for Text-to-Image (T2I) retrieval that mitigates the limitations of cross-modal similarity alignment of existing multi-modal embeddings. VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Experiments on three knowledge-intensive T2I retrieval benchmarks, including a newly introduced multi-entity benchmark, demonstrate that VisRet consistently improves T2I retrieval by 24.5% to 32.7% NDCG@10 across different embedding models. VisRet also significantly benefits downstream visual question answering accuracy when used in retrieval-augmented generation pipelines. The method is plug-and-play and compatible with off-the-shelf retrievers, making it an effective module for knowledge-intensive multi-modal systems. Our code and the new benchmark are publicly available at https://github.com/xiaowu0162/Visualize-then-Retrieve.
Chinese: VisRet提出了一种新的文本到图像检索范式,通过先将文本查询转换为图像,然后在图像模态内进行检索,显著提升了检索性能24.5%至32.7%,并有效增强了下游任务的准确性。
English: VisRet introduces a novel Text-to-Image retrieval approach by first converting text queries into images and then retrieving within the image modality, significantly improving performance by 24.5% to 32.7% across benchmarks and benefiting downstream tasks.

Authors:Guangting Zheng, Yehao Li, Yingwei Pan, Jiajun Deng, Ting Yao, Yanyong Zhang, Tao Mei
Title: Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots
Abstract:
Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current de-facto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the additional guidance to strengthen the next autoregressive modeling phase by shaping global structural awareness of typical dense image tokens. A new Diffusion Transformer head is further devised to amplify the global context among all tokens for mask token prediction. Extensive evaluations on both class-conditional and text-to-image generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs. Code is available at https://github.com/HiDream-ai/himar.
中文: 本文提出分层掩码自回归模型(Hi-MAR),通过先预测低分辨率图像标记来捕捉全局结构,再将其作为密集标记生成的引导,以更高效率超越了标准自回归方法的视觉生成性能。
English: This paper introduces Hierarchical Masked Autoregressive models (Hi-MAR), which enhance visual generation by predicting low-resolution image tokens first to capture global structure, then using them as guidance for dense token generation, outperforming standard autoregressive methods with greater efficiency.

Authors:Zhongwei Zhang, Fuchen Long, Zhaofan Qiu, Yingwei Pan, Wu Liu, Ting Yao, Tao Mei
Title: MotionPro: A Precise Motion Controller for Image-to-Video Generation
Abstract:
Animating images with interactive motion control has garnered popularity for image-to-video (I2V) generation. Modern approaches typically rely on large Gaussian kernels to extend motion trajectories as condition without explicitly defining movement region, leading to coarse motion control and failing to disentangle object and camera moving. To alleviate these, we present MotionPro, a precise motion controller that novelly leverages region-wise trajectory and motion mask to regulate fine-grained motion synthesis and identify target motion category (i.e., object or camera moving), respectively. Technically, MotionPro first estimates the flow maps on each training video via a tracking model, and then samples the region-wise trajectories to simulate inference scenario. Instead of extending flow through large Gaussian kernels, our region-wise trajectory approach enables more precise control by directly utilizing trajectories within local regions, thereby effectively characterizing fine-grained movements. A motion mask is simultaneously derived from the predicted flow maps to capture the holistic motion dynamics of the movement regions. To pursue natural motion control, MotionPro further strengthens video denoising by incorporating both region-wise trajectories and motion mask through feature modulation. More remarkably, we meticulously construct a benchmark, i.e., MC-Bench, with 1.1K user-annotated image-trajectory pairs, for the evaluation of both fine-grained and object-level I2V motion control. Extensive experiments conducted on WebVid-10M and MC-Bench demonstrate the effectiveness of MotionPro. Please refer to our project page for more results: https://zhw-zhang.github.io/MotionPro-page/.
中文: MotionPro通过引入区域轨迹和运动掩码的精确运动控制方法,实现了细粒度运动合成并能区分物体与相机运动,相比依赖粗糙高斯核的传统方法具有显著优势。
English: MotionPro introduces a precise motion control method for image-to-video generation by utilizing region-wise trajectories and motion masks to achieve fine-grained motion synthesis and distinguish between object and camera movement, outperforming previous approaches that relied on coarse Gaussian kernels.

Authors:Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, Jingkuan Song, Fei Huang, Yongbin Li
Title: OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction
Abstract:
Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role's voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter.
中文:OmniCharacter是一种首创的语音-语言个性交互模型,通过低延迟使角色扮演智能体持续展现角色特定特质和声音特征,借助丰富数据集创造沉浸式体验,在内容和风格上均优于现有方法。
English: OmniCharacter is a pioneering speech-language personality interaction model that enables Role-Playing Agents to consistently exhibit role-specific traits and vocal characteristics with low latency, creating immersive experiences through a comprehensive dataset and outperforming existing methods in both content and style.

Authors:Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, Li Yuan
Title: ImgEdit: A Unified Image Editing Dataset and Benchmark
Abstract:
Recent advancements in generative models have enabled high-fidelity text-to-image generation. However, open-source image-editing models still lag behind their proprietary counterparts, primarily due to limited high-quality data and insufficient benchmarks. To overcome these limitations, we introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks. To ensure the data quality, we employ a multi-stage pipeline that integrates a cutting-edge vision-language model, a detection model, a segmentation model, alongside task-specific in-painting procedures and strict post-processing. ImgEdit surpasses existing datasets in both task novelty and data quality. Using ImgEdit, we train ImgEdit-E1, an editing model using Vision Language Model to process the reference image and editing prompt, which outperforms existing open-source models on multiple tasks, highlighting the value of ImgEdit and model design. For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance in terms of instruction adherence, editing quality, and detail preservation. It includes a basic testsuite, a challenging single-turn suite, and a dedicated multi-turn suite. We evaluate both open-source and proprietary models, as well as ImgEdit-E1, providing deep analysis and actionable insights into the current behavior of image-editing models. The source data are publicly available on https://github.com/PKU-YuanGroup/ImgEdit.
中文: 摘要介绍了ImgEdit,这是一个大规模高质量的图像编辑数据集和基准,旨在通过提供精心筛选的编辑对和全面评估工具来弥补开源模型的不足,其训练的ImgEdit-E1模型展现出卓越性能。
English: The abstract introduces ImgEdit, a large-scale, high-quality image-editing dataset and benchmark designed to address limitations in open-source models by providing curated edit pairs and comprehensive evaluation tools, with the trained model ImgEdit-E1 demonstrating superior performance.

Authors:Jinsheng Quan, Chunshi Wang, Yawei Luo
Title: ParticleGS: Particle-Based Dynamics Modeling of 3D Gaussians for Prior-free Motion Extrapolation
Abstract:
This paper aims to model the dynamics of 3D Gaussians from visual observations to support temporal extrapolation. Existing dynamic 3D reconstruction methods often struggle to effectively learn underlying dynamics or rely heavily on manually defined physical priors, which limits their extrapolation capabilities. To address this issue, we propose a novel dynamic 3D Gaussian Splatting prior-free motion extrapolation framework based on particle dynamics systems. The core advantage of our method lies in its ability to learn differential equations that describe the dynamics of 3D Gaussians, and follow them during future frame extrapolation. Instead of simply fitting to the observed visual frame sequence, we aim to more effectively model the gaussian particle dynamics system. To this end, we introduce a dynamics latent state vector into the standard Gaussian kernel and design a dynamics latent space encoder to extract initial state. Subsequently, we introduce a Neural ODEs-based dynamics module that models the temporal evolution of Gaussian in dynamics latent space. Finally, a Gaussian kernel space decoder is used to decode latent state at the specific time step into the deformation. Experimental results demonstrate that the proposed method achieves comparable rendering quality with existing approaches in reconstruction tasks, and significantly outperforms them in future frame extrapolation. Our code is available at https://github.com/QuanJinSheng/ParticleGS.
中文: 本文提出了一种新颖的动态3D高斯泼溅框架,通过神经微分方程学习粒子动力学系统,在保持重建质量的同时实现了显著优于现有方法的时序外推能力。
English: This paper introduces a novel dynamic 3D Gaussian Splatting framework that learns underlying particle dynamics through Neural ODEs, enabling superior temporal extrapolation while maintaining competitive reconstruction quality.

Authors:Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, Chunhua Shen
Title: Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
Abstract:
Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because ``optimal'' keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, namely Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.
中文摘要:该研究提出Omni-R1框架,通过强化学习协调全局推理与细节理解双系统,在保持高分辨率精准定位的同时实现长序列高效处理,在多项挑战性任务中超越现有方法并显著提升泛化能力。
English Summary: The study introduces Omni-R1, a two-system framework using reinforcement learning to balance global video-audio reasoning and fine-grained pixel understanding, achieving superior performance on challenging benchmarks while enhancing generalization and reducing hallucinations.

Authors:Muyao Niu, Mingdeng Cao, Yifan Zhan, Qingtian Zhu, Mingze Ma, Jiancheng Zhao, Yanhong Zeng, Zhihang Zhong, Xiao Sun, Yinqiang Zheng
Title: AniCrafter: Customizing Realistic Human-Centric Animation via Avatar-Background Conditioning in Video Diffusion Models
Abstract:
Recent advances in video diffusion models have significantly improved character animation techniques. However, current approaches rely on basic structural conditions such as DWPose or SMPL-X to animate character images, limiting their effectiveness in open-domain scenarios with dynamic backgrounds or challenging human poses. In this paper, we introduce \textbf{AniCrafter}, a diffusion-based human-centric animation model that can seamlessly integrate and animate a given character into open-domain dynamic backgrounds while following given human motion sequences. Built on cutting-edge Image-to-Video (I2V) diffusion architectures, our model incorporates an innovative ''avatar-background'' conditioning mechanism that reframes open-domain human-centric animation as a restoration task, enabling more stable and versatile animation outputs. Experimental results demonstrate the superior performance of our method. Codes are available at https://github.com/MyNiuuu/AniCrafter.
Chinese: AniCrafter提出了一种基于扩散的人物动画模型,通过创新的“角色-背景”调节机制,能在动态开放域场景中无缝生成人物动画,其性能优于现有方法。
English: AniCrafter introduces a diffusion-based human animation model that uses an innovative avatar-background conditioning mechanism to seamlessly animate characters in dynamic open-domain scenes, outperforming existing methods.

Authors:Pranav Poudel, Aavash Chhetri, Prashnna Gyawali, Georgios Leontidis, Binod Bhattarai
Title: Multimodal Federated Learning With Missing Modalities through Feature Imputation Network
Abstract:
Multimodal federated learning holds immense potential for collaboratively training models from multiple sources without sharing raw data, addressing both data scarcity and privacy concerns, two key challenges in healthcare. A major challenge in training multimodal federated models in healthcare is the presence of missing modalities due to multiple reasons, including variations in clinical practice, cost and accessibility constraints, retrospective data collection, privacy concerns, and occasional technical or human errors. Previous methods typically rely on publicly available real datasets or synthetic data to compensate for missing modalities. However, obtaining real datasets for every disease is impractical, and training generative models to synthesize missing modalities is computationally expensive and prone to errors due to the high dimensionality of medical data. In this paper, we propose a novel, lightweight, low-dimensional feature translator to reconstruct bottleneck features of the missing modalities. Our experiments on three different datasets (MIMIC-CXR, NIH Open-I, and CheXpert), in both homogeneous and heterogeneous settings consistently improve the performance of competitive baselines. The code and implementation details are available at: https://github.com/bhattarailab/FedFeatGen
中文摘要:本文提出了一种轻量级特征转换器,用于在医疗多模态联邦学习中重构缺失模态,在三个医疗数据集上均持续优于基线方法。
English Summary: The paper introduces a lightweight feature translator for reconstructing missing modalities in multimodal federated learning for healthcare, which consistently outperforms baseline methods across three medical datasets.

Authors:Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang
Title: Agentic 3D Scene Generation with Spatially Contextualized VLMs
Abstract:
Despite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured, geometry-aware working memory that integrates its inherent multimodal reasoning capabilities with structured 3D understanding for effective spatial reasoning. Building on this foundation, we develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context. The pipeline features high-quality asset generation with geometric restoration, environment setup with automatic verification, and ergonomic adjustment guided by the scene hypergraph. Experiments show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work. Further results demonstrate that injecting spatial context enables VLMs to perform downstream tasks such as interactive scene editing and path planning, suggesting strong potential for spatially intelligent systems in computer graphics, 3D vision, and embodied applications. Project page: https://spatctxvlm.github.io/project_page/.
中文: 本研究提出了一种新框架,通过整合动态空间上下文来增强视觉语言模型的三维空间推理能力,实现了面向具身AI应用的高级场景生成与交互式编辑功能。
English: This work introduces a novel framework that enhances vision-language models' capacity for 3D spatial reasoning by integrating a dynamic spatial context, enabling advanced scene generation and interactive editing for embodied AI applications.

Authors:Fanheng Kong, Jingyuan Zhang, Hongzhi Zhang, Shi Feng, Daling Wang, Linhao Yu, Xingguang Ji, Yu Tian, Victoria W., Fuzheng Zhang
Title: TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos
Abstract:
Videos are unique in their integration of temporal elements, including camera, scene, action, and attribute, along with their dynamic relationships over time. However, existing benchmarks for video understanding often treat these properties separately or narrowly focus on specific aspects, overlooking the holistic nature of video content. To address this, we introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos, with two complementary tasks: captioning and QA. Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria. We evaluate several leading models on our benchmark, providing fine-grained performance assessments across various dimensions. This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion, offering valuable insights for improving video understanding models. The data and code are available at https://friedrichor.github.io/projects/TUNA.
Chinese: TUNA基准通过引入描述和问答任务,解决了视频中时间元素的整体整合问题,揭示了现有模型在动作描述有限和多主体理解不足等方面的关键挑战。
English: The TUNA benchmark addresses the holistic integration of temporal elements in videos by introducing captioning and QA tasks, revealing key challenges like limited action description and inadequate multi-subject understanding in current models.

Authors:Hongsong Wang, Yin Zhu, Qiuxia Lai, Yang Zhang, Guo-Sen Xie, Xin Geng
Title: PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Generation
Abstract:
Computational dance generation is crucial in many areas, such as art, human-computer interaction, virtual reality, and digital entertainment, particularly for generating coherent and expressive long dance sequences. Diffusion-based music-to-dance generation has made significant progress, yet existing methods still struggle to produce physically plausible motions. To address this, we propose Plausibility-Aware Motion Diffusion (PAMD), a framework for generating dances that are both musically aligned and physically realistic. The core of PAMD lies in the Plausible Motion Constraint (PMC), which leverages Neural Distance Fields (NDFs) to model the actual pose manifold and guide generated motions toward a physically valid pose manifold. To provide more effective guidance during generation, we incorporate Prior Motion Guidance (PMG), which uses standing poses as auxiliary conditions alongside music features. To further enhance realism for complex movements, we introduce the Motion Refinement with Foot-ground Contact (MRFC) module, which addresses foot-skating artifacts by bridging the gap between the optimization objective in linear joint position space and the data representation in nonlinear rotation space. Extensive experiments show that PAMD significantly improves musical alignment and enhances the physical plausibility of generated motions. This project page is available at: https://mucunzhuzhu.github.io/PAMD-page/.

Authors:Hongsong Wang, Ao Sun, Jie Gui, Liang Wang
Title: Data-Free Class-Incremental Gesture Recognition with Prototype-Guided Pseudo Feature Replay
Abstract:
Gesture recognition is an important research area in the field of computer vision. Most gesture recognition efforts focus on close-set scenarios, thereby limiting the capacity to effectively handle unseen or novel gestures. We aim to address class-incremental gesture recognition, which entails the ability to accommodate new and previously unseen gestures over time. Specifically, we introduce a Prototype-Guided Pseudo Feature Replay (PGPFR) framework for data-free class-incremental gesture recognition. This framework comprises four components: Pseudo Feature Generation with Batch Prototypes (PFGBP), Variational Prototype Replay (VPR) for old classes, Truncated Cross-Entropy (TCE) for new classes, and Continual Classifier Re-Training (CCRT). To tackle the issue of catastrophic forgetting, the PFGBP dynamically generates a diversity of pseudo features in an online manner, leveraging class prototypes of old classes along with batch class prototypes of new classes. Furthermore, the VPR enforces consistency between the classifier's weights and the prototypes of old classes, leveraging class prototypes and covariance matrices to enhance robustness and generalization capabilities. The TCE mitigates the impact of domain differences of the classifier caused by pseudo features. Finally, the CCRT training strategy is designed to prevent overfitting to new classes and ensure the stability of features extracted from old classes. Extensive experiments conducted on two widely used gesture recognition datasets, namely SHREC 2017 3D and EgoGesture 3D, demonstrate that our approach outperforms existing state-of-the-art methods by 11.8\% and 12.8\% in terms of mean global accuracy, respectively. The code is available on https://github.com/sunao-101/PGPFR-3/.
中文: 本研究提出了一种原型引导的伪特征回放框架,用于解决类别增量手势识别问题,有效防止灾难性遗忘,并在基准数据集上以超过11%的准确率优势超越了现有最优方法。
English: This study introduces a Prototype-Guided Pseudo Feature Replay framework to address class-incremental gesture recognition, effectively preventing catastrophic forgetting and outperforming state-of-the-art methods by over 11% in accuracy on benchmark datasets.

Authors:Chang Liu, Haomin Zhang, Shiyu Xia, Zihao Chen, Chaofan Ding, Xin Yue, Huizhe Chen, Xinhan Di
Title: Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks
Abstract:
Generating high-quality piano audio from video requires precise synchronization between visual cues and musical output, ensuring accurate semantic and temporal alignment.However, existing evaluation datasets do not fully capture the intricate synchronization required for piano music generation. A comprehensive benchmark is essential for two primary reasons: (1) existing metrics fail to reflect the complexity of video-to-piano music interactions, and (2) a dedicated benchmark dataset can provide valuable insights to accelerate progress in high-quality piano music generation. To address these challenges, we introduce the CoP Benchmark Dataset-a fully open-sourced, multimodal benchmark designed specifically for video-guided piano music generation. The proposed Chain-of-Perform (CoP) benchmark offers several compelling features: (1) detailed multimodal annotations, enabling precise semantic and temporal alignment between video content and piano audio via step-by-step Chain-of-Perform guidance; (2) a versatile evaluation framework for rigorous assessment of both general-purpose and specialized video-to-piano generation tasks; and (3) full open-sourcing of the dataset, annotations, and evaluation protocols. The dataset is publicly available at https://github.com/acappemin/Video-to-Audio-and-Piano, with a continuously updated leaderboard to promote ongoing research in this domain.
中文摘要:CoP基准数据集作为一个完全开源的多模态基准被提出,旨在解决视频生成钢琴音乐中精确同步评估的缺失,其具备详细的多模态注释和灵活评估框架。
English Summary: The CoP Benchmark Dataset is introduced as a fully open-sourced, multimodal benchmark designed to address the lack of precise synchronization evaluation in video-to-piano music generation, featuring detailed annotations and a versatile evaluation framework.

Authors:Xueyi Liu, Zuodong Zhong, Yuxin Guo, Yun-Fu Liu, Zhiguo Su, Qichao Zhang, Junli Wang, Yinfeng Gao, Yupeng Zheng, Qiao Lin, Huiyong Chen, Dongbin Zhao
Title: ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving
Abstract:
Due to the powerful vision-language reasoning and generalization abilities, multimodal large language models (MLLMs) have garnered significant attention in the field of end-to-end (E2E) autonomous driving. However, their application to closed-loop systems remains underexplored, and current MLLM-based methods have not shown clear superiority to mainstream E2E imitation learning approaches. In this work, we propose ReasonPlan, a novel MLLM fine-tuning framework designed for closed-loop driving through holistic reasoning with a self-supervised Next Scene Prediction task and supervised Decision Chain-of-Thought process. This dual mechanism encourages the model to align visual representations with actionable driving context, while promoting interpretable and causally grounded decision making. We curate a planning-oriented decision reasoning dataset, namely PDR, comprising 210k diverse and high-quality samples. Our method outperforms the mainstream E2E imitation learning method by a large margin of 19% L2 and 16.1 driving score on Bench2Drive benchmark. Furthermore, ReasonPlan demonstrates strong zero-shot generalization on unseen DOS benchmark, highlighting its adaptability in handling zero-shot corner cases. Code and dataset will be found in https://github.com/Liuxueyi/ReasonPlan.
中文:ReasonPlan框架通过自监督下一场景预测和监督决策思维链的双重机制,增强了多模态大语言模型在闭环自动驾驶中的应用,显著超越主流方法并展现出强大的零样本泛化能力。
English: The ReasonPlan framework enhances multimodal large language models for closed-loop autonomous driving by integrating self-supervised next scene prediction and supervised decision chain-of-thought, achieving significant performance improvements over mainstream methods and demonstrating strong zero-shot generalization.

Authors:Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, Fei Wang
Title: UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space
Abstract:
Diffusion models have shown great potential in generating realistic image detail. However, adapting these models to video super-resolution (VSR) remains challenging due to their inherent stochasticity and lack of temporal modeling. Previous methods have attempted to mitigate this issue by incorporating motion information and temporal layers. However, unreliable motion estimation from low-resolution videos and costly multiple sampling steps with deep temporal layers limit them to short sequences. In this paper, we propose UltraVSR, a novel framework that enables ultra-realistic and temporally-coherent VSR through an efficient one-step diffusion space. A central component of UltraVSR is the Degradation-aware Reconstruction Scheduling (DRS), which estimates a degradation factor from the low-resolution input and transforms the iterative denoising process into a single-step reconstruction from low-resolution to high-resolution videos. To ensure temporal consistency, we propose a lightweight Recurrent Temporal Shift (RTS) module, including an RTS-convolution unit and an RTS-attention unit. By partially shifting feature components along the temporal dimension, it enables effective propagation, fusion, and alignment across frames without explicit temporal layers. The RTS module is integrated into a pretrained text-to-image diffusion model and is further enhanced through Spatio-temporal Joint Distillation (SJD), which improves temporally coherence while preserving realistic details. Additionally, we introduce a Temporally Asynchronous Inference (TAI) strategy to capture long-range temporal dependencies under limited memory constraints. Extensive experiments show that UltraVSR achieves state-of-the-art performance, both qualitatively and quantitatively, in a single sampling step. Code is available at https://github.com/yongliuy/UltraVSR.
中文: UltraVSR通过退化感知重建调度和轻量级循环时序偏移模块,在单步扩散框架中实现了超逼真且时序一致的视频超分辨率。
English: UltraVSR introduces a one-step diffusion framework with Degradation-aware Reconstruction Scheduling and a lightweight Recurrent Temporal Shift module to achieve ultra-realistic, temporally-coherent video super-resolution efficiently.

Authors:Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu
Title: ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Abstract:
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

Authors:David Schneider, Zdravko Marinov, Rafael Baur, Zeyun Zhong, Rodi Düger, Rainer Stiefelhagen
Title: OmniFall: A Unified Staged-to-Wild Benchmark for Human Fall Detection
Abstract:
Current video-based fall detection research mostly relies on small, staged datasets with significant domain biases concerning background, lighting, and camera setup resulting in unknown real-world performance. We introduce OmniFall, unifying eight public fall detection datasets (roughly 14 h of recordings, roughly 42 h of multiview data, 101 subjects, 29 camera views) under a consistent ten-class taxonomy with standardized evaluation protocols. Our benchmark provides complete video segmentation labels and enables fair cross-dataset comparison previously impossible with incompatible annotation schemes. For real-world evaluation we curate OOPS-Fall from genuine accident videos and establish a staged-to-wild protocol measuring generalization from controlled to uncontrolled environments. Experiments with frozen pre-trained backbones such as I3D or VideoMAE reveal significant performance gaps between in-distribution and in-the-wild scenarios, highlighting critical challenges in developing robust fall detection systems. OmniFall Dataset at https://huggingface.co/datasets/simplexsigil2/omnifall , Code at https://github.com/simplexsigil/omnifall-experiments
中文: 当前跌倒检测研究因数据集有限和领域偏差而受限,OmniFall通过整合多个数据集并建立标准化评估框架,揭示了受控与真实场景间的显著性能差距。
English: Current fall detection research suffers from limited datasets and domain biases, so OmniFall unifies multiple datasets under a standardized framework and introduces a real-world evaluation protocol, revealing significant performance gaps between controlled and uncontrolled environments.

Authors:Shintaro Ito, Natsuki Takama, Koichi Ito, Hwann-Tzong Chen, Takafumi Aoki
Title: ErpGS: Equirectangular Image Rendering enhanced with 3D Gaussian Regularization
Abstract:
The use of multi-view images acquired by a 360-degree camera can reconstruct a 3D space with a wide area. There are 3D reconstruction methods from equirectangular images based on NeRF and 3DGS, as well as Novel View Synthesis (NVS) methods. On the other hand, it is necessary to overcome the large distortion caused by the projection model of a 360-degree camera when equirectangular images are used. In 3DGS-based methods, the large distortion of the 360-degree camera model generates extremely large 3D Gaussians, resulting in poor rendering accuracy. We propose ErpGS, which is Omnidirectional GS based on 3DGS to realize NVS addressing the problems. ErpGS introduce some rendering accuracy improvement techniques: geometric regularization, scale regularization, and distortion-aware weights and a mask to suppress the effects of obstacles in equirectangular images. Through experiments on public datasets, we demonstrate that ErpGS can render novel view images more accurately than conventional methods.

Authors:Chao Huang, Benfeng Wang, Jie Wen, Chengliang Liu, Wei Wang, Li Shen, Xiaochun Cao
Title: Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought
Abstract:
Recent advancements in reasoning capability of Multimodal Large Language Models (MLLMs) demonstrate its effectiveness in tackling complex visual tasks. However, existing MLLM-based Video Anomaly Detection (VAD) methods remain limited to shallow anomaly descriptions without deep reasoning. In this paper, we propose a new task named Video Anomaly Reasoning (VAR), which aims to enable deep analysis and understanding of anomalies in the video by requiring MLLMs to think explicitly before answering. To this end, we propose Vad-R1, an end-to-end MLLM-based framework for VAR. Specifically, we design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies, guiding the MLLM to reason anomaly step-by-step. Based on the structured P2C-CoT, we construct Vad-Reasoning, a dedicated dataset for VAR. Furthermore, we propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs through a self-verification mechanism with limited annotations. Experimental results demonstrate that Vad-R1 achieves superior performance, outperforming both open-source and proprietary models on VAD and VAR tasks. Codes and datasets will be released at https://github.com/wbfwonderful/Vad-R1.
中文: 本文提出了视频异常推理新任务及Vad-R1框架,通过感知-认知思维链和强化学习算法,增强多模态大语言模型对视频异常的深度推理能力,在多项基准测试中表现优异。
English: This paper introduces a novel Video Anomaly Reasoning (VAR) task and proposes Vad-R1, an end-to-end framework that enhances Multimodal Large Language Models' deep reasoning for video anomalies through a Perception-to-Cognition Chain-of-Thought and improved reinforcement learning, achieving superior performance on VAD and VAR benchmarks.

Authors:Lukas Meyer, Andrei-Timotei Ardelean, Tim Weyrich, Marc Stamminger
Title: FruitNeRF++: A Generalized Multi-Fruit Counting Method Utilizing Contrastive Learning and Neural Radiance Fields
Abstract:
We introduce FruitNeRF++, a novel fruit-counting approach that combines contrastive learning with neural radiance fields to count fruits from unstructured input photographs of orchards. Our work is based on FruitNeRF, which employs a neural semantic field combined with a fruit-specific clustering approach. The requirement for adaptation for each fruit type limits the applicability of the method, and makes it difficult to use in practice. To lift this limitation, we design a shape-agnostic multi-fruit counting framework, that complements the RGB and semantic data with instance masks predicted by a vision foundation model. The masks are used to encode the identity of each fruit as instance embeddings into a neural instance field. By volumetrically sampling the neural fields, we extract a point cloud embedded with the instance features, which can be clustered in a fruit-agnostic manner to obtain the fruit count. We evaluate our approach using a synthetic dataset containing apples, plums, lemons, pears, peaches, and mangoes, as well as a real-world benchmark apple dataset. Our results demonstrate that FruitNeRF++ is easier to control and compares favorably to other state-of-the-art methods.

Authors:Natsuki Takama, Shintaro Ito, Koichi Ito, Hwann-Tzong Chen, Takafumi Aoki
Title: Sparse2DGS: Sparse-View Surface Reconstruction using 2D Gaussian Splatting with Dense Point Cloud
Abstract:
Gaussian Splatting (GS) has gained attention as a fast and effective method for novel view synthesis. It has also been applied to 3D reconstruction using multi-view images and can achieve fast and accurate 3D reconstruction. However, GS assumes that the input contains a large number of multi-view images, and therefore, the reconstruction accuracy significantly decreases when only a limited number of input images are available. One of the main reasons is the insufficient number of 3D points in the sparse point cloud obtained through Structure from Motion (SfM), which results in a poor initialization for optimizing the Gaussian primitives. We propose a new 3D reconstruction method, called Sparse2DGS, to enhance 2DGS in reconstructing objects using only three images. Sparse2DGS employs DUSt3R, a fundamental model for stereo images, along with COLMAP MVS to generate highly accurate and dense 3D point clouds, which are then used to initialize 2D Gaussians. Through experiments on the DTU dataset, we show that Sparse2DGS can accurately reconstruct the 3D shapes of objects using just three images. The project page is available at https://gsisaoki.github.io/SPARSE2DGS/

Authors:Nagito Saito, Shintaro Ito, Koichi Ito, Takafumi Aoki
Title: Zero-Shot Pseudo Labels Generation Using SAM and CLIP for Semi-Supervised Semantic Segmentation
Abstract:
Semantic segmentation is a fundamental task in medical image analysis and autonomous driving and has a problem with the high cost of annotating the labels required in training. To address this problem, semantic segmentation methods based on semi-supervised learning with a small number of labeled data have been proposed. For example, one approach is to train a semantic segmentation model using images with annotated labels and pseudo labels. In this approach, the accuracy of the semantic segmentation model depends on the quality of the pseudo labels, and the quality of the pseudo labels depends on the performance of the model to be trained and the amount of data with annotated labels. In this paper, we generate pseudo labels using zero-shot annotation with the Segment Anything Model (SAM) and Contrastive Language-Image Pretraining (CLIP), improve the accuracy of the pseudo labels using the Unified Dual-Stream Perturbations Approach (UniMatch), and use them as enhanced labels to train a semantic segmentation model. The effectiveness of the proposed method is demonstrated through the experiments using the public datasets: PASCAL and MS COCO. The project web page is available at: https://gsisaoki.github.io/ZERO-SHOT-PLG/

Authors:You Wang, Li Fang, Hao Zhu, Fei Hu, Long Ye, Zhan Ma
Title: GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis
Abstract:
Neural Radiance Fields (NeRF) have transformed novel view synthesis by modeling scene-specific volumetric representations directly from images. While generalizable NeRF models can generate novel views across unknown scenes by learning latent ray representations, their performance heavily depends on a large number of multi-view observations. However, with limited input views, these methods experience significant degradation in rendering quality. To address this limitation, we propose GoLF-NRT: a Global and Local feature Fusion-based Neural Rendering Transformer. GoLF-NRT enhances generalizable neural rendering from few input views by leveraging a 3D transformer with efficient sparse attention to capture global scene context. In parallel, it integrates local geometric features extracted along the epipolar line, enabling high-quality scene reconstruction from as few as 1 to 3 input views. Furthermore, we introduce an adaptive sampling strategy based on attention weights and kernel regression, improving the accuracy of transformer-based neural rendering. Extensive experiments on public datasets show that GoLF-NRT achieves state-of-the-art performance across varying numbers of input views, highlighting the effectiveness and superiority of our approach. Code is available at https://github.com/KLMAV-CUC/GoLF-NRT.
中文:GoLF-NRT是一种新颖的神经渲染变换器,通过融合全局场景上下文与局部几何特征,仅需1-3张输入视图即可实现高质量新视角合成,其自适应采样策略使该方法在性能上超越现有技术。
English: GoLF-NRT is a novel neural rendering transformer that combines global scene context and local geometric features to achieve high-quality novel view synthesis from as few as 1-3 input views, outperforming existing methods through its adaptive sampling strategy.

Authors:Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian
Title: Efficient Multi-modal Long Context Learning for Training-free Adaptation
Abstract:
Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly integrate compression and pruning techniques for multi-modal long-context learning, offering a scalable and efficient solution for real-world applications. Extensive experiments on diverse vision-language benchmarks demonstrate that EMLoC achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLoC as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments. Codes are publicly available at https://github.com/Zehong-Ma/EMLoC.
中文: 本文提出EMLoC方法,通过将示范示例嵌入模型输入并采用分块压缩与自适应剪枝技术,无需训练即可高效处理多模态长上下文学习,且保持性能无损。
English: This paper introduces EMLoC, a training-free method that embeds demonstration examples into model inputs and uses chunk-wise compression with adaptive pruning to efficiently handle multi-modal long-context learning without performance loss.

Authors:Sajjad Shahabodini, Mobina Mansoori, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi
Title: The Missing Point in Vision Transformers for Universal Image Segmentation
Abstract:
Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: https://github.com/sajjad-sh33/ViT-P}{https://github.com/sajjad-sh33/ViT-P.
中文: 本文提出ViT-P,一种两阶段分割框架,将掩码生成与分类解耦,利用基于视觉变换器的模型提高精度和适应性,无需预训练,在多个数据集上实现了最先进的性能。
English: This paper introduces ViT-P, a two-stage segmentation framework that decouples mask generation from classification, using a Vision Transformer-based model to enhance accuracy and adaptability without pre-training, achieving state-of-the-art results across multiple datasets.

Authors:Li Fang, Hao Zhu, Longlong Chen, Fei Hu, Long Ye, Zhan Ma
Title: Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction
Abstract:
Recent advancements in generalizable novel view synthesis have achieved impressive quality through interpolation between nearby views. However, rendering high-resolution images remains computationally intensive due to the need for dense sampling of all rays. Recognizing that natural scenes are typically piecewise smooth and sampling all rays is often redundant, we propose a novel depth-guided bundle sampling strategy to accelerate rendering. By grouping adjacent rays into a bundle and sampling them collectively, a shared representation is generated for decoding all rays within the bundle. To further optimize efficiency, our adaptive sampling strategy dynamically allocates samples based on depth confidence, concentrating more samples in complex regions while reducing them in smoother areas. When applied to ENeRF, our method achieves up to a 1.27 dB PSNR improvement and a 47% increase in FPS on the DTU dataset. Extensive experiments on synthetic and real-world datasets demonstrate state-of-the-art rendering quality and up to 2x faster rendering compared to existing generalizable methods. Code is available at https://github.com/KLMAV-CUC/GDB-NeRF.
Chinese: 本文提出了一种深度引导的束采样策略,通过自适应地分组光线和动态分配样本,在多个数据集上实现了更优的渲染质量和最高达两倍的加速效果。
English: This paper introduces a depth-guided bundle sampling strategy that accelerates novel view synthesis by adaptively grouping rays and dynamically allocating samples, achieving superior rendering quality and up to twice the speed on multiple datasets.

Authors:Mobina Mansoori, Sajjad Shahabodini, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi
Title: Advancements in Medical Image Classification through Fine-Tuning Natural Domain Foundation Models
Abstract:
Using massive datasets, foundation models are large-scale, pre-trained models that perform a wide range of tasks. These models have shown consistently improved results with the introduction of new methods. It is crucial to analyze how these trends impact the medical field and determine whether these advancements can drive meaningful change. This study investigates the application of recent state-of-the-art foundation models, DINOv2, MAE, VMamba, CoCa, SAM2, and AIMv2, for medical image classification. We explore their effectiveness on datasets including CBIS-DDSM for mammography, ISIC2019 for skin lesions, APTOS2019 for diabetic retinopathy, and CHEXPERT for chest radiographs. By fine-tuning these models and evaluating their configurations, we aim to understand the potential of these advancements in medical image classification. The results indicate that these advanced models significantly enhance classification outcomes, demonstrating robust performance despite limited labeled data. Based on our results, AIMv2, DINOv2, and SAM2 models outperformed others, demonstrating that progress in natural domain training has positively impacted the medical domain and improved classification outcomes. Our code is publicly available at: https://github.com/sajjad-sh33/Medical-Transfer-Learning.
中文: 本研究评估了DINOv2和AIMv2等先进基础模型在医学图像数据集上的表现,结果表明这些模型显著提升了分类准确性和鲁棒性,即使在标注数据有限的情况下也表现优异。
English: This study evaluates advanced foundation models like DINOv2 and AIMv2 on medical image datasets, showing they significantly improve classification accuracy and robustness, even with limited labeled data.

Authors:Hala Djeghim, Nathan Piasco, Luis Roldão, Moussab Bennehar, Dzmitry Tsishkou, Céline Loscos, Désiré Sidibé
Title: SAIL: Self-supervised Albedo Estimation from Real Images with a Latent Diffusion Model
Abstract:
Intrinsic image decomposition aims at separating an image into its underlying albedo and shading components, isolating the base color from lighting effects to enable downstream applications such as virtual relighting and scene editing. Despite the rise and success of learning-based approaches, intrinsic image decomposition from real-world images remains a significant challenging task due to the scarcity of labeled ground-truth data. Most existing solutions rely on synthetic data as supervised setups, limiting their ability to generalize to real-world scenes. Self-supervised methods, on the other hand, often produce albedo maps that contain reflections and lack consistency under different lighting conditions. To address this, we propose SAIL, an approach designed to estimate albedo-like representations from single-view real-world images. We repurpose the prior knowledge of a latent diffusion model for unconditioned scene relighting as a surrogate objective for albedo estimation. To extract the albedo, we introduce a novel intrinsic image decomposition fully formulated in the latent space. To guide the training of our latent diffusion model, we introduce regularization terms that constrain both the lighting-dependent and independent components of our latent image decomposition. SAIL predicts stable albedo under varying lighting conditions and generalizes to multiple scenes, using only unlabeled multi-illumination data available online.

Authors:Jue Gong, Tingyu Yang, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, Xiaokang Yang
Title: HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance
Abstract:
Human-centered images often suffer from severe generic degradation during transmission and are prone to human motion blur (HMB), making restoration challenging. Existing research lacks sufficient focus on these issues, as both problems often coexist in practice. To address this, we design a degradation pipeline that simulates the coexistence of HMB and generic noise, generating synthetic degraded data to train our proposed HAODiff, a human-aware one-step diffusion. Specifically, we propose a triple-branch dual-prompt guidance (DPG), which leverages high-quality images, residual noise (LQ minus HQ), and HMB segmentation masks as training targets. It produces a positive-negative prompt pair for classifier-free guidance (CFG) in a single diffusion step. The resulting adaptive dual prompts let HAODiff exploit CFG more effectively, boosting robustness against diverse degradations. For fair evaluation, we introduce MPII-Test, a benchmark rich in combined noise and HMB cases. Extensive experiments show that our HAODiff surpasses existing state-of-the-art (SOTA) methods in terms of both quantitative metrics and visual quality on synthetic and real-world datasets, including our introduced MPII-Test. Code is available at: https://github.com/gobunu/HAODiff.
中文摘要:本研究提出HAODiff模型,通过三重分支双提示引导技术,在单步扩散中有效修复同时存在运动模糊和通用噪声的人像图像,在合成与真实数据集上均超越现有最优方法。
English Summary: The study introduces HAODiff, a human-aware one-step diffusion model enhanced with triple-branch dual-prompt guidance, which effectively restores human-centered images degraded by both motion blur and generic noise, outperforming existing methods on synthetic and real-world benchmarks.

Authors:Wen Yin, Yong Wang, Guiduo Duan, Dongyang Zhang, Xin Hu, Yuan-Fang Li, Tao He
Title: Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition
Abstract:
Visual Emotion Recognition (VER) is a critical yet challenging task aimed at inferring emotional states of individuals based on visual cues. However, existing works focus on single domains, e.g., realistic images or stickers, limiting VER models' cross-domain generalizability. To fill this gap, we introduce an Unsupervised Cross-Domain Visual Emotion Recognition (UCDVER) task, which aims to generalize visual emotion recognition from the source domain (e.g., realistic images) to the low-resource target domain (e.g., stickers) in an unsupervised manner. Compared to the conventional unsupervised domain adaptation problems, UCDVER presents two key challenges: a significant emotional expression variability and an affective distribution shift. To mitigate these issues, we propose the Knowledge-aligned Counterfactual-enhancement Diffusion Perception (KCDP) framework. Specifically, KCDP leverages a VLM to align emotional representations in a shared knowledge space and guides diffusion models for improved visual affective perception. Furthermore, a Counterfactual-Enhanced Language-image Emotional Alignment (CLIEA) method generates high-quality pseudo-labels for the target domain. Extensive experiments demonstrate that our model surpasses SOTA models in both perceptibility and generalization, e.g., gaining 12% improvements over the SOTA VER model TGCA-PVT. The project page is at https://yinwen2019.github.io/ucdver.

Authors:Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Yining Shi, Chuang Zhang, Sifa Zheng
Title: DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving
Abstract:
Camera sensor simulation serves as a critical role for autonomous driving (AD), e.g. evaluating vision-based AD algorithms. While existing approaches have leveraged generative models for controllable image/video generation, they remain constrained to generating multi-view video sequences with fixed camera viewpoints and video frequency, significantly limiting their downstream applications. To address this, we present a generalizable camera simulation framework DriveCamSim, whose core innovation lies in the proposed Explicit Camera Modeling (ECM) mechanism. Instead of implicit interaction through vanilla attention, ECM establishes explicit pixel-wise correspondences across multi-view and multi-frame dimensions, decoupling the model from overfitting to the specific camera configurations (intrinsic/extrinsic parameters, number of views) and temporal sampling rates presented in the training data. For controllable generation, we identify the issue of information loss inherent in existing conditional encoding and injection pipelines, proposing an information-preserving control mechanism. This control mechanism not only improves conditional controllability, but also can be extended to be identity-aware to enhance temporal consistency in foreground object rendering. With above designs, our model demonstrates superior performance in both visual quality and controllability, as well as generalization capability across spatial-level (camera parameters variations) and temporal-level (video frame rate variations), enabling flexible user-customizable camera simulation tailored to diverse application scenarios. Code will be avaliable at https://github.com/swc-17/DriveCamSim for facilitating future research.
Chinese: DriveCamSim提出了一种创新的相机仿真框架,通过显式相机建模和信息保留控制机制,实现了可适应不同相机参数与帧率的高质量灵活视频生成,突破了现有方法的局限性。
English: DriveCamSim introduces a novel camera simulation framework with Explicit Camera Modeling and an information-preserving control mechanism to enable flexible, high-quality video generation adaptable to various camera parameters and frame rates, overcoming limitations of existing methods.

Authors:Piyush Tiwary, Kinjawl Bhattacharyya, Prathosh A. P
Title: LangDAug: Langevin Data Augmentation for Multi-Source Domain Generalization in Medical Image Segmentation
Abstract:
Medical image segmentation models often struggle to generalize across different domains due to various reasons. Domain Generalization (DG) methods overcome this either through representation learning or data augmentation (DAug). While representation learning methods seek domain-invariant features, they often rely on ad-hoc techniques and lack formal guarantees. DAug methods, which enrich model representations through synthetic samples, have shown comparable or superior performance to representation learning approaches. We propose LangDAug, a novel $\textbf{Lang}$evin $\textbf{D}$ata $\textbf{Aug}$mentation for multi-source domain generalization in 2D medical image segmentation. LangDAug leverages Energy-Based Models (EBMs) trained via contrastive divergence to traverse between source domains, generating intermediate samples through Langevin dynamics. Theoretical analysis shows that LangDAug induces a regularization effect, and for GLMs, it upper-bounds the Rademacher complexity by the intrinsic dimensionality of the data manifold. Through extensive experiments on Fundus segmentation and 2D MRI prostate segmentation benchmarks, we show that LangDAug outperforms state-of-the-art domain generalization methods and effectively complements existing domain-randomization approaches. The codebase for our method is available at https://github.com/backpropagator/LangDAug.
中文:LangDAug是一种新颖的医学图像分割领域泛化方法,通过朗之万动力学和基于能量的模型在源域间生成中间样本,在实验中表现出优越性能并具有理论正则化保证。
English: LangDAug is a novel domain generalization method for medical image segmentation that uses Langevin dynamics and energy-based models to generate intermediate samples between source domains, demonstrating superior performance and providing theoretical regularization guarantees.

Authors:Fanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Victoria W., Fuzheng Zhang, Guorui Zhou
Title: Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval
Abstract:
Multimodal information retrieval (MIR) faces inherent challenges due to the heterogeneity of data sources and the complexity of cross-modal alignment. While previous studies have identified modal gaps in feature spaces, a systematic approach to address these challenges remains unexplored. In this work, we introduce UNITE, a universal framework that tackles these challenges through two critical yet underexplored aspects: data curation and modality-aware training configurations. Our work provides the first comprehensive analysis of how modality-specific data properties influence downstream task performance across diverse scenarios. Moreover, we propose Modal-Aware Masked Contrastive Learning (MAMCL) to mitigate the competitive relationships among the instances of different modalities. Our framework achieves state-of-the-art results on multiple multimodal retrieval benchmarks, outperforming existing methods by notable margins. Through extensive experiments, we demonstrate that strategic modality curation and tailored training protocols are pivotal for robust cross-modal representation learning. This work not only advances MIR performance but also provides a foundational blueprint for future research in multimodal systems. Our project is available at https://friedrichor.github.io/projects/UNITE.
中文: 本文提出UNITE通用框架,通过数据筛选和模态感知训练配置解决多模态信息检索中的模态差异问题,采用模态感知掩码对比学习等创新方法,在多个基准测试中取得了领先性能。
English: This paper introduces UNITE, a universal framework addressing multimodal information retrieval challenges through data curation and modality-aware training, achieving state-of-the-art results by mitigating modal gaps with innovative strategies like Modal-Aware Masked Contrastive Learning.

Authors:Ruolin Shen, Xiaozhong Ji, Kai WU, Jiangning Zhang, Yijun He, HaiHua Yang, Xiaobin Hu, Xiaoyu Sun
Title: Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning
Abstract:
Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background. Our observations reveal that these multi-modal models cannot distinguish concealed objects, demonstrating an inability to emulate human cognitive processes which effectively utilize foreground-background similarity principles for visual analysis. To analyze this hidden human-model visual thinking discrepancy, we build a visual system that mimicks human visual camouflaged perception to progressively and iteratively `refocus' visual concealed content. The refocus is a progressive guidance mechanism enabling models to logically localize objects in visual images through stepwise reasoning. The localization process of concealed objects requires hierarchical attention shifting with dynamic adjustment and refinement of prior cognitive knowledge. In this paper, we propose a visual refocus reinforcement framework via the policy optimization algorithm to encourage multi-modal models to think and refocus more before answering, and achieve excellent reasoning abilities to align and even surpass human camouflaged perception systems. Our extensive experiments on camouflaged perception successfully demonstrate the emergence of refocus visual phenomena, characterized by multiple reasoning tokens and dynamic adjustment of the detection box. Besides, experimental results on both camouflaged object classification and detection tasks exhibit significantly superior performance compared to Supervised Fine-Tuning (SFT) baselines.
中文: 当前多模态模型在识别与背景融合的伪装物体时与人类视觉系统存在显著差异,为此提出的视觉重聚焦强化框架通过逐步推理机制提升了模型性能,在伪装感知任务中甚至超越了人类表现。
English: Current multi-modal models struggle to detect camouflaged objects due to misalignment with human visual cognition, prompting the development of a refocus reinforcement framework that enhances reasoning and surpasses human performance in camouflaged perception tasks.

Authors:Ho Hin Lee, Quan Liu, Shunxing Bao, Yuankai Huo, Bennett A. Landman
Title: Rep3D: Re-parameterize Large 3D Kernels with Low-Rank Receptive Modeling for Medical Imaging
Abstract:
In contrast to vision transformers, which model long-range dependencies through global self-attention, large kernel convolutions provide a more efficient and scalable alternative, particularly in high-resolution 3D volumetric settings. However, naively increasing kernel size often leads to optimization instability and degradation in performance. Motivated by the spatial bias observed in effective receptive fields (ERFs), we hypothesize that different kernel elements converge at variable rates during training. To support this, we derive a theoretical connection between element-wise gradients and first-order optimization, showing that structurally re-parameterized convolution blocks inherently induce spatially varying learning rates. Building on this insight, we introduce Rep3D, a 3D convolutional framework that incorporates a learnable spatial prior into large kernel training. A lightweight two-stage modulation network generates a receptive-biased scaling mask, adaptively re-weighting kernel updates and enabling local-to-global convergence behavior. Rep3D adopts a plain encoder design with large depthwise convolutions, avoiding the architectural complexity of multi-branch compositions. We evaluate Rep3D on five challenging 3D segmentation benchmarks and demonstrate consistent improvements over state-of-the-art baselines, including transformer-based and fixed-prior re-parameterization methods. By unifying spatial inductive bias with optimization-aware learning, Rep3D offers an interpretable, and scalable solution for 3D medical image analysis. The source code is publicly available at https://github.com/leeh43/Rep3D.
Chinese: Rep3D提出了一种带有可学习空间先验的3D卷积框架,通过自适应重加权大核训练中的内核更新,在3D医学图像分割任务中实现了优于现有方法的性能与可扩展性。
English: Rep3D introduces a 3D convolutional framework with a learnable spatial prior that adaptively re-weights kernel updates during large kernel training, achieving improved performance and scalability over existing methods in 3D medical image segmentation.

Authors:Hu Xiaobin, Liang Yujie, Luo Donghao, Peng Xu, Zhang Jiangning, Zhu Junwei, Wang Chengjie, Fu Yanwei
Title: VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models
Abstract:
While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons:(1) Current metrics inadequately reflect human perception, particularly in unpaired try-on settings;(2)Most existing test sets are limited to indoor scenarios, lacking complexity for real-world evaluation; and (3) An ideal system should guide future advancements in virtual try-on generation. To address these needs, we introduce VTBench, a hierarchical benchmark suite that systematically decomposes virtual image try-on into hierarchical, disentangled dimensions, each equipped with tailored test sets and evaluation criteria. VTBench exhibits three key advantages:1) Multi-Dimensional Evaluation Framework: The benchmark encompasses five critical dimensions for virtual try-on generation (e.g., overall image quality, texture preservation, complex background consistency, cross-category size adaptability, and hand-occlusion handling). Granular evaluation metrics of corresponding test sets pinpoint model capabilities and limitations across diverse, challenging scenarios.2) Human Alignment: Human preference annotations are provided for each test set, ensuring the benchmark's alignment with perceptual quality across all evaluation dimensions. (3) Valuable Insights: Beyond standard indoor settings, we analyze model performance variations across dimensions and investigate the disparity between indoor and real-world try-on scenarios. To foster the field of virtual try-on towards challenging real-world scenario, VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.
中文: VTBench作为一个分层基准套件被提出,通过弥补现有指标的不足、测试集的局限性,并在多样化现实场景中实现与人类感知一致的评价,系统性地评估虚拟试穿模型。
English: VTBench is introduced as a hierarchical benchmark suite to systematically evaluate virtual try-on models by addressing gaps in current metrics, test set limitations, and the need for human-aligned assessments across diverse real-world scenarios.

Authors:Ali Bahri, Moslem Yazdanpanah, Sahar Dastani, Mehrdad Noori, Gustavo Adolfo Vargas Hakim, David Osowiechi, Farzad Beizaee, Ismail Ben Ayed, Christian Desrosiers
Title: SMART-PC: Skeletal Model Adaptation for Robust Test-Time Training in Point Clouds
Abstract:
Test-Time Training (TTT) has emerged as a promising solution to address distribution shifts in 3D point cloud classification. However, existing methods often rely on computationally expensive backpropagation during adaptation, limiting their applicability in real-world, time-sensitive scenarios. In this paper, we introduce SMART-PC, a skeleton-based framework that enhances resilience to corruptions by leveraging the geometric structure of 3D point clouds. During pre-training, our method predicts skeletal representations, enabling the model to extract robust and meaningful geometric features that are less sensitive to corruptions, thereby improving adaptability to test-time distribution shifts. Unlike prior approaches, SMART-PC achieves real-time adaptation by eliminating backpropagation and updating only BatchNorm statistics, resulting in a lightweight and efficient framework capable of achieving high frame-per-second rates while maintaining superior classification performance. Extensive experiments on benchmark datasets, including ModelNet40-C, ShapeNet-C, and ScanObjectNN-C, demonstrate that SMART-PC achieves state-of-the-art results, outperforming existing methods such as MATE in terms of both accuracy and computational efficiency. The implementation is available at: https://github.com/AliBahri94/SMART-PC.
Chinese: SMART-PC提出了一种基于骨架的三维点云分类框架,通过仅更新BatchNorm统计量实现无需反向传播的实时自适应,在保持高帧率的同时获得了最优的分类精度。
English: SMART-PC introduces a skeleton-based framework for 3D point cloud classification that eliminates backpropagation during test-time adaptation, achieving real-time performance and superior accuracy by updating only BatchNorm statistics.

Authors:Jintao Tong, Wenwei Jin, Pengda Qin, Anqi Li, Yixiong Zou, Yuhong Li, Yuhua Li, Ruixuan Li
Title: FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models
Abstract:
Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut
中文: 大型视觉语言模型因冗余视觉令牌导致计算效率低下,现有剪枝方法仅依赖单层注意力评分存在不足,因此提出FlowCut这一信息流感知框架,通过更贴合模型内在行为来提升性能与速度。
English: Large vision-language models face computational inefficiency from redundant visual tokens, which existing pruning methods inadequately address by relying on single-layer attention scores, prompting the development of FlowCut, an information-flow-aware framework that enhances performance and speed by better aligning with the model's inherent behaviors.

Authors:Sanghyun Kim, Deunsol Jung, Minsu Cho
Title: Locality-Aware Zero-Shot Human-Object Interaction Detection
Abstract:
Recent methods for zero-shot Human-Object Interaction (HOI) detection typically leverage the generalization ability of large Vision-Language Model (VLM), i.e., CLIP, on unseen categories, showing impressive results on various zero-shot settings. However, existing methods struggle to adapt CLIP representations for human-object pairs, as CLIP tends to overlook fine-grained information necessary for distinguishing interactions. To address this issue, we devise, LAIN, a novel zero-shot HOI detection framework enhancing the locality and interaction awareness of CLIP representations. The locality awareness, which involves capturing fine-grained details and the spatial structure of individual objects, is achieved by aggregating the information and spatial priors of adjacent neighborhood patches. The interaction awareness, which involves identifying whether and how a human is interacting with an object, is achieved by capturing the interaction pattern between the human and the object. By infusing locality and interaction awareness into CLIP representation, LAIN captures detailed information about the human-object pairs. Our extensive experiments on existing benchmarks show that LAIN outperforms previous methods on various zero-shot settings, demonstrating the importance of locality and interaction awareness for effective zero-shot HOI detection.
Chinese: LAIN是一种新颖的零样本人物-物体交互检测框架,通过增强CLIP表征的局部感知能力以捕捉细粒度细节和空间结构,以及交互感知能力以识别人与物体的互动模式,在多种基准测试中表现卓越。
English: LAIN is a novel zero-shot HOI detection framework that enhances CLIP's representations by incorporating locality awareness to capture fine-grained details and spatial structures, and interaction awareness to identify human-object interaction patterns, achieving superior performance on various benchmarks.

Authors:X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, K. Huang
Title: CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features
Abstract:
Effectively modeling and utilizing spatiotemporal features from RGB and other modalities (\eg, depth, thermal, and event data, denoted as X) is the core of RGB-X tracker design. Existing methods often employ two parallel branches to separately process the RGB and X input streams, requiring the model to simultaneously handle two dispersed feature spaces, which complicates both the model structure and computation process. More critically, intra-modality spatial modeling within each dispersed space incurs substantial computational overhead, limiting resources for inter-modality spatial modeling and temporal modeling. To address this, we propose a novel tracker, CSTrack, which focuses on modeling Compact Spatiotemporal features to achieve simple yet effective tracking. Specifically, we first introduce an innovative Spatial Compact Module that integrates the RGB-X dual input streams into a compact spatial feature, enabling thorough intra- and inter-modality spatial modeling. Additionally, we design an efficient Temporal Compact Module that compactly represents temporal features by constructing the refined target distribution heatmap. Extensive experiments validate the effectiveness of our compact spatiotemporal modeling method, with CSTrack achieving new SOTA results on mainstream RGB-X benchmarks. The code and models will be released at: https://github.com/XiaokunFeng/CSTrack.
中文摘要:CSTrack通过空间紧凑模块和时间紧凑模块整合RGB-X双模态特征流,实现了简化的计算结构和最优的跟踪性能,在主流基准测试中创下新纪录。
English Summary: CSTrack introduces compact spatiotemporal modeling through spatial and temporal modules that unify RGB-X feature streams, achieving state-of-the-art tracking performance with simplified computation.

Authors:Qiang Hu, Qimei Wang, Jia Chen, Xuantao Ji, Mei Liu, Qiang Li, Zhiwei Wang
Title: Holistic White-light Polyp Classification via Alignment-free Dense Distillation of Auxiliary Optical Chromoendoscopy
Abstract:
White Light Imaging (WLI) and Narrow Band Imaging (NBI) are the two main colonoscopic modalities for polyp classification. While NBI, as optical chromoendoscopy, offers valuable vascular details, WLI remains the most common and often the only available modality in resource-limited settings. However, WLI-based methods typically underperform, limiting their clinical applicability. Existing approaches transfer knowledge from NBI to WLI through global feature alignment but often rely on cropped lesion regions, which are susceptible to detection errors and neglect contextual and subtle diagnostic cues. To address this, this paper proposes a novel holistic classification framework that leverages full-image diagnosis without requiring polyp localization. The key innovation lies in the Alignment-free Dense Distillation (ADD) module, which enables fine-grained cross-domain knowledge distillation regardless of misalignment between WLI and NBI images. Without resorting to explicit image alignment, ADD learns pixel-wise cross-domain affinities to establish correspondences between feature maps, guiding the distillation along the most relevant pixel connections. To further enhance distillation reliability, ADD incorporates Class Activation Mapping (CAM) to filter cross-domain affinities, ensuring the distillation path connects only those semantically consistent regions with equal contributions to polyp diagnosis. Extensive results on public and in-house datasets show that our method achieves state-of-the-art performance, relatively outperforming the other approaches by at least 2.5% and 16.2% in AUC, respectively. Code is available at: https://github.com/Huster-Hq/ADD.
Chinese: 本文提出了一种新颖的整体分类框架,通过无对齐密集蒸馏模块实现从窄带成像到白光成像的细粒度知识迁移,无需息肉定位即可达到最先进的息肉分类性能。
English: This paper introduces a novel holistic classification framework with an Alignment-free Dense Distillation module that enables fine-grained knowledge transfer from Narrow Band Imaging to White Light Imaging for polyp classification, achieving state-of-the-art performance without requiring polyp localization.

Authors:Vivek Gopalakrishnan, Neel Dey, Polina Golland
Title: PolyPose: Localizing Deformable Anatomy in 3D from Sparse 2D X-ray Images using Polyrigid Transforms
Abstract:
Determining the 3D pose of a patient from a limited set of 2D X-ray images is a critical task in interventional settings. While preoperative volumetric imaging (e.g., CT and MRI) provides precise 3D localization and visualization of anatomical targets, these modalities cannot be acquired during procedures, where fast 2D imaging (X-ray) is used instead. To integrate volumetric guidance into intraoperative procedures, we present PolyPose, a simple and robust method for deformable 2D/3D registration. PolyPose parameterizes complex 3D deformation fields as a composition of rigid transforms, leveraging the biological constraint that individual bones do not bend in typical motion. Unlike existing methods that either assume no inter-joint movement or fail outright in this under-determined setting, our polyrigid formulation enforces anatomically plausible priors that respect the piecewise rigid nature of human movement. This approach eliminates the need for expensive deformation regularizers that require patient- and procedure-specific hyperparameter optimization. Across extensive experiments on diverse datasets from orthopedic surgery and radiotherapy, we show that this strong inductive bias enables PolyPose to successfully align the patient's preoperative volume to as few as two X-ray images, thereby providing crucial 3D guidance in challenging sparse-view and limited-angle settings where current registration methods fail.
中文摘要:PolyPose 提出了一种通过将三维变形建模为刚性变换组合的鲁棒二维/三维配准方法,仅需两张X光片即可实现精确的患者姿态估计,无需复杂正则化。
English Summary: PolyPose introduces a robust method for 2D/3D registration by modeling 3D deformations as compositions of rigid transforms, enabling accurate patient pose estimation from just two X-rays without complex regularization.

Authors:Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen
Title: CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models
Abstract:
Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5x FLOPs reduction and a 10x overall speedup. Code is released at https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main.
中文: 视觉语言模型存在高推理成本问题,本研究揭示了令牌与神经元稀疏性之间的相互增强关系,提出的CoreMatching框架通过协同优化显著提升了效率,在多项任务和硬件上实现了计算量削减和加速效果。
English: Vision-Language Models face high inference costs, and this study reveals a mutual reinforcement between token and neuron sparsity, leading to the CoreMatching framework that significantly boosts efficiency with demonstrated speedups and computational reductions.

Authors:Chenglong Ma, Yuanfeng Ji, Jin Ye, Zilong Li, Chenhui Wang, Junzhi Ning, Wei Li, Lihao Liu, Qiushan Guo, Tianbin Li, Junjun He, Hongming Shan
Title: MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation
Abstract:
Advanced autoregressive models have reshaped multimodal AI. However, their transformative potential in medical imaging remains largely untapped due to the absence of a unified visual tokenizer -- one capable of capturing fine-grained visual structures for faithful image reconstruction and realistic image synthesis, as well as rich semantics for accurate diagnosis and image interpretation. To this end, we present MedITok, the first unified tokenizer tailored for medical images, encoding both low-level structural details and high-level clinical semantics within a unified latent space. To balance these competing objectives, we introduce a novel two-stage training framework: a visual representation alignment stage that cold-starts the tokenizer reconstruction learning with a visual semantic constraint, followed by a textual semantic representation alignment stage that infuses detailed clinical semantics into the latent space. Trained on the meticulously collected large-scale dataset with over 30 million medical images and 2 million image-caption pairs, MedITok achieves state-of-the-art performance on more than 30 datasets across 9 imaging modalities and 4 different tasks. By providing a unified token space for autoregressive modeling, MedITok supports a wide range of tasks in clinical diagnostics and generative healthcare applications. Model and code will be made publicly available at: https://github.com/Masaaki-75/meditok.
Chinese: MedITok是首个专为医学影像设计的统一标记器,通过两阶段训练框架融合了视觉细节与临床语义,在多种诊断和生成任务中实现了顶尖性能。
English: MedITok introduces the first unified tokenizer for medical imaging that captures both fine-grained visual details and clinical semantics, enabling state-of-the-art performance across diverse diagnostic and generative tasks through a novel two-stage training framework.

Authors:Jingwei Wu, Zhewei Huang, Chang Liu
Title: Advancing Video Self-Supervised Learning via Image Foundation Models
Abstract:
In the past decade, image foundation models (IFMs) have achieved unprecedented progress. However, the potential of directly using IFMs for video self-supervised representation learning has largely been overlooked. In this study, we propose an advancing video self-supervised learning (AdViSe) approach, aimed at significantly reducing the training overhead of video representation models using pre-trained IFMs. Specifically, we first introduce temporal modeling modules (ResNet3D) to IFMs, constructing a video representation model. We then employ a video self-supervised learning approach, playback rate perception, to train temporal modules while freezing the IFM components. Experiments on UCF101 demonstrate that AdViSe achieves performance comparable to state-of-the-art methods while reducing training time by $3.4\times$ and GPU memory usage by $8.2\times$. This study offers fresh insights into low-cost video self-supervised learning based on pre-trained IFMs. Code is available at https://github.com/JingwWu/advise-video-ssl.
Chinese: 提出的AdViSe方法利用预训练图像基础模型并添加时序模块,在实现先进视频自监督学习性能的同时,将训练时间减少3.4倍、GPU内存使用降低8.2倍。
English: The proposed AdViSe method leverages pre-trained image foundation models with added temporal modules to achieve state-of-the-art video self-supervised learning performance while drastically cutting training time by 3.4 times and GPU memory usage by 8.2 times.

Authors:Tyler Ward, Aaron Moseley, Abdullah-Al-Zubaer Imran
Title: Domain and Task-Focused Example Selection for Data-Efficient Contrastive Medical Image Segmentation
Abstract:
Segmentation is one of the most important tasks in the medical imaging pipeline as it influences a number of image-based decisions. To be effective, fully supervised segmentation approaches require large amounts of manually annotated training data. However, the pixel-level annotation process is expensive, time-consuming, and error-prone, hindering progress and making it challenging to perform effective segmentations. Therefore, models must learn efficiently from limited labeled data. Self-supervised learning (SSL), particularly contrastive learning via pre-training on unlabeled data and fine-tuning on limited annotations, can facilitate such limited labeled image segmentation. To this end, we propose a novel self-supervised contrastive learning framework for medical image segmentation, leveraging inherent relationships of different images, dubbed PolyCL. Without requiring any pixel-level annotations or unreasonable data augmentations, our PolyCL learns and transfers context-aware discriminant features useful for segmentation from an innovative surrogate, in a task-related manner. Additionally, we integrate the Segment Anything Model (SAM) into our framework in two novel ways: as a post-processing refinement module that improves the accuracy of predicted masks using bounding box prompts derived from coarse outputs, and as a propagation mechanism via SAM 2 that generates volumetric segmentations from a single annotated 2D slice. Experimental evaluations on three public computed tomography (CT) datasets demonstrate that PolyCL outperforms fully-supervised and self-supervised baselines in both low-data and cross-domain scenarios. Our code is available at https://github.com/tbwa233/PolyCL.
中文: 提出的PolyCL框架通过自监督对比学习在有限标注数据下实现有效的医学图像分割,结合SAM进行掩码优化和体积传播,并在CT数据集上展现出优越性能。
English: The proposed PolyCL framework uses self-supervised contrastive learning to achieve effective medical image segmentation with limited labeled data, integrating SAM for mask refinement and volumetric propagation while demonstrating superior performance on CT datasets.

Authors:Jiayi Xin, Sukwon Yun, Jie Peng, Inyoung Choi, Jenna L. Ballard, Tianlong Chen, Qi Long
Title: I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts
Abstract:
Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, vanilla fusion methods are limited by (1) inability to account for heterogeneous interactions between modalities and (2) lack of interpretability in uncovering the multimodal interactions inherent in the data. To this end, we propose I2MoE (Interpretable Multimodal Interaction-aware Mixture of Experts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, I2MoE utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, I2MoE deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that I2MoE is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at https://github.com/Raina-Xin/I2MoE.
中文:提出的I2MoE框架通过专门设计的交互专家显式建模多样化多模态交互,并在局部和全局层面提供可解释性,从而提升多模态融合效果,在多个数据集上验证了其性能优势。
English: The proposed I2MoE framework enhances multimodal fusion by explicitly modeling diverse interactions through specialized experts and providing interpretability at both local and global levels, demonstrating improved performance across various datasets.

Authors:Ryosei Hara, Wataru Ikeda, Masashi Hatano, Mariko Isogawa
Title: EventEgoHands: Event-based Egocentric 3D Hand Mesh Reconstruction
Abstract:
Reconstructing 3D hand mesh is challenging but an important task for human-computer interaction and AR/VR applications. In particular, RGB and/or depth cameras have been widely used in this task. However, methods using these conventional cameras face challenges in low-light environments and during motion blur. Thus, to address these limitations, event cameras have been attracting attention in recent years for their high dynamic range and high temporal resolution. Despite their advantages, event cameras are sensitive to background noise or camera motion, which has limited existing studies to static backgrounds and fixed cameras. In this study, we propose EventEgoHands, a novel method for event-based 3D hand mesh reconstruction in an egocentric view. Our approach introduces a Hand Segmentation Module that extracts hand regions, effectively mitigating the influence of dynamic background events. We evaluated our approach and demonstrated its effectiveness on the N-HOT3D dataset, improving MPJPE by approximately more than 4.5 cm (43%).

Authors:Eric Tillmann Bill, Enis Simsar, Thomas Hofmann
Title: JEDI: The Force of Jensen-Shannon Divergence in Disentangling Diffusion Models
Abstract:
We introduce JEDI, a test-time adaptation method that enhances subject separation and compositional alignment in diffusion models without requiring retraining or external supervision. JEDI operates by minimizing semantic entanglement in attention maps using a novel Jensen-Shannon divergence based objective. To improve efficiency, we leverage adversarial optimization, reducing the number of updating steps required. JEDI is model-agnostic and applicable to architectures such as Stable Diffusion 1.5 and 3.5, consistently improving prompt alignment and disentanglement in complex scenes. Additionally, JEDI provides a lightweight, CLIP-free disentanglement score derived from internal attention distributions, offering a principled benchmark for compositional alignment under test-time conditions. Code and results are available at https://ericbill21.github.io/JEDI/.

Authors:Shengdong Han, Shangdong Yang, Xin Zhang, Yuxuan Li, Xiang Li, Jian Yang, Ming-Ming Cheng, Yimian Dai
Title: DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing
Abstract:
Resolving closely-spaced small targets in dense clusters presents a significant challenge in infrared imaging, as the overlapping signals hinder precise determination of their quantity, sub-pixel positions, and radiation intensities. While deep learning has advanced the field of infrared small target detection, its application to closely-spaced infrared small targets has not yet been explored. This gap exists primarily due to the complexity of separating superimposed characteristics and the lack of an open-source infrastructure. In this work, we propose the Dynamic Iterative Shrinkage Thresholding Network (DISTA-Net), which reconceptualizes traditional sparse reconstruction within a dynamic framework. DISTA-Net adaptively generates convolution weights and thresholding parameters to tailor the reconstruction process in real time. To the best of our knowledge, DISTA-Net is the first deep learning model designed specifically for the unmixing of closely-spaced infrared small targets, achieving superior sub-pixel detection accuracy. Moreover, we have established the first open-source ecosystem to foster further research in this field. This ecosystem comprises three key components: (1) CSIST-100K, a publicly available benchmark dataset; (2) CSO-mAP, a custom evaluation metric for sub-pixel detection; and (3) GrokCSO, an open-source toolkit featuring DISTA-Net and other models. Our code and dataset are available at https://github.com/GrokCV/GrokCSO.
中文: 本研究提出了首个专为解混密集红外小目标设计的深度学习模型DISTA-Net,通过动态生成参数实现实时重构,并建立了包含基准数据集、评估指标和工具包的开源生态系统以推动该领域研究。
English: This study introduces DISTA-Net, the first deep learning model tailored for resolving closely-spaced infrared small targets by dynamically generating parameters for real-time reconstruction, and establishes an open-source ecosystem including a benchmark dataset, evaluation metric, and toolkit to advance research in this field.

Authors:Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang
Title: Shifting AI Efficiency From Model-Centric to Data-Centric Compression
Abstract:
The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community's advancement.
中文: AI效率研究的重点正从模型为中心的扩展转向数据为中心的令牌压缩,以解决长序列在大语言模型中引起的计算瓶颈问题。
English: The focus of AI efficiency research is shifting from model-centric scaling to data-centric token compression to overcome computational bottlenecks caused by long sequences in large language models.

Authors:Xiaoyang Liu, Bolin Qiu, Jiezhang Cao, Zheng Chen, Yulun Zhang, Xiaokang Yang
Title: Freqformer: Image-Demoiréing Transformer via Efficient Frequency Decomposition
Abstract:
Image demoiréing remains a challenging task due to the complex interplay between texture corruption and color distortions caused by moiré patterns. Existing methods, especially those relying on direct image-to-image restoration, often fail to disentangle these intertwined artifacts effectively. While wavelet-based frequency-aware approaches offer a promising direction, their potential remains underexplored. In this paper, we present Freqformer, a Transformer-based framework specifically designed for image demoiréing through targeted frequency separation. Our method performs an effective frequency decomposition that explicitly splits moiré patterns into high-frequency spatially-localized textures and low-frequency scale-robust color distortions, which are then handled by a dual-branch architecture tailored to their distinct characteristics. We further propose a learnable Frequency Composition Transform (FCT) module to adaptively fuse the frequency-specific outputs, enabling consistent and high-fidelity reconstruction. To better aggregate the spatial dependencies and the inter-channel complementary information, we introduce a Spatial-Aware Channel Attention (SA-CA) module that refines moiré-sensitive regions without incurring high computational cost. Extensive experiments on various demoiréing benchmarks demonstrate that Freqformer achieves state-of-the-art performance with a compact model size. The code is publicly available at https://github.com/xyLiu339/Freqformer.
Chinese: Freqformer提出了一种基于Transformer的图像去摩尔纹框架,通过将摩尔纹有效分解为高频纹理和低频颜色失真,并采用双分支结构和专用模块,以紧凑的模型实现了最先进的性能。
English: Freqformer introduces a Transformer-based framework for image demoiréing by effectively separating moiré patterns into high-frequency textures and low-frequency color distortions, utilizing a dual-branch architecture and specialized modules to achieve state-of-the-art performance with a compact model.

Authors:Chuming Shen, Wei Wei, Xiaoye Qu, Yu Cheng
Title: SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards
Abstract:
DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text domain through stable reinforcement learning (RL). Recently, in the multimodal domain, works have begun to directly apply RL to generate R1-like free-form reasoning for Visual Question Answering (VQA) tasks. However, multimodal tasks share an intrinsically different nature from textual tasks, which heavily rely on the understanding of the input image to solve the problem. Therefore, such free-form reasoning faces two critical limitations in the VQA task: (1) Extended reasoning chains diffuse visual focus away from task-critical regions, degrading answer accuracy. (2) Unverifiable intermediate steps amplify policy-gradient variance and computational costs overhead. To address these issues, in this paper, we introduce SATORI ($\textbf{S}patially$ $\textbf{A}nchored$ $\textbf{T}ask$ $\textbf{O}ptimization$ with $\textbf{R}e\textbf{I}nforcement$ Learning), which decomposes VQA into three verifiable stages, including global image captioning, region localization, and answer prediction, each supplying explicit reward signals. Furthermore, we also introduce VQA-Verify, a 12k dataset annotated with answer-aligned captions and bounding-boxes to facilitate training. Experiments demonstrate consistent performance improvements across seven VQA benchmarks, achieving up to $15.7\%$ improvement in accuracy in accuracy compared to the R1-like baseline. Our analysis of the attention map confirms enhanced focus on critical regions, which brings improvements in accuracy. Our code is available at https://github.com/justairr/SATORI-R1.
中文: 本文提出的SATORI方法将视觉问答分解为可验证的阶段,通过明确的奖励信号增强关键区域关注,实现了最高15.7%的准确率提升。
English: The paper introduces SATORI, a method that enhances Visual Question Answering by decomposing it into verifiable stages with explicit rewards, improving accuracy by up to 15.7% and focusing on critical regions.

Authors:Benjamin Clavié, Florian Brand
Title: ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models
Abstract:
Recent advancements in Large Vision-Language Models (VLMs), have greatly enhanced their capability to jointly process text and images. However, despite extensive benchmarks evaluating visual comprehension (e.g., diagrams, color schemes, OCR tasks...), there is limited assessment of VLMs' ability to read and reason about text-rich images effectively. To fill this gap, we introduce ReadBench, a multimodal benchmark specifically designed to evaluate the reading comprehension capabilities of VLMs. ReadBench transposes contexts from established text-only benchmarks into images of text while keeping textual prompts and questions intact. Evaluating leading VLMs with ReadBench, we find minimal-but-present performance degradation on short, text-image inputs, while performance sharply declines for longer, multi-page contexts. Our experiments further reveal that text resolution has negligible effects on multimodal performance. These findings highlight needed improvements in VLMs, particularly their reasoning over visually presented extensive textual content, a capability critical for practical applications. ReadBench is available at https://github.com/answerdotai/ReadBench .
中文摘要:ReadBench是一个专门评估大型视觉语言模型在文本密集图像上阅读理解能力的新基准,发现模型在处理长篇内容时表现显著下降,而文本分辨率影响甚微。
English Summary: ReadBench is a new benchmark designed to assess how well Large Vision-Language Models read and reason about text-rich images, revealing significant performance drops with longer textual content despite minimal impact from text resolution.

Authors:Yifeng Xu, Zhenliang He, Meina Kan, Shiguang Shan, Xilin Chen
Title: Jodi: Unification of Visual Generation and Understanding via Joint Modeling
Abstract:
Visual generation and understanding are two deeply interconnected aspects of human intelligence, yet they have been traditionally treated as separate tasks in machine learning. In this paper, we propose Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Specifically, Jodi is built upon a linear diffusion transformer along with a role switch mechanism, which enables it to perform three particular types of tasks: (1) joint generation, where the model simultaneously generates images and multiple labels; (2) controllable generation, where images are generated conditioned on any combination of labels; and (3) image perception, where multiple labels can be predicted at once from a given image. Furthermore, we present the Joint-1.6M dataset, which contains 200,000 high-quality images collected from public sources, automatic labels for 7 visual domains, and LLM-generated captions. Extensive experiments demonstrate that Jodi excels in both generation and understanding tasks and exhibits strong extensibility to a wider range of visual domains. Code is available at https://github.com/VIPL-GENUN/Jodi.
中文: 本文提出Jodi框架,通过联合建模图像与多标签域的统一扩散方法,在生成与理解任务中表现优异,并展现出对广泛视觉领域的强大扩展能力。
English: This paper introduces Jodi, a diffusion framework that unifies visual generation and understanding through joint modeling of images and multiple labels, demonstrating strong performance across three task types and extensibility to diverse visual domains.

Authors:Jiashuo Chang, Zhengyi Li, Jianxun Lou, Zhen Qiu, Hanhe Lin
Title: MMP-2K: A Benchmark Multi-Labeled Macro Photography Image Quality Assessment Database
Abstract:
Macro photography (MP) is a specialized field of photography that captures objects at an extremely close range, revealing tiny details. Although an accurate macro photography image quality assessment (MPIQA) metric can benefit macro photograph capturing, which is vital in some domains such as scientific research and medical applications, the lack of MPIQA data limits the development of MPIQA metrics. To address this limitation, we conducted a large-scale MPIQA study. Specifically, to ensure diversity both in content and quality, we sampled 2,000 MP images from 15,700 MP images, collected from three public image websites. For each MP image, 17 (out of 21 after outlier removal) quality ratings and a detailed quality report of distortion magnitudes, types, and positions are gathered by a lab study. The images, quality ratings, and quality reports form our novel multi-labeled MPIQA database, MMP-2k. Experimental results showed that the state-of-the-art generic IQA metrics underperform on MP images. The database and supplementary materials are available at https://github.com/Future-IQA/MMP-2k.
中文: 本研究通过大规模实验建立了MMP-2k多标签宏观摄影图像质量评估数据库,发现现有通用图像质量评估指标在宏观摄影图像上表现不佳。
English: This study introduces the MMP-2k database, a multi-labeled macro photography image quality assessment resource developed through a large-scale lab study, revealing that current generic IQA metrics perform poorly on macro images.

Authors:Xin Ma, Yaohui Wang, Xinyuan Chen, Tien-Tsin Wong, Cunjian Chen
Title: Training-free Stylized Text-to-Image Generation with Fast Inference
Abstract:
Although diffusion models exhibit impressive generative capabilities, existing methods for stylized image generation based on these models often require textual inversion or fine-tuning with style images, which is time-consuming and limits the practical applicability of large-scale diffusion models. To address these challenges, we propose a novel stylized image generation method leveraging a pre-trained large-scale diffusion model without requiring fine-tuning or any additional optimization, termed as OmniPainter. Specifically, we exploit the self-consistency property of latent consistency models to extract the representative style statistics from reference style images to guide the stylization process. Additionally, we then introduce the norm mixture of self-attention, which enables the model to query the most relevant style patterns from these statistics for the intermediate output content features. This mechanism also ensures that the stylized results align closely with the distribution of the reference style images. Our qualitative and quantitative experimental results demonstrate that the proposed method outperforms state-of-the-art approaches.
中文摘要:OmniPainter方法无需微调预训练扩散模型,通过提取参考图像风格特征并采用新型自注意力机制,实现了高效且高质量的图像风格化生成。
English Summary: The proposed OmniPainter method enables efficient stylized image generation using pre-trained diffusion models without fine-tuning, by extracting style statistics from reference images and employing a novel self-attention mechanism to produce high-quality stylized outputs.

Authors:Minzhi Lin, Tianchi Xie, Mengchen Liu, Yilin Ye, Changjian Chen, Shixia Liu
Title: InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts
Abstract:
Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understanding. It includes 5,642 pairs of infographic and plain charts, each sharing the same underlying data but differing in visual presentations. We further design visual-element-based questions to capture their unique visual designs and communicative intent. Evaluation of 20 MLLMs reveals a substantial performance decline on infographic charts, particularly for visual-element-based questions related to metaphors. The paired infographic and plain charts enable fine-grained error analysis and ablation studies, which highlight new opportunities for advancing MLLMs in infographic chart understanding. We release InfoChartQA at https://github.com/CoolDawnAnt/InfoChartQA.
中文摘要:InfoChartQA是一个评估多模态大语言模型理解信息图表能力的新基准,通过对比分析配对的信息图表和普通图表,揭示了模型在设计元素特别是视觉隐喻问题上的显著性能差距。
English Summary: InfoChartQA is a new benchmark designed to evaluate multimodal large language models' ability to understand infographic charts with design elements, revealing significant performance gaps especially on visual metaphor questions through comparative analysis of paired infographic and plain charts.

Authors:Jingping Liu, Ziyan Liu, Zhedong Cen, Yan Zhou, Yinan Zou, Weiyan Zhang, Haiyun Jiang, Tong Ruan
Title: Can Multimodal Large Language Models Understand Spatial Relations?
Abstract:
Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model's prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at https://github.com/ziyan-xiaoyu/SpatialMQA.git.
Chinese: 该研究提出了SpatialMQA,这是一个基于COCO2017的人工标注空间关系推理基准,旨在解决现有基准依赖边界框和忽视图像理解的问题,实验表明当前最先进的多模态大语言模型准确率仅为48.14%,远低于人类98.40%的水平。
English: The study introduces SpatialMQA, a human-annotated benchmark to improve spatial relation reasoning in multimodal large language models by addressing issues like reliance on bounding boxes and lack of image understanding, revealing that current models perform poorly with only 48.14% accuracy compared to human-level performance.

Authors:Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang
Title: VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
Abstract:
Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.
Chinese Summary: 强化学习应用于视频大语言模型虽前景广阔,但存在数据瓶颈和不稳定性问题,提出的VerIPO方法通过验证器引导的迭代优化,有效提升推理链质量并实现高效训练。
English Summary: Reinforcement Learning applied to Video Large Language Models shows promise but faces data bottlenecks and instability, which the proposed VerIPO method overcomes by using a verifier-guided iterative optimization to enhance reasoning chain quality efficiently.

Authors:Tianchen Deng, Wenhua Wu, Junjie He, Yue Pan, Xirui Jiang, Shenghai Yuan, Danwei Wang, Hesheng Wang, Weidong Chen
Title: VPGS-SLAM: Voxel-based Progressive 3D Gaussian SLAM in Large-Scale Scenes
Abstract:
3D Gaussian Splatting has recently shown promising results in dense visual SLAM. However, existing 3DGS-based SLAM methods are all constrained to small-room scenarios and struggle with memory explosion in large-scale scenes and long sequences. To this end, we propose VPGS-SLAM, the first 3DGS-based large-scale RGBD SLAM framework for both indoor and outdoor scenarios. We design a novel voxel-based progressive 3D Gaussian mapping method with multiple submaps for compact and accurate scene representation in large-scale and long-sequence scenes. This allows us to scale up to arbitrary scenes and improves robustness (even under pose drifts). In addition, we propose a 2D-3D fusion camera tracking method to achieve robust and accurate camera tracking in both indoor and outdoor large-scale scenes. Furthermore, we design a 2D-3D Gaussian loop closure method to eliminate pose drift. We further propose a submap fusion method with online distillation to achieve global consistency in large-scale scenes when detecting a loop. Experiments on various indoor and outdoor datasets demonstrate the superiority and generalizability of the proposed framework. The code will be open source on https://github.com/dtc111111/vpgs-slam.
中文: VPGS-SLAM是首个基于3D高斯泼溅的大规模RGBD SLAM框架,通过体素渐进式建图和2D-3D融合跟踪技术,有效解决了先前方法在大型场景中的内存爆炸问题,实现了室内外场景的鲁棒定位与建图。
English: VPGS-SLAM is the first 3D Gaussian Splatting-based RGBD SLAM framework that scales to large indoor and outdoor scenes through voxel-progressive mapping and 2D-3D fusion tracking, overcoming prior limitations of memory explosion and small-scale constraints.

Authors:Tianyu Zhang, Xinyu Wang, Lu Li, Zhenghan Tai, Jijun Chi, Jingrui Tian, Hailin He, Suyuchen Wang
Title: STRICT: Stress Test of Rendering Images Containing Text
Abstract:
While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we introduce $\textbf{STRICT}$, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated; (2) the correctness and legibility of the generated text, and (3) the ratio of not following instructions for generating text. We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling. We release our entire evaluation pipeline at https://github.com/tianyu-z/STRICT-Bench.
中文:扩散模型在生成逼真图像方面表现出色,但在图像中生成一致且清晰文本方面存在不足,因此我们开发了STRICT基准来系统评估文本长度、正确性和指令遵循能力,揭示了持续存在的局限并推动了未来研究方向。
English: Diffusion models excel in realistic image generation but fail to produce consistent and legible text, leading to the creation of the STRICT benchmark for systematic evaluation across text length, correctness, and instruction adherence, revealing persistent limitations and motivating future research.

Authors:Jeffrey A. Chan-Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, Mubarak Shah
Title: MGD$^3$: Mode-Guided Dataset Distillation using Diffusion Models
Abstract:
Dataset distillation has emerged as an effective strategy, significantly reducing training costs and facilitating more efficient model deployment. Recent advances have leveraged generative models to distill datasets by capturing the underlying data distribution. Unfortunately, existing methods require model fine-tuning with distillation losses to encourage diversity and representativeness. However, these methods do not guarantee sample diversity, limiting their performance. We propose a mode-guided diffusion model leveraging a pre-trained diffusion model without the need to fine-tune with distillation losses. Our approach addresses dataset diversity in three stages: Mode Discovery to identify distinct data modes, Mode Guidance to enhance intra-class diversity, and Stop Guidance to mitigate artifacts in synthetic samples that affect performance. Our approach outperforms state-of-the-art methods, achieving accuracy gains of 4.4%, 2.9%, 1.6%, and 1.6% on ImageNette, ImageIDC, ImageNet-100, and ImageNet-1K, respectively. Our method eliminates the need for fine-tuning diffusion models with distillation losses, significantly reducing computational costs. Our code is available on the project webpage: https://jachansantiago.github.io/mode-guided-distillation/

Authors:Jiong Wu, Yang Xing, Boxiao Yu, Wei Shao, Kuang Gong
Title: CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation
Abstract:
Most publicly available medical segmentation datasets are only partially labeled, with annotations provided for a subset of anatomical structures. When multiple datasets are combined for training, this incomplete annotation poses challenges, as it limits the model's ability to learn shared anatomical representations among datasets. Furthermore, vision-only frameworks often fail to capture complex anatomical relationships and task-specific distinctions, leading to reduced segmentation accuracy and poor generalizability to unseen datasets. In this study, we proposed a novel CLIP-DINO Prompt-Driven Segmentation Network (CDPDNet), which combined a self-supervised vision transformer with CLIP-based text embedding and introduced task-specific text prompts to tackle these challenges. Specifically, the framework was constructed upon a convolutional neural network (CNN) and incorporated DINOv2 to extract both fine-grained and global visual features, which were then fused using a multi-head cross-attention module to overcome the limited long-range modeling capability of CNNs. In addition, CLIP-derived text embeddings were projected into the visual space to help model complex relationships among organs and tumors. To further address the partial label challenge and enhance inter-task discriminative capability, a Text-based Task Prompt Generation (TTPG) module that generated task-specific prompts was designed to guide the segmentation. Extensive experiments on multiple medical imaging datasets demonstrated that CDPDNet consistently outperformed existing state-of-the-art segmentation methods. Code and pretrained model are available at: https://github.com/wujiong-hub/CDPDNet.git.
中文: 本研究提出CDPDNet新型分割网络,通过融合视觉特征与文本嵌入解决医学影像中的部分标注问题,并在多数据集实验中展现出优于现有方法的性能。
English: The study introduces CDPDNet, a novel segmentation network that integrates vision and text embeddings to address incomplete labeling and anatomical relationship modeling in medical imaging, demonstrating superior performance over existing methods.

Authors:Yining Pan, Qiongjie Cui, Xulei Yang, Na Zhao
Title: How Do Images Align and Complement LiDAR? Towards a Harmonized Multi-modal 3D Panoptic Segmentation
Abstract:
LiDAR-based 3D panoptic segmentation often struggles with the inherent sparsity of data from LiDAR sensors, which makes it challenging to accurately recognize distant or small objects. Recently, a few studies have sought to overcome this challenge by integrating LiDAR inputs with camera images, leveraging the rich and dense texture information provided by the latter. While these approaches have shown promising results, they still face challenges, such as misalignment during data augmentation and the reliance on post-processing steps. To address these issues, we propose Image-Assists-LiDAR (IAL), a novel multi-modal 3D panoptic segmentation framework. In IAL, we first introduce a modality-synchronized data augmentation strategy, PieAug, to ensure alignment between LiDAR and image inputs from the start. Next, we adopt a transformer decoder to directly predict panoptic segmentation results. To effectively fuse LiDAR and image features into tokens for the decoder, we design a Geometric-guided Token Fusion (GTF) module. Additionally, we leverage the complementary strengths of each modality as priors for query initialization through a Prior-based Query Generation (PQG) module, enhancing the decoder's ability to generate accurate instance masks. Our IAL framework achieves state-of-the-art performance compared to previous multi-modal 3D panoptic segmentation methods on two widely used benchmarks. Code and models are publicly available at .
中文: 提出的图像辅助激光雷达(IAL)框架通过同步数据增强和创新融合模块,有效结合激光雷达与相机数据,解决了多模态错位问题并优化特征整合,从而在三维全景分割中实现了最先进的性能。
English: The proposed Image-Assists-LiDAR (IAL) framework introduces synchronized data augmentation and novel fusion modules to effectively combine LiDAR and camera data, achieving state-of-the-art 3D panoptic segmentation by addressing misalignment and enhancing feature integration.

Authors:Hyunho Ha, Lei Xiao, Christian Richardt, Thu Nguyen-Phuoc, Changil Kim, Min H. Kim, Douglas Lanman, Numair Khan
Title: Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency
Abstract:
We introduce a novel geometry-guided online video view synthesis method with enhanced view and temporal consistency. Traditional approaches achieve high-quality synthesis from dense multi-view camera setups but require significant computational resources. In contrast, selective-input methods reduce this cost but often compromise quality, leading to multi-view and temporal inconsistencies such as flickering artifacts. Our method addresses this challenge to deliver efficient, high-quality novel-view synthesis with view and temporal consistency. The key innovation of our approach lies in using global geometry to guide an image-based rendering pipeline. To accomplish this, we progressively refine depth maps using color difference masks across time. These depth maps are then accumulated through truncated signed distance fields in the synthesized view's image space. This depth representation is view and temporally consistent, and is used to guide a pre-trained blending network that fuses multiple forward-rendered input-view images. Thus, the network is encouraged to output geometrically consistent synthesis results across multiple views and time. Our approach achieves consistent, high-quality video synthesis, while running efficiently in an online manner.

Authors:Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, Alan Yuille
Title: Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
Abstract:
Vision-Language Models (VLMs) have shown promise in various 2D visual tasks, yet their readiness for 3D clinical diagnosis remains unclear due to stringent demands for recognition precision, reasoning ability, and domain knowledge. To systematically evaluate these dimensions, we present DeepTumorVQA, a diagnostic visual question answering (VQA) benchmark targeting abdominal tumors in CT scans. It comprises 9,262 CT volumes (3.7M slices) from 17 public datasets, with 395K expert-level questions spanning four categories: Recognition, Measurement, Visual Reasoning, and Medical Reasoning. DeepTumorVQA introduces unique challenges, including small tumor detection and clinical reasoning across 3D anatomy. Benchmarking four advanced VLMs (RadFM, M3D, Merlin, CT-CHAT), we find current models perform adequately on measurement tasks but struggle with lesion recognition and reasoning, and are still not meeting clinical needs. Two key insights emerge: (1) large-scale multimodal pretraining plays a crucial role in DeepTumorVQA testing performance, making RadFM stand out among all VLMs. (2) Our dataset exposes critical differences in VLM components, where proper image preprocessing and design of vision modules significantly affect 3D perception. To facilitate medical multimodal research, we have released DeepTumorVQA as a rigorous benchmark: https://github.com/Schuture/DeepTumorVQA.
中文: DeepTumorVQA作为诊断基准,旨在系统评估视觉语言模型在三维临床诊断中的表现,发现现有模型在测量任务中表现良好,但在病灶识别和推理方面存在不足,且大规模多模态预训练对性能提升至关重要。
English: DeepTumorVQA is introduced as a diagnostic benchmark to evaluate Vision-Language Models' capabilities in 3D clinical diagnosis, revealing that current models excel in measurement tasks but fall short in lesion recognition and reasoning, with large-scale multimodal pretraining being crucial for performance.

Authors:Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, Ion Stoica
Title: Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
Abstract:
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively. Our code is open-sourced at \href{https://github.com/svg-project/Sparse-VideoGen}{https://github.com/svg-project/Sparse-VideoGen}.
中文:SVG2提出了一种无需训练的框架,通过语义感知的令牌聚类和重排序,提升关键令牌识别精度并优化计算效率,在保持高质量视频生成的同时实现了显著加速。
English: SVG2 introduces a training-free framework with semantic-aware token clustering and reordering to enhance critical token identification and computational efficiency, achieving significant speedups while maintaining high video generation quality.

Authors:Libin Lan, Yanxin Li, Xiaojuan Liu, Juan Zhou, Jianxun Zhang, Nannan Huang, Yudong Zhang
Title: MSLAU-Net: A Hybird CNN-Transformer Network for Medical Image Segmentation
Abstract:
Both CNN-based and Transformer-based methods have achieved remarkable success in medical image segmentation tasks. However, CNN-based methods struggle to effectively capture global contextual information due to the inherent limitations of convolution operations. Meanwhile, Transformer-based methods suffer from insufficient local feature modeling and face challenges related to the high computational complexity caused by the self-attention mechanism. To address these limitations, we propose a novel hybrid CNN-Transformer architecture, named MSLAU-Net, which integrates the strengths of both paradigms. The proposed MSLAU-Net incorporates two key ideas. First, it introduces Multi-Scale Linear Attention, designed to efficiently extract multi-scale features from medical images while modeling long-range dependencies with low computational complexity. Second, it adopts a top-down feature aggregation mechanism, which performs multi-level feature aggregation and restores spatial resolution using a lightweight structure. Extensive experiments conducted on benchmark datasets covering three imaging modalities demonstrate that the proposed MSLAU-Net outperforms other state-of-the-art methods on nearly all evaluation metrics, validating the superiority, effectiveness, and robustness of our approach. Our code is available at https://github.com/Monsoon49/MSLAU-Net.
中文: 提出的MSLAU-Net混合架构通过整合多尺度线性注意力机制和轻量级自上而下特征聚合策略,有效克服了CNN与Transformer模型的各自局限,在多种医学影像基准测试中均实现了最优性能。
English: The proposed MSLAU-Net hybrid architecture effectively overcomes the limitations of CNN and Transformer models by integrating multi-scale linear attention for efficient global context modeling and a lightweight top-down aggregation mechanism, achieving superior performance across multiple medical imaging benchmarks.

Authors:Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Shunyu Liu, Dacheng Tao
Title: VORTA: Efficient Video Diffusion via Routing Sparse Attention
Abstract:
Video Diffusion Transformers (VDiTs) have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent attention acceleration methods leverage the sparsity of attention patterns to improve efficiency; however, they often overlook inefficiencies of redundant long-range interactions. To address this problem, we propose \textbf{VORTA}, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants throughout the sampling process. It achieves a $1.76\times$ end-to-end speedup without quality loss on VBench. Furthermore, VORTA can seamlessly integrate with various other acceleration methods, such as caching and step distillation, reaching up to $14.41\times$ speedup with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of VDiTs in real-world settings.
Chinese: VORTA通过引入稀疏注意力机制和自适应路由策略,在保持视频生成质量的同时,将视频扩散变压器的处理速度最高提升14.41倍。
English: VORTA accelerates Video Diffusion Transformers by introducing a sparse attention mechanism and adaptive routing strategy, achieving up to 14.41× speedup without compromising video generation quality.

Authors:Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac
Title: Think Twice before Adaptation: Improving Adaptability of DeepFake Detection via Online Test-Time Adaptation
Abstract:
Deepfake (DF) detectors face significant challenges when deployed in real-world environments, particularly when encountering test samples deviated from training data through either postprocessing manipulations or distribution shifts. We demonstrate postprocessing techniques can completely obscure generation artifacts presented in DF samples, leading to performance degradation of DF detectors. To address these challenges, we propose Think Twice before Adaptation (\texttt{T$^2$A}), a novel online test-time adaptation method that enhances the adaptability of detectors during inference without requiring access to source training data or labels. Our key idea is to enable the model to explore alternative options through an Uncertainty-aware Negative Learning objective rather than solely relying on its initial predictions as commonly seen in entropy minimization (EM)-based approaches. We also introduce an Uncertain Sample Prioritization strategy and Gradients Masking technique to improve the adaptation by focusing on important samples and model parameters. Our theoretical analysis demonstrates that the proposed negative learning objective exhibits complementary behavior to EM, facilitating better adaptation capability. Empirically, our method achieves state-of-the-art results compared to existing test-time adaptation (TTA) approaches and significantly enhances the resilience and generalization of DF detectors during inference. Code is available \href{https://github.com/HongHanh2104/T2A-Think-Twice-Before-Adaptation}{here}.
中文摘要:提出的“三思而后适应”(T²A)方法通过不确定性感知的负向学习和优先样本适配策略,有效提升了深度伪造检测器在推理过程中对后处理操作和分布偏移的抵御能力。
English Summary: The proposed Think Twice before Adaptation (T²A) method enhances deepfake detector resilience against postprocessing manipulations and distribution shifts through uncertainty-aware negative learning and prioritized sample adaptation during inference.

Authors:Jiayu Wang, Yang Jiao, Yue Yu, Tianwen Qian, Shaoxiang Chen, Jingjing Chen, Yu-Gang Jiang
Title: OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks
Abstract:
Recent breakthroughs in large multimodal models (LMMs), such as the impressive GPT-4o-Native, have demonstrated remarkable proficiency in following general-purpose instructions for image generation. However, current benchmarks often lack the necessary breadth and depth to fully evaluate the diverse capabilities of these models. To overcome this limitation, we introduce OmniGenBench, a novel and comprehensive benchmark meticulously designed to assess the instruction-following abilities of state-of-the-art LMMs across both perception-centric and cognition-centric dimensions. Our OmniGenBench includes 57 diverse sub-tasks grounded in real-world scenarios, systematically categorized according to the specific model capabilities they demand. For rigorous evaluation, we further employ a dual-mode protocol. This protocol utilizes off-the-shelf visual parsing tools for perception-centric tasks and a powerful LLM-based judger for cognition-centric tasks to assess the alignment between generated images and user instructions. Using OmniGenBench, we evaluate mainstream generative models, including prevalent models like GPT-4o, Gemini-2.0-Flash, and Seedream, and provide in-depth comparisons and analyses of their performance.Code and data are available at https://github.com/emilia113/OmniGenBench.
中文: 摘要介绍了OmniGenBench,这是一个全面的基准测试,旨在通过多样化的现实任务和双模式评估协议,评估大型多模态模型在感知和认知维度上的指令遵循能力。
English: The abstract introduces OmniGenBench, a comprehensive benchmark designed to evaluate the instruction-following capabilities of large multimodal models across perception and cognition dimensions, using diverse real-world tasks and a dual-mode assessment protocol.

Authors:Wenchao Zhang, Jiahe Tian, Runze He, Jizhong Han, Jiao Dai, Miaomiao Feng, Wei Mi, Xiaodan Zhang
Title: Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation
Abstract:
Recent text-to-image (T2I) generation models have advanced significantly, enabling the creation of high-fidelity images from textual prompts. However, existing evaluation benchmarks primarily focus on the explicit alignment between generated images and prompts, neglecting the alignment with real-world knowledge beyond prompts. To address this gap, we introduce Align Beyond Prompts (ABP), a comprehensive benchmark designed to measure the alignment of generated images with real-world knowledge that extends beyond the explicit user prompts. ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios. We further introduce ABPScore, a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts, which demonstrates strong correlations with human judgments. Through a comprehensive evaluation of 8 popular T2I models using ABP, we find that even state-of-the-art models, such as GPT-4o, face limitations in integrating simple real-world knowledge into generated images. To mitigate this issue, we introduce a training-free strategy within ABP, named Inference-Time Knowledge Injection (ITKI). By applying this strategy to optimize 200 challenging samples, we achieved an improvement of approximately 43% in ABPScore. The dataset and code are available in https://github.com/smile365317/ABP.
中文: 近期文本生成图像模型进展显著,但现有基准主要关注图像与提示的显式对齐,忽略了超越提示的现实世界知识对齐,因此开发了ABP基准和ABPScore指标,揭示了当前模型的局限性,并提出无需训练的策略使性能提升约43%。
English: Recent text-to-image models have advanced significantly, but existing benchmarks overlook alignment with real-world knowledge beyond prompts, leading to the development of the ABP benchmark and ABPScore metric, which reveal limitations in current models and propose a training-free strategy that improves performance by 43%.

Authors:Chun Wang, Xiaoran Pan, Zihao Pan, Haofan Wang, Yiren Song
Title: GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains
Abstract:
Recent advances in Visual Language Models (VLMs) have demonstrated exceptional performance in visual reasoning tasks. However, geo-localization presents unique challenges, requiring the extraction of multigranular visual cues from images and their integration with external world knowledge for systematic reasoning. Current approaches to geo-localization tasks often lack robust reasoning mechanisms and explainability, limiting their effectiveness. To address these limitations, we propose the Geo Reason Enhancement (GRE) Suite, a novel framework that augments VLMs with structured reasoning chains for accurate and interpretable location inference. The GRE Suite is systematically developed across three key dimensions: dataset, model, and benchmark. First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis. Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision. Finally, we construct the Geo Reason Evaluation Benchmark (GREval-Bench), a comprehensive evaluation framework that assesses VLMs across diverse urban, natural, and landmark scenes to measure both coarse-grained (e.g., country, continent) and fine-grained (e.g., city, street) localization performance. Experimental results demonstrate that GRE significantly outperforms existing methods across all granularities of geo-localization tasks, underscoring the efficacy of reasoning-augmented VLMs in complex geographic inference. Code and data will be released at https://github.com/Thorin215/GRE.
中文: GRE套件通过结构化推理链增强视觉语言模型,实现精准可解释的地理定位,其数据集、模型和基准测试在所有粒度上均优于现有方法。
English: The GRE Suite enhances Visual Language Models with structured reasoning chains for precise and interpretable geo-localization, introducing a dataset, model, and benchmark that outperform existing methods across all granularities.

Authors:Yang Liu, Silin Cheng, Xinwei He, Sebastien Ourselin, Lei Tan, Gen Luo
Title: WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation
Abstract:
Weakly supervised referring expression comprehension(WREC) and segmentation(WRES) aim to learn object grounding based on a given expression using weak supervision signals like image-text pairs. While these tasks have traditionally been modeled separately, we argue that they can benefit from joint learning in a multi-task framework. To this end, we propose WeakMCN, a novel multi-task collaborative network that effectively combines WREC and WRES with a dual-branch architecture. Specifically, the WREC branch is formulated as anchor-based contrastive learning, which also acts as a teacher to supervise the WRES branch. In WeakMCN, we propose two innovative designs to facilitate multi-task collaboration, namely Dynamic Visual Feature Enhancement(DVFE) and Collaborative Consistency Module(CCM). DVFE dynamically combines various pre-trained visual knowledge to meet different task requirements, while CCM promotes cross-task consistency from the perspective of optimization. Extensive experimental results on three popular REC and RES benchmarks, i.e., RefCOCO, RefCOCO+, and RefCOCOg, consistently demonstrate performance gains of WeakMCN over state-of-the-art single-task alternatives, e.g., up to 3.91% and 13.11% on RefCOCO for WREC and WRES tasks, respectively. Furthermore, experiments also validate the strong generalization ability of WeakMCN in both semi-supervised REC and RES settings against existing methods, e.g., +8.94% for semi-REC and +7.71% for semi-RES on 1% RefCOCO. The code is publicly available at https://github.com/MRUIL/WeakMCN.
中文: 本文提出WeakMCN多任务协同网络,通过双分支架构结合动态视觉特征增强和协同一致性模块,联合学习弱监督指代表达式理解与分割任务,在多个基准测试中显著超越单任务方法并展现出优异的半监督泛化能力。
English: The paper introduces WeakMCN, a multi-task collaborative network that jointly learns weakly supervised referring expression comprehension and segmentation through a dual-branch architecture with innovative modules, achieving superior performance on benchmarks and strong generalization in semi-supervised settings.

Authors:Bin Ren, Yawei Li, Xu Zheng, Yuqian Fu, Danda Pani Paudel, Ming-Hsuan Yang, Luc Van Gool, Nicu Sebe
Title: Manifold-aware Representation Learning for Degradation-agnostic Image Restoration
Abstract:
Image Restoration (IR) aims to recover high quality images from degraded inputs affected by various corruptions such as noise, blur, haze, rain, and low light conditions. Despite recent advances, most existing approaches treat IR as a direct mapping problem, relying on shared representations across degradation types without modeling their structural diversity. In this work, we present MIRAGE, a unified and lightweight framework for all in one IR that explicitly decomposes the input feature space into three semantically aligned parallel branches, each processed by a specialized module attention for global context, convolution for local textures, and MLP for channel-wise statistics. This modular decomposition significantly improves generalization and efficiency across diverse degradations. Furthermore, we introduce a cross layer contrastive learning scheme that aligns shallow and latent features to enhance the discriminability of shared representations. To better capture the underlying geometry of feature representations, we perform contrastive learning in a Symmetric Positive Definite (SPD) manifold space rather than the conventional Euclidean space. Extensive experiments show that MIRAGE not only achieves new state of the art performance across a variety of degradation types but also offers a scalable solution for challenging all-in-one IR scenarios. Our code and models will be publicly available at https://amazingren.github.io/MIRAGE/.
中文摘要:MIRAGE提出了一种统一的图像恢复框架,通过将特征分解为全局、局部和通道三个并行分支进行专门处理,并在SPD流形空间进行对比学习,实现了针对多种退化类型的最优性能。
English Summary: MIRAGE is a unified image restoration framework that decomposes features into specialized parallel branches for global, local, and channel processing, achieving state-of-the-art performance across diverse degradations through contrastive learning in SPD manifold space.

Authors:Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang
Title: DVD-Quant: Data-free Video Diffusion Transformers Quantization
Abstract:
Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment. While post-training quantization (PTQ) presents a promising approach to accelerate Video DiT models, existing methods suffer from two critical limitations: (1) dependence on lengthy, computation-heavy calibration procedures, and (2) considerable performance deterioration after quantization. To address these challenges, we propose DVD-Quant, a novel Data-free quantization framework for Video DiTs. Our approach integrates three key innovations: (1) Progressive Bounded Quantization (PBQ) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction, as well as (3) $δ$-Guided Bit Switching ($δ$-GBS) for adaptive bit-width allocation. Extensive experiments across multiple video generation benchmarks demonstrate that DVD-Quant achieves an approximately 2$\times$ speedup over full-precision baselines on HunyuanVideo while maintaining visual fidelity. Notably, DVD-Quant is the first to enable W4A4 PTQ for Video DiTs without compromising video quality. Code and models will be available at https://github.com/lhxcs/DVD-Quant.
中文: DVD-Quant提出了一种无数据量化框架,通过BGR、ARQ和δ-GBS三项创新技术,实现了视频DiT模型的W4A4训练后量化,在保持视觉质量的同时提速约2倍,且无需校准数据。
English: DVD-Quant introduces a data-free quantization framework with three innovations—BGR, ARQ, and δ-GBS—that enable efficient W4A4 post-training quantization for Video DiTs, achieving near 2× speedup while preserving visual quality without calibration data.

Authors:Yicheng Lin, Yunlong Jiang, Xujia Jiao, Bin Han
Title: Why Not Replace? Sustaining Long-Term Visual Localization via Handcrafted-Learned Feature Collaboration on CPU
Abstract:
Robust long-term visual localization in complex industrial environments is critical for mobile robotic systems. Existing approaches face limitations: handcrafted features are illumination-sensitive, learned features are computationally intensive, and semantic- or marker-based methods are environmentally constrained. Handcrafted and learned features share similar representations but differ functionally. Handcrafted features are optimized for continuous tracking, while learned features excel in wide-baseline matching. Their complementarity calls for integration rather than replacement. Building on this, we propose a hierarchical localization framework. It leverages real-time handcrafted feature extraction for relative pose estimation. In parallel, it employs selective learned keypoint detection on optimized keyframes for absolute positioning. This design enables CPU-efficient, long-term visual localization. Experiments systematically progress through three validation phases: Initially establishing feature complementarity through comparative analysis, followed by computational latency profiling across algorithm stages on CPU platforms. Final evaluation under photometric variations (including seasonal transitions and diurnal cycles) demonstrates 47% average error reduction with significantly improved localization consistency. The code implementation is publicly available at https://github.com/linyicheng1/ORB_SLAM3_localization.
中文摘要:该研究提出一种分层视觉定位框架,通过实时手工特征进行相对位姿跟踪,结合选择性学习特征实现绝对定位,在光照变化下平均误差降低47%,显著提升定位一致性并保持CPU高效运行。
English Summary: The proposed hierarchical visual localization framework integrates real-time handcrafted features for relative pose tracking with selective learned features for absolute positioning, achieving 47% error reduction and enhanced consistency under photometric variations while maintaining CPU efficiency.

Authors:Weizhi Zhong, Huan Yang, Zheng Liu, Huiguo He, Zijian He, Xuesong Niu, Di Zhang, Guanbin Li
Title: Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter
Abstract:
Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting). Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-tuning for each new concept, which is time-consuming and prone to overfitting on limited training images. In this work, we propose a novel tuning-free method for multi-concept personalization that can effectively customize both object and abstract concepts without test-time fine-tuning. Our method builds upon the modulation mechanism in pretrained Diffusion Transformers (DiTs) model, leveraging the localized and semantically meaningful properties of the modulation space. Specifically, we propose a novel module, Mod-Adapter, to predict concept-specific modulation direction for the modulation process of concept-related text tokens. It incorporates vision-language cross-attention for extracting concept visual features, and Mixture-of-Experts (MoE) layers that adaptively map the concept features into the modulation space. Furthermore, to mitigate the training difficulty caused by the large gap between the concept image space and the modulation space, we introduce a VLM-guided pretraining strategy that leverages the strong image understanding capabilities of vision-language models to provide semantic supervision signals. For a comprehensive comparison, we extend a standard benchmark by incorporating abstract concepts. Our method achieves state-of-the-art performance in multi-concept personalization, supported by quantitative, qualitative, and human evaluations.

Authors:Yuetong Fang, Deming Zhou, Ziqing Wang, Hongwei Ren, ZeCui Zeng, Lusong Li, Shibo Zhou, Renjing Xu
Title: Spiking Transformers Need High Frequency Information
Abstract:
Spiking Transformers offer an energy-efficient alternative to conventional deep learning by transmitting information solely through binary (0/1) spikes. However, there remains a substantial performance gap compared to artificial neural networks. A common belief is that their binary and sparse activation transmission leads to information loss, thus degrading feature representation and accuracy. In this work, however, we reveal for the first time that spiking neurons preferentially propagate low-frequency information. We hypothesize that the rapid dissipation of high-frequency components is the primary cause of performance degradation. For example, on Cifar-100, adopting Avg-Pooling (low-pass) for token mixing lowers performance to 76.73%; interestingly, replacing it with Max-Pooling (high-pass) pushes the top-1 accuracy to 79.12%, surpassing the well-tuned Spikformer baseline by 0.97%. Accordingly, we introduce Max-Former that restores high-frequency signals through two frequency-enhancing operators: extra Max-Pooling in patch embedding and Depth-Wise Convolution in place of self-attention. Notably, our Max-Former (63.99 M) hits the top-1 accuracy of 82.39% on ImageNet, showing a +7.58% improvement over Spikformer with comparable model size (74.81%, 66.34 M). We hope this simple yet effective solution inspires future research to explore the distinctive nature of spiking neural networks, beyond the established practice in standard deep learning. \href{https://github.com/bic-L/Spiking-Transformers-Need-High-Frequency-Information}{Code} is available.
中文: 脉冲变换器因偏好传播低频信息而导致高频信号损失和性能下降;提出的Max-Former模型通过引入最大池化和深度卷积增强高频成分,在ImageNet等基准测试中显著提升了准确率。
English: Spiking Transformers are energy-efficient but underperform due to their tendency to propagate low-frequency information, which causes high-frequency signal loss and reduced accuracy; the proposed Max-Former model enhances performance by incorporating high-frequency components through Max-Pooling and Depth-Wise Convolution, achieving significant accuracy improvements on benchmarks like ImageNet.

Authors:Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye
Title: Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment
Abstract:
Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/ .

Authors:Dongyang Jin, Chao Fan, Jingzhe Ma, Jingkai Zhou, Weihua Chen, Shiqi Yu
Title: On Denoising Walking Videos for Gait Recognition
Abstract:
To capture individual gait patterns, excluding identity-irrelevant cues in walking videos, such as clothing texture and color, remains a persistent challenge for vision-based gait recognition. Traditional silhouette- and pose-based methods, though theoretically effective at removing such distractions, often fall short of high accuracy due to their sparse and less informative inputs. Emerging end-to-end methods address this by directly denoising RGB videos using human priors. Building on this trend, we propose DenoisingGait, a novel gait denoising method. Inspired by the philosophy that "what I cannot create, I do not understand", we turn to generative diffusion models, uncovering how they partially filter out irrelevant factors for gait understanding. Additionally, we introduce a geometry-driven Feature Matching module, which, combined with background removal via human silhouettes, condenses the multi-channel diffusion features at each foreground pixel into a two-channel direction vector. Specifically, the proposed within- and cross-frame matching respectively capture the local vectorized structures of gait appearance and motion, producing a novel flow-like gait representation termed Gait Feature Field, which further reduces residual noise in diffusion features. Experiments on the CCPG, CASIA-B*, and SUSTech1K datasets demonstrate that DenoisingGait achieves a new SoTA performance in most cases for both within- and cross-domain evaluations. Code is available at https://github.com/ShiqiYu/OpenGait.
中文摘要:DenoisingGait提出了一种新颖的步态识别方法,利用生成扩散模型和几何驱动的特征匹配模块来过滤身份无关特征并创建流式步态特征场,在多个数据集上实现了最先进的性能。
English Summary: DenoisingGait introduces a novel gait recognition method using generative diffusion models and a geometry-driven Feature Matching module to filter out identity-irrelevant cues and create a flow-like Gait Feature Field, achieving state-of-the-art performance on multiple datasets.

Authors:Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang
Title: ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts
Abstract:
Reasoning Video Object Segmentation is a challenging task, which generates a mask sequence from an input video and an implicit, complex text query. Existing works probe into the problem by finetuning Multimodal Large Language Models (MLLM) for segmentation-based output, while still falling short in difficult cases on videos given temporally-sensitive queries, primarily due to the failure to integrate temporal and spatial information. In this paper, we propose ThinkVideo, a novel framework which leverages the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these challenges. Specifically, ThinkVideo utilizes the CoT prompts to extract object selectivities associated with particular keyframes, then bridging the reasoning image segmentation model and SAM2 video processor to output mask sequences. The ThinkVideo framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. We further extend the framework for online video streams, where the CoT is used to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that ThinkVideo significantly outperforms previous works in both cases, qualitatively and quantitatively.
中文: CoT-RVS是一种创新的免训练框架,利用多模态大语言模型的零样本思维链能力,通过时间语义推理提升推理视频对象分割,在处理复杂查询时显著优于现有方法。
English: CoT-RVS is a novel, training-free framework that leverages the zero-shot Chain-of-Thought capability of Multimodal Large Language Models to enhance Reasoning Video Object Segmentation by integrating temporal-semantic reasoning, significantly outperforming previous methods in handling complex queries.

Authors:Min Cheng, Fatemeh Doudi, Dileep Kalathil, Mohammad Ghavamzadeh, Panganamala R. Kumar
Title: Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models
Abstract:
Reinforcement learning (RL) algorithms have been used recently to align diffusion models with downstream objectives such as aesthetic quality and text-image consistency by fine-tuning them to maximize a single reward function under a fixed KL regularization. However, this approach is inherently restrictive in practice, where alignment must balance multiple, often conflicting objectives. Moreover, user preferences vary across prompts, individuals, and deployment contexts, with varying tolerances for deviation from a pre-trained base model. We address the problem of inference-time multi-preference alignment: given a set of basis reward functions and a reference KL regularization strength, can we design a fine-tuning procedure so that, at inference time, it can generate images aligned with any user-specified linear combination of rewards and regularization, without requiring additional fine-tuning? We propose Diffusion Blend, a novel approach to solve inference-time multi-preference alignment by blending backward diffusion processes associated with fine-tuned models, and we instantiate this approach with two algorithms: DB-MPA for multi-reward alignment and DB-KLA for KL regularization control. Extensive experiments show that Diffusion Blend algorithms consistently outperform relevant baselines and closely match or exceed the performance of individually fine-tuned models, enabling efficient, user-driven alignment at inference-time. The code is available at https://github.com/bluewoods127/DB-2025}{github.com/bluewoods127/DB-2025.
中文: 强化学习虽能调整扩散模型以符合单一目标,但在处理用户和场景各异的多重冲突偏好时存在局限。
English: Reinforcement learning has been used to align diffusion models with single objectives, but this approach is limited when balancing multiple conflicting preferences, which vary by user and context.

Authors:Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, Xueqian Wang
Title: Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models
Abstract:
Standing in 2025, at a critical juncture in the pursuit of Artificial General Intelligence (AGI), reinforcement fine-tuning (RFT) has demonstrated significant potential in enhancing the reasoning capability of large language models (LLMs) and has led to the development of cutting-edge AI models such as OpenAI-o1 and DeepSeek-R1. Moreover, the efficient application of RFT to enhance the reasoning capability of multimodal large language models (MLLMs) has attracted widespread attention from the community. In this position paper, we argue that reinforcement fine-tuning powers the reasoning capability of multimodal large language models. To begin with, we provide a detailed introduction to the fundamental background knowledge that researchers interested in this field should be familiar with. Furthermore, we meticulously summarize the improvements of RFT in powering reasoning capability of MLLMs into five key points: diverse modalities, diverse tasks and domains, better training algorithms, abundant benchmarks and thriving engineering frameworks. Finally, we propose five promising directions for future research that the community might consider. We hope that this position paper will provide valuable insights to the community at this pivotal stage in the advancement toward AGI. Summary of works done on RFT for MLLMs is available at https://github.com/Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs.
中文摘要:强化微调技术显著提升了多模态大语言模型的推理能力,本文通过五大改进方向与未来研究路径,为通用人工智能发展的关键阶段提供了重要见解。
English Summary: Reinforcement fine-tuning significantly enhances the reasoning capabilities of multimodal large language models, as detailed through five key improvements and future research directions in this pivotal AGI development stage.

Authors:Haoyu Yang, Yuxiang Cai, Jintao Chen, Xuhong Zhang, Wenhui Lei, Xiaoming Shi, Jianwei Yin, Yankai Jiang
Title: TK-Mamba: Marrying KAN with Mamba for Text-Driven 3D Medical Image Segmentation
Abstract:
3D medical image segmentation is vital for clinical diagnosis and treatment but is challenged by high-dimensional data and complex spatial dependencies. Traditional single-modality networks, such as CNNs and Transformers, are often limited by computational inefficiency and constrained contextual modeling in 3D settings. We introduce a novel multimodal framework that leverages Mamba and Kolmogorov-Arnold Networks (KAN) as an efficient backbone for long-sequence modeling. Our approach features three key innovations: First, an EGSC (Enhanced Gated Spatial Convolution) module captures spatial information when unfolding 3D images into 1D sequences. Second, we extend Group-Rational KAN (GR-KAN), a Kolmogorov-Arnold Networks variant with rational basis functions, into 3D-Group-Rational KAN (3D-GR-KAN) for 3D medical imaging - its first application in this domain - enabling superior feature representation tailored to volumetric data. Third, a dual-branch text-driven strategy leverages CLIP's text embeddings: one branch swaps one-hot labels for semantic vectors to preserve inter-organ semantic relationships, while the other aligns images with detailed organ descriptions to enhance semantic alignment. Experiments on the Medical Segmentation Decathlon (MSD) and KiTS23 datasets show our method achieving state-of-the-art performance, surpassing existing approaches in accuracy and efficiency. This work highlights the power of combining advanced sequence modeling, extended network architectures, and vision-language synergy to push forward 3D medical image segmentation, delivering a scalable solution for clinical use. The source code is openly available at https://github.com/yhy-whu/TK-Mamba.
中文: 本研究提出了一种结合Mamba和3D群有理KAN的新型多模态框架,通过增强空间建模和文本驱动语义对齐技术,有效解决了三维医学图像分割中的计算效率与空间依赖难题,实现了最先进的性能表现。
English: This study introduces a novel multimodal framework integrating Mamba and 3D-Group-Rational KAN to address computational inefficiency and spatial dependency challenges in 3D medical image segmentation, achieving state-of-the-art performance through enhanced spatial modeling and text-driven semantic alignment.

Authors:Yiheng Li, Feng Liang, Dan Kondratyuk, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu
Title: Improved Immiscible Diffusion: Accelerate Diffusion Training by Reducing Its Miscibility
Abstract:
The substantial training cost of diffusion models hinders their deployment. Immiscible Diffusion recently showed that reducing diffusion trajectory mixing in the noise space via linear assignment accelerates training by simplifying denoising. To extend immiscible diffusion beyond the inefficient linear assignment under high batch sizes and high dimensions, we refine this concept to a broader miscibility reduction at any layer and by any implementation. Specifically, we empirically demonstrate the bijective nature of the denoising process with respect to immiscible diffusion, ensuring its preservation of generative diversity. Moreover, we provide thorough analysis and show step-by-step how immiscibility eases denoising and improves efficiency. Extending beyond linear assignment, we propose a family of implementations including K-nearest neighbor (KNN) noise selection and image scaling to reduce miscibility, achieving up to >4x faster training across diverse models and tasks including unconditional/conditional generation, image editing, and robotics planning. Furthermore, our analysis of immiscibility offers a novel perspective on how optimal transport (OT) enhances diffusion training. By identifying trajectory miscibility as a fundamental bottleneck, we believe this work establishes a potentially new direction for future research into high-efficiency diffusion training. The code is available at https://github.com/yhli123/Immiscible-Diffusion.
中文摘要:本研究提出了一种广义的互溶性降低方法,通过K近邻噪声选择和图像缩放等实现方式,将扩散模型训练速度提升高达4倍以上,同时保持生成多样性,并为最优传输在扩散训练中的作用提供了新视角。
English Summary: This work introduces a generalized miscibility reduction approach that extends beyond linear assignment to accelerate diffusion model training by up to 4x through implementations like KNN noise selection and image scaling, while maintaining generative diversity and offering new insights into optimal transport's role in diffusion training.

Authors:Junlin Wang, Zhiyun Lin
Title: Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning
Abstract:
Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present $\textbf{I}$nter-token $\textbf{Con}$trast ($\textbf{ICon}$), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website: https://github.com/HenryWJL/icon
中文: 本文提出ICon方法,通过对视觉变换器的令牌级表征进行对比学习,分离智能体与环境相关令牌,形成具身化视觉表征,从而提升机器人操作策略的学习效果与跨机器人迁移能力。
English: The paper introduces ICon, a contrastive learning method for Vision Transformers that separates agent-specific and environment-specific tokens to create body-relevant visual representations, enhancing robotic manipulation policy learning and transferability.

Authors:Fukun Liu, Adam T. Greer, Gengchen Mai, Jin Sun
Title: ZooplanktonBench: A Geo-Aware Zooplankton Recognition and Classification Dataset from Marine Observations
Abstract:
Plankton are small drifting organisms found throughout the world's oceans and can be indicators of ocean health. One component of this plankton community is the zooplankton, which includes gelatinous animals and crustaceans (e.g. shrimp), as well as the early life stages (i.e., eggs and larvae) of many commercially important fishes. Being able to monitor zooplankton abundances accurately and understand how populations change in relation to ocean conditions is invaluable to marine science research, with important implications for future marine seafood productivity. While new imaging technologies generate massive amounts of video data of zooplankton, analyzing them using general-purpose computer vision tools turns out to be highly challenging due to the high similarity in appearance between the zooplankton and its background (e.g., marine snow). In this work, we present the ZooplanktonBench, a benchmark dataset containing images and videos of zooplankton associated with rich geospatial metadata (e.g., geographic coordinates, depth, etc.) in various water ecosystems. ZooplanktonBench defines a collection of tasks to detect, classify, and track zooplankton in challenging settings, including highly cluttered environments, living vs non-living classification, objects with similar shapes, and relatively small objects. Our dataset presents unique challenges and opportunities for state-of-the-art computer vision systems to evolve and improve visual understanding in dynamic environments characterized by significant variation and the need for geo-awareness. The code and settings described in this paper can be found on our website: https://lfk118.github.io/ZooplanktonBench_Webpage.

Authors:Jingkai Wang, Wu Miao, Jue Gong, Zheng Chen, Xing Liu, Hong Gu, Yutong Liu, Yulun Zhang
Title: HonestFace: Towards Honest Face Restoration with One-Step Diffusion Model
Abstract:
Face restoration has achieved remarkable advancements through the years of development. However, ensuring that restored facial images exhibit high fidelity, preserve authentic features, and avoid introducing artifacts or biases remains a significant challenge. This highlights the need for models that are more "honest" in their reconstruction from low-quality inputs, accurately reflecting original characteristics. In this work, we propose HonestFace, a novel approach designed to restore faces with a strong emphasis on such honesty, particularly concerning identity consistency and texture realism. To achieve this, HonestFace incorporates several key components. First, we propose an identity embedder to effectively capture and preserve crucial identity features from both the low-quality input and multiple reference faces. Second, a masked face alignment method is presented to enhance fine-grained details and textural authenticity, thereby preventing the generation of patterned or overly synthetic textures and improving overall clarity. Furthermore, we present a new landmark-based evaluation metric. Based on affine transformation principles, this metric improves the accuracy compared to conventional L2 distance calculations for facial feature alignment. Leveraging these contributions within a one-step diffusion model framework, HonestFace delivers exceptional restoration results in terms of facial fidelity and realism. Extensive experiments demonstrate that our approach surpasses existing state-of-the-art methods, achieving superior performance in both visual quality and quantitative assessments. The code and pre-trained models will be made publicly available at https://github.com/jkwang28/HonestFace .
中文摘要:HonestFace是一种新颖的人脸复原方法,通过身份嵌入器和掩码对齐技术强调身份一致性与纹理真实性,实现了卓越的保真度并超越了现有最优方法。
English Summary: HonestFace is a novel face restoration method that emphasizes identity consistency and texture realism through an identity embedder and masked alignment technique, achieving superior fidelity and outperforming state-of-the-art methods.

Authors:Afshin Bozorgpour, Sina Ghorbani Kolahi, Reza Azad, Ilker Hacihaliloglu, Dorit Merhof
Title: CENet: Context Enhancement Network for Medical Image Segmentation
Abstract:
Medical image segmentation, particularly in multi-domain scenarios, requires precise preservation of anatomical structures across diverse representations. While deep learning has advanced this field, existing models often struggle with accurate boundary representation, variability in organ morphology, and information loss during downsampling, limiting their accuracy and robustness. To address these challenges, we propose the Context Enhancement Network (CENet), a novel segmentation framework featuring two key innovations. First, the Dual Selective Enhancement Block (DSEB) integrated into skip connections enhances boundary details and improves the detection of smaller organs in a context-aware manner. Second, the Context Feature Attention Module (CFAM) in the decoder employs a multi-scale design to maintain spatial integrity, reduce feature redundancy, and mitigate overly enhanced representations. Extensive evaluations on both radiology and dermoscopic datasets demonstrate that CENet outperforms state-of-the-art (SOTA) methods in multi-organ segmentation and boundary detail preservation, offering a robust and accurate solution for complex medical image analysis tasks. The code is publicly available at https://github.com/xmindflow/cenet.
中文:提出的上下文增强网络(CENet)通过创新模块提升医学图像分割中的边界精度和空间完整性,在多个数据集上展现出优于现有方法的性能。
English: The proposed Context Enhancement Network (CENet) introduces innovative modules to improve boundary precision and spatial integrity in medical image segmentation, demonstrating superior performance over existing methods across diverse datasets.

Authors:Lin Zhao, Yushu Wu, Xinru Jiang, Jianyang Gu, Yanzhi Wang, Xiaolin Xu, Pu Zhao, Xue Lin
Title: Taming Diffusion for Dataset Distillation with High Representativeness
Abstract:
Recent deep learning models demand larger datasets, driving the need for dataset distillation to create compact, cost-efficient datasets while maintaining performance. Due to the powerful image generation capability of diffusion, it has been introduced to this field for generating distilled images. In this paper, we systematically investigate issues present in current diffusion-based dataset distillation methods, including inaccurate distribution matching, distribution deviation with random noise, and separate sampling. Building on this, we propose D^3HR, a novel diffusion-based framework to generate distilled datasets with high representativeness. Specifically, we adopt DDIM inversion to map the latents of the full dataset from a low-normality latent domain to a high-normality Gaussian domain, preserving information and ensuring structural consistency to generate representative latents for the distilled dataset. Furthermore, we propose an efficient sampling scheme to better align the representative latents with the high-normality Gaussian distribution. Our comprehensive experiments demonstrate that D^3HR can achieve higher accuracy across different model architectures compared with state-of-the-art baselines in dataset distillation. Source code: https://github.com/lin-zhao-resoLve/D3HR.
中文: 本文提出D^3HR框架,通过DDIM反演和高效采样方案解决当前基于扩散的数据集蒸馏方法中的关键问题,生成高代表性数据集,在不同模型架构下均比现有方法获得更高准确率。
English: This paper introduces D^3HR, a diffusion-based framework that addresses key issues in dataset distillation by using DDIM inversion and an efficient sampling scheme to generate highly representative datasets, achieving superior accuracy across various model architectures compared to existing methods.

Authors:Jianyang Gu, Haonan Wang, Ruoxi Jia, Saeed Vahidian, Vyacheslav Kungurtsev, Wei Jiang, Yiran Chen
Title: CONCORD: Concept-Informed Diffusion for Dataset Distillation
Abstract:
Dataset distillation (DD) has witnessed significant progress in creating small datasets that encapsulate rich information from large original ones. Particularly, methods based on generative priors show promising performance, while maintaining computational efficiency and cross-architecture generalization. However, the generation process lacks explicit controllability for each sample. Previous distillation methods primarily match the real distribution from the perspective of the entire dataset, whereas overlooking concept completeness at the instance level. The missing or incorrectly represented object details cannot be efficiently compensated due to the constrained sample amount typical in DD settings. To this end, we propose incorporating the concept understanding of large language models (LLMs) to perform Concept-Informed Diffusion (CONCORD) for dataset distillation. Specifically, distinguishable and fine-grained concepts are retrieved based on category labels to inform the denoising process and refine essential object details. By integrating these concepts, the proposed method significantly enhances both the controllability and interpretability of the distilled image generation, without relying on pre-trained classifiers. We demonstrate the efficacy of CONCORD by achieving state-of-the-art performance on ImageNet-1K and its subsets. The code implementation is released in https://github.com/vimar-gu/CONCORD.
中文: 提出的CONCORD方法通过整合大型语言模型的概念理解,增强了数据集蒸馏的可控性和图像细节准确性,在ImageNet基准测试中取得了领先性能。
English: The proposed CONCORD method enhances dataset distillation by integrating large language models' concept understanding to improve controllability and detail accuracy in generated images, achieving state-of-the-art results on ImageNet benchmarks.

Authors:Zifu Wan, Yaqi Xie, Ce Zhang, Zhiqiu Lin, Zihan Wang, Simon Stepputtis, Deva Ramanan, Katia Sycara
Title: InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning
Abstract:
Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: https://zifuwan.github.io/InstructPart/.
中文摘要:本文提出InstructPart基准,用于评估视觉语言模型在部件级物体理解方面的能力,揭示了现有模型的不足,并通过微调方法实现了性能翻倍的基线改进。
English Summary: This paper introduces InstructPart, a benchmark for evaluating part-level object understanding in vision-language models, revealing current models' limitations and proposing a fine-tuned baseline that doubles performance.

Authors:Natia Kukhilava, Tatia Tsmindashvili, Rapael Kalandadze, Anchit Gupta, Sofio Katamadze, François Brémond, Laura M. Ferrari, Philipp Müller, Benedikt Emanuel Wirth
Title: Evaluation in EEG Emotion Recognition: State-of-the-Art Review and Unified Framework
Abstract:
Electroencephalography-based Emotion Recognition (EEG-ER) has become a growing research area in recent years. Analyzing 216 papers published between 2018 and 2023, we uncover that the field lacks a unified evaluation protocol, which is essential to fairly define the state of the art, compare new approaches and to track the field's progress. We report the main inconsistencies between the used evaluation protocols, which are related to ground truth definition, evaluation metric selection, data splitting types (e.g., subject-dependent or subject-independent) and the use of different datasets. Capitalizing on this state-of-the-art research, we propose a unified evaluation protocol, EEGain (https://github.com/EmotionLab/EEGain), which enables an easy and efficient evaluation of new methods and datasets. EEGain is a novel open source software framework, offering the capability to compare - and thus define - state-of-the-art results. EEGain includes standardized methods for data pre-processing, data splitting, evaluation metrics, and the ability to load the six most relevant datasets (i.e., AMIGOS, DEAP, DREAMER, MAHNOB-HCI, SEED, SEED-IV) in EEG-ER with only a single line of code. In addition, we have assessed and validated EEGain using these six datasets on the four most common publicly available methods (EEGNet, DeepConvNet, ShallowConvNet, TSception). This is a significant step to make research on EEG-ER more reproducible and comparable, thereby accelerating the overall progress of the field.
中文: EEG-ER研究领域缺乏统一的评估标准,为此开发了EEGain开源框架,通过标准化数据处理和评估方法,确保研究结果的可复现性与可比性。
English: EEG-ER research lacks standardized evaluation protocols, prompting the development of EEGain, an open-source framework that ensures reproducible and comparable results by unifying data processing and assessment methods.

Authors:Savya Khosla, Sethuraman TV, Barnett Lee, Alexander Schwing, Derek Hoiem
Title: REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders
Abstract:
We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders-DINO, DINOv2, and OpenCLIP-and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAM-based region methods while being significantly faster. Notably, REN achieves state-of-the-art results on the challenging Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks' single-needle challenge. Code and models are available at: https://github.com/savya08/REN.
中文: 区域编码器网络(REN)是一种快速高效的模型,通过点提示直接生成基于区域的图像表示,实现了60倍的速度提升和35倍的内存节省,同时提高了标记质量,并在多项基准测试中超越了现有方法。
English: The Region Encoder Network (REN) is a fast and efficient model that generates region-based image representations directly from point prompts, achieving up to 60x speed and 35x memory reduction while improving token quality and outperforming existing methods in various benchmarks.

Authors:Zizhang Li, Hong-Xing Yu, Wei Liu, Yin Yang, Charles Herrmann, Gordon Wetzstein, Jiajun Wu
Title: WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions
Abstract:
WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. While prior works are restricted to rigid body or simple elastic dynamics, WonderPlay features a hybrid generative simulator to synthesize a wide range of 3D dynamics. The hybrid generative simulator first uses a physics solver to simulate coarse 3D dynamics, which subsequently conditions a video generator to produce a video with finer, more realistic motion. The generated video is then used to update the simulated dynamic 3D scene, closing the loop between the physics solver and the video generator. This approach enables intuitive user control to be combined with the accurate dynamics of physics-based simulators and the expressivity of diffusion-based video generators. Experimental results demonstrate that WonderPlay enables users to interact with various scenes of diverse content, including cloth, sand, snow, liquid, smoke, elastic, and rigid bodies -- all using a single image input. Code will be made public. Project website: https://kyleleey.github.io/WonderPlay/
中文: WonderPlay是一种创新框架,通过整合物理模拟与视频生成技术,能够从单张图像创建动态3D场景,实现直观的用户控制并处理包括布料、液体在内的多种材质。
English: WonderPlay is an innovative framework that combines physics simulation and video generation to create dynamic 3D scenes from a single image, allowing intuitive user control and handling diverse materials like cloth and liquids.

Authors:Junfeng Wu, Dongliang Luo, Weizhi Zhao, Zhihao Xie, Yuanhao Wang, Junyi Li, Xudong Xie, Yuliang Liu, Xiang Bai
Title: TokBench: Evaluating Your Visual Tokenizer before Visual Generation
Abstract:
In this work, we reveal the limitations of visual tokenizers and VAEs in preserving fine-grained features, and propose a benchmark to evaluate reconstruction performance for two challenging visual contents: text and face. Visual tokenizers and VAEs have significantly advanced visual generation and multimodal modeling by providing more efficient compressed or quantized image representations. However, while helping production models reduce computational burdens, the information loss from image compression fundamentally limits the upper bound of visual generation quality. To evaluate this upper bound, we focus on assessing reconstructed text and facial features since they typically: 1) exist at smaller scales, 2) contain dense and rich textures, 3) are prone to collapse, and 4) are highly sensitive to human vision. We first collect and curate a diverse set of clear text and face images from existing datasets. Unlike approaches using VLM models, we employ established OCR and face recognition models for evaluation, ensuring accuracy while maintaining an exceptionally lightweight assessment process requiring just 2GB memory and 4 minutes to complete. Using our benchmark, we analyze text and face reconstruction quality across various scales for different image tokenizers and VAEs. Our results show modern visual tokenizers still struggle to preserve fine-grained features, especially at smaller scales. We further extend this evaluation framework to video, conducting comprehensive analysis of video tokenizers. Additionally, we demonstrate that traditional metrics fail to accurately reflect reconstruction performance for faces and text, while our proposed metrics serve as an effective complement.
本研究揭示了视觉分词器和VAE在保留细粒度特征方面的局限性,提出了一个轻量级基准来评估文本和人脸在不同尺度下的重建质量,并发现它们在处理小尺度特征时仍存在显著不足。
This study exposes the limitations of visual tokenizers and VAEs in retaining fine-grained details, proposing a lightweight benchmark to evaluate text and face reconstruction quality across various scales and revealing their persistent challenges with small-scale features.

Authors:Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan
Title: One RL to See Them All: Visual Triple Unified Reinforcement Learning
Abstract:
Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.
中文: V-Triune提出了一种统一强化学习系统,通过样本级数据格式化、验证器级奖励计算和源级指标监控三大组件,使视觉语言模型能够同时掌握推理与感知任务,并在各类基准测试中实现显著性能提升。
English: V-Triune introduces a unified reinforcement learning system that enables vision-language models to jointly master both reasoning and perception tasks, achieving significant performance gains across diverse benchmarks through its triple-component architecture and novel Dynamic IoU reward mechanism.

Authors:Jacob Hansen, Wei Lin, Junmo Kang, Muhammad Jehanzeb Mirza, Hongyin Luo, Rogerio Feris, Alan Ritter, James Glass, Leonid Karlinsky
Title: Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion
Abstract:
Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many VisIT datasets are available, most are constructed using ad-hoc techniques developed independently by different groups. They are often poorly documented, lack reproducible code, and rely on paid, closed-source model APIs such as GPT-4, Gemini, or Claude to convert image metadata (labels) into VisIT instructions. This leads to high costs and makes it challenging to scale, enhance quality, or generate VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach,~\textbf{\method}, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage \method features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or enhance the data quality of available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ~3\% on average and up to 12\% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. Additionally, our approach enables effective performance scaling - both in quantity and quality - by enhancing the resulting LMM performance across a wide range of benchmarks. We also analyze the impact of various factors, including conversation format, base model selection, and resampling strategies. Our code, which supports the reproduction of equal or higher-quality VisIT datasets and facilities future metadata-to-VisIT data conversion for niche domains, is released at https://github.com/jacob-hansen/Instructify.
Chinese: 本研究提出了Instructify这一开源框架,利用开放大语言模型将图像元数据高效转化为高质量视觉指令调优数据,在超越GPT-4生成数据性能的同时,实现了可扩展、可复现的VisIT数据集构建。
English: The study introduces Instructify, an open-source framework that uses open LLMs to efficiently convert image metadata into high-quality Visual Instruction Tuning data, achieving performance improvements over GPT-4 generated data while enabling scalable and reproducible VisIT dataset creation.

Authors:Ziqiao Peng, Yanbo Fan, Haoyu Wu, Xuan Wang, Hongyan Liu, Jun He, Zhaoxin Fan
Title: DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations
Abstract:
In face-to-face conversations, individuals need to switch between speaking and listening roles seamlessly. Existing 3D talking head generation models focus solely on speaking or listening, neglecting the natural dynamics of interactive conversation, which leads to unnatural interactions and awkward transitions. To address this issue, we propose a new task -- multi-round dual-speaker interaction for 3D talking head generation -- which requires models to handle and generate both speaking and listening behaviors in continuous conversation. To solve this task, we introduce DualTalk, a novel unified framework that integrates the dynamic behaviors of speakers and listeners to simulate realistic and coherent dialogue interactions. This framework not only synthesizes lifelike talking heads when speaking but also generates continuous and vivid non-verbal feedback when listening, effectively capturing the interplay between the roles. We also create a new dataset featuring 50 hours of multi-round conversations with over 1,000 characters, where participants continuously switch between speaking and listening roles. Extensive experiments demonstrate that our method significantly enhances the naturalness and expressiveness of 3D talking heads in dual-speaker conversations. We recommend watching the supplementary video: https://ziqiaopeng.github.io/dualtalk.

Authors:Hyungyung Lee, Geon Choi, Jung-Oh Lee, Hangyul Yoon, Hyuk Gi Hong, Edward Choi
Title: CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
Abstract:
Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 10 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench
中文: 该研究提出了CheXStruct和CXReasonBench,这是一个用于评估大型视觉语言模型在胸部X光分析中临床推理能力的结构化流程和基准,揭示了这些模型在抽象知识与视觉解释结合方面的不足。
English: The study introduces CheXStruct and CXReasonBench, a structured pipeline and benchmark for evaluating the clinical reasoning of Large Vision-Language Models in chest X-ray analysis, revealing their limitations in linking abstract knowledge with visual interpretation despite extensive testing.

Authors:Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu
Title: Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
Abstract:
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code has been released in https://github.com/microsoft/DeepVideoDiscovery.
Chinese: Deep Video Discovery 代理通过采用基于工具的自主搜索策略处理分段视频,克服了大型语言模型在处理长视频时的限制,在 LVBench 等基准测试中取得了领先性能。
English: The Deep Video Discovery agent overcomes LLM limitations in processing long videos by employing an autonomous, tool-based search strategy across segmented clips, achieving state-of-the-art results on benchmarks like LVBench.

Authors:Junhao Chen, Mingjin Chen, Jianjin Xu, Xiang Li, Junting Dong, Mingze Sun, Puhua Jiang, Hongxiang Li, Yuhang Yang, Hao Zhao, Xiaoxiao Long, Ruqi Huang
Title: DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation
Abstract:
Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds "who" and "how" at every denoising step by fusing robust tracking masks with semantically rich-but noisy-pose heat-maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 hours of dual-skater footage with 7,000+ distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a three-track benchmark centered on the DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a significant margin. Moreover, we show that a one-hour fine-tune yields convincing human-robot videos, underscoring broad generalization to embodied-AI and HRI tasks. Extensive ablations confirm that persistent identity-action binding is critical to these gains. Together, our model, datasets, and benchmark lift CVG from single-subject choreography to compositionally controllable, multi-actor interaction, opening new avenues for digital production, simulation, and embodied intelligence. Our video demos and code are available at https://DanceTog.github.io/.

Authors:Kaiyan Zhang, Xinghui Li, Jingyi Lu, Kai Han
Title: Semantic Correspondence: Unified Benchmarking and a Strong Baseline
Abstract:
Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive survey of semantic correspondence methods. We first propose a taxonomy to classify existing methods based on the type of their method designs. These methods are then categorized accordingly, and we provide a detailed analysis of each approach. Furthermore, we aggregate and summarize the results of methods in literature across various benchmarks into a unified comparative table, with detailed configurations to highlight performance variations. Additionally, to provide a detailed understanding on existing methods for semantic matching, we thoroughly conduct controlled experiments to analyse the effectiveness of the components of different methods. Finally, we propose a simple yet effective baseline that achieves state-of-the-art performance on multiple benchmarks, providing a solid foundation for future research in this field. We hope this survey serves as a comprehensive reference and consolidated baseline for future development. Code is publicly available at: https://github.com/Visual-AI/Semantic-Correspondence.
Chinese: 本文首次对计算机视觉中的语义对应方法进行了全面综述,提出了分类体系、对比分析,并开发了一个在多个基准测试中达到最优性能的新基线模型。
English: This paper presents the first comprehensive survey of semantic correspondence methods in computer vision, offering a taxonomy, comparative analysis, and a new baseline model that achieves state-of-the-art performance across multiple benchmarks.

Authors:Sudarshan Rajagopalan, Kartik Narayan, Vishal M. Patel
Title: RestoreVAR: Visual Autoregressive Generation for All-in-One Image Restoration
Abstract:
The use of latent diffusion models (LDMs) such as Stable Diffusion has significantly improved the perceptual quality of All-in-One image Restoration (AiOR) methods, while also enhancing their generalization capabilities. However, these LDM-based frameworks suffer from slow inference due to their iterative denoising process, rendering them impractical for time-sensitive applications. To address this, we propose RestoreVAR, a novel generative approach for AiOR that significantly outperforms LDM-based models in restoration performance while achieving over $\mathbf{10\times}$ faster inference. RestoreVAR leverages visual autoregressive modeling (VAR), a recently introduced approach which performs scale-space autoregression for image generation. VAR achieves comparable performance to that of state-of-the-art diffusion transformers with drastically reduced computational costs. To optimally exploit these advantages of VAR for AiOR, we propose architectural modifications and improvements, including intricately designed cross-attention mechanisms and a latent-space refinement module, tailored for the AiOR task. Extensive experiments show that RestoreVAR achieves state-of-the-art performance among generative AiOR methods, while also exhibiting strong generalization capabilities.

Authors:Zizhao Chen, Yoav Artzi
Title: Knot So Simple: A Minimalistic Environment for Spatial Reasoning
Abstract:
We propose KnotGym, an interactive environment for complex, spatial reasoning and manipulation. KnotGym includes goal-oriented rope manipulation tasks with varying levels of complexity, all requiring acting from pure image observations. Tasks are defined along a clear and quantifiable axis of complexity based on the number of knot crossings, creating a natural generalization test. KnotGym has a simple observation space, allowing for scalable development, yet it highlights core challenges in integrating acute perception, spatial reasoning, and grounded manipulation. We evaluate methods of different classes, including model-based RL, model-predictive control, and chain-of-thought reasoning, and illustrate the challenges KnotGym presents. KnotGym is available at https://github.com/lil-lab/knotgym.
中文: KnotGym是一个用于空间推理和操作的交互式环境,通过基于绳结交叉数量的可量化复杂度任务,为评估不同人工智能方法提供了测试平台。
English: KnotGym is an interactive environment for spatial reasoning and manipulation that features goal-oriented rope tasks with scalable complexity based on knot crossings, enabling evaluation of various AI methods.

Authors:Xiaobao Wei, Jiawei Liu, Dongbo Yang, Junda Cheng, Changyong Shu, Wei Wang
Title: A Wavelet-based Stereo Matching Framework for Solving Frequency Convergence Inconsistency
Abstract:
We find that the EPE evaluation metrics of RAFT-stereo converge inconsistently in the low and high frequency regions, resulting high frequency degradation (e.g., edges and thin objects) during the iterative process. The underlying reason for the limited performance of current iterative methods is that it optimizes all frequency components together without distinguishing between high and low frequencies. We propose a wavelet-based stereo matching framework (Wavelet-Stereo) for solving frequency convergence inconsistency. Specifically, we first explicitly decompose an image into high and low frequency components using discrete wavelet transform. Then, the high-frequency and low-frequency components are fed into two different multi-scale frequency feature extractors. Finally, we propose a novel LSTM-based high-frequency preservation update operator containing an iterative frequency adapter to provide adaptive refined high-frequency features at different iteration steps by fine-tuning the initial high-frequency features. By processing high and low frequency components separately, our framework can simultaneously refine high-frequency information in edges and low-frequency information in smooth regions, which is especially suitable for challenging scenes with fine details and textures in the distance. Extensive experiments demonstrate that our Wavelet-Stereo outperforms the state-of-the-art methods and ranks 1st on both the KITTI 2015 and KITTI 2012 leaderboards for almost all metrics. We will provide code and pre-trained models to encourage further exploration, application, and development of our innovative framework (https://github.com/SIA-IDE/Wavelet-Stereo).
Chinese: 该研究提出了Wavelet-Stereo立体匹配框架,通过小波变换分别处理高低频分量并采用专用特征提取器,解决了频率收敛不一致问题,在KITTI基准测试中取得最优性能。
English: The study introduces Wavelet-Stereo, a stereo matching framework that addresses frequency convergence inconsistency by separately processing high and low-frequency components using wavelet transforms and specialized feature extractors, achieving top performance on KITTI benchmarks.

Authors:Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, Yuhui Zheng
Title: RemoteSAM: Towards Segment Anything for Earth Observation
Abstract:
We aim to develop a robust yet flexible visual foundation model for Earth observation. It should possess strong capabilities in recognizing and localizing diverse visual targets while providing compatibility with various input-output interfaces required across different task scenarios. Current systems cannot meet these requirements, as they typically utilize task-specific architecture trained on narrow data domains with limited semantic coverage. Our study addresses these limitations from two aspects: data and modeling. We first introduce an automatic data engine that enjoys significantly better scalability compared to previous human annotation or rule-based approaches. It has enabled us to create the largest dataset of its kind to date, comprising 270K image-text-mask triplets covering an unprecedented range of diverse semantic categories and attribute specifications. Based on this data foundation, we further propose a task unification paradigm that centers around referring expression segmentation. It effectively handles a wide range of vision-centric perception tasks, including classification, detection, segmentation, grounding, etc, using a single model without any task-specific heads. Combining these innovations on data and modeling, we present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks, outperforming other foundation models such as Falcon, GeoChat, and LHRS-Bot with significantly higher efficiency. Models and data are publicly available at https://github.com/1e12Leon/RemoteSAM.
Chinese: 本研究提出了RemoteSAM,一个用于地球观测的鲁棒视觉基础模型,通过自动数据引擎构建了同类最大数据集,并采用任务统一范式处理多种视觉任务,以高效性实现了最先进的性能。
English: This study introduces RemoteSAM, a robust visual foundation model for Earth observation that uses an automatic data engine to create the largest dataset of its kind and a task unification paradigm for handling multiple vision tasks, achieving state-of-the-art performance with high efficiency.

Authors:Yao Sun, Sining Chen, Yifan Tian, Xiao Xiang Zhu
Title: Building Floor Number Estimation from Crowdsourced Street-Level Images: Munich Dataset and Baseline Method
Abstract:
Accurate information on the number of building floors, or above-ground storeys, is essential for household estimation, utility provision, risk assessment, evacuation planning, and energy modeling. Yet large-scale floor-count data are rarely available in cadastral and 3D city databases. This study proposes an end-to-end deep learning framework that infers floor numbers directly from unrestricted, crowdsourced street-level imagery, avoiding hand-crafted features and generalizing across diverse facade styles. To enable benchmarking, we release the Munich Building Floor Dataset, a public set of over 6800 geo-tagged images collected from Mapillary and targeted field photography, each paired with a verified storey label. On this dataset, the proposed classification-regression network attains 81.2% exact accuracy and predicts 97.9% of buildings within +/-1 floor. The method and dataset together offer a scalable route to enrich 3D city models with vertical information and lay a foundation for future work in urban informatics, remote sensing, and geographic information science. Source code and data will be released under an open license at https://github.com/ya0-sun/Munich-SVI-Floor-Benchmark.
中文: 本研究提出一种端到端深度学习框架,可直接从街景图像推断建筑楼层数,在新建公开数据集上达到81.2%的精确准确率,为丰富三维城市模型提供了可扩展的解决方案。
English: This study introduces an end-to-end deep learning framework that accurately infers building floor counts from street-level images, achieving 81.2% exact accuracy on a new public dataset and offering a scalable solution to enrich 3D city models.

Authors:Shashank Agnihotri, David Schader, Jonas Jakubassa, Nico Sharei, Simon Kral, Mehmet Ege Kaçar, Ruben Weber, Margret Keuper
Title: SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification
Abstract:
Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, real-world applications in safety-critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under diverse adversarial attacks and common corruptions. Our findings reveal systematic weaknesses in state-of-the-art models and uncover key trends based on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are open-sourced in our GitHub repository (https://github.com/shashankskagnihotri/benchmarking_reliability_generalization) along with our complete set of total 6139 evaluations. We anticipate the collected data to foster and encourage future research towards improved model reliability beyond classification.
中文摘要:本研究提出SEMSEGBENCH和DETECBENCH基准测试工具,通过大规模评估揭示了当前最优语义分割和物体检测模型在分布偏移和对抗攻击下存在的系统性缺陷,旨在推动超越分类任务的模型可靠性研究。
English Summary: This research introduces SEMSEGBENCH and DETECBENCH benchmarking tools to evaluate the robustness of semantic segmentation and object detection models against distribution shifts and adversarial attacks, revealing systematic weaknesses in state-of-the-art models through extensive testing.

Authors:Bryan Wong, Jong Woo Kim, Huazhu Fu, Mun Yong Yi
Title: Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling
Abstract:
Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures. However, existing methods often face two key limitations: (1) insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x) and (2) inadequate alignment between visual and textual modalities on the same scale. To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of (1) parent-child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, and (2) heterogeneous intra-scale edges linking visual and textual nodes on the same scale. To further enhance semantic consistency, HiVE-MIL incorporates a two-stage, text-guided dynamic filtering mechanism that removes weakly correlated patch-text pairs, and introduces a hierarchical contrastive loss to align textual semantics across scales. Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data. The code is available at https://github.com/bryanwong17/HiVE-MIL
Chinese: HiVE-MIL提出了一种分层视觉语言框架,通过增强多尺度交互和跨模态对齐来改进全切片图像分类,在16样本设置下宏观F1分数最高提升4.1%,显著优于现有方法。
English: HiVE-MIL introduces a hierarchical vision-language framework that enhances multi-scale interactions and cross-modal alignment in whole slide image classification, achieving superior performance over existing methods with up to 4.1% improvement in macro F1 scores.

Authors:Simone Gaisbauer, Prabin Gyawali, Qilin Zhang, Olaf Wysocki, Boris Jutzi
Title: To Glue or Not to Glue? Classical vs Learned Image Matching for Mobile Mapping Cameras to Textured Semantic 3D Building Models
Abstract:
Feature matching is a necessary step for many computer vision and photogrammetry applications such as image registration, structure-from-motion, and visual localization. Classical handcrafted methods such as SIFT feature detection and description combined with nearest neighbour matching and RANSAC outlier removal have been state-of-the-art for mobile mapping cameras. With recent advances in deep learning, learnable methods have been introduced and proven to have better robustness and performance under complex conditions. Despite their growing adoption, a comprehensive comparison between classical and learnable feature matching methods for the specific task of semantic 3D building camera-to-model matching is still missing. This submission systematically evaluates the effectiveness of different feature-matching techniques in visual localization using textured CityGML LoD2 models. We use standard benchmark datasets (HPatches, MegaDepth-1500) and custom datasets consisting of facade textures and corresponding camera images (terrestrial and drone). For the latter, we evaluate the achievable accuracy of the absolute pose estimated using a Perspective-n-Point (PnP) algorithm, with geometric ground truth derived from geo-referenced trajectory data. The results indicate that the learnable feature matching methods vastly outperform traditional approaches regarding accuracy and robustness on our challenging custom datasets with zero to 12 RANSAC-inliers and zero to 0.16 area under the curve. We believe that this work will foster the development of model-based visual localization methods. Link to the code: https://github.com/simBauer/To\_Glue\_or\_not\_to\_Glue
中文: 本研究系统比较了基于三维建筑模型的视觉定位中传统与可学习特征匹配方法,发现在具有挑战性的数据集上,可学习方法在精度和鲁棒性方面显著优于传统技术。
English: This study systematically compares classical and learnable feature matching methods for visual localization using 3D building models, finding that learnable approaches significantly outperform traditional techniques in accuracy and robustness on challenging datasets.

Authors:Yujin Jeong, Arnas Uselis, Seong Joon Oh, Anna Rohrbach
Title: Diffusion Classifiers Understand Compositionality, but Conditions Apply
Abstract:
Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark Self-Bench comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.
中文摘要:本研究通过涵盖多个模型和任务的系统性评估,揭示了扩散分类器在组合性理解方面的条件有效性,并深入分析了数据集领域和时间步敏感性等关键影响因素。
English Summary: This study comprehensively evaluates the discriminative capabilities of diffusion classifiers across diverse compositional tasks, revealing their conditional effectiveness while analyzing factors like dataset domains and timestep sensitivity.

Authors:Haihong Xiao, Jianan Zou, Yuxin Zhou, Ying He, Wenxiong Kang
Title: SplatCo: Structure-View Collaborative Gaussian Splatting for Detail-Preserving Rendering of Large-Scale Unbounded Scenes
Abstract:
We present SplatCo, a structure-view collaborative Gaussian splatting framework for high-fidelity rendering of complex outdoor environments. SplatCo builds upon two novel components: (1) a cross-structure collaboration module that combines global tri-plane representations, which capture coarse scene layouts, with local context grid features that represent fine surface details. This fusion is achieved through a novel hierarchical compensation strategy, ensuring both global consistency and local detail preservation; and (2) a cross-view assisted training strategy that enhances multi-view consistency by synchronizing gradient updates across viewpoints, applying visibility-aware densification, and pruning overfitted or inaccurate Gaussians based on structural consistency. Through joint optimization of structural representation and multi-view coherence, SplatCo effectively reconstructs fine-grained geometric structures and complex textures in large-scale scenes. Comprehensive evaluations on 13 diverse large-scale scenes, including Mill19, MatrixCity, Tanks & Temples, WHU, and custom aerial captures, demonstrate that SplatCo consistently achieves higher reconstruction quality than state-of-the-art methods, with PSNR improvements of 1-2 dB and SSIM gains of 0.1 to 0.2. These results establish a new benchmark for high-fidelity rendering of large-scale unbounded scenes. Code and additional information are available at https://github.com/SCUT-BIP-Lab/SplatCo.
中文: SplatCo是一种结合全局三平面与局部网格特征的结构-视图协作高斯泼溅框架,通过结构和多视角优化,实现了复杂户外场景的高保真渲染和卓越重建质量。
English: SplatCo is a collaborative Gaussian splatting framework that integrates global tri-plane and local grid features for high-fidelity rendering of complex outdoor scenes, achieving superior reconstruction quality through structural and multi-view optimization.

Authors:Zheyang Huang, Jagannath Aryal, Saeid Nahavandi, Xuequan Lu, Chee Peng Lim, Lei Wei, Hailing Zhou
Title: Object-level Cross-view Geo-localization with Location Enhancement and Multi-Head Cross Attention
Abstract:
Cross-view geo-localization determines the location of a query image, captured by a drone or ground-based camera, by matching it to a geo-referenced satellite image. While traditional approaches focus on image-level localization, many applications, such as search-and-rescue, infrastructure inspection, and precision delivery, demand object-level accuracy. This enables users to prompt a specific object with a single click on a drone image to retrieve precise geo-tagged information of the object. However, variations in viewpoints, timing, and imaging conditions pose significant challenges, especially when identifying visually similar objects in extensive satellite imagery. To address these challenges, we propose an Object-level Cross-view Geo-localization Network (OCGNet). It integrates user-specified click locations using Gaussian Kernel Transfer (GKT) to preserve location information throughout the network. This cue is dually embedded into the feature encoder and feature matching blocks, ensuring robust object-specific localization. Additionally, OCGNet incorporates a Location Enhancement (LE) module and a Multi-Head Cross Attention (MHCA) module to adaptively emphasize object-specific features or expand focus to relevant contextual regions when necessary. OCGNet achieves state-of-the-art performance on a public dataset, CVOGL. It also demonstrates few-shot learning capabilities, effectively generalizing from limited examples, making it suitable for diverse applications (https://github.com/ZheyangH/OCGNet).
中文: 跨视角地理定位方法OCGNet通过整合用户指定位置和自适应模块,提升了对象级定位精度,在公开数据集上实现最优性能并具备小样本学习能力,适用于多种应用场景。
English: Cross-view geo-localization using OCGNet enhances object-level accuracy by integrating user-specified locations and adaptive modules, achieving state-of-the-art performance and few-shot learning capabilities for diverse applications.

Authors:Bin Wu, Wei Wang, Yahui Liu, Zixiang Li, Yao Zhao
Title: DiffusionReward: Enhancing Blind Face Restoration through Reward Feedback Learning
Abstract:
Reward Feedback Learning (ReFL) has recently shown great potential in aligning model outputs with human preferences across various generative tasks. In this work, we introduce a ReFL framework, named DiffusionReward, to the Blind Face Restoration task for the first time. DiffusionReward effectively overcomes the limitations of diffusion-based methods, which often fail to generate realistic facial details and exhibit poor identity consistency. The core of our framework is the Face Reward Model (FRM), which is trained using carefully annotated data. It provides feedback signals that play a pivotal role in steering the optimization process of the restoration network. In particular, our ReFL framework incorporates a gradient flow into the denoising process of off-the-shelf face restoration methods to guide the update of model parameters. The guiding gradient is collaboratively determined by three aspects: (i) the FRM to ensure the perceptual quality of the restored faces; (ii) a regularization term that functions as a safeguard to preserve generative diversity; and (iii) a structural consistency constraint to maintain facial fidelity. Furthermore, the FRM undergoes dynamic optimization throughout the process. It not only ensures that the restoration network stays precisely aligned with the real face manifold, but also effectively prevents reward hacking. Experiments on synthetic and wild datasets demonstrate that our method outperforms state-of-the-art methods, significantly improving identity consistency and facial details. The source codes, data, and models are available at: https://github.com/01NeuralNinja/DiffusionReward.
中文摘要:本文提出DiffusionReward,一种用于盲人脸复原的奖励反馈学习框架,通过面部奖励模型和引导梯度优化显著提升人脸细节与身份一致性,实验证明其性能优于现有最先进方法。
English Summary: This paper introduces DiffusionReward, a novel Reward Feedback Learning framework for Blind Face Restoration that enhances facial details and identity consistency through a Face Reward Model and guided gradient optimization, outperforming existing methods.

Authors:Litao Guo, Xinli Xu, Luozhou Wang, Jiantao Lin, Jinsong Zhou, Zixin Zhang, Bolan Su, Ying-Cong Chen
Title: ComfyMind: Toward General-Purpose Generation via Tree-Based Planning and Reactive Feedback
Abstract:
With the rapid advancement of generative models, general-purpose generation has gained increasing attention as a promising approach to unify diverse tasks across modalities within a single system. Despite this progress, existing open-source frameworks often remain fragile and struggle to support complex real-world applications due to the lack of structured workflow planning and execution-level feedback. To address these limitations, we present ComfyMind, a collaborative AI system designed to enable robust and scalable general-purpose generation, built on the ComfyUI platform. ComfyMind introduces two core innovations: Semantic Workflow Interface (SWI) that abstracts low-level node graphs into callable functional modules described in natural language, enabling high-level composition and reducing structural errors; Search Tree Planning mechanism with localized feedback execution, which models generation as a hierarchical decision process and allows adaptive correction at each stage. Together, these components improve the stability and flexibility of complex generative workflows. We evaluate ComfyMind on three public benchmarks: ComfyBench, GenEval, and Reason-Edit, which span generation, editing, and reasoning tasks. Results show that ComfyMind consistently outperforms existing open-source baselines and achieves performance comparable to GPT-Image-1. ComfyMind paves a promising path for the development of open-source general-purpose generative AI systems. Project page: https://github.com/LitaoGuo/ComfyMind
Chinese: ComfyMind 是一个协作式人工智能系统,通过语义工作流接口和搜索树规划机制提升通用生成能力,在多项基准测试中表现优异,性能媲美 GPT-Image-1。
English: ComfyMind is a collaborative AI system that enhances general-purpose generation through its Semantic Workflow Interface and Search Tree Planning mechanism, achieving superior performance on benchmarks and rivaling GPT-Image-1.

Authors:Nikita Ivanov, Mark Klimov, Dmitry Glukhikh, Tatiana Chernysheva, Igor Glukhikh
Title: Track Anything Annotate: Video annotation and dataset generation of computer vision models
Abstract:
Modern machine learning methods require significant amounts of labelled data, making the preparation process time-consuming and resource-intensive. In this paper, we propose to consider the process of prototyping a tool for annotating and generating training datasets based on video tracking and segmentation. We examine different approaches to solving this problem, from technology selection through to final implementation. The developed prototype significantly accelerates dataset generation compared to manual annotation. All resources are available at https://github.com/lnikioffic/track-anything-annotate
中文摘要:本文提出了一种基于视频跟踪与分割的原型工具,用于自动标注和生成训练数据集,相比人工标注显著加快了处理速度。
English Summary: The paper introduces a prototype tool that uses video tracking and segmentation to automate the annotation and generation of training datasets, significantly speeding up the process compared to manual methods.

Authors:Wenning Xu, Shiyu Fan, Paul Henderson, Edmond S. L. Ho
Title: Multi-Person Interaction Generation from Two-Person Motion Priors
Abstract:
Generating realistic human motion with high-level controls is a crucial task for social understanding, robotics, and animation. With high-quality MOCAP data becoming more available recently, a wide range of data-driven approaches have been presented. However, modelling multi-person interactions still remains a less explored area. In this paper, we present Graph-driven Interaction Sampling, a method that can generate realistic and diverse multi-person interactions by leveraging existing two-person motion diffusion models as motion priors. Instead of training a new model specific to multi-person interaction synthesis, our key insight is to spatially and temporally separate complex multi-person interactions into a graph structure of two-person interactions, which we name the Pairwise Interaction Graph. We thus decompose the generation task into simultaneous single-person motion generation conditioned on one other's motion. In addition, to reduce artifacts such as interpenetrations of body parts in generated multi-person interactions, we introduce two graph-dependent guidance terms into the diffusion sampling scheme. Unlike previous work, our method can produce various high-quality multi-person interactions without having repetitive individual motions. Extensive experiments demonstrate that our approach consistently outperforms existing methods in reducing artifacts when generating a wide range of two-person and multi-person interactions.

Authors:Boxu Chen, Ziwei Zheng, Le Yang, Zeyu Geng, Zhengyu Zhao, Chenhao Lin, Chao Shen
Title: Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations
Abstract:
Large Vision-Language Models (LVLMs) have achieved remarkable success but continue to struggle with object hallucination (OH), generating outputs inconsistent with visual inputs. While previous work has proposed methods to reduce OH, the visual decision-making mechanisms that lead to hallucinations remain poorly understood. In this paper, we propose VaLSe, a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address OH in LVLMs. By tackling dual challenges of modeling complex vision-language interactions and eliminating spurious activation artifacts, VaLSe can generate visual contribution maps that trace how specific visual inputs influence individual output tokens. These maps reveal the model's vision-aware focus regions, which are then used to perform latent space steering, realigning internal representations toward semantically relevant content and reducing hallucinated outputs. Extensive experiments demonstrate that VaLSe is a powerful interpretability tool and an effective method for enhancing model robustness against OH across multiple benchmarks. Furthermore, our analysis uncovers limitations in existing OH evaluation metrics, underscoring the need for more nuanced, interpretable, and visually grounded OH benchmarks in future work. Code is available at: https://github.com/Ziwei-Zheng/VaLSe.
中文: 大型视觉语言模型存在物体幻觉问题,VaLSe框架通过解释视觉贡献并引导潜在表征来减少幻觉输出,从而提升模型的鲁棒性和可解释性。
English: Large Vision-Language Models suffer from object hallucination, and the proposed VaLSe framework addresses this by interpreting visual contributions and steering latent representations to enhance robustness and interpretability.

Authors:Ping Li, Jianan Ni, Bo Pang
Title: Temporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action Recognition
Abstract:
Action recognition models using deep learning are vulnerable to adversarial examples, which are transferable across other models trained on the same data modality. Existing transferable attack methods face two major challenges: 1) they heavily rely on the assumption that the decision boundaries of the surrogate (a.k.a., source) model and the target model are similar, which limits the adversarial transferability; and 2) their decision boundary difference makes the attack direction uncertain, which may result in the gradient oscillation, weakening the adversarial attack. This motivates us to propose a Background Mixup-induced Temporal Consistency (BMTC) attack method for action recognition. From the input transformation perspective, we design a model-agnostic background adversarial mixup module to reduce the surrogate-target model dependency. In particular, we randomly sample one video from each category and make its background frame, while selecting the background frame with the top attack ability for mixup with the clean frame by reinforcement learning. Moreover, to ensure an explicit attack direction, we leverage the background category as guidance for updating the gradient of adversarial example, and design a temporal gradient consistency loss, which strengthens the stability of the attack direction on subsequent frames. Empirical studies on two video datasets, i.e., UCF101 and Kinetics-400, and one image dataset, i.e., ImageNet, demonstrate that our method significantly boosts the transferability of adversarial examples across several action/image recognition models. Our code is available at https://github.com/mlvccn/BMTC_TransferAttackVid.
中文: 提出的背景混合诱导时序一致性(BMTC)攻击方法通过背景混合和时序梯度一致性,减少了对替代模型的依赖并稳定了攻击方向,从而显著提升了动作识别中对抗样本的迁移性。
English: The proposed Background Mixup-induced Temporal Consistency (BMTC) attack method enhances adversarial transferability for action recognition by reducing dependency on surrogate models and stabilizing attack directions through background mixup and temporal gradient consistency.

Authors:Yanping Fu, Xinyuan Liu, Tianyu Li, Yike Ma, Yucheng Zhang, Feng Dai
Title: TopoPoint: Enhance Topology Reasoning via Endpoint Detection in Autonomous Driving
Abstract:
Topology reasoning, which unifies perception and structured reasoning, plays a vital role in understanding intersections for autonomous driving. However, its performance heavily relies on the accuracy of lane detection, particularly at connected lane endpoints. Existing methods often suffer from lane endpoints deviation, leading to incorrect topology construction. To address this issue, we propose TopoPoint, a novel framework that explicitly detects lane endpoints and jointly reasons over endpoints and lanes for robust topology reasoning. During training, we independently initialize point and lane query, and proposed Point-Lane Merge Self-Attention to enhance global context sharing through incorporating geometric distances between points and lanes as an attention mask . We further design Point-Lane Graph Convolutional Network to enable mutual feature aggregation between point and lane query. During inference, we introduce Point-Lane Geometry Matching algorithm that computes distances between detected points and lanes to refine lane endpoints, effectively mitigating endpoint deviation. Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoPoint achieves state-of-the-art performance in topology reasoning (48.8 on OLS). Additionally, we propose DET$_p$ to evaluate endpoint detection, under which our method significantly outperforms existing approaches (52.6 v.s. 45.2 on DET$_p$). The code is released at https://github.com/Franpin/TopoPoint.
中文: TopoPoint是一种新颖的框架,通过显式检测车道端点并与车道联合推理来增强自动驾驶中的拓扑推理能力,在基准测试中实现了最先进的性能。
English: TopoPoint is a novel framework that explicitly detects lane endpoints and jointly reasons over them with lanes to enhance topology reasoning in autonomous driving, achieving state-of-the-art performance on benchmarks.

Authors:Dong Zhang, Lingfeng He, Rui Yan, Fei Shen, Jinhui Tang
Title: R-Genie: Reasoning-Guided Generative Image Editing
Abstract:
While recent advances in image editing have enabled impressive visual synthesis capabilities, current methods remain constrained by explicit textual instructions and limited editing operations, lacking deep comprehension of implicit user intentions and contextual reasoning. In this work, we introduce a new image editing paradigm: reasoning-guided generative editing, which synthesizes images based on complex, multi-faceted textual queries accepting world knowledge and intention inference. To facilitate this task, we first construct a comprehensive dataset featuring over 1,000 image-instruction-edit triples that incorporate rich reasoning contexts and real-world knowledge. We then propose R-Genie: a reasoning-guided generative image editor, which synergizes the generation power of diffusion models with advanced reasoning capabilities of multimodal large language models. R-Genie incorporates a reasoning-attention mechanism to bridge linguistic understanding with visual synthesis, enabling it to handle intricate editing requests involving abstract user intentions and contextual reasoning relations. Extensive experimental results validate that R-Genie can equip diffusion models with advanced reasoning-based editing capabilities, unlocking new potentials for intelligent image synthesis.

Authors:Jiawei Zhou, Linye Lyu, Zhuotao Tian, Cheng Zhuo, Yu Li
Title: SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain
Abstract:
Safety-critical scenarios are rare yet pivotal for evaluating and enhancing the robustness of autonomous driving systems. While existing methods generate safety-critical driving trajectories, simulations, or single-view videos, they fall short of meeting the demands of advanced end-to-end autonomous systems (E2E AD), which require real-world, multi-view video data. To bridge this gap, we introduce SafeMVDrive, the first framework designed to generate high-quality, safety-critical, multi-view driving videos grounded in real-world domains. SafeMVDrive strategically integrates a safety-critical trajectory generator with an advanced multi-view video generator. To tackle the challenges inherent in this integration, we first enhance scene understanding ability of the trajectory generator by incorporating visual context -- which is previously unavailable to such generator -- and leveraging a GRPO-finetuned vision-language model to achieve more realistic and context-aware trajectory generation. Second, recognizing that existing multi-view video generators struggle to render realistic collision events, we introduce a two-stage, controllable trajectory generation mechanism that produces collision-evasion trajectories, ensuring both video quality and safety-critical fidelity. Finally, we employ a diffusion-based multi-view video generator to synthesize high-quality safety-critical driving videos from the generated trajectories. Experiments conducted on an E2E AD planner demonstrate a significant increase in collision rate when tested with our generated data, validating the effectiveness of SafeMVDrive in stress-testing planning modules. Our code, examples, and datasets are publicly available at: https://zhoujiawei3.github.io/SafeMVDrive/.

Authors:Dong-Hee Kim, Hyunjee Song, Donghyun Kim
Title: SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data
Abstract:
Despite the advances in Referring Expression Segmentation (RES) benchmarks, their evaluation protocols remain constrained, primarily focusing on either single targets with short queries (containing minimal attributes) or multiple targets from distinctly different queries on a single domain. This limitation significantly hinders the assessment of more complex reasoning capabilities in RES models. We introduce WildRES, a novel benchmark that incorporates long queries with diverse attributes and non-distinctive queries for multiple targets. This benchmark spans diverse application domains, including autonomous driving environments and robotic manipulation scenarios, thus enabling more rigorous evaluation of complex reasoning capabilities in real-world settings. Our analysis reveals that current RES models demonstrate substantial performance deterioration when evaluated on WildRES. To address this challenge, we introduce SynRES, an automated pipeline generating densely paired compositional synthetic training data through three innovations: (1) a dense caption-driven synthesis for attribute-rich image-mask-expression triplets, (2) reliable semantic alignment mechanisms rectifying caption-pseudo mask inconsistencies via Image-Text Aligned Grouping, and (3) domain-aware augmentations incorporating mosaic composition and superclass replacement to emphasize generalization ability and distinguishing attributes over object categories. Experimental results demonstrate that models trained with SynRES achieve state-of-the-art performance, improving gIoU by 2.0% on WildRES-ID and 3.8% on WildRES-DS. Code and datasets are available at https://github.com/UTLLab/SynRES.
Chinese: WildRES作为一个新基准,通过引入复杂查询和多领域场景严格评估指代表达分割模型,揭示了现有模型的性能局限,并推动了SynRES自动化合成训练数据流程的开发,从而显著提升了模型的精确度。
English: WildRES is a new benchmark that introduces complex queries and diverse domains to rigorously evaluate referring expression segmentation models, revealing their performance limitations and prompting the development of SynRES, an automated pipeline that generates synthetic training data to significantly enhance model accuracy.

Authors:Yunyao Lu, Yihang Wu, Reem Kateb, Ahmad Chaddad
Title: Semi-Supervised Medical Image Segmentation via Dual Networks
Abstract:
Traditional supervised medical image segmentation models require large amounts of labeled data for training; however, obtaining such large-scale labeled datasets in the real world is extremely challenging. Recent semi-supervised segmentation models also suffer from noisy pseudo-label issue and limited supervision in feature space. To solve these challenges, we propose an innovative semi-supervised 3D medical image segmentation method to reduce the dependency on large, expert-labeled datasets. Furthermore, we introduce a dual-network architecture to address the limitations of existing methods in using contextual information and generating reliable pseudo-labels. In addition, a self-supervised contrastive learning strategy is used to enhance the representation of the network and reduce prediction uncertainty by distinguishing between reliable and unreliable predictions. Experiments on clinical magnetic resonance imaging demonstrate that our approach outperforms state-of-the-art techniques. Our code is available at https://github.com/AIPMLab/Semi-supervised-Segmentation.
Chinese: 本研究提出了一种创新的半监督三维医学图像分割方法,采用双网络架构和对比学习策略,减少对标注数据的依赖并提高分割精度,在临床磁共振成像测试中优于现有技术。
English: This study introduces a novel semi-supervised 3D medical image segmentation method with a dual-network architecture and contrastive learning to reduce reliance on labeled data and improve segmentation accuracy, outperforming existing techniques in clinical MRI tests.

Authors:Dan Yuan, Yi Feng, Ziyun Tang
Title: Dual Attention Residual U-Net for Accurate Brain Ultrasound Segmentation in IVH Detection
Abstract:
Intraventricular hemorrhage (IVH) is a severe neurological complication among premature infants, necessitating early and accurate detection from brain ultrasound (US) images to improve clinical outcomes. While recent deep learning methods offer promise for computer-aided diagnosis, challenges remain in capturing both local spatial details and global contextual dependencies critical for segmenting brain anatomies. In this work, we propose an enhanced Residual U-Net architecture incorporating two complementary attention mechanisms: the Convolutional Block Attention Module (CBAM) and a Sparse Attention Layer (SAL). The CBAM improves the model's ability to refine spatial and channel-wise features, while the SAL introduces a dual-branch design, sparse attention filters out low-confidence query-key pairs to suppress noise, and dense attention ensures comprehensive information propagation. Extensive experiments on the Brain US dataset demonstrate that our method achieves state-of-the-art segmentation performance, with a Dice score of 89.04% and IoU of 81.84% for ventricle region segmentation. These results highlight the effectiveness of integrating spatial refinement and attention sparsity for robust brain anatomy detection. Code is available at: https://github.com/DanYuan001/BrainImgSegment.
中文摘要:本研究提出一种结合CBAM和SAL双重注意力机制的增强型残差U-Net,在早产儿脑部超声图像心室分割任务中以89.04%的Dice分数达到最优性能。
English Summary: The study introduces an enhanced Residual U-Net with dual attention mechanisms, CBAM and SAL, achieving state-of-the-art ventricle segmentation in premature infant brain ultrasound images with 89.04% Dice score.

Authors:Xuerui Qiu, Peixi Wu, Yaozhi Wen, Shaowei Gu, Yuqi Pan, Xinhao Luo, Bo XU, Guoqi Li
Title: SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding
Abstract:
Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing SNNs still exhibit a significant performance gap compared to Artificial Neural Networks (ANNs) due to inadequate pre-training strategies. These limitations manifest as restricted generalization ability, task specificity, and a lack of multimodal understanding, particularly in challenging tasks such as multimodal question answering and zero-shot 3D classification. To overcome these challenges, we propose a Spike-based Vision-Language (SVL) pretraining framework that empowers SNNs with open-world 3D understanding while maintaining spike-driven efficiency. SVL introduces two key components: (i) Multi-scale Triple Alignment (MTA) for label-free triplet-based contrastive learning across 3D, image, and text modalities, and (ii) Re-parameterizable Vision-Language Integration (Rep-VLI) to enable lightweight inference without relying on large text encoders. Extensive experiments show that SVL achieves a top-1 accuracy of 85.4% in zero-shot 3D classification, surpassing advanced ANN models, and consistently outperforms prior SNNs on downstream tasks, including 3D classification (+6.1%), DVS action recognition (+2.1%), 3D detection (+1.1%), and 3D segmentation (+2.1%) with remarkable efficiency. Moreover, SVL enables SNNs to perform open-world 3D question answering, sometimes outperforming ANNs. To the best of our knowledge, SVL represents the first scalable, generalizable, and hardware-friendly paradigm for 3D open-world understanding, effectively bridging the gap between SNNs and ANNs in complex open-world understanding tasks. Code is available https://github.com/bollossom/SVL.
中文摘要:提出的脉冲视觉语言(SVL)框架通过多模态预训练解决了脉冲神经网络性能不足的问题,在保持能效优势的同时实现了卓越的零样本三维分类和开放世界理解能力。
English Summary: The proposed Spike-based Vision-Language (SVL) framework overcomes SNNs' performance limitations through multimodal pretraining, achieving superior zero-shot 3D classification and open-world understanding while maintaining energy efficiency.

Authors:Junhang Li, Yu Guo, Chuhua Xian, Shengfeng He
Title: Instruct2See: Learning to Remove Any Obstructions Across Distributions
Abstract:
Images are often obstructed by various obstacles due to capture limitations, hindering the observation of objects of interest. Most existing methods address occlusions from specific elements like fences or raindrops, but are constrained by the wide range of real-world obstructions, making comprehensive data collection impractical. To overcome these challenges, we propose Instruct2See, a novel zero-shot framework capable of handling both seen and unseen obstacles. The core idea of our approach is to unify obstruction removal by treating it as a soft-hard mask restoration problem, where any obstruction can be represented using multi-modal prompts, such as visual semantics and textual instructions, processed through a cross-attention unit to enhance contextual understanding and improve mode control. Additionally, a tunable mask adapter allows for dynamic soft masking, enabling real-time adjustment of inaccurate masks. Extensive experiments on both in-distribution and out-of-distribution obstacles show that Instruct2See consistently achieves strong performance and generalization in obstruction removal, regardless of whether the obstacles were present during the training phase. Code and dataset are available at https://jhscut.github.io/Instruct2See.

Authors:Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Ling Pan
Title: Scaling Image and Video Generation via Test-Time Evolutionary Search
Abstract:
As the marginal cost of scaling computation (data and parameters) during model pre-training continues to increase substantially, test-time scaling (TTS) has emerged as a promising direction for improving generative model performance by allocating additional computation at inference time. While TTS has demonstrated significant success across multiple language tasks, there remains a notable gap in understanding the test-time scaling behaviors of image and video generative models (diffusion-based or flow-based models). Although recent works have initiated exploration into inference-time strategies for vision tasks, these approaches face critical limitations: being constrained to task-specific domains, exhibiting poor scalability, or falling into reward over-optimization that sacrifices sample diversity. In this paper, we propose \textbf{Evo}lutionary \textbf{Search} (EvoSearch), a novel, generalist, and efficient TTS method that effectively enhances the scalability of both image and video generation across diffusion and flow models, without requiring additional training or model expansion. EvoSearch reformulates test-time scaling for diffusion and flow models as an evolutionary search problem, leveraging principles from biological evolution to efficiently explore and refine the denoising trajectory. By incorporating carefully designed selection and mutation mechanisms tailored to the stochastic differential equation denoising process, EvoSearch iteratively generates higher-quality offspring while preserving population diversity. Through extensive evaluation across both diffusion and flow architectures for image and video generation tasks, we demonstrate that our method consistently outperforms existing approaches, achieves higher diversity, and shows strong generalizability to unseen evaluation metrics. Our project is available at the website https://tinnerhrhe.github.io/evosearch.
中文: 测试时缩放(TTS)通过在推理阶段增加计算来经济地提升生成模型性能,而提出的EvoSearch方法无需额外训练即可有效增强扩散和流模型的图像与视频生成效果。
English: Test-time scaling (TTS) offers a cost-effective way to boost generative model performance by adding computation during inference, and the proposed EvoSearch method effectively enhances image and video generation for diffusion and flow models without extra training.

Authors:Florian Barthel, Wieland Morgenstern, Paul Hinzer, Anna Hilsmann, Peter Eisert
Title: CGS-GAN: 3D Consistent Gaussian Splatting GANs for High Resolution Human Head Synthesis
Abstract:
Recently, 3D GANs based on 3D Gaussian splatting have been proposed for high quality synthesis of human heads. However, existing methods stabilize training and enhance rendering quality from steep viewpoints by conditioning the random latent vector on the current camera position. This compromises 3D consistency, as we observe significant identity changes when re-synthesizing the 3D head with each camera shift. Conversely, fixing the camera to a single viewpoint yields high-quality renderings for that perspective but results in poor performance for novel views. Removing view-conditioning typically destabilizes GAN training, often causing the training to collapse. In response to these challenges, we introduce CGS-GAN, a novel 3D Gaussian Splatting GAN framework that enables stable training and high-quality 3D-consistent synthesis of human heads without relying on view-conditioning. To ensure training stability, we introduce a multi-view regularization technique that enhances generator convergence with minimal computational overhead. Additionally, we adapt the conditional loss used in existing 3D Gaussian splatting GANs and propose a generator architecture designed to not only stabilize training but also facilitate efficient rendering and straightforward scaling, enabling output resolutions up to $2048^2$. To evaluate the capabilities of CGS-GAN, we curate a new dataset derived from FFHQ. This dataset enables very high resolutions, focuses on larger portions of the human head, reduces view-dependent artifacts for improved 3D consistency, and excludes images where subjects are obscured by hands or other objects. As a result, our approach achieves very high rendering quality, supported by competitive FID scores, while ensuring consistent 3D scene generation. Check our our project page here: https://fraunhoferhhi.github.io/cgs-gan/
中文:CGS-GAN提出了一种新颖的3D高斯泼溅生成对抗网络框架,无需视角条件即可实现稳定训练和高质量3D一致的人头合成,通过多视角正则化和优化的生成器架构,获得了卓越的渲染质量和具有竞争力的FID分数。
English: CGS-GAN introduces a novel 3D Gaussian Splatting GAN framework that enables stable training and high-quality 3D-consistent human head synthesis without view-conditioning, achieving superior rendering quality and competitive FID scores through multi-view regularization and an optimized generator architecture.

Authors:Hainuo Wang, Qiming Hu, Xiaojie Guo
Title: MODEM: A Morton-Order Degradation Estimation Mechanism for Adverse Weather Image Recovery
Abstract:
Restoring images degraded by adverse weather remains a significant challenge due to the highly non-uniform and spatially heterogeneous nature of weather-induced artifacts, e.g., fine-grained rain streaks versus widespread haze. Accurately estimating the underlying degradation can intuitively provide restoration models with more targeted and effective guidance, enabling adaptive processing strategies. To this end, we propose a Morton-Order Degradation Estimation Mechanism (MODEM) for adverse weather image restoration. Central to MODEM is the Morton-Order 2D-Selective-Scan Module (MOS2D), which integrates Morton-coded spatial ordering with selective state-space models to capture long-range dependencies while preserving local structural coherence. Complementing MOS2D, we introduce a Dual Degradation Estimation Module (DDEM) that disentangles and estimates both global and local degradation priors. These priors dynamically condition the MOS2D modules, facilitating adaptive and context-aware restoration. Extensive experiments and ablation studies demonstrate that MODEM achieves state-of-the-art results across multiple benchmarks and weather types, highlighting its effectiveness in modeling complex degradation dynamics. Our code will be released at https://github.com/hainuo-wang/MODEM.git.
中文: 本研究提出的MODEM框架通过莫顿序降解估计机制,结合空间扫描与双路径估计模块,自适应地建模全局和局部天气退化特征,在多种恶劣天气图像恢复任务中取得了最先进的性能表现。
English: The proposed MODEM framework introduces a Morton-Order Degradation Estimation Mechanism that adaptively captures both global and local weather-induced degradations through spatial ordering and dual estimation modules, achieving state-of-the-art restoration performance across diverse adverse weather conditions.

Authors:Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, Guo-jun Qi
Title: InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO
Abstract:
Recent advances in text-to-video generation, particularly with autoregressive models, have enabled the synthesis of high-quality videos depicting individual scenes. However, extending these models to generate long, cross-scene videos remains a significant challenge. As the context length grows during autoregressive decoding, computational costs rise sharply, and the model's ability to maintain consistency and adhere to evolving textual prompts deteriorates. We introduce InfLVG, an inference-time framework that enables coherent long video generation without requiring additional long-form video data. InfLVG leverages a learnable context selection policy, optimized via Group Relative Policy Optimization (GRPO), to dynamically identify and retain the most semantically relevant context throughout the generation process. Instead of accumulating the entire generation history, the policy ranks and selects the top-$K$ most contextually relevant tokens, allowing the model to maintain a fixed computational budget while preserving content consistency and prompt alignment. To optimize the policy, we design a hybrid reward function that jointly captures semantic alignment, cross-scene consistency, and artifact reduction. To benchmark performance, we introduce the Cross-scene Video Benchmark (CsVBench) along with an Event Prompt Set (EPS) that simulates complex multi-scene transitions involving shared subjects and varied actions/backgrounds. Experimental results show that InfLVG can extend video length by up to 9$\times$, achieving strong consistency and semantic fidelity across scenes. Our code is available at https://github.com/MAPLE-AIGC/InfLVG.
中文:InfLVG是一种推理时框架,通过动态选择相关上下文标记实现连贯的长视频生成,可将视频长度扩展至9倍,同时保持跨场景的一致性和语义保真度。
English: InfLVG is an inference-time framework that enables coherent long video generation by dynamically selecting relevant context tokens, extending video length up to 9 times while maintaining consistency and semantic fidelity across scenes.

Authors:Kwanyoung Kim, Sanghyun Kim
Title: Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model
Abstract:
The choice of initial noise significantly affects the quality and prompt alignment of video diffusion models, where different noise seeds for the same prompt can lead to drastically different generations. While recent methods rely on externally designed priors such as frequency filters or inter-frame smoothing, they often overlook internal model signals that indicate which noise seeds are inherently preferable. To address this, we propose ANSE (Active Noise Selection for Generation), a model-aware framework that selects high-quality noise seeds by quantifying attention-based uncertainty. At its core is BANSA (Bayesian Active Noise Selection via Attention), an acquisition function that measures entropy disagreement across multiple stochastic attention samples to estimate model confidence and consistency. For efficient inference-time deployment, we introduce a Bernoulli-masked approximation of BANSA that enables score estimation using a single diffusion step and a subset of attention layers. Experiments on CogVideoX-2B and 5B demonstrate that ANSE improves video quality and temporal coherence with only an 8% and 13% increase in inference time, respectively, providing a principled and generalizable approach to noise selection in video diffusion. See our project page: https://anse-project.github.io/anse-project/

Authors:Qiyu Chen, Huiyuan Luo, Haiming Yao, Wei Luo, Zhen Qu, Chengkan Lv, Zhengtao Zhang
Title: Center-aware Residual Anomaly Synthesis for Multi-class Industrial Anomaly Detection
Abstract:
Anomaly detection plays a vital role in the inspection of industrial images. Most existing methods require separate models for each category, resulting in multiplied deployment costs. This highlights the challenge of developing a unified model for multi-class anomaly detection. However, the significant increase in inter-class interference leads to severe missed detections. Furthermore, the intra-class overlap between normal and abnormal samples, particularly in synthesis-based methods, cannot be ignored and may lead to over-detection. To tackle these issues, we propose a novel Center-aware Residual Anomaly Synthesis (CRAS) method for multi-class anomaly detection. CRAS leverages center-aware residual learning to couple samples from different categories into a unified center, mitigating the effects of inter-class interference. To further reduce intra-class overlap, CRAS introduces distance-guided anomaly synthesis that adaptively adjusts noise variance based on normal data distribution. Experimental results on diverse datasets and real-world industrial applications demonstrate the superior detection accuracy and competitive inference speed of CRAS. The source code and the newly constructed dataset are publicly available at https://github.com/cqylunlun/CRAS.
中文摘要:提出的中心感知残差异常合成(CRAS)方法通过统一中心耦合缓解类间干扰,并采用自适应异常合成减少类内重叠,有效解决了多类别异常检测中的关键问题,在多个数据集上展现出卓越的检测精度与高效性能。
English Summary: The proposed Center-aware Residual Anomaly Synthesis (CRAS) method addresses multi-class anomaly detection challenges by mitigating inter-class interference through unified center coupling and reducing intra-class overlap via adaptive anomaly synthesis, demonstrating superior accuracy and efficiency across various datasets.

Authors:Xiaoyu Ye, Songjie Cheng, Yongtao Wang, Yajiao Xiong, Yishen Li
Title: T2VUnlearning: A Concept Erasing Method for Text-to-Video Diffusion Models
Abstract:
Recent advances in text-to-video (T2V) diffusion models have significantly enhanced the quality of generated videos. However, their ability to produce explicit or harmful content raises concerns about misuse and potential rights violations. Inspired by the success of unlearning techniques in erasing undesirable concepts from text-to-image (T2I) models, we extend unlearning to T2V models and propose a robust and precise unlearning method. Specifically, we adopt negatively-guided velocity prediction fine-tuning and enhance it with prompt augmentation to ensure robustness against LLM-refined prompts. To achieve precise unlearning, we incorporate a localization and a preservation regularization to preserve the model's ability to generate non-target concepts. Extensive experiments demonstrate that our method effectively erases a specific concept while preserving the model's generation capability for all other concepts, outperforming existing methods. We provide the unlearned models in \href{https://github.com/VDIGPKU/T2VUnlearning.git}{https://github.com/VDIGPKU/T2VUnlearning.git}.
中文: 本研究提出了一种基于遗忘学习的方法,通过增强的微调和正则化技术从文本到视频扩散模型中消除有害概念,在保持非目标内容生成质量的同时,其效果优于现有方法。
English: This study introduces an unlearning-based method to erase harmful concepts from text-to-video diffusion models, using enhanced fine-tuning and regularization techniques to maintain generation quality for non-target content while outperforming existing approaches.

Authors:Mingrui Wu, Lu Wang, Pu Zhao, Fangkai Yang, Jianjin Zhang, Jianfeng Liu, Yuefeng Zhan, Weihao Han, Hao Sun, Jiayi Ji, Xiaoshuai Sun, Qingwei Lin, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang, Rongrong Ji
Title: RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning
Abstract:
Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic alignment, and visual composition, providing indirect supervision to refine prompt generation. Our approach enables end-to-end training without human-annotated data. Experiments on GenEval and T2I-Compbench show that RePrompt significantly boosts spatial layout fidelity and compositional generalization across diverse T2I backbones, establishing new state-of-the-art results.
中文: RePrompt通过强化学习框架,在提示增强过程中引入结构化推理和图像级优化,无需人工标注即可显著提升文本到图像生成的空间布局忠实度和组合泛化能力。
English: RePrompt introduces a reinforcement learning framework that enhances text-to-image prompts through structured reasoning and image-level optimization, achieving superior spatial and compositional accuracy without human annotations.

Authors:Jingjing Jiang, Chongjie Si, Jun Luo, Hanwang Zhang, Chao Ma
Title: Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
Abstract:
This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.
中文摘要:本文提出CoRL协同强化学习框架,通过统一策略优化同时增强多模态大语言模型的生成与理解能力,在多项基准测试中取得显著性能提升。
English Summary: This paper introduces CoRL, a co-reinforcement learning framework that enhances both generation and understanding in multimodal large language models, achieving significant performance gains across multiple benchmarks.

Authors:Uyoung Jeong, Jonathan Freer, Seungryul Baek, Hyung Jin Chang, Kwang In Kim
Title: PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation
Abstract:
We study multi-dataset training (MDT) for pose estimation, where skeletal heterogeneity presents a unique challenge that existing methods have yet to address. In traditional domains, \eg regression and classification, MDT typically relies on dataset merging or multi-head supervision. However, the diversity of skeleton types and limited cross-dataset supervision complicate integration in pose estimation. To address these challenges, we introduce PoseBH, a new MDT framework that tackles keypoint heterogeneity and limited supervision through two key techniques. First, we propose nonparametric keypoint prototypes that learn within a unified embedding space, enabling seamless integration across skeleton types. Second, we develop a cross-type self-supervision mechanism that aligns keypoint predictions with keypoint embedding prototypes, providing supervision without relying on teacher-student models or additional augmentations. PoseBH substantially improves generalization across whole-body and animal pose datasets, including COCO-WholeBody, AP-10K, and APT-36K, while preserving performance on standard human pose benchmarks (COCO, MPII, and AIC). Furthermore, our learned keypoint embeddings transfer effectively to hand shape estimation (InterHand2.6M) and human body shape estimation (3DPW). The code for PoseBH is available at: https://github.com/uyoung-jeong/PoseBH.
中文: 我们提出PoseBH这一新型多数据集训练框架,通过统一关键点嵌入和跨类型自监督机制解决姿态估计中的骨骼异构问题,在人类和动物数据集上显著提升泛化能力,同时保持标准基准的性能表现。
English: We introduce PoseBH, a novel multi-dataset training framework that addresses skeletal heterogeneity in pose estimation through unified keypoint embeddings and cross-type self-supervision, significantly improving generalization across human and animal datasets while maintaining performance on standard benchmarks.

Authors:Hefei Mei, Zirui Wang, Shen You, Minjing Dong, Chang Xu
Title: VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation, yet their vulnerability to adversarial attacks raises significant robustness concerns. While existing effective attacks always focus on task-specific white-box settings, these approaches are limited in the context of LVLMs, which are designed for diverse downstream tasks and require expensive full-model gradient computations. Motivated by the pivotal role and wide adoption of the vision encoder in LVLMs, we propose a simple yet effective Vision Encoder Attack (VEAttack), which targets the vision encoder of LVLMs only. Specifically, we propose to generate adversarial examples by minimizing the cosine similarity between the clean and perturbed visual features, without accessing the following large language models, task information, and labels. It significantly reduces the computational overhead while eliminating the task and label dependence of traditional white-box attacks in LVLMs. To make this simple attack effective, we propose to perturb images by optimizing image tokens instead of the classification token. We provide both empirical and theoretical evidence that VEAttack can easily generalize to various tasks. VEAttack has achieved a performance degradation of 94.5% on image caption task and 75.7% on visual question answering task. We also reveal some key observations to provide insights into LVLM attack/defense: 1) hidden layer variations of LLM, 2) token attention differential, 3) Möbius band in transfer attack, 4) low sensitivity to attack steps. The code is available at https://github.com/hfmei/VEAttack-LVLM
中文摘要:本文提出的视觉编码器攻击(VEAttack)通过仅针对大型视觉语言模型的视觉编码器进行攻击,在显著降低多模态任务性能的同时有效减少了计算开销。
English Summary: The proposed Vision Encoder Attack (VEAttack) effectively compromises Large Vision-Language Models by targeting only the vision encoder, achieving significant performance degradation in multimodal tasks while reducing computational costs.

Authors:Wei Jie Yeo, Rui Mao, Moloud Abdar, Erik Cambria, Ranjan Satapathy
Title: Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads
Abstract:
Multimodal models like CLIP have gained significant attention due to their remarkable zero-shot performance across various tasks. However, studies have revealed that CLIP can inadvertently learn spurious associations between target variables and confounding factors. To address this, we introduce \textsc{Locate-Then-Correct} (LTC), a contrastive framework that identifies spurious attention heads in Vision Transformers via mechanistic insights and mitigates them through targeted ablation. Furthermore, LTC identifies salient, task-relevant attention heads, enabling the integration of discriminative features through orthogonal projection to improve classification performance. We evaluate LTC on benchmarks with inherent background and gender biases, achieving over a $>50\%$ gain in worst-group accuracy compared to non-training post-hoc baselines. Additionally, we visualize the representation of selected heads and find that the presented interpretation corroborates our contrastive mechanism for identifying both spurious and salient attention heads. Code available at https://github.com/wj210/CLIP_LTC.
Chinese: 提出的“定位后修正”框架通过针对特定注意力头识别并缓解CLIP模型中的伪关联,在存在偏见的基准测试中显著提升了最差组准确率。
English: The proposed \textsc{Locate-Then-Correct} framework identifies and mitigates spurious associations in CLIP models by targeting specific attention heads, significantly improving worst-group accuracy on biased benchmarks.

Authors:Zhongpai Gao, Meng Zheng, Benjamin Planche, Anwesa Choudhuri, Terrence Chen, Ziyan Wu
Title: Render-FM: A Foundation Model for Real-time Photorealistic Volumetric Rendering
Abstract:
Volumetric rendering of Computed Tomography (CT) scans is crucial for visualizing complex 3D anatomical structures in medical imaging. Current high-fidelity approaches, especially neural rendering techniques, require time-consuming per-scene optimization, limiting clinical applicability due to computational demands and poor generalizability. We propose Render-FM, a novel foundation model for direct, real-time volumetric rendering of CT scans. Render-FM employs an encoder-decoder architecture that directly regresses 6D Gaussian Splatting (6DGS) parameters from CT volumes, eliminating per-scan optimization through large-scale pre-training on diverse medical data. By integrating robust feature extraction with the expressive power of 6DGS, our approach efficiently generates high-quality, real-time interactive 3D visualizations across diverse clinical CT data. Experiments demonstrate that Render-FM achieves visual fidelity comparable or superior to specialized per-scan methods while drastically reducing preparation time from nearly an hour to seconds for a single inference step. This advancement enables seamless integration into real-time surgical planning and diagnostic workflows. The project page is: https://gaozhongpai.github.io/renderfm/.
中文: Render-FM是一种基础模型,通过直接从CT体积回归6D高斯泼溅参数,实现实时高保真体积渲染,无需逐场景优化,并将准备时间从数小时缩短至数秒。
English: Render-FM is a foundation model that enables real-time, high-fidelity volumetric rendering of CT scans by directly regressing 6D Gaussian Splatting parameters, eliminating per-scene optimization and reducing preparation time from hours to seconds.

Authors:Amit Agarwal, Srikant Panda, Kulbhushan Pachauri
Title: FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding
Abstract:
In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG's capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : https://github.com/oracle-samples/fs-dag
中文: FS-DAG是一种针对少样本场景下视觉丰富文档理解的可扩展高效模型,通过模块化领域专用架构以不足9000万参数处理OCR错误和领域偏移,在信息抽取任务中展现出卓越性能和更快收敛速度。
English: FS-DAG is a scalable and efficient model for visually rich document understanding in few-shot settings, leveraging modular domain-specific backbones to handle OCR errors and domain shifts with under 90M parameters, demonstrating superior performance and faster convergence in information extraction tasks.

Authors:Harim Kim, Yuhan Wang, Minkyu Ahn, Heeyoul Choi, Yuyin Zhou, Charmgil Hong
Title: Harnessing EHRs for Diffusion-based Anomaly Detection on Chest X-rays
Abstract:
Unsupervised anomaly detection (UAD) in medical imaging is crucial for identifying pathological abnormalities without requiring extensive labeled data. However, existing diffusion-based UAD models rely solely on imaging features, limiting their ability to distinguish between normal anatomical variations and pathological anomalies. To address this, we propose Diff3M, a multi-modal diffusion-based framework that integrates chest X-rays and structured Electronic Health Records (EHRs) for enhanced anomaly detection. Specifically, we introduce a novel image-EHR cross-attention module to incorporate structured clinical context into the image generation process, improving the model's ability to differentiate normal from abnormal features. Additionally, we develop a static masking strategy to enhance the reconstruction of normal-like images from anomalies. Extensive evaluations on CheXpert and MIMIC-CXR/IV demonstrate that Diff3M achieves state-of-the-art performance, outperforming existing UAD methods in medical imaging. Our code is available at this http URL https://github.com/nth221/Diff3M
中文摘要:Diff3M是一种创新的多模态框架,通过将胸部X光与电子健康记录相结合,并利用交叉注意力模块增强医学影像中的无监督异常检测,在基准数据集上实现了最先进的性能。
English Summary: Diff3M is a novel multi-modal framework that enhances unsupervised anomaly detection in medical imaging by integrating chest X-rays with Electronic Health Records through a cross-attention module, achieving state-of-the-art performance on benchmark datasets.

Authors:Siyang Song, Micol Spitale, Xiangyu Kong, Hengde Zhu, Cheng Luo, Cristina Palmero, German Barquero, Sergio Escalera, Michel Valstar, Mohamed Daoudi, Tobias Baur, Fabien Ringeval, Andrew Howes, Elisabeth Andre, Hatice Gunes
Title: REACT 2025: the Third Multiple Appropriate Facial Reaction Generation Challenge
Abstract:
In dyadic interactions, a broad spectrum of human facial reactions might be appropriate for responding to each human speaker behaviour. Following the successful organisation of the REACT 2023 and REACT 2024 challenges, we are proposing the REACT 2025 challenge encouraging the development and benchmarking of Machine Learning (ML) models that can be used to generate multiple appropriate, diverse, realistic and synchronised human-style facial reactions expressed by human listeners in response to an input stimulus (i.e., audio-visual behaviours expressed by their corresponding speakers). As a key of the challenge, we provide challenge participants with the first natural and large-scale multi-modal MAFRG dataset (called MARS) recording 137 human-human dyadic interactions containing a total of 2856 interaction sessions covering five different topics. In addition, this paper also presents the challenge guidelines and the performance of our baselines on the two proposed sub-challenges: Offline MAFRG and Online MAFRG, respectively. The challenge baseline code is publicly available at https://github.com/reactmultimodalchallenge/baseline_react2025
中文摘要:REACT 2025挑战赛旨在开发能生成多样化同步面部反应的机器学习模型,并提供了大规模人类对话数据集MARS作为关键支持。
English Summary: The REACT 2025 challenge promotes developing ML models to generate diverse and synchronized facial reactions in dyadic interactions, supported by the large-scale MARS dataset of human-human conversations.

Authors:Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin
Title: OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning
Abstract:
Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50\% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.
中文: OCR-Reasoning基准测试旨在系统评估多模态大语言模型在文本丰富图像推理任务中的表现,结果显示即使最先进的模型也面临巨大挑战,准确率均未超过50%,凸显了解决这一问题的紧迫性。
English: The OCR-Reasoning benchmark is introduced to systematically evaluate Multimodal Large Language Models on text-rich image reasoning tasks, revealing that even state-of-the-art models struggle significantly with accuracy below 50%, highlighting an urgent need for improvement in this area.

Authors:Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam
Title: RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language
Abstract:
Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5\% and 8.0\% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4\% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23\%. Our code and dataset are available at https://github.com/BASHLab/RAVEN.
中文: RAVEN通过其核心组件QuART实现了跨模态查询条件门控机制,能动态评估各模态标记的相关性以增强有效信号并抑制干扰,在多模态问答基准测试中显著提升准确率,并在模态受损情况下保持优异鲁棒性。
English: RAVEN introduces QuART, a query-conditioned cross-modal gating module that dynamically weights tokens across modalities to enhance relevant signals and suppress distractors, achieving significant accuracy improvements on multimodal QA benchmarks while maintaining robustness against modality corruption.

Authors:Xinlong Chen, Yuanxing Zhang, Qiang Liu, Junfei Wu, Fuzheng Zhang, Tieniu Tan
Title: Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model's attention on image tokens. Specifically, MoD measures the consistency between outputs generated from the original image tokens and those derived from the model's attended image tokens, to distinguish the correctness aforementioned. If the outputs are consistent, indicating correct attention, MoD employs a complementary strategy to amplify critical information. Conversely, if the outputs are inconsistent, suggesting erroneous attention, MoD utilizes a contrastive strategy to suppress misleading information. Extensive experiments demonstrate that MoD significantly outperforms existing decoding methods across multiple mainstream benchmarks, effectively mitigating hallucinations in LVLMs. The code is available at https://github.com/xlchen0205/MoD.
中文: 提出的混合解码方法通过评估原始图像标记与模型关注标记输出的一致性,动态调整解码策略,有效缓解大型视觉语言模型的幻觉问题,并在多个基准测试中显著优于现有方法。
English: The proposed Mixture of Decoding (MoD) approach dynamically adjusts decoding strategies by assessing the consistency between outputs from original and attended image tokens, effectively mitigating hallucinations in Large Vision-Language Models and outperforming existing methods across multiple benchmarks.

Authors:Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu
Title: GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
Abstract:
Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.
中文摘要:GoT-R1是一个强化学习框架,通过让模型自主开发复杂文本提示的推理策略,并采用统一奖励机制评估语义对齐与空间精度,显著提升了多对象空间关系和属性绑定的图像生成能力。
English Summary: GoT-R1 is a reinforcement learning framework that enhances visual generation by enabling models to autonomously develop reasoning strategies for complex text prompts, achieving superior performance in spatial relationships and attribute binding through a unified reward system.

Authors:Sara Ghaboura, Ketan More, Wafa Alghallabi, Omkar Thawakar, Jorma Laaksonen, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer
Title: ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark
Abstract:
As Large Multimodal Models (LMMs) become more capable, there is growing interest in evaluating their reasoning processes alongside their final outputs. However, most benchmarks remain focused on English, overlooking languages with rich linguistic and cultural contexts, such as Arabic. To address this gap, we introduce the Comprehensive Arabic Multimodal Reasoning Benchmark (ARB), the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities. ARB spans 11 diverse domains, including visual reasoning, document understanding, OCR, scientific analysis, and cultural interpretation. It comprises 1,356 multimodal samples paired with 5,119 human-curated reasoning steps and corresponding actions. We evaluated 12 state-of-the-art open- and closed-source LMMs and found persistent challenges in coherence, faithfulness, and cultural grounding. ARB offers a structured framework for diagnosing multimodal reasoning in underrepresented languages and marks a critical step toward inclusive, transparent, and culturally aware AI systems. We release the benchmark, rubric, and evaluation suit to support future research and reproducibility. Code available at: https://github.com/mbzuai-oryx/ARB
中文摘要:综合阿拉伯语多模态推理基准(ARB)作为首个评估阿拉伯语多模态逐步推理的框架,揭示了现有模型在连贯性和文化认知方面的持续不足,同时推动了包容性人工智能的发展。
English Summary: The Comprehensive Arabic Multimodal Reasoning Benchmark (ARB) is introduced as the first evaluation framework for step-by-step multimodal reasoning in Arabic, revealing persistent challenges in existing models' coherence and cultural awareness while promoting inclusive AI development.

Authors:Shilin Yan, Jiaming Han, Joey Tsai, Hongwei Xue, Rongyao Fang, Lingyi Hong, Ziyu Guo, Ray Zhang
Title: CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Abstract:
The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.
中文:CrossLMM通过双交叉注意力机制,在减少视觉令牌的同时保持性能完整性,有效解决了大型多模态模型处理长视频序列时的计算效率问题。
English: CrossLMM addresses the computational inefficiency of processing long video sequences in Large Multimodal Models by introducing a dual cross-attention mechanism that significantly reduces visual tokens while maintaining performance through enhanced text-visual interactions.

Authors:Chenhao Zhang, Yazhe Niu
Title: Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework
Abstract:
Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at https://github.com/MING-ZCH/Let-Androids-Dream-of-Electric-Sheep.
中文: LAD框架通过感知、知识搜索和推理三阶段方法,解决了AI系统理解图像隐喻含义的难题,在图像隐含意义基准测试中取得了最先进的性能表现。
English: The LAD framework addresses AI's challenge in understanding metaphorical image implications by employing a three-stage process of perception, knowledge search, and reasoning, achieving state-of-the-art performance on image implication benchmarks.

Authors:Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, Xiangyu Yue
Title: SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
Abstract:
Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final outcome.As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at https://github.com/kxfan2002/SophiaVL-R1.
Chinese: 最新研究提出SophiaVL-R1方法,通过在多模态大语言模型中引入思维过程奖励与结果奖励相结合的方式,有效提升了模型的推理策略和泛化能力,在多个基准测试中实现了优于更大规模模型的性能表现。
English: Recent research introduces SophiaVL-R1, a method that enhances multimodal large language models by incorporating process-based rewards alongside outcome rewards to improve reasoning strategies and generalization, achieving superior performance on benchmarks despite smaller model sizes.

Authors:Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, Pheng-Ann Heng
Title: Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO
Abstract:
Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at https://github.com/ZiyuGuo99/Image-Generation-CoT
中文: 本研究首次对自回归图像生成中的GRPO和DPO强化学习算法进行全面分析,揭示了它们各自的优势,并证明具有更强泛化能力的奖励模型能够同时提升域内性能和跨域泛化能力。
English: This study provides the first comprehensive analysis of GRPO and DPO reinforcement learning algorithms in autoregressive image generation, revealing their distinct advantages and demonstrating how reward models with stronger generalization capabilities can enhance both in-domain performance and out-of-domain generalization.

Authors:Shuhan Tan, Kairan Dou, Yue Zhao, Philipp Krähenbühl
Title: Interactive Post-Training for Vision-Language-Action Models
Abstract:
We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision.

Authors:Kevin Lu, Nicky Kriplani, Rohit Gandikota, Minh Pham, David Bau, Chinmay Hegde, Niv Cohen
Title: When Are Concepts Erased From Diffusion Models?
Abstract:
Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model's internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.

Authors:Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, Weidi Xie
Title: SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding
Abstract:
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.
中文: 本文提出VGBench和SpatialScore基准来评估多模态大语言模型的三维空间理解能力,并开发了SpatialAgent系统,该系统在解决空间推理挑战方面展现出显著效果。
English: This paper introduces VGBench and SpatialScore benchmarks to evaluate multimodal large language models' 3D spatial understanding, and proposes SpatialAgent system which demonstrates effectiveness in addressing spatial reasoning challenges.

Authors:Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, Xue Yang
Title: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
Abstract:
We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.
中文: AdapTok是一种自适应视频分词器,通过内容感知策略动态分配每帧的令牌,在固定预算下无需额外数据即可提升视频重建与生成效率。
English: AdapTok is an adaptive video tokenizer that dynamically allocates tokens per frame using a content-aware strategy, enhancing reconstruction and generation efficiency under controlled budgets without extra data.

Authors:Ming Qian, Bin Tan, Qiuyu Wang, Xianwei Zheng, Hanjiang Xiong, Gui-Song Xia, Yujun Shen, Nan Xue
Title: Seeing through Satellite Images at Street Views
Abstract:
This paper studies the task of SatStreet-view synthesis, which aims to render photorealistic street-view panorama images and videos given any satellite image and specified camera positions or trajectories. We formulate to learn neural radiance field from paired images captured from satellite and street viewpoints, which comes to be a challenging learning problem due to the sparse-view natural and the extremely-large viewpoint changes between satellite and street-view images. We tackle the challenges based on a task-specific observation that street-view specific elements, including the sky and illumination effects are only visible in street-view panoramas, and present a novel approach Sat2Density++ to accomplish the goal of photo-realistic street-view panoramas rendering by modeling these street-view specific in neural networks. In the experiments, our method is testified on both urban and suburban scene datasets, demonstrating that Sat2Density++ is capable of rendering photorealistic street-view panoramas that are consistent across multiple views and faithful to the satellite image.

Authors:Runpeng Yu, Xinyin Ma, Xinchao Wang
Title: Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
Abstract:
In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only $\frac{\text{response length}}{3}$. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at https://github.com/yu-rp/Dimple.
中文摘要:本文提出了首个离散扩散多模态大语言模型Dimple,通过结合自回归与扩散训练的新范式解决了训练不稳定和长度偏差问题,不仅性能超越同类模型,还通过置信解码显著提升推理效率,并利用结构先验实现精细化的响应控制。
English Summary: This paper introduces Dimple, the first discrete diffusion multimodal large language model, which overcomes training instability and length bias through a hybrid autoregressive-diffusion approach, achieving superior performance and enhanced inference efficiency via confident decoding while enabling precise response control through structure priors.

Authors:Moru Liu, Hao Dong, Jessica Kelly, Olga Fink, Mario Trapp
Title: Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation
Abstract:
Out-of-distribution (OOD) detection and segmentation are crucial for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. While prior research has primarily focused on unimodal image data, real-world applications are inherently multimodal, requiring the integration of multiple modalities for improved OOD detection. A key challenge is the lack of supervision signals from unknown data, leading to overconfident predictions on OOD samples. To address this challenge, we propose Feature Mixing, an extremely simple and fast method for multimodal outlier synthesis with theoretical support, which can be further optimized to help the model better distinguish between in-distribution (ID) and OOD data. Feature Mixing is modality-agnostic and applicable to various modality combinations. Additionally, we introduce CARLA-OOD, a novel multimodal dataset for OOD segmentation, featuring synthetic OOD objects across diverse scenes and weather conditions. Extensive experiments on SemanticKITTI, nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that Feature Mixing achieves state-of-the-art performance with a $10 \times$ to $370 \times$ speedup. Our source code and dataset will be available at https://github.com/mona4399/FeatureMixing.
Chinese: 本文提出特征混合法,一种快速通用的多模态异常合成方法,有效提升多数据集上的OOD检测性能,在实现最优表现的同时大幅提升运算速度。
English: This paper introduces Feature Mixing, a fast and versatile method for multimodal outlier synthesis that enhances OOD detection across various datasets, achieving top performance with significant speed improvements.

Authors:Siqi Wan, Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei
Title: Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On
Abstract:
Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: https://github.com/HiDream-ai/SPM-Diff.
中文摘要:本文提出了一种新颖的基于扩散模型的虚拟试穿方法,通过利用显式视觉对应关系和三维感知语义点匹配来引导扩散过程,显著提升了服装细节保持能力,在多个基准数据集上实现了最先进的性能表现。
English Summary: This paper introduces a novel diffusion-based virtual try-on method that enhances garment detail preservation by using explicit visual correspondence and 3D-aware semantic point matching to guide the diffusion process, achieving state-of-the-art performance on benchmark datasets.

Authors:Zongyan Han, Jiale Cao, Shuo Chen, Tong Wang, Jorma Laaksonen, Rao Muhammad Anwer
Title: OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning
Abstract:
Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open-world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step-by-step visual reasoning framework for open-vocabulary segmentation, named OpenSeg-R. The proposed OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image-specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse-to-fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg-R is the first framework to introduce explicit step-by-step visual reasoning into OVS. Experimental results demonstrate that OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open-vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning-guided framework in improving both segmentation precision and interpretability. Our code is publicly available at https://github.com/Hanzy1996/OpenSeg-R.
Chinese: OpenSeg-R通过引入基于大型多模态模型的逐步视觉推理框架,生成分层推理提示以增强开放词汇分割,在多个基准测试中显著提升了准确性和可解释性。
English: OpenSeg-R introduces a step-by-step visual reasoning framework using Large Multimodal Models to enhance open-vocabulary segmentation by generating hierarchical reasoning prompts, significantly improving accuracy and interpretability across multiple benchmarks.

Authors:InternAgent Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai
Title: InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification
Abstract:
Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce InternAgent, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. InternAgent highlights three key advantages: 1) Scalability: InternAgent has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: InternAgent provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: InternAgent has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.65 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.
中文: InternAgent作为一种统一的多智能体框架,通过其可扩展性、交互性和高效性,在多个科学领域推动自主研究,实现快速创新并与领域专家无缝协作。
English: InternAgent is a unified multi-agent framework that accelerates autonomous scientific research across various fields, offering scalability, interactivity, and efficiency by enabling rapid innovation and seamless human-expert collaboration.

Authors:Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li
Title: LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Abstract:
In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.
中文: LLaDA-V是一种基于扩散的多模态模型,通过将视觉指令调整与掩码扩散模型相结合,在文本任务能力较弱的情况下仍展现出竞争力的多模态性能,并在同类扩散模型中实现了最先进的效果。
English: LLaDA-V is a diffusion-based multimodal model that integrates visual instruction tuning with masked diffusion models, demonstrating competitive performance in multimodal tasks despite weaker text-only capabilities and achieving state-of-the-art results among diffusion-based MLLMs.

Authors:Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, Jiaya Jia
Title: Training-Free Efficient Video Generation via Dynamic Token Carving
Abstract:
Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga
中文:Jenga是一种创新的推理流程,通过动态优化注意力机制和逐步提升分辨率,显著加速视频扩散模型的生成速度,同时保持生成质量基本不变。
English: Jenga is an innovative inference pipeline that accelerates video diffusion models by dynamically optimizing attention mechanisms and progressively increasing resolution, achieving significant speed improvements without compromising generation quality.

Authors:Jiaqi Wang, Kevin Qinghong Lin, James Cheng, Mike Zheng Shou
Title: Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Abstract:
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches. Our code is available at https://github.com/kokolerk/TON.
中文: TON方法采用两阶段训练策略,使视觉语言模型能够选择性跳过不必要的推理步骤,在多种任务中保持或提升性能的同时,将完成长度减少高达90%。
English: The TON method introduces a two-stage training strategy that enables vision-language models to selectively skip unnecessary reasoning steps, reducing completion length by up to 90% while maintaining or improving performance across various tasks.

Authors:Haonian Ji, Shi Qiu, Siyang Xin, Siwei Han, Zhaorun Chen, Dake Zhang, Hongyi Wang, Huaxiu Yao
Title: From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization
Abstract:
While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi-agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations. EduVisBench and EduVisAgent are available at https://github.com/aiming-lab/EduVisBench and https://github.com/aiming-lab/EduVisAgent.
Chinese: 本文提出了用于评估教育领域基础模型视觉推理能力的基准EduVisBench,并开发了多智能体框架EduVisAgent,该框架通过协同合作显著提升了符合教学需求的可视化内容生成效果。
English: This paper introduces EduVisBench, a benchmark for evaluating the visual reasoning capabilities of foundation models in education, and proposes EduVisAgent, a multi-agent framework that significantly enhances the generation of pedagogically effective visualizations.

Authors:Chunyi Li, Jiaohao Xiao, Jianbo Zhang, Farong Wen, Zicheng Zhang, Yuan Tian, Xiangyang Zhu, Xiaohong Liu, Zhengxue Cheng, Weisi Lin, Guangtao Zhai
Title: Perceptual Quality Assessment for Embodied AI
Abstract:
Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: https://github.com/lcysyzxdxc/EmbodiedIQA
中文: 本文提出了面向具身AI的图像质量评估框架Embodied-IQA,通过构建包含数万图像对与精细标注的数据库,建立了感知-认知-决策-执行的完整评估体系,旨在解决现有方法无法衡量机器人任务中图像可用性的关键问题。
English: This paper introduces Embodied-IQA, a novel framework for assessing image quality in embodied AI tasks by establishing a comprehensive database and evaluation pipeline to address the gap in current methods that fail to measure usability for robotic perception.

Authors:Junze Wang, Lei Fan, Weipeng Jing, Donglin Di, Yang Song, Sidong Liu, Cong Cong
Title: Hypergraph Tversky-Aware Domain Incremental Learning for Brain Tumor Segmentation with Missing Modalities
Abstract:
Existing methods for multimodal MRI segmentation with missing modalities typically assume that all MRI modalities are available during training. However, in clinical practice, some modalities may be missing due to the sequential nature of MRI acquisition, leading to performance degradation. Furthermore, retraining models to accommodate newly available modalities can be inefficient and may cause overfitting, potentially compromising previously learned knowledge. To address these challenges, we propose Replay-based Hypergraph Domain Incremental Learning (ReHyDIL) for brain tumor segmentation with missing modalities. ReHyDIL leverages Domain Incremental Learning (DIL) to enable the segmentation model to learn from newly acquired MRI modalities without forgetting previously learned information. To enhance segmentation performance across diverse patient scenarios, we introduce the Cross-Patient Hypergraph Segmentation Network (CHSNet), which utilizes hypergraphs to capture high-order associations between patients. Additionally, we incorporate Tversky-Aware Contrastive (TAC) loss to effectively mitigate information imbalance both across and within different modalities. Extensive experiments on the BraTS2019 dataset demonstrate that ReHyDIL outperforms state-of-the-art methods, achieving an improvement of over 2% in the Dice Similarity Coefficient across various tumor regions. Our code is available at https://github.com/reeive/ReHyDIL.
Chinese: 本文提出ReHyDIL方法,通过领域增量学习和超图网络处理多模态MRI中缺失模态的脑肿瘤分割问题,在BraTS2019数据集上各项肿瘤区域的Dice系数提升超过2%。
English: This paper introduces ReHyDIL, a method that uses domain incremental learning and hypergraph networks to improve brain tumor segmentation in multimodal MRI when modalities are missing, showing over 2% Dice improvement on BraTS2019.

Authors:Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, Kai Wang, Yang You
Title: REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training
Abstract:
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at https://github.com/NUS-HPC-AI-Lab/HASTE .
中文:HASTE提出了一种两阶段训练方案,先通过师生模型对齐实现快速收敛,随后终止对齐以释放生成潜力,无需改变架构即可大幅加速扩散模型的训练。
English: HASTE introduces a two-phase training schedule that initially aligns a diffusion transformer with a teacher model for rapid convergence and then terminates alignment to unleash its full generative capacity, achieving significant speedups without architectural changes.

Authors:Jun Xie, Xiongjun Guan, Yingjian Zhu, Zhaoran Zhao, Xinming Wang, Hongzhu Yi, Feng Chen, Zhepeng Wang
Title: Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles
Abstract:
In this paper, we present the runner-up solution for the Ego4D EgoSchema Challenge at CVPR 2025 (Confirmed on May 20, 2025). Inspired by the success of large models, we evaluate and leverage leading accessible multimodal large models and adapt them to video understanding tasks via few-shot learning and model ensemble strategies. Specifically, diversified prompt styles and process paradigms are systematically explored and evaluated to effectively guide the attention of large models, fully unleashing their powerful generalization and adaptability abilities. Experimental results demonstrate that, with our carefully designed approach, directly utilizing an individual multimodal model already outperforms the previous state-of-the-art (SOTA) method which includes several additional processes. Besides, an additional stage is further introduced that facilitates the cooperation and ensemble of periodic results, which achieves impressive performance improvements. We hope this work serves as a valuable reference for the practical application of large models and inspires future research in the field. Our Code is available at https://github.com/XiongjunGuan/EgoSchema-CVPR25.
中文摘要:本文提出了CVPR 2025 Ego4D EgoSchema挑战赛的亚军解决方案,通过小样本学习和模型集成策略将多模态大模型应用于视频理解任务,实现了超越现有最优方法的性能表现。
English Summary: This paper introduces the runner-up solution for the Ego4D EgoSchema Challenge at CVPR 2025, which adapts multimodal large models to video understanding through few-shot learning and ensemble strategies, achieving state-of-the-art performance.

Authors:Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, Pei-lin Li, Xinjie Lin, Jinnian Zhang, Xin-Sheng Chen, Yi Zhang, Kiyohiro Nakayama, Zhengyang Geng, Houwen Peng, Han Hu, Shi-Min Hu
Title: RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs
Abstract:
The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini, and o3, with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking processes (also known as multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning while neglecting the importance of reasoning through multi-modal outputs. In this paper, we present a benchmark, dubbed RBench-V, designed to assess models' vision-indispensable reasoning abilities. To construct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting, and games. Unlike previous benchmarks that typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which require image manipulation such as generating novel images and constructing auxiliary lines to support the reasoning process. We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 Pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, highlighting that current models struggle to leverage multi-modal reasoning. Data and code are available at https://evalmodels.github.io/rbenchv

Authors:Jian Liu, Jing Xu, Song Guo, Jing Li, Jingfeng Guo, Jiaao Yu, Haohan Weng, Biwen Lei, Xianghui Yang, Zhuo Chen, Fangqi Zhu, Tao Han, Chunchao Guo
Title: Mesh-RFT: Enhancing Mesh Generation via Fine-grained Reinforcement Fine-Tuning
Abstract:
Existing pretrained models for 3D mesh generation often suffer from data biases and produce low-quality results, while global reinforcement learning (RL) methods rely on object-level rewards that struggle to capture local structure details. To address these challenges, we present \textbf{Mesh-RFT}, a novel fine-grained reinforcement fine-tuning framework that employs Masked Direct Preference Optimization (M-DPO) to enable localized refinement via quality-aware face masking. To facilitate efficient quality evaluation, we introduce an objective topology-aware scoring system to evaluate geometric integrity and topological regularity at both object and face levels through two metrics: Boundary Edge Ratio (BER) and Topology Score (TS). By integrating these metrics into a fine-grained RL strategy, Mesh-RFT becomes the first method to optimize mesh quality at the granularity of individual faces, resolving localized errors while preserving global coherence. Experiment results show that our M-DPO approach reduces Hausdorff Distance (HD) by 24.6\% and improves Topology Score (TS) by 3.8\% over pre-trained models, while outperforming global DPO methods with a 17.4\% HD reduction and 4.9\% TS gain. These results demonstrate Mesh-RFT's ability to improve geometric integrity and topological regularity, achieving new state-of-the-art performance in production-ready mesh generation. Project Page: \href{https://hitcslj.github.io/mesh-rft/}{this https URL}.

Authors:Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, Jiaxing Huang
Title: R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
Abstract:
In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at https://github.com/HJYao00/R1-ShareVL.
中文摘要:本研究提出Share-GRPO方法,通过扩展问题空间并共享多样化推理路径和奖励信息,有效解决强化学习中的稀疏奖励和优势消失问题,从而提升多模态大语言模型的推理能力。
English Summary: This study introduces Share-GRPO, a reinforcement learning approach that enhances multimodal large language models' reasoning by expanding question spaces and sharing diverse reasoning trajectories and reward information to overcome sparse rewards and advantage vanishing issues.

Authors:Haihong Hao, Mingfei Han, Changlin Li, Zhihui Li, Xiaojun Chang
Title: CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation
Abstract:
Embodied navigation demands comprehensive scene understanding and precise spatial reasoning. While image-text models excel at interpreting pixel-level color and lighting cues, 3D-text models capture volumetric structure and spatial relationships. However, unified fusion approaches that jointly fuse 2D images, 3D point clouds, and textual instructions face challenges in limited availability of triple-modality data and difficulty resolving conflicting beliefs among modalities. In this work, we introduce CoNav, a collaborative cross-modal reasoning framework where a pretrained 3D-text model explicitly guides an image-text navigation agent by providing structured spatial-semantic knowledge to resolve ambiguities during navigation. Specifically, we introduce Cross-Modal Belief Alignment, which operationalizes this cross-modal guidance by simply sharing textual hypotheses from the 3D-text model to the navigation agent. Through lightweight fine-tuning on a small 2D-3D-text corpus, the navigation agent learns to integrate visual cues with spatial-semantic knowledge derived from the 3D-text model, enabling effective reasoning in embodied navigation. CoNav achieves significant improvements on four standard embodied navigation benchmarks (R2R, CVDN, REVERIE, SOON) and two spatial reasoning benchmarks (ScanQA, SQA3D). Moreover, under close navigation Success Rate, CoNav often generates shorter paths compared to other methods (as measured by SPL), showcasing the potential and challenges of fusing data from different modalities in embodied navigation. Project Page: https://oceanhao.github.io/CoNav/

Authors:Giuseppe Guarino, Matteo Ciotola, Gemine Vivone, Giovanni Poggi, Giuseppe Scarpa
Title: Zero-Shot Hyperspectral Pansharpening Using Hysteresis-Based Tuning for Spectral Quality Control
Abstract:
Hyperspectral pansharpening has received much attention in recent years due to technological and methodological advances that open the door to new application scenarios. However, research on this topic is only now gaining momentum. The most popular methods are still borrowed from the more mature field of multispectral pansharpening and often overlook the unique challenges posed by hyperspectral data fusion, such as i) the very large number of bands, ii) the overwhelming noise in selected spectral ranges, iii) the significant spectral mismatch between panchromatic and hyperspectral components, iv) a typically high resolution ratio. Imprecise data modeling especially affects spectral fidelity. Even state-of-the-art methods perform well in certain spectral ranges and much worse in others, failing to ensure consistent quality across all bands, with the risk of generating unreliable results. Here, we propose a hyperspectral pansharpening method that explicitly addresses this problem and ensures uniform spectral quality. To this end, a single lightweight neural network is used, with weights that adapt on the fly to each band. During fine-tuning, the spatial loss is turned on and off to ensure a fast convergence of the spectral loss to the desired level, according to a hysteresis-like dynamic. Furthermore, the spatial loss itself is appropriately redefined to account for nonlinear dependencies between panchromatic and spectral bands. Overall, the proposed method is fully unsupervised, with no prior training on external data, flexible, and low-complexity. Experiments on a recently published benchmarking toolbox show that it ensures excellent sharpening quality, competitive with the state-of-the-art, consistently across all bands. The software code and the full set of results are shared online on https://github.com/giu-guarino/rho-PNN.
中文: 本文提出一种轻量级无监督高光谱全色锐化方法,通过逐波段自适应网络权重并重新定义空间损失函数,确保所有波段光谱质量一致,无需外部训练数据即可获得与先进方法相媲美的效果。
English: This paper introduces a lightweight, unsupervised hyperspectral pansharpening method that adapts network weights per band and redefines spatial loss to ensure uniform spectral quality across all bands, achieving competitive results without external training data.

Authors:Michael Neri, Sara Baldoni
Title: Unsupervised Network Anomaly Detection with Autoencoders and Traffic Images
Abstract:
Due to the recent increase in the number of connected devices, the need to promptly detect security issues is emerging. Moreover, the high number of communication flows creates the necessity of processing huge amounts of data. Furthermore, the connected devices are heterogeneous in nature, having different computational capacities. For this reason, in this work we propose an image-based representation of network traffic which allows to realize a compact summary of the current network conditions with 1-second time windows. The proposed representation highlights the presence of anomalies thus reducing the need for complex processing architectures. Finally, we present an unsupervised learning approach which effectively detects the presence of anomalies. The code and the dataset are available at https://github.com/michaelneri/image-based-network-traffic-anomaly-detection.
中文: 本研究提出了一种基于图像的网络流量表示方法和无监督学习技术,能有效检测异构连接设备中的安全异常,同时降低对复杂处理架构的需求。
English: This work proposes an image-based network traffic representation and unsupervised learning approach to efficiently detect security anomalies in heterogeneous connected devices, reducing the need for complex processing architectures.

Authors:Sushant Gautam, Michael A. Riegler, PÃ¥l Halvorsen
Title: Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models
Abstract:
We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Our objective is to evaluate whether instruction-tuned VLMs can simultaneously improve these tasks, with the goal of enhancing diagnostic accuracy and efficiency. Using MedMultiPoints, a multimodal dataset with annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate each task into instruction-based prompts suitable for vision-language reasoning. We fine-tune Qwen2.5-VL-7B-Instruct using Low-Rank Adaptation (LoRA) across multiple task combinations. Results show that multi-task training improves robustness and accuracy. For example, it reduces the Count Mean Absolute Error (MAE) and increases Matching Accuracy in the Counting + Pointing task. However, trade-offs emerge, such as more zero-case point predictions, indicating reduced reliability in edge cases despite overall performance gains. Our study highlights the potential of adapting general-purpose VLMs to specialized medical tasks via prompt-driven fine-tuning. This approach mirrors clinical workflows, where radiologists simultaneously localize, count, and describe findings - demonstrating how VLMs can learn composite diagnostic reasoning patterns. The model produces interpretable, structured outputs, offering a promising step toward explainable and versatile medical AI. Code, model weights, and scripts will be released for reproducibility at https://github.com/simula/PointDetectCount.
中文摘要:本研究通过指令微调视觉语言模型实现医学图像的多任务分析,在提升检测与计数精度的同时揭示了边缘案例可靠性的权衡,展示了适配临床诊断流程的复合推理能力。
English Summary: This study fine-tunes vision-language models for multi-task medical image analysis, demonstrating improved accuracy in detection and counting tasks while noting trade-offs in edge-case reliability through prompt-based adaptation of clinical workflows.

Authors:Sushant Gautam, Cise Midoglu, Vajira Thambawita, Michael A. Riegler, PÃ¥l Halvorsen, Mubarak Shah
Title: SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding
Abstract:
The integration of artificial intelligence in sports analytics has transformed soccer video understanding, enabling real-time, automated insights into complex game dynamics. Traditional approaches rely on isolated data streams, limiting their effectiveness in capturing the full context of a match. To address this, we introduce SoccerChat, a multimodal conversational AI framework that integrates visual and textual data for enhanced soccer video comprehension. Leveraging the extensive SoccerNet dataset, enriched with jersey color annotations and automatic speech recognition (ASR) transcripts, SoccerChat is fine-tuned on a structured video instruction dataset to facilitate accurate game understanding, event classification, and referee decision making. We benchmark SoccerChat on action classification and referee decision-making tasks, demonstrating its performance in general soccer event comprehension while maintaining competitive accuracy in referee decision making. Our findings highlight the importance of multimodal integration in advancing soccer analytics, paving the way for more interactive and explainable AI-driven sports analysis. https://github.com/simula/SoccerChat
中文: SoccerChat提出了一种多模态AI框架,通过整合视觉与文本数据提升足球视频理解能力,基于SoccerNet数据集在事件分类和裁判决策任务中展现出优越性能。
English: SoccerChat introduces a multimodal AI framework that integrates visual and textual data to enhance soccer video understanding, demonstrating improved performance in event classification and referee decision-making through the SoccerNet dataset.

Authors:Luyang Cao, Jianwei Li, Yinghuan Shi
Title: Background Matters: A Cross-view Bidirectional Modeling Framework for Semi-supervised Medical Image Segmentation
Abstract:
Semi-supervised medical image segmentation (SSMIS) leverages unlabeled data to reduce reliance on manually annotated images. However, current SOTA approaches predominantly focus on foreground-oriented modeling (i.e., segmenting only the foreground region) and have largely overlooked the potential benefits of explicitly modeling the background region. Our study theoretically and empirically demonstrates that highly certain predictions in background modeling enhance the confidence of corresponding foreground modeling. Building on this insight, we propose the Cross-view Bidirectional Modeling (CVBM) framework, which introduces a novel perspective by incorporating background modeling to improve foreground modeling performance. Within CVBM, background modeling serves as an auxiliary perspective, providing complementary supervisory signals to enhance the confidence of the foreground model. Additionally, CVBM introduces an innovative bidirectional consistency mechanism, which ensures mutual alignment between foreground predictions and background-guided predictions. Extensive experiments demonstrate that our approach achieves SOTA performance on the LA, Pancreas, ACDC, and HRF datasets. Notably, on the Pancreas dataset, CVBM outperforms fully supervised methods (i.e., DSC: 84.57% vs. 83.89%) while utilizing only 20% of the labeled data. Our code is publicly available at https://github.com/caoluyang0830/CVBM.git.
中文: 提出的跨视角双向建模框架通过结合背景建模提升前景分割置信度,并引入双向一致性机制,仅用少量标注数据即在多个数据集上实现最优性能。
English: The proposed Cross-view Bidirectional Modeling (CVBM) framework enhances semi-supervised medical image segmentation by incorporating background modeling to boost foreground confidence and introducing bidirectional consistency, achieving state-of-the-art performance with minimal labeled data.

Authors:Siqu Ou, Hongcheng Liu, Pingjie Wang, Yusheng Liao, Chuan Xuan, Yanfeng Wang, Yu Wang
Title: Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning
Abstract:
While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at https://github.com/Cratileo/D2R.
中文摘要:GRASSLAND基准测试和D2R框架通过将视觉草稿与文本推理链相结合,显著提升了多模态模型在动态空间推理任务中的表现,且无需模型微调。
English Summary: The GRASSLAND benchmark and D2R framework enhance dynamic spatial reasoning in multimodal models by integrating visual drafts with textual reasoning chains, achieving superior performance without fine-tuning.

Authors:Jannis Becktepe, Leona Hennig, Steffen Oeltze-Jafra, Marius Lindauer
Title: Auto-nnU-Net: Towards Automated Medical Image Segmentation
Abstract:
Medical Image Segmentation (MIS) includes diverse tasks, from bone to organ segmentation, each with its own challenges in finding the best segmentation model. The state-of-the-art AutoML-related MIS-framework nnU-Net automates many aspects of model configuration but remains constrained by fixed hyperparameters and heuristic design choices. As a full-AutoML framework for MIS, we propose Auto-nnU-Net, a novel nnU-Net variant enabling hyperparameter optimization (HPO), neural architecture search (NAS), and hierarchical NAS (HNAS). Additionally, we propose Regularized PriorBand to balance model accuracy with the computational resources required for training, addressing the resource constraints often faced in real-world medical settings that limit the feasibility of extensive training procedures. We evaluate our approach across diverse MIS datasets from the well-established Medical Segmentation Decathlon, analyzing the impact of AutoML techniques on segmentation performance, computational efficiency, and model design choices. The results demonstrate that our AutoML approach substantially improves the segmentation performance of nnU-Net on 6 out of 10 datasets and is on par on the other datasets while maintaining practical resource requirements. Our code is available at https://github.com/automl/AutoNNUnet.
Chinese: Auto-nnU-Net 通过引入超参数优化和神经架构搜索,显著提升了 nnU-Net 在多个医学影像数据集上的分割性能,同时保持了实际应用中的计算资源可行性。
English: Auto-nnU-Net enhances the nnU-Net framework by incorporating hyperparameter optimization and neural architecture search, significantly improving segmentation performance on multiple medical imaging datasets while maintaining practical computational resource usage.

Authors:Lingfeng Wang, Hualing Lin, Senda Chen, Tao Wang, Changxu Cheng, Yangyang Zhong, Dong Zheng, Wuyue Zhao
Title: ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation
Abstract:
While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at https://github.com/yayafengzi/ALToLLM.
中文: ALTo模型通过创新的长度预测器和优化策略,实现了自适应分词,在保持高效的同时,在主流分割基准上达到了最优性能。
English: The proposed ALTo model introduces an adaptive tokenizer with a novel length predictor and optimization strategy, enabling state-of-the-art segmentation performance while dynamically adjusting token usage for efficiency.

Authors:Yuetong Liu, Yunqiu Xu, Yang Wei, Xiuli Bi, Bin Xiao
Title: Clear Nights Ahead: Towards Multi-Weather Nighttime Image Restoration
Abstract:
Restoring nighttime images affected by multiple adverse weather conditions is a practical yet under-explored research problem, as multiple weather conditions often coexist in the real world alongside various lighting effects at night. This paper first explores the challenging multi-weather nighttime image restoration task, where various types of weather degradations are intertwined with flare effects. To support the research, we contribute the AllWeatherNight dataset, featuring large-scale high-quality nighttime images with diverse compositional degradations, synthesized using our introduced illumination-aware degradation generation. Moreover, we present ClearNight, a unified nighttime image restoration framework, which effectively removes complex degradations in one go. Specifically, ClearNight extracts Retinex-based dual priors and explicitly guides the network to focus on uneven illumination regions and intrinsic texture contents respectively, thereby enhancing restoration effectiveness in nighttime scenarios. In order to better represent the common and unique characters of multiple weather degradations, we introduce a weather-aware dynamic specific-commonality collaboration method, which identifies weather degradations and adaptively selects optimal candidate units associated with specific weather types. Our ClearNight achieves state-of-the-art performance on both synthetic and real-world images. Comprehensive ablation experiments validate the necessity of AllWeatherNight dataset as well as the effectiveness of ClearNight. Project page: https://henlyta.github.io/ClearNight/mainpage.html

Authors:Kuicai Dong, Yujing Chang, Shijie Huang, Yasheng Wang, Ruiming Tang, Yong Liu
Title: Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering
Abstract:
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and integration.Key findings reveal advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems. Our benchmark and code are available at https://mmdocrag.github.io/MMDocRAG/.
Chinese: MMDocRAG提出了包含4,055个专家标注问答对的综合基准及创新评估指标,通过大规模实验发现专有视觉语言模型在跨模态证据处理上显著优于开源模型,为多模态文档问答系统提供了重要改进方向。
English: MMDocRAG introduces a comprehensive benchmark with 4,055 QA pairs and novel metrics to address DocVQA's limitations in multimodal evidence handling, revealing through extensive testing that proprietary models outperform open-source alternatives and benefit more from visual inputs.

Authors:Jisu Han, Jaemin Na, Wonjun Hwang
Title: Ranked Entropy Minimization for Continual Test-Time Adaptation
Abstract:
Test-time adaptation aims to adapt to realistic environments in an online manner by learning during test time. Entropy minimization has emerged as a principal strategy for test-time adaptation due to its efficiency and adaptability. Nevertheless, it remains underexplored in continual test-time adaptation, where stability is more important. We observe that the entropy minimization method often suffers from model collapse, where the model converges to predicting a single class for all images due to a trivial solution. We propose ranked entropy minimization to mitigate the stability problem of the entropy minimization method and extend its applicability to continuous scenarios. Our approach explicitly structures the prediction difficulty through a progressive masking strategy. Specifically, it gradually aligns the model's probability distributions across different levels of prediction difficulty while preserving the rank order of entropy. The proposed method is extensively evaluated across various benchmarks, demonstrating its effectiveness through empirical results. Our code is available at https://github.com/pilsHan/rem
中文: 本研究提出排序熵最小化方法,通过渐进式结构化预测难度并保持熵的排序,有效防止持续测试时适应中的模型崩溃,在多个基准测试中验证了其有效性。
English: The study introduces ranked entropy minimization to prevent model collapse in continual test-time adaptation by progressively structuring prediction difficulty while maintaining entropy rank order, demonstrating effectiveness across benchmarks.

Authors:Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han
Title: Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
Abstract:
Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to large vision-language models (LVLMs), its variants introduce unintended cross-modal positional biases. Specifically, they enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments. This issue arises because image tokens representing the same content but located at different spatial positions are assigned distinct positional biases, leading to inconsistent cross-modal associations. To address this, we propose Per-Token Distance (PTD) - a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory orthogonal to the linear path of text token indices, forming a cone-like structure. This configuration ensures that each text token maintains an equal distance to all image tokens, reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered layer strategy that applies different RoPE variants across layers. This design leverages the complementary strengths of each RoPE variant, thereby enhancing the model's overall performance. Our experimental results demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for LVLMs. The code is available at [https://github.com/lose4578/CircleRoPE](https://github.com/lose4578/CircleRoPE).
中文: 该摘要提出Circle-RoPE方法,通过将图像令牌投影到与文本令牌线性轴正交的环形结构上,形成锥形编码空间来消除视觉语言模型中的跨模态位置偏差,同时采用交错策略在不同层应用不同RoPE变体以保持图像空间信息。
English: The abstract introduces Circle-RoPE, a novel positional encoding method that projects image tokens onto a ring orthogonal to text tokens' linear axis to eliminate cross-modal biases in vision-language models, while preserving spatial information through a cone-like structure and staggered RoPE application across layers.

Authors:Sreetama Sarkar, Yue Che, Alex Gavin, Peter A. Beerel, Souvik Kundu
Title: Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression
Abstract:
Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from "hallucinations", generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at https://github.com/YUECHE77/SPIN.
Chinese: SPIN是一种新颖的注意力引导头抑制方法,通过在推理过程中选择性抑制低注意力头来减少大型视觉语言模型的幻觉现象,在实现幻觉分数降低2.7倍的同时保持F1分数,并将吞吐量提升1.8倍且不增加延迟。
English: SPIN is a novel attention-guided head suppression method that reduces hallucinations in large vision-language models by selectively suppressing low-attention heads during inference, achieving up to 2.7x lower hallucination scores with 1.8x higher throughput and no latency overhead.

Authors:Yuanhao Huang, Yilong Ren, Jinlei Wang, Lujia Huo, Xuesong Bai, Jinchuan Zhang, Haiyan Yu
Title: AdvReal: Physical Adversarial Patch Generation Framework for Security Evaluation of Object Detection Systems
Abstract:
Autonomous vehicles are typical complex intelligent systems with artificial intelligence at their core. However, perception methods based on deep learning are extremely vulnerable to adversarial samples, resulting in security accidents. How to generate effective adversarial examples in the physical world and evaluate object detection systems is a huge challenge. In this study, we propose a unified joint adversarial training framework for both 2D and 3D domains, which simultaneously optimizes texture maps in 2D image and 3D mesh spaces to better address intra-class diversity and real-world environmental variations. The framework includes a novel realistic enhanced adversarial module, with time-space and relighting mapping pipeline that adjusts illumination consistency between adversarial patches and target garments under varied viewpoints. Building upon this, we develop a realism enhancement mechanism that incorporates non-rigid deformation modeling and texture remapping to ensure alignment with the human body's non-rigid surfaces in 3D scenes. Extensive experiment results in digital and physical environments demonstrate that the adversarial textures generated by our method can effectively mislead the target detection model. Specifically, our method achieves an average attack success rate (ASR) of 70.13% on YOLOv12 in physical scenarios, significantly outperforming existing methods such as T-SEA (21.65%) and AdvTexture (19.70%). Moreover, the proposed method maintains stable ASR across multiple viewpoints and distances, with an average attack success rate exceeding 90% under both frontal and oblique views at a distance of 4 meters. This confirms the method's strong robustness and transferability under multi-angle attacks, varying lighting conditions, and real-world distances. The demo video and code can be obtained at https://github.com/Huangyh98/AdvReal.git.
中文: 本研究提出了一种统一的联合对抗训练框架,通过优化2D和3D空间中的纹理映射,结合光照一致性调整和非刚性变形建模,在物理环境中有效误导目标检测模型,并在多视角和距离下保持稳定的高攻击成功率。
English: This study introduces a unified adversarial training framework that generates robust 2D and 3D adversarial textures, achieving high attack success rates against object detection models in real-world scenarios through illumination-consistent patches and non-rigid deformation modeling.

Authors:Qian Deng, Le Hui, Jin Xie, Jian Yang
Title: Sketchy Bounding-box Supervision for 3D Instance Segmentation
Abstract:
Bounding box supervision has gained considerable attention in weakly supervised 3D instance segmentation. While this approach alleviates the need for extensive point-level annotations, obtaining accurate bounding boxes in practical applications remains challenging. To this end, we explore the inaccurate bounding box, named sketchy bounding box, which is imitated through perturbing ground truth bounding box by adding scaling, translation, and rotation. In this paper, we propose Sketchy-3DIS, a novel weakly 3D instance segmentation framework, which jointly learns pseudo labeler and segmentator to improve the performance under the sketchy bounding-box supervisions. Specifically, we first propose an adaptive box-to-point pseudo labeler that adaptively learns to assign points located in the overlapped parts between two sketchy bounding boxes to the correct instance, resulting in compact and pure pseudo instance labels. Then, we present a coarse-to-fine instance segmentator that first predicts coarse instances from the entire point cloud and then learns fine instances based on the region of coarse instances. Finally, by using the pseudo instance labels to supervise the instance segmentator, we can gradually generate high-quality instances through joint training. Extensive experiments show that our method achieves state-of-the-art performance on both the ScanNetV2 and S3DIS benchmarks, and even outperforms several fully supervised methods using sketchy bounding boxes. Code is available at https://github.com/dengq7/Sketchy-3DIS.
Chinese: 提出的Sketchy-3DIS框架通过联合训练伪标签生成器和分割器来处理3D实例分割中不精确的边界框监督,在主要基准测试中实现了最先进的性能。
English: The proposed Sketchy-3DIS framework addresses inaccurate bounding box supervision in 3D instance segmentation by jointly training a pseudo labeler and segmentator to generate high-quality instances, achieving state-of-the-art performance on major benchmarks.

Authors:Zijia Lu, A S M Iftekhar, Gaurav Mittal, Tianjian Meng, Xiawei Wang, Cheng Zhao, Rohith Kukkala, Ehsan Elhamifar, Mei Chen
Title: DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos
Abstract:
Long Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing ``delegate-and-conquer'' strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on two LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47\% while still outperforming existing methods, establishing a new state-of-the-art for LTVG in terms of both efficiency and performance. Our code is available at https://github.com/ZijiaLewisLu/CVPR2025-DeCafNet.
中文: DeCafNet采用“委托与攻克”策略,通过辅助编码器高效处理视频并筛选关键片段,再由专家编码器精炼,在长视频时序定位中实现高达47%的计算量削减,同时创下性能新纪录。
English: DeCafNet introduces a "delegate-and-conquer" strategy using a sidekick encoder for efficient video processing and an expert encoder for refining key clips, achieving up to 47% computation reduction while setting new state-of-the-art performance in long video temporal grounding.

Authors:Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Thomas Oberlin
Title: Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation
Abstract:
Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: https://github.com/echigot/cactif.
中文: 本文提出CACTI和CACTIF两种基于扩散模型的语义一致性风格迁移方法,通过类别自适应归一化和注意力过滤机制,在保持语义边界和结构连贯性的同时有效缩小合成数据与真实数据的领域差距,仅需少量真实数据即可显著提升视觉模型性能。
English: This paper introduces CACTI and CACTIF, two diffusion-based techniques that perform semantically consistent style transfer to bridge the synthetic-to-real domain gap, improving vision model performance with minimal real data by preserving semantic boundaries and structural coherence.

Authors:Kun-Yu Lin, Hongjun Wang, Weining Ren, Kai Han
Title: Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text
Abstract:
This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalence of images. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state. Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model evaluation. Experiments show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro, demonstrating the effectiveness of our data engine and method. Project page: https://visual-ai.github.io/pancap/

Authors:Jie Zhao, Xin Chen, Yongsheng Yuan, Michael Felsberg, Dong Wang, Huchuan Lu
Title: Efficient Motion Prompt Learning for Robust Visual Tracking
Abstract:
Due to the challenges of processing temporal information, most trackers depend solely on visual discriminability and overlook the unique temporal coherence of video data. In this paper, we propose a lightweight and plug-and-play motion prompt tracking method. It can be easily integrated into existing vision-based trackers to build a joint tracking framework leveraging both motion and vision cues, thereby achieving robust tracking through efficient prompt learning. A motion encoder with three different positional encodings is proposed to encode the long-term motion trajectory into the visual embedding space, while a fusion decoder and an adaptive weight mechanism are designed to dynamically fuse visual and motion features. We integrate our motion module into three different trackers with five models in total. Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. Code is available at https://github.com/zj5559/Motion-Prompt-Tracking.
中文: 本文提出了一种轻量级即插即用的运动提示跟踪方法,通过高效的提示学习将运动线索融入现有视觉跟踪器,以极低的训练成本和速度代价显著提升了多个挑战性基准上的跟踪鲁棒性。
English: This paper introduces a lightweight, plug-and-play motion prompt tracking method that enhances vision-based trackers by integrating motion cues through efficient prompt learning, significantly improving robustness across multiple benchmarks with minimal training cost and speed impact.

Authors:Arjhun Swaminathan, Mete Akgün
Title: Accelerating Targeted Hard-Label Adversarial Attacks in Low-Query Black-Box Settings
Abstract:
Deep neural networks for image classification remain vulnerable to adversarial examples -- small, imperceptible perturbations that induce misclassifications. In black-box settings, where only the final prediction is accessible, crafting targeted attacks that aim to misclassify into a specific target class is particularly challenging due to narrow decision regions. Current state-of-the-art methods often exploit the geometric properties of the decision boundary separating a source image and a target image rather than incorporating information from the images themselves. In contrast, we propose Targeted Edge-informed Attack (TEA), a novel attack that utilizes edge information from the target image to carefully perturb it, thereby producing an adversarial image that is closer to the source image while still achieving the desired target classification. Our approach consistently outperforms current state-of-the-art methods across different models in low query settings (nearly 70\% fewer queries are used), a scenario especially relevant in real-world applications with limited queries and black-box access. Furthermore, by efficiently generating a suitable adversarial example, TEA provides an improved target initialization for established geometry-based attacks.
Chinese: 针对图像分类的深度神经网络易受对抗样本攻击,本文提出的目标边缘信息攻击(TEA)通过利用目标图像边缘信息,在黑盒设置中高效生成对抗样本,显著减少查询次数并提升误分类效果。
English: Deep neural networks for image classification are vulnerable to adversarial examples, and the proposed Targeted Edge-informed Attack (TEA) effectively crafts these in black-box settings by using target image edges to reduce queries and enhance misclassification.

Authors:Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, Jiaya Jia
Title: ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay
Abstract:
Training large language models (LLMs) as interactive agents for controlling graphical user interfaces (GUIs) presents a unique challenge to optimize long-horizon action sequences with multimodal feedback from complex environments. While recent works have advanced multi-turn reinforcement learning (RL) for reasoning and tool-using capabilities in LLMs, their application to GUI-based agents remains relatively underexplored due to the difficulty of sparse rewards, delayed feedback, and high rollout costs. In this paper, we investigate end-to-end policy optimization for vision-language-based GUI agents with the aim of improving performance on complex, long-horizon computer tasks. We propose Agentic Replay Policy Optimization (ARPO), an end-to-end RL approach that augments Group Relative Policy Optimization (GRPO) with a replay buffer to reuse the successful experience across training iterations. To further stabilize the training process, we propose a task selection strategy that filters tasks based on baseline agent performance, allowing the agent to focus on learning from informative interactions. Additionally, we compare ARPO with offline preference optimization approaches, highlighting the advantages of policy-based methods in GUI environments. Experiments on the OSWorld benchmark demonstrate that ARPO achieves competitive results, establishing a new performance baseline for LLM-based GUI agents trained via reinforcement learning. Our findings underscore the effectiveness of reinforcement learning for training multi-turn, vision-language GUI agents capable of managing complex real-world UI interactions. Codes and models:https://github.com/dvlab-research/ARPO.git.
中文摘要:本文提出Agentic Replay Policy Optimization (ARPO)方法,通过复用成功经验和任务筛选策略增强GUI智能体训练,在OSWorld基准测试中取得了领先性能,为基于强化学习的图形界面智能体建立了新基准。
English Summary: This paper introduces Agentic Replay Policy Optimization (ARPO), an end-to-end reinforcement learning method that enhances GUI agent training by reusing successful experiences and implementing task selection strategies, achieving competitive results on the OSWorld benchmark.

Authors:Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan
Title: DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Abstract:
End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $π_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$π_0$. Specifically, we add Vision MoE to Drive-$π_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$π_0$.
中文: DriveMoE提出了一种新颖的专家混合框架,通过动态适配驾驶场景的视觉模块和行为专项化的动作模块,在复杂驾驶环境中实现了最先进的端到端自动驾驶性能。
English: DriveMoE introduces a novel Mixture-of-Experts framework for end-to-end autonomous driving, featuring specialized vision and action modules that dynamically adapt to driving contexts and behaviors, achieving state-of-the-art performance in complex scenarios.

Authors:Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee
Title: IRONIC: Coherence-Aware Reasoning Chains for Multi-Modal Sarcasm Detection
Abstract:
Interpreting figurative language such as sarcasm across multi-modal inputs presents unique challenges, often requiring task-specific fine-tuning and extensive reasoning steps. However, current Chain-of-Thought approaches do not efficiently leverage the same cognitive processes that enable humans to identify sarcasm. We present IRONIC, an in-context learning framework that leverages Multi-modal Coherence Relations to analyze referential, analogical and pragmatic image-text linkages. Our experiments show that IRONIC achieves state-of-the-art performance on zero-shot Multi-modal Sarcasm Detection across different baselines. This demonstrates the need for incorporating linguistic and cognitive insights into the design of multi-modal reasoning strategies. Our code is available at: https://github.com/aashish2000/IRONIC
中文摘要:IRONIC是一种新颖的情境学习框架,通过利用多模态连贯关系实现了零样本多模态讽刺检测的最优性能,证明了将语言学和认知原理融入多模态推理设计的重要性。
English Summary: IRONIC is a novel in-context learning framework that uses multi-modal coherence relations to achieve state-of-the-art zero-shot sarcasm detection, demonstrating the importance of integrating linguistic and cognitive principles into multi-modal reasoning.

Authors:Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, Yulun Zhang
Title: DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution
Abstract:
Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (*i.e.*, CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a **28$\times$** speed-up over existing methods such as MGLD-VSR. Code is available at: https://github.com/zhengchen1999/DOVE.
Chinese: 作者提出DOVE,一种高效的单步扩散模型用于视频超分辨率,在实现与多步方法相当或更优性能的同时,推理速度提升高达28倍。
English: The authors propose DOVE, an efficient one-step diffusion model for video super-resolution that achieves comparable or better performance than multi-step methods while offering up to 28× faster inference speed.

Authors:Liyan Wang, Weixiang Zhou, Cong Wang, Kin-Man Lam, Zhixun Su, Jinshan Pan
Title: Deep Learning-Driven Ultra-High-Definition Image Restoration: A Survey
Abstract:
Ultra-high-definition (UHD) image restoration aims to specifically solve the problem of quality degradation in ultra-high-resolution images. Recent advancements in this field are predominantly driven by deep learning-based innovations, including enhancements in dataset construction, network architecture, sampling strategies, prior knowledge integration, and loss functions. In this paper, we systematically review recent progress in UHD image restoration, covering various aspects ranging from dataset construction to algorithm design. This serves as a valuable resource for understanding state-of-the-art developments in the field. We begin by summarizing degradation models for various image restoration subproblems, such as super-resolution, low-light enhancement, deblurring, dehazing, deraining, and desnowing, and emphasizing the unique challenges of their application to UHD image restoration. We then highlight existing UHD benchmark datasets and organize the literature according to degradation types and dataset construction methods. Following this, we showcase major milestones in deep learning-driven UHD image restoration, reviewing the progression of restoration tasks, technological developments, and evaluations of existing methods. We further propose a classification framework based on network architectures and sampling strategies, helping to clearly organize existing methods. Finally, we share insights into the current research landscape and propose directions for further advancements. A related repository is available at https://github.com/wlydlut/UHD-Image-Restoration-Survey.
中文: 本文系统综述了超高清图像复原领域的最新进展,涵盖退化模型、基准数据集、深度学习技术及未来研究方向。
English: This paper provides a systematic review of recent advances in ultra-high-definition image restoration, covering degradation models, benchmark datasets, deep learning techniques, and future research directions.

Authors:Zhenglin Hua, Jinghan He, Zijun Yao, Tianxu Han, Haiyun Guo, Yuheng Jia, Junfeng Fang
Title: Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation
Abstract:
Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with faithfulness or hallucination, extracting more precise and disentangled hallucination-related representations. Our analysis demonstrates that interventions along the identified faithful direction can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose Steering LVLMs via SAE Latent Directions (SSL), a plug-and-play method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead. The code is available at https://github.com/huazhenglin2003/SSL.
中文摘要:本文提出SSL方法,通过稀疏自编码器识别并调控大视觉语言模型中的潜在语义方向,在可忽略的额外时间成本下有效缓解幻觉现象,同时保持语义完整性和跨模型架构的迁移能力。
English Summary: This paper introduces SSL, a plug-and-play method using sparse autoencoders to identify and steer latent directions in large vision-language models, effectively reducing hallucinations while maintaining semantic integrity and transferability across architectures with minimal computational overhead.

Authors:Jinpei Guo, Yifei Ji, Zheng Chen, Kai Liu, Min Liu, Wang Rao, Wenbo Li, Yong Guo, Yulun Zhang
Title: OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates
Abstract:
Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial computational overhead. Moreover, they typically require training separate models for different compression bit-rates, leading to significant training and storage costs. To address these challenges, we propose a one-step diffusion codec across multiple bit-rates. termed OSCAR. Specifically, our method views compressed latents as noisy variants of the original latents, where the level of distortion depends on the bit-rate. This perspective allows them to be modeled as intermediate states along a diffusion trajectory. By establishing a mapping from the compression bit-rate to a pseudo diffusion timestep, we condition a single generative model to support reconstructions at multiple bit-rates. Meanwhile, we argue that the compressed latents retain rich structural information, thereby making one-step denoising feasible. Thus, OSCAR replaces iterative sampling with a single denoising pass, significantly improving inference efficiency. Extensive experiments demonstrate that OSCAR achieves superior performance in both quantitative and visual quality metrics. The code and models will be released at https://github.com/jp-guo/OSCAR.
中文: OSCAR是一种新颖的一步扩散编解码器,通过将压缩潜在表示建模为扩散轨迹上的噪声变体,实现了跨多比特率的高效图像压缩,在显著降低计算开销的同时获得了优越的性能。
English: OSCAR is a novel one-step diffusion codec that enables efficient image compression across multiple bit-rates by modeling compressed latents as noisy variants along a diffusion trajectory, achieving superior performance with significantly reduced computational overhead.

Authors:Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, Wenhu Chen
Title: Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Abstract:
Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.

Authors:Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer
Title: VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance
Abstract:
With video games now generating the highest revenues in the entertainment industry, optimizing game development workflows has become essential for the sector's sustained growth. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), which remains one of the industry's most labor-intensive processes with limited automation options. To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a clear need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain. To bridge this gap, we introduce VideoGameQA-Bench, a comprehensive benchmark that covers a wide array of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos of various games. Code and data are available at: https://asgaardlab.github.io/videogameqa-bench/

Authors:Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun
Title: MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding
Abstract:
Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain's high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: https://github.com/yuxiangwei0808/MoRE-Brain.
中文: MoRE-Brain提出了一种基于脑网络原理的混合专家框架,通过分层处理和双阶段路由机制,实现了从fMRI信号到图像的高保真、可适应且可解释的视觉重建,显著提升了重建质量与神经机制的可解释性。
English: MoRE-Brain introduces a neuro-inspired Mixture-of-Experts framework that achieves high-fidelity, adaptable, and interpretable visual reconstruction from fMRI through hierarchical processing and dual-stage routing, advancing both reconstruction quality and mechanistic insight.

Authors:Tony Montes, Fernando Lozano
Title: ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation
Abstract:
Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at https://github.com/t-montes/viqagent.
中文摘要:本研究提出了一种基于大语言模型的零样本视频问答智能体,通过结合思维链推理与YOLO-World目标追踪技术,在多个基准测试中实现最优性能,同时支持时间定位交叉验证以提升输出可靠性。
English Summary: This work introduces an LLM-brained agent for zero-shot VideoQA that integrates Chain-of-Thought reasoning with YOLO-World object tracking, achieving state-of-the-art performance across multiple benchmarks while enabling cross-verification of grounding timeframes for enhanced reliability.

Authors:Zhiyuan Xu, Bohan Li, Huan-ang Gao, Mingju Gao, Yong Chen, Ming Liu, Chenxu Yan, Hang Zhao, Shuo Feng, Hao Zhao
Title: Challenger: Affordable Adversarial Driving Video Generation
Abstract:
Generating photorealistic driving videos has seen significant progress recently, but current methods largely focus on ordinary, non-adversarial scenarios. Meanwhile, efforts to generate adversarial driving scenarios often operate on abstract trajectory or BEV representations, falling short of delivering realistic sensor data that can truly stress-test autonomous driving (AD) systems. In this work, we introduce Challenger, a framework that produces physically plausible yet photorealistic adversarial driving videos. Generating such videos poses a fundamental challenge: it requires jointly optimizing over the space of traffic interactions and high-fidelity sensor observations. Challenger makes this affordable through two techniques: (1) a physics-aware multi-round trajectory refinement process that narrows down candidate adversarial maneuvers, and (2) a tailored trajectory scoring function that encourages realistic yet adversarial behavior while maintaining compatibility with downstream video synthesis. As tested on the nuScenes dataset, Challenger generates a diverse range of aggressive driving scenarios-including cut-ins, sudden lane changes, tailgating, and blind spot intrusions-and renders them into multiview photorealistic videos. Extensive evaluations show that these scenarios significantly increase the collision rate of state-of-the-art end-to-end AD models (UniAD, VAD, SparseDrive, and DiffusionDrive), and importantly, adversarial behaviors discovered for one model often transfer to others.
中文:Challenger框架通过轨迹优化和评分技术生成逼真的对抗性驾驶视频,有效提升了自动驾驶模型的碰撞率,并展现出跨模型的攻击迁移性。
English: The Challenger framework generates photorealistic adversarial driving videos by optimizing traffic interactions and sensor data through trajectory refinement and scoring, significantly increasing collision rates in autonomous driving models.

Authors:Can Rong, Xin Zhang, Yanxin Xi, Hongjie Sui, Jingtao Ding, Yong Li
Title: Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities
Abstract:
Commuting Origin-destination~(OD) flows, capturing daily population mobility of citizens, are vital for sustainable development across cities around the world. However, it is challenging to obtain the data due to the high cost of travel surveys and privacy concerns. Surprisingly, we find that satellite imagery, publicly available across the globe, contains rich urban semantic signals to support high-quality OD flow generation, with over 98\% expressiveness of traditional multisource hard-to-collect urban sociodemographic, economics, land use, and point of interest data. This inspires us to design a novel data generator, GlODGen, which can generate OD flow data for any cities of interest around the world. Specifically, GlODGen first leverages Vision-Language Geo-Foundation Models to extract urban semantic signals related to human mobility from satellite imagery. These features are then combined with population data to form region-level representations, which are used to generate OD flows via graph diffusion models. Extensive experiments on 4 continents and 6 representative cities show that GlODGen has great generalizability across diverse urban environments on different continents and can generate OD flow data for global cities highly consistent with real-world mobility data. We implement GlODGen as an automated tool, seamlessly integrating data acquisition and curation, urban semantic feature extraction, and OD flow generation together. It has been released at https://github.com/tsinghua-fib-lab/generate-od-pubtools.
中文: GlODGen是一种创新工具,通过从卫星图像中提取城市语义特征并结合人口数据,利用图扩散模型为全球城市生成通勤起讫点流量数据,在不同大陆的多样化城市环境中展现出卓越的准确性和泛化能力。
English: GlODGen is a novel tool that generates commuting origin-destination flow data for global cities by extracting urban semantic features from satellite imagery and combining them with population data through graph diffusion models, demonstrating high accuracy and generalizability across diverse urban environments.

Authors:Penghao Wu, Lewei Lu, Ziwei Liu
Title: Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM
Abstract:
Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this https://github.com/penghao-wu/ProxyV URL.
中文: 大型多模态模型在处理视觉标记时存在计算冗余,ProxyV通过引入代理视觉标记来减轻计算负担,不仅保持性能,还能在与标记削减方法结合时进一步提升效率。
English: Large multimodal models face computational inefficiency with visual tokens, so ProxyV introduces proxy tokens to reduce processing load without losing performance and can even enhance it when combined with token reduction methods.

Authors:Satoshi Kosugi
Title: Leveraging the Powerful Attention of a Pre-trained Diffusion Model for Exemplar-based Image Colorization
Abstract:
Exemplar-based image colorization aims to colorize a grayscale image using a reference color image, ensuring that reference colors are applied to corresponding input regions based on their semantic similarity. To achieve accurate semantic matching between regions, we leverage the self-attention module of a pre-trained diffusion model, which is trained on a large dataset and exhibits powerful attention capabilities. To harness this power, we propose a novel, fine-tuning-free approach based on a pre-trained diffusion model, making two key contributions. First, we introduce dual attention-guided color transfer. We utilize the self-attention module to compute an attention map between the input and reference images, effectively capturing semantic correspondences. The color features from the reference image is then transferred to the semantically matching regions of the input image, guided by this attention map, and finally, the grayscale features are replaced with the corresponding color features. Notably, we utilize dual attention to calculate attention maps separately for the grayscale and color images, achieving more precise semantic alignment. Second, we propose classifier-free colorization guidance, which enhances the transferred colors by combining color-transferred and non-color-transferred outputs. This process improves the quality of colorization. Our experimental results demonstrate that our method outperforms existing techniques in terms of image quality and fidelity to the reference. Specifically, we use 335 input-reference pairs from previous research, achieving an FID of 95.27 (image quality) and an SI-FID of 5.51 (fidelity to the reference). Our source code is available at https://github.com/satoshi-kosugi/powerful-attention.
Chinese: 本文提出了一种无需微调的样例图像着色方法,利用预训练扩散模型的自注意力模块实现精确语义匹配和双重注意力引导的颜色迁移,在图像质量和参考保真度上均优于现有技术。
English: This paper introduces a fine-tuning-free method for exemplar-based image colorization that utilizes the self-attention module of a pre-trained diffusion model to achieve accurate semantic matching and dual attention-guided color transfer, resulting in superior image quality and fidelity compared to existing techniques.

Authors:Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, Jun Xu
Title: GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents
Abstract:
Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding and thereby achieving substantial performance gains. In this paper, we first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update-each revealing distinct challenges arising from blindly applying general-purpose RL without adapting to GUI grounding tasks. Input design: Current templates encourage the model to generate chain-of-thought reasoning, but longer chains unexpectedly lead to worse grounding performance. Output evaluation: Reward functions based on hit signals or box area allow models to exploit box size, leading to reward hacking and poor localization quality. Policy update: Online RL tends to overfit easy examples due to biases in length and sample difficulty, leading to under-optimization on harder cases. To address these issues, we propose three targeted solutions. First, we adopt a Fast Thinking Template that encourages direct answer generation, reducing excessive reasoning during training. Second, we incorporate a box size constraint into the reward function to mitigate reward hacking. Third, we revise the RL objective by adjusting length normalization and adding a difficulty-aware scaling factor, enabling better optimization on hard samples. Our GUI-G1-3B, trained on 17K public samples with Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on ScreenSpot-Pro. This surpasses all prior models of similar size and even outperforms the larger UI-TARS-7B, establishing a new state-of-the-art in GUI agent grounding. The project repository is available at https://github.com/Yuqi-Zhou/GUI-G1.
中文摘要:本文分析了图形用户界面智能体训练中的三大挑战,并提出快速思考模板、边界框约束和优化强化学习目标三项针对性解决方案,使GUI-G1-3B模型在界面定位任务中达到最新最优性能。
English Summary: This paper analyzes challenges in GUI agent training and proposes three targeted solutions—a Fast Thinking Template, box size constraints, and a revised RL objective—enabling their GUI-G1-3B model to achieve state-of-the-art grounding performance.

Authors:Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang
Title: MMaDA: Multimodal Large Diffusion Language Models
Abstract:
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA
中文摘要:MMaDA是一种统一的多模态扩散基础模型,通过模态无关架构、混合思维链微调和统一强化学习算法,在文本推理、多模态理解和文生图任务中均实现了领先性能。
English Summary: MMaDA is a unified multimodal diffusion foundation model that integrates a modality-agnostic architecture, mixed chain-of-thought fine-tuning, and a unified RL algorithm to achieve state-of-the-art performance across textual reasoning, multimodal understanding, and text-to-image generation tasks.

Authors:Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, Wenbing Huang
Title: STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise reasoning. Comprehensive evaluations demonstrate that STAR-R1 achieves state-of-the-art performance across all 11 metrics, outperforming SFT by 23% in cross-view scenarios. Further analysis reveals STAR-R1's anthropomorphic behavior and highlights its unique ability to compare all objects for improving spatial reasoning. Our work provides critical insights in advancing the research of MLLMs and reasoning models. The codes, model weights, and data will be publicly available at https://github.com/zongzhao23/STAR-R1.
Chinese: 本研究提出STAR-R1创新强化学习框架,通过奖励部分正确性和惩罚低效探索,解决了多模态模型在空间推理中的不足,在11项指标上实现最优性能,跨视角场景性能提升23%。
English: The study introduces STAR-R1, a novel reinforcement learning framework that overcomes limitations in multimodal models' spatial reasoning by rewarding partial correctness and penalizing inefficiencies, achieving state-of-the-art performance across 11 metrics with a 23% improvement in cross-view scenarios.

Authors:Ruizhi Shao, Yinghao Xu, Yujun Shen, Ceyuan Yang, Yang Zheng, Changan Chen, Yebin Liu, Gordon Wetzstein
Title: Interspatial Attention for Efficient 4D Human Video Generation
Abstract:
Generating photorealistic videos of digital humans in a controllable manner is crucial for a plethora of applications. Existing approaches either build on methods that employ template-based 3D representations or emerging video generation models but suffer from poor quality or limited consistency and identity preservation when generating individual or multiple digital humans. In this paper, we introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern diffusion transformer (DiT)--based video generation models. ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos. Leveraging a custom-developed video variation autoencoder, we train a latent ISA-based diffusion model on a large corpus of video data. Our model achieves state-of-the-art performance for 4D human video synthesis, demonstrating remarkable motion consistency and identity preservation while providing precise control of the camera and body poses. Our code and model are publicly released at https://dsaurus.github.io/isa4d/.
中文: 本文提出了一种空间交叉注意力机制,结合扩散变换器模型,实现了高质量、可控的4D人物视频生成,在动作连贯性和身份保持方面表现卓越。
English: This paper introduces an interspatial attention mechanism integrated into diffusion transformer models to generate high-quality, controllable 4D human videos with superior motion consistency and identity preservation.

Authors:Weihao Xia, Cengiz Oztireli
Title: Exploring The Visual Feature Space for Multimodal Neural Decoding
Abstract:
The intrication of brain signals drives research that leverages multimodal AI to align brain modalities with visual and textual data for explainable descriptions. However, most existing studies are limited to coarse interpretations, lacking essential details on object descriptions, locations, attributes, and their relationships. This leads to imprecise and ambiguous reconstructions when using such cues for visual decoding. To address this, we analyze different choices of vision feature spaces from pre-trained visual components within Multimodal Large Language Models (MLLMs) and introduce a zero-shot multimodal brain decoding method that interacts with these models to decode across multiple levels of granularities. % To assess a model's ability to decode fine details from brain signals, we propose the Multi-Granularity Brain Detail Understanding Benchmark (MG-BrainDub). This benchmark includes two key tasks: detailed descriptions and salient question-answering, with metrics highlighting key visual elements like objects, attributes, and relationships. Our approach enhances neural decoding precision and supports more accurate neuro-decoding applications. Code will be available at https://github.com/weihaox/VINDEX.
中文: 本研究提出了一种零样本多模态脑解码方法,通过整合多模态大语言模型中的视觉特征空间,解决了现有方法在细节描述上的局限性,从而提升了神经解码在不同粒度上的精确性。
English: This study introduces a zero-shot multimodal brain decoding method that integrates vision feature spaces from Multimodal Large Language Models to enhance neural decoding precision across multiple granularities, addressing the limitations of coarse interpretations in existing approaches.

Authors:Zhuodong Jiang, Haoran Wang, Guoxi Huang, Brett Seymour, Nantheera Anantrasirichai
Title: RUSplatting: Robust 3D Gaussian Splatting for Sparse-View Underwater Scene Reconstruction
Abstract:
Reconstructing high-fidelity underwater scenes remains a challenging task due to light absorption, scattering, and limited visibility inherent in aquatic environments. This paper presents an enhanced Gaussian Splatting-based framework that improves both the visual quality and geometric accuracy of deep underwater rendering. We propose decoupled learning for RGB channels, guided by the physics of underwater attenuation, to enable more accurate colour restoration. To address sparse-view limitations and improve view consistency, we introduce a frame interpolation strategy with a novel adaptive weighting scheme. Additionally, we introduce a new loss function aimed at reducing noise while preserving edges, which is essential for deep-sea content. We also release a newly collected dataset, Submerged3D, captured specifically in deep-sea environments. Experimental results demonstrate that our framework consistently outperforms state-of-the-art methods with PSNR gains up to 1.90dB, delivering superior perceptual quality and robustness, and offering promising directions for marine robotics and underwater visual analytics. The code of RUSplatting is available at https://github.com/theflash987/RUSplatting and the dataset Submerged3D can be downloaded at https://zenodo.org/records/15482420.
中文: 本文提出一种增强型高斯泼溅框架,通过解耦RGB学习、帧插值和新型损失函数解决水下场景重建中的颜色失真、稀疏视角和噪声问题,在新深海数据集上实现了最优性能。
English: This paper introduces an enhanced Gaussian Splatting framework that improves underwater scene reconstruction by addressing color distortion, sparse views, and noise through decoupled RGB learning, frame interpolation, and a novel loss function, achieving state-of-the-art results on a new deep-sea dataset.

Authors:Jiaming Zhou, Ke Ye, Jiayi Liu, Teli Ma, Zifan Wang, Ronghe Qiu, Kun-Yu Lin, Zhilin Zhao, Junwei Liang
Title: Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization
Abstract:
The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for testing, distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.

Authors:Tianjiao Cao, Jiahao Lyu, Weichao Zeng, Weimin Mu, Yu Zhou
Title: The Devil is in Fine-tuning and Long-tailed Problems:A New Benchmark for Scene Text Detection
Abstract:
Scene text detection has seen the emergence of high-performing methods that excel on academic benchmarks. However, these detectors often fail to replicate such success in real-world scenarios. We uncover two key factors contributing to this discrepancy through extensive experiments. First, a \textit{Fine-tuning Gap}, where models leverage \textit{Dataset-Specific Optimization} (DSO) paradigm for one domain at the cost of reduced effectiveness in others, leads to inflated performances on academic benchmarks. Second, the suboptimal performance in practical settings is primarily attributed to the long-tailed distribution of texts, where detectors struggle with rare and complex categories as artistic or overlapped text. Given that the DSO paradigm might undermine the generalization ability of models, we advocate for a \textit{Joint-Dataset Learning} (JDL) protocol to alleviate the Fine-tuning Gap. Additionally, an error analysis is conducted to identify three major categories and 13 subcategories of challenges in long-tailed scene text, upon which we propose a Long-Tailed Benchmark (LTB). LTB facilitates a comprehensive evaluation of ability to handle a diverse range of long-tailed challenges. We further introduce MAEDet, a self-supervised learning-based method, as a strong baseline for LTB. The code is available at https://github.com/pd162/LTB.
中文摘要:场景文本检测器在现实应用中表现不佳,主要因为针对学术基准的微调差距和处理长尾文本分布的困难,为此提出了联合数据集学习协议和新基准,并引入了自监督基线方法。
English Summary: Scene text detectors often underperform in real-world applications due to a fine-tuning gap that prioritizes academic benchmarks and struggles with long-tailed text distributions, prompting the proposal of joint-dataset learning and a new benchmark with a self-supervised baseline method.

Authors:Pujun Xue, Junyi Ge, Xiaotong Jiang, Siyang Song, Zijian Wu, Yupeng Huo, Weicheng Xie, Linlin Shen, Xiaoqin Zhou, Xiaofeng Liu, Min Gu
Title: Oral Imaging for Malocclusion Issues Assessments: OMNI Dataset, Deep Learning Baselines and Benchmarking
Abstract:
Malocclusion is a major challenge in orthodontics, and its complex presentation and diverse clinical manifestations make accurate localization and diagnosis particularly important. Currently, one of the major shortcomings facing the field of dental image analysis is the lack of large-scale, accurately labeled datasets dedicated to malocclusion issues, which limits the development of automated diagnostics in the field of dentistry and leads to a lack of diagnostic accuracy and efficiency in clinical practice. Therefore, in this study, we propose the Oral and Maxillofacial Natural Images (OMNI) dataset, a novel and comprehensive dental image dataset aimed at advancing the study of analyzing dental images for issues of malocclusion. Specifically, the dataset contains 4166 multi-view images with 384 participants in data collection and annotated by professional dentists. In addition, we performed a comprehensive validation of the created OMNI dataset, including three CNN-based methods, two Transformer-based methods, and one GNN-based method, and conducted automated diagnostic experiments for malocclusion issues. The experimental results show that the OMNI dataset can facilitate the automated diagnosis research of malocclusion issues and provide a new benchmark for the research in this field. Our OMNI dataset and baseline code are publicly available at https://github.com/RoundFaceJ/OMNI.
中文摘要:本研究提出了OMNI数据集,这是一个包含专业标注的全面牙科图像集合,旨在解决错颌畸形诊断中缺乏大规模数据的问题,并通过多种验证方法证明了其在推动自动化诊断研究方面的有效性。
English Summary: The study introduces the OMNI dataset, a comprehensive collection of dental images with professional annotations, to address the lack of large-scale data in malocclusion diagnosis and demonstrates its effectiveness in advancing automated diagnostic research through various validation methods.

Authors:Iuliia Kotseruba, John K. Tsotsos
Title: SNAP: A Benchmark for Testing the Effects of Capture Conditions on Fundamental Vision Tasks
Abstract:
Generalization of deep-learning-based (DL) computer vision algorithms to various image perturbations is hard to establish and remains an active area of research. The majority of past analyses focused on the images already captured, whereas effects of the image formation pipeline and environment are less studied. In this paper, we address this issue by analyzing the impact of capture conditions, such as camera parameters and lighting, on DL model performance on 3 vision tasks -- image classification, object detection, and visual question answering (VQA). To this end, we assess capture bias in common vision datasets and create a new benchmark, SNAP (for $\textbf{S}$hutter speed, ISO se$\textbf{N}$sitivity, and $\textbf{AP}$erture), consisting of images of objects taken under controlled lighting conditions and with densely sampled camera settings. We then evaluate a large number of DL vision models and show the effects of capture conditions on each selected vision task. Lastly, we conduct an experiment to establish a human baseline for the VQA task. Our results show that computer vision datasets are significantly biased, the models trained on this data do not reach human accuracy even on the well-exposed images, and are susceptible to both major exposure changes and minute variations of camera settings. Code and data can be found at https://github.com/ykotseruba/SNAP
中文: 深度学习计算机视觉模型在图像扰动下的泛化能力不足,主要源于相机参数和光照等采集条件的偏差;SNAP基准测试表明,即使在曝光良好的图像上,模型性能也显著低于人类水平且对相机设置的微小变化敏感。
English: Deep learning computer vision models struggle with generalization across image perturbations, particularly due to biases in capture conditions like camera settings and lighting, as demonstrated by the SNAP benchmark which reveals significant performance drops even on well-exposed images compared to human accuracy.

Authors:Ruilin Yao, Bo Zhang, Jirui Huang, Xinwei Long, Yifang Zhang, Tianyu Zou, Yufei Wu, Shichao Su, Yifan Xu, Wenxi Zeng, Zhaoyu Yang, Guoyou Li, Shilan Zhang, Zichan Li, Yaxiong Chen, Shengwu Xiong, Peng Xu, Jiajun Zhang, Bowen Zhou, David Clifton, Luc Van Gool
Title: LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in evaluating the synergistic effects of lower-level perceptual capabilities on higher-order reasoning. To lift this limitation, we contribute Lens, a multi-level benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios, forming three progressive task tiers, i.e., perception, understanding, and reasoning. One feature is that each image is equipped with rich annotations for all tasks. Thus, this dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning. In addition, our images are manully collected from the social media, in which 53% were published later than Jan. 2025. We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL. These models are released later than Dec. 2024, and none of them achieve an accuracy greater than 60% in the reasoning tasks. Project page: https://github.com/Lens4MLLMs/lens. ICCV 2025 workshop page: https://lens4mllms.github.io/mars2-workshop-iccv2025/
Chinese: Lens基准通过3.4K张当代图像和6万多个问题构建了包含感知、理解与推理的三层评估体系,旨在检验多模态大模型的协同认知能力,当前顶尖模型在推理任务中的准确率均未超过60%。
English: The Lens benchmark introduces a multi-level evaluation framework with 3.4K contemporary images and 60K+ questions to assess MLLMs' synergistic capabilities across perception, understanding, and reasoning tasks, where current state-of-the-art models achieve below 60% accuracy in reasoning.

Authors:Hua Li, Shijie Lian, Zhiyuan Li, Runmin Cong, Chongyi Li, Laurence T. Yang, Weidong Zhang, Sam Kwong
Title: Advancing Marine Research: UWSAM Framework and UIIS10K Dataset for Precise Underwater Instance Segmentation
Abstract:
With recent breakthroughs in large-scale modeling, the Segment Anything Model (SAM) has demonstrated significant potential in a variety of visual applications. However, due to the lack of underwater domain expertise, SAM and its variants face performance limitations in end-to-end underwater instance segmentation tasks, while their higher computational requirements further hinder their application in underwater scenarios. To address this challenge, we propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories. Then, we introduce UWSAM, an efficient model designed for automatic and accurate segmentation of underwater instances. UWSAM efficiently distills knowledge from the SAM ViT-Huge image encoder into the smaller ViT-Small image encoder via the Mask GAT-based Underwater Knowledge Distillation (MG-UKD) method for effective visual representation learning. Furthermore, we design an End-to-end Underwater Prompt Generator (EUPG) for UWSAM, which automatically generates underwater prompts instead of explicitly providing foreground points or boxes as prompts, thus enabling the network to locate underwater instances accurately for efficient segmentation. Comprehensive experimental results show that our model is effective, achieving significant performance improvements over state-of-the-art methods on multiple underwater instance datasets. Datasets and codes are available at https://github.com/LiamLian0727/UIIS10K.
中文摘要:针对分割一切模型(SAM)在水下实例分割任务中因领域知识缺乏和计算需求高而存在的性能局限,本研究提出高效模型UWSAM,通过知识蒸馏和自动提示生成技术,在多个水下数据集上实现了显著性能突破。
English Summary: The Segment Anything Model (SAM) struggles with underwater instance segmentation due to domain expertise gaps and high computational demands, leading to the development of UWSAM—an efficient model using knowledge distillation and automatic prompt generation that achieves superior performance on underwater datasets.

Authors:Xin Huang, Ruibin Li, Tong Jia, Wei Zheng, Ya Wang
Title: Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model's discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs' performance on complex CR tasks. The source code is available at https://github.com/nynu-BDAI/AHNPL.
中文: 提出的自适应难负样本扰动学习(AHNPL)方法通过生成基于图像的难负样本并采用自适应对比学习,有效提升了视觉语言模型在组合推理任务上的性能。
English: The proposed Adaptive Hard Negative Perturbation Learning (AHNPL) method enhances Vision-Language Models by generating image-based hard negatives and employing adaptive contrastive learning to improve performance on compositional reasoning tasks.

Authors:Andrew Caunes, Thierry Chateau, Vincent Fremont
Title: seg_3D_by_PC2D: Multi-View Projection for Domain Generalization and Adaptation in 3D Semantic Segmentation
Abstract:
3D semantic segmentation plays a pivotal role in autonomous driving and road infrastructure analysis, yet state-of-the-art 3D models are prone to severe domain shift when deployed across different datasets. We propose a novel multi-view projection framework that excels in both domain generalization (DG) and unsupervised domain adaptation (UDA). Our approach first aligns Lidar scans into coherent 3D scenes and renders them from multiple virtual camera poses to create a large-scale synthetic 2D dataset (PC2D). We then use it to train a 2D segmentation model in-domain. During inference, the model processes hundreds of views per scene; the resulting logits are back-projected to 3D with an occlusion-aware voting scheme to generate final point-wise labels. Our framework is modular and enables extensive exploration of key design parameters, such as view generation optimization (VGO), visualization modality optimization (MODO), and 2D model choice. We evaluate on the nuScenes and SemanticKITTI datasets under both the DG and UDA settings. We achieve state-of-the-art results in UDA and close to state-of-the-art in DG, with particularly large gains on large, static classes. Our code and dataset generation tools will be publicly available at https://github.com/andrewcaunes/ia4markings
中文: 本研究提出一种多视角投影框架,通过将激光雷达扫描转换为合成二维数据进行训练,有效解决三维语义分割中的领域偏移问题,在自动驾驶数据集的无监督领域自适应任务中取得最优结果,并在领域泛化方面接近最佳水平。
English: This study introduces a multi-view projection framework that addresses domain shift in 3D semantic segmentation by converting Lidar scans into synthetic 2D data for training, achieving top results in unsupervised domain adaptation and near state-of-the-art in domain generalization on autonomous driving datasets.

Authors:Zeqing Wang, Shiyuan Zhang, Chengpei Tang, Keze Wang
Title: TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models
Abstract:
Reasoning about temporal causality, particularly irreversible transformations of objects governed by real-world knowledge (e.g., fruit decay and human aging), is a fundamental aspect of human visual understanding. Unlike temporal perception based on simple event sequences, this form of reasoning requires a deeper comprehension of how object states change over time. Although the current powerful Vision-Language Models (VLMs) have demonstrated impressive performance on a wide range of downstream tasks, their capacity to reason about temporal causality remains underexplored. To address this gap, we introduce \textbf{TimeCausality}, a novel benchmark specifically designed to evaluate the causal reasoning ability of VLMs in the temporal dimension. Based on our TimeCausality, we find that while the current SOTA open-source VLMs have achieved performance levels comparable to closed-source models like GPT-4o on various standard visual question answering tasks, they fall significantly behind on our benchmark compared with their closed-source competitors. Furthermore, even GPT-4o exhibits a marked drop in performance on TimeCausality compared to its results on other tasks. These findings underscore the critical need to incorporate temporal causality into the evaluation and development of VLMs, and they highlight an important challenge for the open-source VLM community moving forward. Code and Data are available at \href{https://github.com/Zeqing-Wang/TimeCausality }{TimeCausality}.
中文: TimeCausality基准测试表明,当前包括顶尖开源模型和GPT-4o等闭源模型在内的视觉语言模型,在时间因果推理方面表现显著不足,这揭示了其开发与评估体系中的重要缺陷。
English: The TimeCausality benchmark reveals that current Vision-Language Models, including top open-source and closed-source ones like GPT-4o, significantly underperform in reasoning about temporal causality, highlighting a critical gap in their development and evaluation.

Authors:Raza Imam, Rufael Marew, Mohammad Yaqub
Title: On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable?
Abstract:
Medical Vision-Language Models (MVLMs) have achieved par excellence generalization in medical image analysis, yet their performance under noisy, corrupted conditions remains largely untested. Clinical imaging is inherently susceptible to acquisition artifacts and noise; however, existing evaluations predominantly assess generally clean datasets, overlooking robustness -- i.e., the model's ability to perform under real-world distortions. To address this gap, we first introduce MediMeta-C, a corruption benchmark that systematically applies several perturbations across multiple medical imaging datasets. Combined with MedMNIST-C, this establishes a comprehensive robustness evaluation framework for MVLMs. We further propose RobustMedCLIP, a visual encoder adaptation of a pretrained MVLM that incorporates few-shot tuning to enhance resilience against corruptions. Through extensive experiments, we benchmark 5 major MVLMs across 5 medical imaging modalities, revealing that existing models exhibit severe degradation under corruption and struggle with domain-modality tradeoffs. Our findings highlight the necessity of diverse training and robust adaptation strategies, demonstrating that efficient low-rank adaptation when paired with few-shot tuning, improves robustness while preserving generalization across modalities.
中文: 医学视觉语言模型在噪声条件下表现不佳,为此我们建立了MediMeta-C基准并开发了RobustMedCLIP适配方法,通过少量样本调优显著提升了模型的抗干扰能力。
English: Medical Vision-Language Models show excellent generalization but perform poorly under noisy conditions, prompting the creation of MediMeta-C benchmark and RobustMedCLIP adaptation to enhance robustness through few-shot tuning.

Authors:DongGeon Lee, Joonwon Jang, Jihae Jeong, Hwanjo Yu
Title: Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
Abstract:
Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs are more vulnerable to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms. MemeSafetyBench is publicly available at https://github.com/oneonlee/Meme-Safety-Bench.
Chinese: 研究表明,视觉语言模型在面对真实表情包图像的有害指令时比面对人工图像时更易受影响,凸显了进行生态效度安全评估和加强防护机制的迫切需求。
English: This study reveals that vision-language models are significantly more vulnerable to harmful prompts from real meme images than artificial ones, highlighting the urgent need for ecologically valid safety evaluations and enhanced protective measures.

Authors:Raphael Sulzer, Liuyun Duan, Nicolas Girard, Florent Lafarge
Title: The P$^3$ dataset: Pixels, Points and Polygons for Multimodal Building Vectorization
Abstract:
We present the P$^3$ dataset, a large-scale multimodal benchmark for building vectorization, constructed from aerial LiDAR point clouds, high-resolution aerial imagery, and vectorized 2D building outlines, collected across three continents. The dataset contains over 10 billion LiDAR points with decimeter-level accuracy and RGB images at a ground sampling distance of 25 centimeter. While many existing datasets primarily focus on the image modality, P$^3$ offers a complementary perspective by also incorporating dense 3D information. We demonstrate that LiDAR point clouds serve as a robust modality for predicting building polygons, both in hybrid and end-to-end learning frameworks. Moreover, fusing aerial LiDAR and imagery further improves accuracy and geometric quality of predicted polygons. The P$^3$ dataset is publicly available, along with code and pretrained weights of three state-of-the-art models for building polygon prediction at https://github.com/raphaelsulzer/PixelsPointsPolygons .
中文摘要:P$^3$数据集是一个融合航空激光雷达、高分辨率影像和矢量化建筑轮廓的大规模多模态基准数据集,通过多模态融合显著提升了建筑多边形预测的精度与几何质量。
English Summary: The P$^3$ dataset is a large-scale multimodal benchmark combining aerial LiDAR, high-resolution imagery, and vectorized building outlines across three continents, demonstrating LiDAR's effectiveness in predicting building polygons and achieving improved accuracy through multimodal fusion.

Authors:Yuchen Li, Chaoran Feng, Zhenyu Tang, Kaiyuan Deng, Wangbo Yu, Yonghong Tian, Li Yuan
Title: GS2E: Gaussian Splatting is an Effective Data Generator for Event Stream Generation
Abstract:
We introduce GS2E (Gaussian Splatting to Event), a large-scale synthetic event dataset for high-fidelity event vision tasks, captured from real-world sparse multi-view RGB images. Existing event datasets are often synthesized from dense RGB videos, which typically lack viewpoint diversity and geometric consistency, or depend on expensive, difficult-to-scale hardware setups. GS2E overcomes these limitations by first reconstructing photorealistic static scenes using 3D Gaussian Splatting, and subsequently employing a novel, physically-informed event simulation pipeline. This pipeline generally integrates adaptive trajectory interpolation with physically-consistent event contrast threshold modeling. Such an approach yields temporally dense and geometrically consistent event streams under diverse motion and lighting conditions, while ensuring strong alignment with underlying scene structures. Experimental results on event-based 3D reconstruction demonstrate GS2E's superior generalization capabilities and its practical value as a benchmark for advancing event vision research.
中文: GS2E是一个基于真实世界稀疏多视角RGB图像、通过3D高斯泼溅和新型事件模拟流程构建的大规模合成事件数据集,它克服了现有数据集的局限性,能在多样化条件下提供几何一致的事件流,有力推进事件视觉研究发展。
English: GS2E is a large-scale synthetic event dataset created from real-world sparse multi-view RGB images using 3D Gaussian Splatting and a novel event simulation pipeline, overcoming limitations of existing datasets by providing geometrically consistent event streams under diverse conditions for advancing event vision research.

Authors:Yanzhi Tian, Zeming Liu, Zhengyang Liu, Yuhang Guo
Title: Exploring In-Image Machine Translation with Real-World Background
Abstract:
In-Image Machine Translation (IIMT) aims to translate texts within images from one language to another. Previous research on IIMT was primarily conducted on simplified scenarios such as images of one-line text with black font in white backgrounds, which is far from reality and impractical for applications in the real world. To make IIMT research practically valuable, it is essential to consider a complex scenario where the text backgrounds are derived from real-world images. To facilitate research of complex scenario IIMT, we design an IIMT dataset that includes subtitle text with real-world background. However previous IIMT models perform inadequately in complex scenarios. To address the issue, we propose the DebackX model, which separates the background and text-image from the source image, performs translation on text-image directly, and fuses the translated text-image with the background, to generate the target image. Experimental results show that our model achieves improvements in both translation quality and visual effect.
Chinese: DebackX模型通过从复杂现实背景中分离并直接翻译文本再融合,提升了图像内机器翻译的翻译质量和视觉效果。
English: The DebackX model enhances In-Image Machine Translation by separating and translating text from complex real-world backgrounds before fusing it back, achieving superior translation quality and visual results.

Authors:Om Khangaonkar, Hamed Pirsiavash
Title: gen2seg: Generative Models Enable Generalizable Instance Segmentation
Abstract:
By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE's ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.

Authors:Yifan Liu, Wuyang Li, Weihao Yu, Chenxin Li, Alexandre Alahi, Max Meng, Yixuan Yuan
Title: X-GRM: Large Gaussian Reconstruction Model for Sparse-view X-rays to Computed Tomography
Abstract:
Computed Tomography serves as an indispensable tool in clinical workflows, providing non-invasive visualization of internal anatomical structures. Existing CT reconstruction works are limited to small-capacity model architecture and inflexible volume representation. In this work, we present X-GRM (X-ray Gaussian Reconstruction Model), a large feedforward model for reconstructing 3D CT volumes from sparse-view 2D X-ray projections. X-GRM employs a scalable transformer-based architecture to encode sparse-view X-ray inputs, where tokens from different views are integrated efficiently. Then, these tokens are decoded into a novel volume representation, named Voxel-based Gaussian Splatting (VoxGS), which enables efficient CT volume extraction and differentiable X-ray rendering. This combination of a high-capacity model and flexible volume representation, empowers our model to produce high-quality reconstructions from various testing inputs, including in-domain and out-domain X-ray projections. Our codes are available at: https://github.com/CUHK-AIM-Group/X-GRM.
中文:X-GRM提出了一种基于Transformer的大规模模型,通过创新的体素高斯溅射表示法,能够从稀疏X射线投影中重建高质量三维CT图像,实现了灵活高效的临床成像应用。
English: X-GRM introduces a large transformer-based model that reconstructs high-quality 3D CT volumes from sparse X-ray projections using a novel Voxel-based Gaussian Splatting representation, enabling flexible and efficient clinical imaging.

Authors:Ting Huang, Zeyu Zhang, Ruicheng Zhang, Yang Zhao
Title: DC-Scene: Data-Centric Learning for 3D Scene Understanding
Abstract:
3D scene understanding plays a fundamental role in vision applications such as robotics, autonomous driving, and augmented reality. However, advancing learning-based 3D scene understanding remains challenging due to two key limitations: (1) the large scale and complexity of 3D scenes lead to higher computational costs and slower training compared to 2D counterparts; and (2) high-quality annotated 3D datasets are significantly scarcer than those available for 2D vision. These challenges underscore the need for more efficient learning paradigms. In this work, we propose DC-Scene, a data-centric framework tailored for 3D scene understanding, which emphasizes enhancing data quality and training efficiency. Specifically, we introduce a CLIP-driven dual-indicator quality (DIQ) filter, combining vision-language alignment scores with caption-loss perplexity, along with a curriculum scheduler that progressively expands the training pool from the top 25% to 75% of scene-caption pairs. This strategy filters out noisy samples and significantly reduces dependence on large-scale labeled 3D data. Extensive experiments on ScanRefer and Nr3D demonstrate that DC-Scene achieves state-of-the-art performance (86.1 CIDEr with the top-75% subset vs. 85.4 with the full dataset) while reducing training cost by approximately two-thirds, confirming that a compact set of high-quality samples can outperform exhaustive training. Code will be available at https://github.com/AIGeeksGroup/DC-Scene.
中文摘要:提出的DC-Scene框架通过数据质量筛选和渐进式课程学习策略,有效解决了3D场景理解中的计算成本高和数据稀缺问题,在降低训练成本的同时实现了更优性能。
English Summary: The proposed DC-Scene framework addresses 3D scene understanding challenges by implementing a data-centric approach with quality filtering and curriculum learning, achieving superior performance with reduced training costs.

Authors:Yisi Luo, Xile Zhao, Deyu Meng
Title: Continuous Representation Methods, Theories, and Applications: An Overview and Perspectives
Abstract:
Recently, continuous representation methods emerge as novel paradigms that characterize the intrinsic structures of real-world data through function representations that map positional coordinates to their corresponding values in the continuous space. As compared with the traditional discrete framework, the continuous framework demonstrates inherent superiority for data representation and reconstruction (e.g., image restoration, novel view synthesis, and waveform inversion) by offering inherent advantages including resolution flexibility, cross-modal adaptability, inherent smoothness, and parameter efficiency. In this review, we systematically examine recent advancements in continuous representation frameworks, focusing on three aspects: (i) Continuous representation method designs such as basis function representation, statistical modeling, tensor function decomposition, and implicit neural representation; (ii) Theoretical foundations of continuous representations such as approximation error analysis, convergence property, and implicit regularization; (iii) Real-world applications of continuous representations derived from computer vision, graphics, bioinformatics, and remote sensing. Furthermore, we outline future directions and perspectives to inspire exploration and deepen insights to facilitate continuous representation methods, theories, and applications. All referenced works are summarized in our open-source repository: https://github.com/YisiLuo/Continuous-Representation-Zoo
中文摘要:连续表示方法通过灵活高效的功能映射提供卓越的数据处理能力,推动了计算机视觉和生物信息学等领域的发展,并持续进行研究和开源资源分享。
English Summary: Continuous representation methods offer superior data handling through flexible, efficient function mappings, advancing fields like computer vision and bioinformatics with ongoing research and open-source resources.

Authors:Haotian Qin, Dongliang Chang, Yueying Gao, Bingyao Yu, Lei Chen, Zhanyu Ma
Title: Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection
Abstract:
Although existing CLIP-based methods for detecting AI-generated images have achieved promising results, they are still limited by severe feature redundancy, which hinders their generalization ability. To address this issue, incorporating an information bottleneck network into the task presents a straightforward solution. However, relying solely on image-corresponding prompts results in suboptimal performance due to the inherent diversity of prompts. In this paper, we propose a multimodal conditional bottleneck network to reduce feature redundancy while enhancing the discriminative power of features extracted by CLIP, thereby improving the model's generalization ability. We begin with a semantic analysis experiment, where we observe that arbitrary text features exhibit lower cosine similarity with real image features than with fake image features in the CLIP feature space, a phenomenon we refer to as "bias". Therefore, we introduce InfoFD, a text-guided AI-generated image detection framework. InfoFD consists of two key components: the Text-Guided Conditional Information Bottleneck (TGCIB) and Dynamic Text Orthogonalization (DTO). TGCIB improves the generalizability of learned representations by conditioning on both text and class modalities. DTO dynamically updates weighted text features, preserving semantic information while leveraging the global "bias". Our model achieves exceptional generalization performance on the GenImage dataset and latest generative models. Our code is available at https://github.com/Ant0ny44/InfoFD.
中文摘要:本文提出InfoFD框架,通过多模态条件瓶颈网络和动态文本正交化技术,在CLIP特征空间中减少特征冗余并利用文本偏差,显著提升了AI生成图像检测的泛化能力。
English Summary: The paper introduces InfoFD, a text-guided framework that enhances AI-generated image detection by reducing feature redundancy in CLIP through a multimodal conditional bottleneck network and dynamic text orthogonalization, significantly improving generalization performance.

Authors:Wenjie Liu, Zhongliang Liu, Junwei Shu, Changbo Wang, Yang Li
Title: GT^2-GS: Geometry-aware Texture Transfer for Gaussian Splatting
Abstract:
Transferring 2D textures to 3D modalities is of great significance for improving the efficiency of multimedia content creation. Existing approaches have rarely focused on transferring image textures onto 3D representations. 3D style transfer methods are capable of transferring abstract artistic styles to 3D scenes. However, these methods often overlook the geometric information of the scene, which makes it challenging to achieve high-quality 3D texture transfer results. In this paper, we present GT^2-GS, a geometry-aware texture transfer framework for gaussian splitting. From the perspective of matching texture features with geometric information in rendered views, we identify the issue of insufficient texture features and propose a geometry-aware texture augmentation module to expand the texture feature set. Moreover, a geometry-consistent texture loss is proposed to optimize texture features into the scene representation. This loss function incorporates both camera pose and 3D geometric information of the scene, enabling controllable texture-oriented appearance editing. Finally, a geometry preservation strategy is introduced. By alternating between the texture transfer and geometry correction stages over multiple iterations, this strategy achieves a balance between learning texture features and preserving geometric integrity. Extensive experiments demonstrate the effectiveness and controllability of our method. Through geometric awareness, our approach achieves texture transfer results that better align with human visual perception. Our homepage is available at https://vpx-ecnu.github.io/GT2-GS-website.

Authors:Pinxin Liu, Haiyang Liu, Luchuan Song, Chenliang Xu
Title: Intentional Gesture: Deliver Your Intentions with Gestures for Speech
Abstract:
When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (\textit{e.g.} speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce \textbf{Intentional-Gesture}, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. % First, we curate the \textbf{InG} dataset by augmenting BEAT-2 with gesture-intention annotations (\textit{i.e.}, text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the \textbf{Intentional Gesture Motion Tokenizer} to leverage these intention annotations. It injects high-level communicative functions (\textit{e.g.}, intentions) into tokenized motion representations to enable intention-aware gesture synthesis that are both temporally aligned and semantically meaningful, achieving new state-of-the-art performance on the BEAT-2 benchmark. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI. Project Page: https://andypinxinliu.github.io/Intentional-Gesture
中文:提出的Intentional-Gesture框架通过将交流意图融入动作合成,解决了当前手势生成方法语义浅薄的问题,利用意图标注数据集和标记化动作表征实现了最先进的性能。
English: The proposed Intentional-Gesture framework addresses the semantic limitations of current gesture generation methods by incorporating communicative intentions into motion synthesis, achieving state-of-the-art performance through intention-annotated datasets and tokenized motion representations.

Authors:Yifan Liu, Keyu Fan, Weihao Yu, Chenxin Li, Hao Lu, Yixuan Yuan
Title: MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models
Abstract:
Recent advances in generalizable 3D Gaussian Splatting have demonstrated promising results in real-time high-fidelity rendering without per-scene optimization, yet existing approaches still struggle to handle unfamiliar visual content during inference on novel scenes due to limited generalizability. To address this challenge, we introduce MonoSplat, a novel framework that leverages rich visual priors from pre-trained monocular depth foundation models for robust Gaussian reconstruction. Our approach consists of two key components: a Mono-Multi Feature Adapter that transforms monocular features into multi-view representations, coupled with an Integrated Gaussian Prediction module that effectively fuses both feature types for precise Gaussian generation. Through the Adapter's lightweight attention mechanism, features are seamlessly aligned and aggregated across views while preserving valuable monocular priors, enabling the Prediction module to generate Gaussian primitives with accurate geometry and appearance. Through extensive experiments on diverse real-world datasets, we convincingly demonstrate that MonoSplat achieves superior reconstruction quality and generalization capability compared to existing methods while maintaining computational efficiency with minimal trainable parameters. Codes are available at https://github.com/CUHK-AIM-Group/MonoSplat.
中文:MonoSplat是一种创新框架,利用预训练的单目深度模型增强3D高斯重建,在保持计算效率的同时,实现了跨场景的卓越泛化能力和渲染质量。
English: MonoSplat is a novel framework that leverages pre-trained monocular depth models to enhance 3D Gaussian reconstruction, achieving superior generalization and rendering quality across diverse scenes while maintaining computational efficiency.

Authors:Yangting Shi, Renjie He, Le Hui, Xiang Li, Jian Yang, Ming-Ming Cheng, Yimian Dai
Title: AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection
Abstract:
Omni-domain infrared small target detection (IRSTD) poses formidable challenges, as a single model must seamlessly adapt to diverse imaging systems, varying resolutions, and multiple spectral bands simultaneously. Current approaches predominantly rely on visual-only modeling paradigms that not only struggle with complex background interference and inherently scarce target features, but also exhibit limited generalization capabilities across complex omni-scene environments where significant domain shifts and appearance variations occur. In this work, we reveal a critical oversight in existing paradigms: the neglect of readily available auxiliary metadata describing imaging parameters and acquisition conditions, such as spectral bands, sensor platforms, resolution, and observation perspectives. To address this limitation, we propose the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet), a novel multi-modal framework that fundamentally reimagines the IRSTD paradigm by incorporating textual metadata for scene-aware optimization. Through a high-dimensional fusion module based on multi-layer perceptrons (MLPs), AuxDet dynamically integrates metadata semantics with visual features, guiding adaptive representation learning for each individual sample. Additionally, we design a lightweight prior-initialized enhancement module using 1D convolutional blocks to further refine fused features and recover fine-grained target cues. Extensive experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy in omni-domain IRSTD tasks. Code is available at https://github.com/GrokCV/AuxDet.
中文摘要:本文提出的AuxDet框架通过引入辅助元数据构建多模态检测范式,有效解决了全域红外小目标检测中视觉模型的泛化局限,在复杂场景基准测试中展现出最优性能。
English Summary: The proposed AuxDet framework addresses the limitations of visual-only models in omni-domain infrared small target detection by incorporating auxiliary metadata through a multimodal approach, achieving superior performance on complex benchmarks.

Authors:Qian Zhou, Xianda Guo, Jilong Wang, Chuanfu Shen, Zhongyuan Wang, Hua Zou, Qin Zou, Chao Liang, Long Chen, Gang Wu
Title: Exploring Generalized Gait Recognition: Reducing Redundancy and Noise within Indoor and Outdoor Datasets
Abstract:
Generalized gait recognition, which aims to achieve robust performance across diverse domains, remains a challenging problem due to severe domain shifts in viewpoints, appearances, and environments. While mixed-dataset training is widely used to enhance generalization, it introduces new obstacles including inter-dataset optimization conflicts and redundant or noisy samples, both of which hinder effective representation learning. To address these challenges, we propose a unified framework that systematically improves cross-domain gait recognition. First, we design a disentangled triplet loss that isolates supervision signals across datasets, mitigating gradient conflicts during optimization. Second, we introduce a targeted dataset distillation strategy that filters out the least informative 20\% of training samples based on feature redundancy and prediction uncertainty, enhancing data efficiency. Extensive experiments on CASIA-B, OU-MVLP, Gait3D, and GREW demonstrate that our method significantly improves cross-dataset recognition for both GaitBase and DeepGaitV2 backbones, without sacrificing source-domain accuracy. Code will be released at https://github.com/li1er3/Generalized_Gait.
中文: 该统一框架通过解耦三元组损失解决跨数据集优化冲突,并采用针对性数据集蒸馏策略过滤冗余样本,显著提升了跨域步态识别性能,在多个基准测试中表现优异且不影响源域精度。
English: The proposed unified framework enhances cross-domain gait recognition by employing a disentangled triplet loss to resolve inter-dataset optimization conflicts and a targeted dataset distillation strategy to eliminate redundant samples, achieving superior performance across multiple benchmarks without compromising source-domain accuracy.

Authors:Xinran Wang, Songyu Xu, Xiangxuan Shan, Yuxuan Zhang, Muxi Diao, Xueyan Duan, Yanhua Huang, Kongming Liang, Zhanyu Ma
Title: CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation
Abstract:
Cinematography is a cornerstone of film production and appreciation, shaping mood, emotion, and narrative through visual elements such as camera movement, shot composition, and lighting. Despite recent progress in multimodal large language models (MLLMs) and video generation models, the capacity of current models to grasp and reproduce cinematographic techniques remains largely uncharted, hindered by the scarcity of expert-annotated data. To bridge this gap, we present CineTechBench, a pioneering benchmark founded on precise, manual annotation by seasoned cinematography experts across key cinematography dimensions. Our benchmark covers seven essential aspects-shot scale, shot angle, composition, camera movement, lighting, color, and focal length-and includes over 600 annotated movie images and 120 movie clips with clear cinematographic techniques. For the understanding task, we design question answer pairs and annotated descriptions to assess MLLMs' ability to interpret and explain cinematographic techniques. For the generation task, we assess advanced video generation models on their capacity to reconstruct cinema-quality camera movements given conditions such as textual prompts or keyframes. We conduct a large-scale evaluation on 15+ MLLMs and 5+ video generation models. Our results offer insights into the limitations of current models and future directions for cinematography understanding and generation in automatically film production and appreciation. The code and benchmark can be accessed at https://github.com/PRIS-CV/CineTechBench.
中文: CineTechBench基准通过提供涵盖七个关键维度的600多张图像和120个剪辑的专业标注,填补了电影摄影专家标注数据的空白,全面评估多模态模型对电影技术的理解与生成能力。
English: The CineTechBench benchmark addresses the gap in expert-annotated cinematography data by providing over 600 images and 120 clips annotated across seven key dimensions, enabling comprehensive evaluation of multimodal models' understanding and generation of cinematic techniques.

Authors:Seongmin Hwang, Daeyoung Han, Moongu Jeon
Title: Multispectral Detection Transformer with Infrared-Centric Feature Fusion
Abstract:
Multispectral object detection aims to leverage complementary information from visible (RGB) and infrared (IR) modalities to enable robust performance under diverse environmental conditions. Our key insight, derived from wavelet analysis and empirical observations, is that IR images contain structurally rich high-frequency information critical for object detection, making an infrared-centric approach highly effective. To capitalize on this finding, we propose Infrared-Centric Fusion (IC-Fusion), a lightweight and modality-aware sensor fusion method that prioritizes infrared features while effectively integrating complementary RGB semantic context. IC-Fusion adopts a compact RGB backbone and designs a novel fusion module comprising a Multi-Scale Feature Distillation (MSFD) block to enhance RGB features and a three-stage fusion block with a Cross-Modal Channel Shuffle Gate (CCSG), a Cross-Modal Large Kernel Gate (CLKG), and a Channel Shuffle Projection (CSP) to facilitate effective cross-modal interaction. Experiments on the FLIR and LLVIP benchmarks demonstrate the superior effectiveness and efficiency of our IR-centric fusion strategy, further validating its benefits. Our code is available at https://github.com/smin-hwang/IC-Fusion.
中文: 本研究提出红外中心融合(IC-Fusion)方法,通过轻量级传感器融合优先处理红外特征并整合RGB语义信息,在基准数据集上验证了其卓越的性能与效率。
English: The study introduces Infrared-Centric Fusion (IC-Fusion), a lightweight sensor fusion method that prioritizes infrared features while integrating RGB semantic context, demonstrating superior performance and efficiency on benchmark datasets.

Authors:Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren
Title: DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer
Abstract:
Recent advances in knowledge distillation have emphasized the importance of decoupling different knowledge components. While existing methods utilize momentum mechanisms to separate task-oriented and distillation gradients, they overlook the inherent conflict between target-class and non-target-class knowledge flows. Furthermore, low-confidence dark knowledge in non-target classes introduces noisy signals that hinder effective knowledge transfer. To address these limitations, we propose DeepKD, a novel training framework that integrates dual-level decoupling with adaptive denoising. First, through theoretical analysis of gradient signal-to-noise ratio (GSNR) characteristics in task-oriented and non-task-oriented knowledge distillation, we design independent momentum updaters for each component to prevent mutual interference. We observe that the optimal momentum coefficients for task-oriented gradient (TOG), target-class gradient (TCG), and non-target-class gradient (NCG) should be positively related to their GSNR. Second, we introduce a dynamic top-k mask (DTM) mechanism that gradually increases K from a small initial value to incorporate more non-target classes as training progresses, following curriculum learning principles. The DTM jointly filters low-confidence logits from both teacher and student models, effectively purifying dark knowledge during early training. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO demonstrate DeepKD's effectiveness. Our code is available at https://github.com/haiduo/DeepKD.
Chinese: DeepKD提出了一种双级解耦与自适应去噪的训练框架,通过独立动量更新器和动态掩码机制解决知识流冲突并过滤噪声信号,在多个数据集上验证了其有效性。
English: DeepKD introduces a dual-level decoupling framework with adaptive denoising to address conflicts between knowledge components and filter noisy signals in knowledge distillation, achieving superior performance across multiple datasets.

Authors:Muniba Noreen, Furqan Shaukat
Title: Lung Nodule-SSM: Self-Supervised Lung Nodule Detection and Classification in Thoracic CT Images
Abstract:
Lung cancer remains among the deadliest types of cancer in recent decades, and early lung nodule detection is crucial for improving patient outcomes. The limited availability of annotated medical imaging data remains a bottleneck in developing accurate computer-aided diagnosis (CAD) systems. Self-supervised learning can help leverage large amounts of unlabeled data to develop more robust CAD systems. With the recent advent of transformer-based architecture and their ability to generalize to unseen tasks, there has been an effort within the healthcare community to adapt them to various medical downstream tasks. Thus, we propose a novel "LungNodule-SSM" method, which utilizes selfsupervised learning with DINOv2 as a backbone to enhance lung nodule detection and classification without annotated data. Our methodology has two stages: firstly, the DINOv2 model is pre-trained on unlabeled CT scans to learn robust feature representations, then secondly, these features are fine-tuned using transformer-based architectures for lesionlevel detection and accurate lung nodule diagnosis. The proposed method has been evaluated on the challenging LUNA 16 dataset, consisting of 888 CT scans, and compared with SOTA methods. Our experimental results show the superiority of our proposed method with an accuracy of 98.37%, explaining its effectiveness in lung nodule detection. The source code, datasets, and pre-processed data can be accessed using the link:https://github.com/EMeRALDsNRPU/Lung-Nodule-SSM-Self-Supervised-Lung-Nodule-Detection-and-Classification/tree/main
中文: 提出的"LungNodule-SSM"方法采用基于DINOv2的自监督学习,在LUNA 16数据集上实现了98.37%的肺结节检测准确率,性能优于现有最先进方法。
English: The proposed "LungNodule-SSM" method leverages self-supervised learning with DINOv2 to achieve 98.37% accuracy in lung nodule detection on the LUNA 16 dataset, outperforming existing state-of-the-art approaches.

Authors:Jose Sosa, Danila Rukhovich, Anis Kacem, Djamila Aouada
Title: MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks
Abstract:
Multi-modal data in Earth Observation (EO) presents a huge opportunity for improving transfer learning capabilities when pre-training deep learning models. Unlike prior work that often overlooks multi-modal EO data, recent methods have started to include it, resulting in more effective pre-training strategies. However, existing approaches commonly face challenges in effectively transferring learning to downstream tasks where the structure of available data differs from that used during pre-training. This paper addresses this limitation by exploring a more flexible multi-modal, multi-task pre-training strategy for EO data. Specifically, we adopt a Multi-modal Multi-task Masked Autoencoder (MultiMAE) that we pre-train by reconstructing diverse input modalities, including spectral, elevation, and segmentation data. The pre-trained model demonstrates robust transfer learning capabilities, outperforming state-of-the-art methods on various EO datasets for classification and segmentation tasks. Our approach exhibits significant flexibility, handling diverse input configurations without requiring modality-specific pre-trained models. Code will be available at: https://github.com/josesosajs/multimae-meets-eo.
Chinese: 本文提出了一种灵活的多模态多任务预训练策略,采用多模态多任务掩码自编码器(MultiMAE)处理地球观测数据,通过重建多种输入模态提升了迁移学习能力,在分类和分割任务上超越了现有最优方法。
English: This paper introduces a flexible multi-modal, multi-task pre-training strategy using a Multi-modal Multi-task Masked Autoencoder (MultiMAE) for Earth Observation data, which enhances transfer learning by reconstructing diverse input modalities and outperforms existing methods in classification and segmentation tasks.

Authors:Nisarga Nilavadi, Andrey Rudenko, Timm Linder
Title: UPTor: Unified 3D Human Pose Dynamics and Trajectory Prediction for Human-Robot Interaction
Abstract:
We introduce a unified approach to forecast the dynamics of human keypoints along with the motion trajectory based on a short sequence of input poses. While many studies address either full-body pose prediction or motion trajectory prediction, only a few attempt to merge them. We propose a motion transformation technique to simultaneously predict full-body pose and trajectory key-points in a global coordinate frame. We utilize an off-the-shelf 3D human pose estimation module, a graph attention network to encode the skeleton structure, and a compact, non-autoregressive transformer suitable for real-time motion prediction for human-robot interaction and human-aware navigation. We introduce a human navigation dataset ``DARKO'' with specific focus on navigational activities that are relevant for human-aware mobile robot navigation. We perform extensive evaluation on Human3.6M, CMU-Mocap, and our DARKO dataset. In comparison to prior work, we show that our approach is compact, real-time, and accurate in predicting human navigation motion across all datasets. Result animations, our dataset, and code will be available at https://nisarganc.github.io/UPTor-page/

Authors:Daniya Najiha A. Kareem, Jean Lahoud, Mustansar Fiaz, Amandeep Kumar, Hisham Cholakkal
Title: Open-Set Semi-Supervised Learning for Long-Tailed Medical Datasets
Abstract:
Many practical medical imaging scenarios include categories that are under-represented but still crucial. The relevance of image recognition models to real-world applications lies in their ability to generalize to these rare classes as well as unseen classes. Real-world generalization requires taking into account the various complexities that can be encountered in the real-world. First, training data is highly imbalanced, which may lead to model exhibiting bias toward the more frequently represented classes. Moreover, real-world data may contain unseen classes that need to be identified, and model performance is affected by the data scarcity. While medical image recognition has been extensively addressed in the literature, current methods do not take into account all the intricacies in the real-world scenarios. To this end, we propose an open-set learning method for highly imbalanced medical datasets using a semi-supervised approach. Understanding the adverse impact of long-tail distribution at the inherent model characteristics, we implement a regularization strategy at the feature level complemented by a classifier normalization technique. We conduct extensive experiments on the publicly available datasets, ISIC2018, ISIC2019, and TissueMNIST with various numbers of labelled samples. Our analysis shows that addressing the impact of long-tail data in classification significantly improves the overall performance of the network in terms of closed-set and open-set accuracies on all datasets. Our code and trained models will be made publicly available at https://github.com/Daniyanaj/OpenLTR.
中文摘要:本研究提出一种基于半监督学习的开放集识别方法,通过特征级正则化和分类器归一化技术解决医学影像中的长尾分布问题,在多个公开数据集上验证了该方法对闭集和开放集识别性能的显著提升。
English Summary: This study introduces an open-set learning method using semi-supervised techniques to address class imbalance and unseen class recognition in medical imaging, demonstrating improved performance across multiple datasets through feature regularization and classifier normalization.

Authors:Wei Hua, Chenlin Zhou, Jibin Wu, Yansong Chua, Yangyang Shu
Title: MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion
Abstract:
The combination of Spiking Neural Networks (SNNs) with Vision Transformer architectures has garnered significant attention due to their potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT. This novel spike-driven Transformer architecture firstly uses multi-scale spiking attention (MSSA) to enhance the capabilities of spiking attention blocks. We validate our approach across various main datasets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.
中文: 本文提出MSViT这一新型脉冲驱动Transformer架构,通过采用多尺度脉冲注意力机制有效解决了现有SNN变换器在图像多尺度特征提取上的瓶颈,在多个数据集上实现了最优性能。
English: This paper introduces MSViT, a novel spike-driven Transformer architecture that employs multi-scale spiking attention to overcome feature extraction limitations in existing SNN-based transformers, achieving state-of-the-art performance across multiple datasets.

Authors:Xigui Li, Yuanye Zhou, Feiyang Xiao, Xin Guo, Chen Jiang, Tan Pan, Xingmeng Zhang, Cenyu Liu, Zeyun Miao, Jianchao Ge, Xiansheng Wang, Qimeng Wang, Yichi Zhang, Wenbo Zhang, Fengping Zhu, Limei Han, Yuan Qi, Chensen Lin, Yuan Cheng
Title: Aneumo: A Large-Scale Multimodal Aneurysm Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks
Abstract:
Intracranial aneurysms (IAs) are serious cerebrovascular lesions found in approximately 5\% of the general population. Their rupture may lead to high mortality. Current methods for assessing IA risk focus on morphological and patient-specific factors, but the hemodynamic influences on IA development and rupture remain unclear. While accurate for hemodynamic studies, conventional computational fluid dynamics (CFD) methods are computationally intensive, hindering their deployment in large-scale or real-time clinical applications. To address this challenge, we curated a large-scale, high-fidelity aneurysm CFD dataset to facilitate the development of efficient machine learning algorithms for such applications. Based on 427 real aneurysm geometries, we synthesized 10,660 3D shapes via controlled deformation to simulate aneurysm evolution. The authenticity of these synthetic shapes was confirmed by neurosurgeons. CFD computations were performed on each shape under eight steady-state mass flow conditions, generating a total of 85,280 blood flow dynamics data covering key parameters. Furthermore, the dataset includes segmentation masks, which can support tasks that use images, point clouds or other multimodal data as input. Additionally, we introduced a benchmark for estimating flow parameters to assess current modeling methods. This dataset aims to advance aneurysm research and promote data-driven approaches in biofluids, biomedical engineering, and clinical risk assessment. The code and dataset are available at: https://github.com/Xigui-Li/Aneumo.
Chinese: 本研究通过合成3D形状进行流体动力学模拟,构建了一个大规模颅内动脉瘤血流动力学数据集,旨在支持机器学习应用,以促进临床风险评估和生物流体研究的高效发展。
English: This study presents a large-scale hemodynamic dataset of intracranial aneurysms, created through computational fluid dynamics simulations on synthetic 3D shapes, to enable machine learning applications for efficient clinical risk assessment and biofluid research.

Authors:Tuan-Vinh La, Minh-Hieu Nguyen, Minh-Son Dao
Title: KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection
Abstract:
Fake news detection remains a challenging problem due to the complex interplay between textual misinformation, manipulated images, and external knowledge reasoning. While existing approaches have achieved notable results in verifying veracity and cross-modal consistency, two key challenges persist: (1) Existing methods often consider only the global image context while neglecting local object-level details, and (2) they fail to incorporate external knowledge and entity relationships for deeper semantic understanding. To address these challenges, we propose a novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations. Our approach leverages bottom-up attention to capture fine-grained object details, CLIP for global image semantics, and RoBERTa for context-aware text encoding. We further enhance knowledge utilization by retrieving and adaptively selecting relevant entities from a knowledge graph. The fused multi-modal features are processed through a Transformer-based classifier to predict news veracity. Experimental results demonstrate that our model outperforms recent approaches, showcasing the effectiveness of neighbor selection mechanism and multi-modal fusion for fake news detection. Our proposal introduces a new paradigm: knowledge-grounded multimodal reasoning. By integrating explicit entity-level selection and NLI-guided filtering, we shift fake news detection from feature fusion to semantically grounded verification. For reproducibility and further research, the source code is publicly at \href{https://github.com/latuanvinh1998/KGAlign}{github.com/latuanvinh1998/KGAlign}.
中文: 本文提出了一种新颖的多模态假新闻检测框架,融合视觉、文本和知识表征,通过细粒度对象细节、全局图像语义和外部知识,利用基于Transformer的分类器超越现有方法。
English: This paper introduces a novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations, leveraging fine-grained object details, global image semantics, and external knowledge to outperform existing methods through a Transformer-based classifier.

Authors:Xuan Shen, Weize Ma, Yufa Zhou, Enhao Tang, Yanyue Xie, Zhengang Li, Yifan Gong, Quanyi Wang, Henghui Ding, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jun Lin, Jiuxiang Gu
Title: FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge
Abstract:
Auto-regressive (AR) models, initially successful in language generation, have recently shown promise in visual generation tasks due to their superior sampling efficiency. Unlike image generation, video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase. Our key observations are: (i) MLP modules in the decode phase dominate the inference latency, and (ii) there exists high temporal redundancy in MLP outputs of adjacent frames. In this paper, we propose the \textbf{FastCar} framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy. The Temporal Attention Score (TAS) is proposed to determine whether to apply the replay strategy (\textit{i.e.}, reusing cached MLP outputs from the previous frame to reduce redundant computations) with detailed theoretical analysis and justification. Also, we develop a hardware accelerator on FPGA with Dynamic Resource Scheduling (DRS) based on TAS to enable better resource utilization and faster inference. Experimental results demonstrate the effectiveness of our method, which outperforms traditional sparse attention approaches with more than 2.1x decoding speedup and higher energy efficiency on the edge. Furthermore, by combining FastCar and sparse attention, FastCar can boost the performance of sparse attention with alleviated drifting, demonstrating our unique advantages for high-resolution and long-duration video generation. Code: https://github.com/shawnricecake/fast-car
中文摘要:FastCar框架通过利用相邻帧间MLP输出的时间冗余性,采用缓存复用策略和硬件加速技术,将自回归视频生成的解码速度提升2.1倍以上,同时显著提高能效并支持高分辨率长视频生成。
English Summary: The FastCar framework accelerates auto-regressive video generation by exploiting temporal redundancy in MLP outputs, achieving over 2.1x decoding speedup and higher energy efficiency through optimized caching strategies and FPGA hardware acceleration.

Authors:Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jiuxiang Gu
Title: DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance
Abstract:
Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency, and generating just 8 seconds of 720p video takes tens of minutes-posing serious challenges to practical application and scalability. To address this, we propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs. We apply down-sampling to each feature map across frames in the compressed latent space, enabling a higher-level receptive field over the latent composed of hundreds of thousands of tokens. The low-resolution draft attention map, derived from draft query and key, exposes redundancy both spatially within each feature map and temporally across frames. We reorder the query, key, and value based on the draft attention map to guide the sparse attention computation in full resolution, and subsequently restore their original order after the attention computation. This reordering enables structured sparsity that aligns with hardware-optimized execution. Our theoretical analysis demonstrates that the low-resolution draft attention closely approximates the full attention, providing reliable guidance for constructing accurate sparse attention. Experimental results show that our method outperforms existing sparse attention approaches in video generation quality and achieves up to 1.75x end-to-end speedup on GPUs. Code: https://github.com/shawnricecake/draft-attention
中文: 提出的DraftAttention框架通过GPU上的动态稀疏注意力加速视频扩散变换器,在保持生成质量的同时实现了高达1.75倍的端到端加速。
English: The proposed DraftAttention framework accelerates video diffusion transformers by using dynamic sparse attention on GPUs, achieving up to 1.75x speedup while maintaining generation quality.

Authors:Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, Bocheng Zou, Chaoqun Yang, Wentao Zhang
Title: UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens
Abstract:
Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ wearing its hat" without additional textual descriptions of its hat. We call this kind of generation personalized knowledge-driven generation. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and knowledge-driven generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized knowledge-driven generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released at: \href{https://github.com/arctanxarc/UniCTokens}{https://github.com/arctanxarc/UniCTokens}.
中文: UniCTokens提出了一种统一框架,将个性化概念令牌整合到视觉语言模型中,以同时提升理解和生成任务,在个性化知识驱动生成方面实现了领先性能。
English: UniCTokens introduces a unified framework that integrates personalized concept tokens within a vision-language model to enhance both understanding and generation tasks, achieving state-of-the-art performance in personalized knowledge-driven generation.

Authors:Yilin Ye, Junchao Huang, Xingchen Zeng, Jiazhi Xia, Wei Zeng
Title: AKRMap: Adaptive Kernel Regression for Trustworthy Visualization of Cross-Modal Embeddings
Abstract:
Cross-modal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple modalities. This paper introduces AKRMap, a new DR technique designed to visualize cross-modal embeddings metric with enhanced accuracy by learning kernel regression of the metric landscape in the projection space. Specifically, AKRMap constructs a supervised projection network guided by a post-projection kernel regression loss, and employs adaptive generalized kernels that can be jointly optimized with the projection. This approach enables AKRMap to efficiently generate visualizations that capture complex metric distributions, while also supporting interactive features such as zoom and overlay for deeper exploration. Quantitative experiments demonstrate that AKRMap outperforms existing DR methods in generating more accurate and trustworthy visualizations. We further showcase the effectiveness of AKRMap in visualizing and comparing cross-modal embeddings for text-to-image models. Code and demo are available at https://github.com/yilinye/AKRMap.
Chinese Summary: 本文提出AKRMap这一新型降维技术,通过学习投影空间中度量景观的核回归,能够更准确地可视化跨模态嵌入,在定量实验中超越了PCA和t-SNE等传统方法。
English Summary: This paper introduces AKRMap, a novel dimensionality reduction technique that visualizes cross-modal embeddings more accurately by learning kernel regression of the metric landscape, outperforming traditional methods like PCA and t-SNE.

Authors:Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin
Title: Beyond Words: Multimodal LLM Knows When to Speak
Abstract:
While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.
中文: 本研究提出MM-When2Speak多模态大语言模型,通过整合对齐的视觉、听觉和文本数据来预测最佳回应时机与类型,在对话响应时间准确性上相比现有模型提升高达四倍。
English: This study introduces MM-When2Speak, a multimodal LLM that leverages aligned visual, auditory, and textual data to enhance conversational AI by accurately predicting optimal response timing and type, achieving up to four times better timing accuracy than existing models.

Authors:Anna C. Doris, Md Ferdous Alam, Amin Heyrani Nobari, Faez Ahmed
Title: CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation
Abstract:
Efficient creation of accurate and editable 3D CAD models is critical in engineering design, significantly impacting cost and time-to-market in product innovation. Current manual workflows remain highly time-consuming and demand extensive user expertise. While recent developments in AI-driven CAD generation show promise, existing models are limited by incomplete representations of CAD operations, inability to generalize to real-world images, and low output accuracy. This paper introduces CAD-Coder, an open-source Vision-Language Model (VLM) explicitly fine-tuned to generate editable CAD code (CadQuery Python) directly from visual input. Leveraging a novel dataset that we created--GenCAD-Code, consisting of over 163k CAD-model image and code pairs--CAD-Coder outperforms state-of-the-art VLM baselines such as GPT-4.5 and Qwen2.5-VL-72B, achieving a 100% valid syntax rate and the highest accuracy in 3D solid similarity. Notably, our VLM demonstrates some signs of generalizability, successfully generating CAD code from real-world images and executing CAD operations unseen during fine-tuning. The performance and adaptability of CAD-Coder highlights the potential of VLMs fine-tuned on code to streamline CAD workflows for engineers and designers. CAD-Coder is publicly available at: https://github.com/anniedoris/CAD-Coder.
中文:本文介绍了CAD-Coder这一视觉语言模型,它能从视觉输入生成可编辑的CAD代码,在语法有效性和三维相似度上超越现有模型,并展现出对真实世界图像的泛化能力。
English: This paper introduces CAD-Coder, a vision-language model that generates editable CAD code from visual inputs, outperforming existing models with perfect syntax validity and high 3D similarity while showing generalization to real-world images.

Authors:Wentao Ma, Weiming Ren, Yiming Jia, Zhuofeng Li, Ping Nie, Ge Zhang, Wenhu Chen
Title: VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
Abstract:
Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50\% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs' long-video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment-level and full-video understanding through perception and reasoning tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance ($>$25\%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain.
中文摘要:现有长视频理解基准因过度依赖选择题而存在缺陷,允许猜测和利用先验知识,因此提出采用开放式问题的VideoEval-Pro新基准,能更真实评估多模态模型对长视频的实际理解能力。
English Summary: Current long video understanding benchmarks are flawed due to their reliance on multiple-choice questions that allow guessing and prior knowledge exploitation, prompting the creation of VideoEval-Pro with open-ended questions for more accurate evaluation of multimodal models' true comprehension abilities.

Authors:Fnu Mohbat, Mohammed J Zaki
Title: KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language Models
Abstract:
Recent advances in large language models (LLMs) and the abundance of food data have resulted in studies to improve food understanding using LLMs. Despite several recommendation systems utilizing LLMs and Knowledge Graphs (KGs), there has been limited research on integrating food related KGs with LLMs. We introduce KERL, a unified system that leverages food KGs and LLMs to provide personalized food recommendations and generates recipes with associated micro-nutritional information. Given a natural language question, KERL extracts entities, retrieves subgraphs from the KG, which are then fed into the LLM as context to select the recipes that satisfy the constraints. Next, our system generates the cooking steps and nutritional information for each recipe. To evaluate our approach, we also develop a benchmark dataset by curating recipe related questions, combined with constraints and personal preferences. Through extensive experiments, we show that our proposed KG-augmented LLM significantly outperforms existing approaches, offering a complete and coherent solution for food recommendation, recipe generation, and nutritional analysis. Our code and benchmark datasets are publicly available at https://github.com/mohbattharani/KERL.
中文摘要:KERL系统创新性地将食物知识图谱与大语言模型结合,通过个性化推荐、菜谱生成和营养分析功能,经实验验证其性能显著优于现有方法,提供了完整的饮食解决方案。
English Summary: KERL is a novel system that integrates food knowledge graphs with large language models to deliver personalized food recommendations, generate detailed recipes, and provide nutritional analysis, outperforming existing methods through comprehensive evaluation.

Authors:Chuanbo Tang, Zhuoyuan Li, Yifan Bian, Li Li, Dong Liu
Title: Neural Video Compression with Context Modulation
Abstract:
Efficient video coding is highly dependent on exploiting the temporal redundancy, which is usually achieved by extracting and leveraging the temporal context in the emerging conditional coding-based neural video codec (NVC). Although the latest NVC has achieved remarkable progress in improving the compression performance, the inherent temporal context propagation mechanism lacks the ability to sufficiently leverage the reference information, limiting further improvement. In this paper, we address the limitation by modulating the temporal context with the reference frame in two steps. Specifically, we first propose the flow orientation to mine the inter-correlation between the reference frame and prediction frame for generating the additional oriented temporal context. Moreover, we introduce the context compensation to leverage the oriented context to modulate the propagated temporal context generated from the propagated reference feature. Through the synergy mechanism and decoupling loss supervision, the irrelevant propagated information can be effectively eliminated to ensure better context modeling. Experimental results demonstrate that our codec achieves on average 22.7% bitrate reduction over the advanced traditional video codec H.266/VVC, and offers an average 10.1% bitrate saving over the previous state-of-the-art NVC DCVC-FM. The code is available at https://github.com/Austin4USTC/DCMVC.
中文: 本文提出了一种通过流定向和上下文补偿两步调制方法,有效提升神经视频编码器中时间上下文的利用效率,相比H.266/VVC和现有先进方法实现了显著的码率节省。
English: This paper introduces a two-step modulation method using flow orientation and context compensation to enhance temporal context utilization in neural video codecs, achieving significant bitrate reductions compared to H.266/VVC and prior NVC methods.

Authors:Yuxuan Wang, Xuanyu Yi, Qingshan Xu, Yuan Zhou, Long Chen, Hanwang Zhang
Title: Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image
Abstract:
Personalizing 3D scenes from a single reference image enables intuitive user-guided editing, which requires achieving both multi-view consistency across perspectives and referential consistency with the input image. However, these goals are particularly challenging due to the viewpoint bias caused by the limited perspective provided in a single image. Lacking the mechanisms to effectively expand reference information beyond the original view, existing methods of image-conditioned 3DGS personalization often suffer from this viewpoint bias and struggle to produce consistent results. Therefore, in this paper, we present Consistent Personalization for 3D Gaussian Splatting (CP-GS), a framework that progressively propagates the single-view reference appearance to novel perspectives. In particular, CP-GS integrates pre-trained image-to-3D generation and iterative LoRA fine-tuning to extract and extend the reference appearance, and finally produces faithful multi-view guidance images and the personalized 3DGS outputs through a view-consistent generation process guided by geometric cues. Extensive experiments on real-world scenes show that our CP-GS effectively mitigates the viewpoint bias, achieving high-quality personalization that significantly outperforms existing methods. The code will be released at https://github.com/Yuxuan-W/CP-GS.
中文:本文提出的CP-GS框架通过几何引导的渐进式外观传播和迭代微调,有效解决了单图像3D场景个性化中的视角偏差问题,在保持多视角一致性方面显著优于现有方法。
English: This paper introduces CP-GS, a framework that overcomes viewpoint bias in single-image 3D scene personalization by progressively propagating reference appearance to novel views using geometric guidance and iterative fine-tuning, achieving superior multi-view consistency compared to existing methods.

Authors:Zhihao Li, Yufei Wang, Heliang Zheng, Yihao Luo, Bihan Wen
Title: Sparc3D: Sparse Representation and Construction for High-Resolution 3D Shapes Modeling
Abstract:
High-fidelity 3D object synthesis remains significantly more challenging than 2D image generation due to the unstructured nature of mesh data and the cubic complexity of dense volumetric grids. Existing two-stage pipelines-compressing meshes with a VAE (using either 2D or 3D supervision), followed by latent diffusion sampling-often suffer from severe detail loss caused by inefficient representations and modality mismatches introduced in VAE. We introduce Sparc3D, a unified framework that combines a sparse deformable marching cubes representation Sparcubes with a novel encoder Sparconv-VAE. Sparcubes converts raw meshes into high-resolution ($1024^3$) surfaces with arbitrary topology by scattering signed distance and deformation fields onto a sparse cube, allowing differentiable optimization. Sparconv-VAE is the first modality-consistent variational autoencoder built entirely upon sparse convolutional networks, enabling efficient and near-lossless 3D reconstruction suitable for high-resolution generative modeling through latent diffusion. Sparc3D achieves state-of-the-art reconstruction fidelity on challenging inputs, including open surfaces, disconnected components, and intricate geometry. It preserves fine-grained shape details, reduces training and inference cost, and integrates naturally with latent diffusion models for scalable, high-resolution 3D generation.
中文:Sparc3D提出了一种结合稀疏可变形立方体和新型稀疏卷积VAE的统一框架,实现了高保真度的三维重建与生成,克服了现有方法中的细节丢失和模态不匹配问题,同时降低了计算成本。
English: Sparc3D introduces a unified framework using sparse deformable cubes and a novel sparse convolutional VAE to achieve high-fidelity 3D reconstruction and generation, overcoming detail loss and modality mismatches in existing methods while reducing computational costs.

Authors:Guillaume Vray, Devavrat Tomar, Xufeng Gao, Jean-Philippe Thiran, Evan Shelhamer, Behzad Bozorgtabar
Title: ReservoirTTA: Prolonged Test-time Adaptation for Evolving and Recurring Domains
Abstract:
This paper introduces ReservoirTTA, a novel plug-in framework designed for prolonged test-time adaptation (TTA) in scenarios where the test domain continuously shifts over time, including cases where domains recur or evolve gradually. At its core, ReservoirTTA maintains a reservoir of domain-specialized models -- an adaptive test-time model ensemble -- that both detects new domains via online clustering over style features of incoming samples and routes each sample to the appropriate specialized model, and thereby enables domain-specific adaptation. This multi-model strategy overcomes key limitations of single model adaptation, such as catastrophic forgetting, inter-domain interference, and error accumulation, ensuring robust and stable performance on sustained non-stationary test distributions. Our theoretical analysis reveals key components that bound parameter variance and prevent model collapse, while our plug-in TTA module mitigates catastrophic forgetting of previously encountered domains. Extensive experiments on scene-level corruption benchmarks (ImageNet-C, CIFAR-10/100-C), object-level style shifts (DomainNet-126, PACS), and semantic segmentation (Cityscapes->ACDC) covering recurring and continuously evolving domain shifts -- show that ReservoirTTA substantially improves adaptation accuracy and maintains stable performance across prolonged, recurring shifts, outperforming state-of-the-art methods. Our code is publicly available at https://github.com/LTS5/ReservoirTTA.
中文: ReservoirTTA是一种新颖的插件框架,通过维护领域专用模型库来应对持续变化的测试领域,既能检测新领域又能实现领域特定适配,在长期域偏移场景中显著提升了适应精度并保持稳定性能。
English: ReservoirTTA is a plug-in framework that uses an adaptive ensemble of domain-specialized models to enable robust and stable test-time adaptation for continuously shifting domains, overcoming issues like catastrophic forgetting and outperforming existing methods across multiple benchmarks.

Authors:Xuyang Liu, Yiyu Wang, Junpeng Ma, Linfeng Zhang
Title: Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
Abstract:
Video large language models (VideoLLM) excel at video understanding, but face efficiency challenges due to the quadratic complexity of abundant visual tokens. Our systematic analysis of token compression methods for VideoLLMs reveals two critical issues: (i) overlooking distinctive visual signals across frames, leading to information loss; (ii) suffering from implementation constraints, causing incompatibility with modern architectures or efficient operators. To address these challenges, we distill three design principles for VideoLLM token compression and propose a plug-and-play inference acceleration framework "Video Compression Commander" (VidCom2). By quantifying each frame's uniqueness, VidCom2 adaptively adjusts compression intensity across frames, effectively preserving essential information while reducing redundancy in video sequences. Extensive experiments across various VideoLLMs and benchmarks demonstrate the superior performance and efficiency of our VidCom2. With only 25% visual tokens, VidCom2 achieves 99.6% of the original performance on LLaVA-OV while reducing 70.8% of the LLM generation latency. Notably, our Frame Compression Adjustment strategy is compatible with other token compression methods to further improve their performance. Our code is available at https://github.com/xuyang-liu16/VidCom2.
中文摘要:VidCom2框架通过基于帧独特性的自适应视觉令牌压缩,解决了视频大语言模型的效率问题,在显著降低延迟的同时保持了接近原始性能。
English Summary: The VidCom2 framework addresses efficiency issues in VideoLLMs by adaptively compressing visual tokens based on frame uniqueness, achieving near-original performance with significantly reduced latency.

Authors:Chengtang Yao, Lidong Yu, Zhidan Liu, Jiaxi Zeng, Yuwei Wu, Yunde Jia
Title: Diving into the Fusion of Monocular Priors for Generalized Stereo Matching
Abstract:
The matching formulation makes it naturally hard for the stereo matching to handle ill-posed regions like occlusions and non-Lambertian surfaces. Fusing monocular priors has been proven helpful for ill-posed matching, but the biased monocular prior learned from small stereo datasets constrains the generalization. Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions. We dive into the fusion process and observe three main problems limiting the fusion of the VFM monocular prior. The first problem is the misalignment between affine-invariant relative monocular depth and absolute depth of disparity. Besides, when we use the monocular feature in an iterative update structure, the over-confidence in the disparity update leads to local optima results. A direct fusion of a monocular depth map could alleviate the local optima problem, but noisy disparity results computed at the first several iterations will misguide the fusion. In this paper, we propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format, unifying the relative and absolute depth representation. The computed local ordering map is also used to re-weight the initial disparity update, resolving the local optima and noisy problem. In addition, we formulate the final direct fusion of monocular depth to the disparity as a registration problem, where a pixel-wise linear regression module can globally and adaptively align them. Our method fully exploits the monocular prior to support stereo matching results effectively and efficiently. We significantly improve the performance from the experiments when generalizing from SceneFlow to Middlebury and Booster datasets while barely reducing the efficiency.
中文: 立体匹配在处理病态区域时存在困难,但利用视觉基础模型的无偏单目先验可提升泛化能力;通过引入二元局部排序图和自适应配准方法,有效解决了深度对齐不准和局部最优等融合问题。
English: Stereo matching struggles with ill-posed regions, but leveraging unbiased monocular priors from vision foundation models can enhance generalization, though fusion challenges like depth misalignment and local optima require solutions such as binary local ordering maps and adaptive registration techniques.

Authors:Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu
Title: DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Abstract:
Large Vision-Language Models (VLMs) have shown strong capabilities in multimodal understanding and reasoning, yet they are primarily constrained by text-based reasoning processes. However, achieving seamless integration of visual and textual reasoning which mirrors human cognitive processes remains a significant challenge. In particular, effectively incorporating advanced visual input processing into reasoning mechanisms is still an open question. Thus, in this paper, we explore the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT. Notably, this ability emerges natively within the model itself, leveraging its inherent grounding ability as a tool instead of depending on separate specialized models. Specifically, we propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories. DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of tool-calling behavior from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.
中文摘要:DeepEyes模型通过端到端强化学习实现了“图像思维”的交替多模态推理范式,在细粒度感知和推理任务中取得显著性能提升,同时展现出类人的视觉推理模式。
English Summary: The DeepEyes model introduces an interleaved multimodal reasoning paradigm that enables "thinking with images" through end-to-end reinforcement learning, achieving significant performance gains across perception and reasoning tasks while demonstrating human-like visual reasoning patterns.

Authors:Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taiping Yao, Shouhong Ding
Title: Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable
Abstract:
Existing detectors are often trained on biased datasets, leading to the possibility of overfitting on non-causal image attributes that are spuriously correlated with real/synthetic labels. While these biased features enhance performance on the training data, they result in substantial performance degradation when applied to unbiased datasets. One common solution is to perform dataset alignment through generative reconstruction, matching the semantic content between real and synthetic images. However, we revisit this approach and show that pixel-level alignment alone is insufficient. The reconstructed images still suffer from frequency-level misalignment, which can perpetuate spurious correlations. To illustrate, we observe that reconstruction models tend to restore the high-frequency details lost in real images (possibly due to JPEG compression), inadvertently creating a frequency-level misalignment, where synthetic images appear to have richer high-frequency content than real ones. This misalignment leads to models associating high-frequency features with synthetic labels, further reinforcing biased cues. To resolve this, we propose Dual Data Alignment (DDA), which aligns both the pixel and frequency domains. Moreover, we introduce two new test sets: DDA-COCO, containing DDA-aligned synthetic images for testing detector performance on the most aligned dataset, and EvalGEN, featuring the latest generative models for assessing detectors under new generative architectures such as visual auto-regressive generators. Finally, our extensive evaluations demonstrate that a detector trained exclusively on DDA-aligned MSCOCO could improve across 8 diverse benchmarks by a non-trivial margin, showing a +7.2% on in-the-wild benchmarks, highlighting the improved generalizability of unbiased detectors. Our code is available at: https://github.com/roy-ch/Dual-Data-Alignment.
中文: 现有检测器因在带有偏见的训练集上学习而过度依赖虚假相关特征,但仅像素级对齐无法解决频域错位问题,因此提出的双数据对齐方法通过同时校准像素和频率显著提升了检测器的跨数据集泛化能力。
English: Current detectors trained on biased datasets often overfit to non-causal features, but pixel-level alignment alone fails to address frequency-level misalignments that reinforce spurious correlations, prompting the proposed Dual Data Alignment method which improves generalizability across multiple benchmarks.

Authors:Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long
Title: Vid2World: Crafting Video Diffusion Models to Interactive World Models
Abstract:
World models, which predict transitions based on history observation and action sequences, have shown great promise in improving data efficiency for sequential decision making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their applicability in complex environments. In contrast, video diffusion models trained on large, internet-scale datasets have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World performs casualization of a pre-trained video diffusion model by crafting its architecture and training objective to enable autoregressive generation. Furthermore, it introduces a causal action guidance mechanism to enhance action controllability in the resulting interactive world model. Extensive experiments in robot manipulation and game simulation domains show that our method offers a scalable and effective approach for repurposing highly capable video diffusion models to interactive world models.

Authors:Yoorhim Cho, Hongyeob Kim, Semin Kim, Youjia Zhang, Yunseok Choi, Sungeun Hong
Title: RA-Touch: Retrieval-Augmented Touch Understanding with Enriched Visual Data
Abstract:
Visuo-tactile perception aims to understand an object's tactile properties, such as texture, softness, and rigidity. However, the field remains underexplored because collecting tactile data is costly and labor-intensive. We observe that visually distinct objects can exhibit similar surface textures or material properties. For example, a leather sofa and a leather jacket have different appearances but share similar tactile properties. This implies that tactile understanding can be guided by material cues in visual data, even without direct tactile supervision. In this paper, we introduce RA-Touch, a retrieval-augmented framework that improves visuo-tactile perception by leveraging visual data enriched with tactile semantics. We carefully recaption a large-scale visual dataset with tactile-focused descriptions, enabling the model to access tactile semantics typically absent from conventional visual datasets. A key challenge remains in effectively utilizing these tactile-aware external descriptions. RA-Touch addresses this by retrieving visual-textual representations aligned with tactile inputs and integrating them to focus on relevant textural and material properties. By outperforming prior methods on the TVL benchmark, our method demonstrates the potential of retrieval-based visual reuse for tactile understanding. Code is available at https://aim-skku.github.io/RA-Touch

Authors:Luxi Lin, Zhihang Lin, Zhanpeng Zeng, Rongrong Ji
Title: Speculative Decoding Reimagined for Multimodal Large Language Models
Abstract:
This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy. However, current speculative decoding methods for MLLMs fail to achieve the same speedup as they do for LLMs. To address this, we reimagine speculative decoding specifically for MLLMs. Our analysis of MLLM characteristics reveals two key design principles for MSD: (1) Text and visual tokens have fundamentally different characteristics and need to be processed separately during drafting. (2) Both language modeling ability and visual perception capability are crucial for the draft model. For the first principle, MSD decouples text and visual tokens in the draft model, allowing each to be handled based on its own characteristics. For the second principle, MSD uses a two-stage training strategy: In stage one, the draft model is trained on text-only instruction-tuning datasets to improve its language modeling ability. In stage two, MSD gradually introduces multimodal data to enhance the visual perception capability of the draft model. Experiments show that MSD boosts inference speed by up to $2.29\times$ for LLaVA-1.5-7B and up to $2.46\times$ for LLaVA-1.5-13B on multimodal benchmarks, demonstrating its effectiveness. Our code is available at https://github.com/Lyn-Lucy/MSD.
中文: 本文提出多模态推测解码(MSD)方法,通过分离处理文本与视觉标记并采用两阶段训练策略,在保持精度的同时将多模态大语言模型推理速度提升最高达2.46倍。
English: This paper presents Multimodal Speculative Decoding (MSD), a method that accelerates Multimodal Large Language Models by separately processing text and visual tokens and employing a two-stage training strategy, achieving up to 2.46× speedup without accuracy loss.

Authors:Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
Title: Visual Agentic Reinforcement Fine-Tuning
Abstract:
A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs' agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.
中文: 本研究提出的视觉智能体强化微调(Visual-ARFT)方法显著提升了大型视觉语言模型执行网页搜索和图像编码处理的能力,在智能体任务中超越基线模型和GPT-4o,并在多跳问答基准测试中展现出强大的泛化性能。
English: This research introduces Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), which significantly enhances Large Vision-Language Models' ability to perform web searches and image manipulation through coding, outperforming baseline models and GPT-4o in agentic tasks and demonstrating strong generalization on multi-hop QA benchmarks.

Authors:Bin-Bin Gao, Xiaochen Chen, Zhongyi Huang, Congchong Nie, Jun Liu, Jinxiang Lai, Guannan Jiang, Xi Wang, Chengjie Wang
Title: Decoupling Classifier for Boosting Few-shot Object Detection and Instance Segmentation
Abstract:
This paper focus on few-shot object detection~(FSOD) and instance segmentation~(FSIS), which requires a model to quickly adapt to novel classes with a few labeled instances. The existing methods severely suffer from bias classification because of the missing label issue which naturally exists in an instance-level few-shot scenario and is first formally proposed by us. Our analysis suggests that the standard classification head of most FSOD or FSIS models needs to be decoupled to mitigate the bias classification. Therefore, we propose an embarrassingly simple but effective method that decouples the standard classifier into two heads. Then, these two individual heads are capable of independently addressing clear positive samples and noisy negative samples which are caused by the missing label. In this way, the model can effectively learn novel classes while mitigating the effects of noisy negative samples. Without bells and whistles, our model without any additional computation cost and parameters consistently outperforms its baseline and state-of-the-art by a large margin on PASCAL VOC and MS-COCO benchmarks for FSOD and FSIS tasks. The Code is available at https://csgaobb.github.io/Projects/DCFS.

Authors:Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, Yansong Tang
Title: UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
Abstract:
Traditional visual grounding methods primarily focus on single-image scenarios with simple textual references. However, extending these methods to real-world scenarios that involve implicit and complex instructions, particularly in conjunction with multiple images, poses significant challenges, which is mainly due to the lack of advanced reasoning ability across diverse multi-modal contexts. In this work, we aim to address the more practical universal grounding task, and propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding, which enhances reasoning capabilities through reinforcement learning (RL) combined with cold-start data. Specifically, we first construct a high-quality Chain-of-Thought (CoT) grounding dataset, annotated with detailed reasoning chains, to guide the model towards correct reasoning paths via supervised fine-tuning. Subsequently, we perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities. In addition, we identify a difficulty bias arising from the prevalence of easy samples as RL training progresses, and we propose a difficulty-aware weight adjustment strategy to further strengthen the performance. Experimental results demonstrate the effectiveness of UniVG-R1, which achieves state-of-the-art performance on MIG-Bench with a 9.1% improvement over the previous method. Furthermore, our model exhibits strong generalizability, achieving an average improvement of 23.4% in zero-shot performance across four image and video reasoning grounding benchmarks. The project page can be accessed at https://amap-ml.github.io/UniVG-R1-page/.

Authors:Changgu Chen, Xiaoyan Yang, Junwei Shu, Changbo Wang, Yang Li
Title: LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer
Abstract:
In recent years, large-scale pre-trained diffusion transformer models have made significant progress in video generation. While current DiT models can produce high-definition, high-frame-rate, and highly diverse videos, there is a lack of fine-grained control over the video content. Controlling the motion of subjects in videos using only prompts is challenging, especially when it comes to describing complex movements. Further, existing methods fail to control the motion in image-to-video generation, as the subject in the reference image often differs from the subject in the reference video in terms of initial position, size, and shape. To address this, we propose the Leveraging Motion Prior (LMP) framework for zero-shot video generation. Our framework harnesses the powerful generative capabilities of pre-trained diffusion transformers to enable motion in the generated videos to reference user-provided motion videos in both text-to-video and image-to-video generation. To this end, we first introduce a foreground-background disentangle module to distinguish between moving subjects and backgrounds in the reference video, preventing interference in the target video generation. A reweighted motion transfer module is designed to allow the target video to reference the motion from the reference video. To avoid interference from the subject in the reference video, we propose an appearance separation module to suppress the appearance of the reference subject in the target video. We annotate the DAVIS dataset with detailed prompts for our experiments and design evaluation metrics to validate the effectiveness of our method. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in generation quality, prompt-video consistency, and control capability. Our homepage is available at https://vpx-ecnu.github.io/LMP-Website/

Authors:Jiaming Li, Sheng Wang, Xin Wang, Yitao Zhu, Honglin Xiong, Zixu Zhuang, Qian Wang
Title: ReactDiff: Latent Diffusion for Facial Reaction Generation
Abstract:
Given the audio-visual clip of the speaker, facial reaction generation aims to predict the listener's facial reactions. The challenge lies in capturing the relevance between video and audio while balancing appropriateness, realism, and diversity. While prior works have mostly focused on uni-modal inputs or simplified reaction mappings, recent approaches such as PerFRDiff have explored multi-modal inputs and the one-to-many nature of appropriate reaction mappings. In this work, we propose the Facial Reaction Diffusion (ReactDiff) framework that uniquely integrates a Multi-Modality Transformer with conditional diffusion in the latent space for enhanced reaction generation. Unlike existing methods, ReactDiff leverages intra- and inter-class attention for fine-grained multi-modal interaction, while the latent diffusion process between the encoder and decoder enables diverse yet contextually appropriate outputs. Experimental results demonstrate that ReactDiff significantly outperforms existing approaches, achieving a facial reaction correlation of 0.26 and diversity score of 0.094 while maintaining competitive realism. The code is open-sourced at \href{https://github.com/Hunan-Tiger/ReactDiff}{github}.
中文摘要:ReactDiff框架通过融合多模态变换器与潜在空间条件扩散,利用细粒度跨模态注意力机制生成既真实又多样的听者面部反应,在性能上显著超越现有方法。
English Summary: The ReactDiff framework integrates a Multi-Modality Transformer with conditional latent diffusion to generate realistic and diverse listener facial reactions by capturing fine-grained audio-visual interactions, significantly outperforming existing methods.

Authors:Hongjun Choi, Eun Som Jeon, Ankita Shukla, Pavan Turaga
Title: Intra-class Patch Swap for Self-Distillation
Abstract:
Knowledge distillation (KD) is a valuable technique for compressing large deep learning models into smaller, edge-suitable networks. However, conventional KD frameworks rely on pre-trained high-capacity teacher networks, which introduce significant challenges such as increased memory/storage requirements, additional training costs, and ambiguity in selecting an appropriate teacher for a given student model. Although a teacher-free distillation (self-distillation) has emerged as a promising alternative, many existing approaches still rely on architectural modifications or complex training procedures, which limit their generality and efficiency. To address these limitations, we propose a novel framework based on teacher-free distillation that operates using a single student network without any auxiliary components, architectural modifications, or additional learnable parameters. Our approach is built on a simple yet highly effective augmentation, called intra-class patch swap augmentation. This augmentation simulates a teacher-student dynamic within a single model by generating pairs of intra-class samples with varying confidence levels, and then applying instance-to-instance distillation to align their predictive distributions. Our method is conceptually simple, model-agnostic, and easy to implement, requiring only a single augmentation function. Extensive experiments across image classification, semantic segmentation, and object detection show that our method consistently outperforms both existing self-distillation baselines and conventional teacher-based KD approaches. These results suggest that the success of self-distillation could hinge on the design of the augmentation itself. Our codes are available at https://github.com/hchoi71/Intra-class-Patch-Swap.
中文摘要:本文提出了一种新颖的无教师知识蒸馏方法,通过类内图像块交换增强技术,在单一学生网络中实现有效的自蒸馏,无需架构修改或额外参数即可在多种计算机视觉任务中取得优越性能。
English Summary: The paper introduces a novel teacher-free knowledge distillation method using intra-class patch swap augmentation to enable effective self-distillation within a single student network, achieving superior performance across multiple computer vision tasks without architectural changes or additional parameters.

Authors:Zhenyu Li, Tianyi Shang, Pengjie Xu, Zhaojun Deng
Title: Place Recognition Meet Multiple Modalitie: A Comprehensive Review, Current Challenges and Future Directions
Abstract:
Place recognition is a cornerstone of vehicle navigation and mapping, which is pivotal in enabling systems to determine whether a location has been previously visited. This capability is critical for tasks such as loop closure in Simultaneous Localization and Mapping (SLAM) and long-term navigation under varying environmental conditions. In this survey, we comprehensively review recent advancements in place recognition, emphasizing three representative methodological paradigms: Convolutional Neural Network (CNN)-based approaches, Transformer-based frameworks, and cross-modal strategies. We begin by elucidating the significance of place recognition within the broader context of autonomous systems. Subsequently, we trace the evolution of CNN-based methods, highlighting their contributions to robust visual descriptor learning and scalability in large-scale environments. We then examine the emerging class of Transformer-based models, which leverage self-attention mechanisms to capture global dependencies and offer improved generalization across diverse scenes. Furthermore, we discuss cross-modal approaches that integrate heterogeneous data sources such as Lidar, vision, and text description, thereby enhancing resilience to viewpoint, illumination, and seasonal variations. We also summarize standard datasets and evaluation metrics widely adopted in the literature. Finally, we identify current research challenges and outline prospective directions, including domain adaptation, real-time performance, and lifelong learning, to inspire future advancements in this domain. The unified framework of leading-edge place recognition methods, i.e., code library, and the results of their experimental evaluations are available at https://github.com/CV4RA/SOTA-Place-Recognitioner.
中文摘要:本综述系统回顾了位置识别领域的最新进展,重点分析了基于CNN的方法、Transformer框架和跨模态策略三大代表性技术范式,并探讨了当前研究挑战与未来发展方向。
English Summary: This survey comprehensively reviews recent advancements in place recognition for autonomous navigation, focusing on CNN-based methods, Transformer frameworks, and cross-modal strategies while addressing current challenges and future directions.

Authors:Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao Liu, Can Huang
Title: Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Abstract:
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin
Chinese: Dolphin提出了一种新颖的两阶段文档解析模型,通过首先生成布局元素再并行解析内容,结合异构锚点提示和大规模数据集,实现了顶尖的解析性能和效率。
English: Dolphin introduces a novel two-stage document parsing model that first analyzes layout elements and then parses content in parallel, achieving state-of-the-art performance and efficiency through heterogeneous anchor prompting and a large-scale dataset.

Authors:Yibo Gao, Hangqi Zhou, Zheyao Gao, Bomin Wang, Shangqi Gao, Sihan Wang, Xiahai Zhuang
Title: Learning Concept-Driven Logical Rules for Interpretable and Generalizable Medical Image Classification
Abstract:
The pursuit of decision safety in clinical applications highlights the potential of concept-based methods in medical imaging. While these models offer active interpretability, they often suffer from concept leakages, where unintended information within soft concept representations undermines both interpretability and generalizability. Moreover, most concept-based models focus solely on local explanations (instance-level), neglecting the global decision logic (dataset-level). To address these limitations, we propose Concept Rule Learner (CRL), a novel framework to learn Boolean logical rules from binarized visual concepts. CRL employs logical layers to capture concept correlations and extract clinically meaningful rules, thereby providing both local and global interpretability. Experiments on two medical image classification tasks show that CRL achieves competitive performance with existing methods while significantly improving generalizability to out-of-distribution data. The code of our work is available at https://github.com/obiyoag/crl.
中文: 概念规则学习器(CRL)框架通过从二值化视觉概念中学习布尔逻辑规则,解决了医学影像中的概念泄漏和可解释性局限问题,在分类任务中实现了优异性能并显著提升了泛化能力。
English: The Concept Rule Learner (CRL) framework addresses concept leakage and limited interpretability in medical imaging by learning Boolean logical rules from binarized visual concepts, achieving competitive performance and enhanced generalizability in classification tasks.

Authors:Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki
Title: Adversarially Pretrained Transformers may be Universally Robust In-Context Learners
Abstract:
Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we show that transformers adversarially pretrained on diverse tasks can serve as robust foundation models and eliminate the need for adversarial training in downstream tasks. Specifically, we theoretically demonstrate that through in-context learning, a single adversarially pretrained transformer can robustly generalize to multiple unseen tasks without any additional training, i.e., without any parameter updates. This robustness stems from the model's focus on robust features and its resistance to attacks that exploit non-predictive features. Besides these positive findings, we also identify several limitations. Under certain conditions (though unrealistic), no universally robust single-layer transformers exist. Moreover, robust transformers exhibit an accuracy--robustness trade-off and require a large number of in-context demonstrations. The code is available at https://github.com/s-kumano/universally-robust-in-context-learner.
中文:对抗性预训练的Transformer可作为鲁棒基础模型,通过上下文学习无需额外对抗训练即可泛化至未知任务,但也存在准确率-鲁棒性权衡及依赖大量示例等局限性。
English: Adversarially pretrained transformers can serve as robust foundation models, enabling generalization to unseen tasks without further adversarial training through in-context learning, though they face limitations like accuracy-robustness trade-offs and dependency on numerous demonstrations.

Authors:Jesper Duemose Nielsen, Karthik Gopinath, Andrew Hoopes, Adrian Dalca, Colin Magdamo, Steven Arnold, Sudeshna Das, Axel Thielscher, Juan Eugenio Iglesias, Oula Puonti
Title: End-to-end Cortical Surface Reconstruction from Clinical Magnetic Resonance Images
Abstract:
Surface-based cortical analysis is valuable for a variety of neuroimaging tasks, such as spatial normalization, parcellation, and gray matter (GM) thickness estimation. However, most tools for estimating cortical surfaces work exclusively on scans with at least 1 mm isotropic resolution and are tuned to a specific magnetic resonance (MR) contrast, often T1-weighted (T1w). This precludes application using most clinical MR scans, which are very heterogeneous in terms of contrast and resolution. Here, we use synthetic domain-randomized data to train the first neural network for explicit estimation of cortical surfaces from scans of any contrast and resolution, without retraining. Our method deforms a template mesh to the white matter (WM) surface, which guarantees topological correctness. This mesh is further deformed to estimate the GM surface. We compare our method to recon-all-clinical (RAC), an implicit surface reconstruction method which is currently the only other tool capable of processing heterogeneous clinical MR scans, on ADNI and a large clinical dataset (n=1,332). We show a approximately 50 % reduction in cortical thickness error (from 0.50 to 0.24 mm) with respect to RAC and better recovery of the aging-related cortical thinning patterns detected by FreeSurfer on high-resolution T1w scans. Our method enables fast and accurate surface reconstruction of clinical scans, allowing studies (1) with sample sizes far beyond what is feasible in a research setting, and (2) of clinical populations that are difficult to enroll in research studies. The code is publicly available at https://github.com/simnibs/brainnet.
Chinese: 本研究提出了一种神经网络方法,能够从任意对比度和分辨率的临床MRI扫描中精确估计皮质表面,相较于现有工具将皮质厚度误差降低了50%,为更广泛的临床研究应用提供了可能。
English: This study introduces a neural network method that accurately estimates cortical surfaces from diverse clinical MRI scans of any contrast and resolution, achieving a 50% reduction in cortical thickness error compared to existing tools and enabling broader clinical research applications.

Authors:Zhidan Liu, Chengtang Yao, Jiaxi Zeng, Yuwei Wu, Yunde Jia
Title: Multi-Label Stereo Matching for Transparent Scene Depth Estimation
Abstract:
In this paper, we present a multi-label stereo matching method to simultaneously estimate the depth of the transparent objects and the occluded background in transparent scenes.Unlike previous methods that assume a unimodal distribution along the disparity dimension and formulate the matching as a single-label regression problem, we propose a multi-label regression formulation to estimate multiple depth values at the same pixel in transparent scenes. To resolve the multi-label regression problem, we introduce a pixel-wise multivariate Gaussian representation, where the mean vector encodes multiple depth values at the same pixel, and the covariance matrix determines whether a multi-label representation is necessary for a given pixel. The representation is iteratively predicted within a GRU framework. In each iteration, we first predict the update step for the mean parameters and then use both the update step and the updated mean parameters to estimate the covariance matrix. We also synthesize a dataset containing 10 scenes and 89 objects to validate the performance of transparent scene depth estimation. The experiments show that our method greatly improves the performance on transparent surfaces while preserving the background information for scene reconstruction. Code is available at https://github.com/BFZD233/TranScene.
中文摘要:本文提出一种多标签立体匹配方法,通过像素级高斯表示和GRU框架同时估计透明物体与背景的深度,基于合成数据集的实验表明该方法在保持背景信息的同时显著提升了透明表面的深度估计性能。
English Summary: This paper introduces a multi-label stereo matching method using pixel-wise Gaussian representations within a GRU framework to simultaneously estimate depths of transparent objects and backgrounds, demonstrating significant performance improvements through a synthesized dataset.

Authors:Qifeng Cai, Hao Liang, Hejun Dong, Meiyi Qiang, Ruichuan An, Zhaoyang Han, Zhengzhou Zhu, Bin Cui, Wentao Zhang
Title: LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts
Abstract:
Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://github.com/TechNomad-ds/LoVR-benchmark
中文: LoVR基准通过引入更长的视频、高质量细粒度标注和可扩展的标注框架,解决了现有视频文本检索数据集的局限性,为先进模型提出了新的挑战。
English: The LoVR benchmark addresses limitations in existing video-text retrieval datasets by introducing longer videos, high-quality fine-grained captions, and a scalable annotation framework, presenting new challenges for advanced models.

Authors:Ruihan Liu, Xiaoyi Wu, Xijun Chen, Liang Hu, Yunjiang Lou
Title: 4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision
Abstract:
A comprehensive understanding of 3D scenes is essential for autonomous vehicles (AVs), and among various perception tasks, occupancy estimation plays a central role by providing a general representation of drivable and occupied space. However, most existing occupancy estimation methods rely on LiDAR or cameras, which perform poorly in degraded environments such as smoke, rain, snow, and fog. In this paper, we propose 4D-ROLLS, the first weakly supervised occupancy estimation method for 4D radar using the LiDAR point cloud as the supervisory signal. Specifically, we introduce a method for generating pseudo-LiDAR labels, including occupancy queries and LiDAR height maps, as multi-stage supervision to train the 4D radar occupancy estimation model. Then the model is aligned with the occupancy map produced by LiDAR, fine-tuning its accuracy in occupancy estimation. Extensive comparative experiments validate the exceptional performance of 4D-ROLLS. Its robustness in degraded environments and effectiveness in cross-dataset training are qualitatively demonstrated. The model is also seamlessly transferred to downstream tasks BEV segmentation and point cloud occupancy prediction, highlighting its potential for broader applications. The lightweight network enables 4D-ROLLS model to achieve fast inference speeds at about 30 Hz on a 4060 GPU. The code of 4D-ROLLS will be made available at https://github.com/CLASS-Lab/4D-ROLLS.
中文: 本文提出的4D-ROLLS是首个基于4D雷达的弱监督占据估计方法,通过LiDAR生成的伪标签进行多阶段监督训练,在恶劣天气条件下表现出卓越的鲁棒性,并能快速推理及迁移至下游任务。
English: This paper introduces 4D-ROLLS, a weakly supervised method for 4D radar occupancy estimation that uses LiDAR-generated pseudo-labels to achieve robust performance in degraded weather conditions while enabling fast inference and seamless transfer to downstream tasks.

Authors:Zhenyu Bao, Qing Li, Guibiao Liao, Zhongyuan Zhao, Kanglin Liu
Title: MGStream: Motion-aware 3D Gaussian for Streamable Dynamic Scene Reconstruction
Abstract:
3D Gaussian Splatting (3DGS) has gained significant attention in streamable dynamic novel view synthesis (DNVS) for its photorealistic rendering capability and computational efficiency. Despite much progress in improving rendering quality and optimization strategies, 3DGS-based streamable dynamic scene reconstruction still suffers from flickering artifacts and storage inefficiency, and struggles to model the emerging objects. To tackle this, we introduce MGStream which employs the motion-related 3D Gaussians (3DGs) to reconstruct the dynamic and the vanilla 3DGs for the static. The motion-related 3DGs are implemented according to the motion mask and the clustering-based convex hull algorithm. The rigid deformation is applied to the motion-related 3DGs for modeling the dynamic, and the attention-based optimization on the motion-related 3DGs enables the reconstruction of the emerging objects. As the deformation and optimization are only conducted on the motion-related 3DGs, MGStream avoids flickering artifacts and improves the storage efficiency. Extensive experiments on real-world datasets N3DV and MeetRoom demonstrate that MGStream surpasses existing streaming 3DGS-based approaches in terms of rendering quality, training/storage efficiency and temporal consistency. Our code is available at: https://github.com/pcl3dv/MGStream.
中文: MGStream通过引入运动相关的3D高斯模型处理动态场景、标准3D高斯模型处理静态场景,有效消除了闪烁伪影并提升了存储效率,在渲染质量和时间一致性方面优于现有流式3D高斯重建方法。
English: MGStream enhances streamable dynamic scene reconstruction by using motion-related 3D Gaussians for dynamic elements and vanilla 3D Gaussians for static ones, effectively reducing flickering artifacts and improving storage efficiency while achieving superior rendering quality and temporal consistency.

Authors:Matthew Raffel, Lizhong Chen
Title: FlashKAT: Understanding and Addressing Performance Bottlenecks in the Kolmogorov-Arnold Transformer
Abstract:
The Kolmogorov-Arnold Network (KAN) has been gaining popularity as an alternative to the multi-layer perceptron (MLP) with its increased expressiveness and interpretability. However, the KAN can be orders of magnitude slower due to its increased computational cost and training instability, limiting its applicability to larger-scale tasks. Recently, the Kolmogorov-Arnold Transformer (KAT) has been proposed, which can achieve FLOPs similar to the traditional Transformer with MLPs by leveraging Group-Rational KAN (GR-KAN). Unfortunately, despite the comparable FLOPs, our characterizations reveal that the KAT is still 123x slower in training speeds, indicating that there are other performance bottlenecks beyond FLOPs. In this paper, we conduct a series of experiments to understand the root cause of the slowdown in KAT. We uncover that the slowdown can be isolated to memory stalls and, more specifically, in the backward pass of GR-KAN caused by inefficient gradient accumulation. To address this memory bottleneck, we propose FlashKAT, which builds on our restructured kernel that minimizes gradient accumulation with atomic adds and accesses to slow memory. Evaluations demonstrate that FlashKAT can achieve a training speedup of 86.5x compared with the state-of-the-art KAT, while reducing rounding errors in the coefficient gradients. Our code is available at https://github.com/OSU-STARLAB/FlashKAT.
中文: Kolmogorov-Arnold Transformer (KAT) 因梯度累积中的内存瓶颈导致训练速度大幅下降,而 FlashKAT 通过重构内核减少内存延迟,实现了 86.5 倍的加速并提高了梯度精度。
English: The Kolmogorov-Arnold Transformer (KAT) suffers from significant training slowdowns due to memory bottlenecks in gradient accumulation, which FlashKAT addresses by restructuring the kernel to minimize memory stalls, achieving an 86.5x speedup while improving gradient accuracy.

Authors:Chenning Yu, Sicun Gao
Title: Improving Compositional Generation with Diffusion Models Using Lift Scores
Abstract:
We introduce a novel resampling criterion using lift scores, for improving compositional generation in diffusion models. By leveraging the lift scores, we evaluate whether generated samples align with each single condition and then compose the results to determine whether the composed prompt is satisfied. Our key insight is that lift scores can be efficiently approximated using only the original diffusion model, requiring no additional training or external modules. We develop an optimized variant that achieves relatively lower computational overhead during inference while maintaining effectiveness. Through extensive experiments, we demonstrate that lift scores significantly improved the condition alignment for compositional generation across 2D synthetic data, CLEVR position tasks, and text-to-image synthesis. Our code is available at http://rainorangelemon.github.io/complift.

Authors:Barkin Dagda, Muhammad Awais, Saber Fallah
Title: GeoVLM: Improving Automated Vehicle Geolocalisation Using Vision-Language Matching
Abstract:
Cross-view geo-localisation identifies coarse geographical position of an automated vehicle by matching a ground-level image to a geo-tagged satellite image from a database. Despite the advancements in Cross-view geo-localisation, significant challenges still persist such as similar looking scenes which makes it challenging to find the correct match as the top match. Existing approaches reach high recall rates but they still fail to rank the correct image as the top match. To address this challenge, this paper proposes GeoVLM, a novel approach which uses the zero-shot capabilities of vision language models to enable cross-view geo-localisation using interpretable cross-view language descriptions. GeoVLM is a trainable reranking approach which improves the best match accuracy of cross-view geo-localisation. GeoVLM is evaluated on standard benchmark VIGOR and University-1652 and also through real-life driving environments using Cross-View United Kingdom, a new benchmark dataset introduced in this paper. The results of the paper show that GeoVLM improves retrieval performance of cross-view geo-localisation compared to the state-of-the-art methods with the help of explainable natural language descriptions. The code is available at https://github.com/CAV-Research-Lab/GeoVLM
Chinese: 本文提出GeoVLM,一种新颖的可训练重排序方法,利用视觉语言模型的零样本能力和可解释的跨视角语言描述,通过提高最佳匹配精度来改进跨视角地理定位,超越了现有最优方法。
English: This paper introduces GeoVLM, a novel trainable reranking method that leverages vision language models' zero-shot capabilities and interpretable cross-view language descriptions to enhance cross-view geo-localization by improving top-match accuracy over state-of-the-art approaches.

Authors:Rodrigo Fritz, Pablo Suárez-Serrato, Victor Mijangos, Anayanzi D. Martinez-Hernandez, Eduardo Ivan Velazquez Richards
Title: EuLearn: A 3D database for learning Euler characteristics
Abstract:
We present EuLearn, the first surface datasets equitably representing a diversity of topological types. We designed our embedded surfaces of uniformly varying genera relying on random knots, thus allowing our surfaces to knot with themselves. EuLearn contributes new topological datasets of meshes, point clouds, and scalar fields in 3D. We aim to facilitate the training of machine learning systems that can discern topological features. We experimented with specific emblematic 3D neural network architectures, finding that their vanilla implementations perform poorly on genus classification. To enhance performance, we developed a novel, non-Euclidean, statistical sampling method adapted to graph and manifold data. We also introduce adjacency-informed adaptations of PointNet and Transformer architectures that rely on our non-Euclidean sampling strategy. Our results demonstrate that incorporating topological information into deep learning workflows significantly improves performance on these otherwise challenging EuLearn datasets.
中文: EuLearn提供了首个公平代表多种拓扑类型的表面数据集,通过创新的非欧几里得采样方法和改进的PointNet与Transformer架构,显著提升了深度学习模型在拓扑特征识别上的性能。
English: EuLearn introduces the first equitable surface datasets with diverse topological types, enhancing machine learning systems' ability to discern topological features through novel non-Euclidean sampling and adapted architectures like PointNet and Transformer.

Authors:Xingyuan Lu, Yuxi Liu, Dongyu Zhang, Zhiyao Wu, Jing Ren, Feng Xia
Title: EmoMeta: A Multimodal Dataset for Fine-grained Emotion Classification in Chinese Metaphors
Abstract:
Metaphors play a pivotal role in expressing emotions, making them crucial for emotional intelligence. The advent of multimodal data and widespread communication has led to a proliferation of multimodal metaphors, amplifying the complexity of emotion classification compared to single-mode scenarios. However, the scarcity of research on constructing multimodal metaphorical fine-grained emotion datasets hampers progress in this domain. Moreover, existing studies predominantly focus on English, overlooking potential variations in emotional nuances across languages. To address these gaps, we introduce a multimodal dataset in Chinese comprising 5,000 text-image pairs of metaphorical advertisements. Each entry is meticulously annotated for metaphor occurrence, domain relations and fine-grained emotion classification encompassing joy, love, trust, fear, sadness, disgust, anger, surprise, anticipation, and neutral. Our dataset is publicly accessible (https://github.com/DUTIR-YSQ/EmoMeta), facilitating further advancements in this burgeoning field.
中文摘要:隐喻对情感智能至关重要,但多模态数据集匮乏且研究多限于英语,为此我们构建了包含5000个标注文本-图像对的中文多模态隐喻广告数据集,以推动该领域发展。
English Summary: Metaphors are essential for emotional intelligence, but the lack of multimodal datasets and cross-linguistic studies hinders progress, so we created a publicly available Chinese dataset of 5,000 annotated text-image metaphorical advertisements to advance this field.

Authors:Ruoyu Wang, Yi Ma, Shenghua Gao
Title: Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos
Abstract:
Currently almost all state-of-the-art novel view synthesis and reconstruction models rely on calibrated cameras or additional geometric priors for training. These prerequisites significantly limit their applicability to massive uncalibrated data. To alleviate this requirement and unlock the potential for self-supervised training on large-scale uncalibrated videos, we propose a novel two-stage strategy to train a view synthesis model from only raw video frames or multi-view images, without providing camera parameters or other priors. In the first stage, we learn to reconstruct the scene implicitly in a latent space without relying on any explicit 3D representation. Specifically, we predict per-frame latent camera and scene context features, and employ a view synthesis model as a proxy for explicit rendering. This pretraining stage substantially reduces the optimization complexity and encourages the network to learn the underlying 3D consistency in a self-supervised manner. The learned latent camera and implicit scene representation have a large gap compared with the real 3D world. To reduce this gap, we introduce the second stage training by explicitly predicting 3D Gaussian primitives. We additionally apply explicit Gaussian Splatting rendering loss and depth projection loss to align the learned latent representations with physically grounded 3D geometry. In this way, Stage 1 provides a strong initialization and Stage 2 enforces 3D consistency - the two stages are complementary and mutually beneficial. Extensive experiments demonstrate the effectiveness of our approach, achieving high-quality novel view synthesis and accurate camera pose estimation, compared to methods that employ supervision with calibration, pose, or depth information. The code is available at https://github.com/Dwawayu/Pensieve.
中文: 本文提出一种两阶段自监督方法,通过先学习隐式场景表示再采用显式3D高斯建模进行优化,实现了无需相机参数或几何先验的未标定图像新视角合成。
English: This paper introduces a two-stage self-supervised method that enables novel view synthesis from uncalibrated images by first learning implicit scene representations and then refining them with explicit 3D Gaussian modeling, eliminating the need for camera parameters or geometric priors.

Authors:Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, Wenqi Shao
Title: MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision
Abstract:
While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework. We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference setup and achieves significant improvements across both in-domain (MM-K12 test set) and out-of-domain (OlympiadBench, MathVista, etc.) benchmarks. Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance. MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at https://github.com/ModalMinds/MM-PRM.
Chinese: MM-PRM是一种通过自动化框架开发的流程奖励模型,通过提供步骤级监督来增强多模态推理能力,显著提升了在领域内和领域外基准测试中的表现。
English: MM-PRM is a process reward model developed through an automated framework to enhance multimodal reasoning by providing step-level supervision, which significantly improves performance on both in-domain and out-of-domain benchmarks.

Authors:Liang Chen, Hongcheng Gao, Tianyu Liu, Zhiqi Huang, Flood Sung, Xinyu Zhou, Yuxin Wu, Baobao Chang
Title: G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning
Abstract:
Vision-Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision-making within interactive, visually rich environments like games. This ``knowing-doing'' gap significantly limits their potential as autonomous agents, as leading VLMs often performing badly in simple games. To address this, we introduce VLM-Gym, a curated reinforcement learning (RL) environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty, specifically designed for scalable multi-game parallel training. Leveraging VLM-Gym, we train G0 models using pure RL-driven self-evolution, which demonstrate emergent perception and reasoning patterns. To further mitigate challenges arising from game diversity, we develop G1 models. G1 incorporates a perception-enhanced cold start prior to RL fine-tuning. Our resulting G1 models consistently surpass their teacher across all games and outperform leading proprietary models like Claude-3.7-Sonnet-Thinking. Systematic analysis reveals an intriguing finding: perception and reasoning abilities mutually bootstrap each other throughout the RL training process. Source code including VLM-Gym and RL training are released at https://github.com/chenllliang/G1 to foster future research in advancing VLMs as capable interactive agents.
Chinese: 视觉语言模型在交互环境中存在“知行差距”,VLM-Gym通过强化学习驱动的训练解决了这一问题,其开发的模型通过感知与推理的相互促进,全面超越了主流专有模型。
English: Vision-Language Models face a "knowing-doing" gap in interactive environments, which VLM-Gym addresses by enabling RL-driven training of models that outperform leading proprietary ones through mutually bootstrapped perception and reasoning.

Authors:Zhuozhao Hu, Kaishen Yuan, Xin Liu, Zitong Yu, Yuan Zong, Jingang Shi, Huanjing Yue, Jingyu Yang
Title: FEALLM: Advancing Facial Emotion Analysis in Multimodal Large Language Models with Emotional Synergy and Reasoning
Abstract:
Facial Emotion Analysis (FEA) plays a crucial role in visual affective computing, aiming to infer a person's emotional state based on facial data. Scientifically, facial expressions (FEs) result from the coordinated movement of facial muscles, which can be decomposed into specific action units (AUs) that provide detailed emotional insights. However, traditional methods often struggle with limited interpretability, constrained generalization and reasoning abilities. Recently, Multimodal Large Language Models (MLLMs) have shown exceptional performance in various visual tasks, while they still face significant challenges in FEA due to the lack of specialized datasets and their inability to capture the intricate relationships between FEs and AUs. To address these issues, we introduce a novel FEA Instruction Dataset that provides accurate and aligned FE and AU descriptions and establishes causal reasoning relationships between them, followed by constructing a new benchmark, FEABench. Moreover, we propose FEALLM, a novel MLLM architecture designed to capture more detailed facial information, enhancing its capability in FEA tasks. Our model demonstrates strong performance on FEABench and impressive generalization capability through zero-shot evaluation on various datasets, including RAF-DB, AffectNet, BP4D, and DISFA, showcasing its robustness and effectiveness in FEA tasks. The dataset and code will be available at https://github.com/953206211/FEALLM.
Chinese: 面部情感分析通过引入新的指令数据集和FEALLM模型,利用详细的面部动作单元描述和因果推理关系,显著提升了任务的可解释性与泛化能力,在多项基准测试中展现出卓越性能。
English: Facial Emotion Analysis is advanced by a new instruction dataset and FEALLM model, which improve interpretability and generalization through detailed facial action unit descriptions and causal reasoning, achieving robust performance across multiple benchmarks.

Authors:Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, Hao Zhang
Title: VSA: Faster Video Diffusion with Trainable Sparse Attention
Abstract:
Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models. Code will be available at https://github.com/hao-ai-lab/FastVideo.
中文:VSA提出了一种可训练的稀疏注意力机制,通过聚焦关键令牌来降低计算需求,在不牺牲性能的前提下实现了视频扩散模型的高效扩展。
English: VSA introduces a trainable sparse attention mechanism that reduces computational demands by focusing on critical tokens, enabling efficient scaling of video diffusion models without sacrificing performance.

Authors:Ahmet Berke Gokmen, Yigit Ekin, Bahri Batuhan Bilecen, Aysegul Dundar
Title: RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers
Abstract:
We propose RoPECraft, a training-free video motion transfer method for diffusion transformers that operates solely by modifying their rotary positional embeddings (RoPE). We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE, effectively encoding motion into the generation process. These embeddings are then further optimized during denoising time steps via trajectory alignment between the predicted and target velocities using a flow-matching objective. To keep the output faithful to the text prompt and prevent duplicate generations, we incorporate a regularization term based on the phase components of the reference video's Fourier transform, projecting the phase angles onto a smooth manifold to suppress high-frequency artifacts. Experiments on benchmarks reveal that RoPECraft outperforms all recently published methods, both qualitatively and quantitatively.

Authors:Paula Feldman, Martin Sinnona, Claudio Delrieux, Viviana Siless, Emmanuel Iarussi
Title: VesselGPT: Autoregressive Modeling of Vascular Geometry
Abstract:
Anatomical trees are critical for clinical diagnosis and treatment planning, yet their complex and diverse geometry make accurate representation a significant challenge. Motivated by the latest advances in large language models, we introduce an autoregressive method for synthesizing anatomical trees. Our approach first embeds vessel structures into a learned discrete vocabulary using a VQ-VAE architecture, then models their generation autoregressively with a GPT-2 model. This method effectively captures intricate geometries and branching patterns, enabling realistic vascular tree synthesis. Comprehensive qualitative and quantitative evaluations reveal that our technique achieves high-fidelity tree reconstruction with compact discrete representations. Moreover, our B-spline representation of vessel cross-sections preserves critical morphological details that are often overlooked in previous' methods parameterizations. To the best of our knowledge, this work is the first to generate blood vessels in an autoregressive manner. Code is available at https://github.com/LIA-DiTella/VesselGPT-MICCAI.
中文摘要:本文提出一种结合VQ-VAE与GPT-2的自回归方法,通过离散化表征和B样条参数化实现高保真血管树合成,在保留形态细节的同时首次实现了血管的自回归生成。
English Summary: This paper introduces an autoregressive method using VQ-VAE and GPT-2 to synthesize anatomical trees, achieving high-fidelity reconstruction with compact representations while preserving morphological details through B-spline parameterization.

Authors:Gabriele Spadaro, Alberto Presta, Jhony H. Giraldo, Marco Grangetto, Wei Hu, Giuseppe Valenzise, Attilio Fiandrotti, Enzo Tartaglione
Title: Denoising Diffusion Probabilistic Model for Point Cloud Compression at Low Bit-Rates
Abstract:
Efficient compression of low-bit-rate point clouds is critical for bandwidth-constrained applications. However, existing techniques mainly focus on high-fidelity reconstruction, requiring many bits for compression. This paper proposes a "Denoising Diffusion Probabilistic Model" (DDPM) architecture for point cloud compression (DDPM-PCC) at low bit-rates. A PointNet encoder produces the condition vector for the generation, which is then quantized via a learnable vector quantizer. This configuration allows to achieve a low bitrates while preserving quality. Experiments on ShapeNet and ModelNet40 show improved rate-distortion at low rates compared to standardized and state-of-the-art approaches. We publicly released the code at https://github.com/EIDOSLAB/DDPM-PCC.
中文: 本文提出一种基于DDPM的点云压缩方法,通过PointNet编码器和可学习向量量化器在低比特率下实现高效压缩,在基准数据集上展现出优于现有技术的率失真性能。
English: This paper introduces a DDPM-based point cloud compression method that achieves efficient low-bit-rate performance by using a PointNet encoder and learnable vector quantizer, demonstrating superior rate-distortion results on benchmark datasets.

Authors:Qiguang Chen, Libo Qin, Jinhao Liu, Yue Liao, Jiaqi Wang, Jingxuan Zhou, Wanxiang Che
Title: RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning
Abstract:
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models (LLMs) on complex tasks, spurring research into its underlying mechanisms. However, two primary challenges remain for real-world applications: (1) the lack of quantitative metrics and actionable guidelines for evaluating and optimizing measurable boundaries of CoT capability, and (2) the absence of methods to assess boundaries of unmeasurable CoT capability, such as multimodal perception. To address these gaps, we introduce the Reasoning Boundary Framework++ (RBF++). To tackle the first challenge, we define the reasoning boundary (RB) as the maximum limit of CoT performance. We also propose a combination law for RBs, enabling quantitative analysis and offering actionable guidance across various CoT tasks. For the second challenge, particularly in multimodal scenarios, we introduce a constant assumption, which replaces unmeasurable RBs with scenario-specific constants. Additionally, we propose the reasoning boundary division mechanism, which divides unmeasurable RBs into two sub-boundaries, facilitating the quantification and optimization of both unmeasurable domain knowledge and multimodal perception capabilities. Extensive experiments involving 38 models across 13 tasks validate the feasibility of our framework in cross-modal settings. Additionally, we evaluate 10 CoT strategies, offer insights into optimization and decay from two complementary perspectives, and expand evaluation benchmarks for measuring RBs in LLM reasoning. We hope this work advances the understanding of RBs and optimization strategies in LLMs. Code and data are available at https://github.com/LightChen233/reasoning-boundary.
Chinese: 为解决思维链推理在评估和优化中的挑战,提出了推理边界框架++(RBF++),通过定义可测量的性能极限和引入处理不可测量能力(如多模态感知)的创新机制,在多种任务和模型中得到验证。
English: The Reasoning Boundary Framework++ (RBF++) is introduced to address challenges in evaluating and optimizing Chain-of-Thought (CoT) reasoning by defining measurable performance limits and handling unmeasurable capabilities like multimodal perception through innovative mechanisms, validated across diverse tasks and models.

Authors:Lincan Cai, Jingxuan Kang, Shuang Li, Wenxuan Ma, Binhui Xie, Zhida Qin, Jian Liang
Title: From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection
Abstract:
Pretrained vision-language models (VLMs), e.g., CLIP, demonstrate impressive zero-shot capabilities on downstream tasks. Prior research highlights the crucial role of visual augmentation techniques, like random cropping, in alignment with fine-grained class descriptions generated by large language models (LLMs), significantly enhancing zero-shot performance by incorporating multi-view information. However, the inherent randomness of these augmentations can inevitably introduce background artifacts and cause models to overly focus on local details, compromising global semantic understanding. To address these issues, we propose an \textbf{A}ttention-\textbf{B}ased \textbf{S}election (\textbf{ABS}) method from local details to global context, which applies attention-guided cropping in both raw images and feature space, supplement global semantic information through strategic feature selection. Additionally, we introduce a soft matching technique to effectively filter LLM descriptions for better alignment. \textbf{ABS} achieves state-of-the-art performance on out-of-distribution generalization and zero-shot classification tasks. Notably, \textbf{ABS} is training-free and even rivals few-shot and test-time adaptation methods. Our code is available at \href{https://github.com/BIT-DA/ABS}{\textcolor{darkgreen}{https://github.com/BIT-DA/ABS}}.
中文摘要:提出的ABS方法通过注意力引导裁剪和特征选择,解决了随机增强带来的问题,无需训练即可在视觉语言模型中实现零样本性能的显著提升,达到领先水平。
English Summary: The proposed ABS method enhances zero-shot performance in vision-language models by using attention-guided cropping and feature selection to address issues from random augmentations, achieving state-of-the-art results without training.

Authors:Younghyun Kim, Jongheon Jeong, Sangkyung Kwak, Kyungmin Lee, Juho Lee, Jinwoo Shin
Title: StarFT: Robust Fine-tuning of Zero-shot Models via Spuriosity Alignment
Abstract:
Learning robust representations from data often requires scale, which has led to the success of recent zero-shot models such as CLIP. However, the obtained robustness can easily be deteriorated when these models are fine-tuned on other downstream tasks (e.g., of smaller scales). Previous works often interpret this phenomenon in the context of domain shift, developing fine-tuning methods that aim to preserve the original domain as much as possible. However, in a different context, fine-tuned models with limited data are also prone to learning features that are spurious to humans, such as background or texture. In this paper, we propose StarFT (Spurious Textual Alignment Regularization), a novel framework for fine-tuning zero-shot models to enhance robustness by preventing them from learning spuriosity. We introduce a regularization that aligns the output distribution for spuriosity-injected labels with the original zero-shot model, ensuring that the model is not induced to extract irrelevant features further from these descriptions. We leverage recent language models to get such spuriosity-injected labels by generating alternative textual descriptions that highlight potentially confounding features. Extensive experiments validate the robust generalization of StarFT and its emerging properties: zero-shot group robustness and improved zero-shot classification. Notably, StarFT boosts both worst-group and average accuracy by 14.30% and 3.02%, respectively, in the Waterbirds group shift scenario, where other robust fine-tuning baselines show even degraded performance.
中文: 本文提出StarFT框架,通过语言模型生成干扰性文本标签并采用正则化对齐输出分布,防止模型学习虚假特征,在多种场景下显著提升了最差组和平均分类准确率。
English: This paper introduces StarFT, a novel fine-tuning framework that enhances model robustness by preventing the learning of spurious features through regularization aligning outputs with spuriosity-injected labels generated by language models, significantly improving worst-group and average accuracy in various scenarios.

Authors:Seungjun Oh, Younggeun Lee, Hyejin Jeon, Eunbyung Park
Title: Hybrid 3D-4D Gaussian Splatting for Fast Dynamic Scene Representation
Abstract:
Recent advancements in dynamic 3D scene reconstruction have shown promising results, enabling high-fidelity 3D novel view synthesis with improved temporal consistency. Among these, 4D Gaussian Splatting (4DGS) has emerged as an appealing approach due to its ability to model high-fidelity spatial and temporal variations. However, existing methods suffer from substantial computational and memory overhead due to the redundant allocation of 4D Gaussians to static regions, which can also degrade image quality. In this work, we introduce hybrid 3D-4D Gaussian Splatting (3D-4DGS), a novel framework that adaptively represents static regions with 3D Gaussians while reserving 4D Gaussians for dynamic elements. Our method begins with a fully 4D Gaussian representation and iteratively converts temporally invariant Gaussians into 3D, significantly reducing the number of parameters and improving computational efficiency. Meanwhile, dynamic Gaussians retain their full 4D representation, capturing complex motions with high fidelity. Our approach achieves significantly faster training times compared to baseline 4D Gaussian Splatting methods while maintaining or improving the visual quality.

Authors:Qingling Shu, Sibao Chen, Xiao Wang, Zhihui You, Wei Lu, Jin Tang, Bin Luo
Title: Semantic Change Detection of Roads and Bridges: A Fine-grained Dataset and Multimodal Frequency-driven Detector
Abstract:
Accurate detection of road and bridge changes is crucial for urban planning and transportation management, yet presents unique challenges for general change detection (CD). Key difficulties arise from maintaining the continuity of roads and bridges as linear structures and disambiguating visually similar land covers (e.g., road construction vs. bare land). Existing spatial-domain models struggle with these issues, further hindered by the lack of specialized, semantically rich datasets. To fill these gaps, we introduce the Road and Bridge Semantic Change Detection (RB-SCD) dataset. As the first benchmark to systematically target semantic change detection of roads and bridges, RB-SCD offers comprehensive fine-grained annotations for 11 semantic change categories. This enables a detailed analysis of traffic infrastructure evolution. Building on this, we propose a novel framework, the Multimodal Frequency-Driven Change Detector (MFDCD). MFDCD integrates multimodal features in the frequency domain through two key components: (1) the Dynamic Frequency Coupler (DFC), which leverages wavelet transform to decompose visual features, enabling it to robustly model the continuity of linear transitions; and (2) the Textual Frequency Filter (TFF), which encodes semantic priors into frequency-domain graphs and applies filter banks to align them with visual features, resolving semantic ambiguities. Experiments demonstrate the state-of-the-art performance of MFDCD on RB-SCD and three public CD datasets. The code will be available at https://github.com/DaGuangDaGuang/RB-SCD.
中文摘要:本研究提出了RB-SCD数据集和MFDCD框架,通过频域多模态特征处理线性连续性和语义模糊问题,显著提升了道路桥梁变化检测性能,取得了领先的实验效果。
English Summary: The study introduces the RB-SCD dataset and the MFDCD framework, which uses frequency-domain multimodal features to enhance the detection of road and bridge changes by addressing linear continuity and semantic ambiguity issues, achieving state-of-the-art results.

Authors:Sand. ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shucheng Yin, Siran Zhang, Tingting Liu, Xianping Yin, Xiaoyu Yang, Xin Song, Xuan Hu, Yankai Zhang, Yuqiao Li
Title: MAGI-1: Autoregressive Video Generation at Scale
Abstract:
We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.
中文: MAGI-1是一种自回归世界模型,通过预测连续帧片段生成视频,在文本引导的图像转视频任务中表现出色,具有高时间一致性和可扩展的实时部署能力。
English: MAGI-1 is an autoregressive world model that generates videos by predicting sequential chunks of frames, achieving strong performance in text-conditioned image-to-video tasks with high temporal consistency and scalable, real-time deployment.

Authors:Yuzhen Chen, Hojun Son, Arpan Kusari
Title: MatPredict: a dataset and benchmark for learning material properties of diverse indoor objects
Abstract:
Determining material properties from camera images can expand the ability to identify complex objects in indoor environments, which is valuable for consumer robotics applications. To support this, we introduce MatPredict, a dataset that combines the high-quality synthetic objects from Replica dataset with MatSynth dataset's material properties classes - to create objects with diverse material properties. We select 3D meshes of specific foreground objects and render them with different material properties. In total, we generate \textbf{18} commonly occurring objects with \textbf{14} different materials. We showcase how we provide variability in terms of lighting and camera placement for these objects. Next, we provide a benchmark for inferring material properties from visual images using these perturbed models in the scene, discussing the specific neural network models involved and their performance based on different image comparison metrics. By accurately simulating light interactions with different materials, we can enhance realism, which is crucial for training models effectively through large-scale simulations. This research aims to revolutionize perception in consumer robotics. The dataset is provided \href{https://huggingface.co/datasets/UMTRI/MatPredict}{here} and the code is provided \href{https://github.com/arpan-kusari/MatPredict}{here}.
中文: 本研究推出MatPredict数据集,通过融合合成物体与多样化材质属性来提升消费级机器人从图像中识别材质的能力,建立了神经网络性能基准,旨在革新机器人感知技术。
English: This research introduces MatPredict, a dataset combining synthetic objects with diverse material properties to enhance material recognition from images for consumer robotics, providing benchmarks for neural network performance and aiming to revolutionize robotic perception.

Authors:Chengtang Yao, Zhidan Liu, Jiaxi Zeng, Lidong Yu, Yuwei Wu, Yunde Jia
Title: 3D Visual Illusion Depth Estimation
Abstract:
3D visual illusion is a perceptual phenomenon where a two-dimensional plane is manipulated to simulate three-dimensional spatial relationships, making a flat artwork or object look three-dimensional in the human visual system. In this paper, we reveal that the machine visual system is also seriously fooled by 3D visual illusions, including monocular and binocular depth estimation. In order to explore and analyze the impact of 3D visual illusion on depth estimation, we collect a large dataset containing almost 3k scenes and 200k images to train and evaluate SOTA monocular and binocular depth estimation methods. We also propose a robust depth estimation framework that uses common sense from a vision-language model to adaptively select reliable depth from binocular disparity and monocular depth. Experiments show that SOTA monocular, binocular, and multi-view depth estimation approaches are all fooled by various 3D visual illusions, while our method achieves SOTA performance.
中文: 本研究揭示机器视觉系统在深度估计任务中同样会受到三维视觉错觉的严重误导,并提出一种利用视觉语言常识的鲁棒框架,其性能超越了现有方法。
English: This study demonstrates that machine vision systems are significantly deceived by 3D visual illusions in depth estimation tasks, and proposes a robust framework leveraging vision-language common sense to outperform existing methods.

Authors:Xiao Wu, Xiaoqing Zhang, Zunjie Xiao, Lingxi Hu, Risa Higashita, Jiang Liu
Title: Expert-Like Reparameterization of Heterogeneous Pyramid Receptive Fields in Efficient CNNs for Fair Medical Image Classification
Abstract:
Efficient convolutional neural network (CNN) architecture design has attracted growing research interests. However, they typically apply single receptive field (RF), small asymmetric RFs, or pyramid RFs to learn different feature representations, still encountering two significant challenges in medical image classification tasks: 1) They have limitations in capturing diverse lesion characteristics efficiently, e.g., tiny, coordination, small and salient, which have unique roles on the classification results, especially imbalanced medical image classification. 2) The predictions generated by those CNNs are often unfair/biased, bringing a high risk when employing them to real-world medical diagnosis conditions. To tackle these issues, we develop a new concept, Expert-Like Reparameterization of Heterogeneous Pyramid Receptive Fields (ERoHPRF), to simultaneously boost medical image classification performance and fairness. This concept aims to mimic the multi-expert consultation mode by applying the well-designed heterogeneous pyramid RF bag to capture lesion characteristics with varying significances effectively via convolution operations with multiple heterogeneous kernel sizes. Additionally, ERoHPRF introduces an expert-like structural reparameterization technique to merge its parameters with the two-stage strategy, ensuring competitive computation cost and inference speed through comparisons to a single RF. To manifest the effectiveness and generalization ability of ERoHPRF, we incorporate it into mainstream efficient CNN architectures. The extensive experiments show that our proposed ERoHPRF maintains a better trade-off than state-of-the-art methods in terms of medical image classification, fairness, and computation overhead. The code of this paper is available at https://github.com/XiaoLing12138/Expert-Like-Reparameterization-of-Heterogeneous-Pyramid-Receptive-Fields.
中文: 本研究提出ERoHPRF方法,通过模拟多专家会诊模式,利用异构金字塔感受野有效提升医学图像分类的性能与公平性,同时保持计算效率的竞争优势。
English: The study introduces ERoHPRF, a novel approach using heterogeneous pyramid receptive fields to enhance both performance and fairness in medical image classification by mimicking multi-expert consultation, while maintaining computational efficiency.

Authors:Lorena Garcia-Foncillas Macias, Aaron Kujawa, Aya Elshalakany, Jonathan Shapey, Tom Vercauteren
Title: A generalisable head MRI defacing pipeline: Evaluation on 2,566 meningioma scans
Abstract:
Reliable MRI defacing techniques to safeguard patient privacy while preserving brain anatomy are critical for research collaboration. Existing methods often struggle with incomplete defacing or degradation of brain tissue regions. We present a robust, generalisable defacing pipeline for high-resolution MRI that integrates atlas-based registration with brain masking. Our method was evaluated on 2,566 heterogeneous clinical scans for meningioma and achieved a 99.92 per cent success rate (2,564/2,566) upon visual inspection. Excellent anatomical preservation is demonstrated with a Dice similarity coefficient of 0.9975 plus or minus 0.0023 between brain masks automatically extracted from the original and defaced volumes. Source code is available at https://github.com/cai4cai/defacing_pipeline.
中文: 本文提出了一种鲁棒且通用的MRI去身份化流程,通过整合基于图谱的配准和脑部掩膜技术,在保护患者隐私的同时有效保留脑部解剖结构,在临床扫描中实现了99.92%的成功率并展现出卓越的解剖完整性。
English: This paper introduces a robust and generalizable MRI defacing pipeline that ensures patient privacy by effectively removing facial features while preserving brain anatomy, achieving a 99.92% success rate on clinical scans and demonstrating excellent anatomical integrity.

Authors:Vinkle Srivastav, Juliette Puel, Jonathan Vappou, Elijah Van Houten, Paolo Cabras, Nicolas Padoy
Title: A Skull-Adaptive Framework for AI-Based 3D Transcranial Focused Ultrasound Simulation
Abstract:
Transcranial focused ultrasound (tFUS) is an emerging modality for non-invasive brain stimulation and therapeutic intervention, offering millimeter-scale spatial precision and the ability to target deep brain structures. However, the heterogeneous and anisotropic nature of the human skull introduces significant distortions to the propagating ultrasound wavefront, which require time-consuming patient-specific planning and corrections using numerical solvers for accurate targeting. To enable data-driven approaches in this domain, we introduce TFUScapes, the first large-scale, high-resolution dataset of tFUS simulations through anatomically realistic human skulls derived from T1-weighted MRI images. We have developed a scalable simulation engine pipeline using the k-Wave pseudo-spectral solver, where each simulation returns a steady-state pressure field generated by a focused ultrasound transducer placed at realistic scalp locations. In addition to the dataset, we present DeepTFUS, a deep learning model that estimates normalized pressure fields directly from input 3D CT volumes and transducer position. The model extends a U-Net backbone with transducer-aware conditioning, incorporating Fourier-encoded position embeddings and MLP layers to create global transducer embeddings. These embeddings are fused with U-Net encoder features via feature-wise modulation, dynamic convolutions, and cross-attention mechanisms. The model is trained using a combination of spatially weighted and gradient-sensitive loss functions, enabling it to approximate high-fidelity wavefields. The TFUScapes dataset is publicly released to accelerate research at the intersection of computational acoustics, neurotechnology, and deep learning. The project page is available at https://github.com/CAMMA-public/TFUScapes.
中文: 本研究推出了TFUScapes——首个基于真实颅骨解剖的大规模经颅聚焦超声模拟数据集,并开发了DeepTFUS深度学习模型,能够通过CT影像和换能器位置直接估算压力场,显著提升靶向精准度。
English: The study introduces TFUScapes, a large-scale dataset of transcranial focused ultrasound simulations through human skulls, and DeepTFUS, a deep learning model that efficiently estimates pressure fields from CT scans and transducer positions to improve targeting accuracy.

Authors:Kazuki Adachi, Shin'ya Yamaguchi, Tomoki Hamagami
Title: Uniformity First: Uniformity-aware Test-time Adaptation of Vision-language Models against Image Corruption
Abstract:
Pre-trained vision-language models such as contrastive language-image pre-training (CLIP) have demonstrated a remarkable generalizability, which has enabled a wide range of applications represented by zero-shot classification. However, vision-language models still suffer when they face datasets with large gaps from training ones, i.e., distribution shifts. We found that CLIP is especially vulnerable to sensor degradation, a type of realistic distribution shift caused by sensor conditions such as weather, light, or noise. Collecting a new dataset from a test distribution for fine-tuning highly costs since sensor degradation occurs unexpectedly and has a range of variety. Thus, we investigate test-time adaptation (TTA) of zero-shot classification, which enables on-the-fly adaptation to the test distribution with unlabeled test data. Existing TTA methods for CLIP mainly focus on modifying image and text embeddings or predictions to address distribution shifts. Although these methods can adapt to domain shifts, such as fine-grained labels spaces or different renditions in input images, they fail to adapt to distribution shifts caused by sensor degradation. We found that this is because image embeddings are "corrupted" in terms of uniformity, a measure related to the amount of information. To make models robust to sensor degradation, we propose a novel method called uniformity-aware information-balanced TTA (UnInfo). To address the corruption of image embeddings, we introduce uniformity-aware confidence maximization, information-aware loss balancing, and knowledge distillation from the exponential moving average (EMA) teacher. Through experiments, we demonstrate that our UnInfo improves accuracy under sensor degradation by retaining information in terms of uniformity.
中文: 预训练的视觉语言模型如CLIP在传感器退化这类分布偏移中表现不佳,而提出的UnInfo方法通过均匀性感知的置信度最大化、信息平衡和知识蒸馏,有效提升了模型对传感器退化的鲁棒性。
English: Pre-trained vision-language models like CLIP struggle with sensor degradation, a type of distribution shift, but the proposed UnInfo method enhances robustness by preserving uniformity in image embeddings through confidence maximization and information balancing.

Authors:Simone Alberto Peirone, Francesca Pistilli, Giuseppe Averta
Title: HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos
Abstract:
Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content. We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads. By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture. We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zero-shot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision.
中文: HiERO是一种弱监督方法,通过从未经编排的视频中提取人类活动的层次结构来增强推理能力,无需大量训练即在多项基准测试中达到领先水平。
English: HiERO is a weakly-supervised method that leverages hierarchical patterns in human activities from unscripted videos to enhance reasoning, achieving state-of-the-art performance across multiple benchmarks without extensive training.

Authors:Xiao Wang, Yu Jin, Lan Chen, Bo Jiang, Lin Zhu, Yonghong Tian, Jin Tang, Bin Luo
Title: Dynamic Graph Induced Contour-aware Heat Conduction Network for Event-based Object Detection
Abstract:
Event-based Vision Sensors (EVS) have demonstrated significant advantages over traditional RGB frame-based cameras in low-light conditions, high-speed motion capture, and low latency. Consequently, object detection based on EVS has attracted increasing attention from researchers. Current event stream object detection algorithms are typically built upon Convolutional Neural Networks (CNNs) or Transformers, which either capture limited local features using convolutional filters or incur high computational costs due to the utilization of self-attention. Recently proposed vision heat conduction backbone networks have shown a good balance between efficiency and accuracy; however, these models are not specifically designed for event stream data. They exhibit weak capability in modeling object contour information and fail to exploit the benefits of multi-scale features. To address these issues, this paper proposes a novel dynamic graph induced contour-aware heat conduction network for event stream based object detection, termed CvHeat-DET. The proposed model effectively leverages the clear contour information inherent in event streams to predict the thermal diffusivity coefficients within the heat conduction model, and integrates hierarchical structural graph features to enhance feature learning across multiple scales. Extensive experiments on three benchmark datasets for event stream-based object detection fully validated the effectiveness of the proposed model. The source code of this paper will be released on https://github.com/Event-AHU/OpenEvDET.
中文: 本文提出CvHeat-DET动态图轮廓感知热传导网络,通过有效利用事件流固有轮廓信息并整合多尺度层次特征,显著提升了事件流目标检测性能,在三个基准数据集上得到充分验证。
English: This paper introduces CvHeat-DET, a novel dynamic graph-based contour-aware heat conduction network that enhances event stream object detection by leveraging inherent contour information and multi-scale hierarchical features, achieving superior performance across three benchmark datasets.

Authors:Shiao Wang, Xiao Wang, Liye Jin, Bo Jiang, Lin Zhu, Lan Chen, Yonghong Tian, Bin Luo
Title: Towards Low-Latency Event Stream-based Visual Object Tracking: A Slow-Fast Approach
Abstract:
Existing tracking algorithms typically rely on low-frame-rate RGB cameras coupled with computationally intensive deep neural network architectures to achieve effective tracking. However, such frame-based methods inherently face challenges in achieving low-latency performance and often fail in resource-constrained environments. Visual object tracking using bio-inspired event cameras has emerged as a promising research direction in recent years, offering distinct advantages for low-latency applications. In this paper, we propose a novel Slow-Fast Tracking paradigm that flexibly adapts to different operational requirements, termed SFTrack. The proposed framework supports two complementary modes, i.e., a high-precision slow tracker for scenarios with sufficient computational resources, and an efficient fast tracker tailored for latency-aware, resource-constrained environments. Specifically, our framework first performs graph-based representation learning from high-temporal-resolution event streams, and then integrates the learned graph-structured information into two FlashAttention-based vision backbones, yielding the slow and fast trackers, respectively. The fast tracker achieves low latency through a lightweight network design and by producing multiple bounding box outputs in a single forward pass. Finally, we seamlessly combine both trackers via supervised fine-tuning and further enhance the fast tracker's performance through a knowledge distillation strategy. Extensive experiments on public benchmarks, including FE240, COESOT, and EventVOT, demonstrate the effectiveness and efficiency of our proposed method across different real-world scenarios. The source code has been released on https://github.com/Event-AHU/SlowFast_Event_Track.
中文: 现有基于低帧率RGB相机的跟踪方法存在延迟高和资源受限问题,而提出的SFTrack采用仿生事件相机,通过图学习和FlashAttention骨干网络实现高精度慢速与高效快速的双模式跟踪,在FE240和COESOT等基准测试中验证了其有效性。
English: Existing tracking methods using low-frame-rate RGB cameras face latency and resource constraints, but the proposed SFTrack with bio-inspired event cameras introduces a dual-mode slow-fast paradigm that combines high-precision and efficient tracking through graph learning and FlashAttention backbones, validated on benchmarks like FE240 and COESOT.

Authors:Junzhi Ning, Cheng Tang, Kaijing Zhou, Diping Song, Lihao Liu, Ming Hu, Wei Li, Huihui Xu, Yanzhou Su, Tianbin Li, Jiyao Liu, Jin Ye, Sheng Zhang, Yuanfeng Ji, Junjun He
Title: RetinaLogos: Fine-Grained Synthesis of High-Resolution Retinal Images Through Captions
Abstract:
The scarcity of high-quality, labelled retinal imaging data, which presents a significant challenge in the development of machine learning models for ophthalmology, hinders progress in the field. Existing methods for synthesising Colour Fundus Photographs (CFPs) largely rely on predefined disease labels, which restricts their ability to generate images that reflect fine-grained anatomical variations, subtle disease stages, and diverse pathological features beyond coarse class categories. To overcome these challenges, we first introduce an innovative pipeline that creates a large-scale, captioned retinal dataset comprising 1.4 million entries, called RetinaLogos-1400k. Specifically, RetinaLogos-1400k uses the visual language model(VLM) to describe retinal conditions and key structures, such as optic disc configuration, vascular distribution, nerve fibre layers, and pathological features. Building on this dataset, we employ a novel three-step training framework, RetinaLogos, which enables fine-grained semantic control over retinal images and accurately captures different stages of disease progression, subtle anatomical variations, and specific lesion types. Through extensive experiments, our method demonstrates superior performance across multiple datasets, with 62.07% of text-driven synthetic CFPs indistinguishable from real ones by ophthalmologists. Moreover, the synthetic data improves accuracy by 5%-10% in diabetic retinopathy grading and glaucoma detection. Codes are available at https://github.com/uni-medical/retina-text2cfp.
中文: 本研究提出了RetinaLogos-1400k大规模标注视网膜数据集及创新训练框架,实现了对合成视网膜图像的精细语义控制,生成图像具有高度真实性,并显著提升了糖尿病视网膜病变和青光眼的诊断准确率。
English: This study introduces RetinaLogos-1400k, a large-scale captioned retinal dataset, and a novel training framework that enables fine-grained control over synthetic retinal images, achieving high realism and improving diagnostic accuracy for diabetic retinopathy and glaucoma.

Authors:Ben Liu, Zhen Qin
Title: Accelerate TarFlow Sampling with GS-Jacobi Iteration
Abstract:
Image generation models have achieved widespread applications. As an instance, the TarFlow model combines the transformer architecture with Normalizing Flow models, achieving state-of-the-art results on multiple benchmarks. However, due to the causal form of attention requiring sequential computation, TarFlow's sampling process is extremely slow. In this paper, we demonstrate that through a series of optimization strategies, TarFlow sampling can be greatly accelerated by using the Gauss-Seidel-Jacobi (abbreviated as GS-Jacobi) iteration method. Specifically, we find that blocks in the TarFlow model have varying importance: a small number of blocks play a major role in image generation tasks, while other blocks contribute relatively little; some blocks are sensitive to initial values and prone to numerical overflow, while others are relatively robust. Based on these two characteristics, we propose the Convergence Ranking Metric (CRM) and the Initial Guessing Metric (IGM): CRM is used to identify whether a TarFlow block is "simple" (converges in few iterations) or "tough" (requires more iterations); IGM is used to evaluate whether the initial value of the iteration is good. Experiments on four TarFlow models demonstrate that GS-Jacobi sampling can significantly enhance sampling efficiency while maintaining the quality of generated images (measured by FID), achieving speed-ups of 4.53x in Img128cond, 5.32x in AFHQ, 2.96x in Img64uncond, and 2.51x in Img64cond without degrading FID scores or sample quality. Code and checkpoints are accessible on https://github.com/encoreus/GS-Jacobi_for_TarFlow
中文:GS-Jacobi迭代法通过基于收敛性和初始值敏感度对模块进行优先级排序,显著加速了TarFlow的缓慢采样过程,在保持图像质量的同时实现了最高5.32倍的加速效果。
English: The GS-Jacobi iteration method accelerates TarFlow's slow sampling by prioritizing blocks based on their convergence and initial value sensitivity, achieving up to 5.32x speedup without quality loss.

Authors:Hulin Li
Title: Rethinking Features-Fused-Pyramid-Neck for Object Detection
Abstract:
Multi-head detectors typically employ a features-fused-pyramid-neck for multi-scale detection and are widely adopted in the industry. However, this approach faces feature misalignment when representations from different hierarchical levels of the feature pyramid are forcibly fused point-to-point. To address this issue, we designed an independent hierarchy pyramid (IHP) architecture to evaluate the effectiveness of the features-unfused-pyramid-neck for multi-head detectors. Subsequently, we introduced soft nearest neighbor interpolation (SNI) with a weight downscaling factor to mitigate the impact of feature fusion at different hierarchies while preserving key textures. Furthermore, we present a features adaptive selection method for down sampling in extended spatial windows (ESD) to retain spatial features and enhance lightweight convolutional techniques (GSConvE). These advancements culminate in our secondary features alignment solution (SA) for real-time detection, achieving state-of-the-art results on Pascal VOC and MS COCO. Code will be released at https://github.com/AlanLi1997/rethinking-fpn. This paper has been accepted by ECCV2024 and published on Springer Nature.
Chinese Summary: 本研究提出了一种二次特征对齐解决方案(SA),采用独立层次金字塔(IHP)和软最近邻插值(SNI)技术,有效解决了多头部检测器中特征错位问题,在Pascal VOC和MS COCO数据集上实现了最先进的实时检测性能。
English Summary: The study introduces a secondary features alignment solution (SA) that employs an independent hierarchy pyramid (IHP) and soft nearest neighbor interpolation (SNI) to address feature misalignment in multi-head detectors, achieving state-of-the-art real-time detection performance on Pascal VOC and MS COCO datasets.

Authors:Haibin He, Maoyuan Ye, Jing Zhang, Xiantao Cai, Juhua Liu, Bo Du, Dacheng Tao
Title: Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?
Abstract:
Large Multimodal Models (LMMs) have become increasingly versatile, accompanied by impressive Optical Character Recognition (OCR) related capabilities. Existing OCR-related benchmarks emphasize evaluating LMMs' abilities of relatively simple visual question answering, visual-text parsing, etc. However, the extent to which LMMs can deal with complex logical reasoning problems based on OCR cues is relatively unexplored. To this end, we introduce the Reasoning-OCR benchmark, which challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text. Reasoning-OCR covers six visual scenarios and encompasses 150 meticulously designed questions categorized into six reasoning challenges. Additionally, Reasoning-OCR minimizes the impact of field-specialized knowledge. Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges, underscoring the urgent to improve the reasoning performance. We hope Reasoning-OCR can inspire and facilitate future research on enhancing complex reasoning ability based on OCR cues. Reasoning-OCR is publicly available at https://github.com/Hxyz-123/ReasoningOCR.
Chinese: Reasoning-OCR基准测试旨在评估大型多模态模型利用OCR提取的视觉文本线索处理复杂逻辑推理的能力,揭示了当前模型的不足,以期推动该领域未来的发展。
English: The Reasoning-OCR benchmark is introduced to evaluate Large Multimodal Models' ability to handle complex logical reasoning using OCR-extracted visual-text cues, revealing their current limitations and aiming to inspire future advancements in this area.

Authors:Hangyu Li, Qin Zhao, Haoran Xu, Xinyu Jiang, Qingwei Ben, Feiyu Jia, Haoyu Zhao, Liang Xu, Jia Zeng, Hanqing Wang, Bo Dai, Junting Dong, Jiangmiao Pang
Title: TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation
Abstract:
Teleoperation is a cornerstone of embodied-robot learning, and bimanual dexterous teleoperation in particular provides rich demonstrations that are difficult to obtain with fully autonomous systems. While recent studies have proposed diverse hardware pipelines-ranging from inertial motion-capture gloves to exoskeletons and vision-based interfaces-there is still no unified benchmark that enables fair, reproducible comparison of these systems. In this paper, we introduce TeleOpBench, a simulator-centric benchmark tailored to bimanual dexterous teleoperation. TeleOpBench contains 30 high-fidelity task environments that span pick-and-place, tool use, and collaborative manipulation, covering a broad spectrum of kinematic and force-interaction difficulty. Within this benchmark we implement four representative teleoperation modalities-(i) MoCap, (ii) VR device, (iii) arm-hand exoskeletons, and (iv) monocular vision tracking-and evaluate them with a common protocol and metric suite. To validate that performance in simulation is predictive of real-world behavior, we conduct mirrored experiments on a physical dual-arm platform equipped with two 6-DoF dexterous hands. Across 10 held-out tasks we observe a strong correlation between simulator and hardware performance, confirming the external validity of TeleOpBench. TeleOpBench establishes a common yardstick for teleoperation research and provides an extensible platform for future algorithmic and hardware innovation. Codes is now available at https://github.com/cyjdlhy/TeleOpBench .
中文: TeleOpBench被提出作为一个统一的模拟器基准,用于评估四种双手灵巧遥操作方式在30项任务中的表现,展示了仿真与现实之间的强相关性,并为未来研究提供了标准化平台。
English: TeleOpBench is introduced as a unified simulator benchmark for evaluating four bimanual dexterous teleoperation modalities across 30 tasks, demonstrating strong simulation-to-reality correlation and providing a standardized platform for future research.

Authors:Zihua Wang, Ruibo Li, Haozhe Du, Joey Tianyi Zhou, Yu Zhang, Xu Yang
Title: FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks
Abstract:
Large language and multimodal models (LLMs and LMMs) exhibit strong inference capabilities but are often limited by slow decoding speeds. This challenge is especially acute in LMMs, where visual inputs typically comprise more tokens with lower information density than text -- an issue exacerbated by recent trends toward finer-grained visual tokenizations to boost performance. Speculative decoding has been effective in accelerating LLM inference by using a smaller draft model to generate candidate tokens, which are then selectively verified by the target model, improving speed without sacrificing output quality. While this strategy has been extended to LMMs, existing methods largely overlook the unique properties of visual inputs and depend solely on text-based draft models. In this work, we propose \textbf{FLASH} (Fast Latent-Aware Semi-Autoregressive Heuristics), a speculative decoding framework designed specifically for LMMs, which leverages two key properties of multimodal data to design the draft model. First, to address redundancy in visual tokens, we propose a lightweight latent-aware token compression mechanism. Second, recognizing that visual objects often co-occur within a scene, we employ a semi-autoregressive decoding strategy to generate multiple tokens per forward pass. These innovations accelerate draft decoding while maintaining high acceptance rates, resulting in faster overall inference. Experiments show that FLASH significantly outperforms prior speculative decoding approaches in both unimodal and multimodal settings, achieving up to \textbf{2.68$\times$} speed-up on video captioning and \textbf{2.55$\times$} on visual instruction tuning tasks compared to the original LMM. Our code is available \href{https://github.com/ZihuaEvan/FlashSD/}{[here]}.
中文: FLASH框架通过潜在感知令牌压缩机制和半自回归解码策略,有效解决视觉令牌冗余和共现模式问题,在保持输出质量的同时实现最高2.68倍的推理加速。
English: The FLASH framework accelerates multimodal model inference by introducing a latent-aware token compression mechanism and semi-autoregressive decoding to address visual token redundancy and co-occurrence patterns, achieving up to 2.68× speed-up while maintaining output quality.

Authors:Yinzhe Wang, Yiwen Xiao, Hu Wang, Yiping Xu, Yan Tian
Title: IA-MVS: Instance-Focused Adaptive Depth Sampling for Multi-View Stereo
Abstract:
Multi-view stereo (MVS) models based on progressive depth hypothesis narrowing have made remarkable advancements. However, existing methods haven't fully utilized the potential that the depth coverage of individual instances is smaller than that of the entire scene, which restricts further improvements in depth estimation precision. Moreover, inevitable deviations in the initial stage accumulate as the process advances. In this paper, we propose Instance-Adaptive MVS (IA-MVS). It enhances the precision of depth estimation by narrowing the depth hypothesis range and conducting refinement on each instance. Additionally, a filtering mechanism based on intra-instance depth continuity priors is incorporated to boost robustness. Furthermore, recognizing that existing confidence estimation can degrade IA-MVS performance on point clouds. We have developed a detailed mathematical model for confidence estimation based on conditional probability. The proposed method can be widely applied in models based on MVSNet without imposing extra training burdens. Our method achieves state-of-the-art performance on the DTU benchmark. The source code is available at https://github.com/KevinWang73106/IA-MVS.
中文: 提出的实例自适应多视角立体方法通过针对每个实例缩小深度假设范围并引入深度连续性先验,在不增加训练负担的情况下,在DTU基准测试中取得了最先进的深度估计精度。
English: The proposed Instance-Adaptive MVS method improves depth estimation precision by narrowing depth hypotheses per instance and incorporating depth continuity priors, achieving state-of-the-art results on the DTU benchmark without additional training burdens.

Authors:Jiabin Chen, Haiping Wang, Jinpeng Li, Yuan Liu, Zhen Dong, Bisheng Yang
Title: SpatialLLM: From Multi-modality Data to Urban Spatial Intelligence
Abstract:
We propose SpatialLLM, a novel approach advancing spatial intelligence tasks in complex urban scenes. Unlike previous methods requiring geographic analysis tools or domain expertise, SpatialLLM is a unified language model directly addressing various spatial intelligence tasks without any training, fine-tuning, or expert intervention. The core of SpatialLLM lies in constructing detailed and structured scene descriptions from raw spatial data to prompt pre-trained LLMs for scene-based analysis. Extensive experiments show that, with our designs, pretrained LLMs can accurately perceive spatial distribution information and enable zero-shot execution of advanced spatial intelligence tasks, including urban planning, ecological analysis, traffic management, etc. We argue that multi-field knowledge, context length, and reasoning ability are key factors influencing LLM performances in urban analysis. We hope that SpatialLLM will provide a novel viable perspective for urban intelligent analysis and management. The code and dataset are available at https://github.com/WHU-USI3DV/SpatialLLM.
中文摘要:SpatialLLM提出了一种统一语言模型,无需训练或专家干预即可直接处理复杂城市场景中的空间智能任务,通过结构化场景描述驱动预训练大模型实现零样本城市分析。
English Summary: SpatialLLM introduces a unified language model that enables zero-shot spatial intelligence tasks in urban environments without training or expert input, leveraging structured scene descriptions to empower pre-trained LLMs for advanced urban analysis.

Authors:Tianming Liang, Haichao Jiang, Yuting Yang, Chaolei Tan, Shuai Li, Wei-Shi Zheng, Jian-Fang Hu
Title: Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
Abstract:
Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames. To advance the task towards more practical scenarios, we introduce \textbf{Long-RVOS}, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing. The objects are manually annotated with three different types of descriptions to individually evaluate the understanding of static attributes, motion patterns and spatiotemporal relationships. Moreover, unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two new metrics to assess the temporal and spatiotemporal consistency. We benchmark 6 state-of-the-art methods on Long-RVOS. The results show that current approaches struggle severely with the long-video challenges. To address this, we further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local-to-global architecture to capture both short-term dynamics and long-term dependencies. Despite simplicity, ReferMo achieves significant improvements over current methods in long-term scenarios. We hope that Long-RVOS and our baseline can drive future RVOS research towards tackling more realistic and long-form videos.

Authors:Mingyuan Zhou, Yi Gu, Zhendong Wang
Title: Few-Step Diffusion via Score identity Distillation
Abstract:
Diffusion distillation has emerged as a promising strategy for accelerating text-to-image (T2I) diffusion models by distilling a pretrained score network into a one- or few-step generator. While existing methods have made notable progress, they often rely on real or teacher-synthesized images to perform well when distilling high-resolution T2I diffusion models such as Stable Diffusion XL (SDXL), and their use of classifier-free guidance (CFG) introduces a persistent trade-off between text-image alignment and generation diversity. We address these challenges by optimizing Score identity Distillation (SiD) -- a data-free, one-step distillation framework -- for few-step generation. Backed by theoretical analysis that justifies matching a uniform mixture of outputs from all generation steps to the data distribution, our few-step distillation algorithm avoids step-specific networks and integrates seamlessly into existing pipelines, achieving state-of-the-art performance on SDXL at 1024x1024 resolution. To mitigate the alignment-diversity trade-off when real text-image pairs are available, we introduce a Diffusion GAN-based adversarial loss applied to the uniform mixture and propose two new guidance strategies: Zero-CFG, which disables CFG in the teacher and removes text conditioning in the fake score network, and Anti-CFG, which applies negative CFG in the fake score network. This flexible setup improves diversity without sacrificing alignment. Comprehensive experiments on SD1.5 and SDXL demonstrate state-of-the-art performance in both one-step and few-step generation settings, along with robustness to the absence of real images. Our efficient PyTorch implementation, along with the resulting one- and few-step distilled generators, will be released publicly as a separate branch at https://github.com/mingyuanzhou/SiD-LSG.
中文摘要:SiD提出了一种无需真实数据、支持少步生成的蒸馏框架,通过创新性引导策略解决了高分辨率文生图模型中的对齐与多样性权衡问题,在不依赖真实图像的情况下实现了顶尖性能。
English Summary: SiD introduces a data-free, few-step distillation framework that overcomes the alignment-diversity trade-off in high-resolution text-to-image generation by employing novel guidance strategies, achieving state-of-the-art performance without relying on real images.

Authors:Zihan Su, Xuerui Qiu, Hongbin Xu, Tangyu Jiang, Junhao Zhuang, Chun Yuan, Ming Li, Shengfeng He, Fei Richard Yu
Title: Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking
Abstract:
The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spatial region for seamless embedding. To enable spatiotemporal fusion of watermark patches across video frames, we develop a 3D wavelet transform-enhanced Mamba architecture with a novel spatiotemporal local scanning strategy, effectively modeling long-range dependencies during watermark embedding and retrieval. To the best of our knowledge, this is the first attempt to apply state space models to watermarking, opening new avenues for efficient and robust watermark protection. Extensive experiments demonstrate that Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness, which is largely attributed to our proposals. Code is publicly available at https://github.com/Sugewud/Safe-Sora
中文摘要:Safe-Sora是首个通过分层自适应匹配机制和新型三维小波增强Mamba架构,直接在AI生成视频中嵌入隐形图形水印的框架,在视频质量和版权保护方面实现了最先进的性能。
English Summary: Safe-Sora is the first framework to embed invisible graphical watermarks directly into AI-generated videos through a hierarchical adaptive matching mechanism and a novel 3D wavelet-enhanced Mamba architecture, achieving state-of-the-art performance in video quality and copyright protection.

Authors:Yaotian Yang, Yiwen Tang, Yizhe Chen, Xiao Chen, Jiangjie Qiu, Hao Xiong, Haoyu Yin, Zhiyao Luo, Yifei Zhang, Sijia Tao, Wentao Li, Qinghua Zhang, Yuqiang Li, Wanli Ouyang, Bin Zhao, Xiaonan Wang, Fei Wei
Title: AutoMat: Enabling Automated Crystal Structure Reconstruction from Microscopy via Agentic Tool Use
Abstract:
Machine learning-based interatomic potentials and force fields depend critically on accurate atomic structures, yet such data are scarce due to the limited availability of experimentally resolved crystals. Although atomic-resolution electron microscopy offers a potential source of structural data, converting these images into simulation-ready formats remains labor-intensive and error-prone, creating a bottleneck for model training and validation. We introduce AutoMat, an end-to-end, agent-assisted pipeline that automatically transforms scanning transmission electron microscopy (STEM) images into atomic crystal structures and predicts their physical properties. AutoMat combines pattern-adaptive denoising, physics-guided template retrieval, symmetry-aware atomic reconstruction, fast relaxation and property prediction via MatterSim, and coordinated orchestration across all stages. We propose the first dedicated STEM2Mat-Bench for this task and evaluate performance using lattice RMSD, formation energy MAE, and structure-matching success rate. By orchestrating external tool calls, AutoMat enables a text-only LLM to outperform vision-language models in this domain, achieving closed-loop reasoning throughout the pipeline. In large-scale experiments over 450 structure samples, AutoMat substantially outperforms existing multimodal large language models and tools. These results validate both AutoMat and STEM2Mat-Bench, marking a key step toward bridging microscopy and atomistic simulation in materials science.The code and dataset are publicly available at https://github.com/yyt-2378/AutoMat and https://huggingface.co/datasets/yaotianvector/STEM2Mat.
中文:AutoMat是一种端到端的自动化流程,能将扫描透射电子显微镜图像转化为原子晶体结构并预测其物理性质,显著优于现有方法,为材料科学中显微技术与原子模拟搭建了桥梁。
English: AutoMat is an automated pipeline that converts scanning transmission electron microscopy images into atomic crystal structures and predicts their physical properties, significantly outperforming existing methods and bridging microscopy with atomistic simulation in materials science.

Authors:Mingqi Shao, Feng Xiong, Zhaoxu Sun, Mu Xu
Title: MVPainter: Accurate and Detailed 3D Texture Generation via Multi-View Diffusion with Geometric Control
Abstract:
Recently, significant advances have been made in 3D object generation. Building upon the generated geometry, current pipelines typically employ image diffusion models to generate multi-view RGB images, followed by UV texture reconstruction through texture baking. While 3D geometry generation has improved significantly, supported by multiple open-source frameworks, 3D texture generation remains underexplored. In this work, we systematically investigate 3D texture generation through the lens of three core dimensions: reference-texture alignment, geometry-texture consistency, and local texture quality. To tackle these issues, we propose MVPainter, which employs data filtering and augmentation strategies to enhance texture fidelity and detail, and introduces ControlNet-based geometric conditioning to improve texture-geometry alignment. Furthermore, we extract physically-based rendering (PBR) attributes from the generated views to produce PBR meshes suitable for real-world rendering applications. MVPainter achieves state-of-the-art results across all three dimensions, as demonstrated by human-aligned evaluations. To facilitate further research and reproducibility, we also release our full pipeline as an open-source system, including data construction, model architecture, and evaluation tools.

Authors:Xiangpeng Tian, Xiangyu Liao, Xiao Liu, Meng Li, Chao Ren
Title: Degradation-Aware Feature Perturbation for All-in-One Image Restoration
Abstract:
All-in-one image restoration aims to recover clear images from various degradation types and levels with a unified model. Nonetheless, the significant variations among degradation types present challenges for training a universal model, often resulting in task interference, where the gradient update directions of different tasks may diverge due to shared parameters. To address this issue, motivated by the routing strategy, we propose DFPIR, a novel all-in-one image restorer that introduces Degradation-aware Feature Perturbations(DFP) to adjust the feature space to align with the unified parameter space. In this paper, the feature perturbations primarily include channel-wise perturbations and attention-wise perturbations. Specifically, channel-wise perturbations are implemented by shuffling the channels in high-dimensional space guided by degradation types, while attention-wise perturbations are achieved through selective masking in the attention space. To achieve these goals, we propose a Degradation-Guided Perturbation Block (DGPB) to implement these two functions, positioned between the encoding and decoding stages of the encoder-decoder architecture. Extensive experimental results demonstrate that DFPIR achieves state-of-the-art performance on several all-in-one image restoration tasks including image denoising, image dehazing, image deraining, motion deblurring, and low-light image enhancement. Our codes are available at https://github.com/TxpHome/DFPIR.
Chinese: 本文提出了DFPIR,一种新型一体化图像恢复模型,通过引入退化感知特征扰动来调整特征空间以适配统一参数空间,在多项图像恢复任务中实现了最先进的性能。
English: The paper introduces DFPIR, a novel all-in-one image restoration model that uses degradation-aware feature perturbations to align diverse degradation types with a unified parameter space, achieving state-of-the-art performance across multiple tasks.

Authors:Florent Chiaroni, Ali Ayub, Ola Ahmad
Title: ProMi: An Efficient Prototype-Mixture Baseline for Few-Shot Segmentation with Bounding-Box Annotations
Abstract:
In robotics applications, few-shot segmentation is crucial because it allows robots to perform complex tasks with minimal training data, facilitating their adaptation to diverse, real-world environments. However, pixel-level annotations of even small amount of images is highly time-consuming and costly. In this paper, we present a novel few-shot binary segmentation method based on bounding-box annotations instead of pixel-level labels. We introduce, ProMi, an efficient prototype-mixture-based method that treats the background class as a mixture of distributions. Our approach is simple, training-free, and effective, accommodating coarse annotations with ease. Compared to existing baselines, ProMi achieves the best results across different datasets with significant gains, demonstrating its effectiveness. Furthermore, we present qualitative experiments tailored to real-world mobile robot tasks, demonstrating the applicability of our approach in such scenarios. Our code: https://github.com/ThalesGroup/promi.
Chinese: 本文提出了一种名为ProMi的新型小样本二元分割方法,它采用边界框标注而非像素级标签,在多个数据集上取得了最佳效果,并展示了在机器人实际应用中的有效性。
English: This paper introduces ProMi, a novel few-shot binary segmentation method that uses bounding-box annotations instead of pixel-level labels, achieving superior results across datasets and demonstrating real-world applicability in robotics.

Authors:Shaobin Zhuang, Zhipeng Huang, Ying Zhang, Fangyikang Wang, Canmiao Fu, Binxin Yang, Chong Sun, Chen Li, Yali Wang
Title: Video-GPT via Next Clip Diffusion
Abstract:
GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream. The project page is at https://zhuangshaobin.github.io/Video-GPT.github.io/.

Authors:Wenchen Chen, Yanmei Zhang, Zhongwei Xiao, Jianping Chu, Xingbo Wang
Title: Spectral-Spatial Self-Supervised Learning for Few-Shot Hyperspectral Image Classification
Abstract:
Few-shot classification of hyperspectral images (HSI) faces the challenge of scarce labeled samples. Self-Supervised learning (SSL) and Few-Shot Learning (FSL) offer promising avenues to address this issue. However, existing methods often struggle to adapt to the spatial geometric diversity of HSIs and lack sufficient spectral prior knowledge. To tackle these challenges, we propose a method, Spectral-Spatial Self-Supervised Learning for Few-Shot Hyperspectral Image Classification (S4L-FSC), aimed at improving the performance of few-shot HSI classification. Specifically, we first leverage heterogeneous datasets to pretrain a spatial feature extractor using a designed Rotation-Mirror Self-Supervised Learning (RM-SSL) method, combined with FSL. This approach enables the model to learn the spatial geometric diversity of HSIs using rotation and mirroring labels as supervisory signals, while acquiring transferable spatial meta-knowledge through few-shot learning. Subsequently, homogeneous datasets are utilized to pretrain a spectral feature extractor via a combination of FSL and Masked Reconstruction Self-Supervised Learning (MR-SSL). The model learns to reconstruct original spectral information from randomly masked spectral vectors, inferring spectral dependencies. In parallel, FSL guides the model to extract pixel-level discriminative features, thereby embedding rich spectral priors into the model. This spectral-spatial pretraining method, along with the integration of knowledge from heterogeneous and homogeneous sources, significantly enhances model performance. Extensive experiments on four HSI datasets demonstrate the effectiveness and superiority of the proposed S4L-FSC approach for few-shot HSI classification.
中文: 提出的S4L-FSC方法通过结合自监督学习和少样本学习,分别预训练空间与光谱特征提取器,有效解决了高光谱图像空间几何多样性适应问题并融入了光谱先验知识,显著提升了少样本分类性能。
English: The proposed S4L-FSC method enhances few-shot hyperspectral image classification by combining self-supervised learning and few-shot learning to pretrain spatial and spectral feature extractors, effectively addressing spatial geometric diversity and incorporating spectral prior knowledge through heterogeneous and homogeneous datasets.

Authors:Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, Donglin Wang
Title: SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
Abstract:
Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth, either require specialized sensors or fail to effectively exploit depth information for higher-order reasoning. To this end, we propose a novel Spatial Sense and Reasoning method, dubbed SSR, a novel framework that transforms raw depth data into structured, interpretable textual rationales. These textual rationales serve as meaningful intermediate representations to significantly enhance spatial reasoning capabilities. Additionally, we leverage knowledge distillation to compress the generated rationales into compact latent embeddings, which facilitate resource-efficient and plug-and-play integration into existing VLMs without retraining. To enable comprehensive evaluation, we introduce a new dataset named SSR-CoT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBench, a comprehensive multi-task benchmark. Extensive experiments on multiple benchmarks demonstrate SSR substantially improves depth utilization and enhances spatial reasoning, thereby advancing VLMs toward more human-like multi-modal understanding. Our project page is at https://yliu-cs.github.io/SSR.
中文:提出的SSR框架将原始深度数据转化为结构化文本推理,通过知识蒸馏和新基准数据集显著增强了视觉语言模型的空间理解能力,无需重新训练即可实现卓越性能。
English: The proposed SSR framework transforms raw depth data into structured textual rationales to enhance spatial reasoning in Visual-Language Models, achieving superior performance through knowledge distillation and a new benchmark dataset without requiring model retraining.

Authors:Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, Tianfei Zhou
Title: VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning
Abstract:
Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VIDEORFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VIDEORFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert, cognition-inspired CoT curation pipeline. First, we devise a cognition-inspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a MLLM conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets, i.e.VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strengthen the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning and visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VIDEORFT achieves state-of-the-art performance on six video reasoning benchmarks.
中文: VIDEORFT将强化微调扩展至多模态大语言模型,通过构建高质量思维链数据集并引入语义一致性奖励机制,有效解决了视频推理的难题,在六大基准测试中实现了最优性能。
English: VIDEORFT extends reinforcement fine-tuning to multimodal large language models, addressing video reasoning challenges by generating high-quality chain-of-thought datasets and introducing a semantic-consistency reward, achieving state-of-the-art performance across six benchmarks.

Authors:Zirun Guo, Minjie Hong, Tao Jin
Title: Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning
Abstract:
Reinforcement Learning (RL) has shown promise in improving the reasoning abilities of Large Language Models (LLMs). However, the specific challenges of adapting RL to multimodal data and formats remain relatively unexplored. In this work, we present Observe-R1, a novel framework aimed at enhancing the reasoning capabilities of multimodal large language models (MLLMs). We draw inspirations from human learning progression--from simple to complex and easy to difficult, and propose a gradual learning paradigm for MLLMs. To this end, we construct the NeuraLadder dataset, which is organized and sampled according to the difficulty and complexity of data samples for RL training. To tackle multimodal tasks, we introduce a multimodal format constraint that encourages careful observation of images, resulting in enhanced visual abilities and clearer and more structured responses. Additionally, we implement a bonus reward system that favors concise, correct answers within a length constraint, alongside a dynamic weighting mechanism that prioritizes uncertain and medium-difficulty problems, ensuring that more informative samples have a greater impact on training. Our experiments with the Qwen2.5-VL-3B and Qwen2.5-VL-7B models on 20k samples from the NeuraLadder dataset show that Observe-R1 outperforms a series of larger reasoning models on both reasoning and general benchmarks, achieving superior clarity and conciseness in reasoning chains. Ablation studies validate the effectiveness of our strategies, highlighting the robustness and generalization of our approach. The dataset and code will be released at https://github.com/zrguo/Observe-R1.
中文: 本文提出Observe-R1框架,通过渐进式学习和多模态约束增强大语言模型的多模态推理能力,在多项基准测试中实现更优性能。
English: This paper introduces Observe-R1, a framework that enhances multimodal reasoning in large language models through progressive learning and multimodal constraints, achieving superior performance on reasoning benchmarks.

Authors:Siwei Xia, Li Sun, Tiantian Sun, Qingli Li
Title: DragLoRA: Online Optimization of LoRA Adapters for Drag-based Image Editing in Diffusion Model
Abstract:
Drag-based editing within pretrained diffusion model provides a precise and flexible way to manipulate foreground objects. Traditional methods optimize the input feature obtained from DDIM inversion directly, adjusting them iteratively to guide handle points towards target locations. However, these approaches often suffer from limited accuracy due to the low representation ability of the feature in motion supervision, as well as inefficiencies caused by the large search space required for point tracking. To address these limitations, we present DragLoRA, a novel framework that integrates LoRA (Low-Rank Adaptation) adapters into the drag-based editing pipeline. To enhance the training of LoRA adapters, we introduce an additional denoising score distillation loss which regularizes the online model by aligning its output with that of the original model. Additionally, we improve the consistency of motion supervision by adapting the input features using the updated LoRA, giving a more stable and accurate input feature for subsequent operations. Building on this, we design an adaptive optimization scheme that dynamically toggles between two modes, prioritizing efficiency without compromising precision. Extensive experiments demonstrate that DragLoRA significantly enhances the control precision and computational efficiency for drag-based image editing. The Codes of DragLoRA are available at: https://github.com/Sylvie-X/DragLoRA.
中文摘要:DragLoRA提出了一种新颖框架,通过集成LoRA适配器并结合改进的去噪分数蒸馏损失与自适应优化方案,显著提升了基于拖拽的图像编辑的精确度和计算效率。
English Summary: DragLoRA introduces a novel framework that integrates LoRA adapters with an enhanced denoising score distillation loss and adaptive optimization, significantly improving both precision and efficiency in drag-based image editing.

Authors:Qianyue Hu, Junyan Wu, Wei Lu, Xiangyang Luo
Title: VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning
Abstract:
Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak's outstanding defense success rate against unauthorized diffusion-based voice cloning. Audio samples of VoiceCloak are available at https://voice-cloak.github.io/VoiceCloak/.

Authors:Maoyuan Ye, Jing Zhang, Juhua Liu, Bo Du, Dacheng Tao
Title: LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?
Abstract:
Recent advances in Large Multimodal Models (LMMs) have significantly improved their reasoning and Optical Character Recognition (OCR) capabilities. However, their performance on complex logical reasoning tasks involving text-rich images remains underexplored. To bridge this gap, we introduce LogicOCR, a benchmark comprising 1,100 multiple-choice questions designed to evaluate LMMs' logical reasoning abilities on text-rich images, while minimizing reliance on domain-specific knowledge (e.g., mathematics). We construct LogicOCR by curating a text corpus from the Chinese National Civil Servant Examination and develop a scalable, automated pipeline to convert it into multimodal samples. First, we design prompt templates to steer GPT-Image-1 to generate images with diverse backgrounds, interleaved text-illustration layouts, and varied fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified, with low-quality examples discarded. We evaluate a range of representative open-source and proprietary LMMs under both Chain-of-Thought (CoT) and direct-answer settings. Our multi-dimensional analysis reveals key insights, such as the impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation. Notably, LMMs still lag in multimodal reasoning compared to text-only inputs, indicating that they have not fully bridged visual reading with reasoning. We hope LogicOCR will serve as a valuable resource for advancing multimodal reasoning research. The dataset is available at https://github.com/MiliLab/LogicOCR.
中文: LogicOCR是一个包含1,100道选择题的新基准,旨在评估大型多模态模型在文本丰富图像上的逻辑推理能力,揭示了尽管OCR技术有所进步,但它们在视觉阅读与推理结合方面仍存在不足。
English: LogicOCR is a new benchmark with 1,100 multiple-choice questions designed to assess large multimodal models' logical reasoning on text-rich images, revealing their limitations in integrating visual reading with reasoning despite advances in OCR.

Authors:Sijie Zhao, Feng Liu, Enzhuo Zhang, Yiqing Guo, Pengfeng Xiao, Lei Bai, Xueliang Zhang, Hao Chen
Title: Spatial-Temporal-Spectral Unified Modeling for Remote Sensing Dense Prediction
Abstract:
The proliferation of multi-source remote sensing data has propelled the development of deep learning for dense prediction, yet significant challenges in data and task unification persist. Current deep learning architectures for remote sensing are fundamentally rigid. They are engineered for fixed input-output configurations, restricting their adaptability to the heterogeneous spatial, temporal, and spectral dimensions inherent in real-world data. Furthermore, these models neglect the intrinsic correlations among semantic segmentation, binary change detection, and semantic change detection, necessitating the development of distinct models or task-specific decoders. This paradigm is also constrained to a predefined set of output semantic classes, where any change to the classes requires costly retraining. To overcome these limitations, we introduce the Spatial-Temporal-Spectral Unified Network (STSUN) for unified modeling. STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands by leveraging their metadata for a unified representation. Moreover, STSUN unifies disparate dense prediction tasks within a single architecture by conditioning the model on trainable task embeddings. Similarly, STSUN facilitates flexible prediction across multiple set of semantic categories by integrating trainable category embeddings as metadata. Extensive experiments on multiple datasets with diverse Spatial-Temporal-Spectral configurations in multiple scenarios demonstrate that a single STSUN model effectively adapts to heterogeneous inputs and outputs, unifying various dense prediction tasks and diverse semantic class predictions. The proposed approach consistently achieves state-of-the-art performance, highlighting its robustness and generalizability for complex remote sensing applications.
中文: 提出的时空谱统一网络(STSUN)通过可训练的嵌入机制适应异构遥感数据并统一多种密集预测任务,在多样化场景中均实现了最先进的性能表现。
English: The proposed Spatial-Temporal-Spectral Unified Network (STSUN) overcomes the limitations of rigid deep learning architectures by adapting to heterogeneous remote sensing data and unifying multiple dense prediction tasks through trainable embeddings, achieving state-of-the-art performance across diverse scenarios.

Authors:Zhengyi Luo, Chen Tessler, Toru Lin, Ye Yuan, Tairan He, Wenli Xiao, Yunrong Guo, Gal Chechik, Kris Kitani, Linxi Fan, Yuke Zhu
Title: Emergent Active Perception and Dexterity of Simulated Humanoids from Visual Reinforcement Learning
Abstract:
Human behavior is fundamentally shaped by visual perception -- our ability to interact with the world depends on actively gathering relevant information and adapting our movements accordingly. Behaviors like searching for objects, reaching, and hand-eye coordination naturally emerge from the structure of our sensory system. Inspired by these principles, we introduce Perceptive Dexterous Control (PDC), a framework for vision-driven dexterous whole-body control with simulated humanoids. PDC operates solely on egocentric vision for task specification, enabling object search, target placement, and skill selection through visual cues, without relying on privileged state information (e.g., 3D object positions and geometries). This perception-as-interface paradigm enables learning a single policy to perform multiple household tasks, including reaching, grasping, placing, and articulated object manipulation. We also show that training from scratch with reinforcement learning can produce emergent behaviors such as active search. These results demonstrate how vision-driven control and complex tasks induce human-like behaviors and can serve as the key ingredients in closing the perception-action loop for animation, robotics, and embodied AI.
Chinese: 感知灵巧控制(PDC)框架仅通过以自我为中心的视觉,使模拟人形机器人能够执行复杂的家务任务,并通过强化学习无需依赖特权状态信息即可习得主动搜索等涌现行为。
English: The Perceptive Dexterous Control (PDC) framework enables simulated humanoids to perform complex household tasks using only egocentric vision, learning emergent behaviors like active search through reinforcement learning without relying on privileged state information.

Authors:Yixiao Chen, Zhiyuan Ma, Guoli Jia, Che Jiang, Jianjun Li, Bowen Zhou
Title: Context-Aware Autoregressive Models for Multi-Conditional Image Generation
Abstract:
Autoregressive transformers have recently shown impressive image generation quality and efficiency on par with state-of-the-art diffusion models. Unlike diffusion architectures, autoregressive models can naturally incorporate arbitrary modalities into a single, unified token sequence--offering a concise solution for multi-conditional image generation tasks. In this work, we propose $\textbf{ContextAR}$, a flexible and effective framework for multi-conditional image generation. ContextAR embeds diverse conditions (e.g., canny edges, depth maps, poses) directly into the token sequence, preserving modality-specific semantics. To maintain spatial alignment while enhancing discrimination among different condition types, we introduce hybrid positional encodings that fuse Rotary Position Embedding with Learnable Positional Embedding. We design Conditional Context-aware Attention to reduces computational complexity while preserving effective intra-condition perception. Without any fine-tuning, ContextAR supports arbitrary combinations of conditions during inference time. Experimental results demonstrate the powerful controllability and versatility of our approach, and show that the competitive perpormance than diffusion-based multi-conditional control approaches the existing autoregressive baseline across diverse multi-condition driven scenarios. Project page: $\href{https://context-ar.github.io/}{https://context-ar.github.io/.}$

Authors:ZhanFeng Feng, Long Peng, Xin Di, Yong Guo, Wenbo Li, Yulun Zhang, Renjing Pei, Yang Wang, Yang Cao, Zheng-Jun Zha
Title: PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement
Abstract:
Multi-frame video enhancement tasks aim to improve the spatial and temporal resolution and quality of video sequences by leveraging temporal information from multiple frames, which are widely used in streaming video processing, surveillance, and generation. Although numerous Transformer-based enhancement methods have achieved impressive performance, their computational and memory demands hinder deployment on edge devices. Quantization offers a practical solution by reducing the bit-width of weights and activations to improve efficiency. However, directly applying existing quantization methods to video enhancement tasks often leads to significant performance degradation and loss of fine details. This stems from two limitations: (a) inability to allocate varying representational capacity across frames, which results in suboptimal dynamic range adaptation; (b) over-reliance on full-precision teachers, which limits the learning of low-bit student models. To tackle these challenges, we propose a novel quantization method for video enhancement: Progressive Multi-Frame Quantization for Video Enhancement (PMQ-VE). This framework features a coarse-to-fine two-stage process: Backtracking-based Multi-Frame Quantization (BMFQ) and Progressive Multi-Teacher Distillation (PMTD). BMFQ utilizes a percentile-based initialization and iterative search with pruning and backtracking for robust clipping bounds. PMTD employs a progressive distillation strategy with both full-precision and multiple high-bit (INT) teachers to enhance low-bit models' capacity and quality. Extensive experiments demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance across multiple tasks and benchmarks.The code will be made publicly available at: https://github.com/xiaoBIGfeng/PMQ-VE.
中文: 摘要提出PMQ-VE这一新型量化方法,通过回溯式多帧量化和渐进式多教师蒸馏的两阶段策略,有效解决了现有视频增强量化方法在动态范围适应和模型学习能力方面的不足,在多个任务和基准测试中实现了最优性能。
English: The abstract introduces PMQ-VE, a novel quantization method that addresses the limitations of existing approaches in video enhancement by employing a two-stage process with backtracking-based multi-frame quantization and progressive multi-teacher distillation, achieving state-of-the-art performance while reducing computational demands.

Authors:Hanchen Wang, Yixuan Wu, Yinan Feng, Peng Jin, Shihang Feng, Yiming Mao, James Wiskin, Baris Turkbey, Peter A. Pinto, Bradford J. Wood, Songting Luo, Yinpeng Chen, Emad Boctor, Youzuo Lin
Title: OpenPros: A Large-Scale Dataset for Limited View Prostate Ultrasound Computed Tomography
Abstract:
Prostate cancer is one of the most common and lethal cancers among men, making its early detection critically important. Although ultrasound imaging offers greater accessibility and cost-effectiveness compared to MRI, traditional transrectal ultrasound methods suffer from low sensitivity, especially in detecting anteriorly located tumors. Ultrasound computed tomography provides quantitative tissue characterization, but its clinical implementation faces significant challenges, particularly under anatomically constrained limited-angle acquisition conditions specific to prostate imaging. To address these unmet needs, we introduce OpenPros, the first large-scale benchmark dataset explicitly developed for limited-view prostate USCT. Our dataset includes over 280,000 paired samples of realistic 2D speed-of-sound (SOS) phantoms and corresponding ultrasound full-waveform data, generated from anatomically accurate 3D digital prostate models derived from real clinical MRI/CT scans and ex vivo ultrasound measurements, annotated by medical experts. Simulations are conducted under clinically realistic configurations using advanced finite-difference time-domain and Runge-Kutta acoustic wave solvers, both provided as open-source components. Through comprehensive baseline experiments, we demonstrate that state-of-the-art deep learning methods surpass traditional physics-based approaches in both inference efficiency and reconstruction accuracy. Nevertheless, current deep learning models still fall short of delivering clinically acceptable high-resolution images with sufficient accuracy. By publicly releasing OpenPros, we aim to encourage the development of advanced machine learning algorithms capable of bridging this performance gap and producing clinically usable, high-resolution, and highly accurate prostate ultrasound images. The dataset is publicly accessible at https://open-pros.github.io/.

Authors:Yeonkyung Lee, Woojung Han, Youngjun Jun, Hyeonmin Kim, Jungkyung Cho, Seong Jae Hwang
Title: PRETI: Patient-Aware Retinal Foundation Model via Metadata-Guided Representation Learning
Abstract:
Retinal foundation models have significantly advanced retinal image analysis by leveraging self-supervised learning to reduce dependence on labeled data while achieving strong generalization. Many recent approaches enhance retinal image understanding using report supervision, but obtaining clinical reports is often costly and challenging. In contrast, metadata (e.g., age, gender) is widely available and serves as a valuable resource for analyzing disease progression. To effectively incorporate patient-specific information, we propose PRETI, a retinal foundation model that integrates metadata-aware learning with robust self-supervised representation learning. We introduce Learnable Metadata Embedding (LME), which dynamically refines metadata representations. Additionally, we construct patient-level data pairs, associating images from the same individual to improve robustness against non-clinical variations. To further optimize retinal image representation, we propose Retina-Aware Adaptive Masking (RAAM), a strategy that selectively applies masking within the retinal region and dynamically adjusts the masking ratio during training. PRETI captures both global structures and fine-grained pathological details, resulting in superior diagnostic performance. Extensive experiments demonstrate that PRETI achieves state-of-the-art results across diverse diseases and biomarker predictions using in-house and public data, indicating the importance of metadata-guided foundation models in retinal disease analysis. Our code and pretrained model are available at https://github.com/MICV-yonsei/PRETI
中文: PRETI视网膜基础模型通过元数据感知学习和自适应掩码策略,有效结合患者信息与图像特征,在多种视网膜疾病诊断中实现了最优性能。
English: The PRETI retinal foundation model integrates metadata-aware learning with self-supervised representation techniques, employing learnable embeddings and adaptive masking to achieve state-of-the-art diagnostic performance across diverse retinal diseases.

Authors:Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, Haohuan Fu, Jianxi Huang, Juepeng Zheng
Title: Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind
Abstract:
Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating eight public datasets and one private farmland plot dataset, containing 27,247 QA pairs and 19,615 images. The pipeline begins with multi-source data pre-processing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 20 open-source LMMs and 4 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work. Data and code can be accessed at https://rssysu.github.io/AgroMind/.

Authors:Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Fei Zhao, Zihan Wang, Yuan Xie, Shaohui Lin
Title: CompBench: Benchmarking Complex Instruction-guided Image Editing
Abstract:
While real-world applications increasingly demand intricate scene manipulation, existing instruction-guided image editing benchmarks often oversimplify task complexity and lack comprehensive, fine-grained instructions. To bridge this gap, we introduce, a large-scale benchmark specifically designed for complex instruction-guided image editing. CompBench features challenging editing scenarios that incorporate fine-grained instruction following, spatial and contextual reasoning, thereby enabling comprehensive evaluation of image editing models' precise manipulation capabilities. To construct CompBench, We propose an MLLM-human collaborative framework with tailored task pipelines. Furthermore, we propose an instruction decoupling strategy that disentangles editing intents into four key dimensions: location, appearance, dynamics, and objects, ensuring closer alignment between instructions and complex editing requirements. Extensive evaluations reveal that CompBench exposes fundamental limitations of current image editing models and provides critical insights for the development of next-generation instruction-guided image editing systems. The dataset, code, and models are available in https://comp-bench.github.io/.
中文:CompBench是一个专为复杂指令引导图像编辑设计的大规模基准,通过细粒度指令遵循和空间上下文推理的挑战性场景,揭示了当前模型的根本局限性,并为下一代系统发展提供了关键洞见。
English: CompBench is a large-scale benchmark designed to address the limitations of existing image editing evaluations by introducing complex scenarios requiring fine-grained instruction following and spatial-contextual reasoning, which exposes critical weaknesses in current models and provides key insights for future development.

Authors:Wenquan Lu, Jiaqi Zhang, Hugues Van Assel, Randall Balestriero
Title: Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum
Abstract:
Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ($σ=255$, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at https://github.com/wenquanlu/noisy_dinov2.
Chinese: 本研究提出了一种自监督框架,通过采用基于去噪器训练的课程学习和教师引导正则化,增强了噪声鲁棒性表征学习,在无需推理阶段去噪器的情况下显著提升了模型在噪声数据上的性能。
English: This study introduces a self-supervised framework that enhances noise-robust representation learning by employing a denoiser-trained curriculum and teacher-guided regularization, eliminating the need for denoisers during inference while significantly improving model performance on noisy data.

Authors:Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh
Title: Learning to Highlight Audio by Watching Movies
Abstract:
Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process -- separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Our project page is here: https://wikichao.github.io/VisAH/.
中文摘要:本文提出了一种基于Transformer的多模态框架,通过视频引导的音频高亮技术来改善视听协调性,利用新构建的数据集在定量和主观评估中均优于现有基线方法。
English Summary: The paper introduces a transformer-based framework for visually-guided acoustic highlighting to enhance audio-visual harmony by transforming audio based on video cues, using a novel dataset and outperforming existing methods.

Authors:Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova
Title: HISTAI: An Open-Source, Large-Scale Whole Slide Image Dataset for Computational Pathology
Abstract:
Recent advancements in Digital Pathology (DP), particularly through artificial intelligence and Foundation Models, have underscored the importance of large-scale, diverse, and richly annotated datasets. Despite their critical role, publicly available Whole Slide Image (WSI) datasets often lack sufficient scale, tissue diversity, and comprehensive clinical metadata, limiting the robustness and generalizability of AI models. In response, we introduce the HISTAI dataset, a large, multimodal, open-access WSI collection comprising over 60,000 slides from various tissue types. Each case in the HISTAI dataset is accompanied by extensive clinical metadata, including diagnosis, demographic information, detailed pathological annotations, and standardized diagnostic coding. The dataset aims to fill gaps identified in existing resources, promoting innovation, reproducibility, and the development of clinically relevant computational pathology solutions. The dataset can be accessed at https://github.com/HistAI/HISTAI.
Chinese: HISTAI数据集作为一个大型、多模态、开放获取的资源,包含超过6万张全切片图像及丰富的临床元数据,旨在弥补现有资源的不足,推动数字病理学中人工智能模型的稳健性和创新发展。
English: The HISTAI dataset is introduced as a large, multimodal, open-access collection of over 60,000 whole slide images with extensive clinical metadata to address limitations in existing resources and enhance AI model robustness in digital pathology.

Authors:Jiarui Wang, Huiyu Duan, Ziheng Jia, Yu Zhao, Woo Yi Yang, Zicheng Zhang, Zijian Chen, Juntong Wang, Yuke Xing, Guangtao Zhai, Xiongkuo Min
Title: LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation
Abstract:
Recent advancements in large multimodal models (LMMs) have driven substantial progress in both text-to-video (T2V) generation and video-to-text (V2T) interpretation tasks. However, current AI-generated videos (AIGVs) still exhibit limitations in terms of perceptual quality and text-video alignment. Therefore, a reliable and scalable automatic model for AIGV evaluation is desirable, which heavily relies on the scale and quality of human annotations. To this end, we present AIGVE-60K, a comprehensive dataset and benchmark for AI-Generated Video Evaluation, which features (i) comprehensive tasks, encompassing 3,050 extensive prompts across 20 fine-grained task dimensions, (ii) the largest human annotations, including 120K mean-opinion scores (MOSs) and 60K question-answering (QA) pairs annotated on 58,500 videos generated from 30 T2V models, and (iii) bidirectional benchmarking and evaluating for both T2V generation and V2T interpretation capabilities. Based on AIGVE-60K, we propose LOVE, a LMM-based metric for AIGV Evaluation from multiple dimensions including perceptual preference, text-video correspondence, and task-specific accuracy in terms of both instance level and model level. Comprehensive experiments demonstrate that LOVE not only achieves state-of-the-art performance on the AIGVE-60K dataset, but also generalizes effectively to a wide range of other AIGV evaluation benchmarks. These findings highlight the significance of the AIGVE-60K dataset. Database and codes are anonymously available at https://github.com/IntMeGroup/LOVE.
中文:AIGVE-60K数据集通过大规模人工标注和新型多模态评估指标LOVE,解决了AI生成视频在感知质量与文本对齐方面的评估难题,并在多个基准测试中展现出最优性能。
English: The AIGVE-60K dataset and benchmark address limitations in AI-generated video evaluation by providing extensive human annotations and introducing LOVE, a multimodal metric that achieves state-of-the-art performance across multiple evaluation dimensions.

Authors:Yijie Zheng, Jinxuan Yang, Yu Chen, Yaxuan Wang, Yihang Lu, Guoqing Li
Title: Beluga Whale Detection from Satellite Imagery with Point Labels
Abstract:
Very high-resolution (VHR) satellite imagery has emerged as a powerful tool for monitoring marine animals on a large scale. However, existing deep learning-based whale detection methods usually require manually created, high-quality bounding box annotations, which are labor-intensive to produce. Moreover, existing studies often exclude ``uncertain whales'', individuals that have ambiguous appearances in satellite imagery, limiting the applicability of these models in real-world scenarios. To address these limitations, this study introduces an automated pipeline for detecting beluga whales and harp seals in VHR satellite imagery. The pipeline leverages point annotations and the Segment Anything Model (SAM) to generate precise bounding box annotations, which are used to train YOLOv8 for multiclass detection of certain whales, uncertain whales, and harp seals. Experimental results demonstrated that SAM-generated annotations significantly improved detection performance, achieving higher $\text{F}_\text{1}$-scores compared to traditional buffer-based annotations. YOLOv8 trained on SAM-labeled boxes achieved an overall $\text{F}_\text{1}$-score of 72.2% for whales overall and 70.3% for harp seals, with superior performance in dense scenes. The proposed approach not only reduces the manual effort required for annotation but also enhances the detection of uncertain whales, offering a more comprehensive solution for marine animal monitoring. This method holds great potential for extending to other species, habitats, and remote sensing platforms, as well as for estimating whale biometrics, thereby advancing ecological monitoring and conservation efforts. The codes for our label and detection pipeline are publicly available at http://github.com/voyagerxvoyagerx/beluga-seeker .
本研究提出了一种自动化流程,利用点标注和Segment Anything Model(SAM)生成精确边界框来训练YOLOv8,显著提升了在超高分辨率卫星图像中对白鲸、不确定鲸鱼和竖琴海豹的检测性能,同时减少了人工标注工作量。
This study introduces an automated pipeline using point annotations and the Segment Anything Model (SAM) to generate precise bounding boxes for training YOLOv8, significantly improving the detection of beluga whales, uncertain whales, and harp seals in very high-resolution satellite imagery while reducing manual annotation efforts.

Authors:Yinghui Zhang, Tailin Chen, Yuchen Zhang, Zeyu Fu
Title: Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion
Abstract:
The rapid rise of video content on platforms such as TikTok and YouTube has transformed information dissemination, but it has also facilitated the spread of harmful content, particularly hate videos. Despite significant efforts to combat hate speech, detecting these videos remains challenging due to their often implicit nature. Current detection methods primarily rely on unimodal approaches, which inadequately capture the complementary features across different modalities. While multimodal techniques offer a broader perspective, many fail to effectively integrate temporal dynamics and modality-wise interactions essential for identifying nuanced hate content. In this paper, we present CMFusion, an enhanced multimodal hate video detection model utilizing a novel Channel-wise and Modality-wise Fusion Mechanism. CMFusion first extracts features from text, audio, and video modalities using pre-trained models and then incorporates a temporal cross-attention mechanism to capture dependencies between video and audio streams. The learned features are then processed by channel-wise and modality-wise fusion modules to obtain informative representations of videos. Our extensive experiments on a real-world dataset demonstrate that CMFusion significantly outperforms five widely used baselines in terms of accuracy, precision, recall, and F1 score. Comprehensive ablation studies and parameter analyses further validate our design choices, highlighting the model's effectiveness in detecting hate videos. The source codes will be made publicly available at https://github.com/EvelynZ10/cmfusion.
Chinese: 针对TikTok和YouTube等平台上仇恨视频检测的难题,CMFusion模型通过时序跨模态注意力机制和通道-模态双融合策略,有效整合多模态特征,在真实数据集上显著提升了检测精度与召回率。
English: The proliferation of hate videos on platforms like TikTok and YouTube poses detection challenges, which CMFusion addresses through a multimodal fusion model that integrates temporal dynamics and cross-modal interactions, achieving superior performance in identifying nuanced harmful content.

Authors:Kazuhiko Kawamoto, Atsuhiro Endo, Hiroshi Kera
Title: Cross-Model Transfer of Task Vectors via Few-Shot Orthogonal Alignment
Abstract:
Task arithmetic enables efficient model editing by representing task-specific changes as vectors in parameter space. Task arithmetic typically assumes that the source and target models are initialized from the same pre-trained parameters. This assumption limits its applicability in cross-model transfer settings, where models are independently pre-trained on different datasets. To address this challenge, we propose a method based on few-shot orthogonal alignment, which aligns task vectors to the parameter space of a differently pre-trained target model. These transformations preserve key properties of task vectors, such as norm and rank, and are learned using only a small number of labeled examples. We evaluate the method using two Vision Transformers pre-trained on YFCC100M and LAION400M, and test on eight classification datasets. Experimental results show that our method improves transfer accuracy over direct task vector application and achieves performance comparable to few-shot fine-tuning, while maintaining the modularity and reusability of task vectors. Our code is available at https://github.com/kawakera-lab/CrossModelTransfer.
中文摘要:本研究提出一种少样本正交对齐方法,使任务向量能够在独立预训练的模型间实现有效迁移,在保持任务向量模块化特性的同时,获得了与微调相当的性能表现。
English Summary: This study introduces a few-shot orthogonal alignment method to enable task vector transfer between independently pre-trained models, achieving comparable performance to fine-tuning while preserving task vector modularity.

Authors:Mingcheng Qu, Guang Yang, Donglin Di, Tonghua Su, Yue Gao, Yang Song, Lei Fan
Title: Multimodal Cancer Survival Analysis via Hypergraph Learning with Cross-Modality Rebalance
Abstract:
Multimodal pathology-genomic analysis has become increasingly prominent in cancer survival prediction. However, existing studies mainly utilize multi-instance learning to aggregate patch-level features, neglecting the information loss of contextual and hierarchical details within pathology images. Furthermore, the disparity in data granularity and dimensionality between pathology and genomics leads to a significant modality imbalance. The high spatial resolution inherent in pathology data renders it a dominant role while overshadowing genomics in multimodal integration. In this paper, we propose a multimodal survival prediction framework that incorporates hypergraph learning to effectively capture both contextual and hierarchical details from pathology images. Moreover, it employs a modality rebalance mechanism and an interactive alignment fusion strategy to dynamically reweight the contributions of the two modalities, thereby mitigating the pathology-genomics imbalance. Quantitative and qualitative experiments are conducted on five TCGA datasets, demonstrating that our model outperforms advanced methods by over 3.4\% in C-Index performance.
Chinese: 本文提出了一种多模态生存预测框架,利用超图学习捕捉病理图像的上下文和层次细节,并通过模态再平衡机制和交互对齐策略解决病理与基因组数据间的不平衡问题,在TCGA数据集上C-Index性能提升超过3.4%。
English: This paper introduces a multimodal survival prediction framework using hypergraph learning to capture contextual and hierarchical details from pathology images, along with a modality rebalance mechanism and interactive alignment to address the imbalance between pathology and genomic data, achieving over 3.4% higher C-Index performance on TCGA datasets.

Authors:Yian Zhao, Wanshi Xu, Ruochong Zheng, Pengchong Qiao, Chang Liu, Jie Chen
Title: iSegMan: Interactive Segment-and-Manipulate 3D Gaussians
Abstract:
The efficient rendering and explicit nature of 3DGS promote the advancement of 3D scene manipulation. However, existing methods typically encounter challenges in controlling the manipulation region and are unable to furnish the user with interactive feedback, which inevitably leads to unexpected results. Intuitively, incorporating interactive 3D segmentation tools can compensate for this deficiency. Nevertheless, existing segmentation frameworks impose a pre-processing step of scene-specific parameter training, which limits the efficiency and flexibility of scene manipulation. To deliver a 3D region control module that is well-suited for scene manipulation with reliable efficiency, we propose interactive Segment-and-Manipulate 3D Gaussians (iSegMan), an interactive segmentation and manipulation framework that only requires simple 2D user interactions in any view. To propagate user interactions to other views, we propose Epipolar-guided Interaction Propagation (EIP), which innovatively exploits epipolar constraint for efficient and robust interaction matching. To avoid scene-specific training to maintain efficiency, we further propose the novel Visibility-based Gaussian Voting (VGV), which obtains 2D segmentations from SAM and models the region extraction as a voting game between 2D Pixels and 3D Gaussians based on Gaussian visibility. Taking advantage of the efficient and precise region control of EIP and VGV, we put forth a Manipulation Toolbox to implement various functions on selected regions, enhancing the controllability, flexibility and practicality of scene manipulation. Extensive results on 3D scene manipulation and segmentation tasks fully demonstrate the significant advantages of iSegMan. Project page is available at https://zhao-yian.github.io/iSegMan.
中文摘要:提出的iSegMan框架通过简单的2D用户交互实现三维场景操控,采用创新的极线约束交互传播和基于可见性的高斯投票技术,无需场景特定训练即可完成高效分割。
English Summary: The proposed iSegMan framework enables interactive 3D scene manipulation through simple 2D user inputs, utilizing novel Epipolar-guided Interaction Propagation and Visibility-based Gaussian Voting techniques to achieve efficient segmentation without scene-specific training.

Authors:Haitao Li, Ziyu Li, Yiheng Mao, Zhengyao Ding, Zhengxing Huang
Title: DC-Seg: Disentangled Contrastive Learning for Brain Tumor Segmentation with Missing Modalities
Abstract:
Accurate segmentation of brain images typically requires the integration of complementary information from multiple image modalities. However, clinical data for all modalities may not be available for every patient, creating a significant challenge. To address this, previous studies encode multiple modalities into a shared latent space. While somewhat effective, it remains suboptimal, as each modality contains distinct and valuable information. In this study, we propose DC-Seg (Disentangled Contrastive Learning for Segmentation), a new method that explicitly disentangles images into modality-invariant anatomical representation and modality-specific representation, by using anatomical contrastive learning and modality contrastive learning respectively. This solution improves the separation of anatomical and modality-specific features by considering the modality gaps, leading to more robust representations. Furthermore, we introduce a segmentation-based regularizer that enhances the model's robustness to missing modalities. Extensive experiments on the BraTS 2020 and a private white matter hyperintensity(WMH) segmentation dataset demonstrate that DC-Seg outperforms state-of-the-art methods in handling incomplete multimodal brain tumor segmentation tasks with varying missing modalities, while also demonstrate strong generalizability in WMH segmentation. The code is available at https://github.com/CuCl-2/DC-Seg.
中文: DC-Seg通过解耦对比学习将脑图像分离为模态不变的解剖表征和模态特异性表征,有效提升了多模态数据缺失情况下的分割鲁棒性和准确性。
English: DC-Seg introduces a novel approach that disentangles brain images into modality-invariant anatomical and modality-specific representations through contrastive learning, significantly improving segmentation accuracy and robustness to missing modalities in clinical data.

Authors:Pengfei Lyu, Pak-Hei Yeung, Xiaosheng Yu, Jing Xia, Jianning Chi, Chengdong Wu, Jagath C. Rajapakse
Title: Bridging the Inter-Domain Gap through Low-Level Features for Cross-Modal Medical Image Segmentation
Abstract:
This paper addresses the task of cross-modal medical image segmentation by exploring unsupervised domain adaptation (UDA) approaches. We propose a model-agnostic UDA framework, LowBridge, which builds on a simple observation that cross-modal images share some similar low-level features (e.g., edges) as they are depicting the same structures. Specifically, we first train a generative model to recover the source images from their edge features, followed by training a segmentation model on the generated source images, separately. At test time, edge features from the target images are input to the pretrained generative model to generate source-style target domain images, which are then segmented using the pretrained segmentation network. Despite its simplicity, extensive experiments on various publicly available datasets demonstrate that \proposed achieves state-of-the-art performance, outperforming eleven existing UDA approaches under different settings. Notably, further ablation studies show that \proposed is agnostic to different types of generative and segmentation models, suggesting its potential to be seamlessly plugged with the most advanced models to achieve even more outstanding results in the future. The code is available at https://github.com/JoshuaLPF/LowBridge.
中文摘要:本文提出了一种模型无关的无监督域自适应框架LowBridge,通过利用跨模态图像共享的边缘特征将目标域图像转换为源域风格,在多种医学图像分割任务中实现了最优性能。
English Summary: This paper introduces LowBridge, a model-agnostic unsupervised domain adaptation framework for cross-modal medical image segmentation that leverages shared edge features to generate source-style images from target domains, achieving state-of-the-art performance across multiple datasets.

Authors:Shiming Chen, Dingjie Fu, Salman Khan, Fahad Shahbaz Khan
Title: GenZSL: Generative Zero-Shot Learning Via Inductive Variational Autoencoder
Abstract:
Remarkable progress in zero-shot learning (ZSL) has been achieved using generative models. However, existing generative ZSL methods merely generate (imagine) the visual features from scratch guided by the strong class semantic vectors annotated by experts, resulting in suboptimal generative performance and limited scene generalization. To address these and advance ZSL, we propose an inductive variational autoencoder for generative zero-shot learning, dubbed GenZSL. Mimicking human-level concept learning, GenZSL operates by inducting new class samples from similar seen classes using weak class semantic vectors derived from target class names (i.e., CLIP text embedding). To ensure the generation of informative samples for training an effective ZSL classifier, our GenZSL incorporates two key strategies. Firstly, it employs class diversity promotion to enhance the diversity of class semantic vectors. Secondly, it utilizes target class-guided information boosting criteria to optimize the model. Extensive experiments conducted on three popular benchmark datasets showcase the superiority and potential of our GenZSL with significant efficacy and efficiency over f-VAEGAN, e.g., 24.7% performance gains and more than $60\times$ faster training speed on AWA2. Codes are available at https://github.com/shiming-chen/GenZSL.
中文摘要:提出的GenZSL模型通过使用弱语义向量从相似已见类别中归纳新类样本,推进零样本学习,并采用类别多样性促进和目标类别引导优化策略,在效能和效率上实现显著提升。
English Summary: The proposed GenZSL model advances zero-shot learning by generating new class samples from similar seen classes using weak semantic vectors, enhancing diversity and optimizing performance with significant gains in efficacy and efficiency.

Authors:Xuannan Liu, Zekun Li, Zheqi He, Peipei Li, Shuhan Xia, Xing Cui, Huaibo Huang, Xi Yang, Ran He
Title: Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs
Abstract:
The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.
中文摘要:Video-SafetyBench作为首个全面评估大型视觉语言模型在视频文本攻击下安全性的基准,通过2264个测试样本和新颖评估指标,揭示了模型在视频诱导攻击中的显著脆弱性。
English Summary: Video-SafetyBench is introduced as the first comprehensive benchmark to evaluate Large Vision-Language Models' safety against video-text attacks, revealing significant vulnerabilities through 2,264 test pairs and a novel evaluation metric.

Authors:Jiajun Qin, Yuan Pu, Zhuolun He, Seunggeun Kim, David Z. Pan, Bei Yu
Title: UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings
Abstract:
Current research has explored vision-language models for multi-modal embedding tasks, such as information retrieval, visual grounding, and classification. However, real-world scenarios often involve diverse modality combinations between queries and targets, such as text and image to text, text and image to text and image, and text to text and image. These diverse combinations pose significant challenges for existing models, as they struggle to align all modality combinations within a unified embedding space during training, which degrades performance at inference. To address this limitation, we propose UniMoCo, a novel vision-language model architecture designed for multi-modal embedding tasks. UniMoCo introduces a modality-completion module that generates visual features from textual inputs, ensuring modality completeness for both queries and targets. Additionally, we develop a specialized training strategy to align embeddings from both original and modality-completed inputs, ensuring consistency within the embedding space. This enables the model to robustly handle a wide range of modality combinations across embedding tasks. Experiments show that UniMoCo outperforms previous methods while demonstrating consistent robustness across diverse settings. More importantly, we identify and quantify the inherent bias in conventional approaches caused by imbalance of modality combinations in training data, which can be mitigated through our modality-completion paradigm. The code is available at https://github.com/HobbitQia/UniMoCo.
中文摘要:提出的UniMoCo模型通过引入模态补全模块和专门训练策略,解决了视觉语言任务中多种模态组合难以对齐的难题,在提升性能与鲁棒性的同时有效缓解了训练数据不平衡导致的固有偏差。
English Summary: The proposed UniMoCo model addresses the challenge of aligning diverse modality combinations in vision-language tasks by introducing a modality-completion module and specialized training strategy, demonstrating superior performance and robustness while mitigating inherent biases from training data imbalance.

Authors:Jian Zhu, He Wang, Yang Xu, Zebin Wu, Zhihui Wei
Title: Self-Learning Hyperspectral and Multispectral Image Fusion via Adaptive Residual Guided Subspace Diffusion Model
Abstract:
Hyperspectral and multispectral image (HSI-MSI) fusion involves combining a low-resolution hyperspectral image (LR-HSI) with a high-resolution multispectral image (HR-MSI) to generate a high-resolution hyperspectral image (HR-HSI). Most deep learning-based methods for HSI-MSI fusion rely on large amounts of hyperspectral data for supervised training, which is often scarce in practical applications. In this paper, we propose a self-learning Adaptive Residual Guided Subspace Diffusion Model (ARGS-Diff), which only utilizes the observed images without any extra training data. Specifically, as the LR-HSI contains spectral information and the HR-MSI contains spatial information, we design two lightweight spectral and spatial diffusion models to separately learn the spectral and spatial distributions from them. Then, we use these two models to reconstruct HR-HSI from two low-dimensional components, i.e, the spectral basis and the reduced coefficient, during the reverse diffusion process. Furthermore, we introduce an Adaptive Residual Guided Module (ARGM), which refines the two components through a residual guided function at each sampling step, thereby stabilizing the sampling process. Extensive experimental results demonstrate that ARGS-Diff outperforms existing state-of-the-art methods in terms of both performance and computational efficiency in the field of HSI-MSI fusion. Code is available at https://github.com/Zhu1116/ARGS-Diff.
Chinese: 本文提出ARGS-Diff自学习扩散模型,无需额外训练数据即可融合低分辨率高光谱与高分辨率多光谱图像,通过自适应残差引导机制在性能与计算效率上均超越现有最优方法。
English: This paper introduces ARGS-Diff, a self-learning diffusion model that fuses low-resolution hyperspectral and high-resolution multispectral images without external training data, achieving superior performance and efficiency through adaptive residual guidance.

Authors:Hancan Zhu, Jinhao Chen, Guanghua He
Title: MedVKAN: Efficient Feature Extraction with Mamba and KAN for Medical Image Segmentation
Abstract:
Medical image segmentation relies heavily on convolutional neural networks (CNNs) and Transformer-based models. However, CNNs are constrained by limited receptive fields, while Transformers suffer from scalability challenges due to their quadratic computational complexity. To address these limitations, recent advances have explored alternative architectures. The state-space model Mamba offers near-linear complexity while capturing long-range dependencies, and the Kolmogorov-Arnold Network (KAN) enhances nonlinear expressiveness by replacing fixed activation functions with learnable ones. Building on these strengths, we propose MedVKAN, an efficient feature extraction model integrating Mamba and KAN. Specifically, we introduce the EFC-KAN module, which enhances KAN with convolutional operations to improve local pixel interaction. We further design the VKAN module, integrating Mamba with EFC-KAN as a replacement for Transformer modules, significantly improving feature extraction. Extensive experiments on five public medical image segmentation datasets show that MedVKAN achieves state-of-the-art performance on four datasets and ranks second on the remaining one. These results validate the potential of Mamba and KAN for medical image segmentation while introducing an innovative and computationally efficient feature extraction framework. The code is available at: https://github.com/beginner-cjh/MedVKAN.
Chinese: MedVKAN结合了Mamba模型的近线性复杂度和改进的Kolmogorov-Arnold网络(KAN),有效克服了CNN和Transformer的局限性,在多数医学图像分割数据集上实现了最优性能。
English: MedVKAN integrates the Mamba model for near-linear complexity and the enhanced Kolmogorov-Arnold Network (KAN) to overcome limitations of CNNs and Transformers, achieving state-of-the-art performance on most medical image segmentation datasets.

Authors:Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, Jian Zhang
Title: EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
Abstract:
Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models. EgoDex is publicly available for download at https://github.com/apple/ml-egodex.
中文摘要:EgoDex数据集通过苹果Vision Pro设备采集了829小时包含实时3D手部追踪的自我中心视角视频,覆盖194种日常家居操作任务,有效解决了操作模仿学习中的数据稀缺问题,为机器人和计算机视觉领域提供了重要资源。
English Summary: The EgoDex dataset addresses data scarcity in imitation learning for manipulation by providing 829 hours of egocentric video with real-time 3D hand tracking, collected using Apple Vision Pro across 194 household tasks, enabling advancements in robotics and computer vision.

Authors:Haipeng Fang, Sheng Tang, Juan Cao, Enshuo Zhang, Fan Tang, Tong-Yee Lee
Title: Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration
Abstract:
Diffusion transformers have shown exceptional performance in visual generation but incur high computational costs. Token reduction techniques that compress models by sharing the denoising process among similar tokens have been introduced. However, existing approaches neglect the denoising priors of the diffusion models, leading to suboptimal acceleration and diminished image quality. This study proposes a novel concept: attend to prune feature redundancies in areas not attended by the diffusion process. We analyze the location and degree of feature redundancies based on the structure-then-detail denoising priors. Subsequently, we introduce SDTM, a structure-then-detail token merging approach that dynamically compresses feature redundancies. Specifically, we design dynamic visual token merging, compression ratio adjusting, and prompt reweighting for different stages. Served in a post-training way, the proposed method can be integrated seamlessly into any DiT architecture. Extensive experiments across various backbones, schedulers, and datasets showcase the superiority of our method, for example, it achieves 1.55 times acceleration with negligible impact on image quality. Project page: https://github.com/ICTMCG/SDTM.
中文: 本研究提出SDTM方法,通过结构优先-细节补充的令牌合并策略,利用扩散去噪先验动态压缩特征冗余,在多种DiT架构上实现1.55倍加速且几乎不影响图像质量。
English: This study introduces SDTM, a structure-then-detail token merging method that leverages diffusion denoising priors to dynamically compress feature redundancies, achieving 1.55× acceleration with minimal quality loss across various DiT architectures.

Authors:Jae Myung Kim, Stephan Alaniz, Cordelia Schmid, Zeynep Akata
Title: LoFT: LoRA-fused Training Dataset Generation with Few-shot Guidance
Abstract:
Despite recent advances in text-to-image generation, using synthetically generated data seldom brings a significant boost in performance for supervised learning. Oftentimes, synthetic datasets do not faithfully recreate the data distribution of real data, i.e., they lack the fidelity or diversity needed for effective downstream model training. While previous work has employed few-shot guidance to address this issue, existing methods still fail to capture and generate features unique to specific real images. In this paper, we introduce a novel dataset generation framework named LoFT, LoRA-Fused Training-data Generation with Few-shot Guidance. Our method fine-tunes LoRA weights on individual real images and fuses them at inference time, producing synthetic images that combine the features of real images for improved diversity and fidelity of generated data. We evaluate the synthetic data produced by LoFT on 10 datasets, using 8 to 64 real images per class as guidance and scaling up to 1000 images per class. Our experiments show that training on LoFT-generated data consistently outperforms other synthetic dataset methods, significantly increasing accuracy as the dataset size increases. Additionally, our analysis demonstrates that LoFT generates datasets with high fidelity and sufficient diversity, which contribute to the performance improvement. The code is available at https://github.com/ExplainableML/LoFT.
中文: LoFT框架通过在单个真实图像上微调LoRA权重并在推理时融合它们,提升了合成数据集的多样性和保真度,从而在模型训练中持续优于其他方法。
English: The LoFT framework enhances synthetic dataset generation by fine-tuning LoRA weights on individual real images and fusing them during inference, resulting in improved diversity and fidelity that consistently outperforms other methods in downstream model training.

Authors:Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen
Title: SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
Abstract:
The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.
中文: 本研究通过利用Blackwell GPU的FP4张量核心将注意力推理速度提升5倍,并开创性地将8位注意力应用于训练任务,在微调中实现无损性能,但预训练收敛较慢。
English: This study enhances attention efficiency by utilizing FP4 Tensor Cores in Blackwell GPUs for a 5x inference speedup and pioneers 8-bit attention for training, achieving lossless fine-tuning results despite slower pretraining convergence.

Authors:Jiacheng Hou, Zhenjie Song, Ercan Engin Kuruoglu
Title: BrainNetMLP: An Efficient and Effective Baseline for Functional Brain Network Classification
Abstract:
Recent studies have made great progress in functional brain network classification by modeling the brain as a network of Regions of Interest (ROIs) and leveraging their connections to understand brain functionality and diagnose mental disorders. Various deep learning architectures, including Convolutional Neural Networks, Graph Neural Networks, and the recent Transformer, have been developed. However, despite the increasing complexity of these models, the performance gain has not been as salient. This raises a question: Does increasing model complexity necessarily lead to higher classification accuracy? In this paper, we revisit the simplest deep learning architecture, the Multi-Layer Perceptron (MLP), and propose a pure MLP-based method, named BrainNetMLP, for functional brain network classification, which capitalizes on the advantages of MLP, including efficient computation and fewer parameters. Moreover, BrainNetMLP incorporates a dual-branch structure to jointly capture both spatial connectivity and spectral information, enabling precise spatiotemporal feature fusion. We evaluate our proposed BrainNetMLP on two public and popular brain network classification datasets, the Human Connectome Project (HCP) and the Autism Brain Imaging Data Exchange (ABIDE). Experimental results demonstrate pure MLP-based methods can achieve state-of-the-art performance, revealing the potential of MLP-based models as more efficient yet effective alternatives in functional brain network classification. The code will be available at https://github.com/JayceonHo/BrainNetMLP.
中文: 本文提出BrainNetMLP,一种基于多层感知机的双分支结构方法,能同时捕捉空间连接和频谱信息,在功能脑网络分类中实现最优性能,同时展现出高效性和有效性。
English: This paper introduces BrainNetMLP, a pure Multi-Layer Perceptron-based method with a dual-branch structure that captures both spatial connectivity and spectral information, achieving state-of-the-art performance in functional brain network classification while demonstrating efficiency and effectiveness.

Authors:Yusu Qian, Jiasen Lu, Tsu-Jui Fu, Xinze Wang, Chen Chen, Yinfei Yang, Wenze Hu, Zhe Gan
Title: GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing
Abstract:
Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: (i) functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and (ii) image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each annotated with detailed editing instructions, evaluation questions, and spatial object masks. We conduct a large-scale study comparing GPT-Image-1, the latest flagship in the text-guided image editing space, against several state-of-the-art editing models, and validate our automatic metrics against human ratings. Results show that GPT-Image-1 leads in instruction-following accuracy, but often over-modifies irrelevant image regions, highlighting a key trade-off in the current model behavior. GIE-Bench provides a scalable, reproducible framework for advancing more accurate evaluation of text-guided image editing.
Chinese: 本研究提出了GIE-Bench这一新基准,通过功能正确性和图像内容保持度指标评估文本引导图像编辑模型,发现GPT-Image-1虽在指令遵循上表现优异,但容易过度修改非目标区域。
English: This work introduces GIE-Bench, a new benchmark for evaluating text-guided image editing models through functional correctness and image content preservation metrics, revealing that while GPT-Image-1 excels in instruction accuracy, it tends to over-modify non-targeted areas.

Authors:Dingbang Huang, Wenbo Li, Yifei Zhao, Xinyu Pan, Yanhong Zeng, Bo Dai
Title: PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment
Abstract:
Diffusion models have made remarkable advancements in generating high-quality images from textual descriptions. Recent works like LayerDiffuse have extended the previous single-layer, unified image generation paradigm to transparent image layer generation. However, existing multi-layer generation methods fail to handle the interactions among multiple layers such as rational global layout, physics-plausible contacts and visual effects like shadows and reflections while maintaining high alpha quality. To solve this problem, we propose PSDiffusion, a unified diffusion framework for simultaneous multi-layer text-to-image generation. Our model can automatically generate multi-layer images with one RGB background and multiple RGBA foregrounds through a single feed-forward process. Unlike existing methods that combine multiple tools for post-decomposition or generate layers sequentially and separately, our method introduces a global-layer interactive mechanism that generates layered-images concurrently and collaboratively, ensuring not only high quality and completeness for each layer, but also spatial and visual interactions among layers for global coherence.
Chinese: PSDiffusion是一种统一的扩散框架,通过单次前向处理同时生成多层图像,既保证各图层的高质量,又确保图层间的全局交互协调一致。
English: PSDiffusion is a unified diffusion framework that simultaneously generates multi-layer images with a single feed-forward process, ensuring high-quality individual layers and coherent global interactions among them.

Authors:Bohao Xing, Xin Liu, Guoying Zhao, Chengyu Liu, Xiaolan Fu, Heikki Kälviäinen
Title: EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models
Abstract:
Emotion understanding is a critical yet challenging task. Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities in this area. However, MLLMs often suffer from hallucinations, generating irrelevant or nonsensical content. To the best of our knowledge, despite the importance of this issue, there has been no dedicated effort to evaluate emotion-related hallucinations in MLLMs. In this work, we introduce EmotionHallucer, the first benchmark for detecting and analyzing emotion hallucinations in MLLMs. Unlike humans, whose emotion understanding stems from the interplay of biology and social learning, MLLMs rely solely on data-driven learning and lack innate emotional instincts. Fortunately, emotion psychology provides a solid foundation of knowledge about human emotions. Building on this, we assess emotion hallucinations from two dimensions: emotion psychology knowledge and real-world multimodal perception. To support robust evaluation, we utilize an adversarial binary question-answer (QA) framework, which employs carefully crafted basic and hallucinated pairs to assess the emotion hallucination tendencies of MLLMs. By evaluating 38 LLMs and MLLMs on EmotionHallucer, we reveal that: i) most current models exhibit substantial issues with emotion hallucinations; ii) closed-source models outperform open-source ones in detecting emotion hallucinations, and reasoning capability provides additional advantages; iii) existing models perform better in emotion psychology knowledge than in multimodal emotion perception. As a byproduct, these findings inspire us to propose the PEP-MEK framework, which yields an average improvement of 9.90% in emotion hallucination detection across selected models. Resources will be available at https://github.com/xxtars/EmotionHallucer.
中文: 本文提出了首个检测多模态大语言模型中情绪幻觉的基准EmotionHallucer,揭示了当前模型普遍存在该问题,并提出的框架使检测性能平均提升9.90%。
English: This paper introduces EmotionHallucer, the first benchmark for detecting emotion hallucinations in Multimodal Large Language Models, revealing widespread issues and proposing a framework that improves detection by 9.90%.

Authors:Wenchuan Zhang, Penghao Zhang, Jingru Guo, Tao Cheng, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, Hong Bu
Title: Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner
Abstract:
Recent advances in vision language models (VLMs) have enabled broad progress in the general medical field. However, pathology still remains a more challenging subdomain, with current pathology specific VLMs exhibiting limitations in both diagnostic accuracy and reasoning plausibility. Such shortcomings are largely attributable to the nature of current pathology datasets, which are primarily composed of image description pairs that lack the depth and structured diagnostic paradigms employed by real world pathologists. In this study, we leverage pathology textbooks and real world pathology experts to construct high-quality, reasoning-oriented datasets. Building on this, we introduce Patho-R1, a multimodal RL-based pathology Reasoner, trained through a three-stage pipeline: (1) continued pretraining on 3.5 million image-text pairs for knowledge infusion; (2) supervised fine-tuning on 500k high-quality Chain-of-Thought samples for reasoning incentivizing; (3) reinforcement learning using Group Relative Policy Optimization and Decoupled Clip and Dynamic sAmpling Policy Optimization strategies for multimodal reasoning quality refinement. To further assess the alignment quality of our dataset, we propose Patho-CLIP, trained on the same figure-caption corpus used for continued pretraining. Comprehensive experimental results demonstrate that both Patho-CLIP and Patho-R1 achieve robust performance across a wide range of pathology-related tasks, including zero-shot classification, cross-modal retrieval, Visual Question Answering, and Multiple Choice Question. Our project is available at the Patho-R1 repository: https://github.com/Wenchuan-Zhang/Patho-R1.
中文摘要:本研究提出Patho-R1,这是一种基于病理学教材和专家知识构建的高质量数据集、通过三阶段训练流程开发的多模态病理推理模型,在多种病理任务中展现出卓越性能。
English Summary: This study introduces Patho-R1, a multimodal pathology reasoning model trained using a three-stage pipeline on high-quality datasets developed with pathology textbooks and expert input, achieving robust performance across various pathology tasks.

Authors:Keunwoo Peter Yu, Joyce Chai
Title: Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models
Abstract:
Vision-language models (VLMs) have shown remarkable progress in offline tasks such as image captioning and video question answering. However, real-time interactive environments impose new demands on VLMs, requiring them to generate utterances that are not only semantically accurate but also precisely timed. We identify two core capabilities necessary for such settings -- $\textit{perceptual updating}$ and $\textit{contingency awareness}$ -- and propose a new benchmark task, $\textbf{Temporally-Grounded Language Generation (TGLG)}$, to evaluate them. TGLG requires models to generate utterances in response to streaming video such that both content and timing align with dynamic visual input. To support this benchmark, we curate evaluation datasets from sports broadcasting and egocentric human interaction domains, and introduce a new metric, $\textbf{TRACE}$, to evaluate TGLG by jointly measuring semantic similarity and temporal alignment. Finally, we present $\textbf{Vision-Language Model with Time-Synchronized Interleaving (VLM-TSI)}$, a model that interleaves visual and linguistic tokens in a time-synchronized manner, enabling real-time language generation without relying on turn-based assumptions. Experimental results show that VLM-TSI significantly outperforms a strong baseline, yet overall performance remains modest -- highlighting the difficulty of TGLG and motivating further research in real-time VLMs. Code and data available $\href{https://github.com/yukw777/tglg}{here}$.
中文摘要:该研究提出了时序语言生成(TGLG)基准任务,用于评估视觉语言模型在实时交互环境中的表现,并开发了时间同步交错模型(VLM-TSI),其性能虽优于基线但仍凸显了实现语义准确性与时间精准性协同的持续挑战。
English Summary: The study introduces Temporally-Grounded Language Generation (TGLG), a benchmark task for evaluating vision-language models in real-time interactive settings, proposing a new model (VLM-TSI) that outperforms baselines but highlights the ongoing challenges in achieving precise semantic and temporal alignment.

Authors:Yuang Ai, Qihang Fan, Xuefeng Hu, Zhenheng Yang, Ran He, Huaibo Huang
Title: DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling
Abstract:
Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns-highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules, offering strong generative performance with significant efficiency gains. On class-conditional ImageNet generation benchmarks, DiCo-XL achieves an FID of 2.05 at 256x256 resolution and 2.53 at 512x512, with a 2.7x and 3.1x speedup over DiT-XL/2, respectively. Furthermore, experimental results on MS-COCO demonstrate that the purely convolutional DiCo exhibits strong potential for text-to-image generation. Code: https://github.com/shallowdream204/DiCo.
Chinese: Diffusion ConvNet(DiCo)模型通过采用带有紧凑通道注意力机制的增强卷积网络替代扩散变换器中计算成本高昂的全局自注意力,在图像生成任务中实现了卓越的生成性能和显著的效率提升。
English: The Diffusion ConvNet (DiCo) model replaces the computationally expensive global self-attention in Diffusion Transformers with an enhanced convolutional network featuring a compact channel attention mechanism, achieving superior generative performance and significant efficiency gains in image generation tasks.

Authors:Feiran Li, Qianqian Xu, Shilong Bao, Zhiyong Yang, Xiaochun Cao, Qingming Huang
Title: One Image is Worth a Thousand Words: A Usability Preservable Text-Image Collaborative Erasing Framework
Abstract:
Concept erasing has recently emerged as an effective paradigm to prevent text-to-image diffusion models from generating visually undesirable or even harmful content. However, current removal methods heavily rely on manually crafted text prompts, making it challenging to achieve a high erasure (efficacy) while minimizing the impact on other benign concepts (usability). In this paper, we attribute the limitations to the inherent gap between the text and image modalities, which makes it hard to transfer the intricately entangled concept knowledge from text prompts to the image generation process. To address this, we propose a novel solution by directly integrating visual supervision into the erasure process, introducing the first text-image Collaborative Concept Erasing (Co-Erasing) framework. Specifically, Co-Erasing describes the concept jointly by text prompts and the corresponding undesirable images induced by the prompts, and then reduces the generating probability of the target concept through negative guidance. This approach effectively bypasses the knowledge gap between text and image, significantly enhancing erasure efficacy. Additionally, we design a text-guided image concept refinement strategy that directs the model to focus on visual features most relevant to the specified text concept, minimizing disruption to other benign concepts. Finally, comprehensive experiments suggest that Co-Erasing outperforms state-of-the-art erasure approaches significantly with a better trade-off between efficacy and usability. Codes are available at https://github.com/Ferry-Li/Co-Erasing.
中文摘要:Co-Erasing框架通过结合文本提示与视觉监督,采用负向引导和概念精炼策略,在保持良性概念可用性的同时有效消除文本到图像模型中的不良概念,实现了效能与可用性的最佳平衡。
English Summary: The proposed Co-Erasing framework integrates visual supervision with text prompts to effectively remove undesirable concepts from text-to-image models while minimizing impact on benign concepts, achieving superior performance through negative guidance and concept refinement.

Authors:Bin Liu, Chunyang Wang, Xuelian Liu, Bo Xiao, Guan Xi
Title: HyMamba: Mamba with Hybrid Geometry-Feature Coupling for Efficient Point Cloud Classification
Abstract:
Point cloud classification is one of the essential technologies for achieving intelligent perception of 3D environments by machines, its core challenge is to efficiently extract local and global features. Mamba leverages state space models (SSMs) for global point cloud modeling. Although prior Mamba-based point cloud processing methods pay attention to the limitation of its flattened sequence modeling mechanism in fusing local and global features, the critical issue of weakened local geometric relevance caused by decoupling geometric structures and features in the input patches remains not fully revealed, and both jointly limit local feature extraction. Therefore, we propose HyMamba, a geometry and feature coupled Mamba framework featuring: (1) Geometry-Feature Coupled Pooling (GFCP), which achieves physically interpretable geometric information coupling by dynamically aggregating adjacent geometric information into local features; (2) Collaborative Feature Enhancer (CoFE), which enhances sparse signal capture through cross-path feature hybridization while effectively integrating global and local contexts. We conducted extensive experiments on ModelNet40 and ScanObjectNN datasets. The results demonstrate that the proposed model achieves superior classification performance, particularly on the ModelNet40, where it elevates accuracy to 95.99% with merely 0.03M additional parameters. Furthermore, it attains 98.9% accuracy on the ModelNetFewShot dataset, validating its robust generalization capabilities under sparse samples. Our code and weights are available at https://github.com/L1277471578/HyMamba
中文摘要:HyMamba通过几何-特征耦合框架的GFCP和CoFE模块,强化局部几何关联与特征融合,在ModelNet40上以仅0.03M参数增量实现95.99%的分类精度突破。
English Summary: HyMamba introduces a geometry-feature coupled framework with GFCP and CoFE modules to enhance local geometric relevance and feature integration, achieving state-of-the-art classification accuracy of 95.99% on ModelNet40 with minimal parameter increase.

Authors:Yuran Wang, Ruihai Wu, Yue Chen, Jiarui Wang, Jiaqi Liang, Ziyu Zhu, Haoran Geng, Jitendra Malik, Pieter Abbeel, Hao Dong
Title: DexGarmentLab: Dexterous Garment Manipulation Environment with Generalizable Policy
Abstract:
Garment manipulation is a critical challenge due to the diversity in garment categories, geometries, and deformations. Despite this, humans can effortlessly handle garments, thanks to the dexterity of our hands. However, existing research in the field has struggled to replicate this level of dexterity, primarily hindered by the lack of realistic simulations of dexterous garment manipulation. Therefore, we propose DexGarmentLab, the first environment specifically designed for dexterous (especially bimanual) garment manipulation, which features large-scale high-quality 3D assets for 15 task scenarios, and refines simulation techniques tailored for garment modeling to reduce the sim-to-real gap. Previous data collection typically relies on teleoperation or training expert reinforcement learning (RL) policies, which are labor-intensive and inefficient. In this paper, we leverage garment structural correspondence to automatically generate a dataset with diverse trajectories using only a single expert demonstration, significantly reducing manual intervention. However, even extensive demonstrations cannot cover the infinite states of garments, which necessitates the exploration of new algorithms. To improve generalization across diverse garment shapes and deformations, we propose a Hierarchical gArment-manipuLation pOlicy (HALO). It first identifies transferable affordance points to accurately locate the manipulation area, then generates generalizable trajectories to complete the task. Through extensive experiments and detailed analysis of our method and baseline, we demonstrate that HALO consistently outperforms existing methods, successfully generalizing to previously unseen instances even with significant variations in shape and deformation where others fail. Our project page is available at: https://wayrise.github.io/DexGarmentLab/.
中文: DexGarmentLab首创了专用于灵巧双手衣物操作的环境,通过改进模拟技术和提出分层操作策略HALO,利用可迁移的功用点定位和泛化轨迹生成,在未接触过的衣物形状与变形场景中显著优于现有方法。
English: DexGarmentLab introduces the first specialized environment for dexterous bimanual garment manipulation, featuring refined simulations and a hierarchical policy (HALO) that outperforms existing methods by generalizing to unseen garment variations through transferable affordance points and trajectories.

Authors:An-Lan Wang, Jingqun Tang, Liao Lei, Hao Feng, Qi Liu, Xiang Fei, Jinghui Lu, Han Wang, Weiwei Liu, Hao Liu, Yuliang Liu, Xiang Bai, Can Huang
Title: WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
Abstract:
The rapid advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced capabilities in Document Understanding. However, prevailing benchmarks like DocVQA and ChartQA predominantly comprise \textit{scanned or digital} documents, inadequately reflecting the intricate challenges posed by diverse real-world scenarios, such as variable illumination and physical distortions. This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments. WildDoc incorporates a diverse set of manually captured document images reflecting real-world conditions and leverages document sources from established benchmarks to facilitate comprehensive comparisons with digital or scanned documents. Further, to rigorously evaluate model robustness, each document is captured four times under different conditions. Evaluations of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models' inadequate robustness compared to traditional benchmarks, highlighting the unique challenges posed by real-world document understanding. Our project homepage is available at https://bytedance.github.io/WildDoc.
中文摘要:本文提出了首个针对自然环境下文档理解评估的基准WildDoc,揭示了当前最先进的多模态模型在面对真实世界光照变化和物理变形等挑战时出现的显著性能下降问题。
English Summary: This paper introduces WildDoc, the first benchmark for evaluating document understanding in natural environments, revealing significant performance drops in state-of-the-art multimodal models when faced with real-world challenges like variable lighting and physical distortions.

Authors:Zongye Zhang, Bohan Kong, Qingjie Liu, Yunhong Wang
Title: Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion
Abstract:
Generating 3D human motion from text descriptions remains challenging due to the diverse and complex nature of human motion. While existing methods excel within the training distribution, they often struggle with out-of-distribution motions, limiting their applicability in real-world scenarios. Existing VQVAE-based methods often fail to represent novel motions faithfully using discrete tokens, which hampers their ability to generalize beyond seen data. Meanwhile, diffusion-based methods operating on continuous representations often lack fine-grained control over individual frames. To address these challenges, we propose a robust motion generation framework MoMADiff, which combines masked modeling with diffusion processes to generate motion using frame-level continuous representations. Our model supports flexible user-provided keyframe specification, enabling precise control over both spatial and temporal aspects of motion synthesis. MoMADiff demonstrates strong generalization capability on novel text-to-motion datasets with sparse keyframes as motion prompts. Extensive experiments on two held-out datasets and two standard benchmarks show that our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and keyframe adherence. The code is available at: https://github.com/zzysteve/MoMADiff
中文摘要:提出的MoMADiff框架结合掩码建模与扩散过程,通过帧级连续表示生成三维人体运动,相比现有方法在运动质量、文本遵循和关键帧控制方面表现更优。
English Summary: The proposed MoMADiff framework combines masked modeling with diffusion processes to generate 3D human motions from text, offering superior generalization and precise keyframe control compared to existing methods.

Authors:Bo Du, Xuekang Zhu, Xiaochen Ma, Chenfan Qu, Kaiwen Feng, Zhe Yang, Chi-Man Pun, Jian Liu, Jizhe Zhou
Title: ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization
Abstract:
The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark & codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models, 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs.
中文摘要:ForensicHub作为首个全领域伪造图像检测与定位的统一基准,通过模块化架构整合四大检测领域,实现了跨领域模型比较与评估,为打破领域壁垒提供了关键解决方案。
English Summary: The ForensicHub benchmark addresses fragmentation in Fake Image Detection and Localization by unifying four domains through a modular architecture, implementing multiple baselines and benchmarks while providing key insights to advance the field.

Authors:Yunkang Cao, Yuqi Cheng, Xiaohao Xu, Yiheng Zhang, Yihan Sun, Yuxiang Tan, Yuxin Zhang, Xiaonan Huang, Weiming Shen
Title: Visual Anomaly Detection under Complex View-Illumination Interplay: A Large-Scale Benchmark
Abstract:
The practical deployment of Visual Anomaly Detection (VAD) systems is hindered by their sensitivity to real-world imaging variations, particularly the complex interplay between viewpoint and illumination which drastically alters defect visibility. Current benchmarks largely overlook this critical challenge. We introduce Multi-View Multi-Illumination Anomaly Detection (M2AD), a new large-scale benchmark comprising 119,880 high-resolution images designed explicitly to probe VAD robustness under such interacting conditions. By systematically capturing 999 specimens across 10 categories using 12 synchronized views and 10 illumination settings (120 configurations total), M2AD enables rigorous evaluation. We establish two evaluation protocols: M2AD-Synergy tests the ability to fuse information across diverse configurations, and M2AD-Invariant measures single-image robustness against realistic view-illumination effects. Our extensive benchmarking shows that state-of-the-art VAD methods struggle significantly on M2AD, demonstrating the profound challenge posed by view-illumination interplay. This benchmark serves as an essential tool for developing and validating VAD methods capable of overcoming real-world complexities. Our full dataset and test suite will be released at https://hustcyq.github.io/M2AD to facilitate the field.
Chinese: M2AD基准通过12种视角和10种光照组合生成的119,880张高清图像,首次系统揭示了视觉异常检测方法在真实成像变化下的脆弱性,为攻克视角-光照交互难题提供了关键评估工具。
English: The M2AD benchmark addresses the critical gap in visual anomaly detection by introducing 119,880 high-resolution images across 120 view-illumination configurations, revealing significant performance drops in state-of-the-art methods under realistic imaging variations.

Authors:Chao Wang, Wei Lu, Xiang Li, Jian Yang, Lei Luo
Title: M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection
Abstract:
Single-source remote sensing object detection using optical or SAR images struggles in complex environments. Optical images offer rich textural details but are often affected by low-light, cloud-obscured, or low-resolution conditions, reducing the detection performance. SAR images are robust to weather, but suffer from speckle noise and limited semantic expressiveness. Optical and SAR images provide complementary advantages, and fusing them can significantly improve the detection accuracy. However, progress in this field is hindered by the lack of large-scale, standardized datasets. To address these challenges, we propose the first comprehensive dataset for optical-SAR fusion object detection, named Multi-resolution, Multi-polarization, Multi-scene, Multi-source SAR dataset (M4-SAR). It contains 112,184 precisely aligned image pairs and nearly one million labeled instances with arbitrary orientations, spanning six key categories. To enable standardized evaluation, we develop a unified benchmarking toolkit that integrates six state-of-the-art multi-source fusion methods. Furthermore, we propose E2E-OSDet, a novel end-to-end multi-source fusion detection framework that mitigates cross-domain discrepancies and establishes a robust baseline for future studies. Extensive experiments on M4-SAR demonstrate that fusing optical and SAR data can improve $mAP$ by 5.7\% over single-source inputs, with particularly significant gains in complex environments. The dataset and code are publicly available at https://github.com/wchao0601/M4-SAR.
中文: 提出的M4-SAR数据集和E2E-OSDet框架通过光学与SAR图像融合解决了单源遥感在复杂环境中的局限性,使检测精度在复杂环境下提升了5.7% mAP。
English: The proposed M4-SAR dataset and E2E-OSDet framework address the limitations of single-source remote sensing by enabling optical-SAR fusion, improving detection accuracy by 5.7% mAP in complex environments.

Authors:Simeon Adebola, Shuangyu Xie, Chung Min Kim, Justin Kerr, Bart M. van Marrewijk, Mieke van Vlaardingen, Tim van Daalen, E. N. van Loo, Jose Luis Susa Rincon, Eugen Solowjow, Rick van de Zedde, Ken Goldberg
Title: GrowSplat: Constructing Temporal Digital Twins of Plants with Gaussian Splats
Abstract:
Accurate temporal reconstructions of plant growth are essential for plant phenotyping and breeding, yet remain challenging due to complex geometries, occlusions, and non-rigid deformations of plants. We present a novel framework for building temporal digital twins of plants by combining 3D Gaussian Splatting with a robust sample alignment pipeline. Our method begins by reconstructing Gaussian Splats from multi-view camera data, then leverages a two-stage registration approach: coarse alignment through feature-based matching and Fast Global Registration, followed by fine alignment with Iterative Closest Point. This pipeline yields a consistent 4D model of plant development in discrete time steps. We evaluate the approach on data from the Netherlands Plant Eco-phenotyping Center, demonstrating detailed temporal reconstructions of Sequoia and Quinoa species. Videos and Images can be seen at https://berkeleyautomation.github.io/GrowSplat/

Authors:Saad Manzur, Bryan Vela, Brandon Vela, Aditya Agrawal, Lan-Anh Dang-Vu, David Li, Wayne Hayes
Title: PoseBench3D: A Cross-Dataset Analysis Framework for 3D Human Pose Estimation via Pose Lifting Networks
Abstract:
Reliable three-dimensional human pose estimation (3D HPE) remains challenging due to the differences in viewpoints, environments, and camera conventions among datasets. As a result, methods that achieve near-optimal in-dataset accuracy often degrade on unseen datasets. In practice, however, systems must adapt to diverse viewpoints, environments, and camera setups--conditions that differ significantly from those encountered during training, which is often the case in real-world scenarios. Measuring cross-dataset performance is a vital process, but extremely labor-intensive when done manually for human pose estimation. To address these challenges, we automate this evaluation using PoseBench3D, a standardized testing framework that enables consistent and fair cross-dataset comparisons on previously unseen data. PoseBench3D streamlines testing across four widely used 3D HPE datasets via a single, configurable interface. Using this framework, we re-evaluate 18 methods and report over 100 cross-dataset results under Protocol 1: MPJPE and Protocol 2: PA-MPJPE, revealing systematic generalization gaps and the impact of common preprocessing and dataset setup choices. The PoseBench3D code is found at: https://github.com/bryanjvela/PoseBench3D
中文: PoseBench3D是一个标准化框架,通过自动化三维人体姿态估计的跨数据集评估,在统一协议下测试了18种方法,揭示了系统性泛化差距。
English: PoseBench3D is a standardized framework that automates cross-dataset evaluation for 3D human pose estimation, revealing generalization gaps by testing 18 methods across four datasets under consistent protocols.

Authors:Alexey Magay, Dhurba Tripathi, Yu Hao, Yi Fang
Title: A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision
Abstract:
People with blindness and low vision (pBLV) face significant challenges, struggling to navigate environments and locate objects due to limited visual cues. Spatial reasoning is crucial for these individuals, as it enables them to understand and interpret the spatial relationships in their surroundings, enhancing their ability to navigate and interact more safely and independently. Current multi-modal large language (MLLM) models for low vision people lack the spatial reasoning capabilities needed to effectively assist in these tasks. Moreover, there is a notable absence of lightweight, easy-to-use systems that allow pBLV to effectively perceive and interact with their surrounding environment. In this paper, we propose a novel spatial enhanced multi-modal large language model based approach for visually impaired individuals. By fine-tuning the MLLM to incorporate spatial reasoning capabilities, our method significantly improves the understanding of environmental context, which is critical for navigation and object recognition. The innovation extends to a hardware component, designed as an attachment for glasses, ensuring increased accessibility and ease of use. This integration leverages advanced VLMs to interpret visual data and provide real-time, spatially aware feedback to the user. Our approach aims to bridge the gap between advanced machine learning models and practical, user-friendly assistive devices, offering a robust solution for visually impaired users to navigate their surroundings more effectively and independently. The paper includes an in-depth evaluation using the VizWiz dataset, demonstrating substantial improvements in accuracy and user experience. Additionally, we design a comprehensive dataset to evaluate our method's effectiveness in realworld situations, demonstrating substantial improvements in accuracy and user experience.

Authors:Filippo Leveni, Luca Magri, Cesare Alippi, Giacomo Boracchi
Title: Hashing for Structure-based Anomaly Detection
Abstract:
We focus on the problem of identifying samples in a set that do not conform to structured patterns represented by low-dimensional manifolds. An effective way to solve this problem is to embed data in a high dimensional space, called Preference Space, where anomalies can be identified as the most isolated points. In this work, we employ Locality Sensitive Hashing to avoid explicit computation of distances in high dimensions and thus improve Anomaly Detection efficiency. Specifically, we present an isolation-based anomaly detection technique designed to work in the Preference Space which achieves state-of-the-art performance at a lower computational cost. Code is publicly available at https://github.com/ineveLoppiliF/Hashing-for-Structure-based-Anomaly-Detection.
中文摘要:本研究提出一种高效异常检测方法,通过将数据嵌入高维偏好空间并采用局部敏感哈希技术,在保持最优性能的同时显著降低了计算成本。
English Summary: This study introduces an efficient anomaly detection method that identifies outliers by embedding data into a high-dimensional Preference Space and using Locality Sensitive Hashing to reduce computational costs while maintaining state-of-the-art performance.

Authors:Kaifa Yang, Qi Yang, Zhu Li, Yiling Xu
Title: Textured mesh Quality Assessment using Geometry and Color Field Similarity
Abstract:
Textured mesh quality assessment (TMQA) is critical for various 3D mesh applications. However, existing TMQA methods often struggle to provide accurate and robust evaluations. Motivated by the effectiveness of fields in representing both 3D geometry and color information, we propose a novel point-based TMQA method called field mesh quality metric (FMQM). FMQM utilizes signed distance fields and a newly proposed color field named nearest surface point color field to realize effective mesh feature description. Four features related to visual perception are extracted from the geometry and color fields: geometry similarity, geometry gradient similarity, space color distribution similarity, and space color gradient similarity. Experimental results on three benchmark datasets demonstrate that FMQM outperforms state-of-the-art (SOTA) TMQA metrics. Furthermore, FMQM exhibits low computational complexity, making it a practical and efficient solution for real-world applications in 3D graphics and visualization. Our code is publicly available at: https://github.com/yyyykf/FMQM.
中文: 提出的场网格质量度量(FMQM)利用符号距离场和颜色场提取四种视觉感知特征,在纹理网格质量评估中优于现有最优方法,兼具高精度和低计算复杂度。
English: The proposed Field Mesh Quality Metric (FMQM) leverages signed distance and color fields to extract four visual perception features, outperforming state-of-the-art methods in textured mesh quality assessment with high accuracy and low computational complexity.

Authors:Manyu Li, Ruian He, Zixian Zhang, Weimin Tan, Bo Yan
Title: Unifying Segment Anything in Microscopy with Multimodal Large Language Model
Abstract:
Accurate segmentation of regions of interest in biomedical images holds substantial value in image analysis. Although several foundation models for biomedical segmentation have currently achieved excellent performance on certain datasets, they typically demonstrate sub-optimal performance on unseen domain data. We owe the deficiency to lack of vision-language knowledge before segmentation. Multimodal Large Language Models (MLLMs) bring outstanding understanding and reasoning capabilities to multimodal tasks, which inspires us to leverage MLLMs to inject Vision-Language Knowledge (VLK), thereby enabling vision models to demonstrate superior generalization capabilities on cross-domain datasets. In this paper, we propose using MLLMs to guide SAM in learning microscopy crose-domain data, unifying Segment Anything in Microscopy, named uLLSAM. Specifically, we propose the Vision-Language Semantic Alignment (VLSA) module, which injects VLK into Segment Anything Model (SAM). We find that after SAM receives global VLK prompts, its performance improves significantly, but there are deficiencies in boundary contour perception. Therefore, we further propose Semantic Boundary Regularization (SBR) to prompt SAM. Our method achieves performance improvements of 7.71% in Dice and 12.10% in SA across 9 in-domain microscopy datasets, achieving state-of-the-art performance. Our method also demonstrates improvements of 6.79% in Dice and 10.08% in SA across 10 out-ofdomain datasets, exhibiting strong generalization capabilities. Code is available at https://github.com/ieellee/uLLSAM.
Chinese: 本文提出uLLSAM方法,通过将多模态大语言模型的视觉语言知识融入Segment Anything模型,显著提升了其在生物医学图像分割中的准确性和跨领域泛化能力。
English: This paper introduces uLLSAM, a method that enhances the Segment Anything Model (SAM) by integrating vision-language knowledge from Multimodal Large Language Models, significantly improving its accuracy and generalization for biomedical image segmentation across both in-domain and out-of-domain datasets.

Authors:NingFeng Que, Xiaofei Wang, Jingjing Chen, Yixuan Jiang, Chao Li
Title: Adaptive Spatial Transcriptomics Interpolation via Cross-modal Cross-slice Modeling
Abstract:
Spatial transcriptomics (ST) is a promising technique that characterizes the spatial gene profiling patterns within the tissue context. Comprehensive ST analysis depends on consecutive slices for 3D spatial insights, whereas the missing intermediate tissue sections and high costs limit the practical feasibility of generating multi-slice ST. In this paper, we propose C2-STi, the first attempt for interpolating missing ST slices at arbitrary intermediate positions between adjacent ST slices. Despite intuitive, effective ST interpolation presents significant challenges, including 1) limited continuity across heterogeneous tissue sections, 2) complex intrinsic correlation across genes, and 3) intricate cellular structures and biological semantics within each tissue section. To mitigate these challenges, in C2-STi, we design 1) a distance-aware local structural modulation module to adaptively capture cross-slice deformations and enhance positional correlations between ST slices, 2) a pyramid gene co-expression correlation module to capture multi-scale biological associations among genes, and 3) a cross-modal alignment module that integrates the ST-paired hematoxylin and eosin (H&E)-stained images to filter and align the essential cellular features across ST and H\&E images. Extensive experiments on the public dataset demonstrate our superiority over state-of-the-art approaches on both single-slice and multi-slice ST interpolation. Codes are available at https://github.com/XiaofeiWang2018/C2-STi.
中文: 本文提出C2-STi方法,通过设计跨切片变形感知、基因共表达关联和多模态对齐模块,有效解决空间转录组切片插值中的组织异质性等难题,在实验中展现出优越性能。
English: This paper introduces C2-STi, a novel method for interpolating missing spatial transcriptomics slices by addressing challenges such as tissue heterogeneity and gene correlations through specialized modules, demonstrating superior performance in experiments.

Authors:Sayed Mehedi Azim, Brian Corbett, Iman Dehzangi
Title: ROIsGAN: A Region Guided Generative Adversarial Framework for Murine Hippocampal Subregion Segmentation
Abstract:
The hippocampus, a critical brain structure involved in memory processing and various neurodegenerative and psychiatric disorders, comprises three key subregions: the dentate gyrus (DG), Cornu Ammonis 1 (CA1), and Cornu Ammonis 3 (CA3). Accurate segmentation of these subregions from histological tissue images is essential for advancing our understanding of disease mechanisms, developmental dynamics, and therapeutic interventions. However, no existing methods address the automated segmentation of hippocampal subregions from tissue images, particularly from immunohistochemistry (IHC) images. To bridge this gap, we introduce a novel set of four comprehensive murine hippocampal IHC datasets featuring distinct staining modalities: cFos, NeuN, and multiplexed stains combining cFos, NeuN, and either ΔFosB or GAD67, capturing structural, neuronal activity, and plasticity associated information. Additionally, we propose ROIsGAN, a region-guided U-Net-based generative adversarial network tailored for hippocampal subregion segmentation. By leveraging adversarial learning, ROIsGAN enhances boundary delineation and structural detail refinement through a novel region-guided discriminator loss combining Dice and binary cross-entropy loss. Evaluated across DG, CA1, and CA3 subregions, ROIsGAN consistently outperforms conventional segmentation models, achieving performance gains ranging from 1-10% in Dice score and up to 11% in Intersection over Union (IoU), particularly under challenging staining conditions. Our work establishes foundational datasets and methods for automated hippocampal segmentation, enabling scalable, high-precision analysis of tissue images in neuroscience research. Our generated datasets, proposed model as a standalone tool, and its corresponding source code are publicly available at: https://github.com/MehediAzim/ROIsGAN
中文: 本研究提出了ROIsGAN这一新型生成对抗网络及四个海马体免疫组化数据集,实现了海马体亚区的自动分割,相比现有方法性能显著提升,为神经科学研究提供了基础资源。
English: This study introduces ROIsGAN, a novel generative adversarial network, along with four comprehensive hippocampal immunohistochemistry datasets to automate the segmentation of hippocampal subregions, achieving significant performance improvements over existing methods and providing foundational resources for neuroscience research.

Authors:Jianyang Xie, Yitian Zhao, Yanda Meng, He Zhao, Anh Nguyen, Yalin Zheng
Title: Are Spatial-Temporal Graph Convolution Networks for Human Action Recognition Over-Parameterized?
Abstract:
Spatial-temporal graph convolutional networks (ST-GCNs) showcase impressive performance in skeleton-based human action recognition (HAR). However, despite the development of numerous models, their recognition performance does not differ significantly after aligning the input settings. With this observation, we hypothesize that ST-GCNs are over-parameterized for HAR, a conjecture subsequently confirmed through experiments employing the lottery ticket hypothesis. Additionally, a novel sparse ST-GCNs generator is proposed, which trains a sparse architecture from a randomly initialized dense network while maintaining comparable performance levels to the dense components. Moreover, we generate multi-level sparsity ST-GCNs by integrating sparse structures at various sparsity levels and demonstrate that the assembled model yields a significant enhancement in HAR performance. Thorough experiments on four datasets, including NTU-RGB+D 60(120), Kinetics-400, and FineGYM, demonstrate that the proposed sparse ST-GCNs can achieve comparable performance to their dense components. Even with 95% fewer parameters, the sparse ST-GCNs exhibit a degradation of <1% in top-1 accuracy. Meanwhile, the multi-level sparsity ST-GCNs, which require only 66% of the parameters of the dense ST-GCNs, demonstrate an improvement of >1% in top-1 accuracy. The code is available at https://github.com/davelailai/Sparse-ST-GCN.
中文: 稀疏时空图卷积网络在显著减少参数的同时保持了与密集网络相当的性能,而多级稀疏模型进一步提升了精度并降低了参数量。
English: Sparse ST-GCNs achieve comparable performance to dense networks with significantly fewer parameters, and multi-level sparsity models further enhance accuracy while reducing parameter counts.

Authors:Yuki Kawana, Shintaro Shiba, Quan Kong, Norimasa Kobori
Title: GA3CE: Unconstrained 3D Gaze Estimation with Gaze-Aware 3D Context Encoding
Abstract:
We propose a novel 3D gaze estimation approach that learns spatial relationships between the subject and objects in the scene, and outputs 3D gaze direction. Our method targets unconstrained settings, including cases where close-up views of the subject's eyes are unavailable, such as when the subject is distant or facing away. Previous approaches typically rely on either 2D appearance alone or incorporate limited spatial cues using depth maps in the non-learnable post-processing step. Estimating 3D gaze direction from 2D observations in these scenarios is challenging; variations in subject pose, scene layout, and gaze direction, combined with differing camera poses, yield diverse 2D appearances and 3D gaze directions even when targeting the same 3D scene. To address this issue, we propose GA3CE: Gaze-Aware 3D Context Encoding. Our method represents subject and scene using 3D poses and object positions, treating them as 3D context to learn spatial relationships in 3D space. Inspired by human vision, we align this context in an egocentric space, significantly reducing spatial complexity. Furthermore, we propose D$^3$ (direction-distance-decomposed) positional encoding to better capture the spatial relationship between 3D context and gaze direction in direction and distance space. Experiments demonstrate substantial improvements, reducing mean angle error by 13%-37% compared to leading baselines on benchmark datasets in single-frame settings.

Authors:Xingye Cui, Junhai Luo, Jiakun Deng, Kexuan Li, Xiangyu Qiu, Zhenming Peng
Title: ARFC-WAHNet: Adaptive Receptive Field Convolution and Wavelet-Attentive Hierarchical Network for Infrared Small Target Detection
Abstract:
Infrared small target detection (ISTD) is critical in both civilian and military applications. However, the limited texture and structural information in infrared images makes accurate detection particularly challenging. Although recent deep learning-based methods have improved performance, their use of conventional convolution kernels limits adaptability to complex scenes and diverse targets. Moreover, pooling operations often cause feature loss and insufficient exploitation of image information. To address these issues, we propose an adaptive receptive field convolution and wavelet-attentive hierarchical network for infrared small target detection (ARFC-WAHNet). This network incorporates a multi-receptive field feature interaction convolution (MRFFIConv) module to adaptively extract discriminative features by integrating multiple convolutional branches with a gated unit. A wavelet frequency enhancement downsampling (WFED) module leverages Haar wavelet transform and frequency-domain reconstruction to enhance target features and suppress background noise. Additionally, we introduce a high-low feature fusion (HLFF) module for integrating low-level details with high-level semantics, and a global median enhancement attention (GMEA) module to improve feature diversity and expressiveness via global attention. Experiments on public datasets SIRST, NUDT-SIRST, and IRSTD-1k demonstrate that ARFC-WAHNet outperforms recent state-of-the-art methods in both detection accuracy and robustness, particularly under complex backgrounds. The code is available at https://github.com/Leaf2001/ARFC-WAHNet.
中文摘要:提出的ARFC-WAHNet通过自适应感受野卷积和小波注意力模块,有效解决了红外小目标检测中的特征提取不足问题,在复杂背景下相比现有方法展现出更优的检测精度和鲁棒性。
English Summary: The proposed ARFC-WAHNet introduces adaptive receptive field convolution and wavelet-attentive modules to overcome limitations in infrared small target detection, achieving superior accuracy and robustness across complex scenarios compared to existing methods.

Authors:Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, Hongsheng Li
Title: MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
Abstract:
Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.
中文: 当前多模态模型因缺乏对数学图形的精细理解而在推理上受限,为此我们利用代码监督和大规模数据集开发了MathCoder-VL模型,在几何问题求解上超越了GPT-4o等模型,实现了开源模型的最优性能。
English: Current multimodal models struggle with mathematical reasoning due to a lack of detailed figure understanding, so we developed MathCoder-VL using code supervision and a large-scale dataset to achieve state-of-the-art performance, surpassing models like GPT-4o in geometry problem-solving.

Authors:Anastasios Gerontopoulos, Spyros Gidaris, Nikos Komodakis
Title: Multi-Token Prediction Needs Registers
Abstract:
Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.
中文: MuToR是一种创新的多令牌预测方法,通过在输入序列中插入可学习的寄存器令牌来预测未来目标,具有参数增量极少、无需改变模型架构即可兼容现有预训练模型的特点,在语言和视觉任务的微调与预训练中均表现出优越性能。
English: MuToR is a novel multi-token prediction method that integrates learnable register tokens into input sequences to predict future targets, offering minimal parameter overhead, architectural compatibility with existing models, and enhanced performance across fine-tuning and pretraining scenarios in both language and vision tasks.

Authors:Raman Dutt, Pedro Sanchez, Yongchen Yao, Steven McDonagh, Sotirios A. Tsaftaris, Timothy Hospedales
Title: CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs
Abstract:
We introduce CheXGenBench, a rigorous and multifaceted evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and clinical utility across state-of-the-art text-to-image generative models. Despite rapid advancements in generative AI for real-world imagery, medical domain evaluations have been hindered by methodological inconsistencies, outdated architectural comparisons, and disconnected assessment criteria that rarely address the practical clinical value of synthetic samples. CheXGenBench overcomes these limitations through standardised data partitioning and a unified evaluation protocol comprising over 20 quantitative metrics that systematically analyse generation quality, potential privacy vulnerabilities, and downstream clinical applicability across 11 leading text-to-image architectures. Our results reveal critical inefficiencies in the existing evaluation protocols, particularly in assessing generative fidelity, leading to inconsistent and uninformative comparisons. Our framework establishes a standardised benchmark for the medical AI community, enabling objective and reproducible comparisons while facilitating seamless integration of both existing and future generative models. Additionally, we release a high-quality, synthetic dataset, SynthCheX-75K, comprising 75K radiographs generated by the top-performing model (Sana 0.6B) in our benchmark to support further research in this critical domain. Through CheXGenBench, we establish a new state-of-the-art and release our framework, models, and SynthCheX-75K dataset at https://raman1121.github.io/CheXGenBench/

Authors:Fengdi Zhang, Hongkun Cao, Ruqi Huang
Title: Consistent Quantity-Quality Control across Scenes for Deployment-Aware Gaussian Splatting
Abstract:
To reduce storage and computational costs, 3D Gaussian splatting (3DGS) seeks to minimize the number of Gaussians used while preserving high rendering quality, introducing an inherent trade-off between Gaussian quantity and rendering quality. Existing methods strive for better quantity-quality performance, but lack the ability for users to intuitively adjust this trade-off to suit practical needs such as model deployment under diverse hardware and communication constraints. Here, we present ControlGS, a 3DGS optimization method that achieves semantically meaningful and cross-scene consistent quantity-quality control. Through a single training run using a fixed setup and a user-specified hyperparameter reflecting quantity-quality preference, ControlGS can automatically find desirable quantity-quality trade-off points across diverse scenes, from compact objects to large outdoor scenes. It also outperforms baselines by achieving higher rendering quality with fewer Gaussians, and supports a broad adjustment range with stepless control over the trade-off. Project page: https://zhang-fengdi.github.io/ControlGS/

Authors:Jiaming Liang, Lihuan Dai, Xiaoqi Sheng, Xiangguang Chen, Chun Yao, Guihua Tao, Qibin Leng, Hongmin Cai, Xi Zhong
Title: HWA-UNETR: Hierarchical Window Aggregate UNETR for 3D Multimodal Gastric Lesion Segmentation
Abstract:
Multimodal medical image segmentation faces significant challenges in the context of gastric cancer lesion analysis. This clinical context is defined by the scarcity of independent multimodal datasets and the imperative to amalgamate inherently misaligned modalities. As a result, algorithms are constrained to train on approximate data and depend on application migration, leading to substantial resource expenditure and a potential decline in analysis accuracy. To address those challenges, we have made two major contributions: First, we publicly disseminate the GCM 2025 dataset, which serves as the first large-scale, open-source collection of gastric cancer multimodal MRI scans, featuring professionally annotated FS-T2W, CE-T1W, and ADC images from 500 patients. Second, we introduce HWA-UNETR, a novel 3D segmentation framework that employs an original HWA block with learnable window aggregation layers to establish dynamic feature correspondences between different modalities' anatomical structures, and leverages the innovative tri-orientated fusion mamba mechanism for context modeling and capturing long-range spatial dependencies. Extensive experiments on our GCM 2025 dataset and the publicly BraTS 2021 dataset validate the performance of our framework, demonstrating that the new approach surpasses existing methods by up to 1.68\% in the Dice score while maintaining solid robustness. The dataset and code are public via https://github.com/JeMing-creater/HWA-UNETR.
Chinese: 本研究发布了首个大规模开源胃癌多模态MRI数据集GCM 2025,并提出了HWA-UNETR新型三维分割框架,该框架在Dice指标上以最高1.68%的优势超越现有方法,展现出卓越性能。
English: This study introduces the GCM 2025 dataset, the first large-scale open-source gastric cancer multimodal MRI collection, and proposes HWA-UNETR, a novel 3D segmentation framework that achieves superior performance with up to 1.68% higher Dice scores than existing methods.

Authors:Andrei Arhire, Radu Timofte
Title: Learned Lightweight Smartphone ISP with Unpaired Data
Abstract:
The Image Signal Processor (ISP) is a fundamental component in modern smartphone cameras responsible for conversion of RAW sensor image data to RGB images with a strong focus on perceptual quality. Recent work highlights the potential of deep learning approaches and their ability to capture details with a quality increasingly close to that of professional cameras. A difficult and costly step when developing a learned ISP is the acquisition of pixel-wise aligned paired data that maps the raw captured by a smartphone camera sensor to high-quality reference images. In this work, we address this challenge by proposing a novel training method for a learnable ISP that eliminates the need for direct correspondences between raw images and ground-truth data with matching content. Our unpaired approach employs a multi-term loss function guided by adversarial training with multiple discriminators processing feature maps from pre-trained networks to maintain content structure while learning color and texture characteristics from the target RGB dataset. Using lightweight neural network architectures suitable for mobile devices as backbones, we evaluated our method on the Zurich RAW to RGB and Fujifilm UltraISP datasets. Compared to paired training methods, our unpaired learning strategy shows strong potential and achieves high fidelity across multiple evaluation metrics. The code and pre-trained models are available at https://github.com/AndreiiArhire/Learned-Lightweight-Smartphone-ISP-with-Unpaired-Data .
中文: 本研究提出一种无需配对数据的可学习图像信号处理器训练方法,通过多判别器对抗训练保持内容结构并学习目标RGB数据集的色彩纹理,在移动设备上实现了优异的性能。
English: This study introduces an unpaired training method for a learnable Image Signal Processor (ISP) that eliminates the need for aligned data by using adversarial training with multiple discriminators to maintain content structure while adapting color and texture from target RGB datasets, achieving competitive performance on mobile devices.

Authors:Shihao Zou, Qingfeng Li, Wei Ji, Jingjing Li, Yongkui Yang, Guoqi Li, Chao Dong
Title: SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity
Abstract:
Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity $\mathcal{O}(T)$. Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15\% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving $\times 16$, $\times 10$ and $\times 5$ improvements on the three tasks. https://github.com/JimmyZou/SpikeVideoFormer
中文: SpikeVideoFormer是一种高效的脉冲驱动视频Transformer,具有线性时间复杂度,在视频任务中实现最先进性能,同时相比现有方法显著提升了能效。
English: SpikeVideoFormer is an efficient spike-driven video Transformer with linear temporal complexity that achieves state-of-the-art performance in video tasks while significantly improving energy efficiency over existing methods.

Authors:Jie Zhu, Jirong Zha, Ding Li, Leye Wang
Title: A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability
Abstract:
Self-supervised learning shows promise in harnessing extensive unlabeled data, but it also confronts significant privacy concerns, especially in vision. In this paper, we perform membership inference on visual self-supervised models in a more realistic setting: self-supervised training method and details are unknown for an adversary when attacking as he usually faces a black-box system in practice. In this setting, considering that self-supervised model could be trained by completely different self-supervised paradigms, e.g., masked image modeling and contrastive learning, with complex training details, we propose a unified membership inference method called PartCrop. It is motivated by the shared part-aware capability among models and stronger part response on the training data. Specifically, PartCrop crops parts of objects in an image to query responses within the image in representation space. We conduct extensive attacks on self-supervised models with different training protocols and structures using three widely used image datasets. The results verify the effectiveness and generalization of PartCrop. Moreover, to defend against PartCrop, we evaluate two common approaches, i.e., early stop and differential privacy, and propose a tailored method called shrinking crop scale range. The defense experiments indicate that all of them are effective. Finally, besides prototype testing on toy visual encoders and small-scale image datasets, we quantitatively study the impacts of scaling from both data and model aspects in a realistic scenario and propose a scalable PartCrop-v2 by introducing two structural improvements to PartCrop. Our code is at https://github.com/JiePKU/PartCrop.
中文: 本文提出PartCrop方法,通过在表示空间中裁剪物体部件进行查询,实现了对视觉自监督模型的黑盒成员推理攻击,实验证明该方法具有普适有效性,并提出了相应的防御方案。
English: This paper introduces PartCrop, a unified membership inference attack method for visual self-supervised models that operates in a black-box setting by cropping object parts to detect training data, demonstrating effectiveness across various models and datasets while also proposing defense strategies.

Authors:Yue Wang, Shuai Xu, Xuelin Zhu, Yicong Li
Title: MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning
Abstract:
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-modal Interaction (MSCI) model that effectively explores and utilizes intermediate-layer information from CLIP's visual encoder. Specifically, we design two self-adaptive aggregators to extract local information from low-level visual features and integrate global information from high-level visual features, respectively. These key information are progressively incorporated into textual representations through a stage-by-stage interaction mechanism, significantly enhancing the model's perception capability for fine-grained local visual information. Additionally, MSCI dynamically adjusts the attention weights between global and local visual information based on different combinations, as well as different elements within the same combination, allowing it to flexibly adapt to diverse scenarios. Experiments on three widely used datasets fully validate the effectiveness and superiority of the proposed model. Data and code are available at https://github.com/ltpwy/MSCI.
中文: 提出的多阶段跨模态交互模型通过自适应聚合器将局部与全局视觉特征逐步融入文本表示,有效增强了CLIP对细粒度信息的感知能力,在三个基准数据集上验证了其优越性。
English: The proposed Multi-Stage Cross-modal Interaction (MSCI) model enhances CLIP's fine-grained perception by progressively integrating local and global visual features into textual representations through adaptive aggregators, demonstrating superior performance across three benchmark datasets.

Authors:Mengqiu Xu, Kaixin Chen, Heng Guo, Yixiang Huang, Ming Wu, Zhenwei Shi, Chuang Zhang, Jun Guo
Title: MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting
Abstract:
Deep learning approaches for marine fog detection and forecasting have outperformed traditional methods, demonstrating significant scientific and practical importance. However, the limited availability of open-source datasets remains a major challenge. Existing datasets, often focused on a single region or satellite, restrict the ability to evaluate model performance across diverse conditions and hinder the exploration of intrinsic marine fog characteristics. To address these limitations, we introduce \textbf{MFogHub}, the first multi-regional and multi-satellite dataset to integrate annotated marine fog observations from 15 coastal fog-prone regions and six geostationary satellites, comprising over 68,000 high-resolution samples. By encompassing diverse regions and satellite perspectives, MFogHub facilitates rigorous evaluation of both detection and forecasting methods under varying conditions. Extensive experiments with 16 baseline models demonstrate that MFogHub can reveal generalization fluctuations due to regional and satellite discrepancy, while also serving as a valuable resource for the development of targeted and scalable fog prediction techniques. Through MFogHub, we aim to advance both the practical monitoring and scientific understanding of marine fog dynamics on a global scale. The dataset and code are at \href{https://github.com/kaka0910/MFogHub}{https://github.com/kaka0910/MFogHub}.
中文: MFogHub 是首个整合了来自15个沿海雾区和六颗卫星的超过68,000个标注海洋雾样本的多区域多卫星数据集,能够全面评估检测与预测模型,并解决不同条件下的泛化难题。
English: MFogHub is the first multi-regional and multi-satellite dataset integrating over 68,000 annotated marine fog samples from 15 coastal regions and six satellites, enabling comprehensive evaluation of detection and forecasting models while addressing generalization challenges across diverse conditions.

Authors:Pavel Korotaev, Petr Surovtsev, Alexander Kapitanov, Karina Kvanchiani, Aleksandr Nagaev
Title: HandReader: Advanced Techniques for Efficient Fingerspelling Recognition
Abstract:
Fingerspelling is a significant component of Sign Language (SL), allowing the interpretation of proper names, characterized by fast hand movements during signing. Although previous works on fingerspelling recognition have focused on processing the temporal dimension of videos, there remains room for improving the accuracy of these approaches. This paper introduces HandReader, a group of three architectures designed to address the fingerspelling recognition task. HandReader$_{RGB}$ employs the novel Temporal Shift-Adaptive Module (TSAM) to process RGB features from videos of varying lengths while preserving important sequential information. HandReader$_{KP}$ is built on the proposed Temporal Pose Encoder (TPE) operated on keypoints as tensors. Such keypoints composition in a batch allows the encoder to pass them through 2D and 3D convolution layers, utilizing temporal and spatial information and accumulating keypoints coordinates. We also introduce HandReader_RGB+KP - architecture with a joint encoder to benefit from RGB and keypoint modalities. Each HandReader model possesses distinct advantages and achieves state-of-the-art results on the ChicagoFSWild and ChicagoFSWild+ datasets. Moreover, the models demonstrate high performance on the first open dataset for Russian fingerspelling, Znaki, presented in this paper. The Znaki dataset and HandReader pre-trained models are publicly available.
中文: 本文提出HandReader,一套包含三种架构的手指拼写识别系统,通过新颖时序模块处理RGB和关键点数据,在现有及新推出的俄语数据集上均取得最优性能。
English: This paper introduces HandReader, a set of three architectures for fingerspelling recognition that achieve state-of-the-art results by processing RGB and keypoint data through novel temporal modules, validated on both existing and a new Russian dataset.

Authors:Wenhao Shen, Wanqi Yin, Xiaofeng Yang, Cheng Chen, Chaoyue Song, Zhongang Cai, Lei Yang, Hao Wang, Guosheng Lin
Title: ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization
Abstract:
Human mesh recovery (HMR) from a single image is inherently ill-posed due to depth ambiguity and occlusions. Probabilistic methods have tried to solve this by generating numerous plausible 3D human mesh predictions, but they often exhibit misalignment with 2D image observations and weak robustness to in-the-wild images. To address these issues, we propose ADHMR, a framework that Aligns a Diffusion-based HMR model in a preference optimization manner. First, we train a human mesh prediction assessment model, HMR-Scorer, capable of evaluating predictions even for in-the-wild images without 3D annotations. We then use HMR-Scorer to create a preference dataset, where each input image has a pair of winner and loser mesh predictions. This dataset is used to finetune the base model using direct preference optimization. Moreover, HMR-Scorer also helps improve existing HMR models by data cleaning, even with fewer training samples. Extensive experiments show that ADHMR outperforms current state-of-the-art methods. Code is available at: https://github.com/shenwenhao01/ADHMR.
中文:提出的ADHMR框架通过偏好优化对齐基于扩散的人体网格恢复模型,利用新型HMR-Scorer评估预测并创建偏好数据集进行微调,实现了最先进的性能表现。
English: The proposed ADHMR framework enhances human mesh recovery by aligning a diffusion-based model through preference optimization, using a novel HMR-Scorer to evaluate predictions and create a preference dataset for fine-tuning, achieving state-of-the-art performance.

Authors:Yanbo Ding, Xirui Hu, Zhizhi Guo, Chi Zhang, Yali Wang
Title: MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation
Abstract:
Human image animation has gained increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information for open-world animation. To tackle this problem, we propose MTVCrafter (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for human image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatio-temporal cues and avoid strict pixel-level alignment between pose image and character, enabling more flexible and disentangled control. Then, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for human image animation in the complex 3D world. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided human video generation. Experiments show that our MTVCrafter achieves state-of-the-art results with an FID-VID of 6.98, surpassing the second-best by 65%. Powered by robust motion tokens, MTVCrafter also generalizes well to diverse open-world characters (single/multiple, full/half-body) across various styles and scenarios. Our video demos and code are on: https://github.com/DINGYANB/MTVCrafter.
中文摘要:MTVCrafter提出首个直接建模原始3D运动序列的框架,通过4D运动令牌突破2D姿态图像的局限,在复杂3D世界中实现更灵活的人类图像动画,并在开放场景中展现出卓越的泛化能力。
English Summary: MTVCrafter introduces a novel framework using 4D motion tokens to overcome limitations of 2D pose images in human image animation, achieving state-of-the-art performance with superior generalization across diverse characters and scenarios.

Authors:Haozhe Luo, Ziyu Zhou, Zixin Shu, Aurélie Pahud de Mortanges, Robert Berke, Mauricio Reyes
Title: On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging
Abstract:
Deep neural networks excel in medical imaging but remain prone to biases, leading to fairness gaps across demographic groups. We provide the first systematic exploration of Human-AI alignment and fairness in this domain. Our results show that incorporating human insights consistently reduces fairness gaps and enhances out-of-domain generalization, though excessive alignment can introduce performance trade-offs, emphasizing the need for calibrated strategies. These findings highlight Human-AI alignment as a promising approach for developing fair, robust, and generalizable medical AI systems, striking a balance between expert guidance and automated efficiency. Our code is available at https://github.com/Roypic/Aligner.
中文摘要:在医学影像中,人机对齐能持续缩小公平性差距并提升泛化能力,但过度对齐需采用校准策略以平衡专家指导与自动化效率。
English Summary: Human-AI alignment in medical imaging consistently reduces fairness gaps and improves generalization, though excessive alignment requires calibrated strategies to balance expert guidance with automated efficiency.

Authors:Ziad Kheil, Lucas Robinet, Laurent Risser, Soleakhena Ken
Title: IMITATE: Image Registration with Context for unknown time frame recovery
Abstract:
In this paper, we formulate a novel image registration formalism dedicated to the estimation of unknown condition-related images, based on two or more known images and their associated conditions. We show how to practically model this formalism by using a new conditional U-Net architecture, which fully takes into account the conditional information and does not need any fixed image. Our formalism is then applied to image moving tumors for radiotherapy treatment at different breathing amplitude using 4D-CT (3D+t) scans in thoracoabdominal regions. This driving application is particularly complex as it requires to stitch a collection of sequential 2D slices into several 3D volumes at different organ positions. Movement interpolation with standard methods then generates well known reconstruction artefacts in the assembled volumes due to irregular patient breathing, hysteresis and poor correlation of breathing signal to internal motion. Results obtained on 4D-CT clinical data showcase artefact-free volumes achieved through real-time latencies. The code is publicly available at https://github.com/Kheil-Z/IMITATE .
Chinese: 本文提出了一种基于条件U-Net架构的新型图像配准方法,能够从已知图像估计未知条件相关图像,并在胸腹部4D-CT扫描中成功消除了移动肿瘤放疗中的重建伪影。
English: This paper introduces a novel image registration method using a conditional U-Net architecture to estimate unknown condition-related images from known ones, successfully eliminating reconstruction artifacts in 4D-CT scans for radiotherapy of moving tumors.

Authors:Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, Qing Li
Title: FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation
Abstract:
This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.
中文摘要:本文提出FlowDreamer模型,通过显式三维场景流和扩散方法预测机器人操作中的未来视觉帧,在多项基准测试中相比基线模型在语义相似度、像素质量和成功率上均表现出显著优势。
English Summary: This paper introduces FlowDreamer, an RGB-D world model that uses explicit 3D scene flow and diffusion to predict future frames for robot manipulation, achieving superior performance across multiple benchmarks compared to baseline methods.

Authors:Ijazul Haq, Yingjie Zhang, Irfan Ali Khan
Title: PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language
Abstract:
This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels, suitable for training and evaluating models based on different architectures, including Convolutional Neural Networks (CNNs) and Transformers. PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts. A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models: DeepSeek's Janus, InternVL, MiniCPM, Florence, and Qwen (3B and 7B), and four closed-source models: GPT-4o, Gemini, Claude, and Grok. Experimental results demonstrate that Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out. This work provides an insightful assessment of the capabilities and limitations of current LMMs for OCR tasks in Pashto and establishes a foundation for further research not only in Pashto OCR but also for other similar scripts such as Arabic, Persian, and Urdu. PsOCR is available at https://github.com/zirak-ai/PashtoOCR.
本研究评估了大型多模态模型在低资源普什图语光学字符识别任务中的表现,利用新开发的合成数据集PsOCR进行测试,结果表明Gemini整体表现最佳,而Qwen-7B在开源模型中最为突出。
This study assesses the performance of Large Multimodal Models on Optical Character Recognition for the low-resource Pashto language, using a newly developed synthetic dataset, PsOCR, and finds that Gemini leads overall while Qwen-7B excels among open-source models.

Authors:Yuan Gao, Shaobo Xia, Sheng Nie, Cheng Wang, Xiaohuan Xi, Bisheng Yang
Title: APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds
Abstract:
Airborne laser scanning (ALS) point cloud segmentation is a fundamental task for large-scale 3D scene understanding. In real-world applications, models are typically fixed after training. However, domain shifts caused by changes in the environment, sensor types, or sensor degradation often lead to a decline in model performance. Continuous Test-Time Adaptation (CTTA) offers a solution by adapting a source-pretrained model to evolving, unlabeled target domains. Despite its potential, research on ALS point clouds remains limited, facing challenges such as the absence of standardized datasets and the risk of catastrophic forgetting and error accumulation during prolonged adaptation. To tackle these challenges, we propose APCoTTA, the first CTTA method tailored for ALS point cloud semantic segmentation. We propose a dynamic trainable layer selection module. This module utilizes gradient information to select low-confidence layers for training, and the remaining layers are kept frozen, mitigating catastrophic forgetting. To further reduce error accumulation, we propose an entropy-based consistency loss. By losing such samples based on entropy, we apply consistency loss only to the reliable samples, enhancing model stability. In addition, we propose a random parameter interpolation mechanism, which randomly blends parameters from the selected trainable layers with those of the source model. This approach helps balance target adaptation and source knowledge retention, further alleviating forgetting. Finally, we construct two benchmarks, ISPRSC and H3DC, to address the lack of CTTA benchmarks for ALS point cloud segmentation. Experimental results demonstrate that APCoTTA achieves the best performance on two benchmarks, with mIoU improvements of approximately 9% and 14% over direct inference. The new benchmarks and code are available at https://github.com/Gaoyuan2/APCoTTA.
Chinese: APCoTTA方法通过动态可训练层选择、基于熵的一致性损失和随机参数插值机制,解决了机载激光扫描点云连续测试时适应中的灾难性遗忘和误差累积问题,并在新构建的基准数据集上实现了mIoU指标的显著提升。
English: The APCoTTA method introduces dynamic trainable layer selection, entropy-based consistency loss, and random parameter interpolation to address catastrophic forgetting and error accumulation in continuous test-time adaptation for ALS point cloud segmentation, achieving significant mIoU improvements on newly established benchmarks.

Authors:Jiakun Deng, Kexuan Li, Xingye Cui, Jiaxuan Li, Chang Long, Tian Pu, Zhenming Peng
Title: CSPENet: Contour-Aware and Saliency Priors Embedding Network for Infrared Small Target Detection
Abstract:
Infrared small target detection (ISTD) plays a critical role in a wide range of civilian and military applications. Existing methods suffer from deficiencies in the localization of dim targets and the perception of contour information under dense clutter environments, severely limiting their detection performance. To tackle these issues, we propose a contour-aware and saliency priors embedding network (CSPENet) for ISTD. We first design a surround-convergent prior extraction module (SCPEM) that effectively captures the intrinsic characteristic of target contour pixel gradients converging toward their center. This module concurrently extracts two collaborative priors: a boosted saliency prior for accurate target localization and multi-scale structural priors for comprehensively enriching contour detail representation. Building upon this, we propose a dual-branch priors embedding architecture (DBPEA) that establishes differentiated feature fusion pathways, embedding these two priors at optimal network positions to achieve performance enhancement. Finally, we develop an attention-guided feature enhancement module (AGFEM) to refine feature representations and improve saliency estimation accuracy. Experimental results on public datasets NUDT-SIRST, IRSTD-1k, and NUAA-SIRST demonstrate that our CSPENet outperforms other state-of-the-art methods in detection performance. The code is available at https://github.com/IDIP2025/CSPENet.
中文: 提出的CSPENet通过嵌入轮廓感知显著性先验的特殊模块,在基准数据集上实现了优于现有方法的红外小目标检测性能。
English: The proposed CSPENet enhances infrared small target detection by embedding contour-aware saliency priors through specialized modules, achieving superior performance on benchmark datasets compared to existing methods.

Authors:Zhe Shan, Lei Zhou, Liu Mao, Shaofan Chen, Chuanqiu Ren, Xia Xie
Title: Non-Registration Change Detection: A Novel Change Detection Task and Benchmark Dataset
Abstract:
In this study, we propose a novel remote sensing change detection task, non-registration change detection, to address the increasing number of emergencies such as natural disasters, anthropogenic accidents, and military strikes. First, in light of the limited discourse on the issue of non-registration change detection, we systematically propose eight scenarios that could arise in the real world and potentially contribute to the occurrence of non-registration problems. Second, we develop distinct image transformation schemes tailored to various scenarios to convert the available registration change detection dataset into a non-registration version. Finally, we demonstrate that non-registration change detection can cause catastrophic damage to the state-of-the-art methods. Our code and dataset are available at https://github.com/ShanZard/NRCD.
中文: 本研究提出了一种用于处理紧急情况的非配准变化检测遥感任务,界定了导致非配准问题的现实场景,相应转换了数据集,并证明现有先进方法在此类挑战下表现严重受损。
English: This study introduces a non-registration change detection method for remote sensing to handle emergencies, defines real-world scenarios causing non-registration issues, transforms datasets accordingly, and shows that current top methods suffer severely from these challenges.

Authors:Bin-Bin Gao, Yue Zhou, Jiangtao Yan, Yuezhi Cai, Weixi Zhang, Meng Wang, Jun Liu, Yong Liu, Lei Wang, Chengjie Wang
Title: AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection
Abstract:
Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle with designing prompt templates, complex token interactions, or requiring additional fine-tuning, resulting in limited flexibility. In this work, we present a simple yet effective method called AdaptCLIP based on two key insights. First, adaptive visual and textual representations should be learned alternately rather than jointly. Second, comparative learning between query and normal image prompt should incorporate both contextual and aligned residual features, rather than relying solely on residual features. AdaptCLIP treats CLIP models as a foundational service, adding only three simple adapters, visual adapter, textual adapter, and prompt-query adapter, at its input or output ends. AdaptCLIP supports zero-/few-shot generalization across domains and possesses a training-free manner on target domains once trained on a base dataset. AdaptCLIP achieves state-of-the-art performance on 12 anomaly detection benchmarks from industrial and medical domains, significantly outperforming existing competitive methods. We will make the code and model of AdaptCLIP available at https://github.com/gaobb/AdaptCLIP.
中文摘要:AdaptCLIP是一种基于CLIP模型的新方法,通过三个简单适配器实现跨领域视觉异常检测,无需目标域微调即可在工业和医疗基准上取得最优性能。
English Summary: AdaptCLIP is a novel method that enhances CLIP models with three simple adapters for universal visual anomaly detection, achieving state-of-the-art performance across multiple domains without requiring fine-tuning on target data.

Authors:Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, Guanghui Ren
Title: EnerVerse-AC: Envisioning Embodied Environments with Action Condition
Abstract:
Robotic imitation learning has advanced from solving static tasks to addressing dynamic interaction scenarios, but testing and evaluation remain costly and challenging due to the need for real-time interaction with dynamic environments. We propose EnerVerse-AC (EVAC), an action-conditional world model that generates future visual observations based on an agent's predicted actions, enabling realistic and controllable robotic inference. Building on prior architectures, EVAC introduces a multi-level action-conditioning mechanism and ray map encoding for dynamic multi-view image generation while expanding training data with diverse failure trajectories to improve generalization. As both a data engine and evaluator, EVAC augments human-collected trajectories into diverse datasets and generates realistic, action-conditioned video observations for policy testing, eliminating the need for physical robots or complex simulations. This approach significantly reduces costs while maintaining high fidelity in robotic manipulation evaluation. Extensive experiments validate the effectiveness of our method. Code, checkpoints, and datasets can be found at .
中文:EnerVerse-AC (EVAC) 是一种动作条件世界模型,能根据预测动作生成逼真的视觉观测,通过增强数据集和产生可控视频预测,无需物理机器人或复杂仿真即可实现经济高效的策略测试与评估。
English: EnerVerse-AC (EVAC) is an action-conditional world model that generates realistic visual observations for robotic inference, enabling cost-effective policy testing and evaluation without physical robots or complex simulations by augmenting datasets and producing controllable video predictions.

Authors:Xixuan Hao, Yutian Jiang, Xingchen Zou, Jiabo Liu, Yifang Yin, Yuxuan Liang
Title: Unlocking Location Intelligence: A Survey from Deep Learning to The LLM Era
Abstract:
Location Intelligence (LI), the science of transforming location-centric geospatial data into actionable knowledge, has become a cornerstone of modern spatial decision-making. The rapid evolution of Geospatial Representation Learning is fundamentally reshaping LI development through two successive technological revolutions: the deep learning breakthrough and the emerging large language model (LLM) paradigm. While deep neural networks (DNNs) have demonstrated remarkable success in automated feature extraction from structured geospatial data (e.g., satellite imagery, GPS trajectories), the recent integration of LLMs introduces transformative capabilities for cross-modal geospatial reasoning and unstructured geo-textual data processing. This survey presents a comprehensive review of geospatial representation learning across both technological eras, organizing them into a structured taxonomy based on the complete pipeline comprising: (1) data perspective, (2) methodological perspective and (3) application perspective. We also highlight current advancements, discuss existing limitations, and propose potential future research directions in the LLM era. This work offers a thorough exploration of the field and providing a roadmap for further innovation in LI. The summary of the up-to-date paper list can be found in https://github.com/CityMind-Lab/Awesome-Location-Intelligence and will undergo continuous updates.
中文总结:位置智能正通过深度学习和大型语言模型的地理空间表示学习技术发生变革,实现了自动化特征提取与跨模态推理,本综述系统回顾了该领域进展并提出了未来研究方向。
English Summary: Location Intelligence is being transformed by geospatial representation learning through deep learning and large language models, which enable automated feature extraction and cross-modal reasoning, with this survey providing a comprehensive review, taxonomy, and future research directions.

Authors:Nadav Magar, Amir Hertz, Eric Tabellion, Yael Pritch, Alex Rav-Acha, Ariel Shamir, Yedid Hoshen
Title: LightLab: Controlling Light Sources in Images with Diffusion Models
Abstract:
We present a simple, yet effective diffusion-based method for fine-grained, parametric control over light sources in an image. Existing relighting methods either rely on multiple input views to perform inverse rendering at inference time, or fail to provide explicit control over light changes. Our method fine-tunes a diffusion model on a small set of real raw photograph pairs, supplemented by synthetically rendered images at scale, to elicit its photorealistic prior for relighting. We leverage the linearity of light to synthesize image pairs depicting controlled light changes of either a target light source or ambient illumination. Using this data and an appropriate fine-tuning scheme, we train a model for precise illumination changes with explicit control over light intensity and color. Lastly, we show how our method can achieve compelling light editing results, and outperforms existing methods based on user preference.

Authors:Guillermo Gomez-Trenado, Pablo Mesejo, Oscar Cordón, Stéphane Lathuilière
Title: Don't Forget your Inverse DDIM for Image Editing
Abstract:
The field of text-to-image generation has undergone significant advancements with the introduction of diffusion models. Nevertheless, the challenge of editing real images persists, as most methods are either computationally intensive or produce poor reconstructions. This paper introduces SAGE (Self-Attention Guidance for image Editing) - a novel technique leveraging pre-trained diffusion models for image editing. SAGE builds upon the DDIM algorithm and incorporates a novel guidance mechanism utilizing the self-attention layers of the diffusion U-Net. This mechanism computes a reconstruction objective based on attention maps generated during the inverse DDIM process, enabling efficient reconstruction of unedited regions without the need to precisely reconstruct the entire input image. Thus, SAGE directly addresses the key challenges in image editing. The superiority of SAGE over other methods is demonstrated through quantitative and qualitative evaluations and confirmed by a statistically validated comprehensive user study, in which all 47 surveyed users preferred SAGE over competing methods. Additionally, SAGE ranks as the top-performing method in seven out of 10 quantitative analyses and secures second and third places in the remaining three.

Authors:Nicola Marinello, Simen Cassiman, Jonas Heylen, Marc Proesmans, Luc Van Gool
Title: Camera-Only 3D Panoptic Scene Completion for Autonomous Driving through Differentiable Object Shapes
Abstract:
Autonomous vehicles need a complete map of their surroundings to plan and act. This has sparked research into the tasks of 3D occupancy prediction, 3D scene completion, and 3D panoptic scene completion, which predict a dense map of the ego vehicle's surroundings as a voxel grid. Scene completion extends occupancy prediction by predicting occluded regions of the voxel grid, and panoptic scene completion further extends this task by also distinguishing object instances within the same class; both aspects are crucial for path planning and decision-making. However, 3D panoptic scene completion is currently underexplored. This work introduces a novel framework for 3D panoptic scene completion that extends existing 3D semantic scene completion models. We propose an Object Module and Panoptic Module that can easily be integrated with 3D occupancy and scene completion methods presented in the literature. Our approach leverages the available annotations in occupancy benchmarks, allowing individual object shapes to be learned as a differentiable problem. The code is available at https://github.com/nicolamarinello/OffsetOcc .
中文: 自动驾驶汽车需要全面的环境地图,因此本研究提出了一种新颖的3D全景场景补全框架,通过集成物体模块和全景模块来增强现有模型,从而提升路径规划与决策能力。
English: Autonomous vehicles require comprehensive environmental mapping, leading to the development of a novel 3D panoptic scene completion framework that enhances existing models with object and panoptic modules for improved path planning and decision-making.

Authors:Jeffrey Wen, Rizwan Ahmad, Philip Schniter
Title: Conformal Bounds on Full-Reference Image Quality for Imaging Inverse Problems
Abstract:
In imaging inverse problems, we would like to know how close the recovered image is to the true image in terms of full-reference image quality (FRIQ) metrics like PSNR, SSIM, LPIPS, etc. This is especially important in safety-critical applications like medical imaging, where knowing that, say, the SSIM was poor could potentially avoid a costly misdiagnosis. But since we don't know the true image, computing FRIQ is non-trivial. In this work, we combine conformal prediction with approximate posterior sampling to construct bounds on FRIQ that are guaranteed to hold up to a user-specified error probability. We demonstrate our approach on image denoising and accelerated magnetic resonance imaging (MRI) problems. Code is available at https://github.com/jwen307/quality_uq.
中文: 本研究结合了共形预测和近似后验采样,为成像逆问题中的全参考图像质量指标构建了具有保证的边界,并在图像去噪和加速磁共振成像中进行了验证。
English: This study introduces a method using conformal prediction and approximate posterior sampling to create guaranteed bounds on full-reference image quality metrics for imaging inverse problems, validated through denoising and MRI applications.

Authors:Dongyi He, Shiyang Li, Bin Jiang, He Yan
Title: Spec2VolCAMU-Net: A Spectrogram-to-Volume Model for EEG-to-fMRI Reconstruction based on Multi-directional Time-Frequency Convolutional Attention Encoder and Vision-Mamba U-Net
Abstract:
High-resolution functional magnetic resonance imaging (fMRI) is essential for mapping human brain activity; however, it remains costly and logistically challenging. If comparable volumes could be generated directly from widely available scalp electroencephalography (EEG), advanced neuroimaging would become significantly more accessible. Existing EEG-to-fMRI generators rely on plain Convolutional Neural Networks (CNNs) that fail to capture cross-channel time-frequency cues or on heavy transformer/Generative Adversarial Network (GAN) decoders that strain memory and stability. To address these limitations, we propose Spec2VolCAMU-Net, a lightweight architecture featuring a Multi-directional Time-Frequency Convolutional Attention Encoder for rich feature extraction and a Vision-Mamba U-Net decoder that uses linear-time state-space blocks for efficient long-range spatial modelling. We frame the goal of this work as establishing a new state of the art in the spatial fidelity of single-volume reconstruction, a foundational prerequisite for the ultimate aim of generating temporally coherent fMRI time series. Trained end-to-end with a hybrid SSI-MSE loss, Spec2VolCAMU-Net achieves state-of-the-art fidelity on three public benchmarks, recording Structural Similarity Index (SSIM) of 0.693 on NODDI, 0.725 on Oddball and 0.788 on CN-EPFL, representing improvements of 14.5%, 14.9%, and 16.9% respectively over previous best SSIM scores. Furthermore, it achieves competitive Signal-to-Noise Ratio (PSNR) scores, particularly excelling on the CN-EPFL dataset with a 4.6% improvement over the previous best PSNR, thus striking a better balance in reconstruction quality. The proposed model is lightweight and efficient, making it suitable for real-time applications in clinical and research settings. The code is available at https://github.com/hdy6438/Spec2VolCAMU-Net.
Chinese: 本研究提出的Spec2VolCAMU-Net轻量级模型,通过从脑电图数据重建高保真功能磁共振成像体积,在多个基准测试中实现了结构相似性和信噪比的显著提升,创造了该领域的最新性能记录。
English: This study introduces Spec2VolCAMU-Net, a lightweight model that sets a new state of the art in reconstructing high-fidelity fMRI volumes from EEG data, achieving significant improvements in structural similarity and signal-to-noise ratio across multiple benchmarks.

Authors:Yuelin Zhang, Qingpeng Ding, Long Lei, Yongxuan Feng, Raymond Shing-Yan Tang, Shing Shin Cheng
Title: MrTrack: Register Mamba for Needle Tracking with Rapid Reciprocating Motion during Ultrasound-Guided Aspiration Biopsy
Abstract:
Ultrasound-guided fine needle aspiration (FNA) biopsy is a common minimally invasive diagnostic procedure. However, an aspiration needle tracker addressing rapid reciprocating motion is still missing. MrTrack, an aspiration needle tracker with a mamba-based register mechanism, is proposed. MrTrack leverages a Mamba-based register extractor to sequentially distill global context from each historical search map, storing these temporal cues in a register bank. The Mamba-based register retriever then retrieves temporal prompts from the register bank to provide external cues when current vision features are temporarily unusable due to rapid reciprocating motion and imaging degradation. A self-supervised register diversify loss is proposed to encourage feature diversity and dimension independence within the learned register, mitigating feature collapse. Comprehensive experiments conducted on both robotic and manual aspiration biopsy datasets demonstrate that MrTrack not only outperforms state-of-the-art trackers in accuracy and robustness but also achieves superior inference efficiency. Project page: https://github.com/PieceZhang/MrTrack
中文: MrTrack采用基于Mamba的寄存器机制,在超声引导活检中有效追踪快速往复运动的穿刺针,实现了更高的精确度、鲁棒性和推理效率。
English: MrTrack introduces a Mamba-based register mechanism to track aspiration needles during rapid motion, achieving superior accuracy, robustness, and efficiency in ultrasound-guided biopsies.

Authors:Ma Changfeng, Bi Ran, Guo Jie, Wang Chongjun, Guo Yanwen
Title: Sparse Point Cloud Patches Rendering via Splitting 2D Gaussians
Abstract:
Current learning-based methods predict NeRF or 3D Gaussians from point clouds to achieve photo-realistic rendering but still depend on categorical priors, dense point clouds, or additional refinements. Hence, we introduce a novel point cloud rendering method by predicting 2D Gaussians from point clouds. Our method incorporates two identical modules with an entire-patch architecture enabling the network to be generalized to multiple datasets. The module normalizes and initializes the Gaussians utilizing the point cloud information including normals, colors and distances. Then, splitting decoders are employed to refine the initial Gaussians by duplicating them and predicting more accurate results, making our methodology effectively accommodate sparse point clouds as well. Once trained, our approach exhibits direct generalization to point clouds across different categories. The predicted Gaussians are employed directly for rendering without additional refinement on the rendered images, retaining the benefits of 2D Gaussians. We conduct extensive experiments on various datasets, and the results demonstrate the superiority and generalization of our method, which achieves SOTA performance. The code is available at https://github.com/murcherful/GauPCRender}{https://github.com/murcherful/GauPCRender.
中文: 本文提出了一种新颖的点云渲染方法,通过从点云预测二维高斯分布,无需类别先验或密集输入即可实现跨多个数据集的先进渲染效果和泛化能力。
English: This paper introduces a novel point cloud rendering method that predicts 2D Gaussians directly from point clouds, eliminating the need for categorical priors or dense inputs while achieving state-of-the-art performance and generalization across multiple datasets.

Authors:Srinivas Ravuri, Yuan Xu, Martin Ludwig Zehetner, Ketan Motlag, Sahin Albayrak
Title: APR-Transformer: Initial Pose Estimation for Localization in Complex Environments through Absolute Pose Regression
Abstract:
Precise initialization plays a critical role in the performance of localization algorithms, especially in the context of robotics, autonomous driving, and computer vision. Poor localization accuracy is often a consequence of inaccurate initial poses, particularly noticeable in GNSS-denied environments where GPS signals are primarily relied upon for initialization. Recent advances in leveraging deep neural networks for pose regression have led to significant improvements in both accuracy and robustness, especially in estimating complex spatial relationships and orientations. In this paper, we introduce APR-Transformer, a model architecture inspired by state-of-the-art methods, which predicts absolute pose (3D position and 3D orientation) using either image or LiDAR data. We demonstrate that our proposed method achieves state-of-the-art performance on established benchmark datasets such as the Radar Oxford Robot-Car and DeepLoc datasets. Furthermore, we extend our experiments to include our custom complex APR-BeIntelli dataset. Additionally, we validate the reliability of our approach in GNSS-denied environments by deploying the model in real-time on an autonomous test vehicle. This showcases the practical feasibility and effectiveness of our approach. The source code is available at:https://github.com/GT-ARC/APR-Transformer.
中文: APR-Transformer模型通过图像或激光雷达数据实现了最精确的三维位置与方向预测,在GNSS受限环境中经自动驾驶实车验证展现出卓越的可靠性。
English: The APR-Transformer model achieves state-of-the-art accuracy in predicting 3D position and orientation using image or LiDAR data, demonstrating robust performance in GNSS-denied environments through real-world autonomous vehicle testing.

Authors:Pengli Zhu, Yingji Fu, Nanguang Chen, Anqi Qiu
Title: Q-space Guided Collaborative Attention Translation Network for Flexible Diffusion-Weighted Images Synthesis
Abstract:
This study, we propose a novel Q-space Guided Collaborative Attention Translation Networks (Q-CATN) for multi-shell, high-angular resolution DWI (MS-HARDI) synthesis from flexible q-space sampling, leveraging the commonly acquired structural MRI data. Q-CATN employs a collaborative attention mechanism to effectively extract complementary information from multiple modalities and dynamically adjust its internal representations based on flexible q-space information, eliminating the need for fixed sampling schemes. Additionally, we introduce a range of task-specific constraints to preserve anatomical fidelity in DWI, enabling Q-CATN to accurately learn the intrinsic relationships between directional DWI signal distributions and q-space. Extensive experiments on the Human Connectome Project (HCP) dataset demonstrate that Q-CATN outperforms existing methods, including 1D-qDL, 2D-qDL, MESC-SD, and QGAN, in estimating parameter maps and fiber tracts both quantitatively and qualitatively, while preserving fine-grained details. Notably, its ability to accommodate flexible q-space sampling highlights its potential as a promising toolkit for clinical and research applications. Our code is available at https://github.com/Idea89560041/Q-CATN.
中文: 本研究提出Q-CATN新型网络,通过协同注意力机制和灵活q空间采样,从结构MRI合成高质量DWI数据,在定量和定性评估中均优于现有方法,同时保持解剖细节。
English: This study introduces Q-CATN, a novel network that uses collaborative attention and flexible q-space sampling to synthesize high-quality DWI data from structural MRI, outperforming existing methods in both quantitative and qualitative evaluations while preserving anatomical details.

Authors:Fares Bougourzi, Abdenour Hadid
Title: Recent Advances in Medical Imaging Segmentation: A Survey
Abstract:
Medical imaging is a cornerstone of modern healthcare, driving advancements in diagnosis, treatment planning, and patient care. Among its various tasks, segmentation remains one of the most challenging problem due to factors such as data accessibility, annotation complexity, structural variability, variation in medical imaging modalities, and privacy constraints. Despite recent progress, achieving robust generalization and domain adaptation remains a significant hurdle, particularly given the resource-intensive nature of some proposed models and their reliance on domain expertise. This survey explores cutting-edge advancements in medical image segmentation, focusing on methodologies such as Generative AI, Few-Shot Learning, Foundation Models, and Universal Models. These approaches offer promising solutions to longstanding challenges. We provide a comprehensive overview of the theoretical foundations, state-of-the-art techniques, and recent applications of these methods. Finally, we discuss inherent limitations, unresolved issues, and future research directions aimed at enhancing the practicality and accessibility of segmentation models in medical imaging. We are maintaining a \href{https://github.com/faresbougourzi/Awesome-DL-for-Medical-Imaging-Segmentation}{GitHub Repository} to continue tracking and updating innovations in this field.
中文摘要:本综述探讨了生成式人工智能和基础模型等先进方法,以解决医学图像分割中泛化能力和数据可及性等长期挑战,同时讨论了未来研究方向并维护持续更新的资源库。
English Summary: This survey examines advanced methods like Generative AI and Foundation Models to tackle persistent challenges in medical image segmentation, such as generalization and data accessibility, while also discussing future research directions and maintaining an updated resource repository.

Authors:Bin-Bin Gao
Title: MetaUAS: Universal Anomaly Segmentation with One-Prompt Meta-Learning
Abstract:
Zero- and few-shot visual anomaly segmentation relies on powerful vision-language models that detect unseen anomalies using manually designed textual prompts. However, visual representations are inherently independent of language. In this paper, we explore the potential of a pure visual foundation model as an alternative to widely used vision-language models for universal visual anomaly segmentation. We present a novel paradigm that unifies anomaly segmentation into change segmentation. This paradigm enables us to leverage large-scale synthetic image pairs, featuring object-level and local region changes, derived from existing image datasets, which are independent of target anomaly datasets. We propose a one-prompt Meta-learning framework for Universal Anomaly Segmentation (MetaUAS) that is trained on this synthetic dataset and then generalizes well to segment any novel or unseen visual anomalies in the real world. To handle geometrical variations between prompt and query images, we propose a soft feature alignment module that bridges paired-image change perception and single-image semantic segmentation. This is the first work to achieve universal anomaly segmentation using a pure vision model without relying on special anomaly detection datasets and pre-trained visual-language models. Our method effectively and efficiently segments any anomalies with only one normal image prompt and enjoys training-free without guidance from language. Our MetaUAS significantly outperforms previous zero-shot, few-shot, and even full-shot anomaly segmentation methods. The code and pre-trained models are available at https://github.com/gaobb/MetaUAS.
Chinese Summary: 本文提出MetaUAS纯视觉基础模型,通过将异常分割统一为变化检测任务,利用合成图像对进行训练,实现无需训练、不依赖语言的单样本异常检测能力。
English Summary: This paper introduces MetaUAS, a pure visual foundation model that achieves universal anomaly segmentation by reframing it as change detection, using synthetic image pairs for training and enabling training-free, language-independent anomaly detection with just one normal image prompt.

Authors:Bin-Bin Gao
Title: Learning to Detect Multi-class Anomalies with Just One Normal Image Prompt
Abstract:
Unsupervised reconstruction networks using self-attention transformers have achieved state-of-the-art performance for multi-class (unified) anomaly detection with a single model. However, these self-attention reconstruction models primarily operate on target features, which may result in perfect reconstruction for both normal and anomaly features due to high consistency with context, leading to failure in detecting anomalies. Additionally, these models often produce inaccurate anomaly segmentation due to performing reconstruction in a low spatial resolution latent space. To enable reconstruction models enjoying high efficiency while enhancing their generalization for unified anomaly detection, we propose a simple yet effective method that reconstructs normal features and restores anomaly features with just One Normal Image Prompt (OneNIP). In contrast to previous work, OneNIP allows for the first time to reconstruct or restore anomalies with just one normal image prompt, effectively boosting unified anomaly detection performance. Furthermore, we propose a supervised refiner that regresses reconstruction errors by using both real normal and synthesized anomalous images, which significantly improves pixel-level anomaly segmentation. OneNIP outperforms previous methods on three industry anomaly detection benchmarks: MVTec, BTAD, and VisA. The code and pre-trained models are available at https://github.com/gaobb/OneNIP.
中文: 提出的OneNIP方法通过单张正常图像提示重构正常特征并修复异常,显著提升了工业基准测试中的统一异常检测性能,同时改进了像素级分割精度。
English: The proposed OneNIP method enables unified anomaly detection by reconstructing normal features and restoring anomalies using a single normal image prompt, significantly enhancing performance across industry benchmarks while improving pixel-level segmentation accuracy.

Authors:Guan Gui, Bin-Bin Gao, Jun Liu, Chengjie Wang, Yunsheng Wu
Title: Few-Shot Anomaly-Driven Generation for Anomaly Classification and Segmentation
Abstract:
Anomaly detection is a practical and challenging task due to the scarcity of anomaly samples in industrial inspection. Some existing anomaly detection methods address this issue by synthesizing anomalies with noise or external data. However, there is always a large semantic gap between synthetic and real-world anomalies, resulting in weak performance in anomaly detection. To solve the problem, we propose a few-shot Anomaly-driven Generation (AnoGen) method, which guides the diffusion model to generate realistic and diverse anomalies with only a few real anomalies, thereby benefiting training anomaly detection models. Specifically, our work is divided into three stages. In the first stage, we learn the anomaly distribution based on a few given real anomalies and inject the learned knowledge into an embedding. In the second stage, we use the embedding and given bounding boxes to guide the diffusion model to generate realistic and diverse anomalies on specific objects (or textures). In the final stage, we propose a weakly-supervised anomaly detection method to train a more powerful model with generated anomalies. Our method builds upon DRAEM and DesTSeg as the foundation model and conducts experiments on the commonly used industrial anomaly detection dataset, MVTec. The experiments demonstrate that our generated anomalies effectively improve the model performance of both anomaly classification and segmentation tasks simultaneously, \eg, DRAEM and DseTSeg achieved a 5.8\% and 1.5\% improvement in AU-PR metric on segmentation task, respectively. The code and generated anomalous data are available at https://github.com/gaobb/AnoGen.
Chinese: 提出的AnoGen方法利用少量真实异常引导扩散模型生成逼真且多样的异常数据,显著提升了在MVTec等工业数据集上异常检测模型在分类和分割任务中的性能。
English: The proposed AnoGen method uses a few real anomalies to guide a diffusion model in generating realistic and diverse anomalies, significantly enhancing the performance of anomaly detection models in classification and segmentation tasks on industrial datasets like MVTec.

Authors:Yitao Zhu, Yuan Yin, Zhenrong Shen, Zihao Zhao, Haiyu Song, Sheng Wang, Dinggang Shen, Qian Wang
Title: UniCAD: Efficient and Extendable Architecture for Multi-Task Computer-Aided Diagnosis System
Abstract:
The growing complexity and scale of visual model pre-training have made developing and deploying multi-task computer-aided diagnosis (CAD) systems increasingly challenging and resource-intensive. Furthermore, the medical imaging community lacks an open-source CAD platform to enable the rapid creation of efficient and extendable diagnostic models. To address these issues, we propose UniCAD, a unified architecture that leverages the robust capabilities of pre-trained vision foundation models to seamlessly handle both 2D and 3D medical images while requiring only minimal task-specific parameters. UniCAD introduces two key innovations: (1) Efficiency: A low-rank adaptation strategy is employed to adapt a pre-trained visual model to the medical image domain, achieving performance on par with fully fine-tuned counterparts while introducing only 0.17% trainable parameters. (2) Plug-and-Play: A modular architecture that combines a frozen foundation model with multiple plug-and-play experts, enabling diverse tasks and seamless functionality expansion. Building on this unified CAD architecture, we establish an open-source platform where researchers can share and access lightweight CAD experts, fostering a more equitable and efficient research ecosystem. Comprehensive experiments across 12 diverse medical datasets demonstrate that UniCAD consistently outperforms existing methods in both accuracy and deployment efficiency. The source code and project page are available at https://mii-laboratory.github.io/UniCAD/.
中文:UniCAD是一种统一架构,利用预训练视觉模型高效处理2D和3D医学图像,仅需极少量任务特定参数,在精度和部署效率上均优于现有方法,并建立了开源平台促进协作研究。
English: UniCAD is a unified architecture that utilizes pre-trained vision models to efficiently handle 2D and 3D medical images with minimal task-specific parameters, outperforming existing methods in accuracy and deployment efficiency while establishing an open-source platform for collaborative research.

Authors:Jianlin Sun, Xiaolin Fang, Juwei Guan, Dongdong Gui, Teqi Wang, Tongxin Zhu
Title: DRRNet: Macro-Micro Feature Fusion and Dual Reverse Refinement for Camouflaged Object Detection
Abstract:
The core challenge in Camouflage Object Detection (COD) lies in the indistinguishable similarity between targets and backgrounds in terms of color, texture, and shape. This causes existing methods to either lose edge details (such as hair-like fine structures) due to over-reliance on global semantic information or be disturbed by similar backgrounds (such as vegetation patterns) when relying solely on local features. We propose DRRNet, a four-stage architecture characterized by a "context-detail-fusion-refinement" pipeline to address these issues. Specifically, we introduce an Omni-Context Feature Extraction Module to capture global camouflage patterns and a Local Detail Extraction Module to supplement microstructural information for the full-scene context module. We then design a module for forming dual representations of scene understanding and structural awareness, which fuses panoramic features and local features across various scales. In the decoder, we also introduce a reverse refinement module that leverages spatial edge priors and frequency-domain noise suppression to perform a two-stage inverse refinement of the output. By applying two successive rounds of inverse refinement, the model effectively suppresses background interference and enhances the continuity of object boundaries. Experimental results demonstrate that DRRNet significantly outperforms state-of-the-art methods on benchmark datasets. Our code is available at https://github.com/jerrySunning/DRRNet.
中文摘要:DRRNet提出包含"上下文-细节-融合-优化"四阶段架构,通过全局伪装模式提取与局部细节补充的双重表征融合,结合逆向优化模块实现背景干扰抑制与边界连续性增强,在伪装目标检测任务中显著优于现有方法。
English Summary: DRRNet introduces a four-stage "context-detail-fusion-refinement" architecture with specialized modules for global context and local detail extraction, achieving superior camouflage object detection through dual-representation fusion and reverse refinement that suppresses background noise while preserving boundary continuity.

Authors:Zechao Guan, Feng Yan, Shuai Du, Lin Ma, Qingshan Liu
Title: TopoDiT-3D: Topology-Aware Diffusion Transformer with Bottleneck Structure for 3D Point Cloud Generation
Abstract:
Recent advancements in Diffusion Transformer (DiT) models have significantly improved 3D point cloud generation. However, existing methods primarily focus on local feature extraction while overlooking global topological information, such as voids, which are crucial for maintaining shape consistency and capturing complex geometries. To address this limitation, we propose TopoDiT-3D, a Topology-Aware Diffusion Transformer with a bottleneck structure for 3D point cloud generation. Specifically, we design the bottleneck structure utilizing Perceiver Resampler, which not only offers a mode to integrate topological information extracted through persistent homology into feature learning, but also adaptively filters out redundant local features to improve training efficiency. Experimental results demonstrate that TopoDiT-3D outperforms state-of-the-art models in visual quality, diversity, and training efficiency. Furthermore, TopoDiT-3D demonstrates the importance of rich topological information for 3D point cloud generation and its synergy with conventional local feature learning. Videos and code are available at https://github.com/Zechao-Guan/TopoDiT-3D.
中文:提出的TopoDiT-3D模型通过采用带感知重采样器的瓶颈结构整合全局拓扑信息,在三维点云生成中实现了优于现有方法的视觉质量、多样性和训练效率。
English: The proposed TopoDiT-3D model enhances 3D point cloud generation by incorporating global topological information through a bottleneck structure with Perceiver Resampler, improving visual quality, diversity, and training efficiency over existing methods.

Authors:Dayong Liang, Changmeng Zheng, Zhiyuan Wen, Yi Cai, Xiao-Yong Wei, Qing Li
Title: Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning
Abstract:
Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes. This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes. We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs' interactional reasoning through three complementary components. First, our dual-stream graph constructor combines SAM-powered spatial relation extraction with interaction-aware captioning to generate functionally salient scene graphs with spatial grounding. Second, we employ targeted interaction queries to activate VLMs' latent knowledge of object functionalities, converting passive recognition into active reasoning about how objects work together. Finally, we introduce a lone-term memory reinforcement learning strategy with a specialized interaction-focused reward function that transforms transient patterns into long-term reasoning heuristics. Extensive experiments demonstrate that our approach significantly outperforms baseline methods on interaction-heavy reasoning benchmarks, with particularly strong improvements on complex scene understanding tasks. The source code can be accessed at https://github.com/open_upon_acceptance.
中文: 本文提出的交互增强场景图推理框架通过融合空间基础与交互感知推理,结合长期记忆强化机制,有效解决了传统场景图在复杂交互推理中的局限性,显著提升了视觉语言模型对交互密集型任务的理解能力。
English: This paper introduces Interaction-augmented Scene Graph Reasoning (ISGR), a framework that overcomes traditional scene graphs' limitations by integrating spatial grounding with interaction-aware reasoning and long-term memory reinforcement, significantly enhancing vision-language models' performance on complex interaction understanding tasks.

Authors:Yuhang Wang, Abdulaziz Alhuraish, Shengming Yuan, Hao Zhou
Title: OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving Conditions
Abstract:
Lane Keeping Assist (LKA) is widely adopted in modern vehicles, yet its real-world performance remains underexplored due to proprietary systems and limited data access. This paper presents OpenLKA, the first open, large-scale dataset for LKA evaluation and improvement. It includes 400 hours of driving data from 62 production vehicle models, collected through extensive road testing in Tampa, Florida and global contributions from the Comma.ai driving community. The dataset spans a wide range of challenging scenarios, including complex road geometries, degraded lane markings, adverse weather, lighting conditions and surrounding traffic. The dataset is multimodal, comprising: i) full CAN bus streams, decoded using custom reverse-engineered DBC files to extract key LKA events (e.g., system disengagements, lane detection failures); ii) synchronized high-resolution dash-cam video; iii) real-time outputs from Openpilot, providing accurate estimates of road curvature and lane positioning; iv) enhanced scene annotations generated by Vision Language Models, describing lane visibility, pavement quality, weather, lighting, and traffic conditions. By integrating vehicle-internal signals with high-fidelity perception and rich semantic context, OpenLKA provides a comprehensive platform for benchmarking the real-world performance of production LKA systems, identifying safety-critical operational scenarios, and assessing the readiness of current road infrastructure for autonomous driving. The dataset is publicly available at: https://github.com/OpenLKA/OpenLKA.
Chinese: 本文推出了首个用于评估和改进车道保持辅助系统的开放大规模数据集OpenLKA,包含来自62款车型的400小时多模态驾驶数据,可全面评估实际性能并识别关键安全场景。
English: This paper introduces OpenLKA, the first open and large-scale dataset for evaluating and improving Lane Keeping Assist systems, featuring 400 hours of multimodal driving data from 62 vehicle models to benchmark real-world performance and identify safety-critical scenarios.

Authors:Owen Kwon, Abraham George, Alison Bartsch, Amir Barati Farimani
Title: RT-Cache: Training-Free Retrieval for Real-Time Manipulation
Abstract:
Real robots are expected to repeat the same behavior in new environments with very little new data, yet modern controllers either incur heavy per-step inference or require deployment-time fine-tuning. We propose RT-Cache, a training-free retrieval-as-control pipeline that caches diverse image action trajectories in a unified vector memory and, at test time, embeds the current frame to retrieve and replay multi-step snippets, replacing per-step model calls. A hierarchical search keeps lookups sub-second at million scale, shifting cost from compute to storage and enabling real-time control on modest GPUs. Across real-robot tasks and large open logs, RT-Cache achieves higher success and lower completion time than strong retrieval baselines (approximately x2 higher success and ~30% faster in our settings), and a single-episode anchoring study shows immediate adaptation to a more complex, contact-rich task without fine-tuning. RT-Cache turns experience into an append-only memory, offering a simple, scalable path to few-shot deployment today and a foundation for multimodal keys and optional integration with high-level policies. Project page: https://rt-cache.github.io/.

Authors:Wei-Long Tian, Peng Gao, Xiao Liu, Long Xu, Hamido Fujita, Hanan Aljuai, Mao-Li Wang
Title: Towards Adaptive Meta-Gradient Adversarial Examples for Visual Tracking
Abstract:
In recent years, visual tracking methods based on convolutional neural networks and Transformers have achieved remarkable performance and have been successfully applied in fields such as autonomous driving. However, the numerous security issues exposed by deep learning models have gradually affected the reliable application of visual tracking methods in real-world scenarios. Therefore, how to reveal the security vulnerabilities of existing visual trackers through effective adversarial attacks has become a critical problem that needs to be addressed. To this end, we propose an adaptive meta-gradient adversarial attack (AMGA) method for visual tracking. This method integrates multi-model ensembles and meta-learning strategies, combining momentum mechanisms and Gaussian smoothing, which can significantly enhance the transferability and attack effectiveness of adversarial examples. AMGA randomly selects models from a large model repository, constructs diverse tracking scenarios, and iteratively performs both white- and black-box adversarial attacks in each scenario, optimizing the gradient directions of each model. This paradigm minimizes the gap between white- and black-box adversarial attacks, thus achieving excellent attack performance in black-box scenarios. Extensive experimental results on large-scale datasets such as OTB2015, LaSOT, and GOT-10k demonstrate that AMGA significantly improves the attack performance, transferability, and deception of adversarial examples. Codes and data are available at https://github.com/pgao-lab/AMGA.
中文: 提出的自适应元梯度对抗攻击(AMGA)方法通过整合多模型集成和元学习策略,显著提升了视觉跟踪中对抗样本的迁移性和攻击效果,在黑盒场景下实现了卓越性能。
English: The proposed Adaptive Meta-Gradient Adversarial Attack (AMGA) method enhances the transferability and effectiveness of adversarial examples in visual tracking by integrating multi-model ensembles and meta-learning, achieving superior performance in black-box scenarios.

Authors:Yangyi Chen, Hao Peng, Tong Zhang, Heng Ji
Title: Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Abstract:
In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model-a text-only large language model (LLM) trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and thus receive lower probabilities from the text-only reference LLM. During training, we implement a token-specific re-weighting term based on the importance scores to adjust each token's loss. We implement PRIOR in two distinct settings: LVLMs with visual encoders and LVLMs without visual encoders. We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP. In addition, PRIOR exhibits superior scaling properties, as demonstrated by significantly higher scaling coefficients, indicating greater potential for performance gains compared to NTP given increasing compute and data.
Chinese: PRIOR是一种新颖的视觉语言预训练方法,通过基于纯文本参考模型的重要性分数对图像相关标记进行差异化损失加权,有效减少大型视觉语言模型的幻觉现象,相比标准的下一个标记预测方法实现了显著的性能提升。
English: PRIOR is a novel vision-language pre-training method that reduces hallucination in large vision-language models by prioritizing image-related tokens through differential loss weighting based on importance scores from a text-only reference model, achieving significant performance improvements over standard next-token prediction.

Authors:Yancheng Wang, Nebojsa Jojic, Yingzhen Yang
Title: Differentiable Channel Selection in Self-Attention For Person Re-Identification
Abstract:
In this paper, we propose a novel attention module termed the Differentiable Channel Selection Attention module, or the DCS-Attention module. In contrast with conventional self-attention, the DCS-Attention module features selection of informative channels in the computation of the attention weights. The selection of the feature channels is performed in a differentiable manner, enabling seamless integration with DNN training. Our DCS-Attention is compatible with either fixed neural network backbones or learnable backbones with Differentiable Neural Architecture Search (DNAS), leading to DCS with Fixed Backbone (DCS-FB) and DCS-DNAS, respectively. Importantly, our DCS-Attention is motivated by the principle of Information Bottleneck (IB), and a novel variational upper bound for the IB loss, which can be optimized by SGD, is derived and incorporated into the training loss of the networks with the DCS-Attention modules. In this manner, a neural network with DCS-Attention modules is capable of selecting the most informative channels for feature extraction so that it enjoys state-of-the-art performance for the Re-ID task. Extensive experiments on multiple person Re-ID benchmarks using both DCS-FB and DCS-DNAS show that DCS-Attention significantly enhances the prediction accuracy of DNNs for person Re-ID, which demonstrates the effectiveness of DCS-Attention in learning discriminative features critical to identifying person identities. The code of our work is available at https://github.com/Statistical-Deep-Learning/DCS-Attention.
中文: 本文提出的DCS-Attention模块通过可微分通道选择机制,基于信息瓶颈原理优化特征提取,在行人重识别任务中实现了最优性能。
English: The paper introduces the DCS-Attention module, which differentially selects informative channels based on the Information Bottleneck principle to enhance feature extraction and achieve state-of-the-art performance in person Re-ID tasks.

Authors:Kangxian Xie, Yufei Zhu, Kaiming Kuang, Li Zhang, Hongwei Bran Li, Mingchen Gao, Jiancheng Yang
Title: Template-Guided Reconstruction of Pulmonary Segments with Neural Implicit Functions
Abstract:
High-quality 3D reconstruction of pulmonary segments plays a crucial role in segmentectomy and surgical treatment planning for lung cancer. Due to the resolution requirement of the target reconstruction, conventional deep learning-based methods often suffer from computational resource constraints or limited granularity. Conversely, implicit modeling is favored due to its computational efficiency and continuous representation at any resolution. We propose a neural implicit function-based method to learn a 3D surface to achieve anatomy-aware, precise pulmonary segment reconstruction, represented as a shape by deforming a learnable template. Additionally, we introduce two clinically relevant evaluation metrics to assess the reconstruction comprehensively. Further, due to the absence of publicly available shape datasets to benchmark reconstruction algorithms, we developed a shape dataset named Lung3D, including the 3D models of 800 labeled pulmonary segments and the corresponding airways, arteries, veins, and intersegmental veins. We demonstrate that the proposed approach outperforms existing methods, providing a new perspective for pulmonary segment reconstruction. Code and data will be available at https://github.com/M3DV/ImPulSe.
中文摘要:本研究提出的基于神经隐式函数的方法,通过可学习模板变形实现解剖感知的精确肺段三维重建,其计算效率和重建质量优于现有方法,并辅以新型Lung3D数据集和临床评估指标。
English Summary: The proposed neural implicit function-based method enables high-quality, anatomy-aware 3D pulmonary segment reconstruction with superior computational efficiency and precision compared to existing approaches, supported by a new Lung3D dataset and clinical evaluation metrics.

Authors:Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze-Yin Chan, S M Iftekhar Uddin, Shayekh Bin Islam, Roshan Santhosh, Snegha A, Drishti Sharma, Chen Liu, Isha Chaturvedi, Genta Indra Winata, Ashvanth. S, Snehanshu Mukherjee, Alham Fikri Aji
Title: Behind Maya: Building a Multilingual Vision Language Model
Abstract:
In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.
中文:近期大型视觉语言模型在主流语言上表现优异,但在低资源语言和文化多样性方面存在不足,为此我们推出了开源多语言模型Maya,它通过支持八种语言的多语言数据集和模型,显著提升了跨文化视觉语言任务的理解能力。
English: Recent advances in large Vision-Language Models have excelled in major languages but struggle with low-resource languages and cultural diversity, prompting the introduction of Maya, an open-source multilingual VLM that enhances cross-cultural understanding through a multilingual dataset and model supporting eight languages.

Authors:Yuping Wang, Shuo Xing, Cui Can, Renjie Li, Hongyuan Hua, Kexin Tian, Zhaobin Mo, Xiangbo Gao, Keshu Wu, Sulong Zhou, Hengxu You, Juntong Peng, Junge Zhang, Zehao Wang, Rui Song, Mingxuan Yan, Walter Zimmer, Xingcheng Zhou, Peiran Li, Zhaohan Lu, Chia-Ju Chen, Yue Huang, Ryan A. Rossi, Lichao Sun, Hongkai Yu, Zhiwen Fan, Frank Hao Yang, Yuhao Kang, Ross Greer, Chenxi Liu, Eun Hak Lee, Xuan Di, Xinyue Ye, Liu Ren, Alois Knoll, Xiaopeng Li, Shuiwang Ji, Masayoshi Tomizuka, Marco Pavone, Tianbao Yang, Jing Du, Ming-Hsuan Yang, Hua Wei, Ziran Wang, Yang Zhou, Jiachen Li, Zhengzhong Tu
Title: Generative AI for Autonomous Driving: Frontiers and Opportunities
Abstract:
Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, particularly the pursuit of Level 5 autonomy. This survey delivers a comprehensive and critical synthesis of the emerging role of GenAI across the autonomous driving stack. We begin by distilling the principles and trade-offs of modern generative modeling, encompassing VAEs, GANs, Diffusion Models, and Large Language Models (LLMs). We then map their frontier applications in image, LiDAR, trajectory, occupancy, video generation as well as LLM-guided reasoning and decision making. We categorize practical applications, such as synthetic data workflows, end-to-end driving strategies, high-fidelity digital twin systems, smart transportation networks, and cross-domain transfer to embodied AI. We identify key obstacles and possibilities such as comprehensive generalization across rare cases, evaluation and safety checks, budget-limited implementation, regulatory compliance, ethical concerns, and environmental effects, while proposing research plans across theoretical assurances, trust metrics, transport integration, and socio-technical influence. By unifying these threads, the survey provides a forward-looking reference for researchers, engineers, and policymakers navigating the convergence of generative AI and advanced autonomous mobility. An actively maintained repository of cited works is available at https://github.com/taco-group/GenAI4AD.
中文: 生成式人工智能通过内容创作与推理能力正彻底变革自动驾驶技术,为实现L5级全自动驾驶提供了最有前景的路径,同时需应对安全性、泛化能力和实际应用等关键挑战。
English: Generative AI is revolutionizing autonomous driving by enabling advanced content creation and reasoning, offering the most promising path toward achieving Level 5 autonomy while addressing key challenges in safety, generalization, and implementation.

Authors:Hanjung Kim, Jaehyun Kang, Hyolim Kang, Meedeum Cho, Seon Joo Kim, Youngwoon Lee
Title: UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations
Abstract:
Mimicry is a fundamental learning mechanism in humans, enabling individuals to learn new tasks by observing and imitating experts. However, applying this ability to robots presents significant challenges due to the inherent differences between human and robot embodiments in both their visual appearance and physical capabilities. While previous methods bridge this gap using cross-embodiment datasets with shared scenes and tasks, collecting such aligned data between humans and robots at scale is not trivial. In this paper, we propose UniSkill, a novel framework that learns embodiment-agnostic skill representations from large-scale cross-embodiment video data without any labels, enabling skills extracted from human video prompts to effectively transfer to robot policies trained only on robot data. Our experiments in both simulation and real-world environments show that our cross-embodiment skills successfully guide robots in selecting appropriate actions, even with unseen video prompts. The project website can be found at: https://kimhanjung.github.io/UniSkill.

Authors:Huiyan Qi, Bin Zhu, Chong-Wah Ngo, Jingjing Chen, Ee-Peng Lim
Title: Advancing Food Nutrition Estimation via Visual-Ingredient Feature Fusion
Abstract:
Nutrition estimation is an important component of promoting healthy eating and mitigating diet-related health risks. Despite advances in tasks such as food classification and ingredient recognition, progress in nutrition estimation is limited due to the lack of datasets with nutritional annotations. To address this issue, we introduce FastFood, a dataset with 84,446 images across 908 fast food categories, featuring ingredient and nutritional annotations. In addition, we propose a new model-agnostic Visual-Ingredient Feature Fusion (VIF$^2$) method to enhance nutrition estimation by integrating visual and ingredient features. Ingredient robustness is improved through synonym replacement and resampling strategies during training. The ingredient-aware visual feature fusion module combines ingredient features and visual representation to achieve accurate nutritional prediction. During testing, ingredient predictions are refined using large multimodal models by data augmentation and majority voting. Our experiments on both FastFood and Nutrition5k datasets validate the effectiveness of our proposed method built in different backbones (e.g., Resnet, InceptionV3 and ViT), which demonstrates the importance of ingredient information in nutrition estimation. https://huiyanqi.github.io/fastfood-nutrition-estimation/.

Authors:Zongchuang Zhao, Haoyu Fu, Dingkang Liang, Xin Zhou, Dingyuan Zhang, Hongwei Xie, Bing Wang, Xiang Bai
Title: Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving
Abstract:
The Large Visual-Language Models (LVLMs) have significantly advanced image understanding. Their comprehension and reasoning capabilities enable promising applications in autonomous driving scenarios. However, existing research typically focuses on front-view perspectives and partial objects within scenes, struggling to achieve comprehensive scene understanding. Meanwhile, existing LVLMs suffer from the lack of mapping relationship between 2D and 3D and insufficient integration of 3D object localization and instruction understanding. To tackle these limitations, we first introduce NuInteract, a large-scale dataset with over 1.5M multi-view image language pairs spanning dense scene captions and diverse interactive tasks. Furthermore, we propose DriveMonkey, a simple yet effective framework that seamlessly integrates LVLMs with a spatial processor using a series of learnable queries. The spatial processor, designed as a plug-and-play component, can be initialized with pre-trained 3D detectors to improve 3D perception. Our experiments show that DriveMonkey outperforms general LVLMs, especially achieving a 9.86% notable improvement on the 3D visual grounding task. The dataset and code will be released at https://github.com/zc-zhao/DriveMonkey.
Chinese: 本研究提出了DriveMonkey框架,通过集成空间处理与3D感知能力来增强大型视觉语言模型在自动驾驶中的应用,利用NuInteract数据集显著提升了场景理解和3D视觉定位性能。
English: This study introduces DriveMonkey, a framework that enhances large visual-language models for autonomous driving by integrating spatial processing with 3D perception, using the NuInteract dataset to improve scene understanding and 3D visual grounding.

Authors:Xiaolei Qin, Di Wang, Jing Zhang, Fengxiang Wang, Xin Su, Bo Du, Liangpei Zhang
Title: TiMo: Spatiotemporal Foundation Model for Satellite Image Time Series
Abstract:
Satellite image time series (SITS) provide continuous observations of the Earth's surface, making them essential for applications such as environmental management and disaster assessment. However, existing spatiotemporal foundation models rely on plain vision transformers, which encode entire temporal sequences without explicitly capturing multiscale spatiotemporal relationships between land objects. This limitation hinders their effectiveness in downstream tasks. To overcome this challenge, we propose TiMo, a novel hierarchical vision transformer foundation model tailored for SITS analysis. At its core, we introduce a spatiotemporal gyroscope attention mechanism that dynamically captures evolving multiscale patterns across both time and space. For pre-training, we curate MillionST, a large-scale dataset of one million images from 100,000 geographic locations, each captured across 10 temporal phases over five years, encompassing diverse geospatial changes and seasonal variations. Leveraging this dataset, we adapt masked image modeling to pre-train TiMo, enabling it to effectively learn and encode generalizable spatiotemporal representations.Extensive experiments across multiple spatiotemporal tasks-including deforestation monitoring, land cover segmentation, crop type classification, and flood detection-demonstrate TiMo's superiority over state-of-the-art methods. Code, model, and dataset will be released at https://github.com/MiliLab/TiMo.
中文: 提出的TiMo模型通过时空陀螺仪注意力机制捕捉卫星图像时间序列中的多尺度模式,并利用大规模MillionST数据集进行预训练,在多项环境监测任务中实现了优于现有方法的性能。
English: The proposed TiMo model introduces a spatiotemporal gyroscope attention mechanism to capture multiscale patterns in satellite image time series, achieving superior performance across multiple environmental monitoring tasks through pre-training on the large-scale MillionST dataset.

Authors:Edoardo Bianchi, Antonio Liotta
Title: SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation
Abstract:
Assessing human skill levels in complex activities is a challenging problem with applications in sports, rehabilitation, and training. In this work, we present SkillFormer, a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos. Building on the TimeSformer backbone, SkillFormer introduces a CrossViewFusion module that fuses view-specific features using multi-head cross-attention, learnable gating, and adaptive self-calibration. We leverage Low-Rank Adaptation to fine-tune only a small subset of parameters, significantly reducing training costs. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achieves state-of-the-art accuracy in multi-view settings while demonstrating remarkable computational efficiency, using 4.5x fewer parameters and requiring 3.75x fewer training epochs than prior baselines. It excels in multiple structured tasks, confirming the value of multi-view integration for fine-grained skill assessment. Project page at https://edowhite.github.io/SkillFormer

Authors:Holly Dinkel, Marcel Büsching, Alberta Longhini, Brian Coltin, Trey Smith, Danica Kragic, Mårten Björkman, Timothy Bretl
Title: DLO-Splatting: Tracking Deformable Linear Objects Using 3D Gaussian Splatting
Abstract:
This work presents DLO-Splatting, an algorithm for estimating the 3D shape of Deformable Linear Objects (DLOs) from multi-view RGB images and gripper state information through prediction-update filtering. The DLO-Splatting algorithm uses a position-based dynamics model with shape smoothness and rigidity dampening corrections to predict the object shape. Optimization with a 3D Gaussian Splatting-based rendering loss iteratively renders and refines the prediction to align it with the visual observations in the update step. Initial experiments demonstrate promising results in a knot tying scenario, which is challenging for existing vision-only methods.

Authors:Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim
Title: Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
Abstract:
Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with user-specified visual concepts, VGD enhances the interpretability, generalization, and flexibility of prompt generation without the need for additional training. Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models.
中文摘要:提出的视觉引导解码(VGD)方法利用大语言模型和CLIP指导,为文生图模型生成连贯、可读的提示文本,无需额外训练即可在可解释性和上下文相关性上超越现有技术。
English Summary: The proposed Visually Guided Decoding (VGD) method uses large language models and CLIP guidance to generate coherent, human-readable prompts for text-to-image models, outperforming existing techniques in interpretability and contextual relevance without requiring additional training.

Authors:Ziyuan He, Zhiqing Guo, Liejun Wang, Gaobo Yang, Yunfeng Diao, Dan Ma
Title: WaveGuard: Robust Deepfake Detection and Source Tracing via Dual-Tree Complex Wavelet and Graph Neural Networks
Abstract:
Deepfake technology poses increasing risks such as privacy invasion and identity theft. To address these threats, we propose WaveGuard, a proactive watermarking framework that enhances robustness and imperceptibility via frequency-domain embedding and graph-based structural consistency. Specifically, we embed watermarks into high-frequency sub-bands using Dual-Tree Complex Wavelet Transform (DT-CWT) and employ a Structural Consistency Graph Neural Network (SC-GNN) to preserve visual quality. We also design an attention module to refine embedding precision. Experimental results on face swap and reenactment tasks demonstrate that WaveGuard outperforms state-of-the-art methods in both robustness and visual quality. Code is available at https://github.com/vpsg-research/WaveGuard.
中文:摘要介绍了WaveGuard,一种利用频域嵌入和图结构一致性的主动水印框架,通过实验证明其在对抗深度伪造威胁方面具有卓越的鲁棒性和视觉质量。
English: The abstract introduces WaveGuard, a proactive watermarking framework that uses frequency-domain embedding and graph-based structural consistency to combat deepfake threats, showing superior robustness and visual quality in experiments.

Authors:Haofeng Liu, Mingqi Gao, Xuxiao Luo, Ziyue Wang, Guanyi Qin, Junde Wu, Yueming Jin
Title: ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking
Abstract:
Surgical scene segmentation is critical in computer-assisted surgery and is vital for enhancing surgical quality and patient outcomes. Recently, referring surgical segmentation is emerging, given its advantage of providing surgeons with an interactive experience to segment the target object. However, existing methods are limited by low efficiency and short-term tracking, hindering their applicability in complex real-world surgical scenarios. In this paper, we introduce ReSurgSAM2, a two-stage surgical referring segmentation framework that leverages Segment Anything Model 2 to perform text-referred target detection, followed by tracking with reliable initial frame identification and diversity-driven long-term memory. For the detection stage, we propose a cross-modal spatial-temporal Mamba to generate precise detection and segmentation results. Based on these results, our credible initial frame selection strategy identifies the reliable frame for the subsequent tracking. Upon selecting the initial frame, our method transitions to the tracking stage, where it incorporates a diversity-driven memory mechanism that maintains a credible and diverse memory bank, ensuring consistent long-term tracking. Extensive experiments demonstrate that ReSurgSAM2 achieves substantial improvements in accuracy and efficiency compared to existing methods, operating in real-time at 61.2 FPS. Our code and datasets will be available at https://github.com/jinlab-imvr/ReSurgSAM2.
Chinese: ReSurgSAM2提出了一种两阶段的手术场景分割框架,通过跨模态检测和长期跟踪提升精度与效率,实现了61.2 FPS的实时性能。
English: ReSurgSAM2 introduces a two-stage framework for surgical scene segmentation, enhancing accuracy and efficiency through cross-modal detection and long-term tracking, achieving real-time performance at 61.2 FPS.

Authors:Xiao Ni, Carsten Kuehnel, Xiaoyi Jiang
Title: Thermal Detection of People with Mobility Restrictions for Barrier Reduction at Traffic Lights Controlled Intersections
Abstract:
Rapid advances in deep learning for computer vision have driven the adoption of RGB camera-based adaptive traffic light systems to improve traffic safety and pedestrian comfort. However, these systems often overlook the needs of people with mobility restrictions. Moreover, the use of RGB cameras presents significant challenges, including limited detection performance under adverse weather or low-visibility conditions, as well as heightened privacy concerns. To address these issues, we propose a fully automated, thermal detector-based traffic light system that dynamically adjusts signal durations for individuals with walking impairments or mobility burden and triggers the auditory signal for visually impaired individuals, thereby advancing towards barrier-free intersection for all users. To this end, we build the thermal dataset for people with mobility restrictions (TD4PWMR), designed to capture diverse pedestrian scenarios, particularly focusing on individuals with mobility aids or mobility burden under varying environmental conditions, such as different lighting, weather, and crowded urban settings. While thermal imaging offers advantages in terms of privacy and robustness to adverse conditions, it also introduces inherent hurdles for object detection due to its lack of color and fine texture details and generally lower resolution of thermal images. To overcome these limitations, we develop YOLO-Thermal, a novel variant of the YOLO architecture that integrates advanced feature extraction and attention mechanisms for enhanced detection accuracy and robustness in thermal imaging. Experiments demonstrate that the proposed thermal detector outperforms existing detectors, while the proposed traffic light system effectively enhances barrier-free intersection. The source codes and dataset are available at https://github.com/leon2014dresden/YOLO-THERMAL.
中文: 深度学习进步催生了基于RGB摄像头的交通灯系统,但其忽视行动不便者且存在天气和隐私问题,因此开发了基于热成像检测器的系统,结合YOLO-Thermal提升无障碍交叉口的效能。
English: Deep learning advancements have led to RGB-based traffic light systems, but they neglect mobility-impaired individuals and face weather and privacy issues, prompting the development of a thermal detector-based system with YOLO-Thermal for enhanced barrier-free intersections.

Authors:Zheang Huai, Hui Tang, Yi Li, Zhuangzhuang Chen, Xiaomeng Li
Title: Leveraging Segment Anything Model for Source-Free Domain Adaptation via Dual Feature Guided Auto-Prompting
Abstract:
Source-free domain adaptation (SFDA) for segmentation aims at adapting a model trained in the source domain to perform well in the target domain with only the source model and unlabeled target data. Inspired by the recent success of Segment Anything Model (SAM) which exhibits the generality of segmenting images of various modalities and in different domains given human-annotated prompts like bounding boxes or points, we for the first time explore the potentials of Segment Anything Model for SFDA via automatedly finding an accurate bounding box prompt. We find that the bounding boxes directly generated with existing SFDA approaches are defective due to the domain gap. To tackle this issue, we propose a novel Dual Feature Guided (DFG) auto-prompting approach to search for the box prompt. Specifically, the source model is first trained in a feature aggregation phase, which not only preliminarily adapts the source model to the target domain but also builds a feature distribution well-prepared for box prompt search. In the second phase, based on two feature distribution observations, we gradually expand the box prompt with the guidance of the target model feature and the SAM feature to handle the class-wise clustered target features and the class-wise dispersed target features, respectively. To remove the potentially enlarged false positive regions caused by the over-confident prediction of the target model, the refined pseudo-labels produced by SAM are further postprocessed based on connectivity analysis. Experiments on 3D and 2D datasets indicate that our approach yields superior performance compared to conventional methods. Code is available at https://github.com/xmed-lab/DFG.
Chinese: 本研究提出了一种双重特征引导的自动提示方法,通过利用Segment Anything模型自动生成精确的边界框提示来改进无源域自适应分割,在3D和2D数据集上相比传统方法实现了更优的性能。
English: This study introduces a Dual Feature Guided auto-prompting method to enhance source-free domain adaptation for segmentation by leveraging the Segment Anything Model to automatically generate accurate bounding box prompts, achieving superior performance on both 3D and 2D datasets compared to conventional approaches.

Authors:Wenkui Yang, Zhida Zhang, Xiaoqiang Zhou, Junxian Duan, Jie Cao
Title: TT-DF: A Large-Scale Diffusion-Based Dataset and Benchmark for Human Body Forgery Detection
Abstract:
The emergence and popularity of facial deepfake methods spur the vigorous development of deepfake datasets and facial forgery detection, which to some extent alleviates the security concerns about facial-related artificial intelligence technologies. However, when it comes to human body forgery, there has been a persistent lack of datasets and detection methods, due to the later inception and complexity of human body generation methods. To mitigate this issue, we introduce TikTok-DeepFake (TT-DF), a novel large-scale diffusion-based dataset containing 6,120 forged videos with 1,378,857 synthetic frames, specifically tailored for body forgery detection. TT-DF offers a wide variety of forgery methods, involving multiple advanced human image animation models utilized for manipulation, two generative configurations based on the disentanglement of identity and pose information, as well as different compressed versions. The aim is to simulate any potential unseen forged data in the wild as comprehensively as possible, and we also furnish a benchmark on TT-DF. Additionally, we propose an adapted body forgery detection model, Temporal Optical Flow Network (TOF-Net), which exploits the spatiotemporal inconsistencies and optical flow distribution differences between natural data and forged data. Our experiments demonstrate that TOF-Net achieves favorable performance on TT-DF, outperforming current state-of-the-art extendable facial forgery detection models. For our TT-DF dataset, please refer to https://github.com/HashTAG00002/TT-DF.
Chinese: TikTok-DeepFake数据集的推出填补了人体伪造检测领域的数据空白,同时提出的TOF-Net模型通过利用时空和光流分析实现了优越的检测性能。
English: The introduction of the TikTok-DeepFake dataset addresses the scarcity of resources for human body forgery detection, while the proposed TOF-Net model demonstrates superior performance by leveraging spatiotemporal and optical flow analysis.

Authors:Nibir Chandra Mandal, Oishee Bintey Hoque, Abhijin Adiga, Samarth Swarup, Mandy Wilson, Lu Feng, Yangfeng Ji, Miaomiao Zhang, Geoffrey Fox, Madhav Marathe
Title: IrrMap: A Large-Scale Comprehensive Dataset for Irrigation Method Mapping
Abstract:
We introduce IrrMap, the first large-scale dataset (1.1 million patches) for irrigation method mapping across regions. IrrMap consists of multi-resolution satellite imagery from LandSat and Sentinel, along with key auxiliary data such as crop type, land use, and vegetation indices. The dataset spans 1,687,899 farms and 14,117,330 acres across multiple western U.S. states from 2013 to 2023, providing a rich and diverse foundation for irrigation analysis and ensuring geospatial alignment and quality control. The dataset is ML-ready, with standardized 224x224 GeoTIFF patches, the multiple input modalities, carefully chosen train-test-split data, and accompanying dataloaders for seamless deep learning model training andbenchmarking in irrigation mapping. The dataset is also accompanied by a complete pipeline for dataset generation, enabling researchers to extend IrrMap to new regions for irrigation data collection or adapt it with minimal effort for other similar applications in agricultural and geospatial analysis. We also analyze the irrigation method distribution across crop groups, spatial irrigation patterns (using Shannon diversity indices), and irrigated area variations for both LandSat and Sentinel, providing insights into regional and resolution-based differences. To promote further exploration, we openly release IrrMap, along with the derived datasets, benchmark models, and pipeline code, through a GitHub repository: https://github.com/Nibir088/IrrMap and Data repository: https://huggingface.co/Nibir/IrrMap, providing comprehensive documentation and implementation details.
中文: IrrMap是首个用于灌溉方法测绘的大规模数据集,包含美国多个农场2013-2023年的多分辨率卫星影像和辅助数据,专为机器学习应用设计,并附带完整文档开放共享。
English: IrrMap is the first large-scale dataset for irrigation method mapping, featuring multi-resolution satellite imagery and auxiliary data from 2013-2023 across U.S. farms, designed for machine learning applications and openly available with full documentation.

Authors:Midi Wan, Pengfei Li, Yizhuo Liang, Di Wu, Yushan Pan, Guangzhen Zhu, Hao Wang
Title: Skeleton-Guided Diffusion Model for Accurate Foot X-ray Synthesis in Hallux Valgus Diagnosis
Abstract:
Medical image synthesis plays a crucial role in providing anatomically accurate images for diagnosis and treatment. Hallux valgus, which affects approximately 19% of the global population, requires frequent weight-bearing X-rays for assessment, placing additional strain on both patients and healthcare providers. Existing X-ray models often struggle to balance image fidelity, skeletal consistency, and physical constraints, particularly in diffusion-based methods that lack skeletal guidance. We propose the Skeletal-Constrained Conditional Diffusion Model (SCCDM) and introduce KCC, a foot evaluation method utilizing skeletal landmarks. SCCDM incorporates multi-scale feature extraction and attention mechanisms, improving the Structural Similarity Index (SSIM) by 5.72% (0.794) and Peak Signal-to-Noise Ratio (PSNR) by 18.34% (21.40 dB). When combined with KCC, the model achieves an average score of 0.85, demonstrating strong clinical applicability. The code is available at https://github.com/midisec/SCCDM.
中文: 提出的骨骼约束条件扩散模型(SCCDM)通过多尺度特征提取和注意力机制,显著提升了用于拇外翻评估的医学图像合成质量,SSIM提高5.72%,PSNR提升18.34%,结合KCC评估方法展现出优异的临床适用性。
English: The proposed Skeletal-Constrained Conditional Diffusion Model (SCCDM) with multi-scale feature extraction and attention mechanisms significantly improves medical image synthesis for hallux valgus assessment, achieving 5.72% higher SSIM and 18.34% better PSNR while demonstrating strong clinical utility when paired with the KCC evaluation method.

Authors:He Huang, Qi Yang, Mufan Liu, Yiling Xu, Zhu Li
Title: ADC-GS: Anchor-Driven Deformable and Compressed Gaussian Splatting for Dynamic Scene Reconstruction
Abstract:
Existing 4D Gaussian Splatting methods rely on per-Gaussian deformation from a canonical space to target frames, which overlooks redundancy among adjacent Gaussian primitives and results in suboptimal performance. To address this limitation, we propose Anchor-Driven Deformable and Compressed Gaussian Splatting (ADC-GS), a compact and efficient representation for dynamic scene reconstruction. Specifically, ADC-GS organizes Gaussian primitives into an anchor-based structure within the canonical space, enhanced by a temporal significance-based anchor refinement strategy. To reduce deformation redundancy, ADC-GS introduces a hierarchical coarse-to-fine pipeline that captures motions at varying granularities. Moreover, a rate-distortion optimization is adopted to achieve an optimal balance between bitrate consumption and representation fidelity. Experimental results demonstrate that ADC-GS outperforms the per-Gaussian deformation approaches in rendering speed by 300%-800% while achieving state-of-the-art storage efficiency without compromising rendering quality. The code is released at https://github.com/H-Huang774/ADC-GS.git.
中文: 现有4D高斯泼溅方法因逐高斯形变导致效率低下,而提出的ADC-GS通过锚点结构和分层运动管道解决了这一问题,在保持渲染质量的同时实现300%-800%的加速和最优存储效率。
English: Existing 4D Gaussian Splatting methods suffer from inefficiency due to per-Gaussian deformation, but the proposed ADC-GS overcomes this with an anchor-based structure and hierarchical motion pipeline, achieving 300%-800% faster rendering and superior storage efficiency without quality loss.

Authors:Laszlo Szilagyi, Francis Engelmann, Jeannette Bohg
Title: SLAG: Scalable Language-Augmented Gaussian Splatting
Abstract:
Language-augmented scene representations hold great promise for large-scale robotics applications such as search-and-rescue, smart cities, and mining. Many of these scenarios are time-sensitive, requiring rapid scene encoding while also being data-intensive, necessitating scalable solutions. Deploying these representations on robots with limited computational resources further adds to the challenge. To address this, we introduce SLAG, a multi-GPU framework for language-augmented Gaussian splatting that enhances the speed and scalability of embedding large scenes. Our method integrates 2D visual-language model features into 3D scenes using SAM and CLIP. Unlike prior approaches, SLAG eliminates the need for a loss function to compute per-Gaussian language embeddings. Instead, it derives embeddings from 3D Gaussian scene parameters via a normalized weighted average, enabling highly parallelized scene encoding. Additionally, we introduce a vector database for efficient embedding storage and retrieval. Our experiments show that SLAG achieves an 18 times speedup in embedding computation on a 16-GPU setup compared to OpenGaussian, while preserving embedding quality on the ScanNet and LERF datasets. For more details, visit our project website: https://slag-project.github.io/.

Authors:Luu Tung Hai, Thinh D. Le, Zhicheng Ding, Qing Tian, Truong-Son Hy
Title: Topology-Guided Knowledge Distillation for Efficient Point Cloud Processing
Abstract:
Point cloud processing has gained significant attention due to its critical role in applications such as autonomous driving and 3D object recognition. However, deploying high-performance models like Point Transformer V3 in resource-constrained environments remains challenging due to their high computational and memory demands. This work introduces a novel distillation framework that leverages topology-aware representations and gradient-guided knowledge distillation to effectively transfer knowledge from a high-capacity teacher to a lightweight student model. Our approach captures the underlying geometric structures of point clouds while selectively guiding the student model's learning process through gradient-based feature alignment. Experimental results in the Nuscenes, SemanticKITTI, and Waymo datasets demonstrate that the proposed method achieves competitive performance, with an approximately 16x reduction in model size and a nearly 1.9x decrease in inference time compared to its teacher model. Notably, on NuScenes, our method achieves state-of-the-art performance among knowledge distillation techniques trained solely on LiDAR data, surpassing prior knowledge distillation baselines in segmentation performance. Our implementation is available publicly at: https://github.com/HySonLab/PointDistill
中文: 本研究提出了一种新颖的蒸馏框架,利用拓扑感知表示和梯度引导对齐将知识从高容量教师模型迁移至轻量级学生模型,在多个数据集上实现了具有竞争力的性能,同时显著减小了模型规模并缩短了推理时间。
English: This study presents a novel distillation framework using topology-aware representations and gradient-guided alignment to transfer knowledge from a high-capacity teacher to a lightweight student model, achieving competitive performance with significantly reduced model size and inference time across multiple datasets.

Authors:Yu Cheng, Arushi Goel, Hakan Bilen
Title: Visually Interpretable Subtask Reasoning for Visual Question Answering
Abstract:
Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.
中文: VISTAR是一种子任务驱动的训练框架,通过生成逐步推理解释来增强多模态大语言模型,无需外部模型即可同时提升准确性和可解释性。
English: VISTAR is a subtask-driven training framework that enhances multimodal large language models by generating step-by-step reasoning explanations, improving both accuracy and interpretability without external models.

Authors:Songyin Wu, Zhaoyang Lv, Yufeng Zhu, Duncan Frost, Zhengqin Li, Ling-Qi Yan, Carl Ren, Richard Newcombe, Zhao Dong
Title: Monocular Online Reconstruction with Enhanced Detail Preservation
Abstract:
We propose an online 3D Gaussian-based dense mapping framework for photorealistic details reconstruction from a monocular image stream. Our approach addresses two key challenges in monocular online reconstruction: distributing Gaussians without relying on depth maps and ensuring both local and global consistency in the reconstructed maps. To achieve this, we introduce two key modules: the Hierarchical Gaussian Management Module for effective Gaussian distribution and the Global Consistency Optimization Module for maintaining alignment and coherence at all scales. In addition, we present the Multi-level Occupancy Hash Voxels (MOHV), a structure that regularizes Gaussians for capturing details across multiple levels of granularity. MOHV ensures accurate reconstruction of both fine and coarse geometries and textures, preserving intricate details while maintaining overall structural integrity. Compared to state-of-the-art RGB-only and even RGB-D methods, our framework achieves superior reconstruction quality with high computational efficiency. Moreover, it integrates seamlessly with various tracking systems, ensuring generality and scalability.

Authors:Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, Huazhe Xu
Title: H$^3$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning
Abstract:
Visuomotor policy learning has witnessed substantial progress in robotic manipulation, with recent approaches predominantly relying on generative models to model the action distribution. However, these methods often overlook the critical coupling between visual perception and action prediction. In this work, we introduce $\textbf{Triply-Hierarchical Diffusion Policy}~(\textbf{H$^{\mathbf{3}}$DP})$, a novel visuomotor learning framework that explicitly incorporates hierarchical structures to strengthen the integration between visual features and action generation. H$^{3}$DP contains $\mathbf{3}$ levels of hierarchy: (1) depth-aware input layering that organizes RGB-D observations based on depth information; (2) multi-scale visual representations that encode semantic features at varying levels of granularity; and (3) a hierarchically conditioned diffusion process that aligns the generation of coarse-to-fine actions with corresponding visual features. Extensive experiments demonstrate that H$^{3}$DP yields a $\mathbf{+27.5\%}$ average relative improvement over baselines across $\mathbf{44}$ simulation tasks and achieves superior performance in $\mathbf{4}$ challenging bimanual real-world manipulation tasks. Project Page: https://lyy-iiis.github.io/h3dp/.

Authors:Kanchana Ranasinghe, Xiang Li, E-Ro Nguyen, Cristina Mata, Jongwoo Park, Michael S Ryoo
Title: Pixel Motion as Universal Representation for Robot Control
Abstract:
We present LangToMo, a vision-language-action framework structured as a dual-system architecture that uses pixel motion forecasts as intermediate representations. Our high-level System 2, an image diffusion model, generates text-conditioned pixel motion sequences from a single frame to guide robot control. Pixel motion-a universal, interpretable, and motion-centric representation-can be extracted from videos in a weakly-supervised manner, enabling diffusion model training on any video-caption data. Treating generated pixel motion as learned universal representations, our low level System 1 module translates these into robot actions via motion-to-action mapping functions, which can be either hand-crafted or learned with minimal supervision. System 2 operates as a high-level policy applied at sparse temporal intervals, while System 1 acts as a low-level policy at dense temporal intervals. This hierarchical decoupling enables flexible, scalable, and generalizable robot control under both unsupervised and supervised settings, bridging the gap between language, motion, and action. Checkout https://kahnchana.github.io/LangToMo

Authors:Chenze Shao, Fandong Meng, Jie Zhou
Title: Continuous Visual Autoregressive Generation via Score Maximization
Abstract:
Conventional wisdom suggests that autoregressive models are used to process discrete data. When applied to continuous modalities such as visual data, Visual AutoRegressive modeling (VAR) typically resorts to quantization-based approaches to cast the data into a discrete space, which can introduce significant information loss. To tackle this issue, we introduce a Continuous VAR framework that enables direct visual autoregressive generation without vector quantization. The underlying theoretical foundation is strictly proper scoring rules, which provide powerful statistical tools capable of evaluating how well a generative model approximates the true distribution. Within this framework, all we need is to select a strictly proper score and set it as the training objective to optimize. We primarily explore a class of training objectives based on the energy score, which is likelihood-free and thus overcomes the difficulty of making probabilistic predictions in the continuous space. Previous efforts on continuous autoregressive generation, such as GIVT and diffusion loss, can also be derived from our framework using other strictly proper scores. Source code: https://github.com/shaochenze/EAR.
Chinese: 本文提出了一种连续视觉自回归框架,通过利用严格适当评分规则(主要采用能量分数)来避免矢量量化,从而在连续空间中直接进行视觉自回归生成而无需信息损失。
English: The paper introduces a Continuous Visual AutoRegressive (VAR) framework that eliminates the need for vector quantization by leveraging strictly proper scoring rules, primarily using the energy score to enable direct autoregressive generation of continuous visual data without information loss.

Authors:Quang Vinh Nguyen, Minh Duc Nguyen, Thanh Hoang Son Vo, Hyung-Jeong Yang, Soo-Hyung Kim
Title: Anatomical Attention Alignment representation for Radiology Report Generation
Abstract:
Automated Radiology report generation (RRG) aims at producing detailed descriptions of medical images, reducing radiologists' workload and improving access to high-quality diagnostic services. Existing encoder-decoder models only rely on visual features extracted from raw input images, which can limit the understanding of spatial structures and semantic relationships, often resulting in suboptimal text generation. To address this, we propose Anatomical Attention Alignment Network (A3Net), a framework that enhance visual-textual understanding by constructing hyper-visual representations. Our approach integrates a knowledge dictionary of anatomical structures with patch-level visual features, enabling the model to effectively associate image regions with their corresponding anatomical entities. This structured representation improves semantic reasoning, interpretability, and cross-modal alignment, ultimately enhancing the accuracy and clinical relevance of generated reports. Experimental results on IU X-Ray and MIMIC-CXR datasets demonstrate that A3Net significantly improves both visual perception and text generation quality. Our code is available at \href{https://github.com/Vinh-AI/A3Net}{GitHub}.
中文: A3Net框架通过整合解剖学知识与视觉特征,提升了自动化放射学报告生成的语义推理能力和报告准确性,在IU X-Ray和MIMIC-CXR数据集上验证了其优越性能。
English: The A3Net framework enhances automated radiology report generation by integrating anatomical knowledge with visual features, improving semantic reasoning and report accuracy as demonstrated on IU X-Ray and MIMIC-CXR datasets.

Authors:Feng Yuan, Yifan Gao, Wenbin Wu, Keqing Wu, Xiaotong Guo, Jie Jiang, Xin Gao
Title: ABS-Mamba: SAM2-Driven Bidirectional Spiral Mamba Network for Medical Image Translation
Abstract:
Accurate multi-modal medical image translation requires ha-rmonizing global anatomical semantics and local structural fidelity, a challenge complicated by intermodality information loss and structural distortion. We propose ABS-Mamba, a novel architecture integrating the Segment Anything Model 2 (SAM2) for organ-aware semantic representation, specialized convolutional neural networks (CNNs) for preserving modality-specific edge and texture details, and Mamba's selective state-space modeling for efficient long- and short-range feature dependencies. Structurally, our dual-resolution framework leverages SAM2's image encoder to capture organ-scale semantics from high-resolution inputs, while a parallel CNNs branch extracts fine-grained local features. The Robust Feature Fusion Network (RFFN) integrates these epresentations, and the Bidirectional Mamba Residual Network (BMRN) models spatial dependencies using spiral scanning and bidirectional state-space dynamics. A three-stage skip fusion decoder enhances edge and texture fidelity. We employ Efficient Low-Rank Adaptation (LoRA+) fine-tuning to enable precise domain specialization while maintaining the foundational capabilities of the pre-trained components. Extensive experimental validation on the SynthRAD2023 and BraTS2019 datasets demonstrates that ABS-Mamba outperforms state-of-the-art methods, delivering high-fidelity cross-modal synthesis that preserves anatomical semantics and structural details to enhance diagnostic accuracy in clinical applications. The code is available at https://github.com/gatina-yone/ABS-Mamba
中文: ABS-Mamba通过整合SAM2的器官语义感知、CNN的局部特征提取与Mamba的依赖建模,在双分辨率框架下实现了保留解剖结构和细节的高保真跨模态医学图像生成,经临床数据集验证优于现有方法。
English: ABS-Mamba introduces a dual-resolution framework combining SAM2 for organ-level semantics, specialized CNNs for local details, and Mamba for feature dependencies, achieving superior cross-modal medical image synthesis validated on clinical datasets.

Authors:Oriol Barbany, Adrià Colomé, Carme Torras
Title: Beyond Static Perception: Integrating Temporal Context into VLMs for Cloth Folding
Abstract:
Manipulating clothes is challenging due to their complex dynamics, high deformability, and frequent self-occlusions. Garments exhibit a nearly infinite number of configurations, making explicit state representations difficult to define. In this paper, we analyze BiFold, a model that predicts language-conditioned pick-and-place actions from visual observations, while implicitly encoding garment state through end-to-end learning. To address scenarios such as crumpled garments or recovery from failed manipulations, BiFold leverages temporal context to improve state estimation. We examine the internal representations of the model and present evidence that its fine-tuning and temporal context enable effective alignment between text and image regions, as well as temporal consistency.

Authors:Sarah de Boer, Hartmut Häntze, Kiran Vaidhya Venkadesh, Myrthe A. D. Buser, Gabriel E. Humpire Mamani, Lina Xu, Lisa C. Adams, Jawed Nawabi, Keno K. Bressem, Bram van Ginneken, Mathias Prokop, Alessa Hering
Title: Robust Kidney Abnormality Segmentation: A Validation Study of an AI-Based Framework
Abstract:
Kidney abnormality segmentation has important potential to enhance the clinical workflow, especially in settings requiring quantitative assessments. Kidney volume could serve as an important biomarker for renal diseases, with changes in volume correlating directly with kidney function. Currently, clinical practice often relies on subjective visual assessment for evaluating kidney size and abnormalities, including tumors and cysts, which are typically staged based on diameter, volume, and anatomical location. To support a more objective and reproducible approach, this research aims to develop a robust, thoroughly validated kidney abnormality segmentation algorithm, made publicly available for clinical and research use. We employ publicly available training datasets and leverage the state-of-the-art medical image segmentation framework nnU-Net. Validation is conducted using both proprietary and public test datasets, with segmentation performance quantified by Dice coefficient and the 95th percentile Hausdorff distance. Furthermore, we analyze robustness across subgroups based on patient sex, age, CT contrast phases, and tumor histologic subtypes. Our findings demonstrate that our segmentation algorithm, trained exclusively on publicly available data, generalizes effectively to external test sets and outperforms existing state-of-the-art models across all tested datasets. Subgroup analyses reveal consistent high performance, indicating strong robustness and reliability. The developed algorithm and associated code are publicly accessible at https://github.com/DIAGNijmegen/oncology-kidney-abnormality-segmentation.
Chinese: 本研究利用nnU-Net框架开发了一种稳健的肾脏异常分割算法,通过多样化数据集验证确保了高性能和可靠性,所有代码已公开供临床和研究使用。
English: This research develops a robust kidney abnormality segmentation algorithm using the nnU-Net framework, validated across diverse datasets to ensure high performance and reliability, with all code made publicly available for clinical and research use.

Authors:Kamil Jeziorek, Tomasz Kryjak
Title: Self-Supervised Event Representations: Towards Accurate, Real-Time Perception on SoC FPGAs
Abstract:
Event cameras offer significant advantages over traditional frame-based sensors. These include microsecond temporal resolution, robustness under varying lighting conditions and low power consumption. Nevertheless, the effective processing of their sparse, asynchronous event streams remains challenging. Existing approaches to this problem can be categorised into two distinct groups. The first group involves the direct processing of event data with neural models, such as Spiking Neural Networks or Graph Convolutional Neural Networks. However, this approach is often accompanied by a compromise in terms of qualitative performance. The second group involves the conversion of events into dense representations with handcrafted aggregation functions, which can boost accuracy at the cost of temporal fidelity. This paper introduces a novel Self-Supervised Event Representation (SSER) method leveraging Gated Recurrent Unit (GRU) networks to achieve precise per-pixel encoding of event timestamps and polarities without temporal discretisation. The recurrent layers are trained in a self-supervised manner to maximise the fidelity of event-time encoding. The inference is performed with event representations generated asynchronously, thus ensuring compatibility with high-throughput sensors. The experimental validation demonstrates that SSER outperforms aggregation-based baselines, achieving improvements of 2.4% mAP and 0.6% on the Gen1 and 1 Mpx object detection datasets. Furthermore, the paper presents the first hardware implementation of recurrent representation for event data on a System-on-Chip FPGA, achieving sub-microsecond latency and power consumption between 1-2 W, suitable for real-time, power-efficient applications. Code is available at https://github.com/vision-agh/RecRepEvent.
中文: 本文提出了一种利用GRU网络的自监督事件表示方法,无需时间离散化即可编码事件数据,在实现卓越目标检测性能的同时,通过硬件部署达到了亚微秒级延迟和低功耗的高效表现。
English: This paper introduces a self-supervised event representation method using GRU networks to encode event data without temporal discretization, achieving superior object detection performance and efficient hardware implementation with sub-microsecond latency and low power consumption.

Authors:Hao Li, Sicheng Li, Xiang Gao, Abudouaihati Batuer, Lu Yu, Yiyi Liao
Title: GIFStream: 4D Gaussian-based Immersive Video with Feature Stream
Abstract:
Immersive video offers a 6-Dof-free viewing experience, potentially playing a key role in future video technology. Recently, 4D Gaussian Splatting has gained attention as an effective approach for immersive video due to its high rendering efficiency and quality, though maintaining quality with manageable storage remains challenging. To address this, we introduce GIFStream, a novel 4D Gaussian representation using a canonical space and a deformation field enhanced with time-dependent feature streams. These feature streams enable complex motion modeling and allow efficient compression by leveraging temporal correspondence and motion-aware pruning. Additionally, we incorporate both temporal and spatial compression networks for end-to-end compression. Experimental results show that GIFStream delivers high-quality immersive video at 30 Mbps, with real-time rendering and fast decoding on an RTX 4090. Project page: https://xdimlab.github.io/GIFStream

Authors:Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Xue Song, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang
Title: Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
Abstract:
We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally distinct from traditional spatial tokens in the following two key ways: - Selftok offers an elegant and minimalist approach to unify diffusion and AR for vision-language models (VLMs): By representing images with Selftok tokens, we can train a VLM using a purely discrete autoregressive architecture -- like that in LLMs -- without requiring additional modules or training objectives. - We theoretically show that the AR prior satisfies the Bellman equation, whereas the spatial prior does not. Therefore, Selftok supports reinforcement learning (RL) for visual generation with effectiveness comparable to that achieved in LLMs. Besides the AR property, Selftok is also a SoTA tokenizer that achieves a favorable trade-off between high-quality reconstruction and compression rate. We use Selftok to build a pure AR VLM for both visual comprehension and generation tasks. Impressively, without using any text-image training pairs, a simple policy gradient RL working in the visual tokens can significantly boost the visual generation benchmark, surpassing all the existing models by a large margin. Therefore, we believe that Selftok effectively addresses the long-standing challenge that visual tokens cannot support effective RL. When combined with the well-established strengths of RL in LLMs, this brings us one step closer to realizing a truly multimodal LLM. Project Page: https://selftok-team.github.io/report/.

Authors:Mohamed Ali Souibgui, Changkyu Choi, Andrey Barsky, Kangsoo Jung, Ernest Valveny, Dimosthenis Karatzas
Title: DocVXQA: Context-Aware Visual Explanations for Document Question Answering
Abstract:
We propose DocVXQA, a novel framework for visually self-explainable document question answering. The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions, thereby offering interpretable justifications for the model's decisions. To integrate explanations into the learning process, we quantitatively formulate explainability principles as explicit learning objectives. Unlike conventional methods that emphasize only the regions pertinent to the answer, our framework delivers explanations that are \textit{contextually sufficient} while remaining \textit{representation-efficient}. This fosters user trust while achieving a balance between predictive performance and interpretability in DocVQA applications. Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method. The code is available at https://github.com/dali92002/DocVXQA.
中文: DocVXQA是一种新型视觉自解释文档问答框架,不仅能生成准确答案,还能通过视觉热图突出关键区域,提供上下文充分且表征高效的解释,从而在保证性能的同时增强用户信任。
English: DocVXQA is a visually self-explainable document question answering framework that generates accurate answers while learning visual heatmaps to highlight critical regions, ensuring contextually sufficient and representation-efficient explanations for enhanced user trust and balanced performance.

Authors:Hongkun Dou, Zeyu Li, Xingyu Jiang, Hongjue Li, Lijun Yang, Wen Yao, Yue Deng
Title: You Only Look One Step: Accelerating Backpropagation in Diffusion Sampling with Gradient Shortcuts
Abstract:
Diffusion models (DMs) have recently demonstrated remarkable success in modeling large-scale data distributions. However, many downstream tasks require guiding the generated content based on specific differentiable metrics, typically necessitating backpropagation during the generation process. This approach is computationally expensive, as generating with DMs often demands tens to hundreds of recursive network calls, resulting in high memory usage and significant time consumption. In this paper, we propose a more efficient alternative that approaches the problem from the perspective of parallel denoising. We show that full backpropagation throughout the entire generation process is unnecessary. The downstream metrics can be optimized by retaining the computational graph of only one step during generation, thus providing a shortcut for gradient propagation. The resulting method, which we call Shortcut Diffusion Optimization (SDO), is generic, high-performance, and computationally lightweight, capable of optimizing all parameter types in diffusion sampling. We demonstrate the effectiveness of SDO on several real-world tasks, including controlling generation by optimizing latent and aligning the DMs by fine-tuning network parameters. Compared to full backpropagation, our approach reduces computational costs by $\sim 90\%$ while maintaining superior performance. Code is available at https://github.com/deng-ai-lab/SDO.
中文: 本文提出了一种名为“快捷扩散优化”(SDO)的高效计算方法,通过在生成过程中仅保留单步计算图来优化扩散模型,在保持性能的同时将计算成本降低约90%。
English: This paper introduces Shortcut Diffusion Optimization (SDO), a computationally efficient method that optimizes diffusion models by retaining only one step's computational graph during generation, reducing costs by approximately 90% while maintaining performance.

Authors:Wei Li, Ming Hu, Guoan Wang, Lihao Liu, Kaijing Zhou, Junzhi Ning, Xin Guo, Zongyuan Ge, Lixu Gu, Junjun He
Title: Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model
Abstract:
In ophthalmic surgery, developing an AI system capable of interpreting surgical videos and predicting subsequent operations requires numerous ophthalmic surgical videos with high-quality annotations, which are difficult to collect due to privacy concerns and labor consumption. Text-guided video generation (T2V) emerges as a promising solution to overcome this issue by generating ophthalmic surgical videos based on surgeon instructions. In this paper, we present Ophora, a pioneering model that can generate ophthalmic surgical videos following natural language instructions. To construct Ophora, we first propose a Comprehensive Data Curation pipeline to convert narrative ophthalmic surgical videos into a large-scale, high-quality dataset comprising over 160K video-instruction pairs, Ophora-160K. Then, we propose a Progressive Video-Instruction Tuning scheme to transfer rich spatial-temporal knowledge from a T2V model pre-trained on natural video-text datasets for privacy-preserved ophthalmic surgical video generation based on Ophora-160K. Experiments on video quality evaluation via quantitative analysis and ophthalmologist feedback demonstrate that Ophora can generate realistic and reliable ophthalmic surgical videos based on surgeon instructions. We also validate the capability of Ophora for empowering downstream tasks of ophthalmic surgical workflow understanding. Code is available at https://github.com/uni-medical/Ophora.
中文: 本文提出了Ophora模型,能够根据自然语言指令生成高质量眼科手术视频,通过构建专用数据集和渐进式调优方案,有效解决了医疗数据稀缺问题。
English: The paper introduces Ophora, an AI model that generates realistic ophthalmic surgical videos from natural language instructions, addressing data scarcity through a curated dataset and progressive tuning method.

Authors:Peng Sun, Yi Jiang, Tao Lin
Title: Unified Continuous Generative Models
Abstract:
Recent advances in continuous generative models, including multi-step approaches like diffusion and flow-matching (typically requiring 8-1000 sampling steps) and few-step methods such as consistency models (typically 1-8 steps), have demonstrated impressive generative performance. However, existing work often treats these approaches as distinct paradigms, resulting in separate training and sampling methodologies. We introduce a unified framework for training, sampling, and analyzing these models. Our implementation, the Unified Continuous Generative Models Trainer and Sampler (UCGM-{T,S}), achieves state-of-the-art (SOTA) performance. For example, on ImageNet 256x256 using a 675M diffusion transformer, UCGM-T trains a multi-step model achieving 1.30 FID in 20 steps and a few-step model reaching 1.42 FID in just 2 steps. Additionally, applying UCGM-S to a pre-trained model (previously 1.26 FID at 250 steps) improves performance to 1.06 FID in only 40 steps. Code is available at: https://github.com/LINs-lab/UCGM.
中文: 本文提出了一个统一框架来训练、采样和分析连续生成模型,在减少采样步骤的同时,为多步和少步方法均实现了最先进的性能。
English: This paper introduces a unified framework for training, sampling, and analyzing continuous generative models, achieving state-of-the-art performance with fewer steps across both multi-step and few-step approaches.

Authors:Wenhao Hu, Paul Henderson, José Cano
Title: ICE-Pruning: An Iterative Cost-Efficient Pruning Pipeline for Deep Neural Networks
Abstract:
Pruning is a widely used method for compressing Deep Neural Networks (DNNs), where less relevant parameters are removed from a DNN model to reduce its size. However, removing parameters reduces model accuracy, so pruning is typically combined with fine-tuning, and sometimes other operations such as rewinding weights, to recover accuracy. A common approach is to repeatedly prune and then fine-tune, with increasing amounts of model parameters being removed in each step. While straightforward to implement, pruning pipelines that follow this approach are computationally expensive due to the need for repeated fine-tuning. In this paper we propose ICE-Pruning, an iterative pruning pipeline for DNNs that significantly decreases the time required for pruning by reducing the overall cost of fine-tuning, while maintaining a similar accuracy to existing pruning pipelines. ICE-Pruning is based on three main components: i) an automatic mechanism to determine after which pruning steps fine-tuning should be performed; ii) a freezing strategy for faster fine-tuning in each pruning step; and iii) a custom pruning-aware learning rate scheduler to further improve the accuracy of each pruning step and reduce the overall time consumption. We also propose an efficient auto-tuning stage for the hyperparameters (e.g., freezing percentage) introduced by the three components. We evaluate ICE-Pruning on several DNN models and datasets, showing that it can accelerate pruning by up to 9.61x. Code is available at https://github.com/gicLAB/ICE-Pruning
Chinese: ICE-Pruning是一种创新的迭代剪枝流程,通过优化微调过程在保持模型精度的同时显著加速深度神经网络压缩,最高可实现9.61倍的加速效果。
English: ICE-Pruning is an innovative iterative pipeline that accelerates deep neural network compression by optimizing fine-tuning processes while preserving model accuracy, achieving up to 9.61x speedup.

Authors:Chunpeng Li, Ya-tang Li
Title: Feature Visualization in 3D Convolutional Neural Networks
Abstract:
Understanding the computations of convolutional neural networks requires effective visualization of their kernels. While maximal activation methods have proven successful in highlighting the preferred features of 2D convolutional kernels, directly applying these techniques to 3D convolutions often leads to uninterpretable results due to the higher dimensionality and complexity of 3D features. To address this challenge, we propose a novel visualization approach for 3D convolutional kernels that disentangles their texture and motion preferences. Our method begins with a data-driven decomposition of the optimal input that maximally activates a given kernel. We then introduce a two-stage optimization strategy to extract distinct texture and motion components from this input. Applying our approach to visualize kernels at various depths of several pre-trained models, we find that the resulting visualizations--particularly those capturing motion--clearly reveal the preferred dynamic patterns encoded by 3D kernels. These results demonstrate the effectiveness of our method in providing interpretable insights into 3D convolutional operations. Code is available at https://github.com/YatangLiLab/3DKernelVisualizer.
Chinese: 本文提出了一种新颖的三维卷积核可视化方法,通过两阶段优化策略分离其纹理和运动偏好,从而清晰揭示其编码的动态模式,增强了对三维卷积操作的可解释性。
English: This paper introduces a novel visualization method for 3D convolutional kernels that separates their texture and motion preferences through a two-stage optimization strategy, providing interpretable insights into their dynamic patterns.

Authors:Yuqi Cheng, Yunkang Cao, Dongfang Wang, Weiming Shen, Wenlong Li
Title: Boosting Global-Local Feature Matching via Anomaly Synthesis for Multi-Class Point Cloud Anomaly Detection
Abstract:
Point cloud anomaly detection is essential for various industrial applications. The huge computation and storage costs caused by the increasing product classes limit the application of single-class unsupervised methods, necessitating the development of multi-class unsupervised methods. However, the feature similarity between normal and anomalous points from different class data leads to the feature confusion problem, which greatly hinders the performance of multi-class methods. Therefore, we introduce a multi-class point cloud anomaly detection method, named GLFM, leveraging global-local feature matching to progressively separate data that are prone to confusion across multiple classes. Specifically, GLFM is structured into three stages: Stage-I proposes an anomaly synthesis pipeline that stretches point clouds to create abundant anomaly data that are utilized to adapt the point cloud feature extractor for better feature representation. Stage-II establishes the global and local memory banks according to the global and local feature distributions of all the training data, weakening the impact of feature confusion on the establishment of the memory bank. Stage-III implements anomaly detection of test data leveraging its feature distance from global and local memory banks. Extensive experiments on the MVTec 3D-AD, Real3D-AD and actual industry parts dataset showcase our proposed GLFM's superior point cloud anomaly detection performance. The code is available at https://github.com/hustCYQ/GLFM-Multi-class-3DAD.
中文: GLFM是一种创新的多类别点云异常检测方法,通过三阶段全局-局部特征匹配策略有效解决特征混淆问题,在多个数据集上展现出卓越的检测性能。
English: GLFM is a novel multi-class point cloud anomaly detection method that employs a three-stage global-local feature matching strategy to effectively address feature confusion and enhance detection performance across diverse datasets.

Authors:Hongda Qin, Xiao Lu, Zhiyong Wei, Yihong Cao, Kailun Yang, Ningjiang Chen
Title: Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection
Abstract:
Generalizing an object detector trained on a single domain to multiple unseen domains is a challenging task. Existing methods typically introduce image or feature augmentation to diversify the source domain to raise the robustness of the detector. Vision-Language Model (VLM)-based augmentation techniques have been proven to be effective, but they require that the detector's backbone has the same structure as the image encoder of VLM, limiting the detector framework selection. To address this problem, we propose Language-Driven Dual Style Mixing (LDDS) for single-domain generalization, which diversifies the source domain by fully utilizing the semantic information of the VLM. Specifically, we first construct prompts to transfer style semantics embedded in the VLM to an image translation network. This facilitates the generation of style diversified images with explicit semantic information. Then, we propose image-level style mixing between the diversified images and source domain images. This effectively mines the semantic information for image augmentation without relying on specific augmentation selections. Finally, we propose feature-level style mixing in a double-pipeline manner, allowing feature augmentation to be model-agnostic and can work seamlessly with the mainstream detector frameworks, including the one-stage, two-stage, and transformer-based detectors. Extensive experiments demonstrate the effectiveness of our approach across various benchmark datasets, including real to cartoon and normal to adverse weather tasks. The source code and pre-trained models will be publicly available at https://github.com/qinhongda8/LDDS.
Chinese: 提出的语言驱动双重风格混合(LDDS)方法通过利用视觉语言模型的语义信息多样化源域,实现了与多种检测器框架兼容的模型无关图像和特征增强,从而提升单域泛化能力。
English: The proposed Language-Driven Dual Style Mixing (LDDS) method enhances single-domain generalization by diversifying the source domain through VLM-based semantic information, enabling model-agnostic image and feature augmentation that works with various detector frameworks.

Authors:Zihang Liu, Zhenyu Zhang, Hao Tang
Title: Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution
Abstract:
Diffusion-based image super-resolution (SR) methods have demonstrated remarkable performance. Recent advancements have introduced deterministic sampling processes that reduce inference from 15 iterative steps to a single step, thereby significantly improving the inference speed of existing diffusion models. However, their efficiency remains limited when handling complex semantic regions due to the single-step inference. To address this limitation, we propose SAMSR, a semantic-guided diffusion framework that incorporates semantic segmentation masks into the sampling process. Specifically, we introduce the SAM-Noise Module, which refines Gaussian noise using segmentation masks to preserve spatial and semantic features. Furthermore, we develop a pixel-wise sampling strategy that dynamically adjusts the residual transfer rate and noise strength based on pixel-level semantic weights, prioritizing semantically rich regions during the diffusion process. To enhance model training, we also propose a semantic consistency loss, which aligns pixel-wise semantic weights between predictions and ground truth. Extensive experiments on both real-world and synthetic datasets demonstrate that SAMSR significantly improves perceptual quality and detail recovery, particularly in semantically complex images. Our code is released at https://github.com/Liu-Zihang/SAMSR.
中文: SAMSR是一种语义引导的扩散框架,通过引入分割掩码优化噪声并优先处理语义丰富区域,显著提升了复杂图像的超分辨率感知质量和细节恢复能力。
English: SAMSR is a semantic-guided diffusion framework that enhances image super-resolution by integrating segmentation masks to refine noise and prioritize semantically rich regions, significantly improving perceptual quality and detail recovery in complex images.

Authors:Jitesh Joshi, Youngjun Cho
Title: Efficient and Robust Multidimensional Attention in Remote Physiological Sensing through Target Signal Constrained Factorization
Abstract:
Remote physiological sensing using camera-based technologies offers transformative potential for non-invasive vital sign monitoring across healthcare and human-computer interaction domains. Although deep learning approaches have advanced the extraction of physiological signals from video data, existing methods have not been sufficiently assessed for their robustness to domain shifts. These shifts in remote physiological sensing include variations in ambient conditions, camera specifications, head movements, facial poses, and physiological states which often impact real-world performance significantly. Cross-dataset evaluation provides an objective measure to assess generalization capabilities across these domain shifts. We introduce Target Signal Constrained Factorization module (TSFM), a novel multidimensional attention mechanism that explicitly incorporates physiological signal characteristics as factorization constraints, allowing more precise feature extraction. Building on this innovation, we present MMRPhys, an efficient dual-branch 3D-CNN architecture designed for simultaneous multitask estimation of photoplethysmography (rPPG) and respiratory (rRSP) signals from multimodal RGB and thermal video inputs. Through comprehensive cross-dataset evaluation on five benchmark datasets, we demonstrate that MMRPhys with TSFM significantly outperforms state-of-the-art methods in generalization across domain shifts for rPPG and rRSP estimation, while maintaining a minimal inference latency suitable for real-time applications. Our approach establishes new benchmarks for robust multitask and multimodal physiological sensing and offers a computationally efficient framework for practical deployment in unconstrained environments. The web browser-based application featuring on-device real-time inference of MMRPhys model is available at https://physiologicailab.github.io/mmrphys-live

Authors:Zhengye Zhang, Sirui Zhao, Shifeng Liu, Shukang Yin, Xinglong Mao, Tong Xu, Enhong Chen
Title: MELLM: Exploring LLM-Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception
Abstract:
Micro-expressions (MEs) are crucial psychological responses with significant potential for affective computing. However, current automatic micro-expression recognition (MER) research primarily focuses on discrete emotion classification, neglecting a convincing analysis of the subtle dynamic movements and inherent emotional cues. The rapid progress in multimodal large language models (MLLMs), known for their strong multimodal comprehension and language generation abilities, offers new possibilities. MLLMs have shown success in various vision-language tasks, indicating their potential to understand MEs comprehensively, including both fine-grained motion patterns and underlying emotional semantics. Nevertheless, challenges remain due to the subtle intensity and short duration of MEs, as existing MLLMs are not designed to capture such delicate frame-level facial dynamics. In this paper, we propose a novel Micro-Expression Large Language Model (MELLM), which incorporates a subtle facial motion perception strategy with the strong inference capabilities of MLLMs, representing the first exploration of MLLMs in the domain of ME analysis. Specifically, to explicitly guide the MLLM toward motion-sensitive regions, we construct an interpretable motion-enhanced color map by fusing onset-apex optical flow dynamics with the corresponding grayscale onset frame as the model input. Additionally, specialized fine-tuning strategies are incorporated to further enhance the model's visual perception of MEs. Furthermore, we construct an instruction-description dataset based on Facial Action Coding System (FACS) annotations and emotion labels to train our MELLM. Comprehensive evaluations across multiple benchmark datasets demonstrate that our model exhibits superior robustness and generalization capabilities in ME understanding (MEU). Code is available at https://github.com/zyzhangUstc/MELLM.
中文摘要:本文提出MELLM模型,通过融合细微面部动作感知与多模态大语言模型,首次实现了对微表情动态特征和情感语义的全面解析,在多个基准测试中展现出卓越的鲁棒性和泛化能力。
English Summary: This paper introduces MELLM, a novel micro-expression analysis model that integrates subtle facial motion perception with multimodal large language models to enhance the recognition of fine-grained dynamic movements and emotional cues, demonstrating superior robustness and generalization in micro-expression understanding.

Authors:Peng Li, Suizhi Ma, Jialiang Chen, Yuan Liu, Congyi Zhang, Wei Xue, Wenhan Luo, Alla Sheffer, Wenping Wang, Yike Guo
Title: CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation
Abstract:
Recently, 3D generation methods have shown their powerful ability to automate 3D model creation. However, most 3D generation methods only rely on an input image or a text prompt to generate a 3D model, which lacks the control of each component of the generated 3D model. Any modifications of the input image lead to an entire regeneration of the 3D models. In this paper, we introduce a new method called CMD that generates a 3D model from an input image while enabling flexible local editing of each component of the 3D model. In CMD, we formulate the 3D generation as a conditional multiview diffusion model, which takes the existing or known parts as conditions and generates the edited or added components. This conditional multiview diffusion model not only allows the generation of 3D models part by part but also enables local editing of 3D models according to the local revision of the input image without changing other 3D parts. Extensive experiments are conducted to demonstrate that CMD decomposes a complex 3D generation task into multiple components, improving the generation quality. Meanwhile, CMD enables efficient and flexible local editing of a 3D model by just editing one rendered image.
中文: CMD方法通过条件多视角扩散模型,实现了按组件生成3D模型的功能,并支持仅修改单张渲染图像即可对模型进行灵活局部编辑,显著提升了生成质量与控制能力。
English: The CMD method introduces a conditional multiview diffusion model that enables both component-by-component generation and flexible local editing of 3D models by modifying a single rendered image, significantly improving generation quality and control.

Authors:Bidur Khanal, Sandesh Pokhrel, Sanjay Bhandari, Ramesh Rana, Nikesh Shrestha, Ram Bahadur Gurung, Cristian Linte, Angus Watson, Yash Raj Shrestha, Binod Bhattarai
Title: Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models
Abstract:
Vision-Language Models (VLMs) are becoming increasingly popular in the medical domain, bridging the gap between medical images and clinical language. Existing VLMs demonstrate an impressive ability to comprehend medical images and text queries to generate detailed, descriptive diagnostic medical reports. However, hallucination--the tendency to generate descriptions that are inconsistent with the visual content--remains a significant issue in VLMs, with particularly severe implications in the medical field. To facilitate VLM research on gastrointestinal (GI) image analysis and study hallucination, we curate a multimodal image-text GI dataset: Gut-VLM. This dataset is created using a two-stage pipeline: first, descriptive medical reports of Kvasir-v2 images are generated using ChatGPT, which introduces some hallucinated or incorrect texts. In the second stage, medical experts systematically review these reports, and identify and correct potential inaccuracies to ensure high-quality, clinically reliable annotations. Unlike traditional datasets that contain only descriptive texts, our dataset also features tags identifying hallucinated sentences and their corresponding corrections. A common approach to reducing hallucination in VLM is to finetune the model on a small-scale, problem-specific dataset. However, we take a different strategy using our dataset. Instead of finetuning the VLM solely for generating textual reports, we finetune it to detect and correct hallucinations, an approach we call hallucination-aware finetuning. Our results show that this approach is better than simply finetuning for descriptive report generation. Additionally, we conduct an extensive evaluation of state-of-the-art VLMs across several metrics, establishing a benchmark. GitHub Repo: https://github.com/bhattarailab/Hallucination-Aware-VLM.
中文: 视觉语言模型在医学影像应用中潜力巨大,但存在幻觉生成问题;本研究通过构建胃肠道数据集并采用幻觉感知微调方法,有效提升了诊断报告的准确性。
English: Vision-Language Models (VLMs) show promise in medical imaging but suffer from hallucination issues, which are addressed in this study through a curated gastrointestinal dataset and a novel hallucination-aware fine-tuning approach that improves diagnostic report accuracy.

Authors:Wei Shang, Dongwei Ren, Wanying Zhang, Pengfei Zhu, Qinghua Hu, Wangmeng Zuo
Title: High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution
Abstract:
The primary challenge in accelerating image super-resolution lies in reducing computation while maintaining performance and adaptability. Motivated by the observation that high-frequency regions (e.g., edges and textures) are most critical for reconstruction, we propose a training-free adaptive masking module for acceleration that dynamically focuses computation on these challenging areas. Specifically, our method first extracts high-frequency components via Gaussian blur subtraction and adaptively generates binary masks using K-means clustering to identify regions requiring intensive processing. Our method can be easily integrated with both CNNs and Transformers. For CNN-based architectures, we replace standard $3 \times 3$ convolutions with an unfold operation followed by $1 \times 1$ convolutions, enabling pixel-wise sparse computation guided by the mask. For Transformer-based models, we partition the mask into non-overlapping windows and selectively process tokens based on their average values. During inference, unnecessary pixels or windows are pruned, significantly reducing computation. Moreover, our method supports dilation-based mask adjustment to control the processing scope without retraining, and is robust to unseen degradations (e.g., noise, compression). Extensive experiments on benchmarks demonstrate that our method reduces FLOPs by 24--43% for state-of-the-art models (e.g., CARN, SwinIR) while achieving comparable or better quantitative metrics. The source code is available at https://github.com/shangwei5/AMSR
中文: 本文提出了一种无需训练的自适应掩码模块,通过动态聚焦高频区域的计算来加速图像超分辨率,在先进模型中减少24–43%的计算量,同时保持性能。
English: This paper introduces a training-free adaptive masking module that accelerates image super-resolution by dynamically focusing computation on high-frequency regions, reducing FLOPs by 24–43% for state-of-the-art models while maintaining performance.

Authors:Fei Zhou, Yi Li, Mingqing Zhu
Title: Transformer-Based Dual-Optical Attention Fusion Crowd Head Point Counting and Localization Network
Abstract:
In this paper, the dual-optical attention fusion crowd head point counting model (TAPNet) is proposed to address the problem of the difficulty of accurate counting in complex scenes such as crowd dense occlusion and low light in crowd counting tasks under UAV view. The model designs a dual-optical attention fusion module (DAFP) by introducing complementary information from infrared images to improve the accuracy and robustness of all-day crowd counting. In order to fully utilize different modal information and solve the problem of inaccurate localization caused by systematic misalignment between image pairs, this paper also proposes an adaptive two-optical feature decomposition fusion module (AFDF). In addition, we optimize the training strategy to improve the model robustness through spatial random offset data augmentation. Experiments on two challenging public datasets, DroneRGBT and GAIIC2, show that the proposed method outperforms existing techniques in terms of performance, especially in challenging dense low-light scenes. Code is available at https://github.com/zz-zik/TAPNet
中文: 本文提出TAPNet双光注意力融合模型,通过结合红外图像数据和优化训练策略,显著提升了无人机视角下密集遮挡和低光等复杂场景的人群计数精度。
English: This paper introduces TAPNet, a dual-optical attention fusion model that enhances UAV crowd counting accuracy in complex conditions like occlusion and low light by integrating infrared data and optimized training strategies.

Authors:Ye Zhu, Yunan Wang, Zitong Yu
Title: Multimodal Fake News Detection: MFND Dataset and Shallow-Deep Multitask Learning
Abstract:
Multimodal news contains a wealth of information and is easily affected by deepfake modeling attacks. To combat the latest image and text generation methods, we present a new Multimodal Fake News Detection dataset (MFND) containing 11 manipulated types, designed to detect and localize highly authentic fake news. Furthermore, we propose a Shallow-Deep Multitask Learning (SDML) model for fake news, which fully uses unimodal and mutual modal features to mine the intrinsic semantics of news. Under shallow inference, we propose the momentum distillation-based light punishment contrastive learning for fine-grained uniform spatial image and text semantic alignment, and an adaptive cross-modal fusion module to enhance mutual modal features. Under deep inference, we design a two-branch framework to augment the image and text unimodal features, respectively merging with mutual modalities features, for four predictions via dedicated detection and localization projections. Experiments on both mainstream and our proposed datasets demonstrate the superiority of the model. Codes and dataset are released at https://github.com/yunan-wang33/sdml.
中文摘要:本文提出了新型多模态假新闻检测数据集(MFND)和浅层-深层多任务学习模型(SDML),通过创新的对比学习与融合机制,充分利用单模态及跨模态特征,有效实现了对高仿真假新闻的检测与定位。
English Summary: This paper introduces a novel Multimodal Fake News Detection dataset (MFND) and proposes a Shallow-Deep Multitask Learning (SDML) model that effectively utilizes unimodal and cross-modal features to detect and localize sophisticated fake news through innovative contrastive learning and fusion techniques.

Authors:Morui Zhu, Yongqi Zhu, Yihao Zhu, Qi Chen, Deyuan Qu, Song Fu, Qing Yang
Title: M3CAD: Towards Generic Cooperative Autonomous Driving Benchmark
Abstract:
We introduce M$^3$CAD, a novel benchmark designed to advance research in generic cooperative autonomous driving. M$^3$CAD comprises 204 sequences with 30k frames, spanning a diverse range of cooperative driving scenarios. Each sequence includes multiple vehicles and sensing modalities, e.g., LiDAR point clouds, RGB images, and GPS/IMU, supporting a variety of autonomous driving tasks, including object detection and tracking, mapping, motion forecasting, occupancy prediction, and path planning. This rich multimodal setup enables M$^3$CAD to support both single-vehicle and multi-vehicle autonomous driving research, significantly broadening the scope of research in the field. To our knowledge, M$^3$CAD is the most comprehensive benchmark specifically tailored for cooperative multi-task autonomous driving research. We evaluate the state-of-the-art end-to-end solution on M$^3$CAD to establish baseline performance. To foster cooperative autonomous driving research, we also propose E2EC, a simple yet effective framework for cooperative driving solution that leverages inter-vehicle shared information for improved path planning. We release M$^3$CAD, along with our baseline models and evaluation results, to support the development of robust cooperative autonomous driving systems. All resources will be made publicly available on https://github.com/zhumorui/M3CAD
中文: M³CAD是一个专为协作式自动驾驶研究设计的综合性基准,包含204个序列和3万帧多模态数据,支持多任务研究并公开资源以推动该领域发展。
English: M³CAD is a comprehensive benchmark featuring 204 sequences with 30k frames of multimodal data to advance cooperative autonomous driving research, supporting tasks like detection, tracking, and planning while establishing baseline performance and releasing resources publicly.

Authors:Xuefeng Jiang, Jia Li, Nannan Wu, Zhiyuan Wu, Xujing Li, Sheng Sun, Gang Xu, Yuwei Wang, Qi Li, Min Liu
Title: FNBench: Benchmarking Robust Federated Learning against Noisy Labels
Abstract:
Robustness to label noise within data is a significant challenge in federated learning (FL). From the data-centric perspective, the data quality of distributed datasets can not be guaranteed since annotations of different clients contain complicated label noise of varying degrees, which causes the performance degradation. There have been some early attempts to tackle noisy labels in FL. However, there exists a lack of benchmark studies on comprehensively evaluating their practical performance under unified settings. To this end, we propose the first benchmark study FNBench to provide an experimental investigation which considers three diverse label noise patterns covering synthetic label noise, imperfect human-annotation errors and systematic errors. Our evaluation incorporates eighteen state-of-the-art methods over five image recognition datasets and one text classification dataset. Meanwhile, we provide observations to understand why noisy labels impair FL, and additionally exploit a representation-aware regularization method to enhance the robustness of existing methods against noisy labels based on our observations. Finally, we discuss the limitations of this work and propose three-fold future directions. To facilitate related communities, our source code is open-sourced at https://github.com/Sprinter1999/FNBench.
中文: 联邦学习面临客户端标签噪声的挑战,为此我们推出了首个基准研究FNBench,在多种噪声模式下评估了18种方法,并提出了一种基于表示的规则化方法以提高抗噪性。
English: Federated learning faces challenges from varying label noise across clients, prompting the creation of FNBench, the first benchmark that evaluates 18 methods under diverse noise patterns and introduces a regularization technique to enhance robustness.

Authors:Lei Hu, Zhiyong Gan, Ling Deng, Jinglin Liang, Lingyu Liang, Shuangping Huang, Tianshui Chen
Title: ReplayCAD: Generative Diffusion Replay for Continual Anomaly Detection
Abstract:
Continual Anomaly Detection (CAD) enables anomaly detection models in learning new classes while preserving knowledge of historical classes. CAD faces two key challenges: catastrophic forgetting and segmentation of small anomalous regions. Existing CAD methods store image distributions or patch features to mitigate catastrophic forgetting, but they fail to preserve pixel-level detailed features for accurate segmentation. To overcome this limitation, we propose ReplayCAD, a novel diffusion-driven generative replay framework that replay high-quality historical data, thus effectively preserving pixel-level detailed features. Specifically, we compress historical data by searching for a class semantic embedding in the conditional space of the pre-trained diffusion model, which can guide the model to replay data with fine-grained pixel details, thus improving the segmentation performance. However, relying solely on semantic features results in limited spatial diversity. Hence, we further use spatial features to guide data compression, achieving precise control of sample space, thereby generating more diverse data. Our method achieves state-of-the-art performance in both classification and segmentation, with notable improvements in segmentation: 11.5% on VisA and 8.1% on MVTec. Our source code is available at https://github.com/HULEI7/ReplayCAD.
Chinese: ReplayCAD提出了一种基于扩散模型的生成式回放框架,通过保留像素级细节来缓解持续异常检测中的灾难性遗忘并提升分割精度,在多个数据集上实现了显著的性能提升。
English: ReplayCAD introduces a diffusion-driven generative replay framework that preserves pixel-level details to mitigate catastrophic forgetting and enhance segmentation in continual anomaly detection, achieving state-of-the-art performance with significant improvements.

Authors:Maxim Vashkevich, Egor Krivalcevich
Title: Compact and Efficient Neural Networks for Image Recognition Based on Learned 2D Separable Transform
Abstract:
The paper presents a learned two-dimensional separable transform (LST) that can be considered as a new type of computational layer for constructing neural network (NN) architecture for image recognition tasks. The LST based on the idea of sharing the weights of one fullyconnected (FC) layer to process all rows of an image. After that, a second shared FC layer is used to process all columns of image representation obtained from the first layer. The use of LST layers in a NN architecture significantly reduces the number of model parameters compared to models that use stacked FC layers. We show that a NN-classifier based on a single LST layer followed by an FC layer achieves 98.02\% accuracy on the MNIST dataset, while having only 9.5k parameters. We also implemented a LST-based classifier for handwritten digit recognition on the FPGA platform to demonstrate the efficiency of the suggested approach for designing a compact and high-performance implementation of NN models. Git repository with supplementary materials: https://github.com/Mak-Sim/LST-2d
该论文提出一种学习型二维可分离变换(LST)作为神经网络层,通过共享全连接层权重处理图像行列,在MNIST数据集上以仅9.5k参数实现98.02%准确率,并展示了在FPGA平台的高效部署。
This paper introduces a learned two-dimensional separable transform (LST) as a neural network layer that reduces parameters while maintaining high accuracy, achieving 98.02% on MNIST with only 9.5k parameters and demonstrating efficient FPGA implementation.

Authors:Feng Liu, Ziwang Fu, Yunlong Wang, Qijian Zheng
Title: TACFN: Transformer-based Adaptive Cross-modal Fusion Network for Multimodal Emotion Recognition
Abstract:
The fusion technique is the key to the multimodal emotion recognition task. Recently, cross-modal attention-based fusion methods have demonstrated high performance and strong robustness. However, cross-modal attention suffers from redundant features and does not capture complementary features well. We find that it is not necessary to use the entire information of one modality to reinforce the other during cross-modal interaction, and the features that can reinforce a modality may contain only a part of it. To this end, we design an innovative Transformer-based Adaptive Cross-modal Fusion Network (TACFN). Specifically, for the redundant features, we make one modality perform intra-modal feature selection through a self-attention mechanism, so that the selected features can adaptively and efficiently interact with another modality. To better capture the complementary information between the modalities, we obtain the fused weight vector by splicing and use the weight vector to achieve feature reinforcement of the modalities. We apply TCAFN to the RAVDESS and IEMOCAP datasets. For fair comparison, we use the same unimodal representations to validate the effectiveness of the proposed fusion method. The experimental results show that TACFN brings a significant performance improvement compared to other methods and reaches the state-of-the-art. All code and models could be accessed from https://github.com/shuzihuaiyu/TACFN.
中文: 提出的基于Transformer的自适应跨模态融合网络(TACFN)通过自注意力机制进行模态内特征选择和权重融合,有效解决了跨模态情感识别中的冗余特征问题并增强了互补性,在标准数据集上取得了最优性能。
English: The proposed Transformer-based Adaptive Cross-modal Fusion Network (TACFN) addresses redundancy and enhances complementary feature capture in multimodal emotion recognition by using self-attention for intra-modal selection and weight-based reinforcement, achieving state-of-the-art results on benchmark datasets.

Authors:Jing Hu, Kaiwei Yu, Hongjiang Xian, Shu Hu, Xin Wang
Title: Improving Generalization of Medical Image Registration Foundation Model
Abstract:
Deformable registration is a fundamental task in medical image processing, aiming to achieve precise alignment by establishing nonlinear correspondences between images. Traditional methods offer good adaptability and interpretability but are limited by computational efficiency. Although deep learning approaches have significantly improved registration speed and accuracy, they often lack flexibility and generalizability across different datasets and tasks. In recent years, foundation models have emerged as a promising direction, leveraging large and diverse datasets to learn universal features and transformation patterns for image registration, thus demonstrating strong cross-task transferability. However, these models still face challenges in generalization and robustness when encountering novel anatomical structures, varying imaging conditions, or unseen modalities. To address these limitations, this paper incorporates Sharpness-Aware Minimization (SAM) into foundation models to enhance their generalization and robustness in medical image registration. By optimizing the flatness of the loss landscape, SAM improves model stability across diverse data distributions and strengthens its ability to handle complex clinical scenarios. Experimental results show that foundation models integrated with SAM achieve significant improvements in cross-dataset registration performance, offering new insights for the advancement of medical image registration technology. Our code is available at https://github.com/Promise13/fm_sam}{https://github.com/Promise13/fm\_sam.
中文: 本文通过将锐度感知最小化(SAM)融入基础模型,提升了医学图像配准的泛化能力和鲁棒性,使其能更好地适应不同数据集和复杂临床情况。
English: This paper enhances medical image registration by integrating Sharpness-Aware Minimization (SAM) into foundation models, improving generalization and robustness across diverse datasets and clinical scenarios.

Authors:Hang Wang, Zhi-Qi Cheng, Chenhao Lin, Chao Shen, Lei Zhang
Title: HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation
Abstract:
Text-to-image synthesis has progressed to the point where models can generate visually compelling images from natural language prompts. Yet, existing methods often fail to reconcile high-level semantic fidelity with explicit spatial control, particularly in scenes involving multiple objects, nuanced relations, or complex layouts. To bridge this gap, we propose a Hierarchical Cross-Modal Alignment (HCMA) framework for grounded text-to-image generation. HCMA integrates two alignment modules into each diffusion sampling step: a global module that continuously aligns latent representations with textual descriptions to ensure scene-level coherence, and a local module that employs bounding-box layouts to anchor objects at specified locations, enabling fine-grained spatial control. Extensive experiments on the MS-COCO 2014 validation set show that HCMA surpasses state-of-the-art baselines, achieving a 0.69 improvement in Frechet Inception Distance (FID) and a 0.0295 gain in CLIP Score. These results demonstrate HCMA's effectiveness in faithfully capturing intricate textual semantics while adhering to user-defined spatial constraints, offering a robust solution for semantically grounded image generation. Our code is available at https://github.com/hwang-cs-ime/HCMA.
中文: 提出的分层跨模态对齐(HCMA)框架通过结合全局语义对齐和基于边界框的局部空间控制,在文本到图像生成中实现了对复杂语义的精确捕捉与空间约束的严格遵循,在基准测试中以显著提升的FID和CLIP分数刷新了最优性能。
English: The proposed Hierarchical Cross-Modal Alignment (HCMA) framework enhances text-to-image generation by integrating global semantic alignment and local spatial control through bounding-box layouts, achieving state-of-the-art performance on benchmark datasets with significant improvements in FID and CLIP Score metrics.

Authors:Haoyang Xie, Feng Ju
Title: Text-to-CadQuery: A New Paradigm for CAD Generation with Scalable Large Model Capabilities
Abstract:
Computer-aided design (CAD) is fundamental to modern engineering and manufacturing, but creating CAD models still requires expert knowledge and specialized software. Recent advances in large language models (LLMs) open up the possibility of generative CAD, where natural language is directly translated into parametric 3D models. However, most existing methods generate task-specific command sequences that pretrained models cannot directly handle. These sequences must be converted into CAD representations such as CAD vectors before a 3D model can be produced, which requires training models from scratch and adds unnecessary complexity. To tackle this issue, we propose generating CadQuery code directly from text, leveraging the strengths of pretrained LLMs to produce 3D models without intermediate representations, using this Python-based scripting language. Since LLMs already excel at Python generation and spatial reasoning, fine-tuning them on Text-to-CadQuery data proves highly effective. Given that these capabilities typically improve with scale, we hypothesize that larger models will perform better after fine-tuning. To enable this, we augment the Text2CAD dataset with 170,000 CadQuery annotations. We fine-tune six open-source LLMs of varying sizes and observe consistent improvements. Our best model achieves a top-1 exact match of 69.3%, up from 58.8%, and reduces Chamfer Distance by 48.6%. Project page: https://github.com/Text-to-CadQuery/Text-to-CadQuery.
Chinese: 本研究提出了一种利用预训练大语言模型直接从文本生成CadQuery代码的方法,省去中间表示步骤,在创建参数化3D模型方面实现了准确性和效率的显著提升。
English: This research introduces a method to generate CadQuery code directly from text using pretrained large language models, eliminating intermediate representations and achieving significant improvements in accuracy and efficiency for creating parametric 3D models.

Authors:Md Rakibul Hasan, Pouria Behnoudfar, Dan MacKinlay, Thomas Poulet
Title: PC-SRGAN: Physically Consistent Super-Resolution Generative Adversarial Network for General Transient Simulations
Abstract:
Machine Learning, particularly Generative Adversarial Networks (GANs), has revolutionised Super-Resolution (SR). However, generated images often lack physical meaningfulness, which is essential for scientific applications. Our approach, PC-SRGAN, enhances image resolution while ensuring physical consistency for interpretable simulations. PC-SRGAN significantly improves both the Peak Signal-to-Noise Ratio and the Structural Similarity Index Measure compared to conventional SR methods, even with limited training data (e.g., only 13% of training data is required to achieve performance similar to SRGAN). Beyond SR, PC-SRGAN augments physically meaningful machine learning, incorporating numerically justified time integrators and advanced quality metrics. These advancements promise reliable and causal machine-learning models in scientific domains. A significant advantage of PC-SRGAN over conventional SR techniques is its physical consistency, which makes it a viable surrogate model for time-dependent problems. PC-SRGAN advances scientific machine learning by improving accuracy and efficiency, enhancing process understanding, and broadening applications to scientific research. We publicly release the complete source code of PC-SRGAN and all experiments at https://github.com/hasan-rakibul/PC-SRGAN.
中文:PC-SRGAN通过生成物理一致的图像革新了超分辨率技术,在极少训练数据下显著提升精度和效率,并推动科学机器学习应用的发展。
English: PC-SRGAN revolutionizes Super-Resolution by generating physically consistent images, significantly improving accuracy and efficiency with minimal training data while advancing scientific machine learning applications.

Authors:Shehryar Khattak, Timon Homberger, Lukas Bernreiter, Julian Nubert, Olov Andersson, Roland Siegwart, Kostas Alexis, Marco Hutter
Title: CompSLAM: Complementary Hierarchical Multi-Modal Localization and Mapping for Robot Autonomy in Underground Environments
Abstract:
Robot autonomy in unknown, GPS-denied, and complex underground environments requires real-time, robust, and accurate onboard pose estimation and mapping for reliable operations. This becomes particularly challenging in perception-degraded subterranean conditions under harsh environmental factors, including darkness, dust, and geometrically self-similar structures. This paper details CompSLAM, a highly resilient and hierarchical multi-modal localization and mapping framework designed to address these challenges. Its flexible architecture achieves resilience through redundancy by leveraging the complementary nature of pose estimates derived from diverse sensor modalities. Developed during the DARPA Subterranean Challenge, CompSLAM was successfully deployed on all aerial, legged, and wheeled robots of Team Cerberus during their competition-winning final run. Furthermore, it has proven to be a reliable odometry and mapping solution in various subsequent projects, with extensions enabling multi-robot map sharing for marsupial robotic deployments and collaborative mapping. This paper also introduces a comprehensive dataset acquired by a manually teleoperated quadrupedal robot, covering a significant portion of the DARPA Subterranean Challenge finals course. This dataset evaluates CompSLAM's robustness to sensor degradations as the robot traverses 740 meters in an environment characterized by highly variable geometries and demanding lighting conditions. The CompSLAM code and the DARPA SubT Finals dataset are made publicly available for the benefit of the robotics community
Chinese: CompSLAM是一种在DARPA地下挑战赛中开发的多模态定位与绘图框架,能在恶劣的地下环境中实现稳健的实时操作,并已成功应用于多种机器人平台。
English: CompSLAM is a resilient, multi-modal framework for real-time localization and mapping in challenging underground environments, developed during the DARPA Subterranean Challenge and successfully deployed on various robots.

Authors:Valfride Nascimento, Gabriel E. Lima, Rafael O. Ribeiro, William Robson Schwartz, Rayson Laroca, David Menotti
Title: Toward Advancing License Plate Super-Resolution in Real-World Scenarios: A Dataset and Benchmark
Abstract:
Recent advancements in super-resolution for License Plate Recognition (LPR) have sought to address challenges posed by low-resolution (LR) and degraded images in surveillance, traffic monitoring, and forensic applications. However, existing studies have relied on private datasets and simplistic degradation models. To address this gap, we introduce UFPR-SR-Plates, a novel dataset containing 10,000 tracks with 100,000 paired low and high-resolution license plate images captured under real-world conditions. We establish a benchmark using multiple sequential LR and high-resolution (HR) images per vehicle -- five of each -- and two state-of-the-art models for super-resolution of license plates. We also investigate three fusion strategies to evaluate how combining predictions from a leading Optical Character Recognition (OCR) model for multiple super-resolved license plates enhances overall performance. Our findings demonstrate that super-resolution significantly boosts LPR performance, with further improvements observed when applying majority vote-based fusion techniques. Specifically, the Layout-Aware and Character-Driven Network (LCDNet) model combined with the Majority Vote by Character Position (MVCP) strategy led to the highest recognition rates, increasing from 1.7% with low-resolution images to 31.1% with super-resolution, and up to 44.7% when combining OCR outputs from five super-resolved images. These findings underscore the critical role of super-resolution and temporal information in enhancing LPR accuracy under real-world, adverse conditions. The proposed dataset is publicly available to support further research and can be accessed at: https://valfride.github.io/nascimento2024toward/

Authors:Gabriele Rosi, Fabio Cermelli
Title: Show or Tell? A Benchmark To Evaluate Visual and Textual Prompts in Semantic Segmentation
Abstract:
Prompt engineering has shown remarkable success with large language models, yet its systematic exploration in computer vision remains limited. In semantic segmentation, both textual and visual prompts offer distinct advantages: textual prompts through open-vocabulary methods allow segmentation of arbitrary categories, while visual reference prompts provide intuitive reference examples. However, existing benchmarks evaluate these modalities in isolation, without direct comparison under identical conditions. We present Show or Tell (SoT), a novel benchmark specifically designed to evaluate both visual and textual prompts for semantic segmentation across 14 datasets spanning 7 diverse domains (common scenes, urban, food, waste, parts, tools, and land-cover). We evaluate 5 open-vocabulary methods and 4 visual reference prompt approaches, adapting the latter to handle multi-class segmentation through a confidence-based mask merging strategy. Our extensive experiments reveal that open-vocabulary methods excel with common concepts easily described by text but struggle with complex domains like tools, while visual reference prompt methods achieve good average results but exhibit high variability depending on the input prompt. Through comprehensive quantitative and qualitative analysis, we identify the strengths and weaknesses of both prompting modalities, providing valuable insights to guide future research in vision foundation models for segmentation tasks.
中文: Show or Tell (SoT)基准测试在14个数据集中评估了语义分割的视觉与文本提示,发现开放词汇方法擅长处理文本易描述概念而视觉提示结果波动较大,为视觉基础模型的未来发展提供了重要参考。
English: The Show or Tell (SoT) benchmark evaluates both visual and textual prompts for semantic segmentation across 14 datasets, revealing that open-vocabulary methods excel with text-friendly concepts while visual prompts show high variability, offering insights for future vision model development.

Authors:Yufan Deng, Yuhao Zhang, Chen Geng, Shangzhe Wu, Jiajun Wu
Title: Anymate: A Dataset and Baselines for Learning 3D Object Rigging
Abstract:
Rigging and skinning are essential steps to create realistic 3D animations, often requiring significant expertise and manual effort. Traditional attempts at automating these processes rely heavily on geometric heuristics and often struggle with objects of complex geometry. Recent data-driven approaches show potential for better generality, but are often constrained by limited training data. We present the Anymate Dataset, a large-scale dataset of 230K 3D assets paired with expert-crafted rigging and skinning information -- 70 times larger than existing datasets. Using this dataset, we propose a learning-based auto-rigging framework with three sequential modules for joint, connectivity, and skinning weight prediction. We systematically design and experiment with various architectures as baselines for each module and conduct comprehensive evaluations on our dataset to compare their performance. Our models significantly outperform existing methods, providing a foundation for comparing future methods in automated rigging and skinning. Code and dataset can be found at https://anymate3d.github.io/.

Authors:Radu Alexandru Rosu, Keyu Wu, Yao Feng, Youyi Zheng, Michael J. Black
Title: DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models
Abstract:
We address the task of generating 3D hair geometry from a single image, which is challenging due to the diversity of hairstyles and the lack of paired image-to-3D hair data. Previous methods are primarily trained on synthetic data and cope with the limited amount of such data by using low-dimensional intermediate representations, such as guide strands and scalp-level embeddings, that require post-processing to decode, upsample, and add realism. These approaches fail to reconstruct detailed hair, struggle with curly hair, or are limited to handling only a few hairstyles. To overcome these limitations, we propose DiffLocks, a novel framework that enables detailed reconstruction of a wide variety of hairstyles directly from a single image. First, we address the lack of 3D hair data by automating the creation of the largest synthetic hair dataset to date, containing 40K hairstyles. Second, we leverage the synthetic hair dataset to learn an image-conditioned diffusion-transfomer model that generates accurate 3D strands from a single frontal image. By using a pretrained image backbone, our method generalizes to in-the-wild images despite being trained only on synthetic data. Our diffusion model predicts a scalp texture map in which any point in the map contains the latent code for an individual hair strand. These codes are directly decoded to 3D strands without post-processing techniques. Representing individual strands, instead of guide strands, enables the transformer to model the detailed spatial structure of complex hairstyles. With this, DiffLocks can recover highly curled hair, like afro hairstyles, from a single image for the first time. Data and code is available at https://radualexandru.github.io/difflocks/

Authors:Wenqi Zeng, Yuqi Sun, Chenxi Ma, Weimin Tan, Bo Yan
Title: MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks
Abstract:
Medical vision-language models (VLMs) have shown promise as clinical assistants across various medical fields. However, specialized dermatology VLM capable of delivering professional and detailed diagnostic analysis remains underdeveloped, primarily due to less specialized text descriptions in current dermatology multimodal datasets. To address this issue, we propose MM-Skin, the first large-scale multimodal dermatology dataset that encompasses 3 imaging modalities, including clinical, dermoscopic, and pathological and nearly 10k high-quality image-text pairs collected from professional textbooks. In addition, we generate over 27k diverse, instruction-following vision question answering (VQA) samples (9 times the size of current largest dermatology VQA dataset). Leveraging public datasets and MM-Skin, we developed SkinVL, a dermatology-specific VLM designed for precise and nuanced skin disease interpretation. Comprehensive benchmark evaluations of SkinVL on VQA, supervised fine-tuning (SFT) and zero-shot classification tasks across 8 datasets, reveal its exceptional performance for skin diseases in comparison to both general and medical VLM models. The introduction of MM-Skin and SkinVL offers a meaningful contribution to advancing the development of clinical dermatology VLM assistants. MM-Skin is available at https://github.com/ZwQ803/MM-Skin
中文: 该研究提出了MM-Skin,一个包含多种成像模式和大规模高质量图文对的多模态皮肤病数据集,并开发了SkinVL这一专业视觉语言模型,在皮肤病解读任务中展现出优于现有模型的卓越性能。
English: The study introduces MM-Skin, a large-scale multimodal dermatology dataset with diverse imaging modalities and high-quality image-text pairs, and develops SkinVL, a specialized vision-language model that demonstrates superior performance in skin disease interpretation compared to existing models.

Authors:Shuaiyi Huang, Mara Levy, Anubhav Gupta, Daniel Ekpo, Ruijie Zheng, Abhinav Shrivastava
Title: TREND: Tri-teaching for Robust Preference-based Reinforcement Learning with Demonstrations
Abstract:
Preference feedback collected by human or VLM annotators is often noisy, presenting a significant challenge for preference-based reinforcement learning that relies on accurate preference labels. To address this challenge, we propose TREND, a novel framework that integrates few-shot expert demonstrations with a tri-teaching strategy for effective noise mitigation. Our method trains three reward models simultaneously, where each model views its small-loss preference pairs as useful knowledge and teaches such useful pairs to its peer network for updating the parameters. Remarkably, our approach requires as few as one to three expert demonstrations to achieve high performance. We evaluate TREND on various robotic manipulation tasks, achieving up to 90% success rates even with noise levels as high as 40%, highlighting its effective robustness in handling noisy preference feedback. Project page: https://shuaiyihuang.github.io/publications/TREND.

Authors:Dongqian Guo, Wencheng Han, Pang Lyu, Yuxi Zhou, Jianbing Shen
Title: Towards Better Cephalometric Landmark Detection with Diffusion Data Generation
Abstract:
Cephalometric landmark detection is essential for orthodontic diagnostics and treatment planning. Nevertheless, the scarcity of samples in data collection and the extensive effort required for manual annotation have significantly impeded the availability of diverse datasets. This limitation has restricted the effectiveness of deep learning-based detection methods, particularly those based on large-scale vision models. To address these challenges, we have developed an innovative data generation method capable of producing diverse cephalometric X-ray images along with corresponding annotations without human intervention. To achieve this, our approach initiates by constructing new cephalometric landmark annotations using anatomical priors. Then, we employ a diffusion-based generator to create realistic X-ray images that correspond closely with these annotations. To achieve precise control in producing samples with different attributes, we introduce a novel prompt cephalometric X-ray image dataset. This dataset includes real cephalometric X-ray images and detailed medical text prompts describing the images. By leveraging these detailed prompts, our method improves the generation process to control different styles and attributes. Facilitated by the large, diverse generated data, we introduce large-scale vision detection models into the cephalometric landmark detection task to improve accuracy. Experimental results demonstrate that training with the generated data substantially enhances the performance. Compared to methods without using the generated data, our approach improves the Success Detection Rate (SDR) by 6.5%, attaining a notable 82.2%. All code and data are available at: https://um-lab.github.io/cepha-generation
中文摘要:本研究提出了一种创新的数据生成方法,通过解剖学先验和扩散模型生成具有自动标注的多样化头影测量X射线图像,使大规模视觉模型成功检测率达到82.2%,显著提升了标志点检测性能。
English Summary: This study introduces an innovative data generation method using anatomical priors and diffusion models to create diverse cephalometric X-ray images with automated annotations, enabling large-scale vision models to achieve an 82.2% success detection rate and significantly improving landmark detection performance.

Authors:Congqi Cao, Peiheng Han, Yueran zhang, Yating Yu, Qinyi Lv, Lingtong Min, Yanning zhang
Title: Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition
Abstract:
Large-scale pre-trained models have achieved remarkable success in language and image tasks, leading an increasing number of studies to explore the application of pre-trained image models, such as CLIP, in the domain of few-shot action recognition (FSAR). However, current methods generally suffer from several problems: 1) Direct fine-tuning often undermines the generalization capability of the pre-trained model; 2) The exploration of task-specific information is insufficient in the visual tasks; 3) The semantic order information is typically overlooked during text modeling; 4) Existing cross-modal alignment techniques ignore the temporal coupling of multimodal information. To address these, we propose Task-Adapter++, a parameter-efficient dual adaptation method for both image and text encoders. Specifically, to make full use of the variations across different few-shot learning tasks, we design a task-specific adaptation for the image encoder so that the most discriminative information can be well noticed during feature extraction. Furthermore, we leverage large language models (LLMs) to generate detailed sequential sub-action descriptions for each action class, and introduce semantic order adapters into the text encoder to effectively model the sequential relationships between these sub-actions. Finally, we develop an innovative fine-grained cross-modal alignment strategy that actively maps visual features to reside in the same temporal stage as semantic descriptions. Extensive experiments fully demonstrate the effectiveness and superiority of the proposed method, which achieves state-of-the-art performance on 5 benchmarks consistently. The code is open-sourced at https://github.com/Jaulin-Bage/Task-Adapter-pp.
中文: 本文提出Task-Adapter++方法,通过图像编码器的任务适配、大语言模型生成的序列化子动作描述以及细粒度跨模态对齐策略,在少样本动作识别任务中实现了最优性能,并在五个基准测试中取得领先成果。
English: This paper introduces Task-Adapter++, a parameter-efficient dual adaptation method that enhances few-shot action recognition by incorporating task-specific image adaptation, sequential sub-action modeling with LLMs, and a fine-grained cross-modal alignment strategy, achieving state-of-the-art results on multiple benchmarks.

Authors:Weihong Li, Xiaoqiong Liu, Heng Fan, Libo Zhang
Title: CGTrack: Cascade Gating Network with Hierarchical Feature Aggregation for UAV Tracking
Abstract:
Recent advancements in visual object tracking have markedly improved the capabilities of unmanned aerial vehicle (UAV) tracking, which is a critical component in real-world robotics applications. While the integration of hierarchical lightweight networks has become a prevalent strategy for enhancing efficiency in UAV tracking, it often results in a significant drop in network capacity, which further exacerbates challenges in UAV scenarios, such as frequent occlusions and extreme changes in viewing angles. To address these issues, we introduce a novel family of UAV trackers, termed CGTrack, which combines explicit and implicit techniques to expand network capacity within a coarse-to-fine framework. Specifically, we first introduce a Hierarchical Feature Cascade (HFC) module that leverages the spirit of feature reuse to increase network capacity by integrating the deep semantic cues with the rich spatial information, incurring minimal computational costs while enhancing feature representation. Based on this, we design a novel Lightweight Gated Center Head (LGCH) that utilizes gating mechanisms to decouple target-oriented coordinates from previously expanded features, which contain dense local discriminative information. Extensive experiments on three challenging UAV tracking benchmarks demonstrate that CGTrack achieves state-of-the-art performance while running fast. Code will be available at https://github.com/Nightwatch-Fox11/CGTrack.
Chinese: CGTrack模型通过引入分层特征级联模块和轻量门控中心头,在提升无人机跟踪网络容量和效率的同时,在多个基准测试中实现了最先进的性能且运行迅速。
English: The CGTrack model introduces a Hierarchical Feature Cascade module and a Lightweight Gated Center Head to enhance UAV tracking by boosting network capacity and efficiency, achieving top performance on benchmarks with high speed.

Authors:Jianjian Yin, Yi Chen, Chengyu Li, Zhichao Zheng, Yanhui Gu, Junsheng Zhou
Title: DFEN: Dual Feature Equalization Network for Medical Image Segmentation
Abstract:
Current methods for medical image segmentation primarily focus on extracting contextual feature information from the perspective of the whole image. While these methods have shown effective performance, none of them take into account the fact that pixels at the boundary and regions with a low number of class pixels capture more contextual feature information from other classes, leading to misclassification of pixels by unequal contextual feature information. In this paper, we propose a dual feature equalization network based on the hybrid architecture of Swin Transformer and Convolutional Neural Network, aiming to augment the pixel feature representations by image-level equalization feature information and class-level equalization feature information. Firstly, the image-level feature equalization module is designed to equalize the contextual information of pixels within the image. Secondly, we aggregate regions of the same class to equalize the pixel feature representations of the corresponding class by class-level feature equalization module. Finally, the pixel feature representations are enhanced by learning weights for image-level equalization feature information and class-level equalization feature information. In addition, Swin Transformer is utilized as both the encoder and decoder, thereby bolstering the ability of the model to capture long-range dependencies and spatial correlations. We conducted extensive experiments on Breast Ultrasound Images (BUSI), International Skin Imaging Collaboration (ISIC2017), Automated Cardiac Diagnosis Challenge (ACDC) and PH$^2$ datasets. The experimental results demonstrate that our method have achieved state-of-the-art performance. Our code is publicly available at https://github.com/JianJianYin/DFEN.
中文摘要:本文提出一种基于Swin Transformer与CNN混合架构的双重特征均衡网络,通过图像级和类别级特征均衡增强像素特征表示,在多个医学影像数据集上实现了最先进的性能。
English Summary: This paper introduces a dual feature equalization network combining Swin Transformer and CNN to enhance pixel feature representation through image-level and class-level feature equalization, achieving state-of-the-art performance on multiple medical image datasets.

Authors:Hanzhe Liang, Aoran Wang, Jie Zhou, Xin Jin, Can Gao, Jinbao Wang
Title: Examining the Source of Defects from a Mechanical Perspective for 3D Anomaly Detection
Abstract:
In this paper, we explore a novel approach to 3D anomaly detection (AD) that goes beyond merely identifying anomalies based on structural characteristics. Our primary perspective is that most anomalies arise from unpredictable defective forces originating from both internal and external sources. To address these anomalies, we seek out opposing forces that can help correct them. Therefore, we introduce the Mechanics Complementary Model-based Framework for the 3D-AD task (MC4AD), which generates internal and external corrective forces for each point. We first propose a Diverse Anomaly-Generation (DA-Gen) module designed to simulate various types of anomalies. Next, we present the Corrective Force Prediction Network (CFP-Net), which uses complementary representations for point-level analysis to simulate the different contributions from internal and external corrective forces. To ensure the corrective forces are constrained effectively, we have developed a combined loss function that includes a new symmetric loss and an overall loss. Notably, we implement a Hierarchical Quality Control (HQC) strategy based on a three-way decision process and contribute a dataset titled Anomaly-IntraVariance, which incorporates intraclass variance to evaluate our model. As a result, the proposed MC4AD has been proven effective through theory and experimentation. The experimental results demonstrate that our approach yields nine state-of-the-art performances, achieving optimal results with minimal parameters and the fastest inference speed across five existing datasets, in addition to the proposed Anomaly-IntraVariance dataset. The source is available at https://github.com/hzzzzzhappy/MC4AD
Chinese: 本文提出MC4AD框架,通过生成内外矫正力的力学互补模型解决3D异常检测问题,在多个数据集上以最少参数和最快推理速度实现了最优性能。
English: This paper introduces MC4AD, a novel 3D anomaly detection framework that generates corrective forces to address anomalies through a complementary mechanics model, achieving state-of-the-art performance with minimal parameters and fast inference across multiple datasets.

Authors:Chunlai Dong, Haochao Ying, Qibo Qiu, Jinhong Wang, Danny Chen, Jian Wu
Title: Dual-level Fuzzy Learning with Patch Guidance for Image Ordinal Regression
Abstract:
Ordinal regression bridges regression and classification by assigning objects to ordered classes. While human experts rely on discriminative patch-level features for decisions, current approaches are limited by the availability of only image-level ordinal labels, overlooking fine-grained patch-level characteristics. In this paper, we propose a Dual-level Fuzzy Learning with Patch Guidance framework, named DFPG that learns precise feature-based grading boundaries from ambiguous ordinal labels, with patch-level supervision. Specifically, we propose patch-labeling and filtering strategies to enable the model to focus on patch-level features exclusively with only image-level ordinal labels available. We further design a dual-level fuzzy learning module, which leverages fuzzy logic to quantitatively capture and handle label ambiguity from both patch-wise and channel-wise perspectives. Extensive experiments on various image ordinal regression datasets demonstrate the superiority of our proposed method, further confirming its ability in distinguishing samples from difficult-to-classify categories. The code is available at https://github.com/ZJUMAI/DFPG-ord.
中文摘要:提出的DFPG框架通过双重模糊学习与局部引导,利用图像级标签提取局部特征,有效处理标记模糊性并提升困难类别的分类性能。
English Summary: The proposed DFPG framework introduces dual-level fuzzy learning with patch guidance to address ordinal regression by leveraging patch-level features from image-level labels, effectively handling label ambiguity and improving classification of challenging categories.

Authors:Zhiyuan Chen, Keyi Li, Yifan Jia, Le Ye, Yufei Ma
Title: Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition
Abstract:
Diffusion transformer (DiT) models have achieved remarkable success in image generation, thanks for their exceptional generative capabilities and scalability. Nonetheless, the iterative nature of diffusion models (DMs) results in high computation complexity, posing challenges for deployment. Although existing cache-based acceleration methods try to utilize the inherent temporal similarity to skip redundant computations of DiT, the lack of correction may induce potential quality degradation. In this paper, we propose increment-calibrated caching, a training-free method for DiT acceleration, where the calibration parameters are generated from the pre-trained model itself with low-rank approximation. To deal with the possible correction failure arising from outlier activations, we introduce channel-aware Singular Value Decomposition (SVD), which further strengthens the calibration effect. Experimental results show that our method always achieve better performance than existing naive caching methods with a similar computation resource budget. When compared with 35-step DDIM, our method eliminates more than 45% computation and improves IS by 12 at the cost of less than 0.06 FID increase. Code is available at https://github.com/ccccczzy/icc.
中文: 本文提出了一种无需训练的扩散变换器加速方法——增量校准缓存,通过低秩近似和通道感知SVD在减少45%以上计算量的同时保持图像生成质量。
English: This paper introduces a training-free acceleration method for diffusion transformers called increment-calibrated caching, which uses low-rank approximation and channel-aware SVD to reduce computations by over 45% while maintaining image quality.

Authors:Amin Ghafourian, Andrew Lee, Dechen Gao, Tyler Beer, Kin Yen, Iman Soltani
Title: Automating Infrastructure Surveying: A Framework for Geometric Measurements and Compliance Assessment Using Point Cloud Data
Abstract:
Automation can play a prominent role in improving efficiency, accuracy, and scalability in infrastructure surveying and assessing construction and compliance standards. This paper presents a framework for automation of geometric measurements and compliance assessment using point cloud data. The proposed approach integrates deep learning-based detection and segmentation, in conjunction with geometric and signal processing techniques, to automate surveying tasks. As a proof of concept, we apply this framework to automatically evaluate the compliance of curb ramps with the Americans with Disabilities Act (ADA), demonstrating the utility of point cloud data in survey automation. The method leverages a newly collected, large annotated dataset of curb ramps, made publicly available as part of this work, to facilitate robust model training and evaluation. Experimental results, including comparison with manual field measurements of several ramps, validate the accuracy and reliability of the proposed method, highlighting its potential to significantly reduce manual effort and improve consistency in infrastructure assessment. Beyond ADA compliance, the proposed framework lays the groundwork for broader applications in infrastructure surveying and automated construction evaluation, promoting wider adoption of point cloud data in these domains. The annotated database, manual ramp survey data, and developed algorithms are publicly available on the project's GitHub page: https://github.com/Soltanilara/SurveyAutomation.
中文: 本文提出了一种利用点云数据和深度学习的自动化框架,通过评估ADA路缘坡道合规性展示了其在基础设施检测中的高效性,显著减少了人工工作量并提高了评估一致性。
English: This paper introduces an automated framework using point cloud data and deep learning to efficiently assess infrastructure compliance, demonstrated through ADA curb ramp evaluations, which reduces manual effort and enhances assessment consistency.

Authors:Zhangchi Hu, Peixi Wu, Jie Chen, Huyue Zhu, Yijun Wang, Yansong Peng, Hebei Li, Xiaoyan Sun
Title: Dome-DETR: DETR with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection
Abstract:
Tiny object detection plays a vital role in drone surveillance, remote sensing, and autonomous systems, enabling the identification of small targets across vast landscapes. However, existing methods suffer from inefficient feature leverage and high computational costs due to redundant feature processing and rigid query allocation. To address these challenges, we propose Dome-DETR, a novel framework with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection. To reduce feature redundancies, we introduce a lightweight Density-Focal Extractor (DeFE) to produce clustered compact foreground masks. Leveraging these masks, we incorporate Masked Window Attention Sparsification (MWAS) to focus computational resources on the most informative regions via sparse attention. Besides, we propose Progressive Adaptive Query Initialization (PAQI), which adaptively modulates query density across spatial areas for better query allocation. Extensive experiments demonstrate that Dome-DETR achieves state-of-the-art performance (+3.3 AP on AI-TOD-V2 and +2.5 AP on VisDrone) while maintaining low computational complexity and a compact model size. Code is available at https://github.com/RicePasteM/Dome-DETR.
Chinese: Dome-DETR提出了一种密度导向的特征查询操作框架,通过轻量级提取器和自适应查询初始化,在保持计算效率的同时实现了最先进的微小目标检测性能。
English: Dome-DETR introduces a density-oriented feature-query manipulation framework with a lightweight extractor and adaptive query initialization, achieving state-of-the-art tiny object detection performance while maintaining computational efficiency.

Authors:Ho-Joong Kim, Yearang Lee, Jung-Ho Hong, Seong-Whan Lee
Title: DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer
Abstract:
In this paper, we examine a key limitation in query-based detectors for temporal action detection (TAD), which arises from their direct adaptation of originally designed architectures for object detection. Despite the effectiveness of the existing models, they struggle to fully address the unique challenges of TAD, such as the redundancy in multi-scale features and the limited ability to capture sufficient temporal context. To address these issues, we propose a multi-dilated gated encoder and central-adjacent region integrated decoder for temporal action detection transformer (DiGIT). Our approach replaces the existing encoder that consists of multi-scale deformable attention and feedforward network with our multi-dilated gated encoder. Our proposed encoder reduces the redundant information caused by multi-level features while maintaining the ability to capture fine-grained and long-range temporal information. Furthermore, we introduce a central-adjacent region integrated decoder that leverages a more comprehensive sampling strategy for deformable cross-attention to capture the essential information. Extensive experiments demonstrate that DiGIT achieves state-of-the-art performance on THUMOS14, ActivityNet v1.3, and HACS-Segment. Code is available at: https://github.com/Dotori-HJ/DiGIT
中文: 本文提出DiGIT模型,通过多扩张门控编码器减少多尺度特征冗余,并结合中心-邻近区域集成解码器增强时序上下文捕捉能力,在多个基准测试中实现了最先进的时序动作检测性能。
English: This paper introduces DiGIT, a novel transformer-based model for temporal action detection that addresses limitations in existing query-based detectors by replacing standard components with a multi-dilated gated encoder to reduce feature redundancy and a central-adjacent region integrated decoder for enhanced temporal context capture, achieving state-of-the-art performance on multiple benchmarks.

Authors:Etai Sella, Yanir Kleiman, Hadar Averbuch-Elor
Title: InstanceGen: Image Generation with Instance-level Instructions
Abstract:
Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes. Consequently, we are witnessing growing interests in integrating additional structural constraints, typically in the form of coarse bounding boxes, to better guide the generation process in such challenging cases. In this work, we take the idea of structural guidance a step further by making the observation that contemporary image generation models can directly provide a plausible fine-grained structural initialization. We propose a technique that couples this image-based structural guidance with LLM-based instance-level instructions, yielding output images that adhere to all parts of the text prompt, including object counts, instance-level attributes, and spatial relations between instances.

Authors:Gengyan Li, Paulo Gotardo, Timo Bolkart, Stephan Garbin, Kripasindhu Sarkar, Abhimitra Meka, Alexandros Lattas, Thabo Beeler
Title: TeGA: Texture Space Gaussian Avatars for High-Resolution Dynamic Head Modeling
Abstract:
Sparse volumetric reconstruction and rendering via 3D Gaussian splatting have recently enabled animatable 3D head avatars that are rendered under arbitrary viewpoints with impressive photorealism. Today, such photoreal avatars are seen as a key component in emerging applications in telepresence, extended reality, and entertainment. Building a photoreal avatar requires estimating the complex non-rigid motion of different facial components as seen in input video images; due to inaccurate motion estimation, animatable models typically present a loss of fidelity and detail when compared to their non-animatable counterparts, built from an individual facial expression. Also, recent state-of-the-art models are often affected by memory limitations that reduce the number of 3D Gaussians used for modeling, leading to lower detail and quality. To address these problems, we present a new high-detail 3D head avatar model that improves upon the state of the art, largely increasing the number of 3D Gaussians and modeling quality for rendering at 4K resolution. Our high-quality model is reconstructed from multiview input video and builds on top of a mesh-based 3D morphable model, which provides a coarse deformation layer for the head. Photoreal appearance is modelled by 3D Gaussians embedded within the continuous UVD tangent space of this mesh, allowing for more effective densification where most needed. Additionally, these Gaussians are warped by a novel UVD deformation field to capture subtle, localized motion. Our key contribution is the novel deformable Gaussian encoding and overall fitting procedure that allows our head model to preserve appearance detail, while capturing facial motion and other transient high-frequency features such as skin wrinkling.

Authors:Guilherme Vieira Neto, Marcos Eduardo Valle
Title: V-EfficientNets: Vector-Valued Efficiently Scaled Convolutional Neural Network Models
Abstract:
EfficientNet models are convolutional neural networks optimized for parameter allocation by jointly balancing network width, depth, and resolution. Renowned for their exceptional accuracy, these models have become a standard for image classification tasks across diverse computer vision benchmarks. While traditional neural networks learn correlations between feature channels during training, vector-valued neural networks inherently treat multidimensional data as coherent entities, taking for granted the inter-channel relationships. This paper introduces vector-valued EfficientNets (V-EfficientNets), a novel extension of EfficientNet designed to process arbitrary vector-valued data. The proposed models are evaluated on a medical image classification task, achieving an average accuracy of 99.46% on the ALL-IDB2 dataset for detecting acute lymphoblastic leukemia. V-EfficientNets demonstrate remarkable efficiency, significantly reducing parameters while outperforming state-of-the-art models, including the original EfficientNet. The source code is available at https://github.com/mevalle/v-nets.
中文摘要:V-EfficientNets将EfficientNet的参数优化扩展至向量值数据,在白血病检测中以更少参数实现99.46%准确率,超越现有最优模型。
English Summary: V-EfficientNets extend EfficientNet's parameter optimization to vector-valued data, achieving 99.46% accuracy in leukemia detection with fewer parameters than state-of-the-art models.

Authors:Seraj Al Mahmud Mostafa, Chenxi Wang, Jia Yue, Yuta Hozumi, Jianwu Wang
Title: Enhancing Satellite Object Localization with Dilated Convolutions and Attention-aided Spatial Pooling
Abstract:
Object localization in satellite imagery is particularly challenging due to the high variability of objects, low spatial resolution, and interference from noise and dominant features such as clouds and city lights. In this research, we focus on three satellite datasets: upper atmospheric Gravity Waves (GW), mesospheric Bores (Bore), and Ocean Eddies (OE), each presenting its own unique challenges. These challenges include the variability in the scale and appearance of the main object patterns, where the size, shape, and feature extent of objects of interest can differ significantly. To address these challenges, we introduce YOLO-DCAP, a novel enhanced version of YOLOv5 designed to improve object localization in these complex scenarios. YOLO-DCAP incorporates a Multi-scale Dilated Residual Convolution (MDRC) block to capture multi-scale features at scale with varying dilation rates, and an Attention-aided Spatial Pooling (AaSP) module to focus on the global relevant spatial regions, enhancing feature selection. These structural improvements help to better localize objects in satellite imagery. Experimental results demonstrate that YOLO-DCAP significantly outperforms both the YOLO base model and state-of-the-art approaches, achieving an average improvement of 20.95% in mAP50 and 32.23% in IoU over the base model, and 7.35% and 9.84% respectively over state-of-the-art alternatives, consistently across all three satellite datasets. These consistent gains across all three satellite datasets highlight the robustness and generalizability of the proposed approach. Our code is open sourced at https://github.com/AI-4-atmosphere-remote-sensing/satellite-object-localization.
中文: 本研究提出的YOLO-DCAP改进模型通过多尺度空洞残差卷积和注意力空间池化模块,在三种卫星数据集上实现显著性能提升,相比基准模型mAP50提高20.95%、IoU提升32.23%,展现了卓越的物体定位能力。
English: This research introduces YOLO-DCAP, an enhanced YOLOv5 model with multi-scale dilated convolution and attention mechanisms, which significantly improves object localization in satellite imagery by achieving over 20% higher mAP50 and 32% better IoU than base models across three challenging datasets.

Authors:Yueh-Cheng Liu, Lukas Höllein, Matthias Nießner, Angela Dai
Title: QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization
Abstract:
Surface reconstruction is fundamental to computer vision and graphics, enabling applications in 3D modeling, mixed reality, robotics, and more. Existing approaches based on volumetric rendering obtain promising results, but optimize on a per-scene basis, resulting in a slow optimization that can struggle to model under-observed or textureless regions. We introduce QuickSplat, which learns data-driven priors to generate dense initializations for 2D gaussian splatting optimization of large-scale indoor scenes. This provides a strong starting point for the reconstruction, which accelerates the convergence of the optimization and improves the geometry of flat wall structures. We further learn to jointly estimate the densification and update of the scene parameters during each iteration; our proposed densifier network predicts new Gaussians based on the rendering gradients of existing ones, removing the needs of heuristics for densification. Extensive experiments on large-scale indoor scene reconstruction demonstrate the superiority of our data-driven optimization. Concretely, we accelerate runtime by 8x, while decreasing depth errors by up to 48% in comparison to state of the art methods.

Authors:Peihao Wang, Yuehao Wang, Dilin Wang, Sreyas Mohan, Zhiwen Fan, Lemeng Wu, Ruisi Cai, Yu-Ying Yeh, Zhangyang Wang, Qiang Liu, Rakesh Ranjan
Title: Steepest Descent Density Control for Compact 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time, high-resolution novel view synthesis. By representing scenes as a mixture of Gaussian primitives, 3DGS leverages GPU rasterization pipelines for efficient rendering and reconstruction. To optimize scene coverage and capture fine details, 3DGS employs a densification algorithm to generate additional points. However, this process often leads to redundant point clouds, resulting in excessive memory usage, slower performance, and substantial storage demands - posing significant challenges for deployment on resource-constrained devices. To address this limitation, we propose a theoretical framework that demystifies and improves density control in 3DGS. Our analysis reveals that splitting is crucial for escaping saddle points. Through an optimization-theoretic approach, we establish the necessary conditions for densification, determine the minimal number of offspring Gaussians, identify the optimal parameter update direction, and provide an analytical solution for normalizing off-spring opacity. Building on these insights, we introduce SteepGS, incorporating steepest density control, a principled strategy that minimizes loss while maintaining a compact point cloud. SteepGS achieves a ~50% reduction in Gaussian points without compromising rendering quality, significantly enhancing both efficiency and scalability.
中文摘要:3D高斯泼溅技术虽能实现实时视图合成,但存在点云冗余问题;SteepGS通过优化密度控制策略,在保持渲染质量的同时将高斯点数量减少约50%。
English Summary: 3D Gaussian Splatting enables real-time view synthesis but suffers from redundant points, which SteepGS addresses by optimizing density control to halve the Gaussian count while preserving quality.

Authors:Mikhail Chaichuk, Sushant Gautam, Steven Hicks, Elena Tutubalina
Title: Prompt to Polyp: Medical Text-Conditioned Image Synthesis with Diffusion Models
Abstract:
The generation of realistic medical images from text descriptions has significant potential to address data scarcity challenges in healthcare AI while preserving patient privacy. This paper presents a comprehensive study of text-to-image synthesis in the medical domain, comparing two distinct approaches: (1) fine-tuning large pre-trained latent diffusion models and (2) training small, domain-specific models. We introduce a novel model named MSDM, an optimized architecture based on Stable Diffusion that integrates a clinical text encoder, variational autoencoder, and cross-attention mechanisms to better align medical text prompts with generated images. Our study compares two approaches: fine-tuning large pre-trained models (FLUX, Kandinsky) versus training compact domain-specific models (MSDM). Evaluation across colonoscopy (MedVQA-GI) and radiology (ROCOv2) datasets reveals that while large models achieve higher fidelity, our optimized MSDM delivers comparable quality with lower computational costs. Quantitative metrics and qualitative evaluations by medical experts reveal strengths and limitations of each approach.
中文: 本研究比较了医学影像领域的文本到图像生成方法,发现微调大型模型可获得更高保真度,而新型MSDM模型能以更低计算成本实现相当质量。
English: This study compares text-to-image synthesis methods in medical imaging, showing that fine-tuning large models achieves higher fidelity while the novel MSDM model offers comparable quality with reduced computational costs.

Authors:Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey
Title: X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Abstract:
As Contrastive Language-Image Pre-training (CLIP) models are increasingly adopted for diverse downstream tasks and integrated into large vision-language models (VLMs), their susceptibility to adversarial perturbations has emerged as a critical concern. In this work, we introduce \textbf{X-Transfer}, a novel attack method that exposes a universal adversarial vulnerability in CLIP. X-Transfer generates a Universal Adversarial Perturbation (UAP) capable of deceiving various CLIP encoders and downstream VLMs across different samples, tasks, and domains. We refer to this property as \textbf{super transferability}--a single perturbation achieving cross-data, cross-domain, cross-model, and cross-task adversarial transferability simultaneously. This is achieved through \textbf{surrogate scaling}, a key innovation of our approach. Unlike existing methods that rely on fixed surrogate models, which are computationally intensive to scale, X-Transfer employs an efficient surrogate scaling strategy that dynamically selects a small subset of suitable surrogates from a large search space. Extensive evaluations demonstrate that X-Transfer significantly outperforms previous state-of-the-art UAP methods, establishing a new benchmark for adversarial transferability across CLIP models. The code is publicly available in our \href{https://github.com/HanxunH/XTransferBench}{GitHub repository}.
中文: 本文提出X-Transfer攻击方法,通过动态代理缩放生成具有超级迁移性的通用对抗扰动,能高效地跨任务、跨领域破坏CLIP模型。
English: The paper introduces X-Transfer, a novel attack method that generates a universal adversarial perturbation with super transferability, efficiently compromising CLIP models across various tasks and domains through dynamic surrogate scaling.

Authors:Hongyi Chen, Yunchao Yao, Yufei Ye, Zhixuan Xu, Homanga Bharadhwaj, Jiashun Wang, Shubham Tulsiani, Zackory Erickson, Jeffrey Ichnowski
Title: Web2Grasp: Learning Functional Grasps from Web Images of Hand-Object Interactions
Abstract:
Functional grasp is essential for enabling dexterous multi-finger robot hands to manipulate objects effectively. However, most prior work either focuses on power grasping, which simply involves holding an object still, or relies on costly teleoperated robot demonstrations to teach robots how to grasp each object functionally. Instead, we propose extracting human grasp information from web images since they depict natural and functional object interactions, thereby bypassing the need for curated demonstrations. We reconstruct human hand-object interaction (HOI) 3D meshes from RGB images, retarget the human hand to multi-finger robot hands, and align the noisy object mesh with its accurate 3D shape. We show that these relatively low-quality HOI data from inexpensive web sources can effectively train a functional grasping model. To further expand the grasp dataset for seen and unseen objects, we use the initially-trained grasping policy with web data in the IsaacGym simulator to generate physically feasible grasps while preserving functionality. We train the grasping model on 10 object categories and evaluate it on 9 unseen objects, including challenging items such as syringes, pens, spray bottles, and tongs, which are underrepresented in existing datasets. The model trained on the web HOI dataset, achieving a 75.8% success rate on seen objects and 61.8% across all objects in simulation, with a 6.7% improvement in success rate and a 1.8x increase in functionality ratings over baselines. Simulator-augmented data further boosts performance from 61.8% to 83.4%. The sim-to-real transfer to the LEAP Hand achieves a 85% success rate. Project website is at: https://web2grasp.github.io/.

Authors:Thomas Sommariva, Simone Calderara, Angelo Porrello
Title: How to Train Your Metamorphic Deep Neural Network
Abstract:
Neural Metamorphosis (NeuMeta) is a recent paradigm for generating neural networks of varying width and depth. Based on Implicit Neural Representation (INR), NeuMeta learns a continuous weight manifold, enabling the direct generation of compressed models, including those with configurations not seen during training. While promising, the original formulation of NeuMeta proves effective only for the final layers of the undelying model, limiting its broader applicability. In this work, we propose a training algorithm that extends the capabilities of NeuMeta to enable full-network metamorphosis with minimal accuracy degradation. Our approach follows a structured recipe comprising block-wise incremental training, INR initialization, and strategies for replacing batch normalization. The resulting metamorphic networks maintain competitive accuracy across a wide range of compression ratios, offering a scalable solution for adaptable and efficient deployment of deep models. The code is available at: https://github.com/TSommariva/HTTY_NeuMeta.
中文: 本研究提出了一种训练算法,扩展了神经变形能力,实现全网络变形且精度损失最小,从而能够跨多种配置可扩展地生成压缩模型。
English: This work introduces a training algorithm that extends Neural Metamorphosis to enable full-network metamorphosis with minimal accuracy loss, allowing scalable generation of compressed models across various configurations.

Authors:Yiming Qin, Zhu Xu, Yang Liu
Title: Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation
Abstract:
Recent text-to-3D models can render high-quality assets, yet they still stumble on objects with complex attributes. The key obstacles are: (1) existing text-to-3D approaches typically lift text-to-image models to extract semantics via text encoders, while the text encoder exhibits limited comprehension ability for long descriptions, leading to deviated cross-attention focus, subsequently wrong attribute binding in generated results. (2) Occluded object parts demand a disciplined generation order and explicit part disentanglement. Though some works introduce manual efforts to alleviate the above issues, their quality is unstable and highly reliant on manual information. To tackle above problems, we propose a automated method Hierarchical-Chain-of-Generation (HCoG). It leverages a large language model to decompose the long description into blocks representing different object parts, and orders them from inside out according to occlusions, forming a hierarchical chain. Within each block we first coarsely create components, then precisely bind attributes via target-region localization and corresponding 3D Gaussian kernel optimization. Between blocks, we introduce Gaussian Extension and Label Elimination to seamlessly generate new parts by extending new Gaussian kernels, re-assigning semantic labels, and eliminating unnecessary kernels, ensuring that only relevant parts are added without disrupting previously optimized parts. Experiments confirm that HCoG yields structurally coherent, attribute-faithful 3D objects with complex attributes. The code is available at https://github.com/Wakals/GASCOL .
中文: 现有文本到3D模型因文本编码器对长描述理解不足及遮挡部分处理困难,在复杂属性对象生成上存在局限;为此提出的HCoG方法通过大语言模型分解描述并优化3D高斯核,实现了结构连贯且属性准确的自动生成。
English: Current text-to-3D models struggle with complex attributes due to text encoder limitations in understanding long descriptions and handling occluded parts, prompting the development of HCoG, an automated method that uses a large language model to decompose descriptions and optimize 3D Gaussian kernels for coherent and faithful object generation.

Authors:Xingyu Jiang, Ning Gao, Xiuhui Zhang, Hongkun Dou, Shaowen Fu, Xiaoqing Zhong, Hongjue Li, Yue Deng
Title: Image Restoration via Multi-domain Learning
Abstract:
Due to adverse atmospheric and imaging conditions, natural images suffer from various degradation phenomena. Consequently, image restoration has emerged as a key solution and garnered substantial attention. Although recent Transformer architectures have demonstrated impressive success across various restoration tasks, their considerable model complexity poses significant challenges for both training and real-time deployment. Furthermore, instead of investigating the commonalities among different degradations, most existing restoration methods focus on modifying Transformer under limited restoration priors. In this work, we first review various degradation phenomena under multi-domain perspective, identifying common priors. Then, we introduce a novel restoration framework, which integrates multi-domain learning into Transformer. Specifically, in Token Mixer, we propose a Spatial-Wavelet-Fourier multi-domain structure that facilitates local-region-global multi-receptive field modeling to replace vanilla self-attention. Additionally, in Feed-Forward Network, we incorporate multi-scale learning to fuse multi-domain features at different resolutions. Comprehensive experimental results across ten restoration tasks, such as dehazing, desnowing, motion deblurring, defocus deblurring, rain streak/raindrop removal, cloud removal, shadow removal, underwater enhancement and low-light enhancement, demonstrate that our proposed model outperforms state-of-the-art methods and achieves a favorable trade-off among restoration performance, parameter size, computational cost and inference latency. The code is available at: https://github.com/deng-ai-lab/SWFormer.
中文: 本文提出了一种新颖的图像恢复框架,将多领域学习融入Transformer中,通过多感受野结构替代自注意力机制并结合多尺度学习,在十项恢复任务中实现卓越性能与高效平衡。
English: This paper introduces a novel image restoration framework that integrates multi-domain learning into Transformer, replacing self-attention with a multi-receptive field structure and incorporating multi-scale learning to achieve superior performance across ten restoration tasks with balanced efficiency.

Authors:Yunfan Lu, Xiaogang Xu, Pengteng Li, Yusheng Wang, Yi Cui, Huizai Yao, Hui Xiong
Title: From Events to Enhancement: A Survey on Event-Based Imaging Technologies
Abstract:
Event cameras offering high dynamic range and low latency have emerged as disruptive technologies in imaging. Despite growing research on leveraging these benefits for different imaging tasks, a comprehensive study of recently advances and challenges are still lacking. This limits the broader understanding of how to utilize events in universal imaging applications. In this survey, we first introduce a physical model and the characteristics of different event sensors as the foundation. Following this, we highlight the advancement and interaction of image/video enhancement tasks with events. Additionally, we explore advanced tasks, which capture richer light information with events, \eg~light field estimation, multi-view generation, and photometric. Finally, we discuss new challenges and open questions offering a perspective for this rapidly evolving field. More continuously updated resources are at this link: https://github.com/yunfanLu/Awesome-Event-Imaging
Chinese: 本综述全面探讨了事件相机的物理模型、成像任务的进展及新挑战,强调了其在通用成像应用中的潜力。
English: This survey provides a comprehensive overview of event cameras' physical models, advancements in imaging tasks, and emerging challenges, highlighting their potential in universal imaging applications.

Authors:Yonwoo Choi
Title: SVAD: From Single Image to 3D Avatar via Synthetic Data Generation with Video Diffusion and Data Augmentation
Abstract:
Creating high-quality animatable 3D human avatars from a single image remains a significant challenge in computer vision due to the inherent difficulty of reconstructing complete 3D information from a single viewpoint. Current approaches face a clear limitation: 3D Gaussian Splatting (3DGS) methods produce high-quality results but require multiple views or video sequences, while video diffusion models can generate animations from single images but struggle with consistency and identity preservation. We present SVAD, a novel approach that addresses these limitations by leveraging complementary strengths of existing techniques. Our method generates synthetic training data through video diffusion, enhances it with identity preservation and image restoration modules, and utilizes this refined data to train 3DGS avatars. Comprehensive evaluations demonstrate that SVAD outperforms state-of-the-art (SOTA) single-image methods in maintaining identity consistency and fine details across novel poses and viewpoints, while enabling real-time rendering capabilities. Through our data augmentation pipeline, we overcome the dependency on dense monocular or multi-view training data typically required by traditional 3DGS approaches. Extensive quantitative, qualitative comparisons show our method achieves superior performance across multiple metrics against baseline models. By effectively combining the generative power of diffusion models with both the high-quality results and rendering efficiency of 3DGS, our work establishes a new approach for high-fidelity avatar generation from a single image input.

Authors:Beichen Wen, Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu
Title: 3D Scene Generation: A Survey
Abstract:
3D scene generation seeks to synthesize spatially structured, semantically meaningful, and photorealistic environments for applications such as immersive media, robotics, autonomous driving, and embodied AI. Early methods based on procedural rules offered scalability but limited diversity. Recent advances in deep generative models (e.g., GANs, diffusion models) and 3D representations (e.g., NeRF, 3D Gaussians) have enabled the learning of real-world scene distributions, improving fidelity, diversity, and view consistency. Recent advances like diffusion models bridge 3D scene synthesis and photorealism by reframing generation as image or video synthesis problems. This survey provides a systematic overview of state-of-the-art approaches, organizing them into four paradigms: procedural generation, neural 3D-based generation, image-based generation, and video-based generation. We analyze their technical foundations, trade-offs, and representative results, and review commonly used datasets, evaluation protocols, and downstream applications. We conclude by discussing key challenges in generation capacity, 3D representation, data and annotations, and evaluation, and outline promising directions including higher fidelity, physics-aware and interactive generation, and unified perception-generation models. This review organizes recent advances in 3D scene generation and highlights promising directions at the intersection of generative AI, 3D vision, and embodied intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/hzxie/Awesome-3D-Scene-Generation.
中文: 本综述系统梳理了三维场景生成的前沿方法,将其归纳为四种技术范式并分析其理论基础、性能权衡及应用场景,同时展望了高保真度、物理感知生成等未来方向。
English: This survey systematically reviews state-of-the-art 3D scene generation methods, categorizing them into four paradigms and analyzing their technical foundations, trade-offs, and applications, while highlighting future directions like higher fidelity and physics-aware generation.

Authors:Qitao Zhao, Amy Lin, Jeff Tan, Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani
Title: DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion
Abstract:
Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame and employs a transformer-based denoising diffusion model to predict them from multi-view inputs. To address practical challenges in training diffusion models with missing data and unbounded scene coordinates, we introduce specialized mechanisms that ensure robust learning. We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches while naturally modeling uncertainty.

Authors:Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang
Title: Flow-GRPO: Training Flow Matching Models via Online RL
Abstract:
We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from 63% to 95%. In visual text rendering, its accuracy improves from 59% to 92%, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, very little reward hacking occurred, meaning rewards did not increase at the cost of appreciable image quality or diversity degradation.
Chinese: Flow-GRPO首次将在线强化学习融入流匹配模型,通过ODE到SDE转换和降噪减少策略提高采样效率,在文本到图像任务中显著提升性能,同时未牺牲图像质量或多样性。
English: Flow-GRPO introduces the first online reinforcement learning integration into flow matching models, employing ODE-to-SDE conversion and Denoising Reduction strategies to enhance sampling efficiency and achieve significant performance improvements in text-to-image tasks without compromising image quality or diversity.

Authors:Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, Jun-Yan Zhu
Title: Generating Physically Stable and Buildable Brick Structures from Text
Abstract:
We introduce BrickGPT, the first approach for generating physically stable interconnecting brick assembly models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of brick structures, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during autoregressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that BrickGPT produces stable, diverse, and aesthetically pleasing brick structures that align closely with the input text prompts. We also develop a text-based brick texturing method to generate colored and textured designs. We show that our designs can be assembled manually by humans and automatically by robotic arms. We release our new dataset, StableText2Brick, containing over 47,000 brick structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models at the project website: https://avalovelace1.github.io/BrickGPT/.
中文: BrickGPT是首个通过文本提示生成物理稳定互锁砖块模型的方法,它基于大规模数据集训练,并在推理中结合物理验证,确保设计稳定、多样且符合文本描述。
English: BrickGPT is the first method that generates physically stable interlocking brick models from text prompts by training on a large dataset and incorporating physics-based validation during inference, producing designs that are stable, diverse, and align well with input descriptions.

Authors:Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan
Title: TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Abstract:
Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens. TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations. Our empirical results demonstrate that TokLIP achieves exceptional data efficiency, empowering visual tokens with high-level semantic understanding while enhancing low-level generative capacity, making it well-suited for autoregressive Transformers in both comprehension and generation tasks. The code and models are available at https://github.com/TencentARC/TokLIP.
中文:TokLIP是一种创新的视觉分词器,通过将高层语义融入矢量量化标记并支持高效的端到端训练,显著提升了多模态理解能力,在理解和生成任务中均实现了卓越的数据效率和性能表现。
English: TokLIP is a novel visual tokenizer that enhances multimodal comprehension by integrating high-level semantics into vector-quantized tokens while enabling efficient end-to-end training, achieving superior data efficiency and performance in both understanding and generation tasks.

Authors:Sooyoung Park, Arda Senocak, Joon Son Chung
Title: Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization
Abstract:
Large-scale vision-language models demonstrate strong multimodal alignment and generalization across diverse tasks. Among them, CLIP stands out as one of the most successful approaches. In this work, we extend the application of CLIP to sound source localization, proposing a self-supervised method operates without explicit text input. We introduce a framework that maps audios into tokens compatible with CLIP's text encoder, producing audio-driven embeddings. These embeddings are used to generate sounding region masks, from which visual features are extracted and aligned with the audio embeddings through a contrastive audio-visual correspondence objective. Our findings show that alignment knowledge of pre-trained multimodal foundation model enables our method to generate more complete and compact localization for sounding objects. We further propose an LLM-guided extension that distills object-aware audio-visual scene understanding into the model during training to enhance alignment. Extensive experiments across five diverse tasks demonstrate that our method, in all variants, outperforms state-of-the-art approaches and achieves strong generalization in zero-shot settings.
中文: 本研究将CLIP扩展至声源定位,通过自监督框架将音频转换为CLIP兼容的嵌入表示,生成发声区域掩码并通过对比学习对齐视听特征,结合LLM引导的扩展增强对象感知的场景理解能力,在多项任务中超越现有最优方法并展现零样本泛化优势。
English: This study extends CLIP to sound source localization through a self-supervised framework that converts audio into CLIP-compatible embeddings, generating sounding object masks and aligning visual features with audio via contrastive learning, enhanced by an LLM-guided extension for improved object-aware scene understanding, outperforming state-of-the-art methods across multiple tasks.

Authors:Yifan Bian, Chuanbo Tang, Li Li, Dong Liu
Title: Augmented Deep Contexts for Spatially Embedded Video Coding
Abstract:
Most Neural Video Codecs (NVCs) only employ temporal references to generate temporal-only contexts and latent prior. These temporal-only NVCs fail to handle large motions or emerging objects due to limited contexts and misaligned latent prior. To relieve the limitations, we propose a Spatially Embedded Video Codec (SEVC), in which the low-resolution video is compressed for spatial references. Firstly, our SEVC leverages both spatial and temporal references to generate augmented motion vectors and hybrid spatial-temporal contexts. Secondly, to address the misalignment issue in latent prior and enrich the prior information, we introduce a spatial-guided latent prior augmented by multiple temporal latent representations. At last, we design a joint spatial-temporal optimization to learn quality-adaptive bit allocation for spatial references, further boosting rate-distortion performance. Experimental results show that our SEVC effectively alleviates the limitations in handling large motions or emerging objects, and also reduces 11.9% more bitrate than the previous state-of-the-art NVC while providing an additional low-resolution bitstream. Our code and model are available at https://github.com/EsakaK/SEVC.
中文摘要:本研究提出的空间嵌入视频编解码器(SEVC)通过整合空间参考和混合上下文,克服了纯时序神经视频编解码器的局限性,在提供额外低码率码流的同时实现了11.9%的码率降低和更优的率失真性能。
English Summary: The proposed Spatially Embedded Video Codec (SEVC) overcomes limitations of temporal-only neural video codecs by integrating spatial references and hybrid contexts, achieving superior rate-distortion performance with 11.9% bitrate reduction while providing an additional low-resolution stream.

Authors:Ahmed Abdelreheem, Filippo Aleotti, Jamie Watson, Zawar Qureshi, Abdelrahman Eldesokey, Peter Wonka, Gabriel Brostow, Sara Vicente, Guillermo Garcia-Hernando
Title: PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes
Abstract:
We introduce the novel task of Language-Guided Object Placement in Real 3D Scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to find a valid placement for the 3D asset that respects the prompt. Compared with other language-guided localization tasks in 3D scenes such as grounding, this task has specific challenges: it is ambiguous because it has multiple valid solutions, and it requires reasoning about 3D geometric relationships and free space. We inaugurate this task by proposing a new benchmark and evaluation protocol. We also introduce a new dataset for training 3D LLMs on this task, as well as the first method to serve as a non-trivial baseline. We believe that this challenging task and our new benchmark could become part of the suite of benchmarks used to evaluate and compare generalist 3D LLM models.

Authors:Qian Zeng, Chenggong Hu, Mingli Song, Jie Song
Title: Diffusion Model Quantization: A Review
Abstract:
Recent success of large text-to-image models has empirically underscored the exceptional performance of diffusion models in generative tasks. To facilitate their efficient deployment on resource-constrained edge devices, model quantization has emerged as a pivotal technique for both compression and acceleration. This survey offers a thorough review of the latest advancements in diffusion model quantization, encapsulating and analyzing the current state of the art in this rapidly advancing domain. First, we provide an overview of the key challenges encountered in the quantization of diffusion models, including those based on U-Net architectures and Diffusion Transformers (DiT). We then present a comprehensive taxonomy of prevalent quantization techniques, engaging in an in-depth discussion of their underlying principles. Subsequently, we perform a meticulous analysis of representative diffusion model quantization schemes from both qualitative and quantitative perspectives. From a quantitative standpoint, we rigorously benchmark a variety of methods using widely recognized datasets, delivering an extensive evaluation of the most recent and impactful research in the field. From a qualitative standpoint, we categorize and synthesize the effects of quantization errors, elucidating these impacts through both visual analysis and trajectory examination. In conclusion, we outline prospective avenues for future research, proposing novel directions for the quantization of generative models in practical applications. The list of related papers, corresponding codes, pre-trained models and comparison results are publicly available at the survey project homepage https://github.com/TaylorJocelyn/Diffusion-Model-Quantization.
中文摘要:本综述全面探讨了扩散模型量化技术的最新进展,系统分析了量化方法的性能表现与视觉影响,并指出了该领域未来的研究方向。
English Summary: This survey comprehensively reviews recent advances in diffusion model quantization for efficient deployment on edge devices, analyzing both quantitative performance and qualitative effects while outlining future research directions.

Authors:Wei Peng, Kang Liu, Jianchen Hu, Meng Zhang
Title: Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models
Abstract:
Prompt learning is one of the most effective paradigms for adapting pre-trained vision-language models (VLMs) to the biomedical image classification tasks in few shot scenarios. However, most of the current prompt learning methods only used the text prompts and ignored the particular structures (such as the complex anatomical structures and subtle pathological features) in the biomedical images. In this work, we propose Biomed-DPT, a knowledge-enhanced dual modality prompt tuning technique. In designing the text prompt, Biomed-DPT constructs a dual prompt including the template-driven clinical prompts and the large language model (LLM)-driven domain-adapted prompts, then extracts the clinical knowledge from the domain-adapted prompts through the knowledge distillation technique. In designing the vision prompt, Biomed-DPT introduces the zero vector as a soft prompt to leverage attention re-weighting so that the focus on non-diagnostic regions and the recognition of non-critical pathological features are avoided. Biomed-DPT achieves an average classification accuracy of 66.14\% across 11 biomedical image datasets covering 9 modalities and 10 organs, with performance reaching 78.06\% in base classes and 75.97\% in novel classes, surpassing the Context Optimization (CoOp) method by 6.20\%, 3.78\%, and 8.04\%, respectively. Our code are available at \underline{https://github.com/Kanyooo/Biomed-DPT}.
Chinese: Biomed-DPT提出了一种知识增强的双模态提示调优技术,通过结合临床提示和领域适配的文本提示以及视觉软提示,在多个生物医学图像数据集和模态上显著提升了分类准确率。
English: Biomed-DPT introduces a knowledge-enhanced dual modality prompt tuning technique that combines clinical and domain-adapted text prompts with vision soft prompts to significantly improve biomedical image classification accuracy across multiple datasets and modalities.

Authors:Shashank Agnihotri, Amaan Ansari, Annika Dackermann, Fabian Rösch, Margret Keuper
Title: DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions
Abstract:
Deep learning (DL) has surpassed human performance on standard benchmarks, driving its widespread adoption in computer vision tasks. One such task is disparity estimation, estimating the disparity between matching pixels in stereo image pairs, which is crucial for safety-critical applications like medical surgeries and autonomous navigation. However, DL-based disparity estimation methods are highly susceptible to distribution shifts and adversarial attacks, raising concerns about their reliability and generalization. Despite these concerns, a standardized benchmark for evaluating the robustness of disparity estimation methods remains absent, hindering progress in the field. To address this gap, we introduce DispBench, a comprehensive benchmarking tool for systematically assessing the reliability of disparity estimation methods. DispBench evaluates robustness against synthetic image corruptions such as adversarial attacks and out-of-distribution shifts caused by 2D Common Corruptions across multiple datasets and diverse corruption scenarios. We conduct the most extensive performance and robustness analysis of disparity estimation methods to date, uncovering key correlations between accuracy, reliability, and generalization. Open-source code for DispBench: https://github.com/shashankskagnihotri/benchmarking_robustness/tree/disparity_estimation/final/disparity_estimation
中文: 深度学习在立体视觉的视差估计中表现出色,但易受分布偏移和对抗攻击影响,为此推出了DispBench这一综合基准工具,用于评估多种数据集和干扰场景下的方法鲁棒性。
English: Deep learning excels in disparity estimation for stereo vision but faces reliability issues from distribution shifts and adversarial attacks, prompting the creation of DispBench, a comprehensive tool to benchmark robustness across various corruptions and datasets.

Authors:Zuntao Liu, Hao Zhuang, Junjie Jiang, Yuhang Song, Zheng Fang
Title: Nonlinear Motion-Guided and Spatio-Temporal Aware Network for Unsupervised Event-Based Optical Flow
Abstract:
Event cameras have the potential to capture continuous motion information over time and space, making them well-suited for optical flow estimation. However, most existing learning-based methods for event-based optical flow adopt frame-based techniques, ignoring the spatio-temporal characteristics of events. Additionally, these methods assume linear motion between consecutive events within the loss time window, which increases optical flow errors in long-time sequences. In this work, we observe that rich spatio-temporal information and accurate nonlinear motion between events are crucial for event-based optical flow estimation. Therefore, we propose E-NMSTFlow, a novel unsupervised event-based optical flow network focusing on long-time sequences. We propose a Spatio-Temporal Motion Feature Aware (STMFA) module and an Adaptive Motion Feature Enhancement (AMFE) module, both of which utilize rich spatio-temporal information to learn spatio-temporal data associations. Meanwhile, we propose a nonlinear motion compensation loss that utilizes the accurate nonlinear motion between events to improve the unsupervised learning of our network. Extensive experiments demonstrate the effectiveness and superiority of our method. Remarkably, our method ranks first among unsupervised learning methods on the MVSEC and DSEC-Flow datasets. Our project page is available at https://wynelio.github.io/E-NMSTFlow.

Authors:Wenyang Liu, Jianjun Gao, Kim-Hui Yap
Title: SSH-Net: A Self-Supervised and Hybrid Network for Noisy Image Watermark Removal
Abstract:
Visible watermark removal is challenging due to its inherent complexities and the noise carried within images. Existing methods primarily rely on supervised learning approaches that require paired datasets of watermarked and watermark-free images, which are often impractical to obtain in real-world scenarios. To address this challenge, we propose SSH-Net, a Self-Supervised and Hybrid Network specifically designed for noisy image watermark removal. SSH-Net synthesizes reference watermark-free images using the watermark distribution in a self-supervised manner and adopts a dual-network design to address the task. The upper network, focused on the simpler task of noise removal, employs a lightweight CNN-based architecture, while the lower network, designed to handle the more complex task of simultaneously removing watermarks and noise, incorporates Transformer blocks to model long-range dependencies and capture intricate image features. To enhance the model's effectiveness, a shared CNN-based feature encoder is introduced before dual networks to extract common features that both networks can leverage. Our code will be available at https://github.com/wenyang001/SSH-Net.
中文: SSH-Net是一种自监督混合网络,通过合成参考图像并采用具有共享特征编码的双网络架构来解决带噪图像水印去除问题,无需配对数据集即可实现高效处理。
English: SSH-Net is a self-supervised hybrid network that addresses noisy image watermark removal by synthesizing reference images and employing a dual-network architecture with shared feature encoding, eliminating the need for paired datasets.

Authors:Hyunho Song, Dongjae Lee, Seunghun Oh, Minwoo Jung, Ayoung Kim
Title: The City that Never Settles: Simulation-based LiDAR Dataset for Long-Term Place Recognition Under Extreme Structural Changes
Abstract:
Large-scale construction and demolition significantly challenge long-term place recognition (PR) by drastically reshaping urban and suburban environments. Existing datasets predominantly reflect limited or indoor-focused changes, failing to adequately represent extensive outdoor transformations. To bridge this gap, we introduce the City that Never Settles (CNS) dataset, a simulation-based dataset created using the CARLA simulator, capturing major structural changes-such as building construction and demolition-across diverse maps and sequences. Additionally, we propose TCR_sym, a symmetric version of the original TCR metric, enabling consistent measurement of structural changes irrespective of source-target ordering. Quantitative comparisons demonstrate that CNS encompasses more extensive transformations than current real-world benchmarks. Evaluations of state-of-the-art LiDAR-based PR methods on CNS reveal substantial performance degradation, underscoring the need for robust algorithms capable of handling significant environmental changes. Our dataset is available at https://github.com/Hyunho111/CNS_dataset.
中文摘要:本研究提出的“永动之城”(CNS)数据集通过模拟建筑建造与拆除等大规模户外结构变化,弥补了现有数据集的不足,并证明当前先进的地点识别方法在应对剧烈环境变化时存在显著性能缺陷。
English Summary: The City that Never Settles (CNS) dataset addresses the limitations of existing datasets by simulating large-scale outdoor structural changes through building construction and demolition, revealing significant performance drops in current place recognition methods when faced with such transformations.

Authors:Tommaso Apicella, Alessio Xompero, Andrea Cavallaro
Title: Visual Affordances: Enabling Robots to Understand Object Functionality
Abstract:
Human-robot interaction for assistive technologies relies on the prediction of affordances, which are the potential actions a robot can perform on objects. Predicting object affordances from visual perception is formulated differently for tasks such as grasping detection, affordance classification, affordance segmentation, and hand-object interaction synthesis. In this work, we highlight the reproducibility issue in these redefinitions, making comparative benchmarks unfair and unreliable. To address this problem, we propose a unified formulation for visual affordance prediction, provide a comprehensive and systematic review of previous works highlighting strengths and limitations of methods and datasets, and analyse what challenges reproducibility. To favour transparency, we introduce the Affordance Sheet, a document to detail the proposed solution, the datasets, and the validation. As the physical properties of an object influence the interaction with the robot, we present a generic framework that links visual affordance prediction to the physical world. Using the weight of an object as an example for this framework, we discuss how estimating object mass can affect the affordance prediction. Our approach bridges the gap between affordance perception and robot actuation, and accounts for the complete information about objects of interest and how the robot interacts with them to accomplish its task.

Authors:Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, Yuhui Yin
Title: FG-CLIP: Fine-Grained Visual and Textual Alignment
Abstract:
Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. We construct a comprehensive dataset, termed FineHARD, by integrating high-quality region-specific annotations with hard fine-grained negative samples. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP's effectiveness in capturing fine-grained image details and improving overall model performance. The data, code, and models are available at https://github.com/360CVGroup/FG-CLIP.
中文:FG-CLIP通过大规模长文本描述、区域级标注和困难负样本增强CLIP的细粒度理解能力,在多项多模态任务中实现卓越性能。
English: FG-CLIP enhances CLIP's fine-grained understanding through large-scale long captions, region-specific annotations, and hard negative samples, achieving superior performance across multiple multimodal tasks.

Authors:Tingting Liao, Yujian Zheng, Adilbek Karmanov, Liwen Hu, Leyang Jin, Yuliang Xiu, Hao Li
Title: SOAP: Style-Omniscient Animatable Portraits
Abstract:
Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at https://github.com/TingtingLiao/soap.
中文: SOAP是一种风格普适的框架,能够从任意肖像生成具有完整骨骼绑定和一致拓扑的3D虚拟形象,通过自适应优化和可微分渲染技术,不仅支持基于FACS的面部动画,还能精细保留发饰与编发等复杂细节。
English: SOAP is a style-omniscient framework that generates fully rigged, topology-consistent 3D avatars from any portrait, supporting FACS-based animation and preserving intricate details like accessories and hairstyles through adaptive optimization and differentiable rendering.

Authors:Xin Bi, Zhichao Li, Yuxuan Xia, Panpan Tong, Lijuan Zhang, Yang Chen, Junsheng Fu
Title: Driving with Context: Online Map Matching for Complex Roads Using Lane Markings and Scenario Recognition
Abstract:
Accurate online map matching is fundamental to vehicle navigation and the activation of intelligent driving functions. Current online map matching methods are prone to errors in complex road networks, especially in multilevel road area. To address this challenge, we propose an online Standard Definition (SD) map matching method by constructing a Hidden Markov Model (HMM) with multiple probability factors. Our proposed method can achieve accurate map matching even in complex road networks by carefully leveraging lane markings and scenario recognition in the designing of the probability factors. First, the lane markings are generated by a multi-lane tracking method and associated with the SD map using HMM to build an enriched SD map. In areas covered by the enriched SD map, the vehicle can re-localize itself by performing Iterative Closest Point (ICP) registration for the lane markings. Then, the probability factor accounting for the lane marking detection can be obtained using the association probability between adjacent lanes and roads. Second, the driving scenario recognition model is applied to generate the emission probability factor of scenario recognition, which improves the performance of map matching on elevated roads and ordinary urban roads underneath them. We validate our method through extensive road tests in Europe and China, and the experimental results show that our proposed method effectively improves the online map matching accuracy as compared to other existing methods, especially in multilevel road area. Specifically, the experiments show that our proposed method achieves $F_1$ scores of 98.04% and 94.60% on the Zenseact Open Dataset and test data of multilevel road areas in Shanghai respectively, significantly outperforming benchmark methods. The implementation is available at https://github.com/TRV-Lab/LMSR-OMM.
Chinese: 本研究提出了一种基于隐马尔可夫模型和多概率因子的在线SD地图匹配方法,通过整合车道线关联和场景识别,在复杂多层道路区域显著提高了匹配精度,欧洲和中国的道路测试已验证其有效性。
English: This study introduces an online SD map matching method using a Hidden Markov Model with multiple probability factors, including lane marking association and scenario recognition, which significantly enhances accuracy in complex multilevel road networks as validated by tests in Europe and China.

Authors:Lang Nie, Chunyu Lin, Kang Liao, Yun Zhang, Shuaicheng Liu, Yao Zhao
Title: StabStitch++: Unsupervised Online Video Stitching with Spatiotemporal Bidirectional Warps
Abstract:
We retarget video stitching to an emerging issue, named warping shake, which unveils the temporal content shakes induced by sequentially unsmooth warps when extending image stitching to video stitching. Even if the input videos are stable, the stitched video can inevitably cause undesired warping shakes and affect the visual experience. To address this issue, we propose StabStitch++, a novel video stitching framework to realize spatial stitching and temporal stabilization with unsupervised learning simultaneously. First, different from existing learning-based image stitching solutions that typically warp one image to align with another, we suppose a virtual midplane between original image planes and project them onto it. Concretely, we design a differentiable bidirectional decomposition module to disentangle the homography transformation and incorporate it into our spatial warp, evenly spreading alignment burdens and projective distortions across two views. Then, inspired by camera paths in video stabilization, we derive the mathematical expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Finally, a warp smoothing model is presented to produce stable stitched videos with a hybrid loss to simultaneously encourage content alignment, trajectory smoothness, and online collaboration. Compared with StabStitch that sacrifices alignment for stabilization, StabStitch++ makes no compromise and optimizes both of them simultaneously, especially in the online mode. To establish an evaluation benchmark and train the learning framework, we build a video stitching dataset with a rich diversity in camera motions and scenes. Experiments exhibit that StabStitch++ surpasses current solutions in stitching performance, robustness, and efficiency, offering compelling advancements in this field by building a real-time online video stitching system.
Chinese: 本研究提出了StabStitch++视频拼接框架,通过无监督学习同时实现空间对齐和时间稳定,有效消除拼接视频中的扭曲抖动,并在保持实时性的同时显著提升视觉体验。
English: This research introduces StabStitch++, a novel video stitching framework that simultaneously achieves spatial alignment and temporal stabilization through unsupervised learning, effectively eliminating warping shakes in stitched videos while maintaining real-time performance.

Authors:Wanjiang Weng, Xiaofeng Tan, Hongsong Wang, Pan Zhou
Title: ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment
Abstract:
Bilingual text-to-motion generation, which synthesizes 3D human motions from bilingual text inputs, holds immense potential for cross-linguistic applications in gaming, film, and robotics. However, this task faces critical challenges: the absence of bilingual motion-language datasets and the misalignment between text and motion distributions in diffusion models, leading to semantically inconsistent or low-quality motions. To address these challenges, we propose BiHumanML3D, a novel bilingual human motion dataset, which establishes a crucial benchmark for bilingual text-to-motion generation models. Furthermore, we propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics, thereby achieving a unified bilingual model. Building upon this, we propose Reward-guided sampling Alignment (ReAlign) method, comprising a step-aware reward model to assess alignment quality during sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Experiments demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods. Project page: https://wengwanjiang.github.io/ReAlign-page/.

Authors:Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang
Title: Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
Abstract:
Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.
中文: 推理作为智能的核心,在开放多模态环境中对人工智能至关重要;大型多模态推理模型从模块化流程发展为统一框架,以应对泛化与自主行为等挑战,本文通过发展路线图对此进行了系统综述。
English: Reasoning is central to intelligence and crucial for robust AI in open, multimodal environments, with Large Multimodal Reasoning Models evolving from modular pipelines to unified frameworks to address challenges like generalization and agentic behavior, as surveyed through a developmental roadmap.

Authors:I-Chun Arthur Liu, Jason Chen, Gaurav Sukhatme, Daniel Seita
Title: D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation
Abstract:
Learning bimanual manipulation is challenging due to its high dimensionality and tight coordination required between two arms. Eye-in-hand imitation learning, which uses wrist-mounted cameras, simplifies perception by focusing on task-relevant views. However, collecting diverse demonstrations remains costly, motivating the need for scalable data augmentation. While prior work has explored visual augmentation in single-arm settings, extending these approaches to bimanual manipulation requires generating viewpoint-consistent observations across both arms and producing corresponding action labels that are both valid and feasible. In this work, we propose Diffusion for COordinated Dual-arm Data Augmentation (D-CODA), a method for offline data augmentation tailored to eye-in-hand bimanual imitation learning that trains a diffusion model to synthesize novel, viewpoint-consistent wrist-camera images for both arms while simultaneously generating joint-space action labels. It employs constrained optimization to ensure that augmented states involving gripper-to-object contacts adhere to constraints suitable for bimanual coordination. We evaluate D-CODA on 5 simulated and 3 real-world tasks. Our results across 2250 simulation trials and 300 real-world trials demonstrate that it outperforms baselines and ablations, showing its potential for scalable data augmentation in eye-in-hand bimanual manipulation. Our project website is at: https://dcodaaug.github.io/D-CODA/.

Authors:Shashank Agnihotri, David Schader, Nico Sharei, Mehmet Ege Kaçar, Margret Keuper
Title: Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?
Abstract:
Deep learning (DL) models are widely used in real-world applications but remain vulnerable to distribution shifts, especially due to weather and lighting changes. Collecting diverse real-world data for testing the robustness of DL models is resource-intensive, making synthetic corruptions an attractive alternative for robustness testing. However, are synthetic corruptions a reliable proxy for real-world corruptions? To answer this, we conduct the largest benchmarking study on semantic segmentation models, comparing performance on real-world corruptions and synthetic corruptions datasets. Our results reveal a strong correlation in mean performance, supporting the use of synthetic corruptions for robustness evaluation. We further analyze corruption-specific correlations, providing key insights to understand when synthetic corruptions succeed in representing real-world corruptions. Open-source Code: https://github.com/shashankskagnihotri/benchmarking_robustness/tree/segmentation_david/semantic_segmentation
中文: 深度学习模型易受现实世界分布变化影响,本研究证实合成扰动可作为语义分割稳健性测试的有效替代,与真实数据表现出强性能相关性。
English: Deep learning models are vulnerable to real-world distribution shifts, and this study confirms that synthetic corruptions serve as a reliable proxy for robustness testing in semantic segmentation, showing strong performance correlation with real-world data.

Authors:Richard Liu, Daniel Fu, Noah Tan, Itai Lang, Rana Hanocka
Title: WIR3D: Visually-Informed and Geometry-Aware 3D Shape Abstraction
Abstract:
In this work we present WIR3D, a technique for abstracting 3D shapes through a sparse set of visually meaningful curves in 3D. We optimize the parameters of Bezier curves such that they faithfully represent both the geometry and salient visual features (e.g. texture) of the shape from arbitrary viewpoints. We leverage the intermediate activations of a pre-trained foundation model (CLIP) to guide our optimization process. We divide our optimization into two phases: one for capturing the coarse geometry of the shape, and the other for representing fine-grained features. Our second phase supervision is spatially guided by a novel localized keypoint loss. This spatial guidance enables user control over abstracted features. We ensure fidelity to the original surface through a neural SDF loss, which allows the curves to be used as intuitive deformation handles. We successfully apply our method for shape abstraction over a broad dataset of shapes with varying complexity, geometric structure, and texture, and demonstrate downstream applications for feature control and shape deformation.

Authors:Bangyan Liao, Zhenjun Zhao, Haoang Li, Yi Zhou, Yingping Zeng, Hao Li, Peidong Liu
Title: Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World
Abstract:
Determining the vanishing points (VPs) in a Manhattan world, as a fundamental task in many 3D vision applications, consists of jointly inferring the line-VP association and locating each VP. Existing methods are, however, either sub-optimal solvers or pursuing global optimality at a significant cost of computing time. In contrast to prior works, we introduce convex relaxation techniques to solve this task for the first time. Specifically, we employ a "soft" association scheme, realized via a truncated multi-selection error, that allows for joint estimation of VPs' locations and line-VP associations. This approach leads to a primal problem that can be reformulated into a quadratically constrained quadratic programming (QCQP) problem, which is then relaxed into a convex semidefinite programming (SDP) problem. To solve this SDP problem efficiently, we present a globally optimal outlier-robust iterative solver (called GlobustVP), which independently searches for one VP and its associated lines in each iteration, treating other lines as outliers. After each independent update of all VPs, the mutual orthogonality between the three VPs in a Manhattan world is reinforced via local refinement. Extensive experiments on both synthetic and real-world data demonstrate that GlobustVP achieves a favorable balance between efficiency, robustness, and global optimality compared to previous works. The code is publicly available at https://github.com/WU-CVGL/GlobustVP.
中文: 本文提出GlobustVP方法,通过凸松弛技术和软关联方案,在曼哈顿世界中高效估计消失点,实现了计算效率、鲁棒性和全局最优性之间的良好平衡。
English: This paper introduces GlobustVP, a novel method that uses convex relaxation and a soft association scheme to efficiently estimate vanishing points in Manhattan worlds, achieving a balance between computational efficiency, robustness, and global optimality.

Authors:Yi Lin, Dong Zhang, Xiao Fang, Yufan Chen, Kwang-Ting Cheng, Hao Chen
Title: Rethinking Boundary Detection in Deep Learning-Based Medical Image Segmentation
Abstract:
Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transformer (ViT) models, and explicit edge detection operators to tackle this challenge. CTO surpasses existing methods in terms of segmentation accuracy and strikes a better balance between accuracy and efficiency, without the need for additional data inputs or label injections. Specifically, CTO adheres to the canonical encoder-decoder network paradigm, with a dual-stream encoder network comprising a mainstream CNN stream for capturing local features and an auxiliary StitchViT stream for integrating long-range dependencies. Furthermore, to enhance the model's ability to learn boundary areas, we introduce a boundary-guided decoder network that employs binary boundary masks generated by dedicated edge detection operators to provide explicit guidance during the decoding process. We validate the performance of CTO through extensive experiments conducted on seven challenging medical image segmentation datasets, namely ISIC 2016, PH2, ISIC 2018, CoNIC, LiTS17, and BTCV. Our experimental results unequivocally demonstrate that CTO achieves state-of-the-art accuracy on these datasets while maintaining competitive model complexity. The codes have been released at: https://github.com/xiaofang007/CTO.
中文:CTO网络架构融合了CNN、视觉变换器和边缘检测技术,显著提升了医学图像分割的精度,尤其在边界区域表现出色,无需额外数据即在多个数据集上达到领先水平。
English: The CTO network architecture integrates CNNs, Vision Transformers, and edge detection to enhance medical image segmentation accuracy, particularly in boundary areas, achieving state-of-the-art results across multiple datasets without extra data requirements.

Authors:Abdulaziz Almuzairee, Rohan Patil, Dwait Bhatt, Henrik I. Christensen
Title: Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation
Abstract:
Vision is well-known for its use in manipulation, especially using visual servoing. Due to the 3D nature of the world, using multiple camera views and merging them creates better representations for Q-learning and in turn, trains more sample efficient policies. Nevertheless, these multi-view policies are sensitive to failing cameras and can be burdensome to deploy. To mitigate these issues, we introduce a Merge And Disentanglement (MAD) algorithm that efficiently merges views to increase sample efficiency while simultaneously disentangling views by augmenting multi-view feature inputs with single-view features. This produces robust policies and allows lightweight deployment. We demonstrate the efficiency and robustness of our approach using Meta-World and ManiSkill3. For project website and code, see https://aalmuzairee.github.io/mad

Authors:Yuning Du, Jingshuai Liu, Rohan Dharmakumar, Sotirios A. Tsaftaris
Title: Active Sampling for MRI-based Sequential Decision Making
Abstract:
Despite the superior diagnostic capability of Magnetic Resonance Imaging (MRI), its use as a Point-of-Care (PoC) device remains limited by high cost and complexity. To enable such a future by reducing the magnetic field strength, one key approach will be to improve sampling strategies. Previous work has shown that it is possible to make diagnostic decisions directly from k-space with fewer samples. Such work shows that single diagnostic decisions can be made, but if we aspire to see MRI as a true PoC, multiple and sequential decisions are necessary while minimizing the number of samples acquired. We present a novel multi-objective reinforcement learning framework enabling comprehensive, sequential, diagnostic evaluation from undersampled k-space data. Our approach during inference actively adapts to sequential decisions to optimally sample. To achieve this, we introduce a training methodology that identifies the samples that contribute the best to each diagnostic objective using a step-wise weighting reward function. We evaluate our approach in two sequential knee pathology assessment tasks: ACL sprain detection and cartilage thickness loss assessment. Our framework achieves diagnostic performance competitive with various policy-based benchmarks on disease detection, severity quantification, and overall sequential diagnosis, while substantially saving k-space samples. Our approach paves the way for the future of MRI as a comprehensive and affordable PoC device. Our code is publicly available at https://github.com/vios-s/MRI_Sequential_Active_Sampling
中文摘要:本研究提出了一种多目标强化学习框架,通过自适应优化采样策略,能够从欠采样的MRI k空间数据中实现连续诊断评估,在保持膝关节病变诊断准确性的同时大幅减少采样需求。
English Summary: This study introduces a multi-objective reinforcement learning framework that enables sequential diagnostic evaluations from undersampled MRI k-space data, significantly reducing sampling requirements while maintaining competitive diagnostic accuracy for knee pathology assessments.

Authors:Kunlun Xu, Xu Zou, Gang Hua, Jiahuan Zhou
Title: Componential Prompt-Knowledge Alignment for Domain Incremental Learning
Abstract:
Domain Incremental Learning (DIL) aims to learn from non-stationary data streams across domains while retaining and utilizing past knowledge. Although prompt-based methods effectively store multi-domain knowledge in prompt parameters and obtain advanced performance through cross-domain prompt fusion, we reveal an intrinsic limitation: component-wise misalignment between domain-specific prompts leads to conflicting knowledge integration and degraded predictions. This arises from the random positioning of knowledge components within prompts, where irrelevant component fusion introduces interference.To address this, we propose Componential Prompt-Knowledge Alignment (KA-Prompt), a novel prompt-based DIL method that introduces component-aware prompt-knowledge alignment during training, significantly improving both the learning and inference capacity of the model. KA-Prompt operates in two phases: (1) Initial Componential Structure Configuring, where a set of old prompts containing knowledge relevant to the new domain are mined via greedy search, which is then exploited to initialize new prompts to achieve reusable knowledge transfer and establish intrinsic alignment between new and old prompts. (2) Online Alignment Preservation, which dynamically identifies the target old prompts and applies adaptive componential consistency constraints as new prompts evolve. Extensive experiments on DIL benchmarks demonstrate the effectiveness of our KA-Prompt. Our source code is available at https://github.com/zhoujiahuan1991/ICML2025-KA-Prompt
中文摘要:本文针对基于提示的领域增量学习中领域特定提示组件错位导致知识冲突的问题,提出KA-Prompt方法,通过组件感知的提示-知识对齐机制显著提升模型的学习和推理能力。
English Summary: This paper identifies a limitation in prompt-based Domain Incremental Learning where misaligned domain-specific prompts cause conflicting knowledge integration, and proposes KA-Prompt with component-aware alignment to enhance model learning and inference capabilities.

Authors:Ashutosh Singandhupe, Sanket Lokhande, Hung Manh La
Title: Registration of 3D Point Sets Using Exponential-based Similarity Matrix
Abstract:
Point cloud registration is a fundamental problem in computer vision and robotics, involving the alignment of 3D point sets captured from varying viewpoints using depth sensors such as LiDAR or structured light. In modern robotic systems, especially those focused on mapping, it is essential to merge multiple views of the same environment accurately. However, state-of-the-art registration techniques often struggle when large rotational differences exist between point sets or when the data is significantly corrupted by sensor noise. These challenges can lead to misalignments and, consequently, to inaccurate or distorted 3D reconstructions. In this work, we address both these limitations by proposing a robust modification to the classic Iterative Closest Point (ICP) algorithm. Our method, termed Exponential Similarity Matrix ICP (ESM-ICP), integrates a Gaussian-inspired exponential weighting scheme to construct a similarity matrix that dynamically adapts across iterations. This matrix facilitates improved estimation of both rotational and translational components during alignment. We demonstrate the robustness of ESM-ICP in two challenging scenarios: (i) large rotational discrepancies between the source and target point clouds, and (ii) data corrupted by non-Gaussian noise. Our results show that ESM-ICP outperforms traditional geometric registration techniques as well as several recent learning-based methods. To encourage reproducibility and community engagement, our full implementation is made publicly available on GitHub. https://github.com/aralab-unr/ESM_ICP
中文: 本文提出ESM-ICP算法,通过引入指数相似性矩阵改进传统迭代最近点方法,有效解决了点云在大旋转差异和非高斯噪声下的配准难题,其性能优于现有主流方法。
English: This paper introduces ESM-ICP, a robust variant of the Iterative Closest Point algorithm that uses an exponential similarity matrix to improve registration accuracy under large rotational differences and non-Gaussian noise, outperforming existing methods.

Authors:Qi Zhou, Yukai Shi, Xiaojun Yang, Xiaoyu Xian, Lunjia Liao, Ruimao Zhang, Liang Lin
Title: DFVO: Learning Darkness-free Visible and Infrared Image Disentanglement and Fusion All at Once
Abstract:
Visible and infrared image fusion is one of the most crucial tasks in the field of image fusion, aiming to generate fused images with clear structural information and high-quality texture features for high-level vision tasks. However, when faced with severe illumination degradation in visible images, the fusion results of existing image fusion methods often exhibit blurry and dim visual effects, posing major challenges for autonomous driving. To this end, a Darkness-Free network is proposed to handle Visible and infrared image disentanglement and fusion all at Once (DFVO), which employs a cascaded multi-task approach to replace the traditional two-stage cascaded training (enhancement and fusion), addressing the issue of information entropy loss caused by hierarchical data transmission. Specifically, we construct a latent-common feature extractor (LCFE) to obtain latent features for the cascaded tasks strategy. Firstly, a details-extraction module (DEM) is devised to acquire high-frequency semantic information. Secondly, we design a hyper cross-attention module (HCAM) to extract low-frequency information and preserve texture features from source images. Finally, a relevant loss function is designed to guide the holistic network learning, thereby achieving better image fusion. Extensive experiments demonstrate that our proposed approach outperforms state-of-the-art alternatives in terms of qualitative and quantitative evaluations. Particularly, DFVO can generate clearer, more informative, and more evenly illuminated fusion results in the dark environments, achieving best performance on the LLVIP dataset with 63.258 dB PSNR and 0.724 CC, providing more effective information for high-level vision tasks. Our code is publicly accessible at https://github.com/DaVin-Qi530/DFVO.
中文摘要:提出的Darkness-Free网络(DFVO)通过单级级联框架同时处理可见光与红外图像的分离与融合,解决了弱光条件下传统方法融合结果模糊的问题,在黑暗环境中能生成更清晰、信息更丰富的融合图像。
English Summary: The proposed Darkness-Free network (DFVO) addresses blurry fusion results under poor lighting by integrating visible-infrared image disentanglement and fusion in a single cascaded framework, achieving superior performance with clearer, more informative outputs in dark environments.

Authors:Jing Hu, Chengming Feng, Shu Hu, Ming-Ching Chang, Xin Li, Xi Wu, Xin Wang
Title: RLMiniStyler: Light-weight RL Style Agent for Arbitrary Sequential Neural Style Generation
Abstract:
Arbitrary style transfer aims to apply the style of any given artistic image to another content image. Still, existing deep learning-based methods often require significant computational costs to generate diverse stylized results. Motivated by this, we propose a novel reinforcement learning-based framework for arbitrary style transfer RLMiniStyler. This framework leverages a unified reinforcement learning policy to iteratively guide the style transfer process by exploring and exploiting stylization feedback, generating smooth sequences of stylized results while achieving model lightweight. Furthermore, we introduce an uncertainty-aware multi-task learning strategy that automatically adjusts loss weights to adapt to the content and style balance requirements at different training stages, thereby accelerating model convergence. Through a series of experiments across image various resolutions, we have validated the advantages of RLMiniStyler over other state-of-the-art methods in generating high-quality, diverse artistic image sequences at a lower cost. Codes are available at https://github.com/fengxiaoming520/RLMiniStyler.
中文: 本文提出RLMiniStyler强化学习框架,通过迭代策略引导和不确定性感知的多任务学习策略,以更低成本生成多样化高质量风格化图像序列,在多个分辨率图像上验证了其优越性。
English: This paper introduces RLMiniStyler, a reinforcement learning-based framework that efficiently generates diverse stylized image sequences through iterative policy guidance and an uncertainty-aware multi-task learning strategy, achieving high-quality results with reduced computational costs.

Authors:Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, Zhuotao Tian
Title: DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
Abstract:
Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. The ``content'' features are aligned with image crop representations to improve local discriminability, while ``context'' features learn to retain the spatial correlations under the guidance of vision foundation models, such as DINO. Extensive experiments demonstrate that DeCLIP significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks, including object detection and semantic segmentation. Code is available at \textcolor{magenta}{https://github.com/xiaomoguhz/DeCLIP}.
Chinese: DeCLIP通过解耦自注意力机制为“内容”和“上下文”特征来增强CLIP,有效提升局部判别能力和空间一致性,在开放词汇密集预测任务中表现卓越。
English: DeCLIP enhances CLIP by decoupling self-attention into "content" and "context" features to improve local discriminability and spatial consistency, achieving superior performance in open-vocabulary dense prediction tasks.

Authors:Jie Sun, Heng Liu, Yongzhen Wang, Xiao-Ping Zhang, Mingqiang Wei
Title: WDMamba: When Wavelet Degradation Prior Meets Vision Mamba for Image Dehazing
Abstract:
In this paper, we reveal a novel haze-specific wavelet degradation prior observed through wavelet transform analysis, which shows that haze-related information predominantly resides in low-frequency components. Exploiting this insight, we propose a novel dehazing framework, WDMamba, which decomposes the image dehazing task into two sequential stages: low-frequency restoration followed by detail enhancement. This coarse-to-fine strategy enables WDMamba to effectively capture features specific to each stage of the dehazing process, resulting in high-quality restored images. Specifically, in the low-frequency restoration stage, we integrate Mamba blocks to reconstruct global structures with linear complexity, efficiently removing overall haze and producing a coarse restored image. Thereafter, the detail enhancement stage reinstates fine-grained information that may have been overlooked during the previous phase, culminating in the final dehazed output. Furthermore, to enhance detail retention and achieve more natural dehazing, we introduce a self-guided contrastive regularization during network training. By utilizing the coarse restored output as a hard negative example, our model learns more discriminative representations, substantially boosting the overall dehazing performance. Extensive evaluations on public dehazing benchmarks demonstrate that our method surpasses state-of-the-art approaches both qualitatively and quantitatively. Code is available at https://github.com/SunJ000/WDMamba.
中文摘要:本文提出WDMamba新型去雾框架,利用雾霾相关的小波退化先验将去雾过程分解为低频恢复和细节增强两阶段,通过Mamba模块重构全局结构并结合自引导对比正则化恢复精细信息,在公共基准测试中实现了最优的去雾效果。
English Summary: This paper introduces WDMamba, a novel dehazing framework that leverages a haze-specific wavelet degradation prior to decompose image dehazing into low-frequency restoration using Mamba blocks for global structure recovery and detail enhancement for fine-grained information, achieving state-of-the-art performance through a self-guided contrastive regularization during training.

Authors:Yi Li, Zhiyuan Zhang, Jiangnan Xia, Jianghan Cheng, Qilong Wu, Junwei Li, Yibin Tian, Hui Kong
Title: TS-Diff: Two-Stage Diffusion Model for Low-Light RAW Image Enhancement
Abstract:
This paper presents a novel Two-Stage Diffusion Model (TS-Diff) for enhancing extremely low-light RAW images. In the pre-training stage, TS-Diff synthesizes noisy images by constructing multiple virtual cameras based on a noise space. Camera Feature Integration (CFI) modules are then designed to enable the model to learn generalizable features across diverse virtual cameras. During the aligning stage, CFIs are averaged to create a target-specific CFI$^T$, which is fine-tuned using a small amount of real RAW data to adapt to the noise characteristics of specific cameras. A structural reparameterization technique further simplifies CFI$^T$ for efficient deployment. To address color shifts during the diffusion process, a color corrector is introduced to ensure color consistency by dynamically adjusting global color distributions. Additionally, a novel dataset, QID, is constructed, featuring quantifiable illumination levels and a wide dynamic range, providing a comprehensive benchmark for training and evaluation under extreme low-light conditions. Experimental results demonstrate that TS-Diff achieves state-of-the-art performance on multiple datasets, including QID, SID, and ELD, excelling in denoising, generalization, and color consistency across various cameras and illumination levels. These findings highlight the robustness and versatility of TS-Diff, making it a practical solution for low-light imaging applications. Source codes and models are available at https://github.com/CircccleK/TS-Diff
本文提出TS-Diff模型,通过虚拟相机预训练和针对特定设备的微调两阶段方法增强极暗RAW图像,在多个数据集上实现了领先的去噪效果与色彩一致性。
This paper introduces TS-Diff, a two-stage diffusion model that enhances extremely low-light RAW images through pre-training with virtual cameras and fine-tuning for specific devices, achieving top performance in denoising and color consistency across multiple datasets.

Authors:Yajie Fu, Chaorui Huang, Junwei Li, Hui Kong, Yibin Tian, Huakang Li, Zhiyuan Zhang
Title: HDiffTG: A Lightweight Hybrid Diffusion-Transformer-GCN Architecture for 3D Human Pose Estimation
Abstract:
We propose HDiffTG, a novel 3D Human Pose Estimation (3DHPE) method that integrates Transformer, Graph Convolutional Network (GCN), and diffusion model into a unified framework. HDiffTG leverages the strengths of these techniques to significantly improve pose estimation accuracy and robustness while maintaining a lightweight design. The Transformer captures global spatiotemporal dependencies, the GCN models local skeletal structures, and the diffusion model provides step-by-step optimization for fine-tuning, achieving a complementary balance between global and local features. This integration enhances the model's ability to handle pose estimation under occlusions and in complex scenarios. Furthermore, we introduce lightweight optimizations to the integrated model and refine the objective function design to reduce computational overhead without compromising performance. Evaluation results on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HDiffTG achieves state-of-the-art (SOTA) performance on the MPI-INF-3DHP dataset while excelling in both accuracy and computational efficiency. Additionally, the model exhibits exceptional robustness in noisy and occluded environments. Source codes and models are available at https://github.com/CirceJie/HDiffTG
中文: HDiffTG是一种创新的三维人体姿态估计方法,将Transformer、图卷积网络和扩散模型整合到统一框架中,在保持计算效率的同时实现了顶尖的精度与鲁棒性。
English: HDiffTG is a novel 3D human pose estimation method that integrates Transformer, GCN, and diffusion models into a unified framework, achieving state-of-the-art accuracy and robustness while maintaining computational efficiency.

Authors:Yisen Feng, Haoyu Zhang, Meng Liu, Weili Guan, Liqiang Nie
Title: Object-Shot Enhanced Grounding Network for Egocentric Video
Abstract:
Egocentric video grounding is a crucial task for embodied intelligence applications, distinct from exocentric video moment localization. Existing methods primarily focus on the distributional differences between egocentric and exocentric videos but often neglect key characteristics of egocentric videos and the fine-grained information emphasized by question-type queries. To address these limitations, we propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video. Specifically, we extract object information from videos to enrich video representation, particularly for objects highlighted in the textual query but not directly captured in the video features. Additionally, we analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer's attention information, which enhances the model's ability to perform modality alignment. Experiments conducted on three datasets demonstrate that OSGNet achieves state-of-the-art performance, validating the effectiveness of our approach. Our code can be found at https://github.com/Yisen-Feng/OSGNet.
中文:OSGNet提出了一种对象-镜头增强的定位网络,通过提取对象信息和利用自我中心视频的镜头运动来增强模态对齐能力,在三个数据集上实现了最先进的性能。
English: OSGNet introduces an object-shot enhanced grounding network that enriches video representation with object information and leverages egocentric shot movements to improve modality alignment, achieving state-of-the-art performance on three datasets.

Authors:Trinh T. L. Vuong, Jin Tae Kwak
Title: VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning
Abstract:
We present VideoPath-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, single patch images, automatically keyframe-extracted clips, and manually segmented video pathology images, to mimic the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, VideoPath-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the VideoPath-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. VideoPath-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at https://github.com/trinhvg/VideoPath-LLaVA.
中文:VideoPath-LLaVA是计算病理学中首个集成多种图像场景的大型多模态模型,通过模拟病理学家的自然诊断过程生成详细组织学描述和最终诊断,并借助创新的数据处理和训练方法设立了病理视频分析新基准。
English: VideoPath-LLaVA is the first large multimodal model in computational pathology that integrates multiple image scenarios to mimic pathologists' diagnostic process, generating detailed descriptions and definitive diagnoses while setting a new benchmark through innovative data handling and training methods.

Authors:Hail Song, Wonsik Shin, Naeun Lee, Soomin Chung, Nojun Kwak, Woontack Woo
Title: S3D: Sketch-Driven 3D Model Generation
Abstract:
Generating high-quality 3D models from 2D sketches is a challenging task due to the inherent ambiguity and sparsity of sketch data. In this paper, we present S3D, a novel framework that converts simple hand-drawn sketches into detailed 3D models. Our method utilizes a U-Net-based encoder-decoder architecture to convert sketches into face segmentation masks, which are then used to generate a 3D representation that can be rendered from novel views. To ensure robust consistency between the sketch domain and the 3D output, we introduce a novel style-alignment loss that aligns the U-Net bottleneck features with the initial encoder outputs of the 3D generation module, significantly enhancing reconstruction fidelity. To further enhance the network's robustness, we apply augmentation techniques to the sketch dataset. This streamlined framework demonstrates the effectiveness of S3D in generating high-quality 3D models from sketch inputs. The source code for this project is publicly available at https://github.com/hailsong/S3D.
Chinese: S3D框架通过U-Net架构和创新的风格对齐损失函数,成功将简单的二维草图转化为精细的三维模型,确保了高质量的重建效果。
English: The S3D framework effectively converts simple 2D sketches into detailed 3D models using a U-Net architecture and a novel style-alignment loss to ensure high reconstruction fidelity.

Authors:Zixiang Ai, Zichen Liu, Jiahuan Zhou
Title: Vision Graph Prompting via Semantic Low-Rank Decomposition
Abstract:
Vision GNN (ViG) demonstrates superior performance by representing images as graph structures, providing a more natural way to capture irregular semantic patterns beyond traditional grid or sequence-based representations. To efficiently adapt ViG to downstream tasks, parameter-efficient fine-tuning techniques like visual prompting become increasingly essential. However, existing prompting methods are primarily designed for Transformer-based models, neglecting the rich topological relationships among nodes and edges in graph-based representations, limiting their capacity to model complex semantics. In this paper, we propose Vision Graph Prompting (VGP), a novel framework tailored for vision graph structures. Our core insight reveals that semantically connected components in the graph exhibit low-rank properties. Building on this observation, we introduce a semantic low-rank prompting method that decomposes low-rank semantic features and integrates them with prompts on vision graph topologies, capturing both global structural patterns and fine-grained semantic dependencies. Extensive experiments demonstrate our method significantly improves ViG's transfer performance on diverse downstream tasks, achieving results comparable to full fine-tuning while maintaining parameter efficiency. Our code is available at https://github.com/zhoujiahuan1991/ICML2025-VGP.
中文: 提出的视觉图提示(VGP)框架利用低秩语义特征增强视觉图神经网络的高效微调,在多项任务中达到与全参数微调相当的性能。
English: The proposed Vision Graph Prompting (VGP) framework leverages low-rank semantic features to enhance parameter-efficient fine-tuning of Vision GNNs, achieving performance comparable to full fine-tuning across diverse tasks.

Authors:Zixiang Ai, Zichen Liu, Yuanhang Lei, Zhenyu Cui, Xu Zou, Jiahuan Zhou
Title: GAPrompt: Geometry-Aware Point Cloud Prompt for 3D Vision Model
Abstract:
Pre-trained 3D vision models have gained significant attention for their promising performance on point cloud data. However, fully fine-tuning these models for downstream tasks is computationally expensive and storage-intensive. Existing parameter-efficient fine-tuning (PEFT) approaches, which focus primarily on input token prompting, struggle to achieve competitive performance due to their limited ability to capture the geometric information inherent in point clouds. To address this challenge, we propose a novel Geometry-Aware Point Cloud Prompt (GAPrompt) that leverages geometric cues to enhance the adaptability of 3D vision models. First, we introduce a Point Prompt that serves as an auxiliary input alongside the original point cloud, explicitly guiding the model to capture fine-grained geometric details. Additionally, we present a Point Shift Prompter designed to extract global shape information from the point cloud, enabling instance-specific geometric adjustments at the input level. Moreover, our proposed Prompt Propagation mechanism incorporates the shape information into the model's feature extraction process, further strengthening its ability to capture essential geometric characteristics. Extensive experiments demonstrate that GAPrompt significantly outperforms state-of-the-art PEFT methods and achieves competitive results compared to full fine-tuning on various benchmarks, while utilizing only 2.19% of trainable parameters. Our code is available at https://github.com/zhoujiahuan1991/ICML2025-GAPrompt.
中文: 提出的几何感知点云提示(GAPrompt)通过辅助输入和传播机制整合几何线索,以仅2.19%的可训练参数实现与全微调相媲美的性能,显著提升了3D视觉模型的适应性。
English: The proposed Geometry-Aware Point Cloud Prompt (GAPrompt) enhances 3D vision models by integrating geometric cues through auxiliary inputs and propagation mechanisms, achieving competitive performance with full fine-tuning while using only 2.19% of trainable parameters.

Authors:Xuyang Wang, Siyuan Duan, Qizhi Li, Guiduo Duan, Yuan Sun, Dezhong Peng
Title: Reliable Disentanglement Multi-view Learning Against View Adversarial Attacks
Abstract:
Trustworthy multi-view learning has attracted extensive attention because evidence learning can provide reliable uncertainty estimation to enhance the credibility of multi-view predictions. Existing trusted multi-view learning methods implicitly assume that multi-view data is secure. However, in safety-sensitive applications such as autonomous driving and security monitoring, multi-view data often faces threats from adversarial perturbations, thereby deceiving or disrupting multi-view models. This inevitably leads to the adversarial unreliability problem (AUP) in trusted multi-view learning. To overcome this tricky problem, we propose a novel multi-view learning framework, namely Reliable Disentanglement Multi-view Learning (RDML). Specifically, we first propose evidential disentanglement learning to decompose each view into clean and adversarial parts under the guidance of corresponding evidences, which is extracted by a pretrained evidence extractor. Then, we employ the feature recalibration module to mitigate the negative impact of adversarial perturbations and extract potential informative features from them. Finally, to further ignore the irreparable adversarial interferences, a view-level evidential attention mechanism is designed. Extensive experiments on multi-view classification tasks with adversarial attacks show that RDML outperforms the state-of-the-art methods by a relatively large margin. Our code is available at https://github.com/Willy1005/2025-IJCAI-RDML.
中文摘要:本文提出RDML框架,通过证据引导的解耦学习分离多视图数据中的正常特征与对抗扰动,采用特征重校准和视图级证据注意力机制,显著提升了多视图分类任务在对抗攻击下的可靠性。
English Summary: This paper introduces RDML, a reliable multi-view learning framework that addresses adversarial unreliability by disentangling clean and adversarial features using evidence-guided mechanisms to enhance model robustness against attacks.

Authors:Feng Gao, Sheng Liu, Chuanzheng Gong, Xiaowei Zhou, Jiayi Wang, Junyu Dong, Qian Du
Title: Prototype-Based Information Compensation Network for Multi-Source Remote Sensing Data Classification
Abstract:
Multi-source remote sensing data joint classification aims to provide accuracy and reliability of land cover classification by leveraging the complementary information from multiple data sources. Existing methods confront two challenges: inter-frequency multi-source feature coupling and inconsistency of complementary information exploration. To solve these issues, we present a Prototype-based Information Compensation Network (PICNet) for land cover classification based on HSI and SAR/LiDAR data. Specifically, we first design a frequency interaction module to enhance the inter-frequency coupling in multi-source feature extraction. The multi-source features are first decoupled into high- and low-frequency components. Then, these features are recoupled to achieve efficient inter-frequency communication. Afterward, we design a prototype-based information compensation module to model the global multi-source complementary information. Two sets of learnable modality prototypes are introduced to represent the global modality information of multi-source data. Subsequently, cross-modal feature integration and alignment are achieved through cross-attention computation between the modality-specific prototype vectors and the raw feature representations. Extensive experiments on three public datasets demonstrate the significant superiority of our PICNet over state-of-the-art methods. The codes are available at https://github.com/oucailab/PICNet.
中文:提出的基于原型的信息补偿网络(PICNet)通过频率交互模块和原型补偿机制,解决了多源遥感分类中的频间特征耦合与互补信息不一致问题,在三个公开数据集上验证了其显著优越性。
English: The proposed Prototype-based Information Compensation Network (PICNet) addresses inter-frequency feature coupling and complementary information inconsistency in multi-source remote sensing classification by introducing frequency interaction and prototype-based compensation modules, demonstrating superior performance on three public datasets.

Authors:Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, Han Zhao, Siteng Huang, Donglin Wang
Title: OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation
Abstract:
Dual-system VLA (Vision-Language-Action) architectures have become a hot topic in embodied intelligence research, but there is a lack of sufficient open-source work for further performance analysis and optimization. To address this problem, this paper will summarize and compare the structural designs of existing dual-system architectures, and conduct systematic empirical evaluations on the core design elements of existing dual-system architectures. Ultimately, it will provide a low-cost open-source model for further exploration. Of course, this project will continue to update with more experimental conclusions and open-source models with improved performance for everyone to choose from. Project page: https://openhelix-robot.github.io/.
中文: 本文针对双系统VLA架构开源资源不足的问题,通过总结结构设计、开展实证评估,并提供一个低成本开源模型以供进一步探索。
English: This paper addresses the lack of open-source dual-system VLA architectures by summarizing their structural designs, conducting empirical evaluations, and providing a low-cost open-source model for further exploration.

Authors:Shuang Zeng, Chee Hong Lee, Micky C Nnamdi, Wenqi Shi, J Ben Tamo, Lei Zhu, Hangzhou He, Xinliang Zhang, Qian Chen, May D. Wang, Yanye Lu, Qiushi Ren
Title: Novel Extraction of Discriminative Fine-Grained Feature to Improve Retinal Vessel Segmentation
Abstract:
Retinal vessel segmentation is a vital early detection method for several severe ocular diseases. Despite significant progress in retinal vessel segmentation with the advancement of Neural Networks, there are still challenges to overcome. Specifically, retinal vessel segmentation aims to predict the class label for every pixel within a fundus image, with a primary focus on intra-image discrimination, making it vital for models to extract more discriminative features. Nevertheless, existing methods primarily focus on minimizing the difference between the output from the decoder and the label, but ignore fully using feature-level fine-grained representations from the encoder. To address these issues, we propose a novel Attention U-shaped Kolmogorov-Arnold Network named AttUKAN along with a novel Label-guided Pixel-wise Contrastive Loss for retinal vessel segmentation. Specifically, we implement Attention Gates into Kolmogorov-Arnold Networks to enhance model sensitivity by suppressing irrelevant feature activations and model interpretability by non-linear modeling of KAN blocks. Additionally, we also design a novel Label-guided Pixel-wise Contrastive Loss to supervise our proposed AttUKAN to extract more discriminative features by distinguishing between foreground vessel-pixel pairs and background pairs. Experiments are conducted across four public datasets including DRIVE, STARE, CHASE_DB1, HRF and our private dataset. AttUKAN achieves F1 scores of 82.50%, 81.14%, 81.34%, 80.21% and 80.09%, along with MIoU scores of 70.24%, 68.64%, 68.59%, 67.21% and 66.94% in the above datasets, which are the highest compared to 11 networks for retinal vessel segmentation. Quantitative and qualitative results show that our AttUKAN achieves state-of-the-art performance and outperforms existing retinal vessel segmentation methods. Our code will be available at https://github.com/stevezs315/AttUKAN.
Chinese: 研究者提出了一种新型注意力U形Kolmogorov-Arnold网络AttUKAN及标签引导的像素级对比损失,通过注意力机制和对比学习增强特征判别能力,在视网膜血管分割任务中实现了最优性能。
English: The authors propose AttUKAN, a novel Attention U-shaped Kolmogorov-Arnold Network with Label-guided Pixel-wise Contrastive Loss, which achieves state-of-the-art retinal vessel segmentation performance by enhancing feature discrimination through attention mechanisms and contrastive learning.

Authors:Tin Mišić, Karlo Koledić, Fabio Bonsignorio, Ivan Petrović, Ivan Marković
Title: An Active Inference Model of Covert and Overt Visual Attention
Abstract:
The ability to selectively attend to relevant stimuli while filtering out distractions is essential for agents that process complex, high-dimensional sensory input. This paper introduces a model of covert and overt visual attention through the framework of active inference, utilizing dynamic optimization of sensory precisions to minimize free-energy. The model determines visual sensory precisions based on both current environmental beliefs and sensory input, influencing attentional allocation in both covert and overt modalities. To test the effectiveness of the model, we analyze its behavior in the Posner cueing task and a simple target focus task using two-dimensional(2D) visual data. Reaction times are measured to investigate the interplay between exogenous and endogenous attention, as well as valid and invalid cueing. The results show that exogenous and valid cues generally lead to faster reaction times compared to endogenous and invalid cues. Furthermore, the model exhibits behavior similar to inhibition of return, where previously attended locations become suppressed after a specific cue-target onset asynchrony interval. Lastly, we investigate different aspects of overt attention and show that involuntary, reflexive saccades occur faster than intentional ones, but at the expense of adaptability.
中文: 本文提出了一种基于主动推理的视觉注意力模型,通过动态优化感官精度来管理注意分配,实验表明外源性及有效线索能引发更快反应,并揭示了反射性与意向性眼动在速度与适应性上的差异。
English: This paper presents an active inference model that optimizes visual attention by dynamically adjusting sensory precisions, demonstrating through experiments that exogenous and valid cues yield faster responses and revealing distinct temporal characteristics between reflexive and intentional eye movements.

Authors:Chongsheng Zhang, Shuwen Wu, Yingqi Chen, Yi Men, Gaojuan Fan, Matthias Aßenmacher, Christian Heumann, João Gama
Title: Explainable Coarse-to-Fine Ancient Manuscript Duplicates Discovery
Abstract:
Ancient manuscripts are the primary source of ancient linguistic corpora. However, many ancient manuscripts exhibit duplications due to unintentional repeated publication or deliberate forgery. The Dead Sea Scrolls, for example, include counterfeit fragments, whereas Oracle Bones (OB) contain both republished materials and fabricated specimens. Identifying ancient manuscript duplicates is of great significance for both archaeological curation and ancient history study. In this work, we design a progressive OB duplicate discovery framework that combines unsupervised low-level keypoints matching with high-level text-centric content-based matching to refine and rank the candidate OB duplicates with semantic awareness and interpretability. We compare our model with state-of-the-art content-based image retrieval and image matching methods, showing that our model yields comparable recall performance and the highest simplified mean reciprocal rank scores for both Top-5 and Top-15 retrieval results, and with significantly accelerated computation efficiency. We have discovered over 60 pairs of new OB duplicates in real-world deployment, which were missed by domain experts for decades. Code, model and real-world results are available at: https://github.com/cszhangLMU/OBD-Finder/.
中文: 本研究提出了一种渐进式甲骨文重复发现框架,结合无监督关键点匹配与文本内容分析,在实现高效检索的同时,发现了60多对曾被专家遗漏的甲骨文重复样本。
English: This study introduces a progressive framework for detecting duplicate Oracle Bones by integrating unsupervised keypoint matching with text-based content analysis, achieving high retrieval accuracy and efficiency while uncovering over 60 previously missed duplicates.

Authors:Xuechao Wang, Sven Nomm, Junqing Huang, Kadri Medijainen, Aaro Toomela, Michael Ruzhansky
Title: PointExplainer: Towards Transparent Parkinson's Disease Diagnosis
Abstract:
Deep neural networks have shown potential in analyzing digitized hand-drawn signals for early diagnosis of Parkinson's disease. However, the lack of clear interpretability in existing diagnostic methods presents a challenge to clinical trust. In this paper, we propose PointExplainer, an explainable diagnostic strategy to identify hand-drawn regions that drive model diagnosis. Specifically, PointExplainer assigns discrete attribution values to hand-drawn segments, explicitly quantifying their relative contributions to the model's decision. Its key components include: (i) a diagnosis module, which encodes hand-drawn signals into 3D point clouds to represent hand-drawn trajectories, and (ii) an explanation module, which trains an interpretable surrogate model to approximate the local behavior of the black-box diagnostic model. We also introduce consistency measures to further address the issue of faithfulness in explanations. Extensive experiments on two benchmark datasets and a newly constructed dataset show that PointExplainer can provide intuitive explanations with no diagnostic performance degradation. The source code is available at https://github.com/chaoxuewang/PointExplainer.
中文摘要:PointExplainer是一种可解释的诊断策略,通过为手绘片段分配归因值来识别推动帕金森病诊断的关键区域,在保持诊断性能的同时提供直观解释。
English Summary: PointExplainer is an interpretable diagnostic method that identifies key hand-drawn segments for Parkinson's disease detection while maintaining diagnostic accuracy through attribution values and consistency measures.

Authors:Ioannis Nasios
Title: AI-driven multi-source data fusion for algal bloom severity classification in small inland water bodies: Leveraging Sentinel-2, DEM, and NOAA climate data
Abstract:
Harmful algal blooms are a growing threat to inland water quality and public health worldwide, creating an urgent need for efficient, accurate, and cost-effective detection methods. This research introduces a high-performing methodology that integrates multiple open-source remote sensing data with advanced artificial intelligence models. Key data sources include Copernicus Sentinel-2 optical imagery, the Copernicus Digital Elevation Model (DEM), and NOAA's High-Resolution Rapid Refresh (HRRR) climate data, all efficiently retrieved using platforms like Google Earth Engine (GEE) and Microsoft Planetary Computer (MPC). The NIR and two SWIR bands from Sentinel-2, the altitude from the elevation model, the temperature and wind from NOAA as well as the longitude and latitude were the most important features. The approach combines two types of machine learning models, tree-based models and a neural network, into an ensemble for classifying algal bloom severity. While the tree models performed strongly on their own, incorporating a neural network added robustness and demonstrated how deep learning models can effectively use diverse remote sensing inputs. The method leverages high-resolution satellite imagery and AI-driven analysis to monitor algal blooms dynamically, and although initially developed for a NASA competition in the U.S., it shows potential for global application. The complete code is available for further adaptation and practical implementation, illustrating the convergence of remote sensing data and AI to address critical environmental challenges (https://github.com/IoannisNasios/HarmfulAlgalBloomDetection).
中文: 本研究提出一种集成树模型与神经网络的AI方法,结合多源遥感数据高效检测有害藻华,虽最初针对美国水域开发,但具备全球应用潜力,为应对水环境挑战提供了可扩展方案。
English: This study presents an AI-driven ensemble method combining tree-based models and neural networks with multi-source remote sensing data to efficiently detect harmful algal blooms, offering a scalable solution initially developed for U.S. waters but with global applicability.

Authors:Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, Weidi Xie
Title: Multi-Agent System for Comprehensive Soccer Understanding
Abstract:
Recent advances in soccer understanding have demonstrated rapid progress, yet existing research predominantly focuses on isolated or narrow tasks. To bridge this gap, we propose a comprehensive framework for holistic soccer understanding. Concretely, we make the following contributions in this paper: (i) we construct SoccerWiki, the first large-scale multimodal soccer knowledge base, integrating rich domain knowledge about players, teams, referees, and venues to enable knowledge-driven reasoning; (ii) we present SoccerBench, the largest and most comprehensive soccer-specific benchmark, featuring around 10K multimodal (text, image, video) multi-choice QA pairs across 13 distinct tasks; (iii) we introduce SoccerAgent, a novel multi-agent system that decomposes complex soccer questions via collaborative reasoning, leveraging domain expertise from SoccerWiki and achieving robust performance; (iv) extensive evaluations and comparisons with representative MLLMs on SoccerBench highlight the superiority of our agentic system.
中文: 本文提出了一个全面的足球理解框架,包含多模态知识库SoccerWiki、含1万问答对的综合基准SoccerBench,以及通过协同推理展现卓越性能的多智能体系统SoccerAgent。
English: This paper introduces a holistic soccer understanding framework featuring SoccerWiki, a multimodal knowledge base; SoccerBench, a comprehensive benchmark with 10K QA pairs; and SoccerAgent, a multi-agent system that demonstrates superior performance through collaborative reasoning.

Authors:Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang
Title: FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios
Abstract:
Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at https://shiyi-zh0408.github.io/projectpages/FlexiAct/
中文摘要:FlexiAct提出了一种将参考视频中的动作迁移到任意目标图像的新方法,通过RefAdapter和FAE技术突破布局与视角等空间限制,同时保持身份一致性。
English Summary: FlexiAct is a novel method that transfers actions from a reference video to any target image, overcoming spatial constraints like layout and viewpoint through RefAdapter and FAE techniques while maintaining identity consistency.

Authors:Shiqi Li, Jihua Zhu, Yifan Xie, Naiwen Hu, Di Wang
Title: Matching Distance and Geometric Distribution Aided Learning Multiview Point Cloud Registration
Abstract:
Multiview point cloud registration plays a crucial role in robotics, automation, and computer vision fields. This paper concentrates on pose graph construction and motion synchronization within multiview registration. Previous methods for pose graph construction often pruned fully connected graphs or constructed sparse graph using global feature aggregated from local descriptors, which may not consistently yield reliable results. To identify dependable pairs for pose graph construction, we design a network model that extracts information from the matching distance between point cloud pairs. For motion synchronization, we propose another neural network model to calculate the absolute pose in a data-driven manner, rather than optimizing inaccurate handcrafted loss functions. Our model takes into account geometric distribution information and employs a modified attention mechanism to facilitate flexible and reliable feature interaction. Experimental results on diverse indoor and outdoor datasets confirm the effectiveness and generalizability of our approach. The source code is available at https://github.com/Shi-Qi-Li/MDGD.
中文: 本文提出了一种新的多视角点云配准方法,通过神经网络分析点云对间的匹配距离构建可靠姿态图,并采用数据驱动方式和改进注意力机制计算绝对姿态,在多个数据集上验证了方法的有效性和泛化能力。
English: This paper introduces a novel multiview point cloud registration method that uses neural networks to construct a reliable pose graph by analyzing matching distances and to compute absolute poses through a data-driven approach with an enhanced attention mechanism, demonstrating strong performance across various datasets.

Authors:Yifan Xiang, Zhenxi Zhang, Bin Li, Yixuan Weng, Shoujun Zhou, Yangfan He, Keqin Li
Title: ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant
Abstract:
Recent advances in personalized MLLMs enable effective capture of user-specific concepts, supporting both recognition of personalized concepts and contextual captioning. However, humans typically explore and reason over relations among objects and individuals, transcending surface-level information to achieve more personalized and contextual understanding. To this end, existing methods may face three main limitations: Their training data lacks multi-object sets in which relations among objects are learnable. Building on the limited training data, their models overlook the relations between different personalized concepts and fail to reason over them. Their experiments mainly focus on a single personalized concept, where evaluations are limited to recognition and captioning tasks. To address the limitations, we present a new dataset named ReGraP, consisting of 120 sets of personalized knowledge. Each set includes images, KGs, and CoT QA pairs derived from the KGs, enabling more structured and sophisticated reasoning pathways. We propose ReGraP-LLaVA, an MLLM trained with the corresponding KGs and CoT QA pairs, where soft and hard graph prompting methods are designed to align KGs within the model's semantic space. We establish the ReGraP Benchmark, which contains diverse task types: multiple-choice, fill-in-the-blank, True/False, and descriptive questions in both open- and closed-ended settings. The proposed benchmark is designed to evaluate the relational reasoning and knowledge-connection capability of personalized MLLMs. We conduct experiments on the proposed ReGraP-LLaVA and other competitive MLLMs. Results show that the proposed model not only learns personalized knowledge but also performs relational reasoning in responses, achieving the SoTA performance compared with the competitive methods. All the codes and datasets are released at: https://github.com/xyfyyds/ReGraP.
中文: 针对现有个性化多模态大模型在关系推理上的不足,本研究提出ReGraP数据集和ReGraP-LLaVA模型,通过知识图谱和思维链技术实现了最优的关系推理能力。
English: Recent personalized MLLMs struggle with relational reasoning due to limited training data, prompting the introduction of the ReGraP dataset and ReGraP-LLaVA model, which achieve state-of-the-art performance by incorporating structured knowledge graphs and chain-of-thought reasoning.

Authors:Alessandro Simoni, Francesco Pelosin
Title: Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map
Abstract:
Synthetic dataset generation in Computer Vision, particularly for industrial applications, is still underexplored. Industrial defect segmentation, for instance, requires highly accurate labels, yet acquiring such data is costly and time-consuming. To address this challenge, we propose a novel diffusion-based pipeline for generating high-fidelity industrial datasets with minimal supervision. Our approach conditions the diffusion model on enriched bounding box representations to produce precise segmentation masks, ensuring realistic and accurately localized defect synthesis. Compared to existing layout-conditioned generative methods, our approach improves defect consistency and spatial accuracy. We introduce two quantitative metrics to evaluate the effectiveness of our method and assess its impact on a downstream segmentation task trained on real and synthetic data. Our results demonstrate that diffusion-based synthesis can bridge the gap between artificial and real-world industrial data, fostering more reliable and cost-efficient segmentation models. The code is publicly available at https://github.com/covisionlab/diffusion_labeling.
Chinese: 本文提出了一种基于扩散模型的工业缺陷数据集生成方法,通过利用增强的边界框表示来生成精确的分割掩码,有效提升了缺陷一致性和空间精度,并验证了该方法在缩小合成与真实数据差距方面的有效性。
English: This paper introduces a diffusion-based pipeline that generates high-fidelity industrial defect datasets with minimal supervision, using enriched bounding box representations to ensure accurate segmentation masks and improve defect consistency and spatial accuracy.

Authors:Zhiyu Pan, Xiongjun Guan, Yongjie Duan, Jianjiang Feng, Jie Zhou
Title: Fixed-Length Dense Fingerprint Representation
Abstract:
Fixed-length fingerprint representations, which map each fingerprint to a compact and fixed-size feature vector, are computationally efficient and well-suited for large-scale matching. However, designing a robust representation that effectively handles diverse fingerprint modalities, pose variations, and noise interference remains a significant challenge. In this work, we propose a fixed-length dense descriptor of fingerprints, and introduce FLARE-a fingerprint matching framework that integrates the Fixed-Length dense descriptor with pose-based Alignment and Robust Enhancement. This fixed-length representation employs a three-dimensional dense descriptor to effectively capture spatial relationships among fingerprint ridge structures, enabling robust and locally discriminative representations. To ensure consistency within this dense feature space, FLARE incorporates pose-based alignment using complementary estimation methods, along with dual enhancement strategies that refine ridge clarity while preserving the original fingerprint modality. The proposed dense descriptor supports fixed-length representation while maintaining spatial correspondence, enabling fast and accurate similarity computation. Extensive experiments demonstrate that FLARE achieves superior performance across rolled, plain, latent, and contactless fingerprints, significantly outperforming existing methods in cross-modality and low-quality scenarios. Further analysis validates the effectiveness of the dense descriptor design, as well as the impact of alignment and enhancement modules on the accuracy of dense descriptor matching. Experimental results highlight the effectiveness and generalizability of FLARE as a unified and scalable solution for robust fingerprint representation and matching. The implementation and code will be publicly available at https://github.com/Yu-Yy/FLARE.
中文: FLARE是一个指纹匹配框架,通过结合固定长度密集描述符、姿态对齐和增强策略,能够在多种指纹模态及低质量场景下实现鲁棒且高效的匹配性能。
English: FLARE is a fingerprint matching framework that uses a fixed-length dense descriptor with pose alignment and enhancement to achieve robust, high-performance matching across various fingerprint types and challenging conditions.

Authors:Mishal Fatima, Steffen Jung, Margret Keuper
Title: Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models
Abstract:
Backgrounds in images play a major role in contributing to spurious correlations among different data points. Owing to aesthetic preferences of humans capturing the images, datasets can exhibit positional (location of the object within a given frame) and size (region-of-interest to image ratio) biases for different classes. In this paper, we show that these biases can impact how much a model relies on spurious features in the background to make its predictions. To better illustrate our findings, we propose a synthetic dataset derived from ImageNet-1k, Hard-Spurious-ImageNet, which contains images with various backgrounds, object positions, and object sizes. By evaluating the dataset on different pretrained models, we find that most models rely heavily on spurious features in the background when the region-of-interest (ROI) to image ratio is small and the object is far from the center of the image. Moreover, we also show that current methods that aim to mitigate harmful spurious features, do not take into account these factors, hence fail to achieve considerable performance gains for worst-group accuracies when the size and location of core features in an image change. The dataset and implementation code are available at https://github.com/Mishalfatima/Corner_Cases.
中文: 图像背景因人类审美偏好会产生虚假关联,导致模型尤其在小尺寸或偏离中心的物体上依赖误导性特征,而现有缓解方法未能有效应对这些偏差。
English: Image backgrounds can cause spurious correlations due to human aesthetic preferences, leading models to rely on these misleading features especially when objects are small or off-center, and current mitigation methods fail to address these biases effectively.

Authors:João Alves, Pia Haubro Andersen, Rikke Gade
Title: Read My Ears! Horse Ear Movement Detection for Equine Affective State Assessment
Abstract:
The Equine Facial Action Coding System (EquiFACS) enables the systematic annotation of facial movements through distinct Action Units (AUs). It serves as a crucial tool for assessing affective states in horses by identifying subtle facial expressions associated with discomfort. However, the field of horse affective state assessment is constrained by the scarcity of annotated data, as manually labelling facial AUs is both time-consuming and costly. To address this challenge, automated annotation systems are essential for leveraging existing datasets and improving affective states detection tools. In this work, we study different methods for specific ear AU detection and localization from horse videos. We leverage past works on deep learning-based video feature extraction combined with recurrent neural networks for the video classification task, as well as a classic optical flow based approach. We achieve 87.5% classification accuracy of ear movement presence on a public horse video dataset, demonstrating the potential of our approach. We discuss future directions to develop these systems, with the aim of bridging the gap between automated AU detection and practical applications in equine welfare and veterinary diagnostics. Our code will be made publicly available at https://github.com/jmalves5/read-my-ears.
中文: 本研究开发了基于视频分析的自动化马耳动作检测方法,达到87.5%的准确率,旨在提升马匹情感状态评估水平并推动动物福利实践应用。
English: The study develops automated methods for detecting horse ear movements using video analysis, achieving 87.5% accuracy to improve affective state assessment and advance equine welfare applications.

Authors:Mengfei Duan, Kailun Yang, Yuheng Zhang, Yihong Cao, Fei Teng, Kai Luo, Jiaming Zhang, Zhiyong Li, Shutao Li
Title: Panoramic Out-of-Distribution Segmentation
Abstract:
Panoramic imaging enables capturing 360° images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to background clutter and pixel distortions. To address these issues, we introduce a new task, Panoramic Out-of-distribution Segmentation (PanOoS), achieving OoS for panoramas. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text-guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross-domain generalization capability of CLIP. The proposed Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self-adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS methods. Moreover, POS achieves leading closed-set segmentation capabilities. Code and datasets will be available at https://github.com/MengfeiD/PanOoS.
Chinese: 本文提出了全景分布外分割(PanOoS)这一新任务,通过文本引导的提示分布学习解决方案POS有效解决了全景图像中异常检测的难题,并建立了两个基准数据集,实验证明该方法在性能指标上显著优于现有技术。
English: This paper introduces a novel task called Panoramic Out-of-distribution Segmentation (PanOoS) to address the limitations of existing methods in identifying outliers in panoramic images, proposing a solution named POS that utilizes text-guided prompt distribution learning to enhance segmentation performance and establishing two new benchmarks for evaluation.

Authors:Chuyu Zhao, Hao Huang, Jiashuo Guo, Ziyu Shen, Zhongwei Zhou, Jie Liu, Zekuan Yu
Title: RAIL: Region-Aware Instructive Learning for Semi-Supervised Tooth Segmentation in CBCT
Abstract:
Semi-supervised learning has become a compelling approach for 3D tooth segmentation from CBCT scans, where labeled data is minimal. However, existing methods still face two persistent challenges: limited corrective supervision in structurally ambiguous or mislabeled regions during supervised training and performance degradation caused by unreliable pseudo-labels on unlabeled data. To address these problems, we propose Region-Aware Instructive Learning (RAIL), a dual-group dual-student, semi-supervised framework. Each group contains two student models guided by a shared teacher network. By alternating training between the two groups, RAIL promotes intergroup knowledge transfer and collaborative region-aware instruction while reducing overfitting to the characteristics of any single model. Specifically, RAIL introduces two instructive mechanisms. Disagreement-Focused Supervision (DFS) Controller improves supervised learning by instructing predictions only within areas where student outputs diverge from both ground truth and the best student, thereby concentrating supervision on structurally ambiguous or mislabeled areas. In the unsupervised phase, Confidence-Aware Learning (CAL) Modulator reinforces agreement in regions with high model certainty while reducing the effect of low-confidence predictions during training. This helps prevent our model from learning unstable patterns and improves the overall reliability of pseudo-labels. Extensive experiments on four CBCT tooth segmentation datasets show that RAIL surpasses state-of-the-art methods under limited annotation. Our code will be available at https://github.com/Tournesol-Saturday/RAIL.
中文: 提出的RAIL框架通过双组训练和区域感知指导机制,有效解决了半监督3D牙齿分割中的结构模糊区域和不可靠伪标签问题,在有限标注条件下实现了最优性能。
English: The proposed RAIL framework enhances semi-supervised 3D tooth segmentation by introducing dual-group training and region-aware instructive mechanisms, effectively addressing ambiguous regions and unreliable pseudo-labels to achieve state-of-the-art performance with limited annotations.

Authors:Shenglan Li, Rui Yao, Yong Zhou, Hancheng Zhu, Kunyang Sun, Bing Liu, Zhiwen Shao, Jiaqi Zhao
Title: Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking
Abstract:
To reduce the reliance on large-scale annotations, self-supervised RGB-T tracking approaches have garnered significant attention. However, the omission of the object region by erroneous pseudo-label or the introduction of background noise affects the efficiency of modality fusion, while pseudo-label noise triggered by similar object noise can further affect the tracking performance. In this paper, we propose GDSTrack, a novel approach that introduces dynamic graph fusion and temporal diffusion to address the above challenges in self-supervised RGB-T tracking. GDSTrack dynamically fuses the modalities of neighboring frames, treats them as distractor noise, and leverages the denoising capability of a generative model. Specifically, by constructing an adjacency matrix via an Adjacency Matrix Generator (AMG), the proposed Modality-guided Dynamic Graph Fusion (MDGF) module uses a dynamic adjacency matrix to guide graph attention, focusing on and fusing the object's coherent regions. Temporal Graph-Informed Diffusion (TGID) models MDGF features from neighboring frames as interference, and thus improving robustness against similar-object noise. Extensive experiments conducted on four public RGB-T tracking datasets demonstrate that GDSTrack outperforms the existing state-of-the-art methods. The source code is available at https://github.com/LiShenglana/GDSTrack.
中文: GDSTrack通过动态图融合和时间扩散技术,有效解决了自监督RGB-T跟踪中的伪标签噪声和模态融合问题,在多个数据集上表现优于现有先进方法。
English: GDSTrack introduces dynamic graph fusion and temporal diffusion to enhance self-supervised RGB-T tracking by mitigating pseudo-label noise and improving modality integration, outperforming current methods on multiple datasets.

Authors:Zhanyuan Jia, Ni Yao, Danyang Sun, Chuang Han, Yanting Li, Jiaofen Nan, Fubao Zhu, Chen Zhao, Weihua Zhou
Title: UPMAD-Net: A Brain Tumor Segmentation Network with Uncertainty Guidance and Adaptive Multimodal Feature Fusion
Abstract:
Background: Brain tumor segmentation has a significant impact on the diagnosis and treatment of brain tumors. Accurate brain tumor segmentation remains challenging due to their irregular shapes, vague boundaries, and high variability. Objective: We propose a brain tumor segmentation method that combines deep learning with prior knowledge derived from a region-growing algorithm. Methods: The proposed method utilizes a multi-scale feature fusion (MSFF) module and adaptive attention mechanisms (AAM) to extract multi-scale features and capture global contextual information. To enhance the model's robustness in low-confidence regions, the Monte Carlo Dropout (MC Dropout) strategy is employed for uncertainty estimation. Results: Extensive experiments demonstrate that the proposed method achieves superior performance on Brain Tumor Segmentation (BraTS) datasets, significantly outperforming various state-of-the-art methods. On the BraTS2021 dataset, the test Dice scores are 89.18% for Enhancing Tumor (ET) segmentation, 93.67% for Whole Tumor (WT) segmentation, and 91.23% for Tumor Core (TC) segmentation. On the BraTS2019 validation set, the validation Dice scores are 87.43%, 90.92%, and 90.40% for ET, WT, and TC segmentation, respectively. Ablation studies further confirmed the contribution of each module to segmentation accuracy, indicating that each component played a vital role in overall performance improvement. Conclusion: This study proposed a novel 3D brain tumor segmentation network based on the U-Net architecture. By incorporating the prior knowledge and employing the uncertainty estimation method, the robustness and performance were improved. The code for the proposed method is available at https://github.com/chenzhao2023/UPMAD_Net_BrainSeg.
Chinese: 本研究提出了一种融合深度学习和先验知识的新型3D脑肿瘤分割方法,通过多尺度特征融合和不确定性估计,在BraTS数据集上实现了优于现有技术的分割精度。
English: This study introduces a novel 3D brain tumor segmentation method combining deep learning with prior knowledge, achieving superior performance on BraTS datasets through multi-scale feature fusion and uncertainty estimation.

Authors:Qi Gan, Sao Mai Nguyen, Eric Fenaux, Stephan Clémençon, Mounîm El Yacoubi
Title: Polar Coordinate-Based 2D Pose Prior with Neural Distance Field
Abstract:
Human pose capture is essential for sports analysis, enabling precise evaluation of athletes' movements. While deep learning-based human pose estimation (HPE) models from RGB videos have achieved impressive performance on public datasets, their effectiveness in real-world sports scenarios is often hindered by motion blur, occlusions, and domain shifts across different pose representations. Fine-tuning these models can partially alleviate such challenges but typically requires large-scale annotated data and still struggles to generalize across diverse sports environments. To address these limitations, we propose a 2D pose prior-guided refinement approach based on Neural Distance Fields (NDF). Unlike existing approaches that rely solely on angular representations of human poses, we introduce a polar coordinate-based representation that explicitly incorporates joint connection lengths, enabling a more accurate correction of erroneous pose estimations. Additionally, we define a novel non-geodesic distance metric that separates angular and radial discrepancies, which we demonstrate is better suited for polar representations than traditional geodesic distances. To mitigate data scarcity, we develop a gradient-based batch-projection augmentation strategy, which synthesizes realistic pose samples through iterative refinement. Our method is evaluated on a long jump dataset, demonstrating its ability to improve 2D pose estimation across multiple pose representations, making it robust across different domains. Experimental results show that our approach enhances pose plausibility while requiring only limited training data. Code is available at: https://github.com/QGAN2019/polar-NDF.
中文摘要:本文提出了一种基于神经距离场的二维姿态优化方法,通过极坐标表示和新型非测地距离度量,有效提升运动场景中姿态估计的精度,仅需少量训练数据即可增强跨域鲁棒性。
English Summary: This paper introduces a 2D pose refinement method using Neural Distance Fields with polar coordinate representations and a novel non-geodesic distance metric to improve pose estimation accuracy in sports scenarios, requiring minimal training data while enhancing robustness across domains.

Authors:Yepeng Liu, Wenpeng Lai, Zhou Zhao, Yuxuan Xiong, Jinchi Zhu, Jun Cheng, Yongchao Xu
Title: LiftFeat: 3D Geometry-Aware Local Feature Matching
Abstract:
Robust and efficient local feature matching plays a crucial role in applications such as SLAM and visual localization for robotics. Despite great progress, it is still very challenging to extract robust and discriminative visual features in scenarios with drastic lighting changes, low texture areas, or repetitive patterns. In this paper, we propose a new lightweight network called \textit{LiftFeat}, which lifts the robustness of raw descriptor by aggregating 3D geometric feature. Specifically, we first adopt a pre-trained monocular depth estimation model to generate pseudo surface normal label, supervising the extraction of 3D geometric feature in terms of predicted surface normal. We then design a 3D geometry-aware feature lifting module to fuse surface normal feature with raw 2D descriptor feature. Integrating such 3D geometric feature enhances the discriminative ability of 2D feature description in extreme conditions. Extensive experimental results on relative pose estimation, homography estimation, and visual localization tasks, demonstrate that our LiftFeat outperforms some lightweight state-of-the-art methods. Code will be released at : https://github.com/lyp-deeplearning/LiftFeat.
Chinese: 本文提出了一种名为LiftFeat的轻量级网络,通过融合3D几何特征与2D描述符来提升特征匹配的鲁棒性,在多项视觉任务中展现出优越性能。
English: This paper introduces LiftFeat, a lightweight network that enhances feature matching robustness by integrating 3D geometric features with 2D descriptors, demonstrating superior performance in challenging visual tasks.

Authors:Shanshan Song, Hui Tang, Honglong Yang, Xiaomeng Li
Title: DDaTR: Dynamic Difference-aware Temporal Residual Network for Longitudinal Radiology Report Generation
Abstract:
Radiology Report Generation (RRG) automates the creation of radiology reports from medical imaging, enhancing the efficiency of the reporting process. Longitudinal Radiology Report Generation (LRRG) extends RRG by incorporating the ability to compare current and prior exams, facilitating the tracking of temporal changes in clinical findings. Existing LRRG approaches only extract features from prior and current images using a visual pre-trained encoder, which are then concatenated to generate the final report. However, these methods struggle to effectively capture both spatial and temporal correlations during the feature extraction process. Consequently, the extracted features inadequately capture the information of difference across exams and thus underrepresent the expected progressions, leading to sub-optimal performance in LRRG. To address this, we develop a novel dynamic difference-aware temporal residual network (DDaTR). In DDaTR, we introduce two modules at each stage of the visual encoder to capture multi-level spatial correlations. The Dynamic Feature Alignment Module (DFAM) is designed to align prior features across modalities for the integrity of prior clinical information. Prompted by the enriched prior features, the dynamic difference-aware module (DDAM) captures favorable difference information by identifying relationships across exams. Furthermore, our DDaTR employs the dynamic residual network to unidirectionally transmit longitudinal information, effectively modelling temporal correlations. Extensive experiments demonstrated superior performance over existing methods on three benchmarks, proving its efficacy in both RRG and LRRG tasks.
中文: 本文提出了一种动态差异感知时序残差网络(DDaTR),通过专门模块有效捕捉空间和时间相关性,在多个基准测试中显著优于现有方法,提升了纵向放射学报告生成的性能。
English: The paper introduces a dynamic difference-aware temporal residual network (DDaTR) that enhances longitudinal radiology report generation by effectively capturing spatial and temporal correlations through specialized modules, significantly outperforming existing methods on multiple benchmarks.

Authors:Dongbin Zhang, Yunfei Liu, Lijian Lin, Ye Zhu, Yang Li, Minghan Qin, Yu Li, Haoqian Wang
Title: GUAVA: Generalizable Upper Body 3D Gaussian Avatar
Abstract:
Reconstructing a high-quality, animatable 3D human avatar with expressive facial and hand motions from a single image has gained significant attention due to its broad application potential. 3D human avatar reconstruction typically requires multi-view or monocular videos and training on individual IDs, which is both complex and time-consuming. Furthermore, limited by SMPLX's expressiveness, these methods often focus on body motion but struggle with facial expressions. To address these challenges, we first introduce an expressive human model (EHM) to enhance facial expression capabilities and develop an accurate tracking method. Based on this template model, we propose GUAVA, the first framework for fast animatable upper-body 3D Gaussian avatar reconstruction. We leverage inverse texture mapping and projection sampling techniques to infer Ubody (upper-body) Gaussians from a single image. The rendered images are refined through a neural refiner. Experimental results demonstrate that GUAVA significantly outperforms previous methods in rendering quality and offers significant speed improvements, with reconstruction times in the sub-second range (0.1s), and supports real-time animation and rendering.

Authors:Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Lei Sun, Xiangxiang Chu
Title: FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing
Abstract:
Scene text editing aims to modify or add texts on images while ensuring text fidelity and overall visual quality consistent with the background. Recent methods are primarily built on UNet-based diffusion models, which have improved scene text editing results, but still struggle with complex glyph structures, especially for non-Latin ones (\eg, Chinese, Korean, Japanese). To address these issues, we present \textbf{FLUX-Text}, a simple and advanced multilingual scene text editing DiT method. Specifically, our FLUX-Text enhances glyph understanding and generation through lightweight Visual and Text Embedding Modules, while preserving the original generative capability of FLUX. We further propose a Regional Text Perceptual Loss tailored for text regions, along with a matching two-stage training strategy to better balance text editing and overall image quality. Benefiting from the DiT-based architecture and lightweight feature injection modules, FLUX-Text can be trained with only $0.1$M training examples, a \textbf{97\%} reduction compared to $2.9$M required by popular methods. Extensive experiments on multiple public datasets, including English and Chinese benchmarks, demonstrate that our method surpasses other methods in visual quality and text fidelity. All the code is available at https://github.com/AMAP-ML/FluxText.
中文:FLUX-Text是一种基于DiT架构的多语言场景文本编辑方法,通过轻量级模块增强字形生成能力,在仅需0.1M训练样本的情况下(比主流方法减少97%),在视觉质量和文本保真度上均优于现有方法。
English: FLUX-Text is a multilingual scene text editing method using a DiT architecture that enhances glyph generation through lightweight modules and achieves superior visual quality and text fidelity with 97% fewer training examples than existing methods.

Authors:Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
Title: Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
Abstract:
Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.
中文摘要:本文提出首个基于思维链的多模态奖励模型UnifiedReward-Think,通过强化微调方法结合正确与错误推理路径,显著提升了视觉任务中奖励信号的可靠性和鲁棒性。
English Summary: This paper introduces UnifiedReward-Think, the first multimodal reward model incorporating explicit chain-of-thought reasoning to enhance reliability and robustness in vision tasks through a reinforcement fine-tuning approach that leverages both correct and incorrect reasoning paths.

Authors:Pierre Adorni, Minh-Tan Pham, Stéphane May, Sébastien Lefèvre
Title: Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach
Abstract:
Foundation models constitute a significant advancement in computer vision: after a single, albeit costly, training phase, they can address a wide array of tasks. In the field of Earth observation, over 75 remote sensing vision foundation models have been developed in the past four years. However, none has consistently outperformed the others across all available downstream tasks. To facilitate their comparison, we propose a cost-effective method for predicting a model's performance on multiple downstream tasks without the need for fine-tuning on each one. This method is based on what we call "capabilities encoding." The utility of this novel approach is twofold: we demonstrate its potential to simplify the selection of a foundation model for a given new task, and we employ it to offer a fresh perspective on the existing literature, suggesting avenues for future research. Codes are available at https://github.com/pierreadorni/capabilities-encoding.
基础模型在计算机视觉领域实现了重大突破,一次训练即可应对多种任务,但遥感领域尚无模型能在所有下游任务中全面领先,因此提出基于“能力编码”的经济高效方法,无需逐项微调即可预测模型性能,简化模型选择并为未来研究指明方向。
Foundation models in computer vision enable versatile task handling post-training, yet no single remote sensing model excels across all tasks, prompting a cost-effective performance prediction method using "capabilities encoding" to simplify model selection and guide future research.

Authors:Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Yaoqi Huang, Hongyu Lyu, Nguyen Hoang Khoi Tran, Tzu-Yun Tseng, Stewart Worrall
Title: OccCylindrical: Multi-Modal Fusion with Cylindrical Representation for 3D Semantic Occupancy Prediction
Abstract:
The safe operation of autonomous vehicles (AVs) is highly dependent on their understanding of the surroundings. For this, the task of 3D semantic occupancy prediction divides the space around the sensors into voxels, and labels each voxel with both occupancy and semantic information. Recent perception models have used multisensor fusion to perform this task. However, existing multisensor fusion-based approaches focus mainly on using sensor information in the Cartesian coordinate system. This ignores the distribution of the sensor readings, leading to a loss of fine-grained details and performance degradation. In this paper, we propose OccCylindrical that merges and refines the different modality features under cylindrical coordinates. Our method preserves more fine-grained geometry detail that leads to better performance. Extensive experiments conducted on the nuScenes dataset, including challenging rainy and nighttime scenarios, confirm our approach's effectiveness and state-of-the-art performance. The code will be available at: https://github.com/DanielMing123/OccCylindrical
Chinese: 本文提出OccCylindrical方法,通过在柱坐标系下优化多传感器特征,提升自动驾驶车辆的三维语义占据预测能力,保留更精细的几何细节,并在nuScenes数据集上实现了领先性能。
English: This paper introduces OccCylindrical, a method that enhances 3D semantic occupancy prediction for autonomous vehicles by refining multisensor features in cylindrical coordinates, preserving fine-grained details and achieving state-of-the-art performance on the nuScenes dataset.

Authors:Davide Talon, Federico Girella, Ziyue Liu, Marco Cristani, Yiming Wang
Title: Seeing the Abstract: Translating the Abstract Language for Vision Language Models
Abstract:
Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language. Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis. Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text corpora, thus hindering their ability to effectively represent abstract-oriented language. We propose a training-free and model-agnostic method, Abstract-to-Concrete Translator (ACT), to shift abstract representations towards well-represented concrete ones in the VLM latent space, using pre-trained models and existing multimodal databases. On the text-to-image retrieval task, despite being training-free, ACT outperforms the fine-tuned VLMs in both same- and cross-dataset settings, exhibiting its effectiveness with a strong generalization capability. Moreover, the improvement introduced by ACT is consistent with various VLMs, making it a plug-and-play solution.

Authors:Teng Zhou, Jax Luo, Yuping Sun, Yiheng Tan, Shun Yao, Nazim Haouchine, Scott Raymond
Title: Path and Bone-Contour Regularized Unpaired MRI-to-CT Translation
Abstract:
Accurate MRI-to-CT translation promises the integration of complementary imaging information without the need for additional imaging sessions. Given the practical challenges associated with acquiring paired MRI and CT scans, the development of robust methods capable of leveraging unpaired datasets is essential for advancing the MRI-to-CT translation. Current unpaired MRI-to-CT translation methods, which predominantly rely on cycle consistency and contrastive learning frameworks, frequently encounter challenges in accurately translating anatomical features that are highly discernible on CT but less distinguishable on MRI, such as bone structures. This limitation renders these approaches less suitable for applications in radiation therapy, where precise bone representation is essential for accurate treatment planning. To address this challenge, we propose a path- and bone-contour regularized approach for unpaired MRI-to-CT translation. In our method, MRI and CT images are projected to a shared latent space, where the MRI-to-CT mapping is modeled as a continuous flow governed by neural ordinary differential equations. The optimal mapping is obtained by minimizing the transition path length of the flow. To enhance the accuracy of translated bone structures, we introduce a trainable neural network to generate bone contours from MRI and implement mechanisms to directly and indirectly encourage the model to focus on bone contours and their adjacent regions. Evaluations conducted on three datasets demonstrate that our method outperforms existing unpaired MRI-to-CT translation approaches, achieving lower overall error rates. Moreover, in a downstream bone segmentation task, our approach exhibits superior performance in preserving the fidelity of bone structures. Our code is available at: https://github.com/kennysyp/PaBoT.
中文摘要:本研究提出了一种新颖的无配对MRI到CT转换方法,通过神经常微分方程和骨骼轮廓正则化技术,显著提升了骨骼等关键解剖结构的转换精度,特别适用于放射治疗规划。
English Summary: This study introduces a novel unpaired MRI-to-CT translation method that uses neural ODEs and bone-contour regularization to improve anatomical accuracy, particularly for bone structures critical in radiation therapy.

Authors:Zherui Zhang, Rongtao Xu, Jie Zhou, Changwei Wang, Xingtian Pei, Wenhao Xu, Jiguang Zhang, Li Guo, Longxiang Gao, Wenbo Xu, Shibiao Xu
Title: Image Recognition with Online Lightweight Vision Transformer: A Survey
Abstract:
The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: Efficient Component Design, Dynamic Network, and Knowledge Distillation. We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future research directions and potential challenges in the lightweighting of vision transformers with the aim of inspiring further exploration and providing practical guidance for the community. Project Page: https://github.com/ajxklo/Lightweight-VIT
中文摘要:本文综述了为图像识别设计轻量级视觉Transformer的策略,重点研究高效组件设计、动态网络和知识蒸馏,以解决计算难题并分析性能权衡,同时展望了未来研究方向。
English Summary: This paper reviews strategies for creating lightweight vision transformers for image recognition, focusing on efficient design, dynamic networks, and knowledge distillation to address computational challenges while analyzing performance trade-offs.

Authors:Lei Wang, Senmao Li, Fei Yang, Jianye Wang, Ziheng Zhang, Yuhan Liu, Yaxing Wang, Jian Yang
Title: Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability
Abstract:
The diffusion models, in early stages focus on constructing basic image structures, while the refined details, including local features and textures, are generated in later stages. Thus the same network layers are forced to learn both structural and textural information simultaneously, significantly differing from the traditional deep learning architectures (e.g., ResNet or GANs) which captures or generates the image semantic information at different layers. This difference inspires us to explore the time-wise diffusion models. We initially investigate the key contributions of the U-Net parameters to the denoising process and identify that properly zeroing out certain parameters (including large parameters) contributes to denoising, substantially improving the generation quality on the fly. Capitalizing on this discovery, we propose a simple yet effective method-termed ``MaskUNet''- that enhances generation quality with negligible parameter numbers. Our method fully leverages timestep- and sample-dependent effective U-Net parameters. To optimize MaskUNet, we offer two fine-tuning strategies: a training-based approach and a training-free approach, including tailored networks and optimization functions. In zero-shot inference on the COCO dataset, MaskUNet achieves the best FID score and further demonstrates its effectiveness in downstream task evaluations. Project page: https://gudaochangsheng.github.io/MaskUnet-Page/

Authors:Pau Amargant, Peter Hönig, Markus Vincze
Title: Sim2Real Transfer for Vision-Based Grasp Verification
Abstract:
The verification of successful grasps is a crucial aspect of robot manipulation, particularly when handling deformable objects. Traditional methods relying on force and tactile sensors often struggle with deformable and non-rigid objects. In this work, we present a vision-based approach for grasp verification to determine whether the robotic gripper has successfully grasped an object. Our method employs a two-stage architecture; first YOLO-based object detection model to detect and locate the robot's gripper and then a ResNet-based classifier determines the presence of an object. To address the limitations of real-world data capture, we introduce HSR-GraspSynth, a synthetic dataset designed to simulate diverse grasping scenarios. Furthermore, we explore the use of Visual Question Answering capabilities as a zero-shot baseline to which we compare our model. Experimental results demonstrate that our approach achieves high accuracy in real-world environments, with potential for integration into grasping pipelines. Code and datasets are publicly available at https://github.com/pauamargant/HSR-GraspSynth .
中文: 本研究提出了一种基于视觉的抓取验证方法,采用YOLO和ResNet双阶段架构并辅以合成数据集HSR-GraspSynth,在现实环境中对可变形物体的机器人抓取实现了高精度识别。
English: This study introduces a vision-based grasp verification method using a two-stage YOLO and ResNet architecture, supplemented by a synthetic dataset HSR-GraspSynth, achieving high real-world accuracy for robotic manipulation of deformable objects.

Authors:Saeed Ebrahimi, Sahar Rahimi, Ali Dabouei, Srinjoy Das, Jeremy M. Dawson, Nasser M. Nasrabadi
Title: GIF: Generative Inspiration for Face Recognition at Scale
Abstract:
Aiming to reduce the computational cost of Softmax in massive label space of Face Recognition (FR) benchmarks, recent studies estimate the output using a subset of identities. Although promising, the association between the computation cost and the number of identities in the dataset remains linear only with a reduced ratio. A shared characteristic among available FR methods is the employment of atomic scalar labels during training. Consequently, the input to label matching is through a dot product between the feature vector of the input and the Softmax centroids. Inspired by generative modeling, we present a simple yet effective method that substitutes scalar labels with structured identity code, i.e., a sequence of integers. Specifically, we propose a tokenization scheme that transforms atomic scalar labels into structured identity codes. Then, we train an FR backbone to predict the code for each input instead of its scalar label. As a result, the associated computational cost becomes logarithmic w.r.t. number of identities. We demonstrate the benefits of the proposed method by conducting experiments. In particular, our method outperforms its competitors by 1.52%, and 0.6% at TAR@FAR$=1e-4$ on IJB-B and IJB-C, respectively, while transforming the association between computational cost and the number of identities from linear to logarithmic. See code at https://github.com/msed-Ebrahimi/GIF
中文摘要:本研究提出一种受生成模型启发的创新方法,通过将标量标签替换为结构化身份代码,将人脸识别的计算成本从与身份数量的线性关系转变为对数关系,并在基准测试中实现了更优的性能表现。
English Summary: This study introduces a generative modeling-inspired method that replaces scalar labels with structured identity codes to transform the computational cost of face recognition from linear to logarithmic relative to the number of identities, achieving superior performance on benchmark datasets.

Authors:Nikolay Safonov, Alexey Bryncev, Andrey Moskalenko, Dmitry Kulikov, Dmitry Vatolin, Radu Timofte, Haibo Lei, Qifan Gao, Qing Luo, Yaqing Li, Jie Song, Shaozhe Hao, Meisong Zheng, Jingyi Xu, Chengbin Wu, Jiahui Liu, Ying Chen, Xin Deng, Mai Xu, Peipei Liang, Jie Ma, Junjie Jin, Yingxue Pang, Fangzhou Luo, Kai Chen, Shijie Zhao, Mingyang Wu, Renjie Li, Yushen Zuo, Shengyun Zhong, Zhengzhong Tu
Title: NTIRE 2025 Challenge on UGC Video Enhancement: Methods and Results
Abstract:
This paper presents an overview of the NTIRE 2025 Challenge on UGC Video Enhancement. The challenge constructed a set of 150 user-generated content videos without reference ground truth, which suffer from real-world degradations such as noise, blur, faded colors, compression artifacts, etc. The goal of the participants was to develop an algorithm capable of improving the visual quality of such videos. Given the widespread use of UGC on short-form video platforms, this task holds substantial practical importance. The evaluation was based on subjective quality assessment in crowdsourcing, obtaining votes from over 8000 assessors. The challenge attracted more than 25 teams submitting solutions, 7 of which passed the final phase with source code verification. The outcomes may provide insights into the state-of-the-art in UGC video enhancement and highlight emerging trends and effective strategies in this evolving research area. All data, including the processed videos and subjective comparison votes and scores, is made publicly available at https://github.com/msu-video-group/NTIRE25_UGC_Video_Enhancement.
本文介绍了NTIRE 2025用户生成内容视频增强挑战赛,该赛事吸引了25支以上团队开发算法以改善现实世界中质量退化的视频,并通过大规模主观评估进行测试,推动了用户生成内容质量提升的实用解决方案。
This paper introduces the NTIRE 2025 UGC Video Enhancement Challenge, which engaged over 25 teams in developing algorithms to improve real-world degraded videos and evaluated them through large-scale subjective assessments, advancing practical solutions for user-generated content quality.

Authors:Anjila Budathoki, Manish Dhakal
Title: Adversarial Robustness Analysis of Vision-Language Models in Medical Image Segmentation
Abstract:
Adversarial attacks have been fairly explored for computer vision and vision-language models. However, the avenue of adversarial attack for the vision language segmentation models (VLSMs) is still under-explored, especially for medical image analysis. Thus, we have investigated the robustness of VLSMs against adversarial attacks for 2D medical images with different modalities with radiology, photography, and endoscopy. The main idea of this project was to assess the robustness of the fine-tuned VLSMs specially in the medical domain setting to address the high risk scenario. First, we have fine-tuned pre-trained VLSMs for medical image segmentation with adapters. Then, we have employed adversarial attacks -- projected gradient descent (PGD) and fast gradient sign method (FGSM) -- on that fine-tuned model to determine its robustness against adversaries. We have reported models' performance decline to analyze the adversaries' impact. The results exhibit significant drops in the DSC and IoU scores after the introduction of these adversaries. Furthermore, we also explored universal perturbation but were not able to find for the medical images. \footnote{https://github.com/anjilab/secure-private-ai}
中文: 本研究探讨了在医学影像中微调后的视觉语言分割模型对抗性攻击的鲁棒性,发现尽管未能成功生成通用扰动,但在遭受PGD和FGSM攻击时模型性能出现显著下降。
English: This study investigates the robustness of fine-tuned vision-language segmentation models against adversarial attacks in medical imaging, revealing significant performance declines when subjected to PGD and FGSM attacks despite unsuccessful attempts to generate universal perturbations.

Authors:Ruiqi Wang, Hao Zhang
Title: RESAnything: Attribute Prompting for Arbitrary Referring Segmentation
Abstract:
We present an open-vocabulary and zero-shot method for arbitrary referring expression segmentation (RES), targeting input expressions that are more general than what prior works were designed to handle. Specifically, our inputs encompass both object- and part-level labels as well as implicit references pointing to properties or qualities of object/part function, design, style, material, etc. Our model, coined RESAnything, leverages Chain-of-Thoughts (CoT) reasoning, where the key idea is attribute prompting. We generate detailed descriptions of object/part attributes including shape, color, and location for potential segment proposals through systematic prompting of a large language model (LLM), where the proposals are produced by a foundational image segmentation model. Our approach encourages deep reasoning about object or part attributes related to function, style, design, etc., enabling the system to handle implicit queries without any part annotations for training or fine-tuning. As the first zero-shot and LLM-based RES method, RESAnything achieves clearly superior performance among zero-shot methods on traditional RES benchmarks and significantly outperforms existing methods on challenging scenarios involving implicit queries and complex part-level relations. Finally, we contribute a new benchmark dataset to offer ~3K carefully curated RES instances to assess part-level, arbitrary RES solutions.

Authors:Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Liang Wang
Title: R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
Abstract:
Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a $8.4\%$ improvement on the VL Reward-Bench and a $14.3\%$ improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.
中文: 本文提出StableReinforce算法,通过优化训练损失、优势估计和奖励设计来稳定强化学习训练,在构建的多模态偏好数据集上训练的奖励模型在各项基准测试中取得了显著性能提升。
English: This paper introduces StableReinforce, a refined reinforcement learning algorithm that stabilizes training and enhances multimodal reward models, achieving significant performance improvements on benchmarks with collected preference data.

Authors:Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, Jingdong Wang
Title: No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
Abstract:
Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance the generation quality of diffusion transformers. However, existing approaches necessitate to either introduce an external and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation guidance during the original generative training process. In this study, we posit that the unique discriminative process inherent to diffusion transformers enables them to offer such guidance without requiring external representation components. We therefore propose Self-Representation Alignment (SRA), a simple yet straightforward method that obtains representation guidance through a self-distillation manner. Specifically, SRA aligns the output latent representation of the diffusion transformer in the earlier layer with higher noise to that in the later layer with lower noise to progressively enhance the overall representation learning during only the generative training process. Experimental results indicate that applying SRA to DiTs and SiTs yields consistent performance improvements. Moreover, SRA not only significantly outperforms approaches relying on auxiliary, complex representation training frameworks but also achieves performance comparable to methods that are heavily dependent on powerful external representation priors.
中文摘要:最新研究表明,学习有意义的内部表征可提升扩散变换器的训练效率与生成质量,由此提出的自表征对齐方法无需外部组件,仅通过跨噪声层级的潜在表征对齐即实现显著性能提升,效果媲美依赖强外部先验的方法。
English Summary: Recent research shows that learning meaningful internal representations improves diffusion transformer training and generation quality, leading to the Self-Representation Alignment method that aligns latent representations across noise levels without external components, achieving significant performance gains.

Authors:Zinan Guo, Pengze Zhang, Yanze Wu, Chong Mou, Songtao Zhao, Qian He
Title: MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing
Abstract:
Current multi-subject customization approaches encounter two critical challenges: the difficulty in acquiring diverse multi-subject training data, and attribute entanglement across different subjects. To bridge these gaps, we propose MUSAR - a simple yet effective framework to achieve robust multi-subject customization while requiring only single-subject training data. Firstly, to break the data limitation, we introduce debiased diptych learning. It constructs diptych training pairs from single-subject images to facilitate multi-subject learning, while actively correcting the distribution bias introduced by diptych construction via static attention routing and dual-branch LoRA. Secondly, to eliminate cross-subject entanglement, we introduce dynamic attention routing mechanism, which adaptively establishes bijective mappings between generated images and conditional subjects. This design not only achieves decoupling of multi-subject representations but also maintains scalable generalization performance with increasing reference subjects. Comprehensive experiments demonstrate that our MUSAR outperforms existing methods - even those trained on multi-subject dataset - in image quality, subject consistency, and interaction naturalness, despite requiring only single-subject dataset.
中文:提出的MUSAR框架通过去偏双联学习和动态注意力路由机制,仅需单主体训练数据即可解决多主体定制中的数据获取困难和属性纠缠问题,实现卓越的多主体生成效果。
English: The proposed MUSAR framework effectively addresses multi-subject customization challenges by using single-subject training data through debiased diptych learning and dynamic attention routing to eliminate data limitations and attribute entanglement.

Authors:Yankai Jiang, Peng Zhang, Donglin Yang, Yuan Tian, Hai Lin, Xiaosong Wang
Title: Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models
Abstract:
We explore Generalizable Tumor Segmentation, aiming to train a single model for zero-shot tumor segmentation across diverse anatomical regions. Existing methods face limitations related to segmentation quality, scalability, and the range of applicable imaging modalities. In this paper, we uncover the potential of the internal representations within frozen medical foundation diffusion models as highly efficient zero-shot learners for tumor segmentation by introducing a novel framework named DiffuGTS. DiffuGTS creates anomaly-aware open-vocabulary attention maps based on text prompts to enable generalizable anomaly segmentation without being restricted by a predefined training category list. To further improve and refine anomaly segmentation masks, DiffuGTS leverages the diffusion model, transforming pathological regions into high-quality pseudo-healthy counterparts through latent space inpainting, and applies a novel pixel-level and feature-level residual learning approach, resulting in segmentation masks with significantly enhanced quality and generalization. Comprehensive experiments on four datasets and seven tumor categories demonstrate the superior performance of our method, surpassing current state-of-the-art models across multiple zero-shot settings. Codes are available at https://github.com/Yankai96/DiffuGTS.
Chinese: 本文提出DiffuGTS框架,通过利用冻结的医学基础扩散模型生成异常感知注意力图,并采用残差学习优化分割掩码,实现了跨多种数据集和肿瘤类别的零样本泛化肿瘤分割,性能显著优于现有先进方法。
English: This paper introduces DiffuGTS, a novel framework that utilizes frozen medical foundation diffusion models for zero-shot generalizable tumor segmentation by generating anomaly-aware attention maps and refining masks through residual learning, achieving superior performance across multiple datasets and tumor categories.

Authors:Binghong Chen, Tingting Chai, Wei Jiang, Yuanrong Xu, Guanglu Zhou, Xiangqian Wu
Title: Multi-View Learning with Context-Guided Receptance for Image Denoising
Abstract:
Image denoising is essential in low-level vision applications such as photography and automated driving. Existing methods struggle with distinguishing complex noise patterns in real-world scenes and consume significant computational resources due to reliance on Transformer-based models. In this work, the Context-guided Receptance Weighted Key-Value (\M) model is proposed, combining enhanced multi-view feature integration with efficient sequence modeling. Our approach introduces the Context-guided Token Shift (CTS) paradigm, which effectively captures local spatial dependencies and enhance the model's ability to model real-world noise distributions. Additionally, the Frequency Mix (FMix) module extracting frequency-domain features is designed to isolate noise in high-frequency spectra, and is integrated with spatial representations through a multi-view learning process. To improve computational efficiency, the Bidirectional WKV (BiWKV) mechanism is adopted, enabling full pixel-sequence interaction with linear complexity while overcoming the causal selection constraints. The model is validated on multiple real-world image denoising datasets, outperforming the existing state-of-the-art methods quantitatively and reducing inference time up to 40\%. Qualitative results further demonstrate the ability of our model to restore fine details in various scenes.
Chinese: 本文提出用于图像去噪的上下文引导接收加权键值模型,通过创新机制整合空间和频率特征,有效处理真实世界噪声模式,同时相比现有方法将推理时间减少高达40%。
English: This paper introduces the Context-guided Receptance Weighted Key-Value model for image denoising, which integrates spatial and frequency features through novel mechanisms to effectively handle real-world noise patterns while reducing inference time by up to 40% compared to existing methods.

Authors:Vincent Brebion, Julien Moreau, Franck Davoine
Title: DELTA: Dense Depth from Events and LiDAR using Transformer's Attention
Abstract:
Event cameras and LiDARs provide complementary yet distinct data: respectively, asynchronous detections of changes in lighting versus sparse but accurate depth information at a fixed rate. To this day, few works have explored the combination of these two modalities. In this article, we propose a novel neural-network-based method for fusing event and LiDAR data in order to estimate dense depth maps. Our architecture, DELTA, exploits the concepts of self- and cross-attention to model the spatial and temporal relations within and between the event and LiDAR data. Following a thorough evaluation, we demonstrate that DELTA sets a new state of the art in the event-based depth estimation problem, and that it is able to reduce the errors up to four times for close ranges compared to the previous SOTA.

Authors:Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
Title: Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Abstract:
Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey are available on GitHub (https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models).
中文摘要:本综述探讨了多模态理解与图像生成模型的统一路径,通过分析三种主流架构范式并指出关键挑战,为这一新兴领域的未来研究提供指导。
English Summary: This survey explores the unification of multimodal understanding and image generation models, analyzing three architectural paradigms and addressing key challenges to guide future research in this emerging field.

Authors:Xiongjun Guan, Zhiyu Pan, Jianjiang Feng, Jie Zhou
Title: Finger Pose Estimation for Under-screen Fingerprint Sensor
Abstract:
Two-dimensional pose estimation plays a crucial role in fingerprint recognition by facilitating global alignment and reduce pose-induced variations. However, existing methods are still unsatisfactory when handling with large angle or small area inputs. These limitations are particularly pronounced on fingerprints captured by under-screen fingerprint sensors in smartphones. In this paper, we present a novel dual-modal input based network for under-screen fingerprint pose estimation. Our approach effectively integrates two distinct yet complementary modalities: texture details extracted from ridge patches through the under-screen fingerprint sensor, and rough contours derived from capacitive images obtained via the touch screen. This collaborative integration endows our network with more comprehensive and discriminative information, substantially improving the accuracy and stability of pose estimation. A decoupled probability distribution prediction task is designed, instead of the traditional supervised forms of numerical regression or heatmap voting, to facilitate the training process. Additionally, we incorporate a Mixture of Experts (MoE) based feature fusion mechanism and a relationship driven cross-domain knowledge transfer strategy to further strengthen feature extraction and fusion capabilities. Extensive experiments are conducted on several public datasets and two private datasets. The results indicate that our method is significantly superior to previous state-of-the-art (SOTA) methods and remarkably boosts the recognition ability of fingerprint recognition algorithms. Our code is available at https://github.com/XiongjunGuan/DRACO.
中文摘要:本文提出了一种新颖的双模态输入网络,通过整合来自屏下指纹传感器的脊线纹理细节和触摸屏电容图像的粗略轮廓,显著提高了指纹姿态估计的准确性和稳定性,在多个数据集上均优于现有最优方法。
English Summary: This paper introduces a novel dual-modal network that integrates texture details from ridge patches and rough contours from capacitive images to significantly enhance the accuracy and stability of under-screen fingerprint pose estimation, outperforming previous state-of-the-art methods.

Authors:Inclusion AI, Biao Gong, Cheng Zou, Dandan Zheng, Hu Yu, Jingdong Chen, Jianxin Sun, Junbo Zhao, Jun Zhou, Kaixiang Ji, Lixiang Ru, Libin Wang, Qingpei Guo, Rui Liu, Weilong Chai, Xinyu Xiao, Ziyuan Huang
Title: Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
Abstract:
We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale representation alignment strategy. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results demonstrate the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. All code and model weights are open-sourced to foster further exploration within the community. Notably, this work aligns with concurrent multimodal AI milestones - such as ChatGPT-4o with native image generation updated in March 25, 2025 - underscoring the broader significance of unified models like Ming-Lite-Uni on the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further refined.
中文:Ming-Lite-Uni是一个开源多模态框架,通过原生自回归模型和视觉生成器统一视觉与语言,能够执行文本到图像生成和指令式图像编辑任务,实验证明其具有卓越性能。
English: Ming-Lite-Uni is an open-source multimodal framework that unifies vision and language through a native autoregressive model and visual generator, enabling text-to-image generation and image editing while demonstrating strong performance in experiments.

Authors:Zichen Liu, Xu Zou, Gang Hua, Jiahuan Zhou
Title: Token Coordinated Prompt Attention is Needed for Visual Prompting
Abstract:
Visual prompting techniques are widely used to efficiently fine-tune pretrained Vision Transformers (ViT) by learning a small set of shared prompts for all tokens. However, existing methods overlook the unique roles of different tokens in conveying discriminative information and interact with all tokens using the same prompts, thereby limiting the representational capacity of ViT. This often leads to indistinguishable and biased prompt-extracted features, hindering performance. To address this issue, we propose a plug-and-play Token Coordinated Prompt Attention (TCPA) module, which assigns specific coordinated prompts to different tokens for attention-based interactions. Firstly, recognizing the distinct functions of CLS and image tokens-global information aggregation and local feature extraction, we disentangle the prompts into CLS Prompts and Image Prompts, which interact exclusively with CLS tokens and image tokens through attention mechanisms. This enhances their respective discriminative abilities. Furthermore, as different image tokens correspond to distinct image patches and contain diverse information, we employ a matching function to automatically assign coordinated prompts to individual tokens. This enables more precise attention interactions, improving the diversity and representational capacity of the extracted features. Extensive experiments across various benchmarks demonstrate that TCPA significantly enhances the diversity and discriminative power of the extracted features. The code is available at https://github.com/zhoujiahuan1991/ICML2025-TCPA.
中文: 提出的令牌协调提示注意力(TCPA)模块通过为不同令牌分配专用提示——CLS提示用于全局信息,图像提示用于局部特征,解决了现有视觉提示方法的局限性,通过基于注意力的交互提升了特征的多样性和判别力。
English: The proposed Token Coordinated Prompt Attention (TCPA) module addresses limitations in existing visual prompting methods by assigning specialized prompts to different tokens—CLS Prompts for global information and Image Prompts for local features—enhancing feature diversity and discriminative power through attention-based interactions.

Authors:Sungheon Jeong, Jihong Park, Mohsen Imani
Title: Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
Abstract:
Most existing video anomaly detectors rely solely on RGB frames, which lack the temporal resolution needed to capture abrupt or transient motion cues, key indicators of anomalous events. To address this limitation, we propose Image-Event Fusion for Video Anomaly Detection (IEF-VAD), a framework that synthesizes event representations directly from RGB videos and fuses them with image features through a principled, uncertainty-aware process. The system (i) models heavy-tailed sensor noise with a Student`s-t likelihood, deriving value-level inverse-variance weights via a Laplace approximation; (ii) applies Kalman-style frame-wise updates to balance modalities over time; and (iii) iteratively refines the fused latent state to erase residual cross-modal noise. Without any dedicated event sensor or frame-level labels, IEF-VAD sets a new state of the art across multiple real-world anomaly detection benchmarks. These findings highlight the utility of synthetic event representations in emphasizing motion cues that are often underrepresented in RGB frames, enabling accurate and robust video understanding across diverse applications without requiring dedicated event sensors. Code and models are available at https://github.com/EavnJeong/IEF-VAD.
中文: 提出的IEF-VAD框架通过从RGB视频合成事件表征,并采用不确定性感知方法将其与图像特征融合,无需专用事件传感器即可实现最先进的视频异常检测性能。
English: The proposed IEF-VAD framework enhances video anomaly detection by synthesizing event representations from RGB videos and fusing them with image features through an uncertainty-aware process, achieving state-of-the-art performance without requiring dedicated event sensors.

Authors:Huangyue Yu, Baoxiong Jia, Yixin Chen, Yandan Yang, Puhao Li, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Song-Chun Zhu, Tengyu Liu, Siyuan Huang
Title: MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans
Abstract:
Embodied AI (EAI) research requires high-quality, diverse 3D scenes to effectively support skill acquisition, sim-to-real transfer, and generalization. Achieving these quality standards, however, necessitates the precise replication of real-world object diversity. Existing datasets demonstrate that this process heavily relies on artist-driven designs, which demand substantial human effort and present significant scalability challenges. To scalably produce realistic and interactive 3D scenes, we first present MetaScenes, a large-scale, simulatable 3D scene dataset constructed from real-world scans, which includes 15366 objects spanning 831 fine-grained categories. Then, we introduce Scan2Sim, a robust multi-modal alignment model, which enables the automated, high-quality replacement of assets, thereby eliminating the reliance on artist-driven designs for scaling 3D scenes. We further propose two benchmarks to evaluate MetaScenes: a detailed scene synthesis task focused on small item layouts for robotic manipulation and a domain transfer task in vision-and-language navigation (VLN) to validate cross-domain transfer. Results confirm MetaScene's potential to enhance EAI by supporting more generalizable agent learning and sim-to-real applications, introducing new possibilities for EAI research. Project website: https://meta-scenes.github.io/.
中文摘要:MetaScenes通过基于真实扫描的大规模可模拟3D场景数据集和Scan2Sim自动资产替换模型,解决了艺术家驱动设计的可扩展性难题,为具身AI提供了增强泛化能力和仿真到现实应用的新可能。
English Summary: MetaScenes introduces a large-scale, simulatable 3D scene dataset from real-world scans and Scan2Sim, an automated asset replacement model, to overcome scalability challenges in artist-driven designs and enhance Embodied AI through improved generalization and sim-to-real applications.

Authors:Ming Li, Xin Gu, Fan Chen, Xiaoying Xing, Longyin Wen, Chen Chen, Sijie Zhu
Title: SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing
Abstract:
Due to the challenges of manually collecting accurate editing data, existing datasets are typically constructed using various automated methods, leading to noisy supervision signals caused by the mismatch between editing instructions and original-edited image pairs. Recent efforts attempt to improve editing models through generating higher-quality edited images, pre-training on recognition tasks, or introducing vision-language models (VLMs) but fail to resolve this fundamental issue. In this paper, we offer a novel solution by constructing more effective editing instructions for given image pairs. This includes rectifying the editing instructions to better align with the original-edited image pairs and using contrastive editing instructions to further enhance their effectiveness. Specifically, we find that editing models exhibit specific generation attributes at different inference steps, independent of the text. Based on these prior attributes, we define a unified guide for VLMs to rectify editing instructions. However, there are some challenging editing scenarios that cannot be resolved solely with rectified instructions. To this end, we further construct contrastive supervision signals with positive and negative instructions and introduce them into the model training using triplet loss, thereby further facilitating supervision effectiveness. Our method does not require the VLM modules or pre-training tasks used in previous work, offering a more direct and efficient way to provide better supervision signals, and providing a novel, simple, and effective solution for instruction-based image editing. Results on multiple benchmarks demonstrate that our method significantly outperforms existing approaches. Compared with previous SOTA SmartEdit, we achieve 9.19% improvements on the Real-Edit benchmark with 30x less training data and 13x smaller model size.
中文: 本文提出了一种新颖的图像编辑指令优化方法,通过校正指令和构建对比监督信号来提升编辑效果,在显著减少训练数据和模型规模的同时,实现了优于现有方法的性能表现。
English: This paper introduces a novel approach to improve instruction-based image editing by constructing more effective editing instructions through rectification and contrastive supervision, achieving superior performance with significantly reduced training data and model size compared to existing methods.

Authors:Bobo Lian, Dandan Wang, Chenjian Wu, Minxin Chen
Title: Sparse Ellipsoidal Radial Basis Function Network for Point Cloud Surface Representation
Abstract:
Point cloud surface representation is a fundamental problem in computer graphics and vision. This paper presents a machine learning approach for approximating the signed distance function (SDF) of a point cloud using a sparse ellipsoidal radial basis function network, enabling a compact and accurate surface representation. Given the SDF values defined on the grid points constructed from the point cloud, our method approximates the SDF accurately with as few ellipsoidal radial basis functions (ERBFs) as possible, i.e., represents the SDF of a point cloud by sparse ERBFs. To balance sparsity and approximation precision, a dynamic multi-objective optimization strategy is introduced, which adaptively adds the regularization terms and jointly optimizes the weights, centers, shapes, and orientations of ERBFs. To improve computational efficiency, a nearest-neighbor-based data structure is employed, restricting function calculations to points near each Gaussian kernel center. The computations for each kernel are further parallelized on CUDA, which significantly improves the optimization speed. Additionally, a hierarchical octree-based refinement strategy is designed for training. Specifically, the initialization and optimization of network parameters are conducted using coarse grid points in the octree lattice structure. Subsequently, fine lattice points are progressively incorporated to accelerate model convergence and enhance training efficiency. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms previous sparse representation approaches in terms of accuracy, robustness, and computational efficiency. The corresponding executable program is publicly available at https://github.com/lianbobo/SE-RBFNet.git.
Chinese: 本文提出一种机器学习方法,利用稀疏椭球径向基函数网络精确高效地逼近点云的符号距离函数,在精度、鲁棒性和计算效率方面均优于现有稀疏表示方法。
English: This paper introduces a machine learning method that uses a sparse ellipsoidal radial basis function network to accurately and efficiently approximate the signed distance function of point clouds, achieving superior performance in accuracy, robustness, and computational efficiency compared to previous approaches.

Authors:Hao Cheng, Zhiwei Zhao, Yichao He, Zhenzhen Hu, Jia Li, Meng Wang, Richang Hong
Title: VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection
Abstract:
Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced strong multimodal representations, yet they predominantly rely on modality-specific encoders and coarse content-level alignment, limiting fine-grained emotional semantic modeling. To address these issues, we propose VAEmo, an efficient two-stage framework for emotion-centric joint VA representation learning with external knowledge injection. In Stage~1, a unified and lightweight representation network is pre-trained on large-scale speaker-centric VA corpora via masked reconstruction and contrastive objectives, mitigating the modality gap and learning expressive, complementary representations without emotion labels. In Stage~2, multimodal large language models automatically generate detailed affective descriptions according to our well-designed chain-of-thought prompting for only a small subset of VA samples; these rich textual semantics are then injected by aligning their corresponding embeddings with VA representations through dual-path contrastive learning, further bridging the emotion gap. Extensive experiments on multiple downstream AVER benchmarks show that VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance for efficient, generalizable VA emotion representations.
中文摘要:VAEmo是一个两阶段框架,首先通过无情感标签的预训练学习统一视听表征,随后利用生成的情感描述注入语义信息,以紧凑设计在多个人机交互基准上实现了最先进的视听情感识别性能。
English Summary: VAEmo is a two-stage framework that first learns unified audiovisual representations without emotion labels and then injects emotion-aware semantics through generated textual descriptions, achieving state-of-the-art performance in audiovisual emotion recognition with efficient cross-modal encoding.

Authors:Zhichuan Wang, Yang Zhou, Jinhai Xiang, Yulong Wang, Xinwei He
Title: TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment
Abstract:
Learning discriminative 3D representations that generalize well to unknown testing categories is an emerging requirement for many real-world 3D applications. Existing well-established methods often struggle to attain this goal due to insufficient 3D training data from broader concepts. Meanwhile, pre-trained large vision-language models (e.g., CLIP) have shown remarkable zero-shot generalization capabilities. Yet, they are limited in extracting suitable 3D representations due to substantial gaps between their 2D training and 3D testing distributions. To address these challenges, we propose Testing-time Distribution Alignment (TeDA), a novel framework that adapts a pretrained 2D vision-language model CLIP for unknown 3D object retrieval at test time. To our knowledge, it is the first work that studies the test-time adaptation of a vision-language model for 3D feature learning. TeDA projects 3D objects into multi-view images, extracts features using CLIP, and refines 3D query embeddings with an iterative optimization strategy by confident query-target sample pairs in a self-boosting manner. Additionally, TeDA integrates textual descriptions generated by a multimodal language model (InternVL) to enhance 3D object understanding, leveraging CLIP's aligned feature space to fuse visual and textual cues. Extensive experiments on four open-set 3D object retrieval benchmarks demonstrate that TeDA greatly outperforms state-of-the-art methods, even those requiring extensive training. We also experimented with depth maps on Objaverse-LVIS, further validating its effectiveness. Code is available at https://github.com/wangzhichuan123/TeDA.
中文: 本文提出测试时分布对齐(TeDA)框架,通过多视角投影和迭代优化将预训练的CLIP模型适配于零样本3D物体检索,无需大量训练即可超越现有最优方法。
English: This paper introduces Testing-time Distribution Alignment (TeDA), a novel framework that adapts the pre-trained CLIP model for zero-shot 3D object retrieval by aligning 2D and 3D distributions through multi-view projections and iterative optimization, outperforming existing methods without requiring extensive training.

Authors:Madhukar Reddy Vongala, Saurabh Srivastava, Jana Košecká
Title: Compositional Image-Text Matching and Retrieval by Grounding Entities
Abstract:
Vision-language pretraining on large datasets of images-text pairs is one of the main building blocks of current Vision-Language Models. While with additional training, these models excel in various downstream tasks, including visual question answering, image captioning, and visual commonsense reasoning. However, a notable weakness of pretrained models like CLIP, is their inability to perform entity grounding and compositional image and text matching~\cite{Jiang2024ComCLIP, yang2023amc, Rajabi2023GroundedVSR, learninglocalizeCVPR24}. In this work we propose a novel learning-free zero-shot augmentation of CLIP embeddings that has favorable compositional properties. We compute separate embeddings of sub-images of object entities and relations that are localized by the state of the art open vocabulary detectors and dynamically adjust the baseline global image embedding. % The final embedding is obtained by computing a weighted combination of the sub-image embeddings. The resulting embedding is then utilized for similarity computation with text embedding, resulting in a average 1.5\% improvement in image-text matching accuracy on the Visual Genome and SVO Probes datasets~\cite{krishna2017visualgenome, svo}. Notably, the enhanced embeddings demonstrate superior retrieval performance, thus achieving significant gains on the Flickr30K and MS-COCO retrieval benchmarks~\cite{flickr30ke, mscoco}, improving the state-of-the-art Recall@1 by 12\% and 0.4\%, respectively. Our code is available at https://github.com/madhukarreddyvongala/GroundingCLIP.
中文: 本研究提出一种无需学习的CLIP嵌入增强方法,通过结合开放词汇检测器定位的物体和关系子图像嵌入来改进组合式图文匹配,在多个基准测试中显著提升了准确率。
English: This study introduces a learning-free enhancement to CLIP embeddings that improves compositional image-text matching by incorporating localized object and relation embeddings from open-vocabulary detectors, achieving notable accuracy gains on multiple benchmarks.

Authors:Wenchuan Wang, Mengqi Huang, Yijing Tu, Zhendong Mao
Title: DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization
Abstract:
Customized text-to-video generation with pre-trained large-scale models has recently garnered significant attention by focusing on identity and motion consistency. Existing works typically follow the isolated customized paradigm, where the subject identity or motion dynamics are customized exclusively. However, this paradigm completely ignores the intrinsic mutual constraints and synergistic interdependencies between identity and motion, resulting in identity-motion conflicts throughout the generation process that systematically degrade. To address this, we introduce DualReal, a novel framework that employs adaptive joint training to construct interdependencies between dimensions collaboratively. Specifically, DualReal is composed of two units: (1) Dual-aware Adaptation dynamically switches the training step (i.e., identity or motion), learns the current information guided by the frozen dimension prior, and employs a regularization strategy to avoid knowledge leakage; (2) StageBlender Controller leverages the denoising stages and Diffusion Transformer depths to guide different dimensions with adaptive granularity, avoiding conflicts at various stages and ultimately achieving lossless fusion of identity and motion patterns. We constructed a more comprehensive evaluation benchmark than existing methods. The experimental results show that DualReal improves CLIP-I and DINO-I metrics by 21.7% and 31.8% on average, and achieves top performance on nearly all motion metrics. Page: https://wenc-k.github.io/dualreal-customization

Authors:Yamini Sri Krubha, Aryana Hou, Braden Vester, Web Walker, Xin Wang, Li Lin, Shu Hu
Title: Robust AI-Generated Face Detection with Imbalanced Data
Abstract:
Deepfakes, created using advanced AI techniques such as Variational Autoencoder and Generative Adversarial Networks, have evolved from research and entertainment applications into tools for malicious activities, posing significant threats to digital trust. Current deepfake detection techniques have evolved from CNN-based methods focused on local artifacts to more advanced approaches using vision transformers and multimodal models like CLIP, which capture global anomalies and improve cross-domain generalization. Despite recent progress, state-of-the-art deepfake detectors still face major challenges in handling distribution shifts from emerging generative models and addressing severe class imbalance between authentic and fake samples in deepfake datasets, which limits their robustness and detection accuracy. To address these challenges, we propose a framework that combines dynamic loss reweighting and ranking-based optimization, which achieves superior generalization and performance under imbalanced dataset conditions. The code is available at https://github.com/Purdue-M2/SP_CUP.
中文: 基于先进AI的深度伪造技术严重威胁数字信任,尽管检测方法已从局部特征分析发展到全局异常捕捉,但仍面临分布偏移和类别不平衡的挑战,因此提出结合动态损失重加权和排序优化的新框架,以提升泛化能力和检测性能。
English: Deepfakes created with advanced AI pose serious threats to digital trust, and while detection methods have improved, they still struggle with distribution shifts and class imbalance, leading to the proposal of a new framework combining dynamic loss reweighting and ranking-based optimization for better generalization and performance.

Authors:Tao Zhu, Qi Yu, Xinru Dong, Shiyu Li, Yue Liu, Jinlong Jiang, Lei Shu
Title: ProDisc-VAD: An Efficient System for Weakly-Supervised Anomaly Detection in Video Surveillance Applications
Abstract:
Weakly-supervised video anomaly detection (WS-VAD) using Multiple Instance Learning (MIL) suffers from label ambiguity, hindering discriminative feature learning. We propose ProDisc-VAD, an efficient framework tackling this via two synergistic components. The Prototype Interaction Layer (PIL) provides controlled normality modeling using a small set of learnable prototypes, establishing a robust baseline without being overwhelmed by dominant normal data. The Pseudo-Instance Discriminative Enhancement (PIDE) loss boosts separability by applying targeted contrastive learning exclusively to the most reliable extreme-scoring instances (highest/lowest scores). ProDisc-VAD achieves strong AUCs (97.98% ShanghaiTech, 87.12% UCF-Crime) using only 0.4M parameters, over 800x fewer than recent ViT-based methods like VadCLIP. Code is available at https://github.com/modadundun/ProDisc-VAD.
中文:ProDisc-VAD是一种高效框架,通过结合基于原型的正常性建模和伪实例对比学习,解决了弱监督视频异常检测中的标签模糊问题,并以极少的参数量实现了领先性能。
English: ProDisc-VAD is an efficient framework that addresses label ambiguity in weakly-supervised video anomaly detection by combining prototype-based normality modeling and pseudo-instance contrastive learning, achieving state-of-the-art performance with minimal parameters.

Authors:Shubhendu Jena, Amine Ouasfi, Mae Younes, Adnane Boukhayma
Title: Sparfels: Fast Reconstruction from Sparse Unposed Imagery
Abstract:
We present a method for Sparse view reconstruction with surface element splatting that runs within 3 minutes on a consumer grade GPU. While few methods address sparse radiance field learning from noisy or unposed sparse cameras, shape recovery remains relatively underexplored in this setting. Several radiance and shape learning test-time optimization methods address the sparse posed setting by learning data priors or using combinations of external monocular geometry priors. Differently, we propose an efficient and simple pipeline harnessing a single recent 3D foundation model. We leverage its various task heads, notably point maps and camera initializations to instantiate a bundle adjusting 2D Gaussian Splatting (2DGS) model, and image correspondences to guide camera optimization midst 2DGS training. Key to our contribution is a novel formulation of splatted color variance along rays, which can be computed efficiently. Reducing this moment in training leads to more accurate shape reconstructions. We demonstrate state-of-the-art performances in the sparse uncalibrated setting in reconstruction and novel view benchmarks based on established multi-view datasets.

Authors:Shubhendu Jena, Shishir Reddy Vutukur, Adnane Boukhayma
Title: SparSplat: Fast Multi-View Reconstruction with Generalizable 2D Gaussian Splatting
Abstract:
Recovering 3D information from scenes via multi-view stereo reconstruction (MVS) and novel view synthesis (NVS) is inherently challenging, particularly in scenarios involving sparse-view setups. The advent of 3D Gaussian Splatting (3DGS) enabled real-time, photorealistic NVS. Following this, 2D Gaussian Splatting (2DGS) leveraged perspective accurate 2D Gaussian primitive rasterization to achieve accurate geometry representation during rendering, improving 3D scene reconstruction while maintaining real-time performance. Recent approaches have tackled the problem of sparse real-time NVS using 3DGS within a generalizable, MVS-based learning framework to regress 3D Gaussian parameters. Our work extends this line of research by addressing the challenge of generalizable sparse 3D reconstruction and NVS jointly, and manages to perform successfully at both tasks. We propose an MVS-based learning pipeline that regresses 2DGS surface element parameters in a feed-forward fashion to perform 3D shape reconstruction and NVS from sparse-view images. We further show that our generalizable pipeline can benefit from preexisting foundational multi-view deep visual features. The resulting model attains the state-of-the-art results on the DTU sparse 3D reconstruction benchmark in terms of Chamfer distance to ground-truth, as-well as state-of-the-art NVS. It also demonstrates strong generalization on the BlendedMVS and Tanks and Temples datasets. We note that our model outperforms the prior state-of-the-art in feed-forward sparse view reconstruction based on volume rendering of implicit representations, while offering an almost 2 orders of magnitude higher inference speed.

Authors:Xiaorui Zhao, Xinyue Zhou, Peibei Cao, Junyu Lou, Shuhang Gu
Title: HiLLIE: Human-in-the-Loop Training for Low-Light Image Enhancement
Abstract:
Developing effective approaches to generate enhanced results that align well with human visual preferences for high-quality well-lit images remains a challenge in low-light image enhancement (LLIE). In this paper, we propose a human-in-the-loop LLIE training framework that improves the visual quality of unsupervised LLIE model outputs through iterative training stages, named HiLLIE. At each stage, we introduce human guidance into the training process through efficient visual quality annotations of enhanced outputs. Subsequently, we employ a tailored image quality assessment (IQA) model to learn human visual preferences encoded in the acquired labels, which is then utilized to guide the training process of an enhancement model. With only a small amount of pairwise ranking annotations required at each stage, our approach continually improves the IQA model's capability to simulate human visual assessment of enhanced outputs, thus leading to visually appealing LLIE results. Extensive experiments demonstrate that our approach significantly improves unsupervised LLIE model performance in terms of both quantitative and qualitative performance. The code and collected ranking dataset will be available at https://github.com/LabShuHangGU/HiLLIE.
中文摘要:本文提出HiLLIE人机交互训练框架,通过少量排序标注和定制化质量评估模型迭代学习人类视觉偏好,显著提升无监督低光照图像增强的视觉质量。
English Summary: This paper introduces HiLLIE, a human-in-the-loop training framework that iteratively improves low-light image enhancement by incorporating human visual preferences through minimal ranking annotations and a tailored quality assessment model.

Authors:Volodymyr Havrylov, Haiwen Huang, Dan Zhang, Andreas Geiger
Title: Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation
Abstract:
Vision Foundation Models (VFMs) are large-scale, pre-trained models that serve as general-purpose backbones for various computer vision tasks. As VFMs' popularity grows, there is an increasing interest in understanding their effectiveness for dense prediction tasks. However, VFMs typically produce low-resolution features, limiting their direct applicability in this context. One way to tackle this limitation is by employing a task-agnostic feature upsampling module that refines VFM features resolution. To assess the effectiveness of this approach, we investigate Interactive Segmentation (IS) as a novel benchmark for evaluating feature upsampling methods on VFMs. Due to its inherent multimodal input, consisting of an image and a set of user-defined clicks, as well as its dense mask output, IS creates a challenging environment that demands comprehensive visual scene understanding. Our benchmarking experiments show that selecting appropriate upsampling strategies significantly improves VFM features quality. The code is released at https://github.com/havrylovv/iSegProbe
中文: 视觉基础模型通过任务无关的特征上采样模块提升低分辨率特征,并以交互式分割为基准验证了合适的上采样策略能显著改善特征质量。
English: Vision Foundation Models (VFMs) are enhanced for dense prediction tasks by employing task-agnostic feature upsampling, with interactive segmentation serving as a benchmark to demonstrate that proper upsampling strategies significantly improve feature quality.

Authors:Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, Xuming Hu
Title: RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video
Abstract:
Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: https://github.com/LJungang/RTV-Bench.
中文: RTV-Bench作为评估多模态大语言模型实时视频分析能力的新基准,通过动态问题设计和多维评估发现现有模型需优化架构以适应动态场景,开源实时模型虽优于离线版本但仍落后于顶尖专有模型。
English: RTV-Bench is a new benchmark designed to evaluate Multimodal Large Language Models' real-time video analysis capabilities through evolving questions and multi-dimensional assessment, revealing that current models need architectural improvements for better performance in dynamic environments.

Authors:Branko Brkljač, Vladimir Kalušev, Branislav Popović, Milan Sečujski
Title: Transforming faces into video stories -- VideoFace2.0
Abstract:
Face detection and face recognition have been in the focus of vision community since the very beginnings. Inspired by the success of the original Videoface digitizer, a pioneering device that allowed users to capture video signals from any source, we have designed an advanced video analytics tool to efficiently create structured video stories, i.e. identity-based information catalogs. VideoFace2.0 is the name of the developed system for spatial and temporal localization of each unique face in the input video, i.e. face re-identification (ReID), which also allows their cataloging, characterization and creation of structured video outputs for later downstream tasks. Developed near real-time solution is primarily designed to be utilized in application scenarios involving TV production, media analysis, and as an efficient tool for creating large video datasets necessary for training machine learning (ML) models in challenging vision tasks such as lip reading and multimodal speech recognition. Conducted experiments confirm applicability of the proposed face ReID algorithm that is combining the concepts of face detection, face recognition and passive tracking-by-detection in order to achieve robust and efficient face ReID. The system is envisioned as a compact and modular extensions of the existing video production equipment. Presented results are based on test implementation that achieves between 18-25 fps on consumer type notebook. Ablation experiments also confirmed that the proposed algorithm brings relative gain in the reduction of number of false identities in the range of 73%-93%. We hope that the presented work and shared code implementation will stimulate further interest in development of similar, application specific video analysis tools, and lower the entry barrier for production of high-quality multi-modal datasets in the future.
Chinese: VideoFace2.0系统是一种先进的视频分析工具,通过结合人脸检测、识别和追踪技术实现实时人脸重识别,能够高效创建结构化视频输出,适用于电视制作、媒体分析和机器学习数据集构建等场景。
English: The VideoFace2.0 system is an advanced video analytics tool that performs real-time face re-identification by combining face detection, recognition, and tracking, enabling efficient cataloging and structured video output for applications in TV production, media analysis, and machine learning dataset creation.

Authors:Zhenxing Mi, Ping Yin, Xue Xiao, Dan Xu
Title: Learning Heterogeneous Mixture of Scene Experts for Large-scale Neural Radiance Fields
Abstract:
Recent NeRF methods on large-scale scenes have underlined the importance of scene decomposition for scalable NeRFs. Although achieving reasonable scalability, there are several critical problems remaining unexplored, i.e., learnable decomposition, modeling scene heterogeneity, and modeling efficiency. In this paper, we introduce Switch-NeRF++, a Heterogeneous Mixture of Hash Experts (HMoHE) network that addresses these challenges within a unified framework. It is a highly scalable NeRF that learns heterogeneous decomposition and heterogeneous NeRFs efficiently for large-scale scenes in an end-to-end manner. In our framework, a gating network learns to decompose scenes and allocates 3D points to specialized NeRF experts. This gating network is co-optimized with the experts by our proposed Sparsely Gated Mixture of Experts (MoE) NeRF framework. We incorporate a hash-based gating network and distinct heterogeneous hash experts. The hash-based gating efficiently learns the decomposition of the large-scale scene. The distinct heterogeneous hash experts consist of hash grids of different resolution ranges, enabling effective learning of the heterogeneous representation of different scene parts. These design choices make our framework an end-to-end and highly scalable NeRF solution for real-world large-scale scene modeling to achieve both quality and efficiency. We evaluate our accuracy and scalability on existing large-scale NeRF datasets and a new dataset with very large-scale scenes ($>6.5km^2$) from UrbanBIS. Extensive experiments demonstrate that our approach can be easily scaled to various large-scale scenes and achieve state-of-the-art scene rendering accuracy. Furthermore, our method exhibits significant efficiency, with an 8x acceleration in training and a 16x acceleration in rendering compared to Switch-NeRF. Codes will be released at https://github.com/MiZhenxing/Switch-NeRF.
中文: Switch-NeRF++提出了一种异构哈希专家混合框架,能高效学习大规模场景的分解和异构表示,在实现顶尖渲染精度的同时显著提升了训练和渲染速度。
English: Switch-NeRF++ introduces a Heterogeneous Mixture of Hash Experts framework that efficiently learns scene decomposition and heterogeneous representations for large-scale scenes, achieving state-of-the-art rendering accuracy with significant training and rendering speed improvements.

Authors:Jiayi Cheng, Can Gao, Jie Zhou, Jiajun Wen, Tao Dai, Jinbao Wang
Title: MC3D-AD: A Unified Geometry-aware Reconstruction Model for Multi-category 3D Anomaly Detection
Abstract:
3D Anomaly Detection (AD) is a promising means of controlling the quality of manufactured products. However, existing methods typically require carefully training a task-specific model for each category independently, leading to high cost, low efficiency, and weak generalization. Therefore, this paper presents a novel unified model for Multi-Category 3D Anomaly Detection (MC3D-AD) that aims to utilize both local and global geometry-aware information to reconstruct normal representations of all categories. First, to learn robust and generalized features of different categories, we propose an adaptive geometry-aware masked attention module that extracts geometry variation information to guide mask attention. Then, we introduce a local geometry-aware encoder reinforced by the improved mask attention to encode group-level feature tokens. Finally, we design a global query decoder that utilizes point cloud position embeddings to improve the decoding process and reconstruction ability. This leads to local and global geometry-aware reconstructed feature tokens for the AD task. MC3D-AD is evaluated on two publicly available Real3D-AD and Anomaly-ShapeNet datasets, and exhibits significant superiority over current state-of-the-art single-category methods, achieving 3.1\% and 9.3\% improvement in object-level AUROC over Real3D-AD and Anomaly-ShapeNet, respectively. The code is available at https://github.com/iCAN-SZU/MC3D-AD.
中文: 本文提出了一种统一的多类别三维异常检测模型,通过结合局部和全局几何感知信息重构各类别的正常表示,在基准数据集上相比现有单类别方法展现出显著优越性能。
English: This paper introduces a unified multi-category 3D anomaly detection model that leverages local and global geometry-aware information to reconstruct normal representations across categories, demonstrating superior performance over existing single-category methods on benchmark datasets.

Authors:Qi Yang, Le Yang, Geert Van Der Auwera, Zhu Li
Title: HybridGS: High-Efficiency Gaussian Splatting Data Compression using Dual-Channel Sparse Representation and Point Cloud Encoder
Abstract:
Most existing 3D Gaussian Splatting (3DGS) compression schemes focus on producing compact 3DGS representation via implicit data embedding. They have long coding times and highly customized data format, making it difficult for widespread deployment. This paper presents a new 3DGS compression framework called HybridGS, which takes advantage of both compact generation and standardized point cloud data encoding. HybridGS first generates compact and explicit 3DGS data. A dual-channel sparse representation is introduced to supervise the primitive position and feature bit depth. It then utilizes a canonical point cloud encoder to perform further data compression and form standard output bitstreams. A simple and effective rate control scheme is proposed to pivot the interpretable data compression scheme. At the current stage, HybridGS does not include any modules aimed at improving 3DGS quality during generation. But experiment results show that it still provides comparable reconstruction performance against state-of-the-art methods, with evidently higher encoding and decoding speed. The code is publicly available at https://github.com/Qi-Yangsjtu/HybridGS.
Chinese: HybridGS提出了一种新的3DGS压缩框架,通过结合紧凑生成和标准化点云编码,在保持重建质量的同时显著提升了编解码速度。
English: HybridGS introduces a novel 3DGS compression framework that combines compact generation with standardized point cloud encoding, achieving comparable reconstruction quality and faster processing speeds than current methods.

Authors:Xingyu Miao, Haoran Duan, Yang Long, Jungong Han
Title: Rethinking Score Distilling Sampling for 3D Editing and Generation
Abstract:
Score Distillation Sampling (SDS) has emerged as a prominent method for text-to-3D generation by leveraging the strengths of 2D diffusion models. However, SDS is limited to generation tasks and lacks the capability to edit existing 3D assets. Conversely, variants of SDS that introduce editing capabilities often can not generate new 3D assets effectively. In this work, we observe that the processes of generation and editing within SDS and its variants have unified underlying gradient terms. Building on this insight, we propose Unified Distillation Sampling (UDS), a method that seamlessly integrates both the generation and editing of 3D assets. Essentially, UDS refines the gradient terms used in vanilla SDS methods, unifying them to support both tasks. Extensive experiments demonstrate that UDS not only outperforms baseline methods in generating 3D assets with richer details but also excels in editing tasks, thereby bridging the gap between 3D generation and editing. The code is available on: https://github.com/xingy038/UDS.
中文: UDS通过优化SDS中的梯度项,统一了3D生成与编辑功能,在两项任务中均超越基线方法,呈现更丰富的细节。
English: UDS unifies 3D generation and editing by refining gradient terms in SDS, outperforming baselines in both tasks with richer details.

Authors:Siddharth Kothari, Srinivasan Murali, Sankalp Kothari, Ujjwal Verma, Jaya Sreevalsan-Nair
Title: Adversarial Robustness of Deep Learning Models for Inland Water Body Segmentation from SAR Images
Abstract:
Inland water body segmentation from Synthetic Aperture Radar (SAR) images is an important task needed for several applications, such as flood mapping. While SAR sensors capture data in all-weather conditions as high-resolution images, differentiating water and water-like surfaces from SAR images is not straightforward. Inland water bodies, such as large river basins, have complex geometry, which adds to the challenge of segmentation. U-Net is a widely used deep learning model for land-water segmentation of SAR images. In practice, manual annotation is often used to generate the corresponding water masks as ground truth. Manual annotation of the images is prone to label noise owing to data poisoning attacks, especially due to complex geometry. In this work, we simulate manual errors in the form of adversarial attacks on the U-Net model and study the robustness of the model to human errors in annotation. Our results indicate that U-Net can tolerate a certain level of corruption before its performance drops significantly. This finding highlights the crucial role that the quality of manual annotations plays in determining the effectiveness of the segmentation model. The code and the new dataset, along with adversarial examples for robust training, are publicly available. (GitHub link - https://github.com/GVCL/IWSeg-SAR-Poison.git)
中文摘要:本研究通过对抗性攻击模拟人工标注错误,探讨了U-Net模型在SAR图像内陆水域分割中的鲁棒性,发现模型虽能承受一定程度的数据污染,但标注质量对分割效果具有决定性影响。
English Summary: This study investigates the robustness of the U-Net model for inland water segmentation in SAR images against simulated human annotation errors introduced through adversarial attacks, finding that while the model can tolerate some corruption, annotation quality critically impacts segmentation performance.

Authors:Chenghong Li, Hongjie Liao, Yihao Zhi, Xihe Yang, Zhengwentai Sun, Jiahao Chang, Shuguang Cui, Xiaoguang Han
Title: MVHumanNet++: A Large-scale Dataset of Multi-view Daily Dressing Human Captures with Richer Annotations for 3D Human Digitization
Abstract:
In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while significant progress has been achieved in object-centric tasks through large-scale datasets like Objaverse and MVImgNet, human-centric tasks have seen limited advancement, largely due to the absence of a comparable large-scale human dataset. To bridge this gap, we present MVHumanNet++, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using multi-view human capture systems, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. Additionally, the proposed MVHumanNet++ dataset is enhanced with newly processed normal maps and depth maps, significantly expanding its applicability and utility for advanced human-centric research. To explore the potential of our proposed MVHumanNet++ dataset in various 2D and 3D visual tasks, we conducted several pilot studies to demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet++. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet++ dataset with annotations will foster further innovations in the domain of 3D human-centric tasks at scale. MVHumanNet++ is publicly available at https://kevinlee09.github.io/research/MVHumanNet++/.

Authors:Jiesong Bai, Yuhao Yin, Yihang Dong, Xiaofeng Zhang, Chi-Man Pun, Xuhang Chen
Title: LensNet: An End-to-End Learning Framework for Empirical Point Spread Function Modeling and Lensless Imaging Reconstruction
Abstract:
Lensless imaging stands out as a promising alternative to conventional lens-based systems, particularly in scenarios demanding ultracompact form factors and cost-effective architectures. However, such systems are fundamentally governed by the Point Spread Function (PSF), which dictates how a point source contributes to the final captured signal. Traditional lensless techniques often require explicit calibrations and extensive pre-processing, relying on static or approximate PSF models. These rigid strategies can result in limited adaptability to real-world challenges, including noise, system imperfections, and dynamic scene variations, thus impeding high-fidelity reconstruction. In this paper, we propose LensNet, an end-to-end deep learning framework that integrates spatial-domain and frequency-domain representations in a unified pipeline. Central to our approach is a learnable Coded Mask Simulator (CMS) that enables dynamic, data-driven estimation of the PSF during training, effectively mitigating the shortcomings of fixed or sparsely calibrated kernels. By embedding a Wiener filtering component, LensNet refines global structure and restores fine-scale details, thus alleviating the dependency on multiple handcrafted pre-processing steps. Extensive experiments demonstrate LensNet's robust performance and superior reconstruction quality compared to state-of-the-art methods, particularly in preserving high-frequency details and attenuating noise. The proposed framework establishes a novel convergence between physics-based modeling and data-driven learning, paving the way for more accurate, flexible, and practical lensless imaging solutions for applications ranging from miniature sensors to medical diagnostics. The link of code is https://github.com/baijiesong/Lensnet.
中文: LensNet是一种端到端的深度学习框架,通过可学习的编码掩模模拟器动态估计点扩散函数,结合空间和频域表示,在无透镜成像中实现了卓越的重建质量和降噪效果。
English: LensNet is an end-to-end deep learning framework that integrates spatial and frequency domains with a learnable coded mask simulator for dynamic PSF estimation, achieving superior reconstruction quality and noise reduction in lensless imaging.

Authors:Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, Yike Guo
Title: Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion
Abstract:
Generating gestures from human speech has gained tremendous progress in animating virtual avatars. While the existing methods enable synthesizing gestures cooperated by individual self-talking, they overlook the practicality of concurrent gesture modeling with two-person interactive conversations. Moreover, the lack of high-quality datasets with concurrent co-speech gestures also limits handling this issue. To fulfill this goal, we first construct a large-scale concurrent co-speech gesture dataset that contains more than 7M frames for diverse two-person interactive posture sequences, dubbed GES-Inter. Additionally, we propose Co$^3$Gesture, a novel framework that enables coherent concurrent co-speech gesture synthesis including two-person interactive movements. Considering the asymmetric body dynamics of two speakers, our framework is built upon two cooperative generation branches conditioned on separated speaker audio. Specifically, to enhance the coordination of human postures with respect to corresponding speaker audios while interacting with the conversational partner, we present a Temporal Interaction Module (TIM). TIM can effectively model the temporal association representation between two speakers' gesture sequences as interaction guidance and fuse it into the concurrent gesture generation. Then, we devise a mutual attention mechanism to further holistically boost learning dependencies of interacted concurrent motions, thereby enabling us to generate vivid and coherent gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected GES-Inter dataset. The dataset and source code are publicly available at \href{https://mattie-e.github.io/Co3/}{\textit{https://mattie-e.github.io/Co3/}}.
中文: 本研究提出了一种新框架和数据集,用于合成双人对话中的同步手势,填补了并发语音手势建模的空白,并在性能上超越了现有方法。
English: This study introduces a novel framework and dataset for synthesizing synchronized gestures in two-person conversations, addressing the gap in concurrent co-speech gesture modeling and outperforming existing methods.

Authors:Kaidong Zhang, Rongtao Xu, Pengzhen Ren, Junfan Lin, Hefeng Wu, Liang Lin, Xiaodan Liang
Title: RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation
Abstract:
Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots' ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained vision-language model (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a generalist embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim-to-real generalization using only five real-world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.
中文摘要:RoBridge是一种分层智能架构,通过结合高层认知规划与强化学习,有效解决了机器人操作中的程序性和陈述性技能困境,在新任务和仿真到现实的迁移中均实现了高成功率。
English Summary: RoBridge is a hierarchical architecture that integrates cognitive planning with reinforcement learning to overcome procedural and declarative skill limitations in robotic manipulation, achieving high success rates in new tasks and sim-to-real generalization.

Authors:Yifan Liu, Ruichen Yao, Yaokun Liu, Ruohan Zong, Zelin Li, Yang Zhang, Dong Wang
Title: Component-Based Fairness in Face Attribute Classification with Bayesian Network-informed Meta Learning
Abstract:
The widespread integration of face recognition technologies into various applications (e.g., access control and personalized advertising) necessitates a critical emphasis on fairness. While previous efforts have focused on demographic fairness, the fairness of individual biological face components remains unexplored. In this paper, we focus on face component fairness, a fairness notion defined by biological face features. To our best knowledge, our work is the first work to mitigate bias of face attribute prediction at the biological feature level. In this work, we identify two key challenges in optimizing face component fairness: attribute label scarcity and attribute inter-dependencies, both of which limit the effectiveness of bias mitigation from previous approaches. To address these issues, we propose \textbf{B}ayesian \textbf{N}etwork-informed \textbf{M}eta \textbf{R}eweighting (BNMR), which incorporates a Bayesian Network calibrator to guide an adaptive meta-learning-based sample reweighting process. During the training process of our approach, the Bayesian Network calibrator dynamically tracks model bias and encodes prior probabilities for face component attributes to overcome the above challenges. To demonstrate the efficacy of our approach, we conduct extensive experiments on a large-scale real-world human face dataset. Our results show that BNMR is able to consistently outperform recent face bias mitigation baselines. Moreover, our results suggest a positive impact of face component fairness on the commonly considered demographic fairness (e.g., \textit{gender}). Our findings pave the way for new research avenues on face component fairness, suggesting that face component fairness could serve as a potential surrogate objective for demographic fairness. The code for our work is publicly available~\footnote{https://github.com/yliuaa/BNMR-FairCompFace.git}.
中文: 本文提出BNMR方法,首次从生物特征层面解决人脸组件公平性问题,通过贝叶斯网络校准和元学习加权机制有效提升人脸属性预测的公平性,并证实其对人口统计公平性的积极影响。
English: This paper introduces BNMR, a novel method that addresses fairness in face recognition by focusing on biological face components, effectively overcoming challenges like attribute scarcity and dependencies to enhance both component and demographic fairness.

Authors:Jun Li, Yijue Zhang, Haibo Shi, Minhong Li, Qiwei Li, Xiaohua Qian
Title: A Dual-Task Synergy-Driven Generalization Framework for Pancreatic Cancer Segmentation in CT Scans
Abstract:
Pancreatic cancer, characterized by its notable prevalence and mortality rates, demands accurate lesion delineation for effective diagnosis and therapeutic interventions. The generalizability of extant methods is frequently compromised due to the pronounced variability in imaging and the heterogeneous characteristics of pancreatic lesions, which may mimic normal tissues and exhibit significant inter-patient variability. Thus, we propose a generalization framework that synergizes pixel-level classification and regression tasks, to accurately delineate lesions and improve model stability. This framework not only seeks to align segmentation contours with actual lesions but also uses regression to elucidate spatial relationships between diseased and normal tissues, thereby improving tumor localization and morphological characterization. Enhanced by the reciprocal transformation of task outputs, our approach integrates additional regression supervision within the segmentation context, bolstering the model's generalization ability from a dual-task perspective. Besides, dual self-supervised learning in feature spaces and output spaces augments the model's representational capability and stability across different imaging views. Experiments on 594 samples composed of three datasets with significant imaging differences demonstrate that our generalized pancreas segmentation results comparable to mainstream in-domain validation performance (Dice: 84.07%). More importantly, it successfully improves the results of the highly challenging cross-lesion generalized pancreatic cancer segmentation task by 9.51%. Thus, our model constitutes a resilient and efficient foundational technological support for pancreatic disease management and wider medical applications. The codes will be released at https://github.com/SJTUBME-QianLab/Dual-Task-Seg.
中文摘要:本研究提出一种结合像素级分类与回归的双任务泛化框架,通过协同优化提升胰腺病灶分割精度,在跨数据集验证中显著提升模型稳定性与分割性能。
English Summary: This study introduces a dual-task generalization framework combining pixel-level classification and regression to enhance pancreatic lesion segmentation, improving model stability and achieving significant performance gains in cross-dataset validation.

Authors:Stefanos Gkikas, Raul Fernandez Rojas, Manolis Tsiknakis
Title: PainFormer: a Vision Foundation Model for Automatic Pain Assessment
Abstract:
Pain is a manifold condition that impacts a significant percentage of the population. Accurate and reliable pain evaluation for the people suffering is crucial to developing effective and advanced pain management protocols. Automatic pain assessment systems provide continuous monitoring and support decision-making processes, ultimately aiming to alleviate distress and prevent functionality decline. This study introduces PainFormer, a vision foundation model based on multi-task learning principles trained simultaneously on 14 tasks/datasets with a total of 10.9 million samples. Functioning as an embedding extractor for various input modalities, the foundation model provides feature representations to the Embedding-Mixer, a transformer-based module that performs the final pain assessment. Extensive experiments employing behavioral modalities - including RGB, synthetic thermal, and estimated depth videos - and physiological modalities such as ECG, EMG, GSR, and fNIRS revealed that PainFormer effectively extracts high-quality embeddings from diverse input modalities. The proposed framework is evaluated on two pain datasets, BioVid and AI4Pain, and directly compared to 75 different methodologies documented in the literature. Experiments conducted in unimodal and multimodal settings demonstrate state-of-the-art performances across modalities and pave the way toward general-purpose models for automatic pain assessment. The foundation model's architecture (code) and weights are available at: https://github.com/GkikasStefanos/PainFormer.
中文: 本研究提出PainFormer多任务视觉基础模型,通过14个数据集训练,能从行为和生理输入中提取高质量嵌入表征,在单模态与多模态场景下均实现了自动疼痛评估的最先进性能。
English: This study introduces PainFormer, a multi-task vision foundation model trained on 14 datasets to extract high-quality embeddings from behavioral and physiological inputs, achieving state-of-the-art performance in automatic pain assessment across unimodal and multimodal settings.

Authors:Zhen Yao, Xiaowen Ying, Zhiyu Zhu, Mooi Choo Chuah
Title: Learning Flow-Guided Registration for RGB-Event Semantic Segmentation
Abstract:
Event cameras capture microsecond-level motion cues that complement RGB sensors. However, the prevailing paradigm of treating RGB-Event perception as a fusion problem is ill-posed, as it ignores the intrinsic (i) Spatiotemporal and (ii) Modal Misalignment, unlike other RGB-X sensing domains. To tackle these limitations, we recast RGB-Event segmentation from fusion to registration. We propose BRENet, a novel flow-guided bidirectional framework that adaptively matches correspondence between the asymmetric modalities. Specifically, it leverages temporally aligned optical flows as a coarse-grained guide, along with fine-grained event temporal features, to generate precise forward and backward pixel pairings for registration. This pairing mechanism converts the inherent motion lag into terms governed by flow estimation error, bridging modality gaps. Moreover, we introduce Motion-Enhanced Event Tensor (MET), a new representation that transforms sparse event streams into a dense, temporally coherent form. Extensive experiments on four large-scale datasets validate our approach, establishing flow-guided registration as a promising direction for RGB-Event segmentation. Our code is available at: https://github.com/zyaocoder/BRENet.
中文: 本研究将RGB-事件分割重新定义为配准问题而非融合,提出BRENet框架通过流引导的双向匹配和运动增强事件张量,有效弥合了模态间的差异。
English: The study reframes RGB-Event segmentation as a registration problem instead of fusion, introducing BRENet with flow-guided bidirectional matching and a Motion-Enhanced Event Tensor to bridge modality gaps effectively.

Authors:Daoan Zhang, Che Jiang, Ruoshi Xu, Biaoxiang Chen, Zijian Jin, Yutian Lu, Jianguo Zhang, Liang Yong, Jiebo Luo, Shengda Luo
Title: WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation
Abstract:
Recent advances in text-to-image (T2I) generation have achieved impressive results, yet existing models still struggle with prompts that require rich world knowledge and implicit reasoning: both of which are critical for producing semantically accurate, coherent, and contextually appropriate images in real-world scenarios. To address this gap, we introduce \textbf{WorldGenBench}, a benchmark designed to systematically evaluate T2I models' world knowledge grounding and implicit inferential capabilities, covering both the humanities and nature domains. We propose the \textbf{Knowledge Checklist Score}, a structured metric that measures how well generated images satisfy key semantic expectations. Experiments across 21 state-of-the-art models reveal that while diffusion models lead among open-source methods, proprietary auto-regressive models like GPT-4o exhibit significantly stronger reasoning and knowledge integration. Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems. Project Page: \href{https://dwanzhang-ai.github.io/WorldGenBench/}{https://dwanzhang-ai.github.io/WorldGenBench/}

Authors:Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Tianyi Zhou, Dinesh Manocha, Jordan Lee Boyd-Graber
Title: VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding
Abstract:
Synthetic video generation has gained significant attention for its realism and broad applications, but remains prone to violations of common sense and physical laws. This highlights the need for reliable abnormality detectors that understand such principles and are robust to hallucinations. To address this, we introduce VideoHallu, a benchmark of over 3,000 video QA pairs built from synthetic videos generated by models like Veo2, Sora, and Kling, paired with expert-crafted counterintuitive QA to evaluate the critical thinking abilities of Multi-modal Large Language Models (MLLMs) on abnormalities that are perceptually obvious to humans but often hallucinated due to language priors. VideoHallu evaluates MLLMs' abnormality detection abilities with examples across alignment, consistency, commonsense, and physics. We benchmark SOTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen2.5-VL, Video-R1, and VideoChat-R1. We observe that these models perform well on many real-world benchmarks like MVBench and MovieChat, but still struggle with basic physics-based and commonsense reasoning in synthetic videos. We further show that post-training with Group Relative Policy Optimization (GRPO), using curriculum learning on datasets combining video QA with counterintuitive commonsense and physics reasoning over real and synthetic videos, improves MLLMs' abnormality detection and critical thinking, demonstrating the value of targeted training for improving their understanding of commonsense and physical laws. Our code is available at https://github.com/zli12321/VideoHallu.git.
中文: VideoHallu基准通过合成视频异常评估多模态大模型的批判性思维能力,发现尽管在真实场景表现优异,模型仍难以处理常识和物理推理,但采用GRPO的针对性训练能显著提升这方面的理解能力。
English: The VideoHallu benchmark evaluates MLLMs' critical thinking on synthetic video abnormalities, revealing their struggles with commonsense and physics reasoning despite strong real-world performance, while demonstrating that targeted training with GRPO significantly enhances these capabilities.

Authors:Zhe Zhang, Mingxiu Cai, Hanxiao Wang, Gaochang Wu, Tianyou Chai, Xiatian Zhu
Title: CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering
Abstract:
Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is inaccurate yet overlooked, leading to sub-optimal detection. To address this issue, we introduce the concept of cost filtering, borrowed from classical matching tasks, such as depth and flow estimation, into the UAD problem. We call this approach {\em CostFilter-AD}. Specifically, we first construct a matching cost volume between the input and normal samples, comprising two spatial dimensions and one matching dimension that encodes potential matches. To refine this, we propose a cost volume filtering network, guided by the input observation as an attention query across multiple feature layers, which effectively suppresses matching noise while preserving edge structures and capturing subtle anomalies. Designed as a generic post-processing plug-in, CostFilter-AD can be integrated with either reconstruction-based or embedding-based methods. Extensive experiments on MVTec-AD and VisA benchmarks validate the generic benefits of CostFilter-AD for both single- and multi-class UAD tasks. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.
Chinese: 本文提出CostFilter-AD方法,通过构建并过滤输入图像与正常样本间的匹配成本,有效抑制噪声同时保留边缘结构和捕捉细微异常,可作为通用后处理模块提升各类无监督异常检测方法的性能。
English: The paper introduces CostFilter-AD, a novel unsupervised anomaly detection method that refines anomaly localization by filtering matching costs between input and normal samples, effectively reducing noise while enhancing edge and subtle anomaly detection across various UAD approaches.

Authors:Vaidehi Patil, Yi-Lin Sung, Peter Hase, Jie Peng, Tianlong Chen, Mohit Bansal
Title: Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
Abstract:
LLMs trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs as they integrate information from multiple modalities (image and text). Adversaries can exploit this knowledge through multimodal prompts to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While prior work on unlearning has focused on text, multimodal unlearning remains underexplored. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. We extend a visual question-answering dataset using an automated pipeline that generates varying-proximity samples for testing generalization and specificity, followed by manual filtering for maintaining high quality. We then evaluate six defense objectives against seven attacks (four whitebox, three blackbox), including a novel whitebox method leveraging interpretability of hidden states. Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states. Additionally, larger models exhibit greater post-editing robustness, suggesting that scale enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing unlearning in MLLMs.
中文: 大型多模态语言模型存在记忆敏感信息的风险,可通过多模态提示提取,为此开发了UnLOK-VQA基准来评估定向遗忘方法,结果显示多模态攻击更有效,而更大模型具有更强的安全鲁棒性。
English: Large multimodal language models risk memorizing sensitive data, which can be extracted via multimodal prompts, prompting the development of UnLOK-VQA as a benchmark to evaluate targeted unlearning methods and revealing that multimodal attacks are more effective while larger models offer better safety.

Authors:Wenqi Guo, Mohamed Shehata, Shan Du
Title: ZS-VCOS: Zero-Shot Video Camouflaged Object Segmentation By Optical Flow and Open Vocabulary Object Detection
Abstract:
Camouflaged object segmentation presents unique challenges compared to traditional segmentation tasks, primarily due to the high similarity in patterns and colors between camouflaged objects and their backgrounds. Effective solutions to this problem have significant implications in critical areas such as pest control, defect detection, and lesion segmentation in medical imaging. Prior research has predominantly emphasized supervised or unsupervised pre-training methods, leaving zero-shot approaches significantly underdeveloped. Existing zero-shot techniques commonly utilize the Segment Anything Model (SAM) in automatic mode or rely on vision-language models to generate cues for segmentation; however, their performances remain unsatisfactory, due to the similarity of the camouflaged object and the background. This work studies how to avoid training by integrating large pre-trained models like SAM-2 and Owl-v2 with temporal information into a modular pipeline. Evaluated on the MoCA-Mask dataset, our approach achieves outstanding performance improvements, significantly outperforming existing zero-shot methods by raising the F-measure ($F_β^w$) from 0.296 to 0.628. Our approach also surpasses supervised methods, increasing the F-measure from 0.476 to 0.628. Additionally, evaluation on the MoCA-Filter dataset demonstrates an increase in the success rate from 0.628 to 0.697 when compared with FlowSAM, a supervised transfer method. A thorough ablation study further validates the individual contributions of each component. Besides our main contributions, we also highlight inconsistencies in previous work regarding metrics and settings. Code can be found in https://github.com/weathon/vcos.
中文: 本研究提出了一种新颖的零样本伪装物体分割方法,通过将大型预训练模型与时间信息相结合,在基准数据集上相比现有方法取得了显著的性能提升。
English: This study introduces a novel zero-shot approach for camouflaged object segmentation by integrating large pre-trained models with temporal information, achieving significant performance improvements over existing methods on benchmark datasets.

Authors:Mohammadreza Teymoorianfard, Shiqing Ma, Amir Houmansadr
Title: VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models
Abstract:
The rapid rise of video diffusion models has enabled the generation of highly realistic and temporally coherent videos, raising critical concerns about content authenticity, provenance, and misuse. Existing watermarking approaches, whether passive, post-hoc, or adapted from image-based techniques, often struggle to withstand video-specific manipulations such as frame insertion, dropping, or reordering, and typically degrade visual quality. In this work, we introduce VIDSTAMP, a watermarking framework that embeds per-frame or per-segment messages directly into the latent space of temporally-aware video diffusion models. By fine-tuning the model's decoder through a two-stage pipeline, first on static image datasets to promote spatial message separation, and then on synthesized video sequences to restore temporal consistency, VIDSTAMP learns to embed high-capacity, flexible watermarks with minimal perceptual impact. Leveraging architectural components such as 3D convolutions and temporal attention, our method imposes no additional inference cost and offers better perceptual quality than prior methods, while maintaining comparable robustness against common distortions and tampering. VIDSTAMP embeds 768 bits per video (48 bits per frame) with a bit accuracy of 95.0%, achieves a log P-value of -166.65 (lower is better), and maintains a video quality score of 0.836, comparable to unwatermarked outputs (0.838) and surpassing prior methods in capacity-quality tradeoffs. Code: Code: \url{https://github.com/SPIN-UMass/VidStamp}
中文摘要:VIDSTAMP是一种新型水印框架,通过在视频扩散模型的潜在空间中直接嵌入高容量水印,在保持与未加水印视频相当的视觉质量的同时,实现了对篡改的更强鲁棒性。
English Summary: VIDSTAMP is a novel watermarking framework that embeds high-capacity watermarks directly into video diffusion models' latent space, achieving superior robustness against tampering while maintaining visual quality comparable to unwatermarked videos.

Authors:Fahong Zhang, Yilei Shi, Xiao Xiang Zhu
Title: Global Collinearity-aware Polygonizer for Polygonal Building Mapping in Remote Sensing
Abstract:
This paper addresses the challenge of mapping polygonal buildings from remote sensing images and introduces a novel algorithm, the Global Collinearity-aware Polygonizer (GCP). GCP, built upon an instance segmentation framework, processes binary masks produced by any instance segmentation model. The algorithm begins by collecting polylines sampled along the contours of the binary masks. These polylines undergo a refinement process using a transformer-based regression module to ensure they accurately fit the contours of the targeted building instances. Subsequently, a collinearity-aware polygon simplification module simplifies these refined polylines and generate the final polygon representation. This module employs dynamic programming technique to optimize an objective function that balances the simplicity and fidelity of the polygons, achieving globally optimal solutions. Furthermore, the optimized collinearity-aware objective is seamlessly integrated into network training, enhancing the cohesiveness of the entire pipeline. The effectiveness of GCP has been validated on two public benchmarks for polygonal building mapping. Further experiments reveal that applying the collinearity-aware polygon simplification module to arbitrary polylines, without prior knowledge, enhances accuracy over traditional methods such as the Douglas-Peucker algorithm. This finding underscores the broad applicability of GCP. The code for the proposed method will be made available at https://github.com/zhu-xlab.
本文提出了一种新颖的全局共线性感知多边形化算法(GCP),通过基于变换器的回归模块和动态规划技术优化遥感图像中的建筑物轮廓,最终生成精确的多边形表示。
This paper introduces the Global Collinearity-aware Polygonizer (GCP), a novel algorithm that refines and simplifies building contours from remote sensing images using transformer-based regression and dynamic programming to produce accurate polygonal representations.

Authors:Vladimir Somers, Baptiste Standaert, Victor Joos, Alexandre Alahi, Christophe De Vleeschouwer
Title: CAMELTrack: Context-Aware Multi-cue ExpLoitation for Online Multi-Object Tracking
Abstract:
Online multi-object tracking has been recently dominated by tracking-by-detection (TbD) methods, where recent advances rely on increasingly sophisticated heuristics for tracklet representation, feature fusion, and multi-stage matching. The key strength of TbD lies in its modular design, enabling the integration of specialized off-the-shelf models like motion predictors and re-identification. However, the extensive usage of human-crafted rules for temporal associations makes these methods inherently limited in their ability to capture the complex interplay between various tracking cues. In this work, we introduce CAMEL, a novel association module for Context-Aware Multi-Cue ExpLoitation, that learns resilient association strategies directly from data, breaking free from hand-crafted heuristics while maintaining TbD's valuable modularity. At its core, CAMEL employs two transformer-based modules and relies on a novel association-centric training scheme to effectively model the complex interactions between tracked targets and their various association cues. Unlike end-to-end detection-by-tracking approaches, our method remains lightweight and fast to train while being able to leverage external off-the-shelf models. Our proposed online tracking pipeline, CAMELTrack, achieves state-of-the-art performance on multiple tracking benchmarks. Our code is available at https://github.com/TrackingLaboratory/CAMELTrack.
中文: CAMEL提出了一种数据驱动的关联模块,通过直接学习数据中的关联策略摆脱人工启发式规则,同时保持检测跟踪的模块化优势,在多目标在线跟踪中实现了最先进的性能。
English: CAMEL introduces a data-driven association module that learns resilient tracking strategies without hand-crafted heuristics while preserving the modularity of tracking-by-detection, achieving state-of-the-art performance in online multi-object tracking.

Authors:Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne
Title: CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Abstract:
Recent advances in audio-visual learning have shown promising results in learning representations across modalities. However, most approaches rely on global audio representations that fail to capture fine-grained temporal correspondences with visual frames. Additionally, existing methods often struggle with conflicting optimization objectives when trying to jointly learn reconstruction and cross-modal alignment. In this work, we propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning. We address three key challenges: First, we tackle the granularity mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations. Second, we resolve conflicting optimization goals by separating contrastive and reconstruction objectives through dedicated global tokens. Third, we improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens. We evaluate the proposed approach on AudioSet, VGG Sound, and the ADE20K Sound dataset on zero-shot retrieval, classification and localization tasks demonstrating state-of-the-art performance and outperforming more complex architectures.
Chinese Summary: 提出的CAV-MAE Sync框架通过将音频序列与视频帧进行时间对齐、分离对比与重建目标、并增强空间定位能力,解决了视听学习中的关键挑战,在多个基准测试中实现了最先进的性能。
English Summary: The proposed CAV-MAE Sync framework addresses key audio-visual learning challenges by aligning audio sequences with video frames temporally, separating contrastive and reconstruction objectives, and enhancing spatial localization, achieving state-of-the-art results across multiple benchmarks.

Authors:Youngsik Yun, Jeongmin Bae, Hyunseung Son, Seoha Kim, Hahyun Lee, Gun Bang, Youngjung Uh
Title: Compensating Spatiotemporally Inconsistent Observations for Online Dynamic 3D Gaussian Splatting
Abstract:
Online reconstruction of dynamic scenes is significant as it enables learning scenes from live-streaming video inputs, while existing offline dynamic reconstruction methods rely on recorded video inputs. However, previous online reconstruction approaches have primarily focused on efficiency and rendering quality, overlooking the temporal consistency of their results, which often contain noticeable artifacts in static regions. This paper identifies that errors such as noise in real-world recordings affect temporal inconsistency in online reconstruction. We propose a method that enhances temporal consistency in online reconstruction from observations with temporal inconsistency which is inevitable in cameras. We show that our method restores the ideal observation by subtracting the learned error. We demonstrate that applying our method to various baselines significantly enhances both temporal consistency and rendering quality across datasets. Code, video results, and checkpoints are available at https://bbangsik13.github.io/OR2.

Authors:Keiller Nogueira, Akram Zaytar, Wanli Ma, Ribana Roscher, Ronny Hänsch, Caleb Robinson, Anthony Ortiz, Simone Nsutezo, Rahul Dodhia, Juan M. Lavista Ferres, Oktay Karakuş, Paul L. Rosin
Title: Core-Set Selection for Data-efficient Land Cover Segmentation
Abstract:
The increasing accessibility of remotely sensed data and the potential of such data to inform large-scale decision-making has driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models must be trained on large datasets. However, the common assumption that broadly larger datasets lead to better outcomes tends to overlook the complexities of the data distribution, the potential for introducing biases and noise, and the computational resources required for processing and storing vast datasets. Therefore, effective solutions should consider both the quantity and quality of data. In this paper, we propose six novel core-set selection methods for selecting important subsets of samples from remote sensing image segmentation datasets that rely on imagery only, labels only, and a combination of each. We benchmark these approaches against a random-selection baseline on three commonly used land cover classification datasets: DFC2022, Vaihingen, and Potsdam. In each of the datasets, we demonstrate that training on a subset of samples outperforms the random baseline, and some approaches outperform training on all available data. This result shows the importance and potential of data-centric learning for the remote sensing domain. The code is available at https://github.com/keillernogueira/data-centric-rs-classification/.
中文: 针对地球观测任务,通过采用仅依赖图像、标签或其组合的新型核心集选择方法,可提升数据质量,使深度学习模型在遥感图像分割中的表现优于随机选择,有时甚至超过使用全部数据的效果。
English: Deep learning models for Earth Observation can achieve better performance by focusing on data quality through novel core-set selection methods, which outperform random selection and sometimes even full dataset training in remote sensing image segmentation tasks.

Authors:Kui Jiang, Yan Luo, Junjun Jiang, Xin Xu, Fei Ma, Fei Yu
Title: RD-UIE: Relation-Driven State Space Modeling for Underwater Image Enhancement
Abstract:
Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications, where wavelength-dependent attenuation causes severe content degradation and color distortion. While recent state space models like Mamba show potential for long-range dependency modeling, their unfolding operations and fixed scan paths on 1D sequences fail to adapt to local object semantics and global relation modeling, limiting their efficacy in complex underwater environments. To address this, we enhance conventional Mamba with the sorting-based scanning mechanism that dynamically reorders scanning sequences based on statistical distribution of spatial correlation of all pixels. In this way, it encourages the network to prioritize the most informative components--structural and semantic features. Upon building this mechanism, we devise a Visually Self-adaptive State Block (VSSB) that harmonizes dynamic sorting of Mamba with input-dependent dynamic convolution, enabling coherent integration of global context and local relational cues. This exquisite design helps eliminate global focus bias, especially for widely distributed contents, which greatly weakens the statistical frequency. For robust feature extraction and refinement, we design a cross-feature bridge (CFB) to adaptively fuse multi-scale representations. These efforts compose the novel relation-driven Mamba framework for effective UIE (RD-UIE). Extensive experiments on underwater enhancement benchmarks demonstrate RD-UIE outperforms the state-of-the-art approach WMamba in both quantitative metrics and visual fidelity, averagely achieving 0.55 dB performance gain on the three benchmarks. Our code is available at https://github.com/kkoucy/RD-UIE/tree/main
中文: 本研究提出了一种关系驱动的Mamba框架(RD-UIE),通过动态排序像素序列并整合全局上下文与局部特征,显著提升了水下图像增强效果,在性能指标上超越现有最佳方法。
English: The study introduces a relation-driven Mamba framework (RD-UIE) that enhances underwater image processing by dynamically sorting pixel sequences and integrating global context with local features, achieving superior performance over existing methods.

Authors:Ziyan Guo, Haoxuan Qu, Hossein Rahmani, Dewen Soh, Ping Hu, Qiuhong Ke, Jun Liu
Title: TSTMotion: Training-free Scene-aware Text-to-motion Generation
Abstract:
Text-to-motion generation has recently garnered significant research interest, primarily focusing on generating human motion sequences in blank backgrounds. However, human motions commonly occur within diverse 3D scenes, which has prompted exploration into scene-aware text-to-motion generation methods. Yet, existing scene-aware methods often rely on large-scale ground-truth motion sequences in diverse 3D scenes, which poses practical challenges due to the expensive cost. To mitigate this challenge, we are the first to propose a \textbf{T}raining-free \textbf{S}cene-aware \textbf{T}ext-to-\textbf{Motion} framework, dubbed as \textbf{TSTMotion}, that efficiently empowers pre-trained blank-background motion generators with the scene-aware capability. Specifically, conditioned on the given 3D scene and text description, we adopt foundation models together to reason, predict and validate a scene-aware motion guidance. Then, the motion guidance is incorporated into the blank-background motion generators with two modifications, resulting in scene-aware text-driven motion sequences. Extensive experiments demonstrate the efficacy and generalizability of our proposed framework. We release our code in \href{https://tstmotion.github.io/}{Project Page}.
中文: 提出的TSTMotion框架无需训练,通过利用基础模型生成和验证三维场景中的运动引导,有效增强了预训练的空白背景运动生成器的场景感知能力。
English: The proposed TSTMotion framework is a training-free solution that enhances pre-trained blank-background motion generators with scene-aware capabilities by leveraging foundation models to create and validate motion guidance for 3D environments.

Authors:Jiangtong Tan, Hu Yu, Jie Huang, Jie Xiao, Feng Zhao
Title: FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis
Abstract:
Long video generation involves generating extended videos using models trained on short videos, suffering from distribution shifts due to varying frame counts. It necessitates the use of local information from the original short frames to enhance visual and motion quality, and global information from the entire long frames to ensure appearance consistency. Existing training-free methods struggle to effectively integrate the benefits of both, as appearance and motion in videos are closely coupled, leading to motion inconsistency and visual quality. In this paper, we reveal that global and local information can be precisely decoupled into consistent appearance and motion intensity information by applying Principal Component Analysis (PCA), allowing for refined complementary integration of global consistency and local quality. With this insight, we propose FreePCA, a training-free long video generation paradigm based on PCA that simultaneously achieves high consistency and quality. Concretely, we decouple consistent appearance and motion intensity features by measuring cosine similarity in the principal component space. Critically, we progressively integrate these features to preserve original quality and ensure smooth transitions, while further enhancing consistency by reusing the mean statistics of the initial noise. Experiments demonstrate that FreePCA can be applied to various video diffusion models without requiring training, leading to substantial improvements. Code is available at https://github.com/JosephTiTan/FreePCA.
中文:FreePCA提出了一种无需训练的方法,通过主成分分析解耦并整合长视频生成中的全局外观一致性与局部运动质量,无需额外训练即可显著提升性能。
English: FreePCA introduces a training-free method using Principal Component Analysis to decouple and integrate global appearance consistency with local motion quality in long video generation, significantly improving performance without additional training.

Authors:Daneul Kim, Jaeah Lee, Jaesik Park
Title: Improving Editability in Image Generation with Layer-wise Memory
Abstract:
Most real-world image editing tasks require multiple sequential edits to achieve desired results. Current editing approaches, primarily designed for single-object modifications, struggle with sequential editing: especially with maintaining previous edits along with adapting new objects naturally into the existing content. These limitations significantly hinder complex editing scenarios where multiple objects need to be modified while preserving their contextual relationships. We address this fundamental challenge through two key proposals: enabling rough mask inputs that preserve existing content while naturally integrating new elements and supporting consistent editing across multiple modifications. Our framework achieves this through layer-wise memory, which stores latent representations and prompt embeddings from previous edits. We propose Background Consistency Guidance that leverages memorized latents to maintain scene coherence and Multi-Query Disentanglement in cross-attention that ensures natural adaptation to existing content. To evaluate our method, we present a new benchmark dataset incorporating semantic alignment metrics and interactive editing scenarios. Through comprehensive experiments, we demonstrate superior performance in iterative image editing tasks with minimal user effort, requiring only rough masks while maintaining high-quality results throughout multiple editing steps.

Authors:Yeonsang Shin, Jihwan Kim, Yumin Song, Kyungseung Lee, Hyunhee Chung, Taeyoung Na
Title: Generating Animated Layouts as Structured Text Representations
Abstract:
Despite the remarkable progress in text-to-video models, achieving precise control over text elements and animated graphics remains a significant challenge, especially in applications such as video advertisements. To address this limitation, we introduce Animated Layout Generation, a novel approach to extend static graphic layouts with temporal dynamics. We propose a Structured Text Representation for fine-grained video control through hierarchical visual elements. To demonstrate the effectiveness of our approach, we present VAKER (Video Ad maKER), a text-to-video advertisement generation pipeline that combines a three-stage generation process with Unstructured Text Reasoning for seamless integration with LLMs. VAKER fully automates video advertisement generation by incorporating dynamic layout trajectories for objects and graphics across specific video frames. Through extensive evaluations, we demonstrate that VAKER significantly outperforms existing methods in generating video advertisements. Project Page: https://yeonsangshin.github.io/projects/Vaker

Authors:Viktor Kocur, Charalambos Tzamos, Yaqing Ding, Zuzana Berger Haladova, Torsten Sattler, Zuzana Kukelova
Title: Are Minimal Radial Distortion Solvers Really Necessary for Relative Pose Estimation?
Abstract:
Estimating the relative pose between two cameras is a fundamental step in many applications such as Structure-from-Motion. The common approach to relative pose estimation is to apply a minimal solver inside a RANSAC loop. Highly efficient solvers exist for pinhole cameras. Yet, (nearly) all cameras exhibit radial distortion. Not modeling radial distortion leads to (significantly) worse results. However, minimal radial distortion solvers are significantly more complex than pinhole solvers, both in terms of run-time and implementation efforts. This paper compares radial distortion solvers with two simple-to-implement approaches that do not use minimal radial distortion solvers: The first approach combines an efficient pinhole solver with sampled radial undistortion parameters, where the sampled parameters are used for undistortion prior to applying the pinhole solver. The second approach uses a state-of-the-art neural network to estimate the distortion parameters rather than sampling them from a set of potential values. Extensive experiments on multiple datasets, and different camera setups, show that complex minimal radial distortion solvers are not necessary in practice. We discuss under which conditions a simple sampling of radial undistortion parameters is preferable over calibrating cameras using a learning-based prior approach. Code and newly created benchmark for relative pose estimation under radial distortion are available at https://github.com/kocurvik/rdnet.
Chinese: 本文证明在相对位姿估计中无需使用复杂的径向畸变最小化解算器,通过简单的畸变参数采样或基于神经网络的校准方法即可在实际应用中达到相似效果。
English: This paper demonstrates that complex minimal radial distortion solvers are unnecessary for relative pose estimation, showing that simple sampling of undistortion parameters or neural network-based calibration can achieve comparable results in practice.

Authors:Fadi Abdeladhim Zidi, Abdelkrim Ouafi, Fares Bougourzi, Cosimo Distante, Abdelmalik Taleb-Ahmed
Title: Advancing Wheat Crop Analysis: A Survey of Deep Learning Approaches Using Hyperspectral Imaging
Abstract:
As one of the most widely cultivated and consumed crops, wheat is essential to global food security. However, wheat production is increasingly challenged by pests, diseases, climate change, and water scarcity, threatening yields. Traditional crop monitoring methods are labor-intensive and often ineffective for early issue detection. Hyperspectral imaging (HSI) has emerged as a non-destructive and efficient technology for remote crop health assessment. However, the high dimensionality of HSI data and limited availability of labeled samples present notable challenges. In recent years, deep learning has shown great promise in addressing these challenges due to its ability to extract and analysis complex structures. Despite advancements in applying deep learning methods to HSI data for wheat crop analysis, no comprehensive survey currently exists in this field. This review addresses this gap by summarizing benchmark datasets, tracking advancements in deep learning methods, and analyzing key applications such as variety classification, disease detection, and yield estimation. It also highlights the strengths, limitations, and future opportunities in leveraging deep learning methods for HSI-based wheat crop analysis. We have listed the current state-of-the-art papers and will continue tracking updating them in the following https://github.com/fadi-07/Awesome-Wheat-HSI-DeepLearning.
Chinese: 本综述填补了深度学习在小麦作物高光谱成像分析领域缺乏全面调研的空白,通过总结数据集、追踪方法进展并分析品种分类、病害检测和产量估算等关键应用,同时指出了局限性和未来机遇。
English: This review addresses the lack of a comprehensive survey on deep learning applications for hyperspectral imaging in wheat crop analysis by summarizing datasets, tracking method advancements, and examining key uses like disease detection and yield estimation, while highlighting limitations and future opportunities.

Authors:Wufei Ma, Luoxin Ye, Celso M de Melo, Jieneng Chen, Alan Yuille
Title: SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Abstract:
Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D spatial reasoning. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. In this paper, we systematically study the impact of 3D-informed data, architecture, and training setups, introducing SpatialLLM, a large multi-modal model with advanced 3D spatial reasoning abilities. To address data limitations, we develop two types of 3D-informed training datasets: (1) 3D-informed probing data focused on object's 3D location and orientation, and (2) 3D-informed conversation data for complex spatial relationships. Notably, we are the first to curate VQA data that incorporate 3D orientation relationships on real images. Furthermore, we systematically integrate these two types of training data with the architectural and training designs of LMMs, providing a roadmap for optimal design aimed at achieving superior 3D reasoning capabilities. Our SpatialLLM advances machines toward highly capable 3D-informed reasoning, surpassing GPT-4o performance by 8.7%. Our systematic empirical design and the resulting findings offer valuable insights for future research in this direction. Our project page is available at: https://3d-spatial-reasoning.github.io/spatial-llm/

Authors:Branko Brkljač, Milan Brkljač
Title: Person detection and re-identification in open-world settings of retail stores and public spaces
Abstract:
Practical applications of computer vision in smart cities usually assume system integration and operation in challenging open-world environments. In the case of person re-identification task the main goal is to retrieve information whether the specific person has appeared in another place at a different time instance of the same video, or over multiple camera feeds. This typically assumes collecting raw data from video surveillance cameras in different places and under varying illumination conditions. In the considered open-world setting it also requires detection and localization of the person inside the analyzed video frame before the main re-identification step. With multi-person and multi-camera setups the system complexity becomes higher, requiring sophisticated tracking solutions and re-identification models. In this work we will discuss existing challenges in system design architectures, consider possible solutions based on different computer vision techniques, and describe applications of such systems in retail stores and public spaces for improved marketing analytics. In order to analyse sensitivity of person re-identification task under different open-world environments, a performance of one close to real-time solution will be demonstrated over several video captures and live camera feeds. Finally, based on conducted experiments we will indicate further research directions and possible system improvements.
中文: 本文探讨了智能城市中行人重识别系统在开放环境下面临的挑战与解决方案,重点分析了系统架构设计、实时性能评估以及在零售和公共空间中的实际应用与优化方向。
English: This abstract discusses the challenges and solutions for person re-identification in smart city applications, focusing on system design, performance evaluation in real-world settings, and potential improvements for retail and public space analytics.

Authors:Zhengbin Zhang, Yan Wu, Hongkun Zhang
Title: Fast2comm:Collaborative perception combined with prior knowledge
Abstract:
Collaborative perception has the potential to significantly enhance perceptual accuracy through the sharing of complementary information among agents. However, real-world collaborative perception faces persistent challenges, particularly in balancing perception performance and bandwidth limitations, as well as coping with localization errors. To address these challenges, we propose Fast2comm, a prior knowledge-based collaborative perception framework. Specifically, (1)we propose a prior-supervised confidence feature generation method, that effectively distinguishes foreground from background by producing highly discriminative confidence features; (2)we propose GT Bounding Box-based spatial prior feature selection strategy to ensure that only the most informative prior-knowledge features are selected and shared, thereby minimizing background noise and optimizing bandwidth efficiency while enhancing adaptability to localization inaccuracies; (3)we decouple the feature fusion strategies between model training and testing phases, enabling dynamic bandwidth adaptation. To comprehensively validate our framework, we conduct extensive experiments on both real-world and simulated datasets. The results demonstrate the superior performance of our model and highlight the necessity of the proposed methods. Our code is available at https://github.com/Zhangzhengbin-TJ/Fast2comm.
中文:提出的Fast2comm框架通过生成区分性置信度特征、选择信息丰富的空间先验特征以优化带宽和适应性,并解耦融合策略实现动态带宽使用,在实验中展现出卓越性能。
English: The proposed Fast2comm framework enhances collaborative perception by generating discriminative confidence features, selecting informative spatial prior features to optimize bandwidth and adaptability, and decoupling fusion strategies for dynamic bandwidth use, achieving superior performance in experiments.

Authors:Jiajia Li, Xinda Qi, Seyed Hamidreza Nabaei, Meiqi Liu, Dong Chen, Xin Zhang, Xunyuan Yin, Zhaojian Li
Title: A Survey on 3D Reconstruction Techniques in Plant Phenotyping: From Classical Methods to Neural Radiance Fields (NeRF), 3D Gaussian Splatting (3DGS), and Beyond
Abstract:
Plant phenotyping plays a pivotal role in understanding plant traits and their interactions with the environment, making it crucial for advancing precision agriculture and crop improvement. 3D reconstruction technologies have emerged as powerful tools for capturing detailed plant morphology and structure, offering significant potential for accurate and automated phenotyping. This paper provides a comprehensive review of the 3D reconstruction techniques for plant phenotyping, covering classical reconstruction methods, emerging Neural Radiance Fields (NeRF), and the novel 3D Gaussian Splatting (3DGS) approach. Classical methods, which often rely on high-resolution sensors, are widely adopted due to their simplicity and flexibility in representing plant structures. However, they face challenges such as data density, noise, and scalability. NeRF, a recent advancement, enables high-quality, photorealistic 3D reconstructions from sparse viewpoints, but its computational cost and applicability in outdoor environments remain areas of active research. The emerging 3DGS technique introduces a new paradigm in reconstructing plant structures by representing geometry through Gaussian primitives, offering potential benefits in both efficiency and scalability. We review the methodologies, applications, and performance of these approaches in plant phenotyping and discuss their respective strengths, limitations, and future prospects (https://github.com/JiajiaLi04/3D-Reconstruction-Plants). Through this review, we aim to provide insights into how these diverse 3D reconstruction techniques can be effectively leveraged for automated and high-throughput plant phenotyping, contributing to the next generation of agricultural technology.
中文: 本文综述了植物表型分析中的三维重建技术,比较了传统方法、神经辐射场(NeRF)和三维高斯泼溅(3DGS),重点分析了它们的优势、局限性和在推动自动化农业技术发展方面的潜力。
English: This paper reviews 3D reconstruction techniques for plant phenotyping, comparing classical methods, Neural Radiance Fields (NeRF), and 3D Gaussian Splatting (3DGS), highlighting their strengths, limitations, and potential for advancing automated agricultural technology.

Authors:Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li
Title: T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
Abstract:
Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1
中文: 本文提出T2I-R1模型,通过结合双层思维链推理与强化学习,在语义规划和像素处理层面优化文本到图像生成,显著提升了生成性能并超越了现有先进模型。
English: This paper introduces T2I-R1, a reasoning-enhanced text-to-image model that integrates a bi-level chain-of-thought process with reinforcement learning to optimize both semantic planning and pixel processing, achieving significant performance improvements over baseline and state-of-the-art models.

Authors:Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, Georgios Pavlakos
Title: RayZer: A Self-supervised Large View Synthesis Model
Abstract:
We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel, and scene simultaneously. RayZer demonstrates comparable or even superior novel view synthesis performance than ``oracle'' methods that rely on pose annotations in both training and testing. Project: https://hwjiang1510.github.io/RayZer/
RayZer是一种无需3D监督的自监督三维视觉模型,能够从无位姿图像中重建场景并合成新视角,其性能可与依赖位姿标注的方法相媲美。
RayZer is a self-supervised 3D vision model that reconstructs scenes and synthesizes novel views from unposed images without 3D supervision, achieving performance comparable to pose-dependent methods.

Authors:Wayne Wu, Honglin He, Chaoyuan Zhang, Jack He, Seth Z. Zhao, Ran Gong, Quanyi Li, Bolei Zhou
Title: Towards Autonomous Micromobility through Scalable Urban Simulation
Abstract:
Micromobility, which utilizes lightweight mobile machines moving in urban public spaces, such as delivery robots and mobility scooters, emerges as a promising alternative to vehicular mobility. Current micromobility depends mostly on human manual operation (in-person or remote control), which raises safety and efficiency concerns when navigating busy urban environments full of unpredictable obstacles and pedestrians. Assisting humans with AI agents in maneuvering micromobility devices presents a viable solution for enhancing safety and efficiency. In this work, we present a scalable urban simulation solution to advance autonomous micromobility. First, we build URBAN-SIM - a high-performance robot learning platform for large-scale training of embodied agents in interactive urban scenes. URBAN-SIM contains three critical modules: Hierarchical Urban Generation pipeline, Interactive Dynamics Generation strategy, and Asynchronous Scene Sampling scheme, to improve the diversity, realism, and efficiency of robot learning in simulation. Then, we propose URBAN-BENCH - a suite of essential tasks and benchmarks to gauge various capabilities of the AI agents in achieving autonomous micromobility. URBAN-BENCH includes eight tasks based on three core skills of the agents: Urban Locomotion, Urban Navigation, and Urban Traverse. We evaluate four robots with heterogeneous embodiments, such as the wheeled and legged robots, across these tasks. Experiments on diverse terrains and urban structures reveal each robot's strengths and limitations.

Authors:Tiange Luo, Lajanugen Logeswaran, Justin Johnson, Honglak Lee
Title: Visual Test-time Scaling for GUI Agent Grounding
Abstract:
We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents. Understanding webpages is challenging due to the visual complexity of GUI images and the large number of interface elements, making accurate action selection difficult. Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy. To support this process, we propose an image-as-map mechanism that visualizes key landmarks at each step, providing a transparent action record and enables the agent to effectively choose among action candidates. Even with a simple region selection strategy, we observe significant performance gains of 28+\% on Screenspot-pro and 24+\% on WebVoyager benchmarks on top of two state-of-the-art open vision language model agents, UI-TARS and Qwen2.5-VL, highlighting the effectiveness of visual test-time scaling in interactive settings. We achieve a new state-of-the-art grounding performance of 61.6\% on the ScreenSpot-Pro benchmark by applying RegionFocus to a Qwen2.5-VL-72B model. Our code will be released publicly at https://github.com/tiangeluo/RegionFocus.
中文摘要:RegionFocus是一种视觉测试时缩放方法,通过动态聚焦相关区域并采用图像作为地图的机制,显著提升了视觉语言模型代理的性能,在基准测试中实现了超过24%的性能提升,并以61.6%的准确率创下了新的最先进接地性能记录。
English Summary: RegionFocus is a visual test-time scaling method that enhances Vision Language Model Agents by dynamically zooming into relevant regions and using an image-as-map mechanism, achieving significant performance improvements of over 24% on benchmarks and setting a new state-of-the-art grounding accuracy of 61.6%.

Authors:Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, Tobias Weyand
Title: MINERVA: Evaluating Complex Video Reasoning
Abstract:
Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva.
Chinese: MINERVA数据集通过提供带有详细推理路径的复杂问题,弥补了现有视频基准仅关注结果监督的不足,旨在检验多模态模型是否真正具备结合感知与时间信息进行视频推理的能力。
English: The MINERVA dataset addresses the limitations of current video benchmarks by providing detailed reasoning traces alongside complex questions, challenging multimodal models to demonstrate genuine perceptual and temporal reasoning rather than relying on superficial cues.

Authors:Muyi Bao, Shuchang Lyu, Zhaoyang Xu, Huiyu Zhou, Jinchang Ren, Shiming Xiang, Xiangtai Li, Guangliang Cheng
Title: Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook
Abstract:
Deep learning has profoundly transformed remote sensing, yet prevailing architectures like Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) remain constrained by critical trade-offs: CNNs suffer from limited receptive fields, while ViTs grapple with quadratic computational complexity, hindering their scalability for high-resolution remote sensing data. State Space Models (SSMs), particularly the recently proposed Mamba architecture, have emerged as a paradigm-shifting solution, combining linear computational scaling with global context modeling. This survey presents a comprehensive review of Mamba-based methodologies in remote sensing, systematically analyzing about 120 Mamba-based remote sensing studies to construct a holistic taxonomy of innovations and applications. Our contributions are structured across five dimensions: (i) foundational principles of vision Mamba architectures, (ii) micro-architectural advancements such as adaptive scan strategies and hybrid SSM formulations, (iii) macro-architectural integrations, including CNN-Transformer-Mamba hybrids and frequency-domain adaptations, (iv) rigorous benchmarking against state-of-the-art methods in multiple application tasks, such as object detection, semantic segmentation, change detection, etc. and (v) critical analysis of unresolved challenges with actionable future directions. By bridging the gap between SSM theory and remote sensing practice, this survey establishes Mamba as a transformative framework for remote sensing analysis. To our knowledge, this paper is the first systematic review of Mamba architectures in remote sensing. Our work provides a structured foundation for advancing research in remote sensing systems through SSM-based methods. We curate an open-source repository (https://github.com/BaoBao0926/Awesome-Mamba-in-Remote-Sensing) to foster community-driven advancements.
中文: 本综述全面评述了基于Mamba的状态空间模型在遥感领域的变革性应用,通过线性计算扩展和全局上下文建模克服了CNN与ViT的局限,并系统分析了其创新架构、实际应用与未来发展方向。
English: This survey comprehensively reviews Mamba-based State Space Models as a transformative solution for remote sensing, overcoming limitations of CNNs and ViTs through linear computational scaling and global context modeling, while systematically analyzing innovations, applications, and future directions.

Authors:Simon Giebenhain, Tobias Kirschstein, Martin Rünz, Lourdes Agapito, Matthias Nießner
Title: Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction
Abstract:
We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities. Crucially, our benchmark is the first to evaluate both posed and neutral facial geometry. Ultimately, our method outperforms the most competitive baselines by over 15% in terms of geometric accuracy for posed facial expressions.

Authors:Lucas Robinet, Ahmad Berjaoui, Elizabeth Cohen-Jonathan Moyal
Title: Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities
Abstract:
Multimodal magnetic resonance imaging (MRI) constitutes the first line of investigation for clinicians in the care of brain tumors, providing crucial insights for surgery planning, treatment monitoring, and biomarker identification. Pre-training on large datasets have been shown to help models learn transferable representations and adapt with minimal labeled data. This behavior is especially valuable in medical imaging, where annotations are often scarce. However, applying this paradigm to multimodal medical data introduces a challenge: most existing approaches assume that all imaging modalities are available during both pre-training and fine-tuning. In practice, missing modalities often occur due to acquisition issues, specialist unavailability, or specific experimental designs on small in-house datasets. Consequently, a common approach involves training a separate model for each desired modality combination, making the process both resource-intensive and impractical for clinical use. Therefore, we introduce BM-MAE, a masked image modeling pre-training strategy tailored for multimodal MRI data. The same pre-trained model seamlessly adapts to any combination of available modalities, extracting rich representations that capture both intra- and inter-modal information. This allows fine-tuning on any subset of modalities without requiring architectural changes, while still benefiting from a model pre-trained on the full set of modalities. Extensive experiments show that the proposed pre-training strategy outperforms or remains competitive with baselines that require separate pre-training for each modality subset, while substantially surpassing training from scratch on several downstream tasks. Additionally, it can quickly and efficiently reconstruct missing modalities, highlighting its practical value. Code and trained models are available at: https://github.com/Lucas-rbnt/BM-MAE
中文: BM-MAE是一种针对多模态MRI的预训练策略,能适应任意可用模态组合,无需调整架构即可实现高效微调和缺失模态重建。
English: BM-MAE is a pre-training strategy for multimodal MRI that adapts to any available modality combination, enabling efficient fine-tuning and missing modality reconstruction without architectural changes.

Authors:Jorgen Cani, Christos Diou, Spyridon Evangelatos, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos
Title: X-ray illicit object detection using hybrid CNN-transformer neural network architectures
Abstract:
In the field of X-ray security applications, even the smallest details can significantly impact outcomes. Objects that are heavily occluded or intentionally concealed pose a great challenge for detection, whether by human observation or through advanced technological applications. While certain Deep Learning (DL) architectures demonstrate strong performance in processing local information, such as Convolutional Neural Networks (CNNs), others excel in handling distant information, e.g., transformers. In X-ray security imaging the literature has been dominated by the use of CNN-based methods, while the integration of the two aforementioned leading architectures has not been sufficiently explored. In this paper, various hybrid CNN-transformer architectures are evaluated against a common CNN object detection baseline, namely YOLOv8. In particular, a CNN (HGNetV2) and a hybrid CNN-transformer (Next-ViT-S) backbone are combined with different CNN/transformer detection heads (YOLOv8 and RT-DETR). The resulting architectures are comparatively evaluated on three challenging public X-ray inspection datasets, namely EDS, HiXray, and PIDray. Interestingly, while the YOLOv8 detector with its default backbone (CSP-DarkNet53) is generally shown to be advantageous on the HiXray and PIDray datasets, when a domain distribution shift is incorporated in the X-ray images (as happens in the EDS datasets), hybrid CNN-transformer architectures exhibit increased robustness. Detailed comparative evaluation results, including object-level detection performance and object-size error analysis, demonstrate the strengths and weaknesses of each architectural combination and suggest guidelines for future research. The source code and network weights of the models employed in this study are available at https://github.com/jgenc/xray-comparative-evaluation.
Chinese: 本研究评估了用于X射线安检物体检测的混合CNN-Transformer架构,发现相较于YOLOv8等传统卷积网络,该架构在应对领域偏移时展现出更强鲁棒性,基于公开数据集的全面测试揭示了不同架构组合的性能特点。
English: This study evaluates hybrid CNN-transformer architectures for X-ray security object detection, finding they offer enhanced robustness against domain shifts compared to conventional CNNs like YOLOv8, with comprehensive testing on public datasets revealing distinct performance trade-offs.

Authors:Antoni Bigata, Rodrigo Mira, Stella Bounareli, Michał Stypułkowski, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Title: KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution
Abstract:
Lip synchronization, known as the task of aligning lip movements in an existing video with new input audio, is typically framed as a simpler variant of audio-driven facial animation. However, as well as suffering from the usual issues in talking head generation (e.g., temporal consistency), lip synchronization presents significant new challenges such as expression leakage from the input video and facial occlusions, which can severely impact real-world applications like automated dubbing, but are often neglected in existing works. To address these shortcomings, we present KeySync, a two-stage framework that succeeds in solving the issue of temporal consistency, while also incorporating solutions for leakage and occlusions using a carefully designed masking strategy. We show that KeySync achieves state-of-the-art results in lip reconstruction and cross-synchronization, improving visual quality and reducing expression leakage according to LipLeak, our novel leakage metric. Furthermore, we demonstrate the effectiveness of our new masking approach in handling occlusions and validate our architectural choices through several ablation studies. Code and model weights can be found at https://antonibigata.github.io/KeySync.
Chinese: KeySync提出了一种两阶段框架,有效解决了唇形同步中的时间一致性、表情泄漏和面部遮挡问题,在唇形重建和跨同步方面取得了最先进的成果。
English: KeySync introduces a two-stage framework that effectively addresses temporal consistency, expression leakage, and facial occlusions in lip synchronization, achieving state-of-the-art results in lip reconstruction and cross-synchronization.

Authors:Kwon Byung-Ki, Qi Dai, Lee Hyoseok, Chong Luo, Tae-Hyun Oh
Title: JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
Abstract:
We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, namely, adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timesteps of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a viable alternative to conditional generation. The project page is available at https://byungki-k.github.io/JointDiT/.
中文: JointDiT是一种扩散变换器,通过自适应调度权重和不平衡时间步采样技术,实现了RGB与深度信息的联合分布建模,能生成高质量图像和精确深度图。
English: JointDiT is a diffusion transformer that models the joint distribution of RGB and depth, generating high-fidelity images and accurate depth maps through adaptive scheduling weights and unbalanced timestep sampling.

Authors:Ruiyuan Zhang, Qi Wang, Jiaxiang Liu, Yu Zhang, Yuchi Huo, Chao Wu
Title: Leveraging Pretrained Diffusion Models for Zero-Shot Part Assembly
Abstract:
3D part assembly aims to understand part relationships and predict their 6-DoF poses to construct realistic 3D shapes, addressing the growing demand for autonomous assembly, which is crucial for robots. Existing methods mainly estimate the transformation of each part by training neural networks under supervision, which requires a substantial quantity of manually labeled data. However, the high cost of data collection and the immense variability of real-world shapes and parts make traditional methods impractical for large-scale applications. In this paper, we propose first a zero-shot part assembly method that utilizes pre-trained point cloud diffusion models as discriminators in the assembly process, guiding the manipulation of parts to form realistic shapes. Specifically, we theoretically demonstrate that utilizing a diffusion model for zero-shot part assembly can be transformed into an Iterative Closest Point (ICP) process. Then, we propose a novel pushing-away strategy to address the overlap parts, thereby further enhancing the robustness of the method. To verify our work, we conduct extensive experiments and quantitative comparisons to several strong baseline methods, demonstrating the effectiveness of the proposed approach, which even surpasses the supervised learning method. The code has been released on https://github.com/Ruiyuan-Zhang/Zero-Shot-Assembly.
Chinese: 本文提出了一种零样本三维零件装配方法,利用预训练扩散模型作为判别器指导零件组装而无需标注数据,通过创新的推开策略和迭代优化,在实验中超越了监督学习方法的效果。
English: This paper introduces a zero-shot 3D part assembly method using pre-trained diffusion models as discriminators to guide part manipulation without labeled data, achieving superior performance over supervised methods through a novel pushing-away strategy and iterative optimization.

Authors:Wenxuan Liu, Yao Deng, Kang Chen, Xian Zhong, Zhaofei Yu, Tiejun Huang
Title: SOTA: Spike-Navigated Optimal TrAnsport Saliency Region Detection in Composite-bias Videos
Abstract:
Existing saliency detection methods struggle in real-world scenarios due to motion blur and occlusions. In contrast, spike cameras, with their high temporal resolution, significantly enhance visual saliency maps. However, the composite noise inherent to spike camera imaging introduces discontinuities in saliency detection. Low-quality samples further distort model predictions, leading to saliency bias. To address these challenges, we propose Spike-navigated Optimal TrAnsport Saliency Region Detection (SOTA), a framework that leverages the strengths of spike cameras while mitigating biases in both spatial and temporal dimensions. Our method introduces Spike-based Micro-debias (SM) to capture subtle frame-to-frame variations and preserve critical details, even under minimal scene or lighting changes. Additionally, Spike-based Global-debias (SG) refines predictions by reducing inconsistencies across diverse conditions. Extensive experiments on real and synthetic datasets demonstrate that SOTA outperforms existing methods by eliminating composite noise bias. Our code and dataset will be released at https://github.com/lwxfight/sota.
中文摘要:提出的SOTA框架利用脉冲相机的高时间分辨率提升视觉显著性检测,通过微观和全局去偏组件解决复合噪声与样本偏差问题,在多种条件下优于现有方法。
English Summary: The proposed SOTA framework leverages spike cameras' high temporal resolution to enhance saliency detection while addressing composite noise and bias through micro and global debiasing components, outperforming existing methods in diverse conditions.

Authors:Feng Xue, Wenzhuang Xu, Guofeng Zhong, Anlong Minga, Nicu Sebe
Title: Cues3D: Unleashing the Power of Sole NeRF for Consistent and Unique Instances in Open-Vocabulary 3D Panoptic Segmentation
Abstract:
Open-vocabulary 3D panoptic segmentation has recently emerged as a significant trend. Top-performing methods currently integrate 2D segmentation with geometry-aware 3D primitives. However, the advantage would be lost without high-fidelity 3D point clouds, such as methods based on Neural Radiance Field (NeRF). These methods are limited by the insufficient capacity to maintain consistency across partial observations. To address this, recent works have utilized contrastive loss or cross-view association pre-processing for view consensus. In contrast to them, we present Cues3D, a compact approach that relies solely on NeRF instead of pre-associations. The core idea is that NeRF's implicit 3D field inherently establishes a globally consistent geometry, enabling effective object distinction without explicit cross-view supervision. We propose a three-phase training framework for NeRF, initialization-disambiguation-refinement, whereby the instance IDs are corrected using the initially-learned knowledge. Additionally, an instance disambiguation method is proposed to match NeRF-rendered 3D masks and ensure globally unique 3D instance identities. With the aid of Cues3D, we obtain highly consistent and unique 3D instance ID for each object across views with a balanced version of NeRF. Our experiments are conducted on ScanNet v2, ScanNet200, ScanNet++, and Replica datasets for 3D instance, panoptic, and semantic segmentation tasks. Cues3D outperforms other 2D image-based methods and competes with the latest 2D-3D merging based methods, while even surpassing them when using additional 3D point clouds. The code link could be found in the appendix and will be released on \href{https://github.com/mRobotit/Cues3D}{github}
中文摘要:Cues3D是一种基于神经辐射场隐式三维几何的新型开放词汇全景分割方法,无需跨视图监督即可实现全局一致的目标区分,在多个数据集上超越了现有方法。
English Summary: Cues3D is a novel method for open-vocabulary 3D panoptic segmentation that leverages Neural Radiance Field's implicit 3D geometry to achieve globally consistent object distinction without cross-view supervision, outperforming existing approaches across multiple datasets.

Authors:Usman Muhammad, Jorma Laaksonen, Lyudmila Mihaylova
Title: Towards Lightweight Hyperspectral Image Super-Resolution with Depthwise Separable Dilated Convolutional Network
Abstract:
Deep neural networks have demonstrated highly competitive performance in super-resolution (SR) for natural images by learning mappings from low-resolution (LR) to high-resolution (HR) images. However, hyperspectral super-resolution remains an ill-posed problem due to the high spectral dimensionality of the data and the scarcity of available training samples. Moreover, existing methods often rely on large models with a high number of parameters or require the fusion with panchromatic or RGB images, both of which are often impractical in real-world scenarios. Inspired by the MobileNet architecture, we introduce a lightweight depthwise separable dilated convolutional network (DSDCN) to address the aforementioned challenges. Specifically, our model leverages multiple depthwise separable convolutions, similar to the MobileNet architecture, and further incorporates a dilated convolution fusion block to make the model more flexible for the extraction of both spatial and spectral features. In addition, we propose a custom loss function that combines mean squared error (MSE), an L2 norm regularization-based constraint, and a spectral angle-based loss, ensuring the preservation of both spectral and spatial details. The proposed model achieves very competitive performance on two publicly available hyperspectral datasets, making it well-suited for hyperspectral image super-resolution tasks. The source codes are publicly available at: \href{https://github.com/Usman1021/lightweight}{https://github.com/Usman1021/lightweight}.
中文: 作者提出了一种轻量级的深度可分离扩张卷积网络(DSDCN)及定制损失函数,有效解决了高光谱超分辨率难题,在公开数据集上取得优异性能的同时保持了模型的高效性。
English: The authors propose a lightweight depthwise separable dilated convolutional network (DSDCN) with a custom loss function to effectively address hyperspectral super-resolution challenges, achieving competitive performance on public datasets while maintaining model efficiency.

Authors:Seungjun Shin, Suji Kim, Dokwan Oh
Title: Efficient Neural Video Representation with Temporally Coherent Modulation
Abstract:
Implicit neural representations (INR) has found successful applications across diverse domains. To employ INR in real-life, it is important to speed up training. In the field of INR for video applications, the state-of-the-art approach employs grid-type parametric encoding and successfully achieves a faster encoding speed in comparison to its predecessors. However, the grid usage, which does not consider the video's dynamic nature, leads to redundant use of trainable parameters. As a result, it has significantly lower parameter efficiency and higher bitrate compared to NeRV-style methods that do not use a parametric encoding. To address the problem, we propose Neural Video representation with Temporally coherent Modulation (NVTM), a novel framework that can capture dynamic characteristics of video. By decomposing the spatio-temporal 3D video data into a set of 2D grids with flow information, NVTM enables learning video representation rapidly and uses parameter efficiently. Our framework enables to process temporally corresponding pixels at once, resulting in the fastest encoding speed for a reasonable video quality, especially when compared to the NeRV-style method, with a speed increase of over 3 times. Also, it remarks an average of 1.54dB/0.019 improvements in PSNR/LPIPS on UVG (Dynamic) (even with 10% fewer parameters) and an average of 1.84dB/0.013 improvements in PSNR/LPIPS on MCL-JCV (Dynamic), compared to previous grid-type works. By expanding this to compression tasks, we demonstrate comparable performance to video compression standards (H.264, HEVC) and recent INR approaches for video compression. Additionally, we perform extensive experiments demonstrating the superior performance of our algorithm across diverse tasks, encompassing super resolution, frame interpolation and video inpainting. Project page is https://sujiikim.github.io/NVTM/.

Authors:Derong Jin, Ruohan Gao
Title: Differentiable Room Acoustic Rendering with Multi-View Vision Priors
Abstract:
An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments. However, existing methods for room impulse response estimation rely either on data-demanding learning-based models or computationally expensive physics-based modeling. In this work, we introduce Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR), a framework that leverages visual cues extracted from multi-view images and acoustic beam tracing for physics-based room acoustic rendering. Experiments across six real-world environments from two datasets demonstrate that our multimodal, physics-based approach is efficient, interpretable, and accurate, significantly outperforming a series of prior methods. Notably, on the Real Acoustic Field dataset, AV-DAR achieves comparable performance to models trained on 10 times more data while delivering relative gains ranging from 16.6% to 50.9% when trained at the same scale.

Authors:Jiuwu Hao, Liguo Sun, Yuting Wan, Yueyang Wu, Ti Xiang, Haolin Song, Pin Lv
Title: Is Intermediate Fusion All You Need for UAV-based Collaborative Perception?
Abstract:
Collaborative perception enhances environmental awareness through inter-agent communication and is regarded as a promising solution to intelligent transportation systems. However, existing collaborative methods for Unmanned Aerial Vehicles (UAVs) overlook the unique characteristics of the UAV perspective, resulting in substantial communication overhead. To address this issue, we propose a novel communication-efficient collaborative perception framework based on late-intermediate fusion, dubbed LIF. The core concept is to exchange informative and compact detection results and shift the fusion stage to the feature representation level. In particular, we leverage vision-guided positional embedding (VPE) and box-based virtual augmented feature (BoBEV) to effectively integrate complementary information from various agents. Additionally, we innovatively introduce an uncertainty-driven communication mechanism that uses uncertainty evaluation to select high-quality and reliable shared areas. Experimental results demonstrate that our LIF achieves superior performance with minimal communication bandwidth, proving its effectiveness and practicality. Code and models are available at https://github.com/uestchjw/LIF.
中文: 本文提出LIF框架,通过后期中间融合和不确定性驱动机制,实现无人机间高效通信的协同感知,在保证性能的同时大幅降低通信开销。
English: This paper introduces LIF, a communication-efficient collaborative perception framework for UAVs that uses late-intermediate fusion and an uncertainty-driven mechanism to minimize bandwidth while maintaining high performance.

Authors:Bahram Jafrasteh, Wei Peng, Cheng Wan, Yimin Luo, Ehsan Adeli, Qingyu Zhao
Title: WASABI: A Metric for Evaluating Morphometric Plausibility of Synthetic Brain MRIs
Abstract:
Generative models enhance neuroimaging through data augmentation, quality improvement, and rare condition studies. Despite advances in realistic synthetic MRIs, evaluations focus on texture and perception, lacking sensitivity to crucial anatomical fidelity. This study proposes a new metric, called WASABI (Wasserstein-Based Anatomical Brain Index), to assess the anatomical realism of synthetic brain MRIs. WASABI leverages \textit{SynthSeg}, a deep learning-based brain parcellation tool, to derive volumetric measures of brain regions in each MRI and uses the multivariate Wasserstein distance to compare distributions between real and synthetic anatomies. Based on controlled experiments on two real datasets and synthetic MRIs from five generative models, WASABI demonstrates higher sensitivity in quantifying anatomical discrepancies compared to traditional image-level metrics, even when synthetic images achieve near-perfect visual quality. Our findings advocate for shifting the evaluation paradigm beyond visual inspection and conventional metrics, emphasizing anatomical fidelity as a crucial benchmark for clinically meaningful brain MRI synthesis. Our code is available at https://github.com/BahramJafrasteh/wasabi-mri.
中文摘要:本研究提出名为WASABI的新指标,通过结合脑区分割与Wasserstein距离来评估合成脑部MRI的解剖学真实性,实验表明即使合成图像视觉质量近乎完美,该指标仍能比传统方法更敏感地检测解剖差异。
English Summary: This study introduces WASABI, a novel metric using Wasserstein distance and brain parcellation to evaluate anatomical realism in synthetic MRIs, demonstrating superior sensitivity to anatomical discrepancies compared to traditional metrics despite high visual quality.

Authors:Ting Qiao, Yingjia Wang, Xing Liu, Sixing Wu, Jianbing Li, Yiming Li
Title: Cert-SSB: Toward Certified Sample-Specific Backdoor Defense
Abstract:
Deep neural networks (DNNs) are vulnerable to backdoor attacks, where an attacker manipulates a small portion of the training data to implant hidden backdoors into the model. The compromised model behaves normally on clean samples but misclassifies backdoored samples into the attacker-specified target class, posing a significant threat to real-world DNN applications. Currently, several empirical defense methods have been proposed to mitigate backdoor attacks, but they are often bypassed by more advanced backdoor techniques. In contrast, certified defenses based on randomized smoothing have shown promise by adding random noise to training and testing samples to counteract backdoor attacks. In this paper, we reveal that existing randomized smoothing defenses implicitly assume that all samples are equidistant from the decision boundary. However, it may not hold in practice, leading to suboptimal certification performance. To address this issue, we propose a sample-specific certified backdoor defense method, termed Cert-SSB. Cert-SSB first employs stochastic gradient ascent to optimize the noise magnitude for each sample, ensuring a sample-specific noise level that is then applied to multiple poisoned training sets to retrain several smoothed models. After that, Cert-SSB aggregates the predictions of multiple smoothed models to generate the final robust prediction. In particular, in this case, existing certification methods become inapplicable since the optimized noise varies across different samples. To conquer this challenge, we introduce a storage-update-based certification method, which dynamically adjusts each sample's certification region to improve certification performance. We conduct extensive experiments on multiple benchmark datasets, demonstrating the effectiveness of our proposed method. Our code is available at https://github.com/NcepuQiaoTing/Cert-SSB.
中文: 深度神经网络易受后门攻击,现有基于随机平滑的认证防御方法假设所有样本与决策边界等距,导致性能不佳;本文提出Cert-SSB方法,通过为每个样本优化噪声水平并采用动态认证机制,有效提升了防御能力。
English: Deep neural networks are susceptible to backdoor attacks, but existing certified defenses using randomized smoothing assume uniform sample distances from decision boundaries, leading to suboptimal performance; this paper introduces Cert-SSB, a sample-specific defense that optimizes noise levels per sample and employs a dynamic certification method to enhance robustness.

Authors:Abu Mohammed Raisuddin, Jesper Holmblad, Hamed Haghighi, Yuri Poledna, Maikol Funk Drechsler, Valentina Donzella, Eren Erdal Aksoy
Title: REHEARSE-3D: A Multi-modal Emulated Rain Dataset for 3D Point Cloud De-raining
Abstract:
Sensor degradation poses a significant challenge in autonomous driving. During heavy rainfall, the interference from raindrops can adversely affect the quality of LiDAR point clouds, resulting in, for instance, inaccurate point measurements. This, in turn, can potentially lead to safety concerns if autonomous driving systems are not weather-aware, i.e., if they are unable to discern such changes. In this study, we release a new, large-scale, multi-modal emulated rain dataset, REHEARSE-3D, to promote research advancements in 3D point cloud de-raining. Distinct from the most relevant competitors, our dataset is unique in several respects. First, it is the largest point-wise annotated dataset, and second, it is the only one with high-resolution LiDAR data (LiDAR-256) enriched with 4D Radar point clouds logged in both daytime and nighttime conditions in a controlled weather environment. Furthermore, REHEARSE-3D involves rain-characteristic information, which is of significant value not only for sensor noise modeling but also for analyzing the impact of weather at a point level. Leveraging REHEARSE-3D, we benchmark raindrop detection and removal in fused LiDAR and 4D Radar point clouds. Our comprehensive study further evaluates the performance of various statistical and deep-learning models. Upon publication, the dataset and benchmark models will be made publicly available at: https://sporsho.github.io/REHEARSE3D.

Authors:Yan Shu, Weichao Zeng, Fangmin Zhao, Zeyu Chen, Zhenhang Li, Xiaomeng Yang, Yu Zhou, Paolo Rota, Xiang Bai, Lianwen Jin, Xu-Cheng Yin, Nicu Sebe
Title: Visual Text Processing: A Comprehensive Review and Unified Evaluation
Abstract:
Visual text is a crucial component in both document and scene images, conveying rich semantic information and attracting significant attention in the computer vision community. Beyond traditional tasks such as text detection and recognition, visual text processing has witnessed rapid advancements driven by the emergence of foundation models, including text image reconstruction and text image manipulation. Despite significant progress, challenges remain due to the unique properties that differentiate text from general objects. Effectively capturing and leveraging these distinct textual characteristics is essential for developing robust visual text processing models. In this survey, we present a comprehensive, multi-perspective analysis of recent advancements in visual text processing, focusing on two key questions: (1) What textual features are most suitable for different visual text processing tasks? (2) How can these distinctive text features be effectively incorporated into processing frameworks? Furthermore, we introduce VTPBench, a new benchmark that encompasses a broad range of visual text processing datasets. Leveraging the advanced visual quality assessment capabilities of multimodal large language models (MLLMs), we propose VTPScore, a novel evaluation metric designed to ensure fair and reliable evaluation. Our empirical study with more than 20 specific models reveals substantial room for improvement in the current techniques. Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing. The relevant repository is available at https://github.com/shuyansy/Visual-Text-Processing-survey.
Chinese: 本综述全面分析了视觉文本处理的最新进展,探讨了文本特征的选择与融合关键问题,并引入了新基准和评估指标,旨在为该领域的未来探索与创新提供基础资源。
English: This survey provides a comprehensive analysis of recent advances in visual text processing, addressing key questions about textual features and their integration into frameworks, while introducing a new benchmark and evaluation metric to guide future innovations in the field.

Authors:Haiyang Zhou, Wangbo Yu, Jiawen Guan, Xinhua Cheng, Yonghong Tian, Li Yuan
Title: HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation
Abstract:
The rapid advancement of diffusion models holds the promise of revolutionizing the application of VR and AR technologies, which typically require scene-level 4D assets for user experience. Nonetheless, existing diffusion models predominantly concentrate on modeling static 3D scenes or object-level dynamics, constraining their capacity to provide truly immersive experiences. To address this issue, we propose HoloTime, a framework that integrates video diffusion models to generate panoramic videos from a single prompt or reference image, along with a 360-degree 4D scene reconstruction method that seamlessly transforms the generated panoramic video into 4D assets, enabling a fully immersive 4D experience for users. Specifically, to tame video diffusion models for generating high-fidelity panoramic videos, we introduce the 360World dataset, the first comprehensive collection of panoramic videos suitable for downstream 4D scene reconstruction tasks. With this curated dataset, we propose Panoramic Animator, a two-stage image-to-video diffusion model that can convert panoramic images into high-quality panoramic videos. Following this, we present Panoramic Space-Time Reconstruction, which leverages a space-time depth estimation method to transform the generated panoramic videos into 4D point clouds, enabling the optimization of a holistic 4D Gaussian Splatting representation to reconstruct spatially and temporally consistent 4D scenes. To validate the efficacy of our method, we conducted a comparative analysis with existing approaches, revealing its superiority in both panoramic video generation and 4D scene reconstruction. This demonstrates our method's capability to create more engaging and realistic immersive environments, thereby enhancing user experiences in VR and AR applications.
中文: HoloTime框架通过集成视频扩散模型生成全景视频并重建4D场景,能将单个提示或图像转化为时空一致的沉浸式环境,为VR/AR应用提供完整的4D体验解决方案。
English: The HoloTime framework integrates video diffusion models to generate panoramic videos and reconstruct 4D scenes, enabling fully immersive VR/AR experiences by transforming single prompts or images into spatially and temporally consistent environments.

Authors:Daniel Bogdoll, Rajanikant Patnaik Ananta, Abeyankar Giridharan, Isabel Moore, Gregory Stevens, Henry X. Liu
Title: Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection
Abstract:
With an ever-increasing availability of data, it has become more and more challenging to select and label appropriate samples for the training of machine learning models. It is especially difficult to detect long-tail classes of interest in large amounts of unlabeled data. This holds especially true for Intelligent Transportation Systems (ITS), where vehicle fleets and roadside perception systems generate an abundance of raw data. While industrial, proprietary data engines for such iterative data selection and model training processes exist, researchers and the open-source community suffer from a lack of an openly available system. We present the Mcity Data Engine, which provides modules for the complete data-based development cycle, beginning at the data acquisition phase and ending at the model deployment stage. The Mcity Data Engine focuses on rare and novel classes through an open-vocabulary data selection process. All code is publicly available on GitHub under an MIT license: https://github.com/mcity/mcity_data_engine
Chinese: Mcity数据引擎是一个开源系统,旨在解决机器学习模型数据选择和标注的难题,特别关注智能交通系统中的稀有类别,提供从数据采集到模型部署的完整开发周期。
English: The Mcity Data Engine is an open-source system designed to address the challenge of selecting and labeling data for machine learning models, particularly focusing on rare classes in Intelligent Transportation Systems by providing a complete development cycle from data acquisition to model deployment.

Authors:Uzair Shah, Marco Agus, Daniya Boges, Vanessa Chiappini, Mahmood Alzubaidi, Jens Schneider, Markus Hadwiger, Pierre J. Magistretti, Mowafa Househ, Corrado Calı
Title: SAM4EM: Efficient memory-based two stage prompt-free segment anything model adapter for complex 3D neuroscience electron microscopy stacks
Abstract:
We present SAM4EM, a novel approach for 3D segmentation of complex neural structures in electron microscopy (EM) data by leveraging the Segment Anything Model (SAM) alongside advanced fine-tuning strategies. Our contributions include the development of a prompt-free adapter for SAM using two stage mask decoding to automatically generate prompt embeddings, a dual-stage fine-tuning method based on Low-Rank Adaptation (LoRA) for enhancing segmentation with limited annotated data, and a 3D memory attention mechanism to ensure segmentation consistency across 3D stacks. We further release a unique benchmark dataset for the segmentation of astrocytic processes and synapses. We evaluated our method on challenging neuroscience segmentation benchmarks, specifically targeting mitochondria, glia, and synapses, with significant accuracy improvements over state-of-the-art (SOTA) methods, including recent SAM-based adapters developed for the medical domain and other vision transformer-based approaches. Experimental results indicate that our approach outperforms existing solutions in the segmentation of complex processes like glia and post-synaptic densities. Our code and models are available at https://github.com/Uzshah/SAM4EM.
中文: SAM4EM提出了一种新方法,通过结合无提示适配器、双阶段微调和3D记忆注意力机制来增强分割一切模型,实现了电子显微镜数据中神经结构的3D分割,在复杂基准测试中表现优异,并发布了专用数据集。
English: SAM4EM introduces a novel method for 3D segmentation of neural structures in electron microscopy data by enhancing the Segment Anything Model with a prompt-free adapter, dual-stage fine-tuning, and 3D memory attention, achieving superior accuracy on challenging benchmarks and releasing a specialized dataset.

Authors:Mengting Wei, Yante Li, Tuomas Varanka, Yan Jiang, Guoying Zhao
Title: MagicPortrait: Temporally Consistent Face Reenactment with 3D Geometric Guidance
Abstract:
In this study, we propose a method for video face reenactment that integrates a 3D face parametric model into a latent diffusion framework, aiming to improve shape consistency and motion control in existing video-based face generation approaches. Our approach employs the FLAME (Faces Learned with an Articulated Model and Expressions) model as the 3D face parametric representation, providing a unified framework for modeling face expressions and head pose. This not only enables precise extraction of motion features from driving videos, but also contributes to the faithful preservation of face shape and geometry. Specifically, we enhance the latent diffusion model with rich 3D expression and detailed pose information by incorporating depth maps, normal maps, and rendering maps derived from FLAME sequences. These maps serve as motion guidance and are encoded into the denoising UNet through a specifically designed Geometric Guidance Encoder (GGE). A multi-layer feature fusion module with integrated self-attention mechanisms is used to combine facial appearance and motion latent features within the spatial domain. By utilizing the 3D face parametric model as motion guidance, our method enables parametric alignment of face identity between the reference image and the motion captured from the driving video. Experimental results on benchmark datasets show that our method excels at generating high-quality face animations with precise expression and head pose variation modeling. In addition, it demonstrates strong generalization performance on out-of-domain images. Code is publicly available at https://github.com/weimengting/MagicPortrait.
中文: 本研究提出一种视频人脸重演方法,将FLAME三维人脸模型融入隐式扩散框架,通过几何引导和特征融合增强形状一致性与运动控制,实现了高质量、精准的面部动画生成。
English: This study introduces a video face reenactment method that integrates the FLAME 3D face model into a latent diffusion framework, enhancing shape consistency and motion control through geometric guidance and feature fusion for high-quality, precise facial animations.

Authors:Qinfeng Zhu, Yunxi Jiang, Lei Fan
Title: ClassWise-CRF: Category-Specific Fusion for Enhanced Semantic Segmentation of Remote Sensing Imagery
Abstract:
We propose a result-level category-specific fusion architecture called ClassWise-CRF. This architecture employs a two-stage process: first, it selects expert networks that perform well in specific categories from a pool of candidate networks using a greedy algorithm; second, it integrates the segmentation predictions of these selected networks by adaptively weighting their contributions based on their segmentation performance in each category. Inspired by Conditional Random Field (CRF), the ClassWise-CRF architecture treats the segmentation predictions from multiple networks as confidence vector fields. It leverages segmentation metrics (such as Intersection over Union) from the validation set as priors and employs an exponential weighting strategy to fuse the category-specific confidence scores predicted by each network. This fusion method dynamically adjusts the weights of each network for different categories, achieving category-specific optimization. Building on this, the architecture further optimizes the fused results using unary and pairwise potentials in CRF to ensure spatial consistency and boundary accuracy. To validate the effectiveness of ClassWise-CRF, we conducted experiments on two remote sensing datasets, LoveDA and Vaihingen, using eight classic and advanced semantic segmentation networks. The results show that the ClassWise-CRF architecture significantly improves segmentation performance: on the LoveDA dataset, the mean Intersection over Union (mIoU) metric increased by 1.00% on the validation set and by 0.68% on the test set; on the Vaihingen dataset, the mIoU improved by 0.87% on the validation set and by 0.91% on the test set. These results fully demonstrate the effectiveness and generality of the ClassWise-CRF architecture in semantic segmentation of remote sensing images. The full code is available at https://github.com/zhuqinfeng1999/ClassWise-CRF.
中文:ClassWise-CRF架构通过类别自适应加权融合专家网络预测并结合CRF优化,在遥感图像语义分割中显著提升了mIoU指标。
English: The ClassWise-CRF architecture enhances semantic segmentation by adaptively fusing expert networks' predictions using category-specific weighting and CRF optimization, achieving significant mIoU improvements on remote sensing datasets.

Authors:Hebaixu Wang, Jing Zhang, Haonan Guo, Di Wang, Jiayi Ma, Bo Du
Title: DGSolver: Diffusion Generalist Solver with Universal Posterior Sampling for Image Restoration
Abstract:
Diffusion models have achieved remarkable progress in universal image restoration. While existing methods speed up inference by reducing sampling steps, substantial step intervals often introduce cumulative errors. Moreover, they struggle to balance the commonality of degradation representations and restoration quality. To address these challenges, we introduce \textbf{DGSolver}, a diffusion generalist solver with universal posterior sampling. We first derive the exact ordinary differential equations for generalist diffusion models and tailor high-order solvers with a queue-based accelerated sampling strategy to improve both accuracy and efficiency. We then integrate universal posterior sampling to better approximate manifold-constrained gradients, yielding a more accurate noise estimation and correcting errors in inverse inference. Extensive experiments show that DGSolver outperforms state-of-the-art methods in restoration accuracy, stability, and scalability, both qualitatively and quantitatively. Code and models will be available at https://github.com/MiliLab/DGSolver.
中文: DGSolver 是一种扩散通用求解器,通过结合高阶求解器与加速采样策略及通用后验采样,有效提升了图像修复的精度、稳定性和扩展性,显著优于现有技术。
English: DGSolver is a diffusion generalist solver that enhances image restoration by employing high-order solvers with accelerated sampling and universal posterior sampling, achieving superior accuracy, stability, and scalability over existing methods.

Authors:Xinyu Li, Qi Yao, Yuanda Wang
Title: GarmentDiffusion: 3D Garment Sewing Pattern Generation with Multimodal Diffusion Transformers
Abstract:
Garment sewing patterns are fundamental design elements that bridge the gap between design concepts and practical manufacturing. The generative modeling of sewing patterns is crucial for creating diversified garments. However, existing approaches are limited either by reliance on a single input modality or by suboptimal generation efficiency. In this work, we present GarmentDiffusion, a new generative model capable of producing centimeter-precise, vectorized 3D sewing patterns from multimodal inputs (text, image, and incomplete sewing pattern). Our method efficiently encodes 3D sewing pattern parameters into compact edge token representations, achieving a sequence length that is 10 times shorter than that of the autoregressive SewingGPT in DressCode. By employing a diffusion transformer, we simultaneously denoise all edge tokens along the temporal axis, while maintaining a constant number of denoising steps regardless of dataset-specific edge and panel statistics. With all combination of designs of our model, the sewing pattern generation speed is accelerated by 100 times compared to SewingGPT. We achieve new state-of-the-art results on DressCodeData, as well as on the largest sewing pattern dataset, namely GarmentCodeData. The project website is available at https://shenfu-research.github.io/Garment-Diffusion/.

Authors:Jingjing Liu, Nian Wu, Xianchao Xiu, Jianhua Zhang
Title: Robust Orthogonal NMF with Label Propagation for Image Clustering
Abstract:
Non-negative matrix factorization (NMF) is a popular unsupervised learning approach widely used in image clustering. However, in real-world clustering scenarios, most existing NMF methods are highly sensitive to noise corruption and are unable to effectively leverage limited supervised information. To overcome these drawbacks, we propose a unified non-convex framework with label propagation called robust orthogonal nonnegative matrix factorization (RONMF). This method not only considers the graph Laplacian and label propagation as regularization terms but also introduces a more effective non-convex structure to measure the reconstruction error and imposes orthogonal constraints on the basis matrix to reduce the noise corruption, thereby achieving higher robustness. To solve RONMF, we develop an alternating direction method of multipliers (ADMM)-based optimization algorithm. In particular, all subproblems have closed-form solutions, which ensures its efficiency. Experimental evaluations on eight public image datasets demonstrate that the proposed RONMF outperforms state-of-the-art NMF methods across various standard metrics and shows excellent robustness. The code will be available at https://github.com/slinda-liu.
中文: 本文提出了一种鲁棒正交非负矩阵分解(RONMF)方法,通过结合标签传播和非凸优化来提高聚类精度和抗噪性,在八个图像数据集上的实验验证了其优越性能。
English: This paper introduces a robust orthogonal nonnegative matrix factorization (RONMF) method that enhances clustering accuracy and noise resistance by incorporating label propagation and non-convex optimization, validated through superior performance on eight image datasets.

Authors:Chenkai Zhang, Yiming Lei, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
Title: SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding
Abstract:
With the rapid development of Multi-modal Large Language Models (MLLMs), an increasing number of benchmarks have been established to evaluate the video understanding capabilities of these models. However, these benchmarks focus on standalone videos and mainly assess "visual elements" like human actions and object states. In reality, contemporary videos often encompass complex and continuous narratives, typically presented as a series. To address this challenge, we propose SeriesBench, a benchmark consisting of 105 carefully curated narrative-driven series, covering 28 specialized tasks that require deep narrative understanding. Specifically, we first select a diverse set of drama series spanning various genres. Then, we introduce a novel long-span narrative annotation method, combined with a full-information transformation approach to convert manual annotations into diverse task formats. To further enhance model capacity for detailed analysis of plot structures and character relationships within series, we propose a novel narrative reasoning framework, PC-DCoT. Extensive results on SeriesBench indicate that existing MLLMs still face significant challenges in understanding narrative-driven series, while PC-DCoT enables these MLLMs to achieve performance improvements. Overall, our SeriesBench and PC-DCoT highlight the critical necessity of advancing model capabilities to understand narrative-driven series, guiding the future development of MLLMs. SeriesBench is publicly available at https://github.com/zackhxn/SeriesBench-CVPR2025.
中文: 本文提出了SeriesBench这一评估多模态大语言模型对叙事驱动视频系列理解能力的新基准,并开发了PC-DCoT推理框架,该框架能有效提升模型在分析复杂剧情结构和角色关系方面的性能表现。
English: This paper introduces SeriesBench, a novel benchmark for evaluating Multi-modal Large Language Models' understanding of narrative-driven video series, and proposes PC-DCoT, a reasoning framework that enhances models' performance in analyzing complex plot structures and character relationships.

Authors:Weicai Yan, Wang Lin, Zirun Guo, Ye Wang, Fangming Feng, Xiaoda Yang, Zehan Wang, Tao Jin
Title: Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision
Abstract:
Prompt learning has demonstrated promising results in fine-tuning pre-trained multimodal models. However, the performance improvement is limited when applied to more complex and fine-grained tasks. The reason is that most existing methods directly optimize the parameters involved in the prompt generation process through loss backpropagation, which constrains the richness and specificity of the prompt representations. In this paper, we propose Diffusion-Driven Prompt Generator (Diff-Prompt), aiming to use the diffusion model to generate rich and fine-grained prompt information for complex downstream tasks. Specifically, our approach consists of three stages. In the first stage, we train a Mask-VAE to compress the masks into latent space. In the second stage, we leverage an improved Diffusion Transformer (DiT) to train a prompt generator in the latent space, using the masks for supervision. In the third stage, we align the denoising process of the prompt generator with the pre-trained model in the semantic space, and use the generated prompts to fine-tune the model. We conduct experiments on a complex pixel-level downstream task, referring expression comprehension, and compare our method with various parameter-efficient fine-tuning approaches. Diff-Prompt achieves a maximum improvement of 8.87 in R@1 and 14.05 in R@5 compared to the foundation model and also outperforms other state-of-the-art methods across multiple metrics. The experimental results validate the effectiveness of our approach and highlight the potential of using generative models for prompt generation. Code is available at https://github.com/Kelvin-ywc/diff-prompt.
中文: 提示学习在微调多模态模型中展现出潜力,但在复杂任务中因提示信息不够丰富而受限,为此提出的Diff-Prompt方法利用扩散模型生成精细提示,在实验中取得了显著性能提升。
English: Prompt learning shows potential in fine-tuning multimodal models but struggles with complex tasks due to limited prompt richness, leading to the proposed Diff-Prompt method that uses a diffusion model to generate detailed prompts and achieves significant performance improvements in experiments.

Authors:Yumeng Shi, Quanyu Long, Wenya Wang
Title: Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering
Abstract:
Video question answering benefits from the rich information in videos, enabling various applications. However, the large volume of tokens generated from long videos presents challenges to memory efficiency and model performance. To alleviate this, existing works propose to compress video inputs, but often overlook the varying importance of static and dynamic information across different queries, leading to inefficient token usage within limited budgets. We propose a novel token selection strategy, \textsc{explore-then-select}, that adaptively adjusts static and dynamic information based on question requirements. Our framework first explores different token allocations between key frames, which preserve spatial details, and delta frames, which capture temporal changes. Then it employs a query-aware attention-based metric to select the optimal token combination without model updates. Our framework is plug-and-play and can be seamlessly integrated within diverse video language models. Extensive experiments show that our method achieves significant performance improvements (up to 5.8\%) on multiple video question answering benchmarks. Our code is available at https://github.com/ANDgate99/Explore-Then-Select .
Chinese: 我们提出的“探索后选择”框架根据查询需求自适应平衡静态与动态视频标记,通过即插即用集成在视频问答基准上实现最高5.8%的性能提升。
English: Our proposed "explore-then-select" framework adaptively balances static and dynamic video tokens based on query requirements, achieving up to 5.8% performance gains on video QA benchmarks through plug-and-play integration.

Authors:Hong Zhang, Zhongjie Duan, Xingjun Wang, Yuze Zhao, Weiyi Lu, Zhipeng Di, Yixuan Xu, Yingda Chen, Yu Zhang
Title: Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space
Abstract:
Unified multimodal generative models aim to integrate image understanding and generation abilities, offering significant advantages in harnessing multimodal corpora, particularly interleaved text-image data. However, existing unified models exhibit limitations in image synthesis quality, autoregressive error accumulation, and image editing capability. In this work, we propose Nexus-Gen, a novel architecture that unifies image understanding, generation, and editing tasks in a shared image embedding space. This shared space serves as a bridge for the autoregressive and diffusion models, which seamlessly integrates their complementary strengths in cross-modal modeling. To mitigate the severe error accumulation during autoregressive embedding prediction, we propose a novel prefilled autoregression strategy that aligns training-inference dynamics by prefilling input sequences with learnable embeddings. After multi-stage and multi-task training on our constructed large-scale dataset with 26.3 million samples, Nexus-Gen achieves state-of-the-art performance on the evaluation benchmarks spanning image understanding, generation and editing tasks. All models, datasets, and source codes are released in https://github.com/modelscope/Nexus-Gen to facilitate further advancements across the field.
中文: Nexus-Gen提出了一种在共享图像嵌入空间中融合自回归与扩散模型的统一架构,通过创新的预填充自回归策略和多任务训练,在图像理解、生成和编辑任务中实现了最先进的性能。
English: Nexus-Gen introduces a unified architecture combining autoregressive and diffusion models in a shared image embedding space, achieving state-of-the-art performance in image understanding, generation, and editing through innovative prefilled autoregression and multi-task training.

Authors:Luoting Zhuang, Seyed Mohammad Hossein Tabatabaei, Ramin Salehi-Rad, Linh M. Tran, Denise R. Aberle, Ashley E. Prosper, William Hsu
Title: Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Lung Nodule Malignancy Prediction
Abstract:
Machine learning models have utilized semantic features, deep features, or both to assess lung nodule malignancy. However, their reliance on manual annotation during inference, limited interpretability, and sensitivity to imaging variations hinder their application in real-world clinical settings. Thus, this research aims to integrate semantic features derived from radiologists' assessments of nodules, guiding the model to learn clinically relevant, robust, and explainable imaging features for predicting lung cancer. We obtained 938 low-dose CT scans from the National Lung Screening Trial (NLST) with 1,246 nodules and semantic features. Additionally, the Lung Image Database Consortium dataset contains 1,018 CT scans, with 2,625 lesions annotated for nodule characteristics. Three external datasets were obtained from UCLA Health, the LUNGx Challenge, and the Duke Lung Cancer Screening. We fine-tuned a pretrained Contrastive Language-Image Pretraining (CLIP) model with a parameter-efficient fine-tuning approach to align imaging and semantic text features and predict the one-year lung cancer diagnosis. Our model outperformed state-of-the-art (SOTA) models in the NLST test set with an AUROC of 0.901 and AUPRC of 0.776. It also showed robust results in external datasets. Using CLIP, we also obtained predictions on semantic features through zero-shot inference, such as nodule margin (AUROC: 0.812), nodule consistency (0.812), and pleural attachment (0.840). Our approach surpasses the SOTA models in predicting lung cancer across datasets collected from diverse clinical settings, providing explainable outputs, aiding clinicians in comprehending the underlying meaning of model predictions. This approach also prevents the model from learning shortcuts and generalizes across clinical settings. The code is available at https://github.com/luotingzhuang/CLIP_nodule.
中文: 本研究整合放射科医生的语义特征与微调的CLIP模型预测肺癌,在多个数据集中实现卓越性能并提供可解释结果。
English: This research integrates radiologists' semantic features with a fine-tuned CLIP model to predict lung cancer, achieving superior performance and explainable results across multiple datasets.

Authors:Khoa Tuan Nguyen, Ho-min Park, Gaeun Oh, Joris Vankerschaver, Wesley De Neve
Title: Towards Improved Cervical Cancer Screening: Vision Transformer-Based Classification and Interpretability
Abstract:
We propose a novel approach to cervical cell image classification for cervical cancer screening using the EVA-02 transformer model. We developed a four-step pipeline: fine-tuning EVA-02, feature extraction, selecting important features through multiple machine learning models, and training a new artificial neural network with optional loss weighting for improved generalization. With this design, our best model achieved an F1-score of 0.85227, outperforming the baseline EVA-02 model (0.84878). We also utilized Kernel SHAP analysis and identified key features correlating with cell morphology and staining characteristics, providing interpretable insights into the decision-making process of the fine-tuned model. Our code is available at https://github.com/Khoa-NT/isbi2025_ps3c.
中文: 我们提出了一种基于EVA-02转换器模型的宫颈细胞图像分类新方法,通过四步流程实现了0.85227的优异F1分数,并利用SHAP分析提供了可解释的决策依据。
English: We introduce a novel cervical cell classification method using the EVA-02 transformer model with a four-step pipeline that achieved a superior F1-score of 0.85227 and provided interpretable insights through SHAP analysis.

Authors:Zhelun Shen, Zhuo Li, Chenming Wu, Zhibo Rao, Lina Liu, Yuchao Dai, Liangjun Zhang
Title: CMD: Constraining Multimodal Distribution for Domain Adaptation in Stereo Matching
Abstract:
Recently, learning-based stereo matching methods have achieved great improvement in public benchmarks, where soft argmin and smooth L1 loss play a core contribution to their success. However, in unsupervised domain adaptation scenarios, we observe that these two operations often yield multimodal disparity probability distributions in target domains, resulting in degraded generalization. In this paper, we propose a novel approach, Constrain Multi-modal Distribution (CMD), to address this issue. Specifically, we introduce \textit{uncertainty-regularized minimization} and \textit{anisotropic soft argmin} to encourage the network to produce predominantly unimodal disparity distributions in the target domain, thereby improving prediction accuracy. Experimentally, we apply the proposed method to multiple representative stereo-matching networks and conduct domain adaptation from synthetic data to unlabeled real-world scenes. Results consistently demonstrate improved generalization in both top-performing and domain-adaptable stereo-matching models. The code for CMD will be available at: \href{https://github.com/gallenszl/CMD}{https://github.com/gallenszl/CMD}.
中文: 本文提出约束多模态分布(CMD)方法,通过不确定性正则化最小化和各向异性软argmin操作,在立体匹配的无监督域适应中促进单模态视差分布,从而提升模型在真实场景中的泛化性能。
English: This paper introduces the Constrain Multi-modal Distribution (CMD) method to enhance unsupervised domain adaptation in stereo matching by promoting unimodal disparity distributions through uncertainty-regularized minimization and anisotropic soft argmin, improving generalization in real-world scenarios.

Authors:Jinpeng Wang, Tianci Luo, Yaohua Zha, Yan Feng, Ruisheng Luo, Bin Chen, Tao Dai, Long Chen, Yaowei Wang, Shu-Tao Xia
Title: Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning
Abstract:
Visual In-Context Learning (VICL) enables adaptively solving vision tasks by leveraging pixel demonstrations, mimicking human-like task completion through analogy. Prompt selection is critical in VICL, but current methods assume the existence of a single "ideal" prompt in a pool of candidates, which in practice may not hold true. Multiple suitable prompts may exist, but individually they often fall short, leading to difficulties in selection and the exclusion of useful context. To address this, we propose a new perspective: prompt condensation. Rather than relying on a single prompt, candidate prompts collaborate to efficiently integrate informative contexts without sacrificing resolution. We devise Condenser, a lightweight external plugin that compresses relevant fine-grained context across multiple prompts. Optimized end-to-end with the backbone, Condenser ensures accurate integration of contextual cues. Experiments demonstrate Condenser outperforms state-of-the-arts across benchmark tasks, showing superior context compression, scalability with more prompts, and enhanced computational efficiency compared to ensemble methods, positioning it as a highly competitive solution for VICL. Code is open-sourced at https://github.com/gimpong/CVPR25-Condenser.
中文: 本研究提出Condenser方法,通过协同压缩多个提示实现视觉上下文学习优化,在基准测试中展现出优于现有技术的效率与性能表现。
English: The study introduces Condenser, a prompt condensation method that collaboratively compresses multiple prompts to enhance visual in-context learning, outperforming existing techniques in efficiency and performance across benchmarks.

Authors:Thao Nguyen, Krishna Kumar Singh, Jing Shi, Trung Bui, Yong Jae Lee, Yuheng Li
Title: YoChameleon: Personalized Vision and Language Generation
Abstract:
Large Multimodal Models (e.g., GPT-4, Gemini, Chameleon) have evolved into powerful tools with millions of users. However, they remain generic models and lack personalized knowledge of specific user concepts. Previous work has explored personalization for text generation, yet it remains unclear how these methods can be adapted to new modalities, such as image generation. In this paper, we introduce Yo'Chameleon, the first attempt to study personalization for large multimodal models. Given 3-5 images of a particular concept, Yo'Chameleon leverages soft-prompt tuning to embed subject-specific information to (i) answer questions about the subject and (ii) recreate pixel-level details to produce images of the subject in new contexts. Yo'Chameleon is trained with (i) a self-prompting optimization mechanism to balance performance across multiple modalities, and (ii) a ``soft-positive" image generation approach to enhance image quality in a few-shot setting.

Authors:Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tandon, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, Bolei Zhou, Yuheng Li
Title: X-Fusion: Introducing New Modality to Frozen Large Language Models
Abstract:
We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.

Authors:Quentin Guimard, Moreno D'IncÃ, Massimiliano Mancini, Elisa Ricci
Title: Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers
Abstract:
A person downloading a pre-trained model from the web should be aware of its biases. Existing approaches for bias identification rely on datasets containing labels for the task of interest, something that a non-expert may not have access to, or may not have the necessary resources to collect: this greatly limits the number of tasks where model biases can be identified. In this work, we present Classifier-to-Bias (C2B), the first bias discovery framework that works without access to any labeled data: it only relies on a textual description of the classification task to identify biases in the target classification model. This description is fed to a large language model to generate bias proposals and corresponding captions depicting biases together with task-specific target labels. A retrieval model collects images for those captions, which are then used to assess the accuracy of the model w.r.t. the given biases. C2B is training-free, does not require any annotations, has no constraints on the list of biases, and can be applied to any pre-trained model on any classification task. Experiments on two publicly available datasets show that C2B discovers biases beyond those of the original datasets and outperforms a recent state-of-the-art bias detection baseline that relies on task-specific annotations, being a promising first step toward addressing task-agnostic unsupervised bias detection.
中文:C2B框架无需标注数据即可检测预训练模型中的偏见,它通过任务描述生成偏见建议并评估模型准确性,其性能优于依赖标注的方法,推动了无监督偏见检测的发展。
English: The C2B framework enables bias detection in pre-trained models without labeled data by using task descriptions to generate bias proposals and assess model accuracy, outperforming annotation-dependent methods and advancing unsupervised bias discovery.

Authors:Lorenzo Pellegrini, Davide Cozzolino, Serafino Pandolfini, Davide Maltoni, Matteo Ferrara, Luisa Verdoliva, Marco Prati, Marco Ramilli
Title: AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection
Abstract:
The rapid advancement of generative AI has revolutionized image creation, enabling high-quality synthesis from text prompts while raising critical challenges for media authenticity. We present Ai-GenBench, a novel benchmark designed to address the urgent need for robust detection of AI-generated images in real-world scenarios. Unlike existing solutions that evaluate models on static datasets, Ai-GenBench introduces a temporal evaluation framework where detection methods are incrementally trained on synthetic images, historically ordered by their generative models, to test their ability to generalize to new generative models, such as the transition from GANs to diffusion models. Our benchmark focuses on high-quality, diverse visual content and overcomes key limitations of current approaches, including arbitrary dataset splits, unfair comparisons, and excessive computational demands. Ai-GenBench provides a comprehensive dataset, a standardized evaluation protocol, and accessible tools for both researchers and non-experts (e.g., journalists, fact-checkers), ensuring reproducibility while maintaining practical training requirements. By establishing clear evaluation rules and controlled augmentation strategies, Ai-GenBench enables meaningful comparison of detection methods and scalable solutions. Code and data are publicly available to ensure reproducibility and to support the development of robust forensic detectors to keep pace with the rise of new synthetic generators.
Chinese: Ai-GenBench 是一种新颖的时序基准,旨在通过评估检测方法对新型生成模型(如从GAN到扩散模型的过渡)的泛化能力,来稳健检测AI生成图像,同时为实际应用提供标准化工具和数据集。
English: Ai-GenBench is a novel temporal benchmark designed to robustly detect AI-generated images by evaluating detection methods' generalization to new generative models, such as the transition from GANs to diffusion models, while providing standardized tools and datasets for practical use.

Authors:Mainak Singha, Subhankar Roy, Sarthak Mehrotra, Ankit Jha, Moloud Abdar, Biplab Banerjee, Elisa Ricci
Title: FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models
Abstract:
In federated learning, textual prompt tuning adapts Vision-Language Models (e.g., CLIP) by tuning lightweight input tokens (or prompts) on local client data, while keeping network weights frozen. After training, only the prompts are shared by the clients with the central server for aggregation. However, textual prompt tuning suffers from overfitting to known concepts, limiting its generalizability to unseen concepts. To address this limitation, we propose Multimodal Visual Prompt Tuning (FedMVP) that conditions the prompts on multimodal contextual information - derived from the input image and textual attribute features of a class. At the core of FedMVP is a PromptFormer module that synergistically aligns textual and visual features through a cross-attention mechanism. The dynamically generated multimodal visual prompts are then input to the frozen vision encoder of CLIP, and trained with a combination of CLIP similarity loss and a consistency loss. Extensive evaluation on 20 datasets, spanning three generalization settings, demonstrates that FedMVP not only preserves performance on in-distribution classes and domains, but also displays higher generalizability to unseen classes and domains, surpassing state-of-the-art methods by a notable margin of +1.57% - 2.26%. Code is available at https://github.com/mainaksingha01/FedMVP.
中文摘要:提出的FedMVP方法通过结合图像和文本特征生成多模态视觉提示,在保持已知概念性能的同时,显著提升了联邦学习对未知概念的泛化能力。
English Summary: The proposed FedMVP method enhances federated learning by generating multimodal visual prompts using both image and text features, significantly improving generalization to unseen concepts while maintaining performance on known ones.

Authors:Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, Yi Yang
Title: In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
Abstract:
Instruction-based image editing enables precise modifications via natural language prompts, but existing methods face a precision-efficiency tradeoff: fine-tuning demands massive datasets (>10M) and computational resources, while training-free approaches suffer from weak instruction comprehension. We address this by proposing ICEdit, which leverages the inherent comprehension and generation abilities of large-scale Diffusion Transformers (DiTs) through three key innovations: (1) An in-context editing paradigm without architectural modifications; (2) Minimal parameter-efficient fine-tuning for quality improvement; (3) Early Filter Inference-Time Scaling, which uses VLMs to select high-quality noise samples for efficiency. Experiments show that ICEdit achieves state-of-the-art editing performance with only 0.1\% of the training data and 1\% trainable parameters compared to previous methods. Our approach establishes a new paradigm for balancing precision and efficiency in instructional image editing. Codes and demos can be found in https://river-zhang.github.io/ICEdit-gh-pages/.
中文摘要:ICEdit提出了一种基于扩散变换器的高效指令图像编辑方法,通过上下文编辑、参数高效微调和噪声样本优化三大创新,仅用极少训练数据和参数即实现了最优编辑性能。
English Summary: ICEdit introduces an efficient instruction-based image editing method using Diffusion Transformers, achieving superior performance with minimal training data and parameters through in-context editing, parameter-efficient fine-tuning, and noise sample optimization.

Authors:Long Liu, Cihui Yang
Title: OG-HFYOLO :Orientation gradient guidance and heterogeneous feature fusion for deformation table cell instance segmentation
Abstract:
Table structure recognition is a key task in document analysis. However, the geometric deformation in deformed tables causes a weak correlation between content information and structure, resulting in downstream tasks not being able to obtain accurate content information. To obtain fine-grained spatial coordinates of cells, we propose the OG-HFYOLO model, which enhances the edge response by Gradient Orientation-aware Extractor, combines a Heterogeneous Kernel Cross Fusion module and a scale-aware loss function to adapt to multi-scale objective features, and introduces mask-driven non-maximal suppression in the post-processing, which replaces the traditional bounding box suppression mechanism. Furthermore, we also propose a data generator, filling the gap in the dataset for fine-grained deformation table cell spatial coordinate localization, and derive a large-scale dataset named Deformation Wired Table (DWTAL). Experiments show that our proposed model demonstrates excellent segmentation accuracy on all mainstream instance segmentation models. The dataset and the source code are open source: https://github.com/justliulong/OGHFYOLO.
中文: 针对表格结构识别中的几何变形问题,本文提出了OG-HFYOLO模型以增强边缘响应和多尺度特征适应,并创建了DWTAL数据集来支持细粒度单元格空间坐标定位。
English: The OG-HFYOLO model is proposed to address geometric deformation in table structure recognition by enhancing edge detection and multi-scale feature adaptation, while a new dataset DWTAL is created to support fine-grained spatial coordinate localization.

Authors:Zhongqi Wang, Jie Zhang, Shiguang Shan, Xilin Chen
Title: Dynamic Attention Analysis for Backdoor Detection in Text-to-Image Diffusion Models
Abstract:
Recent studies have revealed that text-to-image diffusion models are vulnerable to backdoor attacks, where attackers implant stealthy textual triggers to manipulate model outputs. Previous backdoor detection methods primarily focus on the static features of backdoor samples. However, a vital property of diffusion models is their inherent dynamism. This study introduces a novel backdoor detection perspective named Dynamic Attention Analysis (DAA), showing that these dynamic characteristics serve as better indicators for backdoor detection. Specifically, by examining the dynamic evolution of cross-attention maps, we observe that backdoor samples exhibit distinct feature evolution patterns at the $<$EOS$>$ token compared to benign samples. To quantify these dynamic anomalies, we first introduce DAA-I, which treats the tokens' attention maps as spatially independent and measures dynamic feature using the Frobenius norm. Furthermore, to better capture the interactions between attention maps and refine the feature, we propose a dynamical system-based approach, referred to as DAA-S. This model formulates the spatial correlations among attention maps using a graph-based state equation and we theoretically analyze the global asymptotic stability of this method. Extensive experiments across five representative backdoor attack scenarios demonstrate that our approach significantly surpasses existing detection methods, achieving an average F1 Score of 79.49% and an AUC of 87.67%. The code is available at https://github.com/Robin-WZQ/DAA.
中文: 本研究提出动态注意力分析(DAA)方法,通过捕捉交叉注意力图的动态演化模式来有效检测文本到图像扩散模型中的后门攻击,在多项实验中平均F1分数达79.49%,显著优于现有检测方法。
English: This study introduces Dynamic Attention Analysis (DAA), a novel detection method that leverages the dynamic evolution patterns in cross-attention maps to effectively identify backdoor attacks in text-to-image diffusion models, significantly outperforming existing approaches with an average F1 Score of 79.49%.

Authors:Jiajun Ding, Beiyao Zhu, Xiaosheng Liu, Lishen Zhang, Zhao Liu
Title: LymphAtlas- A Unified Multimodal Lymphoma Imaging Repository Delivering AI-Enhanced Diagnostic Insight
Abstract:
This study integrates PET metabolic information with CT anatomical structures to establish a 3D multimodal segmentation dataset for lymphoma based on whole-body FDG PET/CT examinations, which bridges the gap of the lack of standardised multimodal segmentation datasets in the field of haematological malignancies. We retrospectively collected 483 examination datasets acquired between March 2011 and May 2024, involving 220 patients (106 non-Hodgkin lymphoma, 42 Hodgkin lymphoma); all data underwent ethical review and were rigorously de-identified. Complete 3D structural information was preserved during data acquisition, preprocessing and annotation, and a high-quality dataset was constructed based on the nnUNet format. By systematic technical validation and evaluation of the preprocessing process, annotation quality and automatic segmentation algorithm, the deep learning model trained based on this dataset is verified to achieve accurate segmentation of lymphoma lesions in PET/CT images with high accuracy, good robustness and reproducibility, which proves the applicability and stability of this dataset in accurate segmentation and quantitative analysis. The deep fusion of PET/CT images achieved with this dataset not only significantly improves the accurate portrayal of the morphology, location and metabolic features of tumour lesions, but also provides solid data support for early diagnosis, clinical staging and personalized treatment, and promotes the development of automated image segmentation and precision medicine based on deep learning. The dataset and related resources are available at https://github.com/SuperD0122/LymphAtlas-.
中文: 本研究通过整合PET和CT数据构建了三维多模态淋巴瘤分割数据集,基于深度学习实现病灶精准分割并推动精准医疗发展。
English: This study creates a 3D multimodal lymphoma segmentation dataset by integrating PET and CT data, enabling accurate lesion segmentation through deep learning and advancing precision medicine.

Authors:Ziyang Xu, Kangsheng Duan, Xiaolei Shen, Zhifeng Ding, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang
Title: PixelHacker: Image Inpainting with Structural and Semantic Consistency
Abstract:
Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance, and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places2, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics. Project page at https://hustvl.github.io/PixelHacker.
中文: PixelHacker提出了一种新颖的潜在类别引导范式,通过分别编码前景背景特征并在去噪过程中注入,有效解决了图像修复中的结构与语义难题,在多个数据集上全面超越现有最优方法。
English: PixelHacker introduces a novel latent categories guidance paradigm and a diffusion-based model that significantly outperforms state-of-the-art methods by addressing structural and semantic challenges in image inpainting through separate foreground/background embeddings and linear attention.

Authors:Amaan Izhar, Nurul Japar, Norisma Idris, Ting Dang
Title: MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation
Abstract:
Medical image reporting (MIR) aims to generate structured clinical descriptions from radiological images. Existing methods struggle with fine-grained feature extraction, multimodal alignment, and generalization across diverse imaging types, often relying on vanilla transformers and focusing primarily on chest X-rays. We propose MicarVLMoE, a vision-language mixture-of-experts model with gated cross-aligned fusion, designed to address these limitations. Our architecture includes: (i) a multiscale vision encoder (MSVE) for capturing anatomical details at varying resolutions, (ii) a multihead dual-branch latent attention (MDLA) module for vision-language alignment through latent bottleneck representations, and (iii) a modulated mixture-of-experts (MoE) decoder for adaptive expert specialization. We extend MIR to CT scans, retinal imaging, MRI scans, and gross pathology images, reporting state-of-the-art results on COVCTR, MMR, PGROSS, and ROCO datasets. Extensive experiments and ablations confirm improved clinical accuracy, cross-modal alignment, and model interpretability. Code is available at https://github.com/AI-14/micar-vl-moe.
Chinese: 提出的MicarVLMoE模型通过门控交叉对齐融合架构解决了细粒度特征提取和多模态对齐的局限,在包括CT扫描和MRI在内的多种医学影像类型上实现了最先进的性能。
English: The proposed MicarVLMoE model overcomes limitations in fine-grained feature extraction and multimodal alignment through its gated cross-aligned fusion architecture, achieving state-of-the-art performance across diverse medical imaging types including CT scans and MRI.

Authors:Cedric Le Gentil, Leonardo Brizi, Daniil Lisus, Xinyuan Qiao, Giorgio Grisetti, Timothy D. Barfoot
Title: DRO: Doppler-Aware Direct Radar Odometry
Abstract:
A renaissance in radar-based sensing for mobile robotic applications is underway. Compared to cameras or lidars, millimetre-wave radars have the ability to `see' through thin walls, vegetation, and adversarial weather conditions such as heavy rain, fog, snow, and dust. In this paper, we propose a novel SE(2) odometry approach for spinning frequency-modulated continuous-wave radars. Our method performs scan-to-local-map registration of the incoming radar data in a direct manner using all the radar intensity information without the need for feature or point cloud extraction. The method performs locally continuous trajectory estimation and accounts for both motion and Doppler distortion of the radar scans. If the radar possesses a specific frequency modulation pattern that makes radial Doppler velocities observable, an additional Doppler-based constraint is formulated to improve the velocity estimate and enable odometry in geometrically feature-deprived scenarios (e.g., featureless tunnels). Our method has been validated on over 250km of on-road data sourced from public datasets (Boreas and MulRan) and collected using our automotive platform. With the aid of a gyroscope, it outperforms state-of-the-art methods and achieves an average relative translation error of 0.26% on the Boreas leaderboard. When using data with the appropriate Doppler-enabling frequency modulation pattern, the translation error is reduced to 0.18% in similar environments. We also benchmarked our algorithm using 1.5 hours of data collected with a mobile robot in off-road environments with various levels of structure to demonstrate its versatility. Our real-time implementation is publicly available: https://github.com/utiasASRL/dro.
中文: 本文提出了一种用于旋转雷达的新型SE(2)里程计方法,通过直接扫描到地图配准和多普勒约束,在多种环境中实现了卓越的定位精度,并经过大量数据验证。
English: A novel SE(2) odometry method for spinning radars is introduced, using direct scan-to-map registration and Doppler constraints to achieve superior accuracy in diverse environments, as validated by extensive testing.

Authors:Junlin Guo, James R. Zimmer-Dauphinee, Jordan M. Nieusma, Siqi Lu, Quan Liu, Ruining Deng, Can Cui, Jialin Yue, Yizhe Lin, Tianyuan Yao, Juming Xiong, Junchao Zhu, Chongyu Qu, Yuechen Yang, Mitchell Wilkes, Xiao Wang, Parker VanValkenburgh, Steven A. Wernke, Yuankai Huo
Title: DeepAndes: A Self-Supervised Vision Foundation Model for Multi-Spectral Remote Sensing Imagery of the Andes
Abstract:
By mapping sites at large scales using remotely sensed data, archaeologists can generate unique insights into long-term demographic trends, inter-regional social networks, and past adaptations to climate change. Remote sensing surveys complement field-based approaches, and their reach can be especially great when combined with deep learning and computer vision techniques. However, conventional supervised deep learning methods face challenges in annotating fine-grained archaeological features at scale. While recent vision foundation models have shown remarkable success in learning large-scale remote sensing data with minimal annotations, most off-the-shelf solutions are designed for RGB images rather than multi-spectral satellite imagery, such as the 8-band data used in our study. In this paper, we introduce DeepAndes, a transformer-based vision foundation model trained on three million multi-spectral satellite images, specifically tailored for Andean archaeology. DeepAndes incorporates a customized DINOv2 self-supervised learning algorithm optimized for 8-band multi-spectral imagery, marking the first foundation model designed explicitly for the Andes region. We evaluate its image understanding performance through imbalanced image classification, image instance retrieval, and pixel-level semantic segmentation tasks. Our experiments show that DeepAndes achieves superior F1 scores, mean average precision, and Dice scores in few-shot learning scenarios, significantly outperforming models trained from scratch or pre-trained on smaller datasets. This underscores the effectiveness of large-scale self-supervised pre-training in archaeological remote sensing. Codes will be available on https://github.com/geopacha/DeepAndes.
Chinese: 本文提出了DeepAndes,一种基于Transformer架构的视觉基础模型,通过在三百万张多光谱卫星图像上训练,并针对8波段影像优化自监督学习算法,在考古遥感任务中表现出色,显著优于现有模型在少样本场景下的性能。
English: This paper introduces DeepAndes, a transformer-based vision foundation model trained on three million multi-spectral satellite images, which achieves superior performance in archaeological remote sensing tasks by optimizing self-supervised learning for 8-band imagery and outperforming existing models in few-shot scenarios.

Authors:Noriyuki Kugo, Xiang Li, Zixin Li, Ashish Gupta, Arpandeep Khatua, Nidhish Jain, Chaitanya Patel, Yuta Kyuragi, Yasunori Ishii, Masamoto Tanabiki, Kazuki Kozuka, Ehsan Adeli
Title: VideoMultiAgents: A Multi-Agent Framework for Video Question Answering
Abstract:
Video Question Answering (VQA) inherently relies on multimodal reasoning, integrating visual, temporal, and linguistic cues to achieve a deeper understanding of video content. However, many existing methods rely on feeding frame-level captions into a single model, making it difficult to adequately capture temporal and interactive contexts. To address this limitation, we introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing. It enhances video understanding leveraging complementary multimodal reasoning from independently operating agents. Our approach is also supplemented with a question-guided caption generation, which produces captions that highlight objects, actions, and temporal transitions directly relevant to a given query, thus improving the answer accuracy. Experimental results demonstrate that our method achieves state-of-the-art performance on Intent-QA (79.0%, +6.2% over previous SOTA), EgoSchema subset (75.4%, +3.4%), and NExT-QA (79.6%, +0.4%). The source code is available at https://github.com/PanasonicConnect/VideoMultiAgents.
Chinese: VideoMultiAgents框架通过整合视觉、场景图和文本处理的专业代理,并辅以问题引导的标题生成,有效提升了视频理解能力,在多个基准测试中实现了最先进的性能。
English: The VideoMultiAgents framework overcomes limitations in capturing temporal and interactive contexts by integrating specialized agents for multimodal reasoning and question-guided caption generation, achieving state-of-the-art performance across multiple benchmarks.

Authors:Yu-Ju Tsai, Brian Price, Qing Liu, Luis Figueroa, Daniil Pakhomov, Zhihong Ding, Scott Cohen, Ming-Hsuan Yang
Title: CompleteMe: Reference-based Human Image Completion
Abstract:
Recent methods for human image completion can reconstruct plausible body shapes but often fail to preserve unique details, such as specific clothing patterns or distinctive accessories, without explicit reference images. Even state-of-the-art reference-based inpainting approaches struggle to accurately capture and integrate fine-grained details from reference images. To address this limitation, we propose CompleteMe, a novel reference-based human image completion framework. CompleteMe employs a dual U-Net architecture combined with a Region-focused Attention (RFA) Block, which explicitly guides the model's attention toward relevant regions in reference images. This approach effectively captures fine details and ensures accurate semantic correspondence, significantly improving the fidelity and consistency of completed images. Additionally, we introduce a challenging benchmark specifically designed for evaluating reference-based human image completion tasks. Extensive experiments demonstrate that our proposed method achieves superior visual quality and semantic consistency compared to existing techniques. Project page: https://liagm.github.io/CompleteMe/

Authors:Yibin Yan, Jilan Xu, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, Weidi Xie
Title: Learning Streaming Video Representation via Multitask Training
Abstract:
Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions. To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability. (ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.

Authors:Zador Pataki, Paul-Edouard Sarlin, Johannes L. Schönberger, Marc Pollefeys
Title: MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion
Abstract:
While Structure-from-Motion (SfM) has seen much progress over the years, state-of-the-art systems are prone to failure when facing extreme viewpoint changes in low-overlap, low-parallax or high-symmetry scenarios. Because capturing images that avoid these pitfalls is challenging, this severely limits the wider use of SfM, especially by non-expert users. We overcome these limitations by augmenting the classical SfM paradigm with monocular depth and normal priors inferred by deep neural networks. Thanks to a tight integration of monocular and multi-view constraints, our approach significantly outperforms existing ones under extreme viewpoint changes, while maintaining strong performance in standard conditions. We also show that monocular priors can help reject faulty associations due to symmetries, which is a long-standing problem for SfM. This makes our approach the first capable of reliably reconstructing challenging indoor environments from few images. Through principled uncertainty propagation, it is robust to errors in the priors, can handle priors inferred by different models with little tuning, and will thus easily benefit from future progress in monocular depth and normal estimation. Our code is publicly available at https://github.com/cvg/mpsfm.
中文: 本研究通过整合基于深度学习的单目深度和法线先验,克服了运动恢复结构在极端视角变化等挑战性场景中的局限,实现了即便使用少量图像也能进行鲁棒三维重建。
English: This research overcomes Structure-from-Motion's limitations in challenging scenarios by integrating deep learning-based monocular depth and normal priors, achieving robust reconstruction even with extreme viewpoint changes and few images.

Authors:Narges Rashvand, Ghazal Alinezhad Noghre, Armin Danesh Pazho, Babak Rahimi Ardabili, Hamed Tabkhi
Title: Shopformer: Transformer-Based Framework for Detecting Shoplifting via Human Pose
Abstract:
Shoplifting remains a costly issue for the retail sector, but traditional surveillance systems, which are mostly based on human monitoring, are still largely ineffective, with only about 2% of shoplifters being arrested. Existing AI-based approaches rely on pixel-level video analysis which raises privacy concerns, is sensitive to environmental variations, and demands significant computational resources. To address these limitations, we introduce Shopformer, a novel transformer-based model that detects shoplifting by analyzing pose sequences rather than raw video. We propose a custom tokenization strategy that converts pose sequences into compact embeddings for efficient transformer processing. To the best of our knowledge, this is the first pose-sequence-based transformer model for shoplifting detection. Evaluated on real-world pose data, our method outperforms state-of-the-art anomaly detection models, offering a privacy-preserving, and scalable solution for real-time retail surveillance. The code base for this work is available at https://github.com/TeCSAR-UNCC/Shopformer.
中文摘要:Shopformer提出了一种基于姿态序列分析的Transformer模型,通过将动作轨迹转换为紧凑嵌入来检测商店盗窃,在保护隐私的同时显著提升了检测效率与实时监控能力。
English Summary: Shopformer introduces a privacy-focused transformer model that detects shoplifting through pose sequence analysis instead of raw video, outperforming existing methods while addressing computational and surveillance limitations.

Authors:Yunfei Wan, Jianheng Liu, Chunran Zheng, Jiarong Lin, Fu Zhang
Title: Mesh-Learner: Texturing Mesh with Spherical Harmonics
Abstract:
In this paper, we present a 3D reconstruction and rendering framework termed Mesh-Learner that is natively compatible with traditional rasterization pipelines. It integrates mesh and spherical harmonic (SH) texture (i.e., texture filled with SH coefficients) into the learning process to learn each mesh s view-dependent radiance end-to-end. Images are rendered by interpolating surrounding SH Texels at each pixel s sampling point using a novel interpolation method. Conversely, gradients from each pixel are back-propagated to the related SH Texels in SH textures. Mesh-Learner exploits graphic features of rasterization pipeline (texture sampling, deferred rendering) to render, which makes Mesh-Learner naturally compatible with tools (e.g., Blender) and tasks (e.g., 3D reconstruction, scene rendering, reinforcement learning for robotics) that are based on rasterization pipelines. Our system can train vast, unlimited scenes because we transfer only the SH textures within the frustum to the GPU for training. At other times, the SH textures are stored in CPU RAM, which results in moderate GPU memory usage. The rendering results on interpolation and extrapolation sequences in the Replica and FAST-LIVO2 datasets achieve state-of-the-art performance compared to existing state-of-the-art methods (e.g., 3D Gaussian Splatting and M2-Mapping). To benefit the society, the code will be available at https://github.com/hku-mars/Mesh-Learner.
中文: 本文提出Mesh-Learner框架,通过结合网格和球谐纹理进行端到端视角相关辐射度学习,在保持与传统光栅化流程兼容的同时,在多个数据集上实现了最先进的渲染性能。
English: This paper introduces Mesh-Learner, a 3D reconstruction framework that integrates mesh and spherical harmonic textures for end-to-end view-dependent radiance learning, achieving state-of-the-art performance while maintaining compatibility with traditional rasterization pipelines and tools like Blender.

Authors:Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Abdenour Hadid
Title: DeeCLIP: A Robust and Generalizable Transformer-Based Framework for Detecting AI-Generated Images
Abstract:
This paper introduces DeeCLIP, a novel framework for detecting AI-generated images using CLIP-ViT and fusion learning. Despite significant advancements in generative models capable of creating highly photorealistic images, existing detection methods often struggle to generalize across different models and are highly sensitive to minor perturbations. To address these challenges, DeeCLIP incorporates DeeFuser, a fusion module that combines high-level and low-level features, improving robustness against degradations such as compression and blurring. Additionally, we apply triplet loss to refine the embedding space, enhancing the model's ability to distinguish between real and synthetic content. To further enable lightweight adaptation while preserving pre-trained knowledge, we adopt parameter-efficient fine-tuning using low-rank adaptation (LoRA) within the CLIP-ViT backbone. This approach supports effective zero-shot learning without sacrificing generalization. Trained exclusively on 4-class ProGAN data, DeeCLIP achieves an average accuracy of 89.00% on 19 test subsets composed of generative adversarial network (GAN) and diffusion models. Despite having fewer trainable parameters, DeeCLIP outperforms existing methods, demonstrating superior robustness against various generative models and real-world distortions. The code is publicly available at https://github.com/Mamadou-Keita/DeeCLIP for research purposes.
中文摘要:DeeCLIP是一种利用CLIP-ViT和融合学习检测AI生成图像的创新框架,通过特征融合模块和三元组损失增强模型鲁棒性,仅用少量训练数据即在多种生成模型测试中展现优异性能。
English Summary: DeeCLIP is a robust framework that detects AI-generated images by integrating CLIP-ViT with fusion learning and triplet loss, achieving high accuracy and generalization across multiple generative models despite limited training data.

Authors:Yulong Guo, Zilun Zhang, Yongheng Shang, Tiancheng Zhao, Shuiguang Deng, Yingchun Yang, Jianwei Yin
Title: SRMF: A Data Augmentation and Multimodal Fusion Approach for Long-Tail UHR Satellite Image Segmentation
Abstract:
The long-tail problem presents a significant challenge to the advancement of semantic segmentation in ultra-high-resolution (UHR) satellite imagery. While previous efforts in UHR semantic segmentation have largely focused on multi-branch network architectures that emphasize multi-scale feature extraction and fusion, they have often overlooked the importance of addressing the long-tail issue. In contrast to prior UHR methods that focused on independent feature extraction, we emphasize data augmentation and multimodal feature fusion to alleviate the long-tail problem. In this paper, we introduce SRMF, a novel framework for semantic segmentation in UHR satellite imagery. Our approach addresses the long-tail class distribution by incorporating a multi-scale cropping technique alongside a data augmentation strategy based on semantic reordering and resampling. To further enhance model performance, we propose a multimodal fusion-based general representation knowledge injection method, which, for the first time, fuses text and visual features without the need for individual region text descriptions, extracting more robust features. Extensive experiments on the URUR, GID, and FBP datasets demonstrate that our method improves mIoU by 3.33\%, 0.66\%, and 0.98\%, respectively, achieving state-of-the-art performance. Code is available at: https://github.com/BinSpa/SRMF.git.
中文摘要:SRMF框架通过多尺度裁剪、语义重排序数据增强以及文本-视觉多模态融合,解决了超高分辨率卫星图像分割中的长尾分布问题,在多个数据集上实现了最优性能。
English Summary: The SRMF framework tackles the long-tail problem in ultra-high-resolution satellite image segmentation through multi-scale cropping, semantic reordering data augmentation, and multimodal text-visual fusion, achieving state-of-the-art performance on multiple datasets.

Authors:Xiaoyu Liu, Mingshuai Yao, Yabo Zhang, Xianhui Lin, Peiran Ren, Xiaoming Li, Ming Liu, Wangmeng Zuo
Title: AnimateAnywhere: Rouse the Background in Human Image Animation
Abstract:
Human image animation aims to generate human videos of given characters and backgrounds that adhere to the desired pose sequence. However, existing methods focus more on human actions while neglecting the generation of background, which typically leads to static results or inharmonious movements. The community has explored camera pose-guided animation tasks, yet preparing the camera trajectory is impractical for most entertainment applications and ordinary users. As a remedy, we present an AnimateAnywhere framework, rousing the background in human image animation without requirements on camera trajectories. In particular, based on our key insight that the movement of the human body often reflects the motion of the background, we introduce a background motion learner (BML) to learn background motions from human pose sequences. To encourage the model to learn more accurate cross-frame correspondences, we further deploy an epipolar constraint on the 3D attention map. Specifically, the mask used to suppress geometrically unreasonable attention is carefully constructed by combining an epipolar mask and the current 3D attention map. Extensive experiments demonstrate that our AnimateAnywhere effectively learns the background motion from human pose sequences, achieving state-of-the-art performance in generating human animation results with vivid and realistic backgrounds. The source code and model will be available at https://github.com/liuxiaoyu1104/AnimateAnywhere.
中文摘要:AnimateAnywhere框架通过背景运动学习器从人体姿态序列中推断背景运动,无需预设相机轨迹即可生成具有生动逼真背景的人物动画,实现了最先进的性能。
English Summary: The AnimateAnywhere framework introduces a background motion learner that generates dynamic and realistic backgrounds in human image animations by inferring motion from human pose sequences, eliminating the need for predefined camera trajectories.

Authors:Hoang Chuong Nguyen, Wei Mao, Jose M. Alvarez, Miaomiao Liu
Title: Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video
Abstract:
Neural Radiance Fields (NeRF) has demonstrated its superior capability to represent 3D geometry but require accurately precomputed camera poses during training. To mitigate this requirement, existing methods jointly optimize camera poses and NeRF often relying on good pose initialisation or depth priors. However, these approaches struggle in challenging scenarios, such as large rotations, as they map each camera to a world coordinate system. We propose a novel method that eliminates prior dependencies by modeling continuous camera motions as time-dependent angular velocity and velocity. Relative motions between cameras are learned first via velocity integration, while camera poses can be obtained by aggregating such relative motions up to a world coordinate system defined at a single time step within the video. Specifically, accurate continuous camera movements are learned through a time-dependent NeRF, which captures local scene geometry and motion by training from neighboring frames for each time step. The learned motions enable fine-tuning the NeRF to represent the full scene geometry. Experiments on Co3D and Scannet show our approach achieves superior camera pose and depth estimation and comparable novel-view synthesis performance compared to state-of-the-art methods. Our code is available at https://github.com/HoangChuongNguyen/cope-nerf.
中文: 该方法将连续相机运动建模为时间依赖的速度,无需依赖姿态初始化或深度先验,实现了更优的相机姿态与深度估计,同时保持了具有竞争力的新视角合成效果。
English: The proposed method models continuous camera motions as time-dependent velocities to eliminate dependencies on pose initialization or depth priors, enabling superior camera pose and depth estimation while maintaining competitive novel-view synthesis performance.

Authors:Zhimin Liao, Ping Wei, Shuaijia Chen, Haoxuan Wang, Ziyang Ren
Title: STCOcc: Sparse Spatial-Temporal Cascade Renovation for 3D Occupancy and Scene Flow Prediction
Abstract:
3D occupancy and scene flow offer a detailed and dynamic representation of 3D scene. Recognizing the sparsity and complexity of 3D space, previous vision-centric methods have employed implicit learning-based approaches to model spatial and temporal information. However, these approaches struggle to capture local details and diminish the model's spatial discriminative ability. To address these challenges, we propose a novel explicit state-based modeling method designed to leverage the occupied state to renovate the 3D features. Specifically, we propose a sparse occlusion-aware attention mechanism, integrated with a cascade refinement strategy, which accurately renovates 3D features with the guidance of occupied state information. Additionally, we introduce a novel method for modeling long-term dynamic interactions, which reduces computational costs and preserves spatial information. Compared to the previous state-of-the-art methods, our efficient explicit renovation strategy not only delivers superior performance in terms of RayIoU and mAVE for occupancy and scene flow prediction but also markedly reduces GPU memory usage during training, bringing it down to 8.7GB. Our code is available on https://github.com/lzzzzzm/STCOcc
中文: 本文提出了一种基于显式状态的建模方法,通过稀疏遮挡感知注意力和级联优化策略改进三维特征重建,在占用和场景流预测中实现了更优性能并显著降低了GPU内存消耗。
English: This paper introduces an explicit state-based modeling method with sparse occlusion-aware attention and cascade refinement to enhance 3D feature renovation for occupancy and scene flow prediction, achieving superior performance and reduced GPU memory usage.

Authors:Valerie Zermatten, Javiera Castillo-Navarro, Pallavi Jain, Devis Tuia, Diego Marcos
Title: EcoWikiRS: Learning Ecological Representation of Satellite Images from Weak Supervision with Species Observations and Wikipedia
Abstract:
The presence of species provides key insights into the ecological properties of a location such as land cover, climatic conditions or even soil properties. We propose a method to predict such ecological properties directly from remote sensing (RS) images by aligning them with species habitat descriptions. We introduce the EcoWikiRS dataset, consisting of high-resolution aerial images, the corresponding geolocated species observations, and, for each species, the textual descriptions of their habitat from Wikipedia. EcoWikiRS offers a scalable way of supervision for RS vision language models (RS-VLMs) for ecology. This is a setting with weak and noisy supervision, where, for instance, some text may describe properties that are specific only to part of the species' niche or is irrelevant to a specific image. We tackle this by proposing WINCEL, a weighted version of the InfoNCE loss. We evaluate our model on the task of ecosystem zero-shot classification by following the habitat definitions from the European Nature Information System (EUNIS). Our results show that our approach helps in understanding RS images in a more ecologically meaningful manner. The code and the dataset are available at https://github.com/eceo-epfl/EcoWikiRS.
中文: 本研究提出EcoWikiRS数据集,整合了航空图像与物种栖息地描述,并开发了WINCEL加权损失方法,能在弱监督条件下从遥感图像预测生态属性,有效提升了零样本生态系统分类等任务的生态学意义解读能力。
English: The study introduces EcoWikiRS, a dataset combining aerial images with species habitat descriptions, and proposes WINCEL, a weighted loss method to predict ecological properties from remote sensing images despite noisy supervision, enhancing ecological understanding in tasks like zero-shot ecosystem classification.

Authors:Abhishek Kuriyal, Elliot Vincent, Mathieu Aubry, Loic Landrieu
Title: CoDEx: Combining Domain Expertise for Spatial Generalization in Satellite Image Analysis
Abstract:
Global variations in terrain appearance raise a major challenge for satellite image analysis, leading to poor model performance when training on locations that differ from those encountered at test time. This remains true even with recent large global datasets. To address this challenge, we propose a novel domain-generalization framework for satellite images. Instead of trying to learn a single generalizable model, we train one expert model per training domain, while learning experts' similarity and encouraging similar experts to be consistent. A model selection module then identifies the most suitable experts for a given test sample and aggregates their predictions. Experiments on four datasets (DynamicEarthNet, MUDS, OSCD, and FMoW) demonstrate consistent gains over existing domain generalization and adaptation methods. Our code is publicly available at https://github.com/Abhishek19009/CoDEx.
中文: 针对卫星图像中地形变化带来的挑战,提出了一种新的领域泛化框架,通过训练多个专家模型并聚合其预测,在四个数据集上均取得了优于现有方法的性能提升。
English: A novel domain-generalization framework for satellite images trains multiple expert models and aggregates their predictions to overcome terrain variation challenges, demonstrating consistent improvements across four datasets.

Authors:Nikolaos Chaidos, Angeliki Dimitriou, Nikolaos Spanos, Athanasios Voulodimos, Giorgos Stamou
Title: Explaining Vision GNNs: A Semantic and Visual Analysis of Graph-based Image Classification
Abstract:
Graph Neural Networks (GNNs) have emerged as an efficient alternative to convolutional approaches for vision tasks such as image classification, leveraging patch-based representations instead of raw pixels. These methods construct graphs where image patches serve as nodes, and edges are established based on patch similarity or classification relevance. Despite their efficiency, the explainability of GNN-based vision models remains underexplored, even though graphs are naturally interpretable. In this work, we analyze the semantic consistency of the graphs formed at different layers of GNN-based image classifiers, focusing on how well they preserve object structures and meaningful relationships. A comprehensive analysis is presented by quantifying the extent to which inter-layer graph connections reflect semantic similarity and spatial coherence. Explanations from standard and adversarial settings are also compared to assess whether they reflect the classifiers' robustness. Additionally, we visualize the flow of information across layers through heatmap-based visualization techniques, thereby highlighting the models' explainability. Our findings demonstrate that the decision-making processes of these models can be effectively explained, while also revealing that their reasoning does not necessarily align with human perception, especially in deeper layers.
中文摘要:本研究通过分析图神经网络各层语义一致性和可视化信息流动,评估其在图像分类中的可解释性,发现虽然模型决策可被有效解释,但其深层推理逻辑与人类感知存在差异。
English Summary: This study evaluates the explainability of Graph Neural Networks (GNNs) in image classification by analyzing semantic consistency across layers and visualizing information flow, revealing that while model decisions are interpretable, their reasoning diverges from human perception in deeper layers.

Authors:Sangmin Kim, Seunguk Do, Jaesik Park
Title: ShowMak3r: Compositional TV Show Reconstruction
Abstract:
Reconstructing dynamic radiance fields from video clips is challenging, especially when entertainment videos like TV shows are given. Many challenges make the reconstruction difficult due to (1) actors occluding with each other and having diverse facial expressions, (2) cluttered stages, and (3) small baseline views or sudden shot changes. To address these issues, we present ShowMak3r, a comprehensive reconstruction pipeline that allows the editing of scenes like how video clips are made in a production control room. In ShowMak3r, a 3DLocator module locates recovered actors on the stage using depth prior and estimates unseen human poses via interpolation. The proposed ShotMatcher module then tracks the actors under shot changes. Furthermore, ShowMak3r introduces a face-fitting network that dynamically recovers the actors' expressions. Experiments on Sitcoms3D dataset show that our pipeline can reassemble TV show scenes with new cameras at different timestamps. We also demonstrate that ShowMak3r enables interesting applications such as synthetic shot-making, actor relocation, insertion, deletion, and pose manipulation. Project page : https://nstar1125.github.io/showmak3r

Authors:Seongmin Hwang, Daeyoung Han, Moongu Jeon
Title: DG-DETR: Toward Domain Generalized Detection Transformer
Abstract:
End-to-end Transformer-based detectors (DETRs) have demonstrated strong detection performance. However, domain generalization (DG) research has primarily focused on convolutional neural network (CNN)-based detectors, while paying little attention to enhancing the robustness of DETRs. In this letter, we introduce a Domain Generalized DEtection TRansformer (DG-DETR), a simple, effective, and plug-and-play method that improves out-of-distribution (OOD) robustness for DETRs. Specifically, we propose a novel domain-agnostic query selection strategy that removes domain-induced biases from object queries via orthogonal projection onto the instance-specific style space. Additionally, we leverage a wavelet decomposition to disentangle features into domain-invariant and domain-specific components, enabling synthesis of diverse latent styles while preserving the semantic features of objects. Experimental results validate the effectiveness of DG-DETR. Our code is available at https://github.com/sminhwang/DG-DETR.
中文: DG-DETR通过提出领域无关查询选择和小波分解的即插即用方法,有效提升了DETR检测器在分布外场景下的鲁棒性,填补了Transformer检测器在领域泛化研究中的空白。
English: DG-DETR introduces a plug-and-play method using domain-agnostic query selection and wavelet decomposition to enhance DETRs' out-of-distribution robustness, effectively addressing domain generalization gaps in transformer-based detectors.

Authors:Peijian Zeng, Feiyan Pang, Zhanbo Wang, Aimin Yang
Title: LR-IAD:Mask-Free Industrial Anomaly Detection with Logical Reasoning
Abstract:
Industrial Anomaly Detection (IAD) is critical for ensuring product quality by identifying defects. Traditional methods such as feature embedding and reconstruction-based approaches require large datasets and struggle with scalability. Existing vision-language models (VLMs) and Multimodal Large Language Models (MLLMs) address some limitations but rely on mask annotations, leading to high implementation costs and false positives. Additionally, industrial datasets like MVTec-AD and VisA suffer from severe class imbalance, with defect samples constituting only 23.8% and 11.1% of total data respectively. To address these challenges, we propose a reward function that dynamically prioritizes rare defect patterns during training to handle class imbalance. We also introduce a mask-free reasoning framework using Chain of Thought (CoT) and Group Relative Policy Optimization (GRPO) mechanisms, enabling anomaly detection directly from raw images without annotated masks. This approach generates interpretable step-by-step explanations for defect localization. Our method achieves state-of-the-art performance, outperforming prior approaches by 36% in accuracy on MVTec-AD and 16% on VisA. By eliminating mask dependency and reducing costs while providing explainable outputs, this work advances industrial anomaly detection and supports scalable quality control in manufacturing. Code to reproduce the experiment is available at https://github.com/LilaKen/LR-IAD.
中文: 本研究提出了一种无需掩码标注的推理框架和动态奖励函数,以解决工业异常检测中的类别不平衡问题,在MVTec-AD和VisA数据集上分别实现了36%和16%的准确率提升,同时提供可解释的缺陷定位,显著降低了实施成本。
English: This study introduces a mask-free reasoning framework with a dynamic reward function to address class imbalance in industrial anomaly detection, achieving state-of-the-art accuracy improvements of 36% on MVTec-AD and 16% on VisA while providing interpretable defect localization without costly mask annotations.

Authors:Xinyang Li, Chengjie Yi, Jiawei Lai, Mingbao Lin, Yansong Qu, Shengchuan Zhang, Liujuan Cao
Title: SynergyAmodal: Deocclude Anything with Text Control
Abstract:
Image deocclusion (or amodal completion) aims to recover the invisible regions (\ie, shape and appearance) of occluded instances in images. Despite recent advances, the scarcity of high-quality data that balances diversity, plausibility, and fidelity remains a major obstacle. To address this challenge, we identify three critical elements: leveraging in-the-wild image data for diversity, incorporating human expertise for plausibility, and utilizing generative priors for fidelity. We propose SynergyAmodal, a novel framework for co-synthesizing in-the-wild amodal datasets with comprehensive shape and appearance annotations, which integrates these elements through a tripartite data-human-model collaboration. First, we design an occlusion-grounded self-supervised learning algorithm to harness the diversity of in-the-wild image data, fine-tuning an inpainting diffusion model into a partial completion diffusion model. Second, we establish a co-synthesis pipeline to iteratively filter, refine, select, and annotate the initial deocclusion results of the partial completion diffusion model, ensuring plausibility and fidelity through human expert guidance and prior model constraints. This pipeline generates a high-quality paired amodal dataset with extensive category and scale diversity, comprising approximately 16K pairs. Finally, we train a full completion diffusion model on the synthesized dataset, incorporating text prompts as conditioning signals. Extensive experiments demonstrate the effectiveness of our framework in achieving zero-shot generalization and textual controllability. Our code, dataset, and models will be made publicly available at https://github.com/imlixinyang/SynergyAmodal.
中文: SynergyAmodal框架通过整合真实图像、人类专业知识和生成先验,共同合成包含1.6万对样本的多样化数据,有效解决了图像去遮挡领域高质量数据稀缺的问题,实现了优异的零样本泛化能力和文本可控性。
English: The SynergyAmodal framework addresses the scarcity of high-quality amodal completion data by integrating in-the-wild images, human expertise, and generative priors to co-synthesize a diverse 16K-pair dataset, enabling effective zero-shot generalization and textual controllability.

Authors:Sonia Joseph, Praneet Suresh, Lorenz Hufe, Edward Stevinson, Robert Graham, Yash Vadi, Danilo Bzdok, Sebastian Lapuschkin, Lee Sharkey, Blake Aaron Richards
Title: Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video
Abstract:
Robust tooling and publicly available pre-trained models have helped drive recent advances in mechanistic interpretability for language models. However, similar progress in vision mechanistic interpretability has been hindered by the lack of accessible frameworks and pre-trained weights. We present Prisma (Access the codebase here: https://github.com/Prisma-Multimodal/ViT-Prisma), an open-source framework designed to accelerate vision mechanistic interpretability research, providing a unified toolkit for accessing 75+ vision and video transformers; support for sparse autoencoder (SAE), transcoder, and crosscoder training; a suite of 80+ pre-trained SAE weights; activation caching, circuit analysis tools, and visualization tools; and educational resources. Our analysis reveals surprising findings, including that effective vision SAEs can exhibit substantially lower sparsity patterns than language SAEs, and that in some instances, SAE reconstructions can decrease model loss. Prisma enables new research directions for understanding vision model internals while lowering barriers to entry in this emerging field.
中文:Prisma是一个开源框架,通过提供统一的工具包来访问视觉变换器、训练工具、预训练权重和分析资源,加速视觉机制可解释性研究,同时揭示了视觉稀疏自编码器相比语言模型具有更低稀疏性等惊人发现。
English: Prisma is an open-source framework that accelerates vision mechanistic interpretability research by providing a unified toolkit for accessing vision transformers, training tools, pre-trained weights, and analytical resources, while revealing surprising findings such as lower sparsity in vision SAEs compared to language models.

Authors:Yejin Jeong, Donghun Lee
Title: CLIP-KOA: Enhancing Knee Osteoarthritis Diagnosis with Multi-Modal Learning and Symmetry-Aware Loss Functions
Abstract:
Knee osteoarthritis (KOA) is a universal chronic musculoskeletal disorders worldwide, making early diagnosis crucial. Currently, the Kellgren and Lawrence (KL) grading system is widely used to assess KOA severity. However, its high inter-observer variability and subjectivity hinder diagnostic consistency. To address these limitations, automated diagnostic techniques using deep learning have been actively explored in recent years. In this study, we propose a CLIP-based framework (CLIP-KOA) to enhance the consistency and reliability of KOA grade prediction. To achieve this, we introduce a learning approach that integrates image and text information and incorporate Symmetry Loss and Consistency Loss to ensure prediction consistency between the original and flipped images. CLIP-KOA achieves state-of-the-art accuracy of 71.86\% on KOA severity prediction task, and ablation studies show that CLIP-KOA has 2.36\% improvement in accuracy over the standard CLIP model due to our contribution. This study shows a novel direction for data-driven medical prediction not only to improve reliability of fine-grained diagnosis and but also to explore multimodal methods for medical image analysis. Our code is available at https://github.com/anonymized-link.
中文: 本研究提出的CLIP-KOA框架通过融合图像与文本信息并采用对称性损失和一致性损失,显著提升了膝骨关节炎严重程度预测的准确性和可靠性,达到了71.86%的最新准确率。
English: This study introduces CLIP-KOA, a deep learning framework that integrates image and text data with specialized loss functions to enhance the consistency and accuracy of knee osteoarthritis severity prediction, achieving state-of-the-art 71.86% accuracy.

Authors:Dehao Yuan, Cornelia Fermüller
Title: A Real-Time Event-Based Normal Flow Estimator
Abstract:
This paper presents a real-time, asynchronous, event-based normal flow estimator. It follows the same algorithm as Learning Normal Flow Directly From Event Neighborhoods, but with a more optimized implementation. The original method treats event slices as 3D point clouds, encodes each event's local geometry into a fixed-length vector, and uses a multi-layer perceptron to predict normal flow. It constructs representations by multiplying an adjacency matrix with a feature matrix, resulting in quadratic time complexity with respect to the number of events. In contrast, we leverage the fact that event coordinates are integers and reformulate the representation step as a pooling operation. This achieves the same effect as the adjacency matrix but with much lower computational cost. As a result, our method supports real-time normal flow prediction on event cameras. Our estimator uses 1 GB of CUDA memory and runs at 4 million normal flows per second on an RTX 3070, or 6 million per second on an RTX A5000. We release the CUDA implementation along with a Python interface at https://github.com/dhyuan99/VecKM_flow_cpp.
中文: 本文提出了一种实时、异步、基于事件的法向流估计器,通过将表示步骤重新定义为池化操作,显著降低了计算复杂度,从而在事件相机上实现了高效的法向流预测。
English: This paper introduces a real-time, asynchronous, event-based normal flow estimator that optimizes the original method by reformulating the representation step as a pooling operation, reducing computational complexity and enabling efficient performance on event cameras.

Authors:Jakub Zadrożny, Hakan Bilen
Title: HumMorph: Generalized Dynamic Human Neural Fields from Few Views
Abstract:
We introduce HumMorph, a novel generalized approach to free-viewpoint rendering of dynamic human bodies with explicit pose control. HumMorph renders a human actor in any specified pose given a few observed views (starting from just one) in arbitrary poses. Our method enables fast inference as it relies only on feed-forward passes through the model. We first construct a coarse representation of the actor in the canonical T-pose, which combines visual features from individual partial observations and fills missing information using learned prior knowledge. The coarse representation is complemented by fine-grained pixel-aligned features extracted directly from the observed views, which provide high-resolution appearance information. We show that HumMorph is competitive with the state-of-the-art when only a single input view is available, however, we achieve results with significantly better visual quality given just 2 monocular observations. Moreover, previous generalized methods assume access to accurate body shape and pose parameters obtained using synchronized multi-camera setups. In contrast, we consider a more practical scenario where these body parameters are noisily estimated directly from the observed views. Our experimental results demonstrate that our architecture is more robust to errors in the noisy parameters and clearly outperforms the state of the art in this setting.

Authors:Jiahao Lu, Chong Yin, Silvia Ingala, Kenny Erleben, Michael Bachmann Nielsen, Sune Darkner
Title: MERA: Multimodal and Multiscale Self-Explanatory Model with Considerably Reduced Annotation for Lung Nodule Diagnosis
Abstract:
Lung cancer, a leading cause of cancer-related deaths globally, emphasises the importance of early detection for better patient outcomes. Pulmonary nodules, often early indicators of lung cancer, necessitate accurate, timely diagnosis. Despite Explainable Artificial Intelligence (XAI) advances, many existing systems struggle providing clear, comprehensive explanations, especially with limited labelled data. This study introduces MERA, a Multimodal and Multiscale self-Explanatory model designed for lung nodule diagnosis with considerably Reduced Annotation requirements. MERA integrates unsupervised and weakly supervised learning strategies (self-supervised learning techniques and Vision Transformer architecture for unsupervised feature extraction) and a hierarchical prediction mechanism leveraging sparse annotations via semi-supervised active learning in the learned latent space. MERA explains its decisions on multiple levels: model-level global explanations via semantic latent space clustering, instance-level case-based explanations showing similar instances, local visual explanations via attention maps, and concept explanations using critical nodule attributes. Evaluations on the public LIDC dataset show MERA's superior diagnostic accuracy and self-explainability. With only 1% annotated samples, MERA achieves diagnostic accuracy comparable to or exceeding state-of-the-art methods requiring full annotation. The model's inherent design delivers comprehensive, robust, multilevel explanations aligned closely with clinical practice, enhancing trustworthiness and transparency. Demonstrated viability of unsupervised and weakly supervised learning lowers the barrier to deploying diagnostic AI in broader medical domains. Our complete code is open-source available: https://github.com/diku-dk/credanno.
中文: 本研究提出MERA多模态自解释模型,通过结合无监督学习和分层解释机制,在仅需少量标注的情况下实现肺结节的高精度诊断,并提供符合临床实践的多层次解释。
English: This study introduces MERA, a multimodal self-explanatory model that achieves high diagnostic accuracy for lung nodules with minimal annotation by combining unsupervised learning and hierarchical explanations, while providing comprehensive clinical insights.

Authors:Ni Yao, Xiangyu Liu, Danyang Sun, Chuang Han, Yanting Li, Jiaofen Nan, Chengyang Li, Fubao Zhu, Weihua Zhou, Chen Zhao
Title: Myocardial Region-guided Feature Aggregation Net for Automatic Coronary artery Segmentation and Stenosis Assessment using Coronary Computed Tomography Angiography
Abstract:
Coronary artery disease (CAD) remains a leading cause of mortality worldwide, requiring accurate segmentation and stenosis detection using Coronary Computed Tomography angiography (CCTA). Existing methods struggle with challenges such as low contrast, morphological variability and small vessel segmentation. To address these limitations, we propose the Myocardial Region-guided Feature Aggregation Net, a novel U-shaped dual-encoder architecture that integrates anatomical prior knowledge to enhance robustness in coronary artery segmentation. Our framework incorporates three key innovations: (1) a Myocardial Region-guided Module that directs attention to coronary regions via myocardial contour expansion and multi-scale feature fusion, (2) a Residual Feature Extraction Encoding Module that combines parallel spatial channel attention with residual blocks to enhance local-global feature discrimination, and (3) a Multi-scale Feature Fusion Module for adaptive aggregation of hierarchical vascular features. Additionally, Monte Carlo dropout f quantifies prediction uncertainty, supporting clinical interpretability. For stenosis detection, a morphology-based centerline extraction algorithm separates the vascular tree into anatomical branches, enabling cross-sectional area quantification and stenosis grading. The superiority of MGFA-Net was demonstrated by achieving an Dice score of 85.04%, an accuracy of 84.24%, an HD95 of 6.1294 mm, and an improvement of 5.46% in true positive rate for stenosis detection compared to3D U-Net. The integrated segmentation-to-stenosis pipeline provides automated, clinically interpretable CAD assessment, bridging deep learning with anatomical prior knowledge for precision medicine. Our code is publicly available at http://github.com/chenzhao2023/MGFA_CCTA
Chinese: 本研究提出心肌区域引导特征聚合网络,通过整合解剖学先验知识和多尺度特征融合的双编码器架构,显著提升了冠状动脉分割和狭窄检测性能,实现了85.04%的Dice分数等优越指标。
English: This study introduces the Myocardial Region-guided Feature Aggregation Net, a dual-encoder architecture that enhances coronary artery segmentation and stenosis detection by integrating anatomical priors and multi-scale feature fusion, achieving superior performance metrics including an 85.04% Dice score.

Authors:Shuhao Kang, Martin Y. Liao, Yan Xia, Olaf Wysocki, Boris Jutzi, Daniel Cremers
Title: OPAL: Visibility-aware LiDAR-to-OpenStreetMap Place Recognition via Adaptive Radial Fusion
Abstract:
LiDAR place recognition is a critical capability for autonomous navigation and cross-modal localization in large-scale outdoor environments. Existing approaches predominantly depend on pre-built 3D dense maps or aerial imagery, which impose significant storage overhead and lack real-time adaptability. In this paper, we propose OPAL, a novel framework for LiDAR place recognition that leverages OpenStreetMap (OSM) as a lightweight and up-to-date prior. Our key innovation lies in bridging the domain disparity between sparse LiDAR scans and structured OSM data through two carefully designed components. First, a cross-modal visibility mask that identifies observable regions from both modalities to guide feature alignment. Second, an adaptive radial fusion module that dynamically consolidates radial features into discriminative global descriptors. Extensive experiments on KITTI and KITTI-360 datasets demonstrate OPAL's superiority, achieving 15.98% higher recall at 1m threshold for top-1 retrieved matches, along with 12x faster inference speed compared to the state-of-the-art approach. Code and data are publicly available at: https://github.com/kang-1-2-3/OPAL.
中文摘要:本文提出OPAL这一新型激光雷达地点识别框架,利用开放街道地图作为轻量级先验,通过跨模态可见性掩码和自适应径向融合模块弥合数据差异,在精度和推理速度上均显著超越现有最优方法。
English Summary: The paper introduces OPAL, a novel LiDAR place recognition framework that utilizes OpenStreetMap as a lightweight prior, overcoming domain gaps through cross-modal visibility masks and adaptive radial fusion to achieve superior accuracy and faster inference than existing methods.

Authors:Loc Phuc Truong Nguyen, Hung Truong Thanh Nguyen, Hung Cao
Title: ODExAI: A Comprehensive Object Detection Explainable AI Evaluation
Abstract:
Explainable Artificial Intelligence (XAI) techniques for interpreting object detection models remain in an early stage, with no established standards for systematic evaluation. This absence of consensus hinders both the comparative analysis of methods and the informed selection of suitable approaches. To address this gap, we introduce the Object Detection Explainable AI Evaluation (ODExAI), a comprehensive framework designed to assess XAI methods in object detection based on three core dimensions: localization accuracy, faithfulness to model behavior, and computational complexity. We benchmark a set of XAI methods across two widely used object detectors (YOLOX and Faster R-CNN) and standard datasets (MS-COCO and PASCAL VOC). Empirical results demonstrate that region-based methods (e.g., D-CLOSE) achieve strong localization (PG = 88.49%) and high model faithfulness (OA = 0.863), though with substantial computational overhead (Time = 71.42s). On the other hand, CAM-based methods (e.g., G-CAME) achieve superior localization (PG = 96.13%) and significantly lower runtime (Time = 0.54s), but at the expense of reduced faithfulness (OA = 0.549). These findings demonstrate critical trade-offs among existing XAI approaches and reinforce the need for task-specific evaluation when deploying them in object detection pipelines. Our implementation and evaluation benchmarks are publicly available at: https://github.com/Analytics-Everywhere-Lab/odexai.
中文: ODExAI框架从定位准确性、模型忠实性和计算效率三个维度评估目标检测可解释AI方法,揭示了基于区域与基于CAM方法之间的性能权衡。
English: The ODExAI framework evaluates XAI methods for object detection across localization, faithfulness, and efficiency, revealing trade-offs between region-based and CAM-based approaches.

Authors:De Cheng, Lingfeng He, Nannan Wang, Dingwen Zhang, Xinbo Gao
Title: Semantic-Aligned Learning with Collaborative Refinement for Unsupervised VI-ReID
Abstract:
Unsupervised visible-infrared person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning. Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning. However, these methods overlook the cross-modality variations in feature representation and pseudo-label distributions brought by fine-grained patterns. This insight results in insufficient modality-shared learning when only global features are optimized. To address this issue, we propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up optimization objective for specific fine-grained patterns emphasized by each modality, thereby achieving complementary alignment between the label distributions of different modalities. Specifically, we first introduce a Dual Association with Global Learning (DAGI) module to unify the pseudo-labels of cross-modality instances in a bi-directional manner. Afterward, a Fine-Grained Semantic-Aligned Learning (FGSAL) module is carried out to explore part-level semantic-aligned patterns emphasized by each modality from cross-modality instances. Optimization objective is then formulated based on the semantic-aligned features and their corresponding label space. To alleviate the side-effects arising from noisy pseudo-labels, we propose a Global-Part Collaborative Refinement (GPCR) module to mine reliable positive sample sets for the global and part features dynamically and optimize the inter-instance relationships. Extensive experiments demonstrate the effectiveness of the proposed method, which achieves superior performances to state-of-the-art methods. Our code is available at \href{https://github.com/FranklinLingfeng/code-for-SALCR}.
中文:提出的SALCR框架通过语义对齐学习和协作优化模块,在无监督可见光-红外行人重识别中解决跨模态差异问题,实现细粒度模式优化的跨模态互补对齐。
English: The proposed SALCR framework addresses cross-modality variations in unsupervised visible-infrared person re-identification by introducing semantic-aligned learning and collaborative refinement modules to achieve complementary alignment between modalities through fine-grained pattern optimization.

Authors:Yuming Zhao, Qijian Zhang, Junhui Hou, Jiazhi Xia, Wenping Wang, Ying He
Title: FlexPara: Flexible Neural Surface Parameterization
Abstract:
Surface parameterization is a fundamental geometry processing task, laying the foundations for the visual presentation of 3D assets and numerous downstream shape analysis scenarios. Conventional parameterization approaches demand high-quality mesh triangulation and are restricted to certain simple topologies unless additional surface cutting and decomposition are provided. In practice, the optimal configurations (e.g., type of parameterization domains, distribution of cutting seams, number of mapping charts) may vary drastically with different surface structures and task characteristics, thus requiring more flexible and controllable processing pipelines. To this end, this paper introduces FlexPara, an unsupervised neural optimization framework to achieve both global and multi-chart surface parameterizations by establishing point-wise mappings between 3D surface points and adaptively-deformed 2D UV coordinates. We ingeniously design and combine a series of geometrically-interpretable sub-networks, with specific functionalities of cutting, deforming, unwrapping, and wrapping, to construct a bi-directional cycle mapping framework for global parameterization without the need for manually specified cutting seams. Furthermore, we construct a multi-chart parameterization framework with adaptively-learned chart assignment. Extensive experiments demonstrate the universality, superiority, and inspiring potential of our neural surface parameterization paradigm. The code will be publicly available at https://github.com/AidenZhao/FlexPara
中文摘要:FlexPara是一种无监督神经优化框架,通过建立三维表面点与自适应变形二维UV坐标之间的映射,实现了无需人工切割缝的灵活全局和多图表表面参数化。
English Summary: FlexPara is an unsupervised neural framework that enables flexible global and multi-chart surface parameterization through adaptive point-wise mappings between 3D surfaces and 2D UV coordinates, eliminating the need for manual cutting seams.

Authors:Lei Zhong, Chuan Guo, Yiming Xie, Jiawei Wang, Changjian Li
Title: Sketch2Anim: Towards Transferring Sketch Storyboards into 3D Animation
Abstract:
Storyboarding is widely used for creating 3D animations. Animators use the 2D sketches in storyboards as references to craft the desired 3D animations through a trial-and-error process. The traditional approach requires exceptional expertise and is both labor-intensive and time-consuming. Consequently, there is a high demand for automated methods that can directly translate 2D storyboard sketches into 3D animations. This task is under-explored to date and inspired by the significant advancements of motion diffusion models, we propose to address it from the perspective of conditional motion synthesis. We thus present Sketch2Anim, composed of two key modules for sketch constraint understanding and motion generation. Specifically, due to the large domain gap between the 2D sketch and 3D motion, instead of directly conditioning on 2D inputs, we design a 3D conditional motion generator that simultaneously leverages 3D keyposes, joint trajectories, and action words, to achieve precise and fine-grained motion control. Then, we invent a neural mapper dedicated to aligning user-provided 2D sketches with their corresponding 3D keyposes and trajectories in a shared embedding space, enabling, for the first time, direct 2D control of motion generation. Our approach successfully transfers storyboards into high-quality 3D motions and inherently supports direct 3D animation editing, thanks to the flexibility of our multi-conditional motion generator. Comprehensive experiments and evaluations, and a user perceptual study demonstrate the effectiveness of our approach.

Authors:Zhangshuo Qi, Luqi Cheng, Zijie Zhou, Guangming Xiong
Title: LRFusionPR: A Polar BEV-Based LiDAR-Radar Fusion Network for Place Recognition
Abstract:
In autonomous driving, place recognition is critical for global localization in GPS-denied environments. LiDAR and radar-based place recognition methods have garnered increasing attention, as LiDAR provides precise ranging, whereas radar excels in adverse weather resilience. However, effectively leveraging LiDAR-radar fusion for place recognition remains challenging. The noisy and sparse nature of radar data limits its potential to further improve recognition accuracy. In addition, heterogeneous radar configurations complicate the development of unified cross-modality fusion frameworks. In this paper, we propose LRFusionPR, which improves recognition accuracy and robustness by fusing LiDAR with either single-chip or scanning radar. Technically, a dual-branch network is proposed to fuse different modalities within the unified polar coordinate bird's eye view (BEV) representation. In the fusion branch, cross-attention is utilized to perform cross-modality feature interactions. The knowledge from the fusion branch is simultaneously transferred to the distillation branch, which takes radar as its only input to further improve the robustness. Ultimately, the descriptors from both branches are concatenated, producing the multimodal global descriptor for place retrieval. Extensive evaluations on multiple datasets demonstrate that our LRFusionPR achieves accurate place recognition, while maintaining robustness under varying weather conditions. Our open-source code will be released at https://github.com/QiZS-BIT/LRFusionPR.
中文: LRFusionPR通过双分支网络在统一鸟瞰图表示中融合激光雷达与雷达数据,提升了自动驾驶场景识别的精度和恶劣天气下的鲁棒性。
English: LRFusionPR enhances autonomous driving place recognition by fusing LiDAR and radar data through a dual-branch network in a unified BEV representation, achieving improved accuracy and weather robustness.

Authors:Yuan Li, Ziqian Bai, Feitong Tan, Zhaopeng Cui, Sean Fanello, Yinda Zhang
Title: IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular Videos
Abstract:
We propose a novel 3D-aware diffusion-based method for generating photorealistic talking head videos directly from a single identity image and explicit control signals (e.g., expressions). Our method generates Multiplane Images (MPIs) that ensure geometric consistency, making them ideal for immersive viewing experiences like binocular videos for VR headsets. Unlike existing methods that often require a separate stage or joint optimization to reconstruct a 3D representation (such as NeRF or 3D Gaussians), our approach directly generates the final output through a single denoising process, eliminating the need for post-processing steps to render novel views efficiently. To effectively learn from monocular videos, we introduce a training mechanism that reconstructs the output MPI randomly in either the target or the reference camera space. This approach enables the model to simultaneously learn sharp image details and underlying 3D information. Extensive experiments demonstrate the effectiveness of our method, which achieves competitive avatar quality and novel-view rendering capabilities, even without explicit 3D reconstruction or high-quality multi-view training data.

Authors:Huiling Zheng, Xian Zhong, Bin Liu, Yi Xiao, Bihan Wen, Xiaofeng Li
Title: PAD: Phase-Amplitude Decoupling Fusion for Multi-Modal Land Cover Classification
Abstract:
The fusion of Synthetic Aperture Radar (SAR) and RGB imagery for land cover classification remains challenging due to modality heterogeneity and underutilized spectral complementarity. Existing methods often fail to decouple shared structural features from modality-complementary radiometric attributes, causing feature conflicts and information loss. To address this, we propose Phase-Amplitude Decoupling (PAD), a frequency-aware framework that separates phase (modality-shared) and amplitude (modality-complementary) components in the Fourier domain, thus reinforcing shared structures while preserving complementary characteristics to improve fusion quality. Unlike prior approaches that overlook the distinct physical properties encoded in frequency spectra, PAD is the first to introduce explicit amplitude-phase decoupling for multi-modal fusion. Specifically, PAD comprises two key components: 1) Phase Spectrum Correction (PSC), which aligns cross-modal phase features via convolution-guided scaling to enhance geometric consistency; and 2) Amplitude Spectrum Fusion (ASF), which dynamically integrates high-frequency and low-frequency patterns using frequency-adaptive multilayer perceptrons, leveraging SAR's morphological sensitivity and RGB's spectral richness. Extensive experiments on WHU-OPT-SAR and DDHR-SK datasets demonstrate state-of-the-art performance. Our work establishes a new paradigm for physics-aware multi-modal fusion in remote sensing. The code will be available at https://github.com/RanFeng2/PAD.
中文: 本文提出相位-振幅解耦(PAD)框架,通过在傅里叶域分离相位和振幅分量,强化共享结构并保留互补特征,有效提升了SAR与RGB影像在多模态融合中的土地覆盖分类性能。
English: This paper introduces Phase-Amplitude Decoupling (PAD), a frequency-aware framework that separates phase and amplitude components in the Fourier domain to enhance multi-modal fusion of SAR and RGB imagery for land cover classification by reinforcing shared structures while preserving complementary characteristics.

Authors:Jialang Lu, Huayu Zhao, Huiyu Zhai, Xingxing Yang, Shini Han
Title: DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning
Abstract:
There has long been a belief that high-level semantics learning can benefit various downstream computer vision tasks. However, in the low-light image enhancement (LLIE) community, existing methods learn a brutal mapping between low-light and normal-light domains without considering the semantic information of different regions, especially in those extremely dark regions that suffer from severe information loss. To address this issue, we propose a new deep semantic prior-guided framework (DeepSPG) based on Retinex image decomposition for LLIE to explore informative semantic knowledge via a pre-trained semantic segmentation model and multimodal learning. Notably, we incorporate both image-level semantic prior and text-level semantic prior and thus formulate a multimodal learning framework with combinatorial deep semantic prior guidance for LLIE. Specifically, we incorporate semantic knowledge to guide the enhancement process via three designs: an image-level semantic prior guidance by leveraging hierarchical semantic features from a pre-trained semantic segmentation model; a text-level semantic prior guidance by integrating natural language semantic constraints via a pre-trained vision-language model; a multi-scale semantic-aware structure that facilitates effective semantic feature incorporation. Eventually, our proposed DeepSPG demonstrates superior performance compared to state-of-the-art methods across five benchmark datasets. The implementation details and code are publicly available at https://github.com/Wenyuzhy/DeepSPG.
中文: 提出的DeepSPG框架通过整合图像分割和自然语言的多模态语义先验来增强低光图像,在多个基准测试中超越了现有方法。
English: The proposed DeepSPG framework enhances low-light images by integrating multimodal semantic priors from both image segmentation and natural language, outperforming existing methods across multiple benchmarks.

Authors:Mohammad Mahdi Abootorabi, Omid Ghahroodi, Pardis Sadat Zahraei, Hossein Behzadasl, Alireza Mirrokni, Mobina Salimipanah, Arash Rasouli, Bahar Behzadipour, Sara Azarnoush, Benyamin Maleki, Erfan Sadraiye, Kiarash Kiani Feriz, Mahdi Teymouri Nahad, Ali Moghadasi, Abolfazl Eshagh Abianeh, Nizi Nazar, Hamid R. Rabiee, Mahdieh Soleymani Baghshah, Meisam Ahmadi, Ehsaneddin Asgari
Title: Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions
Abstract:
Generative AI is reshaping art, gaming, and most notably animation. Recent breakthroughs in foundation and diffusion models have reduced the time and cost of producing animated content. Characters are central animation components, involving motion, emotions, gestures, and facial expressions. The pace and breadth of advances in recent months make it difficult to maintain a coherent view of the field, motivating the need for an integrative review. Unlike earlier overviews that treat avatars, gestures, or facial animation in isolation, this survey offers a single, comprehensive perspective on all the main generative AI applications for character animation. We begin by examining the state-of-the-art in facial animation, expression rendering, image synthesis, avatar creation, gesture modeling, motion synthesis, object generation, and texture synthesis. We highlight leading research, practical deployments, commonly used datasets, and emerging trends for each area. To support newcomers, we also provide a comprehensive background section that introduces foundational models and evaluation metrics, equipping readers with the knowledge needed to enter the field. We discuss open challenges and map future research directions, providing a roadmap to advance AI-driven character-animation technologies. This survey is intended as a resource for researchers and developers entering the field of generative AI animation or adjacent fields. Resources are available at: https://github.com/llm-lab-org/Generative-AI-for-Character-Animation-Survey.
Chinese: 生成式AI通过整合面部、手势和动作合成等领域的进展,正在革新角色动画技术,本综述为研究人员提供了全面指导,涵盖当前技术与未来发展方向。
English: Generative AI is revolutionizing character animation by integrating advancements in facial, gesture, and motion synthesis, offering a comprehensive review to guide researchers through current technologies and future directions.

Authors:Gal Almog, Ariel Shamir, Ohad Fried
Title: REED-VAE: RE-Encode Decode Training for Iterative Image Editing with Diffusion Models
Abstract:
While latent diffusion models achieve impressive image editing results, their application to iterative editing of the same image is severely restricted. When trying to apply consecutive edit operations using current models, they accumulate artifacts and noise due to repeated transitions between pixel and latent spaces. Some methods have attempted to address this limitation by performing the entire edit chain within the latent space, sacrificing flexibility by supporting only a limited, predetermined set of diffusion editing operations. We present a RE-encode decode (REED) training scheme for variational autoencoders (VAEs), which promotes image quality preservation even after many iterations. Our work enables multi-method iterative image editing: users can perform a variety of iterative edit operations, with each operation building on the output of the previous one using both diffusion-based operations and conventional editing techniques. We demonstrate the advantage of REED-VAE across a range of image editing scenarios, including text-based and mask-based editing frameworks. In addition, we show how REED-VAE enhances the overall editability of images, increasing the likelihood of successful and precise edit operations. We hope that this work will serve as a benchmark for the newly introduced task of multi-method image editing. Our code and models will be available at https://github.com/galmog/REED-VAE
中文: REED训练方案通过优化变分自编码器,实现了高质量的多方法迭代图像编辑,即使在多次操作后仍能保持图像质量,支持扩散模型与传统编辑技术的灵活结合。
English: The REED training scheme for VAEs enables high-quality iterative image editing by preserving image integrity across multiple operations, supporting both diffusion-based and conventional techniques without accumulating artifacts.

Authors:Xuyin Qi, Zeyu Zhang, Canxuan Gang, Hao Zhang, Lei Zhang, Zhiwei Zhang, Yang Zhao
Title: MediAug: Exploring Visual Augmentation in Medical Imaging
Abstract:
Data augmentation is essential in medical imaging for improving classification accuracy, lesion detection, and organ segmentation under limited data conditions. However, two significant challenges remain. First, a pronounced domain gap between natural photographs and medical images can distort critical disease features. Second, augmentation studies in medical imaging are fragmented and limited to single tasks or architectures, leaving the benefits of advanced mix-based strategies unclear. To address these challenges, we propose a unified evaluation framework with six mix-based augmentation methods integrated with both convolutional and transformer backbones on brain tumour MRI and eye disease fundus datasets. Our contributions are threefold. (1) We introduce MediAug, a comprehensive and reproducible benchmark for advanced data augmentation in medical imaging. (2) We systematically evaluate MixUp, YOCO, CropMix, CutMix, AugMix, and SnapMix with ResNet-50 and ViT-B backbones. (3) We demonstrate through extensive experiments that MixUp yields the greatest improvement on the brain tumor classification task for ResNet-50 with 79.19% accuracy and SnapMix yields the greatest improvement for ViT-B with 99.44% accuracy, and that YOCO yields the greatest improvement on the eye disease classification task for ResNet-50 with 91.60% accuracy and CutMix yields the greatest improvement for ViT-B with 97.94% accuracy. Code will be available at https://github.com/AIGeeksGroup/MediAug.
Chinese: 本研究提出了MediAug这一统一基准,用于评估六种混合数据增强方法在医学影像任务中的表现,结果表明MixUp和SnapMix在脑肿瘤分类中准确率最高,而YOCO和CutMix在不同神经网络架构的眼疾分类任务中表现最优。
English: This study introduces MediAug, a unified benchmark for evaluating six mix-based data augmentation methods on medical imaging tasks, demonstrating that MixUp and SnapMix achieve the highest accuracy for brain tumor classification, while YOCO and CutMix excel in eye disease classification across different neural network architectures.

Authors:Ali Nazari, Mohsen Ebrahimi Moghaddam, Omidreza Borzoei
Title: Kinship Verification through a Forest Neural Network
Abstract:
Early methods used face representations in kinship verification, which are less accurate than joint representations of parents' and children's facial images learned from scratch. We propose an approach featuring graph neural network concepts to utilize face representations and have comparable results to joint representation algorithms. Moreover, we designed the structure of the classification module and introduced a new combination of losses to engage the center loss gradually in training our network. Additionally, we conducted experiments on KinFaceW-I and II, demonstrating the effectiveness of our approach. We achieved the best result on KinFaceW-II, an average improvement of nearly 1.6 for all kinship types, and we were near the best on KinFaceW-I. The code is available at https://github.com/ali-nazari/Kinship-Verification
Chinese: 我们采用图神经网络优化面部表征进行亲属关系验证,其效果与联合表征算法相当,并在KinFaceW数据集上实现了精度提升,创下新记录。
English: Our method employs graph neural networks to enhance face representations for kinship verification, achieving results comparable to joint representation algorithms and setting new benchmarks on KinFaceW datasets with improved accuracy.

Authors:Shahad Albastaki, Anabia Sohail, Iyyakutti Iyappan Ganapathi, Basit Alawode, Asim Khan, Sajid Javed, Naoufel Werghi, Mohammed Bennamoun, Arif Mahmood
Title: Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation
Abstract:
In Computational Pathology (CPath), the introduction of Vision-Language Models (VLMs) has opened new avenues for research, focusing primarily on aligning image-text pairs at a single magnification level. However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution image can provide. Addressing this, we propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions and generate corresponding textual descriptions through advanced CPath VLM. We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations. Cross-resolution alignment using a multimodal encoder enhances the model's ability to capture context from multiple resolutions in histology images. Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions. Pre-trained on a comprehensive TCGA dataset with 34 million image-language pairs at various resolutions, our fine-tuned model outperforms state-of-the-art (SOTA) counterparts across multiple datasets and tasks, demonstrating its effectiveness in CPath. The code is available on GitHub at: https://github.com/BasitAlawode/MR-PLIP
中文摘要:本研究提出了一种多分辨率视觉语言模型,通过在不同放大倍数下对齐视觉与文本数据,显著提升了计算病理学中的癌症亚型分类和组织分析性能,并在多项任务中超越了现有先进方法。
English Summary: This study introduces a multi-resolution vision-language model for computational pathology that enhances cancer subtype classification and tissue analysis by aligning visual and textual data across different magnifications, outperforming existing methods on multiple tasks.

Authors:Jiayi Chen, Yubin Ke, Lin Peng, He Wang
Title: Dexonomy: Synthesizing All Dexterous Grasp Types in a Grasp Taxonomy
Abstract:
Generalizable dexterous grasping with suitable grasp types is a fundamental skill for intelligent robots. Developing such skills requires a large-scale and high-quality dataset that covers numerous grasp types (i.e., at least those categorized by the GRASP taxonomy), but collecting such data is extremely challenging. Existing automatic grasp synthesis methods are often limited to specific grasp types or object categories, hindering scalability. This work proposes an efficient pipeline capable of synthesizing contact-rich, penetration-free, and physically plausible grasps for any grasp type, object, and articulated hand. Starting from a single human-annotated template for each hand and grasp type, our pipeline tackles the complicated synthesis problem with two stages: optimize the object to fit the hand template first, and then locally refine the hand to fit the object in simulation. To validate the synthesized grasps, we introduce a contact-aware control strategy that allows the hand to apply the appropriate force at each contact point to the object. Those validated grasps can also be used as new grasp templates to facilitate future synthesis. Experiments show that our method significantly outperforms previous type-unaware grasp synthesis baselines in simulation. Using our algorithm, we construct a dataset containing 10.7k objects and 9.5M grasps, covering 31 grasp types in the GRASP taxonomy. Finally, we train a type-conditional generative model that successfully performs the desired grasp type from single-view object point clouds, achieving an 82.3% success rate in real-world experiments. Project page: https://pku-epic.github.io/Dexonomy.

Authors:Hang Yu, Jiahao Wen, Zhedong Zheng
Title: CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval
Abstract:
Text-based person retrieval aims to identify specific individuals within an image database using textual descriptions. Due to the high cost of annotation and privacy protection, researchers resort to synthesized data for the paradigm of pretraining and fine-tuning. However, these generated data often exhibit domain biases in both images and textual annotations, which largely compromise the scalability of the pre-trained model. Therefore, we introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability during pretraining to facilitate the subsequent downstream tasks. In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios, and introduce a dynamic error sample memory unit to memorize the history for errors encountered within multiple tasks. To further ensure multi-task adaptation, we also adopt an adaptive dual-speed update strategy, balancing fast adaptation to new tasks and slow weight updates for historical tasks. Albeit simple, our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, including CUHK-PEDES, ICFG-PEDES, and RSTPReid, but also showcases robustness and scalability in handling biased synthetic images and noisy text annotations. Our code is available at https://github.com/Jahawn-Wen/CAMeL-reID.
中文: 本文提出了一种基于跨模态自适应元学习的领域无关预训练框架CAMeL,通过动态误差样本记忆单元和自适应双速更新策略增强模型泛化能力,在真实场景基准测试中超越现有最优方法,并展现出对偏差合成数据和噪声文本标注的强鲁棒性。
English: This paper introduces a domain-agnostic pretraining framework called CAMeL that enhances model generalization through cross-modality adaptive meta-learning, outperforming state-of-the-art methods on real-world benchmarks while demonstrating robustness to biased synthetic data and noisy text annotations.

Authors:Letian Huang, Dongwei Ye, Jialin Dan, Chengzhi Tao, Huiwen Liu, Kun Zhou, Bo Ren, Yuanqi Li, Yanwen Guo, Jie Guo
Title: TransparentGS: Fast Inverse Rendering of Transparent Objects with Gaussians
Abstract:
The emergence of neural and Gaussian-based radiance field methods has led to considerable advancements in novel view synthesis and 3D object reconstruction. Nonetheless, specular reflection and refraction continue to pose significant challenges due to the instability and incorrect overfitting of radiance fields to high-frequency light variations. Currently, even 3D Gaussian Splatting (3D-GS), as a powerful and efficient tool, falls short in recovering transparent objects with nearby contents due to the existence of apparent secondary ray effects. To address this issue, we propose TransparentGS, a fast inverse rendering pipeline for transparent objects based on 3D-GS. The main contributions are three-fold. Firstly, an efficient representation of transparent objects, transparent Gaussian primitives, is designed to enable specular refraction through a deferred refraction strategy. Secondly, we leverage Gaussian light field probes (GaussProbe) to encode both ambient light and nearby contents in a unified framework. Thirdly, a depth-based iterative probes query (IterQuery) algorithm is proposed to reduce the parallax errors in our probe-based framework. Experiments demonstrate the speed and accuracy of our approach in recovering transparent objects from complex environments, as well as several applications in computer graphics and vision.

Authors:Nader Zantout, Haochen Zhang, Pujith Kachana, Jinkai Qiu, Guofei Chen, Ji Zhang, Wenshan Wang
Title: SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models
Abstract:
Interpreting object-referential language and grounding objects in 3D with spatial relations and attributes is essential for robots operating alongside humans. However, this task is often challenging due to the diversity of scenes, large number of fine-grained objects, and complex free-form nature of language references. Furthermore, in the 3D domain, obtaining large amounts of natural language training data is difficult. Thus, it is important for methods to learn from little data and zero-shot generalize to new environments. To address these challenges, we propose SORT3D, an approach that utilizes rich object attributes from 2D data and merges a heuristics-based spatial reasoning toolbox with the ability of large language models (LLMs) to perform sequential reasoning. Importantly, our method does not require text-to-3D data for training and can be applied zero-shot to unseen environments. We show that SORT3D achieves state-of-the-art zero-shot performance on complex view-dependent grounding tasks on two benchmarks. We also implement the pipeline to run real-time on two autonomous vehicles and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments. All source code for the system pipeline is publicly released at https://github.com/nzantout/SORT3D.
中文: SORT3D是一种创新方法,它结合2D物体属性、空间推理和大语言模型,无需文本到3D训练数据即可实现零样本三维物体定位,在未知环境中达到最先进性能。
English: SORT3D is a novel method that combines 2D object attributes with spatial reasoning and large language models to enable zero-shot 3D object grounding without requiring text-to-3D training data, achieving state-of-the-art performance in unseen environments.

Authors:Zhikai Wang, Jiashuo Sun, Wenqi Zhang, Zhiqiang Hu, Xin Li, Fan Wang, Deli Zhao
Title: Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency
Abstract:
Recent advancements in Large Vision-Language Models (LVLMs) have significantly enhanced their ability to integrate visual and linguistic information, achieving near-human proficiency in tasks like object recognition, captioning, and visual question answering. However, current benchmarks typically focus on knowledge-centric evaluations that assess domain-specific expertise, often neglecting the core ability to reason about fundamental mathematical elements and visual concepts. We identify a gap in evaluating elementary-level math problems, which rely on explicit visual dependencies-requiring models to discern, integrate, and reason across multiple images while incorporating commonsense knowledge, all of which are crucial for advancing toward broader AGI capabilities. To address this gap, we introduce VCBENCH, a comprehensive benchmark for multimodal mathematical reasoning with explicit visual dependencies. VCBENCH includes 1,720 problems across six cognitive domains, featuring 6,697 images (averaging 3.9 per question) to ensure multi-image reasoning. We evaluate 26 state-of-the-art LVLMs on VCBENCH, revealing substantial performance disparities, with even the top models unable to exceed 50% accuracy. Our findings highlight the ongoing challenges in visual-mathematical integration and suggest avenues for future LVLM advancements. The project can be found at https://alibaba-damo-academy.github.io/VCBench/.

Authors:Shintaro Shiba, Quan Kong, Norimasa Kobori
Title: E-VLC: A Real-World Dataset for Event-based Visible Light Communication And Localization
Abstract:
Optical communication using modulated LEDs (e.g., visible light communication) is an emerging application for event cameras, thanks to their high spatio-temporal resolutions. Event cameras can be used simply to decode the LED signals and also to localize the camera relative to the LED marker positions. However, there is no public dataset to benchmark the decoding and localization in various real-world settings. We present, to the best of our knowledge, the first public dataset that consists of an event camera, a frame camera, and ground-truth poses that are precisely synchronized with hardware triggers. It provides various camera motions with various sensitivities in different scene brightness settings, both indoor and outdoor. Furthermore, we propose a novel method of localization that leverages the Contrast Maximization framework for motion estimation and compensation. The detailed analysis and experimental results demonstrate the advantages of LED-based localization with events over the conventional AR-marker--based one with frames, as well as the efficacy of the proposed method in localization. We hope that the proposed dataset serves as a future benchmark for both motion-related classical computer vision tasks and LED marker decoding tasks simultaneously, paving the way to broadening applications of event cameras on mobile devices. https://woven-visionai.github.io/evlc-dataset

Authors:Ryo Yamaki, Shintaro Shiba, Guillermo Gallego, Yoshimitsu Aoki
Title: Iterative Event-based Motion Segmentation by Variational Contrast Maximization
Abstract:
Event cameras provide rich signals that are suitable for motion estimation since they respond to changes in the scene. As any visual changes in the scene produce event data, it is paramount to classify the data into different motions (i.e., motion segmentation), which is useful for various tasks such as object detection and visual servoing. We propose an iterative motion segmentation method, by classifying events into background (e.g., dominant motion hypothesis) and foreground (independent motion residuals), thus extending the Contrast Maximization framework. Experimental results demonstrate that the proposed method successfully classifies event clusters both for public and self-recorded datasets, producing sharp, motion-compensated edge-like images. The proposed method achieves state-of-the-art accuracy on moving object detection benchmarks with an improvement of over 30%, and demonstrates its possibility of applying to more complex and noisy real-world scenes. We hope this work broadens the sensitivity of Contrast Maximization with respect to both motion parameters and input events, thus contributing to theoretical advancements in event-based motion segmentation estimation. https://github.com/aoki-media-lab/event_based_segmentation_vcmax
中文摘要:该研究提出的迭代运动分割方法将事件相机数据分类为背景和前景运动,扩展了对比度最大化框架,在运动目标检测中实现了超过30%的性能提升,达到了当前最优精度水平。
English Summary: The proposed iterative motion segmentation method classifies event camera data into background and foreground motions, extending the Contrast Maximization framework to achieve state-of-the-art accuracy with over 30% improvement in moving object detection.

Authors:Rui Li, Biao Zhang, Zhenyu Li, Federico Tombari, Peter Wonka
Title: LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning
Abstract:
We present layered ray intersections (LaRI), a new method for unseen geometry reasoning from a single image. Unlike conventional depth estimation that is limited to the visible surface, LaRI models multiple surfaces intersected by the camera rays using layered point maps. Benefiting from the compact and layered representation, LaRI enables complete, efficient, and view-aligned geometric reasoning to unify object- and scene-level tasks. We further propose to predict the ray stopping index, which identifies valid intersecting pixels and layers from LaRI's output. We build a complete training data generation pipeline for synthetic and real-world data, including 3D objects and scenes, with necessary data cleaning steps and coordination between rendering engines. As a generic method, LaRI's performance is validated in two scenarios: It yields comparable object-level results to the recent large generative model using 4% of its training data and 17% of its parameters. Meanwhile, it achieves scene-level occluded geometry reasoning in only one feed-forward.

Authors:Jens Hooge, Gerard Sanroma-Guell, Faidra Stavropoulou, Alexander Ullmann, Gesine Knobloch, Mark Klemens, Carola Schmidt, Sabine Weckbach, Andreas Bolz
Title: HepatoGEN: Generating Hepatobiliary Phase MRI with Perceptual and Adversarial Models
Abstract:
Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays a crucial role in the detection and characterization of focal liver lesions, with the hepatobiliary phase (HBP) providing essential diagnostic information. However, acquiring HBP images requires prolonged scan times, which may compromise patient comfort and scanner throughput. In this study, we propose a deep learning based approach for synthesizing HBP images from earlier contrast phases (precontrast and transitional) and compare three generative models: a perceptual U-Net, a perceptual GAN (pGAN), and a denoising diffusion probabilistic model (DDPM). We curated a multi-site DCE-MRI dataset from diverse clinical settings and introduced a contrast evolution score (CES) to assess training data quality, enhancing model performance. Quantitative evaluation using pixel-wise and perceptual metrics, combined with qualitative assessment through blinded radiologist reviews, showed that pGAN achieved the best quantitative performance but introduced heterogeneous contrast in out-of-distribution cases. In contrast, the U-Net produced consistent liver enhancement with fewer artifacts, while DDPM underperformed due to limited preservation of fine structural details. These findings demonstrate the feasibility of synthetic HBP image generation as a means to reduce scan time without compromising diagnostic utility, highlighting the clinical potential of deep learning for dynamic contrast enhancement in liver MRI. A project demo is available at: https://jhooge.github.io/hepatogen

Authors:Xinmin Feng, Zhuoyuan Li, Li Li, Dong Liu, Feng Wu
Title: Partition Map-Based Fast Block Partitioning for VVC Inter Coding
Abstract:
Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (QT+MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast block partitioning in inter coding. Based on our previous work on partition map-based methods for intra coding, we analyze the characteristics of VVC inter coding, and thus improve the partition map by incorporating an MTT mask for early termination. Next, we develop a neural network that uses both spatial and temporal features to predict the partition map. It consists of several special designs including stacked top-down and bottom-up processing, quantization parameter modulation layers, and partitioning-adaptive warping. Furthermore, we present a dual-threshold decision scheme to achieve a fine-grained trade-off between complexity reduction and rate-distortion (RD) performance loss. The experimental results demonstrate that the proposed method achieves an average 51.30% encoding time saving with a 2.12% Bjontegaard Delta Bit Rate (BDBR) under the random access configuration.
中文: 该方法通过基于神经网络的划分图预测和双阈值决策方案,在随机访问配置下实现了51.30%的编码时间节省,仅带来2.12%的BDBR性能损失。
English: The proposed neural network-based method with a dual-threshold scheme significantly reduces VVC encoding complexity by 51.30% while maintaining minimal rate-distortion performance loss.

Authors:Kesen Zhao, Beier Zhu, Qianru Sun, Hanwang Zhang
Title: Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
Abstract:
Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). However, existing approaches are focused on text CoT, limiting their ability to leverage visual cues. Visual CoT remains underexplored, and the only work is based on supervised fine-tuning (SFT) that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoT performs preference comparisons between model-generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box annotations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bounding boxes using a template prompt and then answers the question using each bounded region as input. An evaluator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLM with UV-CoT by minimizing negative log-likelihood losses. By emulating human perception--identifying key regions and reasoning based on them--UV-CoT can improve visual comprehension, particularly in spatial reasoning tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual and visual CoT methods. Our zero-shot testing on four unseen datasets shows the strong generalization of UV-CoT. The code is available in https://github.com/kesenzhao/UV-CoT.
中文: UV-CoT框架通过无监督的视觉思维链推理,利用偏好优化比较模型生成的边界框而无需标注,提升了空间推理能力并在多个数据集上展现出优越的泛化性能。
English: The UV-CoT framework introduces an unsupervised approach for visual chain-of-thought reasoning, using preference optimization to compare model-generated bounding boxes without annotations, enhancing spatial reasoning and generalization across datasets.

Authors:Zhengru Fang, Zhenghao Liu, Jingjing Wang, Senkang Hu, Yu Guo, Yiqin Deng, Yuguang Fang
Title: Task-Oriented Communications for Visual Navigation with Edge-Aerial Collaboration in Low Altitude Economy
Abstract:
To support the Low Altitude Economy (LAE), it is essential to achieve precise localization of unmanned aerial vehicles (UAVs) in urban areas where global positioning system (GPS) signals are unavailable. Vision-based methods offer a viable alternative but face severe bandwidth, memory and processing constraints on lightweight UAVs. Inspired by mammalian spatial cognition, we propose a task-oriented communication framework, where UAVs equipped with multi-camera systems extract compact multi-view features and offload localization tasks to edge servers. We introduce the Orthogonally-constrained Variational Information Bottleneck encoder (O-VIB), which incorporates automatic relevance determination (ARD) to prune non-informative features while enforcing orthogonality to minimize redundancy. This enables efficient and accurate localization with minimal transmission cost. Extensive evaluation on a dedicated LAE UAV dataset shows that O-VIB achieves high-precision localization under stringent bandwidth budgets. Code and dataset will be made publicly available at: github.com/fangzr/TOC-Edge-Aerial.
中文摘要:本研究提出了一种面向任务的通信框架,采用正交约束变分信息瓶颈编码器,通过向边缘服务器传输紧凑的多视角特征,实现在GPS信号缺失城市环境中无人机的高效精准定位。
English Summary: The study introduces a task-oriented communication framework using an Orthogonally-constrained Variational Information Bottleneck encoder to enable efficient and accurate UAV localization in GPS-denied urban environments by transmitting compact multi-view features to edge servers.

Authors:Elena Plekhanova, Damien Robert, Johannes Dollinger, Emilia Arens, Philipp Brun, Jan Dirk Wegner, Niklaus Zimmermann
Title: SSL4Eco: A Global Seasonal Dataset for Geospatial Foundation Models in Ecology
Abstract:
With the exacerbation of the biodiversity and climate crises, macroecological pursuits such as global biodiversity mapping become more urgent. Remote sensing offers a wealth of Earth observation data for ecological studies, but the scarcity of labeled datasets remains a major challenge. Recently, self-supervised learning has enabled learning representations from unlabeled data, triggering the development of pretrained geospatial models with generalizable features. However, these models are often trained on datasets biased toward areas of high human activity, leaving entire ecological regions underrepresented. Additionally, while some datasets attempt to address seasonality through multi-date imagery, they typically follow calendar seasons rather than local phenological cycles. To better capture vegetation seasonality at a global scale, we propose a simple phenology-informed sampling strategy and introduce corresponding SSL4Eco, a multi-date Sentinel-2 dataset, on which we train an existing model with a season-contrastive objective. We compare representations learned from SSL4Eco against other datasets on diverse ecological downstream tasks and demonstrate that our straightforward sampling method consistently improves representation quality, highlighting the importance of dataset construction. The model pretrained on SSL4Eco reaches state of the art performance on 7 out of 8 downstream tasks spanning (multi-label) classification and regression. We release our code, data, and model weights to support macroecological and computer vision research at https://github.com/PlekhanovaElena/ssl4eco.
Chinese: 为解决现有地理空间模型的偏差并更好地捕捉全球植被季节性,我们提出了基于物候学采样的SSL4Eco数据集,通过季节对比学习方法在多种生态任务上实现了最优性能。
English: To address biases in existing geospatial models and better capture global vegetation seasonality, we introduce SSL4Eco, a phenology-informed dataset, and demonstrate its superior performance on ecological tasks through a season-contrastive learning approach.

Authors:Nanjie Yao, Gangjian Zhang, Wenhao Shen, Jian Shu, Hao Wang
Title: Unify3D: An Augmented Holistic End-to-end Monocular 3D Human Reconstruction via Anatomy Shaping and Twins Negotiating
Abstract:
Monocular 3D clothed human reconstruction aims to create a complete 3D avatar from a single image. To tackle the human geometry lacking in one RGB image, current methods typically resort to a preceding model for an explicit geometric representation. For the reconstruction itself, focus is on modeling both it and the input image. This routine is constrained by the preceding model, and overlooks the integrity of the reconstruction task. To address this, this paper introduces a novel paradigm that treats human reconstruction as a holistic process, utilizing an end-to-end network for direct prediction from 2D image to 3D avatar, eliminating any explicit intermediate geometry display. Based on this, we further propose a novel reconstruction framework consisting of two core components: the Anatomy Shaping Extraction module, which captures implicit shape features taking into account the specialty of human anatomy, and the Twins Negotiating Reconstruction U-Net, which enhances reconstruction through feature interaction between two U-Nets of different modalities. Moreover, we propose a Comic Data Augmentation strategy and construct 15k+ 3D human scans to bolster model performance in more complex case input. Extensive experiments on two test sets and many in-the-wild cases show the superiority of our method over SOTA methods. Our demos can be found in : https://e2e3dgsrecon.github.io/e2e3dgsrecon/.

Authors:Jiahao Zhang, Bowen Wang, Hong Liu, Liangzhi Li, Yuta Nakashima, Hajime Nagahara
Title: E-InMeMo: Enhanced Prompting for Visual In-Context Learning
Abstract:
Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: https://github.com/Jackieam/E-InMeMo
Chinese: 提出的E-InMeMo方法通过在上下文对中引入可学习的扰动来增强视觉上下文学习,在标准视觉任务中实现了显著性能提升,其中前景分割mIoU指标比基线方法提高7.99,单目标检测提升17.04。
English: The proposed E-InMeMo method enhances visual in-context learning by integrating learnable perturbations into in-context pairs, achieving state-of-the-art performance improvements of 7.99 mIoU in foreground segmentation and 17.04 mIoU in single object detection over baseline methods.

Authors:Zhuohao Yan, Shaoquan Feng, Xingxing Li, Yuxuan Zhou, Chunxi Xia, Shengyu Li
Title: S3MOT: Monocular 3D Object Tracking with Selective State Space Model
Abstract:
Accurate and reliable multi-object tracking (MOT) in 3D space is essential for advancing robotics and computer vision applications. However, it remains a significant challenge in monocular setups due to the difficulty of mining 3D spatiotemporal associations from 2D video streams. In this work, we present three innovative techniques to enhance the fusion and exploitation of heterogeneous cues for monocular 3D MOT: (1) we introduce the Hungarian State Space Model (HSSM), a novel data association mechanism that compresses contextual tracking cues across multiple paths, enabling efficient and comprehensive assignment decisions with linear complexity. HSSM features a global receptive field and dynamic weights, in contrast to traditional linear assignment algorithms that rely on hand-crafted association costs. (2) We propose Fully Convolutional One-stage Embedding (FCOE), which eliminates ROI pooling by directly using dense feature maps for contrastive learning, thus improving object re-identification accuracy under challenging conditions such as varying viewpoints and lighting. (3) We enhance 6-DoF pose estimation through VeloSSM, an encoder-decoder architecture that models temporal dependencies in velocity to capture motion dynamics, overcoming the limitations of frame-based 3D inference. Experiments on the KITTI public test benchmark demonstrate the effectiveness of our method, achieving a new state-of-the-art performance of 76.86~HOTA at 31~FPS. Our approach outperforms the previous best by significant margins of +2.63~HOTA and +3.62~AssA, showcasing its robustness and efficiency for monocular 3D MOT tasks. The code and models are available at https://github.com/bytepioneerX/s3mot.
中文: 本研究提出了三项创新技术——用于高效数据关联的HSSM、提升目标重识别能力的FCOE和优化位姿估计的VeloSSM,显著推进了单目三维多目标跟踪技术,在KITTI基准测试中实现了最先进的性能。
English: This study introduces three innovative techniques—HSSM for efficient data association, FCOE for improved object re-identification, and VeloSSM for enhanced pose estimation—to advance monocular 3D multi-object tracking, achieving state-of-the-art performance on the KITTI benchmark.

Authors:Prachi Garg, Joseph K J, Vineeth N Balasubramanian, Necati Cihan Camgoz, Chengde Wan, Kenrick Kin, Weiguang Si, Shugao Ma, Fernando De La Torre
Title: POET: Prompt Offset Tuning for Continual Human Action Adaptation
Abstract:
As extended reality (XR) is redefining how users interact with computing devices, research in human action recognition is gaining prominence. Typically, models deployed on immersive computing devices are static and limited to their default set of classes. The goal of our research is to provide users and developers with the capability to personalize their experience by adding new action classes to their device models continually. Importantly, a user should be able to add new classes in a low-shot and efficient manner, while this process should not require storing or replaying any of user's sensitive training data. We formalize this problem as privacy-aware few-shot continual action recognition. Towards this end, we propose POET: Prompt-Offset Tuning. While existing prompt tuning approaches have shown great promise for continual learning of image, text, and video modalities; they demand access to extensively pretrained transformers. Breaking away from this assumption, POET demonstrates the efficacy of prompt tuning a significantly lightweight backbone, pretrained exclusively on the base class data. We propose a novel spatio-temporal learnable prompt offset tuning approach, and are the first to apply such prompt tuning to Graph Neural Networks. We contribute two new benchmarks for our new problem setting in human action recognition: (i) NTU RGB+D dataset for activity recognition, and (ii) SHREC-2017 dataset for hand gesture recognition. We find that POET consistently outperforms comprehensive benchmarks. Source code at https://github.com/humansensinglab/POET-continual-action-recognition.
Chinese: 本研究提出POET方法,通过隐私保护的少样本持续动作识别技术,让用户能在不存储敏感数据的情况下为XR设备高效添加新动作类别,并在两个新基准数据集上表现优于现有方法。
English: This research introduces POET, a privacy-aware few-shot continual action recognition method that enables users to efficiently add new action classes to XR devices without storing sensitive data, outperforming existing benchmarks on two new datasets.

Authors:Jianyu Liu, Hangyu Guo, Ranjie Duan, Xingyuan Bu, Yancheng He, Shilong Li, Hui Huang, Jiaheng Liu, Yucheng Wang, Chenchen Jing, Xingwei Qu, Xiao Zhang, Yingshui Tan, Yanan Wu, Jihao Gu, Yangguang Li, Jianke Zhu
Title: DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data, thereby introducing new dimensions of potential attacks and complex risk combinations. In this paper, we begin with a detailed analysis aimed at disentangling risks through step-by-step reasoning within multimodal inputs. We find that systematic multimodal risk disentanglement substantially enhances the risk awareness of MLLMs. Via leveraging the strong discriminative abilities of multimodal risk disentanglement, we further introduce \textbf{DREAM} (\textit{\textbf{D}isentangling \textbf{R}isks to \textbf{E}nhance Safety \textbf{A}lignment in \textbf{M}LLMs}), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback (RLAIF). Experimental results show that DREAM significantly boosts safety during both inference and training phases without compromising performance on normal tasks (namely oversafety), achieving a 16.17\% improvement in the SIUO safe\&effective score compared to GPT-4V. The data and code are available at https://github.com/Kizna1ver/DREAM.
中文: DREAM方法通过监督微调和强化学习系统性地解构多模态大语言模型中的风险,在保持正常任务性能的同时将安全有效性评分比GPT-4V提升了16.17%。
English: The DREAM method enhances safety in Multimodal Large Language Models by systematically disentangling risks through supervised fine-tuning and reinforcement learning, achieving a 16.17% improvement in safety scores over GPT-4V without compromising performance.

Authors:Chen Chen, Daochang Liu, Mubarak Shah, Chang Xu
Title: Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models
Abstract:
Text-to-image diffusion models have demonstrated remarkable capabilities in creating images highly aligned with user prompts, yet their proclivity for memorizing training set images has sparked concerns about the originality of the generated images and privacy issues, potentially leading to legal complications for both model owners and users, particularly when the memorized images contain proprietary content. Although methods to mitigate these issues have been suggested, enhancing privacy often results in a significant decrease in the utility of the outputs, as indicated by text-alignment scores. To bridge the research gap, we introduce a novel method, PRSS, which refines the classifier-free guidance approach in diffusion models by integrating prompt re-anchoring (PR) to improve privacy and incorporating semantic prompt search (SS) to enhance utility. Extensive experiments across various privacy levels demonstrate that our approach consistently improves the privacy-utility trade-off, establishing a new state-of-the-art.
中文: 文本到图像扩散模型存在记忆训练数据的问题,引发隐私和原创性担忧,而新的PRSS方法通过提示重锚定和语义搜索提升了隐私保护和实用性,实现了更优的平衡。
English: Text-to-image diffusion models face challenges with memorizing training data, raising privacy and originality concerns, but the new PRSS method enhances both privacy and utility through prompt re-anchoring and semantic search, achieving a superior trade-off.

Authors:Kaiyuan Tang, Siyuan Yao, Chaoli Wang
Title: iVR-GS: Inverse Volume Rendering for Explorable Visualization via Editable 3D Gaussian Splatting
Abstract:
In volume visualization, users can interactively explore the three-dimensional data by specifying color and opacity mappings in the transfer function (TF) or adjusting lighting parameters, facilitating meaningful interpretation of the underlying structure. However, rendering large-scale volumes demands powerful GPUs and high-speed memory access for real-time performance. While existing novel view synthesis (NVS) methods offer faster rendering speeds with lower hardware requirements, the visible parts of a reconstructed scene are fixed and constrained by preset TF settings, significantly limiting user exploration. This paper introduces inverse volume rendering via Gaussian splatting (iVR-GS), an innovative NVS method that reduces the rendering cost while enabling scene editing for interactive volume exploration. Specifically, we compose multiple iVR-GS models associated with basic TFs covering disjoint visible parts to make the entire volumetric scene visible. Each basic model contains a collection of 3D editable Gaussians, where each Gaussian is a 3D spatial point that supports real-time scene rendering and editing. We demonstrate the superior reconstruction quality and composability of iVR-GS against other NVS solutions (Plenoxels, CCNeRF, and base 3DGS) on various volume datasets. The code is available at https://github.com/TouKaienn/iVR-GS.
中文摘要:本文提出iVR-GS方法,通过高斯抛洒实现逆向体绘制,在降低渲染成本的同时支持场景实时编辑,为交互式体数据探索提供了高质量的重建效果。
English Summary: This paper introduces iVR-GS, an inverse volume rendering method using Gaussian splatting that enables real-time scene editing and reduces rendering costs while maintaining high reconstruction quality for interactive volume exploration.

Authors:Daneul Kim, Jingxu Zhang, Wonjoon Jin, Sunghyun Cho, Qi Dai, Jaesik Park, Chong Luo
Title: Subject-driven Video Generation via Disentangled Identity and Motion
Abstract:
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.
Chinese: 我们提出了一种零样本视频定制模型,通过图像数据集进行身份注入,并利用未标注视频保持时序建模,将主体特定学习与时序动态解耦,无需额外调优即可实现优异的主体一致性。
English: We introduce a zero-shot video customization model that separates subject-specific learning from temporal dynamics using an image dataset for identity injection and unannotated videos for temporal modeling, achieving superior subject consistency without additional tuning.

Authors:Yuxin Yao, Yan Zhang, Zhening Huang, Joan Lasenby
Title: SmallGS: Gaussian Splatting-based Camera Pose Estimation for Small-Baseline Videos
Abstract:
Dynamic videos with small baseline motions are ubiquitous in daily life, especially on social media. However, these videos present a challenge to existing pose estimation frameworks due to ambiguous features, drift accumulation, and insufficient triangulation constraints. Gaussian splatting, which maintains an explicit representation for scenes, provides a reliable novel view rasterization when the viewpoint change is small. Inspired by this, we propose SmallGS, a camera pose estimation framework that is specifically designed for small-baseline videos. SmallGS optimizes sequential camera poses using Gaussian splatting, which reconstructs the scene from the first frame in each video segment to provide a stable reference for the rest. The temporal consistency of Gaussian splatting within limited viewpoint differences reduced the requirement of sufficient depth variations in traditional camera pose estimation. We further incorporate pretrained robust visual features, e.g. DINOv2, into Gaussian splatting, where high-dimensional feature map rendering enhances the robustness of camera pose estimation. By freezing the Gaussian splatting and optimizing camera viewpoints based on rasterized features, SmallGS effectively learns camera poses without requiring explicit feature correspondences or strong parallax motion. We verify the effectiveness of SmallGS in small-baseline videos in TUM-Dynamics sequences, which achieves impressive accuracy in camera pose estimation compared to MonST3R and DORID-SLAM for small-baseline videos in dynamic scenes. Our project page is at: https://yuxinyao620.github.io/SmallGS

Authors:Tetiana Martyniuk, Gilles Puy, Alexandre Boulch, Renaud Marlet, Raoul de Charette
Title: LiDPM: Rethinking Point Diffusion for Lidar Scene Completion
Abstract:
Training diffusion models that work directly on lidar points at the scale of outdoor scenes is challenging due to the difficulty of generating fine-grained details from white noise over a broad field of view. The latest works addressing scene completion with diffusion models tackle this problem by reformulating the original DDPM as a local diffusion process. It contrasts with the common practice of operating at the level of objects, where vanilla DDPMs are currently used. In this work, we close the gap between these two lines of work. We identify approximations in the local diffusion formulation, show that they are not required to operate at the scene level, and that a vanilla DDPM with a well-chosen starting point is enough for completion. Finally, we demonstrate that our method, LiDPM, leads to better results in scene completion on SemanticKITTI. The project page is https://astra-vision.github.io/LiDPM .

Authors:Xu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, Yushi Hu, Artsiom Sanakoyeu, Felix Juefei-Xu, Ji Hou, Junjiao Tian, Tao Xu, Tingbo Hou, Yen-Cheng Liu, Zecheng He, Zijian He, Matt Feiszli, Peizhao Zhang, Peter Vajda, Sam Tsai, Yun Fu
Title: Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models
Abstract:
Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.
Chinese: Token-Shuffle方法通过沿通道合并局部标记并在处理后恢复空间布局,减少自回归模型中的图像标记数量,实现了高效的2048x2048分辨率图像合成,在基准测试和人工评估中均优于现有模型。
English: The Token-Shuffle method reduces image tokens in autoregressive models by merging local tokens along channels and restoring spatial arrangements after processing, enabling efficient 2048x2048 resolution image synthesis that outperforms existing models in benchmarks and human evaluations.

Authors:Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang
Title: Step1X-Edit: A Practical Framework for General Image Editing
Abstract:
In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.
中文: 近年来,多模态模型如GPT-4o和Gemini2 Flash在图像编辑领域取得显著进展,但开源模型仍存在差距;为此,我们推出Step1X-Edit,采用多模态大语言模型和扩散解码器,在GEdit-Bench评估中性能接近领先专有模型,推动了该领域的发展。
English: Recent advancements in multimodal models like GPT-4o and Gemini2 Flash have enhanced image editing, but a performance gap remains with open-source alternatives, prompting the introduction of Step1X-Edit, which uses a Multimodal LLM and diffusion decoder to achieve comparable results and sets a new benchmark in the field.

Authors:Shucheng Gong, Lingzhe Zhao, Wenpu Li, Hong Xie, Yin Zhang, Shiyu Zhao, Peidong Liu
Title: Casual3DHDR: Deblurring High Dynamic Range 3D Gaussian Splatting from Casually Captured Videos
Abstract:
Photo-realistic novel view synthesis from multi-view images, such as neural radiance field (NeRF) and 3D Gaussian Splatting (3DGS), has gained significant attention for its superior performance. However, most existing methods rely on low dynamic range (LDR) images, limiting their ability to capture detailed scenes in high-contrast environments. While some prior works address high dynamic range (HDR) scene reconstruction, they typically require multi-view sharp images with varying exposure times captured at fixed camera positions, which is time-consuming and impractical. To make data acquisition more flexible, we propose \textbf{Casual3DHDR}, a robust one-stage method that reconstructs 3D HDR scenes from casually-captured auto-exposure (AE) videos, even under severe motion blur and unknown, varying exposure times. Our approach integrates a continuous-time camera trajectory into a unified physical imaging model, jointly optimizing exposure times, camera trajectory, and the camera response function (CRF). Extensive experiments on synthetic and real-world datasets demonstrate that \textbf{Casual3DHDR} outperforms existing methods in robustness and rendering quality. Our source code and dataset will be available at https://lingzhezhao.github.io/CasualHDRSplat/

Authors:Lutao Jiang, Jiantao Lin, Kanghao Chen, Wenhang Ge, Xin Yang, Yifan Jiang, Yuanhuiyi Lyu, Xu Zheng, Yinchuan Li, Yingcong Chen
Title: DiMeR: Disentangled Mesh Reconstruction Model
Abstract:
We propose DiMeR, a novel geometry-texture disentangled feed-forward model with 3D supervision for sparse-view mesh reconstruction. Existing methods confront two persistent obstacles: (i) textures can conceal geometric errors, i.e., visually plausible images can be rendered even with wrong geometry, producing multiple ambiguous optimization objectives in geometry-texture mixed solution space for similar objects; and (ii) prevailing mesh extraction methods are redundant, unstable, and lack 3D supervision. To solve these challenges, we rethink the inductive bias for mesh reconstruction. First, we disentangle the unified geometry-texture solution space, where a single input admits multiple feasible solutions, into geometry and texture spaces individually. Specifically, given that normal maps are strictly consistent with geometry and accurately capture surface variations, the normal maps serve as the sole input for geometry prediction in DiMeR, while the texture is estimated from RGB images. Second, we streamline the algorithm of mesh extraction by eliminating modules with low performance/cost ratios and redesigning regularization losses with 3D supervision. Notably, DiMeR still accepts raw RGB images as input by leveraging foundation models for normal prediction. Extensive experiments demonstrate that DiMeR generalises across sparse-view-, single-image-, and text-to-3D tasks, consistently outperforming baselines. On the GSO and OmniObject3D datasets, DiMeR significantly reduces Chamfer Distance by more than 30%.
Chinese: DiMeR提出了一种具有三维监督的几何-纹理解耦模型,解决了稀疏视图重建中的优化模糊和网格提取低效问题,在基准数据集上将倒角距离显著降低了30%以上。
English: DiMeR introduces a geometry-texture disentangled model with 3D supervision to address ambiguous optimization and inefficient mesh extraction in sparse-view reconstruction, achieving over 30% improvement in Chamfer Distance on benchmark datasets.

Authors:Fengchun Liu, Tong Zhang, Chunying Zhang
Title: STCL:Curriculum learning Strategies for deep learning image steganography models
Abstract:
Aiming at the problems of poor quality of steganographic images and slow network convergence of image steganography models based on deep learning, this paper proposes a Steganography Curriculum Learning training strategy (STCL) for deep learning image steganography models. So that only easy images are selected for training when the model has poor fitting ability at the initial stage, and gradually expand to more difficult images, the strategy includes a difficulty evaluation strategy based on the teacher model and an knee point-based training scheduling strategy. Firstly, multiple teacher models are trained, and the consistency of the quality of steganographic images under multiple teacher models is used as the difficulty score to construct the training subsets from easy to difficult. Secondly, a training control strategy based on knee points is proposed to reduce the possibility of overfitting on small training sets and accelerate the training process. Experimental results on three large public datasets, ALASKA2, VOC2012 and ImageNet, show that the proposed image steganography scheme is able to improve the model performance under multiple algorithmic frameworks, which not only has a high PSNR, SSIM score, and decoding accuracy, but also the steganographic images generated by the model under the training of the STCL strategy have a low steganography analysis scores. You can find our code at \href{https://github.com/chaos-boops/STCL}{https://github.com/chaos-boops/STCL}.
中文: 本文提出了一种隐写课程学习策略,通过从易到难的图像逐步训练模型,有效提升了隐写图像质量、加速了网络收敛,并在多个数据集上验证了其优越性能。
English: This paper introduces a Steganography Curriculum Learning (STCL) strategy that enhances image steganography models by training on progressively difficult images, improving quality and efficiency while reducing detectability across multiple datasets.

Authors:Lin Che, Yizi Chen, Tanhua Jin, Martin Raubal, Konrad Schindler, Peter Kiefer
Title: Unsupervised Urban Land Use Mapping with Street View Contrastive Clustering and a Geographical Prior
Abstract:
Urban land use classification and mapping are critical for urban planning, resource management, and environmental monitoring. Existing remote sensing techniques often lack precision in complex urban environments due to the absence of ground-level details. Unlike aerial perspectives, street view images provide a ground-level view that captures more human and social activities relevant to land use in complex urban scenes. Existing street view-based methods primarily rely on supervised classification, which is challenged by the scarcity of high-quality labeled data and the difficulty of generalizing across diverse urban landscapes. This study introduces an unsupervised contrastive clustering model for street view images with a built-in geographical prior, to enhance clustering performance. When combined with a simple visual assignment of the clusters, our approach offers a flexible and customizable solution to land use mapping, tailored to the specific needs of urban planners. We experimentally show that our method can generate land use maps from geotagged street view image datasets of two cities. As our methodology relies on the universal spatial coherence of geospatial data ("Tobler's law"), it can be adapted to various settings where street view images are available, to enable scalable, unsupervised land use mapping and updating. The code will be available at https://github.com/lin102/CCGP.
Chinese: 本研究提出一种结合地理先验的无监督对比聚类模型,利用街景图像生成土地利用地图,为城市规划提供无需标注数据的可扩展定制化解决方案。
English: This study introduces an unsupervised contrastive clustering model with geographical priors to generate land use maps from street view images, offering a scalable and adaptable solution for urban planning without requiring labeled data.

Authors:Anyi Xiao, Cihui Yang
Title: TableCenterNet: A one-stage network for table structure recognition
Abstract:
Table structure recognition aims to parse tables in unstructured data into machine-understandable formats. Recent methods address this problem through a two-stage process or optimized one-stage approaches. However, these methods either require multiple networks to be serially trained and perform more time-consuming sequential decoding, or rely on complex post-processing algorithms to parse the logical structure of tables. They struggle to balance cross-scenario adaptability, robustness, and computational efficiency. In this paper, we propose a one-stage end-to-end table structure parsing network called TableCenterNet. This network unifies the prediction of table spatial and logical structure into a parallel regression task for the first time, and implicitly learns the spatial-logical location mapping laws of cells through a synergistic architecture of shared feature extraction layers and task-specific decoding. Compared with two-stage methods, our method is easier to train and faster to infer. Experiments on benchmark datasets show that TableCenterNet can effectively parse table structures in diverse scenarios and achieve state-of-the-art performance on the TableGraph-24k dataset. Code is available at https://github.com/dreamy-xay/TableCenterNet.
中文摘要:本文提出TableCenterNet,一种单阶段端到端网络,首次将表格空间与逻辑结构解析统一为并行回归任务,通过共享特征提取和任务解码的协同架构,在跨场景适应性和计算效率上优于现有方法。
English Summary: This paper introduces TableCenterNet, a one-stage end-to-end network that unifies spatial and logical table structure parsing into parallel regression, achieving superior adaptability, robustness, and efficiency compared to existing methods.

Authors:Zihan Cheng, Jintao Guo, Jian Zhang, Lei Qi, Luping Zhou, Yinghuan Shi, Yang Gao
Title: Mamba-Sea: A Mamba-based Framework with Global-to-Local Sequence Augmentation for Generalizable Medical Image Segmentation
Abstract:
To segment medical images with distribution shifts, domain generalization (DG) has emerged as a promising setting to train models on source domains that can generalize to unseen target domains. Existing DG methods are mainly based on CNN or ViT architectures. Recently, advanced state space models, represented by Mamba, have shown promising results in various supervised medical image segmentation. The success of Mamba is primarily owing to its ability to capture long-range dependencies while keeping linear complexity with input sequence length, making it a promising alternative to CNNs and ViTs. Inspired by the success, in the paper, we explore the potential of the Mamba architecture to address distribution shifts in DG for medical image segmentation. Specifically, we propose a novel Mamba-based framework, Mamba-Sea, incorporating global-to-local sequence augmentation to improve the model's generalizability under domain shift issues. Our Mamba-Sea introduces a global augmentation mechanism designed to simulate potential variations in appearance across different sites, aiming to suppress the model's learning of domain-specific information. At the local level, we propose a sequence-wise augmentation along input sequences, which perturbs the style of tokens within random continuous sub-sequences by modeling and resampling style statistics associated with domain shifts. To our best knowledge, Mamba-Sea is the first work to explore the generalization of Mamba for medical image segmentation, providing an advanced and promising Mamba-based architecture with strong robustness to domain shifts. Remarkably, our proposed method is the first to surpass a Dice coefficient of 90% on the Prostate dataset, which exceeds previous SOTA of 88.61%. The code is available at https://github.com/orange-czh/Mamba-Sea.
Chinese: 本文提出Mamba-Sea框架,通过全局到局部的序列增强策略提升医学图像分割的领域泛化能力,在Prostate数据集上首次突破90%的Dice系数,超越了现有最佳性能。
English: This paper introduces Mamba-Sea, a novel Mamba-based framework that enhances domain generalization in medical image segmentation through global-to-local sequence augmentation, achieving state-of-the-art performance with over 90% Dice coefficient on the Prostate dataset.

Authors:De-An Huang, Subhashree Radhakrishnan, Zhiding Yu, Jan Kautz
Title: FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
Abstract:
There has been impressive progress in Large Multimodal Models (LMMs). Recent works extend these models to long inputs, including multi-page documents and long videos. However, the model size and performance of these long context models are still limited due to the computational cost in both training and inference. In this work, we explore an orthogonal direction and process long inputs without long context LMMs. We propose Frame Selection Augmented Generation (FRAG), where the model first selects relevant frames within the input, and then only generates the final outputs based on the selected frames. The core of the selection process is done by scoring each frame independently, which does not require long context processing. The frames with the highest scores are then selected by a simple Top-K selection. We show that this frustratingly simple framework is applicable to both long videos and multi-page documents using existing LMMs without any fine-tuning. We consider two models, LLaVA-OneVision and InternVL2, in our experiments and show that FRAG consistently improves the performance and achieves state-of-the-art performances for both long video and long document understanding. For videos, FRAG substantially improves InternVL2-76B by 5.8% on MLVU and 3.7% on Video-MME. For documents, FRAG achieves over 20% improvements on MP-DocVQA compared with recent LMMs specialized in long document understanding. Code is available at: https://github.com/NVlabs/FRAG
中文: 提出的FRAG框架通过独立选择相关帧而无需长上下文模型,有效提升了现有大模型在长视频和长文档任务上的性能表现。
English: The proposed FRAG framework enhances long input processing by selecting relevant frames independently without long context models, significantly improving performance on both video and document tasks using existing LMMs.

Authors:Mingxuan Wu, Huang Huang, Justin Kerr, Chung Min Kim, Anthony Zhang, Brent Yi, Angjoo Kanazawa
Title: Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding
Abstract:
Humans can resort to long-form inspection to build intuition on predicting the 3D configurations of unseen objects. The more we observe the object motion, the better we get at predicting its 3D state immediately. Existing systems either optimize underlying representations from multi-view observations or train a feed-forward predictor from supervised datasets. We introduce Predict-Optimize-Distill (POD), a self-improving framework that interleaves prediction and optimization in a mutually reinforcing cycle to achieve better 4D object understanding with increasing observation time. Given a multi-view object scan and a long-form monocular video of human-object interaction, POD iteratively trains a neural network to predict local part poses from RGB frames, uses this predictor to initialize a global optimization which refines output poses through inverse rendering, then finally distills the results of optimization back into the model by generating synthetic self-labeled training data from novel viewpoints. Each iteration improves both the predictive model and the optimized motion trajectory, creating a virtuous cycle that bootstraps its own training data to learn about the pose configurations of an object. We also introduce a quasi-multiview mining strategy for reducing depth ambiguity by leveraging long video. We evaluate POD on 14 real-world and 5 synthetic objects with various joint types, including revolute and prismatic joints as well as multi-body configurations where parts detach or reattach independently. POD demonstrates significant improvement over a pure optimization baseline which gets stuck in local minima, particularly for longer videos. We also find that POD's performance improves with both video length and successive iterations of the self-improving cycle, highlighting its ability to scale performance with additional observations and looped refinement.

Authors:Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng
Title: Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
Abstract:
The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLMś language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.

Authors:Min Wei, Chaohui Yu, Jingkai Zhou, Fan Wang
Title: 3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models
Abstract:
Video try-on replaces clothing in videos with target garments. Existing methods struggle to generate high-quality and temporally consistent results when handling complex clothing patterns and diverse body poses. We present 3DV-TON, a novel diffusion-based framework for generating high-fidelity and temporally consistent video try-on results. Our approach employs generated animatable textured 3D meshes as explicit frame-level guidance, alleviating the issue of models over-focusing on appearance fidelity at the expanse of motion coherence. This is achieved by enabling direct reference to consistent garment texture movements throughout video sequences. The proposed method features an adaptive pipeline for generating dynamic 3D guidance: (1) selecting a keyframe for initial 2D image try-on, followed by (2) reconstructing and animating a textured 3D mesh synchronized with original video poses. We further introduce a robust rectangular masking strategy that successfully mitigates artifact propagation caused by leaking clothing information during dynamic human and garment movements. To advance video try-on research, we introduce HR-VVT, a high-resolution benchmark dataset containing 130 videos with diverse clothing types and scenarios. Quantitative and qualitative results demonstrate our superior performance over existing methods. The project page is at this link https://2y7c3.github.io/3DV-TON/
中文: 该研究提出3DV-TON框架,通过可动画3D网格生成高保真且时序一致的视频试穿效果,其性能优于现有方法,并配有新的基准数据集支持。
English: The study introduces 3DV-TON, a diffusion-based framework that uses animatable 3D meshes to produce high-fidelity and temporally consistent video try-on results, outperforming existing methods and supported by a new benchmark dataset.

Authors:Francesc Marti-Escofet, Benedikt Blumenstiel, Linus Scheibenreif, Paolo Fraccaro, Konrad Schindler
Title: Fine-tune Smarter, Not Harder: Parameter-Efficient Fine-Tuning for Geospatial Foundation Models
Abstract:
Earth observation (EO) is crucial for monitoring environmental changes, responding to disasters, and managing natural resources. In this context, foundation models facilitate remote sensing image analysis to retrieve relevant geoinformation accurately and efficiently. However, as these models grow in size, fine-tuning becomes increasingly challenging due to the associated computational resources and costs, limiting their accessibility and scalability. Furthermore, full fine-tuning can lead to forgetting pre-trained features and even degrade model generalization. To address this, Parameter-Efficient Fine-Tuning (PEFT) techniques offer a promising solution. In this paper, we conduct extensive experiments with various foundation model architectures and PEFT techniques to evaluate their effectiveness on five different EO datasets. Our results provide a comprehensive comparison, offering insights into when and how PEFT methods support the adaptation of pre-trained geospatial models. We demonstrate that PEFT techniques match or even exceed full fine-tuning performance and enhance model generalisation to unseen geographic regions, while reducing training time and memory requirements. Additional experiments investigate the effect of architecture choices such as the decoder type or the use of metadata, suggesting UNet decoders and fine-tuning without metadata as the recommended configuration. We have integrated all evaluated foundation models and techniques into the open-source package TerraTorch to support quick, scalable, and cost-effective model adaptation.
中文: 参数高效微调(PEFT)技术通过匹配甚至超越全参数微调的性能,在降低计算成本的同时提升模型对未知地理区域的泛化能力,为地球观测任务提供了有效的预训练模型适配方案。
English: Parameter-Efficient Fine-Tuning (PEFT) techniques effectively adapt large foundation models for Earth observation tasks by matching or surpassing full fine-tuning performance while reducing computational costs and enhancing generalization to new regions.

Authors:Hassan Keshvarikhojasteh, Mihail Tifrea, Sibylle Hess, Josien P. W. Pluim, Mitko Veta
Title: A Spatially-Aware Multiple Instance Learning Framework for Digital Pathology
Abstract:
Multiple instance learning (MIL) is a promising approach for weakly supervised classification in pathology using whole slide images (WSIs). However, conventional MIL methods such as Attention-Based Deep Multiple Instance Learning (ABMIL) typically disregard spatial interactions among patches that are crucial to pathological diagnosis. Recent advancements, such as Transformer based MIL (TransMIL), have incorporated spatial context and inter-patch relationships. However, it remains unclear whether explicitly modeling patch relationships yields similar performance gains in ABMIL, which relies solely on Multi-Layer Perceptrons (MLPs). In contrast, TransMIL employs Transformer-based layers, introducing a fundamental architectural shift at the cost of substantially increased computational complexity. In this work, we enhance the ABMIL framework by integrating interaction-aware representations to address this question. Our proposed model, Global ABMIL (GABMIL), explicitly captures inter-instance dependencies while preserving computational efficiency. Experimental results on two publicly available datasets for tumor subtyping in breast and lung cancers demonstrate that GABMIL achieves up to a 7 percentage point improvement in AUPRC and a 5 percentage point increase in the Kappa score over ABMIL, with minimal or no additional computational overhead. These findings underscore the importance of incorporating patch interactions within MIL frameworks. Our code is available at \href{https://github.com/tueimage/GABMIL}{\texttt{GABMIL}}.
中文: 本研究提出的GABMIL模型通过显式建模病理图像中斑块间的相互作用,在保持计算效率的同时显著提升了肿瘤分型的分类性能。
English: This study introduces GABMIL, an enhanced version of ABMIL that explicitly models inter-patch dependencies in whole slide images, achieving significant performance improvements in tumor subtyping with minimal computational overhead.

Authors:Yu Hong, Xiao Cai, Pengpeng Zeng, Shuai Zhang, Jingkuan Song, Lianli Gao, Heng Tao Shen
Title: Towards Generalized and Training-Free Text-Guided Semantic Manipulation
Abstract:
Text-guided semantic manipulation refers to semantically editing an image generated from a source prompt to match a target prompt, enabling the desired semantic changes (e.g., addition, removal, and style transfer) while preserving irrelevant contents. With the powerful generative capabilities of the diffusion model, the task has shown the potential to generate high-fidelity visual content. Nevertheless, existing methods either typically require time-consuming fine-tuning (inefficient), fail to accomplish multiple semantic manipulations (poorly extensible), and/or lack support for different modality tasks (limited generalizability). Upon further investigation, we find that the geometric properties of noises in the diffusion model are strongly correlated with the semantic changes. Motivated by this, we propose a novel $\textit{GTF}$ for text-guided semantic manipulation, which has the following attractive capabilities: 1) $\textbf{Generalized}$: our $\textit{GTF}$ supports multiple semantic manipulations (e.g., addition, removal, and style transfer) and can be seamlessly integrated into all diffusion-based methods (i.e., Plug-and-play) across different modalities (i.e., modality-agnostic); and 2) $\textbf{Training-free}$: $\textit{GTF}$ produces high-fidelity results via simply controlling the geometric relationship between noises without tuning or optimization. Our extensive experiments demonstrate the efficacy of our approach, highlighting its potential to advance the state-of-the-art in semantics manipulation.
中文: 文本引导的语义操控利用扩散模型实现精准图像编辑,而提出的GTF方法提供了一种无需训练、通用性强的解决方案,支持跨模态的多种语义修改且无需微调。
English: Text-guided semantic manipulation enables targeted image editing using diffusion models, and the proposed GTF method offers a generalized, training-free solution that supports multiple semantic changes across modalities without fine-tuning.

Authors:Yinqi Li, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen
Title: DIVE: Inverting Conditional Diffusion Models for Discriminative Tasks
Abstract:
Diffusion models have shown remarkable progress in various generative tasks such as image and video generation. This paper studies the problem of leveraging pretrained diffusion models for performing discriminative tasks. Specifically, we extend the discriminative capability of pretrained frozen generative diffusion models from the classification task to the more complex object detection task, by "inverting" a pretrained layout-to-image diffusion model. To this end, a gradient-based discrete optimization approach for replacing the heavy prediction enumeration process, and a prior distribution model for making more accurate use of the Bayes' rule, are proposed respectively. Empirical results show that this method is on par with basic discriminative object detection baselines on COCO dataset. In addition, our method can greatly speed up the previous diffusion-based method for classification without sacrificing accuracy. Code and models are available at https://github.com/LiYinqi/DIVE .
中文摘要:本文通过逆向应用预训练的布局到图像扩散模型,将生成式扩散模型的判别能力从分类扩展到更复杂的物体检测任务,提出的优化方法在COCO数据集上达到基础判别模型水平,并大幅加速了基于扩散的分类方法且保持精度。
English Summary: This paper extends pretrained diffusion models from generative to discriminative tasks, specifically object detection, by inverting a layout-to-image model and introducing optimization techniques that achieve competitive performance on COCO while accelerating classification without accuracy loss.

Authors:Kai Cui, Jia Li, Yu Liu, Xuesong Zhang, Zhenzhen Hu, Meng Wang
Title: PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion Recognition
Abstract:
Electroencephalography (EEG) signals provide a promising and involuntary reflection of brain activity related to emotional states, offering significant advantages over behavioral cues like facial expressions. However, EEG signals are often noisy, affected by artifacts, and vary across individuals, complicating emotion recognition. While multimodal approaches have used Peripheral Physiological Signals (PPS) like GSR to complement EEG, they often overlook the dynamic synchronization and consistent semantics between the modalities. Additionally, the temporal dynamics of emotional fluctuations across different time resolutions in PPS remain underexplored. To address these challenges, we propose PhysioSync, a novel pre-training framework leveraging temporal and cross-modal contrastive learning, inspired by physiological synchronization phenomena. PhysioSync incorporates Cross-Modal Consistency Alignment (CM-CA) to model dynamic relationships between EEG and complementary PPS, enabling emotion-related synchronizations across modalities. Besides, it introduces Long- and Short-Term Temporal Contrastive Learning (LS-TCL) to capture emotional synchronization at different temporal resolutions within modalities. After pre-training, cross-resolution and cross-modal features are hierarchically fused and fine-tuned to enhance emotion recognition. Experiments on DEAP and DREAMER datasets demonstrate PhysioSync's advanced performance under uni-modal and cross-modal conditions, highlighting its effectiveness for EEG-centered emotion recognition.
中文摘要:PhysioSync是一种新型预训练框架,通过建模脑电与外周生理信号的跨模态动态同步关系,并在多时间分辨率下捕捉情绪波动,显著提升了基于脑电的情绪识别性能。
English Summary: PhysioSync is a novel pre-training framework that enhances EEG-based emotion recognition by modeling dynamic cross-modal synchronization with peripheral physiological signals and capturing emotional fluctuations across multiple temporal resolutions.

Authors:Alberto Fernández-Hernández, Jose I. Mestre, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí
Title: OUI Need to Talk About Weight Decay: A New Perspective on Overfitting Detection
Abstract:
We introduce the Overfitting-Underfitting Indicator (OUI), a novel tool for monitoring the training dynamics of Deep Neural Networks (DNNs) and identifying optimal regularization hyperparameters. Specifically, we validate that OUI can effectively guide the selection of the Weight Decay (WD) hyperparameter by indicating whether a model is overfitting or underfitting during training without requiring validation data. Through experiments on DenseNet-BC-100 with CIFAR- 100, EfficientNet-B0 with TinyImageNet and ResNet-34 with ImageNet-1K, we show that maintaining OUI within a prescribed interval correlates strongly with improved generalization and validation scores. Notably, OUI converges significantly faster than traditional metrics such as loss or accuracy, enabling practitioners to identify optimal WD (hyperparameter) values within the early stages of training. By leveraging OUI as a reliable indicator, we can determine early in training whether the chosen WD value leads the model to underfit the training data, overfit, or strike a well-balanced trade-off that maximizes validation scores. This enables more precise WD tuning for optimal performance on the tested datasets and DNNs. All code for reproducing these experiments is available at https://github.com/AlbertoFdezHdez/OUI.
中文: 过拟合-欠拟合指示器(OUI)是一种创新工具,无需验证数据即可通过监测训练动态有效识别最佳权重衰减超参数,从而及早发现过拟合或欠拟合问题以提升模型泛化能力。
English: The Overfitting-Underfitting Indicator (OUI) is a novel tool that efficiently identifies optimal Weight Decay hyperparameters by monitoring training dynamics without validation data, enabling early detection of overfitting or underfitting for improved model generalization.

Authors:Ning Li, Antai Andy Liu, Jingran Zhang, Justin Cui
Title: Latent Video Dataset Distillation
Abstract:
Dataset distillation has demonstrated remarkable effectiveness in high-compression scenarios for image datasets. While video datasets inherently contain greater redundancy, existing video dataset distillation methods primarily focus on compression in the pixel space, overlooking advances in the latent space that have been widely adopted in modern text-to-image and text-to-video models. In this work, we bridge this gap by introducing a novel video dataset distillation approach that operates in the latent space using a state-of-the-art variational encoder. Furthermore, we employ a diversity-aware data selection strategy to select both representative and diverse samples. Additionally, we introduce a simple, training-free method to further compress the distilled latent dataset. By combining these techniques, our approach achieves a new state-of-the-art performance in dataset distillation, outperforming prior methods on all datasets, e.g. on HMDB51 IPC 1, we achieve a 2.6% performance increase; on MiniUCF IPC 5, we achieve a 7.8% performance increase. Our code is available at https://github.com/liningresearch/Latent_Video_Dataset_Distillation.
中文: 本研究提出了一种新颖的视频数据集蒸馏方法,在潜在空间中使用变分编码器并结合多样性感知样本选择,实现了最先进的性能,相比现有方法有显著提升。
English: This study introduces a novel video dataset distillation method that operates in the latent space using a variational encoder and incorporates diversity-aware sample selection, achieving state-of-the-art performance with significant improvements over prior methods.

Authors:Kartikay Tehlan, Thomas Wendler
Title: Physiological neural representation for personalised tracer kinetic parameter estimation from dynamic PET
Abstract:
Dynamic positron emission tomography (PET) with [$^{18}$F]FDG enables non-invasive quantification of glucose metabolism through kinetic analysis, often modelled by the two-tissue compartment model (TCKM). However, voxel-wise kinetic parameter estimation using conventional methods is computationally intensive and limited by spatial resolution. Deep neural networks (DNNs) offer an alternative but require large training datasets and significant computational resources. To address these limitations, we propose a physiological neural representation based on implicit neural representations (INRs) for personalized kinetic parameter estimation. INRs, which learn continuous functions, allow for efficient, high-resolution parametric imaging with reduced data requirements. Our method also integrates anatomical priors from a 3D CT foundation model to enhance robustness and precision in kinetic modelling. We evaluate our approach on an [$^{18}$F]FDG dynamic PET/CT dataset and compare it to state-of-the-art DNNs. Results demonstrate superior spatial resolution, lower mean-squared error, and improved anatomical consistency, particularly in tumour and highly vascularized regions. Our findings highlight the potential of INRs for personalized, data-efficient tracer kinetic modelling, enabling applications in tumour characterization, segmentation, and prognostic assessment.
中文: 本研究提出了一种基于隐式神经表示的生理神经表示方法,用于动态PET成像中的个性化动力学参数估计,结合解剖先验知识,相比传统方法实现了更高分辨率、更低误差和更好的解剖一致性。
English: This study introduces a physiological neural representation using implicit neural representations (INRs) for personalized kinetic parameter estimation in dynamic PET imaging, integrating anatomical priors to achieve higher resolution, lower error, and improved anatomical consistency compared to conventional methods.

Authors:Valentin Langer, Kartikay Tehlan, Thomas Wendler
Title: Anatomy-constrained modelling of image-derived input functions in dynamic PET using multi-organ segmentation
Abstract:
Accurate kinetic analysis of [$^{18}$F]FDG distribution in dynamic positron emission tomography (PET) requires anatomically constrained modelling of image-derived input functions (IDIFs). Traditionally, IDIFs are obtained from the aorta, neglecting anatomical variations and complex vascular contributions. This study proposes a multi-organ segmentation-based approach that integrates IDIFs from the aorta, portal vein, pulmonary artery, and ureters. Using high-resolution CT segmentations of the liver, lungs, kidneys, and bladder, we incorporate organ-specific blood supply sources to improve kinetic modelling. Our method was evaluated on dynamic [$^{18}$F]FDG PET data from nine patients, resulting in a mean squared error (MSE) reduction of $13.39\%$ for the liver and $10.42\%$ for the lungs. These initial results highlight the potential of multiple IDIFs in improving anatomical modelling and fully leveraging dynamic PET imaging. This approach could facilitate the integration of tracer kinetic modelling into clinical routine.
中文: 本研究提出一种多器官分割方法,通过整合来自多个血管的图像衍生输入函数来改进动态PET成像中的动力学建模,在肝脏和肺部分析中实现了显著的误差降低。
English: This study introduces a multi-organ segmentation method that integrates image-derived input functions from multiple vessels to enhance kinetic modeling in dynamic PET imaging, achieving significant error reduction in liver and lung analyses.

Authors:Xinqi Xiong, Andrea Dunn Beltran, Jun Myeong Choi, Marc Niethammer, Roni Sengupta
Title: PPS-Ctrl: Controllable Sim-to-Real Translation for Colonoscopy Depth Estimation
Abstract:
Accurate depth estimation enhances endoscopy navigation and diagnostics, but obtaining ground-truth depth in clinical settings is challenging. Synthetic datasets are often used for training, yet the domain gap limits generalization to real data. We propose a novel image-to-image translation framework that preserves structure while generating realistic textures from clinical data. Our key innovation integrates Stable Diffusion with ControlNet, conditioned on a latent representation extracted from a Per-Pixel Shading (PPS) map. PPS captures surface lighting effects, providing a stronger structural constraint than depth maps. Experiments show our approach produces more realistic translations and improves depth estimation over GAN-based MI-CycleGAN. Our code is publicly accessible at https://github.com/anaxqx/PPS-Ctrl.
中文摘要:本研究提出一种新颖的图像转换框架,通过将稳定扩散模型与ControlNet结合,并利用逐像素着色图作为结构约束,能够生成更真实的内窥镜图像并显著提升深度估计精度。
English Summary: This study introduces an innovative image-to-image translation framework that combines Stable Diffusion with ControlNet, using Per-Pixel Shading maps as structural constraints to generate realistic endoscopic images and improve depth estimation performance.

Authors:Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong, Silvio Savarese, Heng Ji, Ran Xu
Title: DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
Abstract:
We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically adapts token compression to the content of the image and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models across diverse VLM architectures, including the recently popularized AnyRes-based visual encoders. Furthermore, through qualitative analyses, we demonstrate that DToMe effectively adapts token reduction based on image complexity and, unlike existing systems, provides users more control over computational costs. Project page: https://mikewangwzhl.github.io/dymu/.
中文: DyMU是一种无需训练的框架,通过根据图像复杂度动态合并视觉标记并重建注意力动态,显著降低视觉语言模型的计算负担(减少32%-85%的标记数量),同时在不同任务中保持性能。
English: DyMU is a training-free framework that dynamically reduces the computational cost of vision-language models by merging visual tokens based on image complexity and reconstructing attention dynamics, achieving significant token reduction (32%-85%) while maintaining performance across various tasks.

Authors:David Yan, Alexander Raistrick, Jia Deng
Title: Procedural Dataset Generation for Zero-Shot Stereo Matching
Abstract:
Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains largely unexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. We collect the best settings to produce Infinigen-Stereo, a procedural generator specifically optimized for zero-shot stereo datasets. Models trained only on data from our system outperform robust baselines trained on a combination of existing synthetic datasets and have stronger zero-shot stereo matching performance than public checkpoints from prior works. We open source our system at https://github.com/princeton-vl/InfinigenStereo to enable further research on procedural stereo datasets.
中文: 本研究推出了专为零样本立体匹配优化的程序化生成数据集Infinigen-Stereo,仅使用该数据训练的模型超越了现有合成数据集组合的效果,并显著提升了零样本立体匹配性能。
English: The study introduces Infinigen-Stereo, a procedurally generated dataset optimized for zero-shot stereo matching, which outperforms existing synthetic datasets and enhances model performance when trained solely on its data.

Authors:Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi
Title: Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light
Abstract:
Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n^2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.
中文摘要:本研究提出广义邻域注意力(GNA)解决稀疏注意力效率问题,通过优化Blackwell架构实现,在生成模型中最高可获得46%的端到端加速效果。
English Summary: The study introduces Generalized Neighborhood Attention (GNA) to address sparse attention inefficiencies, achieving up to 46% speedup on generative models through optimized Blackwell architecture implementation.

Authors:Ruotong Wang, Mingli Zhu, Jiarong Ou, Rui Chen, Xin Tao, Pengfei Wan, Baoyuan Wu
Title: BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation
Abstract:
Text-to-video (T2V) generative models have rapidly advanced and found widespread applications across fields like entertainment, education, and marketing. However, the adversarial vulnerabilities of these models remain rarely explored. We observe that in T2V generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts, such as environmental elements, secondary objects, and additional details, providing opportunities for malicious attackers to embed hidden harmful content. Exploiting this inherent redundancy, we introduce BadVideo, the first backdoor attack framework tailored for T2V generation. Our attack focuses on designing target adversarial outputs through two key strategies: (1) Spatio-Temporal Composition, which combines different spatiotemporal features to encode malicious information; (2) Dynamic Element Transformation, which introduces transformations in redundant elements over time to convey malicious information. Based on these strategies, the attacker's malicious target seamlessly integrates with the user's textual instructions, providing high stealthiness. Moreover, by exploiting the temporal dimension of videos, our attack successfully evades traditional content moderation systems that primarily analyze spatial information within individual frames. Extensive experiments demonstrate that BadVideo achieves high attack success rates while preserving original semantics and maintaining excellent performance on clean inputs. Overall, our work reveals the adversarial vulnerability of T2V models, calling attention to potential risks and misuse. Our project page is at https://wrt2000.github.io/BadVideo2025/.
中文摘要:BadVideo是首个针对文本到视频生成模型的后门攻击框架,通过时空组合和动态元素转换利用模型固有冗余嵌入恶意内容,在保持正常生成质量的同时成功规避内容审核系统。
English Summary: BadVideo is the first backdoor attack framework exploiting the inherent redundancy in text-to-video models to embed hidden malicious content through spatio-temporal manipulation, effectively bypassing content moderation while maintaining output quality.

Authors:Xiaoxing Hu, Kaicheng Yang, Jun Wang, Haoran Xu, Ziyong Feng, Yupei Wang
Title: Decoupled Global-Local Alignment for Improving Compositional Understanding
Abstract:
Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model's inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model's inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at https://github.com/xiaoxing2001/DeGLA
中文: DeGLA框架通过自蒸馏机制和新型对比损失,在提升CLIP组合理解能力的同时有效保留其通用性能,在多项基准测试中实现了显著性能提升。
English: The DeGLA framework enhances CLIP's compositional understanding through self-distillation and novel contrastive losses while preserving its general capabilities, achieving significant performance gains on multiple benchmarks.

Authors:Jialiang Zhang, Feng Gao, Yanhai Gan, Junyu Dong, Qian Du
Title: Frequency-Compensated Network for Daily Arctic Sea Ice Concentration Prediction
Abstract:
Accurately forecasting sea ice concentration (SIC) in the Arctic is critical to global ecosystem health and navigation safety. However, current methods still is confronted with two challenges: 1) these methods rarely explore the long-term feature dependencies in the frequency domain. 2) they can hardly preserve the high-frequency details, and the changes in the marginal area of the sea ice cannot be accurately captured. To this end, we present a Frequency-Compensated Network (FCNet) for Arctic SIC prediction on a daily basis. In particular, we design a dual-branch network, including branches for frequency feature extraction and convolutional feature extraction. For frequency feature extraction, we design an adaptive frequency filter block, which integrates trainable layers with Fourier-based filters. By adding frequency features, the FCNet can achieve refined prediction of edges and details. For convolutional feature extraction, we propose a high-frequency enhancement block to separate high and low-frequency information. Moreover, high-frequency features are enhanced via channel-wise attention, and temporal attention unit is employed for low-frequency feature extraction to capture long-range sea ice changes. Extensive experiments are conducted on a satellite-derived daily SIC dataset, and the results verify the effectiveness of the proposed FCNet. Our codes and data will be made public available at: https://github.com/oucailab/FCNet .
中文: 频率补偿网络(FCNet)通过结合频域分析和卷积特征,提升了北极海冰密集度的预测精度,有效捕捉长期依赖关系并保留边缘区域的高频细节。
English: The Frequency-Compensated Network (FCNet) enhances Arctic sea ice concentration forecasting by integrating frequency-domain analysis and convolutional features to capture long-term dependencies and preserve high-frequency details in marginal ice zones.

Authors:Xinru Meng, Han Sun, Jiamei Liu, Ningzhong Liu, Huiyu Zhou
Title: Energy-Based Pseudo-Label Refining for Source-free Domain Adaptation
Abstract:
Source-free domain adaptation (SFDA), which involves adapting models without access to source data, is both demanding and challenging. Existing SFDA techniques typically rely on pseudo-labels generated from confidence levels, leading to negative transfer due to significant noise. To tackle this problem, Energy-Based Pseudo-Label Refining (EBPR) is proposed for SFDA. Pseudo-labels are created for all sample clusters according to their energy scores. Global and class energy thresholds are computed to selectively filter pseudo-labels. Furthermore, a contrastive learning strategy is introduced to filter difficult samples, aligning them with their augmented versions to learn more discriminative features. Our method is validated on the Office-31, Office-Home, and VisDA-C datasets, consistently finding that our model outperformed state-of-the-art methods.
中文:提出的基于能量的伪标签细化方法通过能量阈值筛选伪标签并采用对比学习增强特征区分度,有效解决了无源域适应中的噪声问题,在多个基准数据集上展现出优越性能。
English: The proposed Energy-Based Pseudo-Label Refining (EBPR) method addresses noise in source-free domain adaptation by filtering pseudo-labels using energy thresholds and enhancing feature discrimination through contrastive learning, demonstrating superior performance on benchmark datasets.

Authors:Gerardus Croonen, Andreas Trondl, Julia Simon, Daniel Steininger
Title: SemanticSugarBeets: A Multi-Task Framework and Dataset for Inspecting Harvest and Storage Characteristics of Sugar Beets
Abstract:
While sugar beets are stored prior to processing, they lose sugar due to factors such as microorganisms present in adherent soil and excess vegetation. Their automated visual inspection promises to aide in quality assurance and thereby increase efficiency throughout the processing chain of sugar production. In this work, we present a novel high-quality annotated dataset and two-stage method for the detection, semantic segmentation and mass estimation of post-harvest and post-storage sugar beets in monocular RGB images. We conduct extensive ablation experiments for the detection of sugar beets and their fine-grained semantic segmentation regarding damages, rot, soil adhesion and excess vegetation. For these tasks, we evaluate multiple image sizes, model architectures and encoders, as well as the influence of environmental conditions. Our experiments show an mAP50-95 of 98.8 for sugar-beet detection and an mIoU of 64.0 for the best-performing segmentation model.
Chinese Summary: 本研究提出了一种新颖的标注数据集和两阶段方法,用于在单目RGB图像中检测、语义分割和估算收获后储存的甜菜质量,以应对因微生物和多余植被导致的储存期间糖分损失问题。
English Summary: This study introduces a new annotated dataset and a two-stage method for detecting, segmenting, and estimating the mass of sugar beets in images to address sugar loss during storage caused by microorganisms and vegetation.

Authors:Wenping Ma, Boyou Xue, Mengru Ma, Chuang Chen, Hekai Zhang, Hao Zhu
Title: A Diff-Attention Aware State Space Fusion Model for Remote Sensing Classification
Abstract:
Multispectral (MS) and panchromatic (PAN) images describe the same land surface, so these images not only have their own advantages, but also have a lot of similar information. In order to separate these similar information and their respective advantages, reduce the feature redundancy in the fusion stage. This paper introduces a diff-attention aware state space fusion model (DAS2F-Model) for multimodal remote sensing image classification. Based on the selective state space model, a cross-modal diff-attention module (CMDA-Module) is designed to extract and separate the common features and their respective dominant features of MS and PAN images. Among this, space preserving visual mamba (SPVM) retains image spatial features and captures local features by optimizing visual mamba's input reasonably. Considering that features in the fusion stage will have large semantic differences after feature separation and simple fusion operations struggle to effectively integrate these significantly different features, an attention-aware linear fusion module (AALF-Module) is proposed. It performs pixel-wise linear fusion by calculating influence coefficients. This mechanism can fuse features with large semantic differences while keeping the feature size unchanged. Empirical evaluations indicate that the presented method achieves better results than alternative approaches. The relevant code can be found at:https://github.com/AVKSKVL/DAS-F-Model
本文提出了一种DAS2F模型,通过跨模态注意力模块分离多光谱与全色图像的共有及独有特征,并采用注意力感知的线性融合方法进行有效整合,从而提升遥感图像分类性能。
This paper introduces a DAS2F-Model that separates common and distinct features of multispectral and panchromatic images using a cross-modal attention module and fuses them with an attention-aware linear method to improve remote sensing classification.

Authors:Giacomo Pacini, Lorenzo Bianchi, Luca Ciampi, Nicola Messina, Giuseppe Amato, Fabrizio Falchi
Title: CountingDINO: A Training-free Pipeline for Class-Agnostic Counting using Unsupervised Backbones
Abstract:
Class-agnostic counting (CAC) aims to estimate the number of objects in images without being restricted to predefined categories. However, while current exemplar-based CAC methods offer flexibility at inference time, they still rely heavily on labeled data for training, which limits scalability and generalization to many downstream use cases. In this paper, we introduce CountingDINO, the first training-free exemplar-based CAC framework that exploits a fully unsupervised feature extractor. Specifically, our approach employs self-supervised vision-only backbones to extract object-aware features, and it eliminates the need for annotated data throughout the entire proposed pipeline. At inference time, we extract latent object prototypes via ROI-Align from DINO features and use them as convolutional kernels to generate similarity maps. These are then transformed into density maps through a simple yet effective normalization scheme. We evaluate our approach on the FSC-147 benchmark, where we consistently outperform a baseline based on an SOTA unsupervised object detector under the same label- and training-free setting. Additionally, we achieve competitive results -- and in some cases surpass -- training-free methods that rely on supervised backbones, non-training-free unsupervised methods, as well as several fully supervised SOTA approaches. This demonstrates that label- and training-free CAC can be both scalable and effective. Code: https://lorebianchi98.github.io/CountingDINO/.

Authors:Yahao Lu, Yuehui Li, Xingyuan Guo, Shuai Yuan, Yukai Shi, Liang Lin
Title: Rethinking Generalizable Infrared Small Target Detection: A Real-scene Benchmark and Cross-view Representation Learning
Abstract:
Infrared small target detection (ISTD) is highly sensitive to sensor type, observation conditions, and the intrinsic properties of the target. These factors can introduce substantial variations in the distribution of acquired infrared image data, a phenomenon known as domain shift. Such distribution discrepancies significantly hinder the generalization capability of ISTD models across diverse scenarios. To tackle this challenge, this paper introduces an ISTD framework enhanced by domain adaptation. To alleviate distribution shift between datasets and achieve cross-sample alignment, we introduce Cross-view Channel Alignment (CCA). Additionally, we propose the Cross-view Top-K Fusion strategy, which integrates target information with diverse background features, enhancing the model' s ability to extract critical data characteristics. To further mitigate the impact of noise on ISTD, we develop a Noise-guided Representation learning strategy. This approach enables the model to learn more noise-resistant feature representations, to improve its generalization capability across diverse noisy domains. Finally, we develop a dedicated infrared small target dataset, RealScene-ISTD. Compared to state-of-the-art methods, our approach demonstrates superior performance in terms of detection probability (Pd), false alarm rate (Fa), and intersection over union (IoU). The code is available at: https://github.com/luy0222/RealScene-ISTD.
中文摘要:本文提出了一种增强领域适应的红外小目标检测框架,通过跨视图对齐和抗噪表征学习解决领域偏移问题,并在新构建的数据集上实现了更优的检测性能。
English Summary: This paper introduces a domain adaptation-enhanced framework for infrared small target detection that addresses domain shift through cross-view alignment and noise-resistant representation learning, achieving superior performance on a newly developed dataset.

Authors:Shun Zou, Yi Zou, Juncheng Li, Guangwei Gao, Guojun Qi
Title: Cross Paradigm Representation and Alignment Transformer for Image Deraining
Abstract:
Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge single-paradigm architectures, necessitating a unified framework to integrate complementary global-local and spatial-channel representations. To address this, we propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer). Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms (spatial-channel and global-local) to aid image reconstruction. It bridges the gap within and between paradigms, aligning and coordinating them to enable deep interaction and fusion of features. Specifically, we use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA). SPC-SA enhances global channel dependencies through dynamic sparsity, while SPR-SA focuses on spatial rain distribution and fine-grained texture recovery. To address the feature misalignment and knowledge differences between them, we introduce the Adaptive Alignment Frequency Module (AAFM), which aligns and interacts with features in a two-stage progressive manner, enabling adaptive guidance and complementarity. This reduces the information gap within and between paradigms. Through this unified cross-paradigm dynamic interaction framework, we achieve the extraction of the most valuable interactive fusion information from the two paradigms. Extensive experiments demonstrate that our model achieves state-of-the-art performance on eight benchmark datasets and further validates CPRAformer's robustness in other image restoration tasks and downstream applications.
Chinese: 提出的CPRAformer通过分层对齐和两种自注意力机制整合全局-局部与空间-通道表征,在多个基准测试中实现了最先进的图像去雨性能。
English: The proposed CPRAformer integrates global-local and spatial-channel representations through hierarchical alignment and two specialized self-attention mechanisms, achieving state-of-the-art image deraining performance across multiple benchmarks.

Authors:Duy-Tho Le, Trung Pham, Jianfei Cai, Hamid Rezatofighi
Title: Marginalized Generalized IoU (MGIoU): A Unified Objective Function for Optimizing Any Convex Parametric Shapes
Abstract:
Optimizing the similarity between parametric shapes is crucial for numerous computer vision tasks, where Intersection over Union (IoU) stands as the canonical measure. However, existing optimization methods exhibit significant shortcomings: regression-based losses like L1/L2 lack correlation with IoU, IoU-based losses are unstable and limited to simple shapes, and task-specific methods are computationally intensive and not generalizable accross domains. As a result, the current landscape of parametric shape objective functions has become scattered, with each domain proposing distinct IoU approximations. To address this, we unify the parametric shape optimization objective functions by introducing Marginalized Generalized IoU (MGIoU), a novel loss function that overcomes these challenges by projecting structured convex shapes onto their unique shape Normals to compute one-dimensional normalized GIoU. MGIoU offers a simple, efficient, fully differentiable approximation strongly correlated with IoU. We then extend MGIoU to MGIoU+ that supports optimizing unstructured convex shapes. Together, MGIoU and MGIoU+ unify parametric shape optimization across diverse applications. Experiments on standard benchmarks demonstrate that MGIoU and MGIoU+ consistently outperform existing losses while reducing loss computation latency by 10-40x. Additionally, MGIoU and MGIoU+ satisfy metric properties and scale-invariance, ensuring robustness as an objective function. We further propose MGIoU- for minimizing overlaps in tasks like collision-free trajectory prediction. Code is available at https://ldtho.github.io/MGIoU

Authors:Hariseetharam Gunduboina, Muhammad Haris Khan, Biplab Banerjee
Title: FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing
Abstract:
In recent years, large-scale vision-language models (VLMs) like CLIP have gained attention for their zero-shot inference using instructional text prompts. While these models excel in general computer vision, their potential for domain generalization in remote sensing (RS) remains underexplored. Existing approaches enhance prompt learning by generating visual prompt tokens but rely on full-image features, introducing noise and background artifacts that vary within a class, causing misclassification. To address this, we propose FrogDogNet, a novel prompt learning framework integrating Fourier frequency filtering and self-attention to improve RS scene classification and domain generalization. FrogDogNet selectively retains invariant low-frequency components while eliminating noise and irrelevant backgrounds, ensuring robust feature representation across domains. The model first extracts significant features via projection and self-attention, then applies frequency-based filtering to preserve essential structural information for prompt learning. Extensive experiments on four RS datasets and three domain generalization tasks show that FrogDogNet consistently outperforms state-of-the-art prompt learning methods, demonstrating superior adaptability across domain shifts. Our findings highlight the effectiveness of frequency-based invariant feature retention in generalization, paving the way for broader applications. Our code is available at https://github.com/HariseetharamG/FrogDogNet
Chinese: FrogDogNet提出了一种新颖的提示学习框架,通过傅里叶频率滤波和自注意力机制保留不变的低频特征并消除噪声,显著提升了遥感场景分类和领域泛化能力,在多个数据集和任务中持续优于现有最优方法。
English: FrogDogNet introduces a novel prompt learning framework that leverages Fourier frequency filtering and self-attention to enhance domain generalization in remote sensing by preserving invariant low-frequency features while eliminating noise, consistently outperforming state-of-the-art methods across multiple datasets and tasks.

Authors:Yuanjian Wang, Yufei Deng, Rong Xiao, Jiahao Fan, Chenwei Tang, Deng Xiong, Jiancheng Lv
Title: SaENeRF: Suppressing Artifacts in Event-based Neural Radiance Fields
Abstract:
Event cameras are neuromorphic vision sensors that asynchronously capture changes in logarithmic brightness changes, offering significant advantages such as low latency, low power consumption, low bandwidth, and high dynamic range. While these characteristics make them ideal for high-speed scenarios, reconstructing geometrically consistent and photometrically accurate 3D representations from event data remains fundamentally challenging. Current event-based Neural Radiance Fields (NeRF) methods partially address these challenges but suffer from persistent artifacts caused by aggressive network learning in early stages and the inherent noise of event cameras. To overcome these limitations, we present SaENeRF, a novel self-supervised framework that effectively suppresses artifacts and enables 3D-consistent, dense, and photorealistic NeRF reconstruction of static scenes solely from event streams. Our approach normalizes predicted radiance variations based on accumulated event polarities, facilitating progressive and rapid learning for scene representation construction. Additionally, we introduce regularization losses specifically designed to suppress artifacts in regions where photometric changes fall below the event threshold and simultaneously enhance the light intensity difference of non-zero events, thereby improving the visual fidelity of the reconstructed scene. Extensive qualitative and quantitative experiments demonstrate that our method significantly reduces artifacts and achieves superior reconstruction quality compared to existing methods. The code is available at https://github.com/Mr-firework/SaENeRF.
中文: 事件相机具有低延迟和高动态范围优势,但存在噪声和早期学习伪影问题;SaENeRF通过自监督的归一化处理和正则化损失,有效抑制伪影并实现高质量神经辐射场重建。
English: Event cameras offer low latency and high dynamic range but struggle with 3D reconstruction due to noise and early learning artifacts, which SaENeRF addresses through self-supervised normalization and regularization for superior, artifact-free NeRF results.

Authors:Fengchun Liu, Tong Zhang, Chunying Zhang
Title: CLPSTNet: A Progressive Multi-Scale Convolutional Steganography Model Integrating Curriculum Learning
Abstract:
In recent years, a large number of works have introduced Convolutional Neural Networks (CNNs) into image steganography, which transform traditional steganography methods such as hand-crafted features and prior knowledge design into steganography methods that neural networks autonomically learn information embedding. However, due to the inherent complexity of digital images, issues of invisibility and security persist when using CNN models for information embedding. In this paper, we propose Curriculum Learning Progressive Steganophy Network (CLPSTNet). The network consists of multiple progressive multi-scale convolutional modules that integrate Inception structures and dilated convolutions. The module contains multiple branching pathways, starting from a smaller convolutional kernel and dilatation rate, extracting the basic, local feature information from the feature map, and gradually expanding to the convolution with a larger convolutional kernel and dilatation rate for perceiving the feature information of a larger receptive field, so as to realize the multi-scale feature extraction from shallow to deep, and from fine to coarse, allowing the shallow secret information features to be refined in different fusion stages. The experimental results show that the proposed CLPSTNet not only has high PSNR , SSIM metrics and decoding accuracy on three large public datasets, ALASKA2, VOC2012 and ImageNet, but also the steganographic images generated by CLPSTNet have low steganalysis scores.You can find our code at \href{https://github.com/chaos-boops/CLPSTNet}{https://github.com/chaos-boops/CLPSTNet}.
Chinese: 本文提出CLPSTNet,一种基于课程学习的渐进式隐写网络,通过多尺度卷积模块实现从细到粗的特征提取,在多个数据集上展现出卓越的不可见性、安全性和性能指标。
English: This paper introduces CLPSTNet, a curriculum learning-based progressive steganography network that employs multi-scale convolutional modules to enhance feature extraction from fine to coarse, achieving superior invisibility, security, and performance metrics across multiple datasets.

Authors:André Longon
Title: Naturally Computed Scale Invariance in the Residual Stream of ResNet18
Abstract:
An important capacity in visual object recognition is invariance to image-altering variables which leave the identity of objects unchanged, such as lighting, rotation, and scale. How do neural networks achieve this? Prior mechanistic interpretability research has illuminated some invariance-building circuitry in InceptionV1, but the results are limited and networks with different architectures have remained largely unexplored. This work investigates ResNet18 with a particular focus on its residual stream, an architectural component which InceptionV1 lacks. We observe that many convolutional channels in intermediate blocks exhibit scale invariant properties, computed by the element-wise residual summation of scale equivariant representations: the block input's smaller-scale copy with the block pre-sum output's larger-scale copy. Through subsequent ablation experiments, we attempt to causally link these neural properties with scale-robust object recognition behavior. Our tentative findings suggest how the residual stream computes scale invariance and its possible role in behavior. Code is available at: https://github.com/cest-andre/residual-stream-interp
中文: 本研究揭示了ResNet18通过残差流整合多尺度等变表征来实现尺度不变性的机制,这种机制可能支撑了模型跨尺度物体识别的稳健表现。
English: This study explores how ResNet18's residual stream achieves scale invariance by combining scale-equivariant representations, potentially enabling robust object recognition across different sizes.

Authors:Henry Marichal, Verónica Casaravilla, Candice Power, Karolain Mello, Joaquín Mazarino, Christine Lucas, Ludmila Profumo, Diego Passarella, Gregory Randall
Title: DeepCS-TRD, a Deep Learning-based Cross-Section Tree Ring Detector
Abstract:
Here, we propose Deep CS-TRD, a new automatic algorithm for detecting tree rings in whole cross-sections. It substitutes the edge detection step of CS-TRD by a deep-learning-based approach (U-Net), which allows the application of the method to different image domains: microscopy, scanner or smartphone acquired, and species (Pinus taeda, Gleditsia triachantos and Salix glauca). Additionally, we introduce two publicly available datasets of annotated images to the community. The proposed method outperforms state-of-the-art approaches in macro images (Pinus taeda and Gleditsia triacanthos) while showing slightly lower performance in microscopy images of Salix glauca. To our knowledge, this is the first paper that studies automatic tree ring detection for such different species and acquisition conditions. The dataset and source code are available in https://github.com/hmarichal93/deepcstrd
中文摘要:Deep CS-TRD提出了一种基于U-Net深度学习的自动树木年轮检测算法,适用于多种图像类型和树种,在多数情况下优于现有方法,并公开了数据集和源代码。
English Summary: Deep CS-TRD introduces a deep learning-based algorithm using U-Net for automatic tree ring detection across multiple image types and species, outperforming existing methods in most cases while providing open datasets and code.

Authors:Obed Korshie Dzikunu, Amirhossein Toosi, Shadab Ahamed, Sara Harsini, Francois Benard, Xiaoxiao Li, Arman Rahmim
Title: Comprehensive Evaluation of Quantitative Measurements from Automated Deep Segmentations of PSMA PET/CT Images
Abstract:
This study performs a comprehensive evaluation of quantitative measurements as extracted from automated deep-learning-based segmentation methods, beyond traditional Dice Similarity Coefficient assessments, focusing on six quantitative metrics, namely SUVmax, SUVmean, total lesion activity (TLA), tumor volume (TMTV), lesion count, and lesion spread. We analyzed 380 prostate-specific membrane antigen (PSMA) targeted [18F]DCFPyL PET/CT scans of patients with biochemical recurrence of prostate cancer, training deep neural networks, U-Net, Attention U-Net and SegResNet with four loss functions: Dice Loss, Dice Cross Entropy, Dice Focal Loss, and our proposed L1 weighted Dice Focal Loss (L1DFL). Evaluations indicated that Attention U-Net paired with L1DFL achieved the strongest correlation with the ground truth (concordance correlation = 0.90-0.99 for SUVmax and TLA), whereas models employing the Dice Loss and the other two compound losses, particularly with SegResNet, underperformed. Equivalence testing (TOST, alpha = 0.05, Delta = 20%) confirmed high performance for SUV metrics, lesion count and TLA, with L1DFL yielding the best performance. By contrast, tumor volume and lesion spread exhibited greater variability. Bland-Altman, Coverage Probability, and Total Deviation Index analyses further highlighted that our proposed L1DFL minimizes variability in quantification of the ground truth clinical measures. The code is publicly available at: https://github.com/ObedDzik/pca\_segment.git.
中文: 本研究评估了深度学习分割方法在前列腺癌PET/CT扫描中的应用,发现采用L1DFL损失函数的Attention U-Net模型与临床测量值具有最佳相关性,同时显著降低了定量评估的变异性。
English: This study evaluates deep-learning segmentation methods for prostate cancer PET/CT scans, finding that Attention U-Net with the proposed L1DFL loss function achieves superior correlation with clinical measurements while reducing variability in quantitative assessments.

Authors:Jingchao Wang, Hong Wang, Wenlong Zhang, Kunhua Ji, Dingjiang Huang, Yefeng Zheng
Title: Progressive Language-guided Visual Learning for Multi-Task Visual Grounding
Abstract:
Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. https://github.com/jcwang0602/PLVL
中文摘要:本文提出PLVL框架,通过渐进式语言引导视觉学习,无需额外跨模态交互模块即可实现语言与视觉的深度融合,并利用多任务头协同优化指称表达理解与分割任务,显著提升了性能。
English Summary: The paper introduces PLVL, a framework that progressively integrates language guidance into visual learning for multi-task visual grounding, eliminating the need for separate cross-modal fusion and enhancing collaborative predictions between referring expression comprehension and segmentation.

Authors:Xingxing Zuo, Nikhil Ranganathan, Connor Lee, Georgia Gkioxari, Soon-Jo Chung
Title: MonoTher-Depth: Enhancing Thermal Depth Estimation via Confidence-Aware Distillation
Abstract:
Monocular depth estimation (MDE) from thermal images is a crucial technology for robotic systems operating in challenging conditions such as fog, smoke, and low light. The limited availability of labeled thermal data constrains the generalization capabilities of thermal MDE models compared to foundational RGB MDE models, which benefit from datasets of millions of images across diverse scenarios. To address this challenge, we introduce a novel pipeline that enhances thermal MDE through knowledge distillation from a versatile RGB MDE model. Our approach features a confidence-aware distillation method that utilizes the predicted confidence of the RGB MDE to selectively strengthen the thermal MDE model, capitalizing on the strengths of the RGB model while mitigating its weaknesses. Our method significantly improves the accuracy of the thermal MDE, independent of the availability of labeled depth supervision, and greatly expands its applicability to new scenarios. In our experiments on new scenarios without labeled depth, the proposed confidence-aware distillation method reduces the absolute relative error of thermal MDE by 22.88\% compared to the baseline without distillation.
中文摘要:本研究提出一种置信度感知蒸馏方法,通过从RGB单目深度估计模型迁移知识来提升热图像深度估计性能,在无标注热数据情况下将相对误差降低了22.88%。
English Summary: This study introduces a confidence-aware distillation method that transfers knowledge from RGB-based monocular depth estimation models to enhance thermal image depth estimation, achieving a 22.88% error reduction without requiring labeled thermal data.

Authors:Jiaxing Xu, Kai He, Yue Tang, Wei Li, Mengcheng Lan, Xia Dong, Yiping Ke, Mengling Feng
Title: BrainPrompt: Multi-Level Brain Prompt Enhancement for Neurological Condition Identification
Abstract:
Neurological conditions, such as Alzheimer's Disease, are challenging to diagnose, particularly in the early stages where symptoms closely resemble healthy controls. Existing brain network analysis methods primarily focus on graph-based models that rely solely on imaging data, which may overlook important non-imaging factors and limit the model's predictive power and interpretability. In this paper, we present BrainPrompt, an innovative framework that enhances Graph Neural Networks (GNNs) by integrating Large Language Models (LLMs) with knowledge-driven prompts, enabling more effective capture of complex, non-imaging information and external knowledge for neurological disease identification. BrainPrompt integrates three types of knowledge-driven prompts: (1) ROI-level prompts to encode the identity and function of each brain region, (2) subject-level prompts that incorporate demographic information, and (3) disease-level prompts to capture the temporal progression of disease. By leveraging these multi-level prompts, BrainPrompt effectively harnesses knowledge-enhanced multi-modal information from LLMs, enhancing the model's capability to predict neurological disease stages and meanwhile offers more interpretable results. We evaluate BrainPrompt on two resting-state functional Magnetic Resonance Imaging (fMRI) datasets from neurological disorders, showing its superiority over state-of-the-art methods. Additionally, a biomarker study demonstrates the framework's ability to extract valuable and interpretable information aligned with domain knowledge in neuroscience. The code is available at https://github.com/AngusMonroe/BrainPrompt
中文摘要:BrainPrompt是一种创新框架,通过将大型语言模型与知识驱动提示相结合来增强图神经网络,能够有效捕捉复杂的非成像信息并提高神经系统疾病诊断的可解释性。
English Summary: BrainPrompt is a novel framework that integrates Large Language Models with knowledge-driven prompts into Graph Neural Networks to enhance neurological disease diagnosis by capturing complex non-imaging information and improving interpretability.

Authors:Ziqi Pang, Yu-Xiong Wang
Title: MR. Video: "MapReduce" is the Principle for Long Video Understanding
Abstract:
We propose MR. Video, an agentic long video understanding framework that demonstrates the simple yet effective MapReduce principle for processing long videos: (1) Map: independently and densely perceiving short video clips, and (2) Reduce: jointly aggregating information from all clips. Compared with sequence-to-sequence vision-language models (VLMs), MR. Video performs detailed short video perception without being limited by context length. Compared with existing video agents that typically rely on sequential key segment selection, the Map operation enables simpler and more scalable sequence parallel perception of short video segments. Its Reduce step allows for more comprehensive context aggregation and reasoning, surpassing explicit key segment retrieval. This MapReduce principle is applicable to both VLMs and video agents, and we use LLM agents to validate its effectiveness. In practice, MR. Video employs two MapReduce stages: (A) Captioning: generating captions for short video clips (map), then standardizing repeated characters and objects into shared names (reduce); (B) Analysis: for each user question, analyzing relevant information from individual short videos (map), and integrating them into a final answer (reduce). MR. Video achieves over 10% accuracy improvement on the challenging LVBench compared to state-of-the-art VLMs and video agents. Code is available at: https://github.com/ziqipang/MR-Video
中文: MR. Video框架采用MapReduce原理处理长视频,通过独立感知短视频片段并联合整合信息,在LVBench上相比现有方法实现了超过10%的准确率提升。
English: MR. Video is an agentic framework that applies the MapReduce principle to long video understanding by independently perceiving short clips and jointly aggregating their information, achieving over 10% accuracy improvement on LVBench compared to existing methods.

Authors:Yimu Wang, Xuye Liu, Wei Pang, Li Ma, Shuai Yuan, Paul Debevec, Ning Yu
Title: Survey of Video Diffusion Models: Foundations, Implementations, and Applications
Abstract:
Recent advances in diffusion models have revolutionized video generation, offering superior temporal consistency and visual quality compared to traditional generative adversarial networks-based approaches. While this emerging field shows tremendous promise in applications, it faces significant challenges in motion consistency, computational efficiency, and ethical considerations. This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. Additionally, we explore the synergies between diffusionbased video generation and related domains, including video representation learning, question answering, and retrieval. Compared to the existing surveys (Lei et al., 2024a;b; Melnik et al., 2024; Cao et al., 2023; Xing et al., 2024c) which focus on specific aspects of video generation, such as human video synthesis (Lei et al., 2024a) or long-form content generation (Lei et al., 2024b), our work provides a broader, more updated, and more fine-grained perspective on diffusion-based approaches with a special section for evaluation metrics, industry solutions, and training engineering techniques in video generation. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field. A structured list of related works involved in this survey is also available on https://github.com/Eyeline-Research/Survey-Video-Diffusion.
中文: 扩散模型以卓越的时序一致性和视觉质量革新了视频生成领域,本综述系统性地梳理了其技术演进与方法体系,在剖析运动连贯性等挑战的同时,为研究者提供了涵盖评估指标与工程实践的完整资源库。
English: Diffusion models have revolutionized video generation with superior quality and temporal consistency, though challenges in motion coherence and efficiency persist, as this comprehensive survey systematically reviews their evolution, methodologies, and applications while offering an updated perspective on the field.

Authors:Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, Hongsheng Li
Title: From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
Abstract:
Recent text-to-image diffusion models achieve impressive visual quality through extensive scaling of training data and model parameters, yet they often struggle with complex scenes and fine-grained details. Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. ReflectionFlow introduces three complementary inference-time scaling axes: (1) noise-level scaling to optimize latent initialization; (2) prompt-level scaling for precise semantic guidance; and most notably, (3) reflection-level scaling, which explicitly provides actionable reflections to iteratively assess and correct previous generations. To facilitate reflection-level scaling, we construct GenRef, a large-scale dataset comprising 1 million triplets, each containing a reflection, a flawed image, and an enhanced image. Leveraging this dataset, we efficiently perform reflection tuning on state-of-the-art diffusion transformer, FLUX.1-dev, by jointly modeling multimodal inputs within a unified framework. Experimental results show that ReflectionFlow significantly outperforms naive noise-level scaling methods, offering a scalable and compute-efficient solution toward higher-quality image synthesis on challenging tasks.
中文摘要:ReflectionFlow是一种推理时框架,通过噪声级、提示级和反思级三个扩展维度,使扩散模型能够迭代反思并优化输出,在复杂任务中显著提升图像生成质量。
English Summary: ReflectionFlow is an inference-time framework that enables diffusion models to iteratively reflect on and refine their outputs through three scaling axes, significantly improving image quality on complex tasks.

Authors:Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, Mike Zheng Shou
Title: LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Abstract:
Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary model APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data production pipeline to process YouTube videos and their closed captions (CC, same as ASR), resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT, the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general video QA performance and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new LiveSports-3K benchmark, using LLM-as-a-judge to measure the free-form commentary. Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench, demonstrating the broad generalizability of our approach. All resources of this paper have been released at https://showlab.github.io/livecc.

Authors:Song Wang, Xiaolu Liu, Lingdong Kong, Jianyun Xu, Chunyong Hu, Gongfan Fang, Wentong Li, Jianke Zhu, Xinchao Wang
Title: PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning
Abstract:
Self-supervised representation learning for point cloud has demonstrated effectiveness in improving pre-trained model performance across diverse tasks. However, as pre-trained models grow in complexity, fully fine-tuning them for downstream applications demands substantial computational and storage resources. Parameter-efficient fine-tuning (PEFT) methods offer a promising solution to mitigate these resource requirements, yet most current approaches rely on complex adapter and prompt mechanisms that increase tunable parameters. In this paper, we propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models. Our approach embeds LoRA layers within the most parameter-intensive components of point cloud transformers, reducing the need for tunable parameters while enhancing global feature capture. Additionally, multi-scale token selection extracts critical local information to serve as prompts for downstream fine-tuning, effectively complementing the global context captured by LoRA. The experimental results across various pre-trained models and three challenging public datasets demonstrate that our approach achieves competitive performance with only 3.43% of the trainable parameters, making it highly effective for resource-constrained applications. Source code is available at: https://github.com/songw-zju/PointLoRA.
中文: PointLoRA提出了一种参数高效的微调方法,结合低秩适应与多尺度令牌选择来优化点云模型,仅使用3.43%的可训练参数就能实现优异性能,同时提升全局和局部特征提取能力。
English: PointLoRA introduces a parameter-efficient fine-tuning method that integrates low-rank adaptation with multi-scale token selection to optimize point cloud models, achieving competitive performance using only 3.43% of trainable parameters while enhancing global and local feature extraction.

Authors:Zebin Yao, Lei Ren, Huixing Jiang, Chen Wei, Xiaojie Wang, Ruifan Li, Fangxiang Feng
Title: FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation
Abstract:
Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance, yet existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive subject-specific optimization, while zero-shot methods fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor employs semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated image. Additionally, our framework incorporates a novel noise initialization strategy to preserve geometry priors of reference subjects for robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.
Chinese: FreeGraftor提出了一种无需训练的框架,通过跨图像特征嫁接和创新的噪声初始化策略,在无需微调的情况下实现了卓越的主体保真度和文本对齐效果,有效解决了主体驱动图像生成中保真度与效率的权衡问题。
English: FreeGraftor introduces a training-free framework that overcomes the fidelity-efficiency trade-off in subject-driven image generation by employing cross-image feature grafting and a novel noise initialization strategy, achieving superior subject fidelity and text alignment without requiring fine-tuning.

Authors:Ekaterina Kondrateva, Sandzhi Barg, Mikhail Vasiliev
Title: Benchmarking the Reproducibility of Brain MRI Segmentation Across Scanners and Time
Abstract:
Accurate and reproducible brain morphometry from structural MRI is critical for monitoring neuroanatomical changes across time and across imaging domains. Although deep learning has accelerated segmentation workflows, scanner-induced variability and reproducibility limitations remain-especially in longitudinal and multi-site settings. In this study, we benchmark two modern segmentation pipelines, FastSurfer and SynthSeg, both integrated into FreeSurfer, one of the most widely adopted tools in neuroimaging. Using two complementary datasets - a 17-year longitudinal cohort (SIMON) and a 9-site test-retest cohort (SRPBS)-we quantify inter-scan segmentation variability using Dice coefficient, Surface Dice, Hausdorff Distance (HD95), and Mean Absolute Percentage Error (MAPE). Our results reveal up to 7-8% volume variation in small subcortical structures such as the amygdala and ventral diencephalon, even under controlled test-retest conditions. This raises a key question: is it feasible to detect subtle longitudinal changes on the order of 5-10% in pea-sized brain regions, given the magnitude of domain-induced morphometric noise? We further analyze the effects of registration templates and interpolation modes, and propose surface-based quality filtering to improve segmentation reliability. This study provides a reproducible benchmark for morphometric reproducibility and emphasizes the need for harmonization strategies in real-world neuroimaging studies. Code and figures: https://github.com/kondratevakate/brain-mri-segmentation
中文: 本研究对FastSurfer和SynthSeg流程进行基准测试,发现小型脑区存在高达7-8%的体积变异,强调在神经影像研究中需采用协调策略来提升分割可靠性。
English: This study benchmarks FastSurfer and SynthSeg pipelines, revealing up to 7-8% volume variation in small brain structures and emphasizing the need for harmonization strategies to improve segmentation reliability in neuroimaging.

Authors:Chang Zong, Bin Li, Shoujun Zhou, Jian Wan, Lei Zhang
Title: Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions
Abstract:
Locating specific segments within an instructional video is an efficient way to acquire guiding knowledge. Generally, the task of obtaining video segments for both verbal explanations and visual demonstrations is known as visual answer localization (VAL). However, users often need multiple interactions to obtain answers that align with their expectations when using the system. During these interactions, humans deepen their understanding of the video content by asking themselves questions, thereby accurately identifying the location. Therefore, we propose a new task, named In-VAL, to simulate the multiple interactions between humans and videos in the procedure of obtaining visual answers. The In-VAL task requires interactively addressing several semantic gap issues, including 1) the ambiguity of user intent in the input questions, 2) the incompleteness of language in video subtitles, and 3) the fragmentation of content in video segments. To address these issues, we propose Ask2Loc, a framework for resolving In-VAL by asking questions. It includes three key modules: 1) a chatting module to refine initial questions and uncover clear intentions, 2) a rewriting module to generate fluent language and create complete descriptions, and 3) a searching module to broaden local context and provide integrated content. We conduct extensive experiments on three reconstructed In-VAL datasets. Compared to traditional end-to-end and two-stage methods, our proposed Ask2Loc can improve performance by up to 14.91 (mIoU) on the In-VAL task. Our code and datasets can be accessed at https://github.com/changzong/Ask2Loc.
中文: 本文提出In-VAL任务,模拟人与视频的交互过程,通过Ask2Loc框架解决语义鸿沟问题来定位教学片段,相比传统方法性能提升高达14.91 mIoU。
English: The paper introduces In-VAL, a task simulating human-video interactions to locate instructional segments by resolving semantic gaps through the Ask2Loc framework, which enhances performance by up to 14.91 mIoU over traditional methods.

Authors:Lotfi Abdelkrim Mecharbat, Ibrahim Almakky, Martin Takac, Mohammad Yaqub
Title: MedNNS: Supernet-based Medical Task-Adaptive Neural Network Search
Abstract:
Deep learning (DL) has achieved remarkable progress in the field of medical imaging. However, adapting DL models to medical tasks remains a significant challenge, primarily due to two key factors: (1) architecture selection, as different tasks necessitate specialized model designs, and (2) weight initialization, which directly impacts the convergence speed and final performance of the models. Although transfer learning from ImageNet is a widely adopted strategy, its effectiveness is constrained by the substantial differences between natural and medical images. To address these challenges, we introduce Medical Neural Network Search (MedNNS), the first Neural Network Search framework for medical imaging applications. MedNNS jointly optimizes architecture selection and weight initialization by constructing a meta-space that encodes datasets and models based on how well they perform together. We build this space using a Supernetwork-based approach, expanding the model zoo size by 51x times over previous state-of-the-art (SOTA) methods. Moreover, we introduce rank loss and Fréchet Inception Distance (FID) loss into the construction of the space to capture inter-model and inter-dataset relationships, thereby achieving more accurate alignment in the meta-space. Experimental results across multiple datasets demonstrate that MedNNS significantly outperforms both ImageNet pre-trained DL models and SOTA Neural Architecture Search (NAS) methods, achieving an average accuracy improvement of 1.7% across datasets while converging substantially faster. The code and the processed meta-space is available at https://github.com/BioMedIA-MBZUAI/MedNNS.
中文: 深度学习在医学影像领域取得显著进展,但模型适应仍具挑战;MedNNS通过构建元空间联合优化架构与权重初始化,以更高精度和更快收敛速度超越现有方法。
English: Deep learning has advanced medical imaging but faces challenges in model adaptation, which MedNNS addresses by jointly optimizing architecture and weight initialization through a meta-space, outperforming existing methods with higher accuracy and faster convergence.

Authors:Diego de Oliveira Hitzges, Suman Ghosh, Guillermo Gallego
Title: DERD-Net: Learning Depth from Event-based Ray Densities
Abstract:
Event cameras offer a promising avenue for multi-view stereo depth estimation and Simultaneous Localization And Mapping (SLAM) due to their ability to detect blur-free 3D edges at high-speed and over broad illumination conditions. However, traditional deep learning frameworks designed for conventional cameras struggle with the asynchronous, stream-like nature of event data, as their architectures are optimized for discrete, image-like inputs. We propose a scalable, flexible and adaptable framework for pixel-wise depth estimation with event cameras in both monocular and stereo setups. The 3D scene structure is encoded into disparity space images (DSIs), representing spatial densities of rays obtained by back-projecting events into space via known camera poses. Our neural network processes local subregions of the DSIs combining 3D convolutions and a recurrent structure to recognize valuable patterns for depth prediction. Local processing enables fast inference with full parallelization and ensures constant ultra-low model complexity and memory costs, regardless of camera resolution. Experiments on standard benchmarks (MVSEC and DSEC datasets) demonstrate unprecedented effectiveness: (i) using purely monocular data, our method achieves comparable results to existing stereo methods; (ii) when applied to stereo data, it strongly outperforms all state-of-the-art (SOTA) approaches, reducing the mean absolute error by at least 42%; (iii) our method also allows for increases in depth completeness by more than 3-fold while still yielding a reduction in median absolute error of at least 30%. Given its remarkable performance and effective processing of event-data, our framework holds strong potential to become a standard approach for using deep learning for event-based depth estimation and SLAM. Project page: https://github.com/tub-rip/DERD-Net
Chinese: 本文提出了一种基于事件相机的新型深度学习框架,通过3D卷积和循环结构处理局部视差空间图像,在单目和立体设置中均实现最先进性能,同时保持恒定计算复杂度。
English: This paper introduces a novel deep learning framework for event-based depth estimation that processes local disparity space images with 3D convolutions and recurrent structures, achieving state-of-the-art performance in both monocular and stereo setups while maintaining constant computational complexity.

Authors:Lei Xu, Mehmet Yamac, Mete Ahishali, Moncef Gabbouj
Title: Multi-Scale Tensorial Summation and Dimensional Reduction Guided Neural Network for Edge Detection
Abstract:
Edge detection has attracted considerable attention thanks to its exceptional ability to enhance performance in downstream computer vision tasks. In recent years, various deep learning methods have been explored for edge detection tasks resulting in a significant performance improvement compared to conventional computer vision algorithms. In neural networks, edge detection tasks require considerably large receptive fields to provide satisfactory performance. In a typical convolutional operation, such a large receptive field can be achieved by utilizing a significant number of consecutive layers, which yields deep network structures. Recently, a Multi-scale Tensorial Summation (MTS) factorization operator was presented, which can achieve very large receptive fields even from the initial layers. In this paper, we propose a novel MTS Dimensional Reduction (MTS-DR) module guided neural network, MTS-DR-Net, for the edge detection task. The MTS-DR-Net uses MTS layers, and corresponding MTS-DR blocks as a new backbone to remove redundant information initially. Such a dimensional reduction module enables the neural network to focus specifically on relevant information (i.e., necessary subspaces). Finally, a weight U-shaped refinement module follows MTS-DR blocks in the MTS-DR-Net. We conducted extensive experiments on two benchmark edge detection datasets: BSDS500 and BIPEDv2 to verify the effectiveness of our model. The implementation of the proposed MTS-DR-Net can be found at https://github.com/LeiXuAI/MTS-DR-Net.git.
Chinese: 本文提出了一种新型的MTS-DR-Net神经网络,它采用多尺度张量求和降维模块来高效获取大感受野并消除冗余信息,在BSDS500和BIPEDv2数据集上的实验验证了其显著提升边缘检测性能的有效性。
English: This paper introduces MTS-DR-Net, a novel neural network that employs Multi-scale Tensorial Summation Dimensional Reduction modules to achieve large receptive fields efficiently and remove redundant information, significantly enhancing edge detection performance as validated on BSDS500 and BIPEDv2 datasets.

Authors:Qirui Yang, Fangpu Zhang, Yeying Jin, Qihua Cheng, Peng-Tao Jiang, Huanjing Yue, Jingyu Yang
Title: DSDNet: Raw Domain Demoiréing via Dual Color-Space Synergy
Abstract:
With the rapid advancement of mobile imaging, capturing screens using smartphones has become a prevalent practice in distance learning and conference recording. However, moiré artifacts, caused by frequency aliasing between display screens and camera sensors, are further amplified by the image signal processing pipeline, leading to severe visual degradation. Existing sRGB domain demoiréing methods struggle with irreversible information loss, while recent two-stage raw domain approaches suffer from information bottlenecks and inference inefficiency. To address these limitations, we propose a single-stage raw domain demoiréing framework, Dual-Stream Demoiréing Network (DSDNet), which leverages the synergy of raw and YCbCr images to remove moiré while preserving luminance and color fidelity. Specifically, to guide luminance correction and moiré removal, we design a raw-to-YCbCr mapping pipeline and introduce the Synergic Attention with Dynamic Modulation (SADM) module. This module enriches the raw-to-sRGB conversion with cross-domain contextual features. Furthermore, to better guide color fidelity, we develop a Luminance-Chrominance Adaptive Transformer (LCAT), which decouples luminance and chrominance representations. Extensive experiments demonstrate that DSDNet outperforms state-of-the-art methods in both visual quality and quantitative evaluation and achieves an inference speed $\mathrm{\textbf{2.4x}}$ faster than the second-best method, highlighting its practical advantages. We provide an anonymous online demo at https://xxxxxxxxdsdnet.github.io/DSDNet/.
Chinese: 本文提出的DSDNet单阶段原始域去摩尔纹框架,在有效消除摩尔纹的同时保持了色彩保真度,并实现了比现有方法显著更快的推理速度。
English: This paper introduces DSDNet, a single-stage raw domain framework that effectively removes moiré artifacts while maintaining color fidelity and achieving significantly faster inference speeds than existing methods.

Authors:Manjunath D, Aniruddh Sikdar, Prajwal Gurunath, Sumanth Udupa, Suresh Sundaram
Title: SAGA: Semantic-Aware Gray color Augmentation for Visible-to-Thermal Domain Adaptation across Multi-View Drone and Ground-Based Vision Systems
Abstract:
Domain-adaptive thermal object detection plays a key role in facilitating visible (RGB)-to-thermal (IR) adaptation by reducing the need for co-registered image pairs and minimizing reliance on large annotated IR datasets. However, inherent limitations of IR images, such as the lack of color and texture cues, pose challenges for RGB-trained models, leading to increased false positives and poor-quality pseudo-labels. To address this, we propose Semantic-Aware Gray color Augmentation (SAGA), a novel strategy for mitigating color bias and bridging the domain gap by extracting object-level features relevant to IR images. Additionally, to validate the proposed SAGA for drone imagery, we introduce the IndraEye, a multi-sensor (RGB-IR) dataset designed for diverse applications. The dataset contains 5,612 images with 145,666 instances, captured from diverse angles, altitudes, backgrounds, and times of day, offering valuable opportunities for multimodal learning, domain adaptation for object detection and segmentation, and exploration of sensor-specific strengths and weaknesses. IndraEye aims to enhance the development of more robust and accurate aerial perception systems, especially in challenging environments. Experimental results show that SAGA significantly improves RGB-to-IR adaptation for autonomous driving and IndraEye dataset, achieving consistent performance gains of +0.4% to +7.6% (mAP) when integrated with state-of-the-art domain adaptation techniques. The dataset and codes are available at https://github.com/airliisc/IndraEye.
中文: 本文提出SAGA方法,通过语义感知的灰度色彩增强减少颜色偏差,有效弥合RGB到红外的域间差异,并在新型IndraEye数据集上验证了其对无人机图像域自适应目标检测的显著提升效果。
English: The paper introduces SAGA, a semantic-aware gray color augmentation method that effectively bridges the RGB-to-IR domain gap by reducing color bias and improving object detection, validated on the new IndraEye dataset which enhances multimodal aerial perception with significant performance gains.

Authors:Yannic Neuhaus, Matthias Hein
Title: RePOPE: Impact of Annotation Errors on the POPE Benchmark
Abstract:
Since data annotation is costly, benchmark datasets often incorporate labels from established image datasets. In this work, we assess the impact of label errors in MSCOCO on the frequently used object hallucination benchmark POPE. We re-annotate the benchmark images and identify an imbalance in annotation errors across different subsets. Evaluating multiple models on the revised labels, which we denote as RePOPE, we observe notable shifts in model rankings, highlighting the impact of label quality. Code and data are available at https://github.com/YanNeu/RePOPE .
中文摘要:本研究评估了MSCOCO数据集中的标签错误对POPE物体幻觉基准的影响,通过重新标注数据创建RePOPE后发现,模型性能排名发生显著变化,凸显了标签质量的重要性。
English Summary: This study evaluates how label errors in the MSCOCO dataset affect the POPE object hallucination benchmark, revealing that re-annotating the data (RePOPE) significantly alters model performance rankings and underscores the importance of label quality.

Authors:Yunfeng Li, Bo Wang, Jiahao Wan, Xueyi Wu, Ye Li
Title: SonarT165: A Large-scale Benchmark and STFTrack Framework for Acoustic Object Tracking
Abstract:
Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/SonarT165.
中文: 为解决水下声学目标跟踪缺乏统一基准的问题,本研究提出了首个大规模基准SonarT165和高效框架STFTrack,该框架通过多视图模板融合和最优轨迹校正等创新模块,实现了最先进的性能。
English: To address the lack of a unified benchmark for underwater acoustic object tracking, this study introduces SonarT165, the first large-scale benchmark, and proposes STFTrack, an efficient framework with novel modules that achieves state-of-the-art performance by integrating multi-view features and correcting trajectories.

Authors:Zizhi Chen, Xinyu Zhang, Minghao Han, Yizhou Liu, Ziyun Qian, Weifeng Zhang, Xukun Zhang, Jingwei Wei, Lihua Zhang
Title: VLM-based Prompts as the Optimal Assistant for Unpaired Histopathology Virtual Staining
Abstract:
In histopathology, tissue sections are typically stained using common H&E staining or special stains (MAS, PAS, PASM, etc.) to clearly visualize specific tissue structures. The rapid advancement of deep learning offers an effective solution for generating virtually stained images, significantly reducing the time and labor costs associated with traditional histochemical staining. However, a new challenge arises in separating the fundamental visual characteristics of tissue sections from the visual differences induced by staining agents. Additionally, virtual staining often overlooks essential pathological knowledge and the physical properties of staining, resulting in only style-level transfer. To address these issues, we introduce, for the first time in virtual staining tasks, a pathological vision-language large model (VLM) as an auxiliary tool. We integrate contrastive learnable prompts, foundational concept anchors for tissue sections, and staining-specific concept anchors to leverage the extensive knowledge of the pathological VLM. This approach is designed to describe, frame, and enhance the direction of virtual staining. Furthermore, we have developed a data augmentation method based on the constraints of the VLM. This method utilizes the VLM's powerful image interpretation capabilities to further integrate image style and structural information, proving beneficial in high-precision pathological diagnostics. Extensive evaluations on publicly available multi-domain unpaired staining datasets demonstrate that our method can generate highly realistic images and enhance the accuracy of downstream tasks, such as glomerular detection and segmentation. Our code is available at: https://github.com/CZZZZZZZZZZZZZZZZZ/VPGAN-HARBOR
中文摘要:本研究首次在虚拟染色任务中引入病理视觉语言大模型,通过对比学习提示和概念锚点整合,显著提升图像真实感及下游病理检测的准确性。
English Summary: This study introduces a pathological vision-language model to enhance virtual staining by integrating contrastive prompts and concept anchors, improving realism and accuracy in medical imaging tasks.

Authors:Yixuan Zhu, Haolin Wang, Ao Li, Wenliang Zhao, Yansong Tang, Jingxuan Niu, Lei Chen, Jie Zhou, Jiwen Lu
Title: InstaRevive: One-Step Image Enhancement via Dynamic Score Matching
Abstract:
Image enhancement finds wide-ranging applications in real-world scenarios due to complex environments and the inherent limitations of imaging devices. Recent diffusion-based methods yield promising outcomes but necessitate prolonged and computationally intensive iterative sampling. In response, we propose InstaRevive, a straightforward yet powerful image enhancement framework that employs score-based diffusion distillation to harness potent generative capability and minimize the sampling steps. To fully exploit the potential of the pre-trained diffusion model, we devise a practical and effective diffusion distillation pipeline using dynamic control to address inaccuracies in updating direction during score matching. Our control strategy enables a dynamic diffusing scope, facilitating precise learning of denoising trajectories within the diffusion model and ensuring accurate distribution matching gradients during training. Additionally, to enrich guidance for the generative power, we incorporate textual prompts via image captioning as auxiliary conditions, fostering further exploration of the diffusion model. Extensive experiments substantiate the efficacy of our framework across a diverse array of challenging tasks and datasets, unveiling the compelling efficacy and efficiency of InstaRevive in delivering high-quality and visually appealing results. Code is available at https://github.com/EternalEvan/InstaRevive.
中文:InstaRevive是一种高效的图像增强框架,通过基于分数的扩散蒸馏技术结合动态控制和文本提示,在减少采样步骤的同时仍能生成高质量的视觉效果。
English: InstaRevive is an efficient image enhancement framework that uses score-based diffusion distillation with dynamic control and text prompts to reduce sampling steps while maintaining high-quality generative results.

Authors:Eammon A. Littler, Emmanuel A. Mannoh, Ethan P. M. LaRochelle
Title: Fluorescence Reference Target Quantitative Analysis Library
Abstract:
Standardized performance evaluation of fluorescence imaging systems remains a critical unmet need in the field of fluorescence-guided surgery (FGS). While the American Association of Physicists in Medicine (AAPM) TG311 report and recent FDA draft guidance provide recommended metrics for system characterization, practical tools for extracting these metrics remain limited, inconsistent, and often inaccessible. We present QUEL-QAL, an open-source Python library designed to streamline and standardize the quantitative analysis of fluorescence images using solid reference targets. The library provides a modular, reproducible workflow that includes region of interest (ROI) detection, statistical analysis, and visualization capabilities. QUEL-QAL supports key metrics such as response linearity, limit of detection, depth sensitivity, and spatial resolution, in alignment with regulatory and academic guidance. Built on widely adopted Python packages, the library is designed to be extensible, enabling users to adapt it to novel target designs and analysis protocols. By promoting transparency, reproducibility, and regulatory alignment, QUEL-QAL offers a foundational tool to support standardized benchmarking and accelerate the development and evaluation of fluorescence imaging systems.
Chinese: QUEL-QAL 是一个开源 Python 库,通过提供模块化工作流程来标准化荧光成像系统的定量分析,解决了荧光引导手术领域对统一评估工具的迫切需求。
English: QUEL-QAL is an open-source Python library that standardizes the quantitative analysis of fluorescence imaging systems by providing modular workflows for key performance metrics, addressing the unmet need for consistent evaluation tools in fluorescence-guided surgery.

Authors:Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
Title: CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
Abstract:
Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). CAPTURe requires both recognizing visual patterns and reasoning, making it a useful testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs' ability to form world models that would allow them to fill in missing information. CAPTURe consists of two parts: (1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs (GPT-4o, Intern-VL2, Molmo, and Qwen2-VL) on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in inferring unseen spatial relationships: even the strongest VLMs like GPT-4o fail to count with occlusion. In contrast, we find that humans achieve very little error on CAPTURe. We also find that providing auxiliary information of occluded object locations increases performance, underscoring that the model error comes both from an inability to handle occlusion as well as difficulty in counting in images. Code and data: https://github.com/atinpothiraj/CAPTURe
中文摘要:CAPTURe任务通过评估视觉语言模型对遮挡物后方图案化物体的计数能力,发现即使先进模型也难以进行空间推理,而人类表现近乎完美。
English Summary: The CAPTURe task evaluates vision-language models' ability to count patterned objects behind occlusions, revealing that even advanced models struggle with spatial reasoning about hidden objects while humans perform nearly flawlessly.

Authors:David Ma, Yuanxing Zhang, Jincheng Ren, Jarvis Guo, Yifan Yao, Zhenlin Wei, Zhenzhu Yang, Zhongyuan Peng, Boyu Feng, Jun Ma, Xiao Gu, Zhoufutu Wen, King Zhu, Yancheng He, Meng Cao, Shiwen Ni, Jiaheng Liu, Wenhao Huang, Ge Zhang, Xiaojie Jin
Title: IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
Abstract:
Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released in https://github.com/multimodal-art-projection/IV-Bench.
中文: 现有的多模态大模型评估主要关注图像推理或通用视频理解任务,忽视了图像上下文在视频理解中的重要作用,因此提出了首个全面的图像基础视频感知与推理基准IV-Bench,发现当前先进模型在此任务上表现严重不足,最高准确率仅达28.9%,并揭示了影响性能的关键因素。
English: Current multimodal models are evaluated primarily on image reasoning or general video tasks, neglecting the role of image context in video understanding, so IV-Bench is introduced as the first comprehensive benchmark for image-grounded video perception and reasoning, revealing that state-of-the-art models significantly underperform with at most 28.9% accuracy and highlighting key influencing factors.

Authors:Tajamul Ashraf, Rajes Manna, Partha Sarathi Purkayastha, Tavaheed Tariq, Janibul Bashir
Title: Context Aware Grounded Teacher for Source Free Object Detection
Abstract:
We focus on the Source Free Object Detection (SFOD) problem, when source data is unavailable during adaptation, and the model must adapt to the unlabeled target domain. In medical imaging, several approaches have leveraged a semi-supervised student-teacher architecture to bridge domain discrepancy. Context imbalance in labeled training data and significant domain shifts between domains can lead to biased teacher models that produce inaccurate pseudolabels, degrading the student model's performance and causing a mode collapse. Class imbalance, particularly when one class significantly outnumbers another, leads to contextual bias. To tackle the problem of context bias and the significant performance drop of the student model in the SFOD setting, we introduce Grounded Teacher (GT) as a standard framework. In this study, we model contextual relationships using a dedicated relational context module and leverage it to mitigate inherent biases in the model. This approach enables us to apply augmentations to closely related classes, across and within domains, enhancing the performance of underrepresented classes while keeping the effect on dominant classes minimal. We further improve the quality of predictions by implementing an expert foundational branch to supervise the student model. We validate the effectiveness of our approach in mitigating context bias under the SFOD setting through experiments on three medical datasets supported by comprehensive ablation studies. All relevant resources, including preprocessed data, trained model weights, and code, are publicly available at this https://github.com/Tajamul21/Grounded_Teacher.
中文摘要:Grounded Teacher框架通过建模上下文关系并实施针对性增强,解决了源自由目标检测中的上下文偏见和性能下降问题,有效提升弱势类别表现,同时对主导类别影响极小。
English Summary: The Grounded Teacher framework addresses context bias and performance degradation in Source Free Object Detection by modeling contextual relationships and applying targeted augmentations to improve underrepresented class performance while maintaining minimal impact on dominant classes.

Authors:Ankit Dhiman, Manan Shah, R Venkatesh Babu
Title: MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World
Abstract:
Diffusion models have become central to various image editing tasks, yet they often fail to fully adhere to physical laws, particularly with effects like shadows, reflections, and occlusions. In this work, we address the challenge of generating photorealistic mirror reflections using diffusion-based generative models. Despite extensive training data, existing diffusion models frequently overlook the nuanced details crucial to authentic mirror reflections. Recent approaches have attempted to resolve this by creating synhetic datasets and framing reflection generation as an inpainting task; however, they struggle to generalize across different object orientations and positions relative to the mirror. Our method overcomes these limitations by introducing key augmentations into the synthetic data pipeline: (1) random object positioning, (2) randomized rotations, and (3) grounding of objects, significantly enhancing generalization across poses and placements. To further address spatial relationships and occlusions in scenes with multiple objects, we implement a strategy to pair objects during dataset generation, resulting in a dataset robust enough to handle these complex scenarios. Achieving generalization to real-world scenes remains a challenge, so we introduce a three-stage training curriculum to develop the MirrorFusion 2.0 model to improve real-world performance. We provide extensive qualitative and quantitative evaluations to support our approach. The project page is available at: https://mirror-verse.github.io/.

Authors:Huimin Zeng, Jiacheng Li, Zhiwei Xiong
Title: Plug-and-Play Versatile Compressed Video Enhancement
Abstract:
As a widely adopted technique in data transmission, video compression effectively reduces the size of files, making it possible for real-time cloud computing. However, it comes at the cost of visual quality, posing challenges to the robustness of downstream vision models. In this work, we present a versatile codec-aware enhancement framework that reuses codec information to adaptively enhance videos under different compression settings, assisting various downstream vision tasks without introducing computation bottleneck. Specifically, the proposed codec-aware framework consists of a compression-aware adaptation (CAA) network that employs a hierarchical adaptation mechanism to estimate parameters of the frame-wise enhancement network, namely the bitstream-aware enhancement (BAE) network. The BAE network further leverages temporal and spatial priors embedded in the bitstream to effectively improve the quality of compressed input frames. Extensive experimental results demonstrate the superior quality enhancement performance of our framework over existing enhancement methods, as well as its versatility in assisting multiple downstream tasks on compressed videos as a plug-and-play module. Code and models are available at https://huimin-zeng.github.io/PnP-VCVE/.

Authors:Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan
Title: Towards Understanding Camera Motions in Any Video
Abstract:
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
中文摘要:CameraBench是一个用于评估和改进摄像机运动理解的大规模数据集和基准,包含专家标注的视频和运动基元分类法,揭示了现有模型的局限性,并通过微调实现了性能提升。
English Summary: CameraBench is a comprehensive dataset and benchmark for evaluating camera motion understanding, featuring expert-annotated videos and a taxonomy of motion primitives that reveals the limitations of current models and enables improved performance through fine-tuning.

Authors:Wei Fang, Priyadarshini Panda
Title: Event2Vec: Processing neuromorphic events directly by representations in vector space
Abstract:
The neuromorphic event cameras have overwhelming advantages in temporal resolution, power efficiency, and dynamic range compared to traditional cameras. However, the event cameras output asynchronous, sparse, and irregular events, which are not compatible with mainstream computer vision and deep learning methods. Various methods have been proposed to solve this issue but at the cost of long preprocessing procedures, losing temporal resolutions, or being incompatible with massively parallel computation. Inspired by the great success of the word to vector, we summarize the similarities between words and events, then propose the first event to vector (event2vec) representation. We validate event2vec on classifying the ASL-DVS dataset, showing impressive parameter efficiency, accuracy, and speed than previous graph/image/voxel-based representations. Beyond task performance, the most attractive advantage of event2vec is that it aligns events to the domain of natural language processing, showing the promising prospect of integrating events into large language and multimodal models. Our codes, models, and training logs are available at https://github.com/fangwei123456/event2vec.
中文摘要:event2vec方法提出了一种新颖的表征方式,使神经网络能够直接处理神经形态事件数据,不仅实现了高参数效率和吞吐量,还为与大型语言模型的集成开辟了新途径。
English Summary: The event2vec method introduces a novel representation that enables neural networks to process neuromorphic event data directly, offering high parameter efficiency and throughput while paving the way for integration with large language models.

Authors:Yuan-Hong Liao, Sven Elflein, Liu He, Laura Leal-Taixé, Yejin Choi, Sanja Fidler, David Acuna
Title: LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception
Abstract:
Recent reasoning models through test-time scaling have demonstrated that long chain-of-thoughts can unlock substantial performance boosts in hard reasoning tasks such as math and code. However, the benefit of such long thoughts for system-2 reasoning is relatively less explored in other domains such as perceptual tasks where shallower, system-1 reasoning seems sufficient. In this paper, we introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks. The key challenges in synthesizing elaborate reasoning thoughts for perceptual tasks are that off-the-shelf models are not yet equipped with such thinking behavior and that it is not straightforward to build a reliable process verifier for perceptual tasks. Thus, we propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions from dense image descriptions, then extracts simple CoTs from VLMs for those verifiable problems, and finally expands those simple thoughts to elaborate long thoughts via frontier reasoning models. In controlled experiments with a strong instruction-tuned 7B model, we demonstrate notable improvements over existing visual reasoning data-generation methods. Our model, trained on the generated dataset, achieves an average +3.4 points improvement over 5 vision-centric benchmarks, including +11.8 points on V$^*$ Bench. Notably, despite being tuned for vision tasks, it also improves performance on the text reasoning benchmark, MMLU-Pro, by +2 points.

Authors:Yike Zhang, Eduardo Davalos, Jack Noble
Title: Vision6D: 3D-to-2D Interactive Visualization and Annotation Tool for 6D Pose Estimation
Abstract:
Accurate 6D pose estimation has gained more attention over the years for robotics-assisted tasks that require precise interaction with physical objects. This paper presents an interactive 3D-to-2D visualization and annotation tool to support the 6D pose estimation research community. To the best of our knowledge, the proposed work is the first tool that allows users to visualize and manipulate 3D objects interactively on a 2D real-world scene, along with a comprehensive user study. This system supports robust 6D camera pose annotation by providing both visual cues and spatial relationships to determine object position and orientation in various environments. The annotation feature in Vision6D is particularly helpful in scenarios where the transformation matrix between the camera and world objects is unknown, as it enables accurate annotation of these objects' poses using only the camera intrinsic matrix. This capability serves as a foundational step in developing and training advanced pose estimation models across various domains. We evaluate Vision6D's effectiveness by utilizing widely-used open-source pose estimation datasets Linemod and HANDAL through comparisons between the default ground-truth camera poses with manual annotations. A user study was performed to show that Vision6D generates accurate pose annotations via visual cues in an intuitive 3D user interface. This approach aims to bridge the gap between 2D scene projections and 3D scenes, offering an effective way for researchers and developers to solve 6D pose annotation related problems. The software is open-source and publicly available at https://github.com/InteractiveGL/vision6D.
中文: 本文介绍了Vision6D这一交互式3D到2D可视化标注工具,它通过让用户在真实场景中操控3D物体来实现精确的6D姿态估计,并附有用户研究和开源发布。
English: This paper introduces Vision6D, an interactive 3D-to-2D visualization and annotation tool that enables precise 6D pose estimation by allowing users to manipulate 3D objects in real-world scenes, supported by a user study and open-source availability.

Authors:Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, Yi Ma
Title: Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
Abstract:
Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 diverse real-world scenes. Our six tasks (counting, attribute identification, relative distance, relative direction, object manipulation, and camera pose estimation) specifically test model's geometric correspondence and the capacity to align information consistently across views. Our extensive experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap, indicating that current MLLMs remain far from human-level proficiency. Through in-depth analysis, we show that MLLMs are particularly underperforming under two aspects: (1) cross-view correspondence for partially occluded views and (2) establishing the coarse camera poses. These findings highlight the necessity of domain-specific refinements or modules that embed stronger multi-view awareness. We believe that our All-Angles Bench offers valuable insights and contribute to bridging the gap between MLLMs and human-level multi-view understanding. The project and benchmark are publicly available at https://danielchyeh.github.io/All-Angles-Bench/.

Authors:Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, Jinguo Zhu
Title: VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
Abstract:
Visual reasoning is a core component of human intelligence and a critical capability for advanced multimodal models. Yet current reasoning evaluations of multimodal large language models (MLLMs) often rely on text descriptions and allow language-based reasoning shortcuts, failing to measure genuine vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories (e.g., quantitative shifts, spatial relations, attribute comparisons). These various types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. We evaluate leading MLLMs on this benchmark and analyze their results to identify common failure modes. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans-revealing significant gaps in visual reasoning. Furthermore, we provide a supplementary training dataset and a reinforcement-learning baseline to support further progress.
中文: VisuLogic是一个旨在评估多模态模型真实视觉推理能力的新基准,结果显示当前模型表现不佳,准确率低于30%,远落后于人类水平。
English: VisuLogic is a new benchmark designed to evaluate genuine visual reasoning in multimodal models, revealing that current models perform poorly with accuracy below 30%, significantly lagging behind human capabilities.

Authors:Xiaoyu Han, Shunyuan Zheng, Zonglin Li, Chenyang Wang, Xin Sun, Quanling Meng
Title: Shape-Guided Clothing Warping for Virtual Try-On
Abstract:
Image-based virtual try-on aims to seamlessly fit in-shop clothing to a person image while maintaining pose consistency. Existing methods commonly employ the thin plate spline (TPS) transformation or appearance flow to deform in-shop clothing for aligning with the person's body. Despite their promising performance, these methods often lack precise control over fine details, leading to inconsistencies in shape between clothing and the person's body as well as distortions in exposed limb regions. To tackle these challenges, we propose a novel shape-guided clothing warping method for virtual try-on, dubbed SCW-VTON, which incorporates global shape constraints and additional limb textures to enhance the realism and consistency of the warped clothing and try-on results. To integrate global shape constraints for clothing warping, we devise a dual-path clothing warping module comprising a shape path and a flow path. The former path captures the clothing shape aligned with the person's body, while the latter path leverages the mapping between the pre- and post-deformation of the clothing shape to guide the estimation of appearance flow. Furthermore, to alleviate distortions in limb regions of try-on results, we integrate detailed limb guidance by developing a limb reconstruction network based on masked image modeling. Through the utilization of SCW-VTON, we are able to generate try-on results with enhanced clothing shape consistency and precise control over details. Extensive experiments demonstrate the superiority of our approach over state-of-the-art methods both qualitatively and quantitatively. The code is available at https://github.com/xyhanHIT/SCW-VTON.
中文摘要:提出的SCW-VTON方法通过整合全局形状约束和肢体纹理重建,有效提升了虚拟试衣中服装与身体的贴合度并减少肢体区域变形,在定性和定量评估中均优于现有方法。
English Summary: The proposed SCW-VTON method enhances virtual clothing try-on by incorporating global shape constraints and limb texture reconstruction to improve clothing-body alignment and reduce limb distortions, outperforming existing techniques.

Authors:Xianpan Zhou
Title: Tiger200K: Manually Curated High Visual Quality Video Dataset from UGC Platform
Abstract:
The recent surge in open-source text-to-video generation models has significantly energized the research community, yet their dependence on proprietary training datasets remains a key constraint. While existing open datasets like Koala-36M employ algorithmic filtering of web-scraped videos from early platforms, they still lack the quality required for fine-tuning advanced video generation models. We present Tiger200K, a manually curated high visual quality video dataset sourced from User-Generated Content (UGC) platforms. By prioritizing visual fidelity and aesthetic quality, Tiger200K underscores the critical role of human expertise in data curation, and providing high-quality, temporally consistent video-text pairs for fine-tuning and optimizing video generation architectures through a simple but effective pipeline including shot boundary detection, OCR, border detecting, motion filter and fine bilingual caption. The dataset will undergo ongoing expansion and be released as an open-source initiative to advance research and applications in video generative models. Project page: https://tinytigerpan.github.io/tiger200k/

Authors:Chengxi Han, Xiaoyu Su, Zhiqiang Wei, Meiqi Hu, Yichu Xu
Title: HSANET: A Hybrid Self-Cross Attention Network For Remote Sensing Change Detection
Abstract:
The remote sensing image change detection task is an essential method for large-scale monitoring. We propose HSANet, a network that uses hierarchical convolution to extract multi-scale features. It incorporates hybrid self-attention and cross-attention mechanisms to learn and fuse global and cross-scale information. This enables HSANet to capture global context at different scales and integrate cross-scale features, refining edge details and improving detection performance. We will also open-source our model code: https://github.com/ChengxiHAN/HSANet.
Chinese: HSANet提出了一种结合混合自注意力和交叉注意力的分层网络,用于提升遥感变化检测中的多尺度特征融合和边缘细节,代码已开源在GitHub上。
English: HSANet introduces a hierarchical network with hybrid self-attention and cross-attention to enhance multi-scale feature fusion and edge detail in remote sensing change detection, with code available on GitHub.

Authors:Huzheng Yang, Katherine Xu, Michael D. Grossberg, Yutong Bai, Jianbo Shi
Title: "I Know It When I See It": Mood Spaces for Connecting and Expressing Visual Concepts
Abstract:
Expressing complex concepts is easy when they can be labeled or quantified, but many ideas are hard to define yet instantly recognizable. We propose a Mood Board, where users convey abstract concepts with examples that hint at the intended direction of attribute changes. We compute an underlying Mood Space that 1) factors out irrelevant features and 2) finds the connections between images, thus bringing relevant concepts closer. We invent a fibration computation to compress/decompress pre-trained features into/from a compact space, 50-100x smaller. The main innovation is learning to mimic the pairwise affinity relationship of the image tokens across exemplars. To focus on the coarse-to-fine hierarchical structures in the Mood Space, we compute the top eigenvector structure from the affinity matrix and define a loss in the eigenvector space. The resulting Mood Space is locally linear and compact, allowing image-level operations, such as object averaging, visual analogy, and pose transfer, to be performed as a simple vector operation in Mood Space. Our learning is efficient in computation without any fine-tuning, needs only a few (2-20) exemplars, and takes less than a minute to learn.

Authors:Ziwen Xu, Shuxun Wang, Kewei Xu, Haoming Xu, Mengru Wang, Xinle Deng, Yunzhi Yao, Guozhou Zheng, Huajun Chen, Ningyu Zhang
Title: EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models
Abstract:
In this paper, we introduce EasyEdit2, a framework designed to enable plug-and-play adjustability for controlling Large Language Model (LLM) behaviors. EasyEdit2 supports a wide range of test-time interventions, including safety, sentiment, personality, reasoning patterns, factuality, and language features. Unlike its predecessor, EasyEdit2 features a new architecture specifically designed for seamless model steering. It comprises key modules such as the steering vector generator and the steering vector applier, which enable automatic generation and application of steering vectors to influence the model's behavior without modifying its parameters. One of the main advantages of EasyEdit2 is its ease of use-users do not need extensive technical knowledge. With just a single example, they can effectively guide and adjust the model's responses, making precise control both accessible and efficient. Empirically, we report model steering performance across different LLMs, demonstrating the effectiveness of these techniques. We have released the source code on GitHub at https://github.com/zjunlp/EasyEdit along with a demonstration notebook. In addition, we provide a demo video at https://www.youtube.com/watch?v=AkfoiPfp5rQ for a quick introduction.
中文:EasyEdit2是一个即插即用的框架,通过测试时干预让用户无需深厚技术知识即可轻松控制大型语言模型的行为,仅需一个示例即可精确调整模型响应。
English: EasyEdit2 is a plug-and-play framework that enables users to easily control Large Language Model behaviors through test-time interventions, requiring minimal technical knowledge and allowing precise adjustments with just a single example.

Authors:Minh-Quan Viet Bui, Jongmin Park, Juan Luis Gonzalez Bello, Jaeho Moon, Jihyong Oh, Munchurl Kim
Title: MoBGS: Motion Deblurring Dynamic 3D Gaussian Splatting for Blurry Monocular Video
Abstract:
We present MoBGS, a novel deblurring dynamic 3D Gaussian Splatting (3DGS) framework capable of reconstructing sharp and high-quality novel spatio-temporal views from blurry monocular videos in an end-to-end manner. Existing dynamic novel view synthesis (NVS) methods are highly sensitive to motion blur in casually captured videos, resulting in significant degradation of rendering quality. While recent approaches address motion-blurred inputs for NVS, they primarily focus on static scene reconstruction and lack dedicated motion modeling for dynamic objects. To overcome these limitations, our MoBGS introduces a novel Blur-adaptive Latent Camera Estimation (BLCE) method for effective latent camera trajectory estimation, improving global camera motion deblurring. In addition, we propose a physically-inspired Latent Camera-induced Exposure Estimation (LCEE) method to ensure consistent deblurring of both global camera and local object motion. Our MoBGS framework ensures the temporal consistency of unseen latent timestamps and robust motion decomposition of static and dynamic regions. Extensive experiments on the Stereo Blur dataset and real-world blurry videos show that our MoBGS significantly outperforms the very recent advanced methods (DyBluRF and Deblur4DGS), achieving state-of-the-art performance for dynamic NVS under motion blur.

Authors:Mohamed el amine Boudjoghra, Ivan Laptev, Angela Dai
Title: ScanEdit: Hierarchically-Guided Functional 3D Scan Editing
Abstract:
With the fast pace of 3D capture technology and resulting abundance of 3D data, effective 3D scene editing becomes essential for a variety of graphics applications. In this work we present ScanEdit, an instruction-driven method for functional editing of complex, real-world 3D scans. To model large and interdependent sets of ob- jectswe propose a hierarchically-guided approach. Given a 3D scan decomposed into its object instances, we first construct a hierarchical scene graph representation to enable effective, tractable editing. We then leverage reason- ing capabilities of Large Language Models (LLMs) and translate high-level language instructions into actionable commands applied hierarchically to the scene graph. Fi- nally, ScanEdit integrates LLM-based guidance with ex- plicit physical constraints and generates realistic scenes where object arrangements obey both physics and common sense. In our extensive experimental evaluation ScanEdit outperforms state of the art and demonstrates excellent re- sults for a variety of real-world scenes and input instruc- tions.

Authors:Shiben Liu, Huijie Fan, Qiang Wang, Baojie Fan, Yandong Tang, Liangqiong Qu
Title: Distribution-aware Forgetting Compensation for Exemplar-Free Lifelong Person Re-identification
Abstract:
Lifelong Person Re-identification (LReID) suffers from a key challenge in preserving old knowledge while adapting to new information. The existing solutions include rehearsal-based and rehearsal-free methods to address this challenge. Rehearsal-based approaches rely on knowledge distillation, continuously accumulating forgetting during the distillation process. Rehearsal-free methods insufficiently learn the distribution of each domain, leading to forgetfulness over time. To solve these issues, we propose a novel Distribution-aware Forgetting Compensation (DAFC) model that explores cross-domain shared representation learning and domain-specific distribution integration without using old exemplars or knowledge distillation. We propose a Text-driven Prompt Aggregation (TPA) that utilizes text features to enrich prompt elements and guide the prompt model to learn fine-grained representations for each instance. This can enhance the differentiation of identity information and establish the foundation for domain distribution awareness. Then, Distribution-based Awareness and Integration (DAI) is designed to capture each domain-specific distribution by a dedicated expert network and adaptively consolidate them into a shared region in high-dimensional space. In this manner, DAI can consolidate and enhance cross-domain shared representation learning while alleviating catastrophic forgetting. Furthermore, we develop a Knowledge Consolidation Mechanism (KCM) that comprises instance-level discrimination and cross-domain consistency alignment strategies to facilitate model adaptive learning of new knowledge from the current domain and promote knowledge consolidation learning between acquired domain-specific distributions, respectively. Experimental results show that our DAFC outperforms state-of-the-art methods. Our code is available at https://github.com/LiuShiBen/DAFC.
中文摘要:提出的分布感知遗忘补偿(DAFC)模型通过文本驱动提示聚合和领域分布感知,在不依赖旧样本的情况下增强跨领域表征学习并缓解灾难性遗忘,从而解决终身行人重识别的关键挑战。
English Summary: The proposed Distribution-aware Forgetting Compensation (DAFC) model addresses lifelong person re-identification challenges by integrating text-driven prompt aggregation and domain distribution awareness to enhance cross-domain representation learning while mitigating catastrophic forgetting without relying on old exemplars.

Authors:Weijie He, Mushui Liu, Yunlong Yu, Zhao Wang, Chao Wu
Title: DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation
Abstract:
Compositional text-to-video generation, which requires synthesizing dynamic scenes with multiple interacting entities and precise spatial-temporal relationships, remains a critical challenge for diffusion-based models. Existing methods struggle with layout discontinuity, entity identity drift, and implausible interaction dynamics due to unconstrained cross-attention mechanisms and inadequate physics-aware reasoning. To address these limitations, we propose DyST-XL, a \textbf{training-free} framework that enhances off-the-shelf text-to-video models (e.g., CogVideoX-5B) through frame-aware control. DyST-XL integrates three key innovations: (1) A Dynamic Layout Planner that leverages large language models (LLMs) to parse input prompts into entity-attribute graphs and generates physics-aware keyframe layouts, with intermediate frames interpolated via trajectory optimization; (2) A Dual-Prompt Controlled Attention Mechanism that enforces localized text-video alignment through frame-aware attention masking, achieving precise control over individual entities; and (3) An Entity-Consistency Constraint strategy that propagates first-frame feature embeddings to subsequent frames during denoising, preserving object identity without manual annotation. Experiments demonstrate that DyST-XL excels in compositional text-to-video generation, significantly improving performance on complex prompts and bridging a crucial gap in training-free video synthesis. The code is released in https://github.com/XiaoBuL/DyST-XL.
Chinese: DyST-XL 是一种无需训练的框架,通过整合动态布局规划、双提示注意力控制和实体一致性约束,提升文本到视频模型在合成具有精确时空关系的复杂视频方面的性能。
English: DyST-XL is a training-free framework that enhances text-to-video models by integrating dynamic layout planning, dual-prompt attention control, and entity-consistency constraints to improve compositional video generation with precise spatial-temporal relationships.

Authors:Xin Li, Xijun Wang, Bingchen Li, Kun Yuan, Yizhen Shao, Suhang Yao, Ming Sun, Chao Zhou, Radu Timofte, Zhibo Chen
Title: NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: KwaiSR Dataset and Study
Abstract:
In this work, we build the first benchmark dataset for short-form UGC Image Super-resolution in the wild, termed KwaiSR, intending to advance the research on developing image super-resolution algorithms for short-form UGC platforms. This dataset is collected from the Kwai Platform, which is composed of two parts, i.e., synthetic and wild parts. Among them, the synthetic dataset, including 1,900 image pairs, is produced by simulating the degradation following the distribution of real-world low-quality short-form UGC images, aiming to provide the ground truth for training and objective comparison in the validation/testing. The wild dataset contains low-quality images collected directly from the Kwai Platform, which are filtered using the quality assessment method KVQ from the Kwai Platform. As a result, the KwaiSR dataset contains 1800 synthetic image pairs and 1900 wild images, which are divided into training, validation, and testing parts with a ratio of 8:1:1. Based on the KwaiSR dataset, we organize the NTIRE 2025 challenge on a second short-form UGC Video quality assessment and enhancement, which attracts lots of researchers to develop the algorithm for it. The results of this competition have revealed that our KwaiSR dataset is pretty challenging for existing Image SR methods, which is expected to lead to a new direction in the image super-resolution field. The dataset can be found from https://lixinustc.github.io/NTIRE2025-KVQE-KwaSR-KVQ.github.io/.
中文: 本研究推出了首个短格式用户生成图像超分辨率基准数据集KwaiSR,通过合成与真实数据推动算法发展,并对现有方法构成挑战。
English: This study introduces KwaiSR, the first benchmark dataset for super-resolution of short-form user-generated images, designed to advance algorithm development and challenge existing methods through its synthetic and wild components.

Authors:Hong-Tao Yu, Xiu-Shen Wei, Yuxin Peng, Serge Belongie
Title: Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation
Abstract:
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.
中文摘要:本研究提出包含101万问题和33万图像的细粒度评估基准FG-BMK,通过对12个大型视觉语言模型的系统评估,揭示了训练范式对性能影响的关键发现,为未来模型发展提供了重要指导。
English Summary: This study introduces FG-BMK, a comprehensive fine-grained evaluation benchmark with 1.01 million questions and 0.33 million images, systematically assessing twelve LVLMs to reveal critical insights about training paradigms and performance limitations for advancing future model development.

Authors:Jingkai Zhou, Yifan Wu, Shikai Li, Min Wei, Chao Fan, Weihua Chen, Wei Jiang, Fan Wang
Title: RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild
Abstract:
Controllable character animation remains a challenging problem, particularly in handling rare poses, stylized characters, character-object interactions, complex illumination, and dynamic scenes. To tackle these issues, prior work has largely focused on injecting pose and appearance guidance via elaborate bypass networks, but often struggles to generalize to open-world scenarios. In this paper, we propose a new perspective that, as long as the foundation model is powerful enough, straightforward model modifications with flexible fine-tuning strategies can largely address the above challenges, taking a step towards controllable character animation in the wild. Specifically, we introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model. Our sufficient analysis reveals that the widely adopted Reference Net design is suboptimal for large-scale DiT models. Instead, we demonstrate that minimal modifications to the foundation model architecture yield a surprisingly strong baseline. We further propose the low-noise warmup and "large batches and small iterations" strategies to accelerate model convergence during fine-tuning while maximally preserving the priors of the foundation model. In addition, we introduce a new test dataset that captures diverse real-world challenges, complementing existing benchmarks such as TikTok dataset and UBC fashion video dataset, to comprehensively evaluate the proposed method. Extensive experiments show that RealisDance-DiT outperforms existing methods by a large margin.
中文:本文提出的RealisDance-DiT方法通过对强大视频基础模型进行最小化架构修改并采用高效微调策略,在应对多样化现实挑战时大幅优于现有方法,实现了更优的可控角色动画效果。
English: This paper introduces RealisDance-DiT, a method that achieves superior controllable character animation by minimally modifying a powerful video foundation model and implementing efficient fine-tuning strategies, significantly outperforming existing approaches across diverse real-world challenges.

Authors:Qianyu Zhu, Junjie Wang, Jeremiah Hu, Jia Ai, Yong Lee
Title: PIV-FlowDiffuser:Transfer-learning-based denoising diffusion models for PIV
Abstract:
Deep learning algorithms have significantly reduced the computational time and improved the spatial resolution of particle image velocimetry~(PIV). However, the models trained on synthetic datasets might have a degraded performance on practical particle images due to domain gaps. As a result, special residual patterns are often observed for the vector fields of deep learning-based estimators. To reduce the special noise step-by-step, we employ a denoising diffusion model~(FlowDiffuser) for PIV analysis. And the data-hungry iterative denoising diffusion model is trained via a transfer learning strategy, resulting in our PIV-FlowDiffuser method. Specifically, (1) pre-training a FlowDiffuser model with multiple optical flow datasets of the computer vision community, such as Sintel, KITTI, etc; (2) fine-tuning the pre-trained model on synthetic PIV datasets. Note that the PIV images are upsampled by a factor of two to resolve the small-scale turbulent flow structures. The visualized results indicate that our PIV-FlowDiffuser effectively suppresses the noise patterns. Therefore, the denoising diffusion model reduces the average end-point error~($AEE$) by 59.4% over RAFT256-PIV baseline on the classic Cai's dataset. Besides, PIV-FlowDiffuser exhibits enhanced generalization performance on unseen particle images due to transfer learning. Overall, this study highlights the transfer-learning-based denoising diffusion models for PIV. And a detailed implementation is recommended for interested readers in the repository https://github.com/Zhu-Qianyu/PIV-FlowDiffuser.
中文: 本研究提出PIV-FlowDiffuser,一种基于迁移学习的去噪扩散模型,能有效降低粒子图像测速中的噪声,将平均端点误差减少59.4%,并提升对实际粒子图像的泛化性能。
English: This study introduces PIV-FlowDiffuser, a denoising diffusion model trained via transfer learning that effectively reduces noise in particle image velocimetry, cutting the average endpoint error by 59.4% and improving generalization on real-world particle images.

Authors:Geng Li, Jinglin Xu, Yunzhen Zhao, Yuxin Peng
Title: DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
Abstract:
Humans can effortlessly locate desired objects in cluttered environments, relying on a cognitive mechanism known as visual search to efficiently filter out irrelevant information and focus on task-related regions. Inspired by this process, we propose Dyfo (Dynamic Focus), a training-free dynamic focusing visual search method that enhances fine-grained visual understanding in large multimodal models (LMMs). Unlike existing approaches which require additional modules or data collection, Dyfo leverages a bidirectional interaction between LMMs and visual experts, using a Monte Carlo Tree Search (MCTS) algorithm to simulate human-like focus adjustments. This enables LMMs to focus on key visual regions while filtering out irrelevant content, without introducing additional training caused by vocabulary expansion or the integration of specialized localization modules. Experimental results demonstrate that Dyfo significantly improves fine-grained visual understanding and reduces hallucination issues in LMMs, achieving superior performance across both fixed and dynamic resolution models. The code is available at https://github.com/PKU-ICST-MIPL/DyFo_CVPR2025
中文:Dyfo是一种无需训练的视觉搜索方法,通过蒙特卡洛树搜索动态聚焦关键区域,无需额外模块或数据即可提升大型多模态模型的细粒度理解能力。
English: Dyfo is a training-free visual search method that enhances fine-grained understanding in large multimodal models by dynamically focusing on key regions using Monte Carlo Tree Search, improving performance without extra modules or data.

Authors:Huadai Liu, Tianyi Luo, Kaicheng Luo, Qikai Jiang, Peiwen Sun, Jialei Wang, Rongjie Huang, Qian Chen, Wen Wang, Xiangtai Li, Shiliang Zhang, Zhijie Yan, Zhou Zhao, Wei Xue
Title: OmniAudio: Generating Spatial Audio from 360-Degree Video
Abstract:
Traditional video-to-audio generation techniques primarily focus on perspective video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard format for representing 3D spatial audio that captures sound directionality and enables realistic 3D audio reproduction. We first create Sphere360, a novel dataset tailored for this task that is curated from real-world data. We also design an efficient semi-automated pipeline for collecting and cleaning paired video-audio data. To generate spatial audio from 360-degree video, we propose a novel framework OmniAudio, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data. Furthermore, OmniAudio features a dual-branch framework that utilizes both panoramic and perspective video inputs to capture comprehensive local and global information from 360-degree videos. Experimental results demonstrate that OmniAudio achieves state-of-the-art performance across both objective and subjective metrics on Sphere360. Code and datasets are available at https://github.com/liuhuadai/OmniAudio. The project website is available at https://OmniAudio-360V2SA.github.io.
Chinese: 传统视频转音频方法缺乏空间线索,本研究提出360V2SA新任务,通过OmniAudio框架从360度视频生成空间音频,在Sphere360数据集上取得了最优性能。
English: Traditional video-to-audio methods lack spatial cues, so this study introduces 360V2SA, a novel task using the OmniAudio framework to generate spatial audio from 360-degree videos, achieving state-of-the-art results on the Sphere360 dataset.

Authors:Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, Yanwei Fu
Title: Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation
Abstract:
Camera and human motion controls have been extensively studied for video generation, but existing approaches typically address them separately, suffering from limited data with high-quality annotations for both aspects. To overcome this, we present Uni3C, a unified 3D-enhanced framework for precise control of both camera and human motion in video generation. Uni3C includes two key contributions. First, we propose a plug-and-play control module trained with a frozen video generative backbone, PCDController, which utilizes unprojected point clouds from monocular depth to achieve accurate camera control. By leveraging the strong 3D priors of point clouds and the powerful capacities of video foundational models, PCDController shows impressive generalization, performing well regardless of whether the inference backbone is frozen or fine-tuned. This flexibility enables different modules of Uni3C to be trained in specific domains, i.e., either camera control or human motion control, reducing the dependency on jointly annotated data. Second, we propose a jointly aligned 3D world guidance for the inference phase that seamlessly integrates both scenic point clouds and SMPL-X characters to unify the control signals for camera and human motion, respectively. Extensive experiments confirm that PCDController enjoys strong robustness in driving camera motion for fine-tuned backbones of video generation. Uni3C substantially outperforms competitors in both camera controllability and human motion quality. Additionally, we collect tailored validation sets featuring challenging camera movements and human actions to validate the effectiveness of our method.
中文: Uni3C是一个统一的3D增强框架,通过即插即用的控制模块和联合对齐的3D世界引导,实现了视频生成中相机与人体运动的精准控制,在控制能力和运动质量上均显著优于现有方法。
English: Uni3C is a unified 3D-enhanced framework that achieves precise control of both camera and human motion in video generation through a plug-and-play control module and jointly aligned 3D world guidance, outperforming existing methods in controllability and motion quality.

Authors:Aihua Zheng, Yongqi Sun, Zi Wang, Chenglong Li, Jin Tang
Title: Collaborative Enhancement Network for Low-quality Multi-spectral Vehicle Re-identification
Abstract:
The performance of multi-spectral vehicle Re-identification (ReID) is significantly degraded when some important discriminative cues in visible, near infrared and thermal infrared spectra are lost. Existing methods generate or enhance missing details in low-quality spectra data using the high-quality one, generally called the primary spectrum, but how to justify the primary spectrum is a challenging problem. In addition, when the quality of the primary spectrum is low, the enhancement effect would be greatly degraded, thus limiting the performance of multi-spectral vehicle ReID. To address these problems, we propose the Collaborative Enhancement Network (CoEN), which generates a high-quality proxy from all spectra data and leverages it to supervise the selection of primary spectrum and enhance all spectra features in a collaborative manner, for robust multi-spectral vehicle ReID. First, to integrate the rich cues from all spectra data, we design the Proxy Generator (PG) to progressively aggregate multi-spectral features. Second, we design the Dynamic Quality Sort Module (DQSM), which sorts all spectra data by measuring their correlations with the proxy, to accurately select the primary spectra with the highest correlation. Finally, we design the Collaborative Enhancement Module (CEM) to effectively compensate for missing contents of all spectra by collaborating the primary spectra and the proxy, thereby mitigating the impact of low-quality primary spectra. Extensive experiments on three benchmark datasets are conducted to validate the efficacy of the proposed approach against other multi-spectral vehicle ReID methods. The codes will be released at https://github.com/yongqisun/CoEN.
Chinese: 提出的协同增强网络(CoEN)通过生成高质量代理来指导主光谱选择并协同增强所有光谱特征,有效应对多光谱车辆重识别中数据质量变化带来的挑战。
English: The proposed Collaborative Enhancement Network (CoEN) addresses multi-spectral vehicle ReID challenges by generating a high-quality proxy to guide primary spectrum selection and collaboratively enhance all spectral features, improving robustness against data quality variations.

Authors:Chris Dongjoo Kim, Jihwan Moon, Sangwoo Moon, Heeseung Yun, Sihaeng Lee, Aniruddha Kembhavi, Soonyoung Lee, Gunhee Kim, Sangho Lee, Christopher Clark
Title: ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams
Abstract:
The rapid growth of video-text data presents challenges in storage and computation during training. Online learning, which processes streaming data in real-time, offers a promising solution to these issues while also allowing swift adaptations in scenarios demanding real-time responsiveness. One strategy to enhance the efficiency and effectiveness of learning involves identifying and prioritizing data that enhances performance on target downstream tasks. We propose Relevance and Specificity-based online filtering framework (ReSpec) that selects data based on four criteria: (i) modality alignment for clean data, (ii) task relevance for target focused data, (iii) specificity for informative and detailed data, and (iv) efficiency for low-latency processing. Relevance is determined by the probabilistic alignment of incoming data with downstream tasks, while specificity employs the distance to a root embedding representing the least specific data as an efficient proxy for informativeness. By establishing reference points from target task data, ReSpec filters incoming data in real-time, eliminating the need for extensive storage and compute. Evaluating on large-scale datasets WebVid2M and VideoCC3M, ReSpec attains state-of-the-art performance on five zeroshot video retrieval tasks, using as little as 5% of the data while incurring minimal compute. The source code is available at https://github.com/cdjkim/ReSpec.
Chinese: ReSpec框架通过基于模态对齐、任务相关性、特异性和处理效率实时筛选视频文本数据,显著提升了在线学习效率,在仅使用少量数据和计算的情况下,于零样本视频检索任务中实现了最优性能。
English: The ReSpec framework enhances online learning efficiency by filtering video-text data in real-time based on modality alignment, task relevance, specificity, and processing efficiency, achieving top performance on zero-shot retrieval tasks with minimal data and computation.

Authors:Ryu Tadokoro, Tsukasa Takagi, Shin-ichi Maeda
Title: Segmentation with Noisy Labels via Spatially Correlated Distributions
Abstract:
In semantic segmentation, the accuracy of models heavily depends on the high-quality annotations. However, in many practical scenarios such as medical imaging and remote sensing, obtaining true annotations is not straightforward and usually requires significant human labor. Relying on human labor often introduces annotation errors, including mislabeling, omissions, and inconsistency between annotators. In the case of remote sensing, differences in procurement time can lead to misaligned ground truth annotations. These label errors are not independently distributed, and instead usually appear in spatially connected regions where adjacent pixels are more likely to share the same errors. To address these issues, we propose an approximate Bayesian estimation based on a probabilistic model that assumes training data includes label errors, incorporating the tendency for these errors to occur with spatial correlations between adjacent pixels. Bayesian inference requires computing the posterior distribution of label errors, which becomes intractable when spatial correlations are present. We represent the correlation of label errors between adjacent pixels through a Gaussian distribution whose covariance is structured by a Kac-Murdock-Szegö (KMS) matrix, solving the computational challenges. Through experiments on multiple segmentation tasks, we confirm that leveraging the spatial correlation of label errors significantly improves performance. Notably, in specific tasks such as lung segmentation, the proposed method achieves performance comparable to training with clean labels under moderate noise levels. Code is available at https://github.com/pfnet-research/Bayesian_SpatialCorr.
Chinese Summary: 该研究提出了一种基于概率模型的贝叶斯估计方法,用于处理语义分割中空间相关的标注误差,在医学影像和遥感等任务中显著提升了模型性能。
English Summary: The study introduces a Bayesian estimation method using a probabilistic model that accounts for spatially correlated label errors in semantic segmentation, significantly enhancing model performance across tasks like medical imaging and remote sensing.

Authors:Wenhui Zhu, Peijie Qiu, Xiwen Chen, Zhangsihao Yang, Aristeidis Sotiras, Abolfazl Razi, Yalin Wang
Title: How Effective Can Dropout Be in Multiple Instance Learning ?
Abstract:
Multiple Instance Learning (MIL) is a popular weakly-supervised method for various applications, with a particular interest in histological whole slide image (WSI) classification. Due to the gigapixel resolution of WSI, applications of MIL in WSI typically necessitate a two-stage training scheme: first, extract features from the pre-trained backbone and then perform MIL aggregation. However, it is well-known that this suboptimal training scheme suffers from "noisy" feature embeddings from the backbone and inherent weak supervision, hindering MIL from learning rich and generalizable features. However, the most commonly used technique (i.e., dropout) for mitigating this issue has yet to be explored in MIL. In this paper, we empirically explore how effective the dropout can be in MIL. Interestingly, we observe that dropping the top-k most important instances within a bag leads to better performance and generalization even under noise attack. Based on this key observation, we propose a novel MIL-specific dropout method, termed MIL-Dropout, which systematically determines which instances to drop. Experiments on five MIL benchmark datasets and two WSI datasets demonstrate that MIL-Dropout boosts the performance of current MIL methods with a negligible computational cost. The code is available at https://github.com/ChongQingNoSubway/MILDropout.
Chinese: 本文提出MIL-Dropout新方法,通过选择性丢弃包内最重要的实例来提升多示例学习性能,在噪声环境下仍能增强泛化能力且计算成本可忽略。
English: This paper introduces MIL-Dropout, a novel method that improves Multiple Instance Learning (MIL) performance by selectively dropping top-k important instances, enhancing generalization even under noise with minimal computational cost.

Authors:Bowei Zhang, Lei Ke, Adam W. Harley, Katerina Fragkiadaki
Title: TAPIP3D: Tracking Any Point in Persistent 3D Geometry
Abstract:
We introduce TAPIP3D, a novel approach for long-term 3D point tracking in monocular RGB and RGB-D videos. TAPIP3D represents videos as camera-stabilized spatio-temporal feature clouds, leveraging depth and camera motion information to lift 2D video features into a 3D world space where camera movement is effectively canceled out. Within this stabilized 3D representation, TAPIP3D iteratively refines multi-frame motion estimates, enabling robust point tracking over long time horizons. To handle the irregular structure of 3D point distributions, we propose a 3D Neighborhood-to-Neighborhood (N2N) attention mechanism - a 3D-aware contextualization strategy that builds informative, spatially coherent feature neighborhoods to support precise trajectory estimation. Our 3D-centric formulation significantly improves performance over existing 3D point tracking methods and even surpasses state-of-the-art 2D pixel trackers in accuracy when reliable depth is available. The model supports inference in both camera-centric (unstabilized) and world-centric (stabilized) coordinates, with experiments showing that compensating for camera motion leads to substantial gains in tracking robustness. By replacing the conventional 2D square correlation windows used in prior 2D and 3D trackers with a spatially grounded 3D attention mechanism, TAPIP3D achieves strong and consistent results across multiple 3D point tracking benchmarks. Project Page: https://tapip3d.github.io
中文: TAPIP3D提出了一种新颖的3D点跟踪方法,通过深度和相机运动信息将视频特征稳定至三维空间,并采用3D邻域注意力机制实现长期鲁棒跟踪,在深度数据可靠时其性能超越现有3D与2D跟踪器。
English: TAPIP3D introduces a novel 3D point tracking method that stabilizes video features in 3D space using depth and camera motion, employing a 3D N2N attention mechanism to achieve robust long-term tracking and surpass both 3D and 2D trackers in performance when depth data is reliable.

Authors:Kaihang Pan, Wang Lin, Zhongqi Yue, Tenglong Ao, Liyu Jia, Wei Zhao, Juncheng Li, Siliang Tang, Hanwang Zhang
Title: Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Abstract:
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve superior performance for multimodal comprehension and generation simultaneously compared with other MLLMs. Project Page: https://DDT-LLaMA.github.io/.
Chinese: 本文提出了一种基于扩散时间步的递归视觉标记方法,构建了统一的多模态理解与生成框架,在两项任务上均实现了优于现有模型的性能表现。
English: This paper introduces a novel approach to Multimodal Large Language Models by developing recursive visual tokens based on diffusion timesteps, enabling unified visual comprehension and generation that outperforms existing methods.

Authors:Yeoreum Lee, Jinwook Jung, Sungyong Baik
Title: Mitigating Parameter Interference in Model Merging via Sharpness-Aware Fine-Tuning
Abstract:
Large-scale deep learning models with a pretraining-finetuning paradigm have led to a surge of numerous task-specific models fine-tuned from a common pre-trained model. Recently, several research efforts have been made on merging these large models into a single multi-task model, particularly with simple arithmetic on parameters. Such merging methodology faces a central challenge: interference between model parameters fine-tuned on different tasks. Few recent works have focused on designing a new fine-tuning scheme that can lead to small parameter interference, however at the cost of the performance of each task-specific fine-tuned model and thereby limiting that of a merged model. To improve the performance of a merged model, we note that a fine-tuning scheme should aim for (1) smaller parameter interference and (2) better performance of each fine-tuned model on the corresponding task. In this work, we aim to design a new fine-tuning objective function to work towards these two goals. In the course of this process, we find such objective function to be strikingly similar to sharpness-aware minimization (SAM) objective function, which aims to achieve generalization by finding flat minima. Drawing upon our observation, we propose to fine-tune pre-trained models via sharpness-aware minimization. The experimental and theoretical results showcase the effectiveness and orthogonality of our proposed approach, improving performance upon various merging and fine-tuning methods. Our code is available at https://github.com/baiklab/SAFT-Merge.
Chinese Summary: 本研究提出了一种利用锐度感知最小化的新微调方法,通过减少参数干扰和提升各任务模型的性能,有效增强了合并多任务模型的整体表现。
English Summary: This research introduces a new fine-tuning method using sharpness-aware minimization to enhance the performance of merged multi-task models by reducing parameter interference and improving task-specific model efficacy.

Authors:Zhenkui Yang, Zeyi Huang, Ge Wang, Han Ding, Tony Xiao Han, Fei Wang
Title: Talk is Not Always Cheap: Promoting Wireless Sensing Models with Text Prompts
Abstract:
Wireless signal-based human sensing technologies, such as WiFi, millimeter-wave (mmWave) radar, and Radio Frequency Identification (RFID), enable the detection and interpretation of human presence, posture, and activities, thereby providing critical support for applications in public security, healthcare, and smart environments. These technologies exhibit notable advantages due to their non-contact operation and environmental adaptability; however, existing systems often fail to leverage the textual information inherent in datasets. To address this, we propose an innovative text-enhanced wireless sensing framework, WiTalk, that seamlessly integrates semantic knowledge through three hierarchical prompt strategies-label-only, brief description, and detailed action description-without requiring architectural modifications or incurring additional data costs. We rigorously validate this framework across three public benchmark datasets: XRF55 for human action recognition (HAR), and WiFiTAL and XRFV2 for WiFi temporal action localization (TAL). Experimental results demonstrate significant performance improvements: on XRF55, accuracy for WiFi, RFID, and mmWave increases by 3.9%, 2.59%, and 0.46%, respectively; on WiFiTAL, the average performance of WiFiTAD improves by 4.98%; and on XRFV2, the mean average precision gains across various methods range from 4.02% to 13.68%. Our codes have been included in https://github.com/yangzhenkui/WiTalk.
中文:提出的WiTalk框架通过集成语义文本提示来增强无线人体感知,无需修改架构或增加数据成本,即在多个数据集上显著提升了识别准确率。
English: The proposed WiTalk framework enhances wireless human sensing by integrating semantic text prompts, significantly improving accuracy across multiple datasets without architectural changes or extra data costs.

Authors:Siyi Jiao, Wenzheng Zeng, Yerong Li, Huayu Zhang, Changxin Gao, Nong Sang, Mike Zheng Shou
Title: MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation
Abstract:
Human instance matting aims to estimate an alpha matte for each human instance in an image, which is challenging as it easily fails in complex cases requiring disentangling mingled pixels belonging to multiple instances along hairy and thin boundary structures. In this work, we address this by introducing MP-Mat, a novel 3D-and-instance-aware matting framework with multiplane representation, where the multiplane concept is designed from two different perspectives: scene geometry level and instance level. Specifically, we first build feature-level multiplane representations to split the scene into multiple planes based on depth differences. This approach makes the scene representation 3D-aware, and can serve as an effective clue for splitting instances in different 3D positions, thereby improving interpretability and boundary handling ability especially in occlusion areas. Then, we introduce another multiplane representation that splits the scene in an instance-level perspective, and represents each instance with both matte and color. We also treat background as a special instance, which is often overlooked by existing methods. Such an instance-level representation facilitates both foreground and background content awareness, and is useful for other down-stream tasks like image editing. Once built, the representation can be reused to realize controllable instance-level image editing with high efficiency. Extensive experiments validate the clear advantage of MP-Mat in matting task. We also demonstrate its superiority in image editing tasks, an area under-explored by existing matting-focused methods, where our approach under zero-shot inference even outperforms trained specialized image editing techniques by large margins. Code is open-sourced at https://github.com/JiaoSiyi/MPMat.git}.
中文: MP-Mat通过结合三维感知和实例级别的多重平面表示框架,有效提升了复杂场景下人体实例抠图的精确度,在抠图和图像编辑任务中均展现出卓越性能。
English: MP-Mat introduces a dual multiplane framework combining 3D-aware and instance-level representations to improve human instance matting accuracy in complex scenarios, demonstrating superior performance in both matting and image editing tasks.

Authors:Zheng Chen, Jingkai Wang, Kai Liu, Jue Gong, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Jianxing Zhang, Jinlong Wu, Jun Wang, Zheng Xie, Hakjae Jeon, Suejin Han, Hyung-Ju Chun, Hyunhee Park, Zhicun Yin, Junjie Chen, Ming Liu, Xiaoming Li, Chao Zhou, Wangmeng Zuo, Weixia Zhang, Dingquan Li, Kede Ma, Yun Zhang, Zhuofan Zheng, Yuyue Liu, Shizhen Tang, Zihao Zhang, Yi Ning, Hao Jiang, Wenjie An, Kangmeng Yu, Chenyang Wang, Kui Jiang, Xianming Liu, Junjun Jiang, Yingfu Zhang, Gang He, Siqi Wang, Kepeng Xu, Zhenyang Liu, Changxin Zhou, Shanlan Shen, Yubo Duan, Yiang Chen, Jin Guo, Mengru Yang, Jen-Wei Lee, Chia-Ming Lee, Chih-Chung Hsu, Hu Peng, Chunming He
Title: NTIRE 2025 Challenge on Real-World Face Restoration: Methods and Results
Abstract:
This paper provides a review of the NTIRE 2025 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural, realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or training data. The track of the challenge evaluates performance using a weighted image quality assessment (IQA) score and employs the AdaFace model as an identity checker. The competition attracted 141 registrants, with 13 teams submitting valid models, and ultimately, 10 teams achieved a valid score in the final ranking. This collaborative effort advances the performance of real-world face restoration while offering an in-depth overview of the latest trends in the field.
中文: 本文综述了NTIRE 2025真实人脸复原挑战赛,重点介绍了提升感知质量和身份一致性的解决方案,共有13支团队提交模型,其中10支在最终排名中取得有效成绩。
English: This paper reviews the NTIRE 2025 challenge on real-world face restoration, focusing on solutions that enhance perceptual quality and identity consistency, with 13 teams submitting models and 10 achieving valid rankings.

Authors:Wenke Xia, Ruoxuan Feng, Dong Wang, Di Hu
Title: Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction
Abstract:
Building a generalizable self-correction system is crucial for robots to recover from failures. Despite advancements in Multimodal Large Language Models (MLLMs) that empower robots with semantic reflection ability for failure, translating semantic reflection into how to correct fine-grained robotic actions remains a significant challenge. To address this gap, we build the Phoenix framework, which leverages motion instruction as a bridge to connect high-level semantic reflection with low-level robotic action correction. In this motion-based self-reflection framework, we start with a dual-process motion adjustment mechanism with MLLMs to translate the semantic reflection into coarse-grained motion instruction adjustment. To leverage this motion instruction for guiding how to correct fine-grained robotic actions, a multi-task motion-conditioned diffusion policy is proposed to integrate visual observations for high-frequency robotic action correction. By combining these two models, we could shift the demand for generalization capability from the low-level manipulation policy to the MLLMs-driven motion adjustment model and facilitate precise, fine-grained robotic action correction. Utilizing this framework, we further develop a lifelong learning method to automatically improve the model's capability from interactions with dynamic environments. The experiments conducted in both the RoboMimic simulation and real-world scenarios prove the superior generalization and robustness of our framework across a variety of manipulation tasks. Our code is released at \href{https://github.com/GeWu-Lab/Motion-based-Self-Reflection-Framework}{https://github.com/GeWu-Lab/Motion-based-Self-Reflection-Framework}.
中文: Phoenix框架通过运动指令连接语义反思与机器人动作修正,采用双流程调整机制和扩散策略实现精确控制,并通过终身学习提升在动态环境中的适应能力。
English: The Phoenix framework bridges semantic reflection and robotic action correction through motion instruction, enabling precise adjustments via a dual-process mechanism and diffusion policy, with lifelong learning enhancing its adaptability in diverse environments.

Authors:Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Xiangyu Kong, Xiaoxuan Yu, Hyunhee Park, Suejin Han, Hakjae Jeon, Dafeng Zhang, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Lu Zhao, Yuyi Zhang, Pengyu Yan, Jiawei Hu, Pengwei Liu, Fengjun Guo, Hongyuan Yu, Pufan Xu, Zhijuan Huang, Shuyuan Cui, Peng Guo, Jiahui Liu, Dongkai Zhang, Heng Zhang, Huiyuan Fu, Huadong Ma, Yanhui Guo, Sisi Tian, Xin Liu, Jinwen Liang, Jie Liu, Jie Tang, Gangshan Wu, Zeyu Xiao, Zhuoyuan Li, Yinxiang Zhang, Wenxuan Cai, Vijayalaxmi Ashok Aralikatti, Nikhil Akalwadi, G Gyaneshwar Rao, Chaitra Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Marcos V. Conde, Alejandro Merino, Bruno Longarela, Javier Abad, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Aagam Jain, Milan Kumar Singh, Ankit Kumar, Shubh Kawa, Divyavardhan Singh, Anjali Sarvaiya, Kishor Upla, Raghavendra Ramachandra, Chia-Ming Lee, Yu-Fan Lin, Chih-Chung Hsu, Risheek V Hiremath, Yashaswini Palani, Yuxuan Jiang, Qiang Zhu, Siyue Teng, Fan Zhang, Shuyuan Zhu, Bing Zeng, David Bull, Jingwei Liao, Yuqing Yang, Wenda Shao, Junyi Zhao, Qisheng Xu, Kele Xu, Sunder Ali Khowaja, Ik Hyun Lee, Snehal Singh Tomar, Rajarshi Ray, Klaus Mueller, Sachin Chaudhary, Surya Vashisth, Akshay Dudhane, Praful Hambarde, Satya Naryan Tazi, Prashant Patil, Santosh Kumar Vipparthi, Subrahmanyam Murala, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Zahra Moammeri, Ahmad Mahmoudi-Aznaveh, Ali Karbasi, Hossein Motamednia, Liangyan Li, Guanhua Zhao, Kevin Le, Yimo Ning, Haoxuan Huang, Jun Chen
Title: NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results
Abstract:
This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.
中文摘要:NTIRE 2025图像超分辨率挑战赛旨在通过双赛道设计(恢复赛道与感知赛道)推动超分辨率技术发展,共有25支团队提交有效方案,该竞赛成果将成为领域重要基准。
English Summary: The NTIRE 2025 image super-resolution challenge at CVPR 2025 seeks advanced solutions for reconstructing high-resolution images from downsampled inputs, featuring dual evaluation tracks for restoration accuracy and perceptual quality with 25 team submissions.

Authors:Liang Peng, Boxi Wu, Haoran Cheng, Yibo Zhao, Xiaofei He
Title: SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization
Abstract:
Previous text-to-image diffusion models typically employ supervised fine-tuning (SFT) to enhance pre-trained base models. However, this approach primarily minimizes the loss of mean squared error (MSE) at the pixel level, neglecting the need for global optimization at the image level, which is crucial for achieving high perceptual quality and structural coherence. In this paper, we introduce Self-sUpervised Direct preference Optimization (SUDO), a novel paradigm that optimizes both fine-grained details at the pixel level and global image quality. By integrating direct preference optimization into the model, SUDO generates preference image pairs in a self-supervised manner, enabling the model to prioritize global-level learning while complementing the pixel-level MSE loss. As an effective alternative to supervised fine-tuning, SUDO can be seamlessly applied to any text-to-image diffusion model. Importantly, it eliminates the need for costly data collection and annotation efforts typically associated with traditional direct preference optimization methods. Through extensive experiments on widely-used models, including Stable Diffusion 1.5 and XL, we demonstrate that SUDO significantly enhances both global and local image quality. The codes are provided at \href{https://github.com/SPengLiang/SUDO}{this link}.
中文摘要:本文提出SUDO方法,通过自监督方式优化文本到图像扩散模型的像素级细节和全局图像质量,无需昂贵数据标注即可显著提升图像结构连贯性。
English Summary: This paper introduces SUDO, a self-supervised method that optimizes both pixel-level details and global image quality in text-to-image diffusion models, eliminating the need for costly data annotation while significantly enhancing image coherence.

Authors:Tong Zeng, Longfeng Wu, Liang Shi, Dawei Zhou, Feng Guo
Title: Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding
Abstract:
Vision Large Language Models (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering. However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored. Autonomous driving systems require sophisticated scene understanding in complex environments, yet existing multimodal benchmarks primarily focus on normal driving conditions, failing to adequately assess VLLMs' performance in safety-critical scenarios. To address this, we introduce DVBench, a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos. Built around a hierarchical ability taxonomy that aligns with widely adopted frameworks for describing driving scenarios used in assessing highly automated driving systems, DVBench features 10,000 multiple-choice questions with human-annotated ground-truth answers, enabling a comprehensive evaluation of VLLMs' capabilities in perception and reasoning. Experiments on 14 SOTA VLLMs, ranging from 0.5B to 72B parameters, reveal significant performance gaps, with no model achieving over 40% accuracy, highlighting critical limitations in understanding complex driving scenarios. To probe adaptability, we fine-tuned selected models using domain-specific data from DVBench, achieving accuracy gains ranging from 5.24 to 10.94 percentage points, with relative improvements of up to 43.59%. This improvement underscores the necessity of targeted adaptation to bridge the gap between general-purpose VLLMs and mission-critical driving applications. DVBench establishes an essential evaluation framework and research roadmap for developing VLLMs that meet the safety and robustness requirements for real-world autonomous systems. We released the benchmark toolbox and the fine-tuned model at: https://github.com/tong-zeng/DVBench.git.
Chinese Summary: 视觉大语言模型在通用视觉任务中表现出色,但在自动驾驶等安全关键领域的应用仍存局限;新推出的DVBench基准测试揭示了现有模型在复杂驾驶场景理解上的不足,并通过领域微调显著提升了模型性能,为开发符合实际安全要求的视觉大语言模型提供了重要评估框架。
English Summary: Vision Large Language Models (VLLMs) show strong performance in general visual tasks but struggle with safety-critical autonomous driving scenarios, as demonstrated by the new DVBench benchmark which revealed significant performance gaps and the need for domain-specific fine-tuning to improve their applicability.

Authors:Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, Daniel Cremers
Title: Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction
Abstract:
Traditional SLAM systems, which rely on bundle adjustment, struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates. Taking a novel approach, this work leverages a 3D point tracker to separate the camera-induced motion from the observed motion of dynamic objects. By considering only the camera-induced component, bundle adjustment can operate reliably on all scene elements as a result. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps. Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end. Integrating motion decomposition, bundle adjustment and depth refinement, our unified framework, BA-Track, accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.

Authors:Fulong Ye, Miao Hua, Pengze Zhang, Xinghui Li, Qichao Sun, Songtao Zhao, Qian He, Xinglong Wu
Title: DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning
Abstract:
In this paper, we introduce DreamID, a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. Unlike the typical face swapping training process, which often relies on implicit supervision and struggles to achieve satisfactory results. DreamID establishes explicit supervision for face swapping by constructing Triplet ID Group data, significantly enhancing identity similarity and attribute preservation. The iterative nature of diffusion models poses challenges for utilizing efficient image-space loss functions, as performing time-consuming multi-step sampling to obtain the generated image during training is impractical. To address this issue, we leverage the accelerated diffusion model SD Turbo, reducing the inference steps to a single iteration, enabling efficient pixel-level end-to-end training with explicit Triplet ID Group supervision. Additionally, we propose an improved diffusion-based model architecture comprising SwapNet, FaceNet, and ID Adapter. This robust architecture fully unlocks the power of the Triplet ID Group explicit supervision. Finally, to further extend our method, we explicitly modify the Triplet ID Group data during training to fine-tune and preserve specific attributes, such as glasses and face shape. Extensive experiments demonstrate that DreamID outperforms state-of-the-art methods in terms of identity similarity, pose and expression preservation, and image fidelity. Overall, DreamID achieves high-quality face swapping results at 512*512 resolution in just 0.6 seconds and performs exceptionally well in challenging scenarios such as complex lighting, large angles, and occlusions.

Authors:Jingjing Ren, Wenbo Li, Zhongdao Wang, Haoze Sun, Bangzhen Liu, Haoyu Chen, Jiaqi Xu, Aoxue Li, Shifeng Zhang, Bin Shao, Yong Guo, Lei Zhu
Title: Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis
Abstract:
Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals. While diffusion transformers (DiTs) have demonstrated remarkable capabilities in high-quality video generation, scaling them to 2K resolution remains computationally prohibitive due to quadratic growth in memory and processing costs. In this work, we propose Turbo2K, an efficient and practical framework for generating detail-rich 2K videos while significantly improving training and inference efficiency. First, Turbo2K operates in a highly compressed latent space, reducing computational complexity and memory footprint, making high-resolution video synthesis feasible. However, the high compression ratio of the VAE and limited model size impose constraints on generative quality. To mitigate this, we introduce a knowledge distillation strategy that enables a smaller student model to inherit the generative capacity of a larger, more powerful teacher model. Our analysis reveals that, despite differences in latent spaces and architectures, DiTs exhibit structural similarities in their internal representations, facilitating effective knowledge transfer. Second, we design a hierarchical two-stage synthesis framework that first generates multi-level feature at lower resolutions before guiding high-resolution video generation. This approach ensures structural coherence and fine-grained detail refinement while eliminating redundant encoding-decoding overhead, further enhancing computational efficiency.Turbo2K achieves state-of-the-art efficiency, generating 5-second, 24fps, 2K videos with significantly reduced computational cost. Compared to existing methods, Turbo2K is up to 20$\times$ faster for inference, making high-resolution video generation more scalable and practical for real-world applications.
中文:Turbo2K通过知识蒸馏和分层合成框架,在保持丰富细节的同时实现高效2K视频生成,推理速度提升高达20倍,大幅降低了计算成本。
English: Turbo2K is an efficient framework that enables high-quality 2K video generation through knowledge distillation and a hierarchical synthesis approach, achieving up to 20x faster inference while maintaining visual detail.

Authors:Mingya Zhang, Liang Wang, Limei Gu, Tingsheng Ling, Xianping Tao
Title: WT-BCP: Wavelet Transform based Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation
Abstract:
Semi-supervised medical image segmentation (SSMIS) shows promise in reducing reliance on scarce labeled medical data. However, SSMIS field confronts challenges such as distribution mismatches between labeled and unlabeled data, artificial perturbations causing training biases, and inadequate use of raw image information, especially low-frequency (LF) and high-frequency (HF) components.To address these challenges, we propose a Wavelet Transform based Bidirectional Copy-Paste SSMIS framework, named WT-BCP, which improves upon the Mean Teacher approach. Our method enhances unlabeled data understanding by copying random crops between labeled and unlabeled images and employs WT to extract LF and HF details.We propose a multi-input and multi-output model named XNet-Plus, to receive the fused information after WT. Moreover, consistency training among multiple outputs helps to mitigate learning biases introduced by artificial perturbations. During consistency training, the mixed images resulting from WT are fed into both models, with the student model's output being supervised by pseudo-labels and ground-truth. Extensive experiments conducted on 2D and 3D datasets confirm the effectiveness of our model.Code: https://github.com/simzhangbest/WT-BCP.
中文:WT-BCP框架通过结合小波变换提取频率细节和双向复制粘贴提升数据利用,有效解决了半监督医学图像分割中的分布不匹配和训练偏差问题。
English: The WT-BCP framework enhances semi-supervised medical image segmentation by integrating wavelet transform for frequency detail extraction and bidirectional copy-paste to improve data utilization, effectively addressing distribution mismatches and training biases.

Authors:Chuhao Liu, Zhijian Qiao, Jieqi Shi, Ke Wang, Peize Liu, Shaojie Shen
Title: SG-Reg: Generalizable and Efficient Scene Graph Registration
Abstract:
This paper addresses the challenges of registering two rigid semantic scene graphs, an essential capability when an autonomous agent needs to register its map against a remote agent, or against a prior map. The hand-crafted descriptors in classical semantic-aided registration, or the ground-truth annotation reliance in learning-based scene graph registration, impede their application in practical real-world environments. To address the challenges, we design a scene graph network to encode multiple modalities of semantic nodes: open-set semantic feature, local topology with spatial awareness, and shape feature. These modalities are fused to create compact semantic node features. The matching layers then search for correspondences in a coarse-to-fine manner. In the back-end, we employ a robust pose estimator to decide transformation according to the correspondences. We manage to maintain a sparse and hierarchical scene representation. Our approach demands fewer GPU resources and fewer communication bandwidth in multi-agent tasks. Moreover, we design a new data generation approach using vision foundation models and a semantic mapping module to reconstruct semantic scene graphs. It differs significantly from previous works, which rely on ground-truth semantic annotations to generate data. We validate our method in a two-agent SLAM benchmark. It significantly outperforms the hand-crafted baseline in terms of registration success rate. Compared to visual loop closure networks, our method achieves a slightly higher registration recall while requiring only 52 KB of communication bandwidth for each query frame. Code available at: \href{http://github.com/HKUST-Aerial-Robotics/SG-Reg}{http://github.com/HKUST-Aerial-Robotics/SG-Reg}.
中文: 本文提出一种新型场景图网络,通过融合多模态语义特征实现刚性语义场景图的高效鲁棒配准,在显著提升配准成功率的同时,仅需极少的GPU资源和通信带宽。
English: This paper introduces a novel scene graph network that fuses multimodal semantic features for efficient and robust registration of rigid semantic scene graphs, significantly outperforming existing methods in success rate while requiring minimal GPU resources and communication bandwidth.

Authors:Qiang Chen, Xiao Wang, Haowen Wang, Bo Jiang, Lin Zhu, Dawei Zhang, Yonghong Tian, Jin Tang
Title: Adversarial Attack for RGB-Event based Visual Object Tracking
Abstract:
Visual object tracking is a crucial research topic in the fields of computer vision and multi-modal fusion. Among various approaches, robust visual tracking that combines RGB frames with Event streams has attracted increasing attention from researchers. While striving for high accuracy and efficiency in tracking, it is also important to explore how to effectively conduct adversarial attacks and defenses on RGB-Event stream tracking algorithms, yet research in this area remains relatively scarce. To bridge this gap, in this paper, we propose a cross-modal adversarial attack algorithm for RGB-Event visual tracking. Because of the diverse representations of Event streams, and given that Event voxels and frames are more commonly used, this paper will focus on these two representations for an in-depth study. Specifically, for the RGB-Event voxel, we first optimize the perturbation by adversarial loss to generate RGB frame adversarial examples. For discrete Event voxel representations, we propose a two-step attack strategy, more in detail, we first inject Event voxels into the target region as initialized adversarial examples, then, conduct a gradient-guided optimization by perturbing the spatial location of the Event voxels. For the RGB-Event frame based tracking, we optimize the cross-modal universal perturbation by integrating the gradient information from multimodal data. We evaluate the proposed approach against attacks on three widely used RGB-Event Tracking datasets, i.e., COESOT, FE108, and VisEvent. Extensive experiments show that our method significantly reduces the performance of the tracker across numerous datasets in both unimodal and multimodal scenarios. The source code will be released on https://github.com/Event-AHU/Adversarial_Attack_Defense
中文: 本文提出了一种针对RGB-Event视觉跟踪的跨模态对抗攻击算法,通过对事件体素和帧表示分别采用扰动优化策略生成对抗样本,在多数据集实验中显著降低了跟踪器的性能。
English: This paper introduces a cross-modal adversarial attack algorithm targeting RGB-Event visual tracking, which effectively degrades tracker performance by generating adversarial examples for both RGB-Event voxel and frame representations through optimized perturbation strategies.

Authors:Minho Park, Taewoong Kang, Jooyeol Yun, Sungwon Hwang, Jaegul Choo
Title: SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation
Abstract:
The increasing demand for AR/VR applications has highlighted the need for high-quality 360-degree panoramic content. However, generating high-quality 360-degree panoramic images and videos remains a challenging task due to the severe distortions introduced by equirectangular projection (ERP). Existing approaches either fine-tune pretrained diffusion models on limited ERP datasets or attempt tuning-free methods that still rely on ERP latent representations, leading to discontinuities near the poles. In this paper, we introduce SphereDiff, a novel approach for seamless 360-degree panoramic image and video generation using state-of-the-art diffusion models without additional tuning. We define a spherical latent representation that ensures uniform distribution across all perspectives, mitigating the distortions inherent in ERP. We extend MultiDiffusion to spherical latent space and propose a spherical latent sampling method to enable direct use of pretrained diffusion models. Moreover, we introduce distortion-aware weighted averaging to further improve the generation quality in the projection process. Our method outperforms existing approaches in generating 360-degree panoramic content while maintaining high fidelity, making it a robust solution for immersive AR/VR applications. The code is available here. https://github.com/pmh9960/SphereDiff
中文: SphereDiff提出了一种新颖的球形潜在表示和采样方法,无需额外调优即可直接使用预训练扩散模型生成高质量360度全景图像和视频,有效解决了等距柱状投影带来的失真问题。
English: SphereDiff introduces a novel spherical latent representation and sampling method that enables high-fidelity 360-degree panoramic image and video generation using pretrained diffusion models without additional tuning, effectively mitigating distortions from equirectangular projection.

Authors:Jindong Li, Yongguang Li, Yali Fu, Jiahong Liu, Yixin Liu, Menglin Yang, Irwin King
Title: CLIP-Powered Domain Generalization and Domain Adaptation: A Comprehensive Survey
Abstract:
As machine learning evolves, domain generalization (DG) and domain adaptation (DA) have become crucial for enhancing model robustness across diverse environments. Contrastive Language-Image Pretraining (CLIP) plays a significant role in these tasks, offering powerful zero-shot capabilities that allow models to perform effectively in unseen domains. However, there remains a significant gap in the literature, as no comprehensive survey currently exists that systematically explores the applications of CLIP in DG and DA, highlighting the necessity for this review. This survey presents a comprehensive review of CLIP's applications in DG and DA. In DG, we categorize methods into optimizing prompt learning for task alignment and leveraging CLIP as a backbone for effective feature extraction, both enhancing model adaptability. For DA, we examine both source-available methods utilizing labeled source data and source-free approaches primarily based on target domain data, emphasizing knowledge transfer mechanisms and strategies for improved performance across diverse contexts. Key challenges, including overfitting, domain diversity, and computational efficiency, are addressed, alongside future research opportunities to advance robustness and efficiency in practical applications. By synthesizing existing literature and pinpointing critical gaps, this survey provides valuable insights for researchers and practitioners, proposing directions for effectively leveraging CLIP to enhance methodologies in domain generalization and adaptation. Ultimately, this work aims to foster innovation and collaboration in the quest for more resilient machine learning models that can perform reliably across diverse real-world scenarios. A more up-to-date version of the papers is maintained at: https://github.com/jindongli-Ai/Survey_on_CLIP-Powered_Domain_Generalization_and_Adaptation.
中文: 本综述系统梳理了CLIP在领域泛化与自适应中的应用方法,通过分类讨论优化策略并应对过拟合、领域差异等关键挑战,旨在提升模型在多样化场景中的鲁棒性能。
English: This survey comprehensively reviews CLIP's applications in domain generalization and adaptation, categorizing methods and addressing challenges like overfitting and domain diversity to enhance model robustness across diverse environments.

Authors:Liu Xiao, Li Zhiyuan, Lin Yueyu
Title: Cross-attention for State-based model RWKV-7
Abstract:
We introduce CrossWKV, a novel cross-attention mechanism for the state-based RWKV-7 model, designed to enhance the expressive power of text-to-image generation. Leveraging RWKV-7's linear-complexity Weighted Key-Value (WKV) architecture, CrossWKV integrates text and image modalities in a single pass, utilizing a generalized delta rule with vector-valued gating and low-rank adaptations (LoRA) to achieve superior cross-modal alignment. Unlike Transformer-based models, CrossWKV's non-diagonal, input-dependent transition matrix enables it to represent complex functions beyond the $\mathrm{TC}^0$ complexity class, including all regular languages, as demonstrated by its ability to perform state-tracking tasks like $S_5$ permutation modeling. Evaluated within the Diffusion in RWKV-7 (DIR-7) on datasets such as LAION-5B and ImageNet, CrossWKV achieves a Frechet Inception Distance (FID) of 2.88 and a CLIP score of 0.33 on ImageNet 256x256, matching state-of-the-art performance while offering robust generalization across diverse prompts. The model's enhanced expressivity, combined with constant memory usage and linear scaling, positions it as a powerful solution for advanced cross-modal tasks, with potential applications in high-resolution generation and dynamic state manipulation.Code at https://github.com/TorchRWKV/flash-linear-attention
Chinese: CrossWKV是RWKV-7模型中的一种新型交叉注意力机制,通过单次处理整合文本和图像模态,以线性复杂度实现卓越的跨模态对齐,在文本到图像生成中达到领先性能。
English: CrossWKV is a novel cross-attention mechanism for the RWKV-7 model that enhances text-to-image generation by integrating text and image modalities in a single pass, achieving state-of-the-art performance with superior cross-modal alignment and linear complexity.

Authors:Jie Wang, Nana Yu, Zihao Zhang, Yahong Han
Title: Visual Consensus Prompting for Co-Salient Object Detection
Abstract:
Existing co-salient object detection (CoSOD) methods generally employ a three-stage architecture (i.e., encoding, consensus extraction & dispersion, and prediction) along with a typical full fine-tuning paradigm. Although they yield certain benefits, they exhibit two notable limitations: 1) This architecture relies on encoded features to facilitate consensus extraction, but the meticulously extracted consensus does not provide timely guidance to the encoding stage. 2) This paradigm involves globally updating all parameters of the model, which is parameter-inefficient and hinders the effective representation of knowledge within the foundation model for this task. Therefore, in this paper, we propose an interaction-effective and parameter-efficient concise architecture for the CoSOD task, addressing two key limitations. It introduces, for the first time, a parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP). Our VCP aims to induce the frozen foundation model to perform better on CoSOD tasks by formulating task-specific visual consensus prompts with minimized tunable parameters. Concretely, the primary insight of the purposeful Consensus Prompt Generator (CPG) is to enforce limited tunable parameters to focus on co-salient representations and generate consensus prompts. The formulated Consensus Prompt Disperser (CPD) leverages consensus prompts to form task-specific visual consensus prompts, thereby arousing the powerful potential of pre-trained models in addressing CoSOD tasks. Extensive experiments demonstrate that our concise VCP outperforms 13 cutting-edge full fine-tuning models, achieving the new state of the art (with 6.8% improvement in F_m metrics on the most challenging CoCA dataset). Source code has been available at https://github.com/WJ-CV/VCP.
Chinese: 现有协同显著目标检测方法存在共识提取滞后和参数效率低下的问题,因此本文提出了一种视觉共识提示(VCP)架构,通过参数高效的提示调优将共识嵌入提示,在冻结基础模型的同时显著提升了检测性能。
English: Current CoSOD methods face inefficiencies due to delayed consensus guidance and parameter-heavy fine-tuning, leading to the proposal of a novel Visual Consensus Prompts (VCP) architecture that integrates parameter-efficient prompt tuning to enhance performance with minimal updates.

Authors:Bin Ren, Eduard Zamfir, Zongwei Wu, Yawei Li, Yidi Li, Danda Pani Paudel, Radu Timofte, Ming-Hsuan Yang, Luc Van Gool, Nicu Sebe
Title: Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation
Abstract:
Restoring any degraded image efficiently via just one model has become increasingly significant and impactful, especially with the proliferation of mobile devices. Traditional solutions typically involve training dedicated models per degradation, resulting in inefficiency and redundancy. More recent approaches either introduce additional modules to learn visual prompts, significantly increasing model size, or incorporate cross-modal transfer from large language models trained on vast datasets, adding complexity to the system architecture. In contrast, our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations to enable both efficient and comprehensive restoration through a joint embedding mechanism, without scaling up the model or relying on large language models.Specifically, we examine the sub-latent space of each input, identifying key components and reweighting them first in a gated manner. To fuse the intrinsic degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed for enhancing spatial-aware local-global interactions and enriching the restoration details from the frequency perspective. Extensive benchmarking in the all-in-one restoration setting confirms AnyIR's SOTA performance, reducing model complexity by around 82\% in parameters and 85\% in FLOPs. Our code will be available at our Project page (https://amazingren.github.io/AnyIR/)
中文:AnyIR提出了一种统一的图像修复模型,通过联合嵌入和空频并行融合策略利用退化相似性,在不依赖大型语言模型的情况下实现了最先进的性能,同时将参数量减少约82%、计算量降低85%。
English: AnyIR introduces a unified image restoration model that leverages degradation similarities through joint embeddings and spatial-frequency fusion, achieving state-of-the-art performance while reducing parameters by 82% and FLOPs by 85% without relying on large language models.

Authors:Yikun Ji, Yan Hong, Jiahui Zhan, Haoxing Chen, jun lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang
Title: Towards Explainable Fake Image Detection with Multi-Modal Large Language Models
Abstract:
Progress in image generation raises significant public security concerns. We argue that fake image detection should not operate as a "black box". Instead, an ideal approach must ensure both strong generalization and transparency. Recent progress in Multi-modal Large Language Models (MLLMs) offers new opportunities for reasoning-based AI-generated image detection. In this work, we evaluate the capabilities of MLLMs in comparison to traditional detection methods and human evaluators, highlighting their strengths and limitations. Furthermore, we design six distinct prompts and propose a framework that integrates these prompts to develop a more robust, explainable, and reasoning-driven detection system. The code is available at https://github.com/Gennadiyev/mllm-defake.
中文: 该摘要提出利用多模态大语言模型进行可解释的虚假图像检测,通过设计六种专用提示词构建新型框架,相比传统方法在鲁棒性和可解释性方面实现显著提升。
English: This abstract proposes using Multi-modal Large Language Models for explainable fake image detection, developing a framework with six specialized prompts to enhance robustness and transparency over traditional methods.

Authors:Wenbing Zhu, Lidong Wang, Ziqing Zhou, Chengjie Wang, Yurui Pan, Ruoyi Zhang, Zhuhao Chen, Linjie Cheng, Bin-Bin Gao, Jiangning Zhang, Zhenye Gan, Yuxie Wang, Yulong Chen, Shuguang Qian, Mingmin Chi, Bo Peng, Lizhuang Ma
Title: Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection
Abstract:
The increasing complexity of industrial anomaly detection (IAD) has positioned multimodal detection methods as a focal area of machine vision research. However, dedicated multimodal datasets specifically tailored for IAD remain limited. Pioneering datasets like MVTec 3D have laid essential groundwork in multimodal IAD by incorporating RGB+3D data, but still face challenges in bridging the gap with real industrial environments due to limitations in scale and resolution. To address these challenges, we introduce Real-IAD D3, a high-precision multimodal dataset that uniquely incorporates an additional pseudo3D modality generated through photometric stereo, alongside high-resolution RGB images and micrometer-level 3D point clouds. Real-IAD D3 features finer defects, diverse anomalies, and greater scale across 20 categories, providing a challenging benchmark for multimodal IAD Additionally, we introduce an effective approach that integrates RGB, point cloud, and pseudo-3D depth information to leverage the complementary strengths of each modality, enhancing detection performance. Our experiments highlight the importance of these modalities in boosting detection robustness and overall IAD performance. The dataset and code are publicly accessible for research purposes at https://realiad4ad.github.io/Real-IAD D3
中文: Real-IAD D3数据集通过整合RGB图像、三维点云和伪三维深度信息,提出了一种高精度多模态工业异常检测方法,显著提升了20个类别中的检测鲁棒性和性能。
English: The Real-IAD D3 dataset introduces a high-precision multimodal approach for industrial anomaly detection, combining RGB, 3D point clouds, and pseudo-3D depth to enhance detection robustness and performance across 20 categories.

Authors:Wei Dong, Han Zhou, Seyed Amirreza Mousavi, Jun Chen
Title: Retinex-guided Histogram Transformer for Mask-free Shadow Removal
Abstract:
While deep learning methods have achieved notable progress in shadow removal, many existing approaches rely on shadow masks that are difficult to obtain, limiting their generalization to real-world scenes. In this work, we propose ReHiT, an efficient mask-free shadow removal framework based on a hybrid CNN-Transformer architecture guided by Retinex theory. We first introduce a dual-branch pipeline to separately model reflectance and illumination components, and each is restored by our developed Illumination-Guided Hybrid CNN-Transformer (IG-HCT) module. Second, besides the CNN-based blocks that are capable of learning residual dense features and performing multi-scale semantic fusion, multi-scale semantic fusion, we develop the Illumination-Guided Histogram Transformer Block (IGHB) to effectively handle non-uniform illumination and spatially complex shadows. Extensive experiments on several benchmark datasets validate the effectiveness of our approach over existing mask-free methods. Trained solely on the NTIRE 2025 Shadow Removal Challenge dataset, our solution delivers competitive results with one of the smallest parameter sizes and fastest inference speeds among top-ranked entries, highlighting its applicability for real-world applications with limited computational resources. The code is available at https://github.com/dongw22/oath.
Chinese: 提出的ReHiT框架基于Retinex理论,采用混合CNN-Transformer架构实现无需阴影掩码的阴影去除方法,在基准数据集上以最小的计算资源取得了具有竞争力的性能。
English: The proposed ReHiT framework introduces a mask-free shadow removal method using a hybrid CNN-Transformer guided by Retinex theory, achieving efficient performance with minimal computational resources and competitive results on benchmark datasets.

Authors:Wei Dong, Yan Min, Han Zhou, Jun Chen
Title: Towards Scale-Aware Low-Light Enhancement via Structure-Guided Transformer Design
Abstract:
Current Low-light Image Enhancement (LLIE) techniques predominantly rely on either direct Low-Light (LL) to Normal-Light (NL) mappings or guidance from semantic features or illumination maps. Nonetheless, the intrinsic ill-posedness of LLIE and the difficulty in retrieving robust semantics from heavily corrupted images hinder their effectiveness in extremely low-light environments. To tackle this challenge, we present SG-LLIE, a new multi-scale CNN-Transformer hybrid framework guided by structure priors. Different from employing pre-trained models for the extraction of semantics or illumination maps, we choose to extract robust structure priors based on illumination-invariant edge detectors. Moreover, we develop a CNN-Transformer Hybrid Structure-Guided Feature Extractor (HSGFE) module at each scale with in the UNet encoder-decoder architecture. Besides the CNN blocks which excels in multi-scale feature extraction and fusion, we introduce a Structure-Guided Transformer Block (SGTB) in each HSGFE that incorporates structural priors to modulate the enhancement process. Extensive experiments show that our method achieves state-of-the-art performance on several LLIE benchmarks in both quantitative metrics and visual quality. Our solution ranks second in the NTIRE 2025 Low-Light Enhancement Challenge. Code is released at https://github.com/minyan8/imagine.
Chinese: SG-LLIE提出了一种多尺度CNN-Transformer混合框架,通过基于光照不变边缘检测器提取稳健结构先验来突破现有低光照图像增强方法的局限,在多个基准测试中取得最优性能,并荣获NTIRE 2025低光照增强挑战赛第二名。
English: SG-LLIE introduces a multi-scale CNN-Transformer hybrid framework that leverages robust structure priors from illumination-invariant edge detection to overcome the limitations of existing low-light image enhancement methods, achieving state-of-the-art results on benchmarks and ranking second in the NTIRE 2025 challenge.

Authors:Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, Dan Zhang
Title: LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Abstract:
Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup.
中文: 本研究提出了一种新颖的特征上采样方法,通过基于坐标的交叉注意力变换器和自蒸馏伪地面实况特征来提高视觉基础模型的分辨率,在像素级任务中显著优于现有技术。
English: This study introduces a novel feature upsampling method using a coordinate-based cross-attention transformer and self-distilled pseudo-groundtruth features to enhance the resolution of vision foundation models, significantly outperforming existing techniques in pixel-level tasks.

Authors:Andrea Amaduzzi, Pierluigi Zama Ramirez, Giuseppe Lisanti, Samuele Salti, Luigi Di Stefano
Title: Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in understanding both images and 3D data, yet these modalities face inherent limitations in comprehensively representing object geometry and appearance. Neural Radiance Fields (NeRFs) have emerged as a promising alternative, encoding both geometric and photorealistic properties within the weights of a simple Multi-Layer Perceptron (MLP). This work investigates the feasibility and effectiveness of ingesting NeRFs into an MLLM. We introduce LLaNA, the first MLLM able to perform new tasks such as NeRF captioning and Q\&A, by directly processing the weights of a NeRF's MLP. Notably, LLaNA is able to extract information about the represented objects without the need to render images or materialize 3D data structures. In addition, we build the first large-scale NeRF-language dataset, composed by more than 300K NeRFs trained on ShapeNet and Objaverse, with paired textual annotations that enable various NeRF-language tasks. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that directly processing NeRF weights leads to better performance on NeRF-Language tasks compared to approaches that rely on either 2D or 3D representations derived from NeRFs.
中文摘要:本文提出了首个能够直接处理神经辐射场权重的多模态大语言模型LLaNA,通过省略图像渲染或三维数据结构重建的步骤,实现了神经辐射场描述和问答等新任务,并展现出优越性能。
English Summary: This paper introduces LLaNA, the first Multimodal Large Language Model capable of directly processing Neural Radiance Field weights to perform novel tasks like NeRF captioning and Q&A, achieving superior performance by eliminating the need for rendered images or 3D data structures.

Authors:Mehmet Yamaç, Muhammad Numan Yousaf, Serkan Kiranyaz, Moncef Gabbouj
Title: Multiscale Tensor Summation Factorization as a New Neural Network Layer (MTS Layer) for Multidimensional Data Processing
Abstract:
Multilayer perceptrons (MLP), or fully connected artificial neural networks, are known for performing vector-matrix multiplications using learnable weight matrices; however, their practical application in many machine learning tasks, especially in computer vision, can be limited due to the high dimensionality of input-output pairs at each layer. To improve efficiency, convolutional operators have been utilized to facilitate weight sharing and local connections, yet they are constrained by limited receptive fields. In this paper, we introduce Multiscale Tensor Summation (MTS) Factorization, a novel neural network operator that implements tensor summation at multiple scales, where each tensor to be summed is obtained through Tucker-decomposition-like mode products. Unlike other tensor decomposition methods in the literature, MTS is not introduced as a network compression tool; instead, as a new backbone neural layer. MTS not only reduces the number of parameters required while enhancing the efficiency of weight optimization compared to traditional dense layers (i.e., unfactorized weight matrices in MLP layers), but it also demonstrates clear advantages over convolutional layers. The proof-of-concept experimental comparison of the proposed MTS networks with MLPs and Convolutional Neural Networks (CNNs) demonstrates their effectiveness across various tasks, such as classification, compression, and signal restoration. Additionally, when integrated with modern non-linear units such as the multi-head gate (MHG), also introduced in this study, the corresponding neural network, MTSNet, demonstrates a more favorable complexity-performance tradeoff compared to state-of-the-art transformers in various computer vision applications. The software implementation of the MTS layer and the corresponding MTS-based networks, MTSNets, is shared at https://github.com/mehmetyamac/MTSNet.
中文摘要:本文提出的多尺度张量求和分解作为一种新型神经网络层,在减少参数的同时,在多种计算机视觉任务中展现出优于传统全连接层和卷积网络的性能。
English summary: This paper introduces Multiscale Tensor Summation (MTS) Factorization, a novel neural network layer that reduces parameters while outperforming traditional MLPs and CNNs across various computer vision tasks.

Authors:Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Cihan Camgöz, Shreyas Hampali, Eric Sauser, Shugao Ma, Angela Yao, Fadime Sener
Title: Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding
Abstract:
We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding. ProVideLLM integrates a multimodal cache configured to store two types of tokens - verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by 22x over existing methods in representing one hour of long-term observations while effectively encoding fine-granularity of the present. By interleaving these tokens in our multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length, enabling per-frame streaming inference at 10 FPS and streaming dialogue at 25 FPS, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.

Authors:Yang Yue, Yulin Wang, Chenxin Tao, Pan Liu, Shiji Song, Gao Huang
Title: CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning
Abstract:
Humans can develop internal world models that encode common sense knowledge, telling them how the world works and predicting the consequences of their actions. This concept has emerged as a promising direction for establishing general-purpose machine-learning models in recent preliminary works, e.g., for visual representation learning. In this paper, we present CheXWorld, the first effort towards a self-supervised world model for radiographic images. Specifically, our work develops a unified framework that simultaneously models three aspects of medical knowledge essential for qualified radiologists, including 1) local anatomical structures describing the fine-grained characteristics of local tissues (e.g., architectures, shapes, and textures); 2) global anatomical layouts describing the global organization of the human body (e.g., layouts of organs and skeletons); and 3) domain variations that encourage CheXWorld to model the transitions across different appearance domains of radiographs (e.g., varying clarity, contrast, and exposure caused by collecting radiographs from different hospitals, devices, or patients). Empirically, we design tailored qualitative and quantitative analyses, revealing that CheXWorld successfully captures these three dimensions of medical knowledge. Furthermore, transfer learning experiments across eight medical image classification and segmentation benchmarks showcase that CheXWorld significantly outperforms existing SSL methods and large-scale medical foundation models. Code & pre-trained models are available at https://github.com/LeapLabTHU/CheXWorld.
中文摘要:CheXWorld是首个针对放射影像的自监督世界模型,它能同时建模局部解剖结构、全局解剖布局和领域变化这三个关键医学知识维度,在多项医学图像任务中显著优于现有方法。
English Summary: CheXWorld is a self-supervised world model for radiographic images that captures three key aspects of medical knowledge—local anatomical structures, global anatomical layouts, and domain variations—demonstrating superior performance in medical image classification and segmentation tasks compared to existing methods.

Authors:Zhu Zhu, Shuo Jiang, Jingyuan Zheng, Yawen Li, Yifei Chen, Manli Zhao, Weizhong Gu, Feiwei Qin, Jinhu Wang, Gang Yu
Title: Towards Accurate and Interpretable Neuroblastoma Diagnosis via Contrastive Multi-scale Pathological Image Analysis
Abstract:
Neuroblastoma, adrenal-derived, is among the most common pediatric solid malignancies, characterized by significant clinical heterogeneity. Timely and accurate pathological diagnosis from hematoxylin and eosin-stained whole-slide images is critical for patient prognosis. However, current diagnostic practices primarily rely on subjective manual examination by pathologists, leading to inconsistent accuracy. Existing automated whole-slide image classification methods encounter challenges such as poor interpretability, limited feature extraction capabilities, and high computational costs, restricting their practical clinical deployment. To overcome these limitations, we propose CMSwinKAN, a contrastive-learning-based multi-scale feature fusion model tailored for pathological image classification, which enhances the Swin Transformer architecture by integrating a Kernel Activation Network within its multilayer perceptron and classification head modules, significantly improving both interpretability and accuracy. By fusing multi-scale features and leveraging contrastive learning strategies, CMSwinKAN mimics clinicians' comprehensive approach, effectively capturing global and local tissue characteristics. Additionally, we introduce a heuristic soft voting mechanism guided by clinical insights to bridge patch-level predictions to whole-slide image-level classifications seamlessly. We verified the CMSwinKAN on the publicly available BreakHis dataset and the PpNTs dataset, which was established by our hospital. Results demonstrate that CMSwinKAN performs better than existing state-of-the-art pathology-specific models pre-trained on large datasets. Our source code is available at https://github.com/JSLiam94/CMSwinKAN.
中文: CMSwinKAN模型通过多尺度特征融合和对比学习策略,在病理图像分类中实现了比现有方法更高的准确性和可解释性。
English: The proposed CMSwinKAN model enhances pathological image classification by integrating multi-scale feature fusion and contrastive learning, achieving superior accuracy and interpretability compared to existing methods.

Authors:Mengyuan Li, Changhong Fu, Ziyu Lu, Zijie Zhang, Haobo Zuo, Liangliang Yao
Title: AnyTSR: Any-Scale Thermal Super-Resolution for UAV
Abstract:
Thermal imaging can greatly enhance the application of intelligent unmanned aerial vehicles (UAV) in challenging environments. However, the inherent low resolution of thermal sensors leads to insufficient details and blurred boundaries. Super-resolution (SR) offers a promising solution to address this issue, while most existing SR methods are designed for fixed-scale SR. They are computationally expensive and inflexible in practical applications. To address above issues, this work proposes a novel any-scale thermal SR method (AnyTSR) for UAV within a single model. Specifically, a new image encoder is proposed to explicitly assign specific feature code to enable more accurate and flexible representation. Additionally, by effectively embedding coordinate offset information into the local feature ensemble, an innovative any-scale upsampler is proposed to better understand spatial relationships and reduce artifacts. Moreover, a novel dataset (UAV-TSR), covering both land and water scenes, is constructed for thermal SR tasks. Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art methods across all scaling factors as well as generates more accurate and detailed high-resolution images. The code is located at https://github.com/vision4robotics/AnyTSR.
中文: 本文提出AnyTSR方法,通过创新的图像编码器和任意尺度上采样器提升无人机热成像细节并减少伪影,在所有缩放比例下均优于现有技术。
English: This paper introduces AnyTSR, a novel any-scale thermal super-resolution method for UAVs that enhances image detail and reduces artifacts through an innovative encoder and upsampler, outperforming existing techniques across all scaling factors.

Authors:Sijing Li, Tianwei Lin, Lingshuai Lin, Wenqiao Zhang, Jiang Liu, Xiaoda Yang, Juncheng Li, Yucheng He, Xiaohui Song, Jun Xiao, Yueting Zhuang, Beng Chin Ooi
Title: EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model
Abstract:
Medical Large Vision-Language Models (Med-LVLMs) demonstrate significant potential in healthcare, but their reliance on general medical data and coarse-grained global visual understanding limits them in intelligent ophthalmic diagnosis. Currently, intelligent ophthalmic diagnosis faces three major challenges: (i) Data. The lack of deeply annotated, high-quality, multi-modal ophthalmic visual instruction data; (ii) Benchmark. The absence of a comprehensive and systematic benchmark for evaluating diagnostic performance; (iii) Model. The difficulty of adapting holistic visual architectures to fine-grained, region-specific ophthalmic lesion identification. In this paper, we propose the Eyecare Kit, which systematically tackles the aforementioned three key challenges with the tailored dataset, benchmark and model: First, we construct a multi-agent data engine with real-life ophthalmology data to produce Eyecare-100K, a high-quality ophthalmic visual instruction dataset. Subsequently, we design Eyecare-Bench, a benchmark that comprehensively evaluates the overall performance of LVLMs on intelligent ophthalmic diagnosis tasks across multiple dimensions. Finally, we develop the EyecareGPT, optimized for fine-grained ophthalmic visual understanding thoroughly, which incorporates an adaptive resolution mechanism and a layer-wise dense connector. Extensive experimental results indicate that the EyecareGPT achieves state-of-the-art performance in a range of ophthalmic tasks, underscoring its significant potential for the advancement of open research in intelligent ophthalmic diagnosis. Our project is available at https://github.com/DCDmllm/EyecareGPT.
中文摘要:医学大型视觉语言模型在医疗领域潜力显著,但其依赖通用数据和粗粒度视觉理解限制了智能眼科诊断的应用;本文提出的Eyecare Kit通过定制数据集、评估基准和优化模型系统解决了这一难题,在多项眼科任务中实现了领先性能。
English Summary: Medical Large Vision-Language Models show promise in healthcare but face limitations in ophthalmic diagnosis due to reliance on general data and coarse visual understanding, which the proposed Eyecare Kit addresses with a tailored dataset, benchmark, and optimized model to achieve state-of-the-art performance.

Authors:Yushen He, Lei Zhao, Tianchen Deng, Zipeng Fang, Weidong Chen
Title: Lightweight LiDAR-Camera 3D Dynamic Object Detection and Multi-Class Trajectory Prediction
Abstract:
Service mobile robots are often required to avoid dynamic objects while performing their tasks, but they usually have only limited computational resources. So we present a lightweight multi-modal framework for 3D object detection and trajectory prediction. Our system synergistically integrates LiDAR and camera inputs to achieve real-time perception of pedestrians, vehicles, and riders in 3D space. The framework proposes two novel modules: 1) a Cross-Modal Deformable Transformer (CMDT) for object detection with high accuracy and acceptable amount of computation, and 2) a Reference Trajectory-based Multi-Class Transformer (RTMCT) for efficient and diverse trajectory prediction of mult-class objects with flexible trajectory lengths. Evaluations on the CODa benchmark demonstrate superior performance over existing methods across detection (+2.03% in mAP) and trajectory prediction (-0.408m in minADE5 of pedestrians) metrics. Remarkably, the system exhibits exceptional deployability - when implemented on a wheelchair robot with an entry-level NVIDIA 3060 GPU, it achieves real-time inference at 13.2 fps. To facilitate reproducibility and practical deployment, we release the related code of the method at https://github.com/TossherO/3D_Perception and its ROS inference version at https://github.com/TossherO/ros_packages.
Chinese: 本文提出了一种轻量级多模态框架,通过融合激光雷达和相机数据实现实时3D物体检测与轨迹预测,在有限计算资源下展现出卓越性能与部署能力。
English: This paper introduces a lightweight multi-modal framework that integrates LiDAR and camera data for real-time 3D object detection and trajectory prediction, achieving superior performance and deployability on limited computational resources.

Authors:Zuyao Chen, Jinlin Wu, Zhen Lei, Marc Pollefeys, Chang Wen Chen
Title: Compile Scene Graphs with Reinforcement Learning
Abstract:
Next-token prediction is the fundamental principle for training large language models (LLMs), and reinforcement learning (RL) further enhances their reasoning performance. As an effective way to model language, image, video, and other modalities, the use of LLMs for end-to-end extraction of structured visual representations, such as scene graphs, remains underexplored. It requires the model to accurately produce a set of objects and relationship triplets, rather than generating text token by token. To achieve this, we introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset and subsequently refined using reinforcement learning to enhance its ability to generate scene graphs in an end-to-end manner. The SFT follows a conventional prompt-response paradigm, while RL requires the design of effective reward signals. We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall -- which evaluate semantic and spatial alignment between predictions and ground truth at the object and relation levels. A format consistency reward further ensures that outputs follow the expected structural schema. Extensive experiments on the VG150 and PSG benchmarks show that R1-SGG substantially reduces failure rates and achieves strong performance in Recall and mean Recall, surpassing traditional SGG models and existing multimodal language models. Our code is available at https://github.com/gpt4vision/R1-SGG
中文摘要:R1-SGG是一种通过监督微调和基于图结构奖励的强化学习训练的多模态大模型,能够端到端生成结构化场景图,在视觉基准测试中显著优于传统场景图生成模型。
English Summary: R1-SGG is a multimodal language model trained with supervised fine-tuning and reinforcement learning using graph-centric rewards to generate structured scene graphs end-to-end, achieving superior performance on visual benchmarks compared to traditional methods.

Authors:Jun Zeng, KC Santosh, Deepak Rajan Nayak, Thomas de Lange, Jonas Varkey, Tyler Berzin, Debesh Jha
Title: FocusNet: Transformer-enhanced Polyp Segmentation with Local and Pooling Attention
Abstract:
Colonoscopy is vital in the early diagnosis of colorectal polyps. Regular screenings can effectively prevent benign polyps from progressing to CRC. While deep learning has made impressive strides in polyp segmentation, most existing models are trained on single-modality and single-center data, making them less effective in real-world clinical environments. To overcome these limitations, we propose FocusNet, a Transformer-enhanced focus attention network designed to improve polyp segmentation. FocusNet incorporates three essential modules: the Cross-semantic Interaction Decoder Module (CIDM) for generating coarse segmentation maps, the Detail Enhancement Module (DEM) for refining shallow features, and the Focus Attention Module (FAM), to balance local detail and global context through local and pooling attention mechanisms. We evaluate our model on PolypDB, a newly introduced dataset with multi-modality and multi-center data for building more reliable segmentation methods. Extensive experiments showed that FocusNet consistently outperforms existing state-of-the-art approaches with a high dice coefficients of 82.47% on the BLI modality, 88.46% on FICE, 92.04% on LCI, 82.09% on the NBI and 93.42% on WLI modality, demonstrating its accuracy and robustness across five different modalities. The source code for FocusNet is available at https://github.com/JunZengz/FocusNet.
Chinese: FocusNet通过集成交叉语义交互解码器、细节增强模块和聚焦注意力模块,在多模态结肠镜数据集上实现了82.09%-93.42%的优异分割精度,显著提升了息肉分割的鲁棒性和临床适用性。
English: FocusNet, a Transformer-enhanced network with specialized modules, significantly improves polyp segmentation accuracy and robustness across multiple imaging modalities, outperforming existing methods with dice coefficients ranging from 82.09% to 93.42%.

Authors:Yuchen Rao, Stefan Ainetter, Sinisa Stekovic, Vincent Lepetit, Friedrich Fraundorfer
Title: Leveraging Automatic CAD Annotations for Supervised Learning in 3D Scene Understanding
Abstract:
High-level 3D scene understanding is essential in many applications. However, the challenges of generating accurate 3D annotations make development of deep learning models difficult. We turn to recent advancements in automatic retrieval of synthetic CAD models, and show that data generated by such methods can be used as high-quality ground truth for training supervised deep learning models. More exactly, we employ a pipeline akin to the one previously used to automatically annotate objects in ScanNet scenes with their 9D poses and CAD models. This time, we apply it to the recent ScanNet++ v1 dataset, which previously lacked such annotations. Our findings demonstrate that it is not only possible to train deep learning models on these automatically-obtained annotations but that the resulting models outperform those trained on manually annotated data. We validate this on two distinct tasks: point cloud completion and single-view CAD model retrieval and alignment. Our results underscore the potential of automatic 3D annotations to enhance model performance while significantly reducing annotation costs. To support future research in 3D scene understanding, we will release our annotations, which we call SCANnotate++, along with our trained models.

Authors:Shuobin Wei, Zhuang Zhou, Zhengan Lu, Zizhao Yuan, Binghua Su
Title: HDBFormer: Efficient RGB-D Semantic Segmentation with A Heterogeneous Dual-Branch Framework
Abstract:
In RGB-D semantic segmentation for indoor scenes, a key challenge is effectively integrating the rich color information from RGB images with the spatial distance information from depth images. However, most existing methods overlook the inherent differences in how RGB and depth images express information. Properly distinguishing the processing of RGB and depth images is essential to fully exploiting their unique and significant characteristics. To address this, we propose a novel heterogeneous dual-branch framework called HDBFormer, specifically designed to handle these modality differences. For RGB images, which contain rich detail, we employ both a basic and detail encoder to extract local and global features. For the simpler depth images, we propose LDFormer, a lightweight hierarchical encoder that efficiently extracts depth features with fewer parameters. Additionally, we introduce the Modality Information Interaction Module (MIIM), which combines transformers with large kernel convolutions to interact global and local information across modalities efficiently. Extensive experiments show that HDBFormer achieves state-of-the-art performance on the NYUDepthv2 and SUN-RGBD datasets. The code is available at: https://github.com/Weishuobin/HDBFormer.
中文: HDBFormer提出了一种异构双分支框架,通过分别处理RGB和深度图像的特性,并利用模态信息交互模块,在室内场景语义分割中实现了领先的性能。
English: HDBFormer introduces a heterogeneous dual-branch framework with specialized encoders for RGB and depth modalities, enhanced by a Modality Information Interaction Module to achieve state-of-the-art indoor scene segmentation performance.

Authors:Yang Wu, Yun Zhu, Kaihua Zhang, Jianjun Qian, Jin Xie, Jian Yang
Title: WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion
Abstract:
3D scene perception demands a large amount of adverse-weather LiDAR data, yet the cost of LiDAR data collection presents a significant scaling-up challenge. To this end, a series of LiDAR simulators have been proposed. Yet, they can only simulate a single adverse weather with a single physical model, and the fidelity of the generated data is quite limited. This paper presents WeatherGen, the first unified diverse-weather LiDAR data diffusion generation framework, significantly improving fidelity. Specifically, we first design a map-based data producer, which can provide a vast amount of high-quality diverse-weather data for training purposes. Then, we utilize the diffusion-denoising paradigm to construct a diffusion model. Among them, we propose a spider mamba generator to restore the disturbed diverse weather data gradually. The spider mamba models the feature interactions by scanning the LiDAR beam circle or central ray, excellently maintaining the physical structure of the LiDAR data. Subsequently, following the generator to transfer real-world knowledge, we design a latent feature aligner. Afterward, we devise a contrastive learning-based controller, which equips weather control signals with compact semantic knowledge through language supervision, guiding the diffusion model to generate more discriminative data. Extensive evaluations demonstrate the high generation quality of WeatherGen. Through WeatherGen, we construct the mini-weather dataset, promoting the performance of the downstream task under adverse weather conditions. Code is available: https://github.com/wuyang98/weathergen
中文:WeatherGen是一个统一的扩散框架,通过蜘蛛曼巴生成器和对比学习生成高保真、多天气的激光雷达数据,显著提升了恶劣天气下的三维感知性能。
English: WeatherGen is a unified diffusion framework that generates high-fidelity, diverse-weather LiDAR data using a spider mamba generator and contrastive learning, significantly improving 3D perception in adverse conditions.

Authors:Yihao Ouyang, Xunheng Kuang, Mengjia Xiong, Zhida Wang, Yuanquan Wang
Title: A Novel Hybrid Approach for Retinal Vessel Segmentation with Dynamic Long-Range Dependency and Multi-Scale Retinal Edge Fusion Enhancement
Abstract:
Accurate retinal vessel segmentation provides essential structural information for ophthalmic image analysis. However, existing methods struggle with challenges such as multi-scale vessel variability, complex curvatures, and ambiguous boundaries. While Convolutional Neural Networks (CNNs), Transformer-based models and Mamba-based architectures have advanced the field, they often suffer from vascular discontinuities or edge feature ambiguity. To address these limitations, we propose a novel hybrid framework that synergistically integrates CNNs and Mamba for high-precision retinal vessel segmentation. Our approach introduces three key innovations: 1) The proposed High-Resolution Edge Fuse Network is a high-resolution preserving hybrid segmentation framework that combines a multi-scale backbone with the Multi-scale Retina Edge Fusion (MREF) module to enhance edge features, ensuring accurate and robust vessel segmentation. 2) The Dynamic Snake Visual State Space block combines Dynamic Snake Convolution with Mamba to adaptively capture vessel curvature details and long-range dependencies. An improved eight-directional 2D Snake-Selective Scan mechanism and a dynamic weighting strategy enhance the perception of complex vascular topologies. 3) The MREF module enhances boundary precision through multi-scale edge feature aggregation, suppressing noise while emphasizing critical vessel structures across scales. Experiments on three public datasets demonstrate that our method achieves state-of-the-art performance, particularly in maintaining vascular continuity and effectively segmenting vessels in low-contrast regions. This work provides a robust method for clinical applications requiring accurate retinal vessel analysis. The code is available at https://github.com/frank-oy/HREFNet.
中文: 本文提出了一种结合CNN和Mamba的新型混合框架,通过三项关键创新解决了视网膜血管分割中的多尺度变化和边界模糊等难题,在保持血管连续性和低对比度区域分割方面实现了最先进的性能。
English: This paper introduces a novel hybrid framework combining CNNs and Mamba to overcome retinal vessel segmentation challenges like multi-scale variability and ambiguous boundaries, achieving state-of-the-art performance through three key innovations that enhance edge features and vascular continuity.

Authors:Haoyang Luo, Linwei Tao, Minjing Dong, Chang Xu
Title: Beyond One-Hot Labels: Semantic Mixing for Model Calibration
Abstract:
Model calibration seeks to ensure that models produce confidence scores that accurately reflect the true likelihood of their predictions being correct. However, existing calibration approaches are fundamentally tied to datasets of one-hot labels implicitly assuming full certainty in all the annotations. Such datasets are effective for classification but provides insufficient knowledge of uncertainty for model calibration, necessitating the curation of datasets with numerically rich ground-truth confidence values. However, due to the scarcity of uncertain visual examples, such samples are not easily available as real datasets. In this paper, we introduce calibration-aware data augmentation to create synthetic datasets of diverse samples and their ground-truth uncertainty. Specifically, we present \textbf{Calibration-aware Semantic Mixing (CSM)}, a novel framework that generates training samples with mixed class characteristics and annotates them with distinct confidence scores via diffusion models. Based on this framework, we propose calibrated reannotation to tackle the misalignment between the annotated confidence score and the mixing ratio during the diffusion reverse process. Besides, we explore the loss functions that better fit the new data representation paradigm. Experimental results demonstrate that CSM achieves superior calibration compared to the state-of-the-art calibration approaches. Our code is \href{https://github.com/E-Galois/CSM}{available here}.
中文摘要:本文提出了一种校准感知的数据增强方法CSM,通过扩散模型生成具有混合类别特征的合成样本并标注精确置信度,从而超越传统独热编码数据集的局限,显著提升了模型校准性能。
English Summary: This paper introduces a calibration-aware data augmentation method called CSM, which generates synthetic datasets with mixed-class samples and precise confidence scores using diffusion models to improve model calibration beyond traditional one-hot labeled datasets.

Authors:Jinhao Li, Zijian Chen, Tingzhu Chen, Zhiji Liu, Changbo Wang
Title: OBIFormer: A Fast Attentive Denoising Framework for Oracle Bone Inscriptions
Abstract:
Oracle bone inscriptions (OBIs) are the earliest known form of Chinese characters and serve as a valuable resource for research in anthropology and archaeology. However, most excavated fragments are severely degraded due to thousands of years of natural weathering, corrosion, and man-made destruction, making automatic OBI recognition extremely challenging. Previous methods either focus on pixel-level information or utilize vanilla transformers for glyph-based OBI denoising, which leads to tremendous computational overhead. Therefore, this paper proposes a fast attentive denoising framework for oracle bone inscriptions, i.e., OBIFormer. It leverages channel-wise self-attention, glyph extraction, and selective kernel feature fusion to reconstruct denoised images precisely while being computationally efficient. Our OBIFormer achieves state-of-the-art denoising performance for PSNR and SSIM metrics on synthetic and original OBI datasets. Furthermore, comprehensive experiments on a real oracle dataset demonstrate the great potential of our OBIFormer in assisting automatic OBI recognition. The code will be made available at https://github.com/LJHolyGround/OBIFormer.
中文摘要:本文提出OBIFormer这一快速注意力去噪框架,通过通道自注意力与字形提取技术高效修复受损甲骨文,在保持计算效率的同时实现了最优去噪性能。
English Summary: This paper introduces OBIFormer, a fast attentive denoising framework that effectively restores degraded oracle bone inscriptions using channel-wise self-attention and glyph extraction, achieving state-of-the-art performance while maintaining computational efficiency.

Authors:Yipeng Sun, Linda-Sophie Schneider, Mingxuan Gu, Siyuan Mei, Chengze Ye, Fabian Wagner, Siming Bayer, Andreas Maier
Title: Filter2Noise: Interpretable Self-Supervised Single-Image Denoising for Low-Dose CT with Attention-Guided Bilateral Filtering
Abstract:
Effective denoising is crucial in low-dose CT to enhance subtle structures and low-contrast lesions while preventing diagnostic errors. Supervised methods struggle with limited paired datasets, and self-supervised approaches often require multiple noisy images and rely on deep networks like U-Net, offering little insight into the denoising mechanism. To address these challenges, we propose an interpretable self-supervised single-image denoising framework -- Filter2Noise (F2N). Our approach introduces an Attention-Guided Bilateral Filter that adapted to each noisy input through a lightweight module that predicts spatially varying filter parameters, which can be visualized and adjusted post-training for user-controlled denoising in specific regions of interest. To enable single-image training, we introduce a novel downsampling shuffle strategy with a new self-supervised loss function that extends the concept of Noise2Noise to a single image and addresses spatially correlated noise. On the Mayo Clinic 2016 low-dose CT dataset, F2N outperforms the leading self-supervised single-image method (ZS-N2N) by 4.59 dB PSNR while improving transparency, user control, and parametric efficiency. These features provide key advantages for medical applications that require precise and interpretable noise reduction. Our code is demonstrated at https://github.com/sypsyp97/Filter2Noise.git .
中文摘要:提出的Filter2Noise框架通过引入可调节参数的注意力引导双边滤波器,实现了可解释的单图像自监督CT去噪,在性能和透明度方面均优于现有方法。
English Summary: The proposed Filter2Noise framework introduces an interpretable self-supervised method for single-image CT denoising using an attention-guided bilateral filter with adjustable parameters, achieving superior performance and transparency compared to existing approaches.

Authors:Alex Ergasti, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
Title: U-Shape Mamba: State Space Model for faster diffusion
Abstract:
Diffusion models have become the most popular approach for high-quality image generation, but their high computational cost still remains a significant challenge. To address this problem, we propose U-Shape Mamba (USM), a novel diffusion model that leverages Mamba-based layers within a U-Net-like hierarchical structure. By progressively reducing sequence length in the encoder and restoring it in the decoder through Mamba blocks, USM significantly lowers computational overhead while maintaining strong generative capabilities. Experimental results against Zigma, which is currently the most efficient Mamba-based diffusion model, demonstrate that USM achieves one-third the GFlops, requires less memory and is faster, while outperforming Zigma in image quality. Frechet Inception Distance (FID) is improved by 15.3, 0.84 and 2.7 points on AFHQ, CelebAHQ and COCO datasets, respectively. These findings highlight USM as a highly efficient and scalable solution for diffusion-based generative models, making high-quality image synthesis more accessible to the research community while reducing computational costs.
Chinese: USM是一种新型扩散模型,采用类似U-Net结构和Mamba层,在显著降低三分之二计算量、减少内存需求并加快速度的同时,图像质量超越了当前最高效的Mamba模型Zigma。
English: USM is a novel diffusion model that uses a U-Net-like structure with Mamba layers to drastically cut computational costs by a third, reduce memory usage, and speed up processing while enhancing image quality over the leading efficient model, Zigma.

Authors:Wang Liu, Zhiyu Wang, Xin Guo, Puhong Duan, Xudong Kang, Shutao Li
Title: Learning from Noisy Pseudo-labels for All-Weather Land Cover Mapping
Abstract:
Semantic segmentation of SAR images has garnered significant attention in remote sensing due to the immunity of SAR sensors to cloudy weather and light conditions. Nevertheless, SAR imagery lacks detailed information and is plagued by significant speckle noise, rendering the annotation or segmentation of SAR images a formidable task. Recent efforts have resorted to annotating paired optical-SAR images to generate pseudo-labels through the utilization of an optical image segmentation network. However, these pseudo-labels are laden with noise, leading to suboptimal performance in SAR image segmentation. In this study, we introduce a more precise method for generating pseudo-labels by incorporating semi-supervised learning alongside a novel image resolution alignment augmentation. Furthermore, we introduce a symmetric cross-entropy loss to mitigate the impact of noisy pseudo-labels. Additionally, a bag of training and testing tricks is utilized to generate better land-cover mapping results. Our experiments on the GRSS data fusion contest indicate the effectiveness of the proposed method, which achieves first place. The code is available at https://github.com/StuLiu/DFC2025Track1.git.
中文摘要:本研究提出了一种结合半监督学习和图像分辨率对齐的改进方法,用于生成SAR图像分割的伪标签,并在GRSS数据融合竞赛中取得了最优成绩。
English Summary: This study introduces an enhanced method for generating pseudo-labels for SAR image segmentation using semi-supervised learning and image resolution alignment, achieving top performance in the GRSS data fusion contest.

Authors:Juliette Bertrand, Sophie Giffard-Roisin, James Hollingsworth, Julien Mairal
Title: MicroFlow: Domain-Specific Optical Flow for Ground Deformation Estimation in Seismic Events
Abstract:
Dense ground displacement measurements are crucial for geological studies but are impractical to collect directly. Traditionally, displacement fields are estimated using patch matching on optical satellite images from different acquisition times. While deep learning-based optical flow models are promising, their adoption in ground deformation analysis is hindered by challenges such as the absence of real ground truth, the need for sub-pixel precision, and temporal variations due to geological or anthropogenic changes. In particular, we identify that deep learning models relying on explicit correlation layers struggle at estimating small displacements in real-world conditions. Instead, we propose a model that employs iterative refinements with explicit warping layers and a correlation-independent backbone, enabling sub-pixel precision. Additionally, a non-convex variant of Total Variation regularization preserves fault-line sharpness while maintaining smoothness elsewhere. Our model significantly outperforms widely used geophysics methods on semi-synthetic benchmarks and generalizes well to challenging real-world scenarios captured by both medium- and high-resolution sensors. Project page: https://jbertrand89.github.io/microflow/.

Authors:Shimou Ling, Liang Zhang, Jiangwei Zhao, Lili Pan, Hongliang Li
Title: LoRA-Based Continual Learning with Constraints on Critical Parameter Changes
Abstract:
LoRA-based continual learning represents a promising avenue for leveraging pre-trained models in downstream continual learning tasks. Recent studies have shown that orthogonal LoRA tuning effectively mitigates forgetting. However, this work unveils that under orthogonal LoRA tuning, the critical parameters for pre-tasks still change notably after learning post-tasks. To address this problem, we directly propose freezing the most critical parameter matrices in the Vision Transformer (ViT) for pre-tasks before learning post-tasks. In addition, building on orthogonal LoRA tuning, we propose orthogonal LoRA composition (LoRAC) based on QR decomposition, which may further enhance the plasticity of our method. Elaborate ablation studies and extensive comparisons demonstrate the effectiveness of our proposed method. Our results indicate that our method achieves state-of-the-art (SOTA) performance on several well-known continual learning benchmarks. For instance, on the Split CIFAR-100 dataset, our method shows a 6.35\% improvement in accuracy and a 3.24\% reduction in forgetting compared to previous methods. Our code is available at https://github.com/learninginvision/LoRAC-IPC.
Chinese: 本文提出一种在视觉Transformer中冻结关键参数矩阵并结合正交LoRA组合的方法,有效提升了持续学习性能,在Split CIFAR-100等基准测试中实现了最优的准确率提升和遗忘减少。
English: This paper introduces a method that freezes key parameter matrices in Vision Transformers and employs orthogonal LoRA composition to enhance continual learning, achieving state-of-the-art performance with significant improvements in accuracy and reduced forgetting on benchmarks like Split CIFAR-100.

Authors:Xiangbo Gao, Yuheng Wu, Rujia Wang, Chenxi Liu, Yang Zhou, Zhengzhong Tu
Title: LangCoop: Collaborative Driving with Language
Abstract:
Multi-agent collaboration holds great promise for enhancing the safety, reliability, and mobility of autonomous driving systems by enabling information sharing among multiple connected agents. However, existing multi-agent communication approaches are hindered by limitations of existing communication media, including high bandwidth demands, agent heterogeneity, and information loss. To address these challenges, we introduce LangCoop, a new paradigm for collaborative autonomous driving that leverages natural language as a compact yet expressive medium for inter-agent communication. LangCoop features two key innovations: Mixture Model Modular Chain-of-thought (M$^3$CoT) for structured zero-shot vision-language reasoning and Natural Language Information Packaging (LangPack) for efficiently packaging information into concise, language-based messages. Through extensive experiments conducted in the CARLA simulations, we demonstrate that LangCoop achieves a remarkable 96\% reduction in communication bandwidth (< 2KB per message) compared to image-based communication, while maintaining competitive driving performance in the closed-loop evaluation. Our project page and code are at https://xiangbogaobarry.github.io/LangCoop/.

Authors:Shashank Shriram, Srinivasa Perisetla, Aryan Keskar, Harsha Krishnaswamy, Tonko Emil Westerhof Bossen, Andreas Møgelmose, Ross Greer
Title: Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety
Abstract:
Detecting anomalous hazards in visual data, particularly in video streams, is a critical challenge in autonomous driving. Existing models often struggle with unpredictable, out-of-label hazards due to their reliance on predefined object categories. In this paper, we propose a multimodal approach that integrates vision-language reasoning with zero-shot object detection to improve hazard identification and explanation. Our pipeline consists of a Vision-Language Model (VLM), a Large Language Model (LLM), in order to detect hazardous objects within a traffic scene. We refine object detection by incorporating OpenAI's CLIP model to match predicted hazards with bounding box annotations, improving localization accuracy. To assess model performance, we create a ground truth dataset by denoising and extending the foundational COOOL (Challenge-of-Out-of-Label) anomaly detection benchmark dataset with complete natural language descriptions for hazard annotations. We define a means of hazard detection and labeling evaluation on the extended dataset using cosine similarity. This evaluation considers the semantic similarity between the predicted hazard description and the annotated ground truth for each video. Additionally, we release a set of tools for structuring and managing large-scale hazard detection datasets. Our findings highlight the strengths and limitations of current vision-language-based approaches, offering insights into future improvements in autonomous hazard detection systems. Our models, scripts, and data can be found at https://github.com/mi3labucm/COOOLER.git
中文摘要:本文提出了一种融合视觉语言推理与零样本目标检测的多模态方法,通过突破预定义类别的限制并提升定位精度,来改进自动驾驶中的危险识别能力。
English Summary: This paper introduces a multimodal system combining vision-language reasoning and zero-shot object detection to enhance autonomous driving hazard identification by overcoming limitations of predefined categories and improving localization accuracy.

Authors:Yasin Almalioglu, Andrzej Kucik, Geoffrey French, Dafni Antotsiou, Alexander Adam, Cedric Archambeau
Title: SAR Object Detection with Self-Supervised Pretraining and Curriculum-Aware Sampling
Abstract:
Object detection in satellite-borne Synthetic Aperture Radar (SAR) imagery holds immense potential in tasks such as urban monitoring and disaster response. However, the inherent complexities of SAR data and the scarcity of annotations present significant challenges in the advancement of object detection in this domain. Notably, the detection of small objects in satellite-borne SAR images poses a particularly intricate problem, because of the technology's relatively low spatial resolution and inherent noise. Furthermore, the lack of large labelled SAR datasets hinders the development of supervised deep learning-based object detection models. In this paper, we introduce TRANSAR, a novel self-supervised end-to-end vision transformer-based SAR object detection model that incorporates masked image pre-training on an unlabeled SAR image dataset that spans more than $25,700$ km\textsuperscript{2} ground area. Unlike traditional object detection formulation, our approach capitalises on auxiliary binary semantic segmentation, designed to segregate objects of interest during the post-tuning, especially the smaller ones, from the background. In addition, to address the innate class imbalance due to the disproportion of the object to the image size, we introduce an adaptive sampling scheduler that dynamically adjusts the target class distribution during training based on curriculum learning and model feedback. This approach allows us to outperform conventional supervised architecture such as DeepLabv3 or UNet, and state-of-the-art self-supervised learning-based arhitectures such as DPT, SegFormer or UperNet, as shown by extensive evaluations on benchmark SAR datasets.

Authors:Jiang-Xin Shi, Tong Wei, Yu-Feng Li
Title: LIFT+: Lightweight Fine-Tuning for Long-Tail Learning
Abstract:
The fine-tuning paradigm has emerged as a prominent approach for addressing long-tail learning tasks in the era of foundation models. However, the impact of fine-tuning strategies on long-tail learning performance remains unexplored. In this work, we disclose that existing paradigms exhibit a profound misuse of fine-tuning methods, leaving significant room for improvement in both efficiency and accuracy. Specifically, we reveal that heavy fine-tuning (fine-tuning a large proportion of model parameters) can lead to non-negligible performance deterioration on tail classes, whereas lightweight fine-tuning demonstrates superior effectiveness. Through comprehensive theoretical and empirical validation, we identify this phenomenon as stemming from inconsistent class conditional distributions induced by heavy fine-tuning. Building on this insight, we propose LIFT+, an innovative lightweight fine-tuning framework to optimize consistent class conditions. Furthermore, LIFT+ incorporates semantic-aware initialization, minimalist data augmentation, and test-time ensembling to enhance adaptation and generalization of foundation models. Our framework provides an efficient and accurate pipeline that facilitates fast convergence and model compactness. Extensive experiments demonstrate that LIFT+ significantly reduces both training epochs (from $\sim$100 to $\leq$15) and learned parameters (less than 1%), while surpassing state-of-the-art approaches by a considerable margin. The source code is available at https://github.com/shijxcs/LIFT-plus.
中文: 该研究揭示过度微调会损害尾部类别性能,并提出LIFT+轻量框架,通过大幅减少训练轮次和参数,显著提升模型效率与精度。
English: The study reveals that heavy fine-tuning impairs performance on tail classes, while proposing LIFT+, a lightweight framework that enhances efficiency and accuracy by reducing training epochs and parameters significantly.

Authors:Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, Christoph Feichtenhofer
Title: Perception Encoder: The best visual embeddings are not at the output of the network
Abstract:
We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods: language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together, our PE family of models achieves best-in-class results on a wide variety of tasks, including (1) zero-shot image and video classification and retrieval, simultaneously obtaining 86.6 average zero-shot ImageNet robustness and 76.9 zero-shot Kinetics-400 video classification; (2) document, image, and video Q&A, enabling 94.6 DocVQA, 80.9 InfographicVQA, and 82.7 PerceptionTest with an 8B LLM; and (3) spatial tasks such as detection, tracking, and depth estimation, setting a new COCO state-of-the-art of 66.0 box mAP. To foster further research, we release our models, code, and novel dataset of synthetically and human-annotated videos: https://github.com/facebookresearch/perception_models
中文: 感知编码器(PE)是一种通过对比性视觉-语言学习训练的最先进视觉编码器,借助对齐方法提取隐藏嵌入,在分类、问答和空间预测等多种任务中均取得了顶尖成果。
English: The Perception Encoder (PE) is a state-of-the-art vision encoder trained with contrastive vision-language learning, achieving top-tier results across diverse tasks like classification, Q&A, and spatial predictions by extracting hidden embeddings through alignment methods.

Authors:Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl, Piotr Dollár, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer
Title: PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Abstract:
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models. https://github.com/facebookresearch/perception_models
中文: 本文提出完全开放可复现的感知语言模型框架,避免使用专有模型蒸馏,发布280万人工标注视频数据以弥补关键数据缺口,并建立PLM-VideoBench评估体系来推动透明化的视频理解研究。
English: This paper introduces a fully open and reproducible Perception Language Model framework that avoids proprietary model distillation, releases 2.8M human-labeled video annotations to address data gaps, and establishes PLM-VideoBench for comprehensive video understanding evaluation.

Authors:Fei Shen, Jian Yu, Cong Wang, Xin Jiang, Xiaoyu Du, Jinhui Tang
Title: IMAGGarment: Fine-Grained Garment Generation for Controllable Fashion Design
Abstract:
This paper presents IMAGGarment, a fine-grained garment generation (FGG) framework that enables high-fidelity garment synthesis with precise control over silhouette, color, and logo placement. Unlike existing methods that are limited to single-condition inputs, IMAGGarment addresses the challenges of multi-conditional controllability in personalized fashion design and digital apparel applications. Specifically, IMAGGarment employs a two-stage training strategy to separately model global appearance and local details, while enabling unified and controllable generation through end-to-end inference. In the first stage, we propose a global appearance model that jointly encodes silhouette and color using a mixed attention module and a color adapter. In the second stage, we present a local enhancement model with an adaptive appearance-aware module to inject user-defined logos and spatial constraints, enabling accurate placement and visual consistency. To support this task, we release GarmentBench, a large-scale dataset comprising over 180K garment samples paired with multi-level design conditions, including sketches, color references, logo placements, and textual prompts. Extensive experiments demonstrate that our method outperforms existing baselines, achieving superior structural stability, color fidelity, and local controllability performance. Code, models, and datasets are publicly available at https://github.com/muzishen/IMAGGarment.
中文摘要:IMAGGarment是一种细粒度服装生成框架,通过两阶段训练策略分别建模全局外观和局部细节,实现了对轮廓、颜色和徽标位置的高精度控制,在结构稳定性和局部可控性方面优于现有方法。
English Summary: IMAGGarment is a fine-grained garment generation framework that enables high-fidelity clothing synthesis with precise control over silhouette, color, and logo placement through a two-stage training approach, outperforming existing methods in structural stability and local controllability.

Authors:Zetong Zhang, Manuel Kaufmann, Lixin Xue, Jie Song, Martin R. Oswald
Title: ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos
Abstract:
Creating a photorealistic scene and human reconstruction from a single monocular in-the-wild video figures prominently in the perception of a human-centric 3D world. Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses, and days of training time. In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. 3D Gaussian Splatting is utilized to learn Gaussian primitives for humans and scenes efficiently, and reconstruction-based camera tracking and human pose estimation modules are designed to enable holistic understanding and effective disentanglement of pose and appearance. Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully. Aiming to learn the spatial correlation between human and scene accurately, we introduce occlusion-aware human silhouette rendering and monocular geometric priors, which further improve reconstruction quality. Experiments on the EMDB and NeuMan datasets demonstrate superior or on-par performance with existing methods in camera tracking, human pose estimation, novel view synthesis and runtime. Our project page is at https://eth-ait.github.io/ODHSR.

Authors:Kaiyue Sun, Xian Liu, Yao Teng, Xihui Liu
Title: Personalized Text-to-Image Generation with Auto-Regressive Models
Abstract:
Personalized image synthesis has emerged as a pivotal application in text-to-image generation, enabling the creation of images featuring specific subjects in diverse contexts. While diffusion models have dominated this domain, auto-regressive models, with their unified architecture for text and image modeling, remain underexplored for personalized image generation. This paper investigates the potential of optimizing auto-regressive models for personalized image synthesis, leveraging their inherent multimodal capabilities to perform this task. We propose a two-stage training strategy that combines optimization of text embeddings and fine-tuning of transformer layers. Our experiments on the auto-regressive model demonstrate that this method achieves comparable subject fidelity and prompt following to the leading diffusion-based personalization methods. The results highlight the effectiveness of auto-regressive models in personalized image generation, offering a new direction for future research in this area.
中文摘要:本文通过两阶段训练策略优化自回归模型进行个性化图像合成,实现了与主流扩散模型相当的主体保真度和提示跟随能力。
English Summary: This paper explores optimizing auto-regressive models for personalized image synthesis through a two-stage training approach, achieving performance comparable to leading diffusion-based methods.

Authors:Andrew Melnik, Benjamin Alt, Giang Nguyen, Artur Wilkowski, Maciej Stefańczyk, Qirui Wu, Sinan Harms, Helge Rhodin, Manolis Savva, Michael Beetz
Title: Digital Twin Generation from Visual Data: A Survey
Abstract:
This survey explores recent developments in generating digital twins from videos. Such digital twins can be used for robotics application, media content creation, or design and construction works. We analyze various approaches, including 3D Gaussian Splatting, generative in-painting, semantic segmentation, and foundation models highlighting their advantages and limitations. Additionally, we discuss challenges such as occlusions, lighting variations, and scalability, as well as potential future research directions. This survey aims to provide a comprehensive overview of state-of-the-art methodologies and their implications for real-world applications. Awesome list: https://github.com/ndrwmlnk/awesome-digital-twins
本调查全面综述了从视频生成数字孪生的前沿方法,分析了多种技术及其应用,并探讨了关键挑战与未来研究方向。
This survey provides a comprehensive overview of state-of-the-art methods for creating digital twins from videos, analyzing various techniques and their applications while addressing key challenges and future research directions.

Authors:Shaohui Dai, Yansong Qu, Zheyan Li, Xinyang Li, Shengchuan Zhang, Liujuan Cao
Title: Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs
Abstract:
Bridging natural language and 3D geometry is a crucial step toward flexible, language-driven scene understanding. While recent advances in 3D Gaussian Splatting (3DGS) have enabled fast and high-quality scene reconstruction, research has also explored incorporating open-vocabulary understanding into 3DGS. However, most existing methods require iterative optimization over per-view 2D semantic feature maps, which not only results in inefficiencies but also leads to inconsistent 3D semantics across views. To address these limitations, we introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives. The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities and providing a structured foundation for open-vocabulary understanding. Based on the graph structure, we design an efficient reprojection strategy that lifts 2D semantic features onto the superpoints, avoiding costly multi-view iterative training. The resulting representation ensures strong 3D semantic coherence and naturally supports hierarchical understanding, enabling both coarse- and fine-grained open-vocabulary perception within a unified semantic field. Extensive experiments demonstrate that our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over $30\times$ faster. Our code will be available at https://github.com/Atrovast/THGS.
Chinese: 本文提出了一种免训练框架,通过从3D高斯基元构建超点图来实现高效且视图一致的开集词汇3D场景理解,在速度和分割精度上均显著优于现有方法。
English: This paper introduces a training-free framework that constructs a superpoint graph from 3D Gaussian primitives to achieve efficient and view-consistent open-vocabulary 3D scene understanding, significantly outperforming existing methods in both speed and segmentation accuracy.

Authors:Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie
Title: $\texttt{Complex-Edit}$: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark
Abstract:
We introduce $\texttt{Complex-Edit}$, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises -- a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.
中文: 我们推出了$\texttt{Complex-Edit}$基准,用于评估基于指令的图像编辑模型在不同复杂度下的表现,揭示了开源与专有模型间的性能差距及合成数据对输出质量的影响等关键发现。
English: We introduce $\texttt{Complex-Edit}$, a benchmark for evaluating instruction-based image editing models across complexity levels, revealing key insights such as performance gaps between open-source and proprietary models and the impact of synthetic data on output quality.

Authors:Xin Li, Kun Yuan, Bingchen Li, Fengbin Guan, Yizhen Shao, Zihao Yu, Xijun Wang, Yiting Lu, Wei Luo, Suhang Yao, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Yabin Zhang, Ao-Xiang Zhang, Tianwu Zhi, Jianzhao Liu, Yang Li, Jingwen Xu, Yiting Liao, Yushen Zuo, Mingyang Wu, Renjie Li, Shengyun Zhong, Zhengzhong Tu, Yufan Liu, Xiangguang Chen, Zuowei Cao, Minhao Tang, Shan Liu, Kexin Zhang, Jingfen Xie, Yan Wang, Kai Chen, Shijie Zhao, Yunchen Zhang, Xiangkai Xu, Hong Gao, Ji Shi, Yiming Bao, Xiugang Dong, Xiangsheng Zhou, Yaofeng Tu, Ying Liang, Yiwen Wang, Xinning Chai, Yuxuan Zhang, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Rong Xie, Li Song, Wei Sun, Kang Fu, Linhan Cao, Dandan Zhu, Kaiwei Zhang, Yucheng Zhu, Zicheng Zhang, Menghan Hu, Xiongkuo Min, Guangtao Zhai, Zhi Jin, Jiawei Wu, Wei Wang, Wenjian Zhang, Yuhai Lan, Gaoxiong Yi, Hengyuan Na, Wang Luo, Di Wu, MingYin Bai, Jiawang Du, Zilong Lu, Zhenyu Jiang, Hui Zeng, Ziguan Cui, Zongliang Gan, Guijin Tang, Xinglin Xie, Kehuan Song, Xiaoqiang Lu, Licheng Jiao, Fang Liu, Xu Liu, Puhua Chen, Ha Thu Nguyen, Katrien De Moor, Seyed Ali Amirshahi, Mohamed-Chaker Larabi, Qi Tang, Linfeng He, Zhiyong Gao, Zixuan Gao, Guohua Zhang, Zhiye Huang, Yi Deng, Qingmiao Jiang, Lu Chen, Yi Yang, Xi Liao, Nourine Mohammed Nadir, Yuxuan Jiang, Qiang Zhu, Siyue Teng, Fan Zhang, Shuyuan Zhu, Bing Zeng, David Bull, Meiqin Liu, Chao Yao, Yao Zhao
Title: NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results
Abstract:
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025.
中文: 本文综述了NTIRE 2025挑战赛,该赛事通过高效视频质量评估和基于扩散的超分辨率两个赛道,专注于提升快手、TikTok等平台的短格式用户生成内容视频质量与用户体验,共吸引266名参与者并收到18份有效提交。
English: This paper reviews the NTIRE 2025 Challenge focusing on short-form UGC video quality assessment and enhancement through two tracks—efficient VQA and diffusion-based super-resolution—aimed at improving user experience on platforms like Kwai and TikTok, attracting 266 participants and 18 valid submissions.

Authors:Jialuo Li, Wenhao Chai, Xingyu Fu, Haiyang Xu, Saining Xie
Title: Science-T2I: Addressing Scientific Illusions in Image Synthesis
Abstract:
We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. First, we introduce Science-T2I, an expert-annotated adversarial dataset comprising adversarial 20k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. Additionally, based on SciScore, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models. Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content. Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human evaluators. Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% on SciScore.

Authors:Haojian Huang, Haodong Chen, Shengqiong Wu, Meng Luo, Jinlan Fu, Xinya Du, Hanwang Zhang, Hao Fei
Title: VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
Abstract:
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.
中文摘要:VistaDPO是一种新颖的视频层次时空直接偏好优化框架,通过实例、时序和感知三个层级优化文本-视频偏好对齐,并构建了包含7200对问答数据的新数据集,有效解决了大型视频模型中的视频-语言错位和幻觉问题,显著提升了多项基准测试性能。
English Summary: VistaDPO is a novel framework that addresses video-language misalignment and hallucination in Large Video Models by optimizing text-video preference alignment across instance, temporal, and perceptive levels, using a newly constructed 7.2K QA dataset to significantly enhance model performance on multiple benchmarks.

Authors:Yongqian Peng, Yuxi Ma, Mengmeng Wang, Yuxuan Wang, Yizhou Wang, Chi Zhang, Yixin Zhu, Zilong Zheng
Title: Probing and Inducing Combinational Creativity in Vision-Language Models
Abstract:
The ability to combine existing concepts into novel ideas stands as a fundamental hallmark of human intelligence. Recent advances in Vision-Language Models (VLMs) like GPT-4V and DALLE-3 have sparked debate about whether their outputs reflect combinational creativity--defined by M. A. Boden (1998) as synthesizing novel ideas through combining existing concepts--or sophisticated pattern matching of training data. Drawing inspiration from cognitive science, we investigate the combinational creativity of VLMs from the lens of concept blending. We propose the Identification-Explanation-Implication (IEI) framework, which decomposes creative processes into three levels: identifying input spaces, extracting shared attributes, and deriving novel semantic implications. To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework. Through extensive experiments, we demonstrate that in comprehension tasks, best VLMs have surpassed average human performance while falling short of expert-level understanding; in generation tasks, incorporating our IEI framework into the generation pipeline significantly enhances the creative quality of VLMs' outputs. Our findings establish both a theoretical foundation for evaluating artificial creativity and practical guidelines for improving creative generation in VLMs.

Authors:Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xinwei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, Shuyan Li
Title: EventVAD: Training-Free Event-Aware Video Anomaly Detection
Abstract:
Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.
中文摘要:EventVAD是一种事件感知的视频异常检测框架,通过结合动态图架构和多模态大语言模型进行时序事件推理,在无需训练的场景下实现了最先进的检测性能,能有效捕捉细粒度视觉转换和多样化事件。
English Summary: EventVAD is an event-aware video anomaly detection framework that integrates dynamic graph modeling and multimodal large language models to achieve state-of-the-art performance in training-free settings by capturing fine-grained events through temporal reasoning.

Authors:Riza Velioglu, Petra Bevandic, Robin Chan, Barbara Hammer
Title: MGT: Extending Virtual Try-Off to Multi-Garment Scenarios
Abstract:
Computer vision is transforming fashion industry through Virtual Try-On (VTON) and Virtual Try-Off (VTOFF). VTON generates images of a person in a specified garment using a target photo and a standardized garment image, while a more challenging variant, Person-to-Person Virtual Try-On (p2p-VTON), uses a photo of another person wearing the garment. VTOFF, in contrast, extracts standardized garment images from photos of clothed individuals. We introduce Multi-Garment TryOffDiff (MGT), a diffusion-based VTOFF model capable of handling diverse garment types, including upper-body, lower-body, and dresses. MGT builds on a latent diffusion architecture with SigLIP-based image conditioning to capture garment characteristics such as shape, texture, and pattern. To address garment diversity, MGT incorporates class-specific embeddings, achieving state-of-the-art VTOFF results on VITON-HD and competitive performance on DressCode. When paired with VTON models, it further enhances p2p-VTON by reducing unwanted attribute transfer, such as skin tone, ensuring preservation of person-specific characteristics. Demo, code, and models are available at: https://rizavelioglu.github.io/tryoffdiff/

Authors:Prasanna Reddy Pulakurthi, Majid Rabbani, Celso M. de Melo, Sohail A. Dianat, Raghuveer M. Rao
Title: Effective Dual-Region Augmentation for Reduced Reliance on Large Amounts of Labeled Data
Abstract:
This paper introduces a novel dual-region augmentation approach designed to reduce reliance on large-scale labeled datasets while improving model robustness and adaptability across diverse computer vision tasks, including source-free domain adaptation (SFDA) and person re-identification (ReID). Our method performs targeted data transformations by applying random noise perturbations to foreground objects and spatially shuffling background patches. This effectively increases the diversity of the training data, improving model robustness and generalization. Evaluations on the PACS dataset for SFDA demonstrate that our augmentation strategy consistently outperforms existing methods, achieving significant accuracy improvements in both single-target and multi-target adaptation settings. By augmenting training data through structured transformations, our method enables model generalization across domains, providing a scalable solution for reducing reliance on manually annotated datasets. Furthermore, experiments on Market-1501 and DukeMTMC-reID datasets validate the effectiveness of our approach for person ReID, surpassing traditional augmentation techniques. The code is available at https://github.com/PrasannaPulakurthi/Foreground-Background-Augmentation
本文提出了一种双区域增强方法,通过对前景和背景区域进行针对性变换来提升模型鲁棒性和泛化能力,减少对大规模标注数据的依赖,在跨域适应和行人重识别任务中表现优异。
This paper presents a dual-region augmentation method that enhances model robustness and generalization for computer vision tasks by applying targeted transformations to foreground and background regions, reducing dependency on large labeled datasets.

Authors:Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, Yahui Zhou
Title: SkyReels-V2: Infinite-length Film Generative Model
Abstract:
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.
中文: 摘要指出视频生成在提示遵循、动态质量和时长方面存在挑战,并提出了SkyReels-V2模型,该模型融合多模态大语言模型、多阶段训练和扩散框架,实现了无限长度的高质量影片生成。
English: The abstract highlights persistent challenges in video generation, such as balancing prompt adherence and motion quality, and introduces SkyReels-V2, a model that integrates MLLM, multi-stage training, and a diffusion framework to enable infinite-length, high-quality film synthesis.

Authors:Wenqi Dong, Bangbang Yang, Zesong Yang, Yuan Li, Tao Hu, Hujun Bao, Yuewen Ma, Zhaopeng Cui
Title: HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation
Abstract:
Scene-level 3D generation represents a critical frontier in multimedia and computer graphics, yet existing approaches either suffer from limited object categories or lack editing flexibility for interactive applications. In this paper, we present HiScene, a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation and delivers high-fidelity scenes with compositional identities and aesthetic scene content. Our key insight is treating scenes as hierarchical "objects" under isometric views, where a room functions as a complex object that can be further decomposed into manipulatable items. This hierarchical approach enables us to generate 3D content that aligns with 2D representations while maintaining compositional structure. To ensure completeness and spatial alignment of each decomposed instance, we develop a video-diffusion-based amodal completion technique that effectively handles occlusions and shadows between objects, and introduce shape prior injection to ensure spatial coherence within the scene. Experimental results demonstrate that our method produces more natural object arrangements and complete object instances suitable for interactive applications, while maintaining physical plausibility and alignment with user inputs.
中文摘要:HiScene提出了一种层次化框架,通过将场景视为等距视图下的分层对象,结合视频扩散的不可见补全技术和形状先验注入,实现了具备组合结构的高保真3D场景生成。
English Summary: HiScene introduces a hierarchical framework that bridges 2D image and 3D object generation, enabling high-fidelity scene creation with compositional structure through video-diffusion-based amodal completion and shape prior injection.

Authors:Yang Yue, Yulin Wang, Haojun Jiang, Pan Liu, Shiji Song, Gao Huang
Title: EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance
Abstract:
Echocardiography is crucial for cardiovascular disease detection but relies heavily on experienced sonographers. Echocardiography probe guidance systems, which provide real-time movement instructions for acquiring standard plane images, offer a promising solution for AI-assisted or fully autonomous scanning. However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. EchoWorld employs a pre-training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre-trained model, we introduce a motion-aware attention mechanism in the fine-tuning stage that effectively integrates historical visual-motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures key echocardiographic knowledge, as validated by qualitative analysis. Moreover, our method significantly reduces guidance errors compared to existing visual backbones and guidance frameworks, excelling in both single-frame and sequential evaluation protocols. Code is available at https://github.com/LeapLabTHU/EchoWorld.
中文: EchoWorld是一种运动感知的世界建模框架,通过编码解剖知识和运动动态来提升超声心动图探头引导,利用历史视觉运动数据进行预训练和微调,显著减少了引导误差。
English: EchoWorld is a motion-aware world modeling framework that enhances echocardiography probe guidance by encoding anatomical knowledge and motion dynamics, significantly reducing errors through pre-training and fine-tuning with historical visual-motion data.

Authors:Linkang Du, Zheng Zhu, Min Chen, Zhou Su, Shouling Ji, Peng Cheng, Jiming Chen, Zhikun Zhang
Title: ArtistAuditor: Auditing Artist Style Pirate in Text-to-Image Generation Models
Abstract:
Text-to-image models based on diffusion processes, such as DALL-E, Stable Diffusion, and Midjourney, are capable of transforming texts into detailed images and have widespread applications in art and design. As such, amateur users can easily imitate professional-level paintings by collecting an artist's work and fine-tuning the model, leading to concerns about artworks' copyright infringement. To tackle these issues, previous studies either add visually imperceptible perturbation to the artwork to change its underlying styles (perturbation-based methods) or embed post-training detectable watermarks in the artwork (watermark-based methods). However, when the artwork or the model has been published online, i.e., modification to the original artwork or model retraining is not feasible, these strategies might not be viable. To this end, we propose a novel method for data-use auditing in the text-to-image generation model. The general idea of ArtistAuditor is to identify if a suspicious model has been finetuned using the artworks of specific artists by analyzing the features related to the style. Concretely, ArtistAuditor employs a style extractor to obtain the multi-granularity style representations and treats artworks as samplings of an artist's style. Then, ArtistAuditor queries a trained discriminator to gain the auditing decisions. The experimental results on six combinations of models and datasets show that ArtistAuditor can achieve high AUC values (> 0.937). By studying ArtistAuditor's transferability and core modules, we provide valuable insights into the practical implementation. Finally, we demonstrate the effectiveness of ArtistAuditor in real-world cases by an online platform Scenario. ArtistAuditor is open-sourced at https://github.com/Jozenn/ArtistAuditor.
中文:ArtistAuditor是一种新颖的数据使用审计方法,通过分析多粒度风格表征来识别文本到图像模型是否使用了特定艺术家的作品进行微调,在实验和实际应用中均展现出高效能。
English: ArtistAuditor is a novel data-use auditing method that identifies if a text-to-image model has been fine-tuned using specific artists' works by analyzing multi-granularity style representations, achieving high effectiveness in experiments and real-world applications.

Authors:Dachun Kai, Yueyi Zhang, Jin Wang, Zeyu Xiao, Zhiwei Xiong, Xiaoyan Sun
Title: Event-Enhanced Blurry Video Super-Resolution
Abstract:
In this paper, we tackle the task of blurry video super-resolution (BVSR), aiming to generate high-resolution (HR) videos from low-resolution (LR) and blurry inputs. Current BVSR methods often fail to restore sharp details at high resolutions, resulting in noticeable artifacts and jitter due to insufficient motion information for deconvolution and the lack of high-frequency details in LR frames. To address these challenges, we introduce event signals into BVSR and propose a novel event-enhanced network, Ev-DeblurVSR. To effectively fuse information from frames and events for feature deblurring, we introduce a reciprocal feature deblurring module that leverages motion information from intra-frame events to deblur frame features while reciprocally using global scene context from the frames to enhance event features. Furthermore, to enhance temporal consistency, we propose a hybrid deformable alignment module that fully exploits the complementary motion information from inter-frame events and optical flow to improve motion estimation in the deformable alignment process. Extensive evaluations demonstrate that Ev-DeblurVSR establishes a new state-of-the-art performance on both synthetic and real-world datasets. Notably, on real data, our method is +2.59 dB more accurate and 7.28$\times$ faster than the recent best BVSR baseline FMA-Net. Code: https://github.com/DachunKai/Ev-DeblurVSR.
中文: 本文提出Ev-DeblurVSR事件增强网络,通过融合事件信号与互惠特征去模糊模块及混合可变形对齐模块,在模糊视频超分辨率任务中实现了最优性能,在真实数据上显著提升了精度与处理速度。
English: This paper introduces Ev-DeblurVSR, an event-enhanced network that integrates event signals with reciprocal feature deblurring and hybrid deformable alignment to achieve state-of-the-art performance in blurry video super-resolution, significantly improving accuracy and speed on real-world data.

Authors:Yide Liu, Haijiang Sun, Xiaowen Zhang, Qiaoyuan Liu, Zhouchang Chen, Chongzhuo Xiao
Title: TTRD3: Texture Transfer Residual Denoising Dual Diffusion Model for Remote Sensing Image Super-Resolution
Abstract:
Remote Sensing Image Super-Resolution (RSISR) reconstructs high-resolution (HR) remote sensing images from low-resolution inputs to support fine-grained ground object interpretation. Existing methods face three key challenges: (1) Difficulty in extracting multi-scale features from spatially heterogeneous RS scenes, (2) Limited prior information causing semantic inconsistency in reconstructions, and (3) Trade-off imbalance between geometric accuracy and visual quality. To address these issues, we propose the Texture Transfer Residual Denoising Dual Diffusion Model (TTRD3) with three innovations: First, a Multi-scale Feature Aggregation Block (MFAB) employing parallel heterogeneous convolutional kernels for multi-scale feature extraction. Second, a Sparse Texture Transfer Guidance (STTG) module that transfers HR texture priors from reference images of similar scenes. Third, a Residual Denoising Dual Diffusion Model (RDDM) framework combining residual diffusion for deterministic reconstruction and noise diffusion for diverse generation. Experiments on multi-source RS datasets demonstrate TTRD3's superiority over state-of-the-art methods, achieving 1.43% LPIPS improvement and 3.67% FID enhancement compared to best-performing baselines. Code/model: https://github.com/LED-666/TTRD3.
中文: 提出的纹理传递残差去噪双重扩散模型(TTRD3)通过多尺度特征提取、稀疏纹理引导和双重扩散框架,解决了遥感图像超分辨率中的关键难题,在多项指标上优于现有最佳方法,LPIPS提升1.43%,FID改善3.67%。
English: The proposed Texture Transfer Residual Denoising Dual Diffusion Model (TTRD3) addresses key challenges in Remote Sensing Image Super-Resolution by incorporating multi-scale feature extraction, texture transfer guidance, and a dual diffusion framework, demonstrating superior performance with 1.43% LPIPS and 3.67% FID improvements over existing methods.

Authors:Guoqing Zhang, Jingyun Yang, Yang Li
Title: Hierarchical Feature Learning for Medical Point Clouds via State Space Model
Abstract:
Deep learning-based point cloud modeling has been widely investigated as an indispensable component of general shape analysis. Recently, transformer and state space model (SSM) have shown promising capacities in point cloud learning. However, limited research has been conducted on medical point clouds, which have great potential in disease diagnosis and treatment. This paper presents an SSM-based hierarchical feature learning framework for medical point cloud understanding. Specifically, we down-sample input into multiple levels through the farthest point sampling. At each level, we perform a series of k-nearest neighbor (KNN) queries to aggregate multi-scale structural information. To assist SSM in processing point clouds, we introduce coordinate-order and inside-out scanning strategies for efficient serialization of irregular points. Point features are calculated progressively from short neighbor sequences and long point sequences through vanilla and group Point SSM blocks, to capture both local patterns and long-range dependencies. To evaluate the proposed method, we build a large-scale medical point cloud dataset named MedPointS for anatomy classification, completion, and segmentation. Extensive experiments conducted on MedPointS demonstrate that our method achieves superior performance across all tasks. The dataset is available at https://flemme-docs.readthedocs.io/en/latest/medpoints.html. Code is merged to a public medical imaging platform: https://github.com/wlsdzyzl/flemme.
中文: 本文提出了一种基于状态空间模型的分层特征学习框架,通过多尺度结构信息聚合和创新的点云序列化策略,在新建的MedPointS医疗点云数据集上实现了分类、补全和分割任务的最优性能。
English: This paper introduces a hierarchical SSM-based framework for medical point cloud analysis, employing multi-scale feature aggregation and novel scanning strategies to achieve state-of-the-art performance on the newly created MedPointS dataset for classification, completion, and segmentation tasks.

Authors:Robin Hesse, Jonas Fischer, Simone Schaub-Meyer, Stefan Roth
Title: Disentangling Polysemantic Channels in Convolutional Neural Networks
Abstract:
Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.
Chinese: 本文提出一种算法,将卷积神经网络中的多义通道分解为多个单一概念通道,通过基于不同激活模式重构权重来提高模型可解释性。
English: This paper introduces an algorithm to disentangle polysemantic channels in CNNs into distinct concept-specific channels, enhancing interpretability by restructuring weights based on unique activation patterns.

Authors:Youyi Zhan, Tianjia Shao, Yin Yang, Kun Zhou
Title: Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs
Abstract:
Many works have succeeded in reconstructing Gaussian human avatars from multi-view videos. However, they either struggle to capture pose-dependent appearance details with a single MLP, or rely on a computationally intensive neural network to reconstruct high-fidelity appearance but with rendering performance degraded to non-real-time. We propose a novel Gaussian human avatar representation that can reconstruct high-fidelity pose-dependence appearance with details and meanwhile can be rendered in real time. Our Gaussian avatar is empowered by spatially distributed MLPs which are explicitly located on different positions on human body. The parameters stored in each Gaussian are obtained by interpolating from the outputs of its nearby MLPs based on their distances. To avoid undesired smooth Gaussian property changing during interpolation, for each Gaussian we define a set of Gaussian offset basis, and a linear combination of basis represents the Gaussian property offsets relative to the neutral properties. Then we propose to let the MLPs output a set of coefficients corresponding to the basis. In this way, although Gaussian coefficients are derived from interpolation and change smoothly, the Gaussian offset basis is learned freely without constraints. The smoothly varying coefficients combined with freely learned basis can still produce distinctly different Gaussian property offsets, allowing the ability to learn high-frequency spatial signals. We further use control points to constrain the Gaussians distributed on a surface layer rather than allowing them to be irregularly distributed inside the body, to help the human avatar generalize better when animated under novel poses. Compared to the state-of-the-art method, our method achieves better appearance quality with finer details while the rendering speed is significantly faster under novel views and novel poses.
中文: 本文提出了一种新颖的高斯人体化身表示方法,通过分布在人体不同位置的多层感知机重建高保真的姿态相关外观细节,同时实现实时渲染,在视觉质量和速度上均优于现有方法。
English: This paper introduces a novel Gaussian human avatar representation that uses spatially distributed MLPs on the body to reconstruct high-fidelity pose-dependent appearance details while enabling real-time rendering, outperforming existing methods in both visual quality and speed.

Authors:Pengxuan Yang, Yupeng Zheng, Qichao Zhang, Kefei Zhu, Zebin Xing, Qiao Lin, Yun-Fu Liu, Zhiguo Su, Dongbin Zhao
Title: UncAD: Towards Safe End-to-end Autonomous Driving via Online Map Uncertainty
Abstract:
End-to-end autonomous driving aims to produce planning trajectories from raw sensors directly. Currently, most approaches integrate perception, prediction, and planning modules into a fully differentiable network, promising great scalability. However, these methods typically rely on deterministic modeling of online maps in the perception module for guiding or constraining vehicle planning, which may incorporate erroneous perception information and further compromise planning safety. To address this issue, we delve into the importance of online map uncertainty for enhancing autonomous driving safety and propose a novel paradigm named UncAD. Specifically, UncAD first estimates the uncertainty of the online map in the perception module. It then leverages the uncertainty to guide motion prediction and planning modules to produce multi-modal trajectories. Finally, to achieve safer autonomous driving, UncAD proposes an uncertainty-collision-aware planning selection strategy according to the online map uncertainty to evaluate and select the best trajectory. In this study, we incorporate UncAD into various state-of-the-art (SOTA) end-to-end methods. Experiments on the nuScenes dataset show that integrating UncAD, with only a 1.9% increase in parameters, can reduce collision rates by up to 26% and drivable area conflict rate by up to 42%. Codes, pre-trained models, and demo videos can be accessed at https://github.com/pengxuanyang/UncAD.
中文: 提出的UncAD范式通过估计并利用在线地图的不确定性来指导轨迹预测与选择,以微小参数代价显著降低碰撞与冲突率,从而提升自动驾驶安全性。
English: The proposed UncAD paradigm enhances autonomous driving safety by estimating and utilizing online map uncertainty to guide trajectory prediction and selection, significantly reducing collision and conflict rates with minimal parameter increase.

Authors:Inzamamul Alam, Md Tanvir Islam, Simon S. Woo
Title: Saliency-Aware Diffusion Reconstruction for Effective Invisible Watermark Removal
Abstract:
As digital content becomes increasingly ubiquitous, the need for robust watermark removal techniques has grown due to the inadequacy of existing embedding techniques, which lack robustness. This paper introduces a novel Saliency-Aware Diffusion Reconstruction (SADRE) framework for watermark elimination on the web, combining adaptive noise injection, region-specific perturbations, and advanced diffusion-based reconstruction. SADRE disrupts embedded watermarks by injecting targeted noise into latent representations guided by saliency masks although preserving essential image features. A reverse diffusion process ensures high-fidelity image restoration, leveraging adaptive noise levels determined by watermark strength. Our framework is theoretically grounded with stability guarantees and achieves robust watermark removal across diverse scenarios. Empirical evaluations on state-of-the-art (SOTA) watermarking techniques demonstrate SADRE's superiority in balancing watermark disruption and image quality. SADRE sets a new benchmark for watermark elimination, offering a flexible and reliable solution for real-world web content. Code is available on~\href{https://github.com/inzamamulDU/SADRE}{\textbf{https://github.com/inzamamulDU/SADRE}}.
中文: 本文提出的SADRE框架通过显著性引导的噪声注入和扩散重建技术,在保持图像质量的同时有效去除水印,其性能优于现有方法。
English: This paper proposes the SADRE framework, which effectively removes watermarks through saliency-guided noise injection and diffusion reconstruction while maintaining image quality, outperforming existing methods.

Authors:Mingwei Li, Pu Pang, Hehe Fan, Hua Huang, Yi Yang
Title: TSGS: Improving Gaussian Splatting for Transparent Surface Reconstruction via Normal and De-lighting Priors
Abstract:
Reconstructing transparent surfaces is essential for tasks such as robotic manipulation in labs, yet it poses a significant challenge for 3D reconstruction techniques like 3D Gaussian Splatting (3DGS). These methods often encounter a transparency-depth dilemma, where the pursuit of photorealistic rendering through standard $α$-blending undermines geometric precision, resulting in considerable depth estimation errors for transparent materials. To address this issue, we introduce Transparent Surface Gaussian Splatting (TSGS), a new framework that separates geometry learning from appearance refinement. In the geometry learning stage, TSGS focuses on geometry by using specular-suppressed inputs to accurately represent surfaces. In the second stage, TSGS improves visual fidelity through anisotropic specular modeling, crucially maintaining the established opacity to ensure geometric accuracy. To enhance depth inference, TSGS employs a first-surface depth extraction method. This technique uses a sliding window over $α$-blending weights to pinpoint the most likely surface location and calculates a robust weighted average depth. To evaluate the transparent surface reconstruction task under realistic conditions, we collect a TransLab dataset that includes complex transparent laboratory glassware. Extensive experiments on TransLab show that TSGS achieves accurate geometric reconstruction and realistic rendering of transparent objects simultaneously within the efficient 3DGS framework. Specifically, TSGS significantly surpasses current leading methods, achieving a 37.3% reduction in chamfer distance and an 8.0% improvement in F1 score compared to the top baseline. The code and dataset are available at https://longxiang-ai.github.io/TSGS/.
Chinese: TSGS提出了一种双阶段框架,将几何学习与外观建模分离,解决了3D高斯溅射中的透明-深度困境,在TransLab数据集上验证了其能同时实现透明表面的精确几何重建和逼真渲染。
English: TSGS introduces a dual-stage framework that separates geometry learning from appearance modeling to overcome the transparency-depth dilemma in 3D Gaussian Splatting, achieving superior geometric accuracy and photorealistic rendering for transparent surfaces as validated on the TransLab dataset.

Authors:Leyang Li, Shilin Lu, Yan Ren, Adams Wai-Kin Kong
Title: Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts
Abstract:
Ensuring the ethical deployment of text-to-image models requires effective techniques to prevent the generation of harmful or inappropriate content. While concept erasure methods offer a promising solution, existing finetuning-based approaches suffer from notable limitations. Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages enables precise content modification without sacrificing early-stage structural integrity. This inspires a trajectory-aware objective that preserves the integrity of the early-stage score function field, which steers samples toward the natural image manifold, without relying on heuristic anchor concept selection. For single-concept erasure, we propose an augmentation-enhanced weight saliency map to precisely identify the critical parameters that most significantly contribute to the unwanted concept, enabling more thorough and efficient erasure. For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance. Extensive experiments demonstrate that ANT achieves state-of-the-art results in both single and multi-concept erasure, delivering high-quality, safe outputs without compromising the generative fidelity. Code is available at https://github.com/lileyang1210/ANT
Chinese Summary: 本文提出ANT微调框架,通过自动引导去噪轨迹避免生成有害内容,在单概念与多概念消除任务中均达到最优性能,且不损害图像生成质量。
English Summary: The paper introduces ANT, a finetuning framework that automatically guides denoising trajectories to prevent harmful content generation in text-to-image models, achieving state-of-the-art performance in both single and multi-concept erasure without compromising image quality.

Authors:Siyu Chen, Ting Han, Changshe Zhang, Xin Luo, Meiliu Wu, Guorong Cai, Jinhe Su
Title: Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation
Abstract:
Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at https://github.com/anonymouse-xzrptkvyqc/DepthForge.
中文: DepthForge通过整合深度信息增强视觉基础模型的几何一致性,在语义分割中显著提升了泛化能力,尤其在极端环境下表现卓越。
English: DepthForge enhances Vision Foundation Models' generalization in semantic segmentation by integrating depth information to improve geometric consistency, achieving superior performance across diverse conditions.

Authors:Shin'ya Yamaguchi, Dewei Feng, Sekitoshi Kanai, Kazuki Adachi, Daiki Chijiwa
Title: Post-pre-training for Modality Alignment in Vision-Language Foundation Models
Abstract:
Contrastive language image pre-training (CLIP) is an essential component of building modern vision-language foundation models. While CLIP demonstrates remarkable zero-shot performance on downstream tasks, the multi-modal feature spaces still suffer from a modality gap, which is a gap between image and text feature clusters and limits downstream task performance. Although existing works attempt to address the modality gap by modifying pre-training or fine-tuning, they struggle with heavy training costs with large datasets or degradations of zero-shot performance. This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning. CLIP-Refine aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations. To this end, we introduce two techniques: random feature alignment (RaFA) and hybrid contrastive-distillation (HyCD). RaFA aligns the image and text features to follow a shared prior distribution by minimizing the distance to random reference vectors sampled from the prior. HyCD updates the model with hybrid soft labels generated by combining ground-truth image-text pair labels and outputs from the pre-trained CLIP model. This contributes to achieving both maintaining the past knowledge and learning new knowledge to align features. Our extensive experiments with multiple classification and retrieval tasks show that CLIP-Refine succeeds in mitigating the modality gap and improving the zero-shot performance.
中文: CLIP-Refine是一种后预训练方法,通过随机特征对齐和混合对比蒸馏技术,有效缩小CLIP模型中的模态差距,在提升零样本性能的同时避免性能下降。
English: CLIP-Refine is a post-pre-training method that reduces the modality gap in CLIP models by aligning image and text features through random feature alignment and hybrid contrastive-distillation, improving zero-shot performance without degradation.

Authors:Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby T. Tan, Radu Timofte, Qiyu Rong, Hongyuan Jing, Mengmeng Zhang, Jinglong Li, Xiangyu Lu, Yi Ren, Yuting Liu, Meng Zhang, Xiang Chen, Qiyuan Guan, Jiangxin Dong, Jinshan Pan, Conglin Gou, Qirui Yang, Fangpu Zhang, Yunlong Lin, Sixiang Chen, Guoxi Huang, Ruirui Lin, Yan Zhang, Jingyu Yang, Huanjing Yue, Jiyuan Chen, Qiaosi Yi, Hongjun Wang, Chenxi Xie, Shuai Li, Yuhui Wu, Kaiyi Ma, Jiakui Hu, Juncheng Li, Liwen Pan, Guangwei Gao, Wenjie Li, Zhenyu Jin, Heng Guo, Zhanyu Ma, Yubo Wang, Jinghua Wang, Wangzhi Xing, Anjusree Karnavar, Diqi Chen, Mohammad Aminul Islam, Hao Yang, Ruikun Zhang, Liyuan Pan, Qianhao Luo, XinCao, Han Zhou, Yan Min, Wei Dong, Jun Chen, Taoyi Wu, Weijia Dou, Yu Wang, Shengjie Zhao, Yongcheng Huang, Xingyu Han, Anyan Huang, Hongtao Wu, Hong Wang, Yefeng Zheng, Abhijeet Kumar, Aman Kumar, Marcos V. Conde, Paula Garrido, Daniel Feijoo, Juan C. Benito, Guanglu Dong, Xin Lin, Siyuan Liu, Tianheng Zheng, Jiayu Zhong, Shouyi Wang, Xiangtai Li, Lanqing Guo, Lu Qi, Chao Ren, Shuaibo Wang, Shilong Zhang, Wanyu Zhou, Yunze Wu, Qinzhong Tan, Jieyuan Pei, Zhuoxuan Li, Jiayu Wang, Haoyu Bian, Haoran Sun, Subhajit Paul, Ni Tang, Junhao Huang, Zihan Cheng, Hongyun Zhu, Yuehan Wu, Kaixin Deng, Hang Ouyang, Tianxin Xiao, Fan Yang, Zhizun Luo, Zeyu Xiao, Zhuoyuan Li, Nguyen Pham Hoang Le, An Dinh Thien, Son T. Luu, Kiet Van Nguyen, Ronghua Xu, Xianmin Tian, Weijian Zhou, Jiacheng Zhang, Yuqian Chen, Yihang Duan, Yujie Wu, Suresh Raikwar, Arsh Garg, Kritika, Jianhua Zheng, Xiaoshan Ma, Ruolin Zhao, Yongyu Yang, Yongsheng Liang, Guiming Huang, Qiang Li, Hongbin Zhang, Xiangyu Zheng, A. N. Rajagopalan
Title: NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
Abstract:
This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at https://lixinustc.github.io/CVPR-NTIRE2025-RainDrop-Competition.github.io/.
中文: NTIRE 2025挑战赛通过构建包含昼夜不同聚焦模式的雨滴清晰数据集,为雨滴去除任务设立了新基准,吸引了361名参赛者,其中32支团队在最终测试中取得了顶尖性能。
English: The NTIRE 2025 Challenge introduced a comprehensive Raindrop Clarity dataset to benchmark raindrop removal techniques under diverse lighting and focus conditions, attracting 361 participants with 32 teams achieving state-of-the-art results.

Authors:Qianqian Sun, Jixiang Luo, Dell Zhang, Xuelong Li
Title: SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding
Abstract:
Recent advancements in image editing have utilized large-scale multimodal models to enable intuitive, natural instruction-driven interactions. However, conventional methods still face significant challenges, particularly in spatial reasoning, precise region segmentation, and maintaining semantic consistency, especially in complex scenes. To overcome these challenges, we introduce SmartFreeEdit, a novel end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture, enabling precise, mask-free image editing guided exclusively by natural language instructions. The key innovations of SmartFreeEdit include:(1)the introduction of region aware tokens and a mask embedding paradigm that enhance the spatial understanding of complex scenes;(2) a reasoning segmentation pipeline designed to optimize the generation of editing masks based on natural language instructions;and (3) a hypergraph-augmented inpainting module that ensures the preservation of both structural integrity and semantic coherence during complex edits, overcoming the limitations of local-based image generation. Extensive experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods across multiple evaluation metrics, including segmentation accuracy, instruction adherence, and visual quality preservation, while addressing the issue of local information focus and improving global consistency in the edited image. Our project will be available at https://github.com/smileformylove/SmartFreeEdit.
中文: SmartFreeEdit提出了一种创新的端到端框架,将多模态大语言模型与超图增强修复架构相结合,通过自然语言指令实现精确的无掩码图像编辑,有效解决了空间推理和语义一致性等关键难题。
English: SmartFreeEdit introduces an innovative end-to-end framework that integrates a multimodal large language model with hypergraph-enhanced inpainting, enabling precise, mask-free image editing guided by natural language instructions while overcoming challenges in spatial reasoning and semantic consistency.

Authors:Naibang Wang, Deyong Shang, Yan Gong, Xiaoxi Hu, Ziying Song, Lei Yang, Yuhan Huang, Xiaoyu Wang, Jianli Lu
Title: Collaborative Perception Datasets for Autonomous Driving: A Review
Abstract:
Collaborative perception has attracted growing interest from academia and industry due to its potential to enhance perception accuracy, safety, and robustness in autonomous driving through multi-agent information fusion. With the advancement of Vehicle-to-Everything (V2X) communication, numerous collaborative perception datasets have emerged, varying in cooperation paradigms, sensor configurations, data sources, and application scenarios. However, the absence of systematic summarization and comparative analysis hinders effective resource utilization and standardization of model evaluation. As the first comprehensive review focused on collaborative perception datasets, this work reviews and compares existing resources from a multi-dimensional perspective. We categorize datasets based on cooperation paradigms, examine their data sources and scenarios, and analyze sensor modalities and supported tasks. A detailed comparative analysis is conducted across multiple dimensions. We also outline key challenges and future directions, including dataset scalability, diversity, domain adaptation, standardization, privacy, and the integration of large language models. To support ongoing research, we provide a continuously updated online repository of collaborative perception datasets and related literature: https://github.com/frankwnb/Collaborative-Perception-Datasets-for-Autonomous-Driving.
中文摘要:本文作为首个专注于协同感知数据集的系统性综述,从多维度比较现有资源,分析合作范式与传感器配置,并指出数据集可扩展性、标准化等关键挑战与未来方向。
English Summary: This comprehensive review analyzes collaborative perception datasets for autonomous driving, comparing them across cooperation paradigms, sensor configurations, and application scenarios while identifying key challenges like dataset scalability and standardization.

Authors:Qishan Wang, Shuyong Gao, Junjie Hu, Jiawen Yu, Xuan Tong, You Li, Wenqiang Zhang
Title: HSS-IAD: A Heterogeneous Same-Sort Industrial Anomaly Detection Dataset
Abstract:
Multi-class Unsupervised Anomaly Detection algorithms (MUAD) are receiving increasing attention due to their relatively low deployment costs and improved training efficiency. However, the real-world effectiveness of MUAD methods is questioned due to limitations in current Industrial Anomaly Detection (IAD) datasets. These datasets contain numerous classes that are unlikely to be produced by the same factory and fail to cover multiple structures or appearances. Additionally, the defects do not reflect real-world characteristics. Therefore, we introduce the Heterogeneous Same-Sort Industrial Anomaly Detection (HSS-IAD) dataset, which contains 8,580 images of metallic-like industrial parts and precise anomaly annotations. These parts exhibit variations in structure and appearance, with subtle defects that closely resemble the base materials. We also provide foreground images for synthetic anomaly generation. Finally, we evaluate popular IAD methods on this dataset under multi-class and class-separated settings, demonstrating its potential to bridge the gap between existing datasets and real factory conditions. The dataset is available at https://github.com/Qiqigeww/HSS-IAD-Dataset.
中文: HSS-IAD数据集通过提供具有多样化结构和外观的金属部件及真实缺陷标注,弥补了现有工业异常检测数据集的不足,为多类别无监督异常检测方法在真实工厂环境下的评估提供了更有效的基准。
English: The HSS-IAD dataset addresses limitations in current industrial anomaly detection datasets by providing diverse metallic parts with realistic defects, enabling more accurate evaluation of multi-class unsupervised anomaly detection methods under real-world conditions.

Authors:Junyi Cao, Evangelos Kalogerakis
Title: SOPHY: Learning to Generate Simulation-Ready Objects with Physical Materials
Abstract:
We present SOPHY, a generative model for 3D physics-aware shape synthesis. Unlike existing 3D generative models that focus solely on static geometry or 4D models that produce physics-agnostic animations, our method jointly synthesizes shape, texture, and material properties related to physics-grounded dynamics, making the generated objects ready for simulations and interactive, dynamic environments. To train our model, we introduce a dataset of 3D objects annotated with detailed physical material attributes, along with an efficient pipeline for material annotation. Our method enables applications such as text-driven generation of interactive, physics-aware 3D objects and single-image reconstruction of physically plausible shapes. Furthermore, our experiments show that jointly modeling shape and material properties enhances the realism and fidelity of the generated shapes, improving performance on both generative geometry and physical plausibility.

Authors:Lvmin Zhang, Maneesh Agrawala
Title: Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
Abstract:
We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.
中文: FramePack是一种神经网络结构,通过压缩输入帧以保持固定的变换器上下文长度,从而高效训练下一帧视频预测模型,实现与图像扩散相当的计算瓶颈和更高批处理量,并采用防漂移采样方法减少误差累积。
English: FramePack is a neural network structure that enables efficient training of next-frame video prediction models by compressing input frames to maintain a fixed transformer context length, allowing for higher batch sizes and reduced computational bottlenecks similar to image diffusion, while incorporating an anti-drifting sampling method to minimize error accumulation.

Authors:Yun-Cheng Li, Sen Lei, Yi-Tao Zhao, Heng-Chao Li, Jun Li, Antonio Plaza
Title: SAM-Based Building Change Detection with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping
Abstract:
Building change detection remains challenging for urban development, disaster assessment, and military reconnaissance. While foundation models like Segment Anything Model (SAM) show strong segmentation capabilities, SAM is limited in the task of building change detection due to domain gap issues. Existing adapter-based fine-tuning approaches face challenges with imbalanced building distribution, resulting in poor detection of subtle changes and inaccurate edge extraction. Additionally, bi-temporal misalignment in change detection, typically addressed by optical flow, remains vulnerable to background noises. This affects the detection of building changes and compromises both detection accuracy and edge recognition. To tackle these challenges, we propose a new SAM-Based Network with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping (FAEWNet) for building change detection. FAEWNet utilizes the SAM encoder to extract rich visual features from remote sensing images. To guide SAM in focusing on specific ground objects in remote sensing scenes, we propose a Distribution-Aware Fourier Aggregated Adapter to aggregate task-oriented changed information. This adapter not only effectively addresses the domain gap issue, but also pays attention to the distribution of changed buildings. Furthermore, to mitigate noise interference and misalignment in height offset estimation, we design a novel flow module that refines building edge extraction and enhances the perception of changed buildings. Our state-of-the-art results on the LEVIR-CD, S2Looking and WHU-CD datasets highlight the effectiveness of FAEWNet. The code is available at https://github.com/SUPERMAN123000/FAEWNet.
中文摘要:提出的FAEWNet模型通过结合SAM编码器、分布感知傅里叶适配器和边缘约束变形模块,有效解决了建筑变化检测中的领域差异和时序错位问题,在多个数据集上取得了最优性能。
English Summary: The proposed FAEWNet model overcomes limitations in building change detection by integrating a SAM encoder with a distribution-aware Fourier adapter and an edge-constrained warping module, achieving state-of-the-art performance on multiple datasets.

Authors:Wentao Wu, Xiao Wang, Chenglong Li, Bo Jiang, Jin Tang, Bin Luo, Qi Liu
Title: CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework
Abstract:
Event cameras have attracted increasing attention in recent years due to their advantages in high dynamic range, high temporal resolution, low power consumption, and low latency. Some researchers have begun exploring pre-training directly on event data. Nevertheless, these efforts often fail to establish strong connections with RGB frames, limiting their applicability in multi-modal fusion scenarios. To address these issues, we propose a novel CM3AE pre-training framework for the RGB-Event perception. This framework accepts multi-modalities/views of data as input, including RGB images, event images, and event voxels, providing robust support for both event-based and RGB-event fusion based downstream tasks. Specifically, we design a multi-modal fusion reconstruction module that reconstructs the original image from fused multi-modal features, explicitly enhancing the model's ability to aggregate cross-modal complementary information. Additionally, we employ a multi-modal contrastive learning strategy to align cross-modal feature representations in a shared latent space, which effectively enhances the model's capability for multi-modal understanding and capturing global dependencies. We construct a large-scale dataset containing 2,535,759 RGB-Event data pairs for the pre-training. Extensive experiments on five downstream tasks fully demonstrated the effectiveness of CM3AE. Source code and pre-trained models will be released on https://github.com/Event-AHU/CM3AE.
中文:CM3AE框架提出了一种新颖的RGB-事件感知预训练方法,通过融合多模态数据并采用融合重建与对比学习策略,有效提升跨模态理解能力,在多种下游任务中展现出卓越性能。
English: The CM3AE framework introduces a novel pre-training approach for RGB-Event perception by integrating multi-modal data and employing fusion reconstruction and contrastive learning to enhance cross-modal understanding and performance across diverse downstream tasks.

Authors:Yan Wu, Korrawe Karunratanakul, Zhengyi Luo, Siyu Tang
Title: UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control
Abstract:
Generating natural and physically plausible character motion remains challenging, particularly for long-horizon control with diverse guidance signals. While prior work combines high-level diffusion-based motion planners with low-level physics controllers, these systems suffer from domain gaps that degrade motion quality and require task-specific fine-tuning. To tackle this problem, we introduce UniPhys, a diffusion-based behavior cloning framework that unifies motion planning and control into a single model. UniPhys enables flexible, expressive character motion conditioned on multi-modal inputs such as text, trajectories, and goals. To address accumulated prediction errors over long sequences, UniPhys is trained with the Diffusion Forcing paradigm, learning to denoise noisy motion histories and handle discrepancies introduced by the physics simulator. This design allows UniPhys to robustly generate physically plausible, long-horizon motions. Through guided sampling, UniPhys generalizes to a wide range of control signals, including unseen ones, without requiring task-specific fine-tuning. Experiments show that UniPhys outperforms prior methods in motion naturalness, generalization, and robustness across diverse control tasks.

Authors:Kaustav Chanda, Aayush Atul Verma, Arpitsinh Vaghela, Yezhou Yang, Bharatesh Chakravarthi
Title: Event Quality Score (EQS): Assessing the Realism of Simulated Event Camera Streams via Distances in Latent Space
Abstract:
Event cameras promise a paradigm shift in vision sensing with their low latency, high dynamic range, and asynchronous nature of events. Unfortunately, the scarcity of high-quality labeled datasets hinders their widespread adoption in deep learning-driven computer vision. To mitigate this, several simulators have been proposed to generate synthetic event data for training models for detection and estimation tasks. However, the fundamentally different sensor design of event cameras compared to traditional frame-based cameras poses a challenge for accurate simulation. As a result, most simulated data fail to mimic data captured by real event cameras. Inspired by existing work on using deep features for image comparison, we introduce event quality score (EQS), a quality metric that utilizes activations of the RVT architecture. Through sim-to-real experiments on the DSEC driving dataset, it is shown that a higher EQS implies improved generalization to real-world data after training on simulated events. Thus, optimizing for EQS can lead to developing more realistic event camera simulators, effectively reducing the simulation gap. EQS is available at https://github.com/eventbasedvision/EQS.
中文: 事件相机具有低延迟和高动态范围等优势,但其在深度学习中的应用受限于高质量标注数据的稀缺以及模拟其独特传感器设计的挑战,为此引入事件质量评分(EQS)指标来提升模拟器的真实性并缩小仿真差距。
English: Event cameras offer advantages like low latency and high dynamic range, but their adoption in deep learning is limited by the scarcity of labeled data and the challenge of accurately simulating their unique sensor design, leading to the introduction of the event quality score (EQS) metric to improve simulator realism and reduce the simulation gap.

Authors:Chaitanya Patel, Juan Carlos Niebles, Ehsan Adeli
Title: AdaVid: Adaptive Video-Language Pretraining
Abstract:
Contrastive video-language pretraining has demonstrated great success in learning rich and robust video representations. However, deploying such video encoders on compute-constrained edge devices remains challenging due to their high computational demands. Additionally, existing models are typically trained to process only short video clips, often limited to 4 to 64 frames. In this paper, we introduce AdaVid, a flexible architectural framework designed to learn efficient video encoders that can dynamically adapt their computational footprint based on available resources. At the heart of AdaVid is an adaptive transformer block, inspired by Matryoshka Representation Learning, which allows the model to adjust its hidden embedding dimension at inference time. We show that AdaVid-EgoVLP, trained on video-narration pairs from the large-scale Ego4D dataset, matches the performance of the standard EgoVLP on short video-language benchmarks using only half the compute, and even outperforms EgoVLP when given equal computational resources. We further explore the trade-off between frame count and compute on the challenging Diving48 classification benchmark, showing that AdaVid enables the use of more frames without exceeding computational limits. To handle longer videos, we also propose a lightweight hierarchical network that aggregates short clip features, achieving a strong balance between compute efficiency and accuracy across several long video benchmarks.

Authors:Minmin Yang, Huantao Ren, Senem Velipasalar
Title: 3D-PointZshotS: Geometry-Aware 3D Point Cloud Zero-Shot Semantic Segmentation Narrowing the Visual-Semantic Gap
Abstract:
Existing zero-shot 3D point cloud segmentation methods often struggle with limited transferability from seen classes to unseen classes and from semantic to visual space. To alleviate this, we introduce 3D-PointZshotS, a geometry-aware zero-shot segmentation framework that enhances both feature generation and alignment using latent geometric prototypes (LGPs). Specifically, we integrate LGPs into a generator via a cross-attention mechanism, enriching semantic features with fine-grained geometric details. To further enhance stability and generalization, we introduce a self-consistency loss, which enforces feature robustness against point-wise perturbations. Additionally, we re-represent visual and semantic features in a shared space, bridging the semantic-visual gap and facilitating knowledge transfer to unseen classes. Experiments on three real-world datasets, namely ScanNet, SemanticKITTI, and S3DIS, demonstrate that our method achieves superior performance over four baselines in terms of harmonic mIoU. The code is available at \href{https://github.com/LexieYang/3D-PointZshotS}{Github}.
中文:提出的3D-PointZshotS框架通过引入潜在几何原型和自一致性损失来弥合语义与视觉间的鸿沟,显著提升了三维点云零样本分割性能,在多个数据集上表现优异。
English: The proposed 3D-PointZshotS framework enhances zero-shot 3D point cloud segmentation by incorporating latent geometric prototypes and a self-consistency loss to bridge the semantic-visual gap, achieving superior performance on multiple datasets.

Authors:Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Haofan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, Qin Lin, Qinglin Lu
Title: InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework
Abstract:
Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character customization built upon a foundation diffusion transformer. InstantCharacter demonstrates three fundamental advantages: first, it achieves open-domain personalization across diverse character appearances, poses, and styles while maintaining high-fidelity results. Second, the framework introduces a scalable adapter with stacked transformer encoders, which effectively processes open-domain character features and seamlessly interacts with the latent space of modern diffusion transformers. Third, to effectively train the framework, we construct a large-scale character dataset containing 10-million-level samples. The dataset is systematically organized into paired (multi-view character) and unpaired (text-image combinations) subsets. This dual-data structure enables simultaneous optimization of identity consistency and textual editability through distinct learning pathways. Qualitative experiments demonstrate the advanced capabilities of InstantCharacter in generating high-fidelity, text-controllable, and character-consistent images, setting a new benchmark for character-driven image generation. Our source code is available at https://github.com/Tencent/InstantCharacter.
中文摘要:InstantCharacter是一个基于扩散变换器的可扩展框架,通过创新的适配器和大规模数据集训练,实现了跨领域的高保真、文本可控角色定制,解决了现有方法的泛化能力和图像质量问题。
English Summary: InstantCharacter is a scalable framework built on a diffusion transformer that overcomes limitations of existing methods by achieving high-fidelity, text-controllable character customization across diverse domains through a novel adapter and large-scale dataset training.

Authors:Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, Xingang Pan
Title: WORLDMEM: Long-term Consistent World Simulation with Memory
Abstract:
World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.

Authors:Sidun Liu, Wenyu Li, Peng Qiao, Yong Dou
Title: Regist3R: Incremental Registration with Stereo Foundation Model
Abstract:
Multi-view 3D reconstruction has remained an essential yet challenging problem in the field of computer vision. While DUSt3R and its successors have achieved breakthroughs in 3D reconstruction from unposed images, these methods exhibit significant limitations when scaling to multi-view scenarios, including high computational cost and cumulative error induced by global alignment. To address these challenges, we propose Regist3R, a novel stereo foundation model tailored for efficient and scalable incremental reconstruction. Regist3R leverages an incremental reconstruction paradigm, enabling large-scale 3D reconstructions from unordered and many-view image collections. We evaluate Regist3R on public datasets for camera pose estimation and 3D reconstruction. Our experiments demonstrate that Regist3R achieves comparable performance with optimization-based methods while significantly improving computational efficiency, and outperforms existing multi-view reconstruction models. Furthermore, to assess its performance in real-world applications, we introduce a challenging oblique aerial dataset which has long spatial spans and hundreds of views. The results highlight the effectiveness of Regist3R. We also demonstrate the first attempt to reconstruct large-scale scenes encompassing over thousands of views through pointmap-based foundation models, showcasing its potential for practical applications in large-scale 3D reconstruction tasks, including urban modeling, aerial mapping, and beyond.
中文: Regist3R是一种新颖的立体基础模型,能够从多视角图像实现高效可扩展的增量式三维重建,在保持与优化方法相当性能的同时显著提升计算效率,并在真实场景数据上超越现有模型。
English: Regist3R is a novel stereo foundation model that enables efficient and scalable incremental 3D reconstruction from multi-view images, achieving comparable performance to optimization-based methods while significantly improving computational efficiency and outperforming existing models on challenging real-world datasets.

Authors:Liam Schoneveld, Zhe Chen, Davide Davoli, Jiapeng Tang, Saimon Terazawa, Ko Nishino, Matthias Nießner
Title: SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians
Abstract:
Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians). Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach. Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-of-the-art in emotion classification.

Authors:Aditya Prakash, Benjamin Lundell, Dmitry Andreychuk, David Forsyth, Saurabh Gupta, Harpreet Sawhney
Title: How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions
Abstract:
We tackle the novel problem of predicting 3D hand motion and contact maps (or Interaction Trajectories) given a single RGB view, action text, and a 3D contact point on the object as input. Our approach consists of (1) Interaction Codebook: a VQVAE model to learn a latent codebook of hand poses and contact points, effectively tokenizing interaction trajectories, (2) Interaction Predictor: a transformer-decoder module to predict the interaction trajectory from test time inputs by using an indexer module to retrieve a latent affordance from the learned codebook. To train our model, we develop a data engine that extracts 3D hand poses and contact trajectories from the diverse HoloAssist dataset. We evaluate our model on a benchmark that is 2.5-10X larger than existing works, in terms of diversity of objects and interactions observed, and test for generalization of the model across object categories, action categories, tasks, and scenes. Experimental results show the effectiveness of our approach over transformer & diffusion baselines across all settings.

Authors:Andreas Plesner, Turlan Kuzhagaliyev, Roger Wattenhofer
Title: FLIP Reasoning Challenge
Abstract:
Over the past years, advances in artificial intelligence (AI) have demonstrated how AI can solve many perception and generation tasks, such as image classification and text writing, yet reasoning remains a challenge. This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks on the Idena blockchain. FLIP challenges present users with two orderings of 4 images, requiring them to identify the logically coherent one. By emphasizing sequential reasoning, visual storytelling, and common sense, FLIP provides a unique testbed for multimodal AI systems. Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs). Results reveal that even the best open-sourced and closed-sourced models achieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shot settings, compared to human performance of 95.3%. Captioning models aid reasoning models by providing text descriptions of images, yielding better results than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5 Pro. Combining the predictions from 15 models in an ensemble increases the accuracy to 85.2%. These findings highlight the limitations of existing reasoning models and the need for robust multimodal benchmarks like FLIP. The full codebase and dataset will be available at https://github.com/aplesner/FLIP-Reasoning-Challenge.
中文: FLIP数据集作为评估AI推理能力的基准,通过视觉序列任务揭示现有模型与人类表现差距显著,而图像描述和集成方法能有效提升其准确性。
English: The FLIP dataset is introduced as a benchmark to evaluate AI reasoning through visual sequence tasks, revealing that current models significantly lag behind human performance and benefit from captioning and ensemble methods.

Authors:Junhao Zhuang, Lingen Li, Xuan Ju, Zhaoyang Zhang, Chun Yuan, Ying Shan
Title: Cobra: Efficient Line Art COlorization with BRoAder References
Abstract:
The comic production industry requires reference-based line art colorization with high accuracy, efficiency, contextual consistency, and flexible control. A comic page often involves diverse characters, objects, and backgrounds, which complicates the coloring process. Despite advancements in diffusion models for image generation, their application in line art colorization remains limited, facing challenges related to handling extensive reference images, time-consuming inference, and flexible control. We investigate the necessity of extensive contextual image guidance on the quality of line art colorization. To address these challenges, we introduce Cobra, an efficient and versatile method that supports color hints and utilizes over 200 reference images while maintaining low latency. Central to Cobra is a Causal Sparse DiT architecture, which leverages specially designed positional encodings, causal sparse attention, and Key-Value Cache to effectively manage long-context references and ensure color identity consistency. Results demonstrate that Cobra achieves accurate line art colorization through extensive contextual reference, significantly enhancing inference speed and interactivity, thereby meeting critical industrial demands. We release our codes and models on our project page: https://zhuang2002.github.io/Cobra/.
中文摘要:Cobra提出了一种高效的线稿上色方法,采用因果稀疏DiT架构处理200多张参考图像并保持低延迟,实现了工业级的精准度和速度要求。
English Summary: Cobra introduces an efficient line art colorization method using a Causal Sparse DiT architecture to handle over 200 reference images with low latency, achieving industrial-grade accuracy and speed.

Authors:Alejandro Newell, Peiyun Hu, Lahav Lipson, Stephan R. Richter, Vladlen Koltun
Title: CoMotion: Concurrent Multi-person 3D Motion
Abstract:
We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Code and weights are provided at https://github.com/apple/ml-comotion
中文: 本文提出一种单目三维多人姿态跟踪系统,通过从输入图像直接更新姿态实现拥挤场景中的时序一致性,在保持顶尖精度的同时实现了更快、更稳定的在线跟踪效果。
English: This paper presents a monocular 3D multi-person pose tracking system that achieves temporal coherence in crowded scenes through direct pose updates from input images, matching state-of-the-art accuracy while enabling faster and more robust online tracking.

Authors:Yuan Luo, Rudolf Hoffmann, Yan Xia, Olaf Wysocki, Benedikt Schwab, Thomas H. Kolbe, Daniel Cremers
Title: RADLER: Radar Object Detection Leveraging Semantic 3D City Models and Self-Supervised Radar-Image Learning
Abstract:
Semantic 3D city models are worldwide easy-accessible, providing accurate, object-oriented, and semantic-rich 3D priors. To date, their potential to mitigate the noise impact on radar object detection remains under-explored. In this paper, we first introduce a unique dataset, RadarCity, comprising 54K synchronized radar-image pairs and semantic 3D city models. Moreover, we propose a novel neural network, RADLER, leveraging the effectiveness of contrastive self-supervised learning (SSL) and semantic 3D city models to enhance radar object detection of pedestrians, cyclists, and cars. Specifically, we first obtain the robust radar features via a SSL network in the radar-image pretext task. We then use a simple yet effective feature fusion strategy to incorporate semantic-depth features from semantic 3D city models. Having prior 3D information as guidance, RADLER obtains more fine-grained details to enhance radar object detection. We extensively evaluate RADLER on the collected RadarCity dataset and demonstrate average improvements of 5.46% in mean avarage precision (mAP) and 3.51% in mean avarage recall (mAR) over previous radar object detection methods. We believe this work will foster further research on semantic-guided and map-supported radar object detection. Our project page is publicly available athttps://gpp-communication.github.io/RADLER .
中文: 本文提出了包含同步雷达图像对和语义3D城市模型的新数据集RadarCity,并开发了RADLER神经网络,通过结合对比自监督学习与语义3D先验知识来增强雷达目标检测性能,较现有方法实现了显著提升。
English: This paper introduces RadarCity, a novel dataset with synchronized radar-image pairs and semantic 3D city models, and proposes RADLER, a neural network that enhances radar object detection by integrating contrastive self-supervised learning with semantic 3D priors, achieving significant performance improvements over existing methods.

Authors:Yike Liu, Haipeng Li, Shuaicheng Liu, Bing Zeng
Title: CodingHomo: Bootstrapping Deep Homography With Video Coding
Abstract:
Homography estimation is a fundamental task in computer vision with applications in diverse fields. Recent advances in deep learning have improved homography estimation, particularly with unsupervised learning approaches, offering increased robustness and generalizability. However, accurately predicting homography, especially in complex motions, remains a challenge. In response, this work introduces a novel method leveraging video coding, particularly by harnessing inherent motion vectors (MVs) present in videos. We present CodingHomo, an unsupervised framework for homography estimation. Our framework features a Mask-Guided Fusion (MGF) module that identifies and utilizes beneficial features among the MVs, thereby enhancing the accuracy of homography prediction. Additionally, the Mask-Guided Homography Estimation (MGHE) module is presented for eliminating undesired features in the coarse-to-fine homography refinement process. CodingHomo outperforms existing state-of-the-art unsupervised methods, delivering good robustness and generalizability. The code and dataset are available at: \href{github}{https://github.com/liuyike422/CodingHomo
中文: 本文提出CodingHomo无监督框架,通过利用视频编码中的运动向量并结合掩码引导融合与单应性估计模块,有效提升了单应性估计的精度,在鲁棒性和泛化性方面优于现有先进方法。
English: This paper introduces CodingHomo, an unsupervised framework that enhances homography estimation by leveraging motion vectors from video coding through its Mask-Guided Fusion and Homography Estimation modules, achieving superior robustness and outperforming existing methods.

Authors:Xiaojun Ye, Chun Wang, Yiren Song, Sheng Zhou, Liangcheng Li, Jiajun Bu
Title: FocusedAD: Character-centric Movie Audio Description
Abstract:
Movie Audio Description (AD) aims to narrate visual content during dialogue-free segments, particularly benefiting blind and visually impaired (BVI) audiences. Compared with general video captioning, AD demands plot-relevant narration with explicit character name references, posing unique challenges in movie understanding.To identify active main characters and focus on storyline-relevant regions, we propose FocusedAD, a novel framework that delivers character-centric movie audio descriptions. It includes: (i) a Character Perception Module(CPM) for tracking character regions and linking them to names; (ii) a Dynamic Prior Module(DPM) that injects contextual cues from prior ADs and subtitles via learnable soft prompts; and (iii) a Focused Caption Module(FCM) that generates narrations enriched with plot-relevant details and named characters. To overcome limitations in character identification, we also introduce an automated pipeline for building character query banks. FocusedAD achieves state-of-the-art performance on multiple benchmarks, including strong zero-shot results on MAD-eval-Named and our newly proposed Cinepile-AD dataset. Code and data will be released at https://github.com/Thorin215/FocusedAD .
中文摘要:FocusedAD是一种新颖的以角色为中心的框架,通过追踪角色并融合上下文线索来生成与情节相关的电影音频描述,在多个基准测试中实现了最先进的性能。
English Summary: FocusedAD is a novel character-centric framework that generates plot-relevant movie audio descriptions by tracking characters and incorporating contextual cues, achieving state-of-the-art performance across multiple benchmarks.

Authors:Shuo Li, Fang Liu, Zehua Hao, Xinyi Wang, Lingling Li, Xu Liu, Puhua Chen, Wenping Ma
Title: Logits DeConfusion with CLIP for Few-Shot Learning
Abstract:
With its powerful visual-language alignment capability, CLIP performs well in zero-shot and few-shot learning tasks. However, we found in experiments that CLIP's logits suffer from serious inter-class confusion problems in downstream tasks, and the ambiguity between categories seriously affects the accuracy. To address this challenge, we propose a novel method called Logits DeConfusion, which effectively learns and eliminates inter-class confusion in logits by combining our Multi-level Adapter Fusion (MAF) module with our Inter-Class Deconfusion (ICD) module. Our MAF extracts features from different levels and fuses them uniformly to enhance feature representation. Our ICD learnably eliminates inter-class confusion in logits with a residual structure. Experimental results show that our method can significantly improve the classification performance and alleviate the inter-class confusion problem. The code is available at https://github.com/LiShuo1001/LDC.
中文: 提出的Logits DeConfusion方法通过多级适配器融合和类间去混淆模块,有效解决了CLIP在分类任务中的类间混淆问题,显著提升了分类性能。
English: The proposed Logits DeConfusion method, integrating Multi-level Adapter Fusion and Inter-Class Deconfusion modules, effectively mitigates CLIP's inter-class confusion in logits, significantly boosting classification accuracy.

Authors:Tao Wen, Jiepeng Wang, Yabo Chen, Shugong Xu, Chi Zhang, Xuelong Li
Title: Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image
Abstract:
Accurate and generalizable metric depth estimation is crucial for various computer vision applications but remains challenging due to the diverse depth scales encountered in indoor and outdoor environments. In this paper, we introduce Metric-Solver, a novel sliding anchor-based metric depth estimation method that dynamically adapts to varying scene scales. Our approach leverages an anchor-based representation, where a reference depth serves as an anchor to separate and normalize the scene depth into two components: scaled near-field depth and tapered far-field depth. The anchor acts as a normalization factor, enabling the near-field depth to be normalized within a consistent range while mapping far-field depth smoothly toward zero. Through this approach, any depth from zero to infinity in the scene can be represented within a unified representation, effectively eliminating the need to manually account for scene scale variations. More importantly, for the same scene, the anchor can slide along the depth axis, dynamically adjusting to different depth scales. A smaller anchor provides higher resolution in the near-field, improving depth precision for closer objects while a larger anchor improves depth estimation in far regions. This adaptability enables the model to handle depth predictions at varying distances and ensure strong generalization across datasets. Our design enables a unified and adaptive depth representation across diverse environments. Extensive experiments demonstrate that Metric-Solver outperforms existing methods in both accuracy and cross-dataset generalization.
中文: Metric-Solver提出了一种基于滑动锚点的方法,通过将深度归一化为近场和远场分量来动态适应不同场景尺度,实现了统一的深度表示,并在多样环境中展现出卓越的泛化能力。
English: Metric-Solver introduces a sliding anchor-based method that dynamically adapts to varying scene scales by normalizing depth into near-field and far-field components, enabling unified depth representation and superior generalization across diverse environments.

Authors:Mengshi Qi, Pengfei Zhu, Xiangtai Li, Xiaoyang Bi, Lu Qi, Huadong Ma, Ming-Hsuan Yang
Title: DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency
Abstract:
Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at https://github.com/zaplm/DC-SAM.
中文: 本研究提出的DC-SAM方法通过增强视觉提示和双重一致性机制,将Segment Anything模型适配于上下文分割任务,在图像和新建立的视频基准测试中均取得了最优性能。
English: The proposed DC-SAM method adapts Segment Anything Models for in-context segmentation by enhancing visual prompts and implementing dual consistency mechanisms, achieving state-of-the-art performance on both image and newly established video benchmarks.

Authors:Yizhuo Wu, Francesco Fioranelli, Chang Gao
Title: RadMamba: Efficient Human Activity Recognition through Radar-based Micro-Doppler-Oriented Mamba State-Space Model
Abstract:
Radar-based HAR has emerged as a promising alternative to conventional monitoring approaches, such as wearable devices and camera-based systems, due to its unique privacy preservation and robustness advantages. However, existing solutions based on convolutional and recurrent neural networks, although effective, are computationally demanding during deployment. This limits their applicability in scenarios with constrained resources or those requiring multiple sensors. Advanced architectures, such as Vision Transformer (ViT) and State-Space Model (SSM) architectures, offer improved modeling capabilities and have made efforts toward lightweight designs. However, their computational complexity remains relatively high. To leverage the strengths of transformer architectures while simultaneously enhancing accuracy and reducing computational complexity, this paper introduces RadMamba, a parameter-efficient, radar micro-Doppler-oriented Mamba SSM specifically tailored for radar-based HAR. Across three diverse datasets, RadMamba matches the top-performing previous model's 99.8% classification accuracy on Dataset DIAT with only 1/400 of its parameters and equals the leading models' 92.0% accuracy on Dataset CI4R with merely 1/10 of their parameters. In scenarios with continuous sequences of actions evaluated on Dataset UoG2020, RadMamba surpasses other models with significantly higher parameter counts by at least 3%, achieving this with only 6.7k parameters. Our code is available at: https://github.com/lab-emi/AIRHAR.
中文: RadMamba提出了一种参数高效的Mamba状态空间模型,用于雷达人体活动识别,在多个数据集上以极低的计算复杂度实现了顶尖的分类精度。
English: RadMamba introduces a parameter-efficient Mamba State-Space Model for radar-based human activity recognition, achieving top-tier accuracy with drastically reduced computational complexity across multiple datasets.

Authors:Mohamad Dalal, Artur Xarles, Anthony Cioppa, Silvio Giancola, Marc Van Droogenbroeck, Bernard Ghanem, Albert Clapés, Sergio Escalera, Thomas B. Moeslund
Title: Action Anticipation from SoccerNet Football Video Broadcasts
Abstract:
Artificial intelligence has revolutionized the way we analyze sports videos, whether to understand the actions of games in long untrimmed videos or to anticipate the player's motion in future frames. Despite these efforts, little attention has been given to anticipating game actions before they occur. In this work, we introduce the task of action anticipation for football broadcast videos, which consists in predicting future actions in unobserved future frames, within a five- or ten-second anticipation window. To benchmark this task, we release a new dataset, namely the SoccerNet Ball Action Anticipation dataset, based on SoccerNet Ball Action Spotting. Additionally, we propose a Football Action ANticipation TRAnsformer (FAANTRA), a baseline method that adapts FUTR, a state-of-the-art action anticipation model, to predict ball-related actions. To evaluate action anticipation, we introduce new metrics, including mAP@$δ$, which evaluates the temporal precision of predicted future actions, as well as mAP@$\infty$, which evaluates their occurrence within the anticipation window. We also conduct extensive ablation studies to examine the impact of various task settings, input configurations, and model architectures. Experimental results highlight both the feasibility and challenges of action anticipation in football videos, providing valuable insights into the design of predictive models for sports analytics. By forecasting actions before they unfold, our work will enable applications in automated broadcasting, tactical analysis, and player decision-making. Our dataset and code are publicly available at https://github.com/MohamadDalal/FAANTRA.
中文: 本文提出了足球视频中动作预测的新任务,通过FAANTRA模型和新建数据集在5-10秒窗口内实现动作预判,填补了体育视频分析中前瞻性研究的空白。
English: This paper introduces a novel task of anticipating football actions in broadcast videos, proposing the FAANTRA model and a new dataset to benchmark predictive capabilities within 5-10 second windows while addressing current gaps in sports video analysis.

Authors:Pascal Schlachter, Jonathan Fuss, Bin Yang
Title: Analysis of Pseudo-Labeling for Online Source-Free Universal Domain Adaptation
Abstract:
A domain (distribution) shift between training and test data often hinders the real-world performance of deep neural networks, necessitating unsupervised domain adaptation (UDA) to bridge this gap. Online source-free UDA has emerged as a solution for practical scenarios where access to source data is restricted and target data is received as a continuous stream. However, the open-world nature of many real-world applications additionally introduces category shifts meaning that the source and target label spaces may differ. Online source-free universal domain adaptation (SF-UniDA) addresses this challenge. Existing methods mainly rely on self-training with pseudo-labels, yet the relationship between pseudo-labeling and adaptation outcomes has not been studied yet. To bridge this gap, we conduct a systematic analysis through controlled experiments with simulated pseudo-labeling, offering valuable insights into pseudo-labeling for online SF-UniDA. Our findings reveal a substantial gap between the current state-of-the-art and the upper bound of adaptation achieved with perfect pseudo-labeling. Moreover, we show that a contrastive loss enables effective adaptation even with moderate pseudo-label accuracy, while a cross-entropy (CE) loss, though less robust to pseudo-label errors, achieves superior results when pseudo-labeling approaches perfection. Lastly, our findings indicate that pseudo-label accuracy is in general more crucial than quantity, suggesting that prioritizing fewer but high-confidence pseudo-labels is beneficial. Overall, our study highlights the critical role of pseudo-labeling in (online) SF-UniDA and provides actionable insights to drive future advancements in the field. Our code is available at https://github.com/pascalschlachter/PLAnalysis.
中文: 本研究系统分析了在线无源通用域自适应中的伪标签技术,揭示了当前方法与理想性能间的显著差距,并证明对比损失在中等精度伪标签下表现稳健,而交叉熵损失在接近完美伪标签时更优,强调伪标签质量比数量更为关键。
English: This study systematically analyzes pseudo-labeling in online source-free universal domain adaptation, revealing a significant performance gap from ideal conditions and demonstrating that contrastive loss adapts well with moderate pseudo-label accuracy while cross-entropy loss excels with near-perfect labels, emphasizing the importance of prioritizing pseudo-label quality over quantity.

Authors:Linjuan Fan, Di Wen, Kunyu Peng, Kailun Yang, Jiaming Zhang, Ruiping Liu, Yufan Chen, Junwei Zheng, Jiamin Wu, Xudong Han, Rainer Stiefelhagen
Title: Exploring Video-Based Driver Activity Recognition under Noisy Labels
Abstract:
As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the driver activity recognition field. In this paper, we propose the first label noise learning approach for the driver activity recognition task. Based on the cluster assumption, we initially enable the model to learn clustering-friendly low-dimensional representations from given videos and assign the resultant embeddings into clusters. We subsequently perform co-refinement within each cluster to smooth the classifier outputs. Furthermore, we propose a flexible sample selection strategy that combines two selection criteria without relying on any hyperparameters to filter clean samples from the training dataset. We also incorporate a self-adaptive parameter into the sample selection process to enforce balancing across classes. A comprehensive variety of experiments on the public Drive&Act dataset for all granularity levels demonstrates the superior performance of our method in comparison with other label-denoising methods derived from the image classification field. The source code is available at https://github.com/ilonafan/DAR-noisy-labels.
Chinese: 本文提出了首个用于驾驶员活动识别的标签噪声学习方法,通过基于聚类的表示学习和无需超参数的样本选择策略,在Drive&Act数据集上实现了优于其他方法的性能。
English: This paper introduces the first label noise learning method for driver activity recognition, utilizing clustering-based representation learning and a hyperparameter-free sample selection strategy to achieve superior performance on the Drive&Act dataset.

Authors:Lvpan Cai, Haowei Wang, Jiayi Ji, YanShu ZhouMen, Yiwei Ma, Xiaoshuai Sun, Liujuan Cao, Rongrong Ji
Title: Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach
Abstract:
The rise of AI-generated image editing tools has made localized forgeries increasingly realistic, posing challenges for visual content integrity. Although recent efforts have explored localized AIGC detection, existing datasets predominantly focus on object-level forgeries while overlooking broader scene edits in regions such as sky or ground. To address these limitations, we introduce \textbf{BR-Gen}, a large-scale dataset of 150,000 locally forged images with diverse scene-aware annotations, which are based on semantic calibration to ensure high-quality samples. BR-Gen is constructed through a fully automated Perception-Creation-Evaluation pipeline to ensure semantic coherence and visual realism. In addition, we further propose \textbf{NFA-ViT}, a Noise-guided Forgery Amplification Vision Transformer that enhances the detection of localized forgeries by amplifying forgery-related features across the entire image. NFA-ViT mines heterogeneous regions in images, \emph{i.e.}, potential edited areas, by noise fingerprints. Subsequently, attention mechanism is introduced to compel the interaction between normal and abnormal features, thereby propagating the generalization traces throughout the entire image, allowing subtle forgeries to influence a broader context and improving overall detection robustness. Extensive experiments demonstrate that BR-Gen constructs entirely new scenarios that are not covered by existing methods. Take a step further, NFA-ViT outperforms existing methods on BR-Gen and generalizes well across current benchmarks. All data and codes are available at https://github.com/clpbc/BR-Gen.
中文摘要:本文提出了BR-Gen这一针对现有方法忽视的场景级图像伪造的大规模数据集,并开发了NFA-ViT噪声引导视觉Transformer,通过特征增强和跨区域注意力机制有效提升局部伪造检测能力。
English Summary: This paper introduces BR-Gen, a large-scale dataset addressing scene-level image forgeries overlooked by existing methods, and proposes NFA-ViT, a noise-guided vision transformer that enhances localized forgery detection through feature amplification and cross-region attention mechanisms.

Authors:Qishan Wang, Jia Guo, Shuyong Gao, Haofen Wang, Li Xiong, Junjie Hu, Hanqi Guo, Wenqiang Zhang
Title: Search is All You Need for Few-shot Anomaly Detection
Abstract:
Few-shot anomaly detection (FSAD) has emerged as a crucial yet challenging task in industrial inspection, where normal distribution modeling must be accomplished with only a few normal images. While existing approaches typically employ multi-modal foundation models combining language and vision modalities for prompt-guided anomaly detection, these methods often demand sophisticated prompt engineering and extensive manual tuning. In this paper, we demonstrate that a straightforward nearest-neighbor search framework can surpass state-of-the-art performance in both single-class and multi-class FSAD scenarios. Our proposed method, VisionAD, consists of four simple yet essential components: (1) scalable vision foundation models that extract universal and discriminative features; (2) dual augmentation strategies - support augmentation to enhance feature matching adaptability and query augmentation to address the oversights of single-view prediction; (3) multi-layer feature integration that captures both low-frequency global context and high-frequency local details with minimal computational overhead; and (4) a class-aware visual memory bank enabling efficient one-for-all multi-class detection. Extensive evaluations across MVTec-AD, VisA, and Real-IAD benchmarks demonstrate VisionAD's exceptional performance. Using only 1 normal images as support, our method achieves remarkable image-level AUROC scores of 97.4%, 94.8%, and 70.8% respectively, outperforming current state-of-the-art approaches by significant margins (+1.6%, +3.2%, and +1.4%). The training-free nature and superior few-shot capabilities of VisionAD make it particularly appealing for real-world applications where samples are scarce or expensive to obtain. Code is available at https://github.com/Qiqigeww/VisionAD.
Chinese: VisionAD提出了一种简单高效的近邻搜索框架,用于少样本异常检测,无需训练或复杂提示工程,即在多个基准测试中显著超越现有最优方法。
English: VisionAD introduces a simple yet effective nearest-neighbor framework for few-shot anomaly detection, outperforming state-of-the-art methods across multiple benchmarks without requiring training or complex prompt engineering.

Authors:Yushuai Sun, Zikun Zhou, Dongmei Jiang, Yaowei Wang, Jun Yu, Guangming Lu, Wenjie Pei
Title: Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval
Abstract:
Asymmetric retrieval is a typical scenario in real-world retrieval systems, where compatible models of varying capacities are deployed on platforms with different resource configurations. Existing methods generally train pre-defined networks or subnetworks with capacities specifically designed for pre-determined platforms, using compatible learning. Nevertheless, these methods suffer from limited flexibility for multi-platform deployment. For example, when introducing a new platform into the retrieval systems, developers have to train an additional model at an appropriate capacity that is compatible with existing models via backward-compatible learning. In this paper, we propose a Prunable Network with self-compatibility, which allows developers to generate compatible subnetworks at any desired capacity through post-training pruning. Thus it allows the creation of a sparse subnetwork matching the resources of the new platform without additional training. Specifically, we optimize both the architecture and weight of subnetworks at different capacities within a dense network in compatible learning. We also design a conflict-aware gradient integration scheme to handle the gradient conflicts between the dense network and subnetworks during compatible learning. Extensive experiments on diverse benchmarks and visual backbones demonstrate the effectiveness of our method. Our code and model are available at https://github.com/Bunny-Black/PrunNet.
Chinese: 本文提出了一种具有自兼容性的可修剪网络,通过训练后修剪能够按需生成任意容量的兼容子网络,无需额外训练即可适配新平台部署。
English: This paper introduces a Prunable Network with self-compatibility that enables the generation of compatible subnetworks at any capacity through post-training pruning, eliminating the need for additional training when deploying to new platforms.

Authors:Zihui Zhang, Yafei Yang, Hongtao Wen, Bo Yang
Title: GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision
Abstract:
We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two-stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object-centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real-world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.
Chinese: 本文提出GrabS,一种两阶段无监督方法,通过从数据集中学习以对象为中心的先验知识,并利用具身智能体查询发现对象,在无需人工标注的情况下实现了超越现有方法的3D分割性能。
English: This paper introduces GrabS, a two-stage unsupervised method that learns object-centric priors from datasets and uses an embodied agent to discover objects, achieving superior 3D segmentation performance over existing approaches without human labels.

Authors:Zongye Zhang, Wenrui Cai, Qingjie Liu, Yunhong Wang
Title: SkeletonX: Data-Efficient Skeleton-based Action Recognition via Cross-sample Feature Aggregation
Abstract:
While current skeleton action recognition models demonstrate impressive performance on large-scale datasets, their adaptation to new application scenarios remains challenging. These challenges are particularly pronounced when facing new action categories, diverse performers, and varied skeleton layouts, leading to significant performance degeneration. Additionally, the high cost and difficulty of collecting skeleton data make large-scale data collection impractical. This paper studies one-shot and limited-scale learning settings to enable efficient adaptation with minimal data. Existing approaches often overlook the rich mutual information between labeled samples, resulting in sub-optimal performance in low-data scenarios. To boost the utility of labeled data, we identify the variability among performers and the commonality within each action as two key attributes. We present SkeletonX, a lightweight training pipeline that integrates seamlessly with existing GCN-based skeleton action recognizers, promoting effective training under limited labeled data. First, we propose a tailored sample pair construction strategy on two key attributes to form and aggregate sample pairs. Next, we develop a concise and effective feature aggregation module to process these pairs. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and PKU-MMD with various GCN backbones, demonstrating that the pipeline effectively improves performance when trained from scratch with limited data. Moreover, it surpasses previous state-of-the-art methods in the one-shot setting, with only 1/10 of the parameters and much fewer FLOPs. The code and data are available at: https://github.com/zzysteve/SkeletonX
中文摘要:本文提出SkeletonX轻量级训练框架,通过构建样本对和特征聚合模块,利用表演者差异性和动作共性,有效提升基于GCN的骨架动作识别模型在有限标注数据下的性能表现。
English summary: This paper introduces SkeletonX, a lightweight training pipeline that enhances skeleton action recognition with limited labeled data by leveraging performer variability and action commonality through sample pair construction and feature aggregation.

Authors:Bingjie Gao, Xinyu Gao, Xiaoxue Wu, Yujie Zhou, Yu Qiao, Li Niu, Xinyuan Chen, Yaohui Wang
Title: The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
Abstract:
The evolution of Text-to-video (T2V) generative models, trained on large-scale datasets, has been marked by significant progress. However, the sensitivity of T2V generative models to input prompts highlights the critical role of prompt design in influencing generative outcomes. Prior research has predominantly relied on Large Language Models (LLMs) to align user-provided prompts with the distribution of training prompts, albeit without tailored guidance encompassing prompt vocabulary and sentence structure nuances. To this end, we introduce RAPO, a novel Retrieval-Augmented Prompt Optimization framework. In order to address potential inaccuracies and ambiguous details generated by LLM-generated prompts. RAPO refines the naive prompts through dual optimization branches, selecting the superior prompt for T2V generation. The first branch augments user prompts with diverse modifiers extracted from a learned relational graph, refining them to align with the format of training prompts via a fine-tuned LLM. Conversely, the second branch rewrites the naive prompt using a pre-trained LLM following a well-defined instruction set. Extensive experiments demonstrate that RAPO can effectively enhance both the static and dynamic dimensions of generated videos, demonstrating the significance of prompt optimization for user-provided prompts.
中文: RAPO是一种新颖的检索增强提示优化框架,通过双重优化分支改进用户提示,使其与训练提示分布对齐,从而提升生成视频的静态与动态质量。
English: RAPO is a novel retrieval-augmented prompt optimization framework that refines user prompts through dual optimization branches to enhance video generation quality by aligning them with training prompt distributions.

Authors:Muhammad Shahid Muneer, Simon S. Woo
Title: Towards Safe Synthetic Image Generation On the Web: A Multimodal Robust NSFW Defense and Million Scale Dataset
Abstract:
In the past years, we have witnessed the remarkable success of Text-to-Image (T2I) models and their widespread use on the web. Extensive research in making T2I models produce hyper-realistic images has led to new concerns, such as generating Not-Safe-For-Work (NSFW) web content and polluting the web society. To help prevent misuse of T2I models and create a safer web environment for users features like NSFW filters and post-hoc security checks are used in these models. However, recent work unveiled how these methods can easily fail to prevent misuse. In particular, adversarial attacks on text and image modalities can easily outplay defensive measures. %Exploiting such leads to the growing concern of preventing adversarial attacks on text and image modalities. Moreover, there is currently no robust multimodal NSFW dataset that includes both prompt and image pairs and adversarial examples. This work proposes a million-scale prompt and image dataset generated using open-source diffusion models. Second, we develop a multimodal defense to distinguish safe and NSFW text and images, which is robust against adversarial attacks and directly alleviates current challenges. Our extensive experiments show that our model performs well against existing SOTA NSFW detection methods in terms of accuracy and recall, drastically reducing the Attack Success Rate (ASR) in multimodal adversarial attack scenarios. Code: https://github.com/shahidmuneer/multimodal-nsfw-defense.
中文: 本研究通过构建百万规模的多模态数据集和开发鲁棒防御系统,有效提升了文本到图像模型对不良内容的检测能力,显著降低了多模态对抗攻击的成功率。
English: This research addresses the vulnerability of Text-to-Image models to adversarial attacks by introducing a million-scale multimodal dataset and a robust defense system that significantly enhances NSFW content detection and reduces attack success rates.

Authors:Xingwu Ji, Haochen Niu, Dexin Duan, Rendong Ying, Fei Wen, Peilin Liu
Title: An Online Adaptation Method for Robust Depth Estimation and Visual Odometry in the Open World
Abstract:
Recently, learning-based robotic navigation systems have gained extensive research attention and made significant progress. However, the diversity of open-world scenarios poses a major challenge for the generalization of such systems to practical scenarios. Specifically, learned systems for scene measurement and state estimation tend to degrade when the application scenarios deviate from the training data, resulting to unreliable depth and pose estimation. Toward addressing this problem, this work aims to develop a visual odometry system that can fast adapt to diverse novel environments in an online manner. To this end, we construct a self-supervised online adaptation framework for monocular visual odometry aided by an online-updated depth estimation module. Firstly, we design a monocular depth estimation network with lightweight refiner modules, which enables efficient online adaptation. Then, we construct an objective for self-supervised learning of the depth estimation module based on the output of the visual odometry system and the contextual semantic information of the scene. Specifically, a sparse depth densification module and a dynamic consistency enhancement module are proposed to leverage camera poses and contextual semantics to generate pseudo-depths and valid masks for the online adaptation. Finally, we demonstrate the robustness and generalization capability of the proposed method in comparison with state-of-the-art learning-based approaches on urban, in-house datasets and a robot platform. Code is publicly available at: https://github.com/jixingwu/SOL-SLAM.
中文: 本文提出了一种自监督在线自适应单目视觉里程计框架,通过轻量级深度优化和场景语义整合,实现了对多样化新环境的高效泛化能力。
English: This paper introduces a self-supervised online adaptation framework for monocular visual odometry that enables efficient generalization to diverse environments through lightweight depth refinement and contextual semantic integration.

Authors:Amirhossein Dadashzadeh, Parsa Esmati, Majid Mirmehdi
Title: Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation
Abstract:
Recent advances in Source-Free Unsupervised Video Domain Adaptation (SFUVDA) leverage vision-language models to enhance pseudo-label generation. However, challenges such as noisy pseudo-labels and over-confident predictions limit their effectiveness in adapting well across domains. We propose Co-STAR, a novel framework that integrates curriculum learning with collaborative self-training between a source-trained teacher and a contrastive vision-language model (CLIP). Our curriculum learning approach employs a reliability-based weight function that measures bidirectional prediction alignment between the teacher and CLIP, balancing between confident and uncertain predictions. This function preserves uncertainty for difficult samples, while prioritizing reliable pseudo-labels when the predictions from both models closely align. To further improve adaptation, we propose Adaptive Curriculum Regularization, which modifies the learning priority of samples in a probabilistic, adaptive manner based on their confidence scores and prediction stability, mitigating overfitting to noisy and over-confident samples. Extensive experiments across multiple video domain adaptation benchmarks demonstrate that Co-STAR consistently outperforms state-of-the-art SFUVDA methods. Code is available at: https://github.com/Plrbear/Co-Star
中文:Co-STAR框架通过将课程学习与教师模型和CLIP的协作自训练相结合,采用基于可靠性的权重函数和自适应课程正则化,有效解决了伪标签噪声和过度自信预测问题,在多个视频领域自适应基准测试中表现优异。
English: The Co-STAR framework enhances source-free unsupervised video domain adaptation by combining curriculum learning with collaborative self-training between a teacher model and CLIP, using reliability-based weighting and adaptive regularization to address noisy pseudo-labels and over-confident predictions, achieving superior performance across benchmarks.

Authors:Mansoor Hayat, Supavadee Aramvith, Subrata Bhattacharjee, Nouman Ahmad
Title: Attention GhostUNet++: Enhanced Segmentation of Adipose Tissue and Liver in CT Images
Abstract:
Accurate segmentation of abdominal adipose tissue, including subcutaneous (SAT) and visceral adipose tissue (VAT), along with liver segmentation, is essential for understanding body composition and associated health risks such as type 2 diabetes and cardiovascular disease. This study proposes Attention GhostUNet++, a novel deep learning model incorporating Channel, Spatial, and Depth Attention mechanisms into the Ghost UNet++ bottleneck for automated, precise segmentation. Evaluated on the AATTCT-IDS and LiTS datasets, the model achieved Dice coefficients of 0.9430 for VAT, 0.9639 for SAT, and 0.9652 for liver segmentation, surpassing baseline models. Despite minor limitations in boundary detail segmentation, the proposed model significantly enhances feature refinement, contextual understanding, and computational efficiency, offering a robust solution for body composition analysis. The implementation of the proposed Attention GhostUNet++ model is available at:https://github.com/MansoorHayat777/Attention-GhostUNetPlusPlus.
中文摘要:本研究提出Attention GhostUNet++深度学习模型,通过集成多种注意力机制实现腹部脂肪组织和肝脏的精准分割,在保持高计算效率的同时显著超越了基准模型的性能表现。
English Summary: The study introduces Attention GhostUNet++, a deep learning model integrating multiple attention mechanisms for precise abdominal adipose and liver tissue segmentation, achieving superior performance over baseline models with enhanced computational efficiency.

Authors:Huaxiang Zhang, Hao Zhang, Aoran Mei, Zhongxue Gan, Guo-Niu Zhu
Title: SO-DETR: Leveraging Dual-Domain Features and Knowledge Distillation for Small Object Detection
Abstract:
Detection Transformer-based methods have achieved significant advancements in general object detection. However, challenges remain in effectively detecting small objects. One key difficulty is that existing encoders struggle to efficiently fuse low-level features. Additionally, the query selection strategies are not effectively tailored for small objects. To address these challenges, this paper proposes an efficient model, Small Object Detection Transformer (SO-DETR). The model comprises three key components: a dual-domain hybrid encoder, an enhanced query selection mechanism, and a knowledge distillation strategy. The dual-domain hybrid encoder integrates spatial and frequency domains to fuse multi-scale features effectively. This approach enhances the representation of high-resolution features while maintaining relatively low computational overhead. The enhanced query selection mechanism optimizes query initialization by dynamically selecting high-scoring anchor boxes using expanded IoU, thereby improving the allocation of query resources. Furthermore, by incorporating a lightweight backbone network and implementing a knowledge distillation strategy, we develop an efficient detector for small objects. Experimental results on the VisDrone-2019-DET and UAVVaste datasets demonstrate that SO-DETR outperforms existing methods with similar computational demands. The project page is available at https://github.com/ValiantDiligent/SO_DETR.
中文摘要:本文提出SO-DETR模型,通过双域混合编码器、优化查询选择机制和知识蒸馏策略,有效提升小物体检测性能,在基准数据集上表现优于现有方法。
English Summary: The paper introduces SO-DETR, a transformer-based model that enhances small object detection through a dual-domain hybrid encoder, improved query selection, and knowledge distillation, achieving superior performance on benchmark datasets.

Authors:Ziqi Pang, Xin Xu, Yu-Xiong Wang
Title: Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception
Abstract:
With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at https://github.com/ziqipang/ADDP.
中文: 本研究通过分析生成扩散过程与感知任务的匹配差距,针对不同去噪步骤提出定制化学习目标和数据增强方法,显著提升了扩散模型在判别任务中的感知精度和交互性,无需改动架构即可实现最优性能。
English: This study enhances generative diffusion models for discriminative tasks by addressing critical gaps in alignment, focusing on tailored learning objectives for different denoising steps and diffusion-specific data augmentation to improve perception accuracy and interactivity, achieving state-of-the-art results without architectural changes.

Authors:Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang
Title: SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
Abstract:
This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications. Through careful exploration of training and inference optimization, we demonstrate that: 1) with only 0.5B parameters, our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on challenging text-to-image benchmarks, e.g., 0.59 on GenEval and 79.66 on DPG; 2) both supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) training could lead to significant improvements on generation aesthectics and prompt alignment; and 3) when optimized with inference acceleraton techniques like vLLM, the time for SimpleAR to generate an 1024x1024 image could be reduced to around 14 seconds. By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation and encourage more participation in this research field. Code is available at https://github.com/wdrink/SimpleAR.
中文:SimpleAR是一种简单的自回归框架,仅用5亿参数即可生成高保真的1024x1024图像,在多项基准测试中表现优异,并通过监督微调和GRPO训练显著提升生成质量,同时借助推理加速技术将生成时间缩短至约14秒。
English: SimpleAR is a straightforward autoregressive framework that generates high-fidelity 1024x1024 images with just 0.5B parameters, achieving competitive benchmark results and significant improvements through SFT and GRPO training, while reducing generation time to about 14 seconds with inference acceleration.

Authors:An Zhao, Shengyuan Zhang, Ling Yang, Zejian Li, Jiale Wu, Haoran Xu, AnYang Wei, Perry Pengyun GU, Lingyun Sun
Title: Diffusion Distillation With Direct Preference Optimization For Efficient 3D LiDAR Scene Completion
Abstract:
The application of diffusion models in 3D LiDAR scene completion is limited due to diffusion's slow sampling speed. Score distillation accelerates diffusion sampling but with performance degradation, while post-training with direct policy optimization (DPO) boosts performance using preference data. This paper proposes Distillation-DPO, a novel diffusion distillation framework for LiDAR scene completion with preference aligment. First, the student model generates paired completion scenes with different initial noises. Second, using LiDAR scene evaluation metrics as preference, we construct winning and losing sample pairs. Such construction is reasonable, since most LiDAR scene metrics are informative but non-differentiable to be optimized directly. Third, Distillation-DPO optimizes the student model by exploiting the difference in score functions between the teacher and student models on the paired completion scenes. Such procedure is repeated until convergence. Extensive experiments demonstrate that, compared to state-of-the-art LiDAR scene completion diffusion models, Distillation-DPO achieves higher-quality scene completion while accelerating the completion speed by more than 5-fold. Our method is the first to explore adopting preference learning in distillation to the best of our knowledge and provide insights into preference-aligned distillation. Our code is public available on https://github.com/happyw1nd/DistillationDPO.
Chinese: 本文提出Distillation-DPO,一种新颖的扩散蒸馏框架,通过策略优化对齐偏好来增强3D LiDAR场景补全,相比现有方法实现了5倍以上的速度提升和更高质量的结果。
English: This paper introduces Distillation-DPO, a novel diffusion distillation framework that enhances 3D LiDAR scene completion by aligning preferences through policy optimization, achieving over 5-fold speed improvement and higher quality results compared to existing methods.

Authors:Lewis Clifton, Xin Tian, Duangdao Palasuwan, Phandee Watanaboonyongcharoen, Ponlapat Rojnuckarin, Nantheera Anantrasirichai
Title: Mamba-Based Ensemble learning for White Blood Cell Classification
Abstract:
White blood cell (WBC) classification assists in assessing immune health and diagnosing various diseases, yet manual classification is labor-intensive and prone to inconsistencies. Recent advancements in deep learning have shown promise over traditional methods; however, challenges such as data imbalance and the computational demands of modern technologies, such as Transformer-based models which do not scale well with input size, limit their practical application. This paper introduces a novel framework that leverages Mamba models integrated with ensemble learning to improve WBC classification. Mamba models, known for their linear complexity, provide a scalable alternative to Transformer-based approaches, making them suitable for deployment in resource-constrained environments. Additionally, we introduce a new WBC dataset, Chula-WBC-8, for benchmarking. Our approach not only validates the effectiveness of Mamba models in this domain but also demonstrates their potential to significantly enhance classification efficiency without compromising accuracy. The source code can be found at https://github.com/LewisClifton/Mamba-WBC-Classification.
中文:本文提出了一种结合Mamba模型与集成学习的新框架,通过线性复杂度模型提升白细胞分类效率,并发布了Chula-WBC-8基准数据集,为资源受限环境提供了可行的解决方案。
English: This paper introduces a novel framework using Mamba models with ensemble learning to improve white blood cell classification, offering a scalable and efficient alternative to Transformer-based methods while introducing a new benchmark dataset.

Authors:Panagiotis Agrafiotis, Begüm Demir
Title: Deep Learning-based Bathymetry Retrieval without In-situ Depths using Remote Sensing Imagery and SfM-MVS DSMs with Data Gaps
Abstract:
Accurate, detailed, and high-frequent bathymetry is crucial for shallow seabed areas facing intense climatological and anthropogenic pressures. Current methods utilizing airborne or satellite optical imagery to derive bathymetry primarily rely on either SfM-MVS with refraction correction or Spectrally Derived Bathymetry (SDB). However, SDB methods often require extensive manual fieldwork or costly reference data, while SfM-MVS approaches face challenges even after refraction correction. These include depth data gaps and noise in environments with homogeneous visual textures, which hinder the creation of accurate and complete Digital Surface Models (DSMs) of the seabed. To address these challenges, this work introduces a methodology that combines the high-fidelity 3D reconstruction capabilities of the SfM-MVS methods with state-of-the-art refraction correction techniques, along with the spectral analysis capabilities of a new deep learning-based method for bathymetry prediction. This integration enables a synergistic approach where SfM-MVS derived DSMs with data gaps are used as training data to generate complete bathymetric maps. In this context, we propose Swin-BathyUNet that combines U-Net with Swin Transformer self-attention layers and a cross-attention mechanism, specifically tailored for SDB. Swin-BathyUNet is designed to improve bathymetric accuracy by capturing long-range spatial relationships and can also function as a standalone solution for standard SDB with various training depth data, independent of the SfM-MVS output. Experimental results in two completely different test sites in the Mediterranean and Baltic Seas demonstrate the effectiveness of the proposed approach through extensive experiments that demonstrate improvements in bathymetric accuracy, detail, coverage, and noise reduction in the predicted DSM. The code is available at https://github.com/pagraf/Swin-BathyUNet.
中文: 本研究提出Swin-BathyUNet深度学习模型,通过结合SfM-MVS与光谱分析及折射校正技术,在地中海和波罗的海的实验中生成高精度、全覆盖的海底地形图,显著提升了测深精度并降低了噪声干扰。
English: This study introduces Swin-BathyUNet, a deep learning model that integrates SfM-MVS with spectral analysis and refraction correction to produce accurate, high-coverage seabed maps, demonstrating improved bathymetric precision and reduced noise in Mediterranean and Baltic Sea tests.

Authors:Liu Yang, Huiyu Duan, Yucheng Zhu, Xiaohong Liu, Lu Liu, Zitong Xu, Guangji Ma, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet
Title: Omni$^2$: Unifying Omnidirectional Image Generation and Editing in an Omni Model
Abstract:
$360^{\circ}$ omnidirectional images (ODIs) have gained considerable attention recently, and are widely used in various virtual reality (VR) and augmented reality (AR) applications. However, capturing such images is expensive and requires specialized equipment, making ODI synthesis increasingly important. While common 2D image generation and editing methods are rapidly advancing, these models struggle to deliver satisfactory results when generating or editing ODIs due to the unique format and broad 360$^{\circ}$ Field-of-View (FoV) of ODIs. To bridge this gap, we construct \textbf{\textit{Any2Omni}}, the first comprehensive ODI generation-editing dataset comprises 60,000+ training data covering diverse input conditions and up to 9 ODI generation and editing tasks. Built upon Any2Omni, we propose an \textbf{\underline{Omni}} model for \textbf{\underline{Omni}}-directional image generation and editing (\textbf{\textit{Omni$^2$}}), with the capability of handling various ODI generation and editing tasks under diverse input conditions using one model. Extensive experiments demonstrate the superiority and effectiveness of the proposed Omni$^2$ model for both the ODI generation and editing tasks. Both the Any2Omni dataset and the Omni$^2$ model are publicly available at: https://github.com/IntMeGroup/Omni2.
中文: 作者提出了首个全面的全向图像生成与编辑数据集Any2Omni,并开发了Omni²模型,该单一模型能有效处理多种全向图像任务,实验证明其具有卓越性能。
English: The authors introduce Any2Omni, a comprehensive dataset for omnidirectional image (ODI) generation and editing, and propose the Omni² model, which effectively handles multiple ODI tasks using a single framework, demonstrating superior performance in experiments.

Authors:Jingkun Chen, Haoran Duan, Xiao Zhang, Boyan Gao, Tao Tan, Vicente Grau, Jungong Han
Title: From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation
Abstract:
Medical image segmentation remains challenging due to the high cost of pixel-level annotations for training. In the context of weak supervision, clinician gaze data captures regions of diagnostic interest; however, its sparsity limits its use for segmentation. In contrast, vision-language models (VLMs) provide semantic context through textual descriptions but lack the explanation precision required. Recognizing that neither source alone suffices, we propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths. Our key insight is that gaze data indicates where clinicians focus during diagnosis, while VLMs explain why those regions are significant. To implement this, the teacher model first learns from gaze points enhanced by VLM-generated descriptions of lesion morphology, establishing a foundation for guiding the student model. The teacher then directs the student through three strategies: (1) Multi-scale feature alignment to fuse visual cues with textual semantics; (2) Confidence-weighted consistency constraints to focus on reliable predictions; (3) Adaptive masking to limit error propagation in uncertain areas. Experiments on the Kvasir-SEG, NCI-ISBI, and ISIC datasets show that our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively-improving 3-5% over gaze baselines without increasing the annotation burden. By preserving correlations among predictions, gaze data, and lesion descriptions, our framework also maintains clinical interpretability. This work illustrates how integrating human visual attention with AI-generated semantic context can effectively overcome the limitations of individual weak supervision signals, thereby advancing the development of deployable, annotation-efficient medical AI systems. Code is available at: https://github.com/jingkunchen/FGI.git.
中文摘要:本研究提出一种师生框架,通过整合临床医生的注视数据和视觉语言模型来增强医学图像分割效果,在多个数据集上实现Dice分数3-5%的提升,同时保持了临床可解释性。
English Summary: This study introduces a teacher-student framework that synergistically integrates clinician gaze data with vision-language models to enhance medical image segmentation, achieving 3-5% Dice score improvements across multiple datasets while preserving clinical interpretability.

Authors:Yuezhe Yang, Boyu Yang, Yaqian Wang, Yang He, Xingbo Dong, Zhe Jin
Title: Explicit and Implicit Representations in AI-based 3D Reconstruction for Radiology: A Systematic Review
Abstract:
The demand for high-quality medical imaging in clinical practice and assisted diagnosis has made 3D reconstruction in radiological imaging a key research focus. Artificial intelligence (AI) has emerged as a promising approach to enhancing reconstruction accuracy while reducing acquisition and processing time, thereby minimizing patient radiation exposure and discomfort and ultimately benefiting clinical diagnosis. This review explores state-of-the-art AI-based 3D reconstruction algorithms in radiological imaging, categorizing them into explicit and implicit approaches based on their underlying principles. Explicit methods include point-based, volume-based, and Gaussian representations, while implicit methods encompass implicit prior embedding and neural radiance fields. Additionally, we examine commonly used evaluation metrics and benchmark datasets. Finally, we discuss the current state of development, key challenges, and future research directions in this evolving field. Our project available on: https://github.com/Bean-Young/AI4Radiology.
中文: 本综述探讨了放射影像中基于人工智能的先进三维重建算法,将其分为显式和隐式方法,并讨论了评估指标、当前挑战及未来方向,旨在提升临床诊断效果。
English: This review explores state-of-the-art AI-based 3D reconstruction algorithms in radiological imaging, categorizing them into explicit and implicit approaches, while also addressing evaluation metrics, challenges, and future directions to enhance clinical diagnosis.

Authors:Henghui Ding, Chang Liu, Nikhila Ravi, Shuting He, Yunchao Wei, Song Bai, Philip Torr, Kehuan Song, Xinglin Xie, Kexin Zhang, Licheng Jiao, Lingling Li, Shuyuan Yang, Xuqiang Cao, Linnan Zhao, Jiaxuan Zhao, Fang Liu, Mengjiao Wang, Junpei Zhang, Xu Liu, Yuting Yang, Mengru Ma, Hao Fang, Runmin Cong, Xiankai Lu, Zhiyang Chen, Wei Zhang, Tianming Liang, Haichao Jiang, Wei-Shi Zheng, Jian-Fang Hu, Haobo Yuan, Xiangtai Li, Tao Zhang, Lu Qi, Ming-Hsuan Yang
Title: PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild
Abstract:
This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real-world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state-of-the-art and emerging trends in complex video segmentation. More information can be found on the workshop website: https://pvuw.github.io/.
中文总结:本报告概述了在CVPR 2025举办的第四届PVUW挑战赛,重点介绍了MOSE和MeViS两个赛道的成果、参与方法及旨在更好反映真实场景的新数据集,为复杂视频分割领域提供了重要见解。
English Summary: This report summarizes the 4th PVUW Challenge at CVPR 2025, detailing the MOSE and MeViS tracks' outcomes, methodologies, and new datasets that advance complex video segmentation research.

Authors:Yeongmin Kim, Sotiris Anagnostidis, Yuming Du, Edgar Schönfeld, Jonas Kohler, Markos Georgopoulos, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu
Title: Autoregressive Distillation of Diffusion Transformers
Abstract:
Diffusion models with transformer architectures have demonstrated promising capabilities in generating high-fidelity images and scalability for high resolution. However, iterative sampling process required for synthesis is very resource-intensive. A line of work has focused on distilling solutions to probability flow ODEs into few-step student models. Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information. ARD modifies the teacher transformer architecture by adding token-wise time embedding to mark each input from the trajectory history and employs a block-wise causal attention mask for training. Furthermore, incorporating historical inputs only in lower transformer layers enhances performance and efficiency. We validate the effectiveness of ARD in a class-conditioned generation on ImageNet and T2I synthesis. Our model achieves a $5\times$ reduction in FID degradation compared to the baseline methods while requiring only 1.1\% extra FLOPs on ImageNet-256. Moreover, ARD reaches FID of 1.84 on ImageNet-256 in merely 4 steps and outperforms the publicly available 1024p text-to-image distilled models in prompt adherence score with a minimal drop in FID compared to the teacher. Project page: https://github.com/alsdudrla10/ARD.
Chinese: 提出的自回归蒸馏(ARD)方法通过利用历史ODE轨迹数据和改进的Transformer架构,有效解决了扩散模型中的曝光偏差问题,在极小计算开销下实现了更优的图像生成质量与效率。
English: The proposed AutoRegressive Distillation (ARD) method addresses exposure bias in diffusion models by utilizing historical ODE trajectory data with modified transformer architecture, achieving superior image generation quality and efficiency with minimal computational overhead.

Authors:Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang
Title: UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer
Abstract:
This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at https://github.com/ali-vilab/UniAnimate-DiT.
中文: UniAnimate-DiT项目基于Wan2.1模型,通过LoRA微调和轻量级姿态编码器实现了高保真、时序一致的人像动画,并展现出从480p到720p的强大泛化能力。
English: UniAnimate-DiT utilizes the Wan2.1 model with LoRA fine-tuning and a lightweight pose encoder to achieve high-fidelity, temporally consistent human animations that generalize well from 480p to 720p resolution.

Authors:Xinning Chai, Yao Zhang, Yuxuan Zhang, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song
Title: Distillation-Supervised Convolutional Low-Rank Adaptation for Efficient Image Super-Resolution
Abstract:
Convolutional neural networks (CNNs) have been widely used in efficient image super-resolution. However, for CNN-based methods, performance gains often require deeper networks and larger feature maps, which increase complexity and inference costs. Inspired by LoRA's success in fine-tuning large language models, we explore its application to lightweight models and propose Distillation-Supervised Convolutional Low-Rank Adaptation (DSCLoRA), which improves model performance without increasing architectural complexity or inference costs. Specifically, we integrate ConvLoRA into the efficient SR network SPAN by replacing the SPAB module with the proposed SConvLB module and incorporating ConvLoRA layers into both the pixel shuffle block and its preceding convolutional layer. DSCLoRA leverages low-rank decomposition for parameter updates and employs a spatial feature affinity-based knowledge distillation strategy to transfer second-order statistical information from teacher models (pre-trained SPAN) to student models (ours). This method preserves the core knowledge of lightweight models and facilitates optimal solution discovery under certain conditions. Experiments on benchmark datasets show that DSCLoRA improves PSNR and SSIM over SPAN while maintaining its efficiency and competitive image quality. Notably, DSCLoRA ranked first in the Overall Performance Track of the NTIRE 2025 Efficient Super-Resolution Challenge. Our code and models are made publicly available at https://github.com/Yaozzz666/DSCF-SR.
Chinese: 提出的DSCLoRA方法通过结合低秩适应和知识蒸馏,提升了轻量级超分辨率模型的性能,在不增加计算成本的情况下获得了更高的PSNR和SSIM分数,并在NTIRE 2025挑战赛中荣获第一名。
English: The proposed DSCLoRA method enhances the performance of lightweight super-resolution models by integrating low-rank adaptation and knowledge distillation, achieving superior PSNR and SSIM scores without increasing computational costs, as evidenced by its first-place ranking in the NTIRE 2025 challenge.

Authors:Xinyi Liu, Xiaoyi Zhang, Ziyun Zhang, Yan Lu
Title: UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
Abstract:
Recent advancements in Large Vision-Language Models are accelerating the development of Graphical User Interface (GUI) agents that utilize human-like vision perception capabilities to enhance productivity on digital devices. Compared to approaches predicated on GUI metadata, which are platform-dependent and vulnerable to implementation variations, vision-based approaches offer broader applicability. In this vision-based paradigm, the GUI instruction grounding, which maps user instruction to the location of corresponding element on the given screenshot, remains a critical challenge, particularly due to limited public training dataset and resource-intensive manual instruction data annotation. In this paper, we delve into unexplored challenges in this task including element-to-screen ratio, unbalanced element type, and implicit instruction. To address these challenges, we introduce a large-scale data synthesis pipeline UI-E2I-Synth for generating varying complex instruction datasets using GPT-4o instead of human annotators. Furthermore, we propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks by incorporating diverse annotation aspects. Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding, demonstrating the advancements of proposed data synthesis pipeline. The proposed benchmark, accompanied by extensive analyses, provides practical insights for future research in GUI grounding. We will release corresponding artifacts at https://microsoft.github.io/FIVE-UI-Evol/ .

Authors:Lijun Sheng, Jian Liang, Zilei Wang, Ran He
Title: R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning
Abstract:
Vision-language models (VLMs), such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks. However, due to their inherent vulnerability and the common practice of selecting from a limited set of open-source models, VLMs suffer from a higher risk of adversarial attacks than traditional vision models. Existing defense techniques typically rely on adversarial fine-tuning during training, which requires labeled data and lacks of flexibility for downstream tasks. To address these limitations, we propose robust test-time prompt tuning (R-TPT), which mitigates the impact of adversarial attacks during the inference stage. We first reformulate the classic marginal entropy objective by eliminating the term that introduces conflicts under adversarial conditions, retaining only the pointwise entropy minimization. Furthermore, we introduce a plug-and-play reliability-based weighted ensembling strategy, which aggregates useful information from reliable augmented views to strengthen the defense. R-TPT enhances defense against adversarial attacks without requiring labeled training data while offering high flexibility for inference tasks. Extensive experiments on widely used benchmarks with various attacks demonstrate the effectiveness of R-TPT. The code is available in https://github.com/TomSheng21/R-TPT.
中文: 针对视觉语言模型易受对抗攻击的问题,本文提出R-TPT方法,通过优化熵目标和加权集成策略,在无需标注数据的情况下实现推理阶段的灵活防御。
English: Vision-language models face heightened adversarial risks, so this paper introduces R-TPT, a test-time prompt tuning method that strengthens defenses without labeled data by optimizing entropy and employing weighted ensembling.

Authors:P. Tomkiewicz, J. Jaworski, P. Zielonka, A. Wilinski
Title: K-means Enhanced Density Gradient Analysis for Urban and Transport Metrics Using Multi-Modal Satellite Imagery
Abstract:
This paper presents a novel computational approach for evaluating urban metrics through density gradient analysis using multi-modal satellite imagery, with applications including public transport and other urban systems. By combining optical and Synthetic Aperture Radar (SAR) data, we develop a method to segment urban areas, identify urban centers, and quantify density gradients. Our approach calculates two key metrics: the density gradient coefficient ($α$) and the minimum effective distance (LD) at which density reaches a target threshold. We further employ machine learning techniques, specifically K-means clustering, to objectively identify uniform and high-variability regions within density gradient plots. We demonstrate that these metrics provide an effective screening tool for public transport analyses by revealing the underlying urban structure. Through comparative analysis of two representative cities with contrasting urban morphologies (monocentric vs polycentric), we establish relationships between density gradient characteristics and public transport network topologies. Cities with clear density peaks in their gradient plots indicate distinct urban centers requiring different transport strategies than those with more uniform density distributions. This methodology offers urban planners a cost-effective, globally applicable approach to preliminary public transport assessment using freely available satellite data. The complete implementation, with additional examples and documentation, is available in an open-source repository under the MIT license at https://github.com/nexri/Satellite-Imagery-Urban-Analysis.
中文: 本文提出了一种利用多模态卫星影像和机器学习分析城市密度梯度的新计算方法,为基于城市结构的公共交通系统评估提供了一种经济有效的工具。
English: This paper introduces a novel computational method using multi-modal satellite imagery and machine learning to analyze urban density gradients, providing a cost-effective tool for evaluating public transport systems based on urban structure.

Authors:René Peinl
Title: Using LLMs as prompt modifier to avoid biases in AI image generators
Abstract:
This study examines how Large Language Models (LLMs) can reduce biases in text-to-image generation systems by modifying user prompts. We define bias as a model's unfair deviation from population statistics given neutral prompts. Our experiments with Stable Diffusion XL, 3.5 and Flux demonstrate that LLM-modified prompts significantly increase image diversity and reduce bias without the need to change the image generators themselves. While occasionally producing results that diverge from original user intent for elaborate prompts, this approach generally provides more varied interpretations of underspecified requests rather than superficial variations. The method works particularly well for less advanced image generators, though limitations persist for certain contexts like disability representation. All prompts and generated images are available at https://iisys-hof.github.io/llm-prompt-img-gen/

Authors:Jiaxin Huang, Sheng Miao, BangBang Yang, Yuewen Ma, Yiyi Liao
Title: Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting
Abstract:
Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views - synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion. See our project page: https://xdimlab.github.io/Vivid4D/.

Authors:Elman Ghazaei, Erchan Aptoula
Title: Change State Space Models for Remote Sensing Change Detection
Abstract:
Despite their frequent use for change detection, both ConvNets and Vision transformers (ViT) exhibit well-known limitations, namely the former struggle to model long-range dependencies while the latter are computationally inefficient, rendering them challenging to train on large-scale datasets. Vision Mamba, an architecture based on State Space Models has emerged as an alternative addressing the aforementioned deficiencies and has been already applied to remote sensing change detection, though mostly as a feature extracting backbone. In this article the Change State Space Model is introduced, that has been specifically designed for change detection by focusing on the relevant changes between bi-temporal images, effectively filtering out irrelevant information. By concentrating solely on the changed features, the number of network parameters is reduced, enhancing significantly computational efficiency while maintaining high detection performance and robustness against input degradation. The proposed model has been evaluated via three benchmark datasets, where it outperformed ConvNets, ViTs, and Mamba-based counterparts at a fraction of their computational complexity. The implementation will be made available at https://github.com/Elman295/CSSM upon acceptance.
Chinese: 变化状态空间模型(CSSM)是一种专为遥感变化检测设计的新型架构,通过聚焦双时相图像间的相关变化来减少参数和计算复杂度,在基准数据集上以更低计算成本超越了卷积网络、视觉Transformer和基于Mamba的模型。
English: The Change State Space Model (CSSM) is a novel architecture designed specifically for remote sensing change detection, which efficiently focuses on relevant changes between bi-temporal images to reduce parameters and computational complexity while outperforming ConvNets, ViTs, and Mamba-based models on benchmark datasets.

Authors:Alireza Salehi, Mohammadreza Salehi, Reshad Hosseini, Cees G. M. Snoek, Makoto Yamada, Mohammad Sabokrou
Title: Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detection
Abstract:
Anomaly Detection involves identifying deviations from normal data distributions and is critical in fields such as medical diagnostics and industrial defect detection. Traditional AD methods typically require the availability of normal training samples; however, this assumption is not always feasible. Recently, the rich pretraining knowledge of CLIP has shown promising zero-shot generalization in detecting anomalies without the need for training samples from target domains. However, CLIP's coarse-grained image-text alignment limits localization and detection performance for fine-grained anomalies due to: (1) spatial misalignment, and (2) the limited sensitivity of global features to local anomalous patterns. In this paper, we propose Crane which tackles both problems. First, we introduce a correlation-based attention module to retain spatial alignment more accurately. Second, to boost the model's awareness of fine-grained anomalies, we condition the learnable prompts of the text encoder on image context extracted from the vision encoder and perform a local-to-global representation fusion. Moreover, our method can incorporate vision foundation models such as DINOv2 to further enhance spatial understanding and localization. The key insight of Crane is to balance learnable adaptations for modeling anomalous concepts with non-learnable adaptations that preserve and exploit generalized pretrained knowledge, thereby minimizing in-domain overfitting and maximizing performance on unseen domains. Extensive evaluation across 14 diverse industrial and medical datasets demonstrates that Crane consistently improves the state-of-the-art ZSAD from 2% to 28%, at both image and pixel levels, while remaining competitive in inference speed. The code is available at https://github.com/AlirezaSalehy/Crane.
中文: 本文提出Crane方法,通过相关性注意力机制和图像上下文驱动的可学习提示,解决CLIP在零样本异常检测中的空间错位问题,在保持推理速度的同时,在14个工业与医疗数据集上实现了2%至28%的性能提升。
English: This paper introduces Crane, a method that enhances zero-shot anomaly detection by addressing CLIP's spatial misalignment and insensitivity to local anomalies through correlation-based attention and context-aware prompts, achieving state-of-the-art performance across diverse datasets without compromising inference speed.

Authors:Yudong Zhang, Ruobing Xie, Jiansheng Chen, Xingwu Sun, Zhanhui Kang, Yu Wang
Title: QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models
Abstract:
In typical multimodal tasks, such as Visual Question Answering (VQA), adversarial attacks targeting a specific image and question can lead large vision-language models (LVLMs) to provide incorrect answers. However, it is common for a single image to be associated with multiple questions, and LVLMs may still answer other questions correctly even for an adversarial image attacked by a specific question. To address this, we introduce the query-agnostic visual attack (QAVA), which aims to create robust adversarial examples that generate incorrect responses to unspecified and unknown questions. Compared to traditional adversarial attacks focused on specific images and questions, QAVA significantly enhances the effectiveness and efficiency of attacks on images when the question is unknown, achieving performance comparable to attacks on known target questions. Our research broadens the scope of visual adversarial attacks on LVLMs in practical settings, uncovering previously overlooked vulnerabilities, particularly in the context of visual adversarial threats. The code is available at https://github.com/btzyd/qava.
中文: 本文提出查询无关视觉攻击(QAVA)方法,通过生成鲁棒性对抗图像,使大型视觉语言模型在面对未知问题时仍产生错误回答,在目标问题不明确的情况下显著提升了攻击效果与效率。
English: This paper introduces Query-Agnostic Visual Attack (QAVA), a method that generates robust adversarial images capable of misleading large vision-language models into giving incorrect answers to unspecified questions, significantly enhancing attack effectiveness and efficiency when the target question is unknown.

Authors:Andrea Simonelli, Norman Müller, Peter Kontschieder
Title: Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation
Abstract:
The increasing availability of digital 3D environments, whether through image-based 3D reconstruction, generation, or scans obtained by robots, is driving innovation across various applications. These come with a significant demand for 3D interaction, such as 3D Interactive Segmentation, which is useful for tasks like object selection and manipulation. Additionally, there is a persistent need for solutions that are efficient, precise, and performing well across diverse settings, particularly in unseen environments and with unfamiliar objects. In this work, we introduce a 3D interactive segmentation method that consistently surpasses previous state-of-the-art techniques on both in-domain and out-of-domain datasets. Our simple approach integrates a voxel-based sparse encoder with a lightweight transformer-based decoder that implements implicit click fusion, achieving superior performance and maximizing efficiency. Our method demonstrates substantial improvements on benchmark datasets, including ScanNet, ScanNet++, S3DIS, and KITTI-360, and also on unseen geometric distributions such as the ones obtained by Gaussian Splatting. The project web-page is available at https://simonelli-andrea.github.io/easy3d.

Authors:Hyejin Lee, Seokjun Hong, Jeonghoon Song, Haechan Cho, Zhixiong Jin, Byeonghun Kim, Joobin Jin, Jaegyun Im, Byeongjoon Noh, Hwasoo Yeo
Title: DRIFT open dataset: A drone-derived intelligence for traffic analysis in urban environment
Abstract:
Reliable traffic data are essential for understanding urban mobility and developing effective traffic management strategies. This study introduces the DRone-derived Intelligence For Traffic analysis (DRIFT) dataset, a large-scale urban traffic dataset collected systematically from synchronized drone videos at approximately 250 meters altitude, covering nine interconnected intersections in Daejeon, South Korea. DRIFT provides high-resolution vehicle trajectories that include directional information, processed through video synchronization and orthomap alignment, resulting in a comprehensive dataset of 81,699 vehicle trajectories. Through our DRIFT dataset, researchers can simultaneously analyze traffic at multiple scales - from individual vehicle maneuvers like lane-changes and safety metrics such as time-to-collision to aggregate network flow dynamics across interconnected urban intersections. The DRIFT dataset is structured to enable immediate use without additional preprocessing, complemented by open-source models for object detection and trajectory extraction, as well as associated analytical tools. DRIFT is expected to significantly contribute to academic research and practical applications, such as traffic flow analysis and simulation studies. The dataset and related resources are publicly accessible at https://github.com/AIxMobility/The-DRIFT.
中文: DRIFT数据集通过同步无人机视频提供大规模高分辨率车辆轨迹数据,支持从微观车辆行为到宏观路网流量的多尺度交通分析,无需额外预处理即可直接应用。
English: The DRIFT dataset offers a large-scale collection of high-resolution vehicle trajectories from synchronized drone videos, enabling multi-scale traffic analysis from individual maneuvers to network dynamics without preprocessing.

Authors:Chenyang Zhu, Xing Zhang, Yuyang Sun, Ching-Chun Chang, Isao Echizen
Title: AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era
Abstract:
Recent advances in image generation, particularly diffusion models, have significantly lowered the barrier for creating sophisticated forgeries, making image manipulation detection and localization (IMDL) increasingly challenging. While prior work in IMDL has focused largely on natural images, the anime domain remains underexplored-despite its growing vulnerability to AI-generated forgeries. Misrepresentations of AI-generated images as hand-drawn artwork, copyright violations, and inappropriate content modifications pose serious threats to the anime community and industry. To address this gap, we propose AnimeDL-2M, the first large-scale benchmark for anime IMDL with comprehensive annotations. It comprises over two million images including real, partially manipulated, and fully AI-generated samples. Experiments indicate that models trained on existing IMDL datasets of natural images perform poorly when applied to anime images, highlighting a clear domain gap between anime and natural images. To better handle IMDL tasks in anime domain, we further propose AniXplore, a novel model tailored to the visual characteristics of anime imagery. Extensive evaluations demonstrate that AniXplore achieves superior performance compared to existing methods. Dataset and code can be found in https://flytweety.github.io/AnimeDL2M/.
中文摘要:针对扩散模型使动漫图像伪造检测日益困难的问题,本研究提出了首个大规模动漫专用基准数据集AnimeDL-2M和专门模型AniXplore,该模型在检测篡改动漫内容方面显著优于现有方法。
English Summary: Recent advances in diffusion models have made anime image forgery detection increasingly difficult, prompting the creation of AnimeDL-2M, the first large-scale anime-specific benchmark, and AniXplore, a novel model that outperforms existing methods in detecting manipulated anime content.

Authors:Eunsoo Im, Changhyun Jee, Jung Kwon Lee
Title: GATE3D: Generalized Attention-based Task-synergized Estimation in 3D*
Abstract:
The emerging trend in computer vision emphasizes developing universal models capable of simultaneously addressing multiple diverse tasks. Such universality typically requires joint training across multi-domain datasets to ensure effective generalization. However, monocular 3D object detection presents unique challenges in multi-domain training due to the scarcity of datasets annotated with accurate 3D ground-truth labels, especially beyond typical road-based autonomous driving contexts. To address this challenge, we introduce a novel weakly supervised framework leveraging pseudo-labels. Current pretrained models often struggle to accurately detect pedestrians in non-road environments due to inherent dataset biases. Unlike generalized image-based 2D object detection models, achieving similar generalization in monocular 3D detection remains largely unexplored. In this paper, we propose GATE3D, a novel framework designed specifically for generalized monocular 3D object detection via weak supervision. GATE3D effectively bridges domain gaps by employing consistency losses between 2D and 3D predictions. Remarkably, our model achieves competitive performance on the KITTI benchmark as well as on an indoor-office dataset collected by us to evaluate the generalization capabilities of our framework. Our results demonstrate that GATE3D significantly accelerates learning from limited annotated data through effective pre-training strategies, highlighting substantial potential for broader impacts in robotics, augmented reality, and virtual reality applications. Project page: https://ies0411.github.io/GATE3D/

Authors:Bo-Cheng Hu, Ge-Peng Ji, Dian Shao, Deng-Ping Fan
Title: PraNet-V2: Dual-Supervised Reverse Attention for Medical Image Segmentation
Abstract:
Accurate medical image segmentation is essential for effective diagnosis and treatment. Previously, PraNet-V1 was proposed to enhance polyp segmentation by introducing a reverse attention (RA) module that utilizes background information. However, PraNet-V1 struggles with multi-class segmentation tasks. To address this limitation, we propose PraNet-V2, which, compared to PraNet-V1, effectively performs a broader range of tasks including multi-class segmentation. At the core of PraNet-V2 is the Dual-Supervised Reverse Attention (DSRA) module, which incorporates explicit background supervision, independent background modeling, and semantically enriched attention fusion. Our PraNet-V2 framework demonstrates strong performance on four polyp segmentation datasets. Additionally, by integrating DSRA to iteratively enhance foreground segmentation results in three state-of-the-art semantic segmentation models, we achieve up to a 1.36% improvement in mean Dice score. Code is available at: https://github.com/ai4colonoscopy/PraNet-V2/tree/main/binary_seg/jittor.
Chinese: PraNet-V2通过引入双重监督反向注意力模块,解决了PraNet-V1在多类别分割任务中的不足,在多个息肉分割数据集上表现出色,并显著提升了语义分割模型的平均Dice分数。
English: PraNet-V2 introduces a Dual-Supervised Reverse Attention module to overcome PraNet-V1's limitations in multi-class segmentation, achieving enhanced performance across polyp segmentation datasets and improving mean Dice scores in semantic segmentation models.

Authors:Sami Arja, Nimrod Kruger, Alexandre Marcireau, Nicholas Owen Ralph, Saeed Afshar, Gregory Cohen
Title: Seeing like a Cephalopod: Colour Vision with a Monochrome Event Camera
Abstract:
Cephalopods exhibit unique colour discrimination capabilities despite having one type of photoreceptor, relying instead on chromatic aberration induced by their ocular optics and pupil shapes to perceive spectral information. We took inspiration from this biological mechanism to design a spectral imaging system that combines a ball lens with an event-based camera. Our approach relies on a motorised system that shifts the focal position, mirroring the adaptive lens motion in cephalopods. This approach has enabled us to achieve wavelength-dependent focusing across the visible light and near-infrared spectrum, making the event a spectral sensor. We characterise chromatic aberration effects, using both event-based and conventional frame-based sensors, validating the effectiveness of bio-inspired spectral discrimination both in simulation and in a real setup as well as assessing the spectral discrimination performance. Our proposed approach provides a robust spectral sensing capability without conventional colour filters or computational demosaicing. This approach opens new pathways toward new spectral sensing systems inspired by nature's evolutionary solutions. Code and analysis are available at: https://samiarja.github.io/neuromorphic_octopus_eye/

Authors:Yubin Gu, Yuan Meng, Kaihang Zheng, Xiaoshuai Sun, Jiayi Ji, Weijian Ruan, Liujuan Cao, Rongrong Ji
Title: An Efficient and Mixed Heterogeneous Model for Image Restoration
Abstract:
Image restoration~(IR), as a fundamental multimedia data processing task, has a significant impact on downstream visual applications. In recent years, researchers have focused on developing general-purpose IR models capable of handling diverse degradation types, thereby reducing the cost and complexity of model development. Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas. CNNs excel in efficient inference, whereas Transformers and Mamba excel at capturing long-range dependencies and modeling global contexts. While each architecture has demonstrated success in specialized, single-task settings, limited efforts have been made to effectively integrate heterogeneous architectures to jointly address diverse IR challenges. To bridge this gap, we propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion. RestorMixer adopts a three-stage encoder-decoder structure, where each stage is tailored to the resolution and feature characteristics of the input. In the initial high-resolution stage, CNN-based blocks are employed to rapidly extract shallow local features. In the subsequent stages, we integrate a refined multi-directional scanning Mamba module with a multi-scale window-based self-attention mechanism. This hierarchical and adaptive design enables the model to leverage the strengths of CNNs in local feature extraction, Mamba in global context modeling, and attention mechanisms in dynamic feature refinement. Extensive experimental results demonstrate that RestorMixer achieves leading performance across multiple IR tasks while maintaining high inference efficiency. The official code can be accessed at https://github.com/ClimBin/RestorMixer.
中文: RestorMixer是一种新颖的混合架构融合模型,有效结合了CNN、Transformer和Mamba的优势,能够在保持高效推理的同时解决多种图像复原任务。
English: RestorMixer is a novel mixed-architecture fusion model that effectively integrates CNNs, Transformers, and Mambas to address diverse image restoration challenges while maintaining high inference efficiency.

Authors:Peipei Song, Long Zhang, Long Lan, Weidong Chen, Dan Guo, Xun Yang, Meng Wang
Title: Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering
Abstract:
Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (\ie, TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (\#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.
AMDNet introduces an efficient approach for partially relevant video retrieval by actively discovering semantically consistent moments using learnable span anchors and masked attention, achieving superior performance with fewer parameters.
English Summary:

Authors:Aviral Chharia, Tianyu Ren, Tomotake Furuhata, Kenji Shimada
Title: Safe-Construct: Redefining Construction Safety Violation Recognition as 3D Multi-View Engagement Task
Abstract:
Recognizing safety violations in construction environments is critical yet remains underexplored in computer vision. Existing models predominantly rely on 2D object detection, which fails to capture the complexities of real-world violations due to: (i) an oversimplified task formulation treating violation recognition merely as object detection, (ii) inadequate validation under realistic conditions, (iii) absence of standardized baselines, and (iv) limited scalability from the unavailability of synthetic dataset generators for diverse construction scenarios. To address these challenges, we introduce Safe-Construct, the first framework that reformulates violation recognition as a 3D multi-view engagement task, leveraging scene-level worker-object context and 3D spatial understanding. We also propose the Synthetic Indoor Construction Site Generator (SICSG) to create diverse, scalable training data, overcoming data limitations. Safe-Construct achieves a 7.6% improvement over state-of-the-art methods across four violation types. We rigorously evaluate our approach in near-realistic settings, incorporating four violations, four workers, 14 objects, and challenging conditions like occlusions (worker-object, worker-worker) and variable illumination (back-lighting, overexposure, sunlight). By integrating 3D multi-view spatial understanding and synthetic data generation, Safe-Construct sets a new benchmark for scalable and robust safety monitoring in high-risk industries. Project Website: https://Safe-Construct.github.io/Safe-Construct

Authors:Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Yuchi Huo, Rui Wang, Chi Zhang, Xuelong Li
Title: OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding
Abstract:
In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality's role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps). By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.
中文: OmniVDiff提出了一种统一视频扩散框架,通过动态调整多种视觉模态的角色,实现了文本生成视频、视频理解及属性条件生成等多样化任务。
English: OmniVDiff introduces a unified video diffusion framework that dynamically adapts multiple visual modalities for versatile tasks including text-to-video generation, video understanding, and attribute-conditioned synthesis.

Authors:Christophe Bolduc, Yannick Hold-Geoffroy, Zhixin Shu, Jean-François Lalonde
Title: GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR
Abstract:
We present GaSLight, a method that generates spatially-varying lighting from regular images. Our method proposes using HDR Gaussian Splats as light source representation, marking the first time regular images can serve as light sources in a 3D renderer. Our two-stage process first enhances the dynamic range of images plausibly and accurately by leveraging the priors embedded in diffusion models. Next, we employ Gaussian Splats to model 3D lighting, achieving spatially variant lighting. Our approach yields state-of-the-art results on HDR estimations and their applications in illuminating virtual objects and scenes. To facilitate the benchmarking of images as light sources, we introduce a novel dataset of calibrated and unsaturated HDR to evaluate images as light sources. We assess our method using a combination of this novel dataset and an existing dataset from the literature. Project page: https://lvsn.github.io/gaslight/

Authors:Md Rakibul Hasan, Md Zakir Hossain, Aneesh Krishna, Shafin Rahman, Tom Gedeon
Title: TFMPathy: Tabular Foundation Model for Privacy-Aware, Generalisable Empathy Detection from Videos
Abstract:
Detecting empathy from video interactions is an emerging area of research, particularly in healthcare and social robotics. However, privacy and ethical concerns often prevent the release of raw video data, with many datasets instead shared as pre-extracted tabular features. Previous work on such datasets has established classical tree-based models as the state of the art. Motivated by recent successes of large-scale foundation models for text, we investigate the potential of tabular foundation models (TFMs) for empathy detection from video-derived tabular data. Our proposed system, TFMPathy, is demonstrated with two recent TFMs (TabPFN v2 and TabICL) under both in-context learning and fine-tuning paradigms. On a public human-robot interaction benchmark, TFMPathy significantly improves empathy detection accuracy reported in the literature. While the established evaluation protocol in the literature does not ensure cross-subject generalisation, our evaluation scheme also captures such generalisation. We show that TFMPathy under a fine-tuning setup has better cross-subject generalisation capacity over baseline methods (accuracy: $0.590 \rightarrow 0.730$; AUC: $0.564 \rightarrow 0.669$). Given the ongoing privacy and ethical constraints around raw video sharing, the proposed TFMPathy system provides a practical and scalable path toward building AI systems dependent on human-centred video datasets. Our code is publicly available at https://github.com/hasan-rakibul/TFMPathy (will be made available upon acceptance of this paper).
中文: 提出的TFMPathy系统利用表格基础模型,在解决隐私限制的同时,显著提升了基于视频衍生数据的共情检测准确率和跨被试泛化能力。
English: The proposed TFMPathy system leverages tabular foundation models to significantly improve empathy detection accuracy and cross-subject generalization from video-derived data while addressing privacy constraints.

Authors:Xiulong Liu, Anurag Kumar, Paul Calamia, Sebastia V. Amengual, Calvin Murdock, Ishwarya Ananthabhotla, Philip Robinson, Eli Shlizerman, Vamsi Krishna Ithapu, Ruohan Gao
Title: Hearing Anywhere in Any Environment
Abstract:
In mixed reality applications, a realistic acoustic experience in spatial environments is as crucial as the visual experience for achieving true immersion. Despite recent advances in neural approaches for Room Impulse Response (RIR) estimation, most existing methods are limited to the single environment on which they are trained, lacking the ability to generalize to new rooms with different geometries and surface materials. We aim to develop a unified model capable of reconstructing the spatial acoustic experience of any environment with minimum additional measurements. To this end, we present xRIR, a framework for cross-room RIR prediction. The core of our generalizable approach lies in combining a geometric feature extractor, which captures spatial context from panorama depth images, with a RIR encoder that extracts detailed acoustic features from only a few reference RIR samples. To evaluate our method, we introduce ACOUSTICROOMS, a new dataset featuring high-fidelity simulation of over 300,000 RIRs from 260 rooms. Experiments show that our method strongly outperforms a series of baselines. Furthermore, we successfully perform sim-to-real transfer by evaluating our model on four real-world environments, demonstrating the generalizability of our approach and the realism of our dataset.

Authors:Ankit Kumar Shaw, Kun Jiang, Tuopu Wen, Chandan Kumar Sah, Yining Shi, Mengmeng Yang, Diange Yang, Xiaoli Lian
Title: CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates
Abstract:
The rapid growth of intelligent connected vehicles (ICVs) and integrated vehicle-road-cloud systems has increased the demand for accurate, real-time HD map updates. However, ensuring map reliability remains challenging due to inconsistencies in crowdsourced data, which suffer from motion blur, lighting variations, adverse weather, and lane marking degradation. This paper introduces CleanMAP, a Multimodal Large Language Model (MLLM)-based distillation framework designed to filter and refine crowdsourced data for high-confidence HD map updates. CleanMAP leverages an MLLM-driven lane visibility scoring model that systematically quantifies key visual parameters, assigning confidence scores (0-10) based on their impact on lane detection. A novel dynamic piecewise confidence-scoring function adapts scores based on lane visibility, ensuring strong alignment with human evaluations while effectively filtering unreliable data. To further optimize map accuracy, a confidence-driven local map fusion strategy ranks and selects the top-k highest-scoring local maps within an optimal confidence range (best score minus 10%), striking a balance between data quality and quantity. Experimental evaluations on a real-world autonomous vehicle dataset validate CleanMAP's effectiveness, demonstrating that fusing the top three local maps achieves the lowest mean map update error of 0.28m, outperforming the baseline (0.37m) and meeting stringent accuracy thresholds (<= 0.32m). Further validation with real-vehicle data confirms 84.88% alignment with human evaluators, reinforcing the model's robustness and reliability. This work establishes CleanMAP as a scalable and deployable solution for crowdsourced HD map updates, ensuring more precise and reliable autonomous navigation. The code will be available at https://Ankit-Zefan.github.io/CleanMap/
中文摘要:CleanMAP是一个基于多模态大语言模型的蒸馏框架,通过评估车道可见性并融合高质量局部地图来优化众包高清地图更新,在自动驾驶导航中实现了卓越的精度和与人工评估的高度一致性。
English Summary: CleanMAP is a multimodal large language model-based framework that refines crowdsourced data for high-definition map updates by scoring lane visibility and fusing top-quality local maps, achieving superior accuracy and human alignment in autonomous navigation.

Authors:Zi-Han Jiang, Chien-Wei Lin, Wei-Hua Li, Hsuan-Tung Liu, Yi-Ren Yeh, Chu-Song Chen
Title: Relation-Rich Visual Document Generator for Visual Information Extraction
Abstract:
Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks. The code and model will be available at https://github.com/AI-Application-and-Integration-Lab/RIDGE .
中文摘要:提出的RIDGE模型通过基于大语言模型的内容生成和从OCR结果驱动布局生成的两阶段方法,克服了视觉文档理解中布局多样性和数据稀缺的挑战,显著提升了各类VIE基准测试的性能。
English Summary: The proposed RIDGE model overcomes limitations in visual document understanding by generating relation-rich documents through content creation using LLMs and content-driven layout generation from OCR data, significantly improving performance on VIE benchmarks.

Authors:Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Hojjat Torabi Goudarzi, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte
Title: LEMUR Neural Network Dataset: Towards Seamless AutoML
Abstract:
Neural networks are the backbone of modern artificial intelligence, but designing, evaluating, and comparing them remains labor-intensive. While numerous datasets exist for training, there are few standardized collections of the models themselves. We introduce LEMUR, an open-source dataset and framework that provides a large collection of PyTorch-based neural networks across tasks such as classification, segmentation, detection, and natural language processing. Each model follows a unified template, with configurations and results stored in a structured database to ensure consistency and reproducibility. LEMUR integrates automated hyperparameter optimization via Optuna, includes statistical analysis and visualization tools, and offers an API for seamless access to performance data. The framework is extensible, allowing researchers to add new models, datasets, or metrics without breaking compatibility. By standardizing implementations and unifying evaluation, LEMUR aims to accelerate AutoML research, enable fair benchmarking, and reduce barriers to large-scale neural network experimentation. To support adoption and collaboration, LEMUR and its plugins are released under the MIT license at: https://github.com/ABrain-One/nn-dataset https://github.com/ABrain-One/nn-plots https://github.com/ABrain-One/nn-vr
中文: LEMUR 是一个开源数据集与框架,通过统一模板标准化 PyTorch 神经网络模型,集成自动化优化与分析工具,旨在加速 AutoML 研究并保障实验可复现性。
English: LEMUR is an open-source dataset and framework that standardizes PyTorch neural networks across various tasks, integrating automated optimization and analysis tools to accelerate AutoML research and ensure reproducibility.

Authors:Mingyang Zhu, Yinting Liu, Mingyu Li, Jiacheng Wang
Title: PathSeqSAM: Sequential Modeling for Pathology Image Segmentation with SAM2
Abstract:
Current methods for pathology image segmentation typically treat 2D slices independently, ignoring valuable cross-slice information. We present PathSeqSAM, a novel approach that treats 2D pathology slices as sequential video frames using SAM2's memory mechanisms. Our method introduces a distance-aware attention mechanism that accounts for variable physical distances between slices and employs LoRA for domain adaptation. Evaluated on the KPI Challenge 2024 dataset for glomeruli segmentation, PathSeqSAM demonstrates improved segmentation quality, particularly in challenging cases that benefit from cross-slice context. We have publicly released our code at https://github.com/JackyyyWang/PathSeqSAM.
中文: PathSeqSAM提出了一种创新方法,将二维病理切片视为连续视频帧,利用SAM2的记忆机制结合距离感知注意力机制和LoRA进行领域适配,在KPI挑战赛2024数据集上实现了更优的分割效果。
English: PathSeqSAM introduces a novel method that treats 2D pathology slices as sequential video frames, utilizing SAM2's memory mechanisms with a distance-aware attention mechanism and LoRA for domain adaptation, achieving improved segmentation quality on the KPI Challenge 2024 dataset.

Authors:Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou
Title: ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Abstract:
Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.
中文摘要:ColorBench是一个评估视觉语言模型颜色理解能力的新基准,揭示了现有模型在颜色感知、推理和鲁棒性方面存在显著不足,尽管模型规模扩大和思维链推理能带来一定提升。
English Summary: ColorBench is a novel benchmark that evaluates vision-language models' color understanding, revealing significant limitations in their ability to perceive, reason with, and maintain robustness regarding colors, despite scaling laws and CoT reasoning offering some improvements.

Authors:Yasser Benigmim, Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Raoul de Charette
Title: FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation
Abstract:
In this paper, we challenge the conventional practice in Open-Vocabulary Semantic Segmentation (OVSS) of using averaged class-wise text embeddings, which are typically obtained by encoding each class name with multiple templates (e.g., a photo of , a sketch of a ). We investigate the impact of templates for OVSS, and find that for each class, there exist single-template classifiers--which we refer to as class-experts--that significantly outperform the conventional averaged classifier. First, to identify these class-experts, we introduce a novel approach that estimates them without any labeled data or training. By leveraging the class-wise prediction entropy of single-template classifiers, we select those yielding the lowest entropy as the most reliable class-experts. Second, we combine the outputs of class-experts in a new fusion process. Our plug-and-play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering an improvement without the need for additional labels or training. Extensive experiments show that FLOSS consistently enhances state-of-the-art OVSS models, generalizes well across datasets with different distribution shifts, and delivers substantial improvements in low-data scenarios where only a few unlabeled images are available. Our code is available at https://github.com/yasserben/FLOSS .
中文摘要:本文提出FLOSS方法,通过熵分析筛选单模板类别专家分类器并进行融合,无需额外训练或标注即可显著提升开放词汇语义分割性能。
English Summary: This paper introduces FLOSS, a plug-and-play method that identifies single-template class-expert classifiers through entropy analysis and combines them to significantly enhance Open-Vocabulary Semantic Segmentation without requiring additional training or labels.

Authors:Zeren Jiang, Shaofei Wang, Siyu Tang
Title: DNF-Avatar: Distilling Neural Fields for Real-time Animatable Avatar Relighting
Abstract:
Creating relightable and animatable human avatars from monocular videos is a rising research topic with a range of applications, e.g. virtual reality, sports, and video games. Previous works utilize neural fields together with physically based rendering (PBR), to estimate geometry and disentangle appearance properties of human avatars. However, one drawback of these methods is the slow rendering speed due to the expensive Monte Carlo ray tracing. To tackle this problem, we proposed to distill the knowledge from implicit neural fields (teacher) to explicit 2D Gaussian splatting (student) representation to take advantage of the fast rasterization property of Gaussian splatting. To avoid ray-tracing, we employ the split-sum approximation for PBR appearance. We also propose novel part-wise ambient occlusion probes for shadow computation. Shadow prediction is achieved by querying these probes only once per pixel, which paves the way for real-time relighting of avatars. These techniques combined give high-quality relighting results with realistic shadow effects. Our experiments demonstrate that the proposed student model achieves comparable or even better relighting results with our teacher model while being 370 times faster at inference time, achieving a 67 FPS rendering speed.

Authors:Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff
Title: MIEB: Massive Image Embedding Benchmark
Abstract:
Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.
中文摘要:我们推出了大规模图像嵌入基准(MIEB),通过涵盖38种语言的130项任务全面评估图像及图文嵌入模型,发现没有单一模型能在所有类别中表现卓越,同时揭示了先进视觉模型在文本视觉表征方面的优势及其在混合编码和干扰环境下图文匹配的局限性。
English Summary: The Massive Image Embedding Benchmark (MIEB) is introduced to comprehensively evaluate image and image-text embedding models across 130 tasks in 38 languages, revealing that no single model excels in all categories while uncovering both strengths and limitations in advanced vision models.

Authors:Tao Zhang, Xiangtai Li, Zilong Huang, Yanwei Li, Weixian Lei, Xueqing Deng, Shihao Chen, Shunping Ji, Jiashi Feng
Title: Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Abstract:
Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.
Chinese: Pixel-SAIL是一种高度简化的多模态大语言模型,通过三项关键技术改进,在像素级理解任务中取得了可比甚至更优的性能,无需依赖视觉编码器等额外组件。
English: Pixel-SAIL is a highly simplified multimodal large language model that achieves comparable or superior performance in pixel-level understanding tasks through three key technical improvements, eliminating the need for extra components like vision encoders.

Authors:Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, Zilong Huang
Title: The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Abstract:
This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at https://github.com/bytedance/SAIL.
中文摘要:SAIL是一种统一的多模态大语言模型,在单一Transformer架构中整合视觉与语言处理,无需预训练视觉编码器即可达到模块化模型的性能水平,同时展现出更强的可扩展性和独特的跨模态交互特性。
English Summary: SAIL is a unified multimodal large language model that integrates vision and language processing within a single transformer architecture, achieving performance comparable to modular models while offering enhanced scalability and distinct cross-modal interaction patterns.

Authors:Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya Roosta, Tianmin Shu
Title: RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users
Abstract:
To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following in the real world poses significant challenges beyond performing a single, clearly defined task. For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user's mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. RealWebAssist includes a dataset of sequential instructions collected from real-world human users. Each user instructs a web-based assistant to perform a series of tasks on multiple websites. A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements. Our experimental results show that state-of-the-art models struggle to understand and ground user instructions, posing critical challenges in following real-world user instructions for long-horizon web assistance.
中文: RealWebAssist 是一个新颖的基准测试,旨在评估AI代理处理现实世界中连续、模糊用户指令的能力,当前模型在意图推理和图形界面定位方面仍面临显著挑战。
English: RealWebAssist is a new benchmark designed to evaluate AI agents' ability to handle sequential, ambiguous real-world user instructions for long-horizon web tasks, where current models struggle with intent reasoning and GUI grounding.

Authors:Haoran Hao, Jiaming Han, Yiyuan Zhang, Xiangyu Yue
Title: Multimodal Long Video Modeling Based on Temporal Dynamic Context
Abstract:
Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.
中文: 本文提出时序动态上下文(TDC)方法,通过将视频分割为语义场景、使用时序上下文压缩器减少标记数量,并采用思维链策略处理超长视频,在视频与音频理解基准测试中表现优异。
English: This paper introduces Temporal Dynamic Context (TDC), a dynamic long video encoding method that segments videos into scenes, compresses tokens using a temporal context compressor, and employs a chain-of-thought strategy for enhanced video and audio understanding, achieving strong performance on benchmarks.

Authors:Taihang Hu, Linxuan Li, Kai Wang, Yaxing Wang, Jian Yang, Ming-Ming Cheng
Title: Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing
Abstract:
Text-to-image generation has seen groundbreaking advancements with diffusion models, enabling high-fidelity synthesis and precise image editing through cross-attention manipulation. Recently, autoregressive (AR) models have re-emerged as powerful alternatives, leveraging next-token generation to match diffusion models. However, existing editing techniques designed for diffusion models fail to translate directly to AR models due to fundamental differences in structural control. Specifically, AR models suffer from spatial poverty of attention maps and sequential accumulation of structural errors during image editing, which disrupt object layouts and global consistency. In this work, we introduce Implicit Structure Locking (ISLock), the first training-free editing strategy for AR visual models. Rather than relying on explicit attention manipulation or fine-tuning, ISLock preserves structural blueprints by dynamically aligning self-attention patterns with reference images through the Anchor Token Matching (ATM) protocol. By implicitly enforcing structural consistency in latent space, our method ISLock enables structure-aware editing while maintaining generative autonomy. Extensive experiments demonstrate that ISLock achieves high-quality, structure-consistent edits without additional training and is superior or comparable to conventional editing techniques. Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models. The code will be publicly available at https://github.com/hutaiHang/ATM
中文摘要:本文提出隐式结构锁定(ISLock)方法,通过自注意力对齐实现无需训练的自回归图像编辑,在保持结构一致性的同时弥合了与扩散模型的性能差距。
English Summary: This paper introduces Implicit Structure Locking (ISLock), a training-free editing method for autoregressive image models that preserves structural consistency through self-attention alignment, bridging the performance gap with diffusion models.

Authors:Jian Liu, Wei Sun, Hui Yang, Jin Zheng, Zichen Geng, Hossein Rahmani, Ajmal Mian
Title: MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model
Abstract:
Object pose estimation is a core means for robots to understand and interact with their environment. For this task, monocular category-level methods are attractive as they require only a single RGB camera. However, current methods rely on shape priors or CAD models of the intra-class known objects. We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusion models to alleviate the need for shape priors, CAD models, or depth sensors for intra-class unknown object pose estimation. We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud. We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D. Finally, we design a transformer-based denoiser to recover the object pose from Gaussian noise. Extensive experiments on two popular benchmark datasets show that MonoDiff9D achieves state-of-the-art monocular category-level 9D object pose estimation accuracy without the need for shape priors or CAD models at any stage. Our code will be made public at https://github.com/CNJianLiu/MonoDiff9D.
中文: MonoDiff9D是一种基于扩散模型的方法,它通过利用概率扩散模型和DINOv2的粗略深度估计,无需形状先验或CAD模型即可实现最先进的单目类别级9D物体姿态估计。
English: MonoDiff9D is a diffusion-based method that achieves state-of-the-art monocular category-level 9D object pose estimation without requiring shape priors or CAD models by leveraging probabilistic diffusion models and coarse depth estimation from DINOv2.

Authors:Jiaxin Lu, Chun-Hao Paul Huang, Uttaran Bhattacharya, Qixing Huang, Yi Zhou
Title: HUMOTO: A 4D Dataset of Mocap Human Object Interactions
Abstract:
We present Human Motions with Objects (HUMOTO), a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. Featuring 736 sequences (7,875 seconds at 30 fps), HUMOTO captures interactions with 63 precisely modeled objects and 72 articulated parts. Our innovations include a scene-driven LLM scripting pipeline creating complete, purposeful tasks with natural progression, and a mocap-and-camera recording setup to effectively handle occlusions. Spanning diverse activities from cooking to outdoor picnics, HUMOTO preserves both physical accuracy and logical task flow. Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations. We also provide benchmarks compared to other datasets. HUMOTO's comprehensive full-body motion and simultaneous multi-object interactions address key data-capturing challenges and provide opportunities to advance realistic human-object interaction modeling across research domains with practical applications in animation, robotics, and embodied AI systems. Project: https://jiaxin-lu.github.io/humoto/ .

Authors:Hao Sun, Fenggen Yu, Huiyao Xu, Tao Zhang, Changqing Zou
Title: LL-Gaussian: Low-Light Scene Reconstruction and Enhancement via Gaussian Splatting for Novel View Synthesis
Abstract:
Novel view synthesis (NVS) in low-light scenes remains a significant challenge due to degraded inputs characterized by severe noise, low dynamic range (LDR) and unreliable initialization. While recent NeRF-based approaches have shown promising results, most suffer from high computational costs, and some rely on carefully captured or pre-processed data--such as RAW sensor inputs or multi-exposure sequences--which severely limits their practicality. In contrast, 3D Gaussian Splatting (3DGS) enables real-time rendering with competitive visual fidelity; however, existing 3DGS-based methods struggle with low-light sRGB inputs, resulting in unstable Gaussian initialization and ineffective noise suppression. To address these challenges, we propose LL-Gaussian, a novel framework for 3D reconstruction and enhancement from low-light sRGB images, enabling pseudo normal-light novel view synthesis. Our method introduces three key innovations: 1) an end-to-end Low-Light Gaussian Initialization Module (LLGIM) that leverages dense priors from learning-based MVS approach to generate high-quality initial point clouds; 2) a dual-branch Gaussian decomposition model that disentangles intrinsic scene properties (reflectance and illumination) from transient interference, enabling stable and interpretable optimization; 3) an unsupervised optimization strategy guided by both physical constrains and diffusion prior to jointly steer decomposition and enhancement. Additionally, we contribute a challenging dataset collected in extreme low-light environments and demonstrate the effectiveness of LL-Gaussian. Compared to state-of-the-art NeRF-based methods, LL-Gaussian achieves up to 2,000 times faster inference and reduces training time to just 2%, while delivering superior reconstruction and rendering quality.

Authors:Nguyen Ngoc Dat, Tom Richardson, Matthew Watson, Kilian Meier, Jenna Kline, Sid Reid, Guy Maalouf, Duncan Hine, Majid Mirmehdi, Tilo Burghardt
Title: WildLive: Near Real-time Visual Wildlife Tracking onboard UAVs
Abstract:
Live tracking of wildlife via high-resolution video processing directly onboard drones is widely unexplored and most existing solutions rely on streaming video to ground stations to support navigation. Yet, both autonomous animal-reactive flight control beyond visual line of sight and/or mission-specific individual and behaviour recognition tasks rely to some degree on this capability. In response, we introduce WildLive - a near real-time animal detection and tracking framework for high-resolution imagery running directly onboard uncrewed aerial vehicles (UAVs). The system performs multi-animal detection and tracking at 17.81fps for HD and 7.53fps on 4K video streams suitable for operation during higher altitude flights to minimise animal disturbance. Our system is optimised for Jetson Orin AGX onboard hardware. It integrates the efficiency of sparse optical flow tracking and mission-specific sampling with device-optimised and proven YOLO-driven object detection and segmentation techniques. Essentially, computational resource is focused onto spatio-temporal regions of high uncertainty to significantly improve UAV processing speeds. Alongside, we introduce our WildLive dataset, which comprises 200K+ annotated animal instances across 19K+ frames from 4K UAV videos collected at the Ol Pejeta Conservancy in Kenya. All frames contain ground truth bounding boxes, segmentation masks, as well as individual tracklets and tracking point trajectories. We compare our system against current object tracking approaches including OC-SORT, ByteTrack, and SORT. Our multi-animal tracking experiments with onboard hardware confirm that near real-time high-resolution wildlife tracking is possible on UAVs whilst maintaining high accuracy levels as needed for future navigational and mission-specific animal-centric operational autonomy. Our materials are available at: https://dat-nguyenvn.github.io/WildLive/

Authors:Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He
Title: Breaking the Data Barrier -- Building GUI Agents Through Task Generalization
Abstract:
Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.
中文: GUI代理面临高质量数据稀缺的挑战,通过在中期训练阶段让视觉语言模型学习多样化推理任务,可显著提升其在图形界面规划场景中的泛化能力和性能表现。
English: GUI agents face data scarcity issues, but training Vision Language Models on diverse reasoning tasks during mid-training significantly enhances their generalization and performance across GUI planning scenarios.

Authors:Yating Liu, Yaowei Li, Xiangyuan Lan, Wenming Yang, Zimo Liu, Qingmin Liao
Title: UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval
Abstract:
Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7\% parameters. Code is available at https://github.com/Liu-Yating/UP-Person.
中文: 本文提出UP-Person方法,通过整合Prefix、LoRA和Adapter三个轻量化组件,在仅微调4.7%参数的情况下有效迁移CLIP的多模态知识,在多个行人检索数据集上实现了最优性能。
English: This paper introduces UP-Person, a parameter-efficient transfer learning method that enhances text-based person retrieval by integrating Prefix, LoRA, and Adapter components to optimize CLIP's multi-modal knowledge with minimal parameter tuning, achieving state-of-the-art results on multiple datasets.

Authors:Bingwen Zhu, Yudong Jiang, Baohan Xu, Siqian Yang, Mingyu Yin, Yidi Wu, Huyang Sun, Zuxuan Wu
Title: Aligning Anime Video Generation with Human Feedback
Abstract:
Anime video generation faces significant challenges due to the scarcity of anime data and unusual motion patterns, leading to issues such as motion distortion and flickering artifacts, which result in misalignment with human preferences. Existing reward models, designed primarily for real-world videos, fail to capture the unique appearance and consistency requirements of anime. In this work, we propose a pipeline to enhance anime video generation by leveraging human feedback for better alignment. Specifically, we construct the first multi-dimensional reward dataset for anime videos, comprising 30k human-annotated samples that incorporating human preferences for both visual appearance and visual consistency. Based on this, we develop AnimeReward, a powerful reward model that employs specialized vision-language models for different evaluation dimensions to guide preference alignment. Furthermore, we introduce Gap-Aware Preference Optimization (GAPO), a novel training method that explicitly incorporates preference gaps into the optimization process, enhancing alignment performance and efficiency. Extensive experiment results show that AnimeReward outperforms existing reward models, and the inclusion of GAPO leads to superior alignment in both quantitative benchmarks and human evaluations, demonstrating the effectiveness of our pipeline in enhancing anime video quality. Our code and dataset are publicly available at https://github.com/bilibili/Index-anisora.
中文摘要:本研究提出了一种通过构建首个动漫视频多维度奖励数据集并开发AnimeReward模型及间隙感知偏好优化方法,来提升动漫视频生成质量、使其更符合人类偏好的创新流程。
English Summary: This study introduces a pipeline that improves anime video generation by creating the first multi-dimensional reward dataset with human feedback and developing the AnimeReward model alongside Gap-Aware Preference Optimization to better align outputs with human preferences.

Authors:Hao Ren, Yiming Zeng, Zetong Bi, Zhaoliang Wan, Junlong Huang, Hui Cheng
Title: Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models
Abstract:
Recent advancements in diffusion-based imitation learning, which show impressive performance in modeling multimodal distributions and training stability, have led to substantial progress in various robot learning tasks. In visual navigation, previous diffusion-based policies typically generate action sequences by initiating from denoising Gaussian noise. However, the target action distribution often diverges significantly from Gaussian noise, leading to redundant denoising steps and increased learning complexity. Additionally, the sparsity of effective action distributions makes it challenging for the policy to generate accurate actions without guidance. To address these issues, we propose a novel, unified visual navigation framework leveraging the denoising diffusion bridge models named NaviBridger. This approach enables action generation by initiating from any informative prior actions, enhancing guidance and efficiency in the denoising process. We explore how diffusion bridges can enhance imitation learning in visual navigation tasks and further examine three source policies for generating prior actions. Extensive experiments in both simulated and real-world indoor and outdoor scenarios demonstrate that NaviBridger accelerates policy inference and outperforms the baselines in generating target action sequences. Code is available at https://github.com/hren20/NaiviBridger.
中文摘要:提出的 NaviBridger 框架通过利用扩散桥模型从信息性先验动作生成动作,改进了视觉导航性能,减少了去噪步骤,在仿真和真实环境中均优于基线方法。
English Summary: The proposed NaviBridger framework improves visual navigation by using diffusion bridge models to generate actions from informative prior actions, reducing denoising steps and outperforming baseline methods in both simulated and real-world environments.

Authors:Xiao Wang, Haiyang Wang, Shiao Wang, Qiang Chen, Jiandong Jin, Haoyu Song, Bo Jiang, Chenglong Li
Title: RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework
Abstract:
Existing pedestrian attribute recognition methods are generally developed based on RGB frame cameras. However, these approaches are constrained by the limitations of RGB cameras, such as sensitivity to lighting conditions and motion blur, which hinder their performance. Furthermore, current attribute recognition primarily focuses on analyzing pedestrians' external appearance and clothing, lacking an exploration of emotional dimensions. In this paper, we revisit these issues and propose a novel multi-modal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption. Specifically, we introduce the first large-scale multi-modal pedestrian attribute recognition dataset, termed EventPAR, comprising 100K paired RGB-Event samples that cover 50 attributes related to both appearance and six human emotions, diverse scenes, and various seasons. By retraining and evaluating mainstream PAR models on this dataset, we establish a comprehensive benchmark and provide a solid foundation for future research in terms of data and algorithmic baselines. In addition, we propose a novel RWKV-based multi-modal pedestrian attribute recognition framework, featuring an RWKV visual encoder and an asymmetric RWKV fusion module. Extensive experiments are conducted on our proposed dataset as well as two simulated datasets (MARS-Attribute and DukeMTMC-VID-Attribute), achieving state-of-the-art results. The source code and dataset will be released on https://github.com/Event-AHU/OpenPAR
中文: 本文提出了一种新颖的RGB-事件多模态行人属性识别任务和数据集(EventPAR),通过结合事件相机弥补RGB相机的局限性,在复杂环境下提升性能,并开发了基于RWKV的新框架取得领先成果。
English: This paper introduces a novel multi-modal RGB-Event pedestrian attribute recognition task and dataset (EventPAR), addressing limitations of RGB cameras by incorporating event cameras for improved performance in challenging conditions, along with a new RWKV-based framework achieving state-of-the-art results.

Authors:Yiming Zeng, Hao Ren, Shuhang Wang, Junlong Huang, Hui Cheng
Title: NaviDiffusor: Cost-Guided Diffusion Model for Visual Navigation
Abstract:
Visual navigation, a fundamental challenge in mobile robotics, demands versatile policies to handle diverse environments. Classical methods leverage geometric solutions to minimize specific costs, offering adaptability to new scenarios but are prone to system errors due to their multi-modular design and reliance on hand-crafted rules. Learning-based methods, while achieving high planning success rates, face difficulties in generalizing to unseen environments beyond the training data and often require extensive training. To address these limitations, we propose a hybrid approach that combines the strengths of learning-based methods and classical approaches for RGB-only visual navigation. Our method first trains a conditional diffusion model on diverse path-RGB observation pairs. During inference, it integrates the gradients of differentiable scene-specific and task-level costs, guiding the diffusion model to generate valid paths that meet the constraints. This approach alleviates the need for retraining, offering a plug-and-play solution. Extensive experiments in both indoor and outdoor settings, across simulated and real-world scenarios, demonstrate zero-shot transfer capability of our approach, achieving higher success rates and fewer collisions compared to baseline methods. Code will be released at https://github.com/SYSU-RoboticsLab/NaviD.
中文摘要:本文提出一种混合视觉导航方法,通过将条件扩散模型与可微分成本梯度相结合,无需重新训练即可生成有效路径,在多种环境中展现出优于基线方法的零样本迁移性能。
English Summary: This paper introduces a hybrid visual navigation approach that combines a conditional diffusion model with differentiable cost gradients to generate valid paths without retraining, demonstrating superior zero-shot performance in diverse environments compared to baseline methods.

Authors:Si-Tong Wei, Rui-Huan Wang, Chuan-Zhi Zhou, Baoquan Chen, Peng-Shuai Wang
Title: OctGPT: Octree-based Multiscale Autoregressive Models for 3D Shape Generation
Abstract:
Autoregressive models have achieved remarkable success across various domains, yet their performance in 3D shape generation lags significantly behind that of diffusion models. In this paper, we introduce OctGPT, a novel multiscale autoregressive model for 3D shape generation that dramatically improves the efficiency and performance of prior 3D autoregressive approaches, while rivaling or surpassing state-of-the-art diffusion models. Our method employs a serialized octree representation to efficiently capture the hierarchical and spatial structures of 3D shapes. Coarse geometry is encoded via octree structures, while fine-grained details are represented by binary tokens generated using a vector quantized variational autoencoder (VQVAE), transforming 3D shapes into compact multiscale binary sequences suitable for autoregressive prediction. To address the computational challenges of handling long sequences, we incorporate octree-based transformers enhanced with 3D rotary positional encodings, scale-specific embeddings, and token-parallel generation schemes. These innovations reduce training time by 13 folds and generation time by 69 folds, enabling the efficient training of high-resolution 3D shapes, e.g.,$1024^3$, on just four NVIDIA 4090 GPUs only within days. OctGPT showcases exceptional versatility across various tasks, including text-, sketch-, and image-conditioned generation, as well as scene-level synthesis involving multiple objects. Extensive experiments demonstrate that OctGPT accelerates convergence and improves generation quality over prior autoregressive methods, offering a new paradigm for high-quality, scalable 3D content creation. Our code and trained models are available at https://github.com/octree-nn/octgpt.
中文: OctGPT是一种新颖的多尺度自回归模型,通过序列化八叉树表示和优化的变换器技术,显著提升了3D形状生成的效率和性能,可与扩散模型相媲美。
English: OctGPT is a novel multiscale autoregressive model that significantly enhances efficiency and performance in 3D shape generation, rivaling diffusion models through innovations like serialized octree representation and optimized transformers.

Authors:Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Liqiang Nie
Title: Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration
Abstract:
All-in-one image restoration, addressing diverse degradation types with a unified model, presents significant challenges in designing task-specific prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from pretrained classifiers enhance discriminability but may discard critical visual information for reconstruction. To address these limitations, we introduce Contrastive Prompt Learning (CPL), a novel framework that fundamentally enhances prompt-task alignment through two complementary innovations: a \emph{Sparse Prompt Module (SPM)} that efficiently captures degradation-specific features while minimizing redundancy, and a \emph{Contrastive Prompt Regularization (CPR)} that explicitly strengthens task boundaries by incorporating negative prompt samples across different degradation types. Unlike previous approaches that focus primarily on degradation classification, CPL optimizes the critical interaction between prompts and the restoration model itself. Extensive experiments across five comprehensive benchmarks demonstrate that CPL consistently enhances state-of-the-art all-in-one restoration models, achieving significant improvements in both standard multi-task scenarios and challenging composite degradation settings. Our framework establishes new state-of-the-art performance while maintaining parameter efficiency, offering a principled solution for unified image restoration.
中文: 本文提出对比提示学习(CPL)框架,通过稀疏提示模块减少冗余和对比提示正则化强化任务边界,有效提升全场景图像修复中的提示与任务对齐能力,在多项基准测试中实现了最先进的性能。
English: This paper introduces Contrastive Prompt Learning (CPL), a novel framework that enhances prompt-task alignment in all-in-one image restoration through a Sparse Prompt Module to reduce redundancy and Contrastive Prompt Regularization to strengthen task boundaries, achieving state-of-the-art performance across multiple benchmarks.

Authors:Dongliang Luo, Hanshen Zhu, Ziyang Zhang, Dingkang Liang, Xudong Xie, Yuliang Liu, Xiang Bai
Title: SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting
Abstract:
Most previous scene text spotting methods rely on high-quality manual annotations to achieve promising performance. To reduce their expensive costs, we study semi-supervised text spotting (SSTS) to exploit useful information from unlabeled images. However, directly applying existing semi-supervised methods of general scenes to SSTS will face new challenges: 1) inconsistent pseudo labels between detection and recognition tasks, and 2) sub-optimal supervisions caused by inconsistency between teacher/student. Thus, we propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS that leverages the complementarity of text detection and recognition. Specifically, it gradually generates reliable hierarchical pseudo labels for each task, thereby reducing noisy labels. Meanwhile, it extracts important information in locations and transcriptions from bidirectional flows to improve consistency. Extensive experiments on three datasets under various settings demonstrate the effectiveness of SemiETS on arbitrary-shaped text. For example, it outperforms previous state-of-the-art SSL methods by a large margin on end-to-end spotting (+8.7%, +5.6%, and +2.6% H-mean under 0.5%, 1%, and 2% labeled data settings on Total-Text, respectively). More importantly, it still improves upon a strongly supervised text spotter trained with plenty of labeled data by 2.0%. Compelling domain adaptation ability shows practical potential. Moreover, our method demonstrates consistent improvement on different text spotters.
中文: 本文提出SemiETS半监督端到端文本检测框架,通过生成可靠的分层伪标签和利用双向信息流解决伪标签不一致及监督次优问题,在多种数据集和设置下均实现卓越性能。
English: This paper introduces SemiETS, a semi-supervised framework for end-to-end text spotting that addresses challenges like inconsistent pseudo labels and sub-optimal supervision by generating reliable hierarchical pseudo labels and leveraging bidirectional information flow, achieving superior performance across various datasets and settings.

Authors:Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang
Title: FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Abstract:
We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION
中文: FUSION是一种通过文本引导编码和递归对齐实现深度视觉语言融合的多模态大语言模型,在多个基准测试中以更少的视觉标记超越了更大规模的模型。
English: FUSION is a multimodal large language model that achieves deep vision-language integration through text-guided encoding and recursive alignment, outperforming larger models with fewer vision tokens across multiple benchmarks.

Authors:Maria Tzelepi, Vasileios Mezaris
Title: Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge
Abstract:
Memes have become a dominant form of communication in social media in recent years. Memes are typically humorous and harmless, however there are also memes that promote hate speech, being in this way harmful to individuals and groups based on their identity. Therefore, detecting hateful content in memes has emerged as a task of critical importance. The need for understanding the complex interactions of images and their embedded text renders the hateful meme detection a challenging multimodal task. In this paper we propose to address the aforementioned task leveraging knowledge encoded in powerful Large Multimodal Models (LMM). Specifically, we propose to exploit LMMs in a two-fold manner. First, by extracting knowledge oriented to the hateful meme detection task in order to build strong meme representations. Specifically, generic semantic descriptions and emotions that the images along with their embedded texts elicit are extracted, which are then used to train a simple classification head for hateful meme detection. Second, by developing a novel hard mining approach introducing directly LMM-encoded knowledge to the training process, providing further improvements. We perform extensive experiments on two datasets that validate the effectiveness of the proposed method, achieving state-of-the-art performance. Our code and trained models are publicly available at: https://github.com/IDT-ITI/LMM-CLIP-meme.
中文: 本文提出了一种新颖的仇恨表情包检测方法,通过利用大型多模态模型提取任务导向知识并结合硬挖掘策略,经广泛实验验证实现了最优性能。
English: This paper presents a novel method for detecting hateful memes by leveraging Large Multimodal Models (LMMs) to extract task-oriented knowledge and employing a hard mining approach, achieving state-of-the-art performance through extensive experiments.

Authors:Zihao Liu, Mingwen Ou, Zunnan Xu, Jiaqi Huang, Haonan Han, Ronghui Li, Xiu Li
Title: Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis
Abstract:
Automating the synthesis of coordinated bimanual piano performances poses significant challenges, particularly in capturing the intricate choreography between the hands while preserving their distinct kinematic signatures. In this paper, we propose a dual-stream neural framework designed to generate synchronized hand gestures for piano playing from audio input, addressing the critical challenge of modeling both hand independence and coordination. Our framework introduces two key innovations: (i) a decoupled diffusion-based generation framework that independently models each hand's motion via dual-noise initialization, sampling distinct latent noise for each while leveraging a shared positional condition, and (ii) a Hand-Coordinated Asymmetric Attention (HCAA) mechanism suppresses symmetric (common-mode) noise to highlight asymmetric hand-specific features, while adaptively enhancing inter-hand coordination during denoising. Comprehensive evaluations demonstrate that our framework outperforms existing state-of-the-art methods across multiple metrics. Our project is available at https://monkek123king.github.io/S2C_page/.

Authors:Changwei Wang, Shunpeng Chen, Yukun Song, Rongtao Xu, Zherui Zhang, Jiguang Zhang, Haoran Yang, Yu Zhang, Kexue Fu, Shide Du, Zhiwei Xu, Longxiang Gao, Li Guo, Shibiao Xu
Title: Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition
Abstract:
Visual Place Recognition (VPR) is aimed at predicting the location of a query image by referencing a database of geotagged images. For VPR task, often fewer discriminative local regions in an image produce important effects while mundane background regions do not contribute or even cause perceptual aliasing because of easy overlap. However, existing methods lack precisely modeling and full exploitation of these discriminative regions. In this paper, we propose the Focus on Local (FoL) approach to stimulate the performance of image retrieval and re-ranking in VPR simultaneously by mining and exploiting reliable discriminative local regions in images and introducing pseudo-correlation supervision. First, we design two losses, Extraction-Aggregation Spatial Alignment Loss (SAL) and Foreground-Background Contrast Enhancement Loss (CEL), to explicitly model reliable discriminative local regions and use them to guide the generation of global representations and efficient re-ranking. Second, we introduce a weakly-supervised local feature training strategy based on pseudo-correspondences obtained from aggregating global features to alleviate the lack of local correspondences ground truth for the VPR task. Third, we suggest an efficient re-ranking pipeline that is efficiently and precisely based on discriminative region guidance. Finally, experimental results show that our FoL achieves the state-of-the-art on multiple VPR benchmarks in both image retrieval and re-ranking stages and also significantly outperforms existing two-stage VPR methods in terms of computational efficiency. Code and models are available at https://github.com/chenshunpeng/FoL
中文摘要:Focus on Local (FoL) 方法通过挖掘图像中的判别性局部区域并引入伪相关监督,在视觉位置识别的图像检索和重排序阶段均实现了最优性能,同时显著提升了计算效率。
English Summary: The Focus on Local (FoL) approach enhances Visual Place Recognition by mining discriminative local regions and introducing pseudo-correlation supervision, achieving state-of-the-art performance in both image retrieval and re-ranking while improving computational efficiency.

Authors:Gaurav Shinde, Anuradha Ravi, Emon Dey, Shadman Sakib, Milind Rampure, Nirmalya Roy
Title: A Survey on Efficient Vision-Language Models
Abstract:
Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models. In this survey, we review key techniques for optimizing VLMs on edge and resource-constrained devices. We also explore compact VLM architectures, frameworks and provide detailed insights into the performance-memory trade-offs of efficient VLMs. Furthermore, we establish a GitHub repository at https://github.com/MPSCUMBC/Efficient-Vision-Language-Models-A-Survey to compile all surveyed papers, which we will actively update. Our objective is to foster deeper research in this area.
中文摘要:本综述探讨了在边缘设备上优化视觉语言模型效率的关键技术,通过精简架构和权衡性能与内存使用来应对计算挑战,并建立了持续更新的GitHub仓库以推动该领域深入研究。
English Summary: This survey examines techniques for optimizing vision-language models to enhance efficiency on edge devices, addressing computational challenges through compact architectures and performance-memory trade-offs while maintaining an updated GitHub repository to support ongoing research.

Authors:Yao Yuan, Pan Gao, Qun Dai, Jie Qin, Wei Xiang
Title: Uncertainty Guided Refinement for Fine-Grained Salient Object Detection
Abstract:
Recently, salient object detection (SOD) methods have achieved impressive performance. However, salient regions predicted by existing methods usually contain unsaturated regions and shadows, which limits the model for reliable fine-grained predictions. To address this, we introduce the uncertainty guidance learning approach to SOD, intended to enhance the model's perception of uncertain regions. Specifically, we design a novel Uncertainty Guided Refinement Attention Network (UGRAN), which incorporates three important components, i.e., the Multilevel Interaction Attention (MIA) module, the Scale Spatial-Consistent Attention (SSCA) module, and the Uncertainty Refinement Attention (URA) module. Unlike conventional methods dedicated to enhancing features, the proposed MIA facilitates the interaction and perception of multilevel features, leveraging the complementary characteristics among multilevel features. Then, through the proposed SSCA, the salient information across diverse scales within the aggregated features can be integrated more comprehensively and integrally. In the subsequent steps, we utilize the uncertainty map generated from the saliency prediction map to enhance the model's perception capability of uncertain regions, generating a highly-saturated fine-grained saliency prediction map. Additionally, we devise an adaptive dynamic partition (ADP) mechanism to minimize the computational overhead of the URA module and improve the utilization of uncertainty guidance. Experiments on seven benchmark datasets demonstrate the superiority of the proposed UGRAN over the state-of-the-art methodologies. Codes will be released at https://github.com/I2-Multimedia-Lab/UGRAN.
Chinese: 本文提出不确定性引导细化注意力网络(UGRAN),通过多级交互注意力、尺度空间一致性注意力和不确定性细化注意力模块,结合自适应动态分区机制,有效提升模型对不确定区域的感知能力,在多个基准数据集上实现了优于现有方法的显著目标检测性能。
English: This paper introduces the Uncertainty Guided Refinement Attention Network (UGRAN) to address issues of unsaturated regions and shadows in salient object detection by enhancing the model's perception of uncertain regions through innovative modules and an adaptive dynamic partition mechanism, demonstrating superior performance on benchmark datasets.

Authors:Xingrui Wang, Jiang Liu, Ze Wang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Yusheng Su, Alan Yuille, Zicheng Liu, Emad Barsoum
Title: KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation
Abstract:
Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in https://github.com/XingruiWang/KeyVID.
中文: KeyVID是一种音频到视频的框架,它首先从音频中定位关键帧时间点生成视觉关键帧,再通过插值补充中间帧,在保持计算效率的同时显著提升了动态动作的同步性和视频质量。
English: KeyVID is an audio-to-video framework that first identifies key moments from audio to generate visual keyframes, then interpolates intermediate frames, enhancing synchronization and quality for dynamic motions while maintaining computational efficiency.

Authors:Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, Xiangyong Cao
Title: SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model
Abstract:
Remote sensing has become critical for understanding environmental dynamics, urban planning, and disaster management. However, traditional remote sensing workflows often rely on explicit segmentation or detection methods, which struggle to handle complex, implicit queries that require reasoning over spatial context, domain knowledge, and implicit user intent. Motivated by this, we introduce a new task, \ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region. To advance this task, we construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs. Moreover, we propose SegEarth-R1, a simple yet effective language-guided segmentation baseline that integrates a hierarchical visual encoder, a large language model (LLM) for instruction parsing, and a tailored mask generator for spatial correlation. The design of SegEarth-R1 incorporates domain-specific adaptations, including aggressive visual token compression to handle ultra-high-resolution remote sensing images, a description projection module to fuse language and multi-scale features, and a streamlined mask prediction pipeline that directly queries description embeddings. Extensive experiments demonstrate that SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods. Our data and code will be released at https://github.com/earth-insights/SegEarth-R1.
中文摘要:本文提出地理空间像素推理新任务,通过构建EarthReason基准数据集和SegEarth-R1模型,将分层视觉编码与大语言模型相结合,在遥感图像隐式查询推理任务中实现了最先进的性能表现。
English Summary: This paper introduces geospatial pixel reasoning, a novel task for implicit querying and reasoning in remote sensing, supported by the EarthReason dataset and the SegEarth-R1 model that achieves state-of-the-art performance by integrating hierarchical vision encoding with large language models.

Authors:Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang
Title: TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Abstract:
Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of "aha moments". Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at https://github.com/ZhangXJ199/TinyLLaVA-Video-R1.
中文:近期强化学习提升了大型多模态模型的推理能力,但针对计算资源有限的研究者,我们提出的小规模视频推理模型TinyLLaVA-Video-R1在通用视频问答数据集上不仅显著增强了推理和思考能力,还展现出“顿悟时刻”的新兴特性,为小模型视频推理探索提供了实用参考。
English: Recent advances in reinforcement learning have enhanced large multimodal models' reasoning, yet small-scale models like TinyLLaVA-Video-R1, with under 4B parameters, show significant reasoning improvements and emergent "aha moments" on general Video-QA datasets, offering practical insights for resource-limited research.

Authors:Atharv Mahesh Mane, Dulanga Weerakoon, Vigneshwaran Subbaraju, Sougata Sen, Sanjay E. Sarma, Archan Misra
Title: Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding
Abstract:
3-Dimensional Embodied Reference Understanding (3D-ERU) combines a language description and an accompanying pointing gesture to identify the most relevant target object in a 3D scene. Although prior work has explored pure language-based 3D grounding, there has been limited exploration of 3D-ERU, which also incorporates human pointing gestures. To address this gap, we introduce a data augmentation framework-Imputer, and use it to curate a new benchmark dataset-ImputeRefer for 3D-ERU, by incorporating human pointing gestures into existing 3D scene datasets that only contain language instructions. We also propose Ges3ViG, a novel model for 3D-ERU that achieves ~30% improvement in accuracy as compared to other 3D-ERU models and ~9% compared to other purely language-based 3D grounding models. Our code and dataset are available at https://github.com/AtharvMane/Ges3ViG.
中文: 本研究提出了ImputeRefer基准数据集和Ges3ViG模型,通过将人类指向手势融入3D场景分析,在三维具身参照理解任务中显著超越了现有方法的性能表现。
English: The study introduces ImputeRefer, a novel benchmark dataset, and Ges3ViG, a model for 3D Embodied Reference Understanding that significantly outperforms existing methods by incorporating human pointing gestures into 3D scene analysis.

Authors:Jiuchen Chen, Xinyu Yan, Qizhi Xu, Kaiqi Li
Title: Tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images
Abstract:
Global contextual information and local detail features are essential for haze removal tasks. Deep learning models perform well on small, low-resolution images, but they encounter difficulties with large, high-resolution ones due to GPU memory limitations. As a compromise, they often resort to image slicing or downsampling. The former diminishes global information, while the latter discards high-frequency details. To address these challenges, we propose DehazeXL, a haze removal method that effectively balances global context and local feature extraction, enabling end-to-end modeling of large images on mainstream GPU hardware. Additionally, to evaluate the efficiency of global context utilization in haze removal performance, we design a visual attribution method tailored to the characteristics of haze removal tasks. Finally, recognizing the lack of benchmark datasets for haze removal in large images, we have developed an ultra-high-resolution haze removal dataset (8KDehaze) to support model training and testing. It includes 10000 pairs of clear and hazy remote sensing images, each sized at 8192 $\times$ 8192 pixels. Extensive experiments demonstrate that DehazeXL can infer images up to 10240 $\times$ 10240 pixels with only 21 GB of memory, achieving state-of-the-art results among all evaluated methods. The source code and experimental dataset are available at https://github.com/CastleChen339/DehazeXL.
中文: DehazeXL是一种新颖的去雾方法,能有效平衡全局上下文与局部特征,以最小GPU内存实现大尺寸高分辨率图像的端到端处理并达到最优性能,同时提供了新的8KDehaze数据集支持。
English: DehazeXL is a novel haze removal method that effectively balances global context and local features, enabling end-to-end processing of large high-resolution images with minimal GPU memory while achieving state-of-the-art performance, supported by a new 8KDehaze dataset.

Authors:Lexington Whalen, Zhenbang Du, Haoran You, Chaojian Li, Sixu Li, Yingyan, Lin
Title: Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training
Abstract:
Training diffusion models (DMs) requires substantial computational resources due to multiple forward and backward passes across numerous timesteps, motivating research into efficient training techniques. In this paper, we propose EB-Diff-Train, a new efficient DM training approach that is orthogonal to other methods of accelerating DM training, by investigating and leveraging Early-Bird (EB) tickets -- sparse subnetworks that manifest early in the training process and maintain high generation quality. We first investigate the existence of traditional EB tickets in DMs, enabling competitive generation quality without fully training a dense model. Then, we delve into the concept of diffusion-dedicated EB tickets, drawing on insights from varying importance of different timestep regions. These tickets adapt their sparsity levels according to the importance of corresponding timestep regions, allowing for aggressive sparsity during non-critical regions while conserving computational resources for crucial timestep regions. Building on this, we develop an efficient DM training technique that derives timestep-aware EB tickets, trains them in parallel, and combines them during inference for image generation. Extensive experiments validate the existence of both traditional and timestep-aware EB tickets, as well as the effectiveness of our proposed EB-Diff-Train method. This approach can significantly reduce training time both spatially and temporally -- achieving 2.9$\times$ to 5.8$\times$ speedups over training unpruned dense models, and up to 10.3$\times$ faster training compared to standard train-prune-finetune pipelines -- without compromising generative quality. Our code is available at https://github.com/GATECH-EIC/Early-Bird-Diffusion.
中文:EB-Diff-Train方法通过利用基于时间步重要性的自适应稀疏性早期鸟彩票,在不损失生成质量的前提下显著加速扩散模型训练。
English: The EB-Diff-Train method accelerates diffusion model training by utilizing early-bird tickets that adapt sparsity based on timestep importance, achieving significant speedups without quality loss.

Authors:Hao Wang, Xiaobao Wei, Xiaoan Zhang, Jianing Li, Chengyu Bai, Ying Li, Ming Lu, Wenzhao Zheng, Shanghang Zhang
Title: EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler
Abstract:
Online 3D occupancy prediction provides a comprehensive spatial understanding of embodied environments. While the innovative EmbodiedOcc framework utilizes 3D semantic Gaussians for progressive indoor occupancy prediction, it overlooks the geometric characteristics of indoor environments, which are primarily characterized by planar structures. This paper introduces EmbodiedOcc++, enhancing the original framework with two key innovations: a Geometry-guided Refinement Module (GRM) that constrains Gaussian updates through plane regularization, along with a Semantic-aware Uncertainty Sampler (SUS) that enables more effective updates in overlapping regions between consecutive frames. GRM regularizes the position update to align with surface normals. It determines the adaptive regularization weight using curvature-based and depth-based constraints, allowing semantic Gaussians to align accurately with planar surfaces while adapting in complex regions. To effectively improve geometric consistency from different views, SUS adaptively selects proper Gaussians to update. Comprehensive experiments on the EmbodiedOcc-ScanNet benchmark demonstrate that EmbodiedOcc++ achieves state-of-the-art performance across different settings. Our method demonstrates improved edge accuracy and retains more geometric details while ensuring computational efficiency, which is essential for online embodied perception. The code will be released at: https://github.com/PKUHaoWang/EmbodiedOcc2.
Chinese Summary: EmbodiedOcc++通过引入几何引导优化模块实现平面结构对齐,并结合语义感知不确定性采样器优化高斯更新,在保持计算效率的同时显著提升了三维占据预测的几何细节与边缘精度。
English Summary: EmbodiedOcc++ enhances 3D occupancy prediction by incorporating a Geometry-guided Refinement Module for planar surface alignment and a Semantic-aware Uncertainty Sampler for optimized Gaussian updates, achieving state-of-the-art performance with improved geometric accuracy.

Authors:Shuchao Duan, Amirhossein Dadashzadeh, Alan Whone, Majid Mirmehdi
Title: Trajectory-guided Motion Perception for Facial Expression Quality Assessment in Neurological Disorders
Abstract:
Automated facial expression quality assessment (FEQA) in neurological disorders is critical for enhancing diagnostic accuracy and improving patient care, yet effectively capturing the subtle motions and nuances of facial muscle movements remains a challenge. We propose to analyse facial landmark trajectories, a compact yet informative representation, that encodes these subtle motions from a high-level structural perspective. Hence, we introduce Trajectory-guided Motion Perception Transformer (TraMP-Former), a novel FEQA framework that fuses landmark trajectory features for fine-grained motion capture with visual semantic cues from RGB frames, ultimately regressing the combined features into a quality score. Extensive experiments demonstrate that TraMP-Former achieves new state-of-the-art performance on benchmark datasets with neurological disorders, including PFED5 (up by 6.51%) and an augmented Toronto NeuroFace (up by 7.62%). Our ablation studies further validate the efficiency and effectiveness of landmark trajectories in FEQA. Our code is available at https://github.com/shuchaoduan/TraMP-Former.
中文:TraMP-Former框架通过融合面部关键点轨迹特征与视觉语义线索,在神经障碍数据集上实现了最先进的性能,推动了自动化面部表情质量评估的发展。
English: The TraMP-Former framework advances automated facial expression quality assessment by integrating landmark trajectory features and visual semantic cues, achieving state-of-the-art performance on neurological disorder datasets.

Authors:Ting Huang, Zeyu Zhang, Yemin Wang, Hao Tang
Title: 3D CoCa: Contrastive Learners are 3D Captioners
Abstract:
3D captioning, which aims to describe the content of 3D scenes in natural language, remains highly challenging due to the inherent sparsity of point clouds and weak cross-modal alignment in existing methods. To address these challenges, we propose 3D CoCa, a novel unified framework that seamlessly combines contrastive vision-language learning with 3D caption generation in a single architecture. Our approach leverages a frozen CLIP vision-language backbone to provide rich semantic priors, a spatially-aware 3D scene encoder to capture geometric context, and a multi-modal decoder to generate descriptive captions. Unlike prior two-stage methods that rely on explicit object proposals, 3D CoCa jointly optimizes contrastive and captioning objectives in a shared feature space, eliminating the need for external detectors or handcrafted proposals. This joint training paradigm yields stronger spatial reasoning and richer semantic grounding by aligning 3D and textual representations. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that 3D CoCa significantly outperforms current state-of-the-arts by 10.2% and 5.76% in CIDEr at 0.5IoU, respectively. Code will be available at https://github.com/AIGeeksGroup/3DCoCa.
中文: 提出的3D CoCa框架将对比学习与描述生成整合到统一架构中,无需外部检测器即可增强空间推理和语义对齐,在基准测试中实现了最先进的性能。
English: The proposed 3D CoCa framework integrates contrastive learning and caption generation in a unified architecture, achieving state-of-the-art performance on benchmarks by enhancing spatial reasoning and semantic alignment without external detectors.

Authors:Yexing Xu, Longguang Wang, Minglin Chen, Sheng Ao, Li Li, Yulan Guo
Title: DropoutGS: Dropping Out Gaussians for Better Sparse-view Rendering
Abstract:
Although 3D Gaussian Splatting (3DGS) has demonstrated promising results in novel view synthesis, its performance degrades dramatically with sparse inputs and generates undesirable artifacts. As the number of training views decreases, the novel view synthesis task degrades to a highly under-determined problem such that existing methods suffer from the notorious overfitting issue. Interestingly, we observe that models with fewer Gaussian primitives exhibit less overfitting under sparse inputs. Inspired by this observation, we propose a Random Dropout Regularization (RDR) to exploit the advantages of low-complexity models to alleviate overfitting. In addition, to remedy the lack of high-frequency details for these models, an Edge-guided Splitting Strategy (ESS) is developed. With these two techniques, our method (termed DropoutGS) provides a simple yet effective plug-in approach to improve the generalization performance of existing 3DGS methods. Extensive experiments show that our DropoutGS produces state-of-the-art performance under sparse views on benchmark datasets including Blender, LLFF, and DTU. The project page is at: https://xuyx55.github.io/DropoutGS/.
Chinese: 本文提出DropoutGS方法,通过随机丢弃正则化减少过拟合和边缘引导分裂策略保留细节,显著提升了稀疏输入下3D高斯泼溅的性能,在多个基准数据集上达到最优效果。
English: This paper introduces DropoutGS, a method that enhances 3D Gaussian Splatting by incorporating Random Dropout Regularization to reduce overfitting and an Edge-guided Splatting Strategy to preserve details, achieving state-of-the-art performance with sparse inputs.

Authors:Yongchao Feng, Yajie Liu, Shuai Yang, Wenrui Cai, Jinqing Zhang, Qiqi Zhan, Ziyue Huang, Hongxi Yan, Qiao Wan, Chenguang Liu, Junzhe Wang, Jiahui Lv, Ziqi Liu, Tengyuan Shi, Qingjie Liu, Yunhong Wang
Title: Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation
Abstract:
Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated. In this work, we present the systematic review of VLM-based detection and segmentation, view VLM as the foundational model and conduct comprehensive evaluations across multiple downstream tasks for the first time: 1) The evaluation spans eight detection scenarios (closed-set detection, domain adaptation, crowded objects, etc.) and eight segmentation scenarios (few-shot, open-world, small object, etc.), revealing distinct performance advantages and limitations of various VLM architectures across tasks. 2) As for detection tasks, we evaluate VLMs under three finetuning granularities: \textit{zero prediction}, \textit{visual fine-tuning}, and \textit{text prompt}, and further analyze how different finetuning strategies impact performance under varied task. 3) Based on empirical findings, we provide in-depth analysis of the correlations between task characteristics, model architectures, and training methodologies, offering insights for future VLM design. 4) We believe that this work shall be valuable to the pattern recognition experts working in the fields of computer vision, multimodal learning, and vision foundation models by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research. A project associated with this review and evaluation has been created at https://github.com/better-chao/perceptual_abilities_evaluation.
Chinese: 本研究首次对视觉语言模型在多种检测与分割任务中进行系统性评估,揭示了其性能特点,并为未来模型设计与训练策略提供了重要见解。
English: This study conducts the first comprehensive evaluation of Vision-Language Models (VLMs) across multiple detection and segmentation tasks, revealing their performance characteristics and providing insights into model design and training strategies.

Authors:Weinan Jia, Mengqi Huang, Nan Chen, Lei Zhang, Zhendong Mao
Title: D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation
Abstract:
Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (D$^2$iT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an novel combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with detailed regions correction achieves a unification of global consistency and local realism. Comprehensive experiments on various generation tasks validate the effectiveness of our approach. Code will be released at https://github.com/jiawn-creator/Dynamic-DiT.
中文摘要:本文提出了一种新颖的两阶段框架,通过动态变分自编码器和动态扩散变换器根据图像区域的信息密度进行自适应压缩,利用多粒度噪声预测实现了生成图像全局一致性与局部真实性的统一提升。
English Summary: This paper introduces a novel two-stage framework with Dynamic VAE and Dynamic Diffusion Transformer that dynamically compresses image regions based on their information density, achieving enhanced global consistency and local realism in generated images through multi-grained noise prediction.

Authors:Lin Zhu, Xinbing Wang, Chenghu Zhou, Nanyang Ye
Title: Bayesian Cross-Modal Alignment Learning for Few-Shot Out-of-Distribution Generalization
Abstract:
Recent advances in large pre-trained models showed promising results in few-shot learning. However, their generalization ability on two-dimensional Out-of-Distribution (OoD) data, i.e., correlation shift and diversity shift, has not been thoroughly investigated. Researches have shown that even with a significant amount of training data, few methods can achieve better performance than the standard empirical risk minimization method (ERM) in OoD generalization. This few-shot OoD generalization dilemma emerges as a challenging direction in deep neural network generalization research, where the performance suffers from overfitting on few-shot examples and OoD generalization errors. In this paper, leveraging a broader supervision source, we explore a novel Bayesian cross-modal image-text alignment learning method (Bayes-CAL) to address this issue. Specifically, the model is designed as only text representations are fine-tuned via a Bayesian modelling approach with gradient orthogonalization loss and invariant risk minimization (IRM) loss. The Bayesian approach is essentially introduced to avoid overfitting the base classes observed during training and improve generalization to broader unseen classes. The dedicated loss is introduced to achieve better image-text alignment by disentangling the causal and non-casual parts of image features. Numerical experiments demonstrate that Bayes-CAL achieved state-of-the-art OoD generalization performances on two-dimensional distribution shifts. Moreover, compared with CLIP-like models, Bayes-CAL yields more stable generalization performances on unseen classes. Our code is available at https://github.com/LinLLLL/BayesCAL.
中文: 本文提出的贝叶斯跨模态对齐方法Bayes-CAL通过专门设计的损失函数微调文本表征,有效解决了小样本分布外泛化问题,在未见类别上展现出更稳定的性能表现。
English: This paper introduces Bayes-CAL, a Bayesian cross-modal alignment method that addresses few-shot OoD generalization by fine-tuning text representations with specialized losses to prevent overfitting and improve stability on unseen classes.

Authors:Jiawei Wu, Zhifei Yang, Zhe Wang, Zhi Jin
Title: Gradient as Conditions: Rethinking HOG for All-in-one Image Restoration
Abstract:
All-in-one image restoration (AIR) aims to address diverse degradations within a unified model by leveraging informative degradation conditions to guide the restoration process. However, existing methods often rely on implicitly learned priors, which may entangle feature representations and hinder performance in complex or unseen scenarios. Histogram of Oriented Gradients (HOG) as a classical gradient representation, we observe that it has strong discriminative capability across diverse degradations, making it a powerful and interpretable prior for AIR. Based on this insight, we propose HOGformer, a Transformer-based model that integrates learnable HOG features for degradation-aware restoration. The core of HOGformer is a Dynamic HOG-aware Self-Attention (DHOGSA) mechanism, which adaptively models long-range spatial dependencies conditioned on degradation-specific cues encoded by HOG descriptors. To further adapt the heterogeneity of degradations in AIR, we propose a Dynamic Interaction Feed-Forward (DIFF) module that facilitates channel-spatial interactions, enabling robust feature transformation under diverse degradations. Besides, we propose a HOG loss to explicitly enhance structural fidelity and edge sharpness. Extensive experiments on a variety of benchmarks, including adverse weather and natural degradations, demonstrate that HOGformer achieves state-of-the-art performance and generalizes well to complex real-world scenarios.Code is available at https://github.com/Fire-friend/HOGformer.
Chinese: HOGformer是一种基于Transformer的模型,它利用方向梯度直方图(HOG)作为可解释的先验知识进行一体化图像恢复,通过动态HOG感知自注意力机制和自适应特征交互模块,实现了最先进的性能表现。
English: HOGformer is a Transformer-based model that utilizes Histogram of Oriented Gradients (HOG) as an interpretable prior for all-in-one image restoration, achieving state-of-the-art performance by integrating dynamic HOG-aware self-attention and adaptive feature interaction modules.

Authors:Iason Chaimalas, Arnas Vyšniauskas, Gabriel Brostow
Title: Explorer: Robust Collection of Interactable GUI Elements
Abstract:
Automation of existing Graphical User Interfaces (GUIs) is important but hard to achieve. Upstream of making the GUI user-accessible or somehow scriptable, even the data-collection to understand the original interface poses significant challenges. For example, large quantities of general UI data seem helpful for training general machine learning (ML) models, but accessibility for each person can hinge on the ML's precision on a specific app. We therefore take the perspective that a given user needs confidence, that the relevant UI elements are being detected correctly throughout one app or digital environment. We mostly assume that the target application is known in advance, so that data collection and ML-training can be personalized for the test-time target domain. The proposed Explorer system focuses on detecting on-screen buttons and text-entry fields, i.e. interactables, where the training process has access to a live version of the application. The live application can run on almost any popular platform except iOS phones, and the collection is especially streamlined for Android phones or for desktop Chrome browsers. Explorer also enables the recording of interactive user sessions, and subsequent mapping of how such sessions overlap and sometimes loop back to similar states. We show how having such a map enables a kind of path planning through the GUI, letting a user issue audio commands to get to their destination. Critically, we are releasing our code for Explorer openly at https://github.com/varnelis/Explorer.
Chinese: Explorer系统通过专注于检测按钮和文本字段等交互元素,利用实时应用数据训练机器学习模型,实现个性化的图形用户界面自动化,从而提供精确的用户特定可访问性,并通过语音命令进行路径规划。
English: The Explorer system enables personalized automation of graphical user interfaces by focusing on detecting interactive elements like buttons and text fields, using live application data to train machine learning models for precise, user-specific accessibility and path planning through audio commands.

Authors:You Wu, Xucheng Wang, Xiangyang Yang, Mengyuan Liu, Dan Zeng, Hengzhou Ye, Shuiwang Li
Title: Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking
Abstract:
Single-stream architectures using Vision Transformer (ViT) backbones show great potential for real-time UAV tracking recently. However, frequent occlusions from obstacles like buildings and trees expose a major drawback: these models often lack strategies to handle occlusions effectively. New methods are needed to enhance the occlusion resilience of single-stream ViT models in aerial tracking. In this work, we propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking by enforcing an invariance of the feature representation of a target with respect to random masking operations modeled by a spatial Cox process. Hopefully, this random masking approximately simulates target occlusions, thereby enabling us to learn ViTs that are robust to target occlusion for UAV tracking. This framework is termed ORTrack. Additionally, to facilitate real-time applications, we propose an Adaptive Feature-Based Knowledge Distillation (AFKD) method to create a more compact tracker, which adaptively mimics the behavior of the teacher model ORTrack according to the task's difficulty. This student model, dubbed ORTrack-D, retains much of ORTrack's performance while offering higher efficiency. Extensive experiments on multiple benchmarks validate the effectiveness of our method, demonstrating its state-of-the-art performance. Codes is available at https://github.com/wuyou3474/ORTrack.
中文: 本研究提出OR Track框架,通过基于空间Cox过程的随机掩码学习抗遮挡特征表示,以增强单流视觉Transformer模型在无人机跟踪中的性能,并利用自适应知识蒸馏开发出高效的实时版本OR Track-D。
English: This study introduces ORTrack, a novel framework that enhances single-stream Vision Transformer models for UAV tracking by learning occlusion-robust representations through spatial Cox process-based random masking, and develops ORTrack-D via adaptive knowledge distillation for efficient real-time performance.

Authors:Tzoulio Chamiti, Leandro Di Bella, Adrian Munteanu, Nikos Deligiannis
Title: ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking
Abstract:
Tracking multiple objects based on textual queries is a challenging task that requires linking language understanding with object association across frames. Previous works typically train the whole process end-to-end or integrate an additional referring text module into a multi-object tracker, but they both require supervised training and potentially struggle with generalization to open-set queries. In this work, we introduce ReferGPT, a novel zero-shot referring multi-object tracking framework. We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions. This enhances its descriptive capabilities and supports a more flexible referring vocabulary without training. We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries. Extensive experiments on Refer-KITTI, Refer-KITTIv2 and Refer-KITTI+ demonstrate that ReferGPT achieves competitive performance against trained methods, showcasing its robustness and zero-shot capabilities in autonomous driving. The codes are available on https://github.com/Tzoulio/ReferGPT
中文: ReferGPT是一种零样本多目标跟踪框架,通过具备空间知识的多模态大语言模型生成三维感知描述,并采用基于CLIP的语义匹配策略,在无需训练的情况下于自动驾驶基准测试中展现出竞争优势。
English: ReferGPT is a zero-shot framework that uses a multi-modal large language model with spatial knowledge to generate 3D-aware captions and employs a CLIP-based matching strategy for robust object tracking without training, achieving competitive performance in autonomous driving benchmarks.

Authors:Shengyu Gong, Yueyang Li, Zijian Kang, Weiming Zeng, Hongjie Yan, Zhiguo Zhang, Wai Ting Siok, Nizhuan Wang
Title: LEL: A Novel Lipschitz Continuity-constrained Ensemble Learning Model for EEG-based Emotion Recognition
Abstract:
The accurate and efficient recognition of emotional states in oneself and others is critical, as impairments in this ability can lead to significant psychosocial difficulties. While electroencephalography (EEG) offers a powerful tool for emotion detection, current EEG-based emotion recognition (EER) methods face key limitations: insufficient model stability, limited accuracy in processing high-dimensional nonlinear EEG signals, and poor robustness against intra-subject variability and signal noise. To address these challenges, we introduce LEL (Lipschitz continuity-constrained Ensemble Learning), a novel framework that enhances EEG-based emotion recognition. By integrating Lipschitz continuity constraints, LEL ensures greater model stability and improves generalization, thereby reducing sensitivity to signal variability and noise while significantly boosting the model's overall accuracy and robustness. Its ensemble learning strategy optimizes overall performance by fusing decisions from multiple classifiers to reduce single-model bias and variance. Experimental results on three public benchmark datasets (EAV, FACED and SEED) demonstrated the LEL's state-of-the-art performance, achieving average recognition accuracies of 76.43%, 83.00% and 87.22%, respectively. The official implementation codes are released at https://github.com/NZWANG/LEL.
Chinese: LEL框架通过整合Lipschitz连续性约束和集成学习,显著提升了基于脑电图的情绪识别性能,在多个基准数据集上实现了最优的准确性和鲁棒性。
English: The LEL framework enhances EEG-based emotion recognition by integrating Lipschitz continuity constraints and ensemble learning, achieving state-of-the-art accuracy and robustness on benchmark datasets.

Authors:Yunfei Long, Abhinav Kumar, Xiaoming Liu, Daniel Morris
Title: RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection
Abstract:
Radar hits reflect from points on both the boundary and internal to object outlines. This results in a complex distribution of radar hits that depends on factors including object category, size, and orientation. Current radar-camera fusion methods implicitly account for this with a black-box neural network. In this paper, we explicitly utilize a radar hit distribution model to assist fusion. First, we build a model to predict radar hit distributions conditioned on object properties obtained from a monocular detector. Second, we use the predicted distribution as a kernel to match actual measured radar points in the neighborhood of the monocular detections, generating matching scores at nearby positions. Finally, a fusion stage combines context with the kernel detector to refine the matching scores. Our method achieves the state-of-the-art radar-camera detection performance on nuScenes. Our source code is available at https://github.com/longyunf/riccardo.
Chinese: 本文提出一种方法,通过显式建模雷达命中分布来改进雷达-相机融合,利用预测核与上下文融合优化匹配得分,在nuScenes数据集上实现了最先进的检测性能。
English: This paper introduces a method that explicitly models radar hit distributions to enhance radar-camera fusion, achieving state-of-the-art detection performance on nuScenes by refining matching scores through predicted kernels and contextual fusion.

Authors:Zhijie Shen, Chunyu Lin, Shujuan Huang, Lang Nie, Kang Liao, Yao Zhao
Title: You Need a Transition Plane: Bridging Continuous Panoramic 3D Reconstruction with Perspective Gaussian Splatting
Abstract:
Recently, reconstructing scenes from a single panoramic image using advanced 3D Gaussian Splatting (3DGS) techniques has attracted growing interest. Panoramic images offer a 360$\times$ 180 field of view (FoV), capturing the entire scene in a single shot. However, panoramic images introduce severe distortion, making it challenging to render 3D Gaussians into 2D distorted equirectangular space directly. Converting equirectangular images to cubemap projections partially alleviates this problem but introduces new challenges, such as projection distortion and discontinuities across cube-face boundaries. To address these limitations, we present a novel framework, named TPGS, to bridge continuous panoramic 3D scene reconstruction with perspective Gaussian splatting. Firstly, we introduce a Transition Plane between adjacent cube faces to enable smoother transitions in splatting directions and mitigate optimization ambiguity in the boundary region. Moreover, an intra-to-inter face optimization strategy is proposed to enhance local details and restore visual consistency across cube-face boundaries. Specifically, we optimize 3D Gaussians within individual cube faces and then fine-tune them in the stitched panoramic space. Additionally, we introduce a spherical sampling technique to eliminate visible stitching seams. Extensive experiments on indoor and outdoor, egocentric, and roaming benchmark datasets demonstrate that our approach outperforms existing state-of-the-art methods. Code and models will be available at https://github.com/zhijieshen-bjtu/TPGS.
中文摘要:TPGS框架通过引入过渡平面和优化策略,解决了全景3D高斯溅射中的失真和边界问题,实现了更优的场景重建效果。
English Summary: The TPGS framework addresses distortion and boundary issues in panoramic 3D Gaussian splatting by introducing a transition plane and optimization strategy to achieve superior scene reconstruction.

Authors:Yongchang Wu, Zipeng Qi, Zhenwei Shi, Zhengxia Zou
Title: BlockGaussian: Efficient Large-Scale Scene Novel View Synthesis via Adaptive Block-Based Gaussian Splatting
Abstract:
The recent advancements in 3D Gaussian Splatting (3DGS) have demonstrated remarkable potential in novel view synthesis tasks. The divide-and-conquer paradigm has enabled large-scale scene reconstruction, but significant challenges remain in scene partitioning, optimization, and merging processes. This paper introduces BlockGaussian, a novel framework incorporating a content-aware scene partition strategy and visibility-aware block optimization to achieve efficient and high-quality large-scale scene reconstruction. Specifically, our approach considers the content-complexity variation across different regions and balances computational load during scene partitioning, enabling efficient scene reconstruction. To tackle the supervision mismatch issue during independent block optimization, we introduce auxiliary points during individual block optimization to align the ground-truth supervision, which enhances the reconstruction quality. Furthermore, we propose a pseudo-view geometry constraint that effectively mitigates rendering degradation caused by airspace floaters during block merging. Extensive experiments on large-scale scenes demonstrate that our approach achieves state-of-the-art performance in both reconstruction efficiency and rendering quality, with a 5x speedup in optimization and an average PSNR improvement of 1.21 dB on multiple benchmarks. Notably, BlockGaussian significantly reduces computational requirements, enabling large-scale scene reconstruction on a single 24GB VRAM device. The project page is available at https://github.com/SunshineWYC/BlockGaussian
中文:BlockGaussian通过内容感知的场景分割和可见性感知的块优化,实现了高效的大规模3D场景重建,在优化速度提升5倍的同时PSNR提高1.21分贝,并显著降低了硬件需求。
English: BlockGaussian introduces a content-aware partitioning and visibility-aware optimization framework that achieves efficient large-scale 3D scene reconstruction with 5x faster optimization and 1.21 dB PSNR improvement while reducing hardware requirements.

Authors:Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, Xihui Liu
Title: GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
Abstract:
In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to $\bf{3 \space billion}$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
中文摘要:GigaTok通过语义正则化将分词器特征与预训练视觉编码器对齐,解决了视觉分词器扩展中重建与生成的矛盾,在30亿参数规模下实现了重建、生成和表征学习的同步提升。
English Summary: GigaTok introduces semantic regularization to align tokenizer features with pre-trained visual encoders, enabling scalable visual tokenizers that simultaneously enhance image reconstruction, generation, and representation learning while achieving state-of-the-art performance at 3 billion parameters.

Authors:Masashi Hatano, Zhifan Zhu, Hideo Saito, Dima Damen
Title: The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation
Abstract:
Forecasting hand motion and pose from an egocentric perspective is essential for understanding human intention. However, existing methods focus solely on predicting positions without considering articulation, and only when the hands are visible in the field of view. This limitation overlooks the fact that approximate hand positions can still be inferred even when they are outside the camera's view. In this paper, we propose a method to forecast the 3D trajectories and poses of both hands from an egocentric video, both in and out of the field of view. We propose a diffusion-based transformer architecture for Egocentric Hand Forecasting, EgoH4, which takes as input the observation sequence and camera poses, then predicts future 3D motion and poses for both hands of the camera wearer. We leverage full-body pose information, allowing other joints to provide constraints on hand motion. We denoise the hand and body joints along with a visibility predictor for hand joints and a 3D-to-2D reprojection loss that minimizes the error when hands are in-view. We evaluate EgoH4 on the Ego-Exo4D dataset, combining subsets with body and hand annotations. We train on 156K sequences and evaluate on 34K sequences, respectively. EgoH4 improves the performance by 3.4cm and 5.1cm over the baseline in terms of ADE for hand trajectory forecasting and MPJPE for hand pose forecasting. Project page: https://masashi-hatano.github.io/EgoH4/

Authors:Ian Noronha, Advait Prasad Jawaji, Juan Camilo Soto, Jiajun An, Yan Gu, Upinder Kaur
Title: MBE-ARI: A Multimodal Dataset Mapping Bi-directional Engagement in Animal-Robot Interaction
Abstract:
Animal-robot interaction (ARI) remains an unexplored challenge in robotics, as robots struggle to interpret the complex, multimodal communication cues of animals, such as body language, movement, and vocalizations. Unlike human-robot interaction, which benefits from established datasets and frameworks, animal-robot interaction lacks the foundational resources needed to facilitate meaningful bidirectional communication. To bridge this gap, we present the MBE-ARI (Multimodal Bidirectional Engagement in Animal-Robot Interaction), a novel multimodal dataset that captures detailed interactions between a legged robot and cows. The dataset includes synchronized RGB-D streams from multiple viewpoints, annotated with body pose and activity labels across interaction phases, offering an unprecedented level of detail for ARI research. Additionally, we introduce a full-body pose estimation model tailored for quadruped animals, capable of tracking 39 keypoints with a mean average precision (mAP) of 92.7%, outperforming existing benchmarks in animal pose estimation. The MBE-ARI dataset and our pose estimation framework lay a robust foundation for advancing research in animal-robot interaction, providing essential tools for developing perception, reasoning, and interaction frameworks needed for effective collaboration between robots and animals. The dataset and resources are publicly available at https://github.com/RISELabPurdue/MBE-ARI/, inviting further exploration and development in this critical area.
中文: MBE-ARI数据集通过提供奶牛与机器人交互的多模态同步数据及高精度四足动物姿态估计模型,填补了动物-机器人交互研究领域的资源空白,为推进双向通信研究奠定了坚实基础。
English: The MBE-ARI dataset addresses the lack of resources in animal-robot interaction by providing synchronized multimodal data of cow-robot interactions and a high-precision quadruped pose estimation model, establishing a foundation for advancing bidirectional communication research.

Authors:Gabriele Lozupone, Alessandro Bria, Francesco Fontanella, Frederick J. A. Meijer, Claudio De Stefano, Henkjan Huisman
Title: Latent Diffusion Autoencoders: Toward Efficient and Meaningful Unsupervised Representation Learning in Medical Imaging
Abstract:
This study presents Latent Diffusion Autoencoder (LDAE), a novel encoder-decoder diffusion-based framework for efficient and meaningful unsupervised learning in medical imaging, focusing on Alzheimer disease (AD) using brain MR from the ADNI database as a case study. Unlike conventional diffusion autoencoders operating in image space, LDAE applies the diffusion process in a compressed latent representation, improving computational efficiency and making 3D medical imaging representation learning tractable. To validate the proposed approach, we explore two key hypotheses: (i) LDAE effectively captures meaningful semantic representations on 3D brain MR associated with AD and ageing, and (ii) LDAE achieves high-quality image generation and reconstruction while being computationally efficient. Experimental results support both hypotheses: (i) linear-probe evaluations demonstrate promising diagnostic performance for AD (ROC-AUC: 90%, ACC: 84%) and age prediction (MAE: 4.1 years, RMSE: 5.2 years); (ii) the learned semantic representations enable attribute manipulation, yielding anatomically plausible modifications; (iii) semantic interpolation experiments show strong reconstruction of missing scans, with SSIM of 0.969 (MSE: 0.0019) for a 6-month gap. Even for longer gaps (24 months), the model maintains robust performance (SSIM > 0.93, MSE < 0.004), indicating an ability to capture temporal progression trends; (iv) compared to conventional diffusion autoencoders, LDAE significantly increases inference throughput (20x faster) while also enhancing reconstruction quality. These findings position LDAE as a promising framework for scalable medical imaging applications, with the potential to serve as a foundation model for medical image analysis. Code available at https://github.com/GabrieleLozupone/LDAE
中文: 本研究提出的潜在扩散自编码器在压缩潜在空间应用扩散过程,实现了高效的3D医学影像表征学习,不仅对阿尔茨海默病展现出优异诊断性能,还具备高质量图像重建能力和显著提升的计算效率。
English: This study introduces the Latent Diffusion Autoencoder (LDAE), a novel framework that applies diffusion processes in compressed latent space to achieve efficient 3D medical imaging representation learning, demonstrating strong diagnostic performance for Alzheimer's disease and high-quality image reconstruction with significantly improved computational efficiency.

Authors:Renu Sharma, Debasmita Pal, Arun Ross
Title: Task-conditioned Ensemble of Expert Models for Continuous Learning
Abstract:
One of the major challenges in machine learning is maintaining the accuracy of the deployed model (e.g., a classifier) in a non-stationary environment. The non-stationary environment results in distribution shifts and, consequently, a degradation in accuracy. Continuous learning of the deployed model with new data could be one remedy. However, the question arises as to how we should update the model with new training data so that it retains its accuracy on the old data while adapting to the new data. In this work, we propose a task-conditioned ensemble of models to maintain the performance of the existing model. The method involves an ensemble of expert models based on task membership information. The in-domain models-based on the local outlier concept (different from the expert models) provide task membership information dynamically at run-time to each probe sample. To evaluate the proposed method, we experiment with three setups: the first represents distribution shift between tasks (LivDet-Iris-2017), the second represents distribution shift both between and within tasks (LivDet-Iris-2020), and the third represents disjoint distribution between tasks (Split MNIST). The experiments highlight the benefits of the proposed method. The source code is available at https://github.com/iPRoBe-lab/Continuous_Learning_FE_DM.
中文: 本研究针对非稳态环境中模型精度下降的问题,提出了一种基于任务条件的集成方法,通过局部离群值概念动态组合专家模型,使模型在适应新数据的同时保持对原有数据的性能。
English: This work addresses the challenge of maintaining model accuracy in non-stationary environments by proposing a task-conditioned ensemble method that dynamically combines expert models using local outlier concepts to adapt to new data while preserving performance on old data.

Authors:Gesina Schwalbe, Georgii Mikriukov, Edgar Heinert, Stavros Gerolymatos, Mert Keser, Alois Knoll, Matthias Rottmann, Annika Mütze
Title: On Background Bias of Post-Hoc Concept Embeddings in Computer Vision DNNs
Abstract:
The thriving research field of concept-based explainable artificial intelligence (C-XAI) investigates how human-interpretable semantic concepts embed in the latent spaces of deep neural networks (DNNs). Post-hoc approaches therein use a set of examples to specify a concept, and determine its embeddings in DNN latent space using data driven techniques. This proved useful to uncover biases between different target (foreground or concept) classes. However, given that the background is mostly uncontrolled during training, an important question has been left unattended so far: Are/to what extent are state-of-the-art, data-driven post-hoc C-XAI approaches themselves prone to biases with respect to their backgrounds? E.g., wild animals mostly occur against vegetation backgrounds, and they seldom appear on roads. Even simple and robust C-XAI methods might abuse this shortcut for enhanced performance. A dangerous performance degradation of the concept-corner cases of animals on the road could thus remain undiscovered. This work validates and thoroughly confirms that established Net2Vec-based concept segmentation techniques frequently capture background biases, including alarming ones, such as underperformance on road scenes. For the analysis, we compare 3 established techniques from the domain of background randomization on >50 concepts from 2 datasets, and 7 diverse DNN architectures. Our results indicate that even low-cost setups can provide both valuable insight and improved background robustness.
Chinese: 基于概念的可解释人工智能方法常从训练数据中无意识地学习背景偏差,导致物体出现在非典型背景场景时性能不可靠,这一点已通过跨多个数据集和神经网络架构的系统性测试得到验证。
English: Concept-based explainable AI methods often unintentionally learn background biases from training data, leading to unreliable performance in scenarios where objects appear against atypical backgrounds, as demonstrated through systematic testing across multiple datasets and neural network architectures.

Authors:Maria Santos-Villafranca, Dustin Carrión-Ojeda, Alejandro Perez-Yus, Jesus Bermudez-Cameo, Jose J. Guerrero, Simone Schaub-Meyer
Title: Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing Modalities
Abstract:
Existing methods for egocentric action recognition often rely solely on RGB videos, while additional modalities, e.g., audio, can improve accuracy in challenging scenarios. However, most prior multimodal approaches assume all modalities are available at inference, leading to significant accuracy drops, or even failure, when inputs are missing. To address this, we introduce KARMMA, a multimodal Knowledge distillation approach for egocentric Action Recognition robust to Missing ModAlities that requires no modality alignment across all samples during training or inference. KARMMA distills knowledge from a multimodal teacher into a multimodal student that benefits from all available modalities while remaining robust to missing ones, making it suitable for diverse multimodal scenarios without retraining. Our student uses approximately 50% fewer computational resources than our teacher, resulting in a lightweight and fast model. Experiments on Epic-Kitchens and Something-Something show that our student achieves competitive accuracy while significantly reducing accuracy drops under missing modality conditions.

Authors:Tommaso Galliena, Tommaso Apicella, Stefano Rosa, Pietro Morerio, Alessio Del Bue, Lorenzo Natale
Title: Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Abstract:
We present a self-supervised method to improve an agent's abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin. Code and test set annotations available at https://hsp-iit.github.io/embodied-captioning/
Chinese: 本文提出一种自监督框架,通过共识机制提炼伪描述并使用对比学习微调模型,显著提升了智能体在主动探索环境中描述物体的准确性和一致性。
English: This paper introduces a self-supervised framework that enhances object description accuracy and consistency in active exploration by distilling pseudo-captions through consensus and fine-tuning captioning models with contrastive learning.

Authors:Kerol Djoumessi, Samuel Ofosu Mensah, Philipp Berens
Title: A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Disease Detection from Retinal Fundus Images
Abstract:
In many medical imaging tasks, convolutional neural networks (CNNs) efficiently extract local features hierarchically. More recently, vision transformers (ViTs) have gained popularity, using self-attention mechanisms to capture global dependencies, but lacking the inherent spatial localization of convolutions. Therefore, hybrid models combining CNNs and ViTs have been developed to combine the strengths of both architectures. However, such hybrid models are difficult to interpret, which hinders their application in medical imaging. In this work, we introduce an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for retinal disease detection. Unlike widely used post-hoc saliency methods for ViTs, our approach generates faithful and localized evidence maps that directly reflect the mode's decision process. We evaluated our method on two medical tasks focused on disease detection using color fundus images. Our model achieves state-of-the-art predictive performance compared to black-box and interpretable models and provides class-specific sparse evidence maps in a single forward pass. The code is available at: https://github.com/kdjoumessi/Self-Explainable-CNN-Transformer.
中文: 本文提出了一种可解释的CNN-Transformer混合模型,用于视网膜疾病检测,该模型在实现最先进预测性能的同时,能生成局部化的证据图。
English: This paper introduces an interpretable hybrid CNN-Transformer model for retinal disease detection that generates localized evidence maps while achieving state-of-the-art predictive performance on color fundus images.

Authors:Yi Chen, Tianchen Deng, Wentao Zhao, Xiaoning Wang, Wenqian Xi, Weidong Chen, Jingchuan Wang
Title: SN-LiDAR: Semantic Neural Fields for Novel Space-time View LiDAR Synthesis
Abstract:
Recent research has begun exploring novel view synthesis (NVS) for LiDAR point clouds, aiming to generate realistic LiDAR scans from unseen viewpoints. However, most existing approaches do not reconstruct semantic labels, which are crucial for many downstream applications such as autonomous driving and robotic perception. Unlike images, which benefit from powerful segmentation models, LiDAR point clouds lack such large-scale pre-trained models, making semantic annotation time-consuming and labor-intensive. To address this challenge, we propose SN-LiDAR, a method that jointly performs accurate semantic segmentation, high-quality geometric reconstruction, and realistic LiDAR synthesis. Specifically, we employ a coarse-to-fine planar-grid feature representation to extract global features from multi-frame point clouds and leverage a CNN-based encoder to extract local semantic features from the current frame point cloud. Extensive experiments on SemanticKITTI and KITTI-360 demonstrate the superiority of SN-LiDAR in both semantic and geometric reconstruction, effectively handling dynamic objects and large-scale scenes. Codes will be available on https://github.com/dtc111111/SN-Lidar.
中文:针对激光雷达点云新视角合成研究中常忽略语义标签的问题,本文提出SN-LiDAR方法,通过从粗到精的平面网格特征表征和基于CNN的局部特征提取,同步实现精确语义分割、高质量几何重建与逼真激光雷达合成,在多个基准数据集上展现出卓越性能。
English: Recent research on LiDAR point cloud novel view synthesis often neglects semantic labels, but the proposed SN-LiDAR method jointly achieves accurate semantic segmentation, high-quality geometric reconstruction, and realistic LiDAR synthesis through a coarse-to-fine feature representation and CNN-based local feature extraction, demonstrating superior performance on benchmark datasets.

Authors:Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, Xiongkuo Min
Title: LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs
Abstract:
Recent breakthroughs in large multimodal models (LMMs) have significantly advanced both text-to-image (T2I) generation and image-to-text (I2T) interpretation. However, many generated images still suffer from issues related to perceptual quality and text-image alignment. Given the high cost and inefficiency of manual evaluation, an automatic metric that aligns with human preferences is desirable. To this end, we present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation, which features (i) comprehensive tasks, encompassing 2,100 extensive prompts across 20 fine-grained task dimensions, and (ii) large-scale human-preference annotations, including 100K mean-opinion scores (MOSs) and 50K question-answering (QA) pairs annotated on 50,400 images generated from 24 T2I models. Based on EvalMi-50K, we propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions including perception, text-image correspondence, and task-specific accuracy. Extensive experimental results show that LMM4LMM achieves state-of-the-art performance on EvalMi-50K, and exhibits strong generalization ability on other AI-generated image evaluation benchmark datasets, manifesting the generality of both the EvalMi-50K dataset and LMM4LMM metric. Both EvalMi-50K and LMM4LMM will be released at https://github.com/IntMeGroup/LMM4LMM.
Chinese: 针对大型多模态模型图像生成的评估难题,我们提出了包含大规模人工标注的综合基准EvalMi-50K,并开发了基于LMM的评估指标LMM4LMM,该指标在感知质量与图文对齐度评估中表现出卓越性能与泛化能力。
English: To address the limitations in evaluating large multimodal models' image generation, we introduce EvalMi-50K, a comprehensive benchmark with extensive human annotations, and propose LMM4LMM, an LMM-based metric that demonstrates superior performance and generalization in assessing perceptual quality and text-image alignment.

Authors:Qinghongbing Xie, Zijian Liang, Long Zeng
Title: DSM: Building A Diverse Semantic Map for 3D Visual Grounding
Abstract:
In recent years, with the growing research and application of multimodal large language models (VLMs) in robotics, there has been an increasing trend of utilizing VLMs for robotic scene understanding tasks. Existing approaches that use VLMs for 3D Visual Grounding tasks often focus on obtaining scene information through geometric and visual information, overlooking the extraction of diverse semantic information from the scene and the understanding of rich implicit semantic attributes, such as appearance, physics, and affordance. The 3D scene graph, which combines geometry and language, is an ideal representation method for environmental perception and is an effective carrier for language models in 3D Visual Grounding tasks. To address these issues, we propose a diverse semantic map construction method specifically designed for robotic agents performing 3D Visual Grounding tasks. This method leverages VLMs to capture the latent semantic attributes and relations of objects within the scene and creates a Diverse Semantic Map (DSM) through a geometry sliding-window map construction strategy. We enhance the understanding of grounding information based on DSM and introduce a novel approach named DSM-Grounding. Experimental results show that our method outperforms current approaches in tasks like semantic segmentation and 3D Visual Grounding, particularly excelling in overall metrics compared to the state-of-the-art. In addition, we have deployed this method on robots to validate its effectiveness in navigation and grasping tasks.

Authors:Jinghe Yang, Mingming Gong, Ye Pu
Title: Knowledge Distillation for Underwater Feature Extraction and Matching via GAN-synthesized Images
Abstract:
Autonomous Underwater Vehicles (AUVs) play a crucial role in underwater exploration. Vision-based methods offer cost-effective solutions for localization and mapping in the absence of conventional sensors like GPS and LiDAR. However, underwater environments present significant challenges for feature extraction and matching due to image blurring and noise caused by attenuation, scattering, and the interference of \textit{marine snow}. In this paper, we aim to improve the robustness of the feature extraction and matching in the turbid underwater environment using the cross-modal knowledge distillation method that transfers the in-air feature extraction and matching models to underwater settings using synthetic underwater images as the medium. We first propose a novel adaptive GAN-synthesis method to estimate water parameters and underwater noise distribution, to generate environment-specific synthetic underwater images. We then introduce a general knowledge distillation framework compatible with different teacher models. The evaluation of GAN-based synthesis highlights the significance of the new components, i.e. GAN-synthesized noise and forward scattering, in the proposed model. Additionally, VSLAM, as a representative downstream application of feature extraction and matching, is employed on real underwater sequences to validate the effectiveness of the transferred model. Project page: https://github.com/Jinghe-mel/UFEN-GAN.
中文摘要:本文通过跨模态知识蒸馏方法,利用自适应GAN合成技术生成水下合成图像,将空中特征提取与匹配模型迁移至浑浊水下环境,有效提升了水下特征提取与匹配的鲁棒性。
English Summary: This paper enhances feature extraction and matching in turbid underwater environments through cross-modal knowledge distillation, using synthetic underwater images generated by an adaptive GAN-synthesis method to transfer in-air models to underwater settings.

Authors:Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, Jin Song Dong
Title: F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos
Abstract:
Analyzing Fast, Frequent, and Fine-grained (F$^3$) events presents a significant challenge in video analytics and multi-modal LLMs. Current methods struggle to identify events that satisfy all the F$^3$ criteria with high accuracy due to challenges such as motion blur and subtle visual discrepancies. To advance research in video understanding, we introduce F$^3$Set, a benchmark that consists of video datasets for precise F$^3$ event detection. Datasets in F$^3$Set are characterized by their extensive scale and comprehensive detail, usually encompassing over 1,000 event types with precise timestamps and supporting multi-level granularity. Currently, F$^3$Set contains several sports datasets, and this framework may be extended to other applications as well. We evaluated popular temporal action understanding methods on F$^3$Set, revealing substantial challenges for existing techniques. Additionally, we propose a new method, F$^3$ED, for F$^3$ event detections, achieving superior performance. The dataset, model, and benchmark code are available at https://github.com/F3Set/F3Set.
中文: F³Set基准通过提供大规模数据集并引入F³ED新方法,解决了视频中快速、频繁和细粒度事件检测的难题,其性能优于现有技术。
English: The F³Set benchmark addresses the challenge of detecting fast, frequent, and fine-grained events in videos by providing large-scale datasets and introducing F³ED, a novel method that outperforms existing techniques.

Authors:Guangcong Zheng, Teng Li, Xianpan Zhou, Xi Li
Title: RealCam-Vid: High-resolution Video Dataset with Dynamic Scenes and Metric-scale Camera Movements
Abstract:
Recent advances in camera-controllable video generation have been constrained by the reliance on static-scene datasets with relative-scale camera annotations, such as RealEstate10K. While these datasets enable basic viewpoint control, they fail to capture dynamic scene interactions and lack metric-scale geometric consistency-critical for synthesizing realistic object motions and precise camera trajectories in complex environments. To bridge this gap, we introduce the first fully open-source, high-resolution dynamic-scene dataset with metric-scale camera annotations in https://github.com/ZGCTroy/RealCam-Vid.
中文: 本文指出当前基于静态场景数据集的相机可控视频生成方法存在局限,并推出了首个开源的高分辨率动态场景数据集,具备公制尺度相机标注,以提升复杂环境中物体运动真实性和相机轨迹精确性。
English: This summary highlights the limitations of current camera-controllable video generation methods that rely on static-scene datasets and introduces a new open-source dynamic-scene dataset with metric-scale camera annotations to enable more realistic motion synthesis and precise camera control.

Authors:Danielle Sullivan-Pao, Nicole Tian, Pooya Khorrami
Title: LoRAX: LoRA eXpandable Networks for Continual Synthetic Image Attribution
Abstract:
As generative AI image technologies become more widespread and advanced, there is a growing need for strong attribution models. These models are crucial for verifying the authenticity of images and identifying the architecture of their originating generative models-key to maintaining media integrity. However, attribution models struggle to generalize to unseen models, and traditional fine-tuning methods for updating these models have shown to be impractical in real-world settings. To address these challenges, we propose LoRA eXpandable Networks (LoRAX), a parameter-efficient class incremental algorithm that adapts to novel generative image models without the need for full retraining. Our approach trains an extremely parameter-efficient feature extractor per continual learning task via Low Rank Adaptation. Each task-specific feature extractor learns distinct features while only requiring a small fraction of the parameters present in the underlying feature extractor's backbone model. Our extensive experimentation shows LoRAX outperforms or remains competitive with state-of-the-art class incremental learning algorithms on the Continual Deepfake Detection benchmark across all training scenarios and memory settings, while requiring less than 3% of the number of trainable parameters per feature extractor compared to the full-rank implementation. LoRAX code is available at: https://github.com/mit-ll/lorax_cil.
中文: 针对生成式AI图像溯源模型难以适应新型模型的挑战,我们提出LoRAX参数高效增量学习算法,该方案在持续深度伪造检测任务中表现卓越,且每个特征提取器所需训练参数不足全量训练的3%。
English: To address the limitations of attribution models in adapting to new generative AI image technologies, we introduce LoRAX, a parameter-efficient incremental learning algorithm that excels in continual deepfake detection with minimal trainable parameters.

Authors:Shalini Maiti, Lourdes Agapito, Filippos Kokkinos
Title: Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects
Abstract:
Rapid advancements in text-to-3D generation require robust and scalable evaluation metrics that align closely with human judgment, a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. Gen3DEval evaluates text fidelity, appearance, and surface quality by analyzing 3D surface normals, without requiring ground-truth comparisons, bridging the gap between automated metrics and user preferences. Compared to state-of-the-art task-agnostic models, Gen3DEval demonstrates superior performance in user-aligned evaluations, placing it as a comprehensive and accessible benchmark for future research on text-to-3D generation. The project page can be found here: \href{https://shalini-maiti.github.io/gen3deval.github.io/}{https://shalini-maiti.github.io/gen3deval.github.io/}.

Authors:Lucian Chauvin, Somil Gupta, Angelina Ibarra, Joshua Peeples
Title: Benchmarking Suite for Synthetic Aperture Radar Imagery Anomaly Detection (SARIAD) Algorithms
Abstract:
Anomaly detection is a key research challenge in computer vision and machine learning with applications in many fields from quality control to radar imaging. In radar imaging, specifically synthetic aperture radar (SAR), anomaly detection can be used for the classification, detection, and segmentation of objects of interest. However, there is no method for developing and benchmarking these methods on SAR imagery. To address this issue, we introduce SAR imagery anomaly detection (SARIAD). In conjunction with Anomalib, a deep-learning library for anomaly detection, SARIAD provides a comprehensive suite of algorithms and datasets for assessing and developing anomaly detection approaches on SAR imagery. SARIAD specifically integrates multiple SAR datasets along with tools to effectively apply various anomaly detection algorithms to SAR imagery. Several anomaly detection metrics and visualizations are available. Overall, SARIAD acts as a central package for benchmarking SAR models and datasets to allow for reproducible research in the field of anomaly detection in SAR imagery. This package is publicly available: https://github.com/Advanced-Vision-and-Learning-Lab/SARIAD.
中文: SARIAD是一个综合性工具包,集成了合成孔径雷达(SAR)图像异常检测的数据集与算法,为该领域的方法开发、性能评估及可重复研究提供了统一平台。
English: SARIAD is a comprehensive toolkit that integrates datasets and algorithms for developing and benchmarking anomaly detection methods in synthetic aperture radar (SAR) imagery, enabling reproducible research in the field.

Authors:Junbang Liu, Enpei Huang, Dongxing Mao, Hui Zhang, Xinyuan Song, Yongxin Ni
Title: ContrastiveGaussian: High-Fidelity 3D Generation with Contrastive Learning and Gaussian Splatting
Abstract:
Creating 3D content from single-view images is a challenging problem that has attracted considerable attention in recent years. Current approaches typically utilize score distillation sampling (SDS) from pre-trained 2D diffusion models to generate multi-view 3D representations. Although some methods have made notable progress by balancing generation speed and model quality, their performance is often limited by the visual inconsistencies of the diffusion model outputs. In this work, we propose ContrastiveGaussian, which integrates contrastive learning into the generative process. By using a perceptual loss, we effectively differentiate between positive and negative samples, leveraging the visual inconsistencies to improve 3D generation quality. To further enhance sample differentiation and improve contrastive learning, we incorporate a super-resolution model and introduce another Quantity-Aware Triplet Loss to address varying sample distributions during training. Our experiments demonstrate that our approach achieves superior texture fidelity and improved geometric consistency.
中文: 本文提出ContrastiveGaussian方法,通过将对比学习和感知损失融入生成过程,有效利用扩散模型输出的视觉不一致性来提升单图像3D生成质量,实现了更优的纹理保真度和几何一致性。
English: This paper introduces ContrastiveGaussian, a method that enhances 3D generation from single images by integrating contrastive learning and perceptual loss to address visual inconsistencies in diffusion model outputs, achieving superior texture fidelity and geometric consistency.

Authors:Chengyu Yang, Chengjun Liu
Title: Interpretable Automatic Rosacea Detection with Whitened Cosine Similarity
Abstract:
According to the National Rosacea Society, approximately sixteen million Americans suffer from rosacea, a common skin condition that causes flushing or long-term redness on a person's face. To increase rosacea awareness and to better assist physicians to make diagnosis on this disease, we propose an interpretable automatic rosacea detection method based on whitened cosine similarity in this paper. The contributions of the proposed methods are three-fold. First, the proposed method can automatically distinguish patients suffering from rosacea from people who are clean of this disease with a significantly higher accuracy than other methods in unseen test data, including both classical deep learning and statistical methods. Second, the proposed method addresses the interpretability issue by measuring the similarity between the test sample and the means of two classes, namely the rosacea class versus the normal class, which allows both medical professionals and patients to understand and trust the results. And finally, the proposed methods will not only help increase awareness of rosacea in the general population, but will also help remind patients who suffer from this disease of possible early treatment, as rosacea is more treatable in its early stages. The code and data are available at https://github.com/chengyuyang-njit/ICCRD-2025. The code and data are available at https://github.com/chengyuyang-njit/ICCRD-2025.
Chinese: 本文提出了一种基于白化余弦相似度的可解释自动红斑痤疮检测方法,其准确率优于现有方法,有助于提升医生诊断能力并增强公众对该疾病的认识。
English: This paper introduces an interpretable automatic rosacea detection method using whitened cosine similarity, achieving higher accuracy than existing approaches and enhancing both physician diagnosis and public awareness of the condition.

Authors:Sushant Gautam, Jingdao Chen
Title: X-DECODE: EXtreme Deblurring with Curriculum Optimization and Domain Equalization
Abstract:
Restoring severely blurred images remains a significant challenge in computer vision, impacting applications in autonomous driving, medical imaging, and photography. This paper introduces a novel training strategy based on curriculum learning to improve the robustness of deep learning models for extreme image deblurring. Unlike conventional approaches that train on only low to moderate blur levels, our method progressively increases the difficulty by introducing images with higher blur severity over time, allowing the model to adapt incrementally. Additionally, we integrate perceptual and hinge loss during training to enhance fine detail restoration and improve training stability. We experimented with various curriculum learning strategies and explored the impact of the train-test domain gap on the deblurring performance. Experimental results on the Extreme-GoPro dataset showed that our method outperforms the next best method by 14% in SSIM, whereas experiments on the Extreme-KITTI dataset showed that our method outperforms the next best by 18% in SSIM. Ablation studies showed that a linear curriculum progression outperforms step-wise, sigmoid, and exponential progressions, while hyperparameter settings such as the training blur percentage and loss function formulation all play important roles in addressing extreme blur artifacts. Datasets and code are available at https://github.com/RAPTOR-MSSTATE/XDECODE
Chinese: 本文提出一种基于课程学习的训练策略,通过渐进增加模糊程度并结合感知与铰链损失,在极端图像去模糊任务中取得显著突破,在基准数据集上的SSIM指标比现有最佳方法高出14-18%。
English: This paper proposes a curriculum learning-based training strategy that progressively increases blur severity and integrates perceptual and hinge loss, achieving significant improvements in extreme image deblurring with 14-18% higher SSIM scores than existing methods on benchmark datasets.

Authors:Angelina Ibarra, Joshua Peeples
Title: Patch distribution modeling framework adaptive cosine estimator (PaDiM-ACE) for anomaly detection and localization in synthetic aperture radar imagery
Abstract:
This work presents a new approach to anomaly detection and localization in synthetic aperture radar imagery (SAR), expanding upon the existing patch distribution modeling framework (PaDiM). We introduce the adaptive cosine estimator (ACE) detection statistic. PaDiM uses the Mahalanobis distance at inference, an unbounded metric. ACE instead uses the cosine similarity metric, providing bounded anomaly detection scores. The proposed method is evaluated across multiple SAR datasets, with performance metrics including the area under the receiver operating curve (AUROC) at the image and pixel level, aiming for increased performance in anomaly detection and localization of SAR imagery. The code is publicly available: https://github.com/Advanced-Vision-and-Learning-Lab/PaDiM-ACE.
中文: 本研究提出自适应余弦估计器(ACE),通过用有界的余弦相似度替代PaDiM中无界的马氏距离,增强了合成孔径雷达图像异常检测与定位能力,在多个数据集上通过AUROC指标验证了性能提升。
English: This study introduces an adaptive cosine estimator (ACE) to enhance SAR image anomaly detection and localization by replacing PaDiM's unbounded Mahalanobis distance with bounded cosine similarity, demonstrating improved performance across multiple datasets through AUROC metrics.

Authors:Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua
Title: ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
Abstract:
Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes, and complex environments. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional applications. Code, data and leaderboard can be found at https://gui-agent.github.io/grounding-leaderboard.
中文摘要:ScreenSpot-Pro是一个针对专业GUI环境评估多模态大语言模型的新基准,现有模型在此表现不佳,而提出的ScreenSeekeR方法无需额外训练即可显著提升性能。
English Summary: ScreenSpot-Pro is a new benchmark for evaluating Multi-modal Large Language Models (MLLMs) in professional GUI environments, where existing models struggle, and the proposed ScreenSeekeR method significantly improves performance without additional training.

Authors:Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, Ping Luo
Title: PixelFlow: Pixel-Space Generative Models with Flow
Abstract:
We present PixelFlow, a family of image generation models that operate directly in the raw pixel space, in contrast to the predominant latent-space models. This approach simplifies the image generation process by eliminating the need for a pre-trained Variational Autoencoder (VAE) and enabling the whole model end-to-end trainable. Through efficient cascade flow modeling, PixelFlow achieves affordable computation cost in pixel space. It achieves an FID of 1.98 on 256$\times$256 ImageNet class-conditional image generation benchmark. The qualitative text-to-image results demonstrate that PixelFlow excels in image quality, artistry, and semantic control. We hope this new paradigm will inspire and open up new opportunities for next-generation visual generation models. Code and models are available at https://github.com/ShoufaChen/PixelFlow.
中文: PixelFlow提出了一种直接在像素空间操作的图像生成模型系列,无需变分自编码器即可实现端到端训练,并通过高效级联流建模在图像质量和语义控制方面表现出色。
English: PixelFlow introduces a novel family of image generation models that operate directly in pixel space, eliminating the need for VAEs and achieving state-of-the-art performance with efficient cascade flow modeling.

Authors:Lang Lin, Xueyang Yu, Ziqi Pang, Yu-Xiong Wang
Title: GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation
Abstract:
This paper proposes a novel framework utilizing multi-modal large language models (MLLMs) for referring video object segmentation (RefVOS). Previous MLLM-based methods commonly struggle with the dilemma between "Ref" and "VOS": they either specialize in understanding a few key frames (global reasoning) or tracking objects on continuous frames (local reasoning), and rely on external VOS or frame selectors to mitigate the other end of the challenge. However, our framework GLUS shows that global and local consistency can be unified into a single video segmentation MLLM: a set of sparse "context frames" provides global information, while a stream of continuous "query frames" conducts local object tracking. This is further supported by jointly training the MLLM with a pre-trained VOS memory bank to simultaneously digest short-range and long-range temporal information. To improve the information efficiency within the limited context window of MLLMs, we introduce object contrastive learning to distinguish hard false-positive objects and a self-refined framework to identify crucial frames and perform propagation. By collectively integrating these insights, our GLUS delivers a simple yet effective baseline, achieving new state-of-the-art for MLLMs on the MeViS and Ref-Youtube-VOS benchmark. Our project page is at https://glus-video.github.io/.

Authors:Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Title: MM-IFEngine: Towards Multimodal Instruction Following
Abstract:
The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2$\%$), MIA (+7.6$\%$), and IFEval (+12.3$\%$). We have fully open-sourced the datasets (both SFT and DPO), evaluation code and training scripts at https://github.com/SYuan03/MM-IFEngine.
中文: 本研究提出MM-IFEngine框架,通过生成高质量图像-指令数据集(MM-IFInstruct-23k与MM-IFDPO-23k)和MM-IFEval评估基准,显著提升了多模态大模型的指令跟随能力,在多个基准测试中取得突破性进展。
English: This study introduces MM-IFEngine, a pipeline generating high-quality image-instruction data (MM-IFInstruct-23k and MM-IFDPO-23k) and the MM-IFEval benchmark to enhance multimodal instruction-following in MLLMs, achieving significant performance improvements across multiple benchmarks.

Authors:Yuanhong Yu, Xingyi He, Chen Zhao, Junhao Yu, Jiaqi Yang, Ruizhen Hu, Yujun Shen, Xing Zhu, Xiaowei Zhou, Sida Peng
Title: BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation
Abstract:
This paper presents a generalizable RGB-based approach for object pose estimation, specifically designed to address challenges in sparse-view settings. While existing methods can estimate the poses of unseen objects, their generalization ability remains limited in scenarios involving occlusions and sparse reference views, restricting their real-world applicability. To overcome these limitations, we introduce corner points of the object bounding box as an intermediate representation of the object pose. The 3D object corners can be reliably recovered from sparse input views, while the 2D corner points in the target view are estimated through a novel reference-based point synthesizer, which works well even in scenarios involving occlusions. As object semantic points, object corners naturally establish 2D-3D correspondences for object pose estimation with a PnP algorithm. Extensive experiments on the YCB-Video and Occluded-LINEMOD datasets show that our approach outperforms state-of-the-art methods, highlighting the effectiveness of the proposed representation and significantly enhancing the generalization capabilities of object pose estimation, which is crucial for real-world applications.
中文: 本文提出了一种基于RGB的通用物体姿态估计方法,通过将物体边界框角点作为中间表示,结合新型参考点合成器在稀疏视角和遮挡场景中实现稳健性能,在多个基准数据集上展现出优于现有方法的表现。
English: This paper introduces a generalizable RGB-based method for object pose estimation that uses object corner points as an intermediate representation, enabling robust performance in sparse-view and occlusion scenarios through a novel reference-based point synthesizer and demonstrating superior results on benchmark datasets.

Authors:En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, Wenbing Tao
Title: Perception-R1: Pioneering Perception Policy with Reinforcement Learning
Abstract:
Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in the context of visual perception. In this work, we return to the fundamentals and explore the effects of RL on different perception tasks. We observe that the perceptual complexity is a major factor in determining the effectiveness of RL. We also observe that reward design plays a crucial role in further approching the upper limit of model perception. To leverage these findings, we propose Perception-R1, a scalable RL framework using GRPO during MLLM post-training. With a standard Qwen2.5-VL-3B-Instruct, Perception-R1 achieves +4.2% on RefCOCO+, +17.9% on PixMo-Count, +4.2% on PageOCR, and notably, 31.9% AP on COCO2017 val for the first time, establishing a strong baseline for perception policy learning.
中文: 本研究提出Perception-R1强化学习框架,通过优化感知复杂度处理与奖励机制设计,在多类视觉感知任务中实现显著性能提升,为感知策略学习建立了新基准。
English: This study introduces Perception-R1, a scalable reinforcement learning framework that enhances visual perception tasks by addressing perceptual complexity and reward design, achieving significant performance improvements across multiple benchmarks.

Authors:Yunhan Yang, Yuan-Chen Guo, Yukun Huang, Zi-Xin Zou, Zhipeng Yu, Yangguang Li, Yan-Pei Cao, Xihui Liu
Title: HoloPart: Generative 3D Part Amodal Segmentation
Abstract:
3D part amodal segmentation--decomposing a 3D shape into complete, semantically meaningful parts, even when occluded--is a challenging but crucial task for 3D content creation and understanding. Existing 3D part segmentation methods only identify visible surface patches, limiting their utility. Inspired by 2D amodal segmentation, we introduce this novel task to the 3D domain and propose a practical, two-stage approach, addressing the key challenges of inferring occluded 3D geometry, maintaining global shape consistency, and handling diverse shapes with limited training data. First, we leverage existing 3D part segmentation to obtain initial, incomplete part segments. Second, we introduce HoloPart, a novel diffusion-based model, to complete these segments into full 3D parts. HoloPart utilizes a specialized architecture with local attention to capture fine-grained part geometry and global shape context attention to ensure overall shape consistency. We introduce new benchmarks based on the ABO and PartObjaverse-Tiny datasets and demonstrate that HoloPart significantly outperforms state-of-the-art shape completion methods. By incorporating HoloPart with existing segmentation techniques, we achieve promising results on 3D part amodal segmentation, opening new avenues for applications in geometry editing, animation, and material assignment.

Authors:Rundong Luo, Matthew Wallingford, Ali Farhadi, Noah Snavely, Wei-Chiu Ma
Title: Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos
Abstract:
360° videos have emerged as a promising medium to represent our dynamic visual world. Compared to the "tunnel vision" of standard cameras, their borderless field of view offers a more complete perspective of our surroundings. While existing video models excel at producing standard videos, their ability to generate full panoramic videos remains elusive. In this paper, we investigate the task of video-to-360° generation: given a perspective video as input, our goal is to generate a full panoramic video that is consistent with the original video. Unlike conventional video generation tasks, the output's field of view is significantly larger, and the model is required to have a deep understanding of both the spatial layout of the scene and the dynamics of objects to maintain spatio-temporal consistency. To address these challenges, we first leverage the abundant 360° videos available online and develop a high-quality data filtering pipeline to curate pairwise training data. We then carefully design a series of geometry- and motion-aware operations to facilitate the learning process and improve the quality of 360° video generation. Experimental results demonstrate that our model can generate realistic and coherent 360° videos from in-the-wild perspective video. In addition, we showcase its potential applications, including video stabilization, camera viewpoint control, and interactive visual question answering.

Authors:Yifan Ding, Arturas Aleksandraus, Amirhossein Ahmadian, Jonas Unger, Fredrik Lindsten, Gabriel Eilertsen
Title: Revisiting Likelihood-Based Out-of-Distribution Detection by Modeling Representations
Abstract:
Out-of-distribution (OOD) detection is critical for ensuring the reliability of deep learning systems, particularly in safety-critical applications. Likelihood-based deep generative models have historically faced criticism for their unsatisfactory performance in OOD detection, often assigning higher likelihood to OOD data than in-distribution samples when applied to image data. In this work, we demonstrate that likelihood is not inherently flawed. Rather, several properties in the images space prohibit likelihood as a valid detection score. Given a sufficiently good likelihood estimator, specifically using the probability flow formulation of a diffusion model, we show that likelihood-based methods can still perform on par with state-of-the-art methods when applied in the representation space of pre-trained encoders. The code of our work can be found at $\href{https://github.com/limchaos/Likelihood-OOD.git}{\texttt{https://github.com/limchaos/Likelihood-OOD.git}}$.
Chinese: 本研究证明,通过扩散模型在预训练编码器的表征空间中使用基于似然的方法,能够实现最先进的分布外检测性能,从而挑战了先前对似然方法的批评。
English: This study demonstrates that likelihood-based methods can achieve state-of-the-art out-of-distribution detection performance when applied in the representation space of pre-trained encoders using diffusion models, challenging previous criticisms of likelihood approaches.

Authors:Yihao Wang, Zhong Qian, Peifeng Li
Title: FMNV: A Dataset of Media-Published News Videos for Fake News Detection
Abstract:
News media, particularly video-based platforms, have become deeply embed-ded in daily life, concurrently amplifying the risks of misinformation dissem-ination. Consequently, multimodal fake news detection has garnered signifi-cant research attention. However, existing datasets predominantly comprise user-generated videos characterized by crude editing and limited public en-gagement, whereas professionally crafted fake news videos disseminated by media outlets-often politically or virally motivated-pose substantially greater societal harm. To address this gap, we construct FMNV, a novel da-taset exclusively composed of news videos published by media organizations. Through empirical analysis of existing datasets and our curated collection, we categorize fake news videos into four distinct types. Building upon this taxonomy, we employ Large Language Models (LLMs) to automatically generate deceptive content by manipulating authentic media-published news videos. Furthermore, we propose FMNVD, a baseline model featuring a dual-stream architecture that integrates spatio-temporal motion features from a 3D ResNeXt-101 backbone and static visual semantics from CLIP. The two streams are fused via an attention-based mechanism, while co-attention modules refine the visual, textual, and audio features for effective multi-modal aggregation. Comparative experiments demonstrate both the generali-zation capability of FMNV across multiple baselines and the superior detec-tion efficacy of FMNVD. This work establishes critical benchmarks for de-tecting high-impact fake news in media ecosystems while advancing meth-odologies for cross-modal inconsistency analysis. Our dataset is available in https://github.com/DennisIW/FMNV.
中文:本研究提出了由专业制作的假新闻视频构成的新数据集FMNV,并开发了FMNVD双流检测模型,通过融合多模态特征有效应对高影响力虚假信息的社会危害。
English: This research introduces FMNV, a novel dataset of professionally produced fake news videos, and proposes FMNVD, a dual-stream detection model that effectively integrates multimodal features to address the societal harm of high-impact misinformation.

Authors:Anne-Sofie Maerten, Li-Wei Chen, Stefanie De Winter, Christophe Bossens, Johan Wagemans
Title: LAPIS: A novel dataset for personalized image aesthetic assessment
Abstract:
We present the Leuven Art Personalized Image Set (LAPIS), a novel dataset for personalized image aesthetic assessment (PIAA). It is the first dataset with images of artworks that is suitable for PIAA. LAPIS consists of 11,723 images and was meticulously curated in collaboration with art historians. Each image has an aesthetics score and a set of image attributes known to relate to aesthetic appreciation. Besides rich image attributes, LAPIS offers rich personal attributes of each annotator. We implemented two existing state-of-the-art PIAA models and assessed their performance on LAPIS. We assess the contribution of personal attributes and image attributes through ablation studies and find that performance deteriorates when certain personal and image attributes are removed. An analysis of failure cases reveals that both existing models make similar incorrect predictions, highlighting the need for improvements in artistic image aesthetic assessment. The LAPIS project page can be found at: https://github.com/Anne-SofieMaerten/LAPIS
中文: LAPIS是首个适用于个性化图像美学评估的艺术作品数据集,包含11,723张精心策展的图像,消融研究表明个人属性和图像属性对模型性能均至关重要,而现有模型在艺术图像美学评估中仍存在需要改进的系统性误判。
English: The LAPIS dataset is the first personalized image aesthetic assessment collection featuring 11,723 curated artworks, where ablation studies demonstrate that both personal and image attributes are crucial for model performance, while current models show consistent prediction errors requiring further refinement.

Authors:Yujin Wang, Jiarui Wu, Yichen Bian, Fan Zhang, Tianfan Xue
Title: S2R-HDR: A Large-Scale Rendered Dataset for HDR Fusion
Abstract:
The generalization of learning-based high dynamic range (HDR) fusion is often limited by the availability of training data, as collecting large-scale HDR images from dynamic scenes is both costly and technically challenging. To address these challenges, we propose S2R-HDR, the first large-scale high-quality synthetic dataset for HDR fusion, with 24,000 HDR samples. Using Unreal Engine 5, we design a diverse set of realistic HDR scenes that encompass various dynamic elements, motion types, high dynamic range scenes, and lighting. Additionally, we develop an efficient rendering pipeline to generate realistic HDR images. To further mitigate the domain gap between synthetic and real-world data, we introduce S2R-Adapter, a domain adaptation designed to bridge this gap and enhance the generalization ability of models. Experimental results on real-world datasets demonstrate that our approach achieves state-of-the-art HDR reconstruction performance. Dataset and code will be available at https://openimaginglab.github.io/S2R-HDR.

Authors:Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, Tiancheng Zhao
Title: VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Abstract:
Recently DeepSeek R1 has shown that reinforcement learning (RL) can substantially improve the reasoning capabilities of Large Language Models (LLMs) through a simple yet effective design. The core of R1 lies in its rule-based reward formulation, which leverages tasks with deterministic ground-truth answers to enable precise and stable reward computation. In the visual domain, we similarly observe that a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations. This property makes them naturally compatible with rule-based reward mechanisms. Motivated by this observation, we investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs), aiming to enhance their visual reasoning capabilities. To this end, we develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks. Using this framework, we further explore the feasibility of applying RL to visual domain. Experimental results indicate that the RL-based model not only delivers competitive performance on visual understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in generalization ability. Furthermore, we conduct comprehensive ablation studies that uncover a series of noteworthy insights, including the presence of reward hacking in object detection, the emergence of the "OD aha moment", the impact of training data quality, and the scaling behavior of RL across different model sizes. Through these analyses, we aim to deepen the understanding of how reinforcement learning enhances the capabilities of vision-language models, and we hope our findings and open-source contributions will support continued progress in the vision-language RL community. Our code and model are available at https://github.com/om-ai-lab/VLM-R1
中文: DeepSeek R1 基于确定性答案的规则奖励强化学习方法被成功扩展到视觉语言模型VLM-R1中,有效提升了视觉推理任务的性能表现和泛化能力。
English: DeepSeek R1's reinforcement learning approach, using rule-based rewards for tasks with clear answers, is successfully extended to vision-language models through VLM-R1, enhancing both performance and generalization in visual reasoning tasks.

Authors:Andrés Bell-Navas, María Villalba-Orero, Enrique Lara-Pezzi, Jesús Garicano-Mena, Soledad Le Clainche
Title: Heart Failure Prediction using Modal Decomposition and Masked Autoencoders for Scarce Echocardiography Databases
Abstract:
Heart diseases constitute the main cause of international human defunction. According to the World Health Organization (WHO), approximately 18 million deaths happen each year due to precisely heart diseases. In particular, heart failures (HF) press the healthcare industry to develop systems for their early, rapid, and effective prediction. This work presents an automatic system based on a novel deep learning framework which analyses in real-time echocardiography video sequences for the challenging and more specific task of heart failure time prediction. This system works in two stages. The first one transforms the data from a database of echocardiography video sequences into a machine learning-compatible collection of annotated images which can be used in the training phase of any machine learning-based framework, including a deep learning-based one. This stage includes the use of the Higher Order Dynamic Mode Decomposition (HODMD) algorithm for both data augmentation and feature extraction. The second stage builds and trains a Vision Transformer (ViT). Self-supervised learning (SSL) methods, so far barely explored in the literature about heart failure prediction, are adopted to effectively train the ViT from scratch, even with scarce databases. The designed neural network analyses images from echocardiography sequences to estimate the time in which a heart failure will happen. The results obtained show the efficacy of the HODMD algorithm and the superiority of the proposed system with respect to several established ViT and Convolutional Neural Network (CNN) architectures. The source code will be incorporated into the next version release of the ModelFLOWs-app software (https://github.com/modelflows/ModelFLOWs-app).
中文: 心脏病是全球主要死因,本研究提出一种基于新型深度学习框架的自动化系统,通过包含HODMD算法处理数据和视觉变换器预测的两阶段方法,实时分析超声心动图视频来预测心力衰竭的发生时间。
English: Heart disease is the leading global cause of death, and this study introduces an automated system using a novel deep learning framework that analyzes echocardiography videos in real-time to predict the timing of heart failure through a two-stage process involving HODMD for data processing and a Vision Transformer for prediction.

Authors:Moritz Rempe, Fabian Hörst, Helmut Becker, Marco Schlimbach, Lukas Rotkopf, Kevin Kröninger, Jens Kleesiek
Title: PhaseGen: A Diffusion-Based Approach for Complex-Valued MRI Data Generation
Abstract:
Magnetic resonance imaging (MRI) raw data, or k-Space data, is complex-valued, containing both magnitude and phase information. However, clinical and existing Artificial Intelligence (AI)-based methods focus only on magnitude images, discarding the phase data despite its potential for downstream tasks, such as tumor segmentation and classification. In this work, we introduce $\textit{PhaseGen}$, a novel complex-valued diffusion model for generating synthetic MRI raw data conditioned on magnitude images, commonly used in clinical practice. This enables the creation of artificial complex-valued raw data, allowing pretraining for models that require k-Space information. We evaluate PhaseGen on two tasks: skull-stripping directly in k-Space and MRI reconstruction using the publicly available FastMRI dataset. Our results show that training with synthetic phase data significantly improves generalization for skull-stripping on real-world data, with an increased segmentation accuracy from $41.1\%$ to $80.1\%$, and enhances MRI reconstruction when combined with limited real-world data. This work presents a step forward in utilizing generative AI to bridge the gap between magnitude-based datasets and the complex-valued nature of MRI raw data. This approach allows researchers to leverage the vast amount of avaliable image domain data in combination with the information-rich k-Space data for more accurate and efficient diagnostic tasks. We make our code publicly $\href{https://github.com/TIO-IKIM/PhaseGen}{\text{available here}}$.
中文: 本研究提出PhaseGen这一复数值扩散模型,能从临床常用的磁共振幅度图像生成合成k空间数据,通过弥合图像数据与复数值原始数据之间的鸿沟,显著提升了颅骨剥离和图像重建等任务的性能。
English: This study introduces PhaseGen, a complex-valued diffusion model that generates synthetic MRI k-space data from magnitude images, enhancing tasks like skull-stripping and MRI reconstruction by bridging the gap between clinical image data and the full potential of complex-valued raw data.

Authors:Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, Junjie Yan, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo Xu, Pengyu Cheng, Qizheng Gu, Runjie Zhou, Shaowei Liu, Sihan Cao, Tao Yu, Tianhui Song, Tongtong Bai, Wei Song, Weiran He, Weixiao Huang, Weixin Xu, Xiaokun Yuan, Xingcheng Yao, Xingzhe Wu, Xinhao Li, Xinxing Zu, Xinyu Zhou, Xinyuan Wang, Y. Charles, Yan Zhong, Yang Li, Yangyang Hu, Yanru Chen, Yejie Wang, Yibo Liu, Yibo Miao, Yidao Qin, Yimin Chen, Yiping Bao, Yiqin Wang, Yongsheng Kang, Yuanxin Liu, Yuhao Dong, Yulun Du, Yuxin Wu, Yuzhi Wang, Yuzi Yan, Zaida Zhou, Zhaowei Li, Zhejun Jiang, Zheng Zhang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Zijia Zhao, Ziwei Chen, Zongyu Lin
Title: Kimi-VL Technical Report
Abstract:
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.
Chinese: Kimi-VL是一款高效的开源视觉语言模型,具备先进的多模态推理、长上下文理解和强大智能体能力,在仅激活28亿参数的情况下,于多项挑战性任务中展现出卓越性能。
English: Kimi-VL is an efficient open-source vision-language model with advanced multimodal reasoning, long-context understanding, and strong agent capabilities, achieving competitive performance across various challenging tasks while activating only 2.8B parameters.

Authors:Hengrun Zhao, Yunzhi Zhuge, Yifan Wang, Lijun Wang, Huchuan Lu, Yu Zeng
Title: Learning Universal Features for Generalizable Image Forgery Localization
Abstract:
In recent years, advanced image editing and generation methods have rapidly evolved, making detecting and locating forged image content increasingly challenging. Most existing image forgery detection methods rely on identifying the edited traces left in the image. However, because the traces of different forgeries are distinct, these methods can identify familiar forgeries included in the training data but struggle to handle unseen ones. In response, we present an approach for Generalizable Image Forgery Localization (GIFL). Once trained, our model can detect both seen and unseen forgeries, providing a more practical and efficient solution to counter false information in the era of generative AI. Our method focuses on learning general features from the pristine content rather than traces of specific forgeries, which are relatively consistent across different types of forgeries and therefore can be used as universal features to locate unseen forgeries. Additionally, as existing image forgery datasets are still dominated by traditional hand-crafted forgeries, we construct a new dataset consisting of images edited by various popular deep generative image editing methods to further encourage research in detecting images manipulated by deep generative models. Extensive experimental results show that the proposed approach outperforms state-of-the-art methods in the detection of unseen forgeries and also demonstrates competitive results for seen forgeries. The code and dataset are available at https://github.com/ZhaoHengrun/GIFL.
中文摘要:提出的通用图像伪造定位方法通过从原始图像内容中学习通用特征,有效检测已知和未知伪造类型,并借助新构建的数据集和广泛实验验证了其超越现有方法的性能。
English Summary: The proposed Generalizable Image Forgery Localization (GIFL) method learns universal features from pristine image content to effectively detect both seen and unseen forgeries, outperforming existing approaches through a new dataset and extensive experiments.

Authors:Zitian Tang, Shijie Wang, Junho Cho, Jaewook Yoo, Chen Sun
Title: How Can Objects Help Video-Language Understanding?
Abstract:
Do we still need to represent objects explicitly in multimodal large language models (MLLMs)? To one extreme, pre-trained encoders convert images into visual tokens, with which objects and spatiotemporal relationships may be implicitly modeled. To the other extreme, image captions by themselves provide strong empirical performances for understanding tasks, despite missing fine-grained spatiotemporal information. To answer this question, we introduce ObjectMLLM, a framework capable of leveraging arbitrary computer vision algorithm to extract and integrate structured visual representation. Through extensive evaluations on six video question answering benchmarks, we confirm that explicit integration of object-centric representation remains necessary. Surprisingly, we observe that the simple approach of quantizing the continuous, structured object information and representing them as plain text performs the best, offering a data-efficient approach to integrate other visual perception modules into MLLM design. Our code and models are released at https://github.com/brown-palm/ObjectMLLM.
Chinese: ObjectMLLM 证实了在多模态大语言模型中显式的以对象为中心的表示仍然必要,将结构化对象数据量化为纯文本的方法被证明是整合视觉感知模块的最有效方式。
English: ObjectMLLM demonstrates that explicit object-centric representation remains essential in multimodal large language models, with quantizing structured object data into plain text proving most effective for integrating visual perception modules.

Authors:Anning Hu, Ang Li, Xirui Jin, Danping Zou
Title: ThermoStereoRT: Thermal Stereo Matching in Real Time via Knowledge Distillation and Attention-based Refinement
Abstract:
We introduce ThermoStereoRT, a real-time thermal stereo matching method designed for all-weather conditions that recovers disparity from two rectified thermal stereo images, envisioning applications such as night-time drone surveillance or under-bed cleaning robots. Leveraging a lightweight yet powerful backbone, ThermoStereoRT constructs a 3D cost volume from thermal images and employs multi-scale attention mechanisms to produce an initial disparity map. To refine this map, we design a novel channel and spatial attention module. Addressing the challenge of sparse ground truth data in thermal imagery, we utilize knowledge distillation to boost performance without increasing computational demands. Comprehensive evaluations on multiple datasets demonstrate that ThermoStereoRT delivers both real-time capacity and robust accuracy, making it a promising solution for real-world deployment in various challenging environments. Our code will be released on https://github.com/SJTU-ViSYS-team/ThermoStereoRT
中文: ThermoStereoRT是一种实时热成像立体匹配方法,采用轻量级架构和多尺度注意力机制生成精确视差图,并通过知识蒸馏增强性能,适用于夜间无人机监控等全天候场景。
English: ThermoStereoRT is a real-time thermal stereo matching method that uses a lightweight backbone and multi-scale attention to generate accurate disparity maps, enhanced by knowledge distillation for robust performance in all-weather applications like drone surveillance.

Authors:Linyan Huang, Haonan Lin, Yanning Zhou, Kaiwen Xiao
Title: FlexIP: Dynamic Control of Preservation and Personality for Customized Image Generation
Abstract:
With the rapid advancement of 2D generative models, preserving subject identity while enabling diverse editing has emerged as a critical research focus. Existing methods typically face inherent trade-offs between identity preservation and personalized manipulation. We introduce FlexIP, a novel framework that decouples these objectives through two dedicated components: a Personalization Adapter for stylistic manipulation and a Preservation Adapter for identity maintenance. By explicitly injecting both control mechanisms into the generative model, our framework enables flexible parameterized control during inference through dynamic tuning of the weight adapter. Experimental results demonstrate that our approach breaks through the performance limitations of conventional methods, achieving superior identity preservation while supporting more diverse personalized generation capabilities (Project Page: https://flexip-tech.github.io/flexip/).

Authors:Darian Tomašević, Fadi Boutros, Chenhao Lin, Naser Damer, Vitomir Štruc, Peter Peer
Title: ID-Booth: Identity-consistent Face Generation with Diffusion Models
Abstract:
Recent advances in generative modeling have enabled the generation of high-quality synthetic data that is applicable in a variety of domains, including face recognition. Here, state-of-the-art generative models typically rely on conditioning and fine-tuning of powerful pretrained diffusion models to facilitate the synthesis of realistic images of a desired identity. Yet, these models often do not consider the identity of subjects during training, leading to poor consistency between generated and intended identities. In contrast, methods that employ identity-based training objectives tend to overfit on various aspects of the identity, and in turn, lower the diversity of images that can be generated. To address these issues, we present in this paper a novel generative diffusion-based framework, called ID-Booth. ID-Booth consists of a denoising network responsible for data generation, a variational auto-encoder for mapping images to and from a lower-dimensional latent space and a text encoder that allows for prompt-based control over the generation procedure. The framework utilizes a novel triplet identity training objective and enables identity-consistent image generation while retaining the synthesis capabilities of pretrained diffusion models. Experiments with a state-of-the-art latent diffusion model and diverse prompts reveal that our method facilitates better intra-identity consistency and inter-identity separability than competing methods, while achieving higher image diversity. In turn, the produced data allows for effective augmentation of small-scale datasets and training of better-performing recognition models in a privacy-preserving manner. The source code for the ID-Booth framework is publicly available at https://github.com/dariant/ID-Booth.
中文摘要:本文提出ID-Booth这一新型扩散框架,通过三重身份训练目标增强合成图像的身份一致性,在保持预训练模型生成能力的同时,实现了比现有方法更优的身份保持与图像多样性。
English Summary: The paper introduces ID-Booth, a novel diffusion-based framework that enhances identity consistency in synthetic image generation through a triplet training objective, achieving superior identity preservation and image diversity compared to existing methods.

Authors:Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Xieyuanli Chen, Hesheng Wang
Title: Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction
Abstract:
Predicting hand motion is critical for understanding human intentions and bridging the action space between human movements and robot manipulations. Existing hand trajectory prediction (HTP) methods forecast the future hand waypoints in 3D space conditioned on past egocentric observations. However, such models are only designed to accommodate 2D egocentric video inputs. There is a lack of awareness of multimodal environmental information from both 2D and 3D observations, hindering the further improvement of 3D HTP performance. In addition, these models overlook the synergy between hand movements and headset camera egomotion, either predicting hand trajectories in isolation or encoding egomotion only from past frames. To address these limitations, we propose novel diffusion models (MMTwin) for multimodal 3D hand trajectory prediction. MMTwin is designed to absorb multimodal information as input encompassing 2D RGB images, 3D point clouds, past hand waypoints, and text prompt. Besides, two latent diffusion models, the egomotion diffusion and the HTP diffusion as twins, are integrated into MMTwin to predict camera egomotion and future hand trajectories concurrently. We propose a novel hybrid Mamba-Transformer module as the denoising model of the HTP diffusion to better fuse multimodal features. The experimental results on three publicly available datasets and our self-recorded data demonstrate that our proposed MMTwin can predict plausible future 3D hand trajectories compared to the state-of-the-art baselines, and generalizes well to unseen environments. The code and pretrained models will be released at https://github.com/IRMVLab/MMTwin.
中文: 该摘要提出MMTwin新型扩散模型,通过融合二维RGB图像、三维点云、历史手部路径及文本提示等多模态输入,并集成自我运动与手部轨迹预测双扩散机制,在实验中展现出卓越的三维手部轨迹预测性能及环境泛化能力。
English: This abstract introduces MMTwin, a novel diffusion model that leverages multimodal inputs including 2D RGB images, 3D point clouds, past hand waypoints, and text prompts to enhance 3D hand trajectory prediction by concurrently modeling camera egomotion and future hand movements, demonstrating superior performance and generalization in experiments.

Authors:Zhe Wang, Yuhua Ru, Aladine Chetouani, Fang Chen, Fabian Bauer, Liping Zhang, Didier Hans, Rachid Jennane, Mohamed Jarraya, Yung Hsin Chen
Title: MoEDiff-SR: Mixture of Experts-Guided Diffusion Model for Region-Adaptive MRI Super-Resolution
Abstract:
Magnetic Resonance Imaging (MRI) at lower field strengths (e.g., 3T) suffers from limited spatial resolution, making it challenging to capture fine anatomical details essential for clinical diagnosis and neuroimaging research. To overcome this limitation, we propose MoEDiff-SR, a Mixture of Experts (MoE)-guided diffusion model for region-adaptive MRI Super-Resolution (SR). Unlike conventional diffusion-based SR models that apply a uniform denoising process across the entire image, MoEDiff-SR dynamically selects specialized denoising experts at a fine-grained token level, ensuring region-specific adaptation and enhanced SR performance. Specifically, our approach first employs a Transformer-based feature extractor to compute multi-scale patch embeddings, capturing both global structural information and local texture details. The extracted feature embeddings are then fed into an MoE gating network, which assigns adaptive weights to multiple diffusion-based denoisers, each specializing in different brain MRI characteristics, such as centrum semiovale, sulcal and gyral cortex, and grey-white matter junction. The final output is produced by aggregating the denoised results from these specialized experts according to dynamically assigned gating probabilities. Experimental results demonstrate that MoEDiff-SR outperforms existing state-of-the-art methods in terms of quantitative image quality metrics, perceptual fidelity, and computational efficiency. Difference maps from each expert further highlight their distinct specializations, confirming the effective region-specific denoising capability and the interpretability of expert contributions. Additionally, clinical evaluation validates its superior diagnostic capability in identifying subtle pathological features, emphasizing its practical relevance in clinical neuroimaging. Our code is available at https://github.com/ZWang78/MoEDiff-SR.
中文: MoEDiff-SR是一种新型扩散模型,通过混合专家机制动态选择针对不同脑区的专用去噪器,在定量指标和临床诊断方面显著提升了磁共振图像超分辨率性能。
English: MoEDiff-SR is a novel diffusion model that uses a Mixture of Experts to dynamically apply specialized denoising for different brain regions, significantly improving MRI super-resolution performance in both quantitative metrics and clinical diagnosis.

Authors:Gene Chou, Wenqi Xian, Guandao Yang, Mohamed Abdelfattah, Bharath Hariharan, Noah Snavely, Ning Yu, Paul Debevec
Title: FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution
Abstract:
A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce high-resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044x1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We evaluate our approach across multiple unseen datasets against state-of-the-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as video editing, and online decision-making, such as robotics. We release all code and model weights at https://github.com/Eyeline-Research/FlashDepth
Chinese: FlashDepth 是一种实时视频深度估计方法,能够在24帧/秒下生成高分辨率、准确且一致的深度图,以较少训练数据显著超越现有模型的速度和边界清晰度,同时保持竞争力精度。
English: FlashDepth is a real-time video depth estimation method that achieves high-resolution, accurate, and consistent depth maps at 24 FPS, outperforming existing models in speed and boundary sharpness while maintaining competitive accuracy with minimal training data.

Authors:Alexander Rubinstein, Ameya Prabhu, Matthias Bethge, Seong Joon Oh
Title: Are We Done with Object-Centric Learning?
Abstract:
Object-centric learning (OCL) seeks to learn representations that only encode an object, isolated from other objects or background cues in a scene. This approach underpins various aims, including out-of-distribution (OOD) generalization, sample-efficient composition, and modeling of structured environments. Most research has focused on developing unsupervised mechanisms that separate objects into discrete slots in the representation space, evaluated using unsupervised object discovery. However, with recent sample-efficient segmentation models, we can separate objects in the pixel space and encode them independently. This achieves remarkable zero-shot performance on OOD object discovery benchmarks, is scalable to foundation models, and can handle a variable number of slots out-of-the-box. Hence, the goal of OCL methods to obtain object-centric representations has been largely achieved. Despite this progress, a key question remains: How does the ability to separate objects within a scene contribute to broader OCL objectives, such as OOD generalization? We address this by investigating the OOD generalization challenge caused by spurious background cues through the lens of OCL. We propose a novel, training-free probe called Object-Centric Classification with Applied Masks (OCCAM), demonstrating that segmentation-based encoding of individual objects significantly outperforms slot-based OCL methods. However, challenges in real-world applications remain. We provide the toolbox for the OCL community to use scalable object-centric representations, and focus on practical applications and fundamental questions, such as understanding object perception in human cognition. Our code is available here: https://github.com/AlexanderRubinstein/OCCAM.
Chinese: 物体中心学习(OCL)旨在从场景中分离物体以提升泛化能力和效率,提出的OCCAM方法通过基于分割的编码显著优于传统基于槽位的方法,同时解决了实际应用中的挑战。
English: Object-centric learning (OCL) aims to isolate objects from scenes for improved generalization and efficiency, with the proposed OCCAM method using segmentation-based encoding to outperform traditional slot-based approaches while addressing challenges in real-world applications.

Authors:Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, Xiangchao Yan, Xin Li, Tianshuo Peng, Shufei Zhang, Botian Shi, Tao Chen, Zhibo Chen, Lei Bai, Peng Gao, Bo Zhang
Title: OmniCaptioner: One Captioner to Rule Them All
Abstract:
We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.
中文: OmniCaptioner 是一个通用的视觉描述框架,可为多种视觉领域生成精细的文本描述,增强大语言模型的视觉推理能力、改进图像生成并实现高效的监督微调。
English: OmniCaptioner is a unified visual captioning framework that generates detailed descriptions for diverse visual domains, enhancing visual reasoning with LLMs, improving image generation, and enabling efficient supervised fine-tuning.

Authors:Mengchen Zhang, Tong Wu, Jing Tan, Ziwei Liu, Gordon Wetzstein, Dahua Lin
Title: GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography
Abstract:
Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on geometric optimization or handcrafted procedural systems, while recent learning-based methods often inherit structural biases or lack textual alignment, constraining creative synthesis. In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. We first introduce DataDoP, a large-scale multi-modal dataset containing 29K real-world shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent. Thanks to the comprehensive and diverse database, we further train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movement generation based on text guidance and RGBD inputs, named GenDoP. Extensive experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability. We believe our approach establishes a new standard for learning-based cinematography, paving the way for future advancements in camera control and filmmaking. Our project website: https://kszpxxzmc.github.io/GenDoP/.
中文: 本文提出了GenDoP模型,基于DataDoP数据集训练的自回归Transformer,能根据文本引导生成富有艺术表现力的相机轨迹,在可控性和运动稳定性上优于现有方法。
English: This paper introduces GenDoP, an auto-regressive Transformer model trained on the DataDoP dataset to generate artistic and text-aligned camera trajectories, offering superior controllability and motion stability over existing methods.

Authors:Jifang Wang, Xue Yang, Longyue Wang, Zhenran Xu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
Title: A Unified Agentic Framework for Evaluating Conditional Image Generation
Abstract:
Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval's capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.
中文: 本文提出CIGEval框架,利用大型多模态模型全面评估条件图像生成任务,在多项实验中达到接近人工评估的相关性,并超越现有最优方法。
English: This paper presents CIGEval, an agentic framework using large multimodal models to comprehensively evaluate conditional image generation, achieving near-human correlation in assessments and surpassing previous state-of-the-art methods.

Authors:Jonas Loos, Lorenz Linhardt
Title: Latent Diffusion U-Net Representations Contain Positional Embeddings and Anomalies
Abstract:
Diffusion models have demonstrated remarkable capabilities in synthesizing realistic images, spurring interest in using their representations for various downstream tasks. To better understand the robustness of these representations, we analyze popular Stable Diffusion models using representational similarity and norms. Our findings reveal three phenomena: (1) the presence of a learned positional embedding in intermediate representations, (2) high-similarity corner artifacts, and (3) anomalous high-norm artifacts. These findings underscore the need to further investigate the properties of diffusion model representations before considering them for downstream tasks that require robust features. Project page: https://jonasloos.github.io/sd-representation-anomalies

Authors:Yuhang Yang, Fengqi Liu, Yixing Lu, Qin Zhao, Pingyu Wu, Wei Zhai, Ran Yi, Yang Cao, Lizhuang Ma, Zheng-Jun Zha, Junting Dong
Title: SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets
Abstract:
3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains $1$ million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation.
中文摘要:本文提出了一种潜在空间生成范式,通过高斯压缩和条件生成将三维人体数字化转化为可学习的分布偏移问题,同时构建了百万规模的HGS-1M数据集,实现了具有精细纹理和衣物变形的高质量三维人体重建。
English Summary: This paper introduces a latent space generation paradigm that transforms 3D human digitization into a learnable distribution shift using Gaussian compression and conditional generation, while also creating a million-scale HGS-1M dataset to enable high-quality 3D human reconstruction with detailed textures and clothing deformation.

Authors:Sergio Romero-Tapiador, Ruben Tolosana, Blanca Lacruz-Pleguezuelos, Laura Judith Marcos Zambrano, Guadalupe X. Bazán, Isabel Espinosa-Salinas, Julian Fierrez, Javier Ortega-Garcia, Enrique Carrillo de Santa Pau, Aythami Morales
Title: Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition
Abstract:
Automatic dietary assessment based on food images remains a challenge, requiring precise food detection, segmentation, and classification. Vision-Language Models (VLMs) offer new possibilities by integrating visual and textual reasoning. In this study, we evaluate six state-of-the-art VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, and LLaVA), analyzing their capabilities in food recognition at different levels. For the experimental framework, we introduce the FoodNExTDB, a unique food image database that contains 9,263 expert-labeled images across 10 categories (e.g., "protein source"), 62 subcategories (e.g., "poultry"), and 9 cooking styles (e.g., "grilled"). In total, FoodNExTDB includes 50k nutritional labels generated by seven experts who manually annotated all images in the database. Also, we propose a novel evaluation metric, Expert-Weighted Recall (EWR), that accounts for the inter-annotator variability. Results show that closed-source models outperform open-source ones, achieving over 90% EWR in recognizing food products in images containing a single product. Despite their potential, current VLMs face challenges in fine-grained food recognition, particularly in distinguishing subtle differences in cooking styles and visually similar food items, which limits their reliability for automatic dietary assessment. The FoodNExTDB database is publicly available at https://github.com/AI4Food/FoodNExtDB.
中文: 本研究使用新型FoodNExTDB数据库和专家加权召回率指标评估了六种视觉语言模型,发现闭源模型在基础食物识别表现出色,但在烹饪方式等精细区分方面仍存在不足。
English: This study evaluates six Vision-Language Models for food recognition using the novel FoodNExTDB database and Expert-Weighted Recall metric, finding closed-source models excel at basic recognition but struggle with fine-grained distinctions like cooking styles.

Authors:Emmanuelle Bourigault, Amir Jamaludin, Abdullah Hamdi
Title: UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation
Abstract:
In medical imaging, the primary challenge is collecting large-scale labeled data due to privacy concerns, logistics, and high labeling costs. In this work, we present the UK Biobank Organs and Bones (UKBOB), the largest labeled dataset of body organs, comprising 51,761 MRI 3D samples (equivalent to 17.9 million 2D images) and more than 1.37 billion 2D segmentation masks of 72 organs, all based on the UK Biobank MRI dataset. We utilize automatic labeling, introduce an automated label cleaning pipeline with organ-specific filters, and manually annotate a subset of 300 MRIs with 11 abdominal classes to validate the quality (referred to as UKBOB-manual). This approach allows for scaling up the dataset collection while maintaining confidence in the labels. We further confirm the validity of the labels by demonstrating zero-shot generalization of trained models on the filtered UKBOB to other small labeled datasets from similar domains (e.g., abdominal MRI). To further mitigate the effect of noisy labels, we propose a novel method called Entropy Test-time Adaptation (ETTA) to refine the segmentation output. We use UKBOB to train a foundation model, Swin-BOB, for 3D medical image segmentation based on the Swin-UNetr architecture, achieving state-of-the-art results in several benchmarks in 3D medical imaging, including the BRATS brain MRI tumor challenge (with a 0.4% improvement) and the BTCV abdominal CT scan benchmark (with a 1.3% improvement). The pre-trained models and the code are available at https://emmanuelleb985.github.io/ukbob , and the filtered labels will be made available with the UK Biobank.

Authors:Jiawei Mao, Yuhan Wang, Yucheng Tang, Daguang Xu, Kang Wang, Yang Yang, Zongwei Zhou, Yuyin Zhou
Title: MedSegFactory: Text-Guided Generation of Medical Image-Mask Pairs
Abstract:
This paper presents MedSegFactory, a versatile medical synthesis framework that generates high-quality paired medical images and segmentation masks across modalities and tasks. It aims to serve as an unlimited data repository, supplying image-mask pairs to enhance existing segmentation tools. The core of MedSegFactory is a dual-stream diffusion model, where one stream synthesizes medical images and the other generates corresponding segmentation masks. To ensure precise alignment between image-mask pairs, we introduce Joint Cross-Attention (JCA), enabling a collaborative denoising paradigm by dynamic cross-conditioning between streams. This bidirectional interaction allows both representations to guide each other's generation, enhancing consistency between generated pairs. MedSegFactory unlocks on-demand generation of paired medical images and segmentation masks through user-defined prompts that specify the target labels, imaging modalities, anatomical regions, and pathological conditions, facilitating scalable and high-quality data generation. This new paradigm of medical image synthesis enables seamless integration into diverse medical imaging workflows, enhancing both efficiency and accuracy. Extensive experiments show that MedSegFactory generates data of superior quality and usability, achieving competitive or state-of-the-art performance in 2D and 3D segmentation tasks while addressing data scarcity and regulatory constraints.

Authors:Chang Nie, Yiqing Xu, Guangming Wang, Zhe Liu, Yanzi Miao, Hesheng Wang
Title: MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking
Abstract:
Moving object segmentation plays a vital role in understanding dynamic visual environments. While existing methods rely on multi-frame image sequences to identify moving objects, single-image MOS is critical for applications like motion intention prediction and handling camera frame drops. However, segmenting moving objects from a single image remains challenging for existing methods due to the absence of temporal cues. To address this gap, we propose MovSAM, the first framework for single-image moving object segmentation. MovSAM leverages a Multimodal Large Language Model (MLLM) enhanced with Chain-of-Thought (CoT) prompting to search the moving object and generate text prompts based on deep thinking for segmentation. These prompts are cross-fused with visual features from the Segment Anything Model (SAM) and a Vision-Language Model (VLM), enabling logic-driven moving object segmentation. The segmentation results then undergo a deep thinking refinement loop, allowing MovSAM to iteratively improve its understanding of the scene context and inter-object relationships with logical reasoning. This innovative approach enables MovSAM to segment moving objects in single images by considering scene understanding. We implement MovSAM in the real world to validate its practical application and effectiveness for autonomous driving scenarios where the multi-frame methods fail. Furthermore, despite the inherent advantage of multi-frame methods in utilizing temporal information, MovSAM achieves state-of-the-art performance across public MOS benchmarks, reaching 92.5\% on J\&F. Our implementation will be available at https://github.com/IRMVLab/MovSAM.
中文摘要:提出的MovSAM框架利用思维链增强的多模态大语言模型,通过逻辑推理实现单幅图像中的运动物体分割,在多项基准测试中达到92.5%的最先进性能,有效解决了多帧方法失效的应用场景。
English Summary: The proposed MovSAM framework utilizes a Chain-of-Thought enhanced Multimodal Large Language Model to enable logic-driven moving object segmentation from single images, achieving state-of-the-art performance of 92.5% on benchmarks while addressing scenarios where multi-frame methods fail.

Authors:Rishubh Parihar, Srinjay Sarkar, Sarthak Vora, Jogendra Kundu, R. Venkatesh Babu
Title: MonoPlace3D: Learning 3D-Aware Object Placement for 3D Monocular Detection
Abstract:
Current monocular 3D detectors are held back by the limited diversity and scale of real-world datasets. While data augmentation certainly helps, it's particularly difficult to generate realistic scene-aware augmented data for outdoor settings. Most current approaches to synthetic data generation focus on realistic object appearance through improved rendering techniques. However, we show that where and how objects are positioned is just as crucial for training effective 3D monocular detectors. The key obstacle lies in automatically determining realistic object placement parameters - including position, dimensions, and directional alignment when introducing synthetic objects into actual scenes. To address this, we introduce MonoPlace3D, a novel system that considers the 3D scene content to create realistic augmentations. Specifically, given a background scene, MonoPlace3D learns a distribution over plausible 3D bounding boxes. Subsequently, we render realistic objects and place them according to the locations sampled from the learned distribution. Our comprehensive evaluation on two standard datasets KITTI and NuScenes, demonstrates that MonoPlace3D significantly improves the accuracy of multiple existing monocular 3D detectors while being highly data efficient.

Authors:Rishubh Parihar, Vaibhav Agrawal, Sachidanand VS, R. Venkatesh Babu
Title: Compass Control: Multi Object Orientation Control for Text-to-Image Generation
Abstract:
Existing approaches for controlling text-to-image diffusion models, while powerful, do not allow for explicit 3D object-centric control, such as precise control of object orientation. In this work, we address the problem of multi-object orientation control in text-to-image diffusion models. This enables the generation of diverse multi-object scenes with precise orientation control for each object. The key idea is to condition the diffusion model with a set of orientation-aware \textbf{compass} tokens, one for each object, along with text tokens. A light-weight encoder network predicts these compass tokens taking object orientation as the input. The model is trained on a synthetic dataset of procedurally generated scenes, each containing one or two 3D assets on a plain background. However, direct training this framework results in poor orientation control as well as leads to entanglement among objects. To mitigate this, we intervene in the generation process and constrain the cross-attention maps of each compass token to its corresponding object regions. The trained model is able to achieve precise orientation control for a) complex objects not seen during training and b) multi-object scenes with more than two objects, indicating strong generalization capabilities. Further, when combined with personalization methods, our method precisely controls the orientation of the new object in diverse contexts. Our method achieves state-of-the-art orientation control and text alignment, quantified with extensive evaluations and a user study.

Authors:Pedro Hermosilla, Christian Stippel, Leon Sick
Title: Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding
Abstract:
Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository (https://github.com/phermosilla/msm).
自监督学习已革新二维视觉但在三维场景理解中应用有限;本文提出一种鲁棒评估协议和新型掩码场景建模方法,使自监督三维特征无需微调即可达到监督学习性能。
Self-supervised learning has revolutionized 2D vision but remains underutilized in 3D scene understanding; this paper introduces both a robust evaluation protocol and a novel Masked Scene Modeling approach that enables self-supervised 3D features to match supervised performance without fine-tuning.

Authors:Anil Armagan, Albert Saà-Garriga, Bruno Manganelli, Kyuwon Kim, M. Kerim Yucel
Title: GSta: Efficient Training Scheme with Siestaed Gaussians for Monocular 3D Scene Reconstruction
Abstract:
Gaussian Splatting (GS) is a popular approach for 3D reconstruction, mostly due to its ability to converge reasonably fast, faithfully represent the scene and render (novel) views in a fast fashion. However, it suffers from large storage and memory requirements, and its training speed still lags behind the hash-grid based radiance field approaches (e.g. Instant-NGP), which makes it especially difficult to deploy them in robotics scenarios, where 3D reconstruction is crucial for accurate operation. In this paper, we propose GSta that dynamically identifies Gaussians that have converged well during training, based on their positional and color gradient norms. By forcing such Gaussians into a siesta and stopping their updates (freezing) during training, we improve training speed with competitive accuracy compared to state of the art. We also propose an early stopping mechanism based on the PSNR values computed on a subset of training images. Combined with other improvements, such as integrating a learning rate scheduler, GSta achieves an improved Pareto front in convergence speed, memory and storage requirements, while preserving quality. We also show that GSta can improve other methods and complement orthogonal approaches in efficiency improvement; once combined with Trick-GS, GSta achieves up to 5x faster training, 16x smaller disk size compared to vanilla GS, while having comparable accuracy and consuming only half the peak memory. More visualisations are available at https://anilarmagan.github.io/SRUK-GSta.

Authors:Alexandre Banks, Richard Cook, Septimiu E. Salcudean
Title: Setup-Invariant Augmented Reality for Teaching by Demonstration with Surgical Robots
Abstract:
Augmented reality (AR) is an effective tool in robotic surgery education as it combines exploratory learning with three-dimensional guidance. However, existing AR systems require expert supervision and do not account for differences in the mentor and mentee robot configurations. To enable novices to train outside the operating room while receiving expert-informed guidance, we present dV-STEAR: an open-source system that plays back task-aligned expert demonstrations without assuming identical setup joint positions between expert and novice. Pose estimation was rigorously quantified, showing a registration error of 3.86 (SD=2.01)mm. In a user study (N=24), dV-STEAR significantly improved novice performance on tasks from the Fundamentals of Laparoscopic Surgery. In a single-handed ring-over-wire task, dV-STEAR increased completion speed (p=0.03) and reduced collision time (p=0.01) compared to dry-lab training alone. During a pick-and-place task, it improved success rates (p=0.004). Across both tasks, participants using dV-STEAR exhibited significantly more balanced hand use and reported lower frustration levels. This work presents a novel educational tool implemented on the da Vinci Research Kit, demonstrates its effectiveness in teaching novices, and builds the foundation for further AR integration into robot-assisted surgery.
中文:dV-STAR系统通过增强现实技术让新手外科医生能在专家指导下进行训练,无需相同设备配置即可显著提升手术任务表现并降低操作挫败感。
English: The dV-STAR system enables novice surgeons to train with expert guidance in augmented reality, significantly improving performance and reducing frustration in robotic surgery tasks without requiring identical equipment setups.

Authors:Elia Peruzzo, Dejia Xu, Xingqian Xu, Humphrey Shi, Nicu Sebe
Title: RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism
Abstract:
Video generation is experiencing rapid growth, driven by advances in diffusion models and the development of better and larger datasets. However, producing high-quality videos remains challenging due to the high-dimensional data and the complexity of the task. Recent efforts have primarily focused on enhancing visual quality and addressing temporal inconsistencies, such as flickering. Despite progress in these areas, the generated videos often fall short in terms of motion complexity and physical plausibility, with many outputs either appearing static or exhibiting unrealistic motion. In this work, we propose a framework to improve the realism of motion in generated videos, exploring a complementary direction to much of the existing literature. Specifically, we advocate for the incorporation of a retrieval mechanism during the generation phase. The retrieved videos act as grounding signals, providing the model with demonstrations of how the objects move. Our pipeline is designed to apply to any text-to-video diffusion model, conditioning a pretrained model on the retrieved samples with minimal fine-tuning. We demonstrate the superiority of our approach through established metrics, recently proposed benchmarks, and qualitative results, and we highlight additional applications of the framework.
中文: 本文提出一种检索增强框架,通过在生成过程中引入真实视频作为运动参照来提升视频生成的运动真实感,该方法仅需少量微调即可适配现有模型,并在多项评估中展现优越性能。
English: This paper introduces a retrieval-enhanced framework to improve motion realism in video generation by grounding diffusion models with real video examples, requiring minimal fine-tuning and demonstrating superior performance across metrics and benchmarks.

Authors:Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, Di Hu
Title: Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception
Abstract:
High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a \textbf{divide-then-aggregate} strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining. The source code are available at https://github.com/GeWu-Lab/Patch-Matters
高质量图像描述对于跨模态应用至关重要,本研究提出了一种分而治之的策略,通过语义和空间分块增强细节并减少多模态模型中的幻觉,无需重新训练。
High-quality image captions are essential for cross-modal applications, and this study introduces a divide-then-aggregate strategy using semantic and spatial patches to enhance detail and reduce hallucinations in multimodal models without retraining.

Authors:Nan Peng, Xun Zhou, Mingming Wang, Guisong Chen, Wenqi Xu
Title: Uni-PrevPredMap: Extending PrevPredMap to a Unified Framework of Prior-Informed Modeling for Online Vectorized HD Map Construction
Abstract:
Safety constitutes a foundational imperative for autonomous driving systems, necessitating the maximal incorporation of accessible external prior information. This study establishes that temporal perception buffers and cost-efficient maps inherently form complementary prior sources for online vectorized high-definition (HD) map construction. We present Uni-PrevPredMap, a unified prior-informed framework that systematically integrates two synergistic information sources: previous predictions and simulated outdated HD maps. The framework introduces two core innovations: a tile-indexed 3D vectorized global map processor enabling efficient refreshment, storage, and retrieval of 3D vectorized priors; a tri-mode operational optimization paradigm ensuring consistency across non-prior, temporal-prior, and temporal-map-fusion-prior scenarios while mitigating reliance on idealized map fidelity assumptions. Uni-PrevPredMap achieves state-of-the-art performance in map-absent scenarios across established online vectorized HD map construction benchmarks. When provided with simulated outdated HD maps, the framework exhibits robust capabilities in error-resilient prior fusion, empirically confirming the synergistic complementarity between previous predictions and simulated outdated HD maps. Code will be available at https://github.com/pnnnnnnn/Uni-PrevPredMap.
中文:Uni-PrevPredMap框架通过整合时序感知缓存与模拟过时地图,在在线矢量化高精地图构建中实现了最优性能,有效验证了这两种先验信息源之间的互补协同性与错误容忍能力。
English: Uni-PrevPredMap is a unified framework that integrates temporal perception buffers and simulated outdated maps to achieve state-of-the-art performance in online vectorized HD map construction, demonstrating robust error resilience and complementary synergy between these prior information sources.

Authors:Hu Cui, Tessai Hayama
Title: HGMamba: Enhancing 3D Human Pose Estimation with a HyperGCN-Mamba Network
Abstract:
3D human pose lifting is a promising research area that leverages estimated and ground-truth 2D human pose data for training. While existing approaches primarily aim to enhance the performance of estimated 2D poses, they often struggle when applied to ground-truth 2D pose data. We observe that achieving accurate 3D pose reconstruction from ground-truth 2D poses requires precise modeling of local pose structures, alongside the ability to extract robust global spatio-temporal features. To address these challenges, we propose a novel Hyper-GCN and Shuffle Mamba (HGMamba) block, which processes input data through two parallel streams: Hyper-GCN and Shuffle-Mamba. The Hyper-GCN stream models the human body structure as hypergraphs with varying levels of granularity to effectively capture local joint dependencies. Meanwhile, the Shuffle Mamba stream leverages a state space model to perform spatio-temporal scanning across all joints, enabling the establishment of global dependencies. By adaptively fusing these two representations, HGMamba achieves strong global feature modeling while excelling at local structure modeling. We stack multiple HGMamba blocks to create three variants of our model, allowing users to select the most suitable configuration based on the desired speed-accuracy trade-off. Extensive evaluations on the Human3.6M and MPI-INF-3DHP benchmark datasets demonstrate the effectiveness of our approach. HGMamba-B achieves state-of-the-art results, with P1 errors of 38.65 mm and 14.33 mm on the respective datasets. Code and models are available: https://github.com/HuCui2022/HGMamba
中文: 提出的HGMamba模型通过并行超图卷积和混洗曼巴流分别捕捉局部关节依赖和全局时空特征,在基准数据集上实现了最先进的三维姿态重建精度。
English: The proposed HGMamba model introduces parallel Hyper-GCN and Shuffle Mamba streams to capture local joint dependencies and global spatio-temporal features, achieving state-of-the-art 3D pose reconstruction accuracy on benchmark datasets.

Authors:Ludvig Dillén, Per-Erik Forssén, Johan Edstedt
Title: FACT: Multinomial Misalignment Classification for Point Cloud Registration
Abstract:
We present FACT, a method for predicting alignment quality (i.e., registration error) of registered lidar point cloud pairs. This is useful e.g. for quality assurance of large, automatically registered 3D models. FACT extracts local features from a registered pair and processes them with a point transformer-based network to predict a misalignment class. We generalize prior work that study binary alignment classification of registration errors, by recasting it as multinomial misalignment classification. To achieve this, we introduce a custom regression-by-classification loss function that combines the cross-entropy and Wasserstein losses, and demonstrate that it outperforms both direct regression and prior binary classification. FACT successfully classifies point-cloud pairs registered with both the classical ICP and GeoTransformer, while other choices, such as standard point-cloud-quality metrics and registration residuals are shown to be poor choices for predicting misalignment. On a synthetically perturbed point-cloud task introduced by the CorAl method, we show that FACT achieves substantially better performance than CorAl. Finally, we demonstrate how FACT can assist experts in correcting misaligned point-cloud maps. Our code is available at https://github.com/LudvigDillen/FACT_for_PCMC.
中文: FACT是一种通过点变换网络和定制损失函数预测激光雷达点云对配准质量的方法,在分类配准误差方面优于现有方法。
English: FACT is a method that predicts the alignment quality of registered lidar point cloud pairs using a point transformer-based network and a custom loss function, outperforming existing approaches in classifying misalignment.

Authors:Xiaohang Yang, Qing Wang, Jiahao Yang, Gregory Slabaugh, Shanxin Yuan
Title: STaR: Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints
Abstract:
Motion retargeting seeks to faithfully replicate the spatio-temporal motion characteristics of a source character onto a target character with a different body shape. Apart from motion semantics preservation, ensuring geometric plausibility and maintaining temporal consistency are also crucial for effective motion retargeting. However, many existing methods prioritize either geometric plausibility or temporal consistency. Neglecting geometric plausibility results in interpenetration while neglecting temporal consistency leads to motion jitter. In this paper, we propose a novel sequence-to-sequence model for seamless Spatial-Temporal aware motion Retargeting (STaR), with penetration and consistency constraints. STaR consists of two modules: (1) a spatial module that incorporates dense shape representation and a novel limb penetration constraint to ensure geometric plausibility while preserving motion semantics, and (2) a temporal module that utilizes a temporal transformer and a novel temporal consistency constraint to predict the entire motion sequence at once while enforcing multi-level trajectory smoothness. The seamless combination of the two modules helps us achieve a good balance between the semantic, geometric, and temporal targets. Extensive experiments on the Mixamo and ScanRet datasets demonstrate that our method produces plausible and coherent motions while significantly reducing interpenetration rates compared with other approaches. Code page: https://github.com/XiaohangYang829/STaR.
中文摘要:本文提出STaR模型,通过空间模块的密集形状表征与穿透约束保障几何合理性,结合时序模块的变换器与一致性约束实现运动平滑,在保持运动语义的同时显著减少了穿透现象。
English Summary: The paper introduces STaR, a sequence-to-sequence model that integrates spatial and temporal modules with penetration and consistency constraints to achieve balanced motion retargeting by preserving semantics while ensuring geometric plausibility and temporal smoothness.

Authors:Adam McArthur, Stephanie Wichuk, Stephen Burnside, Andrew Kirby, Alexander Scammon, Damian Sol, Abhilash Hareendranathan, Jacob L. Jaremko
Title: Retuve: Automated Multi-Modality Analysis of Hip Dysplasia with Open Source AI
Abstract:
Developmental dysplasia of the hip (DDH) poses significant diagnostic challenges, hindering timely intervention. Current screening methodologies lack standardization, and AI-driven studies suffer from reproducibility issues due to limited data and code availability. To address these limitations, we introduce Retuve, an open-source framework for multi-modality DDH analysis, encompassing both ultrasound (US) and X-ray imaging. Retuve provides a complete and reproducible workflow, offering open datasets comprising expert-annotated US and X-ray images, pre-trained models with training code and weights, and a user-friendly Python Application Programming Interface (API). The framework integrates segmentation and landmark detection models, enabling automated measurement of key diagnostic parameters such as the alpha angle and acetabular index. By adhering to open-source principles, Retuve promotes transparency, collaboration, and accessibility in DDH research. This initiative has the potential to democratize DDH screening, facilitate early diagnosis, and ultimately improve patient outcomes by enabling widespread screening and early intervention. The GitHub repository/code can be found here: https://github.com/radoss-org/retuve
中文:Retuve是一个开源框架,旨在解决髋关节发育不良的诊断难题,通过提供标准化的多模态分析工具、开放数据集和预训练模型,促进筛查准确性和早期干预,改善患者预后。
English: Retuve is an open-source framework designed to overcome diagnostic challenges in developmental dysplasia of the hip by providing standardized, multi-modality analysis tools with accessible datasets and models to enhance screening accuracy and early intervention.

Authors:Huzaifa Arif, Keerthiram Murugesan, Payel Das, Alex Gittens, Pin-Yu Chen
Title: PEEL the Layers and Find Yourself: Revisiting Inference-time Data Leakage for Residual Neural Networks
Abstract:
This paper explores inference-time data leakage risks of deep neural networks (NNs), where a curious and honest model service provider is interested in retrieving users' private data inputs solely based on the model inference results. Particularly, we revisit residual NNs due to their popularity in computer vision and our hypothesis that residual blocks are a primary cause of data leakage owing to the use of skip connections. By formulating inference-time data leakage as a constrained optimization problem, we propose a novel backward feature inversion method, \textbf{PEEL}, which can effectively recover block-wise input features from the intermediate output of residual NNs. The surprising results in high-quality input data recovery can be explained by the intuition that the output from these residual blocks can be considered as a noisy version of the input and thus the output retains sufficient information for input recovery. We demonstrate the effectiveness of our layer-by-layer feature inversion method on facial image datasets and pre-trained classifiers. Our results show that PEEL outperforms the state-of-the-art recovery methods by an order of magnitude when evaluated by mean squared error (MSE). The code is available at \href{https://github.com/Huzaifa-Arif/PEEL}{https://github.com/Huzaifa-Arif/PEEL}
中文: 本文提出PEEL这一新型逆向特征还原方法,通过将残差神经网络中间输出视为输入的噪声版本,有效实现输入数据恢复,在人脸图像恢复任务中展现出比现有技术更优越的性能。
English: This paper introduces PEEL, a novel backward feature inversion method that effectively recovers input data from residual neural networks by treating intermediate outputs as noisy versions of inputs, demonstrating superior performance over existing techniques in facial image recovery.

Authors:Yufu Wang, Yu Sun, Priyanka Patel, Kostas Daniilidis, Michael J. Black, Muhammed Kocabas
Title: PromptHMR: Promptable Human Mesh Recovery
Abstract:
Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary "side information" that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot exploit scene context while methods that process the whole image often fail to detect people and are less accurate than methods that use crops. While recent language-based methods explore HPS reasoning through large language or vision-language models, their metric accuracy is well below the state of the art. In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. Our method processes full images to maintain scene context and accepts multiple input modalities: spatial prompts like bounding boxes and masks, and semantic prompts like language descriptions or interaction labels. PromptHMR demonstrates robust performance across challenging scenarios: estimating people from bounding boxes as small as faces in crowded scenes, improving body shape estimation through language descriptions, modeling person-person interactions, and producing temporally coherent motions in videos. Experiments on benchmarks show that PromptHMR achieves state-of-the-art performance while offering flexible prompt-based control over the HPS estimation process.

Authors:Hritam Basak, Zhaozheng Yin
Title: SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation
Abstract:
Domain Adaptation (DA) and Semi-supervised Learning (SSL) converge in Semi-supervised Domain Adaptation (SSDA), where the objective is to transfer knowledge from a source domain to a target domain using a combination of limited labeled target samples and abundant unlabeled target data. Although intuitive, a simple amalgamation of DA and SSL is suboptimal in semantic segmentation due to two major reasons: (1) previous methods, while able to learn good segmentation boundaries, are prone to confuse classes with similar visual appearance due to limited supervision; and (2) skewed and imbalanced training data distribution preferring source representation learning whereas impeding from exploring limited information about tailed classes. Language guidance can serve as a pivotal semantic bridge, facilitating robust class discrimination and mitigating visual ambiguities by leveraging the rich semantic relationships encoded in pre-trained language models to enhance feature representations across domains. Therefore, we propose the first language-guided SSDA setting for semantic segmentation in this work. Specifically, we harness the semantic generalization capabilities inherent in vision-language models (VLMs) to establish a synergistic framework within the SSDA paradigm. To address the inherent class-imbalance challenges in long-tailed distributions, we introduce class-balanced segmentation loss formulations that effectively regularize the learning process. Through extensive experimentation across diverse domain adaptation scenarios, our approach demonstrates substantial performance improvements over contemporary state-of-the-art (SoTA) methodologies. Code is available: \href{https://github.com/hritam-98/SemiDAViL}{GitHub}.
中文: 本文首次提出语言引导的半监督领域自适应方法用于语义分割,通过利用视觉语言模型增强类别区分能力并采用新型损失函数解决数据不平衡问题,显著超越了现有最优方法。
English: This paper introduces the first language-guided semi-supervised domain adaptation (SSDA) method for semantic segmentation, leveraging vision-language models to enhance class discrimination and address data imbalance through novel loss formulations, achieving state-of-the-art performance.

Authors:Hicham Talaoubrid, Anissa Mokraoui, Ismail Ben Ayed, Axel Prouvost, Sonimith Hang, Monit Korn, Rémi Harvey
Title: Analyzing the Impact of Low-Rank Adaptation for Cross-Domain Few-Shot Object Detection in Aerial Images
Abstract:
This paper investigates the application of Low-Rank Adaptation (LoRA) to small models for cross-domain few-shot object detection in aerial images. Originally designed for large-scale models, LoRA helps mitigate overfitting, making it a promising approach for resource-constrained settings. We integrate LoRA into DiffusionDet, and evaluate its performance on the DOTA and DIOR datasets. Our results show that LoRA applied after an initial fine-tuning slightly improves performance in low-shot settings (e.g., 1-shot and 5-shot), while full fine-tuning remains more effective in higher-shot configurations. These findings highlight LoRA's potential for efficient adaptation in aerial object detection, encouraging further research into parameter-efficient fine-tuning strategies for few-shot learning. Our code is available here: https://github.com/HichTala/LoRA-DiffusionDet.
中文: 本研究将低秩自适应(LoRA)应用于小模型进行少样本航空目标检测,表明其在少样本场景下能轻微提升性能,而完整微调在数据充足时效果更佳。
English: This study applies Low-Rank Adaptation (LoRA) to small models for few-shot aerial object detection, showing it slightly improves performance in low-shot scenarios while full fine-tuning works better with more data.

Authors:Matvei Popov, Aymen Kallala, Anirudha Ramesh, Narimane Hennouni, Shivesh Khaitan, Rick Gentry, Alain-Sam Cohen
Title: Leveraging State Space Models in Long Range Genomics
Abstract:
Long-range dependencies are critical for understanding genomic structure and function, yet most conventional methods struggle with them. Widely adopted transformer-based models, while excelling at short-context tasks, are limited by the attention module's quadratic computational complexity and inability to extrapolate to sequences longer than those seen in training. In this work, we explore State Space Models (SSMs) as a promising alternative by benchmarking two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics modeling tasks under conditions parallel to a 50M parameter transformer baseline. We discover that SSMs match transformer performance and exhibit impressive zero-shot extrapolation across multiple tasks, handling contexts 10 to 100 times longer than those seen during training, indicating more generalizable representations better suited for modeling the long and complex human genome. Moreover, we demonstrate that these models can efficiently process sequences of 1M tokens on a single GPU, allowing for modeling entire genomic regions at once, even in labs with limited compute. Our findings establish SSMs as efficient and scalable for long-context genomic analysis.

Authors:Mohsen Jenadeleh, Jon Sneyers, Panqi Jia, Shima Mohammadi, Joao Ascenso, Dietmar Saupe
Title: Subjective Visual Quality Assessment for High-Fidelity Learning-Based Image Compression
Abstract:
Learning-based image compression methods have recently emerged as promising alternatives to traditional codecs, offering improved rate-distortion performance and perceptual quality. JPEG AI represents the latest standardized framework in this domain, leveraging deep neural networks for high-fidelity image reconstruction. In this study, we present a comprehensive subjective visual quality assessment of JPEG AI-compressed images using the JPEG AIC-3 methodology, which quantifies perceptual differences in terms of Just Noticeable Difference (JND) units. We generated a dataset of 50 compressed images with fine-grained distortion levels from five diverse sources. A large-scale crowdsourced experiment collected 96,200 triplet responses from 459 participants. We reconstructed JND-based quality scales using a unified model based on boosted and plain triplet comparisons. Additionally, we evaluated the alignment of multiple objective image quality metrics with human perception in the high-fidelity range. The CVVDP metric achieved the overall highest performance; however, most metrics including CVVDP were overly optimistic in predicting the quality of JPEG AI-compressed images. These findings emphasize the necessity for rigorous subjective evaluations in the development and benchmarking of modern image codecs, particularly in the high-fidelity range. Another technical contribution is the introduction of the well-known Meng-Rosenthal-Rubin statistical test to the field of Quality of Experience research. This test can reliably assess the significance of difference in performance of quality metrics in terms of correlation between metrics and ground truth. The complete dataset, including all subjective scores, is publicly available at https://github.com/jpeg-aic/dataset-JPEG-AI-SDR25.
中文: 本研究对JPEG AI压缩图像进行了全面的主观质量评估,发现多数客观指标高估了其感知质量,并强调了在编解码器开发中严格人类评估的必要性。
English: This study conducts a comprehensive subjective quality assessment of JPEG AI-compressed images, revealing that most objective metrics overestimate their perceptual quality and underscoring the need for rigorous human evaluation in codec development.

Authors:Hongbin Liang, Hezhe Qiao, Wei Huang, Qizhou Wang, Mingsheng Shang, Lin Chen
Title: Temporal-contextual Event Learning for Pedestrian Crossing Intent Prediction
Abstract:
Ensuring the safety of vulnerable road users through accurate prediction of pedestrian crossing intention (PCI) plays a crucial role in the context of autonomous and assisted driving. Analyzing the set of observation video frames in ego-view has been widely used in most PCI prediction methods to forecast the cross intent. However, they struggle to capture the critical events related to pedestrian behaviour along the temporal dimension due to the high redundancy of the video frames, which results in the sub-optimal performance of PCI prediction. Our research addresses the challenge by introducing a novel approach called \underline{T}emporal-\underline{c}ontextual Event \underline{L}earning (TCL). The TCL is composed of the Temporal Merging Module (TMM), which aims to manage the redundancy by clustering the observed video frames into multiple key temporal events. Then, the Contextual Attention Block (CAB) is employed to adaptively aggregate multiple event features along with visual and non-visual data. By synthesizing the temporal feature extraction and contextual attention on the key information across the critical events, TCL can learn expressive representation for the PCI prediction. Extensive experiments are carried out on three widely adopted datasets, including PIE, JAAD-beh, and JAAD-all. The results show that TCL substantially surpasses the state-of-the-art methods. Our code can be accessed at https://github.com/dadaguailhb/TCL.
Chinese: 本研究提出了一种新颖的时序上下文事件学习(TCL)方法,通过将视频帧聚类为关键时序事件并自适应整合上下文特征,显著提升了行人过街意图预测性能,在多个数据集上大幅超越现有最优方法。
English: This study introduces a novel Temporal-contextual Event Learning (TCL) approach that enhances pedestrian crossing intention prediction by clustering video frames into key temporal events and adaptively integrating contextual features, significantly outperforming existing methods across multiple datasets.

Authors:Jisang Han, Honggyu An, Jaewoo Jung, Takuya Narihira, Junyoung Seo, Kazumi Fukuda, Chaehyun Kim, Sunghwan Hong, Yuki Mitsufuji, Seungryong Kim
Title: D^2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes
Abstract:
We address the task of 3D reconstruction in dynamic scenes, where object motions degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera poses. To overcome this, we propose D^2USt3R that regresses 4D pointmaps that simultaneiously capture both static and dynamic 3D scene geometry in a feed-forward manner. By explicitly incorporating both spatial and temporal aspects, our approach successfully encapsulates spatio-temporal dense correspondence to the proposed 4D pointmaps, enhancing downstream tasks. Extensive experimental evaluations demonstrate that our proposed approach consistently achieves superior reconstruction performance across various datasets featuring complex motions.

Authors:Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Title: HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance
Abstract:
Text-to-image (T2I) diffusion/flow models have drawn considerable attention recently due to their remarkable ability to deliver flexible visual creations. Still, high-resolution image synthesis presents formidable challenges due to the scarcity and complexity of high-resolution content. Recent approaches have investigated training-free strategies to enable high-resolution image synthesis with pre-trained models. However, these techniques often struggle with generating high-quality visuals and tend to exhibit artifacts or low-fidelity details, as they typically rely solely on the endpoint of the low-resolution sampling trajectory while neglecting intermediate states that are critical for preserving structure and synthesizing finer detail. To this end, we present HiFlow, a training-free and model-agnostic framework to unlock the resolution potential of pre-trained flow models. Specifically, HiFlow establishes a virtual reference flow within the high-resolution space that effectively captures the characteristics of low-resolution flow information, offering guidance for high-resolution generation through three key aspects: initialization alignment for low-frequency consistency, direction alignment for structure preservation, and acceleration alignment for detail fidelity. By leveraging such flow-aligned guidance, HiFlow substantially elevates the quality of high-resolution image synthesis of T2I models and demonstrates versatility across their personalized variants. Extensive experiments validate HiFlow's capability in achieving superior high-resolution image quality over state-of-the-art methods.
中文: HiFlow是一种无需训练的框架,通过初始化、方向和加速三个维度的流对齐指导,有效提升文本到图像模型的高分辨率生成质量,在保持结构的同时增强细节保真度。
English: HiFlow is a training-free framework that enhances high-resolution image synthesis in text-to-image models by aligning low-resolution flow information across initialization, direction, and acceleration to preserve structure and improve detail fidelity.

Authors:Xiaoxing Hu, Ziyang Gong, Yupei Wang, Yuru Jia, Gen Luo, Xue Yang
Title: Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation
Abstract:
Parameter-Efficient Fine-Tuning (PEFT) is a technique that allows us to adapt powerful Foundation Models (FMs) to diverse downstream tasks while preserving and unleashing their inherent capabilities. However, we have observed that existing PEFT methods, which are often designed with natural imagery in mind, struggle when applied to Remote Sensing (RS) scenarios. This is primarily due to their inability to handle artifact influences, a problem particularly severe in RS image features. To tackle this challenge, we introduce Earth-Adapter, the first PEFT method specifically designed for RS artifacts conquering. Earth-Adapter introduces a novel Mixture of Frequency Adaptation process that combines a Mixture of Adapter (MoA) with Discrete Fourier Transformation (DFT). By utilizing DFT, Earth-Adapter can decompose features into different frequency components, precisely separating artifacts from original features. The MoA then dynamically assigns weights to each adapter expert, allowing for the combination of features across various frequency domains. These simple-yet-effective approaches enable Earth-Adapter to more efficiently overcome the disturbances caused by artifacts than previous PEFT methods, significantly enhancing the FMs' performance on RS scenarios. Experiments on Domain Adaptation (DA), and Domain Generalization (DG) semantic segmentation benchmarks showcase the Earth-Adapter's effectiveness. Compared with baseline Rein, Earth-Adapter significantly improves 9.0% mIoU in DA and 3.1% mIoU in DG benchmarks. Our code will be released at https://github.com/VisionXLab/Earth-Adapter.
中文: Earth-Adapter是一种专为遥感设计的参数高效微调方法,通过混合频率适配技术有效分离并克服图像伪影,在领域适应和泛化基准测试中显著优于现有方法。
English: Earth-Adapter is a novel Parameter-Efficient Fine-Tuning method designed specifically for remote sensing, using a Mixture of Frequency Adaptation to effectively separate and overcome artifacts, significantly outperforming previous methods in domain adaptation and generalization benchmarks.

Authors:Yiming Liang, Tianhan Xu, Yuta Kikuchi
Title: HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation
Abstract:
We present Hierarchical Motion Representation (HiMoR), a novel deformation representation for 3D Gaussian primitives capable of achieving high-quality monocular dynamic 3D reconstruction. The insight behind HiMoR is that motions in everyday scenes can be decomposed into coarser motions that serve as the foundation for finer details. Using a tree structure, HiMoR's nodes represent different levels of motion detail, with shallower nodes modeling coarse motion for temporal smoothness and deeper nodes capturing finer motion. Additionally, our model uses a few shared motion bases to represent motions of different sets of nodes, aligning with the assumption that motion tends to be smooth and simple. This motion representation design provides Gaussians with a more structured deformation, maximizing the use of temporal relationships to tackle the challenging task of monocular dynamic 3D reconstruction. We also propose using a more reliable perceptual metric as an alternative, given that pixel-level metrics for evaluating monocular dynamic 3D reconstruction can sometimes fail to accurately reflect the true quality of reconstruction. Extensive experiments demonstrate our method's efficacy in achieving superior novel view synthesis from challenging monocular videos with complex motions.

Authors:Qing Xu, Zhenye Lou, Chenxin Li, Yue Li, Xiangjian He, Tesema Fiseha Berhanu, Rong Qu, Wenting Duan, Zhen Chen
Title: HER-Seg: Holistically Efficient Segmentation for High-Resolution Medical Images
Abstract:
High-resolution segmentation is critical for precise disease diagnosis by extracting fine-grained morphological details. Existing hierarchical encoder-decoder frameworks have demonstrated remarkable adaptability across diverse medical segmentation tasks. While beneficial, they usually require the huge computation and memory cost when handling large-size segmentation, which limits their applications in foundation model building and real-world clinical scenarios. To address this limitation, we propose a holistically efficient framework for high-resolution medical image segmentation, called HER-Seg. Specifically, we first devise a computation-efficient image encoder (CE-Encoder) to model long-range dependencies with linear complexity while maintaining sufficient representations. In particular, we introduce the dual-gated linear attention (DLA) mechanism to perform cascaded token filtering, selectively retaining important tokens while ignoring irrelevant ones to enhance attention computation efficiency. Then, we introduce a memory-efficient mask decoder (ME-Decoder) to eliminate the demand for the hierarchical structure by leveraging cross-scale segmentation decoding. Extensive experiments reveal that HER-Seg outperforms state-of-the-arts in high-resolution medical 2D, 3D and video segmentation tasks. In particular, our HER-Seg requires only 0.59GB training GPU memory and 9.39G inference FLOPs per 1024$\times$1024 image, demonstrating superior memory and computation efficiency. The code is available at https://github.com/xq141839/HER-Seg.
Chinese: HER-Seg是一种高效的高分辨率医学图像分割框架,通过计算高效编码器和内存高效解码器显著降低资源消耗,在多种分割任务中超越现有最优方法。
English: HER-Seg is a highly efficient framework for high-resolution medical image segmentation that reduces computational and memory costs through a computation-efficient encoder and memory-efficient decoder, outperforming state-of-the-art methods in various tasks.

Authors:Yujia Hu, Songhua Liu, Xingyi Yang, Xinchao Wang
Title: Flash Sculptor: Modular 3D Worlds from Objects
Abstract:
Existing text-to-3D and image-to-3D models often struggle with complex scenes involving multiple objects and intricate interactions. Although some recent attempts have explored such compositional scenarios, they still require an extensive process of optimizing the entire layout, which is highly cumbersome if not infeasible at all. To overcome these challenges, we propose Flash Sculptor in this paper, a simple yet effective framework for compositional 3D scene/object reconstruction from a single image. At the heart of Flash Sculptor lies a divide-and-conquer strategy, which decouples compositional scene reconstruction into a sequence of sub-tasks, including handling the appearance, rotation, scale, and translation of each individual instance. Specifically, for rotation, we introduce a coarse-to-fine scheme that brings the best of both worlds--efficiency and accuracy--while for translation, we develop an outlier-removal-based algorithm that ensures robust and precise parameters in a single step, without any iterative optimization. Extensive experiments demonstrate that Flash Sculptor achieves at least a 3 times speedup over existing compositional 3D methods, while setting new benchmarks in compositional 3D reconstruction performance. Codes are available at https://github.com/YujiaHu1109/Flash-Sculptor.
中文: Flash Sculptor采用分治策略实现单图像组合式3D重建,通过粗到精旋转估计和离群点去除算法,在提速三倍的同时刷新了性能基准。
English: Flash Sculptor introduces a divide-and-conquer framework for efficient compositional 3D reconstruction from a single image, achieving at least 3x faster speeds and superior performance without iterative optimization.

Authors:Saad Wazir, Daeyoung Kim
Title: Rethinking the Nested U-Net Approach: Enhancing Biomarker Segmentation with Attention Mechanisms and Multiscale Feature Fusion
Abstract:
Identifying biomarkers in medical images is vital for a wide range of biotech applications. However, recent Transformer and CNN based methods often struggle with variations in morphology and staining, which limits their feature extraction capabilities. In medical image segmentation, where data samples are often limited, state-of-the-art (SOTA) methods improve accuracy by using pre-trained encoders, while end-to-end approaches typically fall short due to difficulties in transferring multiscale features effectively between encoders and decoders. To handle these challenges, we introduce a nested UNet architecture that captures both local and global context through Multiscale Feature Fusion and Attention Mechanisms. This design improves feature integration from encoders, highlights key channels and regions, and restores spatial details to enhance segmentation performance. Our method surpasses SOTA approaches, as evidenced by experiments across four datasets and detailed ablation studies. Code: https://github.com/saadwazir/ReN-UNet
中文摘要:该研究提出的嵌套UNet架构通过多尺度特征融合和注意力机制,有效解决了医学图像分割中特征提取的局限性,在多个数据集上超越了现有最优方法的性能表现。
English Summary: The proposed nested UNet architecture with multiscale feature fusion and attention mechanisms effectively addresses limitations in medical image segmentation by enhancing feature integration and spatial detail restoration, outperforming current state-of-the-art methods across multiple datasets.

Authors:Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, Lijuan Wang
Title: V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
Abstract:
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visual-text processing. However, existing static image-text benchmarks are insufficient for evaluating their dynamic perception and interactive reasoning abilities. We introduce Vision-centric Multiple Abilities Game Evaluation(V-MAGE), a novel game-based evaluation framework designed to systematically assess MLLMs' visual reasoning in interactive, continuous-space environments. V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios. These scenarios are set in free-form, visually complex environments that require models to interpret dynamic game states and make decisions based solely on visual input, thereby closely reflecting the conditions encountered by human players. To ensure robust and interpretable comparisons across models, V-MAGE employs a dynamic Elo-based ranking system that accounts for varying difficulty levels and task diversity. Benchmarking state-of-the-art MLLMs against human baselines reveals that while leading models approach human-level performance in simple tasks, their performance drops significantly in complex scenarios requiring advanced reasoning and task orchestration. This persistent performance gap highlights fundamental limitations in current MLLMs' ability to perform real-time, vision-grounded interactions. Through extensive analyses, we demonstrate the utility of V-MAGE in uncovering these limitations and providing actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings. Code is publicly available at https://github.com/CSU-JPG/V-MAGE.
中文: 我们提出V-MAGE这一基于游戏的评估框架,用于测试多模态大语言模型在交互环境中的动态视觉推理能力,结果显示尽管模型在简单任务中接近人类水平,但在需要复杂推理的任务中仍存在显著差距。
English: We introduce V-MAGE, a game-based framework that evaluates Multimodal Large Language Models' dynamic visual reasoning in interactive environments, revealing significant performance gaps in complex tasks compared to humans despite near-human ability in simple scenarios.

Authors:Davide Sferrazza, Gabriele Berton, Gabriele Trivigno, Carlo Masone
Title: To Match or Not to Match: Revisiting Image Matching for Reliable Visual Place Recognition
Abstract:
Visual Place Recognition (VPR) is a critical task in computer vision, traditionally enhanced by re-ranking retrieval results with image matching. However, recent advancements in VPR methods have significantly improved performance, challenging the necessity of re-ranking. In this work, we show that modern retrieval systems often reach a point where re-ranking can degrade results, as current VPR datasets are largely saturated. We propose using image matching as a verification step to assess retrieval confidence, demonstrating that inlier counts can reliably predict when re-ranking is beneficial. Our findings shift the paradigm of retrieval pipelines, offering insights for more robust and adaptive VPR systems. The code is available at https://github.com/FarInHeight/To-Match-or-Not-to-Match.
中文: 本研究证明,在现代检索系统中,视觉位置识别的重排序可能降低性能,并提出改用图像匹配作为验证步骤,通过内点数量判断何时重排序更有利。
English: This study demonstrates that re-ranking in Visual Place Recognition can degrade performance with modern retrieval systems, proposing instead to use image matching as a verification step to determine when re-ranking is beneficial based on inlier counts.

Authors:Dahyun Kang, Ahmet Iscen, Eunchan Jo, Sua Choi, Minsu Cho, Cordelia Schmid
Title: Memory-Modular Classification: Learning to Generalize with Memory Replacement
Abstract:
We propose a novel memory-modular learner for image classification that separates knowledge memorization from reasoning. Our model enables effective generalization to new classes by simply replacing the memory contents, without the need for model retraining. Unlike traditional models that encode both world knowledge and task-specific skills into their weights during training, our model stores knowledge in the external memory of web-crawled image and text data. At inference time, the model dynamically selects relevant content from the memory based on the input image, allowing it to adapt to arbitrary classes by simply replacing the memory contents. The key differentiator that our learner meta-learns to perform classification tasks with noisy web data from unseen classes, resulting in robust performance across various classification scenarios. Experimental results demonstrate the promising performance and versatility of our approach in handling diverse classification tasks, including zero-shot/few-shot classification of unseen classes, fine-grained classification, and class-incremental classification.
中文: 本文提出了一种用于图像分类的记忆模块化学习器,它将知识记忆与推理分离,通过更新外部记忆无需重新训练即可泛化到新类别,并利用网络数据进行元学习,在多种分类任务中展现出稳健性能。
English: This paper introduces a memory-modular learner for image classification that decouples knowledge memorization from reasoning, enabling generalization to new classes by updating external memory without retraining and achieving robust performance across diverse tasks through meta-learning with web data.

Authors:Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos, Panagiotis C. Petrantonakis
Title: Latent Multimodal Reconstruction for Misinformation Detection
Abstract:
Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image's origin, context, or meaning, poses a growing challenge in the digital age. To support fact-checkers, researchers have focused on developing datasets and methods for multimodal misinformation detection (MMD). Due to the scarcity of large-scale annotated MMD datasets, recent approaches rely on synthetic training data created via out-of-context pairings or named entity manipulations (e.g., altering names, dates, or locations). However, these often yield simplistic examples that lack real-world complexity, limiting model robustness. Meanwhile, Large Vision-Language Models (LVLMs) remain underexplored for generating diverse and realistic synthetic data for MMD. To address, we introduce "Miscaption This!", a collection of LVLM-generated miscaptioned image datasets. Additionally, we introduce "Latent Multimodal Reconstruction" (LAMAR), a network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to guide detection. We explore various training strategies (end-to-end vs. large-scale pre-training) and integration mechanisms (direct, mask, gate, and attention). Extensive experiments show that models trained on "MisCaption This!" generalize better to real-world misinformation while LAMAR achieves new state-of-the-art on both NewsCLIPpings and VERITE benchmarks; highlighting the value of LVLM-generated data and reconstruction-based networks for advancing MMD. Our code is available at https://github.com/stevejpapad/miscaptioned-image-reconstruction
中文: 研究人员推出了由大型视觉语言模型生成的错误标注图像数据集“MisCaption This!”以及基于重构的网络“LAMAR”,二者通过提升模型泛化能力显著改进了多模态虚假信息检测,并在权威基准测试中取得了最优性能。
English: Researchers introduce "MisCaption This!", a dataset of miscaptioned images generated by Large Vision-Language Models, and "LAMAR", a reconstruction-based network that together enhance multimodal misinformation detection by improving model generalization and achieving state-of-the-art results on benchmarks.

Authors:Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, Shilin Xu, Tao Zhang, Haobo Yuan, Yikang Zhou, Wei Chow, Linfeng Li, Xiangtai Li, Lei Zhu, Lu Qi
Title: An Empirical Study of GPT-4o Image Generation Capabilities
Abstract:
The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling. For a high-definition version of the PDF, please refer to the link on GitHub: \href{https://github.com/Ephemeral182/Empirical-Study-of-GPT-4o-Image-Gen}{https://github.com/Ephemeral182/Empirical-Study-of-GPT-4o-Image-Gen}.
中文: 本研究对GPT-4o的图像生成能力进行了实证评估,通过与其他模型对比分析,揭示了其优势与局限,并为未来统一生成模型的架构设计指明了发展方向。
English: This study empirically evaluates GPT-4o's image generation capabilities across multiple tasks, benchmarking it against other models to identify strengths, limitations, and future directions for unified generative architectures.

Authors:Kuntian Zhang, Simin Yu, Yaoshu Wang, Makoto Onizuka, Chuan Xiao
Title: CKGAN: Training Generative Adversarial Networks Using Characteristic Kernel Integral Probability Metrics
Abstract:
In this paper, we propose CKGAN, a novel generative adversarial network (GAN) variant based on an integral probability metrics framework with characteristic kernel (CKIPM). CKIPM, as a distance between two probability distributions, is designed to optimize the lowerbound of the maximum mean discrepancy (MMD) in a reproducing kernel Hilbert space, and thus can be used to train GANs. CKGAN mitigates the notorious problem of mode collapse by mapping the generated images back to random noise. To save the effort of selecting the kernel function manually, we propose a soft selection method to automatically learn a characteristic kernel function. The experimental evaluation conducted on a set of synthetic and real image benchmarks (MNIST, CelebA, etc.) demonstrates that CKGAN generally outperforms other MMD-based GANs. The results also show that at the cost of moderately more training time, the automatically selected kernel function delivers very close performance to the best of manually fine-tuned one on real image benchmarks and is able to improve the performances of other MMD-based GANs.
中文: 本文提出CKGAN,一种基于特征核积分概率度量的生成对抗网络变体,通过将生成图像映射回随机噪声缓解模式崩溃问题,并采用软选择方法自动学习最优核函数,在多个基准数据集上优于其他基于MMD的GAN模型。
English: This paper introduces CKGAN, a GAN variant using characteristic kernel-based integral probability metrics to mitigate mode collapse and automatically learn optimal kernels, demonstrating superior performance over other MMD-based GANs on benchmark datasets.

Authors:Junxi Chen, Junhao Dong, Xiaohua Xie
Title: Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking
Abstract:
Recently, the Image Prompt Adapter (IP-Adapter) has been increasingly integrated into text-to-image diffusion models (T2I-DMs) to improve controllability. However, in this paper, we reveal that T2I-DMs equipped with the IP-Adapter (T2I-IP-DMs) enable a new jailbreak attack named the hijacking attack. We demonstrate that, by uploading imperceptible image-space adversarial examples (AEs), the adversary can hijack massive benign users to jailbreak an Image Generation Service (IGS) driven by T2I-IP-DMs and mislead the public to discredit the service provider. Worse still, the IP-Adapter's dependency on open-source image encoders reduces the knowledge required to craft AEs. Extensive experiments verify the technical feasibility of the hijacking attack. In light of the revealed threat, we investigate several existing defenses and explore combining the IP-Adapter with adversarially trained models to overcome existing defenses' limitations. Our code is available at https://github.com/fhdnskfbeuv/attackIPA.
中文: 研究发现,将图像提示适配器(IP-Adapter)集成到文本到图像扩散模型中会引发劫持攻击,攻击者可通过难以察觉的对抗样本破坏图像生成服务并损害公众信任,同时探讨了结合对抗训练模型等防御措施来应对此威胁。
English: The study reveals that integrating the Image Prompt Adapter (IP-Adapter) into text-to-image diffusion models enables a hijacking attack, where adversaries can use imperceptible adversarial examples to compromise image generation services and undermine public trust, with defenses explored to mitigate this threat.

Authors:Shiao Wang, Xiao Wang, Bo Jiang, Lin Zhu, Guoqi Li, Yaowei Wang, Yonghong Tian, Jin Tang
Title: Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset
Abstract:
Human Activity Recognition (HAR) primarily relied on traditional RGB cameras to achieve high-performance activity recognition. However, the challenging factors in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras. To address these challenges, biologically inspired event cameras offer a promising solution to overcome the limitations of traditional RGB cameras. In this work, we rethink human activity recognition by combining the RGB and event cameras. The first contribution is the proposed large-scale multi-modal RGB-Event human activity recognition benchmark dataset, termed HARDVS 2.0, which bridges the dataset gaps. It contains 300 categories of everyday real-world actions with a total of 107,646 paired videos covering various challenging scenarios. Inspired by the physics-informed heat conduction model, we propose a novel multi-modal heat conduction operation framework for effective activity recognition, termed MMHCO-HAR. More in detail, given the RGB frames and event streams, we first extract the feature embeddings using a stem network. Then, multi-modal Heat Conduction blocks are designed to fuse the dual features, the key module of which is the multi-modal Heat Conduction Operation layer. We integrate RGB and event embeddings through a multi-modal DCT-IDCT layer while adaptively incorporating the thermal conductivity coefficient via FVEs into this module. After that, we propose an adaptive fusion module based on a policy routing strategy for high-performance classification. Comprehensive experiments demonstrate that our method consistently performs well, validating its effectiveness and robustness. The source code and benchmark dataset will be released on https://github.com/Event-AHU/HARDVS/tree/HARDVSv2
中文: 本研究提出了一种新颖的多模态框架MMHCO-HAR,通过结合RGB和事件相机来提升人类活动识别能力,并基于HARDVS 2.0数据集和全面实验验证了其有效性。
English: This study introduces a novel multi-modal framework, MMHCO-HAR, combining RGB and event cameras to enhance human activity recognition, supported by the HARDVS 2.0 dataset and validated through comprehensive experiments.

Authors:Sarosij Bose, Hannah Dela Cruz, Arindam Dutta, Elena Kokkoni, Konstantinos Karydis, Amit K. Roy-Chowdhury
Title: Leveraging Synthetic Adult Datasets for Unsupervised Infant Pose Estimation
Abstract:
Human pose estimation is a critical tool across a variety of healthcare applications. Despite significant progress in pose estimation algorithms targeting adults, such developments for infants remain limited. Existing algorithms for infant pose estimation, despite achieving commendable performance, depend on fully supervised approaches that require large amounts of labeled data. These algorithms also struggle with poor generalizability under distribution shifts. To address these challenges, we introduce SHIFT: Leveraging SyntHetic Adult Datasets for Unsupervised InFanT Pose Estimation, which leverages the pseudo-labeling-based Mean-Teacher framework to compensate for the lack of labeled data and addresses distribution shifts by enforcing consistency between the student and the teacher pseudo-labels. Additionally, to penalize implausible predictions obtained from the mean-teacher framework, we incorporate an infant manifold pose prior. To enhance SHIFT's self-occlusion perception ability, we propose a novel visibility consistency module for improved alignment of the predicted poses with the original image. Extensive experiments on multiple benchmarks show that SHIFT significantly outperforms existing state-of-the-art unsupervised domain adaptation (UDA) pose estimation methods by 5% and supervised infant pose estimation methods by a margin of 16%. The project page is available at: https://sarosijbose.github.io/SHIFT.

Authors:Pengfei Zhou, Fanrui Zhang, Xiaopeng Peng, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You, Kaipeng Zhang
Title: MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
Abstract:
Multimodal reasoning, which integrates language and visual cues into problem solving and decision making, is a fundamental aspect of human intelligence and a crucial step toward artificial general intelligence. However, the evaluation of multimodal reasoning capabilities in Multimodal Large Language Models (MLLMs) remains inadequate. Most existing reasoning benchmarks are constrained by limited data size, narrow domain coverage, and unstructured knowledge distribution. To close these gaps, we introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations. Spanning six disciplines (math, physics, chemistry, biology, geography, and information science), our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade. It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions, providing a robust platform for comprehensive evaluation. Additionally, we present a novel dynamic evaluation framework to mitigate data contamination issues by bootstrapping question forms, question types, and image styles during evaluation. Extensive experiment on MDK12-Bench reveals the significant limitation of current MLLMs in multimodal reasoning. The findings on our benchmark provide insights into the development of the next-generation models. Our data and codes are available at https://github.com/LanceZPF/MDK12.
中文: MDK12-Bench是一个基于K-12真实考题构建的多学科基准测试,通过大规模数据标注和动态评估框架揭示了当前多模态大语言模型在推理能力上的显著不足。
English: MDK12-Bench is a comprehensive multi-disciplinary benchmark using K-12 exam questions to evaluate multimodal reasoning in MLLMs, revealing significant limitations in current models through its large-scale dataset and dynamic evaluation framework.

Authors:Songyan Zhang, Yongtao Ge, Jinyuan Tian, Guangkai Xu, Hao Chen, Chen Lv, Chunhua Shen
Title: POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction
Abstract:
3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by camera and object motion. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, but it still struggles with ambiguous matching in dynamic regions, which may hamper further improvement. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion. Specifically, our method first learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps within a unified coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, 3D point tracking, and pose estimation. Code and models are publicly available at https://github.com/wyddmw/POMATO.
中文: POMATO 是一个将点云匹配与时间运动相结合的统一框架,通过确保尺度一致性和精确几何关系,显著提升了动态三维重建在深度估计和三维点跟踪等任务中的性能。
English: POMATO is a unified framework that integrates pointmap matching with temporal motion to enhance dynamic 3D reconstruction, improving performance in tasks like depth estimation and 3D point tracking by ensuring scale consistency and precise geometry.

Authors:Shunsuke Sakai, Shunsuke Tsuge, Tatsuhito Hasegawa
Title: Noisy Deep Ensemble: Accelerating Deep Ensemble Learning via Noise Injection
Abstract:
Neural network ensembles is a simple yet effective approach for enhancing generalization capabilities. The most common method involves independently training multiple neural networks initialized with different weights and then averaging their predictions during inference. However, this approach increases training time linearly with the number of ensemble members. To address this issue, we propose the novel ``\textbf{Noisy Deep Ensemble}'' method, significantly reducing the training time required for neural network ensembles. In this method, a \textit{parent model} is trained until convergence, and then the weights of the \textit{parent model} are perturbed in various ways to construct multiple \textit{child models}. This perturbation of the \textit{parent model} weights facilitates the exploration of different local minima while significantly reducing the training time for each ensemble member. We evaluated our method using diverse CNN architectures on CIFAR-10 and CIFAR-100 datasets, surpassing conventional efficient ensemble methods and achieving test accuracy comparable to standard ensembles. Code is available at \href{https://github.com/TSTB-dev/NoisyDeepEnsemble}{https://github.com/TSTB-dev/NoisyDeepEnsemble}
Chinese: 提出的“噪声深度集成”方法通过扰动已收敛父模型的权重来创建子模型,显著减少了神经网络集成的训练时间,在保持与标准集成相当精度的同时提高了效率。
English: The proposed "Noisy Deep Ensemble" method reduces training time for neural network ensembles by perturbing a converged parent model's weights to create child models, achieving comparable accuracy to standard ensembles with improved efficiency.

Authors:Shunsuke Sakai, Xiangteng He, Chunzhi Gu, Leonid Sigal, Tatsuhito Hasegawa
Title: Reconstruction-Free Anomaly Detection with Diffusion Models
Abstract:
Despite the remarkable success, recent reconstruction-based anomaly detection (AD) methods via diffusion modeling still involve fine-grained noise-strength tuning and computationally expensive multi-step denoising, leading to a fundamental tension between fidelity and efficiency. In this paper, we propose a novel inversion-based AD approach - detection via noising in latent space - which circumvents explicit reconstruction. Importantly, we contend that the limitations in prior reconstruction-based methods originate from the prevailing detection via denoising in RGB space paradigm. To address this, we model AD under a reconstruction-free formulation, which directly infers the final latent variable corresponding to the input image via DDIM inversion, and then measures the deviation based on the known prior distribution for anomaly scoring. Specifically, in approximating the original probability flow ODE using the Euler method, we only enforce very few inversion steps to noise the clean image to pursue inference efficiency. As the added noise is adaptively derived with the learned diffusion model, the original features for the clean testing image can still be leveraged to yield high detection accuracy. We perform extensive experiments and detailed analysis across three widely used image AD datasets under the unsupervised unified setting to demonstrate the effectiveness of our model, regarding state-of-the-art AD performance, and about 2 times inference time speedup without diffusion distillation.
中文: 本文提出了一种新颖的基于反转的异常检测方法,通过在潜在空间中对图像加噪来避免显式重建,在保持高精度的同时实现了约两倍的推理加速,并达到了最先进的检测性能。
English: This paper introduces a novel inversion-based anomaly detection method that avoids explicit reconstruction by noising images in latent space, achieving state-of-the-art performance with approximately twice the inference speed while maintaining high accuracy.

Authors:Yan Zhang, Zhong Ji, Changxu Meng, Yanwei Pang, Jungong Han
Title: iEBAKER: Improved Remote Sensing Image-Text Retrieval Framework via Eliminate Before Align and Keyword Explicit Reasoning
Abstract:
Recent studies focus on the Remote Sensing Image-Text Retrieval (RSITR), which aims at searching for the corresponding targets based on the given query. Among these efforts, the application of Foundation Models (FMs), such as CLIP, to the domain of remote sensing has yielded encouraging outcomes. However, existing FM based methodologies neglect the negative impact of weakly correlated sample pairs and fail to account for the key distinctions among remote sensing texts, leading to biased and superficial exploration of sample pairs. To address these challenges, we propose an approach named iEBAKER (an Improved Eliminate Before Align strategy with Keyword Explicit Reasoning framework) for RSITR. Specifically, we propose an innovative Eliminate Before Align (EBA) strategy to filter out the weakly correlated sample pairs, thereby mitigating their deviations from optimal embedding space during alignment.Further, two specific schemes are introduced from the perspective of whether local similarity and global similarity affect each other. On this basis, we introduce an alternative Sort After Reversed Retrieval (SAR) strategy, aims at optimizing the similarity matrix via reverse retrieval. Additionally, we incorporate a Keyword Explicit Reasoning (KER) module to facilitate the beneficial impact of subtle key concept distinctions. Without bells and whistles, our approach enables a direct transition from FM to RSITR task, eliminating the need for additional pretraining on remote sensing data. Extensive experiments conducted on three popular benchmark datasets demonstrate that our proposed iEBAKER method surpasses the state-of-the-art models while requiring less training data. Our source code will be released at https://github.com/zhangy0822/iEBAKER.
中文:提出的iEBAKER框架通过"先剔除后对齐"策略过滤弱相关样本对,并结合关键词显式推理模块增强文本区分度,在无需额外预训练的情况下实现了遥感图文检索的最优性能。
English: The proposed iEBAKER framework addresses limitations in Remote Sensing Image-Text Retrieval by introducing an Eliminate Before Align strategy to filter weakly correlated samples and a Keyword Explicit Reasoning module to enhance text distinction, achieving state-of-the-art performance without additional pretraining.

Authors:Xiao Zhang, Xiangyu Han, Xiwen Lai, Yao Sun, Pei Zhang, Konrad Kording
Title: Falcon: Fractional Alternating Cut with Overcoming Minima in Unsupervised Segmentation
Abstract:
Today's unsupervised image segmentation algorithms often segment suboptimally. Modern graph-cut based approaches rely on high-dimensional attention maps from Transformer-based foundation models, typically employing a relaxed Normalized Cut solved recursively via the Fiedler vector (the eigenvector of the second smallest eigenvalue). Consequently, they still lag behind supervised methods in both mask generation speed and segmentation accuracy. We present a regularized fractional alternating cut (Falcon), an optimization-based K-way Normalized Cut without relying on recursive eigenvector computations, achieving substantially improved speed and accuracy. Falcon operates in two stages: (1) a fast K-way Normalized Cut solved by extending into a fractional quadratic transformation, with an alternating iterative procedure and regularization to avoid local minima; and (2) refinement of the resulting masks using complementary low-level information, producing high-quality pixel-level segmentations. Experiments show that Falcon not only surpasses existing state-of-the-art methods by an average of 2.5% across six widely recognized benchmarks (reaching up to 4.3\% improvement on Cityscapes), but also reduces runtime by around 30% compared to prior graph-based approaches. These findings demonstrate that the semantic information within foundation-model attention can be effectively harnessed by a highly parallelizable graph cut framework. Consequently, Falcon can narrow the gap between unsupervised and supervised segmentation, enhancing scalability in real-world applications and paving the way for dense prediction-based vision pre-training in various downstream tasks. The code is released in https://github.com/KordingLab/Falcon.
中文摘要:Falcon提出了一种基于正则化分数交替割的快速无监督图像分割方法,无需递归计算即可实现更优的速度和精度,显著缩小了与有监督方法之间的差距。
English Summary: Falcon introduces a faster and more accurate unsupervised image segmentation method using a regularized fractional alternating cut that avoids recursive computations, achieving significant improvements in both speed and accuracy over existing approaches.

Authors:Qi Mao, Lan Chen, Yuchao Gu, Mike Zheng Shou, Ming-Hsuan Yang
Title: Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model
Abstract:
Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly balance these two objectives. In this work, we introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization to enable a balanced integration of fidelity and editability within a unified framework. Unlike direct attention injections, we develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment for improved editability. However, simultaneously applying both constraints can lead to gradient conflicts, where the dominance of one constraint results in over- or under-editing. To address this challenge, we introduce an adaptive time-step scheduler that dynamically adjusts the influence of these constraints, guiding the diffusion latent toward an optimal balance. Extensive quantitative and qualitative experiments validate the effectiveness of our approach, demonstrating its superiority in achieving a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods. The source code will be available at https://github.com/CUC-MIPG/UnifyEdit.
中文: UnifyEdit是一种免调优方法,通过自注意力和交叉注意力约束优化扩散潜变量,并采用自适应调度器动态平衡二者,在文本图像编辑中实现了保真度与可编辑性的卓越统一。
English: UnifyEdit is a tuning-free method that optimizes diffusion latents using self-attention and cross-attention constraints, dynamically balanced by an adaptive scheduler to achieve superior fidelity and editability in text-based image editing.

Authors:Long Ma, Yuxin Feng, Yan Zhang, Jinyuan Liu, Weimin Wang, Guang-Yong Chen, Chengpei Xu, Zhuo Su
Title: CoA: Towards Real Image Dehazing via Compression-and-Adaptation
Abstract:
Learning-based image dehazing algorithms have shown remarkable success in synthetic domains. However, real image dehazing is still in suspense due to computational resource constraints and the diversity of real-world scenes. Therefore, there is an urgent need for an algorithm that excels in both efficiency and adaptability to address real image dehazing effectively. This work proposes a Compression-and-Adaptation (CoA) computational flow to tackle these challenges from a divide-and-conquer perspective. First, model compression is performed in the synthetic domain to develop a compact dehazing parameter space, satisfying efficiency demands. Then, a bilevel adaptation in the real domain is introduced to be fearless in unknown real environments by aggregating the synthetic dehazing capabilities during the learning process. Leveraging a succinct design free from additional constraints, our CoA exhibits domain-irrelevant stability and model-agnostic flexibility, effectively bridging the model chasm between synthetic and real domains to further improve its practical utility. Extensive evaluations and analyses underscore the approach's superiority and effectiveness. The code is publicly available at https://github.com/fyxnl/COA.
中文: 本文提出压缩与适应(CoA)计算流程,通过在合成域进行模型压缩并在真实域实施双层适应,有效解决真实图像去雾的效率与适应性难题。
English: This paper introduces a Compression-and-Adaptation (CoA) computational flow that combines model compression in synthetic domains with bilevel adaptation in real environments to efficiently and effectively address real image dehazing challenges.

Authors:Jiahang Li, Shibo Xue, Yong Su
Title: Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification
Abstract:
Inspired by human visual attention, deep neural networks have widely adopted attention mechanisms to learn locally discriminative attributes for challenging visual classification tasks. However, existing approaches primarily emphasize the representation of such features while neglecting their precise localization, which often leads to misclassification caused by shortcut biases. This limitation becomes even more pronounced when models are evaluated on transfer or out-of-distribution datasets. In contrast, humans are capable of leveraging prior object knowledge to quickly localize and compare fine-grained attributes, a capability that is especially crucial in complex and high-variance classification scenarios. Motivated by this, we introduce Gaze-CIFAR-10, a human gaze time-series dataset, along with a dual-sequence gaze encoder that models the precise sequential localization of human attention on distinct local attributes. In parallel, a Vision Transformer (ViT) is employed to learn the sequential representation of image content. Through cross-modal fusion, our framework integrates human gaze priors with machine-derived visual sequences, effectively correcting inaccurate localization in image feature representations. Extensive qualitative and quantitative experiments demonstrate that gaze-guided cognitive cues significantly enhance classification accuracy.

Authors:Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, Ignacio Rocco, Mehdi S. M. Sajjadi, Sarath Chandar, Ross Goroshin
Title: TAPNext: Tracking Any Point (TAP) as Next Token Prediction
Abstract:
Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training. The TAPNext model and code can be found at https://tap-next.github.io/.

Authors:Yunlong Tang, Jing Bi, Chao Huang, Susan Liang, Daiki Shimada, Hang Hua, Yunzhong Xiao, Yizhi Song, Pinxin Liu, Mingqian Feng, Junjia Guo, Zhuo Liu, Luchuan Song, Ali Vosoughi, Jinxi He, Liu He, Zeliang Zhang, Jiebo Luo, Chenliang Xu
Title: Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Abstract:
We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning that enables detailed descriptions of user-selected objects through time. CAT-V integrates three key components: a Segmenter based on SAMURAI for precise object segmentation across frames, a Temporal Analyzer powered by TRACE-Uni for accurate event boundary detection and temporal analysis, and a Captioner using InternVL-2.5 for generating detailed object-centric descriptions. Through spatiotemporal visual prompts and chain-of-thought reasoning, our framework generates detailed, temporally-aware descriptions of objects' attributes, actions, statuses, interactions, and environmental contexts without requiring additional training data. CAT-V supports flexible user interactions through various visual prompts (points, bounding boxes, and irregular regions) and maintains temporal sensitivity by tracking object states and interactions across different time segments. Our approach addresses limitations of existing video captioning methods, which either produce overly abstract descriptions or lack object-level precision, enabling fine-grained, object-specific descriptions while maintaining temporal coherence and spatial accuracy. The GitHub repository for this project is available at https://github.com/yunlong10/CAT-V
Chinese: CAT-V是一个无需训练的细粒度对象中心视频描述框架,通过整合分割、时序分析和描述生成模块,利用时空视觉提示生成详细且时间敏感的对象描述,支持多种用户交互方式。
English: CAT-V is a training-free framework for fine-grained object-centric video captioning that integrates segmentation, temporal analysis, and description generation to produce detailed, temporally-aware object descriptions through flexible user prompts.

Authors:Marija Ivanovska, Leon Todorov, Naser Damer, Deepak Kumar Jain, Peter Peer, Vitomir Å truc
Title: SelfMAD: Enhancing Generalization and Robustness in Morphing Attack Detection via Self-Supervised Learning
Abstract:
With the continuous advancement of generative models, face morphing attacks have become a significant challenge for existing face verification systems due to their potential use in identity fraud and other malicious activities. Contemporary Morphing Attack Detection (MAD) approaches frequently rely on supervised, discriminative models trained on examples of bona fide and morphed images. These models typically perform well with morphs generated with techniques seen during training, but often lead to sub-optimal performance when subjected to novel unseen morphing techniques. While unsupervised models have been shown to perform better in terms of generalizability, they typically result in higher error rates, as they struggle to effectively capture features of subtle artifacts. To address these shortcomings, we present SelfMAD, a novel self-supervised approach that simulates general morphing attack artifacts, allowing classifiers to learn generic and robust decision boundaries without overfitting to the specific artifacts induced by particular face morphing methods. Through extensive experiments on widely used datasets, we demonstrate that SelfMAD significantly outperforms current state-of-the-art MADs, reducing the detection error by more than 64% in terms of EER when compared to the strongest unsupervised competitor, and by more than 66%, when compared to the best performing discriminative MAD model, tested in cross-morph settings. The source code for SelfMAD is available at https://github.com/LeonTodorov/SelfMAD.
Chinese: 本文提出SelfMAD,一种自监督方法,通过模拟通用的形态攻击伪影来提升对新型人脸形态攻击的检测能力,在跨形态测试中显著优于现有模型,将检测错误率降低了64%以上。
English: The paper introduces SelfMAD, a self-supervised method that simulates general morphing attack artifacts to enhance the detection of novel face morphing techniques, significantly outperforming existing models by reducing detection errors by over 64% in cross-morph settings.

Authors:Ruoyu Xue, Jingyi Xu, Sounak Mondal, Hieu Le, Gregory Zelinsky, Minh Hoai, Dimitris Samaras
Title: Few-shot Personalized Scanpath Prediction
Abstract:
A personalized model for scanpath prediction provides insights into the visual preferences and attention patterns of individual subjects. However, existing methods for training scanpath prediction models are data-intensive and cannot be effectively personalized to new individuals with only a few available examples. In this paper, we propose few-shot personalized scanpath prediction task (FS-PSP) and a novel method to address it, which aims to predict scanpaths for an unseen subject using minimal support data of that subject's scanpath behavior. The key to our method's adaptability is the Subject-Embedding Network (SE-Net), specifically designed to capture unique, individualized representations for each subject's scanpaths. SE-Net generates subject embeddings that effectively distinguish between subjects while minimizing variability among scanpaths from the same individual. The personalized scanpath prediction model is then conditioned on these subject embeddings to produce accurate, personalized results. Experiments on multiple eye-tracking datasets demonstrate that our method excels in FS-PSP settings and does not require any fine-tuning steps at test time. Code is available at: https://github.com/cvlab-stonybrook/few-shot-scanpath
中文: 本文提出了一种基于主体嵌入网络(SE-Net)的小样本个性化扫描路径预测方法,通过生成个体化表征,仅需少量数据即可为新主体实现精准的扫描路径预测,且无需测试时微调。
English: This paper introduces a few-shot personalized scanpath prediction (FS-PSP) method using a Subject-Embedding Network (SE-Net) to generate individualized representations, enabling accurate scanpath predictions for new subjects with minimal data and no test-time fine-tuning.

Authors:In-Hwan Jin, Haesoo Choo, Seong-Hun Jeong, Heemoon Park, Junghwan Kim, Oh-joon Kwon, Kyeongbo Kong
Title: Optimizing 4D Gaussians for Dynamic Scene Video from Single Landscape Images
Abstract:
To achieve realistic immersion in landscape images, fluids such as water and clouds need to move within the image while revealing new scenes from various camera perspectives. Recently, a field called dynamic scene video has emerged, which combines single image animation with 3D photography. These methods use pseudo 3D space, implicitly represented with Layered Depth Images (LDIs). LDIs separate a single image into depth-based layers, which enables elements like water and clouds to move within the image while revealing new scenes from different camera perspectives. However, as landscapes typically consist of continuous elements, including fluids, the representation of a 3D space separates a landscape image into discrete layers, and it can lead to diminished depth perception and potential distortions depending on camera movement. Furthermore, due to its implicit modeling of 3D space, the output may be limited to videos in the 2D domain, potentially reducing their versatility. In this paper, we propose representing a complete 3D space for dynamic scene video by modeling explicit representations, specifically 4D Gaussians, from a single image. The framework is focused on optimizing 3D Gaussians by generating multi-view images from a single image and creating 3D motion to optimize 4D Gaussians. The most important part of proposed framework is consistent 3D motion estimation, which estimates common motion among multi-view images to bring the motion in 3D space closer to actual motions. As far as we know, this is the first attempt that considers animation while representing a complete 3D space from a single landscape image. Our model demonstrates the ability to provide realistic immersion in various landscape images through diverse experiments and metrics. Extensive experimental results are https://cvsp-lab.github.io/ICLR2025_3D-MOM/.

Authors:Ziad Kheil, Soleakhena Ken, Laurent Risser
Title: Biomechanical Constraints Assimilation in Deep-Learning Image Registration: Application to sliding and locally rigid deformations
Abstract:
Regularization strategies in medical image registration often take a one-size-fits-all approach by imposing uniform constraints across the entire image domain. Yet biological structures are anything but regular. Lacking structural awareness, these strategies may fail to consider a panoply of spatially inhomogeneous deformation properties, which would faithfully account for the biomechanics of soft and hard tissues, especially in poorly contrasted structures. To bridge this gap, we propose a learning-based image registration approach in which the inferred deformation properties can locally adapt themselves to trained biomechanical characteristics. Specifically, we first enforce in the training process local rigid displacements, shearing motions or pseudo-elastic deformations using regularization losses inspired from the field of solid-mechanics. We then show on synthetic and real 3D thoracic and abdominal images that these mechanical properties of different nature are well generalized when inferring the deformations between new image pairs. Our approach enables neural-networks to infer tissue-specific deformation patterns directly from input images, ensuring mechanically plausible motion. These networks preserve rigidity within hard tissues while allowing controlled sliding in regions where tissues naturally separate, more faithfully capturing physiological motion. The code is publicly available at https://github.com/Kheil-Z/biomechanical_DLIR .
中文摘要:现有医学图像配准方法采用统一约束而忽略组织间生物力学差异,为此我们提出一种基于学习的方法,能局部自适应变形属性以更真实地模拟组织运动特性。
English Summary: Current medical image registration methods apply uniform constraints that overlook the biomechanical differences between tissues, so we developed a learning-based approach that adapts deformation properties locally to better mimic real tissue behavior.

Authors:Yue Yao, Mohamed-Khalil Bouzidi, Daniel Goehring, Joerg Reichardt
Title: EP-Diffuser: An Efficient Diffusion Model for Traffic Scene Generation and Prediction via Polynomial Representations
Abstract:
As the prediction horizon increases, predicting the future evolution of traffic scenes becomes increasingly difficult due to the multi-modal nature of agent motion. Most state-of-the-art (SotA) prediction models primarily focus on forecasting the most likely future. However, for the safe operation of autonomous vehicles, it is equally important to cover the distribution for plausible motion alternatives. To address this, we introduce EP-Diffuser, a novel parameter-efficient diffusion-based generative model designed to capture the distribution of possible traffic scene evolutions. Conditioned on road layout and agent history, our model acts as a predictor and generates diverse, plausible scene continuations. We benchmark EP-Diffuser against two SotA models in terms of accuracy and plausibility of predictions on the Argoverse 2 dataset. Despite its significantly smaller model size, our approach achieves both highly accurate and plausible traffic scene predictions. We further evaluate model generalization ability in an out-of-distribution (OoD) test setting using Waymo Open dataset and show superior robustness of our approach. The code and model checkpoints are available at: https://github.com/continental/EP-Diffuser.
Chinese: EP-Diffuser是一种参数高效的扩散模型,能够生成多样且合理的交通场景预测,尽管模型规模较小,却在准确性和鲁棒性上超越了现有最优模型。
English: EP-Diffuser is a parameter-efficient diffusion model that generates diverse and plausible traffic scene predictions, achieving superior accuracy and robustness compared to state-of-the-art models despite its smaller size.

Authors:Victor Fonte Chavez, Claudia Esteves, Jean-Bernard Hayet
Title: Time-adaptive Video Frame Interpolation based on Residual Diffusion
Abstract:
In this work, we propose a new diffusion-based method for video frame interpolation (VFI), in the context of traditional hand-made animation. We introduce three main contributions: The first is that we explicitly handle the interpolation time in our model, which we also re-estimate during the training process, to cope with the particularly large variations observed in the animation domain, compared to natural videos; The second is that we adapt and generalize a diffusion scheme called ResShift recently proposed in the super-resolution community to VFI, which allows us to perform a very low number of diffusion steps (in the order of 10) to produce our estimates; The third is that we leverage the stochastic nature of the diffusion process to provide a pixel-wise estimate of the uncertainty on the interpolated frame, which could be useful to anticipate where the model may be wrong. We provide extensive comparisons with respect to state-of-the-art models and show that our model outperforms these models on animation videos. Our code is available at https://github.com/VicFonch/Multi-Input-Resshift-Diffusion-VFI.
Chinese: 本研究提出了一种针对手绘动画的扩散式视频帧插值新方法,具备时间感知插值、高效低步数扩散方案及像素级不确定性估计三大创新,显著提升了动画视频的插值性能。
English: This study introduces a novel diffusion-based video frame interpolation method tailored for hand-drawn animation, featuring time-aware interpolation, an efficient diffusion scheme requiring minimal steps, and pixel-level uncertainty estimation to enhance reliability.

Authors:Sihang Li, Zeyu Jiang, Grace Chen, Chenyang Xu, Siqi Tan, Xue Wang, Irving Fang, Kristof Zyskowski, Shannon P. McPherron, Radu Iovita, Chen Feng, Jing Zhang
Title: GARF: Learning Generalizable 3D Reassembly for Real-World Fractures
Abstract:
3D reassembly is a challenging spatial intelligence task with broad applications across scientific domains. While large-scale synthetic datasets have fueled promising learning-based approaches, their generalizability to different domains is limited. Critically, it remains uncertain whether models trained on synthetic datasets can generalize to real-world fractures where breakage patterns are more complex. To bridge this gap, we propose GARF, a generalizable 3D reassembly framework for real-world fractures. GARF leverages fracture-aware pretraining to learn fracture features from individual fragments, with flow matching enabling precise 6-DoF alignments. At inference time, we introduce one-step preassembly, improving robustness to unseen objects and varying numbers of fractures. In collaboration with archaeologists, paleoanthropologists, and ornithologists, we curate Fractura, a diverse dataset for vision and learning communities, featuring real-world fracture types across ceramics, bones, eggshells, and lithics. Comprehensive experiments have shown our approach consistently outperforms state-of-the-art methods on both synthetic and real-world datasets, achieving 82.87\% lower rotation error and 25.15\% higher part accuracy. This sheds light on training on synthetic data to advance real-world 3D puzzle solving, demonstrating its strong generalization across unseen object shapes and diverse fracture types.

Authors:Sangbeom Lim, Junwan Kim, Heeji Yoon, Jaewoo Jung, Seungryong Kim
Title: URECA: Unique Region Caption Anything
Abstract:
Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.
Chinese: URECA数据集和模型通过多粒度方法解决了现有区域级描述方法的局限性,利用精细化的数据筛选流程和增强的空间编码技术确保独特且一致的描述,实现了最先进的性能表现。
English: The URECA dataset and model address the limitations of existing region-level captioning methods by introducing a multi-granularity approach that ensures unique and consistent captions through a refined data curation pipeline and enhanced spatial encoding techniques, achieving state-of-the-art performance.

Authors:Hansheng Chen, Kai Zhang, Hao Tan, Zexiang Xu, Fujun Luan, Leonidas Guibas, Gordon Wetzstein, Sai Bi
Title: Gaussian Mixture Flow Matching Models
Abstract:
Diffusion models approximate the denoising distribution as a Gaussian and predict its mean, whereas flow matching models reparameterize the Gaussian mean as flow velocity. However, they underperform in few-step sampling due to discretization error and tend to produce over-saturated colors under classifier-free guidance (CFG). To address these limitations, we propose a novel Gaussian mixture flow matching (GMFlow) model: instead of predicting the mean, GMFlow predicts dynamic Gaussian mixture (GM) parameters to capture a multi-modal flow velocity distribution, which can be learned with a KL divergence loss. We demonstrate that GMFlow generalizes previous diffusion and flow matching models where a single Gaussian is learned with an $L_2$ denoising loss. For inference, we derive GM-SDE/ODE solvers that leverage analytic denoising distributions and velocity fields for precise few-step sampling. Furthermore, we introduce a novel probabilistic guidance scheme that mitigates the over-saturation issues of CFG and improves image generation quality. Extensive experiments demonstrate that GMFlow consistently outperforms flow matching baselines in generation quality, achieving a Precision of 0.942 with only 6 sampling steps on ImageNet 256$\times$256.
中文摘要:提出的GMFlow模型通过预测高斯混合参数来更精确地捕捉速度分布,改进了扩散模型和流匹配方法,其新型概率引导方案实现了更优的少步采样效果并有效缓解了色彩过饱和问题。
English Summary: The proposed GMFlow model improves upon diffusion and flow matching methods by predicting Gaussian mixture parameters for more accurate velocity distributions, enabling superior few-step sampling and reducing color over-saturation through a novel probabilistic guidance scheme.

Authors:Karan Dalal, Daniel Koceja, Gashon Hussein, Jiarui Xu, Yue Zhao, Youjin Song, Shihao Han, Ka Chun Cheung, Jan Kautz, Carlos Guestrin, Tatsunori Hashimoto, Sanmi Koyejo, Yejin Choi, Yu Sun, Xiaolong Wang
Title: One-Minute Video Generation with Test-Time Training
Abstract:
Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit

Authors:Gal Fiebelman, Hadar Averbuch-Elor, Sagie Benaim
Title: Let it Snow! Animating Static Gaussian Scenes With Dynamic Weather Effects
Abstract:
3D Gaussian Splatting has recently enabled fast and photorealistic reconstruction of static 3D scenes. However, introducing dynamic elements that interact naturally with such static scenes remains challenging. Accordingly, we present a novel hybrid framework that combines Gaussian-particle representations for incorporating physically-based global weather effects into static 3D Gaussian Splatting scenes, correctly handling the interactions of dynamic elements with the static scene. We follow a three-stage process: we first map static 3D Gaussians to a particle-based representation. We then introduce dynamic particles and simulate their motion using the Material Point Method (MPM). Finally, we map the simulated particles back to the Gaussian domain while introducing appearance parameters tailored for specific effects. To correctly handle the interactions of dynamic elements with the static scene, we introduce specialized collision handling techniques. Our approach supports a variety of weather effects, including snowfall, rainfall, fog, and sandstorms, and can also support falling objects, all with physically plausible motion and appearance. Experiments demonstrate that our method significantly outperforms existing approaches in both visual quality and physical realism.

Authors:German Barquero, Nadine Bertsch, Manojkumar Marramreddy, Carlos Chacón, Filippo Arcadu, Ferran Rigual, Nicky Sijia He, Cristina Palmero, Sergio Escalera, Yuting Ye, Robin Kips
Title: From Sparse Signal to Smooth Motion: Real-Time Motion Generation with Rolling Prediction Models
Abstract:
In extended reality (XR), generating full-body motion of the users is important to understand their actions, drive their virtual avatars for social interaction, and convey a realistic sense of presence. While prior works focused on spatially sparse and always-on input signals from motion controllers, many XR applications opt for vision-based hand tracking for reduced user friction and better immersion. Compared to controllers, hand tracking signals are less accurate and can even be missing for an extended period of time. To handle such unreliable inputs, we present Rolling Prediction Model (RPM), an online and real-time approach that generates smooth full-body motion from temporally and spatially sparse input signals. Our model generates 1) accurate motion that matches the inputs (i.e., tracking mode) and 2) plausible motion when inputs are missing (i.e., synthesis mode). More importantly, RPM generates seamless transitions from tracking to synthesis, and vice versa. To demonstrate the practical importance of handling noisy and missing inputs, we present GORP, the first dataset of realistic sparse inputs from a commercial virtual reality (VR) headset with paired high quality body motion ground truth. GORP provides >14 hours of VR gameplay data from 28 people using motion controllers (spatially sparse) and hand tracking (spatially and temporally sparse). We benchmark RPM against the state of the art on both synthetic data and GORP to highlight how we can bridge the gap for real-world applications with a realistic dataset and by handling unreliable input signals. Our code, pretrained models, and GORP dataset are available in the project webpage.

Authors:Wenzhao Tang, Weihang Li, Xiucheng Liang, Olaf Wysocki, Filip Biljecki, Christoph Holst, Boris Jutzi
Title: Texture2LoD3: Enabling LoD3 Building Reconstruction With Panoramic Images
Abstract:
Despite recent advancements in surface reconstruction, Level of Detail (LoD) 3 building reconstruction remains an unresolved challenge. The main issue pertains to the object-oriented modelling paradigm, which requires georeferencing, watertight geometry, facade semantics, and low-poly representation -- Contrasting unstructured mesh-oriented models. In Texture2LoD3, we introduce a novel method leveraging the ubiquity of 3D building model priors and panoramic street-level images, enabling the reconstruction of LoD3 building models. We observe that prior low-detail building models can serve as valid planar targets for ortho-rectifying street-level panoramic images. Moreover, deploying segmentation on accurately textured low-level building surfaces supports maintaining essential georeferencing, watertight geometry, and low-poly representation for LoD3 reconstruction. In the absence of LoD3 validation data, we additionally introduce the ReLoD3 dataset, on which we experimentally demonstrate that our method leads to improved facade segmentation accuracy by 11% and can replace costly manual projections. We believe that Texture2LoD3 can scale the adoption of LoD3 models, opening applications in estimating building solar potential or enhancing autonomous driving simulations. The project website, code, and data are available here: https://wenzhaotang.github.io/Texture2LoD3/.

Authors:Julio Silva-Rodríguez, Jose Dolz, Ismail Ben Ayed
Title: A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?
Abstract:
Vision-language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources. This paradigm has quickly made its way into the medical image analysis community. In particular, there is an impressive amount of recent literature developing vision-language models for radiology. However, the available medical datasets with image-text supervision are scarce, and medical concepts are fine-grained, involving expert knowledge that existing vision-language models struggle to encode. In this paper, we propose to take a prudent step back from the literature and revisit supervised, unimodal pre-training, using fine-grained labels instead. We conduct an extensive comparison demonstrating that unimodal pre-training is highly competitive and better suited to integrating heterogeneous data sources. Our results also question the potential of recent vision-language models for open-vocabulary generalization, which have been evaluated using optimistic experimental settings. Finally, we study novel alternatives to better integrate fine-grained labels and noisy text supervision.
Chinese: 医学影像中的视觉-语言预训练因数据稀缺和概念精细而面临挑战,因此回归监督式单模态预训练,证明其更具竞争力且能更好地整合异构数据。
English: Vision-language pre-training in medical imaging faces challenges due to scarce datasets and fine-grained concepts, prompting a return to supervised unimodal pre-training that proves more competitive and better integrates heterogeneous data.

Authors:Rayan Merghani Ahmed, Adnan Iltaf, Mohamed Elmanna, Gang Zhao, Hongliang Li, Yue Du, Bin Li, Shoujun Zhou
Title: MSA-UNet3+: Multi-Scale Attention UNet3+ with New Supervised Prototypical Contrastive Loss for Coronary DSA Image Segmentation
Abstract:
Accurate segmentation of coronary Digital Subtraction Angiography images is essential to diagnose and treat coronary artery diseases. Despite advances in deep learning, challenges such as high intra-class variance and class imbalance limit precise vessel delineation. Most existing approaches for coronary DSA segmentation cannot address these issues. Also, existing segmentation network's encoders do not directly generate semantic embeddings, which could enable the decoder to reconstruct segmentation masks effectively from these well-defined features. We propose a Supervised Prototypical Contrastive Loss that fuses supervised and prototypical contrastive learning to enhance coronary DSA image segmentation. The supervised contrastive loss enforces semantic embeddings in the encoder, improving feature differentiation. The prototypical contrastive loss allows the model to focus on the foreground class while alleviating the high intra-class variance and class imbalance problems by concentrating only on the hard-to-classify background samples. We implement the proposed SPCL loss within an MSA-UNet3+: a Multi-Scale Attention-Enhanced UNet3+ architecture. The architecture integrates key components: a Multi-Scale Attention Encoder and a Multi-Scale Dilated Bottleneck designed to enhance multi-scale feature extraction and a Contextual Attention Fusion Module built to keep fine-grained details while improving contextual understanding. Experiments on a private coronary DSA dataset show that MSA-UNet3+ outperforms state-of-the-art methods, achieving the highest Dice coefficient and F1-score and significantly reducing ASD and ACD. The developed framework provides clinicians with precise vessel segmentation, enabling accurate identification of coronary stenosis and supporting informed diagnostic and therapeutic decisions. The code will be released at https://github.com/rayanmerghani/MSA-UNet3plus.
中文: 本研究提出了一种监督原型对比损失方法,结合MSA-UNet3+架构,通过解决类别不平衡和提升特征区分度来改进冠状动脉DSA图像分割,在私有数据集上实现了最优性能。
English: This study introduces a Supervised Prototypical Contrastive Loss integrated into an MSA-UNet3+ architecture to enhance coronary DSA image segmentation by addressing class imbalance and improving feature differentiation, achieving superior performance on a private dataset.

Authors:Xingyu Hu, Junjun Jiang, Chenyang Wang, Kui Jiang, Xianming Liu, Jiayi Ma
Title: Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion
Abstract:
Unified image fusion aims to integrate complementary information from multi-source images, enhancing image quality through a unified framework applicable to diverse fusion tasks. While treating all fusion tasks as a unified problem facilitates task-invariant knowledge sharing, it often overlooks task-specific characteristics, thereby limiting the overall performance. Existing general image fusion methods incorporate explicit task identification to enable adaptation to different fusion tasks. However, this dependence during inference restricts the model's generalization to unseen fusion tasks. To address these issues, we propose a novel unified image fusion framework named "TITA", which dynamically balances both Task-invariant Interaction and Task-specific Adaptation. For task-invariant interaction, we introduce the Interaction-enhanced Pixel Attention (IPA) module to enhance pixel-wise interactions for better multi-source complementary information extraction. For task-specific adaptation, the Operation-based Adaptive Fusion (OAF) module dynamically adjusts operation weights based on task properties. Additionally, we incorporate the Fast Adaptive Multitask Optimization (FAMO) strategy to mitigate the impact of gradient conflicts across tasks during joint training. Extensive experiments demonstrate that TITA not only achieves competitive performance compared to specialized methods across three image fusion scenarios but also exhibits strong generalization to unseen fusion tasks. The source codes are released at https://github.com/huxingyuabc/TITA.
中文: 提出的“TITA”框架通过交互增强像素注意力和操作自适应融合模块,动态平衡任务不变交互与任务特定适应,在保持多任务性能的同时显著提升了对未知融合任务的泛化能力。
English: The proposed "TITA" framework dynamically balances task-invariant interaction and task-specific adaptation through specialized modules to enhance unified image fusion performance and generalization across both known and unseen tasks.

Authors:Shenghao Ren, Yi Lu, Jiayi Huang, Jiayi Zhao, He Zhang, Tao Yu, Qiu Shen, Xun Cao
Title: MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond
Abstract:
Existing human Motion Capture (MoCap) methods mostly focus on the visual similarity while neglecting the physical plausibility. As a result, downstream tasks such as driving virtual human in 3D scene or humanoid robots in real world suffer from issues such as timing drift and jitter, spatial problems like sliding and penetration, and poor global trajectory accuracy. In this paper, we revisit human MoCap from the perspective of interaction between human body and physical world by exploring the role of pressure. Firstly, we construct a large-scale human Motion capture dataset with Pressure, RGB and Optical sensors (named MotionPRO), which comprises 70 volunteers performing 400 types of motion, encompassing a total of 12.4M pose frames. Secondly, we examine both the necessity and effectiveness of the pressure signal through two challenging tasks: (1) pose and trajectory estimation based solely on pressure: We propose a network that incorporates a small kernel decoder and a long-short-term attention module, and proof that pressure could provide accurate global trajectory and plausible lower body pose. (2) pose and trajectory estimation by fusing pressure and RGB: We impose constraints on orthographic similarity along the camera axis and whole-body contact along the vertical axis to enhance the cross-attention strategy to fuse pressure and RGB feature maps. Experiments demonstrate that fusing pressure with RGB features not only significantly improves performance in terms of objective metrics, but also plausibly drives virtual humans (SMPL) in 3D scene. Furthermore, we demonstrate that incorporating physical perception enables humanoid robots to perform more precise and stable actions, which is highly beneficial for the development of embodied artificial intelligence. Project page is available at: https://nju-cite-mocaphumanoid.github.io/MotionPRO/

Authors:Jay Kamat, Júlia Borràs, Carme Torras
Title: CloSE: A Compact Shape- and Orientation-Agnostic Cloth State Representation
Abstract:
Cloth manipulation is a difficult problem mainly because of the non-rigid nature of cloth, which makes a good representation of deformation essential. We present a new representation for the deformation-state of clothes. First, we propose the dGLI disk representation, based on topological indices computed for segments on the edges of the cloth mesh border that are arranged on a circular grid. The heat-map of the dGLI disk uncovers patterns that correspond to features of the cloth state that are consistent for different shapes, sizes of positions of the cloth, like the corners and the fold locations. We then abstract these important features from the dGLI disk onto a circle, calling it the Cloth StatE representation (CloSE). This representation is compact, continuous, and general for different shapes. Finally, we show the strengths of this representation in two relevant applications: semantic labeling and high- and low-level planning. The code, the dataset and the video can be accessed from : https://jaykamat99.github.io/close-representation

Authors:Wang Tang, Fethiye Irmak Dogan, Linbo Qing, Hatice Gunes
Title: AsyReC: A Multimodal Graph-based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification
Abstract:
Dyadic social relationships, which refer to relationships between two individuals who know each other through repeated interactions (or not), are shaped by shared spatial and temporal experiences. Current computational methods for modeling these relationships face three major challenges: (1) the failure to model asymmetric relationships, e.g., one individual may perceive the other as a friend while the other perceives them as an acquaintance, (2) the disruption of continuous interactions by discrete frame sampling, which segments the temporal continuity of interaction in real-world scenarios, and (3) the limitation to consider periodic behavioral cues, such as rhythmic vocalizations or recurrent gestures, which are crucial for inferring the evolution of dyadic relationships. To address these challenges, we propose AsyReC, a multimodal graph-based framework for asymmetric dyadic relationship classification, with three core innovations: (i) a triplet graph neural network with node-edge dual attention that dynamically weights multimodal cues to capture interaction asymmetries (addressing challenge 1); (ii) a clip-level relationship learning architecture that preserves temporal continuity, enabling fine-grained modeling of real-world interaction dynamics (addressing challenge 2); and (iii) a periodic temporal encoder that projects time indices onto sine/cosine waveforms to model recurrent behavioral patterns (addressing challenge 3). Extensive experiments on two public datasets demonstrate state-of-the-art performance, while ablation studies validate the critical role of asymmetric interaction modeling and periodic temporal encoding in improving the robustness of dyadic relationship classification in real-world scenarios. Our code is publicly available at: https://github.com/tw-repository/AsyReC.
中文: AsyReC框架通过多模态图神经网络和时序编码器,解决了二元关系建模中的三大挑战——关系不对称性、时间连续性中断和周期性行为整合,在公开数据集上实现了领先性能。
English: The AsyReC framework addresses three key challenges in dyadic relationship modeling—asymmetry, temporal continuity disruption, and periodic behavior integration—through a multimodal graph-based approach with triplet GNNs and temporal encoders, achieving state-of-the-art performance on public datasets.

Authors:Jihyun Lee, Weipeng Xu, Alexander Richard, Shih-En Wei, Shunsuke Saito, Shaojie Bai, Te-Li Wang, Minhyuk Sung, Tae-Kyun Kim, Jason Saragih
Title: REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning
Abstract:
We present REWIND (Real-Time Egocentric Whole-Body Motion Diffusion), a one-step diffusion model for real-time, high-fidelity human motion estimation from egocentric image inputs. While an existing method for egocentric whole-body (i.e., body and hands) motion estimation is non-real-time and acausal due to diffusion-based iterative motion refinement to capture correlations between body and hand poses, REWIND operates in a fully causal and real-time manner. To enable real-time inference, we introduce (1) cascaded body-hand denoising diffusion, which effectively models the correlation between egocentric body and hand motions in a fast, feed-forward manner, and (2) diffusion distillation, which enables high-quality motion estimation with a single denoising step. Our denoising diffusion model is based on a modified Transformer architecture, designed to causally model output motions while enhancing generalizability to unseen motion lengths. Additionally, REWIND optionally supports identity-conditioned motion estimation when identity prior is available. To this end, we propose a novel identity conditioning method based on a small set of pose exemplars of the target identity, which further enhances motion estimation quality. Through extensive experiments, we demonstrate that REWIND significantly outperforms the existing baselines both with and without exemplar-based identity conditioning.

Authors:Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, Mu Xu
Title: FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis
Abstract:
Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: https://fantasy-amap.github.io/fantasy-talking/.

Authors:Pengju Sun, Banglei Guan, Zhenbao Yu, Yang Shang, Qifeng Yu, Daniel Barath
Title: Learning Affine Correspondences by Integrating Geometric Constraints
Abstract:
Affine correspondences have received significant attention due to their benefits in tasks like image matching and pose estimation. Existing methods for extracting affine correspondences still have many limitations in terms of performance; thus, exploring a new paradigm is crucial. In this paper, we present a new pipeline designed for extracting accurate affine correspondences by integrating dense matching and geometric constraints. Specifically, a novel extraction framework is introduced, with the aid of dense matching and a novel keypoint scale and orientation estimator. For this purpose, we propose loss functions based on geometric constraints, which can effectively improve accuracy by supervising neural networks to learn feature geometry. The experimental show that the accuracy and robustness of our method outperform the existing ones in image matching tasks. To further demonstrate the effectiveness of the proposed method, we applied it to relative pose estimation. Affine correspondences extracted by our method lead to more accurate poses than the baselines on a range of real-world datasets. The code is available at https://github.com/stilcrad/DenseAffine.
Chinese: 本文提出了一种新颖的流程,通过结合密集匹配和几何约束来改进仿射对应关系的提取,在图像匹配和姿态估计任务中相比现有方法展现出更高的准确性和鲁棒性。
English: This paper introduces a novel pipeline that enhances affine correspondence extraction by combining dense matching with geometric constraints, achieving superior accuracy and robustness in image matching and pose estimation tasks compared to existing methods.

Authors:Yuanpei Liu, Kai Han
Title: DebGCD: Debiased Learning with Distribution Guidance for Generalized Category Discovery
Abstract:
In this paper, we tackle the problem of Generalized Category Discovery (GCD). Given a dataset containing both labelled and unlabelled images, the objective is to categorize all images in the unlabelled subset, irrespective of whether they are from known or unknown classes. In GCD, an inherent label bias exists between known and unknown classes due to the lack of ground-truth labels for the latter. State-of-the-art methods in GCD leverage parametric classifiers trained through self-distillation with soft labels, leaving the bias issue unattended. Besides, they treat all unlabelled samples uniformly, neglecting variations in certainty levels and resulting in suboptimal learning. Moreover, the explicit identification of semantic distribution shifts between known and unknown classes, a vital aspect for effective GCD, has been neglected. To address these challenges, we introduce DebGCD, a \underline{Deb}iased learning with distribution guidance framework for \underline{GCD}. Initially, DebGCD co-trains an auxiliary debiased classifier in the same feature space as the GCD classifier, progressively enhancing the GCD features. Moreover, we introduce a semantic distribution detector in a separate feature space to implicitly boost the learning efficacy of GCD. Additionally, we employ a curriculum learning strategy based on semantic distribution certainty to steer the debiased learning at an optimized pace. Thorough evaluations on GCD benchmarks demonstrate the consistent state-of-the-art performance of our framework, highlighting its superiority. Project page: https://visual-ai.github.io/debgcd/

Authors:Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Weiqiang Wang, Wentong Li, Hongxia Xu, Danny Chen, Jintai Chen, Jian Wu
Title: OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM
Abstract:
Despite the remarkable progress of multimodal large language models (MLLMs), they continue to face challenges in achieving competitive performance on ordinal regression (OR; a.k.a. ordinal classification). To address this issue, this paper presents OrderChain, a novel and general prompting paradigm that improves the ordinal understanding ability of MLLMs by specificity and commonality modeling. Specifically, our OrderChain consists of a set of task-aware prompts to facilitate the specificity modeling of diverse OR tasks and a new range optimization Chain-of-Thought (RO-CoT), which learns a commonality way of thinking about OR tasks by uniformly decomposing them into multiple small-range optimization subtasks. Further, we propose a category recursive division (CRD) method to generate instruction candidate category prompts to support RO-CoT automatic optimization. Comprehensive experiments show that LLaVA model with our OrderChain improves baseline LLaVA significantly on diverse OR datasets, e.g., from 47.5\% to 93.2\% accuracy on the Adience dataset for age estimation, and from 30.0\% to 85.7\% accuracy on the Diabetic Retinopathy dataset. Notably, LLaVA with our OrderChain also remarkably outperforms state-of-the-art methods by 27% on accuracy and 0.24 on MAE on the Adience dataset. To our best knowledge, our OrderChain is the first work that augments MLLMs for OR tasks, and the effectiveness is witnessed across a spectrum of OR datasets. Project Page: https://order-chain.github.io/.
中文: OrderChain是一种新颖的提示范式,通过特性与共性建模增强多模态大语言模型的序数理解能力,在多种序数回归任务上显著提升了性能表现。
English: OrderChain is a novel prompting paradigm that enhances multimodal large language models' ordinal understanding through specificity and commonality modeling, significantly improving performance across various ordinal regression tasks.

Authors:Inhwan Bae, Junoh Lee, Hae-Gon Jeon
Title: Continuous Locomotive Crowd Behavior Generation
Abstract:
Modeling and reproducing crowd behaviors are important in various domains including psychology, robotics, transport engineering and virtual environments. Conventional methods have focused on synthesizing momentary scenes, which have difficulty in replicating the continuous nature of real-world crowds. In this paper, we introduce a novel method for automatically generating continuous, realistic crowd trajectories with heterogeneous behaviors and interactions among individuals. We first design a crowd emitter model. To do this, we obtain spatial layouts from single input images, including a segmentation map, appearance map, population density map and population probability, prior to crowd generation. The emitter then continually places individuals on the timeline by assigning independent behavior characteristics such as agents' type, pace, and start/end positions using diffusion models. Next, our crowd simulator produces their long-term locomotions. To simulate diverse actions, it can augment their behaviors based on a Markov chain. As a result, our overall framework populates the scenes with heterogeneous crowd behaviors by alternating between the proposed emitter and simulator. Note that all the components in the proposed framework are user-controllable. Lastly, we propose a benchmark protocol to evaluate the realism and quality of the generated crowds in terms of the scene-level population dynamics and the individual-level trajectory accuracy. We demonstrate that our approach effectively models diverse crowd behavior patterns and generalizes well across different geographical environments. Code is publicly available at https://github.com/InhwanBae/CrowdES .
中文: 本文提出了一种新颖框架,通过交替使用人群发射器和模拟器来生成具有异质行为的连续逼真人群轨迹,该方法在不同地理环境中展现出优异泛化能力且支持全流程用户控制。
English: This paper introduces a novel framework for generating continuous, realistic crowd trajectories with heterogeneous behaviors using an alternating emitter-simulator approach, which demonstrates superior generalization across environments and offers full user control.

Authors:Xiongbo Lu, Yaxiong Chen, Shengwu Xiong
Title: AnyArtisticGlyph: Multilingual Controllable Artistic Glyph Generation
Abstract:
Artistic Glyph Image Generation (AGIG) differs from current creativity-focused generation models by offering finely controllable deterministic generation. It transfers the style of a reference image to a source while preserving its content. Although advanced and promising, current methods may reveal flaws when scrutinizing synthesized image details, often producing blurred or incorrect textures, posing a significant challenge. Hence, we introduce AnyArtisticGlyph, a diffusion-based, multilingual controllable artistic glyph generation model. It includes a font fusion and embedding module, which generates latent features for detailed structure creation, and a vision-text fusion and embedding module that uses the CLIP model to encode references and blends them with transformation caption embeddings for seamless global image generation. Moreover, we incorporate a coarse-grained feature-level loss to enhance generation accuracy. Experiments show that it produces natural, detailed artistic glyph images with state-of-the-art performance. Our project will be open-sourced on https://github.com/jiean001/AnyArtisticGlyph to advance text generation technology.
Chinese: AnyArtisticGlyph是一种基于扩散模型的多语言可控艺术字形生成方法,通过融合字体和视觉文本模块,能够生成细节丰富、自然逼真的图像,并达到顶尖性能水平。
English: AnyArtisticGlyph is a diffusion-based model that enables multilingual controllable artistic glyph generation by integrating font and vision-text fusion modules, producing high-quality, detailed images with state-of-the-art performance.

Authors:Samarth Mishra, Kate Saenko, Venkatesh Saligrama
Title: Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data
Abstract:
Compositionality, or correctly recognizing scenes as compositions of atomic visual concepts, remains difficult for multimodal large language models (MLLMs). Even state of the art MLLMs such as GPT-4o can make mistakes in distinguishing compositions like "dog chasing cat" vs "cat chasing dog". While on Winoground, a benchmark for measuring such reasoning, MLLMs have made significant progress, they are still far from a human's performance. We show that compositional reasoning in these models can be improved by elucidating such concepts via data, where a model is trained to prefer the correct caption for an image over a close but incorrect one. We introduce SCRAMBLe: Synthetic Compositional Reasoning Augmentation of MLLMs with Binary preference Learning, an approach for preference tuning open-weight MLLMs on synthetic preference data generated in a fully automated manner from existing image-caption data. SCRAMBLe holistically improves these MLLMs' compositional reasoning capabilities which we can see through significant improvements across multiple vision language compositionality benchmarks, as well as smaller but significant improvements on general question answering tasks. As a sneak peek, SCRAMBLe tuned Molmo-7B model improves on Winoground from 49.5% to 54.8% (best reported to date), while improving by ~1% on more general visual question answering tasks. Code for SCRAMBLe along with tuned models and our synthetic training dataset is available at https://github.com/samarth4149/SCRAMBLe.
中文: 多模态大语言模型在组合推理方面存在困难,但通过SCRAMBLe方法,利用合成偏好数据进行训练,可显著提升其在Winoground等基准测试和通用视觉问答任务中的表现。
English: Multimodal large language models struggle with compositional reasoning, but the SCRAMBLe method enhances this by training them on synthetic preference data, leading to significant improvements on benchmarks like Winoground and general visual question answering tasks.

Authors:Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Stewart Worrall
Title: Inverse++: Vision-Centric 3D Semantic Occupancy Prediction Assisted with 3D Object Detection
Abstract:
3D semantic occupancy prediction aims to forecast detailed geometric and semantic information of the surrounding environment for autonomous vehicles (AVs) using onboard surround-view cameras. Existing methods primarily focus on intricate inner structure module designs to improve model performance, such as efficient feature sampling and aggregation processes or intermediate feature representation formats. In this paper, we explore multitask learning by introducing an additional 3D supervision signal by incorporating an additional 3D object detection auxiliary branch. This extra 3D supervision signal enhances the model's overall performance by strengthening the capability of the intermediate features to capture small dynamic objects in the scene, and these small dynamic objects often include vulnerable road users, i.e. bicycles, motorcycles, and pedestrians, whose detection is crucial for ensuring driving safety in autonomous vehicles. Extensive experiments conducted on the nuScenes datasets, including challenging rainy and nighttime scenarios, showcase that our approach attains state-of-the-art results, achieving an IoU score of 31.73% and a mIoU score of 20.91% and excels at detecting vulnerable road users (VRU). The code will be made available at:https://github.com/DanielMing123/Inverse++
中文: 本文通过引入3D物体检测辅助分支来增强自动驾驶车辆的3D语义占据预测,提升了中间特征对小尺寸动态物体(如易受伤害道路使用者)的捕捉能力,并在nuScenes数据集上取得了最优性能。
English: This paper enhances 3D semantic occupancy prediction for autonomous vehicles by incorporating a 3D object detection auxiliary branch, which improves feature representation for detecting small dynamic objects like vulnerable road users and achieves state-of-the-art performance on nuScenes datasets.

Authors:Haoren Zhao, Tianyi Chen, Zhen Wang
Title: On the Robustness of GUI Grounding Models Against Image Attacks
Abstract:
Graphical User Interface (GUI) grounding models are crucial for enabling intelligent agents to understand and interact with complex visual interfaces. However, these models face significant robustness challenges in real-world scenarios due to natural noise and adversarial perturbations, and their robustness remains underexplored. In this study, we systematically evaluate the robustness of state-of-the-art GUI grounding models, such as UGround, under three conditions: natural noise, untargeted adversarial attacks, and targeted adversarial attacks. Our experiments, which were conducted across a wide range of GUI environments, including mobile, desktop, and web interfaces, have clearly demonstrated that GUI grounding models exhibit a high degree of sensitivity to adversarial perturbations and low-resolution conditions. These findings provide valuable insights into the vulnerabilities of GUI grounding models and establish a strong benchmark for future research aimed at enhancing their robustness in practical applications. Our code is available at https://github.com/ZZZhr-1/Robust_GUI_Grounding.
Chinese: 本研究系统评估了先进GUI接地模型的鲁棒性,发现它们在多种界面环境下对对抗性扰动和低分辨率条件高度敏感。
English: This study systematically assesses the robustness of state-of-the-art GUI grounding models, revealing their high sensitivity to adversarial perturbations and low-resolution conditions across various interfaces.

Authors:Manlai Liang, JiaMing Zhang, Xiong Li, Jinlong Li
Title: LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important
Abstract:
The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modification of the inference infrastructure and significant computation overhead. Based on the fact that the Large Language models are autoregressive models, we propose LagKV, a KV compression strategy only relying on straight forward comparison among KV themselves. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on RULER benchmark show that, our approach outperforms SnapKV and StreamingLLM in different compression ratios. Especially in the 64-digit passkey retrieval task, our method outperforms the attention weight based method $H_2O$ over $50\%$ with same compression ratios. Our code is available at https://github.com/AI-Lab-China-Merchants-Bank/LagKV.
中文摘要:LagKV是一种创新的KV缓存压缩方法,通过直接比较KV条目而无需注意力权重,实现了便捷集成和在长上下文大语言模型推理中的卓越性能。
English Summary: LagKV is a novel KV cache compression method that eliminates the need for attention weights by comparing KV entries directly, offering easy integration and superior performance in long-context LLM inference.

Authors:Bo-Wen Yin, Jiao-Long Cao, Ming-Ming Cheng, Qibin Hou
Title: DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation
Abstract:
Recent advances in scene understanding benefit a lot from depth maps because of the 3D geometry information, especially in complex conditions (e.g., low light and overexposed). Existing approaches encode depth maps along with RGB images and perform feature fusion between them to enable more robust predictions. Taking into account that depth can be regarded as a geometry supplement for RGB images, a straightforward question arises: Do we really need to explicitly encode depth information with neural networks as done for RGB images? Based on this insight, in this paper, we investigate a new way to learn RGBD feature representations and present DFormerv2, a strong RGBD encoder that explicitly uses depth maps as geometry priors rather than encoding depth information with neural networks. Our goal is to extract the geometry clues from the depth and spatial distances among all the image patch tokens, which will then be used as geometry priors to allocate attention weights in self-attention. Extensive experiments demonstrate that DFormerv2 exhibits exceptional performance in various RGBD semantic segmentation benchmarks. Code is available at: https://github.com/VCIP-RGBD/DFormer.
中文摘要:DFormerv2提出了一种创新的RGBD编码器,将深度图作为几何先验来引导自注意力机制中的权重分配,无需对深度信息进行神经网络显式编码,即在RGBD语义分割任务中实现了卓越性能。
English Summary: DFormerv2 introduces a novel RGBD encoder that treats depth maps as geometric priors to guide attention mechanisms in self-attention, achieving state-of-the-art performance in RGBD semantic segmentation without explicit neural network encoding of depth information.

Authors:Wanzhou Liu, Zhexiao Xiong, Xinyu Li, Nathan Jacobs
Title: DeclutterNeRF: Generative-Free 3D Scene Recovery for Occlusion Removal
Abstract:
Recent novel view synthesis (NVS) techniques, including Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have greatly advanced 3D scene reconstruction with high-quality rendering and realistic detail recovery. Effectively removing occlusions while preserving scene details can further enhance the robustness and applicability of these techniques. However, existing approaches for object and occlusion removal predominantly rely on generative priors, which, despite filling the resulting holes, introduce new artifacts and blurriness. Moreover, existing benchmark datasets for evaluating occlusion removal methods lack realistic complexity and viewpoint variations. To address these issues, we introduce DeclutterSet, a novel dataset featuring diverse scenes with pronounced occlusions distributed across foreground, midground, and background, exhibiting substantial relative motion across viewpoints. We further introduce DeclutterNeRF, an occlusion removal method free from generative priors. DeclutterNeRF introduces joint multi-view optimization of learnable camera parameters, occlusion annealing regularization, and employs an explainable stochastic structural similarity loss, ensuring high-quality, artifact-free reconstructions from incomplete images. Experiments demonstrate that DeclutterNeRF significantly outperforms state-of-the-art methods on our proposed DeclutterSet, establishing a strong baseline for future research.
中文:近期的新视角合成技术如NeRF和3DGS显著提升了三维场景重建质量,但现有遮挡去除方法依赖生成先验导致伪影,且缺乏真实复杂数据集;为此提出的DeclutterSet数据集和DeclutterNeRF方法通过多视角联合优化实现了无伪影的高质量重建。
English: Recent advances in novel view synthesis techniques like NeRF and 3DGS have improved 3D scene reconstruction, but existing occlusion removal methods often introduce artifacts and lack realistic evaluation datasets, prompting the introduction of DeclutterSet and DeclutterNeRF for artifact-free, high-quality reconstructions.

Authors:Haebeom Jung, Namtae Kim, Jungwoo Kim, Jaesik Park
Title: Targetless LiDAR-Camera Calibration with Anchored 3D Gaussians
Abstract:
We present a targetless LiDAR-camera calibration method that jointly optimizes sensor poses and scene geometry from arbitrary scenes, without relying on traditional calibration targets such as checkerboards or spherical reflectors. Our approach leverages a 3D Gaussian-based scene representation. We first freeze reliable LiDAR points as anchors, then jointly optimize the poses and auxiliary Gaussian parameters in a fully differentiable manner using a photometric loss. This joint optimization significantly reduces sensor misalignment, resulting in higher rendering quality and consistently improved PSNR compared to the carefully calibrated poses provided in popular datasets. We validate our method through extensive experiments on two real-world autonomous driving datasets, KITTI-360 and Waymo, each featuring distinct sensor configurations. Additionally, we demonstrate the robustness of our approach using a custom LiDAR-camera setup, confirming strong performance across diverse hardware configurations.

Authors:Weikai Lin, Tianrui Ma, Adith Boloor, Yu Feng, Ruofan Xing, Xuan Zhang, Yuhao Zhu
Title: SnapPix: Efficient-Coding--Inspired In-Sensor Compression for Edge Vision
Abstract:
Energy-efficient image acquisition on the edge is crucial for enabling remote sensing applications where the sensor node has weak compute capabilities and must transmit data to a remote server/cloud for processing. To reduce the edge energy consumption, this paper proposes a sensor-algorithm co-designed system called SnapPix, which compresses raw pixels in the analog domain inside the sensor. We use coded exposure (CE) as the in-sensor compression strategy as it offers the flexibility to sample, i.e., selectively expose pixels, both spatially and temporally. SNAPPIX has three contributions. First, we propose a task-agnostic strategy to learn the sampling/exposure pattern based on the classic theory of efficient coding. Second, we co-design the downstream vision model with the exposure pattern to address the pixel-level non-uniformity unique to CE-compressed images. Finally, we propose lightweight augmentations to the image sensor hardware to support our in-sensor CE compression. Evaluating on action recognition and video reconstruction, SnapPix outperforms state-of-the-art video-based methods at the same speed while reducing the energy by up to 15.4x. We have open-sourced the code at: https://github.com/horizon-research/SnapPix.
中文: SnapPix是一种传感器-算法协同设计的系统,通过编码曝光在模拟域进行压缩,在保持遥感应用性能的同时显著降低了边缘能耗。
English: SnapPix is a sensor-algorithm co-designed system that employs coded exposure for analog domain compression to significantly reduce edge energy consumption while maintaining performance in remote sensing applications.

Authors:Junjie Jiang, Zelin Wang, Manqi Zhao, Yin Li, DongSheng Jiang
Title: SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation
Abstract:
Segment Anything 2 (SAM2) enables robust single-object tracking using segmentation. To extend this to multi-object tracking (MOT), we propose SAM2MOT, introducing a novel Tracking by Segmentation paradigm. Unlike Tracking by Detection or Tracking by Query, SAM2MOT directly generates tracking boxes from segmentation masks, reducing reliance on detection accuracy. SAM2MOT has two key advantages: zero-shot generalization, allowing it to work across datasets without fine-tuning, and strong object association, inherited from SAM2. To further improve performance, we integrate a trajectory manager system for precise object addition and removal, and a cross-object interaction module to handle occlusions. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT. Code is available at https://github.com/TripleJoy/SAM2MOT.
中文: SAM2MOT提出了一种新颖的基于分割的多目标跟踪范式,利用SAM2的分割能力实现零样本泛化和强目标关联,在多个数据集上的实验结果表明其达到了最先进的性能水平。
English: SAM2MOT introduces a novel Tracking by Segmentation paradigm for multi-object tracking, leveraging SAM2's segmentation capabilities to achieve zero-shot generalization and strong object association, with experimental results demonstrating state-of-the-art performance on multiple datasets.

Authors:Jiancheng Pan, Yanxing Liu, Xiao He, Long Peng, Jiahao Li, Yuze Sun, Xiaomeng Huang
Title: Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection
Abstract:
Foundation models pretrained on extensive datasets, such as GroundingDINO and LAE-DINO, have performed remarkably in the cross-domain few-shot object detection (CD-FSOD) task. Through rigorous few-shot training, we found that the integration of image-based data augmentation techniques and grid-based sub-domain search strategy significantly enhances the performance of these foundation models. Building upon GroundingDINO, we employed several widely used image augmentation methods and established optimization objectives to effectively navigate the expansive domain space in search of optimal sub-domains. This approach facilitates efficient few-shot object detection and introduces an approach to solving the CD-FSOD problem by efficiently searching for the optimal parameter configuration from the foundation model. Our findings substantially advance the practical deployment of vision-language models in data-scarce environments, offering critical insights into optimizing their cross-domain generalization capabilities without labor-intensive retraining. Code is available at https://github.com/jaychempan/ETS.
中文: 基于GroundingDINO和LAE-DINO等基础模型,通过图像增强和网格化子域搜索策略显著提升了跨域少样本检测性能,实现了无需重复训练的高效参数优化。
English: Foundation models like GroundingDINO and LAE-DINO achieve superior cross-domain few-shot object detection through image augmentation and grid-based sub-domain search, enabling efficient parameter optimization without extensive retraining.

Authors:Lei Cheng, Mahdi Saleh, Qing Cheng, Lu Sang, Hongli Xu, Daniel Cremers, Federico Tombari
Title: PRISM: Probabilistic Representation for Integrated Shape Modeling and Generation
Abstract:
Despite the advancements in 3D full-shape generation, accurately modeling complex geometries and semantics of shape parts remains a significant challenge, particularly for shapes with varying numbers of parts. Current methods struggle to effectively integrate the contextual and structural information of 3D shapes into their generative processes. We address these limitations with PRISM, a novel compositional approach for 3D shape generation that integrates categorical diffusion models with Statistical Shape Models (SSM) and Gaussian Mixture Models (GMM). Our method employs compositional SSMs to capture part-level geometric variations and uses GMM to represent part semantics in a continuous space. This integration enables both high fidelity and diversity in generated shapes while preserving structural coherence. Through extensive experiments on shape generation and manipulation tasks, we demonstrate that our approach significantly outperforms previous methods in both quality and controllability of part-level operations. Our code will be made publicly available.

Authors:Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, Yu-Gang Jiang
Title: UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
Abstract:
We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations, enabling seamless integration of unified visual understanding and image generation tasks. Unlike previous approaches that rely on unilateral visual representations, our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information that empowers heterogeneous tasks to selectively assimilate domain-specific knowledge based on their inherent characteristics. Through in-depth experiments, we uncover key principles for developing a unified model capable of both visual understanding and image generation. Extensive evaluations across a diverse range of prominent benchmarks demonstrate that UniToken achieves state-of-the-art performance, surpassing existing approaches. These results establish UniToken as a robust foundation for future research in this domain. The code and models are available at https://github.com/SxJyJay/UniToken.
中文: UniToken通过结合离散与连续视觉表征的统一模型,在视觉理解和图像生成任务中均实现了最先进的性能,经多基准测试验证其卓越表现。
English: UniToken is a unified model that combines discrete and continuous visual representations to achieve state-of-the-art performance in both visual understanding and image generation tasks, as validated by extensive experiments across multiple benchmarks.

Authors:Yiming Shi, Shaoshuai Yang, Xun Zhu, Haoyu Wang, Xiangling Fu, Miao Li, Ji Wu
Title: MedM-VL: What Makes a Good Medical LVLM?
Abstract:
Medical image analysis is essential in modern healthcare. Deep learning has redirected research focus toward complex medical multimodal tasks, including report generation and visual question answering. Traditional task-specific models often fall short in handling these challenges. Large vision-language models (LVLMs) offer new solutions for solving such tasks. In this study, we build on the popular LLaVA framework to systematically explore model architectures and training strategies for both 2D and 3D medical LVLMs. We present extensive empirical findings and practical guidance. To support reproducibility and future research, we release a modular codebase, MedM-VL, and two pre-trained models: MedM-VL-2D for 2D medical image analysis and MedM-VL-CT-Chest for 3D CT-based applications. The code is available at: https://github.com/MSIIP/MedM-VL
中文摘要:本研究基于LLaVA框架开发了大型视觉语言模型以推进医学图像分析,提供了实证研究成果并发布了包含2D和3D预训练模型的MedM-VL代码库。
English Summary: This study advances medical image analysis by developing large vision-language models based on the LLaVA framework, providing empirical insights and releasing the MedM-VL codebase with pre-trained models for both 2D and 3D applications.

Authors:Zhisheng Huang, Peng Wang, Jingdong Zhang, Yuan Liu, Xin Li, Wenping Wang
Title: 3R-GS: Best Practice in Optimizing Camera Poses Along with 3DGS
Abstract:
3D Gaussian Splatting (3DGS) has revolutionized neural rendering with its efficiency and quality, but like many novel view synthesis methods, it heavily depends on accurate camera poses from Structure-from-Motion (SfM) systems. Although recent SfM pipelines have made impressive progress, questions remain about how to further improve both their robust performance in challenging conditions (e.g., textureless scenes) and the precision of camera parameter estimation simultaneously. We present 3R-GS, a 3D Gaussian Splatting framework that bridges this gap by jointly optimizing 3D Gaussians and camera parameters from large reconstruction priors MASt3R-SfM. We note that naively performing joint 3D Gaussian and camera optimization faces two challenges: the sensitivity to the quality of SfM initialization, and its limited capacity for global optimization, leading to suboptimal reconstruction results. Our 3R-GS, overcomes these issues by incorporating optimized practices, enabling robust scene reconstruction even with imperfect camera registration. Extensive experiments demonstrate that 3R-GS delivers high-quality novel view synthesis and precise camera pose estimation while remaining computationally efficient. Project page: https://zsh523.github.io/3R-GS/
Chinese: 3R-GS是一种新型的3D高斯泼溅框架,通过联合优化3D高斯分布和相机参数,即使在相机初始化不完善的情况下也能实现鲁棒的场景重建和高质量的新视角合成。
English: 3R-GS is a novel 3D Gaussian Splatting framework that jointly optimizes 3D Gaussians and camera parameters, enabling robust scene reconstruction and high-quality novel view synthesis even with imperfect camera initialization.

Authors:Mete Ahishali, Anis Ur Rahman, Einari Heinaro, Samuli Junttila
Title: ADA-Net: Attention-Guided Domain Adaptation Network with Contrastive Learning for Standing Dead Tree Segmentation Using Aerial Imagery
Abstract:
Information on standing dead trees is important for understanding forest ecosystem functioning and resilience but has been lacking over large geographic regions. Climate change has caused large-scale tree mortality events that can remain undetected due to limited data. In this study, we propose a novel method for segmenting standing dead trees using aerial multispectral orthoimages. Because access to annotated datasets has been a significant problem in forest remote sensing due to the need for forest expertise, we introduce a method for domain transfer by leveraging domain adaptation to learn a transformation from a source domain X to target domain Y. In this Image-to-Image translation task, we aim to utilize available annotations in the target domain by pre-training a segmentation network. When images from a new study site without annotations are introduced (source domain X), these images are transformed into the target domain. Then, transfer learning is applied by inferring the pre-trained network on domain-adapted images. In addition to investigating the feasibility of current domain adaptation approaches for this objective, we propose a novel approach called the Attention-guided Domain Adaptation Network (ADA-Net) with enhanced contrastive learning. Accordingly, the ADA-Net approach provides new state-of-the-art domain adaptation performance levels outperforming existing approaches. We have evaluated the proposed approach using two datasets from Finland and the US. The USA images are converted to the Finland domain, and we show that the synthetic USA2Finland dataset exhibits similar characteristics to the Finland domain images. The software implementation is shared at https://github.com/meteahishali/ADA-Net. The data is publicly available at https://www.kaggle.com/datasets/meteahishali/aerial-imagery-for-standing-dead-tree-segmentation.
中文摘要:本研究提出ADA-Net这一新型域自适应方法,通过图像转换实现跨林区知识迁移,解决了枯立木航拍图像分割中标注数据匮乏的问题,在芬兰和美国数据集上取得了最优性能。
English Summary: This study introduces ADA-Net, a novel domain adaptation method for segmenting standing dead trees from aerial imagery, addressing data scarcity by transferring knowledge between forest regions and achieving state-of-the-art performance on datasets from Finland and the USA.

Authors:Yunlong Lin, Zixu Lin, Haoyu Chen, Panwang Pan, Chenxin Li, Sixiang Chen, Yeying Jin, Wenbo Li, Xinghao Ding
Title: JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
Abstract:
Vision-centric perception systems struggle with unpredictable and coupled weather degradations in the wild. Current solutions are often limited, as they either depend on specific degradation priors or suffer from significant domain gaps. To enable robust and autonomous operation in real-world conditions, we propose JarvisIR, a VLM-powered agent that leverages the VLM as a controller to manage multiple expert restoration models. To further enhance system robustness, reduce hallucinations, and improve generalizability in real-world adverse weather, JarvisIR employs a novel two-stage framework consisting of supervised fine-tuning and human feedback alignment. Specifically, to address the lack of paired data in real-world scenarios, the human feedback alignment enables the VLM to be fine-tuned effectively on large-scale real-world data in an unsupervised manner. To support the training and evaluation of JarvisIR, we introduce CleanBench, a comprehensive dataset consisting of high-quality and large-scale instruction-responses pairs, including 150K synthetic entries and 80K real entries. Extensive experiments demonstrate that JarvisIR exhibits superior decision-making and restoration capabilities. Compared with existing methods, it achieves a 50% improvement in the average of all perception metrics on CleanBench-Real. Project page: https://cvpr2025-jarvisir.github.io/.

Authors:Kai Fang, Anqi Zhang, Guangyu Gao, Jianbo Jiao, Chi Harold Liu, Yunchao Wei
Title: CoMBO: Conflict Mitigation via Branched Optimization for Class Incremental Segmentation
Abstract:
Effective Class Incremental Segmentation (CIS) requires simultaneously mitigating catastrophic forgetting and ensuring sufficient plasticity to integrate new classes. The inherent conflict above often leads to a back-and-forth, which turns the objective into finding the balance between the performance of previous~(old) and incremental~(new) classes. To address this conflict, we introduce a novel approach, Conflict Mitigation via Branched Optimization~(CoMBO). Within this approach, we present the Query Conflict Reduction module, designed to explicitly refine queries for new classes through lightweight, class-specific adapters. This module provides an additional branch for the acquisition of new classes while preserving the original queries for distillation. Moreover, we develop two strategies to further mitigate the conflict following the branched structure, \textit{i.e.}, the Half-Learning Half-Distillation~(HDHL) over classification probabilities, and the Importance-Based Knowledge Distillation~(IKD) over query features. HDHL selectively engages in learning for classification probabilities of queries that match the ground truth of new classes, while aligning unmatched ones to the corresponding old probabilities, thus ensuring retention of old knowledge while absorbing new classes via learning negative samples. Meanwhile, IKD assesses the importance of queries based on their matching degree to old classes, prioritizing the distillation of important features and allowing less critical features to evolve. Extensive experiments in Class Incremental Panoptic and Semantic Segmentation settings have demonstrated the superior performance of CoMBO. Project page: https://guangyu-ryan.github.io/CoMBO.

Authors:Yikai Wang, Guangce Liu, Xinzhou Wang, Zilong Chen, Jiafang Li, Xin Liang, Fuchun Sun, Jun Zhu
Title: Video4DGen: Enhancing Video and 4D Generation through Mutual Optimization
Abstract:
The advancement of 4D (i.e., sequential 3D) generation opens up new possibilities for lifelike experiences in various applications, where users can explore dynamic objects or characters from any viewpoint. Meanwhile, video generative models are receiving particular attention given their ability to produce realistic and imaginative frames. These models are also observed to exhibit strong 3D consistency, indicating the potential to act as world simulators. In this work, we present Video4DGen, a novel framework that excels in generating 4D representations from single or multiple generated videos as well as generating 4D-guided videos. This framework is pivotal for creating high-fidelity virtual contents that maintain both spatial and temporal coherence. The 4D outputs generated by Video4DGen are represented using our proposed Dynamic Gaussian Surfels (DGS), which optimizes time-varying warping functions to transform Gaussian surfels (surface elements) from a static state to a dynamically warped state. We design warped-state geometric regularization and refinements on Gaussian surfels, to preserve the structural integrity and fine-grained appearance details. To perform 4D generation from multiple videos and capture representation across spatial, temporal, and pose dimensions, we design multi-video alignment, root pose optimization, and pose-guided frame sampling strategies. The leveraging of continuous warping fields also enables a precise depiction of pose, motion, and deformation over per-video frames. Further, to improve the overall fidelity from the observation of all camera poses, Video4DGen performs novel-view video generation guided by the 4D content, with the proposed confidence-filtered DGS to enhance the quality of generated sequences. With the ability of 4D and video generation, Video4DGen offers a powerful tool for applications in virtual reality, animation, and beyond.
中文: Video4DGen是一种创新框架,能从单个或多个视频生成动态4D表示和视频,通过动态高斯表面元素技术保持时空一致性,为虚拟现实和动画等应用提供高保真内容。
English: Video4DGen is a novel framework that generates dynamic 4D representations and videos from single or multiple inputs, using Dynamic Gaussian Surfels to ensure spatial-temporal coherence and high fidelity for applications like virtual reality and animation.

Authors:Yongchuan Cui, Jinhe Zhang, Peng Liu, Weijing Song, Yi Zeng
Title: Overcoming the Identity Mapping Problem in Self-Supervised Hyperspectral Anomaly Detection
Abstract:
The surge of deep learning has catalyzed considerable progress in self-supervised Hyperspectral Anomaly Detection (HAD). The core premise for self-supervised HAD is that anomalous pixels are inherently more challenging to reconstruct, resulting in larger errors compared to the background. However, owing to the powerful nonlinear fitting capabilities of neural networks, self-supervised models often suffer from the Identity Mapping Problem (IMP). The IMP manifests as a tendency for the model to overfit to the entire image, particularly with increasing network complexity or prolonged training iterations. Consequently, the whole image can be precisely reconstructed, and even the anomalous pixels exhibit imperceptible errors, making them difficult to detect. Despite the proposal of several models aimed at addressing the IMP-related issues, a unified descriptive framework and validation of solutions for IMP remain lacking. In this paper, we conduct an in-depth exploration to IMP, and summarize a unified framework that describes IMP from the perspective of network optimization, which encompasses three aspects: perturbation, reconstruction, and regularization. Correspondingly, we introduce three solutions: superpixel pooling and uppooling for perturbation, error-adaptive convolution for reconstruction, and online background pixel mining for regularization. With extensive experiments being conducted to validate the effectiveness, it is hoped that our work will provide valuable insights and inspire further research for self-supervised HAD. Code: \url{https://github.com/yc-cui/Super-AD}.
Chinese: 本文针对自监督高光谱异常检测中的恒等映射问题,提出了一个包含扰动、重建和正则化三个方面的统一框架及相应解决方案,并通过大量实验验证了其有效性。
English: This paper addresses the Identity Mapping Problem in self-supervised Hyperspectral Anomaly Detection by proposing a unified framework with three solutions—perturbation, reconstruction, and regularization—validated through extensive experiments.

Authors:Xiao-Hui Li, Fei Yin, Cheng-Lin Liu
Title: DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning
Abstract:
Document image segmentation is crucial for document analysis and recognition but remains challenging due to the diversity of document formats and segmentation tasks. Existing methods often address these tasks separately, resulting in limited generalization and resource wastage. This paper introduces DocSAM, a transformer-based unified framework designed for various document image segmentation tasks, such as document layout analysis, multi-granularity text segmentation, and table structure recognition, by modelling these tasks as a combination of instance and semantic segmentation. Specifically, DocSAM employs Sentence-BERT to map category names from each dataset into semantic queries that match the dimensionality of instance queries. These two sets of queries interact through an attention mechanism and are cross-attended with image features to predict instance and semantic segmentation masks. Instance categories are predicted by computing the dot product between instance and semantic queries, followed by softmax normalization of scores. Consequently, DocSAM can be jointly trained on heterogeneous datasets, enhancing robustness and generalization while reducing computational and storage resources. Comprehensive evaluations show that DocSAM surpasses existing methods in accuracy, efficiency, and adaptability, highlighting its potential for advancing document image understanding and segmentation across various applications. Codes are available at https://github.com/xhli-git/DocSAM.
Chinese: DocSAM是一种基于Transformer的统一框架,通过结合实例分割和语义分割处理多种文档图像分割任务,在提升泛化能力和效率的同时减少了资源消耗。
English: DocSAM is a unified transformer-based framework that integrates instance and semantic segmentation to handle diverse document image segmentation tasks, improving generalization and efficiency while reducing resource usage.

Authors:Conghao Xiong, Hao Chen, Joseph J. Y. Sung
Title: A Survey of Pathology Foundation Model: Progress and Future Directions
Abstract:
Computational pathology, which involves analyzing whole slide images for automated cancer diagnosis, relies on multiple instance learning, where performance depends heavily on the feature extractor and aggregator. Recent Pathology Foundation Models (PFMs), pretrained on large-scale histopathology data, have significantly enhanced both the extractor and aggregator, but they lack a systematic analysis framework. In this survey, we present a hierarchical taxonomy organizing PFMs through a top-down philosophy applicable to foundation model analysis in any domain: model scope, model pretraining, and model design. Additionally, we systematically categorize PFM evaluation tasks into slide-level, patch-level, multimodal, and biological tasks, providing comprehensive benchmarking criteria. Our analysis identifies critical challenges in both PFM development (pathology-specific methodology, end-to-end pretraining, data-model scalability) and utilization (effective adaptation, model maintenance), paving the way for future directions in this promising field. Resources referenced in this survey are available at https://github.com/BearCleverProud/AwesomeWSI.
中文摘要:本综述提出了一种层次化分类法,通过模型范围、预训练和设计三个维度系统分析病理学基础模型,并划分评估任务类别,指出了模型开发与应用中的关键挑战,为计算病理学发展指明方向。
English Summary: This survey introduces a hierarchical taxonomy for analyzing Pathology Foundation Models (PFMs) by examining model scope, pretraining, and design, while categorizing evaluation tasks and identifying key challenges in development and utilization to advance computational pathology.

Authors:Shintaro Shiba, Yoshimitsu Aoki, Guillermo Gallego
Title: Simultaneous Motion And Noise Estimation with Event Cameras
Abstract:
Event cameras are emerging vision sensors whose noise is challenging to characterize. Existing denoising methods for event cameras are often designed in isolation and thus consider other tasks, such as motion estimation, separately (i.e., sequentially after denoising). However, motion is an intrinsic part of event data, since scene edges cannot be sensed without motion. We propose, to the best of our knowledge, the first method that simultaneously estimates motion in its various forms (e.g., ego-motion, optical flow) and noise. The method is flexible, as it allows replacing the one-step motion estimation of the widely-used Contrast Maximization framework with any other motion estimator, such as deep neural networks. The experiments show that the proposed method achieves state-of-the-art results on the E-MLB denoising benchmark and competitive results on the DND21 benchmark, while demonstrating effectiveness across motion estimation and intensity reconstruction tasks. Our approach advances event-data denoising theory and expands practical denoising use-cases via open-source code. Project page: https://github.com/tub-rip/ESMD
中文: 本文提出了首个同时估计事件相机中运动与噪声的方法,在实现先进去噪性能的同时,提高了运动估计任务的灵活性。
English: This paper introduces the first method to jointly estimate motion and noise in event cameras, achieving state-of-the-art denoising performance while enhancing flexibility across motion estimation tasks.

Authors:Yifan Li, Wentao Bao, Botao Ye, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong
Title: Window Token Concatenation for Efficient Visual Large Language Models
Abstract:
To effectively reduce the visual tokens in Visual Large Language Models (VLLMs), we propose a novel approach called Window Token Concatenation (WiCo). Specifically, we employ a sliding window to concatenate spatially adjacent visual tokens. However, directly concatenating these tokens may group diverse tokens into one, and thus obscure some fine details. To address this challenge, we propose fine-tuning the last few layers of the vision encoder to adaptively adjust the visual tokens, encouraging that those within the same window exhibit similar features. To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM. Such a design enjoys the merits of the large perception field of the LLM for fine-grained visual understanding while keeping a small number of visual tokens for efficient inference. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors. The code is available: https://github.com/JackYFL/WiCo.
中文摘要:作者提出WiCo方法,通过滑动窗口拼接相邻视觉标记以减少VLLMs中的视觉标记数量,并进一步推出WiCo+在语言模型深层分解视觉标记以提升细粒度视觉理解能力,实验表明该方法在粗细粒度视觉任务上均优于现有标记缩减方案。
English Summary: The authors propose WiCo, a method that uses a sliding window to concatenate adjacent visual tokens in VLLMs, and enhance it with WiCo+ for fine-grained tasks by decomposing tokens in later layers, achieving superior performance in both coarse and fine-grained visual understanding tasks.

Authors:Houzhang Fang, Xiaolin Wang, Zengyang Li, Lu Wang, Qingshan Li, Yi Chang, Luxin Yan
Title: Detection-Friendly Nonuniformity Correction: A Union Framework for Infrared UAVTarget Detection
Abstract:
Infrared unmanned aerial vehicle (UAV) images captured using thermal detectors are often affected by temperature dependent low-frequency nonuniformity, which significantly reduces the contrast of the images. Detecting UAV targets under nonuniform conditions is crucial in UAV surveillance applications. Existing methods typically treat infrared nonuniformity correction (NUC) as a preprocessing step for detection, which leads to suboptimal performance. Balancing the two tasks while enhancing detection beneficial information remains challenging. In this paper, we present a detection-friendly union framework, termed UniCD, that simultaneously addresses both infrared NUC and UAV target detection tasks in an end-to-end manner. We first model NUC as a small number of parameter estimation problem jointly driven by priors and data to generate detection-conducive images. Then, we incorporate a new auxiliary loss with target mask supervision into the backbone of the infrared UAV target detection network to strengthen target features while suppressing the background. To better balance correction and detection, we introduce a detection-guided self-supervised loss to reduce feature discrepancies between the two tasks, thereby enhancing detection robustness to varying nonuniformity levels. Additionally, we construct a new benchmark composed of 50,000 infrared images in various nonuniformity types, multi-scale UAV targets and rich backgrounds with target annotations, called IRBFD. Extensive experiments on IRBFD demonstrate that our UniCD is a robust union framework for NUC and UAV target detection while achieving real-time processing capabilities. Dataset can be available at https://github.com/IVPLaboratory/UniCD.
中文摘要:本文提出UniCD端到端联合框架,通过参数估计和辅助损失同时实现红外非均匀性校正与无人机目标检测,在提升检测鲁棒性的同时保持实时处理能力。
English Summary: The paper introduces UniCD, an end-to-end framework that jointly performs infrared nonuniformity correction and UAV target detection through parameter estimation and auxiliary losses to enhance robustness and real-time processing.

Authors:Maksim Siniukov, Di Chang, Minh Tran, Hongkun Gong, Ashutosh Chaubey, Mohammad Soleymani
Title: DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion
Abstract:
Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.

Authors:Dahun Kim, AJ Piergiovanni, Ganesh Mallya, Anelia Angelova
Title: VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Abstract:
We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark targets alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (e.g. ActivityNet-Captions, YouCook2), we construct two compositional benchmarks, ActivityNet-Comp and YouCook2-Comp. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences. To improve model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences. We evaluate video-text foundational models and large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in compositionality. Overall, our work provides a comprehensive framework for evaluating and enhancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.
中文: 我们推出VideoComp基准和学习框架,旨在通过多事件视频中的时序对齐测试及分层成对偏好损失与预训练策略,提升视觉语言模型在视频文本组合性理解方面的细粒度能力。
English: We introduce VideoComp, a benchmark and framework for enhancing video-text compositionality in vision-language models by testing temporal alignment through challenging disruptions in multi-event videos and improving performance with a hierarchical pairwise loss and pretraining strategy.

Authors:Arash Sajjadi, Mark Eramian
Title: TGraphX: Tensor-Aware Graph Neural Network for Multi-Dimensional Feature Learning
Abstract:
TGraphX presents a novel paradigm in deep learning by unifying convolutional neural networks (CNNs) with graph neural networks (GNNs) to enhance visual reasoning tasks. Traditional CNNs excel at extracting rich spatial features from images but lack the inherent capability to model inter-object relationships. Conversely, conventional GNNs typically rely on flattened node features, thereby discarding vital spatial details. TGraphX overcomes these limitations by employing CNNs to generate multi-dimensional node features (e.g., (3*128*128) tensors) that preserve local spatial semantics. These spatially aware nodes participate in a graph where message passing is performed using 1*1 convolutions, which fuse adjacent features while maintaining their structure. Furthermore, a deep CNN aggregator with residual connections is used to robustly refine the fused messages, ensuring stable gradient flow and end-to-end trainability. Our approach not only bridges the gap between spatial feature extraction and relational reasoning but also demonstrates significant improvements in object detection refinement and ensemble reasoning.
中文: TGraphX提出了一种将卷积神经网络与图神经网络相结合的新范式,通过保留空间语义的多维节点特征和结构化信息传递机制,显著提升了视觉推理任务中的物体检测和关系推理性能。
English: TGraphX introduces a unified framework combining CNNs and GNNs to enhance visual reasoning by preserving spatial semantics through multi-dimensional node features and structured message passing, achieving notable improvements in object detection and relational tasks.

Authors:Tyler Ward, Abdullah-Al-Zubaer Imran
Title: Improving Brain Disorder Diagnosis with Advanced Brain Function Representation and Kolmogorov-Arnold Networks
Abstract:
Quantifying functional connectivity (FC), a vital metric for the diagnosis of various brain disorders, traditionally relies on the use of a pre-defined brain atlas. However, using such atlases can lead to issues regarding selection bias and lack of regard for specificity. Addressing this, we propose a novel transformer-based classification network (ABFR-KAN) with effective brain function representation to aid in diagnosing autism spectrum disorder (ASD). ABFR-KAN leverages Kolmogorov-Arnold Network (KAN) blocks replacing traditional multi-layer perceptron (MLP) components. Thorough experimentation reveals the effectiveness of ABFR-KAN in improving the diagnosis of ASD under various configurations of the model architecture. Our code is available at https://github.com/tbwa233/ABFR-KAN
中文: 本研究提出ABFR-KAN,一种基于Transformer的网络,采用Kolmogorov-Arnold网络模块来改进自闭症谱系障碍的诊断,通过优化功能连接表征并克服图谱选择偏差。
English: The study introduces ABFR-KAN, a transformer-based network using Kolmogorov-Arnold Network blocks to enhance autism spectrum disorder diagnosis by improving functional connectivity representation and overcoming atlas-related biases.

Authors:Wanhee Lee, Klemen Kotar, Rahul Mysore Venkatesh, Jared Watrous, Honglin Chen, Khai Loong Aw, Daniel L. K. Yamins
Title: 3D Scene Understanding Through Local Random Access Sequence Modeling
Abstract:
3D scene understanding from single images is a pivotal problem in computer vision with numerous downstream applications in graphics, augmented reality, and robotics. While diffusion-based modeling approaches have shown promise, they often struggle to maintain object and scene consistency, especially in complex real-world scenarios. To address these limitations, we propose an autoregressive generative approach called Local Random Access Sequence (LRAS) modeling, which uses local patch quantization and randomly ordered sequence generation. By utilizing optical flow as an intermediate representation for 3D scene editing, our experiments demonstrate that LRAS achieves state-of-the-art novel view synthesis and 3D object manipulation capabilities. Furthermore, we show that our framework naturally extends to self-supervised depth estimation through a simple modification of the sequence design. By achieving strong performance on multiple 3D scene understanding tasks, LRAS provides a unified and effective framework for building the next generation of 3D vision models.

Authors:Ved Umrajkar, Aakash Kumar Singh
Title: Detection Limits and Statistical Separability of Tree Ring Watermarks in Rectified Flow-based Text-to-Image Generation Models
Abstract:
Tree-Ring Watermarking is a significant technique for authenticating AI-generated images. However, its effectiveness in rectified flow-based models remains unexplored, particularly given the inherent challenges of these models with noise latent inversion. Through extensive experimentation, we evaluated and compared the detection and separability of watermarks between SD 2.1 and FLUX.1-dev models. By analyzing various text guidance configurations and augmentation attacks, we demonstrate how inversion limitations affect both watermark recovery and the statistical separation between watermarked and unwatermarked images. Our findings provide valuable insights into the current limitations of Tree-Ring Watermarking in the current SOTA models and highlight the critical need for improved inversion methods to achieve reliable watermark detection and separability. The official implementation, dataset release and all experimental results are available at this \href{https://github.com/dsgiitr/flux-watermarking}{\textbf{link}}.
Chinese: 树环水印技术在如FLUX.1-dev等整流流模型中因噪声潜在反转的挑战而存在局限性,影响了水印恢复以及水印与非水印图像间的统计可分离性。
English: Tree-Ring Watermarking faces limitations in rectified flow-based models like FLUX.1-dev due to noise latent inversion challenges, affecting both watermark recovery and statistical separability between watermarked and unwatermarked images.

Authors:Teodor Chiaburu, Felix Bießmann, Frank Haußer
Title: Uncertainty Propagation in XAI: A Comparison of Analytical and Empirical Estimators
Abstract:
Understanding uncertainty in Explainable AI (XAI) is crucial for building trust and ensuring reliable decision-making in Machine Learning models. This paper introduces a unified framework for quantifying and interpreting Uncertainty in XAI by defining a general explanation function $e_θ(x, f)$ that captures the propagation of uncertainty from key sources: perturbations in input data and model parameters. By using both analytical and empirical estimates of explanation variance, we provide a systematic means of assessing the impact uncertainty on explanations. We illustrate the approach using a first-order uncertainty propagation as the analytical estimator. In a comprehensive evaluation across heterogeneous datasets, we compare analytical and empirical estimates of uncertainty propagation and evaluate their robustness. Extending previous work on inconsistencies in explanations, our experiments identify XAI methods that do not reliably capture and propagate uncertainty. Our findings underscore the importance of uncertainty-aware explanations in high-stakes applications and offer new insights into the limitations of current XAI methods. The code for the experiments can be found in our repository at https://github.com/TeodorChiaburu/UXAI
本文提出了一个统一框架,通过分析输入数据和模型参数变化对解释的影响来量化可解释人工智能中的不确定性,揭示了现有方法的局限性,并强调了在关键应用中采用不确定性感知方法的重要性。
This paper presents a unified framework for quantifying uncertainty in Explainable AI by analyzing how input data and model parameter variations affect explanations, revealing limitations in current methods and emphasizing the need for uncertainty-aware approaches in critical applications.

Authors:Zhiqiang Wang, Pengbin Feng, Yanbin Lin, Shuzhang Cai, Zongao Bian, Jinghua Yan, Xingquan Zhu
Title: CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward
Abstract:
We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to enhance learning efficiency. Unlike the conventional binary 0/1 accuracy reward, our fuzzy reward model provides nuanced incentives, encouraging more precise outputs. Experimental results demonstrate that GRPO with a standard 0/1 accuracy reward underperforms compared to supervised fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B), surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across five in-domain datasets. On an out-of-domain dataset, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. Code and data: https://github.com/yeyimilk/CrowdVLM-R1
我们提出了FGRPR框架,将GRPO与模糊奖励函数结合以提高学习效率,通过提供细致激励促进精确输出,在多个数据集上超越了包括GPT4o和SFT在内的基线模型。
We propose FGRPR, a framework combining GRPO with a fuzzy reward function to improve learning efficiency, outperforming baseline models including GPT4o and SFT across datasets by providing nuanced incentives for precise outputs.

Authors:Wulin Xie, Yi-Fan Zhang, Chaoyou Fu, Yang Shi, Bingyan Nie, Hongkai Chen, Zhang Zhang, Liang Wang, Tieniu Tan
Title: MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
Abstract:
Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in https://mme-unify.github.io/.
中文: 本文提出一个综合评估框架,通过标准化传统任务和引入新型混合模态生成测试来解决现有MLLM基准的不足,在对12个领先统一多模态模型的系统测试中揭示了当前模型存在的显著性能差距。
English: This paper introduces a comprehensive evaluation framework addressing the limitations of existing MLLM benchmarks by standardizing traditional tasks and introducing novel mixed-modality generation assessments, revealing significant performance gaps in current unified multimodal models through systematic testing of 12 leading systems.

Authors:Niu Lian, Jun Li, Jinpeng Wang, Ruisheng Luo, Yaowei Wang, Shu-Tao Xia, Bin Chen
Title: AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing
Abstract:
Self-Supervised Video Hashing (SSVH) compresses videos into hash codes for efficient indexing and retrieval using unlabeled training videos. Existing approaches rely on random frame sampling to learn video features and treat all frames equally. This results in suboptimal hash codes, as it ignores frame-specific information density and reconstruction difficulty. To address this limitation, we propose a new framework, termed AutoSSVH, that employs adversarial frame sampling with hash-based contrastive learning. Our adversarial sampling strategy automatically identifies and selects challenging frames with richer information for reconstruction, enhancing encoding capability. Additionally, we introduce a hash component voting strategy and a point-to-set (P2Set) hash-based contrastive objective, which help capture complex inter-video semantic relationships in the Hamming space and improve the discriminability of learned hash codes. Extensive experiments demonstrate that AutoSSVH achieves superior retrieval efficacy and efficiency compared to state-of-the-art approaches. Code is available at https://github.com/EliSpectre/CVPR25-AutoSSVH.
中文:AutoSSVH通过对抗性采样选择信息丰富的视频帧,并结合对比学习与投票策略,显著提升了哈希码的区分度和视频检索效果。
English: AutoSSVH enhances video hashing by using adversarial sampling to select information-rich frames and employs contrastive learning with a voting strategy to improve hash code discriminability and retrieval performance.

Authors:Runnan Fang, Xiaobin Wang, Yuan Liang, Shuofei Qiao, Jialong Wu, Zekun Xi, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Title: SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
Abstract:
In the interaction between agents and their environments, agents expand their capabilities by planning and executing actions. However, LLM-based agents face substantial challenges when deployed in novel environments or required to navigate unconventional action spaces. To empower agents to autonomously explore environments, optimize workflows, and enhance their understanding of actions, we propose SynWorld, a framework that allows agents to synthesize possible scenarios with multi-step action invocation within the action space and perform Monte Carlo Tree Search (MCTS) exploration to effectively refine their action knowledge in the current environment. Our experiments demonstrate that SynWorld is an effective and general approach to learning action knowledge in new environments. Code is available at https://github.com/zjunlp/SynWorld.
中文摘要:SynWorld框架通过合成多步动作场景和执行蒙特卡洛树搜索探索,使基于大语言模型的智能体能够自主优化工作流程并增强动作理解,实验证明其在新环境中学习动作知识的有效性。
English Summary: SynWorld is a framework that enables LLM-based agents to autonomously explore novel environments by synthesizing scenarios and using Monte Carlo Tree Search to refine their action knowledge, proving effective in experimental evaluations.

Authors:Shuofei Qiao, Zhisong Qiu, Baochang Ren, Xiaobin Wang, Xiangyuan Ru, Ningyu Zhang, Xiang Chen, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Title: Agentic Knowledgeable Self-awareness
Abstract:
Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional agent planning approaches adopt a "flood irrigation" methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of situational self-awareness during decision-making-the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose agentic knowledgeable self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent's self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge. Code is available at https://github.com/zjunlp/KnowSelf.
中文摘要:提出的KnowSelf范式通过情境感知使基于大语言模型的智能体能够自主调控知识运用,在不同任务中以最少的外部知识实现更优的规划效果。
English Summary: The proposed KnowSelf paradigm enables LLM-based agents to autonomously regulate knowledge usage through situational awareness, achieving superior planning performance with minimal external knowledge across various tasks.

Authors:Xi Wang, Ziqi He, Yang Zhou
Title: Dynamic Importance in Diffusion U-Net for Enhanced Image Synthesis
Abstract:
Traditional diffusion models typically employ a U-Net architecture. Previous studies have unveiled the roles of attention blocks in the U-Net. However, they overlook the dynamic evolution of their importance during the inference process, which hinders their further exploitation to improve image applications. In this study, we first theoretically proved that, re-weighting the outputs of the Transformer blocks within the U-Net is a "free lunch" for improving the signal-to-noise ratio during the sampling process. Next, we proposed Importance Probe to uncover and quantify the dynamic shifts in importance of the Transformer blocks throughout the denoising process. Finally, we design an adaptive importance-based re-weighting schedule tailored to specific image generation and editing tasks. Experimental results demonstrate that, our approach significantly improves the efficiency of the inference process, and enhances the aesthetic quality of the samples with identity consistency. Our method can be seamlessly integrated into any U-Net-based architecture. Code: https://github.com/Hytidel/UNetReweighting
中文: 本研究提出了一种针对基于U-Net的扩散模型中Transformer模块的自适应重加权方法,能在推理过程中动态调整其重要性,从而提升信噪比,并在图像生成与编辑任务中同时提高效率与美学质量。
English: This study introduces an adaptive re-weighting method for Transformer blocks in U-Net-based diffusion models, which dynamically adjusts their importance during inference to enhance signal-to-noise ratio and improve both efficiency and aesthetic quality in image generation and editing tasks.

Authors:Nasar Iqbal, Niki Martinel
Title: Pyramid-based Mamba Multi-class Unsupervised Anomaly Detection
Abstract:
Recent advances in convolutional neural networks (CNNs) and transformer-based methods have improved anomaly detection and localization, but challenges persist in precisely localizing small anomalies. While CNNs face limitations in capturing long-range dependencies, transformer architectures often suffer from substantial computational overheads. We introduce a state space model (SSM)-based Pyramidal Scanning Strategy (PSS) for multi-class anomaly detection and localization--a novel approach designed to address the challenge of small anomaly localization. Our method captures fine-grained details at multiple scales by integrating the PSS with a pre-trained encoder for multi-scale feature extraction and a feature-level synthetic anomaly generator. An improvement of $+1\%$ AP for multi-class anomaly localization and a +$1\%$ increase in AU-PRO on MVTec benchmark demonstrate our method's superiority in precise anomaly localization across diverse industrial scenarios. The code is available at https://github.com/iqbalmlpuniud/Pyramid Mamba.
中文: 提出的金字塔扫描策略(基于状态空间模型)通过多尺度特征提取和合成异常生成,提升了多类别异常检测与定位的精度,在MVTec基准测试中AP和AU-PRO指标均提高1%,显著改善了小异常定位能力。
English: The proposed Pyramidal Scanning Strategy (SSM-based PSS) enhances multi-class anomaly detection and localization by capturing fine-grained details at multiple scales, achieving a +1% improvement in AP and AU-PRO on the MVTec benchmark for precise small anomaly identification.

Authors:Denis Coquenet
Title: Meta-DAN: towards an efficient prediction strategy for page-level handwritten text recognition
Abstract:
Recent advances in text recognition led to a paradigm shift for page-level recognition, from multi-step segmentation-based approaches to end-to-end attention-based ones. However, the naïve character-level autoregressive decoding process results in long prediction times: it requires several seconds to process a single page image on a modern GPU. We propose the Meta Document Attention Network (Meta-DAN) as a novel decoding strategy to reduce the prediction time while enabling a better context modeling. It relies on two main components: windowed queries, to process several transformer queries altogether, enlarging the context modeling with near future; and multi-token predictions, whose goal is to predict several tokens per query instead of only the next one. We evaluate the proposed approach on 10 full-page handwritten datasets and demonstrate state-of-the-art results on average in terms of character error rate. Source code and weights of trained models are available at https://github.com/FactoDeepLearning/meta_dan.
中文摘要:Meta-DAN通过引入窗口化查询和多令牌预测技术,显著降低了页面文本识别的预测时间并增强了上下文建模能力,在手写数据集上取得了领先的性能表现。
English Summary: The Meta Document Attention Network (Meta-DAN) introduces windowed queries and multi-token predictions to significantly reduce prediction time and improve context modeling in page-level text recognition, achieving state-of-the-art results on handwritten datasets.

Authors:Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruixuan Chen, Junshi Xia, Naoto Yokoya
Title: SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding
Abstract:
Synthetic Aperture Radar (SAR) is a crucial remote sensing technology, enabling all-weather, day-and-night observation with strong surface penetration for precise and continuous environmental monitoring and analysis. However, SAR image interpretation remains challenging due to its complex physical imaging mechanisms and significant visual disparities from human perception. Recently, Vision-Language Models (VLMs) have demonstrated remarkable success in RGB image understanding, offering powerful open-vocabulary interpretation and flexible language interaction. However, their application to SAR images is severely constrained by the absence of SAR-specific knowledge in their training distributions, leading to suboptimal performance. To address this limitation, we introduce SARLANG-1M, a large-scale benchmark tailored for multimodal SAR image understanding, with a primary focus on integrating SAR with textual modality. SARLANG-1M comprises more than 1 million high-quality SAR image-text pairs collected from over 59 cities worldwide. It features hierarchical resolutions (ranging from 0.1 to 25 meters), fine-grained semantic descriptions (including both concise and detailed captions), diverse remote sensing categories (1,696 object types and 16 land cover classes), and multi-task question-answering pairs spanning seven applications and 1,012 question types. Extensive experiments on mainstream VLMs demonstrate that fine-tuning with SARLANG-1M significantly enhances their performance in SAR image interpretation, reaching performance comparable to human experts. The dataset and code will be made publicly available at https://github.com/Jimmyxichen/SARLANG-1M.
中文:SARLANG-1M是一个大规模多模态基准数据集,通过提供超过100万个SAR图像-文本对,有效弥补了SAR图像解译的不足,使得视觉语言模型在微调后能达到接近人类专家的性能水平。
English: SARLANG-1M is a large-scale multimodal benchmark designed to bridge the gap in SAR image interpretation by providing over 1 million image-text pairs, which significantly enhances the performance of Vision-Language Models to near-human expert levels after fine-tuning.

Authors:Jiaxin Guo, Wenzhen Dong, Tianyu Huang, Hao Ding, Ziyi Wang, Haomin Kuang, Qi Dou, Yun-Hui Liu
Title: Endo3R: Unified Online Reconstruction from Dynamic Monocular Endoscopic Video
Abstract:
Reconstructing 3D scenes from monocular surgical videos can enhance surgeon's perception and therefore plays a vital role in various computer-assisted surgery tasks. However, achieving scale-consistent reconstruction remains an open challenge due to inherent issues in endoscopic videos, such as dynamic deformations and textureless surfaces. Despite recent advances, current methods either rely on calibration or instrument priors to estimate scale, or employ SfM-like multi-stage pipelines, leading to error accumulation and requiring offline optimization. In this paper, we present Endo3R, a unified 3D foundation model for online scale-consistent reconstruction from monocular surgical video, without any priors or extra optimization. Our model unifies the tasks by predicting globally aligned pointmaps, scale-consistent video depths, and camera parameters without any offline optimization. The core contribution of our method is expanding the capability of the recent pairwise reconstruction model to long-term incremental dynamic reconstruction by an uncertainty-aware dual memory mechanism. The mechanism maintains history tokens of both short-term dynamics and long-term spatial consistency. Notably, to tackle the highly dynamic nature of surgical scenes, we measure the uncertainty of tokens via Sampson distance and filter out tokens with high uncertainty. Regarding the scarcity of endoscopic datasets with ground-truth depth and camera poses, we further devise a self-supervised mechanism with a novel dynamics-aware flow loss. Abundant experiments on SCARED and Hamlyn datasets demonstrate our superior performance in zero-shot surgical video depth prediction and camera pose estimation with online efficiency. Project page: https://wrld.github.io/Endo3R/.

Authors:Xin Zhang, Robby T. Tan
Title: Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation
Abstract:
Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained traction in Domain Generalized Semantic Segmentation (DGSS) due to their strong generalization capabilities. However, existing DGSS methods often rely exclusively on either VFMs or VLMs, overlooking their complementary strengths. VFMs (e.g., DINOv2) excel at capturing fine-grained features, while VLMs (e.g., CLIP) provide robust text alignment but struggle with coarse granularity. Despite their complementary strengths, effectively integrating VFMs and VLMs with attention mechanisms is challenging, as the increased patch tokens complicate long-sequence modeling. To address this, we propose MFuser, a novel Mamba-based fusion framework that efficiently combines the strengths of VFMs and VLMs while maintaining linear scalability in sequence length. MFuser consists of two key components: MVFuser, which acts as a co-adapter to jointly fine-tune the two models by capturing both sequential and spatial dynamics; and MTEnhancer, a hybrid attention-Mamba module that refines text embeddings by incorporating image priors. Our approach achieves precise feature locality and strong text alignment without incurring significant computational overhead. Extensive experiments demonstrate that MFuser significantly outperforms state-of-the-art DGSS methods, achieving 68.20 mIoU on synthetic-to-real and 71.87 mIoU on real-to-real benchmarks. The code is available at https://github.com/devinxzhang/MFuser.
中文: 提出的MFuser框架通过基于Mamba的融合机制,有效结合视觉基础模型和视觉语言模型的互补优势,在保持线性计算复杂度的同时实现了领域泛化语义分割的卓越性能。
English: The proposed MFuser framework effectively integrates Vision Foundation Models and Vision-Language Models through Mamba-based fusion to achieve superior domain generalization in semantic segmentation with linear computational scalability.

Authors:Zeyang Zheng, Arman Hosseini, Dong Chen, Omid Shoghli, Arsalan Heydarian
Title: Real-Time Roadway Obstacle Detection for Electric Scooters Using Deep Learning and Multi-Sensor Fusion
Abstract:
The increasing adoption of electric scooters (e-scooters) in urban areas has coincided with a rise in traffic accidents and injuries, largely due to their small wheels, lack of suspension, and sensitivity to uneven surfaces. While deep learning-based object detection has been widely used to improve automobile safety, its application for e-scooter obstacle detection remains unexplored. This study introduces a novel ground obstacle detection system for e-scooters, integrating an RGB camera, and a depth camera to enhance real-time road hazard detection. Additionally, the Inertial Measurement Unit (IMU) measures linear vertical acceleration to identify surface vibrations, guiding the selection of six obstacle categories: tree branches, manhole covers, potholes, pine cones, non-directional cracks, and truncated domes. All sensors, including the RGB camera, depth camera, and IMU, are integrated within the Intel RealSense Camera D435i. A deep learning model powered by YOLO detects road hazards and utilizes depth data to estimate obstacle proximity. Evaluated on the seven hours of naturalistic riding dataset, the system achieves a high mean average precision (mAP) of 0.827 and demonstrates excellent real-time performance. This approach provides an effective solution to enhance e-scooter safety through advanced computer vision and data fusion. The dataset is accessible at https://zenodo.org/records/14583718, and the project code is hosted on https://github.com/Zeyang-Zheng/Real-Time-Roadway-Obstacle-Detection-for-Electric-Scooters.
中文: 本研究为电动滑板车开发了一种集成RGB深度相机和IMU的实时地面障碍物检测系统,通过基于YOLO的深度学习模型实现了对道路危险的高精度识别。
English: This study develops a real-time ground obstacle detection system for electric scooters using an integrated RGB-depth camera and IMU, achieving high accuracy in identifying road hazards through a YOLO-based deep learning model.

Authors:Xin Jin, Simon Niklaus, Zhoutong Zhang, Zhihao Xia, Chunle Guo, Yuting Yang, Jiawen Chen, Chongyi Li
Title: Classic Video Denoising in a Machine Learning World: Robust, Fast, and Controllable
Abstract:
Denoising is a crucial step in many video processing pipelines such as in interactive editing, where high quality, speed, and user control are essential. While recent approaches achieve significant improvements in denoising quality by leveraging deep learning, they are prone to unexpected failures due to discrepancies between training data distributions and the wide variety of noise patterns found in real-world videos. These methods also tend to be slow and lack user control. In contrast, traditional denoising methods perform reliably on in-the-wild videos and run relatively quickly on modern hardware. However, they require manually tuning parameters for each input video, which is not only tedious but also requires skill. We bridge the gap between these two paradigms by proposing a differentiable denoising pipeline based on traditional methods. A neural network is then trained to predict the optimal denoising parameters for each specific input, resulting in a robust and efficient approach that also supports user control.
中文: 本文提出了一种混合去噪方法,将传统方法的可靠性与神经网络相结合,自动预测最优参数,实现了鲁棒、高效且用户可控的视频去噪效果。
English: This paper introduces a hybrid denoising approach that combines traditional methods' reliability with a neural network to automatically predict optimal parameters, offering robust, efficient, and user-controllable video denoising.

Authors:Haozhan Tang, Tianyi Zhang, Oliver Kroemer, Matthew Johnson-Roberson, Weiming Zhi
Title: GraphSeg: Segmented 3D Representations via Graph Edge Addition and Contraction
Abstract:
Robots operating in unstructured environments often require accurate and consistent object-level representations. This typically requires segmenting individual objects from the robot's surroundings. While recent large models such as Segment Anything (SAM) offer strong performance in 2D image segmentation. These advances do not translate directly to performance in the physical 3D world, where they often over-segment objects and fail to produce consistent mask correspondences across views. In this paper, we present GraphSeg, a framework for generating consistent 3D object segmentations from a sparse set of 2D images of the environment without any depth information. GraphSeg adds edges to graphs and constructs dual correspondence graphs: one from 2D pixel-level similarities and one from inferred 3D structure. We formulate segmentation as a problem of edge addition, then subsequent graph contraction, which merges multiple 2D masks into unified object-level segmentations. We can then leverage \emph{3D foundation models} to produce segmented 3D representations. GraphSeg achieves robust segmentation with significantly fewer images and greater accuracy than prior methods. We demonstrate state-of-the-art performance on tabletop scenes and show that GraphSeg enables improved performance on downstream robotic manipulation tasks. Code available at https://github.com/tomtang502/graphseg.git.
中文: GraphSeg框架通过构建双重对应图并利用3D基础模型,仅从稀疏的2D图像即可生成一致的3D物体分割,在机器人操作任务中实现了超越现有方法的准确性和性能表现。
English: GraphSeg is a framework that generates consistent 3D object segmentations from sparse 2D images without depth information by constructing dual correspondence graphs and leveraging 3D foundation models, achieving superior accuracy and performance in robotic manipulation tasks.

Authors:Kahim Wong, Jicheng Zhou, Kemou Li, Yain-Whar Si, Xiaowei Wu, Jiantao Zhou
Title: FontGuard: A Robust Font Watermarking Approach Leveraging Deep Font Knowledge
Abstract:
The proliferation of AI-generated content brings significant concerns on the forensic and security issues such as source tracing, copyright protection, etc, highlighting the need for effective watermarking technologies. Font-based text watermarking has emerged as an effective solution to embed information, which could ensure copyright, traceability, and compliance of the generated text content. Existing font watermarking methods usually neglect essential font knowledge, which leads to watermarked fonts of low quality and limited embedding capacity. These methods are also vulnerable to real-world distortions, low-resolution fonts, and inaccurate character segmentation. In this paper, we introduce FontGuard, a novel font watermarking model that harnesses the capabilities of font models and language-guided contrastive learning. Unlike previous methods that focus solely on the pixel-level alteration, FontGuard modifies fonts by altering hidden style features, resulting in better font quality upon watermark embedding. We also leverage the font manifold to increase the embedding capacity of our proposed method by generating substantial font variants closely resembling the original font. Furthermore, in the decoder, we employ an image-text contrastive learning to reconstruct the embedded bits, which can achieve desirable robustness against various real-world transmission distortions. FontGuard outperforms state-of-the-art methods by +5.4%, +7.4%, and +5.8% in decoding accuracy under synthetic, cross-media, and online social network distortions, respectively, while improving the visual quality by 52.7% in terms of LPIPS. Moreover, FontGuard uniquely allows the generation of watermarked fonts for unseen fonts without re-training the network. The code and dataset are available at https://github.com/KAHIMWONG/FontGuard.
中文摘要:FontGuard是一种创新的字体水印模型,通过修改隐藏样式特征嵌入信息,在保持字体高质量的同时显著提升了抗失真能力和解码准确率。
English Summary: FontGuard is a novel font watermarking model that enhances text security by embedding hidden style features, achieving superior visual quality and robustness against real-world distortions compared to existing methods.

Authors:Zhen Hao Sia, Yogesh Singh Rawat
Title: Scaling Open-Vocabulary Action Detection
Abstract:
In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work. Our code is available at https://siatheindochinese.github.io/sia_act_page/ .

Authors:Junying Wang, Jingyuan Liu, Xin Sun, Krishna Kumar Singh, Zhixin Shu, He Zhang, Jimei Yang, Nanxuan Zhao, Tuanfeng Y. Wang, Simon S. Chen, Ulrich Neumann, Jae Shin Yoon
Title: Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization
Abstract:
This paper introduces Comprehensive Relighting, the first all-in-one approach that can both control and harmonize the lighting from an image or video of humans with arbitrary body parts from any scene. Building such a generalizable model is extremely challenging due to the lack of dataset, restricting existing image-based relighting models to a specific scenario (e.g., face or static human). To address this challenge, we repurpose a pre-trained diffusion model as a general image prior and jointly model the human relighting and background harmonization in the coarse-to-fine framework. To further enhance the temporal coherence of the relighting, we introduce an unsupervised temporal lighting model that learns the lighting cycle consistency from many real-world videos without any ground truth. In inference time, our temporal lighting module is combined with the diffusion models through the spatio-temporal feature blending algorithms without extra training; and we apply a new guided refinement as a post-processing to preserve the high-frequency details from the input image. In the experiments, Comprehensive Relighting shows a strong generalizability and lighting temporal coherence, outperforming existing image-based human relighting and harmonization methods.

Authors:Xianwei Zhuang, Yuxin Xie, Yufan Deng, Dongchao Yang, Liming Liang, Jinghan Ru, Yuguo Yin, Yuexian Zou
Title: VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
Abstract:
In this work, we present VARGPT-v1.1, an advanced unified visual autoregressive model that builds upon our previous framework VARGPT. The model preserves the dual paradigm of next-token prediction for visual understanding and next-scale generation for image synthesis. Specifically, VARGPT-v1.1 integrates: (1) a novel training strategy combining iterative visual instruction tuning with reinforcement learning through Direct Preference Optimization (DPO), (2) an expanded training corpus containing 8.3M visual-generative instruction pairs, (3) an upgraded language model backbone using Qwen2, (4) enhanced image generation resolution, and (5) emergent image editing capabilities without architectural modifications. These advancements enable VARGPT-v1.1 to achieve state-of-the-art performance in multimodal understanding and text-to-image instruction-following tasks, demonstrating significant improvements in both comprehension and generation metrics. Notably, through visual instruction tuning, the model acquires image editing functionality while maintaining architectural consistency with its predecessor, revealing the potential for unified visual understanding, generation, and editing. Our findings suggest that well-designed unified visual autoregressive models can effectively adopt flexible training strategies from large language models (LLMs), exhibiting promising scalability. The codebase and model weights are publicly available at https://github.com/VARGPT-family/VARGPT-v1.1.
中文:VARGPT-v1.1作为升级版统一视觉自回归模型,通过创新的训练策略和扩展数据集,在多模态理解和文本到图像生成任务中达到最优性能,并在保持架构不变的情况下展现出图像编辑的新能力。
English: VARGPT-v1.1 is an enhanced unified visual autoregressive model that achieves state-of-the-art performance in multimodal understanding and text-to-image generation through novel training strategies and expanded datasets, while demonstrating emergent image editing capabilities without architectural changes.

Authors:Rohit Agarwal, Aryan Dessai, Arif Ahmed Sekh, Krishna Agarwal, Alexander Horsch, Dilip K. Prasad
Title: Haphazard Inputs as Images in Online Learning
Abstract:
The field of varying feature space in online learning settings, also known as haphazard inputs, is very prominent nowadays due to its applicability in various fields. However, the current solutions to haphazard inputs are model-dependent and cannot benefit from the existing advanced deep-learning methods, which necessitate inputs of fixed dimensions. Therefore, we propose to transform the varying feature space in an online learning setting to a fixed-dimension image representation on the fly. This simple yet novel approach is model-agnostic, allowing any vision-based models to be applicable for haphazard inputs, as demonstrated using ResNet and ViT. The image representation handles the inconsistent input data seamlessly, making our proposed approach scalable and robust. We show the efficacy of our method on four publicly available datasets. The code is available at https://github.com/Rohit102497/HaphazardInputsAsImages.
中文: 本文提出了一种与模型无关的方法,将变化特征空间实时转换为固定维度的图像表示,使得ResNet和ViT等先进视觉模型能够处理随机输入,并在四个数据集上验证了其有效性。
English: This paper introduces a model-agnostic method that converts varying feature spaces into fixed-dimension image representations, enabling the use of advanced vision models like ResNet and ViT for haphazard inputs and demonstrating effectiveness across four datasets.

Authors:Yangxiao Lu, Ruosen Li, Liqiang Jing, Jikai Wang, Xinya Du, Yunhui Guo, Nicholas Ruozzi, Yu Xiang
Title: Multimodal Reference Visual Grounding
Abstract:
Visual grounding focuses on detecting objects from images based on language expressions. Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance by training large models with large-scale datasets. However, the problem remains challenging, especially when similar objects appear in the input image. For example, an LVLM may not be able to differentiate Diet Coke and regular Coke in an image. In this case, if additional reference images of Diet Coke and regular Coke are available, it can help the visual grounding of similar objects. In this work, we introduce a new task named Multimodal Reference Visual Grounding (MRVG). In this task, a model has access to a set of reference images of objects in a database. Based on these reference images and a language expression, the model is required to detect a target object from a query image. We first introduce a new dataset to study the MRVG problem. Then we introduce a novel method, named MRVG-Net, to solve this visual grounding problem. We show that by efficiently using reference images with few-shot object detection and using Large Language Models (LLMs) for object matching, our method achieves superior visual grounding performance compared to the state-of-the-art LVLMs such as Qwen2.5-VL-72B. Our approach bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding, which has wide applications in robotics. Project page with our video, code, and dataset: https://irvlutd.github.io/MultiGrounding

Authors:Sudong Wang, Yunjian Zhang, Yao Zhu, Jianing Li, Zizhe Wang, Yanwei Liu, Xiangyang Ji
Title: Towards Understanding How Knowledge Evolves in Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) are gradually becoming the foundation for many artificial intelligence applications. However, understanding their internal working mechanisms has continued to puzzle researchers, which in turn limits the further enhancement of their capabilities. In this paper, we seek to investigate how multimodal knowledge evolves and eventually induces natural languages in LVLMs. We design a series of novel strategies for analyzing internal knowledge within LVLMs, and delve into the evolution of multimodal knowledge from three levels, including single token probabilities, token probability distributions, and feature encodings. In this process, we identify two key nodes in knowledge evolution: the critical layers and the mutation layers, dividing the evolution process into three stages: rapid evolution, stabilization, and mutation. Our research is the first to reveal the trajectory of knowledge evolution in LVLMs, providing a fresh perspective for understanding their underlying mechanisms. Our codes are available at https://github.com/XIAO4579/Vlm-interpretability.
中文摘要:本研究揭示了大型视觉语言模型中多模态知识的演化轨迹,通过识别关键层和突变层将演化过程划分为三个阶段,为理解其内在机制提供了新视角。
English Summary: This study investigates the evolution of multimodal knowledge in Large Vision-Language Models, identifying critical and mutation layers that divide the process into three distinct stages and offering new insights into their internal mechanisms.

Authors:Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Hancheng Min, Chris Callison-Burch, René Vidal
Title: Concept Lancet: Image Editing with Compositional Representation Transplant
Abstract:
Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.

Authors:Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan
Title: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Abstract:
Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To study this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an robust evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and the LMM-as-a-judge approach. We conducted experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models. The evaluation results demonstrate that current models face significant challenges in reasoning-based editing tasks. Even the most powerful model evaluated, GPT-4o-Image, achieves an accuracy of merely 28.8%. RISEBench effectively highlights the limitations of contemporary editing models, provides valuable insights, and indicates potential future directions for the field of reasoning-aware visual editing. Our code and data have been released at https://github.com/PhoenixZ810/RISEBench.
中文: 大型多模态模型在推理式视觉编辑任务中表现不佳,新推出的RISEBench基准测试显示,即便是最优模型GPT-4o-Image的准确率也仅为28.8%,充分暴露了当前模型在指令推理和外观一致性方面存在的明显缺陷。
English: Large Multi-modality Models struggle with reasoning-based visual editing tasks, as demonstrated by the newly introduced RISEBench benchmark where even top-performing models like GPT-4o-Image achieve only 28.8% accuracy, highlighting significant limitations in instruction reasoning and appearance consistency.

Authors:Divya Velayudhan, Abdelfatah Ahmed, Mohamad Alansari, Neha Gour, Abderaouf Behouch, Taimur Hassan, Syed Talal Wasim, Nabil Maalej, Muzammal Naseer, Juergen Gall, Mohammed Bennamoun, Ernesto Damiani, Naoufel Werghi
Title: STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
Abstract:
Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Code, data, and models are available at https://divs1159.github.io/STING-BEE/.

Authors:Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, Zeynep Akata
Title: Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models
Abstract:
Given that interpretability and steerability are crucial to AI safety, Sparse Autoencoders (SAEs) have emerged as a tool to enhance them in Large Language Models (LLMs). In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in vision representations. To ensure that our evaluation aligns with human perception, we propose a benchmark derived from a large-scale user study. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons, with sparsity and wide latents being the most influential factors. Notably, we demonstrate that applying SAE interventions on CLIP's vision encoder directly steers multimodal LLM outputs (e.g., LLaVA), without any modifications to the underlying model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised tool for enhancing both interpretability and control of VLMs. Code is available at https://github.com/ExplainableML/sae-for-vlm.
Chinese: 稀疏自编码器(SAEs)应用于视觉语言模型(VLMs),通过增强神经元层面的单义性并直接引导模型输出,无需修改底层架构即可提升可解释性和控制能力。
English: Sparse Autoencoders (SAEs) are applied to Vision-Language Models (VLMs) to improve neuron-level monosemanticity and enable direct steering of model outputs, enhancing interpretability and control without modifying the underlying architecture.

Authors:Yuexi Du, Jiazhen Zhang, Nicha C. Dvornek, John A. Onofrey
Title: GMR-Conv: An Efficient Rotation and Reflection Equivariant Convolution Kernel Using Gaussian Mixture Rings
Abstract:
Symmetry, where certain features remain invariant under geometric transformations, can often serve as a powerful prior in designing convolutional neural networks (CNNs). While conventional CNNs inherently support translational equivariance, extending this property to rotation and reflection has proven challenging, often forcing a compromise between equivariance, efficiency, and information loss. In this work, we introduce Gaussian Mixture Ring Convolution (GMR-Conv), an efficient convolution kernel that smooths radial symmetry using a mixture of Gaussian-weighted rings. This design mitigates discretization errors of circular kernels, thereby preserving robust rotation and reflection equivariance without incurring computational overhead. We further optimize both the space and speed efficiency of GMR-Conv via a novel parameterization and computation strategy, allowing larger kernels at an acceptable cost. Extensive experiments on eight classification and one segmentation datasets demonstrate that GMR-Conv not only matches conventional CNNs' performance but can also surpass it in applications with orientation-less data. GMR-Conv is also proven to be more robust and efficient than the state-of-the-art equivariant learning methods. Our work provides inspiring empirical evidence that carefully applied radial symmetry can alleviate the challenges of information loss, marking a promising advance in equivariant network architectures. The code is available at https://github.com/XYPB/GMR-Conv.
中文摘要:GMR-Conv通过高斯混合环形卷积核实现了高效且鲁棒的旋转与反射等变性,在多个数据集上超越传统卷积网络和先进等变学习方法,且无需额外计算成本。
English Summary: GMR-Conv introduces an efficient Gaussian mixture-based convolution kernel that achieves robust rotation and reflection equivariance without computational penalties, outperforming conventional CNNs and state-of-the-art methods in various tasks.

Authors:Jay N. Paranjape, Celso de Melo, Vishal M. Patel
Title: F-ViTA: Foundation Model Guided Visible to Thermal Translation
Abstract:
Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks (GANs) or Diffusion Models (DMs), treating the task as a style transfer problem. As a result, these approaches attempt to learn both the modality distribution shift and underlying physical principles from limited training data. In this paper, we propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation. Specifically, we condition an InstructPix2Pix Diffusion Model with zero-shot masks and labels from foundation models such as SAM and Grounded DINO. This allows the model to learn meaningful correlations between scene objects and their thermal signatures in infrared imagery. Extensive experiments on five public datasets demonstrate that F-ViTA outperforms state-of-the-art (SOTA) methods. Furthermore, our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image. Code: https://github.com/JayParanjape/F-ViTA/tree/master.
中文: F-ViTA提出了一种创新的可见光到热成像转换方法,利用基础模型指导扩散过程,在多个数据集上表现卓越,并能有效泛化至分布外场景。
English: F-ViTA introduces a novel visible-to-thermal image translation method that utilizes foundation models to guide diffusion processes, achieving superior performance across multiple datasets and generalizing effectively to out-of-distribution scenarios.

Authors:Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, Li Yuan
Title: GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
Abstract:
The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.
中文: 本报告推出首个评估基准GPT-ImgEval,系统评估GPT-4o在图像生成质量、编辑能力和语义合成上的卓越表现,揭示其自回归扩散架构特点与生成局限,同时提供对比研究和安全分析。
English: This report introduces GPT-ImgEval, the first benchmark to evaluate GPT-4o's superior image generation, editing, and semantic synthesis capabilities, revealing its auto-regressive diffusion architecture and limitations while providing comparative analysis and safety insights.

Authors:Vincent Gbouna Zakka, Luis J. Manso, Zhuangzhuang Dai
Title: Multi-Head Adaptive Graph Convolution Network for Sparse Point Cloud-Based Human Activity Recognition
Abstract:
Human activity recognition is increasingly vital for supporting independent living, particularly for the elderly and those in need of assistance. Domestic service robots with monitoring capabilities can enhance safety and provide essential support. Although image-based methods have advanced considerably in the past decade, their adoption remains limited by concerns over privacy and sensitivity to low-light or dark conditions. As an alternative, millimetre-wave (mmWave) radar can produce point cloud data which is privacy-preserving. However, processing the sparse and noisy point clouds remains a long-standing challenge. While graph-based methods and attention mechanisms show promise, they predominantly rely on "fixed" kernels; kernels that are applied uniformly across all neighbourhoods, highlighting the need for adaptive approaches that can dynamically adjust their kernels to the specific geometry of each local neighbourhood in point cloud data. To overcome this limitation, we introduce an adaptive approach within the graph convolutional framework. Instead of a single shared weight function, our Multi-Head Adaptive Kernel (MAK) module generates multiple dynamic kernels, each capturing different aspects of the local feature space. By progressively refining local features while maintaining global spatial context, our method enables convolution kernels to adapt to varying local features. Experimental results on benchmark datasets confirm the effectiveness of our approach, achieving state-of-the-art performance in human activity recognition. Our source code is made publicly available at: https://github.com/Gbouna/MAK-GCN
中文: 本研究提出了一种图卷积网络中的多头自适应核模块,能动态处理稀疏的毫米波雷达点云,在保护隐私的同时实现了最先进的人类活动识别性能。
English: This study introduces a Multi-Head Adaptive Kernel module within graph convolutional networks to dynamically process sparse mmWave radar point clouds, achieving state-of-the-art human activity recognition while preserving privacy.

Authors:Benjy Friedmann, Michael Werman
Title: CanonNet: Canonical Ordering and Curvature Learning for Point Cloud Analysis
Abstract:
Point cloud processing poses two fundamental challenges: establishing consistent point ordering and effectively learning fine-grained geometric features. Current architectures rely on complex operations that limit expressivity while struggling to capture detailed surface geometry. We present CanonNet, a lightweight neural network composed of two complementary components: (1) a preprocessing pipeline that creates a canonical point ordering and orientation, and (2) a geometric learning framework where networks learn from synthetic surfaces with precise curvature values. This modular approach eliminates the need for complex transformation-invariant architectures while effectively capturing local geometric properties. Our experiments demonstrate state-of-the-art performance in curvature estimation and competitive results in geometric descriptor tasks with significantly fewer parameters (\textbf{100X}) than comparable methods. CanonNet's efficiency makes it particularly suitable for real-world applications where computational resources are limited, demonstrating that mathematical preprocessing can effectively complement neural architectures for point cloud analysis. The code for the project is publicly available \hyperlink{https://benjyfri.github.io/CanonNet/}{https://benjyfri.github.io/CanonNet/}.

Authors:Xingguang Zhang, Nicholas Chimitt, Xijun Wang, Yu Yuan, Stanley H. Chan
Title: Learning Phase Distortion with Selective State Space Models for Video Turbulence Mitigation
Abstract:
Atmospheric turbulence is a major source of image degradation in long-range imaging systems. Although numerous deep learning-based turbulence mitigation (TM) methods have been proposed, many are slow, memory-hungry, and do not generalize well. In the spatial domain, methods based on convolutional operators have a limited receptive field, so they cannot handle a large spatial dependency required by turbulence. In the temporal domain, methods relying on self-attention can, in theory, leverage the lucky effects of turbulence, but their quadratic complexity makes it difficult to scale to many frames. Traditional recurrent aggregation methods face parallelization challenges. In this paper, we present a new TM method based on two concepts: (1) A turbulence mitigation network based on the Selective State Space Model (MambaTM). MambaTM provides a global receptive field in each layer across spatial and temporal dimensions while maintaining linear computational complexity. (2) Learned Latent Phase Distortion (LPD). LPD guides the state space model. Unlike classical Zernike-based representations of phase distortion, the new LPD map uniquely captures the actual effects of turbulence, significantly improving the model's capability to estimate degradation by reducing the ill-posedness. Our proposed method exceeds current state-of-the-art networks on various synthetic and real-world TM benchmarks with significantly faster inference speed.

Authors:Feng Gao, Miao Fu, Jingchao Cao, Junyu Dong, Qian Du
Title: Adaptive Frequency Enhancement Network for Remote Sensing Image Semantic Segmentation
Abstract:
Semantic segmentation of high-resolution remote sensing images plays a crucial role in land-use monitoring and urban planning. Recent remarkable progress in deep learning-based methods makes it possible to generate satisfactory segmentation results. However, existing methods still face challenges in adapting network parameters to various land cover distributions and enhancing the interaction between spatial and frequency domain features. To address these challenges, we propose the Adaptive Frequency Enhancement Network (AFENet), which integrates two key components: the Adaptive Frequency and Spatial feature Interaction Module (AFSIM) and the Selective feature Fusion Module (SFM). AFSIM dynamically separates and modulates high- and low-frequency features according to the content of the input image. It adaptively generates two masks to separate high- and low-frequency components, therefore providing optimal details and contextual supplementary information for ground object feature representation. SFM selectively fuses global context and local detailed features to enhance the network's representation capability. Hence, the interactions between frequency and spatial features are further enhanced. Extensive experiments on three publicly available datasets demonstrate that the proposed AFENet outperforms state-of-the-art methods. In addition, we also validate the effectiveness of AFSIM and SFM in managing diverse land cover types and complex scenarios. Our codes are available at https://github.com/oucailab/AFENet.
中文:提出的AFENet模型通过自适应频率空间交互模块和选择性特征融合模块,有效整合频域与空间特征以提升遥感图像语义分割精度,在多个公开数据集上验证了其优越性能。
English: The proposed AFENet model enhances semantic segmentation of remote sensing images by adaptively integrating frequency and spatial features through its AFSIM and SFM modules, demonstrating superior performance over existing methods across multiple datasets.

Authors:Leonardo Iurada, Marco Ciccone, Tatiana Tommasi
Title: Efficient Model Editing with Task-Localized Sparse Fine-tuning
Abstract:
Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.
中文: TaLoS提出了一种构建稀疏任务向量的方法,通过仅更新梯度敏感性低的参数子集来提升训练和推理效率,无需线性化即可在任务算术中超越现有方法。
English: TaLoS introduces a method for creating sparse task vectors that enhance training and inference efficiency by updating only a subset of parameters with low gradient sensitivity, outperforming existing approaches in task arithmetic without requiring linearization.

Authors:Lihua Liu, Jiehong Lin, Zhenxin Liu, Kui Jia
Title: PicoPose: Progressive Pixel-to-Pixel Correspondence Learning for Novel Object Pose Estimation
Abstract:
RGB-based novel object pose estimation is critical for rapid deployment in robotic applications, yet zero-shot generalization remains a key challenge. In this paper, we introduce PicoPose, a novel framework designed to tackle this task using a three-stage pixel-to-pixel correspondence learning process. Firstly, PicoPose matches features from the RGB observation with those from rendered object templates, identifying the best-matched template and establishing coarse correspondences. Secondly, PicoPose smooths the correspondences by globally regressing a 2D affine transformation, including in-plane rotation, scale, and 2D translation, from the coarse correspondence map. Thirdly, PicoPose applies the affine transformation to the feature map of the best-matched template and learns correspondence offsets within local regions to achieve fine-grained correspondences. By progressively refining the correspondences, PicoPose significantly improves the accuracy of object poses computed via PnP/RANSAC. PicoPose achieves state-of-the-art performance on the seven core datasets of the BOP benchmark, demonstrating exceptional generalization to novel objects. Code and trained models are available at https://github.com/foollh/PicoPose.
中文: PicoPose采用三阶段像素级对应学习框架,通过逐步优化特征匹配,在机器人应用中实现了针对新物体的零样本姿态估计的最先进性能。
English: PicoPose is a three-stage pixel-to-pixel correspondence learning framework that progressively refines feature matching to achieve state-of-the-art zero-shot pose estimation for novel objects in robotic applications.

Authors:Jiwoo Chung, Sangeek Hyun, Hyunjun Kim, Eunseo Koh, MinKyu Lee, Jae-Pil Heo
Title: Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation
Abstract:
Recent advances in text-to-image generative models have enabled numerous practical applications, including subject-driven generation, which fine-tunes pretrained models to capture subject semantics from only a few examples. While diffusion-based models produce high-quality images, their extensive denoising steps result in significant computational overhead, limiting real-world applicability. Visual autoregressive (VAR) models, which predict next-scale tokens rather than spatially adjacent ones, offer significantly faster inference suitable for practical deployment. In this paper, we propose the first VAR-based approach for subject-driven generation. However, naive fine-tuning VAR leads to computational overhead, language drift, and reduced diversity. To address these challenges, we introduce selective layer tuning to reduce complexity and prior distillation to mitigate language drift. Additionally, we found that the early stages have a greater influence on the generation of subject than the latter stages, which merely synthesize minor details. Based on this finding, we propose scale-wise weighted tuning, which prioritizes coarser resolutions for promoting the model to focus on the subject-relevant information instead of local details. Extensive experiments validate that our method significantly outperforms diffusion-based baselines across various metrics and demonstrates its practical usage.

Authors:Yan Ma, Steffi Chern, Xuyang Shen, Yiran Zhong, Pengfei Liu
Title: Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme
Abstract:
Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.
中文摘要:本研究为视觉语言模型提出了一个透明的强化学习框架,并通过实验验证了关键发现,如强化学习在泛化能力上优于监督微调,以及响应长度对性能的影响。
English Summary: This study introduces a transparent reinforcement learning framework for vision-language models, validated through experiments that reveal key insights like RL's superior generalization over supervised fine-tuning and the influence of response length on performance.

Authors:Andrei Dumitriu, Florin Tatui, Florin Miron, Radu Tudor Ionescu, Radu Timofte
Title: Rip Current Segmentation: A Novel Benchmark and YOLOv8 Baseline Results
Abstract:
Rip currents are the leading cause of fatal accidents and injuries on many beaches worldwide, emphasizing the importance of automatically detecting these hazardous surface water currents. In this paper, we address a novel task: rip current instance segmentation. We introduce a comprehensive dataset containing $2,466$ images with newly created polygonal annotations for instance segmentation, used for training and validation. Additionally, we present a novel dataset comprising $17$ drone videos (comprising about $24K$ frames) captured at $30 FPS$, annotated with both polygons for instance segmentation and bounding boxes for object detection, employed for testing purposes. We train various versions of YOLOv8 for instance segmentation on static images and assess their performance on the test dataset (videos). The best results were achieved by the YOLOv8-nano model (runnable on a portable device), with an mAP50 of $88.94%$ on the validation dataset and $81.21%$ macro average on the test dataset. The results provide a baseline for future research in rip current segmentation. Our work contributes to the existing literature by introducing a detailed, annotated dataset, and training a deep learning model for instance segmentation of rip currents. The code, training details and the annotated dataset are made publicly available at https://github.com/Irikos/rip_currents.
中文摘要:本文针对离岸流实例分割这一新任务,发布了详细标注的数据集,通过YOLOv8-nano模型实现了高精度检测,为后续研究奠定基础并提升海滩安全防护能力。
English Summary: This paper introduces a novel rip current instance segmentation task, presenting comprehensive datasets and achieving high detection accuracy with the YOLOv8-nano model, providing a baseline for future research and enhancing beach safety.

Authors:Hesong Li, Ziqi Wu, Ruiwen Shao, Tao Zhang, Ying Fu
Title: Noise Calibration and Spatial-Frequency Interactive Network for STEM Image Enhancement
Abstract:
Scanning Transmission Electron Microscopy (STEM) enables the observation of atomic arrangements at sub-angstrom resolution, allowing for atomically resolved analysis of the physical and chemical properties of materials. However, due to the effects of noise, electron beam damage, sample thickness, etc, obtaining satisfactory atomic-level images is often challenging. Enhancing STEM images can reveal clearer structural details of materials. Nonetheless, existing STEM image enhancement methods usually overlook unique features in the frequency domain, and existing datasets lack realism and generality. To resolve these issues, in this paper, we develop noise calibration, data synthesis, and enhancement methods for STEM images. We first present a STEM noise calibration method, which is used to synthesize more realistic STEM images. The parameters of background noise, scan noise, and pointwise noise are obtained by statistical analysis and fitting of real STEM images containing atoms. Then we use these parameters to develop a more general dataset that considers both regular and random atomic arrangements and includes both HAADF and BF mode images. Finally, we design a spatial-frequency interactive network for STEM image enhancement, which can explore the information in the frequency domain formed by the periodicity of atomic arrangement. Experimental results show that our data is closer to real STEM images and achieves better enhancement performances together with our network. Code will be available at https://github.com/HeasonLee/SFIN}{https://github.com/HeasonLee/SFIN.
Chinese: 本文提出了噪声校准与数据合成方法以生成更逼真的STEM图像,并设计了一种空间-频率交互网络,利用频域信息显著提升了图像清晰度与增强效果。
English: This paper introduces noise calibration and data synthesis methods to create more realistic STEM images and develops a spatial-frequency interactive network that leverages frequency domain information for enhanced image clarity and performance.

Authors:Fa-Ting Hong, Zunnan Xu, Zixiang Zhou, Jun Zhou, Xiu Li, Qin Lin, Qinglin Lu, Dan Xu
Title: Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation
Abstract:
Talking head synthesis is vital for virtual avatars and human-computer interaction. However, most existing methods are typically limited to accepting control from a single primary modality, restricting their practical utility. To this end, we introduce \textbf{ACTalker}, an end-to-end video diffusion framework that supports both multi-signals control and single-signal control for talking head video generation. For multiple control, we design a parallel mamba structure with multiple branches, each utilizing a separate driving signal to control specific facial regions. A gate mechanism is applied across all branches, providing flexible control over video generation. To ensure natural coordination of the controlled video both temporally and spatially, we employ the mamba structure, which enables driving signals to manipulate feature tokens across both dimensions in each branch. Additionally, we introduce a mask-drop strategy that allows each driving signal to independently control its corresponding facial region within the mamba structure, preventing control conflicts. Experimental results demonstrate that our method produces natural-looking facial videos driven by diverse signals and that the mamba layer seamlessly integrates multiple driving modalities without conflict. The project website can be found at https://harlanhong.github.io/publications/actalker/index.html.

Authors:Mykola Lavreniuk, Nataliia Kussul, Andrii Shelestov, Bohdan Yailymov, Yevhenii Salii, Volodymyr Kuzin, Zoltan Szantoi
Title: Delineate Anything: Resolution-Agnostic Field Boundary Delineation on Satellite Imagery
Abstract:
The accurate delineation of agricultural field boundaries from satellite imagery is vital for land management and crop monitoring. However, current methods face challenges due to limited dataset sizes, resolution discrepancies, and diverse environmental conditions. We address this by reformulating the task as instance segmentation and introducing the Field Boundary Instance Segmentation - 22M dataset (FBIS-22M), a large-scale, multi-resolution dataset comprising 672,909 high-resolution satellite image patches (ranging from 0.25 m to 10 m) and 22,926,427 instance masks of individual fields, significantly narrowing the gap between agricultural datasets and those in other computer vision domains. We further propose Delineate Anything, an instance segmentation model trained on our new FBIS-22M dataset. Our proposed model sets a new state-of-the-art, achieving a substantial improvement of 88.5% in mAP@0.5 and 103% in mAP@0.5:0.95 over existing methods, while also demonstrating significantly faster inference and strong zero-shot generalization across diverse image resolutions and unseen geographic regions. Code, pre-trained models, and the FBIS-22M dataset are available at https://lavreniuk.github.io/Delineate-Anything.

Authors:Fatemeh Behrad, Tinne Tuytelaars, Johan Wagemans
Title: Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment
Abstract:
The capacity of Vision transformers (ViTs) to handle variable-sized inputs is often constrained by computational complexity and batch processing limitations. Consequently, ViTs are typically trained on small, fixed-size images obtained through downscaling or cropping. While reducing computational burden, these methods result in significant information loss, negatively affecting tasks like image aesthetic assessment. We introduce Charm, a novel tokenization approach that preserves Composition, High-resolution, Aspect Ratio, and Multi-scale information simultaneously. Charm prioritizes high-resolution details in specific regions while downscaling others, enabling shorter fixed-size input sequences for ViTs while incorporating essential information. Charm is designed to be compatible with pre-trained ViTs and their learned positional embeddings. By providing multiscale input and introducing variety to input tokens, Charm improves ViT performance and generalizability for image aesthetic assessment. We avoid cropping or changing the aspect ratio to further preserve information. Extensive experiments demonstrate significant performance improvements on various image aesthetic and quality assessment datasets (up to 8.1 %) using a lightweight ViT backbone. Code and pre-trained models are available at https://github.com/FBehrad/Charm.
中文摘要:Charm提出了一种新颖的标记化方法,能同时保留图像的构图、高分辨率、宽高比和多尺度信息,使视觉变换器能够在不损失关键信息的前提下高效处理不同尺寸的输入,从而显著提升图像美学评估任务的性能表现。
English Summary: Charm introduces a novel tokenization method that preserves key image attributes like composition and high-resolution details, enabling Vision transformers to process variable-sized inputs efficiently without information loss, thereby significantly improving performance in image aesthetic assessment tasks.

Authors:Nedko Savov, Naser Kazemi, Mohammad Mahdi, Danda Pani Paudel, Xi Wang, Luc Van Gool
Title: Exploration-Driven Generative Interactive Environments
Abstract:
Modern world models require costly and time-consuming collection of large video datasets with action demonstrations by people or by environment-specific agents. To simplify training, we focus on using many virtual environments for inexpensive, automatically collected interaction data. Genie, a recent multi-environment world model, demonstrates simulation abilities of many environments with shared behavior. Unfortunately, training their model requires expensive demonstrations. Therefore, we propose a training framework merely using a random agent in virtual environments. While the model trained in this manner exhibits good controls, it is limited by the random exploration possibilities. To address this limitation, we propose AutoExplore Agent - an exploration agent that entirely relies on the uncertainty of the world model, delivering diverse data from which it can learn the best. Our agent is fully independent of environment-specific rewards and thus adapts easily to new environments. With this approach, the pretrained multi-environment model can quickly adapt to new environments achieving video fidelity and controllability improvement. In order to obtain automatically large-scale interaction datasets for pretraining, we group environments with similar behavior and controls. To this end, we annotate the behavior and controls of 974 virtual environments - a dataset that we name RetroAct. For building our model, we first create an open implementation of Genie - GenieRedux and apply enhancements and adaptations in our version GenieRedux-G. Our code and data are available at https://github.com/insait-institute/GenieRedux.
Chinese: 该研究提出了一种在虚拟环境中使用随机代理的训练框架,并引入了AutoExplore代理,通过利用世界模型的不确定性生成多样化数据,以提升在新环境中的适应性和性能,同时基于RetroAct数据集和增强的GenieRedux-G模型实现。
English: The study introduces a training framework using a random agent in virtual environments and proposes the AutoExplore Agent, which leverages world model uncertainty to generate diverse data for improved adaptability and performance in new environments, supported by the RetroAct dataset and an enhanced GenieRedux-G model.

Authors:Zhuguanyu Wu, Jiayi Zhang, Jiaxin Chen, Jinyang Guo, Di Huang, Yunhong Wang
Title: APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers
Abstract:
Vision Transformers (ViTs) have become one of the most commonly used backbones for vision tasks. Despite their remarkable performance, they often suffer significant accuracy drops when quantized for practical deployment, particularly by post-training quantization (PTQ) under ultra-low bits. Recently, reconstruction-based PTQ methods have shown promising performance in quantizing Convolutional Neural Networks (CNNs). However, they fail when applied to ViTs, primarily due to the inaccurate estimation of output importance and the substantial accuracy degradation in quantizing post-GELU activations. To address these issues, we propose \textbf{APHQ-ViT}, a novel PTQ approach based on importance estimation with Average Perturbation Hessian (APH). Specifically, we first thoroughly analyze the current approximation approaches with Hessian loss, and propose an improved average perturbation Hessian loss. To deal with the quantization of the post-GELU activations, we design an MLP Reconstruction (MR) method by replacing the GELU function in MLP with ReLU and reconstructing it by the APH loss on a small unlabeled calibration set. Extensive experiments demonstrate that APHQ-ViT using linear quantizers outperforms existing PTQ methods by substantial margins in 3-bit and 4-bit across different vision tasks. The source code is available at https://github.com/GoatWu/APHQ-ViT.
Chinese: 视觉Transformer(ViT)在超低位量化时存在显著精度下降问题,APHQ-ViT通过改进的基于Hessian的重要性评估和MLP重构方法,在3位和4位量化设置中实现了最先进的性能表现。
English: Vision Transformers (ViTs) face significant accuracy loss during ultra-low bit quantization, which APHQ-ViT addresses through improved Hessian-based importance estimation and MLP reconstruction to achieve state-of-the-art performance in 3-bit and 4-bit settings.

Authors:Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, Xiaopeng Zhang, Rongtao Xu, Shibiao Xu
Title: Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Abstract:
Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We systematically review the applications of multimodal fusion in key robotic vision tasks, including semantic scene understanding, simultaneous localization and mapping (SLAM), 3D object detection, navigation and localization, and robot manipulation. We compare VLMs based on large language models (LLMs) with traditional multimodal fusion methods, analyzing their advantages, limitations, and synergies. Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Furthermore, we identify critical research challenges such as cross-modal alignment, efficient fusion strategies, real-time deployment, and domain adaptation, and propose future research directions, including self-supervised learning for robust multimodal representations, transformer-based fusion architectures, and scalable multimodal frameworks. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.
中文: 本综述系统梳理了多模态融合在机器人视觉中的应用,对比视觉语言模型与传统方法,分析数据集并指出关键研究挑战,以推动多模态感知技术的发展。
English: This survey systematically reviews multimodal fusion applications in robotic vision, comparing vision-language models with traditional methods while analyzing datasets and identifying key research challenges to advance multimodal perception.

Authors:Rick van Essen, Eldert van Henten, Lammert Kooistra, Gert Kootstra
Title: Adaptive path planning for efficient object search by UAVs in agricultural fields
Abstract:
This paper presents an adaptive path planner for object search in agricultural fields using UAVs. The path planner uses a high-altitude coverage flight path and plans additional low-altitude inspections when the detection network is uncertain. The path planner was evaluated in an offline simulation environment containing real-world images. We trained a YOLOv8 detection network to detect artificial plants placed in grass fields to showcase the potential of our path planner. We evaluated the effect of different detection certainty measures, optimized the path planning parameters, investigated the effects of localization errors, and different numbers of objects in the field. The YOLOv8 detection confidence worked best to differentiate between true and false positive detections and was therefore used in the adaptive planner. The optimal parameters of the path planner depended on the distribution of objects in the field. When the objects were uniformly distributed, more low-altitude inspections were needed compared to a non-uniform distribution of objects, resulting in a longer path length. The adaptive planner proved to be robust against localization uncertainty. When increasing the number of objects, the flight path length increased, especially when the objects were uniformly distributed. When the objects were non-uniformly distributed, the adaptive path planner yielded a shorter path than a low-altitude coverage path, even with a high number of objects. Overall, the presented adaptive path planner allowed finding non-uniformly distributed objects in a field faster than a coverage path planner and resulted in a compatible detection accuracy. The path planner is made available at https://github.com/wur-abe/uav_adaptive_planner.
中文: 本文提出了一种用于农田目标搜索的自适应无人机路径规划器,通过结合高空覆盖飞行和检测不确定时的低空巡检,在非均匀分布目标场景下展现出比全覆盖路径更短的飞行距离,同时对定位误差具有良好鲁棒性。
English: This paper introduces an adaptive UAV path planner for agricultural object search that combines high-altitude coverage with targeted low-altitude inspections triggered by detection uncertainty, demonstrating superior efficiency for non-uniform object distributions while maintaining robust performance against localization errors.

Authors:Changshuo Wang, Shuting He, Xiang Fang, Meiqing Wu, Siew-Kei Lam, Prayag Tiwari
Title: Taylor Series-Inspired Local Structure Fitting Network for Few-shot Point Cloud Semantic Segmentation
Abstract:
Few-shot point cloud semantic segmentation aims to accurately segment "unseen" new categories in point cloud scenes using limited labeled data. However, pretraining-based methods not only introduce excessive time overhead but also overlook the local structure representation among irregular point clouds. To address these issues, we propose a pretraining-free local structure fitting network for few-shot point cloud semantic segmentation, named TaylorSeg. Specifically, inspired by Taylor series, we treat the local structure representation of irregular point clouds as a polynomial fitting problem and propose a novel local structure fitting convolution, called TaylorConv. This convolution learns the low-order basic information and high-order refined information of point clouds from explicit encoding of local geometric structures. Then, using TaylorConv as the basic component, we construct two variants of TaylorSeg: a non-parametric TaylorSeg-NN and a parametric TaylorSeg-PN. The former can achieve performance comparable to existing parametric models without pretraining. For the latter, we equip it with an Adaptive Push-Pull (APP) module to mitigate the feature distribution differences between the query set and the support set. Extensive experiments validate the effectiveness of the proposed method. Notably, under the 2-way 1-shot setting, TaylorSeg-PN achieves improvements of +2.28% and +4.37% mIoU on the S3DIS and ScanNet datasets respectively, compared to the previous state-of-the-art methods. Our code is available at https://github.com/changshuowang/TaylorSeg.
中文: 提出的TaylorSeg网络采用无需预训练的方法,通过TaylorConv将不规则点云的局部结构建模为多项式拟合,在少样本点云语义分割中实现了领先性能,避免了预训练的时间开销。
English: The proposed TaylorSeg network introduces a pretraining-free approach for few-shot point cloud semantic segmentation, utilizing TaylorConv to model local structures as polynomial fittings, achieving state-of-the-art performance without pretraining overhead.

Authors:Jiayi Gao, Zijin Yin, Changcheng Hua, Yuxin Peng, Kongming Liang, Zhanyu Ma, Jun Guo, Yang Liu
Title: ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer
Abstract:
The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current methods have two limitations: 1) struggle to handle multi-subjects videos, failing to transfer specific subject motion; 2) struggle to preserve the diversity and accuracy of motion as transferring to subjects with varying shapes. To overcome these, we introduce \textbf{ConMo}, a zero-shot framework that disentangle and recompose the motions of subjects and camera movements. ConMo isolates individual subject and background motion cues from complex trajectories in source videos using only subject masks, and reassembles them for target video generation. This approach enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios. Additionally, we propose soft guidance in the recomposition stage which controls the retention of original motion to adjust shape constraints, aiding subject shape adaptation and semantic transformation. Unlike previous methods, ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation. Extensive experiments demonstrate that ConMo significantly outperforms state-of-the-art methods in motion fidelity and semantic consistency. The code is available at https://github.com/Andyplus1/ConMo.
中文: ConMo是一种零样本框架,能从源视频中分离并重组主体和相机运动,实现对多样化主体的精确运动控制,并在多主体场景中提升视频生成的保真度和语义一致性。
English: ConMo is a zero-shot framework that disentangles and recomposes subject and camera motions from source videos, enabling precise motion control across diverse subjects and improving multi-subject video generation with enhanced fidelity and semantic consistency.

Authors:Zhongjian Wang, Peng Zhang, Jinwei Qi, Guangyuan Wang, Chaonan Ji, Sheng Xu, Bang Zhang, Liefeng Bo
Title: OmniTalker: One-shot Real-time Text-Driven Talking Audio-Video Generation With Multimodal Style Mimicking
Abstract:
Although significant progress has been made in audio-driven talking head generation, text-driven methods remain underexplored. In this work, we present OmniTalker, a unified framework that jointly generates synchronized talking audio-video content from input text while emulating the speaking and facial movement styles of the target identity, including speech characteristics, head motion, and facial dynamics. Our framework adopts a dual-branch diffusion transformer (DiT) architecture, with one branch dedicated to audio generation and the other to video synthesis. At the shallow layers, cross-modal fusion modules are introduced to integrate information between the two modalities. In deeper layers, each modality is processed independently, with the generated audio decoded by a vocoder and the video rendered using a GAN-based high-quality visual renderer. Leveraging the in-context learning capability of DiT through a masked-infilling strategy, our model can simultaneously capture both audio and visual styles without requiring explicit style extraction modules. Thanks to the efficiency of the DiT backbone and the optimized visual renderer, OmniTalker achieves real-time inference at 25 FPS. To the best of our knowledge, OmniTalker is the first one-shot framework capable of jointly modeling speech and facial styles in real time. Extensive experiments demonstrate its superiority over existing methods in terms of generation quality, particularly in preserving style consistency and ensuring precise audio-video synchronization, all while maintaining efficient inference.

Authors:Peifu Liu, Huiyan Bai, Tingfa Xu, Jihui Wang, Huan Chen, Jianan Li
Title: Hyperspectral Remote Sensing Images Salient Object Detection: The First Benchmark Dataset and Baseline
Abstract:
The objective of hyperspectral remote sensing image salient object detection (HRSI-SOD) is to identify objects or regions that exhibit distinct spectrum contrasts with the background. This area holds significant promise for practical applications; however, progress has been limited by a notable scarcity of dedicated datasets and methodologies. To bridge this gap and stimulate further research, we introduce the first HRSI-SOD dataset, termed HRSSD, which includes 704 hyperspectral images and 5327 pixel-level annotated salient objects. The HRSSD dataset poses substantial challenges for salient object detection algorithms due to large scale variation, diverse foreground-background relations, and multi-salient objects. Additionally, we propose an innovative and efficient baseline model for HRSI-SOD, termed the Deep Spectral Saliency Network (DSSN). The core of DSSN is the Cross-level Saliency Assessment Block, which performs pixel-wise attention and evaluates the contributions of multi-scale similarity maps at each spatial location, effectively reducing erroneous responses in cluttered regions and emphasizes salient regions across scales. Additionally, the High-resolution Fusion Module combines bottom-up fusion strategy and learned spatial upsampling to leverage the strengths of multi-scale saliency maps, ensuring accurate localization of small objects. Experiments on the HRSSD dataset robustly validate the superiority of DSSN, underscoring the critical need for specialized datasets and methodologies in this domain. Further evaluations on the HSOD-BIT and HS-SOD datasets demonstrate the generalizability of the proposed method. The dataset and source code are publicly available at https://github.com/laprf/HRSSD.
中文: 本文提出了首个高光谱遥感图像显著目标检测数据集HRSSD,并开发了具有跨层级显著性评估模块和高分辨率融合模块的深度光谱显著性网络DSSN,有效解决了尺度变化和复杂背景等挑战,在多个数据集上验证了其优越性能和泛化能力。
English: This paper introduces the first hyperspectral remote sensing image salient object detection dataset HRSSD and proposes the Deep Spectral Saliency Network (DSSN) with innovative modules to address challenges like scale variation and cluttered backgrounds, demonstrating superior performance and generalizability across multiple datasets.

Authors:Takahiro Shirakawa, Tomoyuki Suzuki, Takuto Narumoto, Daichi Haraguchi
Title: MG-Gen: Single Image to Motion Graphics Generation
Abstract:
We introduce MG-Gen, a framework that generates motion graphics directly from a single raster image. MG-Gen decompose a single raster image into layered structures represented as HTML, generate animation scripts for each layer, and then render them into a video. Experiments confirm MG-Gen generates dynamic motion graphics while preserving text readability and fidelity to the input conditions, whereas state-of-the-art image-to-video generation methods struggle with them. The code is available at https://github.com/CyberAgentAILab/MG-GEN.
中文:MG-Gen框架能够将单个栅格图像分解为分层HTML结构并生成动画脚本,从而直接创建动态图形,在保持文本可读性和输入保真度方面优于现有方法。
English: MG-Gen is a framework that converts a single raster image into layered HTML structures and animation scripts to produce dynamic motion graphics, outperforming existing methods in preserving text and input fidelity.

Authors:Boris Sukhovilov
Title: Determining Sphere Radius through Pairwise Distances
Abstract:
We propose a novel method for determining the radius of a spherical surface based on the distances measured between points on this surface. We consider the most general case of determining the radius when the distances are measured with errors and the sphere has random deviations from its ideal shape. For the solution, we used the minimally necessary four points and an arbitrary N number of points. We provide a new closed form solution for the radius of the sphere through the matrix of pairwise distances. We also determine the standard deviation of the radius estimate caused by measurement errors and deviations of the sphere from its ideal shape. We found optimal configurations of points on the sphere that provide the minimum standard deviation of the radius estimate. This paper describes our solution and provides all the mathematical derivations. We share the implementation of our method as open source code at https://github.com/boris-sukhovilov/Sphere_Radius.
中文: 本文提出了一种基于点间距离测量估算球面半径的新方法,考虑了测量误差和球体形状偏差,并给出了最小化估计方差的最优点配置及开源代码实现。
English: This paper introduces a novel method for estimating the radius of a spherical surface using pairwise distance measurements, accounting for measurement errors and shape deviations, and provides an optimal point configuration for minimal estimation variance with open-source implementation.

Authors:Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, Ichiro Ide
Title: MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion
Abstract:
Multi-modal multi-view action recognition is a rapidly growing field in computer vision, offering significant potential for applications in surveillance. However, current datasets often fail to address real-world challenges such as wide-area distributed settings, asynchronous data streams, and the lack of frame-level annotations. Furthermore, existing methods face difficulties in effectively modeling inter-view relationships and enhancing spatial feature learning. In this paper, we introduce the MultiSensor-Home dataset, a novel benchmark designed for comprehensive action recognition in home environments, and also propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method. The proposed MultiSensor-Home dataset features untrimmed videos captured by distributed sensors, providing high-resolution RGB and audio data along with detailed multi-view frame-level action labels. The proposed MultiTSF method leverages a Transformer-based fusion mechanism to dynamically model inter-view relationships. Furthermore, the proposed method integrates a human detection module to enhance spatial feature learning, guiding the model to prioritize frames with human activity to enhance action the recognition accuracy. Experiments on the proposed MultiSensor-Home and the existing MM-Office datasets demonstrate the superiority of MultiTSF over the state-of-the-art methods. Quantitative and qualitative results highlight the effectiveness of the proposed method in advancing real-world multi-modal multi-view action recognition. The source code is available at https://github.com/thanhhff/MultiTSF.
中文: 本文提出了针对多模态动作识别实际挑战的MultiSensor-Home数据集和基于Transformer的MultiTSF方法,该方法能动态建模视图间关系并增强空间特征学习,实验证明其性能优于现有最优方法。
English: This paper introduces the MultiSensor-Home dataset addressing real-world challenges in multi-modal action recognition and proposes MultiTSF, a Transformer-based method that dynamically models inter-view relationships and enhances spatial feature learning, demonstrating superior performance over existing approaches.

Authors:Amit Rand, Hadi Ibrahim
Title: Beyond Conventional Transformers: The Medical X-ray Attention (MXA) Block for Improved Multi-Label Diagnosis Using Knowledge Distillation
Abstract:
Medical imaging, particularly X-ray analysis, often involves detecting multiple conditions simultaneously within a single scan, making multi-label classification crucial for real-world clinical applications. We present the Medical X-ray Attention (MXA) block, a novel attention mechanism tailored specifically to address the unique challenges of X-ray abnormality detection. The MXA block enhances traditional Multi-Head Self Attention (MHSA) by integrating a specialized module that efficiently captures both detailed local information and broader global context. To the best of our knowledge, this is the first work to propose a task-specific attention mechanism for diagnosing chest X-rays, as well as to attempt multi-label classification using an Efficient Vision Transformer (EfficientViT). By embedding the MXA block within the EfficientViT architecture and employing knowledge distillation, our proposed model significantly improves performance on the CheXpert dataset, a widely used benchmark for multi-label chest X-ray abnormality detection. Our approach achieves an area under the curve (AUC) of 0.85, an absolute improvement of 0.19 compared to our baseline model's AUC of 0.66, corresponding to a substantial approximate 233% relative improvement over random guessing (AUC = 0.5).
中文: MXA模块作为一种新型注意力机制,集成于EfficientViT架构中,通过捕捉局部细节和全局上下文,显著提升了多标签胸部X光分类性能,在CheXpert数据集上实现了0.85的AUC值。
English: The MXA block, a novel attention mechanism integrated into the EfficientViT architecture, significantly enhances multi-label chest X-ray classification by capturing both local details and global context, achieving a 0.85 AUC on the CheXpert dataset.

Authors:Shaocong Long, Qianyu Zhou, Xiangtai Li, Chenhao Ying, Yunhai Tong, Lizhuang Ma, Yuan Luo, Dacheng Tao
Title: Generative Classifier for Domain Generalization
Abstract:
Domain generalization (DG) aims to improve the generalizability of computer vision models toward distribution shifts. The mainstream DG methods focus on learning domain invariance, however, such methods overlook the potential inherent in domain-specific information. While the prevailing practice of discriminative linear classifier has been tailored to domain-invariant features, it struggles when confronted with diverse domain-specific information, e.g., intra-class shifts, that exhibits multi-modality. To address these issues, we explore the theoretical implications of relying on domain invariance, revealing the crucial role of domain-specific information in mitigating the target risk for DG. Drawing from these insights, we propose Generative Classifier-driven Domain Generalization (GCDG), introducing a generative paradigm for the DG classifier based on Gaussian Mixture Models (GMMs) for each class across domains. GCDG consists of three key modules: Heterogeneity Learning Classifier~(HLC), Spurious Correlation Blocking~(SCB), and Diverse Component Balancing~(DCB). Concretely, HLC attempts to model the feature distributions and thereby capture valuable domain-specific information via GMMs. SCB identifies the neural units containing spurious correlations and perturbs them, mitigating the risk of HLC learning spurious patterns. Meanwhile, DCB ensures a balanced contribution of components in HLC, preventing the underestimation or neglect of critical components. In this way, GCDG excels in capturing the nuances of domain-specific information characterized by diverse distributions. GCDG demonstrates the potential to reduce the target risk and encourage flat minima, improving the generalizability. Extensive experiments show GCDG's comparable performance on five DG benchmarks and one face anti-spoofing dataset, seamlessly integrating into existing DG methods with consistent improvements.
中文: 领域泛化旨在提升模型对分布变化的适应性,本研究提出GCDG方法,采用高斯混合模型的生成式分类器,通过异质性学习、伪相关阻断和组件平衡模块,有效捕捉领域特定信息并提升泛化性能。
English: Domain generalization (DG) aims to enhance model adaptability to distribution shifts, and this study introduces GCDG, a generative classifier using Gaussian Mixture Models to effectively capture domain-specific information and improve generalization through modules for heterogeneity learning, spurious correlation blocking, and component balancing.

Authors:Wenzhuo Liu, Wenshuo Wang, Yicheng Qiao, Qiannan Guo, Jiayin Zhu, Pengfei Li, Zilong Chen, Huiming Yang, Zhiwei Li, Lening Wang, Tiao Tan, Huaping Liu
Title: MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception
Abstract:
Advanced driver assistance systems require a comprehensive understanding of the driver's mental/physical state and traffic context but existing works often neglect the potential benefits of joint learning between these tasks. This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emotion (e.g., anxiety, happiness), vehicle behavior (e.g., parking, turning), and traffic context (e.g., traffic jam, traffic smooth). A key challenge is avoiding negative transfer between tasks, which can impair learning performance. To address this, we introduce two key components into the framework: one is the multi-axis region attention network to extract global context-sensitive features, and the other is the dual-branch multimodal embedding to learn multimodal embeddings from both task-shared and task-specific features. The former uses a multi-attention mechanism to extract task-relevant features, mitigating negative transfer caused by task-unrelated features. The latter employs a dual-branch structure to adaptively adjust task-shared and task-specific parameters, enhancing cross-task knowledge transfer while reducing task conflicts. We assess MMTL-UniAD on the AIDE dataset, using a series of ablation studies, and show that it outperforms state-of-the-art methods across all four tasks. The code is available on https://github.com/Wenzhuo-Liu/MMTL-UniAD.
中文: 本文提出了MMTL-UniAD框架,通过多轴注意力机制和双分支嵌入设计,在联合学习驾驶员行为、情绪、车辆行为与交通环境的同时有效抑制任务间负迁移,在AIDE数据集上实现了最优性能。
English: This paper introduces MMTL-UniAD, a unified multi-modal multi-task learning framework that jointly recognizes driver behavior, emotion, vehicle behavior, and traffic context while mitigating negative transfer through multi-axis attention and dual-branch embedding, achieving superior performance on the AIDE dataset.

Authors:Tae-Young Lee, Sundong Park, Minwoo Jeon, Hyoseok Hwang, Gyeong-Moon Park
Title: ESC: Erasing Space Concept for Knowledge Deletion
Abstract:
As concerns regarding privacy in deep learning continue to grow, individuals are increasingly apprehensive about the potential exploitation of their personal knowledge in trained models. Despite several research efforts to address this, they often fail to consider the real-world demand from users for complete knowledge erasure. Furthermore, our investigation reveals that existing methods have a risk of leaking personal knowledge through embedding features. To address these issues, we introduce a novel concept of Knowledge Deletion (KD), an advanced task that considers both concerns, and provides an appropriate metric, named Knowledge Retention score (KR), for assessing knowledge retention in feature space. To achieve this, we propose a novel training-free erasing approach named Erasing Space Concept (ESC), which restricts the important subspace for the forgetting knowledge by eliminating the relevant activations in the feature. In addition, we suggest ESC with Training (ESC-T), which uses a learnable mask to better balance the trade-off between forgetting and preserving knowledge in KD. Our extensive experiments on various datasets and models demonstrate that our proposed methods achieve the fastest and state-of-the-art performance. Notably, our methods are applicable to diverse forgetting scenarios, such as facial domain setting, demonstrating the generalizability of our methods. The code is available at http://github.com/KU-VGI/ESC .
Chinese: 本文提出知识删除(KD)概念,通过ESC和ESC-T方法在特征空间中彻底消除个人知识,在多种场景下实现了最先进的安全高效知识遗忘性能。
English: This paper introduces Knowledge Deletion (KD) to address privacy concerns by completely erasing personal knowledge from models, proposing the ESC and ESC-T methods that achieve state-of-the-art performance in secure and efficient knowledge removal across various scenarios.

Authors:Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, Qian He
Title: Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
Abstract:
Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.
Chinese: 本研究提出了一种数据合成流程和UNO模型,以解决主题驱动图像生成中的数据扩展和主题扩展难题,在单主题与多主题生成中均实现了高度一致性和可控性。
English: This study introduces a data synthesis pipeline and the UNO model to overcome challenges in subject-driven image generation, achieving high consistency and controllability in both single- and multi-subject scenarios.

Authors:Chao Huang, Susan Liang, Yunlong Tang, Jing Bi, Li Ma, Yapeng Tian, Chenliang Xu
Title: FreSca: Scaling in Frequency Space Enhances Diffusion Models
Abstract:
Latent diffusion models (LDMs) have achieved remarkable success in a variety of image tasks, yet achieving fine-grained, disentangled control over global structures versus fine details remains challenging. This paper explores frequency-based control within latent diffusion models. We first systematically analyze frequency characteristics across pixel space, VAE latent space, and internal LDM representations. This reveals that the "noise difference" term, derived from classifier-free guidance at each step t, is a uniquely effective and semantically rich target for manipulation. Building on this insight, we introduce FreSca, a novel and plug-and-play framework that decomposes noise difference into low- and high-frequency components and applies independent scaling factors to them via spatial or energy-based cutoffs. Essentially, FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control. We demonstrate its versatility and effectiveness in improving generation quality and structural emphasis on multiple architectures (e.g., SD3, SDXL) and across applications including image generation, editing, depth estimation, and video synthesis, thereby unlocking a new dimension of expressive control within LDMs.
Chinese: 本文提出FreSca框架,通过独立调节噪声差异的低频与高频分量,实现了对隐扩散模型的细粒度控制,无需模型重训练即可提升多种应用的生成质量和结构表现。
English: This paper introduces FreSca, a plug-and-play framework that enables fine-grained control over latent diffusion models by independently scaling low- and high-frequency components of the noise difference, enhancing generation quality and structural emphasis across various applications without requiring model retraining.

Authors:Zhaoyang Zhang, Yannick Hold-Geoffroy, Miloš Hašan, Ziwen Chen, Fujun Luan, Julie Dorsey, Yiwei Hu
Title: Generating 360° Video is What You Need For a 3D Scene
Abstract:
Generating 3D scenes is still a challenging task due to the lack of readily available scene data. Most existing methods only produce partial scenes and provide limited navigational freedom. We introduce a practical and scalable solution that uses 360° video as an intermediate scene representation, capturing the full-scene context and ensuring consistent visual content throughout the generation. We propose WorldPrompter, a generative pipeline that synthesizes traversable 3D scenes from text prompts. WorldPrompter incorporates a conditional 360° panoramic video generator, capable of producing a 128-frame video that simulates a person walking through and capturing a virtual environment. The resulting video is then reconstructed as Gaussian splats by a fast feedforward 3D reconstructor, enabling a true walkable experience within the 3D scene. Experiments demonstrate that our panoramic video generation model, trained with a mix of image and video data, achieves convincing spatial and temporal consistency for static scenes. This is validated by an average COLMAP matching rate of 94.6\%, allowing for high-quality panoramic Gaussian splat reconstruction and improved navigation throughout the scene. Qualitative and quantitative results also show it outperforms the state-of-the-art 360° video generators and 3D scene generation models.

Authors:Ilir Tahiraj, Markus Edinger, Dominik Kulmer, Markus Lienkamp
Title: CaLiV: LiDAR-to-Vehicle Calibration of Arbitrary Sensor Setups
Abstract:
In autonomous systems, sensor calibration is essential for safe and efficient navigation in dynamic environments. Accurate calibration is a prerequisite for reliable perception and planning tasks such as object detection and obstacle avoidance. Many existing LiDAR calibration methods require overlapping fields of view, while others use external sensing devices or postulate a feature-rich environment. In addition, Sensor-to-Vehicle calibration is not supported by the vast majority of calibration algorithms. In this work, we propose a novel target-based technique for extrinsic Sensor-to-Sensor and Sensor-to-Vehicle calibration of multi-LiDAR systems called CaLiV. This algorithm works for non-overlapping fields of view and does not require any external sensing devices. First, we apply motion to produce field of view overlaps and utilize a simple Unscented Kalman Filter to obtain vehicle poses. Then, we use the Gaussian mixture model-based registration framework GMMCalib to align the point clouds in a common calibration frame. Finally, we reduce the task of recovering the sensor extrinsics to a minimization problem. We show that both translational and rotational Sensor-to-Sensor errors can be solved accurately by our method. In addition, all Sensor-to-Vehicle rotation angles can also be calibrated with high accuracy. We validate the simulation results in real-world experiments. The code is open-source and available on https://github.com/TUMFTM/CaLiV.
中文:CaLiV算法实现了多激光雷达系统的外参标定,无需重叠视场或外部设备即可完成传感器间及传感器与车辆间的校准,通过基于运动的点云配准方法在平移和旋转参数上均获得高精度结果。
English: The CaLiV algorithm enables extrinsic calibration of multi-LiDAR systems for both sensor-to-sensor and sensor-to-vehicle parameters without requiring overlapping fields of view or external devices, achieving high accuracy in both translation and rotation through motion-based point cloud alignment.

Authors:Shu-Wei Lu, Yi-Hsuan Tsai, Yi-Ting Chen
Title: Toward Real-world BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting
Abstract:
Bird's-eye view (BEV) perception has gained significant attention because it provides a unified representation to fuse multiple view images and enables a wide range of down-stream autonomous driving tasks, such as forecasting and planning. Recent state-of-the-art models utilize projection-based methods which formulate BEV perception as query learning to bypass explicit depth estimation. While we observe promising advancements in this paradigm, they still fall short of real-world applications because of the lack of uncertainty modeling and expensive computational requirement. In this work, we introduce GaussianLSS, a novel uncertainty-aware BEV perception framework that revisits unprojection-based methods, specifically the Lift-Splat-Shoot (LSS) paradigm, and enhances them with depth un-certainty modeling. GaussianLSS represents spatial dispersion by learning a soft depth mean and computing the variance of the depth distribution, which implicitly captures object extents. We then transform the depth distribution into 3D Gaussians and rasterize them to construct uncertainty-aware BEV features. We evaluate GaussianLSS on the nuScenes dataset, achieving state-of-the-art performance compared to unprojection-based methods. In particular, it provides significant advantages in speed, running 2.5x faster, and in memory efficiency, using 0.3x less memory compared to projection-based methods, while achieving competitive performance with only a 0.4% IoU difference.

Authors:Hanyang Wang, Fangfu Liu, Jiawei Chi, Yueqi Duan
Title: VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Abstract:
Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: https://hanyang-21.github.io/VideoScene

Authors:Oliver Hahn, Christoph Reich, Nikita Araslanov, Daniel Cremers, Christian Rupprecht, Stefan Roth
Title: Scene-Centric Unsupervised Panoptic Segmentation
Abstract:
Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data, combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.
Chinese: 本研究首次提出一种无监督全景分割方法,直接基于场景中心图像进行训练,无需依赖物体中心数据或人工标注,通过伪标签训练和全景自训练策略,显著提升了全景分割质量。
English: This study introduces the first unsupervised panoptic segmentation method that trains directly on scene-centric imagery, eliminating the need for object-centric data or human annotations by leveraging pseudo-label training and panoptic self-training to achieve significant improvements in panoptic quality.

Authors:Jing Liu, Wenxuan Wang, Yisi Zhang, Yepeng Tang, Xingjian He, Longteng Guo, Tongtian Yue, Xinlong Wang
Title: Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
Abstract:
Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression. While traditional RES methods primarily address object-level grounding, real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity, such as multi-object, single object or part-level references. This introduces great challenges due to the diverse and nuanced ways users describe targets. However, existing datasets and models mainly focus on designing grounding specialists for object-level target localization, lacking the necessary data resources and unified frameworks for the more practical multi-grained RES. In this paper, we take a step further towards visual granularity unified RES task. To overcome the limitation of data scarcity, we introduce a new multi-granularity referring expression segmentation (MRES) task, alongside the RefCOCOm benchmark, which includes part-level annotations for advancing finer-grained visual understanding. In addition, we create MRES-32M, the largest visual grounding dataset, comprising over 32.2M masks and captions across 1M images, specifically designed for part-level vision-language grounding. To tackle the challenges of multi-granularity RES, we propose UniRES++, a unified multimodal large language model that integrates object-level and part-level RES tasks. UniRES++ incorporates targeted designs for fine-grained visual feature exploration. With the joint model architecture and parameters, UniRES++ achieves state-of-the-art performance across multiple benchmarks, including RefCOCOm for MRES, gRefCOCO for generalized RES, and RefCOCO, RefCOCO+, RefCOCOg for classic RES. To foster future research into multi-grained visual grounding, our RefCOCOm benchmark, MRES-32M dataset and model UniRES++ will be publicly available at https://github.com/Rubics-Xuan/MRES.
中文: 本文提出了多粒度指代表达分割(MRES)任务及RefCOCOm基准和MRES-32M数据集,开发了统一多模态模型UniRES++,在多个RES基准测试中实现了最优性能。
English: This paper introduces a multi-granularity referring expression segmentation (MRES) task with the new RefCOCOm benchmark and MRES-32M dataset, proposing UniRES++, a unified multimodal model that achieves state-of-the-art performance across various RES benchmarks.

Authors:Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, Zhaoxiang Zhang
Title: End-to-End Driving with Online Trajectory Evaluation via BEV World Model
Abstract:
End-to-end autonomous driving has achieved remarkable progress by integrating perception, prediction, and planning into a fully differentiable framework. Yet, to fully realize its potential, an effective online trajectory evaluation is indispensable to ensure safety. By forecasting the future outcomes of a given trajectory, trajectory evaluation becomes much more effective. This goal can be achieved by employing a world model to capture environmental dynamics and predict future states. Therefore, we propose an end-to-end driving framework WoTE, which leverages a BEV World model to predict future BEV states for Trajectory Evaluation. The proposed BEV world model is latency-efficient compared to image-level world models and can be seamlessly supervised using off-the-shelf BEV-space traffic simulators. We validate our framework on both the NAVSIM benchmark and the closed-loop Bench2Drive benchmark based on the CARLA simulator, achieving state-of-the-art performance. Code is released at https://github.com/liyingyanUCAS/WoTE.
中文:WoTE框架采用高效的BEV世界模型进行自动驾驶轨迹评估,通过低延迟实时预测未来状态,在基准测试中实现了最优性能。
English: The WoTE framework introduces an efficient BEV world model for trajectory evaluation in autonomous driving, achieving state-of-the-art performance on benchmarks by enabling real-time prediction of future states with minimal latency.

Authors:Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao, Hang Xu
Title: ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
Abstract:
We present ILLUME+ that leverages dual visual tokenization and a diffusion decoder to improve both deep semantic understanding and high-fidelity image generation. Existing unified models have struggled to simultaneously handle the three fundamental capabilities in a unified model: understanding, generation, and editing. Models like Chameleon and EMU3 utilize VQGAN for image discretization, due to the lack of deep semantic interaction, they lag behind specialist models like LLaVA in visual understanding tasks. To mitigate this, LaViT and ILLUME employ semantic encoders for tokenization, but they struggle with image editing due to poor texture preservation. Meanwhile, Janus series decouples the input and output image representation, limiting their abilities to seamlessly handle interleaved image-text understanding and generation. In contrast, ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves both fine-grained textures and text-aligned semantics while enabling a coarse-to-fine image representation strategy for multimodal understanding and generation. Additionally, we employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution. ILLUME+ follows a continuous-input, discrete-output scheme within the unified MLLM and adopts a progressive training procedure that supports dynamic resolution across the vision tokenizer, MLLM, and diffusion decoder. This design allows for flexible and efficient context-aware image editing and generation across diverse tasks. ILLUME+ (3B) exhibits competitive performance against existing unified MLLMs and specialized models across multimodal understanding, generation, and editing benchmarks. With its strong performance, ILLUME+ provides a scalable and versatile foundation for future multimodal applications. Project Page: https://illume-unified-mllm.github.io/.
中文: ILLUME+ 采用双重视觉标记器和扩散解码器,提升了深度语义理解与高保真图像生成能力,解决了现有统一模型在理解、生成和编辑任务中的局限性。
English: ILLUME+ introduces a dual visual tokenizer and diffusion decoder to enhance deep semantic understanding and high-fidelity image generation, overcoming limitations in unified models for understanding, generation, and editing tasks.

Authors:Yanzhou Su, Tianbin Li, Jiyao Liu, Chenglong Ma, Junzhi Ning, Cheng Tang, Sibo Ju, Jin Ye, Pengcheng Chen, Ming Hu, Shixiang Tang, Lihao Liu, Bin Fu, Wenqi Shao, Xiaowei Hu, Xiangwen Liao, Yuanfeng Ji, Junjun He
Title: GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning
Abstract:
Recent advances in general medical AI have made significant strides, but existing models often lack the reasoning capabilities needed for complex medical decision-making. This paper presents GMAI-VL-R1, a multimodal medical reasoning model enhanced by reinforcement learning (RL) to improve its reasoning abilities. Through iterative training, GMAI-VL-R1 optimizes decision-making, significantly boosting diagnostic accuracy and clinical support. We also develop a reasoning data synthesis method, generating step-by-step reasoning data via rejection sampling, which further enhances the model's generalization. Experimental results show that after RL training, GMAI-VL-R1 excels in tasks such as medical image diagnosis and visual question answering. While the model demonstrates basic memorization with supervised fine-tuning, RL is crucial for true generalization. Our work establishes new evaluation benchmarks and paves the way for future advancements in medical reasoning models. Code, data, and model will be released at \href{https://github.com/uni-medical/GMAI-VL-R1}{this link}.
中文: 本文提出GMAI-VL-R1,一种通过强化学习和新型推理数据合成方法增强的多模态医疗推理模型,显著提升了医疗任务中的诊断准确性和泛化能力。
English: This paper introduces GMAI-VL-R1, a multimodal medical reasoning model enhanced by reinforcement learning and a novel reasoning data synthesis method, which significantly improves diagnostic accuracy and generalization in medical tasks.

Authors:Nusrat Munia, Abdullah-Al-Zubaer Imran
Title: Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images
Abstract:
Artificial Intelligence (AI) in skin disease diagnosis has improved significantly, but a major concern is that these models frequently show biased performance across subgroups, especially regarding sensitive attributes such as skin color. To address these issues, we propose a novel generative AI-based framework, namely, Dermatology Diffusion Transformer (DermDiT), which leverages text prompts generated via Vision Language Models and multimodal text-image learning to generate new dermoscopic images. We utilize large vision language models to generate accurate and proper prompts for each dermoscopic image which helps to generate synthetic images to improve the representation of underrepresented groups (patient, disease, etc.) in highly imbalanced datasets for clinical diagnoses. Our extensive experimentation showcases the large vision language models providing much more insightful representations, that enable DermDiT to generate high-quality images. Our code is available at https://github.com/Munia03/DermDiT
中文:提出的DermDiT框架利用生成式AI和视觉语言模型创建合成皮肤镜图像,通过增强数据集中代表性不足群体的样本来解决皮肤疾病诊断中的偏差问题。
English: The proposed DermDiT framework uses generative AI and vision-language models to create synthetic dermoscopic images, addressing biases in skin disease diagnosis by improving representation of underrepresented groups in datasets.

Authors:Huayang Huang, Xiangye Jin, Jiaxu Miao, Yu Wu
Title: Implicit Bias Injection Attacks against Text-to-Image Diffusion Models
Abstract:
The proliferation of text-to-image diffusion models (T2I DMs) has led to an increased presence of AI-generated images in daily life. However, biased T2I models can generate content with specific tendencies, potentially influencing people's perceptions. Intentional exploitation of these biases risks conveying misleading information to the public. Current research on bias primarily addresses explicit biases with recognizable visual patterns, such as skin color and gender. This paper introduces a novel form of implicit bias that lacks explicit visual features but can manifest in diverse ways across various semantic contexts. This subtle and versatile nature makes this bias challenging to detect, easy to propagate, and adaptable to a wide range of scenarios. We further propose an implicit bias injection attack framework (IBI-Attacks) against T2I diffusion models by precomputing a general bias direction in the prompt embedding space and adaptively adjusting it based on different inputs. Our attack module can be seamlessly integrated into pre-trained diffusion models in a plug-and-play manner without direct manipulation of user input or model retraining. Extensive experiments validate the effectiveness of our scheme in introducing bias through subtle and diverse modifications while preserving the original semantics. The strong concealment and transferability of our attack across various scenarios further underscore the significance of our approach. Code is available at https://github.com/Hannah1102/IBI-attacks.
中文: 本文提出文本到图像扩散模型中一种缺乏明确视觉特征但能在不同语义背景下多样化表现的新型隐性偏见,并设计了一种即插即用的攻击框架(IBI-Attacks),能在保持原始语义的同时注入难以察觉的偏见。
English: This paper introduces a novel implicit bias in text-to-image diffusion models that lacks explicit visual patterns but manifests diversely across semantic contexts, proposing a plug-and-play attack framework (IBI-Attacks) that injects subtle biases while preserving original image semantics.

Authors:Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun
Title: SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Abstract:
Video spatial reasoning, which involves inferring the underlying spatial structure from observed video frames, poses a significant challenge for existing Multimodal Large Language Models (MLLMs). This limitation stems primarily from 1) the absence of high-quality datasets for this task, and 2) the lack of effective training strategies to develop spatial reasoning capabilities. Motivated by the success of Reinforcement Learning with Verifiable Reward (RLVR) in unlocking LLM reasoning abilities, this work aims to improve MLLMs in video spatial reasoning through the RLVR paradigm. To this end, we introduce the $\textbf{SpaceR}$ framework. First, we present $\textbf{SpaceR-151k}$, a dataset with 91k questions spanning diverse spatial reasoning scenarios with verifiable answers, and 60k samples for maintaining general multimodal understanding. Second, we propose $\textbf{Spatially-Guided RLVR (SG-RLVR)}$, a novel reinforcement learning approach that extends Group Relative Policy Optimization (GRPO) with a novel map imagination mechanism, which encourages the model to infer spatial layouts in the thinking process, thereby facilitating more effective spatial reasoning. Extensive experiments demonstrate that SpaceR achieves state-of-the-art performance on spatial reasoning benchmarks (e.g., VSI-Bench, STI-Bench, and SPAR-Bench), while maintaining competitive results on video understanding benchmarks (e.g., Video-MME, TempCompass, and LongVideoBench). Remarkably, SpaceR surpasses the advanced GPT-4o by 11.6\% accuracy on VSI-Bench and is on par with the leading proprietary model Gemini-2.0-Flash, highlighting the effectiveness of our SpaceR-151k dataset and SG-RLVR in reinforcing spatial reasoning ability of MLLMs. Code, model, and dataset are available at https://github.com/OuyangKun10/SpaceR.
中文: SpaceR框架通过包含15.1万样本的数据集和具有地图想象机制的新型强化学习方法,有效提升了多模态大语言模型的视频空间推理能力,在空间推理基准测试中达到最优性能,同时保持卓越的视频理解表现。
English: The SpaceR framework addresses video spatial reasoning limitations in MLLMs by introducing a 151k-sample dataset and a novel reinforcement learning method with map imagination, achieving state-of-the-art performance on spatial reasoning benchmarks while maintaining competitive video understanding capabilities.

Authors:Kegang Wang, Jiankai Tang, Yuxuan Fan, Jiatong Ji, Yuanchun Shi, Yuntao Wang
Title: Memory-efficient Low-latency Remote Photoplethysmography through Temporal-Spatial State Space Duality
Abstract:
Remote photoplethysmography (rPPG), enabling non-contact physiological monitoring through facial light reflection analysis, faces critical computational bottlenecks as deep learning introduces performance gains at the cost of prohibitive resource demands. This paper proposes ME-rPPG, a memory-efficient algorithm built on temporal-spatial state space duality, which resolves the trilemma of model scalability, cross-dataset generalization, and real-time constraints. Leveraging a transferable state space, ME-rPPG efficiently captures subtle periodic variations across facial frames while maintaining minimal computational overhead, enabling training on extended video sequences and supporting low-latency inference. Achieving cross-dataset MAEs of 5.38 (MMPD), 0.70 (VitalVideo), and 0.25 (PURE), ME-rPPG outperforms all baselines with improvements ranging from 21.3% to 60.2%. Our solution enables real-time inference with only 3.6 MB memory usage and 9.46 ms latency -- surpassing existing methods by 19.5%-49.7% accuracy and 43.2% user satisfaction gains in real-world deployments. The code and demos are released for reproducibility on https://health-hci-group.github.io/ME-rPPG-demo/.
中文: 本文提出ME-rPPG算法,通过时空状态空间对偶性解决了远程光电容积描记技术中的计算瓶颈,在模型可扩展性、跨数据集泛化能力和实时性之间取得平衡,在跨数据集测试和实际部署中实现了更高的精度与效率。
English: This paper introduces ME-rPPG, a memory-efficient algorithm that overcomes computational bottlenecks in remote photoplethysmography by balancing model scalability, generalization, and real-time performance through temporal-spatial state space duality, achieving superior accuracy and efficiency in cross-dataset evaluations and real-world deployments.

Authors:Yuxuan Luo, Zhengkun Rong, Lizhen Wang, Longhao Zhang, Tianshu Hu, Yongming Zhu
Title: DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
Abstract:
While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations. For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales. For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements. Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency. Project Page: https://grisoon.github.io/DreamActor-M1/.

Authors:Kaan Karaman, Yuchang Jiang, Damien Robert, Vivien Sainte Fare Garnot, Maria João Santos, Jan Dirk Wegner
Title: GSR4B: Biomass Map Super-Resolution with Sentinel-1/2 Guidance
Abstract:
Accurate Above-Ground Biomass (AGB) mapping at both large scale and high spatio-temporal resolution is essential for applications ranging from climate modeling to biodiversity assessment, and sustainable supply chain monitoring. At present, fine-grained AGB mapping relies on costly airborne laser scanning acquisition campaigns usually limited to regional scales. Initiatives such as the ESA CCI map attempt to generate global biomass products from diverse spaceborne sensors but at a coarser resolution. To enable global, high-resolution (HR) mapping, several works propose to regress AGB from HR satellite observations such as ESA Sentinel-1/2 images. We propose a novel way to address HR AGB estimation, by leveraging both HR satellite observations and existing low-resolution (LR) biomass products. We cast this problem as Guided Super-Resolution (GSR), aiming at upsampling LR biomass maps (sources) from $100$ to $10$ m resolution, using auxiliary HR co-registered satellite images (guides). We compare super-resolving AGB maps with and without guidance, against direct regression from satellite images, on the public BioMassters dataset. We observe that Multi-Scale Guidance (MSG) outperforms direct regression both for regression ($-780$ t/ha RMSE) and perception ($+2.0$ dB PSNR) metrics, and better captures high-biomass values, without significant computational overhead. Interestingly, unlike the RGB+Depth setting they were originally designed for, our best-performing AGB GSR approaches are those that most preserve the guide image texture. Our results make a strong case for adopting the GSR framework for accurate HR biomass mapping at scale. Our code and model weights are made publicly available (https://github.com/kaankaramanofficial/GSR4B).
Chinese Summary: 该研究提出了一种引导式超分辨率方法,利用卫星图像将低分辨率生物量地图提升至高分辨率,相比直接回归方法在生物量估算精度上表现更优。
English Summary: The study introduces a Guided Super-Resolution method that enhances low-resolution biomass maps to high resolution using satellite imagery, demonstrating superior accuracy in biomass estimation compared to direct regression approaches.

Authors:Yiting Lu, Xin Li, Haoning Wu, Bingchen Li, Weisi Lin, Zhibo Chen
Title: Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning
Abstract:
The rapid advancement of Large Multi-modal Foundation Models (LMM) has paved the way for the possible Explainable Image Quality Assessment (EIQA) with instruction tuning from two perspectives: overall quality explanation, and attribute-wise perception answering. However, existing works usually overlooked the conflicts between these two types of perception explanations during joint instruction tuning, leading to insufficient perception understanding. To mitigate this, we propose a new paradigm for perception-oriented instruction tuning, i.e., Q-Adapt, which aims to eliminate the conflicts and achieve the synergy between these two EIQA tasks when adapting LMM, resulting in enhanced multi-faceted explanations of IQA. Particularly, we propose a progressive instruction tuning strategy by dividing the adaption process of LMM for EIQA into two stages, where the first stage empowers the LMM with universal perception knowledge tailored for two tasks using an efficient transfer learning strategy, i.e., LoRA, and the second stage introduces the instruction-adaptive visual prompt tuning to dynamically adapt visual features for the different instructions from two tasks. In this way, our proposed Q-Adapt can achieve a lightweight visual quality evaluator, demonstrating comparable performance and, in some instances, superior results across perceptual-related benchmarks and commonly-used IQA databases. The source code is publicly available at https://github.com/yeppp27/Q-Adapt.
中文:Q-Adapt方法通过渐进式指令调优解决可解释图像质量评估中整体与属性感知的冲突,在提升大模型多维度解释能力的同时,实现了轻量化且具有竞争力的评估效果。
English: The Q-Adapt method addresses conflicts between overall and attribute-wise explanations in Explainable Image Quality Assessment by using progressive instruction tuning to enhance LMM performance, achieving lightweight yet competitive results on benchmarks.

Authors:Tobias Fischer, Samuel Rota Bulò, Yung-Hsu Yang, Nikhil Keetha, Lorenzo Porzi, Norman Müller, Katja Schwarz, Jonathon Luiten, Marc Pollefeys, Peter Kontschieder
Title: FlowR: Flowing from Sparse to Dense 3D Reconstructions
Abstract:
3D Gaussian splatting enables high-quality novel view synthesis (NVS) at real-time frame rates. However, its quality drops sharply as we depart from the training views. Thus, dense captures are needed to match the high-quality expectations of applications like Virtual Reality (VR). However, such dense captures are very laborious and expensive to obtain. Existing works have explored using 2D generative models to alleviate this requirement by distillation or generating additional training views. These models typically rely on a noise-to-data generative process conditioned only on a handful of reference input views, leading to hallucinations, inconsistent generation results, and subsequent reconstruction artifacts. Instead, we propose a multi-view, flow matching model that learns a flow to directly connect novel view renderings from possibly sparse reconstructions to renderings that we expect from dense reconstructions. This enables augmenting scene captures with consistent, generated views to improve reconstruction quality. Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540x960 resolution (91K tokens) on one H100 GPU in a single forward pass. Our pipeline consistently improves NVS in sparse- and dense-view scenarios, leading to higher-quality reconstructions than prior works across multiple, widely-used NVS benchmarks.

Authors:Yuehui Qiu, Dandan Shan, Yining Wang, Pei Dong, Dijia Wu, Xinnian Yang, Qingqi Hong, Dinggang Shen
Title: A topology-preserving three-stage framework for fully-connected coronary artery extraction
Abstract:
Coronary artery extraction is a crucial prerequisite for computer-aided diagnosis of coronary artery disease. Accurately extracting the complete coronary tree remains challenging due to several factors, including presence of thin distal vessels, tortuous topological structures, and insufficient contrast. These issues often result in over-segmentation and under-segmentation in current segmentation methods. To address these challenges, we propose a topology-preserving three-stage framework for fully-connected coronary artery extraction. This framework includes vessel segmentation, centerline reconnection, and missing vessel reconstruction. First, we introduce a new centerline enhanced loss in the segmentation process. Second, for the broken vessel segments, we further propose a regularized walk algorithm to integrate distance, probabilities predicted by a centerline classifier, and directional cosine similarity, for reconnecting the centerlines. Third, we apply implicit neural representation and implicit modeling, to reconstruct the geometric model of the missing vessels. Experimental results show that our proposed framework outperforms existing methods, achieving Dice scores of 88.53\% and 85.07\%, with Hausdorff Distances (HD) of 1.07mm and 1.63mm on ASOCA and PDSCA datasets, respectively. Code will be available at https://github.com/YH-Qiu/CorSegRec.
Chinese: 本文提出了一种三阶段框架用于全连接冠状动脉提取,通过血管分割、中心线重连和缺失血管重建解决分割难题,在基准数据集上分别达到88.53%和85.07%的Dice分数,性能优于现有方法。
English: This paper presents a three-stage framework for fully-connected coronary artery extraction that addresses segmentation challenges through vessel segmentation, centerline reconnection, and missing vessel reconstruction, achieving superior performance with Dice scores of 88.53% and 85.07% on benchmark datasets.

Authors:Jijun Xiang, Xuan Zhu, Xianqi Wang, Yu Wang, Hong Zhang, Fei Guo, Xin Yang
Title: DEPTHOR: Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image
Abstract:
Depth enhancement, which uses RGB images as guidance to convert raw signals from dToF into high-precision, dense depth maps, is a critical task in computer vision. Although existing super-resolution-based methods show promising results on public datasets, they often rely on idealized assumptions like accurate region correspondences and reliable dToF inputs, overlooking calibration errors that cause misalignment and anomaly signals inherent to dToF imaging, limiting real-world applicability. To address these challenges, we propose a novel completion-based method, named DEPTHOR, featuring advances in both the training strategy and model architecture. First, we propose a method to simulate real-world dToF data from the accurate ground truth in synthetic datasets to enable noise-robust training. Second, we design a novel network that incorporates monocular depth estimation (MDE), leveraging global depth relationships and contextual information to improve prediction in challenging regions. On the ZJU-L5 dataset, our training strategy significantly enhances depth completion models, achieving results comparable to depth super-resolution methods, while our model achieves state-of-the-art results, improving Rel and RMSE by 27% and 18%, respectively. On a more challenging set of dToF samples we collected, our method outperforms SOTA methods on preliminary stereo-based GT, improving Rel and RMSE by 23% and 22%, respectively. Our Code is available at https://github.com/ShadowBbBb/Depthor
Chinese: 提出的DEPTHOR方法通过模拟真实世界dToF数据进行鲁棒训练,并结合单目深度估计来增强深度补全,在基准数据集上以显著提升的精度指标实现了最先进的性能。
English: The proposed DEPTHOR method enhances depth completion by simulating real-world dToF data for robust training and integrating monocular depth estimation, achieving state-of-the-art performance with significant improvements in accuracy metrics on benchmark datasets.

Authors:Adriano Fragomeni, Dima Damen, Michael Wray
Title: Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
Abstract:
Video retrieval requires aligning visual content with corresponding natural language descriptions. In this paper, we introduce Modality Auxiliary Concepts for Video Retrieval (MAC-VR), a novel approach that leverages modality-specific tags -- automatically extracted from foundation models -- to enhance video retrieval. We propose to align modalities in a latent space, along with learning and aligning auxiliary latent concepts derived from the features of a video and its corresponding caption. We introduce these auxiliary concepts to improve the alignment of visual and textual latent concepts, allowing concepts to be distinguished from one another. We conduct extensive experiments on six diverse datasets: two different splits of MSR-VTT, DiDeMo, TGIF, Charades and YouCook2. The experimental results consistently demonstrate that modality-specific tags improve cross-modal alignment, outperforming current state-of-the-art methods across three datasets and performing comparably or better across others. Project Webpage: https://adrianofragomeni.github.io/MAC-VR/

Authors:Dandan Shan, Zihan Li, Yunxiang Li, Qingde Li, Jie Tian, Qingqi Hong
Title: STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation
Abstract:
Accurate segmentation of lesions plays a critical role in medical image analysis and diagnosis. Traditional segmentation approaches that rely solely on visual features often struggle with the inherent uncertainty in lesion distribution and size. To address these issues, we propose STPNet, a Scale-aware Text Prompt Network that leverages vision-language modeling to enhance medical image segmentation. Our approach utilizes multi-scale textual descriptions to guide lesion localization and employs retrieval-segmentation joint learning to bridge the semantic gap between visual and linguistic modalities. Crucially, STPNet retrieves relevant textual information from a specialized medical text repository during training, eliminating the need for text input during inference while retaining the benefits of cross-modal learning. We evaluate STPNet on three datasets: COVID-Xray, COVID-CT, and Kvasir-SEG. Experimental results show that our vision-language approach outperforms state-of-the-art segmentation methods, demonstrating the effectiveness of incorporating textual semantic knowledge into medical image analysis. The code has been made publicly on https://github.com/HUANGLIZI/STPNet.
中文: STPNet是一种尺度感知文本提示网络,通过多尺度文本描述和检索-分割联合学习增强医学图像分割效果,在无需推理阶段文本输入的情况下超越了现有最优方法。
English: STPNet is a scale-aware text prompt network that enhances medical image segmentation by leveraging multi-scale textual descriptions and retrieval-segmentation joint learning, outperforming state-of-the-art methods without requiring text input during inference.

Authors:Luca Ciampi, Gabriele Lagani, Giuseppe Amato, Fabrizio Falchi
Title: Semi-Supervised Biomedical Image Segmentation via Diffusion Models and Teacher-Student Co-Training
Abstract:
Supervised deep learning for semantic segmentation has achieved excellent results in accurately identifying anatomical and pathological structures in medical images. However, it often requires large annotated training datasets, which limits its scalability in clinical settings. To address this challenge, semi-supervised learning is a well-established approach that leverages both labeled and unlabeled data. In this paper, we introduce a novel semi-supervised teacher-student framework for biomedical image segmentation, inspired by the recent success of generative models. Our approach leverages denoising diffusion probabilistic models (DDPMs) to generate segmentation masks by progressively refining noisy inputs conditioned on the corresponding images. The teacher model is first trained in an unsupervised manner using a cycle-consistency constraint based on noise-corrupted image reconstruction, enabling it to generate informative semantic masks. Subsequently, the teacher is integrated into a co-training process with a twin-student network. The student learns from ground-truth labels when available and from teacher-generated pseudo-labels otherwise, while the teacher continuously improves its pseudo-labeling capabilities. Finally, to further enhance performance, we introduce a multi-round pseudo-label generation strategy that iteratively improves the pseudo-labeling process. We evaluate our approach on multiple biomedical imaging benchmarks, spanning multiple imaging modalities and segmentation tasks. Experimental results show that our method consistently outperforms state-of-the-art semi-supervised techniques, highlighting its effectiveness in scenarios with limited annotated data. The code to replicate our experiments can be found at https://github.com/ciampluca/diffusion_semi_supervised_biomedical_image_segmentation
中文: 本文提出了一种基于去噪扩散模型的新型半监督师生框架,通过迭代伪标签优化有效利用有限标注数据进行生物医学图像分割,在多个基准测试中超越了现有最优方法。
English: This paper introduces a novel semi-supervised teacher-student framework using denoising diffusion models for biomedical image segmentation, which outperforms existing methods by effectively leveraging limited labeled data through iterative pseudo-label refinement.

Authors:Zixuan Wang, Duo Peng, Feng Chen, Yuwei Yang, Yinjie Lei
Title: Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis
Abstract:
Conditional image synthesis is a crucial task with broad applications, such as artistic creation and virtual reality. However, current generative methods are often task-oriented with a narrow scope, handling a restricted condition with constrained applicability. In this paper, we propose a novel approach that treats conditional image synthesis as the modular combination of diverse fundamental condition units. Specifically, we divide conditions into three primary units: text, layout, and drag. To enable effective control over these conditions, we design a dedicated alignment module for each. For the text condition, we introduce a Dense Concept Alignment (DCA) module, which achieves dense visual-text alignment by drawing on diverse textual concepts. For the layout condition, we propose a Dense Geometry Alignment (DGA) module to enforce comprehensive geometric constraints that preserve the spatial configuration. For the drag condition, we introduce a Dense Motion Alignment (DMA) module to apply multi-level motion regularization, ensuring that each pixel follows its desired trajectory without visual artifacts. By flexibly inserting and combining these alignment modules, our framework enhances the model's adaptability to diverse conditional generation tasks and greatly expands its application range. Extensive experiments demonstrate the superior performance of our framework across a variety of conditions, including textual description, segmentation mask (bounding box), drag manipulation, and their combinations. Code is available at https://github.com/ZixuanWang0525/DADG.
中文摘要:本文提出了一种模块化条件图像合成框架,将条件划分为文本、布局和拖拽三个基本单元,通过专门设计的对齐模块增强模型对多样化生成任务的适应性和应用范围。
English Summary: This paper introduces a modular framework for conditional image synthesis that divides conditions into text, layout, and drag units, with specialized alignment modules for each to enhance adaptability and application scope across diverse generation tasks.

Authors:Ziteng Cui, Xuangeng Chu, Tatsuya Harada
Title: Luminance-GS: Adapting 3D Gaussian Splatting to Challenging Lighting Conditions with View-Adaptive Curve Adjustment
Abstract:
Capturing high-quality photographs under diverse real-world lighting conditions is challenging, as both natural lighting (e.g., low-light) and camera exposure settings (e.g., exposure time) significantly impact image quality. This challenge becomes more pronounced in multi-view scenarios, where variations in lighting and image signal processor (ISP) settings across viewpoints introduce photometric inconsistencies. Such lighting degradations and view-dependent variations pose substantial challenges to novel view synthesis (NVS) frameworks based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). To address this, we introduce Luminance-GS, a novel approach to achieving high-quality novel view synthesis results under diverse challenging lighting conditions using 3DGS. By adopting per-view color matrix mapping and view-adaptive curve adjustments, Luminance-GS achieves state-of-the-art (SOTA) results across various lighting conditions -- including low-light, overexposure, and varying exposure -- while not altering the original 3DGS explicit representation. Compared to previous NeRF- and 3DGS-based baselines, Luminance-GS provides real-time rendering speed with improved reconstruction quality.

Authors:Siran Li, Chen Liu, Ruiyang Liu, Zhendong Wang, Gaofeng He, Yong-Lu Li, Xiaogang Jin, Huamin Wang
Title: GarmageNet: A Multimodal Generative Framework for Sewing Pattern Design and Generic Garment Modeling
Abstract:
Realistic digital garment modeling remains a labor-intensive task due to the intricate process of translating 2D sewing patterns into high-fidelity, simulation-ready 3D garments. We introduce GarmageNet, a unified generative framework that automates the creation of 2D sewing patterns, the construction of sewing relationships, and the synthesis of 3D garment initializations compatible with physics-based simulation. Central to our approach is Garmage, a novel garment representation that encodes each panel as a structured geometry image, effectively bridging the semantic and geometric gap between 2D structural patterns and 3D garment shapes. GarmageNet employs a latent diffusion transformer to synthesize panel-wise geometry images and integrates GarmageJigsaw, a neural module for predicting point-to-point sewing connections along panel contours. To support training and evaluation, we build GarmageSet, a large-scale dataset comprising over 10,000 professionally designed garments with detailed structural and style annotations. Our method demonstrates versatility and efficacy across multiple application scenarios, including scalable garment generation from multi-modal design concepts (text prompts, sketches, photographs), automatic modeling from raw flat sewing patterns, pattern recovery from unstructured point clouds, and progressive garment editing using conventional instructions-laying the foundation for fully automated, production-ready pipelines in digital fashion. Project page: https://style3d.github.io/garmagenet.

Authors:Soumyya Kanti Datta, Shan Jia, Siwei Lyu
Title: Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies
Abstract:
Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .
Chinese: 本文提出LIPINC-V2检测框架,通过识别嘴部区域的时空不一致性来有效检测细微的唇语同步深度伪造,在基准数据集上实现了最先进的检测性能。
English: This paper introduces LIPINC-V2, a novel detection framework that identifies spatiotemporal inconsistencies in the mouth region to effectively detect subtle lip-syncing deepfakes, achieving state-of-the-art performance on benchmark datasets.

Authors:Korbinian Moller, Luis Schwarzmeier, Johannes Betz
Title: From Shadows to Safety: Occlusion Tracking and Risk Mitigation for Urban Autonomous Driving
Abstract:
Autonomous vehicles (AVs) must navigate dynamic urban environments where occlusions and perception limitations introduce significant uncertainties. This research builds upon and extends existing approaches in risk-aware motion planning and occlusion tracking to address these challenges. While prior studies have developed individual methods for occlusion tracking and risk assessment, a comprehensive method integrating these techniques has not been fully explored. We, therefore, enhance a phantom agent-centric model by incorporating sequential reasoning to track occluded areas and predict potential hazards. Our model enables realistic scenario representation and context-aware risk evaluation by modeling diverse phantom agents, each with distinct behavior profiles. Simulations demonstrate that the proposed approach improves situational awareness and balances proactive safety with efficient traffic flow. While these results underline the potential of our method, validation in real-world scenarios is necessary to confirm its feasibility and generalizability. By utilizing and advancing established methodologies, this work contributes to safer and more reliable AV planning in complex urban environments. To support further research, our method is available as open-source software at: https://github.com/TUM-AVS/OcclusionAwareMotionPlanning
本研究通过将顺序推理融入虚拟智能体模型来增强自动驾驶车辆的安全性,改进了遮挡追踪和风险评估,仿真实验表明该方法能提升环境感知能力并平衡交通流效率。
This research enhances autonomous vehicle safety by integrating sequential reasoning into phantom agent modeling for improved occlusion tracking and risk assessment, with simulations showing better situational awareness and traffic flow balance.

Authors:Chang-Bin Zhang, Jinhong Ni, Yujie Zhong, Kai Han
Title: v-CLR: View-Consistent Learning for Open-World Instance Segmentation
Abstract:
In this paper, we address the challenging problem of open-world instance segmentation. Existing works have shown that vanilla visual networks are biased toward learning appearance information, \eg texture, to recognize objects. This implicit bias causes the model to fail in detecting novel objects with unseen textures in the open-world setting. To address this challenge, we propose a learning framework, called view-Consistent LeaRning (v-CLR), which aims to enforce the model to learn appearance-invariant representations for robust instance segmentation. In v-CLR, we first introduce additional views for each image, where the texture undergoes significant alterations while preserving the image's underlying structure. We then encourage the model to learn the appearance-invariant representation by enforcing the consistency between object features across different views, for which we obtain class-agnostic object proposals using off-the-shelf unsupervised models that possess strong object-awareness. These proposals enable cross-view object feature matching, greatly reducing the appearance dependency while enhancing the object-awareness. We thoroughly evaluate our method on public benchmarks under both cross-class and cross-dataset settings, achieving state-of-the-art performance. Project page: https://visual-ai.github.io/vclr
中文: 本文提出视图一致性学习(v-CLR)框架,通过跨视图特征一致性训练模型学习外观不变的对象表示,在开放世界实例分割任务中取得领先性能。
English: This paper introduces view-Consistent LeaRning (v-CLR), a framework that enhances open-world instance segmentation by training models to recognize appearance-invariant object representations through cross-view feature consistency, achieving state-of-the-art results on benchmarks.

Authors:Zirui Wu, Jianteng Chen, Laijian Li, Shaoteng Wu, Zhikai Zhu, Kang Xu, Martin R. Oswald, Jie Song
Title: 3D Gaussian Inverse Rendering with Approximated Global Illumination
Abstract:
3D Gaussian Splatting shows great potential in reconstructing photo-realistic 3D scenes. However, these methods typically bake illumination into their representations, limiting their use for physically-based rendering and scene editing. Although recent inverse rendering approaches aim to decompose scenes into material and lighting components, they often rely on simplifying assumptions that fail when editing. We present a novel approach that enables efficient global illumination for 3D Gaussians Splatting through screen-space ray tracing. Our key insight is that a substantial amount of indirect light can be traced back to surfaces visible within the current view frustum. Leveraging this observation, we augment the direct shading computed by 3D Gaussians with Monte-Carlo screen-space ray-tracing to capture one-bounce indirect illumination. In this way, our method enables realistic global illumination without sacrificing the computational efficiency and editability benefits of 3D Gaussians. Through experiments, we show that the screen-space approximation we utilize allows for indirect illumination and supports real-time rendering and editing. Code, data, and models will be made available at our project page: https://wuzirui.github.io/gs-ssr.

Authors:Chunhui Zhang, Li Liu, Jialin Gao, Xin Sun, Hao Wen, Xi Zhou, Shiming Ge, Yanfeng Wang
Title: COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking
Abstract:
Transformer has recently demonstrated great potential in improving vision-language (VL) tracking algorithms. However, most of the existing VL trackers rely on carefully designed mechanisms to perform the multi-stage multi-modal fusion. Additionally, direct multi-modal fusion without alignment ignores distribution discrepancy between modalities in feature space, potentially leading to suboptimal representations. In this work, we propose COST, a contrastive one-stage transformer fusion framework for VL tracking, aiming to learn semantically consistent and unified VL representations. Specifically, we introduce a contrastive alignment strategy that maximizes mutual information (MI) between a video and its corresponding language description. This enables effective cross-modal alignment, yielding semantically consistent features in the representation space. By leveraging a visual-linguistic transformer, we establish an efficient multi-modal fusion and reasoning mechanism, empirically demonstrating that a simple stack of transformer encoders effectively enables unified VL representations. Moreover, we contribute a newly collected VL tracking benchmark dataset for small object tracking, named VL-SOT500, with bounding boxes and language descriptions. Our dataset comprises two challenging subsets, VL-SOT230 and VL-SOT270, dedicated to evaluating generic and high-speed small object tracking, respectively. Small object tracking is notoriously challenging due to weak appearance and limited features, and this dataset is, to the best of our knowledge, the first to explore the usage of language cues to enhance visual representation for small object tracking. Extensive experiments demonstrate that COST achieves state-of-the-art performance on five existing VL tracking datasets, as well as on our proposed VL-SOT500 dataset. Source codes and dataset will be made publicly available.
中文: 提出的COST框架采用对比式单阶段Transformer,通过最大化视频与语言间的互信息实现跨模态对齐,在包括新型小目标追踪数据集VL-SOT500在内的多个视觉语言追踪基准上取得了最优性能。
English: The proposed COST framework introduces a contrastive one-stage transformer that aligns video and language features through mutual information maximization, achieving state-of-the-art performance on multiple VL tracking benchmarks including the newly introduced VL-SOT500 dataset for small object tracking.

Authors:Jiawei Wang, Yushen Zuo, Yuanjun Chai, Zhendong Liu, Yicheng Fu, Yichun Feng, Kin-Man Lam
Title: Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Abstract:
Vision-Language Models (VLMs) extend the capabilities of Large Language Models (LLMs) by incorporating visual information, yet they remain vulnerable to jailbreak attacks, especially when processing noisy or corrupted images. Although existing VLMs adopt security measures during training to mitigate such attacks, vulnerabilities associated with noise-augmented visual inputs are overlooked. In this work, we identify that missing noise-augmented training causes critical security gaps: many VLMs are susceptible to even simple perturbations such as Gaussian noise. To address this challenge, we propose Robust-VLGuard, a multimodal safety dataset with aligned / misaligned image-text pairs, combined with noise-augmented fine-tuning that reduces attack success rates while preserving functionality of VLM. For stronger optimization-based visual perturbation attacks, we propose DiffPure-VLM, leveraging diffusion models to convert adversarial perturbations into Gaussian-like noise, which can be defended by VLMs with noise-augmented safety fine-tuning. Experimental results demonstrate that the distribution-shifting property of diffusion model aligns well with our fine-tuned VLMs, significantly mitigating adversarial perturbations across varying intensities. The dataset and code are available at https://github.com/JarvisUSTC/DiffPure-RobustVLM.
中文摘要:视觉语言模型易受噪声图像攻击,而提出的Robust-VLGuard通过噪声增强微调和DiffPure-VLM防御方法,在保持模型功能的同时显著提升了对抗扰动的防御能力。
English Summary: Vision-Language Models remain vulnerable to jailbreak attacks through noisy images, but the proposed Robust-VLGuard with noise-augmented fine-tuning and DiffPure-VLM defense effectively mitigates these security risks while maintaining model functionality.

Authors:Salim Khazem, Jeremy Fix, Cédric Pradalier
Title: PolygoNet: Leveraging Simplified Polygonal Representation for Effective Image Classification
Abstract:
Deep learning models have achieved significant success in various image related tasks. However, they often encounter challenges related to computational complexity and overfitting. In this paper, we propose an efficient approach that leverages polygonal representations of images using dominant points or contour coordinates. By transforming input images into these compact forms, our method significantly reduces computational requirements, accelerates training, and conserves resources making it suitable for real time and resource constrained applications. These representations inherently capture essential image features while filtering noise, providing a natural regularization effect that mitigates overfitting. The resulting lightweight models achieve performance comparable to state of the art methods using full resolution images while enabling deployment on edge devices. Extensive experiments on benchmark datasets validate the effectiveness of our approach in reducing complexity, improving generalization, and facilitating edge computing applications. This work demonstrates the potential of polygonal representations in advancing efficient and scalable deep learning solutions for real world scenarios. The code for the experiments of the paper is provided in https://github.com/salimkhazem/PolygoNet.
Chinese: 本文提出一种高效的深度学习方法,通过采用多边形图像表示来降低计算复杂度、防止过拟合,并能在保持与先进方法相当性能的同时,实现在边缘设备上的部署。
English: This paper introduces an efficient deep learning method that uses polygonal image representations to reduce computational complexity, prevent overfitting, and enable deployment on edge devices while maintaining performance comparable to state-of-the-art approaches.

Authors:Ilir Tahiraj, Jeremialie Swadiryus, Felix Fent, Markus Lienkamp
Title: Cal or No Cal? -- Real-Time Miscalibration Detection of LiDAR and Camera Sensors
Abstract:
The goal of extrinsic calibration is the alignment of sensor data to ensure an accurate representation of the surroundings and enable sensor fusion applications. From a safety perspective, sensor calibration is a key enabler of autonomous driving. In the current state of the art, a trend from target-based offline calibration towards targetless online calibration can be observed. However, online calibration is subject to strict real-time and resource constraints which are not met by state-of-the-art methods. This is mainly due to the high number of parameters to estimate, the reliance on geometric features, or the dependence on specific vehicle maneuvers. To meet these requirements and ensure the vehicle's safety at any time, we propose a miscalibration detection framework that shifts the focus from the direct regression of calibration parameters to a binary classification of the calibration state, i.e., calibrated or miscalibrated. Therefore, we propose a contrastive learning approach that compares embedded features in a latent space to classify the calibration state of two different sensor modalities. Moreover, we provide a comprehensive analysis of the feature embeddings and challenging calibration errors that highlight the performance of our approach. As a result, our method outperforms the current state-of-the-art in terms of detection performance, inference time, and resource demand. The code is open source and available on https://github.com/TUMFTM/MiscalibrationDetection.
Chinese: 本文提出了一种基于对比学习的失准检测框架,通过将传感器校准状态分类为已校准或未校准,在检测性能、速度和资源效率上均优于现有方法。
English: The paper introduces a miscalibration detection framework using contrastive learning to classify sensor calibration states as calibrated or miscalibrated, outperforming current methods in detection performance, speed, and efficiency.

Authors:Xian-Xian Liu, Yuanyuan Wei, Mingkun Xu, Yongze Guo, Hongwei Zhang, Huicong Dong, Qun Song, Qi Zhao, Wei Luo, Feng Tien, Juntao Gao, Simon Fong
Title: An Integrated AI-Enabled System Using One Class Twin Cross Learning (OCT-X) for Early Gastric Cancer Detection
Abstract:
Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One Class Twin Cross Learning (OCT-X) algorithm. Leveraging a novel fast double-threshold grid search strategy (FDT-GS) and a patch-based deep fully convolutional network, OCT-X maximizes diagnostic accuracy through real-time data processing and seamless lesion surveillance. The hardware component includes an all-in-one point-of-care testing (POCT) device with high-resolution imaging sensors, real-time data processing, and wireless connectivity, facilitated by the NI CompactDAQ and LabVIEW software. Our integrated system achieved an unprecedented diagnostic accuracy of 99.70%, significantly outperforming existing models by up to 4.47%, and demonstrated a 10% improvement in multirate adaptability. These findings underscore the potential of OCT-X as well as the integrated system in clinical diagnostics, offering a path toward more accurate, efficient, and less invasive early gastric cancer detection. Future research will explore broader applications, further advancing oncological diagnostics. Code is available at https://github.com/liu37972/Multirate-Location-on-OCT-X-Learning.git.
中文: 该研究提出了一种结合OCT-X算法与先进硬件的集成系统,实现了99.70%的胃癌早期诊断准确率,显著优于现有方法。
English: The study introduces an integrated system combining the OCT-X algorithm with advanced hardware to achieve 99.70% diagnostic accuracy for early gastric cancer detection, significantly outperforming existing methods.

Authors:Junhao Cheng, Yuying Ge, Yixiao Ge, Jing Liao, Ying Shan
Title: AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
Abstract:
Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic anime world as their favorite characters for life simulation through language instructions. Such games are defined as infinite game since they eliminate predetermined boundaries and fixed gameplay rules, where players can interact with the game world through open-ended language and experience ever-evolving storylines and environments. Recently, a pioneering approach for infinite anime life simulation employs large language models (LLMs) to translate multi-turn text dialogues into language instructions for image generation. However, it neglects historical visual context, leading to inconsistent gameplay. Furthermore, it only generates static images, failing to incorporate the dynamics necessary for an engaging gaming experience. In this work, we propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states, as illustrated in Figure 1. We introduce novel action-aware multimodal representations to represent animation shots, which can be decoded into high-quality video clips using a video diffusion model. By taking historical animation shot representations as context and predicting subsequent representations, AnimeGamer can generate games with contextual consistency and satisfactory dynamics. Extensive evaluations using both automated metrics and human evaluations demonstrate that AnimeGamer outperforms existing methods in various aspects of the gaming experience. Codes and checkpoints are available at https://github.com/TencentARC/AnimeGamer.
中文: 生成式游戏的最新进展允许玩家通过语言指令与动漫世界互动,而提出的AnimeGamer模型通过多模态表示生成动态且上下文一致的动画,解决了现有方法的一致性和静态画面限制问题。
English: Recent advances in generative games enable players to interact with anime worlds through language instructions, and the proposed AnimeGamer model addresses inconsistencies and static limitations by using multimodal representations to generate dynamic, contextually consistent animations.

Authors:Saarthak Kapse, Pushpak Pati, Srikar Yellapragada, Srijan Das, Rajarsi R. Gupta, Joel Saltz, Dimitris Samaras, Prateek Prasanna
Title: GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology
Abstract:
Pretraining a Multiple Instance Learning (MIL) aggregator enables the derivation of Whole Slide Image (WSI)-level embeddings from patch-level representations without supervision. While recent multimodal MIL pretraining approaches leveraging auxiliary modalities have demonstrated performance gains over unimodal WSI pretraining, the acquisition of these additional modalities necessitates extensive clinical profiling. This requirement increases costs and limits scalability in existing WSI datasets lacking such paired modalities. To address this, we propose Gigapixel Vision-Concept Knowledge Contrastive pretraining (GECKO), which aligns WSIs with a Concept Prior derived from the available WSIs. First, we derive an inherently interpretable concept prior by computing the similarity between each WSI patch and textual descriptions of predefined pathology concepts. GECKO then employs a dual-branch MIL network: one branch aggregates patch embeddings into a WSI-level deep embedding, while the other aggregates the concept prior into a corresponding WSI-level concept embedding. Both aggregated embeddings are aligned using a contrastive objective, thereby pretraining the entire dual-branch MIL model. Moreover, when auxiliary modalities such as transcriptomics data are available, GECKO seamlessly integrates them. Across five diverse tasks, GECKO consistently outperforms prior unimodal and multimodal pretraining approaches while also delivering clinically meaningful interpretability that bridges the gap between computational models and pathology expertise. Code is made available at https://github.com/bmi-imaginelab/GECKO
中文摘要:GECKO提出了一种自监督预训练方法,通过将全切片图像嵌入与可解释的概念先验对齐,在多项任务中超越现有方法,同时提供具有临床意义的病理学解释。
English Summary: GECKO introduces a self-supervised pretraining method that aligns whole slide image embeddings with interpretable concept priors, outperforming existing approaches across multiple tasks while providing clinically meaningful insights.

Authors:Peter Kocsis, Lukas Höllein, Matthias Nießner
Title: IntrinsiX: High-Quality PBR Generation using Image Priors
Abstract:
We introduce IntrinsiX, a novel method that generates high-quality intrinsic images from text description. In contrast to existing text-to-image models whose outputs contain baked-in scene lighting, our approach predicts physically-based rendering (PBR) maps. This enables the generated outputs to be used for content creation scenarios in core graphics applications that facilitate re-lighting, editing, and texture generation tasks. In order to train our generator, we exploit strong image priors, and pre-train separate models for each PBR material component (albedo, roughness, metallic, normals). We then align these models with a new cross-intrinsic attention formulation that concatenates key and value features in a consistent fashion. This allows us to exchange information between each output modality and to obtain semantically coherent PBR predictions. To ground each intrinsic component, we propose a rendering loss which provides image-space signals to constrain the model, thus facilitating sharp details also in the output BRDF properties. Our results demonstrate detailed intrinsic generation with strong generalization capabilities that outperforms existing intrinsic image decomposition methods used with generated images by a significant margin. Finally, we show a series of applications, including re-lighting, editing, and text-conditioned room-scale PBR texture generation.

Authors:Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, Zhen Lei
Title: MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
Abstract:
Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at https://apexgen-x.github.io/MergeVQ.
中文:MergeVQ通过将令牌合并技术与基于矢量量化的生成模型相结合,在统一架构中平衡了图像生成与视觉表征学习,在ImageNet上实现了优异的性能表现,同时保持了高效的令牌利用和推理速度。
English: MergeVQ integrates token merging with vector quantization in a unified architecture to enhance both image generation quality and representation learning efficiency, achieving competitive results on ImageNet while maintaining token efficiency and inference speed.

Authors:Liangbin Xie, Daniil Pakhomov, Zhonghao Wang, Zongze Wu, Ziyan Chen, Yuqian Zhou, Haitian Zheng, Zhifei Zhang, Zhe Lin, Jiantao Zhou, Chao Dong
Title: TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting
Abstract:
This paper introduces TurboFill, a fast image inpainting model that enhances a few-step text-to-image diffusion model with an inpainting adapter for high-quality and efficient inpainting. While standard diffusion models generate high-quality results, they incur high computational costs. We overcome this by training an inpainting adapter on a few-step distilled text-to-image model, DMD2, using a novel 3-step adversarial training scheme to ensure realistic, structurally consistent, and visually harmonious inpainted regions. To evaluate TurboFill, we propose two benchmarks: DilationBench, which tests performance across mask sizes, and HumanBench, based on human feedback for complex prompts. Experiments show that TurboFill outperforms both multi-step BrushNet and few-step inpainting methods, setting a new benchmark for high-performance inpainting tasks. Our project page: https://liangbinxie.github.io/projects/TurboFill/

Authors:Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, Jiajun Wu
Title: WorldScore: A Unified Evaluation Benchmark for World Generation
Abstract:
We introduce the WorldScore benchmark, the first unified benchmark for world generation. We decompose world generation into a sequence of next-scene generation tasks with explicit camera trajectory-based layout specifications, enabling unified evaluation of diverse approaches from 3D and 4D scene generation to video generation models. The WorldScore benchmark encompasses a curated dataset of 3,000 test examples that span diverse worlds: static and dynamic, indoor and outdoor, photorealistic and stylized. The WorldScore metrics evaluate generated worlds through three key aspects: controllability, quality, and dynamics. Through extensive evaluation of 19 representative models, including both open-source and closed-source ones, we reveal key insights and challenges for each category of models. Our dataset, evaluation code, and leaderboard can be found at https://haoyi-duan.github.io/WorldScore/
中文: WorldScore是首个统一的世界生成基准,通过将生成过程分解为具有明确布局规范的连续场景任务,评估了19个模型在可控性、质量和动态性方面的表现,揭示了各类模型的关键见解与挑战。
English: The WorldScore benchmark is the first unified framework for evaluating world generation models by decomposing the process into sequential scene generation tasks with explicit layout specifications, assessing 19 models across controllability, quality, and dynamics to reveal key insights and challenges.

Authors:Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, Chaochao Lu
Title: IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval
Abstract:
Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval tasks lack sufficient complexity and demonstrate limited practical application value. It spires us to design Instance-Driven Multimodal Image Retrieval (IDMR), a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario. Unlike existing retrieval tasks focused on global image similarity or category-level matching, IDMR demands fine-grained instance-level consistency across diverse contexts. To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data. Addressing the scarcity of training data, we propose a cross-domain synthesis method that creates 557K training samples by cropping objects from standard detection datasets. Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench. Experimental results demonstrate previous models' limitations in instance-aware retrieval and highlight the potential of MLLM for advanced retrieval applications. The whole training dataset, codes and models, with wide ranges of sizes, are available at https://github.com/BwLiu01/IDMR.
中文: 研究者提出了实例驱动的多模态图像检索(IDMR)新任务,要求模型在文本描述的不同场景中检索包含相同实体对象的图像,并通过跨域合成方法生成训练数据,其基于多模态大语言模型的检索方法在传统和零样本基准测试中均优于现有模型。
English: The authors introduce Instance-Driven Multimodal Image Retrieval (IDMR), a novel task requiring models to retrieve images with the same object instance in varied text-described scenarios, and develop a cross-domain synthesis method to create training data, with their MLLM-based model outperforming existing approaches on both traditional and zero-shot benchmarks.

Authors:Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, Benjamin Van Durme
Title: WikiVideo: Article Generation from Multiple Videos
Abstract:
We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.
中文摘要:本文提出了WikiVideo基准,用于从多视频生成维基百科式文章,并开发了协作文章生成(CAG)方法,通过推理模型与视频大模型的交互提升事件语义理解能力,其性能显著优于现有技术。
English Summary: This paper introduces WikiVideo, a benchmark for generating Wikipedia-style articles from multiple videos, and proposes Collaborative Article Generation (CAG), an interactive method that enhances high-level event understanding by combining reasoning models with VideoLLMs, outperforming existing approaches.

Authors:Haoxuan Li, Wei Song, Aofan Liu, Peiwu Qin
Title: DBF-UNet: A Two-Stage Framework for Carotid Artery Segmentation with Pseudo-Label Generation
Abstract:
Medical image analysis faces significant challenges due to limited annotation data, particularly in three-dimensional carotid artery segmentation tasks, where existing datasets exhibit spatially discontinuous slice annotations with only a small portion of expert-labeled slices in complete 3D volumetric data. To address this challenge, we propose a two-stage segmentation framework. First, we construct continuous vessel centerlines by interpolating between annotated slice centroids and propagate labels along these centerlines to generate interpolated annotations for unlabeled slices. The slices with expert annotations are used for fine-tuning SAM-Med2D, while the interpolated labels on unlabeled slices serve as prompts to guide segmentation during inference. In the second stage, we propose a novel Dense Bidirectional Feature Fusion UNet (DBF-UNet). This lightweight architecture achieves precise segmentation of complete 3D vascular structures. The network incorporates bidirectional feature fusion in the encoder and integrates multi-scale feature aggregation with dense connectivity for effective feature reuse. Experimental validation on public datasets demonstrates that our proposed method effectively addresses the sparse annotation challenge in carotid artery segmentation while achieving superior performance compared to existing approaches. The source code is available at https://github.com/Haoxuanli-Thu/DBF-UNet.
中文: 本研究提出一个两阶段框架解决颈动脉三维分割中标注稀疏的问题,通过血管中心线插值生成标注并采用新型DBF-UNet架构实现双向特征融合,在获得精确血管分割的同时展现出优越性能。
English: This study introduces a two-stage framework for 3D carotid artery segmentation that addresses sparse annotations by generating interpolated labels via vessel centerlines and employs a novel DBF-UNet architecture with bidirectional feature fusion to achieve precise vascular segmentation with superior performance.

Authors:Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, Xin Eric Wang
Title: Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Abstract:
Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.
Chinese: Agent S2通过组合式框架结合混合定位技术和主动分层规划,解决了图形界面交互中的核心难题,在多项计算机使用基准测试中创下了最优性能记录。
English: Agent S2 introduces a compositional framework with Mixture-of-Grounding and Proactive Hierarchical Planning to overcome GUI interaction challenges, achieving state-of-the-art performance across multiple computer use benchmarks.

Authors:Enzhe Sun, Yongchuan Cui, Peng Liu, Jining Yan
Title: A Decade of Deep Learning for Remote Sensing Spatiotemporal Fusion: Advances, Challenges, and Opportunities
Abstract:
Remote sensing spatiotemporal fusion (STF) addresses the fundamental trade-off between temporal and spatial resolution by combining high temporal-low spatial and high spatial-low temporal imagery. This paper presents the first comprehensive survey of deep learning advances in remote sensing STF over the past decade. We establish a systematic taxonomy of deep learning architectures including Convolutional Neural Networks (CNNs), Transformers, Generative Adversarial Networks (GANs), diffusion models, and sequence models, revealing significant growth in deep learning adoption for STF tasks. Our analysis reveals that CNN-based methods dominate spatial feature extraction, while Transformer architectures show superior performance in capturing long-range temporal dependencies. GAN and diffusion models demonstrate exceptional capability in detail reconstruction, substantially outperforming traditional methods in structural similarity and spectral fidelity. Through comprehensive experiments on seven benchmark datasets comparing ten representative methods, we validate these findings and quantify the performance trade-offs between different approaches. We identify five critical challenges: time-space conflicts, limited generalization across datasets, computational efficiency for large-scale processing, multi-source heterogeneous fusion, and insufficient benchmark diversity. The survey highlights promising opportunities in foundation models, hybrid architectures, and self-supervised learning approaches that could address current limitations and enable multimodal applications. The specific models, datasets, and other information mentioned in this article have been collected in: https://github.com/yc-cui/Deep-Learning-Spatiotemporal-Fusion-Survey.
中文摘要:本文首次系统综述了深度学习在遥感时空融合领域十年来的进展,通过分析不同架构的优势并指出当前五大挑战,同时展望了基础模型和混合架构等未来发展方向。
English Summary: This survey comprehensively reviews the past decade's advances in deep learning for remote sensing spatiotemporal fusion, analyzing various architectures' strengths and identifying key challenges alongside future opportunities in the field.

Authors:Fenglei Hao, Yuliang Yang, Ruiyuan Su, Zhengran Zhao, Yukun Qiao, Mengyu Zhu
Title: GISE-TTT:A Framework for Global InformationSegmentation and Enhancement
Abstract:
This paper addresses the challenge of capturing global temporaldependencies in long video sequences for Video Object Segmentation (VOS). Existing architectures often fail to effectively model these dependencies acrossextended temporal horizons. To overcome this limitation, we introduce GISE-TTT, anovel architecture that integrates Temporal Transformer (TTT) layers intotransformer-based frameworks through a co-designed hierarchical approach.The TTTlayer systematically condenses historical temporal information into hidden states thatencode globally coherent contextual representations. By leveraging multi-stagecontextual aggregation through hierarchical concatenation, our frameworkprogressively refines spatiotemporal dependencies across network layers. This designrepresents the first systematic empirical evidence that distributing global informationacross multiple network layers is critical for optimal dependency utilization in videosegmentation tasks.Ablation studies demonstrate that incorporating TTT modules athigh-level feature stages significantly enhances global modeling capabilities, therebyimproving the network's ability to capture long-range temporal relationships. Extensive experiments on DAVIS 2017 show that GISE-TTT achieves a 3.2%improvement in segmentation accuracy over the baseline model, providingcomprehensive evidence that global information should be strategically leveragedthroughout the network architecture.The code will be made available at:https://github.com/uuool/GISE-TTT.
中文: 本文提出GISE-TTT架构,通过分层设计的时序Transformer层有效捕捉长视频序列中的全局时间依赖关系,在DAVIS 2017数据集上实现了3.2%的精度提升。
English: This paper introduces GISE-TTT, a novel architecture that integrates Temporal Transformer layers through hierarchical design to effectively capture global temporal dependencies in long video sequences, achieving a 3.2% accuracy improvement on DAVIS 2017.

Authors:Xiaohua Qi, Renda Li, Long Peng, Qiang Ling, Jun Yu, Ziyi Chen, Peng Chang, Mei Han, Jing Xiao
Title: Data-free Knowledge Distillation with Diffusion Models
Abstract:
Recently Data-Free Knowledge Distillation (DFKD) has garnered attention and can transfer knowledge from a teacher neural network to a student neural network without requiring any access to training data. Although diffusion models are adept at synthesizing high-fidelity photorealistic images across various domains, existing methods cannot be easiliy implemented to DFKD. To bridge that gap, this paper proposes a novel approach based on diffusion models, DiffDFKD. Specifically, DiffDFKD involves targeted optimizations in two key areas. Firstly, DiffDFKD utilizes valuable information from teacher models to guide the pre-trained diffusion models' data synthesis, generating datasets that mirror the training data distribution and effectively bridge domain gaps. Secondly, to reduce computational burdens, DiffDFKD introduces Latent CutMix Augmentation, an efficient technique, to enhance the diversity of diffusion model-generated images for DFKD while preserving key attributes for effective knowledge transfer. Extensive experiments validate the efficacy of DiffDFKD, yielding state-of-the-art results exceeding existing DFKD approaches. We release our code at https://github.com/xhqi0109/DiffDFKD.
中文:本文提出DiffDFKD,一种基于扩散模型的无数据知识蒸馏新方法,通过教师模型指导生成类训练数据并采用潜在CutMix增强技术提升多样性,实现了领先的性能。
English: This paper introduces DiffDFKD, a novel data-free knowledge distillation method using diffusion models to generate training-like data guided by teacher models and employs Latent CutMix Augmentation to enhance diversity efficiently, achieving state-of-the-art results.

Authors:Yang Yang, Xijie Xu, Yixun Zhou, Jie Zheng
Title: CellVTA: Enhancing Vision Foundation Models for Accurate Cell Segmentation and Classification
Abstract:
Cell instance segmentation is a fundamental task in digital pathology with broad clinical applications. Recently, vision foundation models, which are predominantly based on Vision Transformers (ViTs), have achieved remarkable success in pathology image analysis. However, their improvements in cell instance segmentation remain limited. A key challenge arises from the tokenization process in ViTs, which substantially reduces the spatial resolution of input images, leading to suboptimal segmentation quality, especially for small and densely packed cells. To address this problem, we propose CellVTA (Cell Vision Transformer with Adapter), a novel method that improves the performance of vision foundation models for cell instance segmentation by incorporating a CNN-based adapter module. This adapter extracts high-resolution spatial information from input images and injects it into the ViT through a cross-attention mechanism. Our method preserves the core architecture of ViT, ensuring seamless integration with pretrained foundation models. Extensive experiments show that CellVTA achieves 0.538 mPQ on the CoNIC dataset and 0.506 mPQ on the PanNuke dataset, which significantly outperforms the state-of-the-art cell segmentation methods. Ablation studies confirm the superiority of our approach over other fine-tuning strategies, including decoder-only fine-tuning and full fine-tuning. Our code and models are publicly available at https://github.com/JieZheng-ShanghaiTech/CellVTA.
Chinese: CellVTA通过将基于CNN的适配器与视觉Transformer结合,保留高分辨率空间细节,显著提升了数字病理学中的细胞实例分割性能,在基准数据集上取得了领先成果。
English: CellVTA enhances cell instance segmentation in digital pathology by integrating a CNN-based adapter with Vision Transformers to preserve high-resolution spatial details, achieving state-of-the-art results on benchmark datasets.

Authors:Hyunwoo Park, Gun Ryu, Wonjun Kim
Title: DropGaussian: Structural Regularization for Sparse-view Gaussian Splatting
Abstract:
Recently, 3D Gaussian splatting (3DGS) has gained considerable attentions in the field of novel view synthesis due to its fast performance while yielding the excellent image quality. However, 3DGS in sparse-view settings (e.g., three-view inputs) often faces with the problem of overfitting to training views, which significantly drops the visual quality of novel view images. Many existing approaches have tackled this issue by using strong priors, such as 2D generative contextual information and external depth signals. In contrast, this paper introduces a prior-free method, so-called DropGaussian, with simple changes in 3D Gaussian splatting. Specifically, we randomly remove Gaussians during the training process in a similar way of dropout, which allows non-excluded Gaussians to have larger gradients while improving their visibility. This makes the remaining Gaussians to contribute more to the optimization process for rendering with sparse input views. Such simple operation effectively alleviates the overfitting problem and enhances the quality of novel view synthesis. By simply applying DropGaussian to the original 3DGS framework, we can achieve the competitive performance with existing prior-based 3DGS methods in sparse-view settings of benchmark datasets without any additional complexity. The code and model are publicly available at: https://github.com/DCVL-3D/DropGaussian release.
中文: DropGaussian是一种无先验方法,通过在训练中随机移除高斯函数来缓解稀疏视图三维高斯溅射中的过拟合问题,无需额外复杂度即可提升新视角合成的质量。
English: DropGaussian is a prior-free method that randomly removes Gaussians during training to mitigate overfitting in sparse-view 3D Gaussian splatting, enhancing novel view synthesis quality without added complexity.

Authors:Thomas E. Huber, Jules Lecomte, Borislav Polovnikov, Axel von Arnim
Title: Scaling Up Resonate-and-Fire Networks for Fast Deep Learning
Abstract:
Spiking neural networks (SNNs) present a promising computing paradigm for neuromorphic processing of event-based sensor data. The resonate-and-fire (RF) neuron, in particular, appeals through its biological plausibility, complex dynamics, yet computational simplicity. Despite theoretically predicted benefits, challenges in parameter initialization and efficient learning inhibited the implementation of RF networks, constraining their use to a single layer. In this paper, we address these shortcomings by deriving the RF neuron as a structured state space model (SSM) from the HiPPO framework. We introduce S5-RF, a new SSM layer comprised of RF neurons based on the S5 model, that features a generic initialization scheme and fast training within a deep architecture. S5-RF scales for the first time a RF network to a deep SNN with up to four layers and achieves with 78.8% a new state-of-the-art result for recurrent SNNs on the Spiking Speech Commands dataset in under three hours of training time. Moreover, compared to the reference SNNs that solve our benchmarking tasks, it achieves similar performance with much fewer spiking operations. Our code is publicly available at https://github.com/ThomasEHuber/s5-rf.
中文: 本文提出的S5-RF基于谐振发放神经元构建深度脉冲神经网络,在脉冲语音指令数据集上以更少训练时间和计算量实现了最优性能。
English: This paper introduces S5-RF, a deep spiking neural network based on resonate-and-fire neurons, which achieves state-of-the-art accuracy on the Spiking Speech Commands dataset with significantly reduced training time and computational operations.

Authors:Fida Mohammad Thoker, Letian Jiang, Chen Zhao, Bernard Ghanem
Title: SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
Abstract:
Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal redundancy, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. In SMILE, we leverage image-language pretrained models, such as CLIP, to guide the learning process with their high-level spatial semantics. We enhance the representation of motion by introducing synthetic motion patterns in the training data, allowing the model to capture more complex and dynamic content. Furthermore, using SMILE, we establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data. We have carried out extensive experiments on 7 datasets with various downstream scenarios. SMILE surpasses current state-of-the-art SSL methods, showcasing its effectiveness in learning more discriminative and generalizable video representations. Code is available: https://github.com/fmthoker/SMILE
Chinese: 本文提出SMILE自监督视频学习方法,通过融合图像语言模型的空间语义和合成运动模式来增强语义表示与动态捕捉能力,在无需自然视频数据的情况下,于多个数据集上超越现有最优方法。
English: This paper introduces SMILE, a self-supervised video learning method that enhances semantic representation and motion dynamics by integrating spatial semantics from image-language models and synthetic motion patterns, outperforming existing approaches across multiple datasets without requiring natural video data.

Authors:Shuyi Zhou, Shuxiang Xie, Ryoichi Ishikawa, Takeshi Oishi
Title: Robust LiDAR-Camera Calibration with 2D Gaussian Splatting
Abstract:
LiDAR-camera systems have become increasingly popular in robotics recently. A critical and initial step in integrating the LiDAR and camera data is the calibration of the LiDAR-camera system. Most existing calibration methods rely on auxiliary target objects, which often involve complex manual operations, whereas targetless methods have yet to achieve practical effectiveness. Recognizing that 2D Gaussian Splatting (2DGS) can reconstruct geometric information from camera image sequences, we propose a calibration method that estimates LiDAR-camera extrinsic parameters using geometric constraints. The proposed method begins by reconstructing colorless 2DGS using LiDAR point clouds. Subsequently, we update the colors of the Gaussian splats by minimizing the photometric loss. The extrinsic parameters are optimized during this process. Additionally, we address the limitations of the photometric loss by incorporating the reprojection and triangulation losses, thereby enhancing the calibration robustness and accuracy.
中文: 本文提出一种无需标定物的激光雷达-相机系统校准方法,利用二维高斯泼溅重建几何信息,并通过光度、重投影和三角测量损失优化外参,从而提升校准的鲁棒性和精度。
English: This paper introduces a novel targetless calibration method for LiDAR-camera systems that leverages 2D Gaussian Splatting to reconstruct geometric information and optimizes extrinsic parameters through photometric, reprojection, and triangulation losses for improved accuracy and robustness.

Authors:Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun
Title: ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers
Abstract:
Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV
中文: 本文提出ShortV方法,无需训练即可通过识别并冻结多模态大语言模型中处理视觉令牌的无效层,在保持性能的同时显著降低计算成本,例如在LLaVA-NeXT-13B上实现50%的FLOPs减少。
English: This paper introduces ShortV, a training-free method that reduces computational costs in Multimodal Large Language Models by identifying and freezing ineffective layers during visual token processing, achieving up to 50% FLOPs reduction while maintaining performance.

Authors:Jie Ma, Zhitao Gao, Qi Chai, Jun Liu, Pinghui Wang, Jing Tao, Zhou Su
Title: FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
Abstract:
Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81\%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at https://github.com/reml-group/fortisavqa.
Chinese: 本文提出了FortisAVQA数据集,通过重构问题和引入分布偏移来解决视听问答中的过拟合与鲁棒性问题,并设计了MAVEN去偏网络,在性能上实现了7.81%的显著提升,达到最先进水平。
English: This paper introduces FortisAVQA, a novel dataset designed to address overfitting and robustness issues in Audio-Visual Question Answering by incorporating rephrased questions and distribution shifts, and proposes MAVEN, a debiasing network that achieves state-of-the-art performance with a 7.81% improvement.

Authors:Zhuohao Li, Zhicheng Huang, Wenchao Liu, Zhuxin Zhang, Jianming Miao
Title: FSSUWNet: Mitigating the Fragility of Pre-trained Models with Feature Enhancement for Few-Shot Semantic Segmentation in Underwater Images
Abstract:
Few-Shot Semantic Segmentation (FSS), which focuses on segmenting new classes in images using only a limited number of annotated examples, has recently progressed in data-scarce domains. However, in this work, we show that the existing FSS methods often struggle to generalize to underwater environments. Specifically, the prior features extracted by pre-trained models used as feature extractors are fragile due to the unique challenges of underwater images. To address this, we propose FSSUWNet, a tailored FSS framework for underwater images with feature enhancement. FSSUWNet exploits the integration of complementary features, emphasizing both low-level and high-level image characteristics. In addition to employing a pre-trained model as the primary encoder, we propose an auxiliary encoder called Feature Enhanced Encoder which extracts complementary features to better adapt to underwater scene characteristics. Furthermore, a simple and effective Feature Alignment Module aims to provide global prior knowledge and align low-level features with high-level features in dimensions. Given the scarcity of underwater images, we introduce a cross-validation dataset version based on the Segmentation of Underwater Imagery dataset. Extensive experiments on public underwater segmentation datasets demonstrate that our approach achieves state-of-the-art performance. For example, our method outperforms the previous best method by 2.8% and 2.6% in terms of the mean Intersection over Union metric for 1-shot and 5-shot scenarios in the datasets, respectively. Our implementation is available at https://github.com/lizhh268/FSSUWNet.
中文摘要:本文提出FSSUWNet,一种专为水下环境设计的少样本语义分割框架,通过增强特征集成与对齐来克服现有方法的局限性,在公开数据集上实现了最优性能。
English Summary: This paper introduces FSSUWNet, a novel few-shot semantic segmentation framework designed for underwater environments that enhances feature integration and alignment to overcome the limitations of existing methods, achieving state-of-the-art performance on public datasets.

Authors:Haobo Yuan, Tao Zhang, Xiangtai Li, Lu Qi, Zilong Huang, Shilin Xu, Jiashi Feng, Ming-Hsuan Yang
Title: 4th PVUW MeViS 3rd Place Report: Sa2VA
Abstract:
Referring video object segmentation (RVOS) is a challenging task that requires the model to segment the object in a video given the language description. MeViS is a recently proposed dataset that contains motion expressions of the target objects, leading to a challenging benchmark, compared with existing RVOS benchmarks. On the other hand, for referring expression tasks, a new trend is to adopt multi-modal large language model (MLLM) to achieve better image and text alignment. In this report, we show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos. By enlarging the scope of key frames, without any further training, we can achieve the 3rd place in the 4th PVUW workshop.
中文摘要:通过增强多模态大语言模型的测试时推理方法,无需额外训练即可显著提升在具有挑战性的MeViS视频指代分割数据集上的性能,最终在第四届PVUW研讨会中获得第三名。
English Summary: A simple modification to the test-time inference method on stronger multi-modal large language models significantly improves performance on the challenging MeViS dataset for referring video object segmentation, achieving third place in the 4th PVUW workshop without additional training.

Authors:Ruoyu Chen, Siyuan Liang, Jingzhi Li, Shiming Liu, Li Liu, Hua Zhang, Xiaochun Cao
Title: Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection
Abstract:
To develop a trustworthy AI system, which aim to identify the input regions that most influence the models decisions. The primary task of existing attribution methods lies in efficiently and accurately identifying the relationships among input-prediction interactions. Particularly when the input data is discrete, such as images, analyzing the relationship between inputs and outputs poses a significant challenge due to the combinatorial explosion. In this paper, we propose a novel and efficient black-box attribution mechanism, LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection. First, to accurately assess interactions, we design a submodular function that quantifies subset importance and effectively captures their impact on decision outcomes. Then, efficiently ranking input sub-regions by their importance for attribution, we improve optimization efficiency through a novel bidirectional greedy search algorithm. LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors. Extensive experiments on eight foundation models demonstrate that our method provides faithful interpretations with fewer regions and exhibits strong generalization, shows an average improvement of 36.3% in Insertion and 39.6% in Deletion. Our method also outperforms the naive greedy search in attribution efficiency, being 1.6 times faster. Furthermore, when explaining the reasons behind model prediction errors, the average highest confidence achieved by our method is, on average, 86.1% higher than that of state-of-the-art attribution algorithms. The code is available at https://github.com/RuoyuChen10/LIMA.
中文: 本文提出LiMA这一高效黑盒归因方法,通过将输入区域重要性重构为子模优化问题,在识别影响AI决策的关键区域方面实现了更优的准确性和效率。
English: This paper introduces LiMA, an efficient black-box attribution method that reformulates input region importance as a submodular optimization problem, achieving superior accuracy and speed in identifying influential regions for AI decisions.

Authors:Hao Qin, Luyuan Chen, Ming Kong, Mengxu Lu, Qiang Zhu
Title: Distilling Multi-view Diffusion Models into 3D Generators
Abstract:
We introduce DD3G, a formulation that Distills a multi-view Diffusion model (MV-DM) into a 3D Generator using gaussian splatting. DD3G compresses and integrates extensive visual and spatial geometric knowledge from the MV-DM by simulating its ordinary differential equation (ODE) trajectory, ensuring the distilled generator generalizes better than those trained solely on 3D data. Unlike previous amortized optimization approaches, we align the MV-DM and 3D generator representation spaces to transfer the teacher's probabilistic flow to the student, thus avoiding inconsistencies in optimization objectives caused by probabilistic sampling. The introduction of probabilistic flow and the coupling of various attributes in 3D Gaussians introduce challenges in the generation process. To tackle this, we propose PEPD, a generator consisting of Pattern Extraction and Progressive Decoding phases, which enables efficient fusion of probabilistic flow and converts a single image into 3D Gaussians within 0.06 seconds. Furthermore, to reduce knowledge loss and overcome sparse-view supervision, we design a joint optimization objective that ensures the quality of generated samples through explicit supervision and implicit verification. Leveraging existing 2D generation models, we compile 120k high-quality RGBA images for distillation. Experiments on synthetic and public datasets demonstrate the effectiveness of our method. Our project is available at: https://qinbaigao.github.io/DD3G_project/

Authors:Qi Song, Chenghong Li, Haotong Lin, Sida Peng, Rui Huang
Title: ADGaussian: Generalizable Gaussian Splatting for Autonomous Driving with Multi-modal Inputs
Abstract:
We present a novel approach, termed ADGaussian, for generalizable street scene reconstruction. The proposed method enables high-quality rendering from single-view input. Unlike prior Gaussian Splatting methods that primarily focus on geometry refinement, we emphasize the importance of joint optimization of image and depth features for accurate Gaussian prediction. To this end, we first incorporate sparse LiDAR depth as an additional input modality, formulating the Gaussian prediction process as a joint learning framework of visual information and geometric clue. Furthermore, we propose a multi-modal feature matching strategy coupled with a multi-scale Gaussian decoding model to enhance the joint refinement of multi-modal features, thereby enabling efficient multi-modal Gaussian learning. Extensive experiments on two large-scale autonomous driving datasets, Waymo and KITTI, demonstrate that our ADGaussian achieves state-of-the-art performance and exhibits superior zero-shot generalization capabilities in novel-view shifting.

Authors:Chong Li, Jingyang Huo, Weikang Gong, Yanwei Fu, Xiangyang Xue, Jianfeng Feng
Title: DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding
Abstract:
Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components - semantic, spatial, and motion - then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding. Project page: https://chongjg.github.io/DecoFuse/.
中文: DecoFuse提出了一种受大脑启发的框架,通过将fMRI信号分解为语义、空间和运动三个独立成分分别解码再融合,在视频重建任务中实现了卓越性能,并通过神经编码分析验证了其与双通路假设的生物一致性。
English: DecoFuse introduces a brain-inspired framework that decomposes fMRI signals into semantic, spatial, and motion components for separate decoding and fusion, achieving superior performance in video reconstruction and validating biological plausibility through neural alignment.

Authors:Xiaole Xian, Zhichao Liao, Qingyu Li, Wenyu Qin, Pengfei Wan, Weicheng Xie, Long Zeng, Linlin Shen, Pingfa Feng
Title: SPF-Portrait: Towards Pure Text-to-Portrait Customization with Semantic Pollution-Free Fine-Tuning
Abstract:
Fine-tuning a pre-trained Text-to-Image (T2I) model on a tailored portrait dataset is the mainstream method for text-to-portrait customization. However, existing methods often severely impact the original model's behavior (e.g., changes in ID, layout, etc.) while customizing portrait attributes. To address this issue, we propose SPF-Portrait, a pioneering work to purely understand customized target semantics and minimize disruption to the original model. In our SPF-Portrait, we design a dual-path contrastive learning pipeline, which introduces the original model as a behavioral alignment reference for the conventional fine-tuning path. During the contrastive learning, we propose a novel Semantic-Aware Fine Control Map that indicates the intensity of response regions of the target semantics, to spatially guide the alignment process between the contrastive paths. It adaptively balances the behavioral alignment across different regions and the responsiveness of the target semantics. Furthermore, we propose a novel response enhancement mechanism to reinforce the presentation of target semantics, while mitigating representation discrepancy inherent in direct cross-modal supervision. Through the above strategies, we achieve incremental learning of customized target semantics for pure text-to-portrait customization. Extensive experiments show that SPF-Portrait achieves state-of-the-art performance. Project page: https://spf-portrait.github.io/SPF-Portrait/

Authors:Zilong Huang, Jun He, Junyan Ye, Lihan Jiang, Weijia Li, Yiping Chen, Ting Han
Title: Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration
Abstract:
The reconstruction of immersive and realistic 3D scenes holds significant practical importance in various fields of computer vision and computer graphics. Typically, immersive and realistic scenes should be free from obstructions by dynamic objects, maintain global texture consistency, and allow for unrestricted exploration. The current mainstream methods for image-driven scene construction involves iteratively refining the initial image using a moving virtual camera to generate the scene. However, previous methods struggle with visual discontinuities due to global texture inconsistencies under varying camera poses, and they frequently exhibit scene voids caused by foreground-background occlusions. To this end, we propose a novel layered 3D scene reconstruction framework from panoramic image, named Scene4U. Specifically, Scene4U integrates an open-vocabulary segmentation model with a large language model to decompose a real panorama into multiple layers. Then, we employs a layered repair module based on diffusion model to restore occluded regions using visual cues and depth information, generating a hierarchical representation of the scene. The multi-layer panorama is then initialized as a 3D Gaussian Splatting representation, followed by layered optimization, which ultimately produces an immersive 3D scene with semantic and structural consistency that supports free exploration. Scene4U outperforms state-of-the-art method, improving by 24.24% in LPIPS and 24.40% in BRISQUE, while also achieving the fastest training speed. Additionally, to demonstrate the robustness of Scene4U and allow users to experience immersive scenes from various landmarks, we build WorldVista3D dataset for 3D scene reconstruction, which contains panoramic images of globally renowned sites. The implementation code and dataset will be released at https://github.com/LongHZ140516/Scene4U .
中文: 本文提出Scene4U,一种基于全景图像的分层三维场景重建框架,通过分解场景层、修复遮挡区域并采用三维高斯溅射优化,解决了视觉不连续和遮挡问题,实现了具有语义和结构一致性的沉浸式三维场景,并在性能指标上显著优于现有方法。
English: This paper introduces Scene4U, a novel layered 3D scene reconstruction framework from panoramic images that addresses visual discontinuities and occlusions by decomposing scenes into layers, restoring occluded regions, and optimizing with 3D Gaussian Splatting to achieve immersive, consistent 3D scenes with superior performance metrics.

Authors:Wanjing Zhang, Chenxing Wang
Title: Intrinsic-feature-guided 3D Object Detection
Abstract:
LiDAR-based 3D object detection is essential for autonomous driving systems. However, LiDAR point clouds may appear to have sparsity, uneven distribution, and incomplete structures, significantly limiting the detection performance. In road driving environments, target objects referring to vehicles, pedestrians and cyclists are well-suited for enhancing representation through the complete template guidance, considering their grid and topological structures. Therefore, this paper presents an intrinsic-feature-guided 3D object detection method based on a template-assisted feature enhancement module, which extracts intrinsic features from relatively generalized templates and provides rich structural information for foreground objects. Furthermore, a proposal-level contrastive learning mechanism is designed to enhance the feature differences between foreground and background objects. The proposed modules can act as plug-and-play components and improve the performance of multiple existing methods. Extensive experiments illustrate that the proposed method achieves the highly competitive detection results. Code will be available at https://github.com/zhangwanjingjj/IfgNet.git.
中文: 本文提出了一种基于模板辅助特征增强的三维物体检测方法,通过提取内在特征和对比学习机制,显著提升了车辆、行人和骑车人的检测性能,在自动驾驶领域展现出强大竞争力。
English: This paper introduces a template-guided 3D object detection method that enhances feature representation for vehicles, pedestrians, and cyclists through intrinsic feature extraction and contrastive learning, achieving competitive results in autonomous driving applications.

Authors:Ting Liu, Siyuan Li
Title: Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation
Abstract:
Recent advances in zero-shot referring image segmentation (RIS), driven by models such as the Segment Anything Model (SAM) and CLIP, have made substantial progress in aligning visual and textual information. Despite these successes, the extraction of precise and high-quality mask region representations remains a critical challenge, limiting the full potential of RIS tasks. In this paper, we introduce a training-free, hybrid global-local feature extraction approach that integrates detailed mask-specific features with contextual information from the surrounding area, enhancing mask region representation. To further strengthen alignment between mask regions and referring expressions, we propose a spatial guidance augmentation strategy that improves spatial coherence, which is essential for accurately localizing described areas. By incorporating multiple spatial cues, this approach facilitates more robust and precise referring segmentation. Extensive experiments on standard RIS benchmarks demonstrate that our method significantly outperforms existing zero-shot RIS models, achieving substantial performance gains. We believe our approach advances RIS tasks and establishes a versatile framework for region-text alignment, offering broader implications for cross-modal understanding and interaction. Code is available at https://github.com/fhgyuanshen/HybridGL .
中文摘要:本文提出了一种无需训练的混合全局-局部特征提取方法和空间引导增强策略,显著提升了零样本指代图像分割中的掩码区域表示与文本对齐效果,在基准测试中表现卓越。
English Summary: This paper introduces a training-free hybrid global-local feature extraction method and a spatial guidance augmentation strategy to enhance mask region representation and alignment in zero-shot referring image segmentation, achieving superior performance on benchmarks.

Authors:Pooya Ashtari, Shahryar Noei, Fateme Nateghi Haredasht, Jonathan H. Chen, Giuseppe Jurman, Aleksandra Pizurica, Sabine Van Huffel
Title: Deconver: A Deconvolutional Network for Medical Image Segmentation
Abstract:
While convolutional neural networks (CNNs) and vision transformers (ViTs) have advanced medical image segmentation, they face inherent limitations such as local receptive fields in CNNs and high computational complexity in ViTs. This paper introduces Deconver, a novel network that integrates traditional deconvolution techniques from image restoration as a core learnable component within a U-shaped architecture. Deconver replaces computationally expensive attention mechanisms with efficient nonnegative deconvolution (NDC) operations, enabling the restoration of high-frequency details while suppressing artifacts. Key innovations include a backpropagation-friendly NDC layer based on a provably monotonic update rule and a parameter-efficient design. Evaluated across four datasets (ISLES'22, BraTS'23, GlaS, FIVES) covering both 2D and 3D segmentation tasks, Deconver achieves state-of-the-art performance in Dice scores and Hausdorff distance while reducing computational costs (FLOPs) by up to 90% compared to leading baselines. By bridging traditional image restoration with deep learning, this work offers a practical solution for high-precision segmentation in resource-constrained clinical workflows. The project is available at https://github.com/pashtari/deconver.
中文: Deconver提出了一种新型U型网络,通过整合高效的非负反卷积技术替代注意力机制,在多个医学影像数据集中以降低高达90%的计算成本实现了最先进的分割性能。
English: Deconver introduces a novel U-shaped network that integrates efficient nonnegative deconvolution to replace attention mechanisms, achieving state-of-the-art segmentation performance with up to 90% lower computational costs across multiple medical imaging datasets.

Authors:Han Zhou, Wei Dong, Jun Chen
Title: LITA-GS: Illumination-Agnostic Novel View Synthesis via Reference-Free 3D Gaussian Splatting and Physical Priors
Abstract:
Directly employing 3D Gaussian Splatting (3DGS) on images with adverse illumination conditions exhibits considerable difficulty in achieving high-quality, normally-exposed representations due to: (1) The limited Structure from Motion (SfM) points estimated in adverse illumination scenarios fail to capture sufficient scene details; (2) Without ground-truth references, the intensive information loss, significant noise, and color distortion pose substantial challenges for 3DGS to produce high-quality results; (3) Combining existing exposure correction methods with 3DGS does not achieve satisfactory performance due to their individual enhancement processes, which lead to the illumination inconsistency between enhanced images from different viewpoints. To address these issues, we propose LITA-GS, a novel illumination-agnostic novel view synthesis method via reference-free 3DGS and physical priors. Firstly, we introduce an illumination-invariant physical prior extraction pipeline. Secondly, based on the extracted robust spatial structure prior, we develop the lighting-agnostic structure rendering strategy, which facilitates the optimization of the scene structure and object appearance. Moreover, a progressive denoising module is introduced to effectively mitigate the noise within the light-invariant representation. We adopt the unsupervised strategy for the training of LITA-GS and extensive experiments demonstrate that LITA-GS surpasses the state-of-the-art (SOTA) NeRF-based method while enjoying faster inference speed and costing reduced training time. The code is released at https://github.com/LowLevelAI/LITA-GS.
中文摘要:LITA-GS方法通过引入光照无关的物理先验和渐进式去噪模块,解决了三维高斯溅射在恶劣光照条件下难以生成高质量场景表示的难题,在性能和效率上均超越现有先进技术。
English Summary: The proposed LITA-GS method overcomes 3D Gaussian Splatting's limitations in adverse lighting by integrating illumination-invariant physical priors and progressive denoising, achieving superior performance and efficiency compared to existing approaches.

Authors:Suzanne Stathatos, Michael Hobley, Markus Marks, Pietro Perona
Title: SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance
Abstract:
Foundation models excel at vision tasks in natural images but fail in low signal-to-noise ratio (SNR) videos, such as underwater sonar, ultrasound, and microscopy. We introduce Spatiotemporal Augmentations and denoising in Video for Downstream Tasks (SAVeD), a self-supervised method that denoises low-SNR sensor videos and is trained using only the raw noisy data. By leveraging differences in foreground and background motion, SAVeD enhances object visibility using an encoder-decoder with a temporal bottleneck. Our approach improves classification, detection, tracking, and counting, outperforming state-of-the-art video denoising methods with lower resource requirements. Project page: https://suzanne-stathatos.github.io/SAVeD Code page: https://github.com/suzanne-stathatos/SAVeD
Chinese: SAVeD是一种仅使用原始噪声数据进行自监督训练的方法,通过利用前景与背景运动差异来增强低信噪比视频中的目标可见性,并以更少资源超越现有最优去噪技术。
English: SAVeD is a self-supervised method that denoises low-SNR videos using raw noisy data alone, enhancing object visibility through motion differences and outperforming state-of-the-art denoising techniques with fewer resources.

Authors:Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen
Title: Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
Abstract:
Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets. Our code is publicly available for research purpose at https://easi3r.github.io/
中文: Easi3R提出了一种无需训练的4D重建方法,通过解耦DUSt3R中的注意力机制实现动态场景建模,其性能显著优于依赖大量数据训练的现有方法。
English: Easi3R introduces a training-free 4D reconstruction method that adapts attention mechanisms in DUSt3R during inference, achieving superior dynamic scene modeling without pre-training or fine-tuning.

Authors:Chenyang Li, Wenxuan Liu, Guoqiang Gong, Xiaobo Ding, Xian Zhong
Title: SU-YOLO: Spiking Neural Network for Efficient Underwater Object Detection
Abstract:
Underwater object detection is critical for oceanic research and industrial safety inspections. However, the complex optical environment and the limited resources of underwater equipment pose significant challenges to achieving high accuracy and low power consumption. To address these issues, we propose Spiking Underwater YOLO (SU-YOLO), a Spiking Neural Network (SNN) model. Leveraging the lightweight and energy-efficient properties of SNNs, SU-YOLO incorporates a novel spike-based underwater image denoising method based solely on integer addition, which enhances the quality of feature maps with minimal computational overhead. In addition, we introduce Separated Batch Normalization (SeBN), a technique that normalizes feature maps independently across multiple time steps and is optimized for integration with residual structures to capture the temporal dynamics of SNNs more effectively. The redesigned spiking residual blocks integrate the Cross Stage Partial Network (CSPNet) with the YOLO architecture to mitigate spike degradation and enhance the model's feature extraction capabilities. Experimental results on URPC2019 underwater dataset demonstrate that SU-YOLO achieves mAP of 78.8% with 6.97M parameters and an energy consumption of 2.98 mJ, surpassing mainstream SNN models in both detection accuracy and computational efficiency. These results underscore the potential of SNNs for engineering applications. The code is available in https://github.com/lwxfight/snn-underwater.
中文:提出的脉冲水下YOLO(SU-YOLO)模型利用脉冲神经网络的高能效特性,结合创新的去噪和归一化方法,在低功耗条件下实现了卓越的水下目标检测性能。
English: The proposed Spiking Underwater YOLO (SU-YOLO) model leverages spiking neural networks' energy efficiency and introduces novel denoising and normalization techniques to achieve superior underwater object detection accuracy with minimal computational resources.

Authors:Chong Bao, Xiyu Zhang, Zehao Yu, Jiale Shi, Guofeng Zhang, Songyou Peng, Zhaopeng Cui
Title: Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views
Abstract:
Neural rendering has demonstrated remarkable success in high-quality 3D neural reconstruction and novel view synthesis with dense input views and accurate poses. However, applying it to extremely sparse, unposed views in unbounded 360° scenes remains a challenging problem. In this paper, we propose a novel neural rendering framework to accomplish the unposed and extremely sparse-view 3D reconstruction in unbounded 360° scenes. To resolve the spatial ambiguity inherent in unbounded scenes with sparse input views, we propose a layered Gaussian-based representation to effectively model the scene with distinct spatial layers. By employing a dense stereo reconstruction model to recover coarse geometry, we introduce a layer-specific bootstrap optimization to refine the noise and fill occluded regions in the reconstruction. Furthermore, we propose an iterative fusion of reconstruction and generation alongside an uncertainty-aware training approach to facilitate mutual conditioning and enhancement between these two processes. Comprehensive experiments show that our approach outperforms existing state-of-the-art methods in terms of rendering quality and surface reconstruction accuracy. Project page: https://zju3dv.github.io/free360/
中文: 本文提出了一种新颖的神经渲染框架,通过分层高斯表示和引导优化,在无边界360°场景中实现了从极稀疏、无位姿输入视图的高质量三维重建和新视角合成。
English: This paper introduces a novel neural rendering framework that achieves high-quality 3D reconstruction and novel view synthesis from extremely sparse, unposed views in unbounded 360° scenes through layered Gaussian representation and bootstrap optimization.

Authors:Yuping Wang, Xiangyu Huang, Xiaokang Sun, Mingxuan Yan, Shuo Xing, Zhengzhong Tu, Jiachen Li
Title: UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving
Abstract:
We introduce UniOcc, a comprehensive, unified benchmark and toolkit for occupancy forecasting (i.e., predicting future occupancies based on historical information) and occupancy prediction (i.e., predicting current-frame occupancy from camera images. UniOcc unifies the data from multiple real-world datasets (i.e., nuScenes, Waymo) and high-fidelity driving simulators (i.e., CARLA, OpenCOOD), providing 2D/3D occupancy labels and annotating innovative per-voxel flows. Unlike existing studies that rely on suboptimal pseudo labels for evaluation, UniOcc incorporates novel evaluation metrics that do not depend on ground-truth labels, enabling robust assessment on additional aspects of occupancy quality. Through extensive experiments on state-of-the-art models, we demonstrate that large-scale, diverse training data and explicit flow information significantly enhance occupancy prediction and forecasting performance. Our data and code are available at https://uniocc.github.io/.
中文:UniOcc是一个统一的占用预测基准与工具包,整合了多源真实数据和仿真数据集,通过创新的三维流标注和无真值标签评估指标,证明大规模训练数据与显式流信息能显著提升占用预测性能。
English: UniOcc is a unified benchmark and toolkit for occupancy forecasting and prediction, integrating diverse real-world and simulated datasets with innovative 3D flow annotations and label-free evaluation metrics to significantly improve model performance through large-scale training data and explicit flow information.

Authors:Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, Xihui Liu
Title: Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1
Abstract:
Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.
Chinese: 强化学习的最新进展提升了多模态大语言模型的推理能力,SEED-Bench-R1基准测试验证了其有效性,但在复杂视觉推理任务中保持逻辑连贯性仍是待解决的挑战。
English: Recent advancements in reinforcement learning have enhanced multimodal large language models' reasoning capabilities, as demonstrated by the SEED-Bench-R1 benchmark, though challenges remain in maintaining logical coherence during complex visual reasoning tasks.

Authors:Shuaizheng Liu, Jianqi Ma, Lingchen Sun, Xiangtao Kong, Lei Zhang
Title: InstructRestore: Region-Customized Image Restoration with Human Instructions
Abstract:
Despite the significant progress in diffusion prior-based image restoration, most existing methods apply uniform processing to the entire image, lacking the capability to perform region-customized image restoration according to user instructions. In this work, we propose a new framework, namely InstructRestore, to perform region-adjustable image restoration following human instructions. To achieve this, we first develop a data generation engine to produce training triplets, each consisting of a high-quality image, the target region description, and the corresponding region mask. With this engine and careful data screening, we construct a comprehensive dataset comprising 536,945 triplets to support the training and evaluation of this task. We then examine how to integrate the low-quality image features under the ControlNet architecture to adjust the degree of image details enhancement. Consequently, we develop a ControlNet-like model to identify the target region and allocate different integration scales to the target and surrounding regions, enabling region-customized image restoration that aligns with user instructions. Experimental results demonstrate that our proposed InstructRestore approach enables effective human-instructed image restoration, such as images with bokeh effects and user-instructed local enhancement. Our work advances the investigation of interactive image restoration and enhancement techniques. Data, code, and models will be found at https://github.com/shuaizhengliu/InstructRestore.git.
中文: 本文提出了InstructRestore新框架,通过类似ControlNet的模型实现基于用户指令的区域定制化图像修复,能够根据描述对特定区域进行差异化细节增强。
English: This paper introduces InstructRestore, a novel framework that enables region-specific image restoration guided by human instructions, utilizing a ControlNet-like model to adaptively enhance details in targeted areas based on user input.

Authors:Wenyan Cong, Hanqing Zhu, Peihao Wang, Bangya Liu, Dejia Xu, Kevin Wang, David Z. Pan, Yan Wang, Zhiwen Fan, Zhangyang Wang
Title: Can Test-Time Scaling Improve World Foundation Model?
Abstract:
World foundation models, which simulate the physical world by predicting future states from current observations and inputs, have become central to many applications in physical intelligence, including autonomous driving and robotics. However, these models require substantial computational resources for pretraining and are further constrained by available data during post-training. As such, scaling computation at test time emerges as both a critical and practical alternative to traditional model enlargement or re-training. In this work, we introduce SWIFT, a test-time scaling framework tailored for WFMs. SWIFT integrates our extensible WFM evaluation toolkit with process-level inference strategies, including fast tokenization, probability-based Top-K pruning, and efficient beam search. Empirical results on the COSMOS model demonstrate that test-time scaling exists even in a compute-optimal way. Our findings reveal that test-time scaling laws hold for WFMs and that SWIFT provides a scalable and effective pathway for improving WFM inference without retraining or increasing model size. Project page: https://scalingwfm.github.io/.
中文: SWIFT是一种测试时扩展框架,通过优化策略在不重新训练或增大模型的情况下提升世界基础模型的推理效率,实证结果验证了其有效性。
English: SWIFT is a test-time scaling framework that enhances world foundation models' inference efficiency through optimized strategies without retraining or enlarging the model, as validated by empirical results.

Authors:Adam Schmidt, Mert Asim Karaoglu, Soham Sinha, Mingang Jang, Ho-Gun Ha, Kyungmin Jung, Kyeongmo Gu, Ihsan Ullah, Hyunki Lee, Jonáš Šerých, Michal Neoral, Jiří Matas, Rulin Zhou, Wenlong He, An Wang, Hongliang Ren, Bruno Silva, Sandro Queirós, Estêvão Lima, João L. Vilaça, Shunsuke Kikuchi, Atsushi Kouno, Hiroki Matsuzaki, Tongtong Li, Yulu Chen, Ling Li, Xiang Ma, Xiaojian Li, Mona Sheikh Zeinoddin, Xu Wang, Zafer Tandogdu, Greg Shaw, Evangelos Mazomenos, Danail Stoyanov, Yuxin Chen, Zijian Wu, Alexander Ladikos, Simon DiMaio, Septimiu E. Salcudean, Omid Mohareri
Title: Point Tracking in Surgery--The 2024 Surgical Tattoos in Infrared (STIR) Challenge
Abstract:
Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy. Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms. This paper introduces a point tracking challenge to address this, wherein participants can submit their algorithms for quantification. The submitted algorithms are evaluated using a dataset named surgical tattoos in infrared (STIR), with the challenge aptly named the STIR Challenge 2024. The STIR Challenge 2024 comprises two quantitative components: accuracy and efficiency. The accuracy component tests the accuracy of algorithms on in vivo and ex vivo sequences. The efficiency component tests the latency of algorithm inference. The challenge was conducted as a part of MICCAI EndoVis 2024. In this challenge, we had 8 total teams, with 4 teams submitting before and 4 submitting after challenge day. This paper details the STIR Challenge 2024, which serves to move the field towards more accurate and efficient algorithms for spatial understanding in surgery. In this paper we summarize the design, submissions, and results from the challenge. The challenge dataset is available here: https://zenodo.org/records/14803158 , and the code for baseline models and metric calculation is available here: https://github.com/athaddius/STIRMetrics
中文: STIR挑战赛2024通过点追踪任务评估算法在STIR数据集上的准确性与效率,旨在推动手术空间理解技术的发展,本文详细记录了八支参赛队伍的设计方案与比赛结果。
English: The STIR Challenge 2024 was introduced to advance surgical spatial understanding by evaluating algorithm accuracy and efficiency through point tracking on the STIR dataset, with results from eight participating teams detailed in this paper.

Authors:Yuelei Li, Hyunjin Kim, Fangneng Zhan, Ri-Zhao Qiu, Mazeyu Ji, Xiaojun Shan, Xueyan Zou, Paul Liang, Hanspeter Pfister, Xiaolong Wang
Title: Visual Acoustic Fields
Abstract:
Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS). Our approach features two key modules: sound generation and sound localization. The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources. To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds. To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources. Our project page is at https://yuelei0428.github.io/projects/Visual-Acoustic-Fields/.

Authors:Karim Radouane, Hanane Azzag, Mustapha lebbah
Title: MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing
Abstract:
We propose a unified framework that integrates object detection (OD) and visual grounding (VG) for remote sensing (RS) imagery. To support conventional OD and establish an intuitive prior for VG task, we fine-tune an open-set object detector using referring expression data, framing it as a partially supervised OD task. In the first stage, we construct a graph representation of each image, comprising object queries, class embeddings, and proposal locations. Then, our task-aware architecture processes this graph to perform the VG task. The model consists of: (i) a multi-branch network that integrates spatial, visual, and categorical features to generate task-aware proposals, and (ii) an object reasoning network that assigns probabilities across proposals, followed by a soft selection mechanism for final referring object localization. Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG datasets, achieving significant improvements over state-of-the-art methods while retaining classical OD capabilities. The code will be available in our repository: \url{https://github.com/rd20karim/MB-ORES}.
中文摘要:本研究提出了一种统一框架,将目标检测与视觉定位相结合应用于遥感图像,通过微调开放集检测器和任务感知架构,在基准数据集上实现了优于现有方法的性能,同时保持了传统检测功能。
English Summary: This study introduces a unified framework that combines object detection and visual grounding for remote sensing imagery, utilizing a fine-tuned open-set detector and a task-aware architecture to achieve state-of-the-art performance on benchmark datasets while preserving traditional detection capabilities.

Authors:Dominik Schnaus, Nikita Araslanov, Daniel Cremers
Title: It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data
Abstract:
The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e. without parallel data. We present the first feasibility study, and investigate conformity of existing vision and language foundation models in the context of unsupervised, or "blind", matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens up the exciting possibility of embedding semantic knowledge into other modalities virtually annotation-free. As a proof of concept, we showcase an unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.

Authors:Xiangyuan Peng, Miao Tang, Huawei Sun, Kay Bierzynski, Lorenzo Servadei, Robert Wille
Title: 4D mmWave Radar for Sensing Enhancement in Adverse Environments: Advances and Challenges
Abstract:
Intelligent transportation systems require accurate and reliable sensing. However, adverse environments, such as rain, snow, and fog, can significantly degrade the performance of LiDAR and cameras. In contrast, 4D mmWave radar not only provides 3D point clouds and velocity measurements but also maintains robustness in challenging conditions. Recently, research on 4D mmWave radar under adverse environments has been growing, but a comprehensive review is still lacking. To bridge this gap, this work reviews the current research on 4D mmWave radar under adverse environments. First, we present an overview of existing 4D mmWave radar datasets encompassing diverse weather and lighting scenarios. Subsequently, we analyze existing learning-based methods leveraging 4D mmWave radar to enhance performance according to different adverse conditions. Finally, the challenges and potential future directions are discussed for advancing 4D mmWave radar applications in harsh environments. To the best of our knowledge, this is the first review specifically concentrating on 4D mmWave radar in adverse environments. The related studies are listed at: https://github.com/XiangyPeng/4D-mmWave-Radar-in-Adverse-Environments.
中文: 本综述首次专门聚焦于恶劣环境下的4D毫米波雷达研究,系统梳理了多场景数据集与基于学习的增强方法,并探讨了该技术在智能交通系统中面临的挑战与发展方向。
English: This review addresses the lack of comprehensive analysis by summarizing current research on 4D mmWave radar, which maintains robust performance in adverse conditions like rain and fog, covering datasets, learning-based methods, and future challenges for intelligent transportation systems.

Authors:Yanbo Wang, Yongtao Chen, Chuan Cao, Tianchen Deng, Wentao Zhao, Jingchuan Wang, Weidong Chen
Title: SALT: A Flexible Semi-Automatic Labeling Tool for General LiDAR Point Clouds with Cross-Scene Adaptability and 4D Consistency
Abstract:
We propose a flexible Semi-Automatic Labeling Tool (SALT) for general LiDAR point clouds with cross-scene adaptability and 4D consistency. Unlike recent approaches that rely on camera distillation, SALT operates directly on raw LiDAR data, automatically generating pre-segmentation results. To achieve this, we propose a novel zero-shot learning paradigm, termed data alignment, which transforms LiDAR data into pseudo-images by aligning with the training distribution of vision foundation models. Additionally, we design a 4D-consistent prompting strategy and 4D non-maximum suppression module to enhance SAM2, ensuring high-quality, temporally consistent presegmentation. SALT surpasses the latest zero-shot methods by 18.4% PQ on SemanticKITTI and achieves nearly 40-50% of human annotator performance on our newly collected low-resolution LiDAR data and on combined data from three LiDAR types, significantly boosting annotation efficiency. We anticipate that SALT's open-sourcing will catalyze substantial expansion of current LiDAR datasets and lay the groundwork for the future development of LiDAR foundation models. Code is available at https://github.com/Cavendish518/SALT.
中文: 我们推出了SALT,一种半自动标注工具,采用零样本学习范式直接处理LiDAR数据,在多种LiDAR数据集上实现了卓越的分割效果,并显著提升了标注效率。
English: We introduce SALT, a semi-automatic labeling tool that uses a zero-shot learning approach to directly process LiDAR data, achieving superior segmentation results and significantly improving annotation efficiency across diverse LiDAR datasets.

Authors:Nima Torbati, Anastasia Meshcheryakova, Ramona Woitek, Sepideh Hatamikia, Diana Mechtcheriakova, Amirreza Mahbod
Title: A Multi-Stage Auto-Context Deep Learning Framework for Tissue and Nuclei Segmentation and Classification in H&E-Stained Histological Images of Advanced Melanoma
Abstract:
Melanoma is the most lethal form of skin cancer, with an increasing incidence rate worldwide. Analyzing histological images of melanoma by localizing and classifying tissues and cell nuclei is considered the gold standard method for diagnosis and treatment options for patients. While many computerized approaches have been proposed for automatic analysis, most perform tissue-based analysis and nuclei (cell)-based analysis as separate tasks, which might be suboptimal. In this work, using the PUMA challenge dataset, we propose a novel multi-stage deep learning approach by combining tissue and nuclei information in a unified framework based on the auto-context concept to perform segmentation and classification in histological images of melanoma. Through pre-training and further post-processing, our approach achieved second and first place rankings in the PUMA challenge, with average micro Dice tissue score and summed nuclei F1-score of 73.40% for Track 1 and 63.48% for Track 2, respectively. Furthermore, through a comprehensive ablation study and additional evaluation on an external dataset, we demonstrated the effectiveness of the framework components as well as the generalization capabilities of the proposed approach. Our implementation for training and testing is available at: https://github.com/NimaTorbati/PumaSubmit
Chinese: 本研究提出了一种多阶段深度学习框架,通过整合组织和细胞核信息进行黑色素瘤诊断,在PUMA挑战赛中取得领先排名,并通过消融实验和外部验证证明了其优异的泛化能力。
English: This study introduces a multi-stage deep learning framework that integrates tissue and nuclei analysis for melanoma diagnosis, achieving top rankings in the PUMA challenge and demonstrating strong generalization through ablation studies and external validation.

Authors:Fangda Chen, Shanshan Zhao, Chuanfu Xu, Long Lan
Title: JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation
Abstract:
Recent advancements in customized video generation have led to significant improvements in the simultaneous adaptation of appearance and motion. While prior methods typically decouple appearance and motion training, the stage-wise strategy often introduces concept interference, resulting in inaccurate rendering of appearance features or motion patterns. Another challenge is appearance contamination, where background and foreground elements from reference videos distort the customized subject. In this work, we propose JointTuner, a novel framework that enables joint optimization of both appearance and motion components by leveraging two key innovations: Synaptic Low-Rank Adaptation (Synaptic LoRA) and Appearance-independent Temporal Loss (AiT Loss). Synaptic LoRA introduces a synaptic regulator, implemented as a context-aware linear activation layer, to dynamically guide LoRA modules to focus on either subject appearance or motion patterns, thereby enabling consistent optimization across spatial and temporal dimensions. AiT Loss disrupts the gradient flow of appearance-related components, guiding the model to focus exclusively on motion learning and minimizing appearance interference. JointTuner is compatible with both UNet-based models (e.g., ZeroScope) and Diffusion Transformer-based models (e.g., CogVideoX), supporting the generation of longer and higher-quality customized videos. Additionally, we present a systematic evaluation framework for appearance-motion combined customization, covering 90 combinations evaluated along four critical dimensions: semantic alignment, motion dynamism, temporal consistency, and perceptual quality. Our project homepage can be found at https://fdchen24.github.io/JointTuner-Website.

Authors:Ruisheng Han, Kanglei Zhou, Amir Atapour-Abarghouei, Xiaohui Liang, Hubert P. H. Shum
Title: FineCausal: A Causal-Based Framework for Interpretable Fine-Grained Action Quality Assessment
Abstract:
Action quality assessment (AQA) is critical for evaluating athletic performance, informing training strategies, and ensuring safety in competitive sports. However, existing deep learning approaches often operate as black boxes and are vulnerable to spurious correlations, limiting both their reliability and interpretability. In this paper, we introduce FineCausal, a novel causal-based framework that achieves state-of-the-art performance on the FineDiving-HM dataset. Our approach leverages a Graph Attention Network-based causal intervention module to disentangle human-centric foreground cues from background confounders, and incorporates a temporal causal attention module to capture fine-grained temporal dependencies across action stages. This dual-module strategy enables FineCausal to generate detailed spatio-temporal representations that not only achieve state-of-the-art scoring performance but also provide transparent, interpretable feedback on which features drive the assessment. Despite its strong performance, FineCausal requires extensive expert knowledge to define causal structures and depends on high-quality annotations, challenges that we discuss and address as future research directions. Code is available at https://github.com/Harrison21/FineCausal.
Chinese: FineCausal提出了一种新颖的因果框架,利用图注意力和时序模块分离前景线索与背景干扰,在提升可解释性的同时实现了动作质量评估的最优性能。
English: FineCausal introduces a novel causal framework using graph attention and temporal modules to enhance interpretability and achieve state-of-the-art action quality assessment by disentangling foreground cues from background confounders.

Authors:Zhichao Liao, Xiaokun Liu, Wenyu Qin, Qingyu Li, Qiulin Wang, Pengfei Wan, Di Zhang, Long Zeng, Pingfa Feng
Title: HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment
Abstract:
Image Aesthetic Assessment (IAA) is a long-standing and challenging research task. However, its subset, Human Image Aesthetic Assessment (HIAA), has been scarcely explored. To bridge this research gap, our work pioneers a holistic implementation framework tailored for HIAA. Specifically, we introduce HumanBeauty, the first dataset purpose-built for HIAA, which comprises 108k high-quality human images with manual annotations. To achieve comprehensive and fine-grained HIAA, 50K human images are manually collected through a rigorous curation process and annotated leveraging our trailblazing 12-dimensional aesthetic standard, while the remaining 58K with overall aesthetic labels are systematically filtered from public datasets. Based on the HumanBeauty database, we propose HumanAesExpert, a powerful Vision Language Model for aesthetic evaluation of human images. We innovatively design an Expert head to incorporate human knowledge of aesthetic sub-dimensions while jointly utilizing the Language Modeling (LM) and Regression heads. This approach empowers our model to achieve superior proficiency in both overall and fine-grained HIAA. Furthermore, we introduce a MetaVoter, which aggregates scores from all three heads, to effectively balance the capabilities of each head, thereby realizing improved assessment precision. Extensive experiments demonstrate that our HumanAesExpert models deliver significantly better performance in HIAA than other state-of-the-art models. Project webpage: https://humanaesexpert.github.io/HumanAesExpert/
中文: 本研究首创了专门用于人体图像美学评估的HumanBeauty数据集,并开发了采用多头部架构的HumanAesExpert视觉语言模型,该模型在人体图像的整体与细粒度美学评估中均显著优于现有最优方法。
English: This study introduces HumanBeauty, the first specialized dataset for Human Image Aesthetic Assessment (HIAA), and proposes HumanAesExpert, a vision-language model with an innovative multi-head architecture that significantly outperforms existing methods in both overall and fine-grained aesthetic evaluation of human images.

Authors:Qihan Huang, Weilong Dai, Jinlong Liu, Wanggui He, Hao Jiang, Mingli Song, Jingyuan Chen, Chang Yao, Jie Song
Title: Boosting MLLM Reasoning with Text-Debiased Hint-GRPO
Abstract:
MLLM reasoning has drawn widespread research for its excellent problem-solving capability. Current reasoning methods fall into two types: PRM, which supervises the intermediate reasoning steps, and ORM, which supervises the final results. Recently, DeepSeek-R1 has challenged the traditional view that PRM outperforms ORM, which demonstrates strong generalization performance using an ORM method (i.e., GRPO). However, current MLLM's GRPO algorithms still struggle to handle challenging and complex multimodal reasoning tasks (e.g., mathematical reasoning). In this work, we reveal two problems that impede the performance of GRPO on the MLLM: Low data utilization and Text-bias. Low data utilization refers to that GRPO cannot acquire positive rewards to update the MLLM on difficult samples, and text-bias is a phenomenon that the MLLM bypasses image condition and solely relies on text condition for generation after GRPO training. To tackle these problems, this work proposes Hint-GRPO that improves data utilization by adaptively providing hints for samples of varying difficulty, and text-bias calibration that mitigates text-bias by calibrating the token prediction logits with image condition in test-time. Experiment results on three base MLLMs across eleven datasets demonstrate that our proposed methods advance the reasoning capability of original MLLM by a large margin, exhibiting superior performance to existing MLLM reasoning methods. Our code is available at https://github.com/hqhQAQ/Hint-GRPO.
中文摘要:本文提出的Hint-GRPO方法通过自适应提示和文本偏差校准,有效解决了GRPO在数据利用和图像依赖方面的不足,大幅提升了多模态大语言模型的推理能力。
English Summary: This paper introduces Hint-GRPO and text-bias calibration to overcome GRPO's limitations in low data utilization and text-bias, significantly enhancing multimodal large language models' reasoning performance across diverse datasets.

Authors:SeonYeong Lee, EonSeung Seong, DongEon Lee, SiYeoul Lee, Yubin Cho, Chunsu Park, Seonho Kim, MinKyung Seo, YoungSin Ko, MinWoo Kim
Title: Learned Image Compression and Restoration for Digital Pathology
Abstract:
Digital pathology images play a crucial role in medical diagnostics, but their ultra-high resolution and large file sizes pose significant challenges for storage, transmission, and real-time visualization. To address these issues, we propose CLERIC, a novel deep learning-based image compression framework designed specifically for whole slide images (WSIs). CLERIC integrates a learnable lifting scheme and advanced convolutional techniques to enhance compression efficiency while preserving critical pathological details. Our framework employs a lifting-scheme transform in the analysis stage to decompose images into low- and high-frequency components, enabling more structured latent representations. These components are processed through parallel encoders incorporating Deformable Residual Blocks (DRB) and Recurrent Residual Blocks (R2B) to improve feature extraction and spatial adaptability. The synthesis stage applies an inverse lifting transform for effective image reconstruction, ensuring high-fidelity restoration of fine-grained tissue structures. We evaluate CLERIC on a digital pathology image dataset and compare its performance against state-of-the-art learned image compression (LIC) models. Experimental results demonstrate that CLERIC achieves superior rate-distortion (RD) performance, significantly reducing storage requirements while maintaining high diagnostic image quality. Our study highlights the potential of deep learning-based compression in digital pathology, facilitating efficient data management and long-term storage while ensuring seamless integration into clinical workflows and AI-assisted diagnostic systems. Code and models are available at: https://github.com/pnu-amilab/CLERIC.
中文:CLERIC是一种基于深度学习的新型数字病理图像压缩框架,通过可学习的提升方案和先进卷积技术,在提升存储效率的同时保持了诊断所需的图像质量。
English: CLERIC is a novel deep learning-based compression framework for digital pathology images that enhances storage efficiency while preserving diagnostic quality through advanced lifting schemes and convolutional techniques.

Authors:Bosung Kim, Kyuhwan Lee, Isu Jeong, Jungmin Cheon, Yeojin Lee, Seulki Lee
Title: On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices
Abstract:
We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed On-device Sora applies three novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device, comparable to those produced by high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation on commodity mobile and embedded devices without resource-intensive re-training for model optimization (compression). The code implementation is available at a GitHub repository(https://github.com/eai-lab/On-device-Sora).
中文摘要:On-device Sora 提出首个无需训练的移动端文本到视频生成方案,通过线性比例跳跃、时间维度令牌合并和动态加载并发推理三项创新技术,在智能手机上实现与高端GPU相媲美的高质量视频生成,有效克服了移动设备计算和内存限制。
English Summary: On-device Sora introduces a training-free solution for efficient text-to-video generation on smartphones through three novel techniques—Linear Proportional Leap, Temporal Dimension Token Merging, and Concurrent Inference with Dynamic Loading—enabling high-quality video creation comparable to high-end GPUs on resource-constrained devices.

Authors:Haoran Shen, Peixian Zhuang, Jiahao Kou, Yuxin Zeng, Haoying Xu, Jiangyun Li
Title: MGD-SAM2: Multi-view Guided Detail-enhanced Segment Anything Model 2 for High-Resolution Class-agnostic Segmentation
Abstract:
Segment Anything Models (SAMs), as vision foundation models, have demonstrated remarkable performance across various image analysis tasks. Despite their strong generalization capabilities, SAMs encounter challenges in fine-grained detail segmentation for high-resolution class-independent segmentation (HRCS), due to the limitations in the direct processing of high-resolution inputs and low-resolution mask predictions, and the reliance on accurate manual prompts. To address these limitations, we propose MGD-SAM2 which integrates SAM2 with multi-view feature interaction between a global image and local patches to achieve precise segmentation. MGD-SAM2 incorporates the pre-trained SAM2 with four novel modules: the Multi-view Perception Adapter (MPAdapter), the Multi-view Complementary Enhancement Module (MCEM), the Hierarchical Multi-view Interaction Module (HMIM), and the Detail Refinement Module (DRM). Specifically, we first introduce MPAdapter to adapt the SAM2 encoder for enhanced extraction of local details and global semantics in HRCS images. Then, MCEM and HMIM are proposed to further exploit local texture and global context by aggregating multi-view features within and across multi-scales. Finally, DRM is designed to generate gradually restored high-resolution mask predictions, compensating for the loss of fine-grained details resulting from directly upsampling the low-resolution prediction maps. Experimental results demonstrate the superior performance and strong generalization of our model on multiple high-resolution and normal-resolution datasets. Code will be available at https://github.com/sevenshr/MGD-SAM2.
中文: 提出的MGD-SAM2模型通过集成多视角特征交互和细节优化模块,显著提升了SAM2在高分辨率图像上的分割精度,在多个数据集上展现出优越性能。
English: The proposed MGD-SAM2 model enhances SAM2's segmentation precision for high-resolution images by integrating multi-view feature interactions and a detail refinement module, achieving superior performance across diverse datasets.

Authors:Lingyu Liu, Yaxiong Wang, Li Zhu, Zhedong Zheng
Title: Every Painting Awakened: A Training-free Framework for Painting-to-Animation Generation
Abstract:
We introduce a training-free framework specifically designed to bring real-world static paintings to life through image-to-video (I2V) synthesis, addressing the persistent challenge of aligning these motions with textual guidance while preserving fidelity to the original artworks. Existing I2V methods, primarily trained on natural video datasets, often struggle to generate dynamic outputs from static paintings. It remains challenging to generate motion while maintaining visual consistency with real-world paintings. This results in two distinct failure modes: either static outputs due to limited text-based motion interpretation or distorted dynamics caused by inadequate alignment with real-world artistic styles. We leverage the advanced text-image alignment capabilities of pre-trained image models to guide the animation process. Our approach introduces synthetic proxy images through two key innovations: (1) Dual-path score distillation: We employ a dual-path architecture to distill motion priors from both real and synthetic data, preserving static details from the original painting while learning dynamic characteristics from synthetic frames. (2) Hybrid latent fusion: We integrate hybrid features extracted from real paintings and synthetic proxy images via spherical linear interpolation in the latent space, ensuring smooth transitions and enhancing temporal consistency. Experimental evaluations confirm that our approach significantly improves semantic alignment with text prompts while faithfully preserving the unique characteristics and integrity of the original paintings. Crucially, by achieving enhanced dynamic effects without requiring any model training or learnable parameters, our framework enables plug-and-play integration with existing I2V methods, making it an ideal solution for animating real-world paintings. More animated examples can be found on our project website.

Authors:Yoonshik Kim, Jaeyoon Jung
Title: KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
Abstract:
The recent emergence of Large Vision-Language Models(VLMs) has resulted in a variety of different benchmarks for evaluating such models. Despite this, we observe that most existing evaluation methods suffer from the fact that they either require the model to choose from pre-determined responses, sacrificing open-endedness, or evaluate responses using a judge model, resulting in subjective and unreliable evaluation. In addition, we observe a lack of benchmarks for VLMs in the Korean language, which are necessary as a separate metric from more common English language benchmarks, as the performance of generative language models can differ significantly based on the language being used. Therefore, we present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language for the evaluation of VLMs. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria covering 10 different aspects of VLM performance. The grading criteria eliminate the problem of unreliability by allowing the judge model to grade each response based on a pre-determined set of rules. By defining the evaluation criteria in an objective manner, even a small open-source model can be used to evaluate models on our benchmark reliably. In addition to evaluating a large number of existing VLMs on our benchmark, we also experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods. Our evaluation code is available at https://github.com/maum-ai/KOFFVQA
中文: 现有视觉语言模型评估方法常限制开放性回答或依赖主观评判,因此我们推出KOFFVQA韩语开放式视觉问答基准,通过预设评分标准实现客观可靠的模型评估。
English: Current VLM evaluations often sacrifice open-endedness or rely on subjective judge models, so we introduce KOFFVQA—a Korean free-form visual question answering benchmark with objective grading criteria to ensure reliable evaluation.

Authors:Hongwei Ren, Xiaopeng Lin, Hongxiang Huang, Yue Zhou, Bojun Cheng
Title: Exploring Temporal Dynamics in Event-based Eye Tracker
Abstract:
Eye-tracking is a vital technology for human-computer interaction, especially in wearable devices such as AR, VR, and XR. The realization of high-speed and high-precision eye-tracking using frame-based image sensors is constrained by their limited temporal resolution, which impairs the accurate capture of rapid ocular dynamics, such as saccades and blinks. Event cameras, inspired by biological vision systems, are capable of perceiving eye movements with extremely low power consumption and ultra-high temporal resolution. This makes them a promising solution for achieving high-speed, high-precision tracking with rich temporal dynamics. In this paper, we propose TDTracker, an effective eye-tracking framework that captures rapid eye movements by thoroughly modeling temporal dynamics from both implicit and explicit perspectives. TDTracker utilizes 3D convolutional neural networks to capture implicit short-term temporal dynamics and employs a cascaded structure consisting of a Frequency-aware Module, GRU, and Mamba to extract explicit long-term temporal dynamics. Ultimately, a prediction heatmap is used for eye coordinate regression. Experimental results demonstrate that TDTracker achieves state-of-the-art (SOTA) performance on the synthetic SEET dataset and secured Third place in the CVPR event-based eye-tracking challenge 2025. Our code is available at https://github.com/rhwxmx/TDTracker.
Chinese: TDTracker是一种创新的眼动追踪框架,通过结合3D卷积神经网络和级联结构来捕捉短期与长期时间动态,利用事件相机实现了高速高精度的眼动追踪,并取得了最先进的性能表现。
English: TDTracker is an innovative eye-tracking framework that employs 3D CNNs and a cascaded structure to capture both short-term and long-term temporal dynamics, achieving state-of-the-art performance in high-speed, high-precision eye movement tracking with event cameras.

Authors:Yi Liu, Wengen Li, Jihong Guan, Shuigeng Zhou, Yichao Zhang
Title: Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space
Abstract:
Cloud removal (CR) remains a challenging task in remote sensing image processing. Although diffusion models (DM) exhibit strong generative capabilities, their direct applications to CR are suboptimal, as they generate cloudless images from random noise, ignoring inherent information in cloudy inputs. To overcome this drawback, we develop a new CR model EMRDM based on mean-reverting diffusion models (MRDMs) to establish a direct diffusion process between cloudy and cloudless images. Compared to current MRDMs, EMRDM offers a modular framework with updatable modules and an elucidated design space, based on a reformulated forward process and a new ordinary differential equation (ODE)-based backward process. Leveraging our framework, we redesign key MRDM modules to boost CR performance, including restructuring the denoiser via a preconditioning technique, reorganizing the training process, and improving the sampling process by introducing deterministic and stochastic samplers. To achieve multi-temporal CR, we further develop a denoising network for simultaneously denoising sequential images. Experiments on mono-temporal and multi-temporal datasets demonstrate the superior performance of EMRDM. Our code is available at https://github.com/Ly403/EMRDM.
中文: 本文提出EMRDM模型,基于均值回复扩散方法,通过模块化框架和改进流程直接将多云图像转为无云图像,在实验中展现出卓越性能。
English: The paper introduces EMRDM, a cloud removal model based on mean-reverting diffusion, which directly transforms cloudy images into cloudless ones through a modular framework and enhanced processes, demonstrating superior performance in experiments.

Authors:Kun Liu, Qi Liu, Xinchen Liu, Jie Li, Yongdong Zhang, Jiebo Luo, Xiaodong He, Wu Liu
Title: HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation
Abstract:
Text-to-video (T2V) generation has made tremendous progress in generating complicated scenes based on texts. However, human-object interaction (HOI) often cannot be precisely generated by current T2V models due to the lack of large-scale videos with accurate captions for HOI. To address this issue, we introduce HOIGen-1M, the first largescale dataset for HOI Generation, consisting of over one million high-quality videos collected from diverse sources. In particular, to guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs), and then the videos are further cleaned by human annotators. Moreover, to obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy that not only generates expressive captions but also eliminates the hallucination by individual MLLM. Furthermore, due to the lack of an evaluation framework for generated HOI videos, we propose two new metrics to assess the quality of generated videos in a coarse-to-fine manner. Extensive experiments reveal that current T2V models struggle to generate high-quality HOI videos and confirm that our HOIGen-1M dataset is instrumental for improving HOI video generation. Project webpage is available at https://liuqi-creat.github.io/HOIGen.github.io.

Authors:Haitao Tian, Junyang Li, Chenxing Wang, Helong Jiang
Title: Detail-aware multi-view stereo network for depth estimation
Abstract:
Multi-view stereo methods have achieved great success for depth estimation based on the coarse-to-fine depth learning frameworks, however, the existing methods perform poorly in recovering the depth of object boundaries and detail regions. To address these issues, we propose a detail-aware multi-view stereo network (DA-MVSNet) with a coarse-to-fine framework. The geometric depth clues hidden in the coarse stage are utilized to maintain the geometric structural relationships between object surfaces and enhance the expressive capability of image features. In addition, an image synthesis loss is employed to constrain the gradient flow for detailed regions and further strengthen the supervision of object boundaries and texture-rich areas. Finally, we propose an adaptive depth interval adjustment strategy to improve the accuracy of object reconstruction. Extensive experiments on the DTU and Tanks & Temples datasets demonstrate that our method achieves competitive results. The code is available at https://github.com/wsmtht520-/DAMVSNet.
中文: 提出的DA-MVSNet通过利用几何深度线索、采用图像合成损失和自适应深度间隔策略,有效改善了物体边界和细节区域的深度恢复效果,在基准数据集上取得了优异性能。
English: The proposed DA-MVSNet addresses poor depth recovery at object boundaries and detailed regions by leveraging geometric depth clues, employing an image synthesis loss, and implementing an adaptive depth interval strategy, achieving competitive results on benchmark datasets.

Authors:Takeshi Noda, Chao Chen, Junsheng Zhou, Weiqi Zhang, Yu-Shen Liu, Zhizhong Han
Title: Learning Bijective Surface Parameterization for Inferring Signed Distance Functions from Sparse Point Clouds with Grid Deformation
Abstract:
Inferring signed distance functions (SDFs) from sparse point clouds remains a challenge in surface reconstruction. The key lies in the lack of detailed geometric information in sparse point clouds, which is essential for learning a continuous field. To resolve this issue, we present a novel approach that learns a dynamic deformation network to predict SDFs in an end-to-end manner. To parameterize a continuous surface from sparse points, we propose a bijective surface parameterization (BSP) that learns the global shape from local patches. Specifically, we construct a bijective mapping for sparse points from the parametric domain to 3D local patches, integrating patches into the global surface. Meanwhile, we introduce grid deformation optimization (GDO) into the surface approximation to optimize the deformation of grid points and further refine the parametric surfaces. Experimental results on synthetic and real scanned datasets demonstrate that our method significantly outperforms the current state-of-the-art methods. Project page: https://takeshie.github.io/Bijective-SDF
Chinese: 本文提出了一种新颖方法,通过学习动态变形网络和双射表面参数化从稀疏点云预测有符号距离函数,在表面重建方面显著优于现有技术。
English: This paper introduces a novel method that learns a dynamic deformation network and bijective surface parameterization to predict signed distance functions from sparse point clouds, significantly outperforming existing techniques in surface reconstruction.

Authors:Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy S. Ren, Chun-Le Guo, Chongyi Li
Title: DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution
Abstract:
Large-scale pre-trained diffusion models are becoming increasingly popular in solving the Real-World Image Super-Resolution (Real-ISR) problem because of their rich generative priors. The recent development of diffusion transformer (DiT) has witnessed overwhelming performance over the traditional UNet-based architecture in image generation, which also raises the question: Can we adopt the advanced DiT-based diffusion model for Real-ISR? To this end, we propose our DiT4SR, one of the pioneering works to tame the large-scale DiT model for Real-ISR. Instead of directly injecting embeddings extracted from low-resolution (LR) images like ControlNet, we integrate the LR embeddings into the original attention mechanism of DiT, allowing for the bidirectional flow of information between the LR latent and the generated latent. The sufficient interaction of these two streams allows the LR stream to evolve with the diffusion process, producing progressively refined guidance that better aligns with the generated latent at each diffusion step. Additionally, the LR guidance is injected into the generated latent via a cross-stream convolution layer, compensating for DiT's limited ability to capture local information. These simple but effective designs endow the DiT model with superior performance in Real-ISR, which is demonstrated by extensive experiments. Project Page: https://adam-duan.github.io/projects/dit4sr/.

Authors:Maximilian Augustin, Yannic Neuhaus, Matthias Hein
Title: DASH: Detection and Assessment of Systematic Hallucinations of VLMs
Abstract:
Vision-language models (VLMs) are prone to object hallucinations, where they erroneously indicate the presenceof certain objects in an image. Existing benchmarks quantify hallucinations using relatively small, labeled datasets. However, this approach is i) insufficient to assess hallucinations that arise in open-world settings, where VLMs are widely used, and ii) inadequate for detecting systematic errors in VLMs. We propose DASH (Detection and Assessment of Systematic Hallucinations), an automatic, large-scale pipeline designed to identify systematic hallucinations of VLMs on real-world images in an open-world setting. A key component is DASH-OPT for image-based retrieval, where we optimize over the ''natural image manifold'' to generate images that mislead the VLM. The output of DASH consists of clusters of real and semantically similar images for which the VLM hallucinates an object. We apply DASH to PaliGemma and two LLaVA-NeXT models across 380 object classes and, in total, find more than 19k clusters with 950k images. We study the transfer of the identified systematic hallucinations to other VLMs and show that fine-tuning PaliGemma with the model-specific images obtained with DASH mitigates object hallucinations. Code and data are available at https://YanNeu.github.io/DASH.

Authors:Tianming Liang, Haichao Jiang, Wei-Shi Zheng, Jian-Fang Hu
Title: ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025
Abstract:
Referring Video Object Segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This task has attracted increasing attention in the field of computer vision due to its promising applications in video editing and human-agent interaction. Recently, ReferDINO has demonstrated promising performance in this task by adapting object-level vision-language knowledge from pretrained foundational image models. In this report, we further enhance its capabilities by incorporating the advantages of SAM2 in mask quality and object consistency. In addition, to effectively balance performance between single-object and multi-object scenarios, we introduce a conditional mask fusion strategy that adaptively fuses the masks from ReferDINO and SAM2. Our solution, termed ReferDINO-Plus, achieves 60.43 \(\mathcal{J}\&\mathcal{F}\) on MeViS test set, securing 2nd place in the MeViS PVUW challenge at CVPR 2025. The code is available at: https://github.com/iSEE-Laboratory/ReferDINO-Plus.
中文: ReferDINO-Plus通过融合SAM2的优质掩码和条件掩码融合策略,提升了视频对象分割性能,在CVPR 2025 MeViS挑战赛中以60.43分获得第二名。
English: ReferDINO-Plus enhances video object segmentation by integrating SAM2's superior mask quality and a conditional fusion strategy, achieving second place in the CVPR 2025 MeViS challenge with a score of 60.43.

Authors:Jannik Endres, Oliver Hahn, Charles Corbière, Simone Schaub-Meyer, Stefan Roth, Alexandre Alahi
Title: Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model
Abstract:
Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360° field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.

Authors:Ashim Dahal, Saydul Akbar Murad, Nick Rahimi
Title: Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning
Abstract:
Understanding the representation shift on Vision Language Models like CLIP under different augmentations provides valuable insights on Mechanistic Interpretability. In this study, we show the shift on CLIP's embeddings on 9 common augmentation techniques: noise, blur, color jitter, scale and rotate, flip, elastic and perspective transforms, random brightness and contrast, and coarse dropout of pixel blocks. We scrutinize the embedding shifts under similarity on attention map, patch, edge, detail preservation, cosine similarity, L2 distance, pairwise distance and dendrogram clusters and provide qualitative analysis on sample images. Our findings suggest certain augmentations like noise, perspective transform and shift scaling have higher degree of drastic impact on embedding shift. This study provides a concrete foundation for future work on VLM's robustness for mechanical interpretation and adversarial data defense. The code implementation for this study can be found on \href{https://github.com/ashimdahal/clip-shift-analysis}{https://github.com/ashimdahal/clip-shift-analysis}.
Chinese: 本研究分析了九种常见图像增强技术对CLIP嵌入表示的影响,发现噪声、透视变换和尺度缩放会导致最显著的嵌入偏移,为视觉语言模型的机制可解释性和鲁棒性研究奠定了基础。
English: This study analyzes how nine common image augmentations affect CLIP's embedding shifts, revealing that noise, perspective transforms, and scaling cause the most significant changes, thereby advancing mechanistic interpretability and robustness in vision-language models.

Authors:Chenglong Lu, Shen Liang, Xuewei Wang, Wei Wang
Title: Reinforcement Learning-based Token Pruning in Vision Transformers: A Markov Game Approach
Abstract:
Vision Transformers (ViTs) have computational costs scaling quadratically with the number of tokens, calling for effective token pruning policies. Most existing policies are handcrafted, lacking adaptivity to varying inputs. Moreover, they fail to consider the sequential nature of token pruning across multiple layers. In this work, for the first time (as far as we know), we exploit Reinforcement Learning (RL) to data-adaptively learn a pruning policy. Formulating token pruning as a sequential decision-making problem, we model it as a Markov Game and utilize Multi-Agent Proximal Policy Optimization (MAPPO) where each agent makes an individualized pruning decision for a single token. We also develop reward functions that enable simultaneous collaboration and competition of these agents to balance efficiency and accuracy. On the well-known ImageNet-1k dataset, our method improves the inference speed by up to 44% while incurring only a negligible accuracy drop of 0.4%. The source code is available at https://github.com/daashuai/rl4evit.
视觉变换器因令牌数量呈二次方增长而计算成本高昂,为此我们采用强化学习中的MAPPO方法自适应修剪令牌,在几乎不影响精度的情况下将推理速度提升44%。
Vision Transformers face high computational costs due to quadratic token scaling, so we introduce a reinforcement learning approach using MAPPO to adaptively prune tokens, boosting inference speed by 44% with minimal accuracy loss.

Authors:Maofu Liu, Xin Jiang, Xiaokang Zhang
Title: CADFormer: Fine-Grained Cross-modal Alignment and Decoding Transformer for Referring Remote Sensing Image Segmentation
Abstract:
Referring Remote Sensing Image Segmentation (RRSIS) is a challenging task, aiming to segment specific target objects in remote sensing (RS) images based on a given language expression. Existing RRSIS methods typically employ coarse-grained unidirectional alignment approaches to obtain multimodal features, and they often overlook the critical role of language features as contextual information during the decoding process. Consequently, these methods exhibit weak object-level correspondence between visual and language features, leading to incomplete or erroneous predicted masks, especially when handling complex expressions and intricate RS image scenes. To address these challenges, we propose a fine-grained cross-modal alignment and decoding Transformer, CADFormer, for RRSIS. Specifically, we design a semantic mutual guidance alignment module (SMGAM) to achieve both vision-to-language and language-to-vision alignment, enabling comprehensive integration of visual and textual features for fine-grained cross-modal alignment. Furthermore, a textual-enhanced cross-modal decoder (TCMD) is introduced to incorporate language features during decoding, using refined textual information as context to enhance the relationship between cross-modal features. To thoroughly evaluate the performance of CADFormer, especially for inconspicuous targets in complex scenes, we constructed a new RRSIS dataset, called RRSIS-HR, which includes larger high-resolution RS image patches and semantically richer language expressions. Extensive experiments on the RRSIS-HR dataset and the popular RRSIS-D dataset demonstrate the effectiveness and superiority of CADFormer. Datasets and source codes will be available at https://github.com/zxk688.
中文摘要:该研究提出CADFormer模型,通过细粒度跨模态对齐和解码技术,结合语言特征增强遥感图像分割的准确性,并在新的高分辨率数据集上验证了其优越性能。
English Summary: The study introduces CADFormer, a Transformer-based model with fine-grained cross-modal alignment and decoding to enhance Referring Remote Sensing Image Segmentation by improving object-level correspondence and integrating language features, validated on a new high-resolution dataset.

Authors:Junzhu Mao, Yang Shen, Jinyang Guo, Yazhou Yao, Xiansheng Hua
Title: Efficient Token Compression for Vision Transformer with Spatial Information Preserved
Abstract:
Token compression is essential for reducing the computational and memory requirements of transformer models, enabling their deployment in resource-constrained environments. In this work, we propose an efficient and hardware-compatible token compression method called Prune and Merge. Our approach integrates token pruning and merging operations within transformer models to achieve layer-wise token compression. By introducing trainable merge and reconstruct matrices and utilizing shortcut connections, we efficiently merge tokens while preserving important information and enabling the restoration of pruned tokens. Additionally, we introduce a novel gradient-weighted attention scoring mechanism that computes token importance scores during the training phase, eliminating the need for separate computations during inference and enhancing compression efficiency. We also leverage gradient information to capture the global impact of tokens and automatically identify optimal compression structures. Extensive experiments on the ImageNet-1k and ADE20K datasets validate the effectiveness of our approach, achieving significant speed-ups with minimal accuracy degradation compared to state-of-the-art methods. For instance, on DeiT-Small, we achieve a 1.64$\times$ speed-up with only a 0.2\% drop in accuracy on ImageNet-1k. Moreover, by compressing segmenter models and comparing with existing methods, we demonstrate the superior performance of our approach in terms of efficiency and effectiveness. Code and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/prune_and_merge.
Chinese: 本文提出了一种名为“剪枝与合并”的高效令牌压缩方法,通过可训练的合并与重构矩阵及梯度加权注意力机制,在ImageNet-1k和ADE20K数据集上实现了显著加速且精度损失极小。
English: This paper introduces Prune and Merge, an efficient token compression method for transformers that integrates pruning and merging with trainable matrices and a gradient-weighted attention mechanism, achieving significant speed-ups on ImageNet-1k and ADE20K with minimal accuracy loss.

Authors:Maofu Liu, Jiahui Liu, Xiaokang Zhang
Title: Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image Captioning
Abstract:
Remote sensing image captioning aims to generate semantically accurate descriptions that are closely linked to the visual features of remote sensing images. Existing approaches typically emphasize fine-grained extraction of visual features and capturing global information. However, they often overlook the complementary role of textual information in enhancing visual semantics and face challenges in precisely locating objects that are most relevant to the image context. To address these challenges, this paper presents a semantic-spatial feature fusion with dynamic graph refinement (SFDR) method, which integrates the semantic-spatial feature fusion (SSFF) and dynamic graph feature refinement (DGFR) modules. The SSFF module utilizes a multi-level feature representation strategy by leveraging pre-trained CLIP features, grid features, and ROI features to integrate rich semantic and spatial information. In the DGFR module, a graph attention network captures the relationships between feature nodes, while a dynamic weighting mechanism prioritizes objects that are most relevant to the current scene and suppresses less significant ones. Therefore, the proposed SFDR method significantly enhances the quality of the generated descriptions. Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed method. The source code will be available at https://github.com/zxk688}{https://github.com/zxk688.
中文: 本文提出SFDR方法,通过语义空间特征融合和动态图优化模块,有效结合视觉与文本信息,显著提升了遥感图像描述的准确性。
English: This paper introduces the SFDR method, which combines semantic-spatial feature fusion and dynamic graph refinement to enhance remote sensing image captioning by better integrating visual and textual information for more accurate descriptions.

Authors:Yuhang Yang, Ke Fan, Shangkun Sun, Hongxiang Li, Ailing Zeng, FeiLin Han, Wei Zhai, Wei Liu, Yang Cao, Zheng-Jun Zha
Title: VideoGen-Eval: Agent-based System for Video Generation Evaluation
Abstract:
The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models, primarily due to simple prompts that cannot showcase the model's capabilities, fixed evaluation operators struggling with Out-of-Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen-Eval, an agent evaluation system that integrates LLM-based content structuring, MLLM-based content judgment, and patch tools designed for temporal-dense dimensions, to achieve a dynamic, flexible, and expandable video generation evaluation. Additionally, we introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system. It comprises 700 structured, content-rich prompts (both T2V and I2V) and over 12,000 videos generated by 20+ models, among them, 8 cutting-edge models are selected as quantitative evaluation for the agent and human. Extensive experiments validate that our proposed agent-based evaluation system demonstrates strong alignment with human preferences and reliably completes the evaluation, as well as the diversity and richness of the benchmark.
中文摘要:视频生成技术的快速发展凸显了现有评估系统的不足,为此我们提出了VideoGen-Eval评估系统,该系统整合了大语言模型和多模态大语言模型技术,通过结构化提示和专用工具实现动态灵活的视频生成评估,并构建包含丰富样本的基准数据集验证了其与人类评估的高度一致性。
English Summary: The rapid progress in video generation has exposed limitations in current evaluation systems, leading to the development of VideoGen-Eval, an agent-based system that integrates LLM and MLLM technologies with specialized tools to provide dynamic and human-aligned assessments, validated through extensive experiments on a comprehensive benchmark.

Authors:Aimira Baitieva, Yacine Bouaouni, Alexandre Briot, Dick Ameln, Souhaiel Khalfaoui, Samet Akcay
Title: Beyond Academic Benchmarks: Critical Analysis and Best Practices for Visual Industrial Anomaly Detection
Abstract:
Anomaly detection (AD) is essential for automating visual inspection in manufacturing. This field of computer vision is rapidly evolving, with increasing attention towards real-world applications. Meanwhile, popular datasets are typically produced in controlled lab environments with artificially created defects, unable to capture the diversity of real production conditions. New methods often fail in production settings, showing significant performance degradation or requiring impractical computational resources. This disconnect between academic results and industrial viability threatens to misdirect visual anomaly detection research. This paper makes three key contributions: (1) we demonstrate the importance of real-world datasets and establish benchmarks using actual production data, (2) we provide a fair comparison of existing SOTA methods across diverse tasks by utilizing metrics that are valuable for practical applications, and (3) we present a comprehensive analysis of recent advancements in this field by discussing important challenges and new perspectives for bridging the academia-industry gap. The code is publicly available at https://github.com/abc-125/viad-benchmark
中文摘要:本文通过建立真实场景基准、公平评估先进方法并提供全面分析,旨在弥合视觉异常检测研究中学术成果与工业应用之间的鸿沟。
English Summary: This paper addresses the disconnect between academic anomaly detection research and industrial needs by establishing real-world benchmarks, fairly evaluating state-of-the-art methods, and providing comprehensive analysis to bridge the gap between theory and practical applications.

Authors:Xin Zuo, Jiaran Jiang, Jifeng Shen, Wankou Yang
Title: Improving underwater semantic segmentation with underwater image quality attention and muti-scale aggregation attention
Abstract:
Underwater image understanding is crucial for both submarine navigation and seabed exploration. However, the low illumination in underwater environments degrades the imaging quality, which in turn seriously deteriorates the performance of underwater semantic segmentation, particularly for outlining the object region boundaries. To tackle this issue, we present UnderWater SegFormer (UWSegFormer), a transformer-based framework for semantic segmentation of low-quality underwater images. Firstly, we propose the Underwater Image Quality Attention (UIQA) module. This module enhances the representation of highquality semantic information in underwater image feature channels through a channel self-attention mechanism. In order to address the issue of loss of imaging details due to the underwater environment, the Multi-scale Aggregation Attention(MAA) module is proposed. This module aggregates sets of semantic features at different scales by extracting discriminative information from high-level features,thus compensating for the semantic loss of detail in underwater objects. Finally, during training, we introduce Edge Learning Loss (ELL) in order to enhance the model's learning of underwater object edges and improve the model's prediction accuracy. Experiments conducted on the SUIM and DUT-USEG (DUT) datasets have demonstrated that the proposed method has advantages in terms of segmentation completeness, boundary clarity, and subjective perceptual details when compared to SOTA methods. In addition, the proposed method achieves the highest mIoU of 82.12 and 71.41 on the SUIM and DUT datasets, respectively. Code will be available at https://github.com/SAWRJJ/UWSegFormer.
中文: UWSegFormer框架通过引入水下图像质量注意力、多尺度聚合模块和边缘学习损失,有效解决了水下低照度导致的语义分割边界模糊问题,在多个数据集上取得了最优性能。
English: The proposed UWSegFormer framework enhances underwater semantic segmentation by integrating quality attention, multi-scale feature aggregation, and edge learning loss to address low illumination challenges, achieving state-of-the-art performance on benchmark datasets.

Authors:Haiduo Huang, Yadong Zhang, Pengju Ren
Title: KernelDNA: Dynamic Kernel Sharing via Decoupled Naive Adapters
Abstract:
Dynamic convolution enhances model capacity by adaptively combining multiple kernels, yet faces critical trade-offs: prior works either (1) incur significant parameter overhead by scaling kernel numbers linearly, (2) compromise inference speed through complex kernel interactions, or (3) struggle to jointly optimize dynamic attention and static kernels. We also observe that pre-trained Convolutional Neural Networks (CNNs) exhibit inter-layer redundancy akin to that in Large Language Models (LLMs). Specifically, dense convolutional layers can be efficiently replaced by derived ``child" layers generated from a shared ``parent" convolutional kernel through an adapter. To address these limitations and implement the weight-sharing mechanism, we propose a lightweight convolution kernel plug-in, named KernelDNA. It decouples kernel adaptation into input-dependent dynamic routing and pre-trained static modulation, ensuring both parameter efficiency and hardware-friendly inference. Unlike existing dynamic convolutions that expand parameters via multi-kernel ensembles, our method leverages cross-layer weight sharing and adapter-based modulation, enabling dynamic kernel specialization without altering the standard convolution structure. This design preserves the native computational efficiency of standard convolutions while enhancing representation power through input-adaptive kernel adjustments. Experiments on image classification and dense prediction tasks demonstrate that KernelDNA achieves state-of-the-art accuracy-efficiency balance among dynamic convolution variants. Our codes are available at https://github.com/haiduo/KernelDNA.
中文: KernelDNA提出了一种轻量级卷积插件,通过跨层权重共享和基于适配器的调制实现动态核适应,在不改变标准卷积结构的情况下达到了最优的精度-效率平衡。
English: KernelDNA introduces a lightweight convolution plug-in that enables dynamic kernel adaptation through cross-layer weight sharing and adapter-based modulation, achieving state-of-the-art accuracy-efficiency balance without altering standard convolution structures.

Authors:Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, Tat-Seng Chua
Title: JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
Abstract:
This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at https://javisdit.github.io/.
中文:JavisDiT是一种新颖的联合视听扩散变换器,通过分层对齐机制生成同步的高质量音视频内容,并凭借其JavisBench数据集设立了新基准。
English: JavisDiT is a novel Joint Audio-Video Diffusion Transformer that generates synchronized high-quality audio and video content using a hierarchical alignment mechanism and sets a new benchmark with its JavisBench dataset.

Authors:Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, Xu Jia
Title: VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior
Abstract:
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics with vision and language informed physical prior. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: https://madaoer.github.io/projects/physically_plausible_video_generation.
中文摘要:针对视频扩散模型常生成不符合物理规律内容的问题,本文提出两阶段生成框架,通过视觉语言模型进行物理感知推理来指导视频生成,实现更符合现实物理规律的运动效果。
English Summary: Video diffusion models often produce physically implausible content, so this paper introduces a two-stage framework that integrates physics-aware reasoning through vision-language models to guide video generation toward more realistic motion.

Authors:Hang Guo, Yawei Li, Taolin Zhang, Jiangshan Wang, Tao Dai, Shu-Tao Xia, Luca Benini
Title: FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning
Abstract:
Visual Autoregressive (VAR) modeling has gained popularity for its shift towards next-scale prediction. However, existing VAR paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution. To address this challenge, we propose FastVAR, a post-training acceleration method for efficient resolution scaling with VARs. Our key finding is that the majority of latency arises from the large-scale step where most tokens have already converged. Leveraging this observation, we develop the cached token pruning strategy that only forwards pivotal tokens for scale-specific modeling while using cached tokens from previous scale steps to restore the pruned slots. This significantly reduces the number of forwarded tokens and improves the efficiency at larger resolutions. Experiments show the proposed FastVAR can further speedup FlashAttention-accelerated VAR by 2.7$\times$ with negligible performance drop of <1%. We further extend FastVAR to zero-shot generation of higher resolution images. In particular, FastVAR can generate one 2K image with 15GB memory footprints in 1.5s on a single NVIDIA 3090 GPU. Code is available at https://github.com/csguoh/FastVAR.
Chinese: FastVAR提出了一种训练后加速方法,通过缓存令牌剪枝策略仅在前向传播中处理关键令牌,显著降低了大规模步骤的延迟,实现了2.7倍加速且性能损失可忽略。
English: FastVAR introduces a post-training acceleration method that uses cached token pruning to significantly reduce latency by forwarding only pivotal tokens at large-scale steps, achieving a 2.7× speedup with minimal performance loss.

Authors:Hongxiang Jiang, Jihao Yin, Qixiong Wang, Jiaqi Feng, Guo Chen
Title: EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing
Abstract:
Recent advances in multimodal large language models (MLLMs) have demonstrated impressive results in various visual tasks. However, in remote sensing (RS), high resolution and small proportion of objects pose challenges to existing MLLMs, which struggle with object-centric tasks, particularly in precise localization and fine-grained attribute description for each object. These RS MLLMs have not yet surpassed classical visual perception models, as they only provide coarse image understanding, leading to limited gains in real-world scenarios. To address this gap, we establish EagleVision, an MLLM tailored for remote sensing that excels in object detection and attribute comprehension. Equipped with the Attribute Disentangle module, EagleVision learns disentanglement vision tokens to express distinct attributes. To support object-level visual-language alignment, we construct EVAttrs-95K, the first large-scale object attribute understanding dataset in RS for instruction tuning, along with a novel evaluation benchmark, EVBench. EagleVision achieves state-of-the-art performance on both fine-grained object detection and object attribute understanding tasks, highlighting the mutual promotion between detection and understanding capabilities in MLLMs. The code, model, data, and demo will be available at https://github.com/XiangTodayEatsWhat/EagleVision.
Chinese: EagleVision是专为遥感领域设计的多模态大语言模型,通过引入属性解耦模块和新数据集,在目标检测和属性理解方面取得突破,实现了细粒度任务的最优性能。
English: EagleVision is a multimodal large language model specifically designed for remote sensing that overcomes challenges in object detection and attribute understanding by introducing an Attribute Disentangle module and a new dataset, achieving state-of-the-art performance in fine-grained tasks.

Authors:Hyunsik Jeon, Satoshi Koide, Yu Wang, Zhankui He, Julian McAuley
Title: LaViC: Adapting Large Vision-Language Models to Visually-Aware Conversational Recommendation
Abstract:
Conversational recommender systems engage users in dialogues to refine their needs and provide more personalized suggestions. Although textual information suffices for many domains, visually driven categories such as fashion or home decor potentially require detailed visual information related to color, style, or design. To address this challenge, we propose LaViC (Large Vision-Language Conversational Recommendation Framework), a novel approach that integrates compact image representations into dialogue-based recommendation systems. LaViC leverages a large vision-language model in a two-stage process: (1) visual knowledge self-distillation, which condenses product images from hundreds of tokens into a small set of visual tokens in a self-distillation manner, significantly reducing computational overhead, and (2) recommendation prompt tuning, which enables the model to incorporate both dialogue context and distilled visual tokens, providing a unified mechanism for capturing textual and visual features. To support rigorous evaluation of visually-aware conversational recommendation, we construct a new dataset by aligning Reddit conversations with Amazon product listings across multiple visually oriented categories (e.g., fashion, beauty, and home). This dataset covers realistic user queries and product appearances in domains where visual details are crucial. Extensive experiments demonstrate that LaViC significantly outperforms text-only conversational recommendation methods and open-source vision-language baselines. Moreover, LaViC achieves competitive or superior accuracy compared to prominent proprietary baselines (e.g., GPT-3.5-turbo, GPT-4o-mini, and GPT-4o), demonstrating the necessity of explicitly using visual data for capturing product attributes and showing the effectiveness of our vision-language integration. Our code and dataset are available at https://github.com/jeon185/LaViC.
中文:LaViC框架通过两阶段流程整合精炼视觉标记与对话上下文,在视觉导向领域显著优于纯文本推荐方法,并与主流专有模型达到同等或更高准确率。
English: The LaViC framework enhances conversational recommender systems by integrating distilled visual tokens and dialogue context through a two-stage process, significantly outperforming text-only methods and achieving competitive accuracy with proprietary models in visually-oriented domains.

Authors:Cong Wei, Bo Sun, Haoyu Ma, Ji Hou, Felix Juefei-Xu, Zecheng He, Xiaoliang Dai, Luxin Zhang, Kunpeng Li, Tingbo Hou, Animesh Sinha, Peter Vajda, Wenhu Chen
Title: MoCha: Towards Movie-Grade Talking Character Synthesis
Abstract:
Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.

Authors:Lu Yu, Haoyu Han, Zhe Tao, Hantao Yao, Changsheng Xu
Title: Language Guided Concept Bottleneck Models for Interpretable Continual Learning
Abstract:
Continual learning (CL) aims to enable learning systems to acquire new knowledge constantly without forgetting previously learned information. CL faces the challenge of mitigating catastrophic forgetting while maintaining interpretability across tasks. Most existing CL methods focus primarily on preserving learned knowledge to improve model performance. However, as new information is introduced, the interpretability of the learning process becomes crucial for understanding the evolving decision-making process, yet it is rarely explored. In this paper, we introduce a novel framework that integrates language-guided Concept Bottleneck Models (CBMs) to address both challenges. Our approach leverages the Concept Bottleneck Layer, aligning semantic consistency with CLIP models to learn human-understandable concepts that can generalize across tasks. By focusing on interpretable concepts, our method not only enhances the models ability to retain knowledge over time but also provides transparent decision-making insights. We demonstrate the effectiveness of our approach by achieving superior performance on several datasets, outperforming state-of-the-art methods with an improvement of up to 3.06% in final average accuracy on ImageNet-subset. Additionally, we offer concept visualizations for model predictions, further advancing the understanding of interpretable continual learning.
中文: 本文提出了一种新颖的持续学习框架,通过整合语言引导的概念瓶颈模型来学习跨任务的人类可理解概念,不仅有效缓解灾难性遗忘并提升模型性能(在ImageNet子集上准确率最高提升3.06%),同时提供了透明的决策解释能力。
English: This paper introduces a novel continual learning framework that integrates language-guided Concept Bottleneck Models to simultaneously mitigate catastrophic forgetting and enhance interpretability by learning human-understandable concepts across tasks, achieving superior performance with up to 3.06% accuracy improvement on ImageNet-subset.

Authors:Felix Wimbauer, Weirong Chen, Dominik Muhle, Christian Rupprecht, Daniel Cremers
Title: AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos
Abstract:
Estimating camera motion and intrinsics from casual videos is a core challenge in computer vision. Traditional bundle-adjustment based methods, such as SfM and SLAM, struggle to perform reliably on arbitrary data. Although specialized SfM approaches have been developed for handling dynamic scenes, they either require intrinsics or computationally expensive test-time optimization and often fall short in performance. Recently, methods like Dust3r have reformulated the SfM problem in a more data-driven way. While such techniques show promising results, they are still 1) not robust towards dynamic objects and 2) require labeled data for supervised training. As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. Our intuition is that such a network can learn strong priors over realistic camera poses. To scale up our training, we rely on an uncertainty-based loss formulation and pre-trained depth and flow networks instead of motion or trajectory supervision. This allows us to use diverse, unlabelled video datasets obtained mostly from YouTube. Additionally, we ensure that the predicted trajectory does not accumulate drift over time through a lightweight trajectory refinement step. We test AnyCam on established datasets, where it delivers accurate camera poses and intrinsics both qualitatively and quantitatively. Furthermore, even with trajectory refinement, AnyCam is significantly faster than existing works for SfM in dynamic settings. Finally, by combining camera information, uncertainty, and depth, our model can produce high-quality 4D pointclouds.

Authors:Björn Möller, Lucas Görnhardt, Tim Fingscheidt
Title: A Lightweight Image Super-Resolution Transformer Trained on Low-Resolution Images Only
Abstract:
Transformer architectures prominently lead single-image super-resolution (SISR) benchmarks, reconstructing high-resolution (HR) images from their low-resolution (LR) counterparts. Their strong representative power, however, comes with a higher demand for training data compared to convolutional neural networks (CNNs). For many real-world SR applications, the availability of high-quality HR training images is not given, sparking interest in LR-only training methods. The LR-only SISR benchmark mimics this condition by allowing only low-resolution (LR) images for model training. For a 4x super-resolution, this effectively reduces the amount of available training data to 6.25% of the HR image pixels, which puts the employment of a data-hungry transformer model into question. In this work, we are the first to utilize a lightweight vision transformer model with LR-only training methods addressing the unsupervised SISR LR-only benchmark. We adopt and configure a recent LR-only training method from microscopy image super-resolution to macroscopic real-world data, resulting in our multi-scale training method for bicubic degradation (MSTbic). Furthermore, we compare it with reference methods and prove its effectiveness both for a transformer and a CNN model. We evaluate on the classic SR benchmark datasets Set5, Set14, BSD100, Urban100, and Manga109, and show superior performance over state-of-the-art (so far: CNN-based) LR-only SISR methods. The code is available on GitHub: https://github.com/ifnspaml/SuperResolutionMultiscaleTraining.
中文: 本研究提出了一种轻量级视觉变换器及多尺度训练方法,用于无监督单图像超分辨率任务,在仅使用低分辨率图像训练的条件下超越了现有基于卷积神经网络的方法性能。
English: This study introduces a lightweight vision transformer with a multi-scale training method for unsupervised single-image super-resolution, achieving superior performance on LR-only benchmarks compared to existing CNN-based approaches.

Authors:Nam Anh Dinh, Itai Lang, Hyunwoo Kim, Oded Stein, Rana Hanocka
Title: Geometry in Style: 3D Stylization via Surface Normal Deformation
Abstract:
We present Geometry in Style, a new method for identity-preserving mesh stylization. Existing techniques either adhere to the original shape through overly restrictive deformations such as bump maps or significantly modify the input shape using expressive deformations that may introduce artifacts or alter the identity of the source shape. In contrast, we represent a deformation of a triangle mesh as a target normal vector for each vertex neighborhood. The deformations we recover from target normals are expressive enough to enable detailed stylizations yet restrictive enough to preserve the shape's identity. We achieve such deformations using our novel differentiable As-Rigid-As-Possible (dARAP) layer, a neural-network-ready adaptation of the classical ARAP algorithm which we use to solve for per-vertex rotations and deformed vertices. As a differentiable layer, dARAP is paired with a visual loss from a text-to-image model to drive deformations toward style prompts, altogether giving us Geometry in Style. Our project page is at https://threedle.github.io/geometry-in-style.

Authors:Alessio Borgi, Luca Maiano, Irene Amerini
Title: Z-SASLM: Zero-Shot Style-Aligned SLI Blending Latent Manipulation
Abstract:
We introduce Z-SASLM, a Zero-Shot Style-Aligned SLI (Spherical Linear Interpolation) Blending Latent Manipulation pipeline that overcomes the limitations of current multi-style blending methods. Conventional approaches rely on linear blending, assuming a flat latent space leading to suboptimal results when integrating multiple reference styles. In contrast, our framework leverages the non-linear geometry of the latent space by using SLI Blending to combine weighted style representations. By interpolating along the geodesic on the hypersphere, Z-SASLM preserves the intrinsic structure of the latent space, ensuring high-fidelity and coherent blending of diverse styles - all without the need for fine-tuning. We further propose a new metric, Weighted Multi-Style DINO ViT-B/8, designed to quantitatively evaluate the consistency of the blended styles. While our primary focus is on the theoretical and practical advantages of SLI Blending for style manipulation, we also demonstrate its effectiveness in a multi-modal content fusion setting through comprehensive experimental studies. Experimental results show that Z-SASLM achieves enhanced and robust style alignment. The implementation code can be found at: https://github.com/alessioborgi/Z-SASLM.
中文: Z-SASLM提出了一种基于球面线性插值的零样本风格混合方法,通过沿超球面测地线插值克服传统线性混合缺陷,无需微调即可实现高质量多风格融合,并设计了新的量化评估指标。
English: Z-SASLM introduces a zero-shot style blending pipeline using spherical linear interpolation to overcome linear blending limitations, achieving high-fidelity multi-style fusion without fine-tuning while proposing a new metric for quantitative evaluation.

Authors:Marc-Antoine Lavoie, Anas Mahmoud, Steven L. Waslander
Title: Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection
Abstract:
The current state-of-the-art methods in domain adaptive object detection (DAOD) use Mean Teacher self-labelling, where a teacher model, directly derived as an exponential moving average of the student model, is used to generate labels on the target domain which are then used to improve both models in a positive loop. This couples learning and generating labels on the target domain, and other recent works also leverage the generated labels to add additional domain alignment losses. We believe this coupling is brittle and excessively constrained: there is no guarantee that a student trained only on source data can generate accurate target domain labels and initiate the positive feedback loop, and much better target domain labels can likely be generated by using a large pretrained network that has been exposed to much more data. Vision foundational models are exactly such models, and they have shown impressive task generalization capabilities even when frozen. We want to leverage these models for DAOD and introduce DINO Teacher, which consists of two components. First, we train a new labeller on source data only using a large frozen DINOv2 backbone and show it generates more accurate labels than Mean Teacher. Next, we align the student's source and target image patch features with those from a DINO encoder, driving source and target representations closer to the generalizable DINO representation. We obtain state-of-the-art performance on multiple DAOD datasets. Code available at https://github.com/TRAILab/DINO_Teacher
中文: 作者提出DINO Teacher方法,利用大型冻结DINOv2骨干网络生成更准确的目标域标签,并将学生模型特征与DINO表征对齐,在多个域自适应目标检测数据集上实现了最先进的性能。
English: The authors propose DINO Teacher, a novel method for domain adaptive object detection that leverages a large frozen DINOv2 backbone to generate more accurate target domain labels and align student features with DINO representations, achieving state-of-the-art performance on multiple datasets.

Authors:Vincent Gbouna Zakka, Zhuangzhuang Dai, Luis J. Manso
Title: Action Recognition in Real-World Ambient Assisted Living Environment
Abstract:
The growing ageing population and their preference to maintain independence by living in their own homes require proactive strategies to ensure safety and support. Ambient Assisted Living (AAL) technologies have emerged to facilitate ageing in place by offering continuous monitoring and assistance within the home. Within AAL technologies, action recognition plays a crucial role in interpreting human activities and detecting incidents like falls, mobility decline, or unusual behaviours that may signal worsening health conditions. However, action recognition in practical AAL applications presents challenges, including occlusions, noisy data, and the need for real-time performance. While advancements have been made in accuracy, robustness to noise, and computation efficiency, achieving a balance among them all remains a challenge. To address this challenge, this paper introduces the Robust and Efficient Temporal Convolution network (RE-TCN), which comprises three main elements: Adaptive Temporal Weighting (ATW), Depthwise Separable Convolutions (DSC), and data augmentation techniques. These elements aim to enhance the model's accuracy, robustness against noise and occlusion, and computational efficiency within real-world AAL contexts. RE-TCN outperforms existing models in terms of accuracy, noise and occlusion robustness, and has been validated on four benchmark datasets: NTU RGB+D 60, Northwestern-UCLA, SHREC'17, and DHG-14/28. The code is publicly available at: https://github.com/Gbouna/RE-TCN
Chinese: 本文针对环境辅助生活(AAL)中行为识别的挑战,提出了鲁棒高效时序卷积网络(RE-TCN),通过自适应时序加权和深度可分离卷积等组件提升精度、抗噪性和计算效率,并在多个基准数据集上验证了其优越性能。
English: To address the challenges of action recognition in Ambient Assisted Living (AAL), this paper proposes the Robust and Efficient Temporal Convolution network (RE-TCN), which enhances accuracy, noise robustness, and computational efficiency through innovative components and has been validated on multiple benchmark datasets.

Authors:Pengyu Chen, Sicheng Wang, Cuizhen Wang, Senrong Wang, Beiao Huang, Lu Huang, Zhe Zang
Title: A GAN-Enhanced Deep Learning Framework for Rooftop Detection from Historical Aerial Imagery
Abstract:
Precise detection of rooftops from historical aerial imagery is essential for analyzing long-term urban development and human settlement patterns. Nonetheless, black-and-white analog photographs present considerable challenges for modern object detection frameworks due to their limited spatial resolution, absence of color information, and archival degradation. To address these challenges, this research introduces a two-stage image enhancement pipeline based on Generative Adversarial Networks (GANs): image colorization utilizing DeOldify, followed by super-resolution enhancement with Real-ESRGAN. The enhanced images were subsequently employed to train and evaluate rooftop detection models, including Faster R-CNN, DETReg, and YOLOv11n. The results demonstrate that the combination of colorization with super-resolution significantly enhances detection performance, with YOLOv11n achieving a mean Average Precision (mAP) exceeding 85\%. This signifies an enhancement of approximately 40\% over the original black-and-white images and 20\% over images enhanced solely through colorization. The proposed method effectively bridges the gap between archival imagery and contemporary deep learning techniques, facilitating more reliable extraction of building footprints from historical aerial photographs. Code and resources for reproducing our results are publicly available at \href{https://github.com/Pengyu-gis/Historical-Aerial-Photos}{github.com/Pengyu-gis/Historical-Aerial-Photos}.
中文: 本研究提出了一种基于生成对抗网络的两阶段图像增强方法,通过着色和超分辨率处理显著提升了历史航拍图像中屋顶检测的精度,使YOLOv11n模型的平均精度均值超过85%,有效弥合了档案影像与现代深度学习技术之间的差距。
English: This study develops a two-stage GAN-based enhancement method combining colorization and super-resolution to significantly improve rooftop detection in historical aerial images, achieving over 85% mAP with YOLOv11n and bridging archival imagery with modern deep learning techniques.

Authors:Shota Hirose, Kazuki Kotoyori, Kasidis Arunruangsirilert, Fangzheng Lin, Heming Sun, Jiro Katto
Title: Real-time Video Prediction With Fast Video Interpolation Model and Prediction Training
Abstract:
Transmission latency significantly affects users' quality of experience in real-time interaction and actuation. As latency is principally inevitable, video prediction can be utilized to mitigate the latency and ultimately enable zero-latency transmission. However, most of the existing video prediction methods are computationally expensive and impractical for real-time applications. In this work, we therefore propose real-time video prediction towards the zero-latency interaction over networks, called IFRVP (Intermediate Feature Refinement Video Prediction). Firstly, we propose three training methods for video prediction that extend frame interpolation models, where we utilize a simple convolution-only frame interpolation network based on IFRNet. Secondly, we introduce ELAN-based residual blocks into the prediction models to improve both inference speed and accuracy. Our evaluations show that our proposed models perform efficiently and achieve the best trade-off between prediction accuracy and computational speed among the existing video prediction methods. A demonstration movie is also provided at http://bit.ly/IFRVPDemo. The code will be released at https://github.com/FykAikawa/IFRVP.
中文摘要:本文提出IFRVP实时视频预测方法,通过改进帧插值技术和引入ELAN残差块,在预测精度与计算速度之间实现了最佳平衡,致力于实现网络零延迟交互。
English Summary: This paper introduces IFRVP, a real-time video prediction method that uses enhanced frame interpolation and ELAN-based residual blocks to achieve optimal balance between accuracy and computational efficiency for zero-latency network interactions.

Authors:Zhenyu Tang, Chaoran Feng, Xinhua Cheng, Wangbo Yu, Junwu Zhang, Yuan Liu, Xiaoxiao Long, Wenping Wang, Li Yuan
Title: NeuralGS: Bridging Neural Fields and 3D Gaussian Splatting for Compact 3D Representations
Abstract:
3D Gaussian Splatting (3DGS) achieves impressive quality and rendering speed, but with millions of 3D Gaussians and significant storage and transmission costs. In this paper, we aim to develop a simple yet effective method called NeuralGS that compresses the original 3DGS into a compact representation. Our observation is that neural fields like NeRF can represent complex 3D scenes with Multi-Layer Perceptron (MLP) neural networks using only a few megabytes. Thus, NeuralGS effectively adopts the neural field representation to encode the attributes of 3D Gaussians with MLPs, only requiring a small storage size even for a large-scale scene. To achieve this, we adopt a clustering strategy and fit the Gaussians within each cluster using different tiny MLPs, based on importance scores of Gaussians as fitting weights. We experiment on multiple datasets, achieving a 91-times average model size reduction without harming the visual quality.
中文:NeuralGS通过MLP和聚类策略将3D高斯泼溅压缩为紧凑的神经表示,在保持视觉质量的同时实现了91倍的模型大小缩减。
English: NeuralGS compresses 3D Gaussian Splatting into a compact neural representation using MLPs and clustering, achieving 91-times size reduction while preserving visual quality.

Authors:Ao Wang, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
Title: LSNet: See Large, Focus Small
Abstract:
Vision network designs, including Convolutional Neural Networks and Vision Transformers, have significantly advanced the field of computer vision. Yet, their complex computations pose challenges for practical deployments, particularly in real-time applications. To tackle this issue, researchers have explored various lightweight and efficient network designs. However, existing lightweight models predominantly leverage self-attention mechanisms and convolutions for token mixing. This dependence brings limitations in effectiveness and efficiency in the perception and aggregation processes of lightweight networks, hindering the balance between performance and efficiency under limited computational budgets. In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a ``See Large, Focus Small'' strategy for lightweight vision network design. We introduce LS (\textbf{L}arge-\textbf{S}mall) convolution, which combines large-kernel perception and small-kernel aggregation. It can efficiently capture a wide range of perceptual information and achieve precise feature aggregation for dynamic and complex visual representations, thus enabling proficient processing of visual information. Based on LS convolution, we present LSNet, a new family of lightweight models. Extensive experiments demonstrate that LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks. Codes and models are available at https://github.com/jameslahm/lsnet.
中文摘要:本文提出LSNet轻量视觉网络,采用"见大聚焦小"策略,通过LS卷积结合大核感知与小核聚合,在多种视觉任务中实现了优于现有轻量模型的性能与效率。
English Summary: The paper introduces LSNet, a lightweight vision network that employs a novel "See Large, Focus Small" strategy through LS convolution, combining large-kernel perception with small-kernel aggregation to achieve superior efficiency and performance across various vision tasks.

Authors:Alexander Vogel, Omar Moured, Yufan Chen, Jiaming Zhang, Rainer Stiefelhagen
Title: RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning
Abstract:
Recently, Vision Language Models (VLMs) have increasingly emphasized document visual grounding to achieve better human-computer interaction, accessibility, and detailed understanding. However, its application to visualizations such as charts remains under-explored due to the inherent complexity of interleaved visual-numerical relationships in chart images. Existing chart understanding methods primarily focus on answering questions without explicitly identifying the visual elements that support their predictions. To bridge this gap, we introduce RefChartQA, a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding, enabling models to refer elements at multiple granularities within chart images. Furthermore, we conduct a comprehensive evaluation by instruction-tuning 5 state-of-the-art VLMs across different categories. Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%, reducing hallucinations, and improving model reliability. Additionally, we identify key factors influencing text-spatial alignment, such as architectural improvements in TinyChart, which leverages a token-merging module for enhanced feature fusion. Our dataset is open-sourced for community development and further advancements. All models and code will be publicly available at https://github.com/moured/RefChartQA.
Chinese: RefChartQA提出了一个结合图表问答与视觉定位的新基准,通过明确识别空间元素将模型准确率提升超过15%,解决了现有方法忽视图表视觉证据的不足。
English: RefChartQA introduces a novel benchmark combining chart question answering with visual grounding to enhance model accuracy by over 15% through explicit spatial element identification, addressing gaps in current methods that overlook visual evidence in charts.

Authors:Guohong Huang, Ling-An Zeng, Zexin Zheng, Shengbo Gu, Wei-Shi Zheng
Title: Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation
Abstract:
We propose a novel approach for generating text-guided human-object interactions (HOIs) that achieves explicit joint-level interaction modeling in a computationally efficient manner. Previous methods represent the entire human body as a single token, making it difficult to capture fine-grained joint-level interactions and resulting in unrealistic HOIs. However, treating each individual joint as a token would yield over twenty times more tokens, increasing computational overhead. To address these challenges, we introduce an Efficient Explicit Joint-level Interaction Model (EJIM). EJIM features a Dual-branch HOI Mamba that separately and efficiently models spatiotemporal HOI information, as well as a Dual-branch Condition Injector for integrating text semantics and object geometry into human and object motions. Furthermore, we design a Dynamic Interaction Block and a progressive masking mechanism to iteratively filter out irrelevant joints, ensuring accurate and nuanced interaction modeling. Extensive quantitative and qualitative evaluations on public datasets demonstrate that EJIM surpasses previous works by a large margin while using only 5\% of the inference time. Code is available \href{https://github.com/Huanggh531/EJIM}{here}.
Chinese: 我们提出了一种高效的显式关节级交互模型(EJIM),该方法能够以高计算效率对文本引导的人体-物体交互进行精细建模,在仅使用5%推理时间的情况下,大幅超越了现有方法的性能。
English: We introduce the Efficient Explicit Joint-level Interaction Model (EJIM), a novel method that effectively models text-guided human-object interactions at the joint level with high computational efficiency, significantly outperforming previous approaches while using only 5% of the inference time.

Authors:Xiaolu Liu, Ruizi Yang, Song Wang, Wentong Li, Junbo Chen, Jianke Zhu
Title: Uncertainty-Instructed Structure Injection for Generalizable HD Map Construction
Abstract:
Reliable high-definition (HD) map construction is crucial for the driving safety of autonomous vehicles. Although recent studies demonstrate improved performance, their generalization capability across unfamiliar driving scenes remains unexplored. To tackle this issue, we propose UIGenMap, an uncertainty-instructed structure injection approach for generalizable HD map vectorization, which concerns the uncertainty resampling in statistical distribution and employs explicit instance features to reduce excessive reliance on training data. Specifically, we introduce the perspective-view (PV) detection branch to obtain explicit structural features, in which the uncertainty-aware decoder is designed to dynamically sample probability distributions considering the difference in scenes. With probabilistic embedding and selection, UI2DPrompt is proposed to construct PV-learnable prompts. These PV prompts are integrated into the map decoder by designed hybrid injection to compensate for neglected instance structures. To ensure real-time inference, a lightweight Mimic Query Distillation is designed to learn from PV prompts, which can serve as an efficient alternative to the flow of PV branches. Extensive experiments on challenging geographically disjoint (geo-based) data splits demonstrate that our UIGenMap achieves superior performance, with +5.7 mAP improvement on the nuScenes dataset. Source code will be available at https://github.com/xiaolul2/UIGenMap.
Chinese: UIGenMap通过不确定性引导的结构注入方法,利用动态分布重采样和显式实例特征来提升高精地图的泛化能力,在nuScenes数据集上实现了mAP指标5.7的性能提升,有效保障自动驾驶的行车安全。
English: UIGenMap enhances autonomous vehicle safety by introducing an uncertainty-instructed structure injection method that improves generalization across diverse driving scenes through dynamic uncertainty resampling and explicit feature integration, achieving a +5.7 mAP boost on nuScenes.

Authors:Yuanyuan Gao, Hao Li, Jiaqi Chen, Zhengyu Zou, Zhihang Zhong, Dingwen Zhang, Xiao Sun, Junwei Han
Title: CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction
Abstract:
Despite its significant achievements in large-scale scene reconstruction, 3D Gaussian Splatting still faces substantial challenges, including slow processing, high computational costs, and limited geometric accuracy. These core issues arise from its inherently unstructured design and the absence of efficient parallelization. To overcome these challenges simultaneously, we introduce CityGS-X, a scalable architecture built on a novel parallelized hybrid hierarchical 3D representation (PH^2-3D). As an early attempt, CityGS-X abandons the cumbersome merge-and-partition process and instead adopts a newly-designed batch-level multi-task rendering process. This architecture enables efficient multi-GPU rendering through dynamic Level-of-Detail voxel allocations, significantly improving scalability and performance. Through extensive experiments, CityGS-X consistently outperforms existing methods in terms of faster training times, larger rendering capacities, and more accurate geometric details in large-scale scenes. Notably, CityGS-X can train and render a scene with 5,000+ images in just 5 hours using only 4 * 4090 GPUs, a task that would make other alternative methods encounter Out-Of-Memory (OOM) issues and fail completely. This implies that CityGS-X is far beyond the capacity of other existing methods.
中文: CityGS-X通过创新的并行混合分层架构解决了3D高斯泼溅技术处理速度慢、计算成本高的核心问题,在大规模场景中实现了更快的训练速度、更高的几何精度和高效的多GPU渲染能力。
English: CityGS-X overcomes 3D Gaussian Splatting's limitations of slow processing and high computational costs through a novel parallelized hybrid hierarchical architecture, achieving faster training and superior geometric accuracy in large-scale scenes with efficient multi-GPU rendering.

Authors:Zijun Ding, Mingdie Xiong, Congcong Zhu, Jingrun Chen
Title: STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing
Abstract:
Existing audio-driven visual dubbing methods have achieved great success. Despite this, we observe that the semantic ambiguity between spatial and temporal domains significantly degrades the synthesis stability for the dynamic faces. We argue that aligning the semantic features from spatial and temporal domains is a promising approach to stabilizing facial motion. To achieve this, we propose a Spatial-Temporal Semantic Alignment (STSA) method, which introduces a dual-path alignment mechanism and a differentiable semantic representation. The former leverages a Consistent Information Learning (CIL) module to maximize the mutual information at multiple scales, thereby reducing the manifold differences between spatial and temporal domains. The latter utilizes probabilistic heatmap as ambiguity-tolerant guidance to avoid the abnormal dynamics of the synthesized faces caused by slight semantic jittering. Extensive experimental results demonstrate the superiority of the proposed STSA, especially in terms of image quality and synthesis stability. Pre-trained weights and inference code are available at https://github.com/SCAILab-USTC/STSA.
中文: 提出的时空语义对齐(STSA)方法通过双路径对齐机制和概率热图来协调空间与时间语义特征,有效稳定了面部运动合成,显著提升了图像质量与合成稳定性。
English: The proposed Spatial-Temporal Semantic Alignment (STSA) method stabilizes facial motion synthesis by aligning spatial and temporal semantic features through dual-path alignment and probabilistic heatmaps, significantly improving image quality and stability.

Authors:Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, Li Zhang
Title: From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D
Abstract:
Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.
中文: 针对现有LVLMs在空间感知上的不足,我们构建了新型2D空间数据生成流程,开发了SPAR-7M数据集和SPAR-Bench基准测试,使模型在空间推理任务中达到最优性能。
English: Recent LVLMs still face spatial perception challenges, so we developed a novel 2D spatial data pipeline to create the SPAR-7M dataset and SPAR-Bench benchmark, enabling models to achieve state-of-the-art performance on spatial reasoning tasks.

Authors:Peiyu Chen, Fuling Lin, Weipeng Guan, Peng Lu
Title: SuperEIO: Self-Supervised Event Feature Learning for Event Inertial Odometry
Abstract:
Event cameras asynchronously output low-latency event streams, promising for state estimation in high-speed motion and challenging lighting conditions. As opposed to frame-based cameras, the motion-dependent nature of event cameras presents persistent challenges in achieving robust event feature detection and matching. In recent years, learning-based approaches have demonstrated superior robustness over traditional handcrafted methods in feature detection and matching, particularly under aggressive motion and HDR scenarios. In this paper, we propose SuperEIO, a novel framework that leverages the learning-based event-only detection and IMU measurements to achieve event-inertial odometry. Our event-only feature detection employs a convolutional neural network under continuous event streams. Moreover, our system adopts the graph neural network to achieve event descriptor matching for loop closure. The proposed system utilizes TensorRT to accelerate the inference speed of deep networks, which ensures low-latency processing and robust real-time operation on resource-limited platforms. Besides, we evaluate our method extensively on multiple public datasets, demonstrating its superior accuracy and robustness compared to other state-of-the-art event-based methods. We have also open-sourced our pipeline to facilitate research in the field: https://github.com/arclab-hku/SuperEIO.
Chinese: 事件相机在高速运动和复杂光照条件下提供低延迟事件流,而SuperEIO框架通过基于学习的事件检测结合IMU数据,在资源受限平台上实现了实时、鲁棒的事件-惯性里程计。
English: Event cameras offer low-latency event streams ideal for state estimation in high-speed and challenging lighting conditions, and the proposed SuperEIO framework uses learning-based event-only detection with IMU data to achieve robust, real-time event-inertial odometry on resource-limited platforms.

Authors:Ke Zhang, Vishal M. Patel
Title: MedCL: Learning Consistent Anatomy Distribution for Scribble-supervised Medical Image Segmentation
Abstract:
Curating large-scale fully annotated datasets is expensive, laborious, and cumbersome, especially for medical images. Several methods have been proposed in the literature that make use of weak annotations in the form of scribbles. However, these approaches require large amounts of scribble annotations, and are only applied to the segmentation of regular organs, which are often unavailable for the disease species that fall in the long-tailed distribution. Motivated by the fact that the medical labels have anatomy distribution priors, we propose a scribble-supervised clustering-based framework, called MedCL, to learn the inherent anatomy distribution of medical labels. Our approach consists of two steps: i) Mix the features with intra- and inter-image mix operations, and ii) Perform feature clustering and regularize the anatomy distribution at both local and global levels. Combined with a small amount of weak supervision, the proposed MedCL is able to segment both regular organs and challenging irregular pathologies. We implement MedCL based on SAM and UNet backbones, and evaluate the performance on three open datasets of regular structure (MSCMRseg), multiple organs (BTCV) and irregular pathology (MyoPS). It is shown that even with less scribble supervision, MedCL substantially outperforms the conventional segmentation methods. Our code is available at https://github.com/BWGZK/MedCL.
中文摘要:MedCL框架通过特征混合与聚类技术,结合少量标注和医学先验知识,能够同时精准分割常规器官与不规则病灶,在有限监督下显著优于传统分割方法。
English Summary: The proposed MedCL framework leverages scribble annotations and anatomical priors through feature mixing and clustering to effectively segment both regular organs and irregular pathologies with minimal supervision, outperforming traditional methods.

Authors:Lauren Shrack, Timm Haucke, Antoine Salaün, Arjun Subramonian, Sara Beery
Title: Pairwise Matching of Intermediate Representations for Fine-grained Explainability
Abstract:
The differences between images belonging to fine-grained categories are often subtle and highly localized, and existing explainability techniques for deep learning models are often too diffuse to provide useful and interpretable explanations. We propose a new explainability method (PAIR-X) that leverages both intermediate model activations and backpropagated relevance scores to generate fine-grained, highly-localized pairwise visual explanations. We use animal and building re-identification (re-ID) as a primary case study of our method, and we demonstrate qualitatively improved results over a diverse set of explainability baselines on 35 public re-ID datasets. In interviews, animal re-ID experts found PAIR-X to be a meaningful improvement over existing baselines for deep model explainability, and suggested that its visualizations would be directly applicable to their work. We also propose a novel quantitative evaluation metric for our method, and demonstrate that PAIR-X visualizations appear more plausible for correct image matches than incorrect ones even when the model similarity score for the pairs is the same. By improving interpretability, PAIR-X enables humans to better distinguish correct and incorrect matches. Our code is available at: https://github.com/pairx-explains/pairx
中文:提出的PAIR-X方法通过结合模型激活和相关性评分生成细粒度、局部化的视觉解释,在动物和建筑物重识别任务中展现出优于现有技术的可解释性。
English: The proposed PAIR-X method generates fine-grained, localized visual explanations by combining model activations and relevance scores, demonstrating superior interpretability over existing techniques in animal and building re-identification tasks.

Authors:Nina Weng, Aasa Feragen, Siavash Bigdeli
Title: Patronus: Bringing Transparency to Diffusion Models with Prototypes
Abstract:
Diffusion-based generative models, such as Denoising Diffusion Probabilistic Models (DDPMs), have achieved remarkable success in image generation, but their step-by-step denoising process remains opaque, leaving critical aspects of the generation mechanism unexplained. To address this, we introduce \emph{Patronus}, an interpretable diffusion model inspired by ProtoPNet. Patronus integrates a prototypical network into DDPMs, enabling the extraction of prototypes and conditioning of the generation process on their prototype activation vector. This design enhances interpretability by showing the learned prototypes and how they influence the generation process. Additionally, the model supports downstream tasks like image manipulation, enabling more transparent and controlled modifications. Moreover, Patronus could reveal shortcut learning in the generation process by detecting unwanted correlations between learned prototypes. Notably, Patronus operates entirely without any annotations or text prompts. This work opens new avenues for understanding and controlling diffusion models through prototype-based interpretability. Our code is available at \href{https://github.com/nina-weng/patronus}{https://github.com/nina-weng/patronus}.
中文: Patronus是一种可解释的扩散模型,它将原型网络集成到DDPM中,通过展示学习到的原型及其对生成过程的影响来提高透明度,同时支持图像操控和检测捷径学习,且无需任何标注。
English: Patronus is an interpretable diffusion model that integrates a prototypical network into DDPMs to enhance transparency by revealing learned prototypes and their influence on image generation, while enabling manipulation and detecting shortcut learning without requiring annotations.

Authors:Weiqi Li, Xuanyu Zhang, Shijie Zhao, Yabin Zhang, Junlin Li, Li Zhang, Jian Zhang
Title: Q-Insight: Understanding Image Quality via Visual Reinforcement Learning
Abstract:
Image quality assessment (IQA) focuses on the perceptual visual quality of images, playing a crucial role in downstream tasks such as image reconstruction, compression, and generation. The rapid advancement of multi-modal large language models (MLLMs) has significantly broadened the scope of IQA, moving toward comprehensive image quality understanding that incorporates content analysis, degradation perception, and comparison reasoning beyond mere numerical scoring. Previous MLLM-based methods typically either generate numerical scores lacking interpretability or heavily rely on supervised fine-tuning (SFT) using large-scale annotated datasets to provide descriptive assessments, limiting their flexibility and applicability. In this paper, we propose Q-Insight, a reinforcement learning-based model built upon group relative policy optimization (GRPO), which demonstrates strong visual reasoning capability for image quality understanding while requiring only a limited amount of rating scores and degradation labels. By jointly optimizing score regression and degradation perception tasks with carefully designed reward functions, our approach effectively exploits their mutual benefits for enhanced performance. Extensive experiments demonstrate that Q-Insight substantially outperforms existing state-of-the-art methods in both score regression and degradation perception tasks, while exhibiting impressive zero-shot generalization to comparison reasoning tasks. Code will be available at https://github.com/lwq20020127/Q-Insight.
Chinese: 本文提出Q-Insight模型,采用基于群体相对策略优化的强化学习方法,通过少量标注数据联合优化评分回归与退化感知任务,在图像质量评估中显著超越现有方法并展现出优秀的零样本推理能力。
English: This paper introduces Q-Insight, a reinforcement learning model using group relative policy optimization that advances image quality assessment by integrating score regression and degradation perception with minimal supervision, outperforming existing methods and showing strong zero-shot generalization.

Authors:Xiaomin Yu, Pengxiang Ding, Wenjie Zhang, Siteng Huang, Songyang Gao, Chengwei Qin, Kejian Wu, Zhaoxin Fan, Ziyue Qiao, Donglin Wang
Title: Unicorn: Text-Only Data Synthesis for Vision Language Model Training
Abstract:
Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training. Code is available at https://github.com/Yu-xm/Unicorn.git.
中文: 该研究提出三阶段框架,仅通过文本生成高质量多模态训练数据,无需真实图像即可构建用于预训练的Unicorn-1.2M和指令调优的Unicorn-471K-Instruction数据集,为视觉语言模型训练提供了经济高效的解决方案。
English: The proposed three-stage framework synthesizes high-quality multimodal training data from text alone, generating Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning without real images, offering a cost-effective solution for vision-language model training.

Authors:Shuai Shen, Wanhua Li, Yunpeng Zhang, Yap-Peng Tan, Jiwen Lu
Title: Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis
Abstract:
Talking head synthesis has emerged as a prominent research topic in computer graphics and multimedia, yet most existing methods often struggle to strike a balance between generation quality and computational efficiency, particularly under real-time constraints. In this paper, we propose a novel framework that integrates Gaussian Splatting with a structured Audio Factorization Plane (Audio-Plane) to enable high-quality, audio-synchronized, and real-time talking head generation. For modeling a dynamic talking head, a 4D volume representation, which consists of three axes in 3D space and one temporal axis aligned with audio progression, is typically required. However, directly storing and processing a dense 4D grid is impractical due to the high memory and computation cost, and lack of scalability for longer durations. We address this challenge by decomposing the 4D volume representation into a set of audio-independent spatial planes and audio-dependent planes, forming a compact and interpretable representation for talking head modeling that we refer to as the Audio-Plane. This factorized design allows for efficient and fine-grained audio-aware spatial encoding, and significantly enhances the model's ability to capture complex lip dynamics driven by speech signals. To further improve region-specific motion modeling, we introduce an audio-guided saliency splatting mechanism based on region-aware modulation, which adaptively emphasizes highly dynamic regions such as the mouth area. This allows the model to focus its learning capacity on where it matters most for accurate speech-driven animation. Extensive experiments on both the self-driven and the cross-driven settings demonstrate that our method achieves state-of-the-art visual quality, precise audio-lip synchronization, and real-time performance, outperforming prior approaches across both 2D- and 3D-based paradigms.

Authors:Zhendi Gong, Susan Francis, Eleanor Cox, Stamatios N. Sotiropoulos, Dorothee P. Auer, Guoping Qiu, Andrew P. French, Xin Chen
Title: MO-CTranS: A unified multi-organ segmentation model learning from multiple heterogeneously labelled datasets
Abstract:
Multi-organ segmentation holds paramount significance in many clinical tasks. In practice, compared to large fully annotated datasets, multiple small datasets are often more accessible and organs are not labelled consistently. Normally, an individual model is trained for each of these datasets, which is not an effective way of using data for model learning. It remains challenging to train a single model that can robustly learn from several partially labelled datasets due to label conflict and data imbalance problems. We propose MO-CTranS: a single model that can overcome such problems. MO-CTranS contains a CNN-based encoder and a Transformer-based decoder, which are connected in a multi-resolution manner. Task-specific tokens are introduced in the decoder to help differentiate label discrepancies. Our method was evaluated and compared to several baseline models and state-of-the-art (SOTA) solutions on abdominal MRI datasets that were acquired in different views (i.e. axial and coronal) and annotated for different organs (i.e. liver, kidney, spleen). Our method achieved better performance (most were statistically significant) than the compared methods. Github link: https://github.com/naisops/MO-CTranS.
中文: 多器官分割在临床任务中至关重要,但利用多个部分标注数据集训练单一模型常面临标签冲突和数据不平衡的挑战,而提出的MO-CTranS模型通过结合CNN与Transformer架构并引入任务特定标记,在腹部MRI数据上相比基线方法和现有最优方案取得了更优性能。
English: Multi-organ segmentation is crucial in clinical practice, yet training a single model on multiple partially labeled datasets faces challenges like label conflicts and data imbalance, which the proposed MO-CTranS model overcomes by integrating CNN and Transformer architectures with task-specific tokens, achieving superior performance on abdominal MRI datasets compared to baseline and SOTA methods.

Authors:David Fischinger, Martin Boyer
Title: DF2023: The Digital Forensics 2023 Dataset for Image Forgery Detection
Abstract:
The deliberate manipulation of public opinion, especially through altered images, which are frequently disseminated through online social networks, poses a significant danger to society. To fight this issue on a technical level we support the research community by releasing the Digital Forensics 2023 (DF2023) training and validation dataset, comprising one million images from four major forgery categories: splicing, copy-move, enhancement and removal. This dataset enables an objective comparison of network architectures and can significantly reduce the time and effort of researchers preparing datasets.

Authors:Wei-Jin Huang, Yuan-Ming Li, Zhi-Wei Xia, Yu-Ming Tang, Kun-Yu Lin, Jian-Fang Hu, Wei-Shi Zheng
Title: Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks
Abstract:
Error detection in procedural activities is essential for consistent and correct outcomes in AR-assisted and robotic systems. Existing methods often focus on temporal ordering errors or rely on static prototypes to represent normal actions. However, these approaches typically overlook the common scenario where multiple, distinct actions are valid following a given sequence of executed actions. This leads to two issues: (1) the model cannot effectively detect errors using static prototypes when the inference environment or action execution distribution differs from training; and (2) the model may also use the wrong prototypes to detect errors if the ongoing action label is not the same as the predicted one. To address this problem, we propose an Adaptive Multiple Normal Action Representation (AMNAR) framework. AMNAR predicts all valid next actions and reconstructs their corresponding normal action representations, which are compared against the ongoing action to detect errors. Extensive experiments demonstrate that AMNAR achieves state-of-the-art performance, highlighting the effectiveness of AMNAR and the importance of modeling multiple valid next actions in error detection. The code is available at https://github.com/iSEE-Laboratory/AMNAR.
Chinese: 提出的自适应多正常动作表示(AMNAR)框架通过动态预测所有有效后续动作并与当前动作进行比较,解决了现有错误检测方法的局限性,大量实验证明其实现了最先进的性能。
English: The proposed Adaptive Multiple Normal Action Representation (AMNAR) framework addresses limitations in existing error detection methods by dynamically predicting all valid next actions and comparing them with ongoing actions, achieving state-of-the-art performance through extensive experiments.

Authors:Ada Gorgun, Bernt Schiele, Jonas Fischer
Title: VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow
Abstract:
Neural networks are widely adopted to solve complex and challenging tasks. Especially in high-stakes decision-making, understanding their reasoning process is crucial, yet proves challenging for modern deep networks. Feature visualization (FV) is a powerful tool to decode what information neurons are responding to and hence to better understand the reasoning behind such networks. In particular, in FV we generate human-understandable images that reflect the information detected by neurons of interest. However, current methods often yield unrecognizable visualizations, exhibiting repetitive patterns and visual artifacts that are hard to understand for a human. To address these problems, we propose to guide FV through statistics of real image features combined with measures of relevant network flow to generate prototypical images. Our approach yields human-understandable visualizations that both qualitatively and quantitatively improve over state-of-the-art FVs across various architectures. As such, it can be used to decode which information the network uses, complementing mechanistic circuits that identify where it is encoded. Code is available at: https://github.com/adagorgun/VITAL
神经网络在复杂任务中至关重要,但其推理过程难以解释;我们的方法通过利用真实图像统计和网络流来改进特征可视化,生成更清晰、更易理解的图像,性能优于现有技术。
Neural networks are crucial for complex tasks, but their reasoning is hard to interpret; our method improves feature visualization by using real image statistics and network flow to generate clearer, more understandable images that outperform current techniques.

Authors:David Fischinger, Martin Boyer
Title: DF-Net: The Digital Forensics Network for Image Forgery Detection
Abstract:
The orchestrated manipulation of public opinion, particularly through manipulated images, often spread via online social networks (OSN), has become a serious threat to society. In this paper we introduce the Digital Forensics Net (DF-Net), a deep neural network for pixel-wise image forgery detection. The released model outperforms several state-of-the-art methods on four established benchmark datasets. Most notably, DF-Net's detection is robust against lossy image operations (e.g resizing, compression) as they are automatically performed by social networks.

Authors:Jiahao Xia, Min Xu, Wenjian Huang, Jianguo Zhang, Haimin Zhang, Chunxia Xiao
Title: Mitigating Knowledge Discrepancies among Multiple Datasets for Task-agnostic Unified Face Alignment
Abstract:
Despite the similar structures of human faces, existing face alignment methods cannot learn unified knowledge from multiple datasets with different landmark annotations. The limited training samples in a single dataset commonly result in fragile robustness in this field. To mitigate knowledge discrepancies among different datasets and train a task-agnostic unified face alignment (TUFA) framework, this paper presents a strategy to unify knowledge from multiple datasets. Specifically, we calculate a mean face shape for each dataset. To explicitly align these mean shapes on an interpretable plane based on their semantics, each shape is then incorporated with a group of semantic alignment embeddings. The 2D coordinates of these aligned shapes can be viewed as the anchors of the plane. By encoding them into structure prompts and further regressing the corresponding facial landmarks using image features, a mapping from the plane to the target faces is finally established, which unifies the learning target of different datasets. Consequently, multiple datasets can be utilized to boost the generalization ability of the model. The successful mitigation of discrepancies also enhances the efficiency of knowledge transferring to a novel dataset, significantly boosts the performance of few-shot face alignment. Additionally, the interpretable plane endows TUFA with a task-agnostic characteristic, enabling it to locate landmarks unseen during training in a zero-shot manner. Extensive experiments are carried on seven benchmarks and the results demonstrate an impressive improvement in face alignment brought by knowledge discrepancies mitigation. The code is available at https://github.com/Jiahao-UTS/TUFA.
中文摘要:本文提出了一种任务无关的统一人脸对齐(TUFA)框架,通过将不同数据集的平均人脸形状在可解释平面上对齐来消除知识差异,从而提升模型的泛化能力、小样本学习效果,并具备零样本定位未训练关键点的能力。
English Summary: This paper introduces a Task-agnostic Unified Face Alignment (TUFA) framework that mitigates knowledge discrepancies across datasets by aligning mean face shapes on an interpretable plane, enabling improved generalization, few-shot learning, and zero-shot capability for unseen landmarks.

Authors:Byeongjun Kwon, Munchurl Kim
Title: One Look is Enough: Seamless Patchwise Refinement for Zero-Shot Monocular Depth Estimation on High-Resolution Images
Abstract:
Zero-shot depth estimation (DE) models exhibit strong generalization performance as they are trained on large-scale datasets. However, existing models struggle with high-resolution images due to the discrepancy in image resolutions of training (with smaller resolutions) and inference (for high resolutions). Processing them at full resolution leads to decreased estimation accuracy on depth with tremendous memory consumption, while downsampling to the training resolution results in blurred edges in the estimated depth images. Prevailing high-resolution depth estimation methods adopt a patch-based approach, which introduces depth discontinuity issues when reassembling the estimated depth patches, resulting in test-time inefficiency. Additionally, to obtain fine-grained depth details, these methods rely on synthetic datasets due to the real-world sparse ground truth depth, leading to poor generalizability. To tackle these limitations, we propose Patch Refine Once (PRO), an efficient and generalizable tile-based framework. Our PRO consists of two key components: (i) Grouped Patch Consistency Training that enhances test-time efficiency while mitigating the depth discontinuity problem by jointly processing four overlapping patches and enforcing a consistency loss on their overlapping regions within a single backpropagation step, and (ii) Bias Free Masking that prevents the DE models from overfitting to dataset-specific biases, enabling better generalization to real-world datasets even after training on synthetic data. Zero-shot evaluations on Booster, ETH3D, Middlebury 2014, and NuScenes demonstrate that our PRO can be seamlessly integrated into existing depth estimation models.

Authors:Huiang He, Minghui Hu, Chuanxia Zheng, Chaoyue Wang, Tat-Jen Cham
Title: Semantix: An Energy Guided Sampler for Semantic Style Transfer
Abstract:
Recent advances in style and appearance transfer are impressive, but most methods isolate global style and local appearance transfer, neglecting semantic correspondence. Additionally, image and video tasks are typically handled in isolation, with little focus on integrating them for video transfer. To address these limitations, we introduce a novel task, Semantic Style Transfer, which involves transferring style and appearance features from a reference image to a target visual content based on semantic correspondence. We subsequently propose a training-free method, Semantix an energy-guided sampler designed for Semantic Style Transfer that simultaneously guides both style and appearance transfer based on semantic understanding capacity of pre-trained diffusion models. Additionally, as a sampler, Semantix be seamlessly applied to both image and video models, enabling semantic style transfer to be generic across various visual media. Specifically, once inverting both reference and context images or videos to noise space by SDEs, Semantix utilizes a meticulously crafted energy function to guide the sampling process, including three key components: Style Feature Guidance, Spatial Feature Guidance and Semantic Distance as a regularisation term. Experimental results demonstrate that Semantix not only effectively accomplishes the task of semantic style transfer across images and videos, but also surpasses existing state-of-the-art solutions in both fields. The project website is available at https://huiang-he.github.io/semantix/

Authors:Yancong Lin, Shiming Wang, Liangliang Nan, Julian Kooij, Holger Caesar
Title: VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow
Abstract:
Scene flow estimation aims to recover per-point motion from two adjacent LiDAR scans. However, in real-world applications such as autonomous driving, points rarely move independently of others, especially for nearby points belonging to the same object, which often share the same motion. Incorporating this locally rigid motion constraint has been a key challenge in self-supervised scene flow estimation, which is often addressed by post-processing or appending extra regularization. While these approaches are able to improve the rigidity of predicted flows, they lack an architectural inductive bias for local rigidity within the model structure, leading to suboptimal learning efficiency and inferior performance. In contrast, we enforce local rigidity with a lightweight add-on module in neural network design, enabling end-to-end learning. We design a discretized voting space that accommodates all possible translations and then identify the one shared by nearby points by differentiable voting. Additionally, to ensure computational efficiency, we operate on pillars rather than points and learn representative features for voting per pillar. We plug the Voting Module into popular model designs and evaluate its benefit on Argoverse 2 and Waymo datasets. We outperform baseline works with only marginal compute overhead. Code is available at https://github.com/tudelft-iv/VoteFlow.
中文摘要: 本文提出一种轻量级投票模块,通过离散化投票空间和可微分投票机制在场景流估计中强化局部刚性约束,以微小计算代价在自动驾驶数据集上实现性能提升。
English Summary: This paper introduces a lightweight Voting Module that enforces local rigidity in scene flow estimation through a discretized voting space and differentiable voting, improving performance on autonomous driving datasets with minimal computational overhead.

Authors:Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, Qianqian Wang
Title: Segment Any Motion in Videos
Abstract:
Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at https://motion-seg.github.io/.
Chinese: 该模型结合长程轨迹运动线索与基于DINO的语义特征,并利用SAM2通过迭代提示策略实现像素级掩码优化,在多种复杂场景下的运动物体分割任务中取得了领先性能。
English: The proposed model integrates long-range trajectory motion cues with DINO-based semantic features and utilizes SAM2 for iterative mask refinement, achieving state-of-the-art performance in moving object segmentation across diverse challenging scenarios.

Authors:Dongping Liao, Xitong Gao, Yabo Xu, Chengzhong Xu
Title: FLIP: Towards Comprehensive and Reliable Evaluation of Federated Prompt Learning
Abstract:
The increasing emphasis on privacy and data security has driven the adoption of federated learning, a decentralized approach to train machine learning models without sharing raw data. Prompt learning, which fine-tunes prompt embeddings of pretrained models, offers significant advantages in federated settings by reducing computational costs and communication overheads while leveraging the strong performance and generalization capabilities of vision-language models such as CLIP. This paper addresses the intersection of federated learning and prompt learning, particularly for vision-language models. In this work, we introduce a comprehensive framework, named FLIP, to evaluate federated prompt learning algorithms. FLIP assesses the performance of 8 state-of-the-art federated prompt learning methods across 4 federated learning protocols and 12 open datasets, considering 6 distinct evaluation scenarios. Our findings demonstrate that prompt learning maintains strong generalization performance in both in-distribution and out-of-distribution settings with minimal resource consumption. This work highlights the effectiveness of federated prompt learning in environments characterized by data scarcity, unseen classes, and cross-domain distributional shifts. We open-source the code for all implemented algorithms in FLIP to facilitate further research in this domain.
中文: 联邦提示学习通过FLIP框架展示了其在保护隐私的同时,利用预训练视觉语言模型以最少资源消耗实现高效训练,并在多种数据场景下保持强大的泛化能力。
English: Federated prompt learning, exemplified by the FLIP framework, enables efficient and privacy-preserving model training by leveraging pre-trained vision-language models with minimal resource usage while maintaining robust generalization across diverse data scenarios.

Authors:Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xiaoyang Guo, Jiaqing Zhou, Hao Zhao, Xiaoguang Han
Title: Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging
Abstract:
With the growing demand for high-fidelity 3D models from 2D images, existing methods still face significant challenges in accurately reproducing fine-grained geometric details due to limitations in domain gaps and inherent ambiguities in RGB images. To address these issues, we propose Hi3DGen, a novel framework for generating high-fidelity 3D geometry from images via normal bridging. Hi3DGen consists of three key components: (1) an image-to-normal estimator that decouples the low-high frequency image pattern with noise injection and dual-stream training to achieve generalizable, stable, and sharp estimation; (2) a normal-to-geometry learning approach that uses normal-regularized latent diffusion learning to enhance 3D geometry generation fidelity; and (3) a 3D data synthesis pipeline that constructs a high-quality dataset to support training. Extensive experiments demonstrate the effectiveness and superiority of our framework in generating rich geometric details, outperforming state-of-the-art methods in terms of fidelity. Our work provides a new direction for high-fidelity 3D geometry generation from images by leveraging normal maps as an intermediate representation.

Authors:Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang
Title: Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance
Abstract:
Pre-trained conditional diffusion models have demonstrated remarkable potential in image editing. However, they often face challenges with temporal consistency, particularly in the talking head domain, where continuous changes in facial expressions intensify the level of difficulty. These issues stem from the independent editing of individual images and the inherent loss of temporal continuity during the editing process. In this paper, we introduce Follow Your Motion (FYM), a generic framework for maintaining temporal consistency in portrait editing. Specifically, given portrait images rendered by a pre-trained 3D Gaussian Splatting model, we first develop a diffusion model that intuitively and inherently learns motion trajectory changes at different scales and pixel coordinates, from the first frame to each subsequent frame. This approach ensures that temporally inconsistent edited avatars inherit the motion information from the rendered avatars. Secondly, to maintain fine-grained expression temporal consistency in talking head editing, we propose a dynamic re-weighted attention mechanism. This mechanism assigns higher weight coefficients to landmark points in space and dynamically updates these weights based on landmark loss, achieving more consistent and refined facial expressions. Extensive experiments demonstrate that our method outperforms existing approaches in terms of temporal consistency and can be used to optimize and compensate for temporally inconsistent outputs in a range of applications, such as text-driven editing, relighting, and various other applications.
中文: 本文提出的Follow Your Motion (FYM)框架通过从首帧学习运动轨迹并采用动态加权注意力机制,有效解决了人像编辑中的时间一致性问题,尤其在保持精细面部表情连贯性方面表现优异。
English: The paper introduces Follow Your Motion (FYM), a framework that enhances temporal consistency in portrait editing by learning motion trajectories and employing a dynamic attention mechanism to maintain refined facial expressions across frames.

Authors:Wenjie Liu, Zhongliang Liu, Xiaoyan Yang, Man Sha, Yang Li
Title: ABC-GS: Alignment-Based Controllable Style Transfer for 3D Gaussian Splatting
Abstract:
3D scene stylization approaches based on Neural Radiance Fields (NeRF) achieve promising results by optimizing with Nearest Neighbor Feature Matching (NNFM) loss. However, NNFM loss does not consider global style information. In addition, the implicit representation of NeRF limits their fine-grained control over the resulting scenes. In this paper, we introduce ABC-GS, a novel framework based on 3D Gaussian Splatting to achieve high-quality 3D style transfer. To this end, a controllable matching stage is designed to achieve precise alignment between scene content and style features through segmentation masks. Moreover, a style transfer loss function based on feature alignment is proposed to ensure that the outcomes of style transfer accurately reflect the global style of the reference image. Furthermore, the original geometric information of the scene is preserved with the depth loss and Gaussian regularization terms. Extensive experiments show that our ABC-GS provides controllability of style transfer and achieves stylization results that are more faithfully aligned with the global style of the chosen artistic reference. Our homepage is available at https://vpx-ecnu.github.io/ABC-GS-website.

Authors:Yiren Lu, Yunlai Zhou, Yiran Qiao, Chaoda Song, Tuo Liang, Jing Ma, Yu Yin
Title: Segment then Splat: A Unified Approach for 3D Open-Vocabulary Segmentation based on Gaussian Splatting
Abstract:
Open-vocabulary querying in 3D space is crucial for enabling more intelligent perception in applications such as robotics, autonomous systems, and augmented reality. However, most existing methods rely on 2D pixel-level parsing, leading to multi-view inconsistencies and poor 3D object retrieval. Moreover, they are limited to static scenes and struggle with dynamic scenes due to the complexities of motion modeling. In this paper, we propose Segment then Splat, a 3D-aware open vocabulary segmentation approach for both static and dynamic scenes based on Gaussian Splatting. Segment then Splat reverses the long established approach of "segmentation after reconstruction" by dividing Gaussians into distinct object sets before reconstruction. Once the reconstruction is complete, the scene is naturally segmented into individual objects, achieving true 3D segmentation. This approach not only eliminates Gaussian-object misalignment issues in dynamic scenes but also accelerates the optimization process, as it eliminates the need for learning a separate language field. After optimization, a CLIP embedding is assigned to each object to enable open-vocabulary querying. Extensive experiments on various datasets demonstrate the effectiveness of our proposed method in both static and dynamic scenarios.

Authors:Jaewoo Jeong, Seohee Lee, Daehee Park, Giwon Lee, Kuk-Jin Yoon
Title: Multi-modal Knowledge Distillation-based Human Trajectory Forecasting
Abstract:
Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation. In such applications, camera-based perception enables the extraction of additional modalities (human pose, text) to enhance prediction accuracy. Indeed, we find that textual descriptions play a crucial role in integrating additional modalities into a unified understanding. However, online extraction of text requires the use of VLM, which may not be feasible for resource-constrained systems. To address this challenge, we propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities. The comprehensive knowledge of a teacher model trained with trajectory, human pose, and text is distilled into a student model using only trajectory or human pose as a sole supplement. In doing so, we separately distill the core locomotion insights from intra-agent multi-modality and inter-agent interaction. Our generalizable framework is validated with two state-of-the-art models across three datasets on both ego-view (JRDB, SIT) and BEV-view (ETH/UCY) setups, utilizing both annotated and VLM-generated text captions. Distilled student models show consistent improvement in all prediction metrics for both full and instantaneous observations, improving up to ~13%. The code is available at https://github.com/Jaewoo97/KDTF.
Chinese: 该研究提出了一种多模态知识蒸馏框架,通过将融合轨迹、姿态和文本的教师模型知识提炼到仅使用轨迹或姿态的学生模型中,在多个数据集上实现了行人轨迹预测性能高达13%的提升。
English: The study introduces a multi-modal knowledge distillation framework where a student model, limited to trajectory or human pose data, is trained by a teacher model that integrates trajectory, pose, and text, achieving up to 13% improvement in pedestrian trajectory forecasting across various datasets.

Authors:Chanhyuk Lee, Jiho Choi, Chanryeol Lee, Donggyun Kim, Seunghoon Hong
Title: AdaRank: Adaptive Rank Pruning for Enhanced Model Merging
Abstract:
Model merging has emerged as a promising approach for unifying independently fine-tuned models into an integrated framework, significantly enhancing computational efficiency in multi-task learning. Recently, several SVD-based techniques have been introduced to exploit low-rank structures for enhanced merging, but their reliance on such manually designed rank selection often leads to cross-task interference and suboptimal performance. In this paper, we propose AdaRank, a novel model merging framework that adaptively selects the most beneficial singular directions of task vectors to merge multiple models. We empirically show that the dominant singular components of task vectors can cause critical interference with other tasks, and that naive truncation across tasks and layers degrades performance. In contrast, AdaRank dynamically prunes the singular components that cause interference and offers an optimal amount of information to each task vector by learning to prune ranks during test-time via entropy minimization. Our analysis demonstrates that such method mitigates detrimental overlaps among tasks, while empirical results show that AdaRank consistently achieves state-of-the-art performance with various backbones and number of tasks, reducing the performance gap between fine-tuned models to nearly 1%.
中文摘要:AdaRank是一种自适应模型融合框架,通过熵最小化动态选择任务向量的最优奇异分量,有效减少任务间干扰,以接近最优性能的表现将性能差距缩小至约1%。
English Summary: AdaRank is an adaptive model merging framework that dynamically selects optimal singular components from task vectors through entropy minimization, effectively reducing cross-task interference and achieving near state-of-the-art performance with minimal performance gap.

Authors:Ximing Wen, Mallika Mainali, Anik Sen
Title: How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark
Abstract:
Vision Language Models (VLMs) have demonstrated strong reasoning capabilities in Visual Question Answering (VQA) tasks; however, their ability to perform Theory of Mind (ToM) tasks, such as inferring human intentions, beliefs, and mental states, remains underexplored. We propose an open-ended question framework to evaluate VLMs' performance across diverse categories of ToM tasks. We curated and annotated a benchmark dataset of 30 images and evaluated the performance of four VLMs of varying sizes. Our results show that the GPT-4 model outperformed all the others, with only one smaller model, GPT-4o-mini, achieving comparable performance. We observed that VLMs often struggle to infer intentions in complex scenarios such as bullying or cheating. Our findings reveal that smaller models can sometimes infer correct intentions despite relying on incorrect visual cues. The dataset is available at https://github.com/ximingwen/ToM-AAAI25-Multimodal.
Chinese: 视觉语言模型在视觉问答中表现出色,但在心智理论任务中仍有不足,GPT-4表现最佳,而小型模型有时能通过错误视觉线索推断正确意图。
English: Vision Language Models show promising reasoning in Visual Question Answering but face challenges in Theory of Mind tasks, with GPT-4 leading performance while smaller models sometimes succeed despite incorrect visual cues.

Authors:Ziyue Huang, Hongxi Yan, Qiqi Zhan, Shuai Yang, Mingming Zhang, Chenkai Zhang, YiMing Lei, Zeming Liu, Qingjie Liu, Yunhong Wang
Title: A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
Abstract:
The rapid advancement of remote sensing foundation models, particularly vision and multimodal models, has significantly enhanced the capabilities of intelligent geospatial data interpretation. These models combine various data modalities, such as optical, radar, and LiDAR imagery, with textual and geographic information, enabling more comprehensive analysis and understanding of remote sensing data. The integration of multiple modalities allows for improved performance in tasks like object detection, land cover classification, and change detection, which are often challenged by the complex and heterogeneous nature of remote sensing data. However, despite these advancements, several challenges remain. The diversity in data types, the need for large-scale annotated datasets, and the complexity of multimodal fusion techniques pose significant obstacles to the effective deployment of these models. Moreover, the computational demands of training and fine-tuning multimodal models require significant resources, further complicating their practical application in remote sensing image interpretation tasks. This paper provides a comprehensive review of the state-of-the-art in vision and multimodal foundation models for remote sensing, focusing on their architecture, training methods, datasets and application scenarios. We discuss the key challenges these models face, such as data alignment, cross-modal transfer learning, and scalability, while also identifying emerging research directions aimed at overcoming these limitations. Our goal is to provide a clear understanding of the current landscape of remote sensing foundation models and inspire future research that can push the boundaries of what these models can achieve in real-world applications. The list of resources collected by the paper can be found in the https://github.com/IRIP-BUAA/A-Review-for-remote-sensing-vision-language-models.
中文:遥感基础模型的快速发展通过融合多模态数据提升了地理空间数据分析能力,但仍面临数据多样性和计算需求等挑战,本文综述了这些模型的现状并指出了未来研究方向。
English: The rapid development of remote sensing foundation models enhances geospatial data analysis by integrating multiple data modalities, yet faces challenges like data diversity and computational demands, which this review addresses while outlining future research directions.

Authors:Ukcheol Shin, Jinsun Park
Title: Deep Depth Estimation from Thermal Image: Dataset, Benchmark, and Challenges
Abstract:
Achieving robust and accurate spatial perception under adverse weather and lighting conditions is crucial for the high-level autonomy of self-driving vehicles and robots. However, existing perception algorithms relying on the visible spectrum are highly affected by weather and lighting conditions. A long-wave infrared camera (i.e., thermal imaging camera) can be a potential solution to achieve high-level robustness. However, the absence of large-scale datasets and standardized benchmarks remains a significant bottleneck to progress in active research for robust visual perception from thermal images. To this end, this manuscript provides a large-scale Multi-Spectral Stereo (MS$^2$) dataset that consists of stereo RGB, stereo NIR, stereo thermal, stereo LiDAR data, and GNSS/IMU information along with semi-dense depth ground truth. MS$^2$ dataset includes 162K synchronized multi-modal data pairs captured across diverse locations (e.g., urban city, residential area, campus, and high-way road) at different times (e.g., morning, daytime, and nighttime) and under various weather conditions (e.g., clear-sky, cloudy, and rainy). Secondly, we conduct a thorough evaluation of monocular and stereo depth estimation networks across RGB, NIR, and thermal modalities to establish standardized benchmark results on MS$^2$ depth test sets (e.g., day, night, and rainy). Lastly, we provide in-depth analyses and discuss the challenges revealed by the benchmark results, such as the performance variability for each modality under adverse conditions, domain shift between different sensor modalities, and potential research direction for thermal perception. Our dataset and source code are publicly available at https://sites.google.com/view/multi-spectral-stereo-dataset and https://github.com/UkcheolShin/SupDepth4Thermal.
中文: 本文介绍了MS²数据集,这是一个大规模多光谱立体数据集,旨在解决热成像视觉感知研究中缺乏标准化基准的问题,并对不同模态在恶劣条件下的深度估计进行了全面评估和分析。
English: This manuscript introduces the MS² dataset, a large-scale multi-spectral stereo collection designed to address the lack of standardized benchmarks for robust visual perception using thermal imaging, and it provides comprehensive evaluations and analyses of depth estimation across various modalities under adverse conditions.

Authors:Hang Zhou, Xinxin Zuo, Rui Ma, Li Cheng
Title: BOOTPLACE: Bootstrapped Object Placement with Detection Transformers
Abstract:
In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to reduce the reliance for dense supervision. However, this often limits their capacity to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been explored, but their over-relaxed regularization often leads to imprecise object placement. We introduce BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem. Our approach begins by identifying suitable regions of interest for object placement. This is achieved by training a specialized detection transformer on object-subtracted backgrounds, enhanced with multi-object supervisions. It then semantically associates each target compositing object with detected regions based on their complementary characteristics. Through a boostrapped training approach applied to randomly object-subtracted images, our model enforces meaningful placements through extensive paired data augmentation. Experimental results on established benchmarks demonstrate BOOTPLACE's superior performance in object repositioning, markedly surpassing state-of-the-art baselines on Cityscapes and OPA datasets with notable improvements in IOU scores. Additional ablation studies further showcase the compositionality and generalizability of our approach, supported by user study evaluations.
中文: 本文提出BOOTPLACE这一新颖的检测式放置范式,通过在用物体剔除背景训练的特化检测变换器中识别合适放置区域,并根据互补特性将目标合成物体与检测区域语义关联,在Cityscapes和OPA等基准测试中实现了优越性能。
English: This paper introduces BOOTPLACE, a novel placement-by-detection paradigm that identifies suitable regions for object placement using a detection transformer trained on object-subtracted backgrounds and associates objects with regions based on complementary characteristics, achieving superior performance on benchmarks like Cityscapes and OPA datasets.

Authors:Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, Chen Change Loy
Title: Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Abstract:
Unifying visual understanding and generation within a single multimodal framework remains a significant challenge, as the two inherently heterogeneous tasks require representations at different levels of granularity. Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. A preliminary study on the MAR encoder's representation reveals exceptional linear probing accuracy and precise feature response to visual concepts, which indicates MAR's potential for visual understanding tasks beyond its original generation role. Based on these insights, we present \emph{Harmon}, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder. Through a three-stage training procedure that progressively optimizes understanding and generation capabilities, Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks while matching the performance of methods with dedicated semantic encoders (e.g., Janus) on image understanding benchmarks. Our code and models will be available at https://github.com/wusize/Harmon.
Chinese: 本研究提出了Harmon框架,通过共享掩码自回归编码器统一视觉理解与生成任务,在GenEval、MJHQ30K和WISE基准测试中实现图像生成的顶尖性能,同时在理解任务上达到专用语义编码器的同等水平。
English: This work introduces Harmon, a unified autoregressive framework that integrates visual understanding and generation using a shared masked autoregressive encoder, achieving state-of-the-art results in image generation on benchmarks like GenEval, MJHQ30K, and WISE while matching the performance of dedicated semantic encoders in understanding tasks.

Authors:Ishit Mehta, Manmohan Chandraker, Ravi Ramamoorthi
Title: Locally Orderless Images for Optimization in Differentiable Rendering
Abstract:
Problems in differentiable rendering often involve optimizing scene parameters that cause motion in image space. The gradients for such parameters tend to be sparse, leading to poor convergence. While existing methods address this sparsity through proxy gradients such as topological derivatives or lagrangian derivatives, they make simplifying assumptions about rendering. Multi-resolution image pyramids offer an alternative approach but prove unreliable in practice. We introduce a method that uses locally orderless images, where each pixel maps to a histogram of intensities that preserves local variations in appearance. Using an inverse rendering objective that minimizes histogram distance, our method extends support for sparsely defined image gradients and recovers optimal parameters. We validate our method on various inverse problems using both synthetic and real data.

Authors:Johannes Seiffarth, Katharina Nöh
Title: PyUAT: Open-source Python framework for efficient and scalable cell tracking
Abstract:
Tracking individual cells in live-cell imaging provides fundamental insights, inevitable for studying causes and consequences of phenotypic heterogeneity, responses to changing environmental conditions or stressors. Microbial cell tracking, characterized by stochastic cell movements and frequent cell divisions, remains a challenging task when imaging frame rates must be limited to avoid counterfactual results. A promising way to overcome this limitation is uncertainty-aware tracking (UAT), which uses statistical models, calibrated to empirically observed cell behavior, to predict likely cell associations. We present PyUAT, an efficient and modular Python implementation of UAT for tracking microbial cells in time-lapse imaging. We demonstrate its performance on a large 2D+t data set and investigate the influence of modular biological models and imaging intervals on the tracking performance. The open-source PyUAT software is available at https://github.com/JuBiotech/PyUAT, including example notebooks for immediate use in Google Colab.
Chinese: PyUAT是一种高效、模块化的Python工具,通过基于观测行为校准的统计模型预测细胞关联,实现了不确定性感知追踪,以解决微生物细胞追踪中的挑战。
English: PyUAT is an efficient, modular Python tool that implements uncertainty-aware tracking to overcome challenges in microbial cell tracking by predicting cell associations using statistical models calibrated to observed behavior.

Authors:Taufiq Ahmed, Abhishek Kumar, Constantino Álvarez Casado, Anlan Zhang, Tuomo Hänninen, Lauri Loven, Miguel Bordallo López, Sasu Tarkoma
Title: Exponentially Weighted Instance-Aware Repeat Factor Sampling for Long-Tailed Object Detection Model Training in Unmanned Aerial Vehicles Surveillance Scenarios
Abstract:
Object detection models often struggle with class imbalance, where rare categories appear significantly less frequently than common ones. Existing sampling-based rebalancing strategies, such as Repeat Factor Sampling (RFS) and Instance-Aware Repeat Factor Sampling (IRFS), mitigate this issue by adjusting sample frequencies based on image and instance counts. However, these methods are based on linear adjustments, which limit their effectiveness in long-tailed distributions. This work introduces Exponentially Weighted Instance-Aware Repeat Factor Sampling (E-IRFS), an extension of IRFS that applies exponential scaling to better differentiate between rare and frequent classes. E-IRFS adjusts sampling probabilities using an exponential function applied to the geometric mean of image and instance frequencies, ensuring a more adaptive rebalancing strategy. We evaluate E-IRFS on a dataset derived from the Fireman-UAV-RGBT Dataset and four additional public datasets, using YOLOv11 object detection models to identify fire, smoke, people and lakes in emergency scenarios. The results show that E-IRFS improves detection performance by 22\% over the baseline and outperforms RFS and IRFS, particularly for rare categories. The analysis also highlights that E-IRFS has a stronger effect on lightweight models with limited capacity, as these models rely more on data sampling strategies to address class imbalance. The findings demonstrate that E-IRFS improves rare object detection in resource-constrained environments, making it a suitable solution for real-time applications such as UAV-based emergency monitoring. The code is available at: https://github.com/futurians/E-IRFS.
中文: 本文提出的E-IRFS方法通过指数加权采样策略,在线性方法RFS和IRFS基础上显著提升长尾分布中稀有类别的检测性能,在无人机应急监测任务中实现22%的性能提升。
English: This paper introduces E-IRFS, an exponentially weighted sampling method that enhances rare class detection in imbalanced datasets by outperforming linear approaches like RFS and IRFS, achieving a 22% improvement in emergency monitoring scenarios.

Authors:Hongyi Zeng, Wenxuan Liu, Tianhua Xia, Jinhui Chen, Ziyun Li, Sai Qian Zhang
Title: Foveated Instance Segmentation
Abstract:
Instance segmentation is essential for augmented reality and virtual reality (AR/VR) as it enables precise object recognition and interaction, enhancing the integration of virtual and real-world elements for an immersive experience. However, the high computational overhead of segmentation limits its application on resource-constrained AR/VR devices, causing large processing latency and degrading user experience. In contrast to conventional scenarios, AR/VR users typically focus on only a few regions within their field of view before shifting perspective, allowing segmentation to be concentrated on gaze-specific areas. This insight drives the need for efficient segmentation methods that prioritize processing instance of interest, reducing computational load and enhancing real-time performance. In this paper, we present a foveated instance segmentation (FovealSeg) framework that leverages real-time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. Evaluation results show that FSNet achieves an IoU of 0.56 on ADE20K and 0.54 on LVIS, notably outperforming the baseline. The code is available at https://github.com/SAI-
中文:FovealSeg框架利用实时视线数据,仅对用户关注的区域进行实例分割,在显著降低计算负担的同时,在基准测试中保持了高精度表现。
English: FovealSeg is a framework that uses real-time gaze data to perform instance segmentation only on areas of interest, significantly reducing computational load while maintaining high accuracy on benchmarks.

Authors:Alessandro Conti, Massimiliano Mancini, Enrico Fini, Yiming Wang, Paolo Rota, Elisa Ricci
Title: On Large Multimodal Models as Open-World Image Classifiers
Abstract:
Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt "What is the main object in the image?"). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them.
中文: 本研究在开放世界环境下评估大型多模态模型的图像分类能力,揭示了其在细粒度识别方面的挑战,并表明优化提示策略可有效提升分类效果。
English: This study evaluates Large Multimodal Models in an open-world image classification setting, revealing their challenges with granularity and fine-grained recognition while demonstrating how enhanced prompting can improve performance.

Authors:Mohammad Amin Khalafi, Seyed Amir Ahmad Safavi-Naini, Ameneh Salehi, Nariman Naderi, Dorsa Alijanzadeh, Pardis Ketabi Moghadam, Kaveh Kavosi, Negar Golestani, Shabnam Shahrokh, Soltanali Fallah, Jamil S Samaan, Nicholas P. Tatonetti, Nicholas Hoerter, Girish Nadkarni, Hamid Asadzadeh Aghdaei, Ali Soroush
Title: Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images
Abstract:
Introduction: This study provides a comprehensive performance assessment of vision-language models (VLMs) against established convolutional neural networks (CNNs) and classic machine learning models (CMLs) for computer-aided detection (CADe) and computer-aided diagnosis (CADx) of colonoscopy polyp images. Method: We analyzed 2,258 colonoscopy images with corresponding pathology reports from 428 patients. We preprocessed all images using standardized techniques (resizing, normalization, and augmentation) and implemented a rigorous comparative framework evaluating 11 distinct models: ResNet50, 4 CMLs (random forest, support vector machine, logistic regression, decision tree), two specialized contrastive vision language encoders (CLIP, BiomedCLIP), and three general-purpose VLMs ( GPT-4 Gemini-1.5-Pro, Claude-3-Opus). Our performance assessment focused on two clinical tasks: polyp detection (CADe) and classification (CADx). Result: In polyp detection, ResNet50 achieved the best performance (F1: 91.35%, AUROC: 0.98), followed by BiomedCLIP (F1: 88.68%, AUROC: [AS1] ). GPT-4 demonstrated comparable effectiveness to traditional machine learning approaches (F1: 81.02%, AUROC: [AS2] ), outperforming other general-purpose VLMs. For polyp classification, performance rankings remained consistent but with lower overall metrics. ResNet50 maintained the highest efficacy (weighted F1: 74.94%), while GPT-4 demonstrated moderate capability (weighted F1: 41.18%), significantly exceeding other VLMs (Claude-3-Opus weighted F1: 25.54%, Gemini 1.5 Pro weighted F1: 6.17%). Conclusion: CNNs remain superior for both CADx and CADe tasks. However, VLMs like BioMedCLIP and GPT-4 may be useful for polyp detection tasks where training CNNs is not feasible.
中文: 本研究评估了视觉语言模型在结肠镜息肉检测与分类中相对于卷积神经网络和传统机器学习的表现,发现ResNet50等CNN模型仍最优,但BiomedCLIP和GPT-4等视觉语言模型在无法训练CNN时具有检测潜力。
English: This study evaluates vision-language models against CNNs and traditional machine learning for colonoscopy polyp detection and classification, finding CNNs like ResNet50 superior but noting VLMs such as BiomedCLIP and GPT-4 show promise for detection when CNN training is impractical.

Authors:Haolong Yan, Kaijun Tan, Yeqing Shen, Xin Huang, Zheng Ge, Xiangyu Zhang, Si Li, Daxin Jiang
Title: M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?
Abstract:
We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench), which comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences. M-DocSum-Bench is a reference-based generation task and necessitates the generation of interleaved image-text summaries using provided reference images, thereby simultaneously evaluating capabilities in understanding, reasoning, localization, and summarization within complex multimodal document scenarios. To facilitate this benchmark, we develop an automated framework to construct summaries and propose a fine-grained evaluation method called M-DocEval. Moreover, we further develop a robust summarization baseline, i.e., M-DocSum-7B, by progressive two-stage training with diverse instruction and preference data. The extensive results on our M-DocSum-Bench reveal that the leading LVLMs struggle to maintain coherence and accurately integrate information within long and interleaved contexts, often exhibiting confusion between similar images and a lack of robustness. Notably, M-DocSum-7B achieves state-of-the-art performance compared to larger and closed-source models (including GPT-4o, Gemini Pro, Claude-3.5-Sonnet and Qwen2.5-VL-72B, etc.), demonstrating the potential of LVLMs for improved interleaved image-text understanding. The code, data, and models are available at https://github.com/stepfun-ai/M-DocSum-Bench.
Chinese Summary: 本研究提出了新型多模态文档摘要基准(M-DocSum-Bench),用于评估大视觉语言模型对图文交织文档的理解能力,发现现有模型在复杂多模态场景中难以保持连贯性和准确整合信息。
English Summary: This study introduces a novel Multimodal Document Summarization Benchmark (M-DocSum-Bench) to evaluate Large Vision-Language Models' ability to understand interleaved image-text documents, revealing that current models struggle with coherence and information integration in complex multimodal contexts.

Authors:Ivan Diaz, Florin Scherer, Yanik Berli, Roland Wiest, Helly Hammer, Robert Hoepner, Alejandro Leon Betancourt, Piotr Radojewski, Richard McKinley
Title: Learning from spatially inhomogenous data: resolution-adaptive convolutions for multiple sclerosis lesion segmentation
Abstract:
In the setting of clinical imaging, differences in between vendors, hospitals and sequences can yield highly inhomogeneous imaging data. In MRI in particular, voxel dimension, slice spacing and acquisition plane can vary substantially. For clinical applications, therefore, algorithms must be trained to handle data with various voxel resolutions. The usual strategy to deal with heterogeneity of resolution is harmonization: resampling imaging data to a common (usually isovoxel) resolution. This can lead to loss of fidelity arising from interpolation artifacts out-of-plane and downsampling in-plane. We present in this paper a network architecture designed to be able to learn directly from spatially heterogeneous data, without resampling: a segmentation network based on the e3nn framework that leverages a spherical harmonic, rather than voxel-grid, parameterization of convolutional kernels, with a fixed physical radius. Networks based on these kernels can be resampled to their input voxel dimensions. We trained and tested our network on a publicly available dataset assembled from three centres, and on an in-house dataset of Multiple Sclerosis cases with a high degree of spatial inhomogeneity. We compared our approach to a standard U-Net with two strategies for handling inhomogeneous data: training directly on the data without resampling, and resampling to a common resolution of 1mm isovoxels. We show that our network is able to learn from various combinations of voxel sizes and outperforms classical U-Nets on 2D testing cases and most 3D testing cases. This shows an ability to generalize well when tested on image resolutions not seen during training. Our code can be found at: http://github.com/SCAN-NRAD/e3nn\_U-Net.
中文摘要:本文提出了一种新型神经网络架构,可直接处理空间异构的MRI数据而无需重采样,利用球谐卷积核在不同体素分辨率下表现优于传统方法。
English Summary: This paper introduces a novel neural network architecture that directly learns from spatially heterogeneous MRI data without resampling, using spherical harmonic convolutional kernels to outperform traditional methods in handling diverse voxel resolutions.

Authors:Haitong Liu, Kuofeng Gao, Yang Bai, Jinmin Li, Jinxiao Shan, Tao Dai, Shu-Tao Xia
Title: Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations
Abstract:
Recently, video-based large language models (video-based LLMs) have achieved impressive performance across various video comprehension tasks. However, this rapid advancement raises significant privacy and security concerns, particularly regarding the unauthorized use of personal video data in automated annotation by video-based LLMs. These unauthorized annotated video-text pairs can then be used to improve the performance of downstream tasks, such as text-to-video generation. To safeguard personal videos from unauthorized use, we propose two series of protective video watermarks with imperceptible adversarial perturbations, named Ramblings and Mutes. Concretely, Ramblings aim to mislead video-based LLMs into generating inaccurate captions for the videos, thereby degrading the quality of video annotations through inconsistencies between video content and captions. Mutes, on the other hand, are designed to prompt video-based LLMs to produce exceptionally brief captions, lacking descriptive detail. Extensive experiments demonstrate that our video watermarking methods effectively protect video data by significantly reducing video annotation performance across various video-based LLMs, showcasing both stealthiness and robustness in protecting personal video content. Our code is available at https://github.com/ttthhl/Protecting_Your_Video_Content.
Chinese: 研究人员开发了两种名为Ramblings和Mutes的隐形视频水印,通过误导性描述或过度简化的字幕来降低视频大语言模型的标注质量,从而有效保护个人视频内容。
English: Researchers have developed two imperceptible video watermarks, Ramblings and Mutes, that effectively protect personal videos by degrading the annotation quality of video-based large language models through misleading captions or excessive brevity.

Authors:Yide Di, Yun Liao, Hao Zhou, Kaijun Zhu, Qing Duan, Junhui Liu, Mingyu Lu
Title: UFM: Unified Feature Matching Pre-training with Multi-Modal Image Assistants
Abstract:
Image feature matching, a foundational task in computer vision, remains challenging for multimodal image applications, often necessitating intricate training on specific datasets. In this paper, we introduce a Unified Feature Matching pre-trained model (UFM) designed to address feature matching challenges across a wide spectrum of modal images. We present Multimodal Image Assistant (MIA) transformers, finely tunable structures adept at handling diverse feature matching problems. UFM exhibits versatility in addressing both feature matching tasks within the same modal and those across different modals. Additionally, we propose a data augmentation algorithm and a staged pre-training strategy to effectively tackle challenges arising from sparse data in specific modals and imbalanced modal datasets. Experimental results demonstrate that UFM excels in generalization and performance across various feature matching tasks. The code will be released at:https://github.com/LiaoYun0x0/UFM.
Chinese: 本文提出了一种统一特征匹配预训练模型(UFM)及多模态图像助手变换器,通过数据增强和分阶段预训练策略,有效解决了跨模态图像特征匹配问题,展现出优异的泛化能力和性能表现。
English: This paper introduces a Unified Feature Matching pre-trained model (UFM) with Multimodal Image Assistant transformers, which effectively handles feature matching across various image modalities through data augmentation and staged pre-training, demonstrating strong generalization and performance.

Authors:Abdelrahman Shaker, Muhammad Maaz, Chenhui Gou, Hamid Rezatofighi, Salman Khan, Fahad Shahbaz Khan
Title: Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Abstract:
Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.
Chinese: Mobile-VideoGPT是一种高效的多模态框架,通过轻量级双视觉编码器、注意力帧评分机制和令牌剪枝技术,在减少40%参数的同时实现实时处理,并在六大视频理解基准测试中平均超越现有0.5B参数模型6个点。
English: Mobile-VideoGPT is an efficient multimodal framework that reduces computational demands and parameters while achieving higher throughput and outperforming existing 0.5B-parameter models by an average of 6 points across video understanding benchmarks.

Authors:Chi-Pin Huang, Yen-Siang Wu, Hung-Kai Chung, Kai-Po Chang, Fu-En Yang, Yu-Chiang Frank Wang
Title: VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models
Abstract:
Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.

Authors:Reza Qorbani, Gianluca Villani, Theodoros Panagiotakopoulos, Marc Botet Colomer, Linus Härenstam-Nielsen, Mattia Segu, Pier Luigi Dovesi, Jussi Karlgren, Daniel Cremers, Federico Tombari, Matteo Poggi
Title: Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation
Abstract:
Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine-tuning for effective real-world applications. We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation. SemLA leverages a library of LoRA-based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad-hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on a 20-domain benchmark built over 10 standard datasets demonstrate SemLA's superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open-vocabulary semantic segmentation.
中文: 提出的语义库适应(SemLA)框架通过基于CLIP嵌入相似性动态融合相关LoRA适配器,实现了开放词汇语义分割的无训练域适应,在跨域性能、可扩展性、可解释性和数据隐私保护方面均表现出色。
English: The proposed Semantic Library Adaptation (SemLA) framework enables training-free domain adaptation for open-vocabulary semantic segmentation by dynamically merging relevant LoRA adapters based on CLIP embedding proximity, achieving superior performance across diverse domains while ensuring scalability, explainability, and data privacy.

Authors:Weihao Yu, Yuanhao Cai, Ruyi Zha, Zhiwen Fan, Chenxin Li, Yixuan Yuan
Title: X$^{2}$-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction
Abstract:
Four-dimensional computed tomography (4D CT) reconstruction is crucial for capturing dynamic anatomical changes but faces inherent limitations from conventional phase-binning workflows. Current methods discretize temporal resolution into fixed phases with respiratory gating devices, introducing motion misalignment and restricting clinical practicality. In this paper, We propose X$^2$-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning. Our approach models anatomical dynamics through a spatiotemporal encoder-decoder architecture that predicts time-varying Gaussian deformations, eliminating phase discretization. To remove dependency on external gating devices, we introduce a physiology-driven periodic consistency loss that learns patient-specific breathing cycles directly from projections via differentiable optimization. Extensive experiments demonstrate state-of-the-art performance, achieving a 9.93 dB PSNR gain over traditional methods and 2.25 dB improvement against prior Gaussian splatting techniques. By unifying continuous motion modeling with hardware-free period learning, X$^2$-Gaussian advances high-fidelity 4D CT reconstruction for dynamic clinical imaging. Project website at: https://x2-gaussian.github.io/.
Chinese: X²-高斯框架通过将动态辐射高斯泼溅与自监督呼吸运动学习相结合,实现了连续时间四维CT重建,消除了相位离散化和外部门控设备的依赖,并在现有方法基础上取得了显著的性能提升。
English: The X²-Gaussian framework enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning, eliminating phase discretization and external gating devices while achieving significant performance improvements over existing methods.

Authors:Ziren Gong, Fabio Tosi, Youmin Zhang, Stefano Mattoccia, Matteo Poggi
Title: HS-SLAM: Hybrid Representation with Structural Supervision for Improved Dense SLAM
Abstract:
NeRF-based SLAM has recently achieved promising results in tracking and reconstruction. However, existing methods face challenges in providing sufficient scene representation, capturing structural information, and maintaining global consistency in scenes emerging significant movement or being forgotten. To this end, we present HS-SLAM to tackle these problems. To enhance scene representation capacity, we propose a hybrid encoding network that combines the complementary strengths of hash-grid, tri-planes, and one-blob, improving the completeness and smoothness of reconstruction. Additionally, we introduce structural supervision by sampling patches of non-local pixels rather than individual rays to better capture the scene structure. To ensure global consistency, we implement an active global bundle adjustment (BA) to eliminate camera drifts and mitigate accumulative errors. Experimental results demonstrate that HS-SLAM outperforms the baselines in tracking and reconstruction accuracy while maintaining the efficiency required for robotics.
中文:HS-SLAM通过融合混合编码网络提升场景表示能力,采用非局部像素块采样进行结构监督以优化结构捕捉,并实施主动全局光束法平差确保全局一致性,在保持效率的同时显著提高了跟踪与重建精度。
English: HS-SLAM enhances NeRF-based SLAM by integrating a hybrid encoding network for improved scene representation, structural supervision through patch sampling for better structural capture, and active global bundle adjustment to maintain global consistency, achieving superior tracking and reconstruction accuracy efficiently.

Authors:Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr, Federico Tombari, Bernt Schiele
Title: Test-Time Visual In-Context Tuning
Abstract:
Visual in-context learning (VICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample. Specifically, we flip the role between the task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on six representative vision tasks ranging from high-level visual understanding to low-level image processing, with 15 common corruptions, demonstrate that our VICT can improve the generalizability of VICL to unseen new domains. In addition, we show the potential of applying VICT for unseen tasks at test time. Code: https://github.com/Jiahao000/VICT.
中文: 本文提出测试时视觉上下文调优(VICT)方法,通过单一样本自适应和循环一致性损失重构原始任务提示,有效提升了视觉上下文学习模型在未知领域的泛化能力。
English: This paper introduces test-time Visual In-Context Tuning (VICT), a method that enhances the generalizability of visual in-context learning models to unseen domains by adapting them with a single test sample and using cycle consistency loss to recover original task prompts.

Authors:Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, Xiangyu Yue
Title: Video-R1: Reinforcing Video Reasoning in MLLMs
Abstract:
Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 37.1% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All code, models, and data are released in: https://github.com/tulerfeng/Video-R1.
中文:基于DeepSeek-R1的成功,Video-R1率先将R1范式引入多模态大模型的视频推理领域,通过提出T-GRPO算法加强时序建模并融合图像推理数据,在多个基准测试中实现突破性进展,尤其在VSI-Bench视频空间推理基准上以37.1%的准确率超越商用模型GPT-4o。
English: Building on DeepSeek-R1's success, Video-R1 pioneers the R1 paradigm for video reasoning in MLLMs by introducing the T-GRPO algorithm to enhance temporal modeling and incorporating image-reasoning data, achieving state-of-the-art performance on multiple benchmarks including surpassing GPT-4o on VSI-Bench with 37.1% accuracy.

Authors:Jianning Pei, Han Hu, Shuyang Gu
Title: Optimal Stepsize for Diffusion Sampling
Abstract:
Diffusion models achieve remarkable generation quality but suffer from computational intensive sampling due to suboptimal step discretization. While existing works focus on optimizing denoising directions, we address the principled design of stepsize schedules. This paper proposes Optimal Stepsize Distillation, a dynamic programming framework that extracts theoretically optimal schedules by distilling knowledge from reference trajectories. By reformulating stepsize optimization as recursive error minimization, our method guarantees global discretization bounds through optimal substructure exploitation. Crucially, the distilled schedules demonstrate strong robustness across architectures, ODE solvers, and noise schedules. Experiments show 10x accelerated text-to-image generation while preserving 99.4% performance on GenEval. Our code is available at https://github.com/bebebe666/OptimalSteps.
Chinese: 本文提出最优步长蒸馏法,通过动态编程框架优化扩散模型的步长调度,在保持99.4%性能的同时实现文本到图像生成速度提升10倍。
English: This paper introduces Optimal Stepsize Distillation, a dynamic programming framework that accelerates diffusion model sampling by optimizing stepsize schedules, achieving 10x faster text-to-image generation with minimal performance loss.

Authors:Hongkai Lin, Dingkang Liang, Zhenghao Qi, Xiang Bai
Title: A Unified Image-Dense Annotation Generation Model for Underwater Scenes
Abstract:
Underwater dense prediction, especially depth estimation and semantic segmentation, is crucial for gaining a comprehensive understanding of underwater scenes. Nevertheless, high-quality and large-scale underwater datasets with dense annotations remain scarce because of the complex environment and the exorbitant data collection costs. This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes. It relies solely on text as input to simultaneously generate realistic underwater images and multiple highly consistent dense annotations. Specifically, we unify the generation of text-to-image and text-to-dense annotations within a single model. The Implicit Layout Sharing mechanism (ILS) and cross-modal interaction method called Time Adaptive Normalization (TAN) are introduced to jointly optimize the consistency between image and dense annotations. We synthesize a large-scale underwater dataset using TIDE to validate the effectiveness of our method in underwater dense prediction tasks. The results demonstrate that our method effectively improves the performance of existing underwater dense prediction models and mitigates the scarcity of underwater data with dense annotations. We hope our method can offer new perspectives on alleviating data scarcity issues in other fields. The code is available at https://github.com/HongkLin/TIDE
中文: 本文提出TIDE方法,通过文本输入统一生成水下图像与密集标注,有效缓解水下密集预测任务中的数据匮乏问题,显著提升了现有模型的性能表现。
English: This paper introduces TIDE, a unified text-to-image and dense annotation generation method that creates realistic underwater images and consistent annotations to address data scarcity in underwater dense prediction tasks, significantly improving model performance.

Authors:Haolin Liu, Xiaohang Zhan, Zizheng Yan, Zhongjin Luo, Yuxin Wen, Xiaoguang Han
Title: Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence
Abstract:
Establishing character shape correspondence is a critical and fundamental task in computer vision and graphics, with diverse applications including re-topology, attribute transfer, and shape interpolation. Current dominant functional map methods, while effective in controlled scenarios, struggle in real situations with more complex challenges such as non-isometric shape discrepancies. In response, we revisit registration-for-correspondence methods and tap their potential for more stable shape correspondence estimation. To overcome their common issues including unstable deformations and the necessity for careful pre-alignment or high-quality initial 3D correspondences, we introduce Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence. We first re-purpose a foundation model for 2D character correspondence that ensures reliable and stable 2D mappings. Crucially, we propose a novel Semantic Flow Guided Registration approach that leverages 2D correspondence to guide mesh deformations. Our framework significantly surpasses existing methods in challenging scenarios, and brings possibilities for a wide array of real applications, as demonstrated in our results.

Authors:Minghui Lin, Xiang Wang, Yishan Wang, Shu Wang, Fengqi Dai, Pengxiang Ding, Cunxiang Wang, Zhengrong Zuo, Nong Sang, Siteng Huang, Donglin Wang
Title: Exploring the Evolution of Physics Cognition in Video Generation: A Survey
Abstract:
Recent advancements in video generation have witnessed significant progress, especially with the rapid advancement of diffusion models. Despite this, their deficiencies in physical cognition have gradually received widespread attention - generated content often violates the fundamental laws of physics, falling into the dilemma of ''visual realism but physical absurdity". Researchers began to increasingly recognize the importance of physical fidelity in video generation and attempted to integrate heuristic physical cognition such as motion representations and physical knowledge into generative systems to simulate real-world dynamic scenarios. Considering the lack of a systematic overview in this field, this survey aims to provide a comprehensive summary of architecture designs and their applications to fill this gap. Specifically, we discuss and organize the evolutionary process of physical cognition in video generation from a cognitive science perspective, while proposing a three-tier taxonomy: 1) basic schema perception for generation, 2) passive cognition of physical knowledge for generation, and 3) active cognition for world simulation, encompassing state-of-the-art methods, classical paradigms, and benchmarks. Subsequently, we emphasize the inherent key challenges in this domain and delineate potential pathways for future research, contributing to advancing the frontiers of discussion in both academia and industry. Through structured review and interdisciplinary analysis, this survey aims to provide directional guidance for developing interpretable, controllable, and physically consistent video generation paradigms, thereby propelling generative models from the stage of ''visual mimicry'' towards a new phase of ''human-like physical comprehension''.
中文摘要:近期视频生成技术虽在扩散模型推动下取得进展,但存在物理认知缺陷,导致生成内容违反物理规律;本综述通过建立三层认知分类体系,系统梳理了物理认知融合方法,旨在推动视频生成从“视觉模仿”迈向“类人物理理解”的新阶段。
English Summary: Recent video generation advancements using diffusion models face challenges in physical realism, prompting research into integrating physical cognition for more accurate simulations, with this survey providing a systematic taxonomy and future directions to bridge the gap between visual and physical fidelity.

Authors:David Yifan Yao, Albert J. Zhai, Shenlong Wang
Title: Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
Abstract:
This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.

Authors:Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Manyuan Zhang, Will Beddow, Erwann Millon, Victor Perez, Wenhai Wang, Conghui He, Bo Zhang, Xiaohong Liu, Hongsheng Li, Yu Qiao, Chang Xu, Peng Gao
Title: Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
Abstract:
We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.
Lumina-Image 2.0 采用统一架构和高效训练策略,通过跨模态交互优化实现了卓越的文本到图像生成性能,仅用少量参数即可展现强大效果。
Lumina-Image 2.0 introduces a unified text-to-image generation framework with enhanced cross-modal interaction and efficiency, achieving superior performance with minimal parameters.

Authors:Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, Ziwei Liu
Title: VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Abstract:
Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real "world models" through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI-assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities. Tailored to individual dimensions, our evaluation framework integrates generalists such as SOTA VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive human annotations to ensure evaluation alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.
中文: 视频生成已从生成不真实内容发展到视觉逼真,但现有基准如VBench仅关注表面真实性,因此推出VBench-2.0评估物理规律和常识等内在真实性,以实现真正的现实模拟。
English: Video generation has progressed from unrealistic outputs to visually convincing videos, yet current benchmarks like VBench focus on superficial aspects, prompting the introduction of VBench-2.0 to evaluate intrinsic faithfulness in areas like physics and commonsense for true realism.

Authors:Yan Xia, Xiaowei Zhou, Etienne Vouga, Qixing Huang, Georgios Pavlakos
Title: Reconstructing Humans with a Biomechanically Accurate Skeleton
Abstract:
In this paper, we introduce a method for reconstructing 3D humans from a single image using a biomechanically accurate skeleton model. To achieve this, we train a transformer that takes an image as input and estimates the parameters of the model. Due to the lack of training data for this task, we build a pipeline to produce pseudo ground truth model parameters for single images and implement a training procedure that iteratively refines these pseudo labels. Compared to state-of-the-art methods for 3D human mesh recovery, our model achieves competitive performance on standard benchmarks, while it significantly outperforms them in settings with extreme 3D poses and viewpoints. Additionally, we show that previous reconstruction methods frequently violate joint angle limits, leading to unnatural rotations. In contrast, our approach leverages the biomechanically plausible degrees of freedom making more realistic joint rotation estimates. We validate our approach across multiple human pose estimation benchmarks. We make the code, models and data available at: https://isshikihugh.github.io/HSMR/

Authors:Shitian Zhao, Qilong Wu, Xinyue Li, Bo Zhang, Ming Li, Qi Qin, Dongyang Liu, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Peng Gao, Bin Fu, Zhen Li
Title: LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
Abstract:
We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$\times$1024 images. Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance. To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation. Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). Our codes, models, datasets, and demo are publicly available.
中文: LeX-Art提出了一套全面的文本到图像合成方案,通过构建高质量数据集、开发提示增强模型和训练先进生成模型,在文本渲染保真度和准确性上取得显著提升,并建立了系统化评估基准。
English: LeX-Art introduces a comprehensive text-image synthesis suite featuring high-quality dataset construction, prompt enhancement models, and state-of-the-art text-to-image models, achieving significant improvements in text rendering fidelity and accuracy through systematic evaluation benchmarks.

Authors:Yuhan Zhang, Mengchen Zhang, Tong Wu, Tengfei Wang, Gordon Wetzstein, Dahua Lin, Ziwei Liu
Title: 3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models
Abstract:
3D generation is experiencing rapid advancements, while the development of 3D evaluation has not kept pace. How to keep automatic evaluation equitably aligned with human perception has become a well-recognized challenge. Recent advances in the field of language and image generation have explored human preferences and showcased respectable fitting ability. However, the 3D domain still lacks such a comprehensive preference dataset over generative models. To mitigate this absence, we develop 3DGen-Arena, an integrated platform in a battle manner. Then, we carefully design diverse text and image prompts and leverage the arena platform to gather human preferences from both public users and expert annotators, resulting in a large-scale multi-dimension human preference dataset 3DGen-Bench. Using this dataset, we further train a CLIP-based scoring model, 3DGen-Score, and a MLLM-based automatic evaluator, 3DGen-Eval. These two models innovatively unify the quality evaluation of text-to-3D and image-to-3D generation, and jointly form our automated evaluation system with their respective strengths. Extensive experiments demonstrate the efficacy of our scoring model in predicting human preferences, exhibiting a superior correlation with human ranks compared to existing metrics. We believe that our 3DGen-Bench dataset and automated evaluation system will foster a more equitable evaluation in the field of 3D generation, further promoting the development of 3D generative models and their downstream applications. Project page is available at https://zyh482.github.io/3DGen-Bench/.
中文: 为解决三维生成领域缺乏人类偏好数据的问题,研究者开发了3DGen-Arena平台和3DGen-Bench数据集,并基于此构建了3DGen-Score与3DGen-Eval自动评估系统,实验证明其评估结果比现有指标更符合人类判断标准。
English: The authors introduce 3DGen-Arena and 3DGen-Bench to address the lack of human preference data in 3D generation, developing automated evaluators 3DGen-Score and 3DGen-Eval that demonstrate superior alignment with human judgment compared to existing metrics.

Authors:Xianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, Yangguang Li
Title: SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling
Abstract:
Creating high-fidelity 3D meshes with arbitrary topology, including open surfaces and complex interiors, remains a significant challenge. Existing implicit field methods often require costly and detail-degrading watertight conversion, while other approaches struggle with high resolutions. This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to $1024^3$ directly from rendering losses. SparseFlex combines the accuracy of Flexicubes with a sparse voxel structure, focusing computation on surface-adjacent regions and efficiently handling open surfaces. Crucially, we introduce a frustum-aware sectional voxel training strategy that activates only relevant voxels during rendering, dramatically reducing memory consumption and enabling high-resolution training. This also allows, for the first time, the reconstruction of mesh interiors using only rendering supervision. Building upon this, we demonstrate a complete shape modeling pipeline by training a variational autoencoder (VAE) and a rectified flow transformer for high-quality 3D shape generation. Our experiments show state-of-the-art reconstruction accuracy, with a ~82% reduction in Chamfer Distance and a ~88% increase in F-score compared to previous methods, and demonstrate the generation of high-resolution, detailed 3D shapes with arbitrary topology. By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation and modeling.
中文: 本文提出SparseFlex稀疏结构等值面表示法,可直接通过渲染损失实现高分辨率可微分网格重建,在显著降低计算成本和内存消耗的同时达到了最先进的精度水平。
English: This paper introduces SparseFlex, a sparse-structured isosurface representation that enables high-resolution differentiable mesh reconstruction directly from rendering losses, achieving state-of-the-art accuracy with significantly reduced computational costs and memory usage.

Authors:Wenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yiwei Jiang, Yongliang Shen, Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, Weiming Lu, Peng Li, Yueting Zhuang
Title: Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
Abstract:
Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.
中文摘要:Embodied Reasoner模型通过合成数千条观察-思考-行动轨迹并采用三阶段训练流程,将深度推理能力扩展到交互式具身任务中,在评估中显著超越了先进的视觉推理模型。
English Summary: The Embodied Reasoner model extends deep reasoning to interactive embodied tasks by synthesizing thousands of Observation-Thought-Action trajectories and employing a three-stage training pipeline, significantly outperforming advanced visual reasoning models in evaluations.

Authors:Zhiyuan Ma, Xinyue Liang, Rongyuan Wu, Xiangyu Zhu, Zhen Lei, Lei Zhang
Title: Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data
Abstract:
It is highly desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries into the 3D outputs through score distillation. Since PRD supports training without 3D ground-truths, we can easily scale up the training data and improve generation quality for challenging text prompts with creative concepts. Meanwhile, PRD can accelerate the inference speed of the generation model in just a few steps. With PRD, we train a Triplane generator, namely TriplaneTurbo, which adds only $2.5\%$ trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both efficiency and quality. Specifically, it can produce high-quality 3D meshes in 1.2 seconds and generalize well for challenging text input. The code is available at https://github.com/theEricMa/TriplaneTurbo.
Chinese: 提出的渐进式渲染蒸馏(PRD)方法通过融合多视角扩散模型实现无需3D真值数据的文本到3D生成,其训练的TriplaneTurbo模型仅需1.2秒即可生成高质量3D网格且泛化能力强。
English: The proposed Progressive Rendering Distillation (PRD) method enables efficient text-to-3D generation by distilling multi-view diffusion models without requiring 3D ground-truth data, resulting in TriplaneTurbo that produces high-quality 3D meshes in just 1.2 seconds.

Authors:Yassir Lairgi
Title: When Astronomy Meets AI: Manazel For Crescent Visibility Prediction in Morocco
Abstract:
The accurate determination of the beginning of each Hijri month is essential for religious, cultural, and administrative purposes. Manazel (The code and datasets are available at https://github.com/lairgiyassir/manazel) addresses this challenge in Morocco by leveraging 13 years of crescent visibility data to refine the ODEH criterion, a widely used standard for lunar crescent visibility prediction. The study integrates two key features, the Arc of Vision (ARCV) and the total width of the crescent (W), to enhance the accuracy of lunar visibility assessments. A machine learning approach utilizing the Logistic Regression algorithm is employed to classify crescent visibility conditions, achieving a predictive accuracy of 98.83%. This data-driven methodology offers a robust and reliable framework for determining the start of the Hijri month, comparing different data classification tools, and improving the consistency of lunar calendar calculations in Morocco. The findings demonstrate the effectiveness of machine learning in astronomical applications and highlight the potential for further enhancements in the modeling of crescent visibility.
Chinese: Manazel项目通过机器学习分析13年新月可见性数据,优化了ODEH准则,以98.83%的预测准确率提升了摩洛哥伊斯兰历月初判定的精确度。
English: The Manazel project enhances the accuracy of determining Hijri months in Morocco by applying machine learning to 13 years of lunar data, achieving 98.83% predictive accuracy through refined visibility criteria.

Authors:Jonathan Lee, Bolivar Solarte, Chin-Hsuan Wu, Jin-Cheng Jhang, Fu-En Wang, Yi-Hsuan Tsai, Min Sun
Title: uLayout: Unified Room Layout Estimation for Perspective and Panoramic Images
Abstract:
We present uLayout, a unified model for estimating room layout geometries from both perspective and panoramic images, whereas traditional solutions require different model designs for each image type. The key idea of our solution is to unify both domains into the equirectangular projection, particularly, allocating perspective images into the most suitable latitude coordinate to effectively exploit both domains seamlessly. To address the Field-of-View (FoV) difference between the input domains, we design uLayout with a shared feature extractor with an extra 1D-Convolution layer to condition each domain input differently. This conditioning allows us to efficiently formulate a column-wise feature regression problem regardless of the FoV input. This simple yet effective approach achieves competitive performance with current state-of-the-art solutions and shows for the first time a single end-to-end model for both domains. Extensive experiments in the real-world datasets, LSUN, Matterport3D, PanoContext, and Stanford 2D-3D evidence the contribution of our approach. Code is available at https://github.com/JonathanLee112/uLayout.
中文总结:uLayout是一种统一模型,通过将透视和全景图像转换为等距柱状投影,并采用共享特征提取器与领域特定调节,实现了对两种图像类型的房间布局几何的精确估计。
English Summary: uLayout is a unified model that accurately estimates room layouts from both perspective and panoramic images by converting them into equirectangular projections and using a shared feature extractor with domain-specific conditioning.

Authors:Hyunjun Lee, Hyunsoo Lee, Sookwan Han
Title: SyncSDE: A Probabilistic Framework for Diffusion Synchronization
Abstract:
There have been many attempts to leverage multiple diffusion models for collaborative generation, extending beyond the original domain. A prominent approach involves synchronizing multiple diffusion trajectories by mixing the estimated scores to artificially correlate the generation processes. However, existing methods rely on naive heuristics, such as averaging, without considering task specificity. These approaches do not clarify why such methods work and often produce suboptimal results when a heuristic suitable for one task is blindly applied to others. In this paper, we present a probabilistic framework for analyzing why diffusion synchronization works and reveal where heuristics should be focused; modeling correlations between multiple trajectories and adapting them to each specific task. We further identify optimal correlation models per task, achieving better results than previous approaches that apply a single heuristic across all tasks without justification.

Authors:Achint Soni, Meet Soni, Sirisha Rambhatla
Title: LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing
Abstract:
Text-guided image editing aims to modify specific regions of an image according to natural language instructions while maintaining the general structure and the background fidelity. Existing methods utilize masks derived from cross-attention maps generated from diffusion models to identify the target regions for modification. However, since cross-attention mechanisms focus on semantic relevance, they struggle to maintain the image integrity. As a result, these methods often lack spatial consistency, leading to editing artifacts and distortions. In this work, we address these limitations and introduce LOCATEdit, which enhances cross-attention maps through a graph-based approach utilizing self-attention-derived patch relationships to maintain smooth, coherent attention across image regions, ensuring that alterations are limited to the designated items while retaining the surrounding structure. LOCATEdit consistently and substantially outperforms existing baselines on PIE-Bench, demonstrating its state-of-the-art performance and effectiveness on various editing tasks. Code can be found on https://github.com/LOCATEdit/LOCATEdit/
中文: 本文提出LOCATEdit方法,通过基于图的自注意力关系增强交叉注意力图,在文本引导图像编辑中保持空间一致性并减少伪影,在PIE-Bench基准测试中实现了最先进的性能。
English: This paper introduces LOCATEdit, a novel text-guided image editing method that improves cross-attention maps using graph-based self-attention relationships to maintain spatial consistency and minimize artifacts, achieving state-of-the-art performance on PIE-Bench.

Authors:Shuming Liu, Chen Zhao, Tianqi Xu, Bernard Ghanem
Title: BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
Abstract:
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks. However, their effectiveness in long-form video analysis is constrained by limited context windows. Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content, diminishing their effectiveness in real-world scenarios. In this paper, we introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies. First, to enable a more realistic evaluation of VLMs in long-form video understanding, we propose a multi-source retrieval evaluation setting. Our findings reveal that uniform sampling performs poorly in noisy contexts, underscoring the importance of selecting the right frames. Second, we explore several frame selection strategies based on query-frame similarity and analyze their effectiveness at inference time. Our results show that inverse transform sampling yields the most significant performance improvement, increasing accuracy on the Video-MME benchmark from 53.8% to 56.1% and MLVU benchmark from 58.9% to 63.4%. Our code is available at https://github.com/sming256/BOLT.
中文: BOLT方法通过逆向变换采样策略优化视频语言模型的长视频分析能力,无需额外训练即可显著提升多个基准测试的准确率。
English: BOLT enhances large video-language models for long-form video analysis by employing inverse transform sampling for frame selection, boosting accuracy on benchmarks without requiring additional training.

Authors:Yuxiao Sun, Yao Zhao, Meiqin Liu, Chao Yao, Weisi Lin
Title: Embedding Compression Distortion in Video Coding for Machines
Abstract:
Currently, video transmission serves not only the Human Visual System (HVS) for viewing but also machine perception for analysis. However, existing codecs are primarily optimized for pixel-domain and HVS-perception metrics rather than the needs of machine vision tasks. To address this issue, we propose a Compression Distortion Representation Embedding (CDRE) framework, which extracts machine-perception-related distortion representation and embeds it into downstream models, addressing the information lost during compression and improving task performance. Specifically, to better analyze the machine-perception-related distortion, we design a compression-sensitive extractor that identifies compression degradation in the feature domain. For efficient transmission, a lightweight distortion codec is introduced to compress the distortion information into a compact representation. Subsequently, the representation is progressively embedded into the downstream model, enabling it to be better informed about compression degradation and enhancing performance. Experiments across various codecs and downstream tasks demonstrate that our framework can effectively boost the rate-task performance of existing codecs with minimal overhead in terms of bitrate, execution time, and number of parameters. Our codes and supplementary materials are released in https://github.com/Ws-Syx/CDRE/.
中文: 提出的CDRE框架通过将压缩失真表征嵌入下游模型来增强机器视觉任务性能,有效弥补压缩过程中的信息损失,并以极低开销实现性能提升。
English: The proposed CDRE framework enhances machine vision tasks by embedding compression distortion representations into downstream models, effectively compensating for information loss and improving performance with minimal overhead.

Authors:Xiaoqin Wang, Xusen Ma, Xianxu Hou, Meidan Ding, Yudong Li, Junliang Chen, Wenting Chen, Xiaoyang Peng, Linlin Shen
Title: FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs
Abstract:
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by training with our proposed face VQA data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, with results also compared against human performance. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released at https://github.com/CVI-SZU/FaceBench.
中文: FaceBench是一个专门设计用于评估多模态大语言模型面部感知能力的数据集,通过分层属性和视觉问答对进行测试,所开发的Face-LLaVA模型在性能上显著优于现有开源模型,并与商业模型相媲美。
English: FaceBench is a specialized dataset introduced to evaluate multimodal large language models' face perception abilities through hierarchical attributes and visual question-answering pairs, with the developed Face-LLaVA model showing superior performance compared to existing models.

Authors:Lucas Nunes, Rodrigo Marcuzzi, Jens Behley, Cyrill Stachniss
Title: Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving
Abstract:
Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still however a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close-to-real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene-scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse-to-fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models, achieving more realistic semantic scene data generation compared to previous state-of-the-art methods. Besides improving 3D semantic scene-scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene-scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort. Our code is available at https://github.com/PRBonn/3DiSS.
Chinese: 本研究提出了一种无需依赖投影或解耦模型的新方法,能够生成逼真的三维语义场景尺度数据,不仅提升了数据合成质量,还通过将其作为训练数据有效提高了语义分割模型的性能。
English: This study introduces a novel method for generating realistic 3D semantic scene-scale data without relying on projections or decoupled models, enhancing both data synthesis quality and the performance of semantic segmentation models when used as training data.

Authors:Qiyu Dai, Xingyu Ni, Qianfan Shen, Wenzheng Chen, Baoquan Chen, Mengyu Chu
Title: RainyGS: Efficient Rain Synthesis with Physically-Based Gaussian Splatting
Abstract:
We consider the problem of adding dynamic rain effects to in-the-wild scenes in a physically-correct manner. Recent advances in scene modeling have made significant progress, with NeRF and 3DGS techniques emerging as powerful tools for reconstructing complex scenes. However, while effective for novel view synthesis, these methods typically struggle with challenging scene editing tasks, such as physics-based rain simulation. In contrast, traditional physics-based simulations can generate realistic rain effects, such as raindrops and splashes, but they often rely on skilled artists to carefully set up high-fidelity scenes. This process lacks flexibility and scalability, limiting its applicability to broader, open-world environments. In this work, we introduce RainyGS, a novel approach that leverages the strengths of both physics-based modeling and 3DGS to generate photorealistic, dynamic rain effects in open-world scenes with physical accuracy. At the core of our method is the integration of physically-based raindrop and shallow water simulation techniques within the fast 3DGS rendering framework, enabling realistic and efficient simulations of raindrop behavior, splashes, and reflections. Our method supports synthesizing rain effects at over 30 fps, offering users flexible control over rain intensity -- from light drizzles to heavy downpours. We demonstrate that RainyGS performs effectively for both real-world outdoor scenes and large-scale driving scenarios, delivering more photorealistic and physically-accurate rain effects compared to state-of-the-art methods. Project page can be found at https://pku-vcl-geometry.github.io/RainyGS/

Authors:Erik Wallin, Fredrik Kahl, Lars Hammarstrand
Title: ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification via Multi-Depth Networks
Abstract:
Out-of-distribution (OOD) detection in deep learning has traditionally been framed as a binary task, where samples are either classified as belonging to the known classes or marked as OOD, with little attention given to the semantic relationships between OOD samples and the in-distribution (ID) classes. We propose a framework for detecting and classifying OOD samples in a given class hierarchy. Specifically, we aim to predict OOD data to their correct internal nodes of the class hierarchy, whereas the known ID classes should be predicted as their corresponding leaf nodes. Our approach leverages the class hierarchy to create a probabilistic model and we implement this model by using networks trained for ID classification at multiple hierarchy depths. We conduct experiments on three datasets with predefined class hierarchies and show the effectiveness of our method. Our code is available at https://github.com/walline/prohoc.
中文摘要:本文提出了一种在类别层次结构中检测并分类分布外样本的框架,能够将未知样本归类至层次结构的内部节点,同时保持对已知类别的叶节点准确预测,并在多个数据集上验证了方法的有效性。
English Summary: This paper introduces a framework that detects out-of-distribution (OOD) samples by classifying them into appropriate internal nodes of a class hierarchy, while maintaining accurate leaf-level predictions for in-distribution data, demonstrating effectiveness across multiple datasets.

Authors:Zhenxiang Ma, Zhenyu Yang, Miao Tao, Yuanzhen Zhou, Zeyu He, Yuchang Zhang, Rong Fu, Hengjie Li
Title: LandMarkSystem Technical Report
Abstract:
3D reconstruction is vital for applications in autonomous driving, virtual reality, augmented reality, and the metaverse. Recent advancements such as Neural Radiance Fields(NeRF) and 3D Gaussian Splatting (3DGS) have transformed the field, yet traditional deep learning frameworks struggle to meet the increasing demands for scene quality and scale. This paper introduces LandMarkSystem, a novel computing framework designed to enhance multi-scale scene reconstruction and rendering. By leveraging a componentized model adaptation layer, LandMarkSystem supports various NeRF and 3DGS structures while optimizing computational efficiency through distributed parallel computing and model parameter offloading. Our system addresses the limitations of existing frameworks, providing dedicated operators for complex 3D sparse computations, thus facilitating efficient training and rapid inference over extensive scenes. Key contributions include a modular architecture, a dynamic loading strategy for limited resources, and proven capabilities across multiple representative algorithms.This comprehensive solution aims to advance the efficiency and effectiveness of 3D reconstruction tasks.To facilitate further research and collaboration, the source code and documentation for the LandMarkSystem project are publicly available in an open-source repository, accessing the repository at: https://github.com/InternLandMark/LandMarkSystem.
Chinese: 本文提出LandMarkSystem这一新型计算框架,通过分布式计算和优化算子支持多种神经辐射场与3D高斯泼溅结构,有效提升多尺度三维场景重建与渲染效率,解决了现有深度学习框架在场景质量与规模方面的局限性。
English: This paper introduces LandMarkSystem, a novel computing framework that enhances multi-scale 3D scene reconstruction and rendering by supporting various Neural Radiance Fields and 3D Gaussian Splatting structures through distributed computing and optimized operators, addressing limitations in existing deep learning frameworks.

Authors:Yehui Shen, Lei Zhang, Qingqiu Li, Xiongwei Zhao, Yue Wang, Huimin Lu, Xieyuanli Chen
Title: UGNA-VPR: A Novel Training Paradigm for Visual Place Recognition Based on Uncertainty-Guided NeRF Augmentation
Abstract:
Visual place recognition (VPR) is crucial for robots to identify previously visited locations, playing an important role in autonomous navigation in both indoor and outdoor environments. However, most existing VPR datasets are limited to single-viewpoint scenarios, leading to reduced recognition accuracy, particularly in multi-directional driving or feature-sparse scenes. Moreover, obtaining additional data to mitigate these limitations is often expensive. This paper introduces a novel training paradigm to improve the performance of existing VPR networks by enhancing multi-view diversity within current datasets through uncertainty estimation and NeRF-based data augmentation. Specifically, we initially train NeRF using the existing VPR dataset. Then, our devised self-supervised uncertainty estimation network identifies places with high uncertainty. The poses of these uncertain places are input into NeRF to generate new synthetic observations for further training of VPR networks. Additionally, we propose an improved storage method for efficient organization of augmented and original training data. We conducted extensive experiments on three datasets and tested three different VPR backbone networks. The results demonstrate that our proposed training paradigm significantly improves VPR performance by fully utilizing existing data, outperforming other training approaches. We further validated the effectiveness of our approach on self-recorded indoor and outdoor datasets, consistently demonstrating superior results. Our dataset and code have been released at \href{https://github.com/nubot-nudt/UGNA-VPR}{https://github.com/nubot-nudt/UGNA-VPR}.
中文: 本文提出一种训练范式,通过不确定性估计和基于NeRF的数据增强从现有数据集中生成合成视图,显著提升了视觉位置识别在不同环境中的性能表现。
English: This paper introduces a training paradigm that enhances visual place recognition by using uncertainty estimation and NeRF-based data augmentation to generate synthetic views from existing datasets, significantly improving performance across various environments.

Authors:Zerui Chen, Rolandos Alexandros Potamias, Shizhe Chen, Cordelia Schmid
Title: HORT: Monocular Hand-held Objects Reconstruction with Transformers
Abstract:
Reconstructing hand-held objects in 3D from monocular images remains a significant challenge in computer vision. Most existing approaches rely on implicit 3D representations, which produce overly smooth reconstructions and are time-consuming to generate explicit 3D shapes. While more recent methods directly reconstruct point clouds with diffusion models, the multi-step denoising makes high-resolution reconstruction inefficient. To address these limitations, we propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method follows a coarse-to-fine strategy, first generating a sparse point cloud from the image and progressively refining it into a dense representation using pixel-aligned image features. To enhance reconstruction accuracy, we integrate image features with 3D hand geometry to jointly predict the object point cloud and its pose relative to the hand. Our model is trained end-to-end for optimal performance. Experimental results on both synthetic and real datasets demonstrate that our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.

Authors:Zixu Li, Zhiheng Fu, Yupeng Hu, Zhiwei Chen, Haokun Wen, Liqiang Nie
Title: FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval
Abstract:
Composed Image Retrieval (CIR) facilitates image retrieval through a multimodal query consisting of a reference image and modification text. The reference image defines the retrieval context, while the modification text specifies desired alterations. However, existing CIR datasets predominantly employ coarse-grained modification text (CoarseMT), which inadequately captures fine-grained retrieval intents. This limitation introduces two key challenges: (1) ignoring detailed differences leads to imprecise positive samples, and (2) greater ambiguity arises when retrieving visually similar images. These issues degrade retrieval accuracy, necessitating manual result filtering or repeated queries. To address these limitations, we develop a robust fine-grained CIR data annotation pipeline that minimizes imprecise positive samples and enhances CIR systems' ability to discern modification intents accurately. Using this pipeline, we refine the FashionIQ and CIRR datasets to create two fine-grained CIR datasets: Fine-FashionIQ and Fine-CIRR. Furthermore, we introduce FineCIR, the first CIR framework explicitly designed to parse the modification text. FineCIR effectively captures fine-grained modification semantics and aligns them with ambiguous visual entities, enhancing retrieval precision. Extensive experiments demonstrate that FineCIR consistently outperforms state-of-the-art CIR baselines on both fine-grained and traditional CIR benchmark datasets. Our FineCIR code and fine-grained CIR datasets are available at https://github.com/SDU-L/FineCIR.git.
Chinese: FineCIR提出了一种新颖的框架和细粒度数据集,解决了组合图像检索中粗粒度修改文本的局限性,通过精确捕捉详细修改意图显著提升了检索精度。
English: FineCIR introduces a novel framework and fine-grained datasets to address the limitations of coarse modification text in composed image retrieval, significantly improving retrieval accuracy by precisely capturing detailed modification intents.

Authors:Hanyue Tu, Siqi Wu, Li Li, Wengang Zhou, Houqiang Li
Title: Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression
Abstract:
Autoencoder-based structures have dominated recent learned image compression methods. However, the inherent information loss associated with autoencoders limits their rate-distortion performance at high bit rates and restricts their flexibility of rate adaptation. In this paper, we present a variable-rate image compression model based on invertible transform to overcome these limitations. Specifically, we design a lightweight multi-scale invertible neural network, which bijectively maps the input image into multi-scale latent representations. To improve the compression efficiency, a multi-scale spatial-channel context model with extended gain units is devised to estimate the entropy of the latent representation from high to low levels. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods, and remains competitive with recent multi-model approaches. Notably, our method is the first learned image compression solution that outperforms VVC across a very wide range of bit rates using a single model, especially at high bit rates. The source code is available at https://github.com/hytu99/MSINN-VRLIC.
中文摘要:本文提出了一种基于多尺度可逆神经网络的变速率图像压缩模型,通过双射映射克服了自编码器的局限性,在广泛码率范围内以单一模型实现了优于现有方法的性能,尤其在高速率下首次超越VVC标准。
English Summary: This paper introduces a variable-rate image compression model using a multi-scale invertible neural network that overcomes autoencoder limitations by bijectively mapping images into latent representations, achieving state-of-the-art performance across a wide bit rate range with a single model.

Authors:Wencheng Han, Dongqian Guo, Xiao Chen, Pang Lyu, Yi Jin, Jianbing Shen
Title: Reducing CT Metal Artifacts by Learning Latent Space Alignment with Gemstone Spectral Imaging Data
Abstract:
Metal artifacts in CT slices have long posed challenges in medical diagnostics. These artifacts degrade image quality, resulting in suboptimal visualization and complicating the accurate interpretation of tissues adjacent to metal implants. To address these issues, we introduce the Latent Gemstone Spectral Imaging (GSI) Alignment Framework, which effectively reduces metal artifacts while avoiding the introduction of noise information. Our work is based on a key finding that even artifact-affected ordinary CT sequences contain sufficient information to discern detailed structures. The challenge lies in the inability to clearly represent this information. To address this issue, we developed an Alignment Framework that adjusts the representation of ordinary CT images to match GSI CT sequences. GSI is an advanced imaging technique using multiple energy levels to mitigate artifacts caused by metal implants. By aligning the representation to GSI data, we can effectively suppress metal artifacts while clearly revealing detailed structure, without introducing extraneous information into CT sequences. To facilitate the application, we propose a new dataset, Artifacts-GSI, captured from real patients with metal implants, and establish a new benchmark based on this dataset. Experimental results show that our method significantly reduces metal artifacts and greatly enhances the readability of CT slices. All our code and data are available at: https://um-lab.github.io/GSI-MAR/
中文摘要:潜在宝石光谱成像对齐框架通过将普通CT图像表征与先进GSI序列对齐,有效减少CT扫描中的金属伪影,显著提升图像清晰度且不引入额外噪声。
English Summary: The Latent Gemstone Spectral Imaging Alignment Framework effectively reduces metal artifacts in CT scans by aligning ordinary CT image representations with advanced GSI sequences, significantly improving image clarity without introducing noise.

Authors:Zhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, Yue Liao
Title: Vision-to-Music Generation: A Survey
Abstract:
Vision-to-music Generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast application prospects in fields such as film scoring, short video creation, and dance music synthesis. However, compared to the rapid development of modalities like text and images, research in vision-to-music is still in its preliminary stage due to its complex internal structure and the difficulty of modeling dynamic relationships with video. Existing surveys focus on general music generation without comprehensive discussion on vision-to-music. In this paper, we systematically review the research progress in the field of vision-to-music generation. We first analyze the technical characteristics and core challenges for three input types: general videos, human movement videos, and images, as well as two output types of symbolic music and audio music. We then summarize the existing methodologies on vision-to-music generation from the architecture perspective. A detailed review of common datasets and evaluation metrics is provided. Finally, we discuss current challenges and promising directions for future research. We hope our survey can inspire further innovation in vision-to-music generation and the broader field of multimodal generation in academic research and industrial applications. To follow latest works and foster further innovation in this field, we are continuously maintaining a GitHub repository at https://github.com/wzk1015/Awesome-Vision-to-Music-Generation.
中文: 本文系统综述了视觉到音乐的生成领域,通过分析技术特征、方法体系、数据集与评估指标,指出当前挑战与未来方向,旨在推动多模态人工智能应用的创新发展。
English: This paper systematically reviews the vision-to-music generation field, analyzing technical characteristics, methodologies, datasets, and evaluation metrics while identifying current challenges and future directions to inspire innovation in multimodal AI applications.

Authors:Haoyu Zhao, Zhongang Qi, Cong Wang, Qingping Zheng, Guansong Lu, Fei Chen, Hang Xu, Zuxuan Wu
Title: DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation
Abstract:
With diffusion transformer (DiT) excelling in video generation, its use in specific tasks has drawn increasing attention. However, adapting DiT for pose-guided human image animation faces two core challenges: (a) existing U-Net-based pose control methods may be suboptimal for the DiT backbone; and (b) removing text guidance, as in previous approaches, often leads to semantic loss and model degradation. To address these issues, we propose DynamiCtrl, a novel framework for human animation in video DiT architecture. Specifically, we use a shared VAE encoder for human images and driving poses, unifying them into a common latent space, maintaining pose fidelity, and eliminating the need for an expert pose encoder during video denoising. To integrate pose control into the DiT backbone effectively, we propose a novel Pose-adaptive Layer Norm model. It injects normalized pose features into the denoising process via conditioning on visual tokens, enabling seamless and scalable pose control across DiT blocks. Furthermore, to overcome the shortcomings of text removal, we introduce the "Joint-text" paradigm, which preserves the role of text embeddings to provide global semantic context. Through full-attention blocks, image and pose features are aligned with text features, enhancing semantic consistency, leveraging pretrained knowledge, and enabling multi-level control. Experiments verify the superiority of DynamiCtrl on benchmark and self-collected data (e.g., achieving the best LPIPS of 0.166), demonstrating strong character control and high-quality synthesis. The project page is available at https://gulucaptain.github.io/DynamiCtrl/.

Authors:Bashar Tahir, Philipp Svoboda, Markus Rupp
Title: PLAIN: Scalable Estimation Architecture for Integrated Sensing and Communication
Abstract:
Integrated sensing and communication (ISAC) is envisioned be to one of the paradigms upon which next-generation mobile networks will be built, extending localization and tracking capabilities, as well as giving birth to environment-aware wireless access. A key aspect of sensing integration is parameter estimation, which involves extracting information about the surrounding environment, such as the direction, distance, and velocity of various objects within. This is typically of a high-dimensional nature, which leads to significant computational complexity, if performed jointly across multiple sensing dimensions, such as space, frequency, and time. Additionally, due to the incorporation of sensing on top of the data transmission, the time window available for sensing is likely to be short, resulting in an estimation problem where only a single snapshot is accessible. In this work, we propose PLAIN, a tensor-based estimation architecture that flexibly scales with multiple sensing dimensions and can handle high dimensionality, limited measurement time, and super-resolution requirements. It consists of three stages: a compression stage, where the high dimensional input is converted into lower dimensionality, without sacrificing resolution; a decoupled estimation stage, where the parameters across the different dimensions are estimated in parallel with low complexity; an input-based fusion stage, where the decoupled parameters are fused together to form a paired multidimensional estimate. We investigate the performance of the architecture for different configurations and compare it against practical sequential and joint estimation baselines, as well as theoretical bounds. Our results show that PLAIN, using tools from tensor algebra, subspace-based processing, and compressed sensing, can scale flexibly with dimensionality, while operating with low complexity and maintaining super-resolution.
中文: PLAIN是一种基于张量的估计架构,通过数据压缩、解耦参数估计和结果融合,高效处理ISAC中的高维感知问题,实现了低复杂度和超分辨率性能。
English: PLAIN is a tensor-based estimation architecture that efficiently handles high-dimensional sensing in ISAC by compressing data, decoupling parameter estimation, and fusing results, achieving low complexity and super-resolution performance.

Authors:Yimin Xu, Fan Yang, Bin Xu
Title: DSU-Net:An Improved U-Net Model Based on DINOv2 and SAM2 with Multi-scale Cross-model Feature Enhancement
Abstract:
Despite the significant advancements in general image segmentation achieved by large-scale pre-trained foundation models (such as Meta's Segment Any-thing Model (SAM) series and DINOv2), their performance in specialized fields remains limited by two critical issues: the excessive training costs due to large model parameters, and the insufficient ability to represent specific domain characteristics. This paper proposes a multi-scale feature collabora-tion framework guided by DINOv2 for SAM2, with core innovations in three aspects: (1) Establishing a feature collaboration mechanism between DINOv2 and SAM2 backbones, where high-dimensional semantic features extracted by the self-supervised model guide multi-scale feature fusion; (2) Designing lightweight adapter modules and cross-modal, cross-layer feature fusion units to inject cross-domain knowledge while freezing the base model parameters; (3) Constructing a U-shaped network structure based on U-net, which utilizes attention mechanisms to achieve adaptive aggregation decoding of multi-granularity features. This framework surpasses existing state-of-the-art meth-ods in downstream tasks such as camouflage target detection and salient ob-ject detection, without requiring costly training processes. It provides a tech-nical pathway for efficient deployment of visual image segmentation, demon-strating significant application value in a wide range of downstream tasks and specialized fields within image segmentation.Project page: https://github.com/CheneyXuYiMin/SAM2DINO-Seg
中文: 本文提出了一种由DINOv2引导的SAM2多尺度特征协作框架,通过轻量级适配器和注意力机制注入跨领域知识,无需昂贵训练即可在专业图像分割任务中实现领先性能。
English: This paper introduces a multi-scale feature collaboration framework guided by DINOv2 for SAM2, which enhances specialized image segmentation by integrating cross-domain knowledge through lightweight adapters and attention mechanisms, achieving state-of-the-art performance without costly training.

Authors:Jiahao Lyu, Minghua Zhao, Jing Hu, Xuewen Huang, Yifei Chen, Shuangli Du
Title: VADMamba: Exploring State Space Models for Fast Video Anomaly Detection
Abstract:
Video anomaly detection (VAD) methods are mostly CNN-based or Transformer-based, achieving impressive results, but the focus on detection accuracy often comes at the expense of inference speed. The emergence of state space models in computer vision, exemplified by the Mamba model, demonstrates improved computational efficiency through selective scans and showcases the great potential for long-range modeling. Our study pioneers the application of Mamba to VAD, dubbed VADMamba, which is based on multi-task learning for frame prediction and optical flow reconstruction. Specifically, we propose the VQ-Mamba Unet (VQ-MaU) framework, which incorporates a Vector Quantization (VQ) layer and Mamba-based Non-negative Visual State Space (NVSS) block. Furthermore, two individual VQ-MaU networks separately predict frames and reconstruct corresponding optical flows, further boosting accuracy through a clip-level fusion evaluation strategy. Experimental results validate the efficacy of the proposed VADMamba across three benchmark datasets, demonstrating superior performance in inference speed compared to previous work. Code is available at https://github.com/jLooo/VADMamba.
中文: 本研究提出VADMamba这一新型视频异常检测方法,通过基于Mamba模型的帧预测与光流重建双任务架构,在多个基准测试中实现了推理速度与检测精度的双重提升。
English: The study introduces VADMamba, a novel video anomaly detection method that utilizes Mamba-based models for frame prediction and optical flow reconstruction, achieving enhanced inference speed and accuracy across multiple benchmarks.

Authors:Jinwei Qi, Chaonan Ji, Sheng Xu, Peng Zhang, Bang Zhang, Liefeng Bo
Title: ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model
Abstract:
Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.
Chinese: 本文提出了一种新颖的实时风格化人像视频生成框架,通过采用包含分层运动扩散模型和显式手势控制信号的两阶段方法,解决了头部与身体运动同步及表情控制的难题,实现了30fps下富有表现力的上半身视频聊天。
English: This paper introduces a novel framework for stylized real-time portrait video generation that overcomes limitations in synchronizing head and body movements and controlling facial expressions by employing a two-stage process with hierarchical motion diffusion models and explicit hand control signals, enabling expressive upper-body video chat at 30fps.

Authors:Junjie Chen, Weilong Chen, Yifan Zuo, Yuming Fang
Title: Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation
Abstract:
Category-agnostic pose estimation aims to locate keypoints on query images according to a few annotated support images for arbitrary novel classes. Existing methods generally extract support features via heatmap pooling, and obtain interacted features from support and query via cross-attention. Hence, these works neglect to mine fine-grained and structure-aware (FGSA) features from both support and query images, which are crucial for pixel-level keypoint localization. To this end, we propose a novel yet concise framework, which recurrently mines FGSA features from both support and query images. Specifically, we design a FGSA mining module based on deformable attention mechanism. On the one hand, we mine fine-grained features by applying deformable attention head over multi-scale feature maps. On the other hand, we mine structure-aware features by offsetting the reference points of keypoints to their linked keypoints. By means of above module, we recurrently mine FGSA features from support and query images, and thus obtain better support features and query estimations. In addition, we propose to use mixup keypoints to pad various classes to a unified keypoint number, which could provide richer supervision than the zero padding used in existing works. We conduct extensive experiments and in-depth studies on large-scale MP-100 dataset, and outperform SOTA method dramatically (+3.2\%PCK@0.05). Code is avaiable at https://github.com/chenbys/FMMP.
中文: 本文提出了一种新颖框架,通过可变形注意力机制从支持和查询图像中循环挖掘细粒度结构感知特征,在姿态估计任务上显著超越了现有方法。
English: This paper introduces a novel framework that recurrently mines fine-grained and structure-aware features from both support and query images using a deformable attention mechanism, significantly outperforming existing methods on pose estimation tasks.

Authors:Jiajie Quan, Ao Tong, Yuxuan Cai, Xinwei He, Yulong Wang, Yang Zhou
Title: Omni-AD: Learning to Reconstruct Global and Local Features for Multi-class Anomaly Detection
Abstract:
In multi-class unsupervised anomaly detection(MUAD), reconstruction-based methods learn to map input images to normal patterns to identify anomalous pixels. However, this strategy easily falls into the well-known "learning shortcut" issue when decoders fail to capture normal patterns and reconstruct both normal and abnormal samples naively. To address that, we propose to learn the input features in global and local manners, forcing the network to memorize the normal patterns more comprehensively. Specifically, we design a two-branch decoder block, named Omni-block. One branch corresponds to global feature learning, where we serialize two self-attention blocks but replace the query and (key, value) with learnable tokens, respectively, thus capturing global features of normal patterns concisely and thoroughly. The local branch comprises depth-separable convolutions, whose locality enables effective and efficient learning of local features for normal patterns. By stacking Omni-blocks, we build a framework, Omni-AD, to learn normal patterns of different granularity and reconstruct them progressively. Comprehensive experiments on public anomaly detection benchmarks show that our method outperforms state-of-the-art approaches in MUAD. Code is available at https://github.com/easyoo/Omni-AD.git
中文摘要:提出的Omni-AD框架通过采用双分支解码器,分别使用序列化自注意力机制和深度可分离卷积来全面学习全局与局部正常模式,有效解决了多类无监督异常检测中的学习捷径问题,在基准测试中实现了最优性能。
English summary: The proposed Omni-AD framework addresses learning shortcuts in multi-class unsupervised anomaly detection by employing a two-branch decoder that comprehensively captures both global and local normal patterns through serialized self-attention and depth-separable convolutions, achieving state-of-the-art performance on benchmarks.

Authors:Yun Zhu, Le Hui, Hang Yang, Jianjun Qian, Jin Xie, Jian Yang
Title: Learning Class Prototypes for Unified Sparse Supervised 3D Object Detection
Abstract:
Both indoor and outdoor scene perceptions are essential for embodied intelligence. However, current sparse supervised 3D object detection methods focus solely on outdoor scenes without considering indoor settings. To this end, we propose a unified sparse supervised 3D object detection method for both indoor and outdoor scenes through learning class prototypes to effectively utilize unlabeled objects. Specifically, we first propose a prototype-based object mining module that converts the unlabeled object mining into a matching problem between class prototypes and unlabeled features. By using optimal transport matching results, we assign prototype labels to high-confidence features, thereby achieving the mining of unlabeled objects. We then present a multi-label cooperative refinement module to effectively recover missed detections through pseudo label quality control and prototype label cooperation. Experiments show that our method achieves state-of-the-art performance under the one object per scene sparse supervised setting across indoor and outdoor datasets. With only one labeled object per scene, our method achieves about 78%, 90%, and 96% performance compared to the fully supervised detector on ScanNet V2, SUN RGB-D, and KITTI, respectively, highlighting the scalability of our method. Code is available at https://github.com/zyrant/CPDet3D.
中文摘要:本文提出了一种统一的稀疏监督三维物体检测方法,通过类别原型学习有效利用未标记物体,在室内外场景中仅需每场景单个标注即可实现接近全监督方法的性能。
English Summary: This paper introduces a unified sparse supervised 3D object detection method that utilizes class prototypes to effectively mine unlabeled objects for both indoor and outdoor scenes, achieving state-of-the-art performance with minimal labeled data.

Authors:Haoming Xu, Shuxun Wang, Yanqiu Zhao, Yi Zhong, Ziyan Jiang, Ningyuan Zhao, Shumin Deng, Huajun Chen, Ningyu Zhang
Title: ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging
Abstract:
This paper presents the ZJUKLAB team's submission for SemEval-2025 Task 4: Unlearning Sensitive Content from Large Language Models. This task aims to selectively erase sensitive knowledge from large language models, avoiding both over-forgetting and under-forgetting issues. We propose an unlearning system that leverages Model Merging (specifically TIES-Merging), combining two specialized models into a more balanced unlearned model. Our system achieves competitive results, ranking second among 26 teams, with an online score of 0.944 for Task Aggregate and 0.487 for overall Aggregate. In this paper, we also conduct local experiments and perform a comprehensive analysis of the unlearning process, examining performance trajectories, loss dynamics, and weight perspectives, along with several supplementary experiments, to understand the effectiveness of our method. Furthermore, we analyze the shortcomings of our method and evaluation metrics, emphasizing that MIA scores and ROUGE-based metrics alone are insufficient to fully evaluate successful unlearning. Finally, we emphasize the need for more comprehensive evaluation methodologies and rethinking of unlearning objectives in future research. Code is available at https://github.com/zjunlp/unlearn/tree/main/semeval25.
中文: ZJUKLAB团队针对SemEval-2025任务四提出了基于模型融合的遗忘系统,在26支队伍中荣获第二名,其方法在有效消除敏感内容的同时,揭示了当前评估指标的不足,并呼吁建立更全面的评估体系。
English: The ZJUKLAB team introduced a model merging-based unlearning system for SemEval-2025 Task 4, achieving second place by effectively removing sensitive content while highlighting limitations in current evaluation metrics and calling for improved assessment methods.

Authors:Yusong Hu, Zichen Liang, Fei Yang, Qibin Hou, Xialei Liu, Ming-Ming Cheng
Title: KAC: Kolmogorov-Arnold Classifier for Continual Learning
Abstract:
Continual learning requires models to train continuously across consecutive tasks without forgetting. Most existing methods utilize linear classifiers, which struggle to maintain a stable classification space while learning new tasks. Inspired by the success of Kolmogorov-Arnold Networks (KAN) in preserving learning stability during simple continual regression tasks, we set out to explore their potential in more complex continual learning scenarios. In this paper, we introduce the Kolmogorov-Arnold Classifier (KAC), a novel classifier developed for continual learning based on the KAN structure. We delve into the impact of KAN's spline functions and introduce Radial Basis Functions (RBF) for improved compatibility with continual learning. We replace linear classifiers with KAC in several recent approaches and conduct experiments across various continual learning benchmarks, all of which demonstrate performance improvements, highlighting the effectiveness and robustness of KAC in continual learning. The code is available at https://github.com/Ethanhuhuhu/KAC.
中文: 本文提出基于KAN结构的Kolmogorov-Arnold分类器(KAC),通过采用样条函数和径向基函数增强稳定性与兼容性,在持续学习中替代线性分类器,在多项基准测试中均表现出性能提升和鲁棒性优势。
English: This paper introduces the Kolmogorov-Arnold Classifier (KAC), a novel classifier based on KAN structure that replaces linear classifiers in continual learning, demonstrating improved performance and robustness across various benchmarks through enhanced stability and compatibility with spline and radial basis functions.

Authors:Ooha Lakkadi Reddy
Title: Rerouting Connection: Hybrid Computer Vision Analysis Reveals Visual Similarity Between Indus and Tibetan-Yi Corridor Writing Systems
Abstract:
This thesis employs a hybrid CNN-Transformer architecture, alongside a detailed anthropological framework, to investigate potential historical connections between the visual morphology of the Indus Valley script and pictographic systems of the Tibetan-Yi Corridor. Through an ensemble methodology of three target scripts across 15 independently trained models, we demonstrate that Tibetan-Yi Corridor scripts exhibit approximately six-fold higher visual similarity to the Indus script (0.635) than to the Bronze Age Proto-Cuneiform (0.102) or Proto-Elamite (0.078). Contrary to expectations, when measured through direct script-to-script embedding comparisons, the Indus script maps closer to Tibetan-Yi Corridor scripts with a mean cosine similarity of 0.930 (CI: [0.917, 0.942]) than to contemporaneous West Asian signaries, which recorded mean similarities of 0.887 (CI: [0.863, 0.911]) and 0.855 (CI: [0.818, 0.891]). Across dimensionality reduction and clustering methods, the Indus script consistently clusters closest to Tibetan-Yi Corridor scripts. These computational findings align with observed pictorial parallels in numeral systems, gender markers, and iconographic elements. Archaeological evidence of contact networks along the ancient Shu-Shendu road, coinciding with the Indus Civilization's decline, provides a plausible transmission pathway. While alternate explanations cannot be ruled out, the specificity and consistency of similarities suggest more complex cultural transmission networks between South and East Asia than previously recognized.
中文: 本研究采用混合CNN-Transformer模型,发现印度河文字与藏彝走廊文字在视觉形态上存在显著相似性,其关联强度远超同期西亚文字,暗示了沿古代通道可能存在文化传播网络。
English: This study uses a hybrid CNN-Transformer model to reveal that the Indus Valley script shares significantly stronger visual and structural similarities with Tibetan-Yi Corridor scripts than with contemporaneous West Asian scripts, suggesting potential historical cultural transmission along ancient pathways.

Authors:Judy X Yang, Jing Wang, Zhuanfeng, Li, Chenhong Sui Zekun Long, Jun Zhou
Title: HSLiNets: Evaluating Band Ordering Strategies in Hyperspectral and LiDAR Fusion
Abstract:
The integration of hyperspectral imaging (HSI) and Light Detection and Ranging (LiDAR) data provides complementary spectral and spatial information for remote sensing applications. While previous studies have explored the role of band selection and grouping in HSI classification, little attention has been given to how the spectral sequence or band order affects classification outcomes when fused with LiDAR. In this work, we systematically investigate the influence of band order on HSI-LiDAR fusion performance. Through extensive experiments, we demonstrate that band order significantly impacts classification accuracy, revealing a previously overlooked factor in fusion-based models. Motivated by this observation, we propose a novel fusion architecture that not only integrates HSI and LiDAR data but also learns from multiple band order configurations. The proposed method enhances feature representation by adaptively fusing different spectral sequences, leading to improved classification accuracy. Experimental results on the Houston 2013 and Trento datasets show that our approach outperforms state-of-the-art fusion models. Data and code are available at https://github.com/Judyxyang/HSLiNets.
中文: 本研究揭示了在高光谱与激光雷达数据融合中,光谱波段顺序对分类精度具有显著影响,并提出了一种新型融合架构,通过自适应整合多种波段序列来增强特征表征能力,从而超越现有模型的性能。
English: This study reveals that the order of spectral bands significantly influences classification accuracy in hyperspectral and LiDAR data fusion, leading to the development of a novel architecture that adaptively integrates multiple band sequences to enhance feature representation and outperform existing models.

Authors:Venkat Adithya Amula, Sunayana Samavedam, Saurabh Saini, Avani Gupta, Narayanan P J
Title: Prototype Guided Backdoor Defense
Abstract:
Deep learning models are susceptible to {\em backdoor attacks} involving malicious attackers perturbing a small subset of training data with a {\em trigger} to causes misclassifications. Various triggers have been used, including semantic triggers that are easily realizable without requiring the attacker to manipulate the image. The emergence of generative AI has eased the generation of varied poisoned samples. Robustness across types of triggers is crucial to effective defense. We propose Prototype Guided Backdoor Defense (PGBD), a robust post-hoc defense that scales across different trigger types, including previously unsolved semantic triggers. PGBD exploits displacements in the geometric spaces of activations to penalize movements toward the trigger. This is done using a novel sanitization loss of a post-hoc fine-tuning step. The geometric approach scales easily to all types of attacks. PGBD achieves better performance across all settings. We also present the first defense against a new semantic attack on celebrity face images. Project page: \hyperlink{https://venkatadithya9.github.io/pgbd.github.io/}{this https URL}.

Authors:Amaya Gallagher-Syed, Henry Senior, Omnia Alwazzan, Elena Pontarini, Michele Bombardieri, Costantino Pitzalis, Myles J. Lewis, Michael R. Barnes, Luca Rossi, Gregory Slabaugh
Title: BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational Pathology
Abstract:
The development of biologically interpretable and explainable models remains a key challenge in computational pathology, particularly for multistain immunohistochemistry (IHC) analysis. We present BioX-CPath, an explainable graph neural network architecture for whole slide image (WSI) classification that leverages both spatial and semantic features across multiple stains. At its core, BioX-CPath introduces a novel Stain-Aware Attention Pooling (SAAP) module that generates biologically meaningful, stain-aware patient embeddings. Our approach achieves state-of-the-art performance on both Rheumatoid Arthritis and Sjogren's Disease multistain datasets. Beyond performance metrics, BioX-CPath provides interpretable insights through stain attention scores, entropy measures, and stain interaction scores, that permit measuring model alignment with known pathological mechanisms. This biological grounding, combined with strong classification performance, makes BioX-CPath particularly suitable for clinical applications where interpretability is key. Source code and documentation can be found at: https://github.com/AmayaGS/BioX-CPath.
Chinese: BioX-CPath是一种可解释的图神经网络,通过引入染色感知注意力池化模块,在全切片图像分类中实现最优性能,同时提供生物学可解释性见解,适用于临床诊断。
English: BioX-CPath is an explainable graph neural network for whole slide image classification that introduces a stain-aware attention pooling module, achieving state-of-the-art performance while providing biologically interpretable insights for clinical applications.

Authors:Xiaoming Qi, Jingyang Zhang, Huazhu Fu, Guanyu Yang, Shuo Li, Yueming Jin
Title: Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for Federated Continual Learning
Abstract:
Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (FedDAH). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:https://github.com/jinlab-imvr/FedDAH.
中文: 联邦持续学习在动态医疗场景下面临灾难性遗忘和优化偏差的挑战,FedDAH方法通过动态分配超网络和自适应模型重校准,有效提升了跨客户端协同学习能力。
English: Federated continual learning (FCL) faces challenges of catastrophic forgetting and biased optimization in dynamic medical scenarios, which the proposed FedDAH method addresses through a dynamic allocation hypernetwork and adaptive model recalibration to enhance collaborative learning across clients.

Authors:Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, Ziwei Liu
Title: Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency
Abstract:
We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.
中文摘要:Free4D提出了一种无需调优的框架,通过提取预训练基础模型从单张图像生成一致的4D场景,实现了无需昂贵训练的高效通用实时渲染能力。
English Summary: Free4D is a tuning-free framework that generates consistent 4D scenes from single images by distilling pre-trained foundation models, enabling efficient and generalizable real-time rendering without expensive training.

Authors:Yan-Bo Lin, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, Xiaofei Wang, Gedas Bertasius, Lijuan Wang
Title: Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
Abstract:
In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at https://genjib.github.io/project_page/AVED/index.html
Chinese: 本文提出了零样本音视频编辑任务,构建了AvED-Bench评估数据集,并开发了AvED框架来实现无需训练的跨模态同步编辑。
English: This paper introduces zero-shot audio-video editing, creates the AvED-Bench evaluation dataset, and proposes the AvED framework to achieve synchronized cross-modal edits without training.

Authors:Yulu Pan, Ce Zhang, Gedas Bertasius
Title: BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
Abstract:
We present BASKET, a large-scale basketball video dataset for fine-grained skill estimation. BASKET contains 4,477 hours of video capturing 32,232 basketball players from all over the world. Compared to prior skill estimation datasets, our dataset includes a massive number of skilled participants with unprecedented diversity in terms of gender, age, skill level, geographical location, etc. BASKET includes 20 fine-grained basketball skills, challenging modern video recognition models to capture the intricate nuances of player skill through in-depth video analysis. Given a long highlight video (8-10 minutes) of a particular player, the model needs to predict the skill level (e.g., excellent, good, average, fair, poor) for each of the 20 basketball skills. Our empirical analysis reveals that the current state-of-the-art video models struggle with this task, significantly lagging behind the human baseline. We believe that BASKET could be a useful resource for developing new video models with advanced long-range, fine-grained recognition capabilities. In addition, we hope that our dataset will be useful for domain-specific applications such as fair basketball scouting, personalized player development, and many others. Dataset and code are available at https://github.com/yulupan00/BASKET.
中文:BASKET是一个大规模篮球视频数据集,包含来自全球32,232名球员的4,477小时视频,旨在通过20项精细篮球技能评估推动视频识别模型在长序列细粒度分析能力的发展。
English: BASKET is a comprehensive basketball video dataset featuring 4,477 hours of footage from 32,232 globally diverse players, designed to challenge video recognition models with fine-grained skill estimation across 20 specific basketball abilities.

Authors:Chenxi Wang, Jizhan Fang, Xiang Chen, Bozhong Tian, Ziwen Xu, Huajun Chen, Ningyu Zhang
Title: ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems
Abstract:
Recent advancements in Large Multimodal Models (LMMs) have shown promise in Autonomous Driving Systems (ADS). However, their direct application to ADS is hindered by challenges such as misunderstanding of traffic knowledge, complex road conditions, and diverse states of vehicle. To address these challenges, we propose the use of Knowledge Editing, which enables targeted modifications to a model's behavior without the need for full retraining. Meanwhile, we introduce ADS-Edit, a multimodal knowledge editing dataset specifically designed for ADS, which includes various real-world scenarios, multiple data types, and comprehensive evaluation metrics. We conduct comprehensive experiments and derive several interesting conclusions. We hope that our work will contribute to the further advancement of knowledge editing applications in the field of autonomous driving. Code and data are available in https://github.com/zjunlp/EasyEdit/blob/main/examples/ADSEdit.md.
中文摘要:本研究提出知识编辑方法和ADS-Edit数据集,通过针对性修正模型行为来解决自动驾驶中交通知识误解与复杂路况问题,无需完整重新训练即可提升多模态大模型性能。
English Summary: This study introduces Knowledge Editing and the ADS-Edit dataset to enhance Large Multimodal Models for autonomous driving by addressing traffic knowledge gaps and complex road scenarios without full retraining.

Authors:Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
Title: Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning
Abstract:
Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods improve VLM reasoning via Chain-of-Thought (CoT) supervised fine-tuning, using meticulously annotated training data to enhance visual reasoning capabilities. However, this training paradigm may lead to overfitting and cognitive rigidity, restricting the model's ability to transfer visual reasoning skills across domains and limiting its real-world applicability. To address these limitations, we propose Reason-RFT, a novel reinforcement fine-tuning framework that significantly enhances generalization capabilities in visual reasoning tasks. Reason-RFT introduces a two-phase training framework for visual reasoning: (1) Supervised Fine-Tuning (SFT) with curated Chain-of-Thought (CoT) data activates the reasoning potential of Vision-Language Models (VLMs), followed by (2) Group Relative Policy Optimization (GRPO)-based reinforcement learning that generates multiple reasoning-response pairs, significantly enhancing generalization in visual reasoning tasks. To evaluate Reason-RFT's visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation. Experimental results demonstrate Reasoning-RFT's three key advantages: (1) Performance Enhancement: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models; (2) Generalization Superiority: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms; (3) Data Efficiency: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines. Project website: https://tanhuajie.github.io/ReasonRFT

Authors:Chen Tang, Xinzhu Ma, Encheng Su, Xiufeng Song, Xiaohong Liu, Wei-Hong Li, Lei Bai, Wanli Ouyang, Xiangyu Yue
Title: UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines
Abstract:
Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce \textbf{UniSTD}, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Code will be available at https://github.com/1hunters/UniSTD.
中文: UniSTD是一种基于Transformer的统一框架,通过任务无关的预训练和专业化联合训练,实现了跨任务、可扩展的时空学习,支持多领域应用。
English: UniSTD is a unified Transformer-based framework that leverages task-agnostic pretraining and specialized joint training to enable scalable, cross-task spatiotemporal learning across multiple domains.

Authors:Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, Shenlong Wang
Title: PhysGen3D: Crafting a Miniature Interactive World from a Single Image
Abstract:
Envisioning physically plausible outcomes from a single image requires a deep understanding of the world's dynamics. To address this, we introduce PhysGen3D, a novel framework that transforms a single image into an amodal, camera-centric, interactive 3D scene. By combining advanced image-based geometric and semantic understanding with physics-based simulation, PhysGen3D creates an interactive 3D world from a static image, enabling us to "imagine" and simulate future scenarios based on user input. At its core, PhysGen3D estimates 3D shapes, poses, physical and lighting properties of objects, thereby capturing essential physical attributes that drive realistic object interactions. This framework allows users to specify precise initial conditions, such as object speed or material properties, for enhanced control over generated video outcomes. We evaluate PhysGen3D's performance against closed-source state-of-the-art (SOTA) image-to-video models, including Pika, Kling, and Gen-3, showing PhysGen3D's capacity to generate videos with realistic physics while offering greater flexibility and fine-grained control. Our results show that PhysGen3D achieves a unique balance of photorealism, physical plausibility, and user-driven interactivity, opening new possibilities for generating dynamic, physics-grounded video from an image.

Authors:Han Wang, Yongjie Ye, Bingru Li, Yuxiang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, Can Huang
Title: Vision as LoRA
Abstract:
We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.
Chinese: VoRA是一种创新方法,通过将视觉专用LoRA层直接集成到大型语言模型中,使其转变为多模态模型,实现了推理过程中参数的无缝融合和任意分辨率输入的处理能力。
English: VoRA is a novel approach that transforms a large language model into a multimodal model by integrating vision-specific LoRA layers directly into the LLM, enabling seamless parameter merging and flexible input resolution processing during inference.

Authors:Jiepeng Wang, Zhaoqing Wang, Hao Pan, Yuan Liu, Dongdong Yu, Changhu Wang, Wenping Wang
Title: MMGen: Unified Multi-modal Image Generation and Understanding in One Go
Abstract:
A unified diffusion framework for multi-modal generation and understanding has the transformative potential to achieve seamless and controllable image diffusion and other cross-modal tasks. In this paper, we introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model. This includes: (1) multi-modal category-conditioned generation, where multi-modal outputs are generated simultaneously through a single inference process, given category information; (2) multi-modal visual understanding, which accurately predicts depth, surface normals, and segmentation maps from RGB images; and (3) multi-modal conditioned generation, which produces corresponding RGB images based on specific modality conditions and other aligned modalities. Our approach develops a novel diffusion transformer that flexibly supports multi-modal output, along with a simple modality-decoupling strategy to unify various tasks. Extensive experiments and applications demonstrate the effectiveness and superiority of MMGen across diverse tasks and conditions, highlighting its potential for applications that require simultaneous generation and understanding.
Chinese: MMGen是一个统一的扩散框架,集成了多模态生成与理解任务,通过创新的扩散变换器和模态解耦策略,实现了无缝可控的跨模态应用。
English: MMGen is a unified diffusion framework that integrates multi-modal generation and understanding tasks, enabling seamless and controllable cross-modal applications through a novel diffusion transformer and modality-decoupling strategy.

Authors:Hao Fu, Hanbin Zhao, Jiahua Dong, Henghui Ding, Chao Zhang, Hui Qian
Title: IAP: Improving Continual Learning of Vision-Language Models via Instance-Aware Prompting
Abstract:
Recent pre-trained vision-language models (PT-VLMs) often face a Multi-Domain Task Incremental Learning (MTIL) scenario in practice, where several classes and domains of multi-modal tasks are incrementally arrived. Without access to previously seen tasks and unseen tasks, memory-constrained MTIL suffers from forward and backward forgetting. To alleviate the above challenges, parameter-efficient fine-tuning techniques (PEFT), such as prompt tuning, are employed to adapt the PT-VLM to the diverse incrementally learned tasks. To achieve effective new task adaptation, existing methods only consider the effect of PEFT strategy selection, but neglect the influence of PEFT parameter setting (e.g., prompting). In this paper, we tackle the challenge of optimizing prompt designs for diverse tasks in MTIL and propose an Instance-Aware Prompting (IAP) framework. Specifically, our Instance-Aware Gated Prompting (IA-GP) strategy enhances adaptation to new tasks while mitigating forgetting by adaptively assigning prompts across transformer layers at the instance level. Our Instance-Aware Class-Distribution-Driven Prompting (IA-CDDP) improves the task adaptation process by determining an accurate task-label-related confidence score for each instance. Experimental evaluations across 11 datasets, using three performance metrics, demonstrate the effectiveness of our proposed method. The source codes are available at https://github.com/FerdinandZJU/IAP.
Chinese: 本文提出了一种实例感知提示框架,通过自适应实例级策略优化视觉语言模型中的多领域任务增量学习提示设计,在增强新任务适应性的同时有效缓解遗忘问题。
English: This paper introduces an Instance-Aware Prompting (IAP) framework to optimize prompt designs for multi-domain task incremental learning in vision-language models, enhancing new task adaptation while mitigating forgetting through adaptive instance-level strategies.

Authors:Trung Duc Ha, Sidney Bender
Title: Diffusion Counterfactuals for Image Regressors
Abstract:
Counterfactual explanations have been successfully applied to create human interpretable explanations for various black-box models. They are handy for tasks in the image domain, where the quality of the explanations benefits from recent advances in generative models. Although counterfactual explanations have been widely applied to classification models, their application to regression tasks remains underexplored. We present two methods to create counterfactual explanations for image regression tasks using diffusion-based generative models to address challenges in sparsity and quality: 1) one based on a Denoising Diffusion Probabilistic Model that operates directly in pixel-space and 2) another based on a Diffusion Autoencoder operating in latent space. Both produce realistic, semantic, and smooth counterfactuals on CelebA-HQ and a synthetic data set, providing easily interpretable insights into the decision-making process of the regression model and reveal spurious correlations. We find that for regression counterfactuals, changes in features depend on the region of the predicted value. Large semantic changes are needed for significant changes in predicted values, making it harder to find sparse counterfactuals than with classifiers. Moreover, pixel space counterfactuals are more sparse while latent space counterfactuals are of higher quality and allow bigger semantic changes.
中文摘要:针对图像回归任务,本文采用基于扩散的像素空间和潜在空间模型生成反事实解释,虽能提供对模型决策的可解释性洞察并揭示伪相关性,但在稀疏性和质量方面相比分类任务更具挑战性。
English Summary: Counterfactual explanations for image regression tasks are generated using diffusion-based models in pixel and latent spaces, producing realistic insights into model decisions while revealing challenges in achieving sparsity and quality compared to classification tasks.

Authors:Carlos Gomes, Benedikt Blumenstiel, Joao Lucas de Sousa Almeida, Pedro Henrique de Oliveira, Paolo Fraccaro, Francesc Marti Escofet, Daniela Szwarcman, Naomi Simumba, Romeo Kienzler, Bianca Zadrozny
Title: TerraTorch: The Geospatial Foundation Models Toolkit
Abstract:
TerraTorch is a fine-tuning and benchmarking toolkit for Geospatial Foundation Models built on PyTorch Lightning and tailored for satellite, weather, and climate data. It integrates domain-specific data modules, pre-defined tasks, and a modular model factory that pairs any backbone with diverse decoder heads. These components allow researchers and practitioners to fine-tune supported models in a no-code fashion by simply editing a training configuration. By consolidating best practices for model development and incorporating the automated hyperparameter optimization extension Iterate, TerraTorch reduces the expertise and time required to fine-tune or benchmark models on new Earth Observation use cases. Furthermore, TerraTorch directly integrates with GEO-Bench, allowing for systematic and reproducible benchmarking of Geospatial Foundation Models. TerraTorch is open sourced under Apache 2.0, available at https://github.com/IBM/terratorch, and can be installed via pip install terratorch.
中文: TerraTorch 是一个基于 PyTorch Lightning 的工具包,专为卫星、气象和气候数据设计,用于微调和基准测试地理空间基础模型,支持无代码定制和高效模型优化。
English: TerraTorch is a PyTorch Lightning-based toolkit designed for fine-tuning and benchmarking geospatial foundation models using satellite, weather, and climate data, enabling no-code customization and efficient model optimization.

Authors:Jinnan Chen, Lingting Zhu, Zeyu Hu, Shengju Qian, Yugang Chen, Xin Wang, Gim Hee Lee
Title: MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation
Abstract:
Recent advances in auto-regressive transformers have revolutionized generative modeling across different domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential next-token prediction paradigm, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution latent prediction. To address these challenges, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent upscaling in the continuous space. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficiently up-scale the latent token resolution with fast convergence. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling capabilities compared to joint distribution modeling approaches (e.g., diffusion transformers).

Authors:Jiale Cheng, Ruiliang Lyu, Xiaotao Gu, Xiao Liu, Jiazheng Xu, Yida Lu, Jiayan Teng, Zhuoyi Yang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang
Title: VPO: Aligning Text-to-Video Generation Models with Prompt Optimization
Abstract:
Video generation models have achieved remarkable progress in text-to-video tasks. These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured. This gap makes prompt optimization crucial for generating high-quality videos. Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks. Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos. To achieve this, VPO employs a two-stage optimization approach. First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods. Moreover, VPO shows strong generalization across video generation models. Furthermore, we demonstrate that VPO could outperform and be combined with RLHF methods on video generation models, underscoring the effectiveness of VPO in aligning video generation models. Our code and data are publicly available at https://github.com/thu-coai/VPO.
中文: VPO是一个基于无害性、准确性和有益性三原则的提示优化框架,能忠实保留用户意图并显著提升生成视频的安全性和质量。
English: VPO is a principled framework that optimizes text prompts for video generation by ensuring harmlessness, accuracy, and helpfulness, significantly improving safety, alignment, and video quality across models.

Authors:Yingdong Shi, Changming Li, Yifan Wang, Yongxiang Zhao, Anqi Pang, Sibei Yang, Jingyi Yu, Kan Ren
Title: Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability
Abstract:
Diffusion models have demonstrated impressive capabilities in synthesizing diverse content. However, despite their high-quality outputs, these models often perpetuate social biases, including those related to gender and race. These biases can potentially contribute to harmful real-world consequences, reinforcing stereotypes and exacerbating inequalities in various social contexts. While existing research on diffusion bias mitigation has predominantly focused on guiding content generation, it often neglects the intrinsic mechanisms within diffusion models that causally drive biased outputs. In this paper, we investigate the internal processes of diffusion models, identifying specific decision-making mechanisms, termed bias features, embedded within the model architecture. By directly manipulating these features, our method precisely isolates and adjusts the elements responsible for bias generation, permitting granular control over the bias levels in the generated content. Through experiments on both unconditional and conditional diffusion models across various social bias attributes, we demonstrate our method's efficacy in managing generation distribution while preserving image quality. We also dissect the discovered model mechanism, revealing different intrinsic features controlling fine-grained aspects of generation, boosting further research on mechanistic interpretability of diffusion models.

Authors:Ji Woo Hong, Tri Ton, Trung X. Pham, Gwanhyeong Koo, Sunjae Yoon, Chang D. Yoo
Title: ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On
Abstract:
This paper introduces ITA-MDT, the Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On (IVTON), designed to overcome the limitations of previous approaches by leveraging the Masked Diffusion Transformer (MDT) for improved handling of both global garment context and fine-grained details. The IVTON task involves seamlessly superimposing a garment from one image onto a person in another, creating a realistic depiction of the person wearing the specified garment. Unlike conventional diffusion-based virtual try-on models that depend on large pre-trained U-Net architectures, ITA-MDT leverages a lightweight, scalable transformer-based denoising diffusion model with a mask latent modeling scheme, achieving competitive results while reducing computational overhead. A key component of ITA-MDT is the Image-Timestep Adaptive Feature Aggregator (ITAFA), a dynamic feature aggregator that combines all of the features from the image encoder into a unified feature of the same size, guided by diffusion timestep and garment image complexity. This enables adaptive weighting of features, allowing the model to emphasize either global information or fine-grained details based on the requirements of the denoising stage. Additionally, the Salient Region Extractor (SRE) module is presented to identify complex region of the garment to provide high-resolution local information to the denoising model as an additional condition alongside the global information of the full garment image. This targeted conditioning strategy enhances detail preservation of fine details in highly salient garment regions, optimizing computational resources by avoiding unnecessarily processing entire garment image. Comparative evaluations confirms that ITA-MDT improves efficiency while maintaining strong performance, reaching state-of-the-art results in several metrics.
中文: ITA-MDT框架采用自适应掩码扩散变换器进行虚拟试穿,通过动态特征聚合和针对性调节,在提升细节保留的同时优化了计算效率。
English: The ITA-MDT framework introduces an adaptive masked diffusion transformer for virtual try-on, enhancing detail preservation and efficiency through dynamic feature aggregation and targeted conditioning.

Authors:Hongda Liu, Longguang Wang, Weijun Guan, Ye Zhang, Yulan Guo
Title: Pluggable Style Representation Learning for Multi-Style Transfer
Abstract:
Due to the high diversity of image styles, the scalability to various styles plays a critical role in real-world applications. To accommodate a large amount of styles, previous multi-style transfer approaches rely on enlarging the model size while arbitrary-style transfer methods utilize heavy backbones. However, the additional computational cost introduced by more model parameters hinders these methods to be deployed on resource-limited devices. To address this challenge, in this paper, we develop a style transfer framework by decoupling the style modeling and transferring. Specifically, for style modeling, we propose a style representation learning scheme to encode the style information into a compact representation. Then, for style transferring, we develop a style-aware multi-style transfer network (SaMST) to adapt to diverse styles using pluggable style representations. In this way, our framework is able to accommodate diverse image styles in the learned style representations without introducing additional overhead during inference, thereby maintaining efficiency. Experiments show that our style representation can extract accurate style information. Moreover, qualitative and quantitative results demonstrate that our method achieves state-of-the-art performance in terms of both accuracy and efficiency. The codes are available in https://github.com/The-Learning-And-Vision-Atelier-LAVA/SaMST.
Chinese: 本文提出了一种风格迁移框架,通过解耦风格建模与迁移过程,利用紧凑的风格表示和风格感知网络,在推理时不增加计算开销的前提下高效处理多样化的图像风格。
English: This paper introduces a style transfer framework that separates style modeling and transferring, using a compact style representation and a style-aware network to efficiently handle diverse image styles without increasing computational costs during inference.

Authors:Hao Ai, Kunyi Wang, Zezhou Wang, Hao Lu, Jin Tian, Yaxin Luo, Peng Xing, Jen-Yuan Huang, Huaxia Li, Gen luo
Title: Dynamic Pyramid Network for Efficient Multimodal Large Language Model
Abstract:
Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy the visual semantics in MLLM, especially in difficult samples. To overcome this shortcoming, we propose a novel dynamic pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance. To validate our approach, we conduct extensive experiments on two popular MLLMs and ten benchmarks. Experimental results show that DPN can save up to 56% average FLOPs on LLaVA while further achieving +0.74% performance gains. Besides, the generalization ability of DPN is also validated on the existing high-resolution MLLM called LLaVA-HR. The source code will be released at https://github.com/aihao2000/DPN-LLaVA.
中文: 本文提出了一种动态金字塔网络(DPN),通过分层压缩多模态大语言模型中的视觉特征,在节省高达56%计算量的同时提升性能,并辅以动态池化专家机制根据输入难度自适应调整压缩率。
English: This paper introduces a dynamic pyramid network (DPN) that hierarchically compresses visual features in multimodal large language models to reduce computational costs by up to 56% while improving performance, complemented by a dynamic pooling experts mechanism that adapts compression rates based on input difficulty.

Authors:Qian Wang, Aleksandar Cvejic, Abdelrahman Eldesokey, Peter Wonka
Title: EditCLIP: Representation Learning for Image Editing
Abstract:
We introduce EditCLIP, a novel representation-learning approach for image editing. Our method learns a unified representation of edits by jointly encoding an input image and its edited counterpart, effectively capturing their transformation. To evaluate its effectiveness, we employ EditCLIP to solve two tasks: exemplar-based image editing and automated edit evaluation. In exemplar-based image editing, we replace text-based instructions in InstructPix2Pix with EditCLIP embeddings computed from a reference exemplar image pair. Experiments demonstrate that our approach outperforms state-of-the-art methods while being more efficient and versatile. For automated evaluation, EditCLIP assesses image edits by measuring the similarity between the EditCLIP embedding of a given image pair and either a textual editing instruction or the EditCLIP embedding of another reference image pair. Experiments show that EditCLIP aligns more closely with human judgments than existing CLIP-based metrics, providing a reliable measure of edit quality and structural preservation.

Authors:Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wang, Tianyi Gui, Tingyu Weng, Tong Shen, Wei Lin, Wei Wang, Wei Wang, Wenmeng Zhou, Wente Wang, Wenting Shen, Wenyuan Yu, Xianzhong Shi, Xiaoming Huang, Xin Xu, Yan Kou, Yangyu Lv, Yifei Li, Yijing Liu, Yiming Wang, Yingya Zhang, Yitong Huang, Yong Li, You Wu, Yu Liu, Yulin Pan, Yun Zheng, Yuntao Hong, Yupeng Shi, Yutong Feng, Zeyinzi Jiang, Zhen Han, Zhi-Fan Wu, Ziyu Liu
Title: Wan: Open and Advanced Large-Scale Video Generative Models
Abstract:
This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at https://github.com/Wan-Video/Wan2.1.
中文: Wan是一套开源视频生成模型套件,提供1.3B和14B两种参数规模的模型,在保持消费级效率的同时实现了业界领先的性能表现,并支持涵盖八大下游任务的全面应用场景。
English: Wan is an open-source video generation suite featuring 1.3B and 14B parameter models that achieve leading performance across multiple benchmarks while maintaining consumer-grade efficiency and supporting eight downstream applications.

Authors:Lee Chae-Yeon, Oh Hyun-Bin, Han EunGi, Kim Sung-Bin, Suekyeong Nam, Tae-Hyun Oh
Title: Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics
Abstract:
Recent advancements in speech-driven 3D talking head generation have made significant progress in lip synchronization. However, existing models still struggle to capture the perceptual alignment between varying speech characteristics and corresponding lip movements. In this work, we claim that three criteria -- Temporal Synchronization, Lip Readability, and Expressiveness -- are crucial for achieving perceptually accurate lip movements. Motivated by our hypothesis that a desirable representation space exists to meet these three criteria, we introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes. We found that our learned representation exhibits desirable characteristics, and we plug it into existing models as a perceptual loss to better align lip movements to the given speech. In addition, we utilize this representation as a perceptual metric and introduce two other physically grounded lip synchronization metrics to assess how well the generated 3D talking heads align with these three criteria. Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization. Codes and datasets are available at https://perceptual-3d-talking-head.github.io/.
中文总结:本研究提出了一种语音-网格同步表示法,通过感知损失函数提升三维说话头生成在时间同步性、唇部可读性和表现力三个关键维度的性能。
English Summary: This research introduces a speech-mesh synchronized representation to enhance 3D talking head generation by improving temporal synchronization, lip readability, and expressiveness through a perceptual loss function.

Authors:Jianyang Zhang, Qianli Luo, Guowu Yang, Wenjing Yang, Weide Liu, Guosheng Lin, Fengmao Lv
Title: Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability
Abstract:
Language Bottleneck Models (LBMs) are proposed to achieve interpretable image recognition by classifying images based on textual concept bottlenecks. However, current LBMs simply list all concepts together as the bottleneck layer, leading to the spurious cue inference problem and cannot generalized to unseen classes. To address these limitations, we propose the Attribute-formed Language Bottleneck Model (ALBM). ALBM organizes concepts in the attribute-formed class-specific space, where concepts are descriptions of specific attributes for specific classes. In this way, ALBM can avoid the spurious cue inference problem by classifying solely based on the essential concepts of each class. In addition, the cross-class unified attribute set also ensures that the concept spaces of different classes have strong correlations, as a result, the learned concept classifier can be easily generalized to unseen classes. Moreover, to further improve interpretability, we propose Visual Attribute Prompt Learning (VAPL) to extract visual features on fine-grained attributes. Furthermore, to avoid labor-intensive concept annotation, we propose the Description, Summary, and Supplement (DSS) strategy to automatically generate high-quality concept sets with a complete and precise attribute. Extensive experiments on 9 widely used few-shot benchmarks demonstrate the interpretability, transferability, and performance of our approach. The code and collected concept sets are available at https://github.com/tiggers23/ALBM.
Chinese: 提出的属性形成语言瓶颈模型(ALBM)将概念组织在特定类别的属性空间中,以避免虚假线索推断并提升对未见类别的泛化能力,同时引入视觉属性提示学习和自动概念生成策略,以增强可解释性并减少标注工作量。
English: The proposed Attribute-formed Language Bottleneck Model (ALBM) organizes concepts in a class-specific attribute space to avoid spurious cue inference and improve generalization to unseen classes, while also introducing Visual Attribute Prompt Learning and an automated concept generation strategy for enhanced interpretability and reduced annotation effort.

Authors:Chenwei Zhang, Khanh Dao Duc
Title: CryoSAMU: Enhancing 3D Cryo-EM Density Maps of Protein Structures at Intermediate Resolution with Structure-Aware Multimodal U-Nets
Abstract:
Enhancing cryogenic electron microscopy (cryo-EM) 3D density maps at intermediate resolution (4-8 Å) is crucial in protein structure determination. Recent advances in deep learning have led to the development of automated approaches for enhancing experimental cryo-EM density maps. Yet, these methods are not optimized for intermediate-resolution maps and rely on map density features alone. To address this, we propose CryoSAMU, a novel method designed to enhance 3D cryo-EM density maps of protein structures using structure-aware multimodal U-Nets and trained on curated intermediate-resolution density maps. We comprehensively evaluate CryoSAMU across various metrics and demonstrate its competitive performance compared to state-of-the-art methods. Notably, CryoSAMU achieves significantly faster processing speed, showing promise for future practical applications. Our code is available at https://github.com/chenwei-zhang/CryoSAMU.
中文: CryoSAMU是一种新型深度学习方法,通过结构感知多模态U-Net增强中分辨率冷冻电镜密度图,相比现有方法具有更优性能和显著更快的处理速度。
English: CryoSAMU is a novel deep learning method that enhances intermediate-resolution cryo-EM density maps using structure-aware multimodal U-Nets, achieving competitive performance and significantly faster processing speeds compared to existing approaches.

Authors:Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, Lei Zhang
Title: InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction
Abstract:
Instruction-based video editing allows effective and interactive editing of videos using only instructions without extra inputs such as masks or attributes. However, collecting high-quality training triplets (source video, edited video, instruction) is a challenging task. Existing datasets mostly consist of low-resolution, short duration, and limited amount of source videos with unsatisfactory editing quality, limiting the performance of trained editing models. In this work, we present a high-quality Instruction-based Video Editing dataset with 1M triplets, namely InsViE-1M. We first curate high-resolution and high-quality source videos and images, then design an effective editing-filtering pipeline to construct high-quality editing triplets for model training. For a source video, we generate multiple edited samples of its first frame with different intensities of classifier-free guidance, which are automatically filtered by GPT-4o with carefully crafted guidelines. The edited first frame is propagated to subsequent frames to produce the edited video, followed by another round of filtering for frame quality and motion evaluation. We also generate and filter a variety of video editing triplets from high-quality images. With the InsViE-1M dataset, we propose a multi-stage learning strategy to train our InsViE model, progressively enhancing its instruction following and editing ability. Extensive experiments demonstrate the advantages of our InsViE-1M dataset and the trained model over state-of-the-art works. Codes are available at \href{https://github.com/langmanbusi/InsViE}{InsViE}.
中文:本研究提出了InsViE-1M高质量指令视频编辑数据集,通过精心设计的编辑过滤流程和多阶段训练策略,有效解决了现有数据集质量不足的问题,显著提升了模型的编辑性能。
English: This work introduces InsViE-1M, a high-quality dataset for instruction-based video editing that overcomes limitations of existing datasets through a rigorous editing-filtering pipeline and multi-stage training, significantly improving model performance.

Authors:Haoqin Tu, Weitao Feng, Hardy Chen, Hui Liu, Xianfeng Tang, Cihang Xie
Title: ViLBench: A Suite for Vision-Language Process Reward Modeling
Abstract:
Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at https://ucsc-vlaa.github.io/ViLBench with our code, model, and data.
Chinese: 本文对视觉语言模型作为奖励模型进行了基准测试,发现其在不同任务中表现不一,并引入ViLBench评估过程奖励,表明即使先进模型也面临挑战,同时证明利用过程数据进行针对性训练可有效提升性能。
English: This paper benchmarks vision-language models as reward models, revealing inconsistent performance across tasks, and introduces ViLBench to evaluate process rewards, showing that even advanced models struggle while demonstrating that targeted training with process data can enhance performance.

Authors:Ziran Zhang, Xiaohui Li, Yihao Liu, Yujin Wang, Yueting Chen, Tianfan Xue, Shi Guo
Title: EGVD: Event-Guided Video Diffusion Model for Physically Realistic Large-Motion Frame Interpolation
Abstract:
Video frame interpolation (VFI) in scenarios with large motion remains challenging due to motion ambiguity between frames. While event cameras can capture high temporal resolution motion information, existing event-based VFI methods struggle with limited training data and complex motion patterns. In this paper, we introduce Event-Guided Video Diffusion Model (EGVD), a novel framework that leverages the powerful priors of pre-trained stable video diffusion models alongside the precise temporal information from event cameras. Our approach features a Multi-modal Motion Condition Generator (MMCG) that effectively integrates RGB frames and event signals to guide the diffusion process, producing physically realistic intermediate frames. We employ a selective fine-tuning strategy that preserves spatial modeling capabilities while efficiently incorporating event-guided temporal information. We incorporate input-output normalization techniques inspired by recent advances in diffusion modeling to enhance training stability across varying noise levels. To improve generalization, we construct a comprehensive dataset combining both real and simulated event data across diverse scenarios. Extensive experiments on both real and simulated datasets demonstrate that EGVD significantly outperforms existing methods in handling large motion and challenging lighting conditions, achieving substantial improvements in perceptual quality metrics (27.4% better LPIPS on Prophesee and 24.1% on BSRGB) while maintaining competitive fidelity measures. Code and datasets available at: https://github.com/OpenImagingLab/EGVD.
中文: 事件引导视频扩散模型(EGVD)通过融合预训练视频扩散模型与事件相机数据,有效解决了大运动场景下的视频帧插值难题,在多种场景中实现了卓越的感知质量和泛化能力。
English: The Event-Guided Video Diffusion Model (EGVD) effectively integrates pre-trained video diffusion models with event camera data to address large motion challenges in video frame interpolation, achieving superior perceptual quality and generalization across diverse scenarios.

Authors:Taorui Wang, Zitong Yu, Yong Xu
Title: TC-GS: Tri-plane based compression for 3D Gaussian Splatting
Abstract:
Recently, 3D Gaussian Splatting (3DGS) has emerged as a prominent framework for novel view synthesis, providing high fidelity and rapid rendering speed. However, the substantial data volume of 3DGS and its attributes impede its practical utility, requiring compression techniques for reducing memory cost. Nevertheless, the unorganized shape of 3DGS leads to difficulties in compression. To formulate unstructured attributes into normative distribution, we propose a well-structured tri-plane to encode Gaussian attributes, leveraging the distribution of attributes for compression. To exploit the correlations among adjacent Gaussians, K-Nearest Neighbors (KNN) is used when decoding Gaussian distribution from the Tri-plane. We also introduce Gaussian position information as a prior of the position-sensitive decoder. Additionally, we incorporate an adaptive wavelet loss, aiming to focus on the high-frequency details as iterations increase. Our approach has achieved results that are comparable to or surpass that of SOTA 3D Gaussians Splatting compression work in extensive experiments across multiple datasets. The codes are released at https://github.com/timwang2001/TC-GS.
中文摘要:该研究提出了一种三平面编码方法,通过结构化高斯属性并利用KNN解码和位置先验来压缩3D高斯泼溅数据,在多个数据集上实现了领先的压缩性能。
English Summary: The study introduces a tri-plane encoding method for compressing 3D Gaussian Splatting data by structuring attributes and utilizing KNN decoding with positional priors, achieving state-of-the-art compression performance across multiple datasets.

Authors:Weijie Guo, Guofeng Zhang, Wufei Ma, Alan Yuille
Title: DINeMo: Learning Neural Mesh Models with no 3D Annotations
Abstract:
Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding, which would enable a broad range of applications in robotics and embodied AI. Recent works explored neural mesh models that approach a range of 2D and 3D tasks from an analysis-by-synthesis perspective. Despite the largely enhanced robustness to partial occlusion and domain shifts, these methods depended heavily on 3D annotations for part-contrastive learning, which confines them to a narrow set of categories and hinders efficient scaling. In this work, we present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence obtained from large visual foundation models. We adopt a bidirectional pseudo-correspondence generation method, which produce pseudo correspondence utilize both local appearance features and global context information. Experimental results on car datasets demonstrate that our DINeMo outperforms previous zero- and few-shot 3D pose estimation by a wide margin, narrowing the gap with fully-supervised methods by 67.3%. Our DINeMo also scales effectively and efficiently when incorporating more unlabeled images during training, which demonstrate the advantages over supervised learning methods that rely on 3D annotations. Our project page is available at https://analysis-by-synthesis.github.io/DINeMo/.
中文: DINeMo是一种新型神经网格模型,通过利用大型视觉基础模型生成的伪对应关系,无需3D标注即可实现卓越的零样本和少样本3D姿态估计性能,并能有效利用未标注数据进行扩展。
English: DINeMo is a novel neural mesh model that eliminates the need for 3D annotations by leveraging pseudo-correspondence from large visual foundation models, achieving superior performance in zero- and few-shot 3D pose estimation while effectively scaling with unlabeled data.

Authors:Haiyang Liu, Zhan Xu, Fa-Ting Hong, Hsin-Ping Huang, Yi Zhou, Yang Zhou
Title: Video Motion Graphs
Abstract:
We present Video Motion Graphs, a system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes new videos by first retrieving video clips with gestures matching the conditions and then generating interpolation frames to seamlessly connect clip boundaries. The core of our approach is HMInterp, a robust Video Frame Interpolation (VFI) model that enables seamless interpolation of discontinuous frames, even for complex motion scenarios like dancing. HMInterp i) employs a dual-branch interpolation approach, combining a Motion Diffusion Model for human skeleton motion interpolation with a diffusion-based video frame interpolation model for final frame generation. ii) adopts condition progressive training to effectively leverage identity strong and weak conditions, such as images and pose. These designs ensure both high video texture quality and accurate motion trajectory. Results show that our Video Motion Graphs outperforms existing generative- and retrieval-based methods for multi-modal conditioned human motion video generation. Project page can be found at https://h-liu1997.github.io/Video-Motion-Graphs/

Authors:Mengqing Xue, Yifei Liu, Ling Guo, Shaoli Huang, Changxing Ding
Title: Guiding Human-Object Interactions with Rich Geometry and Relations
Abstract:
Human-object interaction (HOI) synthesis is crucial for creating immersive and realistic experiences for applications such as virtual reality. Existing methods often rely on simplified object representations, such as the object's centroid or the nearest point to a human, to achieve physically plausible motions. However, these approaches may overlook geometric complexity, resulting in suboptimal interaction fidelity. To address this limitation, we introduce ROG, a novel diffusion-based framework that models the spatiotemporal relationships inherent in HOIs with rich geometric detail. For efficient object representation, we select boundary-focused and fine-detail key points from the object mesh, ensuring a comprehensive depiction of the object's geometry. This representation is used to construct an interactive distance field (IDF), capturing the robust HOI dynamics. Furthermore, we develop a diffusion-based relation model that integrates spatial and temporal attention mechanisms, enabling a better understanding of intricate HOI relationships. This relation model refines the generated motion's IDF, guiding the motion generation process to produce relation-aware and semantically aligned movements. Experimental evaluations demonstrate that ROG significantly outperforms state-of-the-art methods in the realism and semantic accuracy of synthesized HOIs.

Authors:Naitik Jain, Amogh Joshi, Mason Earles
Title: iNatAg: Multi-Class Classification Models Enabled by a Large-Scale Benchmark Dataset with 4.7M Images of 2,959 Crop and Weed Species
Abstract:
Accurate identification of crop and weed species is critical for precision agriculture and sustainable farming. However, it remains a challenging task due to a variety of factors -- a high degree of visual similarity among species, environmental variability, and a continued lack of large, agriculture-specific image data. We introduce iNatAg, a large-scale image dataset which contains over 4.7 million images of 2,959 distinct crop and weed species, with precise annotations along the taxonomic hierarchy from binary crop/weed labels to specific species labels. Curated from the broader iNaturalist database, iNatAg contains data from every continent and accurately reflects the variability of natural image captures and environments. Enabled by this data, we train benchmark models built upon the Swin Transformer architecture and evaluate the impact of various modifications such as the incorporation of geospatial data and LoRA finetuning. Our best models achieve state-of-the-art performance across all taxonomic classification tasks, achieving 92.38\% on crop and weed classification. Furthermore, the scale of our dataset enables us to explore incorrect misclassifications and unlock new analytic possiblities for plant species. By combining large-scale species coverage, multi-task labels, and geographic diversity, iNatAg provides a new foundation for building robust, geolocation-aware agricultural classification systems. We release the iNatAg dataset publicly through AgML (https://github.com/Project-AgML/AgML), enabling direct access and integration into agricultural machine learning workflows.
中文: iNatAg数据集包含超过470万张图像,涵盖2959种作物和杂草,通过多层次标注和地理多样性,为精准农业提供了强大的分类基础,解决了物种视觉相似性和环境多变性的挑战。
English: The iNatAg dataset, comprising over 4.7 million images of 2,959 crop and weed species with multi-level annotations, enables state-of-the-art classification models and provides a robust foundation for precision agriculture by addressing challenges like visual similarity and environmental variability.

Authors:Yu Xin, Gorkem Can Ates, Kuang Gong, Wei Shao
Title: Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis
Abstract:
Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decomposed 3D convolutions to capture fine-grained spatial features at scale; (2) SigLIP, a contrastive learning strategy with pairwise sigmoid loss that improves image-text alignment without relying on large negative batches; and (3) a dual-stream MLP-Mixer projector that fuses low- and high-level image features with text embeddings for richer multi-modal representations. We evaluate our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images. Results show that Med3DVLM achieves superior performance across multiple benchmarks. For image-text retrieval, it reaches 61.00% R@1 on 2,000 samples, significantly outperforming the current state-of-the-art M3D model (19.10%). For report generation, it achieves a METEOR score of 36.42% (vs. 14.38%). In open-ended visual question answering (VQA), it scores 36.76% METEOR (vs. 33.58%), and in closed-ended VQA, it achieves 79.95% accuracy (vs. 75.78%). These results highlight Med3DVLM's ability to bridge the gap between 3D imaging and language, enabling scalable, multi-task reasoning across clinical applications. Our code is publicly available at https://github.com/mirthAI/Med3DVLM.
中文:Med3DVLM通过高效编码器、对比学习策略和多模态融合三大创新,解决了三维医学影像与文本对齐的难题,在多项基准测试中实现了领先性能。
English: Med3DVLM overcomes the computational and alignment challenges of 3D vision-language models through innovations in efficient encoding, contrastive learning, and multi-modal fusion, achieving state-of-the-art performance across multiple medical imaging benchmarks.

Authors:Stefan Stojanov, David Wendt, Seungwoo Kim, Rahul Venkatesh, Kevin Feigelis, Jiajun Wu, Daniel LK Yamins
Title: Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals
Abstract:
Estimating motion in videos is an essential computer vision problem with many downstream applications, including controllable video generation and robotics. Current solutions are primarily trained using synthetic data or require tuning of situation-specific heuristics, which inherently limits these models' capabilities in real-world contexts. Despite recent developments in large-scale self-supervised learning from videos, leveraging such representations for motion estimation remains relatively underexplored. In this work, we develop Opt-CWM, a self-supervised technique for flow and occlusion estimation from a pre-trained next-frame prediction model. Opt-CWM works by learning to optimize counterfactual probes that extract motion information from a base video model, avoiding the need for fixed heuristics while training on unrestricted video inputs. We achieve state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data.

Authors:Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, Chao Zhang
Title: Audio-centric Video Understanding Benchmark without Text Shortcut
Abstract:
Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting in the comprehension of visual information. However, a thorough understanding of videos significantly depends on auditory information, as audio offers critical context, emotional cues, and semantic meaning that visual data alone often lacks. This paper proposes an audio-centric video understanding benchmark (AVUT) to evaluate the video comprehension capabilities of multimodal LLMs with a particular focus on auditory information. AVUT introduces a suite of carefully designed audio-centric tasks, holistically testing the understanding of both audio content and audio-visual interactions in videos. Moreover, this work points out the text shortcut problem that largely exists in other benchmarks where the correct answer can be found from question text alone without needing videos. AVUT addresses this problem by proposing a answer permutation-based filtering mechanism. A thorough evaluation across a diverse range of open-source and proprietary multimodal LLMs is performed, followed by the analyses of deficiencies in audio-visual LLMs. Demos and data are available at https://github.com/lark-png/AVUT.
中文摘要:本文提出以音频为中心的视频理解基准(AVUT),通过设计音频导向任务和答案重排过滤机制,专门评估多模态大语言模型对听觉信息及音视频交互的理解能力。
English Summary: This paper introduces an audio-centric video understanding benchmark (AVUT) to evaluate multimodal LLMs' video comprehension by focusing on auditory information and addressing text shortcut issues through answer permutation filtering.

Authors:Daniel G. P. Petrini, Hae Yong Kim
Title: Optimizing Breast Cancer Detection in Mammograms: A Comprehensive Study of Transfer Learning, Resolution Reduction, and Multi-View Classification
Abstract:
Mammography, an X-ray-based imaging technique, remains central to the early detection of breast cancer. Recent advances in artificial intelligence have enabled increasingly sophisticated computer-aided diagnostic methods, evolving from patch-based classifiers to whole-image approaches and then to multi-view architectures that jointly analyze complementary projections. Despite this progress, several critical questions remain unanswered. In this study, we systematically investigate these issues by addressing five key research questions: (1) the role of patch classifiers in performance, (2) the transferability of natural-image-trained backbones, (3) the advantages of learn-to-resize over conventional downscaling, (4) the contribution of multi-view integration, and (5) the robustness of findings across varying image quality. Beyond benchmarking, our experiments demonstrate clear performance gains over prior work. For the CBIS-DDSM dataset, we improved single-view AUC from 0.8153 to 0.8343, and multiple-view AUC from 0.8483 to 0.8658. Using a new comparative method, we also observed a 0.0217 AUC increase when extending from single to multiple-view analysis. On the complete VinDr-Mammo dataset, the multiple-view approach further improved results, achieving a 0.0492 AUC increase over single view and reaching 0.8511 AUC overall. These results establish new state-of-the-art benchmarks, providing clear evidence of the advantages of multi-view architectures for mammogram interpretation. Beyond performance, our analysis offers principled insights into model design and transfer learning strategies, contributing to the development of more accurate and reliable breast cancer screening tools. The inference code and trained models are publicly available at https://github.com/dpetrini/multiple-view.
中文: 本研究通过系统评估人工智能驱动的乳腺摄影分析,证明了多视角架构显著优于单视角方法,并通过原则性模型设计确立了新的最先进基准,从而推进了乳腺癌筛查技术。
English: This study advances breast cancer screening by systematically evaluating AI-driven mammography analysis, demonstrating that multi-view architectures significantly outperform single-view methods and establishing new state-of-the-art benchmarks through principled model design.

Authors:Sangwon Baik, Hyeonwoo Kim, Hanbyul Joo
Title: Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models
Abstract:
We present a method for learning 3D spatial relationships between object pairs, referred to as object-object spatial relationships (OOR), by leveraging synthetically generated 3D samples from pre-trained 2D diffusion models. We hypothesize that images synthesized by 2D diffusion models inherently capture realistic OOR cues, enabling efficient collection of a 3D dataset to learn OOR for various unbounded object categories. Our approach synthesizes diverse images that capture plausible OOR cues, which we then uplift into 3D samples. Leveraging our diverse collection of 3D samples for the object pairs, we train a score-based OOR diffusion model to learn the distribution of their relative spatial relationships. Additionally, we extend our pairwise OOR to multi-object OOR by enforcing consistency across pairwise relations and preventing object collisions. Extensive experiments demonstrate the robustness of our method across various object-object spatial relationships, along with its applicability to 3D scene arrangement tasks and human motion synthesis using our OOR diffusion model.

Authors:Xiang Xu, Lingdong Kong, Hui Shuai, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, Qingshan Liu
Title: SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining
Abstract:
LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving. The code is publicly available at https://github.com/Xiangxu-0103/SuperFlow
中文摘要:SuperFlow++ 是一种新颖的激光雷达表示学习框架,通过整合连续激光雷达-摄像头对的时空线索,在11个数据集上超越现有最优方法,为自动驾驶感知建立了新基准。
English Summary: SuperFlow++ is a novel LiDAR representation learning framework that integrates spatiotemporal cues using consecutive LiDAR-camera pairs, outperforming state-of-the-art methods across 11 datasets while establishing new benchmarks for autonomous driving perception.

Authors:Hongyu Liu, Xuan Wang, Ziyu Wan, Yue Ma, Jingye Chen, Yanbo Fan, Yujun Shen, Yibing Song, Qifeng Chen
Title: AvatarArtist: Open-Domain 4D Avatarization
Abstract:
This work focuses on open-domain 4D avatarization, with the purpose of creating a 4D avatar from a portrait image in an arbitrary style. We select parametric triplanes as the intermediate 4D representation and propose a practical training paradigm that takes advantage of both generative adversarial networks (GANs) and diffusion models. Our design stems from the observation that 4D GANs excel at bridging images and triplanes without supervision yet usually face challenges in handling diverse data distributions. A robust 2D diffusion prior emerges as the solution, assisting the GAN in transferring its expertise across various domains. The synergy between these experts permits the construction of a multi-domain image-triplane dataset, which drives the development of a general 4D avatar creator. Extensive experiments suggest that our model, AvatarArtist, is capable of producing high-quality 4D avatars with strong robustness to various source image domains. The code, the data, and the models will be made publicly available to facilitate future studies.
中文:本研究提出AvatarArtist方法,通过结合生成对抗网络和扩散模型,能够从任意风格的肖像图像生成高质量4D虚拟形象,展现出强大的跨域适应能力。
English: This research introduces AvatarArtist, a method for creating 4D avatars from portrait images by combining GANs and diffusion models to handle diverse styles effectively.

Authors:Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, Hongxu Yin
Title: Scaling Vision Pre-Training to 4K Resolution
Abstract:
High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S^2 while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, PS3 and VILA-HD outperform previous vision encoders (e.g., SigLIP2 and Perception Encoder) and MLLMs (e.g., NVILA and Qwen2.5-VL) respectively across multiple benchmarks and achieve better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 16.1% improvement over GPT-4o and a 7.5% improvement and 1.67x speedup over Qwen2.5-VL.
Chinese: PS3通过选择性处理局部区域并与详细描述对比,实现了高效的4K分辨率视觉预训练,显著提升了多模态模型VILA-HD的高清感知能力,同时大幅降低了计算开销。
English: PS3 enables efficient 4K-resolution vision pre-training by selectively processing local regions with detailed captions, significantly enhancing high-resolution perception in multi-modal models like VILA-HD while reducing computational costs.

Authors:Fernando Julio Cendra, Kai Han
Title: ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models
Abstract:
The inherent ambiguity in defining visual concepts poses significant challenges for modern generative models, such as the diffusion-based Text-to-Image (T2I) models, in accurately learning concepts from a single image. Existing methods lack a systematic way to reliably extract the interpretable underlying intrinsic concepts. To address this challenge, we present ICE, short for Intrinsic Concept Extraction, a novel framework that exclusively utilises a T2I model to automatically and systematically extract intrinsic concepts from a single image. ICE consists of two pivotal stages. In the first stage, ICE devises an automatic concept localization module to pinpoint relevant text-based concepts and their corresponding masks within the image. This critical stage streamlines concept initialization and provides precise guidance for subsequent analysis. The second stage delves deeper into each identified mask, decomposing the object-level concepts into intrinsic concepts and general concepts. This decomposition allows for a more granular and interpretable breakdown of visual elements. Our framework demonstrates superior performance on intrinsic concept extraction from a single image in an unsupervised manner. Project page: https://visual-ai.github.io/ice

Authors:Liang Pan, Zeshi Yang, Zhiyang Dou, Wenjia Wang, Buzhen Huang, Bo Dai, Taku Komura, Jingbo Wang
Title: TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization
Abstract:
Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks. Website: https://liangpan99.github.io/TokenHSI/
中文:TokenHSI是一种基于Transformer的统一策略,通过令牌机制整合多种人-场景交互技能,显著提升了复杂任务中的多功能性、适应性和可扩展性。
English: TokenHSI is a unified transformer-based policy that integrates multiple human-scene interaction skills through a token-based mechanism, enhancing versatility, adaptability, and extensibility in complex tasks.

Authors:Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, Yongdong Zhang
Title: Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
Abstract:
Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask$^2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.

Authors:Xinpeng Li, Shijian Deng, Bolin Lai, Weiguo Pian, James M. Rehg, Yapeng Tian
Title: Towards Online Multi-Modal Social Interaction Understanding
Abstract:
Multimodal social interaction understanding (MMSI) is critical in human-robot interaction systems. In real-world scenarios, AI agents are required to provide real-time feedback. However, existing models often depend on both past and future contexts, which hinders them from applying to real-world problems. To bridge this gap, we propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams. To address the challenges of missing the useful future context, we develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting with multi-modal large language models. First, to enrich linguistic context, the multi-party conversation forecasting simulates potential future utterances in a coarse-to-fine manner, anticipating upcoming speaker turns and then generating fine-grained conversational details. Second, to effectively incorporate visual social cues like gaze and gesture, social-aware visual prompting highlights the social dynamics in video with bounding boxes and body keypoints for each person and frame. Extensive experiments on three tasks and two datasets demonstrate that our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI. The code and pre-trained models will be publicly released at: https://github.com/Sampson-Lee/OnlineMMSI.
Chinese: 本研究提出了Online-MMSI-VLM框架,通过预测未来对话和整合社交感知视觉线索,在无需未来上下文的情况下实现了实时多模态社交交互理解的卓越性能。
English: The study introduces Online-MMSI-VLM, a framework that enhances real-time multimodal social interaction understanding by forecasting future conversations and integrating social-aware visual cues, achieving superior performance without relying on future context.

Authors:Aaron Serianni, Tyler Zhu, Olga Russakovsky, Vikram V. Ramaswamy
Title: Attention IoU: Examining Biases in CelebA using Attention Maps
Abstract:
Computer vision models have been shown to exhibit and amplify biases across a wide array of datasets and tasks. Existing methods for quantifying bias in classification models primarily focus on dataset distribution and model performance on subgroups, overlooking the internal workings of a model. We introduce the Attention-IoU (Attention Intersection over Union) metric and related scores, which use attention maps to reveal biases within a model's internal representations and identify image features potentially causing the biases. First, we validate Attention-IoU on the synthetic Waterbirds dataset, showing that the metric accurately measures model bias. We then analyze the CelebA dataset, finding that Attention-IoU uncovers correlations beyond accuracy disparities. Through an investigation of individual attributes through the protected attribute of Male, we examine the distinct ways biases are represented in CelebA. Lastly, by subsampling the training set to change attribute correlations, we demonstrate that Attention-IoU reveals potential confounding variables not present in dataset labels.
Chinese: Attention-IoU指标通过注意力图揭示模型内部表征中的偏见并识别导致偏见的图像特征,在Waterbirds和CelebA数据集上的验证表明,它能发现超出准确率差异的相关性和数据标签中未体现的潜在混杂变量。
English: The Attention-IoU metric uses attention maps to uncover biases in a model's internal representations and identify image features causing them, validated on datasets like Waterbirds and CelebA to reveal correlations and confounding variables beyond accuracy disparities.

Authors:Jun Zhou, Jiahao Li, Zunnan Xu, Hanhui Li, Yiji Cheng, Fa-Ting Hong, Qin Lin, Qinglin Lu, Xiaodan Liang
Title: FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model
Abstract:
Currently, instruction-based image editing methods have made significant progress by leveraging the powerful cross-modal understanding capabilities of vision language models (VLMs). However, they still face challenges in three key areas: 1) complex scenarios; 2) semantic consistency; and 3) fine-grained editing. To address these issues, we propose FireEdit, an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM. FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process. Specifically, we enhance the fine-grained visual perception capabilities of the VLM by introducing additional region tokens. Relying solely on the output of the LLM to guide the diffusion model may lead to suboptimal editing results. Therefore, we propose a Time-Aware Target Injection module and a Hybrid Visual Cross Attention module. The former dynamically adjusts the guidance strength at various denoising stages by integrating timestep embeddings with the text embeddings. The latter enhances visual details for image editing, thereby preserving semantic consistency between the edited result and the source image. By combining the VLM enhanced with fine-grained region tokens and the time-dependent diffusion model, FireEdit demonstrates significant advantages in comprehending editing instructions and maintaining high semantic consistency. Extensive experiments indicate that our approach surpasses the state-of-the-art instruction-based image editing methods. Our project is available at https://zjgans.github.io/fireedit.github.io.

Authors:Jiazhi Guan, Kaisiyuan Wang, Zhiliang Xu, Quanwei Yang, Yasheng Sun, Shengyi He, Borong Liang, Yukang Cao, Yingying Li, Haocheng Feng, Errui Ding, Jingdong Wang, Youjian Zhao, Hang Zhou, Ziwei Liu
Title: AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers
Abstract:
Despite the recent progress of audio-driven video generation, existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics. Moving forward, it is desirable yet challenging to generate holistic human videos with both accurate lip-sync and delicate co-speech gestures w.r.t. given audio. In this work, we propose AudCast, a generalized audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm, which synthesizes holistic human videos based on a reference image and a given audio. 1) Firstly, an audio-conditioned Holistic Human DiT architecture is proposed to directly drive the movements of any human body with vivid gesture dynamics. 2) Then to enhance hand and face details that are well-knownly difficult to handle, a Regional Refinement DiT leverages regional 3D fitting as the bridge to reform the signals, producing the final results. Extensive experiments demonstrate that our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details. Resources can be found at https://guanjz20.github.io/projects/AudCast.
中文摘要:本文提出AudCast框架,通过级联扩散变换器实现从音频和参考图像生成具有准确口型同步和自然身体动作的整体人体视频。
English Summary: This paper introduces AudCast, a novel framework using cascade Diffusion-Transformers to generate holistic human videos with synchronized lip movements and natural body gestures from audio inputs and reference images.

Authors:Zhuoming Liu, Yiquan Li, Khoi Duc Nguyen, Yiwu Zhong, Yin Li
Title: PAVE: Patching and Adapting Video Large Language Models
Abstract:
Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at https://github.com/dragonlzm/PAVE.
Chinese: PAVE是一种灵活的框架,通过为预训练的视频大语言模型添加轻量级适配器,使其能够处理包含音频和3D线索等辅助信号的任务,在显著提升性能的同时仅增加极少的计算开销。
English: PAVE is a flexible framework that enhances pre-trained Video LLMs by adding lightweight adapters to handle tasks with side-channel signals like audio and 3D cues, significantly improving performance with minimal computational overhead.

Authors:Jingdan Kang, Haoxin Yang, Yan Cai, Huaidong Zhang, Xuemiao Xu, Yong Du, Shengfeng He
Title: SITA: Structurally Imperceptible and Transferable Adversarial Attacks for Stylized Image Generation
Abstract:
Image generation technology has brought significant advancements across various fields but has also raised concerns about data misuse and potential rights infringements, particularly with respect to creating visual artworks. Current methods aimed at safeguarding artworks often employ adversarial attacks. However, these methods face challenges such as poor transferability, high computational costs, and the introduction of noticeable noise, which compromises the aesthetic quality of the original artwork. To address these limitations, we propose a Structurally Imperceptible and Transferable Adversarial (SITA) attacks. SITA leverages a CLIP-based destylization loss, which decouples and disrupts the robust style representation of the image. This disruption hinders style extraction during stylized image generation, thereby impairing the overall stylization process. Importantly, SITA eliminates the need for a surrogate diffusion model, leading to significantly reduced computational overhead. The method's robust style feature disruption ensures high transferability across diverse models. Moreover, SITA introduces perturbations by embedding noise within the imperceptible structural details of the image. This approach effectively protects against style extraction without compromising the visual quality of the artwork. Extensive experiments demonstrate that SITA offers superior protection for artworks against unauthorized use in stylized generation. It significantly outperforms existing methods in terms of transferability, computational efficiency, and noise imperceptibility. Code is available at https://github.com/A-raniy-day/SITA.
中文: 提出的SITA方法通过破坏图像风格表征并在结构细节中嵌入噪声,有效保护艺术作品免遭未经授权的风格化生成使用,相比现有方法具有更高的可迁移性、计算效率和视觉不可感知性。
English: The proposed SITA method protects artworks from unauthorized use in stylized image generation by disrupting style representations through structurally embedded noise, offering enhanced transferability, computational efficiency, and visual imperceptibility compared to existing approaches.

Authors:Vladan Stojnić, Yannis Kalantidis, Jiří Matas, Giorgos Tolias
Title: LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation
Abstract:
We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: https://github.com/vladan-stojnic/LPOSS
Chinese: LPOSS+ 提出了一种无需训练的开集词汇语义分割方法,通过像素级标签传播优化视觉语言模型的预测结果,并利用视觉模型捕捉模态内关系,在不依赖窗口处理的情况下实现了最先进的性能。
English: LPOSS+ introduces a training-free method for open-vocabulary semantic segmentation by refining Vision-and-Language Model predictions through pixel-level label propagation and leveraging Vision Models to capture intra-modal relationships, achieving state-of-the-art accuracy without window-based processing.

Authors:Christina Kassab, Sacha Morin, Martin Büchner, Matías Mattamala, Kumaraditya Gupta, Abhinav Valada, Liam Paull, Maurice Fallon
Title: OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations
Abstract:
3D scene understanding has been transformed by open-vocabulary language models that enable interaction via natural language. However, the evaluation of these representations is limited to closed-set semantics that do not capture the richness of language. This work presents OpenLex3D, a dedicated benchmark to evaluate 3D open-vocabulary scene representations. OpenLex3D provides entirely new label annotations for 23 scenes from Replica, ScanNet++, and HM3D, which capture real-world linguistic variability by introducing synonymical object categories and additional nuanced descriptions. By introducing an open-set 3D semantic segmentation task and an object retrieval task, we provide insights on feature precision, segmentation, and downstream capabilities. We evaluate various existing 3D open-vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement. The benchmark is publicly available at: https://openlex3d.github.io/.

Authors:Pihai Sun, Junjun Jiang, Yuanqi Yao, Youyu Chen, Wenbo Zhao, Kui Jiang, Xianming Liu
Title: FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion
Abstract:
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to ineffective feature fusion. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two synergistic components: The Parameter-efficient Self-supervised Transfer (PST) establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity by enabling joint encoding without depth ground truth. Complementing this, we propose the Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches through physics-aware fusion. This combined approach enables FUSE to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with 14% and 24.9% improvements in Abs.Rel on MVSEC and DENSE datasets. The framework exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur, significantly advancing real-world deployment capabilities. The source code for our method is publicly available at: https://github.com/sunpihai-up/FUSE
中文摘要:FUSE框架通过频率解耦融合的自监督编码器,解决了图像-事件联合深度估计中数据稀缺和模态失配的双重挑战,在实现最优性能的同时展现出卓越的零样本适应能力。
English Summary: The FUSE framework introduces a self-supervised encoder with frequency-decoupled fusion to overcome data scarcity and modality mismatch in image-event depth estimation, achieving state-of-the-art performance and strong zero-shot adaptability.

Authors:Yuli Zhou, Guolei Sun, Yawei Li, Yuqian Fu, Luca Benini, Ender Konukoglu
Title: CamSAM2: Segment Anything Accurately in Camouflaged Videos
Abstract:
Video camouflaged object segmentation (VCOS), aiming at segmenting camouflaged objects that seamlessly blend into their environment, is a fundamental vision task with various real-world applications. With the release of SAM2, video segmentation has witnessed significant progress. However, SAM2's capability of segmenting camouflaged videos is suboptimal, especially when given simple prompts such as point and box. To address the problem, we propose Camouflaged SAM2 (CamSAM2), which enhances SAM2's ability to handle camouflaged scenes without modifying SAM2's parameters. Specifically, we introduce a decamouflaged token to provide the flexibility of feature adjustment for VCOS. To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules, respectively. Object prototype generation (OPG) is introduced to abstract and memorize object prototypes with informative details using high-quality features from previous frames. Extensive experiments are conducted to validate the effectiveness of our approach. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 mDice gains with click prompt on MoCA-Mask and 19.6 mDice gains with mask prompt on SUN-SEG-Hard, with Hiera-T as the backbone. The code will be available at https://github.com/zhoustan/CamSAM2.
中文: 本文提出Camouflaged SAM2 (CamSAM2)方法,通过引入少量可学习参数增强SAM2在视频中分割伪装物体的能力,在多个VCOS数据集上实现了显著性能提升。
English: This paper introduces Camouflaged SAM2 (CamSAM2), a method that enhances SAM2's ability to segment camouflaged objects in videos by adding minimal learnable parameters, achieving significant performance improvements on VCOS datasets.

Authors:Yusen Xie, Zhengmin Huang, Shaojie Shen, Jun Ma
Title: Semi-SMD: Semi-Supervised Metric Depth Estimation via Surrounding Cameras for Autonomous Driving
Abstract:
In this paper, we introduce Semi-SMD, a novel metric depth estimation framework tailored for surrounding cameras equipment in autonomous driving. In this work, the input data consists of adjacent surrounding frames and camera parameters. We propose a unified spatial-temporal-semantic fusion module to construct the visual fused features. Cross-attention components for surrounding cameras and adjacent frames are utilized to focus on metric scale information refinement and temporal feature matching. Building on this, we propose a pose estimation framework using surrounding cameras, their corresponding estimated depths, and extrinsic parameters, which effectively address the scale ambiguity in multi-camera setups. Moreover, semantic world model and monocular depth estimation world model are integrated to supervised the depth estimation, which improve the quality of depth estimation. We evaluate our algorithm on DDAD and nuScenes datasets, and the results demonstrate that our method achieves state-of-the-art performance in terms of surrounding camera based depth estimation quality. The source code will be available on https://github.com/xieyuser/Semi-SMD.
Chinese: 本文提出Semi-SMD,一种面向自动驾驶环视相机的创新深度估计框架,通过时空语义融合和跨注意力机制,在基准数据集上实现了最先进的性能表现。
English: This paper presents Semi-SMD, a novel framework for metric depth estimation in autonomous driving that integrates spatial-temporal-semantic fusion and cross-attention mechanisms to achieve state-of-the-art performance on benchmark datasets.

Authors:Ilias Stogiannidis, Steven McDonagh, Sotirios A. Tsaftaris
Title: Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) have recently emerged as powerful tools, excelling in tasks that integrate visual and textual comprehension, such as image captioning, visual question answering, and image-text retrieval. However, existing benchmarks for VLMs include spatial components, which often fail to isolate spatial reasoning from related tasks such as object detection or semantic comprehension. In this paper, we address these deficiencies with a multi-faceted approach towards understanding spatial reasoning. Informed by the diverse and multi-dimensional nature of human spatial reasoning abilities, we present a detailed analysis that first delineates the core elements of spatial reasoning: spatial relations, orientation and navigation, mental rotation, and spatial visualization, and then assesses the performance of these models in both synthetic and real-world images, bridging controlled and naturalistic contexts. We analyze 13 state-of-the-art Vision-Language Models, uncovering pivotal insights into their spatial reasoning performance. Our results reveal profound shortcomings in current VLMs, with average accuracy across the 13 models approximating random chance, highlighting spatial reasoning as a persistent obstacle. This work not only exposes the pressing need to advance spatial reasoning within VLMs but also establishes a solid platform for future exploration. Code available on GitHub (https://github.com/stogiannidis/srbench) and dataset available on HuggingFace (https://huggingface.co/datasets/stogiannidis/srbench).
Chinese: 视觉语言模型在空间推理方面存在显著不足,其表现接近随机猜测水平,凸显了推进该领域发展的迫切需求。
English: Vision-Language Models show significant limitations in spatial reasoning, with their performance averaging near random chance, highlighting an urgent need for advancements in this area.

Authors:Jungin Park, Jiyoung Lee, Kwanghoon Sohn
Title: Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
Abstract:
View-invariant representation learning from egocentric (first-person, ego) and exocentric (third-person, exo) videos is a promising approach toward generalizing video understanding systems across multiple viewpoints. However, this area has been underexplored due to the substantial differences in perspective, motion patterns, and context between ego and exo views. In this paper, we propose a novel masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment, called Bootstrap Your Own Views (BYOV), for fine-grained view-invariant video representation learning from unpaired ego-exo videos. We highlight the importance of capturing the compositional nature of human actions as a basis for robust cross-view understanding. Specifically, self-view masking and cross-view masking predictions are designed to learn view-invariant and powerful representations concurrently. Experimental results demonstrate that our BYOV significantly surpasses existing approaches with notable gains across all metrics in four downstream ego-exo video tasks. The code is available at https://github.com/park-jungin/byov.
Chinese: 提出的BYOV方法通过掩码式自-他视角建模,从未配对的第一人称和第三人称视频中学习细粒度的视角不变视频表征,在多项下游任务中均显著超越现有方法。
English: The proposed Bootstrap Your Own Views (BYOV) method uses masked ego-exo modeling to learn fine-grained view-invariant video representations from unpaired first- and third-person videos, achieving superior performance across multiple downstream tasks.

Authors:Andrii Yermakov, Jan Cech, Jiri Matas
Title: Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection
Abstract:
This paper tackles the challenge of detecting partially manipulated facial deepfakes, which involve subtle alterations to specific facial features while retaining the overall context, posing a greater detection difficulty than fully synthetic faces. We leverage the Contrastive Language-Image Pre-training (CLIP) model, specifically its ViT-L/14 visual encoder, to develop a generalizable detection method that performs robustly across diverse datasets and unknown forgery techniques with minimal modifications to the original model. The proposed approach utilizes parameter-efficient fine-tuning (PEFT) techniques, such as LN-tuning, to adjust a small subset of the model's parameters, preserving CLIP's pre-trained knowledge and reducing overfitting. A tailored preprocessing pipeline optimizes the method for facial images, while regularization strategies, including L2 normalization and metric learning on a hyperspherical manifold, enhance generalization. Trained on the FaceForensics++ dataset and evaluated in a cross-dataset fashion on Celeb-DF-v2, DFDC, FFIW, and others, the proposed method achieves competitive detection accuracy comparable to or outperforming much more complex state-of-the-art techniques. This work highlights the efficacy of CLIP's visual encoder in facial deepfake detection and establishes a simple, powerful baseline for future research, advancing the field of generalizable deepfake detection. The code is available at: https://github.com/yermandy/deepfake-detection
中文: 本文提出了一种基于CLIP视觉编码器的深度伪造检测方法,通过最小化模型调整实现了跨数据集的优异检测性能,为可泛化检测领域建立了简单而强大的基准。
English: This paper introduces a robust deepfake detection method using CLIP's visual encoder with minimal modifications, achieving competitive accuracy across diverse datasets and establishing a simple yet effective baseline for generalizable detection.

Authors:Rupak Bose, Chinedu Innocent Nwoye, Aditya Bhat, Nicolas Padoy
Title: CoSimGen: Controllable Diffusion Model for Simultaneous Image and Mask Generation
Abstract:
The acquisition of annotated datasets with paired images and segmentation masks is a critical challenge in domains such as medical imaging, remote sensing, and computer vision. Manual annotation demands significant resources, faces ethical constraints, and depends heavily on domain expertise. Existing generative models often target single-modality outputs, either images or segmentation masks, failing to address the need for high-quality, simultaneous image-mask generation. Additionally, these models frequently lack adaptable conditioning mechanisms, restricting control over the generated outputs and limiting their applicability for dataset augmentation and rare scenario simulation. We propose CoSimGen, a diffusion-based framework for controllable simultaneous image and mask generation. Conditioning is intuitively achieved through (1) text prompts grounded in class semantics, (2) spatial embedding of context prompts to provide spatial coherence, and (3) spectral embedding of timestep information to model noise levels during diffusion. To enhance controllability and training efficiency, the framework incorporates contrastive triplet loss between text and class embeddings, alongside diffusion and adversarial losses. Initial low-resolution outputs 128 x 128 are super-resolved to 512 x 512, producing high-fidelity images and masks with strict adherence to conditions. We evaluate CoSimGen on metrics such as FID, KID, LPIPS, Class FID, Positive predicted value for image fidelity and semantic alignment of generated samples over 4 diverse datasets. CoSimGen achieves state-of-the-art performance across all datasets, achieving the lowest KID of 0.11 and LPIPS of 0.53 across datasets.

Authors:Jan Kohút, Martin Dočekal, Michal Hradiš, Marek Vaško
Title: BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction
Abstract:
Manual digitization of bibliographic metadata is time consuming and labor intensive, especially for historical and real-world archives with highly variable formatting across documents. Despite advances in machine learning, the absence of dedicated datasets for metadata extraction hinders automation. To address this gap, we introduce BiblioPage, a dataset of scanned title pages annotated with structured bibliographic metadata. The dataset consists of approximately 2,000 monograph title pages collected from 14 Czech libraries, spanning a wide range of publication periods, typographic styles, and layout structures. Each title page is annotated with 16 bibliographic attributes, including title, contributors, and publication metadata, along with precise positional information in the form of bounding boxes. To extract structured information from this dataset, we valuated object detection models such as YOLO and DETR combined with transformer-based OCR, achieving a maximum mAP of 52 and an F1 score of 59. Additionally, we assess the performance of various visual large language models, including LlamA 3.2-Vision and GPT-4o, with the best model reaching an F1 score of 67. BiblioPage serves as a real-world benchmark for bibliographic metadata extraction, contributing to document understanding, document question answering, and document information extraction. Dataset and evaluation scripts are availible at: https://github.com/DCGM/biblio-dataset
Chinese: BiblioPage数据集通过提供2000份带注释的专著扉页,解决了书目元数据提取领域专用资源匮乏的问题,评估显示基于Transformer的OCR和视觉大模型最高F1值达67%,为文档理解任务提供了基准。
English: The BiblioPage dataset addresses the lack of dedicated resources for bibliographic metadata extraction by providing 2,000 annotated monograph title pages, with evaluations showing transformer-based OCR and visual language models achieving up to 67 F1 score, serving as a benchmark for document understanding tasks.

Authors:Yabin Wang, Zhiwu Huang, Xiaopeng Hong
Title: OpenSDI: Spotting Diffusion-Generated Images in the Open World
Abstract:
This paper identifies OpenSDI, a challenge for spotting diffusion-generated images in open-world settings. In response to this challenge, we define a new benchmark, the OpenSDI dataset (OpenSDID), which stands out from existing datasets due to its diverse use of large vision-language models that simulate open-world diffusion-based manipulations. Another outstanding feature of OpenSDID is its inclusion of both detection and localization tasks for images manipulated globally and locally by diffusion models. To address the OpenSDI challenge, we propose a Synergizing Pretrained Models (SPM) scheme to build up a mixture of foundation models. This approach exploits a collaboration mechanism with multiple pretrained foundation models to enhance generalization in the OpenSDI context, moving beyond traditional training by synergizing multiple pretrained models through prompting and attending strategies. Building on this scheme, we introduce MaskCLIP, an SPM-based model that aligns Contrastive Language-Image Pre-Training (CLIP) with Masked Autoencoder (MAE). Extensive evaluations on OpenSDID show that MaskCLIP significantly outperforms current state-of-the-art methods for the OpenSDI challenge, achieving remarkable relative improvements of 14.23% in IoU (14.11% in F1) and 2.05% in accuracy (2.38% in F1) compared to the second-best model in localization and detection tasks, respectively. Our dataset and code are available at https://github.com/iamwangyabin/OpenSDI.
中文摘要:本文提出了检测扩散生成图像的OpenSDI挑战,并创建了OpenSDID基准数据集,通过协同预训练模型(SPM)方案开发出MaskCLIP模型,在定位和检测任务中均取得了最先进的性能表现。
English Summary: This paper introduces the OpenSDI challenge for detecting diffusion-generated images and proposes the OpenSDID benchmark with a novel Synergizing Pretrained Models (SPM) scheme, leading to the MaskCLIP model that achieves state-of-the-art performance in both localization and detection tasks.

Authors:Hongcheng Gao, Jiashu Qu, Jingyi Tang, Baolong Bi, Yue Liu, Hongyu Chen, Li Liang, Li Su, Qingming Huang
Title: Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
Abstract:
The hallucination of large multimodal models (LMMs), providing responses that appear correct but are actually incorrect, limits their reliability and applicability. This paper aims to study the hallucination problem of LMMs in video modality, which is dynamic and more challenging compared to static modalities like images and text. From this motivation, we first present a comprehensive benchmark termed HAVEN for evaluating hallucinations of LMMs in video understanding tasks. It is built upon three dimensions, i.e., hallucination causes, hallucination aspects, and question formats, resulting in 6K questions. Then, we quantitatively study 7 influential factors on hallucinations, e.g., duration time of videos, model sizes, and model reasoning, via experiments of 16 LMMs on the presented benchmark. In addition, inspired by recent thinking models like OpenAI o1, we propose a video-thinking model to mitigate the hallucinations of LMMs via supervised reasoning fine-tuning (SRFT) and direct preference optimization (TDPO)-- where SRFT enhances reasoning capabilities while TDPO reduces hallucinations in the thinking process. Extensive experiments and analyses demonstrate the effectiveness. Remarkably, it improves the baseline by 7.65% in accuracy on hallucination evaluation and reduces the bias score by 4.5%. The code and data are public at https://github.com/Hongcheng-Gao/HAVEN.
中文: 本文提出了HAVEN基准来评估大模型在视频理解中的幻觉问题,并通过监督推理微调和直接偏好优化的视频思维模型有效降低了幻觉,显著提升了准确率并减少了偏差。
English: This paper introduces HAVEN, a benchmark to evaluate hallucinations in large multimodal models for video understanding, and proposes a video-thinking model that significantly reduces hallucinations through supervised reasoning fine-tuning and direct preference optimization.

Authors:Xinxing Cheng, Tianyang Zhang, Wenqi Lu, Qingjie Meng, Alejandro F. Frangi, Jinming Duan
Title: SACB-Net: Spatial-awareness Convolutions for Medical Image Registration
Abstract:
Deep learning-based image registration methods have shown state-of-the-art performance and rapid inference speeds. Despite these advances, many existing approaches fall short in capturing spatially varying information in non-local regions of feature maps due to the reliance on spatially-shared convolution kernels. This limitation leads to suboptimal estimation of deformation fields. In this paper, we propose a 3D Spatial-Awareness Convolution Block (SACB) to enhance the spatial information within feature representations. Our SACB estimates the spatial clusters within feature maps by leveraging feature similarity and subsequently parameterizes the adaptive convolution kernels across diverse regions. This adaptive mechanism generates the convolution kernels (weights and biases) tailored to spatial variations, thereby enabling the network to effectively capture spatially varying information. Building on SACB, we introduce a pyramid flow estimator (named SACB-Net) that integrates SACBs to facilitate multi-scale flow composition, particularly addressing large deformations. Experimental results on the brain IXI and LPBA datasets as well as Abdomen CT datasets demonstrate the effectiveness of SACB and the superiority of SACB-Net over the state-of-the-art learning-based registration methods. The code is available at https://github.com/x-xc/SACB_Net .
Chinese: 本文提出3D空间感知卷积块(SACB)和SACB-Net,通过自适应生成不同区域的卷积核来解决图像配准中空间变化信息捕捉不足的问题,在脑部和腹部数据集上展现了优越性能。
English: This paper introduces a 3D Spatial-Awareness Convolution Block (SACB) and SACB-Net to address limitations in capturing spatially varying information in image registration by adaptively generating convolution kernels for different regions, demonstrating superior performance on brain and abdomen datasets.

Authors:Haim Sawdayee, Chuan Guo, Guy Tevet, Bing Zhou, Jian Wang, Amit H. Bermano
Title: Dance Like a Chicken: Low-Rank Stylization for Human Motion Diffusion
Abstract:
Text-to-motion generative models span a wide range of 3D human actions but struggle with nuanced stylistic attributes such as a "Chicken" style. Due to the scarcity of style-specific data, existing approaches pull the generative prior towards a reference style, which often results in out-of-distribution low quality generations. In this work, we introduce LoRA-MDM, a lightweight framework for motion stylization that generalizes to complex actions while maintaining editability. Our key insight is that adapting the generative prior to include the style, while preserving its overall distribution, is more effective than modifying each individual motion during generation. Building on this idea, LoRA-MDM learns to adapt the prior to include the reference style using only a few samples. The style can then be used in the context of different textual prompts for generation. The low-rank adaptation shifts the motion manifold in a semantically meaningful way, enabling realistic style infusion even for actions not present in the reference samples. Moreover, preserving the distribution structure enables advanced operations such as style blending and motion editing. We compare LoRA-MDM to state-of-the-art stylized motion generation methods and demonstrate a favorable balance between text fidelity and style consistency.

Authors:Junwei Zheng, Ruiping Liu, Yufan Chen, Zhenfang Chen, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen
Title: Scene-agnostic Pose Regression for Visual Localization
Abstract:
Absolute Pose Regression (APR) predicts 6D camera poses but lacks the adaptability to unknown environments without retraining, while Relative Pose Regression (RPR) generalizes better yet requires a large image retrieval database. Visual Odometry (VO) generalizes well in unseen environments but suffers from accumulated error in open trajectories. To address this dilemma, we introduce a new task, Scene-agnostic Pose Regression (SPR), which can achieve accurate pose regression in a flexible way while eliminating the need for retraining or databases. To benchmark SPR, we created a large-scale dataset, 360SPR, with over 200K photorealistic panoramas, 3.6M pinhole images and camera poses in 270 scenes at three different sensor heights. Furthermore, a SPR-Mamba model is initially proposed to address SPR in a dual-branch manner. Extensive experiments and studies demonstrate the effectiveness of our SPR paradigm, dataset, and model. In the unknown scenes of both 360SPR and 360Loc datasets, our method consistently outperforms APR, RPR and VO. The dataset and code are available at https://junweizheng93.github.io/publications/SPR/SPR.html.
中文: 本文提出场景无关姿态回归(SPR),通过无需重新训练或数据库的灵活方法实现精确相机姿态估计,并基于新构建的360SPR数据集和双分支SPR-Mamba模型,在未知场景中持续优于现有技术。
English: The paper introduces Scene-agnostic Pose Regression (SPR), a flexible method that achieves accurate camera pose estimation without retraining or databases, supported by the new 360SPR dataset and a dual-branch SPR-Mamba model, outperforming existing techniques in unknown environments.

Authors:Zhengwentai Sun, Chenghong Li, Hongjie Liao, Xihe Yang, Keru Zheng, Heyuan Li, Yihao Zhi, Shuliang Ning, Shuguang Cui, Xiaoguang Han
Title: Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage
Abstract:
Achieving fine-grained controllability in human image synthesis is a long-standing challenge in computer vision. Existing methods primarily focus on either facial synthesis or near-frontal body generation, with limited ability to simultaneously control key factors such as viewpoint, pose, clothing, and identity in a disentangled manner. In this paper, we introduce a new disentangled and controllable human synthesis task, which explicitly separates and manipulates these four factors within a unified framework. We first develop an end-to-end generative model trained on MVHumanNet for factor disentanglement. However, the domain gap between MVHumanNet and in-the-wild data produces unsatisfactory results, motivating the exploration of virtual try-on (VTON) dataset as a potential solution. Through experiments, we observe that simply incorporating the VTON dataset as additional data to train the end-to-end model degrades performance, primarily due to the inconsistency in data forms between the two datasets, which disrupts the disentanglement process. To better leverage both datasets, we propose a stage-by-stage framework that decomposes human image generation into three sequential steps: clothed A-pose generation, back-view synthesis, and pose and view control. This structured pipeline enables better dataset utilization at different stages, significantly improving controllability and generalization, especially for in-the-wild scenarios. Extensive experiments demonstrate that our stage-by-stage approach outperforms end-to-end models in both visual fidelity and disentanglement quality, offering a scalable solution for real-world tasks. Additional demos are available on the project page: https://taited.github.io/discohuman-project/.

Authors:Shijie Ma, Yuying Ge, Teng Wang, Yuxin Guo, Yixiao Ge, Ying Shan
Title: GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
Abstract:
The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.
中文: 本研究提出GenHancer方法,通过优化条件机制、去噪策略和生成范式来利用生成模型增强CLIP的细粒度视觉感知能力,在MMVP-VLM等基准测试中取得了显著优于现有技术的性能表现。
English: This study introduces GenHancer, a method that enhances CLIP's fine-grained visual perception by strategically leveraging generative models through optimized conditioning, denoising, and generation techniques, achieving superior performance on benchmarks like MMVP-VLM.

Authors:Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, Yu Qiao
Title: AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset
Abstract:
Diffusion models have achieved remarkable progress in the field of video generation. However, their iterative denoising nature requires a large number of inference steps to generate a video, which is slow and computationally expensive. In this paper, we begin with a detailed analysis of the challenges present in existing diffusion distillation methods and propose a novel efficient method, namely AccVideo, to reduce the inference steps for accelerating video diffusion models with synthetic dataset. We leverage the pretrained video diffusion model to generate multiple valid denoising trajectories as our synthetic dataset, which eliminates the use of useless data points during distillation. Based on the synthetic dataset, we design a trajectory-based few-step guidance that utilizes key data points from the denoising trajectories to learn the noise-to-video mapping, enabling video generation in fewer steps. Furthermore, since the synthetic dataset captures the data distribution at each diffusion timestep, we introduce an adversarial training strategy to align the output distribution of the student model with that of our synthetic dataset, thereby enhancing the video quality. Extensive experiments demonstrate that our model achieves 8.5x improvements in generation speed compared to the teacher model while maintaining comparable performance. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution, i.e., 5-seconds, 720x1280, 24fps.
中文: 本文提出AccVideo方法,通过利用去噪轨迹构建合成数据集并结合对抗训练,在保持高质量视频生成的同时,将视频扩散模型的推理速度提升8.5倍,能生成5秒时长、720x1280分辨率的高清视频。
English: This paper introduces AccVideo, an efficient distillation method that accelerates video diffusion models by using synthetic datasets from denoising trajectories and adversarial training, achieving 8.5x faster generation while maintaining high-quality, high-resolution video output.

Authors:Shujuan Li, Yu-Shen Liu, Zhizhong Han
Title: GaussianUDF: Inferring Unsigned Distance Functions through 3D Gaussian Splatting
Abstract:
Reconstructing open surfaces from multi-view images is vital in digitalizing complex objects in daily life. A widely used strategy is to learn unsigned distance functions (UDFs) by checking if their appearance conforms to the image observations through neural rendering. However, it is still hard to learn continuous and implicit UDF representations through 3D Gaussians splatting (3DGS) due to the discrete and explicit scene representation, i.e., 3D Gaussians. To resolve this issue, we propose a novel approach to bridge the gap between 3D Gaussians and UDFs. Our key idea is to overfit thin and flat 2D Gaussian planes on surfaces, and then, leverage the self-supervision and gradient-based inference to supervise unsigned distances in both near and far area to surfaces. To this end, we introduce novel constraints and strategies to constrain the learning of 2D Gaussians to pursue more stable optimization and more reliable self-supervision, addressing the challenges brought by complicated gradient field on or near the zero level set of UDFs. We report numerical and visual comparisons with the state-of-the-art on widely used benchmarks and real data to show our advantages in terms of accuracy, efficiency, completeness, and sharpness of reconstructed open surfaces with boundaries.
中文摘要:本研究提出了一种创新方法,通过优化表面上的二维高斯平面来弥合三维高斯泼溅与无符号距离函数之间的差距,利用新型约束和自监督策略实现了开放表面重建在精度和完整性上的显著提升。
English Summary: This study introduces a novel method that bridges 3D Gaussian splatting and unsigned distance functions by optimizing 2D Gaussian planes on surfaces, achieving superior reconstruction of open surfaces through innovative constraints and self-supervision.

Authors:Jiaxin Zhang, Junjun Jiang, Youyu Chen, Kui Jiang, Xianming Liu
Title: COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting
Abstract:
Accurate object segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D segmentation based on 3D Gaussian Splatting (3DGS) struggles with accurately delineating object boundaries, as Gaussian primitives often span across object edges due to their inherent volume and the lack of semantic guidance during training. In order to tackle these challenges, we introduce Clear Object Boundaries for 3DGS Segmentation (COB-GS), which aims to improve segmentation accuracy by clearly delineating blurry boundaries of interwoven Gaussian primitives within the scene. Unlike existing approaches that remove ambiguous Gaussians and sacrifice visual quality, COB-GS, as a 3DGS refinement method, jointly optimizes semantic and visual information, allowing the two different levels to cooperate with each other effectively. Specifically, for the semantic guidance, we introduce a boundary-adaptive Gaussian splitting technique that leverages semantic gradient statistics to identify and split ambiguous Gaussians, aligning them closely with object boundaries. For the visual optimization, we rectify the degraded suboptimal texture of the 3DGS scene, particularly along the refined boundary structures. Experimental results show that COB-GS substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained model, yielding clear boundaries while preserving high visual quality. Code is available at https://github.com/ZestfulJX/COB-GS.
中文: COB-GS通过联合优化语义与视觉信息,采用边界自适应高斯分割和纹理校正技术,在保持高质量视觉效果的同时,显著提升了3D高斯溅射分割中对象边界的精确度。
English: COB-GS enhances 3D Gaussian Splatting segmentation by jointly optimizing semantic and visual information, using boundary-adaptive Gaussian splitting and texture rectification to achieve precise object boundaries and maintain high visual quality.

Authors:Muyi Bao, Shuchang Lyu, Zhaoyang Xu, Qi Zhao, Changyu Zeng, Wenpei Bai, Guangliang Cheng
Title: ASP-VMUNet: Atrous Shifted Parallel Vision Mamba U-Net for Skin Lesion Segmentation
Abstract:
Skin lesion segmentation is a critical challenge in computer vision, and it is essential to separate pathological features from healthy skin for diagnostics accurately. Traditional Convolutional Neural Networks (CNNs) are limited by narrow receptive fields, and Transformers face significant computational burdens. This paper presents a novel skin lesion segmentation framework, the Atrous Shifted Parallel Vision Mamba UNet (ASP-VMUNet), which integrates the efficient and scalable Mamba architecture to overcome limitations in traditional CNNs and computationally demanding Transformers. The framework introduces an atrous scan technique that minimizes background interference and expands the receptive field, enhancing Mamba's scanning capabilities. Additionally, the inclusion of a Parallel Vision Mamba (PVM) layer and a shift round operation optimizes feature segmentation and fosters rich inter-segment information exchange. A supplementary CNN branch with a Selective-Kernel (SK) Block further refines the segmentation by blending local and global contextual information. Tested on four benchmark datasets (ISIC16/17/18 and PH2), ASP-VMUNet demonstrates superior performance in skin lesion segmentation, validated by comprehensive ablation studies. This approach not only advances medical image segmentation but also highlights the benefits of hybrid architectures in medical imaging technology. Our code is available at https://github.com/BaoBao0926/ASP-VMUNet/tree/main.
中文: 本文提出ASP-VMUNet新型皮肤病变分割框架,通过结合高效Mamba架构与空洞扫描及并行视觉层,克服传统CNN和Transformer的局限性,在多个基准数据集上实现了卓越性能。
English: This paper introduces ASP-VMUNet, a novel skin lesion segmentation framework that combines the efficient Mamba architecture with atrous scanning and parallel vision layers to overcome limitations of CNNs and Transformers, achieving superior performance on benchmark datasets.

Authors:Jiaqi Liao, Yuwei Niu, Fanqing Meng, Hao Li, Changyao Tian, Yinuo Du, Yuwen Xiong, Dianqi Li, Xizhou Zhu, Li Yuan, Jifeng Dai, Yu Cheng
Title: LangBridge: Interpreting Image as a Combination of Language Embeddings
Abstract:
Recent years have witnessed remarkable advances in Large Vision-Language Models (LVLMs), which have achieved human-level performance across various complex vision-language tasks. Following LLaVA's paradigm, mainstream LVLMs typically employ a shallow MLP for visual-language alignment through a two-stage training process: pretraining for cross-modal alignment followed by instruction tuning. While this approach has proven effective, the underlying mechanisms of how MLPs bridge the modality gap remain poorly understood. Although some research has explored how LLMs process transformed visual tokens, few studies have investigated the fundamental alignment mechanism. Furthermore, the MLP adapter requires retraining whenever switching LLM backbones. To address these limitations, we first investigate the working principles of MLP adapters and discover that they learn to project visual embeddings into subspaces spanned by corresponding text embeddings progressively. Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings. This innovative design enables pretraining-free adapter transfer across different LLMs while maintaining performance. Our experimental results demonstrate that a LangBridge adapter pre-trained on Qwen2-0.5B can be directly applied to larger models such as LLaMA3-8B or Qwen2.5-14B while maintaining competitive performance. Overall, LangBridge enables interpretable vision-language alignment by grounding visual representations in LLM vocab embedding, while its plug-and-play design ensures efficient reuse across multiple LLMs with nearly no performance degradation. See our project page at https://jiaqiliao77.github.io/LangBridge.github.io/
中文摘要:近年来大型视觉语言模型虽取得显著进展,但其跨模态对齐机制尚不明确且需针对不同骨干网络重新训练,为此提出的LangBridge适配器通过将视觉表征映射至语言模型词嵌入空间,实现了无需预训练的跨模型迁移,在保持性能的同时增强了可解释性。
English Summary: Recent advances in Large Vision-Language Models (LVLMs) have achieved human-level performance, but their underlying alignment mechanisms remain poorly understood and require retraining for different backbones, leading to the development of LangBridge, a novel adapter that enables pretraining-free transfer across LLMs while maintaining performance through interpretable visual-to-text embedding mapping.

Authors:Akshay Kulkarni, Ge Yan, Chung-En Sun, Tuomas Oikarinen, Tsui-Wei Weng
Title: Interpretable Generative Models through Post-hoc Concept Bottlenecks
Abstract:
Concept bottleneck models (CBM) aim to produce inherently interpretable models that rely on human-understandable concepts for their predictions. However, existing approaches to design interpretable generative models based on CBMs are not yet efficient and scalable, as they require expensive generative model training from scratch as well as real images with labor-intensive concept supervision. To address these challenges, we present two novel and low-cost methods to build interpretable generative models through post-hoc techniques and we name our approaches: concept-bottleneck autoencoder (CB-AE) and concept controller (CC). Our proposed approaches enable efficient and scalable training without the need of real data and require only minimal to no concept supervision. Additionally, our methods generalize across modern generative model families including generative adversarial networks and diffusion models. We demonstrate the superior interpretability and steerability of our methods on numerous standard datasets like CelebA, CelebA-HQ, and CUB with large improvements (average ~25%) over the prior work, while being 4-15x faster to train. Finally, a large-scale user study is performed to validate the interpretability and steerability of our methods.

Authors:Hyeongjin Nam, Donghwan Kim, Jeongtaek Oh, Kyoung Mu Lee
Title: DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image
Abstract:
Most existing methods of 3D clothed human reconstruction from a single image treat the clothed human as a single object without distinguishing between cloth and human body. In this regard, we present DeClotH, which separately reconstructs 3D cloth and human body from a single image. This task remains largely unexplored due to the extreme occlusion between cloth and the human body, making it challenging to infer accurate geometries and textures. Moreover, while recent 3D human reconstruction methods have achieved impressive results using text-to-image diffusion models, directly applying such an approach to this problem often leads to incorrect guidance, particularly in reconstructing 3D cloth. To address these challenges, we propose two core designs in our framework. First, to alleviate the occlusion issue, we leverage 3D template models of cloth and human body as regularizations, which provide strong geometric priors to prevent erroneous reconstruction by the occlusion. Second, we introduce a cloth diffusion model specifically designed to provide contextual information about cloth appearance, thereby enhancing the reconstruction of 3D cloth. Qualitative and quantitative experiments demonstrate that our proposed approach is highly effective in reconstructing both 3D cloth and the human body. More qualitative results are provided at https://hygenie1228.github.io/DeClotH/.

Authors:Yufei Cai, Hu Han, Yuxiang Wei, Shiguang Shan, Xilin Chen
Title: EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models
Abstract:
The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer methods explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization strategy, resulting in high computational burdens. In this paper, we propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be available https://github.com/PrototypeNx/EfficientMT.
中文:提出的EfficientMT框架通过利用合成配对样本和新颖模块,高效地将预训练文本到视频模型适配为通用视频运动迁移框架,无需测试时优化,在效率和运动可控性方面优于现有方法。
English: The proposed EfficientMT framework efficiently adapts a pretrained text-to-video model for general video motion transfer by leveraging synthetic paired samples and novel modules, eliminating the need for test-time optimization while outperforming existing methods in efficiency and motion controllability.

Authors:Zizhi Chen, Minghao Han, Xukun Zhang, Shuwei Ma, Tao Liu, Xing Wei, Lihua Zhang
Title: VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction
Abstract:
Multimodal learning combining pathology images and genomic sequences enhances cancer survival analysis but faces clinical implementation barriers due to limited access to genomic sequencing in under-resourced regions. To enable survival prediction using only whole-slide images (WSI), we propose the Visual-Genomic Answering-Guided Transformer (VGAT), a framework integrating Visual Question Answering (VQA) techniques for genomic modality reconstruction. By adapting VQA's text feature extraction approach, we derive stable genomic representations that circumvent dimensionality challenges in raw genomic data. Simultaneously, a cluster-based visual prompt module selectively enhances discriminative WSI patches, addressing noise from unfiltered image regions. Evaluated across five TCGA datasets, VGAT outperforms existing WSI-only methods, demonstrating the viability of genomic-informed inference without sequencing. This approach bridges multimodal research and clinical feasibility in resource-constrained settings. The code link is https://github.com/CZZZZZZZZZZZZZZZZZ/VGAT.
中文总结:VGAT框架通过视觉问答技术重构基因组信息并增强病理图像的判别区域,仅利用全切片图像即可实现精准的癌症生存预测,为资源受限环境提供了无需基因组测序的临床解决方案。
English Summary: The VGAT framework enables accurate cancer survival prediction using only pathology images by reconstructing genomic information through visual question answering techniques and enhancing discriminative image regions, eliminating the need for genomic sequencing in resource-limited settings.

Authors:Farzad Beizaee, Gregory A. Lodygensky, Christian Desrosiers, Jose Dolz
Title: Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection
Abstract:
Recent advances in diffusion models have spurred research into their application for Reconstruction-based unsupervised anomaly detection. However, these methods may struggle with maintaining structural integrity and recovering the anomaly-free content of abnormal regions, especially in multi-class scenarios. Furthermore, diffusion models are inherently designed to generate images from pure noise and struggle to selectively alter anomalous regions of an image while preserving normal ones. This leads to potential degradation of normal regions during reconstruction, hampering the effectiveness of anomaly detection. This paper introduces a reformulation of the standard diffusion model geared toward selective region alteration, allowing the accurate identification of anomalies. By modeling anomalies as noise in the latent space, our proposed Deviation correction diffusion (DeCo-Diff) model preserves the normal regions and encourages transformations exclusively on anomalous areas. This selective approach enhances the reconstruction quality, facilitating effective unsupervised detection and localization of anomaly regions. Comprehensive evaluations demonstrate the superiority of our method in accurately identifying and localizing anomalies in complex images, with pixel-level AUPRC improvements of 11-14% over state-of-the-art models on well known anomaly detection datasets. The code is available at https://github.com/farzad-bz/DeCo-Diff
中文摘要:本文提出的DeCo-Diff模型通过将异常建模为潜在空间噪声,实现了对异常区域的精准选择性重建,在无监督异常检测与定位任务中取得了显著性能提升。
English Summary: This paper introduces DeCo-Diff, a novel diffusion model that selectively reconstructs only anomalous image regions by treating anomalies as latent noise, achieving significant improvements in unsupervised anomaly detection and localization.

Authors:Dohwan Ko, Sihyeon Kim, Yumin Suh, Vijay Kumar B. G, Minseo Yoon, Manmohan Chandraker, Hyunwoo J. Kim
Title: ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models
Abstract:
Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.

Authors:Zhiying Yan, Yiyuan Liang, Shilv Cai, Tao Zhang, Sheng Zhong, Luxin Yan, Xu Zou
Title: Divide-and-Conquer: Dual-Hierarchical Optimization for Semantic 4D Gaussian Spatting
Abstract:
Semantic 4D Gaussians can be used for reconstructing and understanding dynamic scenes, with temporal variations than static scenes. Directly applying static methods to understand dynamic scenes will fail to capture the temporal features. Few works focus on dynamic scene understanding based on Gaussian Splatting, since once the same update strategy is employed for both dynamic and static parts, regardless of the distinction and interaction between Gaussians, significant artifacts and noise appear. We propose Dual-Hierarchical Optimization (DHO), which consists of Hierarchical Gaussian Flow and Hierarchical Gaussian Guidance in a divide-and-conquer manner. The former implements effective division of static and dynamic rendering and features. The latter helps to mitigate the issue of dynamic foreground rendering distortion in textured complex scenes. Extensive experiments show that our method consistently outperforms the baselines on both synthetic and real-world datasets, and supports various downstream tasks. Project Page: https://sweety-yan.github.io/DHO.

Authors:Yongting Hu, Yuxin Lin, Chengliang Liu, Xiaoling Luo, Xiaoyan Dou, Qihao Xu, Yong Xu
Title: Wavelet-based Global-Local Interaction Network with Cross-Attention for Multi-View Diabetic Retinopathy Detection
Abstract:
Multi-view diabetic retinopathy (DR) detection has recently emerged as a promising method to address the issue of incomplete lesions faced by single-view DR. However, it is still challenging due to the variable sizes and scattered locations of lesions. Furthermore, existing multi-view DR methods typically merge multiple views without considering the correlations and redundancies of lesion information across them. Therefore, we propose a novel method to overcome the challenges of difficult lesion information learning and inadequate multi-view fusion. Specifically, we introduce a two-branch network to obtain both local lesion features and their global dependencies. The high-frequency component of the wavelet transform is used to exploit lesion edge information, which is then enhanced by global semantic to facilitate difficult lesion learning. Additionally, we present a cross-view fusion module to improve multi-view fusion and reduce redundancy. Experimental results on large public datasets demonstrate the effectiveness of our method. The code is open sourced on https://github.com/HuYongting/WGLIN.
中文: 本研究提出了一种新颖的双分支网络,结合小波变换和跨视图融合技术,以增强多视角糖尿病视网膜病变检测中的病灶特征学习并减少冗余,在公开数据集上验证了其优越性能。
English: This study introduces a novel two-branch network with wavelet transform and cross-view fusion to enhance lesion feature learning and reduce redundancy in multi-view diabetic retinopathy detection, demonstrating superior performance on public datasets.

Authors:Jiaqi Liao, Zhengyuan Yang, Linjie Li, Dianqi Li, Kevin Lin, Yu Cheng, Lijuan Wang
Title: ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
Abstract:
In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. To avoid generating unstructured ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80\% performance gain for SEED-X on T2I-ICL tasks. See our project page at https://ImageGen-CoT.github.io/. Code and model weights will be open-sourced.
中文: 本研究提出ImageGen-CoT框架,通过在图像生成前引入结构化思维过程来增强多模态大语言模型在文本到图像上下文学习中的推理能力,通过微调和混合扩展策略使SEED-X模型性能提升80%。
English: This study introduces ImageGen-CoT, a framework that enhances multimodal LLMs' contextual reasoning in Text-to-Image In-Context Learning by generating structured thought processes before image creation, achieving an 80% performance gain on SEED-X through fine-tuning and hybrid scaling strategies.

Authors:Weizhi Chen, Jingbo Chen, Yupeng Deng, Jiansheng Chen, Yuman Feng, Zhihao Xi, Diyou Liu, Kai Li, Yu Meng
Title: LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text
Abstract:
This study addresses the technical bottlenecks in handling long text and the "hallucination" issue caused by insufficient short text information in remote sensing vision-language foundation models (VLFM). We propose a novel vision-language foundation model, LRSCLIP, and a multimodal dataset, LRS2M. The main contributions are as follows: (1) By integrating multi-source remote sensing data and adopting a large language model labeling strategy, we construct the LRS2M dataset, which contains 2 million image-text pairs, providing both short and long texts for the first time, thus solving the problem of semantic granularity limitations in existing datasets; (2) The design of the LRSCLIP architecture based on Long-CLIP's KPS module, which extends CLIP's text processing capacity and achieves fine-grained cross-modal feature alignment through a dual-text loss weighting mechanism. Experimental results show that LRSCLIP improves retrieval accuracy by 10\%-20\% over the Long-CLIP baseline in the zero-shot long-text cross-modal retrieval task. For the zero-shot short-text cross-modal retrieval task, LRSCLIP achieves improvements over the current best model, GeoRSCLIP, with increases of 0.17\%, 0.67\%, and 0.92\% in Text to Image R@1, Image to Text R@1, and mR on RSITMD, respectively, and 0.04\%, 2.93\%, and 1.28\% on RSICD. In the zero-shot image classification task (average accuracy=75.75\%) and semantic localization task (Rmi=0.7653), LRSCLIP achieves state-of-the-art performance. These results validate the dual advantages of fine-grained semantic understanding and global feature matching in LRSCLIP. This work provides a new benchmark model and data support for remote sensing multimodal learning. The related code has been open source and is available at https://github.com/MitsuiChen14/LRSCLIP.
中文: 本研究提出LRSCLIP新型视觉语言基础模型和LRS2M多模态数据集,解决了遥感领域长文本处理瓶颈和语义粒度限制问题,在跨模态检索与分类任务中实现了最优性能。
English: This study introduces LRSCLIP, a novel vision-language foundation model, and the LRS2M dataset to overcome long text processing bottlenecks and reduce hallucinations in remote sensing, achieving state-of-the-art performance in cross-modal retrieval and classification tasks.

Authors:Zhuoran Zhao, Linlin Yang, Pengzhan Sun, Pan Hui, Angela Yao
Title: Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation
Abstract:
Recent synthetic 3D human datasets for the face, body, and hands have pushed the limits on photorealism. Face recognition and body pose estimation have achieved state-of-the-art performance using synthetic training data alone, but for the hand, there is still a large synthetic-to-real gap. This paper presents the first systematic study of the synthetic-to-real gap of 3D hand pose estimation. We analyze the gap and identify key components such as the forearm, image frequency statistics, hand pose, and object occlusions. To facilitate our analysis, we propose a data synthesis pipeline to synthesize high-quality data. We demonstrate that synthetic hand data can achieve the same level of accuracy as real data when integrating our identified components, paving the path to use synthetic data alone for hand pose estimation. Code and data are available at: https://github.com/delaprada/HandSynthesis.git.
Chinese: 本文系统研究了3D手部姿态估计中的合成与真实数据差距,识别关键因素并提出数据合成流程,使合成数据达到真实数据精度,为仅使用合成数据铺平道路。
English: This paper systematically investigates the synthetic-to-real gap in 3D hand pose estimation, identifying key factors and proposing a data synthesis pipeline that enables synthetic data to match real data accuracy, paving the way for using synthetic data alone.

Authors:Yang Ren, Hai Jiang, Menglong Yang, Wei Li, Shuaicheng Liu
Title: ISPDiffuser: Learning RAW-to-sRGB Mappings with Texture-Aware Diffusion Models and Histogram-Guided Color Consistency
Abstract:
RAW-to-sRGB mapping, or the simulation of the traditional camera image signal processor (ISP), aims to generate DSLR-quality sRGB images from raw data captured by smartphone sensors. Despite achieving comparable results to sophisticated handcrafted camera ISP solutions, existing learning-based methods still struggle with detail disparity and color distortion. In this paper, we present ISPDiffuser, a diffusion-based decoupled framework that separates the RAW-to-sRGB mapping into detail reconstruction in grayscale space and color consistency mapping from grayscale to sRGB. Specifically, we propose a texture-aware diffusion model that leverages the generative ability of diffusion models to focus on local detail recovery, in which a texture enrichment loss is further proposed to prompt the diffusion model to generate more intricate texture details. Subsequently, we introduce a histogram-guided color consistency module that utilizes color histogram as guidance to learn precise color information for grayscale to sRGB color consistency mapping, with a color consistency loss designed to constrain the learned color information. Extensive experimental results show that the proposed ISPDiffuser outperforms state-of-the-art competitors both quantitatively and visually. The code is available at https://github.com/RenYangSCU/ISPDiffuser.
中文: 本文提出ISPDiffuser框架,通过将RAW到sRGB转换解耦为灰度空间细节重建和色彩一致性映射,采用纹理感知扩散模型和直方图引导的色彩模块,在细节恢复和色彩保真度上均优于现有方法。
English: This paper introduces ISPDiffuser, a diffusion-based framework that decouples RAW-to-sRGB conversion into grayscale detail reconstruction and color mapping, utilizing a texture-aware diffusion model and histogram-guided color module to outperform existing methods in both detail recovery and color accuracy.

Authors:Ruiyi Wang, Yushuo Zheng, Zicheng Zhang, Chunyi Li, Shuaicheng Liu, Guangtao Zhai, Xiaohong Liu
Title: Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing
Abstract:
Existing real-world image dehazing methods primarily attempt to fine-tune pre-trained models or adapt their inference procedures, thus heavily relying on the pre-trained models and associated training data. Moreover, restoring heavily distorted information under dense haze requires generative diffusion models, whose potential in dehazing remains underutilized partly due to their lengthy sampling processes. To address these limitations, we introduce a novel hazing-dehazing pipeline consisting of a Realistic Hazy Image Generation framework (HazeGen) and a Diffusion-based Dehazing framework (DiffDehaze). Specifically, HazeGen harnesses robust generative diffusion priors of real-world hazy images embedded in a pre-trained text-to-image diffusion model. By employing specialized hybrid training and blended sampling strategies, HazeGen produces realistic and diverse hazy images as high-quality training data for DiffDehaze. To alleviate the inefficiency and fidelity concerns associated with diffusion-based methods, DiffDehaze adopts an Accelerated Fidelity-Preserving Sampling process (AccSamp). The core of AccSamp is the Tiled Statistical Alignment Operation (AlignOp), which can provide a clean and faithful dehazing estimate within a small fraction of sampling steps to reduce complexity and enable effective fidelity guidance. Extensive experiments demonstrate the superior dehazing performance and visual quality of our approach over existing methods. The code is available at https://github.com/ruiyi-w/Learning-Hazing-to-Dehazing.
中文摘要:作者提出了一种新颖的雾化-去雾化流程,包含真实雾图生成框架和加速扩散去雾方法,在去雾性能和视觉质量上均优于现有方法。
English Summary: The authors propose a novel hazing-dehazing pipeline featuring a realistic hazy image generation framework and an accelerated diffusion-based dehazing method that outperforms existing approaches in both performance and visual quality.

Authors:Xinpeng Liu, Zeyi Huang, Fumio Okura, Yasuyuki Matsushita
Title: HoGS: Unified Near and Far Object Reconstruction via Homogeneous Gaussian Splatting
Abstract:
Novel view synthesis has demonstrated impressive progress recently, with 3D Gaussian splatting (3DGS) offering efficient training time and photorealistic real-time rendering. However, reliance on Cartesian coordinates limits 3DGS's performance on distant objects, which is important for reconstructing unbounded outdoor environments. We found that, despite its ultimate simplicity, using homogeneous coordinates, a concept on the projective geometry, for the 3DGS pipeline remarkably improves the rendering accuracies of distant objects. We therefore propose Homogeneous Gaussian Splatting (HoGS) incorporating homogeneous coordinates into the 3DGS framework, providing a unified representation for enhancing near and distant objects. HoGS effectively manages both expansive spatial positions and scales particularly in outdoor unbounded environments by adopting projective geometry principles. Experiments show that HoGS significantly enhances accuracy in reconstructing distant objects while maintaining high-quality rendering of nearby objects, along with fast training speed and real-time rendering capability. Our implementations are available on our project page https://kh129.github.io/hogs/.

Authors:Rong Wang, Fabian Prada, Ziyan Wang, Zhongshi Jiang, Chengxiang Yin, Junxuan Li, Shunsuke Saito, Igor Santesteban, Javier Romero, Rohan Joshi, Hongdong Li, Jason Saragih, Yaser Sheikh
Title: FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images
Abstract:
We present a novel method for reconstructing personalized 3D human avatars with realistic animation from only a few images. Due to the large variations in body shapes, poses, and cloth types, existing methods mostly require hours of per-subject optimization during inference, which limits their practical applications. In contrast, we learn a universal prior from over a thousand clothed humans to achieve instant feedforward generation and zero-shot generalization. Specifically, instead of rigging the avatar with shared skinning weights, we jointly infer personalized avatar shape, skinning weights, and pose-dependent deformations, which effectively improves overall geometric fidelity and reduces deformation artifacts. Moreover, to normalize pose variations and resolve coupled ambiguity between canonical shapes and skinning weights, we design a 3D canonicalization process to produce pixel-aligned initial conditions, which helps to reconstruct fine-grained geometric details. We then propose a multi-frame feature aggregation to robustly reduce artifacts introduced in canonicalization and fuse a plausible avatar preserving person-specific identities. Finally, we train the model in an end-to-end framework on a large-scale capture dataset, which contains diverse human subjects paired with high-quality 3D scans. Extensive experiments show that our method generates more authentic reconstruction and animation than state-of-the-arts, and can be directly generalized to inputs from casually taken phone photos. Project page and code is available at https://github.com/rongakowang/FRESA.
中文: 本文提出了一种从少量图像重建个性化3D人体化身的创新方法,通过从多样化人体数据学习通用先验,实现即时生成和零样本泛化,并通过联合推断化身形状、蒙皮权重和变形来提高几何保真度并减少伪影。
English: This paper introduces a novel method for reconstructing personalized 3D human avatars from few images using a universal prior learned from diverse human data, enabling instant generation and zero-shot generalization while improving geometric fidelity and reducing artifacts through joint inference of avatar shape, skinning weights, and deformations.

Authors:Sara Al-Emadi, Yin Yang, Ferda Ofli
Title: Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery
Abstract:
Object detectors have achieved remarkable performance in many applications; however, these deep learning models are typically designed under the i.i.d. assumption, meaning they are trained and evaluated on data sampled from the same (source) distribution. In real-world deployment, however, target distributions often differ from source data, leading to substantial performance degradation. Domain Generalisation (DG) seeks to bridge this gap by enabling models to generalise to Out-Of-Distribution (OOD) data without access to target distributions during training, enhancing robustness to unseen conditions. In this work, we examine the generalisability and robustness of state-of-the-art object detectors under real-world distribution shifts, focusing particularly on spatial domain shifts. Despite the need, a standardised benchmark dataset specifically designed for assessing object detection under realistic DG scenarios is currently lacking. To address this, we introduce Real-World Distribution Shifts (RWDS), a suite of three novel DG benchmarking datasets that focus on humanitarian and climate change applications. These datasets enable the investigation of domain shifts across (i) climate zones and (ii) various disasters and geographic regions. To our knowledge, these are the first DG benchmarking datasets tailored for object detection in real-world, high-impact contexts. We aim for these datasets to serve as valuable resources for evaluating the robustness and generalisation of future object detection models. Our datasets and code are available at https://github.com/RWGAI/RWDS.
中文: 本文提出了Real-World Distribution Shifts (RWDS)基准数据集,用于评估物体检测器在空间域偏移下的泛化能力,填补了在现实世界人道主义和气候变化场景中缺乏标准化领域泛化评估资源的空白。
English: This paper introduces the Real-World Distribution Shifts (RWDS) benchmark datasets to evaluate object detectors' generalization under spatial domain shifts, addressing the lack of standardized resources for real-world domain generalization scenarios in humanitarian and climate change contexts.

Authors:Maria Larchenko, Alexander Lobashev, Dmitry Guskov, Vladimir Vladimirovich Palyulin
Title: Color Transfer with Modulated Flows
Abstract:
In this work, we introduce Modulated Flows (ModFlows), a novel approach for color transfer between images based on rectified flows. The primary goal of the color transfer is to adjust the colors of a target image to match the color distribution of a reference image. Our technique is based on optimal transport and executes color transfer as an invertible transformation within the RGB color space. The ModFlows utilizes the bijective property of flows, enabling us to introduce a common intermediate color distribution and build a dataset of rectified flows. We train an encoder on this dataset to predict the weights of a rectified model for new images. After training on a set of optimal transport plans, our approach can generate plans for new pairs of distributions without additional fine-tuning. We additionally show that the trained encoder provides an image embedding, associated only with its color style. The presented method is capable of processing 4K images and achieves the state-of-the-art performance in terms of content and style similarity. Our source code is available at https://github.com/maria-larchenko/modflows
中文: 本文提出调制流(ModFlows),一种基于整流流和最优传输的可逆色彩迁移方法,能匹配图像间的色彩分布,无需对新图像对微调即可实现最先进的性能。
English: This paper presents Modulated Flows (ModFlows), an invertible color transfer method using rectified flows and optimal transport to match color distributions between images, achieving state-of-the-art performance without fine-tuning for new image pairs.

Authors:Alexander Lobashev, Maria Larchenko, Dmitry Guskov
Title: Color Conditional Generation with Sliced Wasserstein Guidance
Abstract:
We propose SW-Guidance, a training-free approach for image generation conditioned on the color distribution of a reference image. While it is possible to generate an image with fixed colors by first creating an image from a text prompt and then applying a color style transfer method, this approach often results in semantically meaningless colors in the generated image. Our method solves this problem by modifying the sampling process of a diffusion model to incorporate the differentiable Sliced 1-Wasserstein distance between the color distribution of the generated image and the reference palette. Our method outperforms state-of-the-art techniques for color-conditional generation in terms of color similarity to the reference, producing images that not only match the reference colors but also maintain semantic coherence with the original text prompt. Our source code is available at https://github.com/alobashev/sw-guidance/.
中文: SW-Guidance是一种无需训练的生成方法,通过改进扩散模型的采样过程来匹配参考图像的色彩分布,在保持语义连贯性的同时实现了比现有技术更优的色彩还原效果。
English: SW-Guidance is a training-free method that enhances image generation by aligning the color distribution of generated images with a reference palette through modified diffusion sampling, achieving superior color fidelity and semantic coherence compared to existing techniques.

Authors:Lingyan Ran, Lidong Wang, Guangcong Wang, Peng Wang, Yanning Zhang
Title: DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding
Abstract:
The task of translating visible-to-infrared images (V2IR) is inherently challenging due to three main obstacles: 1) achieving semantic-aware translation, 2) managing the diverse wavelength spectrum in infrared imagery, and 3) the scarcity of comprehensive infrared datasets. Current leading methods tend to treat V2IR as a conventional image-to-image synthesis challenge, often overlooking these specific issues. To address this, we introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM). PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength. To improve V2IR translation, VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions. Through the combination of PLM, VLUM, and the extensive IR-500K dataset, DiffV2IR markedly improves the performance of V2IR. Experiments validate DiffV2IR's excellence in producing high-quality translations, establishing its efficacy and broad applicability. The code, dataset, and DiffV2IR model will be available at https://github.com/LidongWang-26/DiffV2IR.
中文:DiffV2IR框架通过渐进学习模块和视觉语言理解模块解决了可见光到红外图像转换的难题,并借助IR-500K数据集显著提升了转换质量和应用范围。
English: The DiffV2IR framework introduces a Progressive Learning Module and Vision-Language Understanding Module to overcome challenges in visible-to-infrared image translation, supported by the new IR-500K dataset, significantly enhancing translation quality and applicability.

Authors:Kangwei Liu, Junwu Liu, Yun Cao, Jinlin Guo, Xiaowei Yi
Title: DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model
Abstract:
Recent advances in talking face generation have significantly improved facial animation synthesis. However, existing approaches face fundamental limitations: 3DMM-based methods maintain temporal consistency but lack fine-grained regional control, while Stable Diffusion-based methods enable spatial manipulation but suffer from temporal inconsistencies. The integration of these approaches is hindered by incompatible control mechanisms and semantic entanglement of facial representations. This paper presents DisentTalk, introducing a data-driven semantic disentanglement framework that decomposes 3DMM expression parameters into meaningful subspaces for fine-grained facial control. Building upon this disentangled representation, we develop a hierarchical latent diffusion architecture that operates in 3DMM parameter space, integrating region-aware attention mechanisms to ensure both spatial precision and temporal coherence. To address the scarcity of high-quality Chinese training data, we introduce CHDTF, a Chinese high-definition talking face dataset. Extensive experiments show superior performance over existing methods across multiple metrics, including lip synchronization, expression quality, and temporal consistency. Project Page: https://kangweiiliu.github.io/DisentTalk.

Authors:Haoliang Shang, Hanyu Wu, Guangyao Zhai, Boyang Sun, Fangjinhua Wang, Federico Tombari, Marc Pollefeys
Title: SG-Tailor: Inter-Object Commonsense Relationship Reasoning for Scene Graph Manipulation
Abstract:
Scene graphs capture complex relationships among objects, serving as strong priors for content generation and manipulation. Yet, reasonably manipulating scene graphs -- whether by adding nodes or modifying edges -- remains a challenging and untouched task. Tasks such as adding a node to the graph or reasoning about a node's relationships with all others are computationally intractable, as even a single edge modification can trigger conflicts due to the intricate interdependencies within the graph. To address these challenges, we introduce SG-Tailor, an autoregressive model that predicts the conflict-free relationship between any two nodes. SG-Tailor not only infers inter-object relationships, including generating commonsense edges for newly added nodes but also resolves conflicts arising from edge modifications to produce coherent, manipulated graphs for downstream tasks. For node addition, the model queries the target node and other nodes from the graph to predict the appropriate relationships. For edge modification, SG-Tailor employs a Cut-And-Stitch strategy to solve the conflicts and globally adjust the graph. Extensive experiments demonstrate that SG-Tailor outperforms competing methods by a large margin and can be seamlessly integrated as a plug-in module for scene generation and robotic manipulation tasks.
中文: SG-Tailor是一种自回归模型,能预测场景图中节点间无冲突的关系,通过节点添加和边修改实现连贯的图操作,并在性能上大幅超越现有方法。
English: SG-Tailor is an autoregressive model that predicts conflict-free relationships between nodes in scene graphs, enabling coherent graph manipulation through node addition and edge modification while outperforming other methods significantly.

Authors:Taeksoo Kim, Hanbyul Joo
Title: Target-Aware Video Diffusion Models
Abstract:
We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask and the desired action is described via a text prompt. Unlike existing controllable image-to-video diffusion models that often rely on dense structural or motion cues to guide the actor's movements toward the target, our target-aware model requires only a simple mask to indicate the target, leveraging the generalization capabilities of pretrained models to produce plausible actions. This makes our method particularly effective for human-object interaction (HOI) scenarios, where providing precise action guidance is challenging, and further enables the use of video diffusion models for high-level action planning in applications such as robotics. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using a novel cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant transformer blocks and attention regions. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: video content creation and zero-shot 3D HOI motion synthesis.

Authors:Ruixiao Dong, Mengde Xu, Zigang Geng, Li Li, Han Hu, Shuyang Gu
Title: Equivariant Image Modeling
Abstract:
Current generative models, such as autoregressive and diffusion approaches, decompose high-dimensional data distribution learning into a series of simpler subtasks. However, inherent conflicts arise during the joint optimization of these subtasks, and existing solutions fail to resolve such conflicts without sacrificing efficiency or scalability. We propose a novel equivariant image modeling framework that inherently aligns optimization targets across subtasks by leveraging the translation invariance of natural visual signals. Our method introduces (1) column-wise tokenization which enhances translational symmetry along the horizontal axis, and (2) windowed causal attention which enforces consistent contextual relationships across positions. Evaluated on class-conditioned ImageNet generation at 256x256 resolution, our approach achieves performance comparable to state-of-the-art AR models while using fewer computational resources. Systematic analysis demonstrates that enhanced equivariance reduces inter-task conflicts, significantly improving zero-shot generalization and enabling ultra-long image synthesis. This work establishes the first framework for task-aligned decomposition in generative modeling, offering insights into efficient parameter sharing and conflict-free optimization. The code and models are publicly available at https://github.com/drx-code/EquivariantModeling.
中文摘要:本文提出了一种等变图像建模框架,通过利用平移不变性解决生成模型中子任务的优化冲突,在ImageNet图像生成任务上以更少计算资源实现了与先进自回归模型相当的性能。
English Summary: This paper introduces an equivariant image modeling framework that resolves optimization conflicts in generative models by leveraging translational symmetry, achieving state-of-the-art performance on ImageNet generation with greater computational efficiency.

Authors:Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, Yueqi Duan
Title: Video-T1: Test-Time Scaling for Video Generation
Abstract:
With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: https://liuff19.github.io/Video-T1
中文: 本研究探索了视频生成中的测试时扩展方法,通过线性搜索和树状帧结构等策略优化推理计算,实验证明增加测试时计算能显著提升视频生成质量。
English: This study explores Test-Time Scaling (TTS) for video generation, introducing methods like linear search and Tree-of-Frames (ToF) to enhance video quality by optimizing inference-time computation, with experiments confirming significant improvements from increased test-time compute.

Authors:Ye Tian, Xin Xia, Yuxi Ren, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Yunhai Tong, Ling Yang, Bin Cui
Title: Training-free Diffusion Acceleration with Bottleneck Sampling
Abstract:
Diffusion models have demonstrated remarkable capabilities in visual content generation but remain challenging to deploy due to their high computational cost during inference. This computational burden primarily arises from the quadratic complexity of self-attention with respect to image or video resolution. While existing acceleration methods often compromise output quality or necessitate costly retraining, we observe that most diffusion models are pre-trained at lower resolutions, presenting an opportunity to exploit these low-resolution priors for more efficient inference without degrading performance. In this work, we introduce Bottleneck Sampling, a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity. Bottleneck Sampling follows a high-low-high denoising workflow: it performs high-resolution denoising in the initial and final stages while operating at lower resolutions in intermediate steps. To mitigate aliasing and blurring artifacts, we further refine the resolution transition points and adaptively shift the denoising timesteps at each stage. We evaluate Bottleneck Sampling on both image and video generation tasks, where extensive experiments demonstrate that it accelerates inference by up to 3$\times$ for image generation and 2.5$\times$ for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process across multiple evaluation metrics.
中文摘要:Bottleneck Sampling是一种无需重新训练的高效框架,通过采用“高-低-高”去噪流程利用低分辨率先验,在保持图像和视频生成质量的同时,将推理速度分别提升至3倍和2.5倍。
English Summary: Bottleneck Sampling is a training-free framework that accelerates diffusion model inference by leveraging low-resolution priors in a high-low-high denoising workflow, achieving up to 3× speedup for images and 2.5× for videos without compromising output quality.

Authors:Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, Chuang Gan
Title: AdaWorld: Learning Adaptable World Models with Latent Actions
Abstract:
World models aim to learn action-controlled future prediction and have proven essential for the development of intelligent agents. However, most existing world models rely heavily on substantial action-labeled data and costly training, making it challenging to adapt to novel environments with heterogeneous actions through limited interactions. This limitation can hinder their applicability across broader domains. To overcome this limitation, we propose AdaWorld, an innovative world model learning approach that enables efficient adaptation. The key idea is to incorporate action information during the pretraining of world models. This is achieved by extracting latent actions from videos in a self-supervised manner, capturing the most critical transitions between frames. We then develop an autoregressive world model that conditions on these latent actions. This learning paradigm enables highly adaptable world models, facilitating efficient transfer and learning of new actions even with limited interactions and finetuning. Our comprehensive experiments across multiple environments demonstrate that AdaWorld achieves superior performance in both simulation quality and visual planning.
中文摘要:AdaWorld是一种创新的世界模型,通过自监督方式从视频中提取潜在动作,能够在有限数据和微调下高效适应新环境。
English Summary: AdaWorld is an innovative world model that learns latent actions from videos in a self-supervised way, enabling efficient adaptation to new environments with limited data and fine-tuning.

Authors:Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, Yu-Gang Jiang
Title: CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Abstract:
Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to accommodate visual inputs with different resolutions, and an Alignment Loss between visual and textual features for better cross-modal alignment. After continual pre-training, leading VFMs like DINOv2, SigLIP and AIMv2 achieve remarkable improvements not only in multimodal understanding tasks but also in generic classification and segmentation tasks. Remarkably, CoMP-AIMv2 achieves scores of 64.9 on ChartQA with a 0.5B LLM, while maintaining an 87.3% accuracy on ImageNet-1K and a 51.8 mIoU on ADE20K under frozen chunk evaluation.
Chinese: 本文提出CoMP多模态持续预训练框架,通过持续旋转位置嵌入和跨模态对齐损失增强视觉基础模型,使其能够灵活处理不同分辨率输入并实现视觉-语言表征对齐,在多类任务中取得显著性能提升。
English: This paper introduces CoMP, a multimodal continual pre-training pipeline that enhances Vision Foundation Models by enabling flexible input resolutions and better visual-language alignment, leading to significant performance gains across various tasks.

Authors:Moussa Kassem Sbeyti, Nadja Klein, Azarm Nowzad, Fikret Sivrikaya, Sahin Albayrak
Title: Building Blocks for Robust and Effective Semi-Supervised Real-World Object Detection
Abstract:
Semi-supervised object detection (SSOD) based on pseudo-labeling significantly reduces dependence on large labeled datasets by effectively leveraging both labeled and unlabeled data. However, real-world applications of SSOD often face critical challenges, including class imbalance, label noise, and labeling errors. We present an in-depth analysis of SSOD under real-world conditions, uncovering causes of suboptimal pseudo-labeling and key trade-offs between label quality and quantity. Based on our findings, we propose four building blocks that can be seamlessly integrated into an SSOD framework. Rare Class Collage (RCC): a data augmentation method that enhances the representation of rare classes by creating collages of rare objects. Rare Class Focus (RCF): a stratified batch sampling strategy that ensures a more balanced representation of all classes during training. Ground Truth Label Correction (GLC): a label refinement method that identifies and corrects false, missing, and noisy ground truth labels by leveraging the consistency of teacher model predictions. Pseudo-Label Selection (PLS): a selection method for removing low-quality pseudo-labeled images, guided by a novel metric estimating the missing detection rate while accounting for class rarity. We validate our methods through comprehensive experiments on autonomous driving datasets, resulting in up to 6% increase in SSOD performance. Overall, our investigation and novel, data-centric, and broadly applicable building blocks enable robust and effective SSOD in complex, real-world scenarios. Code is available at https://mos-ks.github.io/publications.

Authors:Weichen Fan, Amber Yijia Zheng, Raymond A. Yeh, Ziwei Liu
Title: CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models
Abstract:
Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at github.com/WeichenFan/CFG-Zero-star)
中文: 本研究分析了流匹配模型中的无分类器引导技术,发现其在训练初期会误导样本轨迹,并提出CFG-Zero*方法,通过优化缩放和零初始化来修正估计误差,在图像和视频生成任务中均优于原方法。
English: This study analyzes Classifier-Free Guidance (CFG) in flow matching models, revealing its tendency to misdirect samples during early training and proposing CFG-Zero* with optimized scaling and zero-initialization to correct inaccuracies, consistently outperforming CFG in image and video generation tasks.

Authors:Yanda Chen, Gongwei Chen, Miao Zhang, Weili Guan, Liqiang Nie
Title: Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation
Abstract:
Dataset distillation (DD) excels in synthesizing a small number of images per class (IPC) but struggles to maintain its effectiveness in high-IPC settings. Recent works on dataset distillation demonstrate that combining distilled and real data can mitigate the effectiveness decay. However, our analysis of the combination paradigm reveals that the current one-shot and independent selection mechanism induces an incompatibility issue between distilled and real images. To address this issue, we introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation. CCFS employs a curriculum selection framework for real data selection, where we leverage a coarse-to-fine strategy to select appropriate real data based on the current synthetic dataset in each curriculum. Extensive experiments validate CCFS, surpassing the state-of-the-art by +6.6\% on CIFAR-10, +5.8\% on CIFAR-100, and +3.4\% on Tiny-ImageNet under high-IPC settings. Notably, CCFS achieves 60.2\% test accuracy on ResNet-18 with a 20\% compression ratio of Tiny-ImageNet, closely matching full-dataset training with only 0.3\% degradation. Code: https://github.com/CYDaaa30/CCFS.
中文摘要:数据集蒸馏在高IPC设置下面临合成与真实图像不兼容的问题,而提出的CCFS方法通过课程式由粗到细的选择策略有效解决了该问题,在多个数据集上实现了最先进的性能。
English Summary: Dataset distillation faces challenges in high-IPC settings due to incompatibility between distilled and real images, which the proposed CCFS method addresses through a curriculum coarse-to-fine selection strategy, achieving state-of-the-art performance across multiple datasets.

Authors:Zunnan Xu, Zhentao Yu, Zixiang Zhou, Jun Zhou, Xiaoyu Jin, Fa-Ting Hong, Xiaozhong Ji, Junwei Zhu, Chengfei Cai, Shiyu Tang, Qin Lin, Xiu Li, Qinglin Lu
Title: HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation
Abstract:
We introduce HunyuanPortrait, a diffusion-based condition control method that employs implicit representations for highly controllable and lifelike portrait animation. Given a single portrait image as an appearance reference and video clips as driving templates, HunyuanPortrait can animate the character in the reference image by the facial expression and head pose of the driving videos. In our framework, we utilize pre-trained encoders to achieve the decoupling of portrait motion information and identity in videos. To do so, implicit representation is adopted to encode motion information and is employed as control signals in the animation phase. By leveraging the power of stable video diffusion as the main building block, we carefully design adapter layers to inject control signals into the denoising unet through attention mechanisms. These bring spatial richness of details and temporal consistency. HunyuanPortrait also exhibits strong generalization performance, which can effectively disentangle appearance and motion under different image styles. Our framework outperforms existing methods, demonstrating superior temporal consistency and controllability. Our project is available at https://kkakkkka.github.io/HunyuanPortrait.

Authors:Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang
Title: MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Abstract:
Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts. However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities. To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset. We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available at https://github.com/arctanxarc/MC-LLaVA}.
中文: 本文提出首个多概念个性化框架MC-LLaVA,通过多概念指令微调和个性化提示策略有效整合多个用户概念,并构建高质量数据集显著提升了视觉语言模型在现实场景中的适用性。
English: This paper introduces MC-LLaVA, the first multi-concept personalization framework for vision-language models that integrates multiple user concepts through innovative instruction tuning and personalized prompts, significantly enhancing real-world applicability with a new high-quality dataset.

Authors:Yulong Zheng, Zicheng Jiang, Shengfeng He, Yandu Sun, Junyu Dong, Huaidong Zhang, Yong Du
Title: NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting
Abstract:
Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) have noticeably advanced photo-realistic novel view synthesis using images from densely spaced camera viewpoints. However, these methods struggle in few-shot scenarios due to limited supervision. In this paper, we present NexusGS, a 3DGS-based approach that enhances novel view synthesis from sparse-view images by directly embedding depth information into point clouds, without relying on complex manual regularizations. Exploiting the inherent epipolar geometry of 3DGS, our method introduces a novel point cloud densification strategy that initializes 3DGS with a dense point cloud, reducing randomness in point placement while preventing over-smoothing and overfitting. Specifically, NexusGS comprises three key steps: Epipolar Depth Nexus, Flow-Resilient Depth Blending, and Flow-Filtered Depth Pruning. These steps leverage optical flow and camera poses to compute accurate depth maps, while mitigating the inaccuracies often associated with optical flow. By incorporating epipolar depth priors, NexusGS ensures reliable dense point cloud coverage and supports stable 3DGS training under sparse-view conditions. Experiments demonstrate that NexusGS significantly enhances depth accuracy and rendering quality, surpassing state-of-the-art methods by a considerable margin. Furthermore, we validate the superiority of our generated point clouds by substantially boosting the performance of competing methods. Project page: https://usmizuki.github.io/NexusGS/.

Authors:Linwei Chen, Lin Gu, Liang Li, Chenggang Yan, Ying Fu
Title: Frequency Dynamic Convolution for Dense Image Prediction
Abstract:
While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability. In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. FDConv divides this budget into frequency-based groups with disjoint Fourier indices, enabling the construction of frequency-diverse weights without increasing the parameter cost. To further enhance adaptability, we propose Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM). KSM dynamically adjusts the frequency response of each filter at the spatial level, while FBM decomposes weights into distinct frequency bands in the frequency domain and modulates them dynamically based on local content. Extensive experiments on object detection, segmentation, and classification validate the effectiveness of FDConv. We demonstrate that when applied to ResNet-50, FDConv achieves superior performance with a modest increase of +3.6M parameters, outperforming previous methods that require substantial increases in parameter budgets (e.g., CondConv +90M, KW +76.5M). Moreover, FDConv seamlessly integrates into a variety of architectures, including ConvNeXt, Swin-Transformer, offering a flexible and efficient solution for modern vision tasks. The code is made publicly available at https://github.com/Linwei-Chen/FDConv.
中文: FDConv通过频域分组和动态调制技术,以少量参数增长实现了卷积权重的频率多样性,在多种视觉任务中性能卓越,显著优于需大幅增加参数的传统方法。
English: FDConv introduces frequency-based grouping and dynamic modulation techniques to create diverse convolutional weights efficiently, achieving superior performance in vision tasks with minimal parameter increase compared to previous methods.

Authors:Nathan Darjana, Ryo Fujii, Hideo Saito, Hiroki Kajita
Title: EgoSurgery-HTS: A Dataset for Egocentric Hand-Tool Segmentation in Open Surgery Videos
Abstract:
Egocentric open-surgery videos capture rich, fine-grained details essential for accurately modeling surgical procedures and human behavior in the operating room. A detailed, pixel-level understanding of hands and surgical tools is crucial for interpreting a surgeon's actions and intentions. We introduce EgoSurgery-HTS, a new dataset with pixel-wise annotations and a benchmark suite for segmenting surgical tools, hands, and interacting tools in egocentric open-surgery videos. Specifically, we provide a labeled dataset for (1) tool instance segmentation of 14 distinct surgical tools, (2) hand instance segmentation, and (3) hand-tool segmentation to label hands and the tools they manipulate. Using EgoSurgery-HTS, we conduct extensive evaluations of state-of-the-art segmentation methods and demonstrate significant improvements in the accuracy of hand and hand-tool segmentation in egocentric open-surgery videos compared to existing datasets. The dataset will be released at https://github.com/Fujiry0/EgoSurgery.
Chinese: EgoSurgery-HTS数据集为开放手术视频中的手术工具和手部提供了像素级标注,显著提升了分割精度,有助于深入分析手术操作。
English: The EgoSurgery-HTS dataset provides pixel-level annotations for surgical tools and hands in egocentric open-surgery videos, significantly improving segmentation accuracy and enabling detailed analysis of surgical actions.

Authors:Sebastian Tewes, Yufan Chen, Omar Moured, Jiaming Zhang, Rainer Stiefelhagen
Title: SFDLA: Source-Free Document Layout Analysis
Abstract:
Document Layout Analysis (DLA) is a fundamental task in document understanding. However, existing DLA and adaptation methods often require access to large-scale source data and target labels. This requirements severely limiting their real-world applicability, particularly in privacy-sensitive and resource-constrained domains, such as financial statements, medical records, and proprietary business documents. According to our observation, directly transferring source-domain fine-tuned models on target domains often results in a significant performance drop (Avg. -32.64%). In this work, we introduce Source-Free Document Layout Analysis (SFDLA), aiming for adapting a pre-trained source DLA models to an unlabeled target domain, without access to any source data. To address this challenge, we establish the first SFDLA benchmark, covering three major DLA datasets for geometric- and content-aware adaptation. Furthermore, we propose Document Layout Analysis Adapter (DLAdapter), a novel framework that is designed to improve source-free adaptation across document domains. Our method achieves a +4.21% improvement over the source-only baseline and a +2.26% gain over existing source-free methods from PubLayNet to DocLayNet. We believe this work will inspire the DLA community to further investigate source-free document understanding. To support future research of the community, the benchmark, models, and code will be publicly available at https://github.com/s3setewe/sfdla-DLAdapter.
Chinese: 本文提出了源自由文档布局分析(SFDLA)方法,能够在无需源数据的情况下将预训练模型适配到未标注目标域,相比现有方法取得了性能提升。
English: This paper introduces Source-Free Document Layout Analysis (SFDLA), a method that adapts pre-trained models to unlabeled target domains without requiring source data, achieving performance improvements over existing approaches.

Authors:Shaokai Ye, Haozhe Qi, Alexander Mathis, Mackenzie W. Mathis
Title: LLaVAction: evaluating and training multi-modal large language models for action recognition
Abstract:
Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: https://github.com/AdaptiveMotorControlLab/LLaVAction.
Chinese: 本研究通过将EPIC-KITCHENS-100数据集重构为视频多选题形式,提出了一系列改进多模态大语言模型的方法,在多个行为理解基准测试中实现了最先进的性能表现。
English: This study enhances multi-modal large language models (MLLMs) for action recognition by reformulating the EPIC-KITCHENS-100 dataset into a video multiple-choice format and introducing methods that achieve state-of-the-art performance across multiple benchmarks.

Authors:Bingchen Miao, Yang Wu, Minghe Gao, Qifan Yu, Wendong Bu, Wenqiao Zhang, Yunfei Li, Siliang Tang, Tat-Seng Chua, Juncheng Li
Title: Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark
Abstract:
The development of Generalist Virtual Agents (GVAs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and labor-intensive human annotations. To address these challenges, we propose Similar, a Step-Wise Multi-Dimensional Generalist Reward Model, which offers fine-grained signals for agent training and can choose better action for inference-time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS-P algorithm to automatically collect and annotate step-wise, five-dimensional agent execution data. Using this data, we train Similar with the Triple-M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation, named SRM. This benchmark consists of two components: SRMTrain, which serves as the training set for Similar, and SRMEval, a manually selected test set for evaluating the reward model. Experimental results demonstrate that Similar, through its step-wise, multi-dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference-time scaling. The project is available at https://github.com/antgroup/Similar.
中文:提出的Similar模型通过提供细粒度、多维度的奖励信号,改进了通用虚拟代理的训练与推理,利用自动数据标注和新评估基准克服了现有方法的局限。
English: The proposed Similar model provides fine-grained, multi-dimensional rewards to enhance Generalist Virtual Agents' training and inference, overcoming limitations of current methods through automated data annotation and a new benchmark for evaluation.

Authors:Arne Grobrügge, Niklas Kühl, Gerhard Satzger, Philipp Spitzer
Title: Towards Human-Understandable Multi-Dimensional Concept Discovery
Abstract:
Concept-based eXplainable AI (C-XAI) aims to overcome the limitations of traditional saliency maps by converting pixels into human-understandable concepts that are consistent across an entire dataset. A crucial aspect of C-XAI is completeness, which measures how well a set of concepts explains a model's decisions. Among C-XAI methods, Multi-Dimensional Concept Discovery (MCD) effectively improves completeness by breaking down the CNN latent space into distinct and interpretable concept subspaces. However, MCD's explanations can be difficult for humans to understand, raising concerns about their practical utility. To address this, we propose Human-Understandable Multi-dimensional Concept Discovery (HU-MCD). HU-MCD uses the Segment Anything Model for concept identification and implements a CNN-specific input masking technique to reduce noise introduced by traditional masking methods. These changes to MCD, paired with the completeness relation, enable HU-MCD to enhance concept understandability while maintaining explanation faithfulness. Our experiments, including human subject studies, show that HU-MCD provides more precise and reliable explanations than existing C-XAI methods. The code is available at https://github.com/grobruegge/hu-mcd.
中文:HU-MCD通过基于SAM的概念识别和针对CNN的掩码技术,在保持解释忠实度的同时提升了概念可理解性,实验证明其比现有方法提供更精确可靠的可解释AI方案。
English: HU-MCD enhances concept-based explainable AI by improving human understandability through SAM-based concept identification and CNN-specific masking, while maintaining faithfulness and outperforming existing methods in precision and reliability.

Authors:Takashi Isobe, He Cui, Dong Zhou, Mengmeng Ge, Dong Li, Emad Barsoum
Title: AMD-Hummingbird: Towards an Efficient Text-to-Video Model
Abstract:
Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.
中文:提出的Hummingbird框架通过精简模型参数和引入视觉反馈学习,显著提升了文本到视频生成的效率,在保持高质量输出的同时实现31倍加速,并能支持更长视频序列的生成。
English: The proposed Hummingbird framework significantly enhances text-to-video generation efficiency by reducing model parameters and incorporating visual feedback learning, achieving 31x faster processing while maintaining high-quality output and supporting longer video sequences.

Authors:Bin Li, Dehong Gao, Yeyuan Wang, Linbo Jin, Shanqing Yu, Xiaoyan Cai, Libin Yang
Title: Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models
Abstract:
Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions. By applying contrastive decoding, we dynamically adjust the logits generated from original image tokens and irrelevant image tokens, reducing the model's over-attention to irrelevant information. The experimental results demonstrate that IAVA consistently outperforms existing decoding techniques on benchmarks such as MME, POPE, and TextVQA in mitigating object hallucinations. Our IAVA approach is available online at https://github.com/Lee-lab558/IAVA.
中文摘要:提出的指令对齐视觉注意力(IAVA)方法通过对比解码动态调整对无关图像标记的关注,有效缓解大型视觉语言模型中的物体幻觉问题,在多项基准测试中均展现出优越性能。
English Summary: The proposed Instruction-Aligned Visual Attention (IAVA) method mitigates object hallucinations in Large Vision-Language Models by dynamically adjusting attention to irrelevant image tokens through contrastive decoding, demonstrating superior performance across multiple benchmarks.

Authors:Zihao Chen, Hsuanyu Wu, Chi-Hsi Kung, Yi-Ting Chen, Yan-Tsung Peng
Title: ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation Dataset
Abstract:
Traffic Atomic Activity which describes traffic patterns for topological intersection dynamics is a crucial topic for the advancement of intelligent driving systems. However, existing atomic activity datasets are collected from an egocentric view, which cannot support the scenarios where traffic activities in an entire intersection are required. Moreover, existing datasets only provide video-level atomic activity annotations, which require exhausting efforts to manually trim the videos for recognition and limit their applications to untrimmed videos. To bridge this gap, we introduce the Aerial Traffic Atomic Activity Recognition and Segmentation (ATARS) dataset, the first aerial dataset designed for multi-label atomic activity analysis. We offer atomic activity labels for each frame, which accurately record the intervals for traffic activities. Moreover, we propose a novel task, Multi-label Temporal Atomic Activity Recognition, enabling the study of accurate temporal localization for atomic activity and easing the burden of manual video trimming for recognition. We conduct extensive experiments to evaluate existing state-of-the-art models on both atomic activity recognition and temporal atomic activity segmentation. The results highlight the unique challenges of our ATARS dataset, such as recognizing extremely small objects' activities. We further provide comprehensive discussion analyzing these challenges and offer valuable insights for future direction to improve recognizing atomic activity in aerial view. Our source code and dataset are available at https://github.com/magecliff96/ATARS/
中文摘要:ATARS数据集作为首个用于多标签原子活动分析的航拍数据集,通过提供逐帧标注实现了精确的时间定位并减少了人工视频剪辑负担,实验揭示了识别微小物体活动等独特挑战。
English Summary: The ATARS dataset is introduced as the first aerial dataset for multi-label atomic activity analysis, providing frame-level annotations to enable precise temporal localization and reduce manual video trimming, while experiments reveal challenges like recognizing small object activities.

Authors:Soulaimene Turki, Daniel Panangian, Houda Chaabouni-Chouayakh, Ksenia Bittner
Title: AIM2PC: Aerial Image to 3D Building Point Cloud Reconstruction
Abstract:
Three-dimensional urban reconstruction of buildings from single-view images has attracted significant attention over the past two decades. However, recent methods primarily focus on rooftops from aerial images, often overlooking essential geometrical details. Additionally, there is a notable lack of datasets containing complete 3D point clouds for entire buildings, along with challenges in obtaining reliable camera pose information for aerial images. This paper addresses these challenges by presenting a novel methodology, AIM2PC , which utilizes our generated dataset that includes complete 3D point clouds and determined camera poses. Our approach takes features from a single aerial image as input and concatenates them with essential additional conditions, such as binary masks and Sobel edge maps, to enable more edge-aware reconstruction. By incorporating a point cloud diffusion model based on Centered denoising Diffusion Probabilistic Models (CDPM), we project these concatenated features onto the partially denoised point cloud using our camera poses at each diffusion step. The proposed method is able to reconstruct the complete 3D building point cloud, including wall information and demonstrates superior performance compared to existing baseline techniques. To allow further comparisons with our methodology the dataset has been made available at https://github.com/Soulaimene/AIM2PCDataset
中文: 本文提出AIM2PC新方法,通过结合图像特征与边缘感知输入,并利用配备相机位姿的扩散模型,从单张航拍图像重建完整的三维建筑点云,其性能优于现有技术且附带公开数据集。
English: This paper introduces AIM2PC, a novel method that reconstructs complete 3D building point clouds from single aerial images by integrating image features with edge-aware inputs and utilizing a diffusion model with provided camera poses, outperforming existing techniques and accompanied by a publicly available dataset.

Authors:Xiaoyu Zhang, Weihong Pan, Chong Bao, Xiyu Zhang, Xiaojun Xiang, Hanqing Jiang, Hujun Bao
Title: LookCloser: Frequency-aware Radiance Field for Tiny-Detail Scene
Abstract:
Humans perceive and comprehend their surroundings through information spanning multiple frequencies. In immersive scenes, people naturally scan their environment to grasp its overall structure while examining fine details of objects that capture their attention. However, current NeRF frameworks primarily focus on modeling either high-frequency local views or the broad structure of scenes with low-frequency information, which is limited to balancing both. We introduce FA-NeRF, a novel frequency-aware framework for view synthesis that simultaneously captures the overall scene structure and high-definition details within a single NeRF model. To achieve this, we propose a 3D frequency quantification method that analyzes the scene's frequency distribution, enabling frequency-aware rendering. Our framework incorporates a frequency grid for fast convergence and querying, a frequency-aware feature re-weighting strategy to balance features across different frequency contents. Extensive experiments show that our method significantly outperforms existing approaches in modeling entire scenes while preserving fine details. Project page: https://coscatter.github.io/LookCloser/

Authors:Luca Zanella, Massimiliano Mancini, Willi Menapace, Sergey Tulyakov, Yiming Wang, Elisa Ricci
Title: Can Text-to-Video Generation help Video-Language Alignment?
Abstract:
Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others. We hypothesize this issue is linked to noise (semantic and visual) in the generated videos and develop a method, SynViTA, that accounts for those. SynViTA dynamically weights the contribution of each synthetic video based on how similar its target caption is w.r.t. the real counterpart. Moreover, a semantic consistency loss makes the model focus on fine-grained differences across captions, rather than differences in video appearance. Experiments show that, on average, SynViTA improves over existing methods on VideoCon test sets and SSv2-Temporal, SSv2-Events, and ATP-Hard benchmarks, being a first promising step for using synthetic videos when learning video-language models.
Recent video-language models face linguistic bias from negative captions, which SynViTA addresses by using dynamically weighted synthetic videos and a semantic consistency loss to improve alignment without relying on scarce real video data.
English Summary:

Authors:Junyuan Gao, Jiahe Song, Jiang Wu, Runchuan Zhu, Guanlin Shen, Shasha Wang, Xingjian Wei, Haote Yang, Songyang Zhang, Weijia Li, Bin Wang, Dahua Lin, Lijun Wu, Conghui He
Title: PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model
Abstract:
Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design across 10 languages, enabling fair and accurate cross-lingual comparisons. It includes the vision setting where text and queries are embedded in images, requiring LVLMs to simultaneously "see", "read", and "think", aligning with real-world applications. Additionally, PM\textsuperscript{4}Bench incorporates safety evaluations, addressing critical oversight in existing multilingual benchmarks. Using PM4Bench, we evaluate 11 mainstream LVLMs, revealing significant cross-linguistic performance disparities, particularly in vision settings, and identifying OCR capability as a key determinant of these imbalances. We will release PM4Bench at https://github.com/opendatalab/PM4Bench .
中文: PM4Bench作为首个并行多语言多模态基准,通过包含10种语言的平行语料库、融合视觉与文本的任务及安全性评估,解决了现有大型视觉语言模型评测的不足,并揭示了与OCR能力相关的性能差异。
English: PM4Bench is introduced as the first parallel multilingual multi-modal benchmark addressing limitations in existing LVLM evaluations by featuring a 10-language parallel corpus, integrated vision-text tasks, and safety assessments, revealing performance disparities tied to OCR capabilities.

Authors:Zequn Zeng, Yudi Su, Jianqiao Sun, Tiansheng Wen, Hao Zhang, Zhengjue Wang, Bo Chen, Hongwei Liu, Jiawei Ma
Title: Explaining Domain Shifts in Language: Concept erasing for Interpretable Image Classification
Abstract:
Concept-based models can map black-box representations to human-understandable concepts, which makes the decision-making process more transparent and then allows users to understand the reason behind predictions. However, domain-specific concepts often impact the final predictions, which subsequently undermine the model generalization capabilities, and prevent the model from being used in high-stake applications. In this paper, we propose a novel Language-guided Concept-Erasing (LanCE) framework. In particular, we empirically demonstrate that pre-trained vision-language models (VLMs) can approximate distinct visual domain shifts via domain descriptors while prompting large Language Models (LLMs) can easily simulate a wide range of descriptors of unseen visual domains. Then, we introduce a novel plug-in domain descriptor orthogonality (DDO) regularizer to mitigate the impact of these domain-specific concepts on the final predictions. Notably, the DDO regularizer is agnostic to the design of concept-based models and we integrate it into several prevailing models. Through evaluation of domain generalization on four standard benchmarks and three newly introduced benchmarks, we demonstrate that DDO can significantly improve the out-of-distribution (OOD) generalization over the previous state-of-the-art concept-based models.Our code is available at https://github.com/joeyz0z/LanCE.
中文:提出的LanCE框架利用视觉语言模型和大语言模型识别领域特定概念,并引入描述符正交性正则化器,通过减少这些概念对预测的影响来提升模型的泛化能力。
English: The proposed LanCE framework uses vision-language models and large language models to identify domain-specific concepts and introduces a descriptor orthogonality regularizer to enhance model generalization by reducing their impact on predictions.

Authors:Wei Deng, Mengshi Qi, Huadong Ma
Title: Global-Local Tree Search in VLMs for 3D Indoor Scene Generation
Abstract:
Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is represented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, i.e. room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discretize the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing the position with the name of emojis. The quantitative and qualitative experimental results illustrate our approach generates more plausible 3D scenes than state-of-the-art approaches. Our source code is available at https://github.com/dw-dengwei/TreeSearchGen .
中文: 本文提出一种新颖的全局-局部树搜索算法,通过将空间规划分解为层次化结构并利用表情符号网格提示,借助视觉语言模型生成合理的3D室内场景。
English: This paper introduces a novel global-local tree search algorithm that leverages Vision-Language Models to generate plausible 3D indoor scenes by decomposing spatial planning into hierarchical levels and using emoji-grid prompts for object placement.

Authors:Zhenyu Pan, Han Liu
Title: MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
Abstract:
We present MetaSpatial, the first reinforcement learning (RL)-based framework designed to enhance 3D spatial reasoning in vision-language models (VLMs), enabling real-time 3D scene generation without the need for hard-coded optimizations. MetaSpatial addresses two core challenges: (i) the lack of internalized 3D spatial reasoning in VLMs, which limits their ability to generate realistic layouts, and (ii) the inefficiency of traditional supervised fine-tuning (SFT) for layout generation tasks, as perfect ground truth annotations are unavailable. Our key innovation is a multi-turn RL-based optimization mechanism that integrates physics-aware constraints and rendered image evaluations, ensuring generated 3D layouts are coherent, physically plausible, and aesthetically consistent. Methodologically, MetaSpatial introduces an adaptive, iterative reasoning process, where the VLM refines spatial arrangements over multiple turns by analyzing rendered outputs, improving scene coherence progressively. Empirical evaluations demonstrate that MetaSpatial significantly enhances the spatial consistency and formatting stability of various scale models. Post-training, object placements are more realistic, aligned, and functionally coherent, validating the effectiveness of RL for 3D spatial reasoning in metaverse, AR/VR, digital twins, and game development applications. Our code, data, and training pipeline are publicly available at https://github.com/PzySeere/MetaSpatial.
Chinese: MetaSpatial提出了首个基于强化学习的框架,通过多轮迭代优化增强视觉语言模型的3D空间推理能力,无需真实标注即可实时生成物理合理且美观的3D场景布局。
English: MetaSpatial introduces the first reinforcement learning framework that enhances 3D spatial reasoning in vision-language models, enabling real-time generation of physically coherent 3D scenes through iterative optimization without requiring ground truth annotations.

Authors:Sixian Ding, Xu Jiang, Zhongjing Du, Jiaqi Cui, Xinyi Zeng, Yan Wang
Title: SIT-FER: Integration of Semantic-, Instance-, Text-level Information for Semi-supervised Facial Expression Recognition
Abstract:
Semi-supervised deep facial expression recognition (SS-DFER) has gained increasingly research interest due to the difficulty in accessing sufficient labeled data in practical settings. However, existing SS-DFER methods mainly utilize generated semantic-level pseudo-labels for supervised learning, the unreliability of which compromises their performance and undermines the practical utility. In this paper, we propose a novel SS-DFER framework that simultaneously incorporates semantic, instance, and text-level information to generate high-quality pseudo-labels. Specifically, for the unlabeled data, considering the comprehensive knowledge within the textual descriptions and instance representations, we respectively calculate the similarities between the facial vision features and the corresponding textual and instance features to obtain the probabilities at the text- and instance-level. Combining with the semantic-level probability, these three-level probabilities are elaborately aggregated to gain the final pseudo-labels. Furthermore, to enhance the utilization of one-hot labels for the labeled data, we also incorporate text embeddings excavated from textual descriptions to co-supervise model training, enabling facial visual features to exhibit semantic correlations in the text space. Experiments on three datasets demonstrate that our method significantly outperforms current state-of-the-art SS-DFER methods and even exceeds fully supervised baselines. The code will be available at https://github.com/PatrickStarL/SIT-FER.
中文: 本文提出了一种新颖的半监督深度面部表情识别框架,通过融合语义、实例和文本级信息生成高质量伪标签,并利用文本嵌入增强训练效果,在多个数据集上显著优于现有方法甚至全监督基线。
English: This paper introduces a novel semi-supervised deep facial expression recognition framework that integrates semantic, instance, and text-level information to generate reliable pseudo-labels and leverages text embeddings to enhance training, achieving superior performance over existing methods and fully supervised baselines.

Authors:Jinho Jeong, Sangmin Han, Jinwoo Kim, Seon Joo Kim
Title: Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models
Abstract:
In this paper, we propose LSRNA, a novel framework for higher-resolution (exceeding 1K) image generation using diffusion models by leveraging super-resolution directly in the latent space. Existing diffusion models struggle with scaling beyond their training resolutions, often leading to structural distortions or content repetition. Reference-based methods address the issues by upsampling a low-resolution reference to guide higher-resolution generation. However, they face significant challenges: upsampling in latent space often causes manifold deviation, which degrades output quality. On the other hand, upsampling in RGB space tends to produce overly smoothed outputs. To overcome these limitations, LSRNA combines Latent space Super-Resolution (LSR) for manifold alignment and Region-wise Noise Addition (RNA) to enhance high-frequency details. Our extensive experiments demonstrate that integrating LSRNA outperforms state-of-the-art reference-based methods across various resolutions and metrics, while showing the critical role of latent space upsampling in preserving detail and sharpness. The code is available at https://github.com/3587jjh/LSRNA.
中文: LSRNA提出了一种结合潜在空间超分辨率和区域噪声添加的新框架,能够在超过1K分辨率下生成高质量图像,同时保持结构完整性并增强细节。
English: LSRNA introduces a latent space super-resolution framework combined with region-wise noise addition to enable high-quality image generation beyond 1K resolution while preserving structural integrity and enhancing details.

Authors:Chenfei Liao, Kaiyu Lei, Xu Zheng, Junha Moon, Zhixiong Wang, Yixuan Wang, Danda Pani Paudel, Luc Van Gool, Xuming Hu
Title: Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness
Abstract:
Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities. Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality. Robustness has thus become essential for practical MMSS applications. However, the absence of standardized benchmarks for evaluating robustness hinders further advancement. To address this, we first survey existing MMSS literature and categorize representative methods to provide a structured overview. We then introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM). From a probabilistic standpoint, we model modality failure under two conditions: (1) all damaged combinations are equally probable; (2) each modality fails independently following a Bernoulli distribution. Based on these, we propose four metrics-$mIoU^{Avg}_{EMM}$, $mIoU^{E}_{EMM}$, $mIoU^{Avg}_{RMM}$, and $mIoU^{E}_{RMM}$-to assess model robustness under EMM and RMM. This work provides the first dedicated benchmark for MMSS robustness, offering new insights and tools to advance the field. Source code is available at https://github.com/Chenfei-Liao/Multi-Modal-Semantic-Segmentation-Robustness-Benchmark.
Chinese: 本文首次为多模态语义分割(MMSS)建立了专门的鲁棒性基准,通过评估模型在模态缺失和噪声情况下的表现,旨在缩小研究与实际应用之间的差距。
English: This paper introduces the first dedicated robustness benchmark for multi-modal semantic segmentation (MMSS), evaluating models under missing and noisy modality scenarios to bridge the gap between research and real-world deployment.

Authors:Junteng Liu, Weihao Zeng, Xiwen Zhang, Yijun Wang, Zifei Shan, Junxian He
Title: On the Perception Bottleneck of VLMs for Chart Understanding
Abstract:
Chart understanding requires models to effectively analyze and reason about numerical data, textual elements, and complex visual components. Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process. In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, where the visual representation may fail to encapsulate the correct information, and the extraction bottleneck, where the language model struggles to extract the necessary information from the provided visual representations. Through comprehensive experiments, we find that (1) the information embedded within visual representations is substantially richer than what is typically captured by linear extractors, such as the widely used retrieval accuracy metric; (2) While instruction tuning effectively enhances the extraction capability of LVLMs, the vision encoder remains a critical bottleneck, demanding focused attention and improvement. Therefore, we further enhance the visual encoder to mitigate the vision encoder bottleneck under a contrastive learning framework. Empirical results demonstrate that our approach significantly mitigates the perception bottleneck and improves the ability of LVLMs to comprehend charts. Code is publicly available at https://github.com/hkust-nlp/Vision4Chart.
Chinese: 本研究揭示了大型视觉语言模型在图表理解中的感知瓶颈,归因于视觉编码器和信息提取的双重限制,并通过对比学习增强视觉编码器的方法,显著提升了模型性能。
English: This study identifies the perception bottleneck in large vision-language models for chart understanding, attributing it to limitations in both the vision encoder and information extraction, and proposes a contrastive learning-enhanced visual encoder that significantly improves model performance.

Authors:Zhichao Sun, Huazhang Hu, Yidong Ma, Gang Liu, Nemo Chen, Xu Tang, Yao Hu, Yongchao Xu
Title: CQ-DINO: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection
Abstract:
With the exponential growth of data, traditional object detection methods are increasingly struggling to handle vast vocabulary object detection tasks effectively. We analyze two key limitations of classification-based detectors: positive gradient dilution, where rare positive categories receive insufficient learning signals, and hard negative gradient dilution, where discriminative gradients are overwhelmed by numerous easy negatives. To address these challenges, we propose CQ-DINO, a category query-based object detection framework that reformulates classification as a contrastive task between object queries and learnable category queries. Our method introduces image-guided query selection, which reduces the negative space by adaptively retrieving top-K relevant categories per image via cross-attention, thereby rebalancing gradient distributions and facilitating implicit hard example mining. Furthermore, CQ-DINO flexibly integrates explicit hierarchical category relationships in structured datasets (e.g., V3Det) or learns implicit category correlations via self-attention in generic datasets (e.g., COCO). Experiments demonstrate that CQ-DINO achieves superior performance on the challenging V3Det benchmark (surpassing previous methods by 2.1% AP) while maintaining competitiveness in COCO. Our work provides a scalable solution for real-world detection systems requiring wide category coverage. The code is publicly at https://github.com/RedAIGC/CQ-DINO.
中文摘要:CQ-DINO提出了一种基于类别查询的目标检测框架,通过将分类重构为对象查询与类别查询间的对比任务,有效解决了传统分类器中梯度稀释问题,在大型词汇数据集上实现了最优性能,并在通用数据集上保持竞争力。
English Summary: CQ-DINO is a novel object detection framework that addresses gradient dilution issues in traditional classifiers by reformulating classification as a contrastive task between object queries and category queries, achieving state-of-the-art performance on large-vocabulary datasets while maintaining competitive results on standard benchmarks.

Authors:Dian Zheng, Cheng Zhang, Xiao-Ming Wu, Cao Li, Chengfei Lv, Jian-Fang Hu, Wei-Shi Zheng
Title: Panorama Generation From NFoV Image Done Right
Abstract:
Generating 360-degree panoramas from narrow field of view (NFoV) image is a promising computer vision task for Virtual Reality (VR) applications. Existing methods mostly assess the generated panoramas with InceptionNet or CLIP based metrics, which tend to perceive the image quality and is \textbf{not suitable for evaluating the distortion}. In this work, we first propose a distortion-specific CLIP, named Distort-CLIP to accurately evaluate the panorama distortion and discover the \textbf{``visual cheating''} phenomenon in previous works (\ie, tending to improve the visual results by sacrificing distortion accuracy). This phenomenon arises because prior methods employ a single network to learn the distinct panorama distortion and content completion at once, which leads the model to prioritize optimizing the latter. To address the phenomenon, we propose \textbf{PanoDecouple}, a decoupled diffusion model framework, which decouples the panorama generation into distortion guidance and content completion, aiming to generate panoramas with both accurate distortion and visual appeal. Specifically, we design a DistortNet for distortion guidance by imposing panorama-specific distortion prior and a modified condition registration mechanism; and a ContentNet for content completion by imposing perspective image information. Additionally, a distortion correction loss function with Distort-CLIP is introduced to constrain the distortion explicitly. The extensive experiments validate that PanoDecouple surpasses existing methods both in distortion and visual metrics.
中文摘要:本文提出Distort-CLIP评估全景图畸变,并设计PanoDecouple解耦扩散框架,通过分离畸变引导与内容补全来生成兼具视觉吸引力和几何准确性的全景图像。
English Summary: This paper introduces Distort-CLIP for evaluating panorama distortion and PanoDecouple, a decoupled diffusion framework that separates distortion guidance from content completion to generate visually appealing panoramas with accurate geometry.

Authors:Yuchuan Tian, Hanting Chen, Mengyu Zheng, Yuchen Liang, Chao Xu, Yunhe Wang
Title: U-REPA: Aligning Diffusion U-Nets to ViTs
Abstract:
Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose U-REPA, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach $FID<1.5$ in 200 epochs or 1M iterations on ImageNet 256 $\times$ 256, and needs only half the total epochs to perform better than REPA. Codes are available at https://github.com/YuchuanTian/U-REPA.
中文:U-REPA是一种创新的表征对齐范式,通过解决空间维度不一致性并引入流形损失,成功将REPA适配到U-Net架构,在图像生成质量和收敛速度上均显著优于现有方法。
English: U-REPA is a novel representation alignment paradigm that adapts REPA to U-Net architectures by addressing spatial inconsistencies and introducing manifold loss, achieving superior image generation quality and significantly faster convergence than previous methods.

Authors:Wencheng Zhu, Yuexin Wang, Hongxuan Li, Pengfei Zhu, Qinghua Hu
Title: VTD-CLIP: Video-to-Text Discretization via Prompting CLIP
Abstract:
Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing approaches primarily rely on parameter-efficient fine-tuning of image-text pre-trained models, yet they often suffer from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our method repurposes the frozen text encoder to construct a visual codebook from video class labels due to the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This codebook effectively transforms temporal visual data into textual tokens via feature lookups and offers interpretable video representations through explicit video modeling. Then, to enhance robustness against irrelevant or noisy frames, we introduce a confidence-aware fusion module that dynamically weights keyframes by assessing their semantic relevance via the codebook. Furthermore, our method incorporates learnable text prompts to conduct adaptive codebook updates. Extensive experiments on HMDB-51, UCF-101, SSv2, and Kinetics-400 have validated the superiority of our approach, achieving more competitive improvements over state-of-the-art methods. The code will be publicly available at https://github.com/isxinxin/VTD-CLIP.
Chinese: 本文提出了一种视频到文本的离散化框架,通过利用类别标签构建的视觉码本将时序视觉数据转化为文本标记,结合置信度感知融合模块和可学习文本提示,在多个基准测试中实现了更强的鲁棒性和性能提升。
English: This paper introduces a video-to-text discretization framework that enhances temporal modeling and interpretability by converting visual data into textual tokens using a codebook derived from class labels, incorporating confidence-aware fusion and adaptive prompts to improve robustness and performance across multiple benchmarks.

Authors:Sherry X. Chen, Misha Sra, Pradeep Sen
Title: Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning
Abstract:
Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to the difficulty of creating large, high-quality training datasets. To do this, previous approaches have typically relied on text-to-image (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP (I-CLIP), a selfsupervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel I-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at https://github.com/SherryXTChen/Instruct-CLIP.git.
中文: Instruct-CLIP是一种自监督方法,通过优化图像编辑数据集中指令与图像变化的对齐,利用校正后的数据集和新颖的损失函数提升如InstructPix2Pix等模型的编辑效果。
English: Instruct-CLIP is a self-supervised method that refines image-editing datasets by aligning instructions with image changes, enhancing the performance of models like InstructPix2Pix through a corrected dataset and novel loss function.

Authors:Hongen Liu, Cheng Cui, Yuning Du, Yi Liu, Gang Pan
Title: PP-FormulaNet: Bridging Accuracy and Efficiency in Advanced Formula Recognition
Abstract:
Formula recognition is an important task in document intelligence. It involves converting mathematical expressions from document images into structured symbolic formats that computers can easily work with. LaTeX is the most common format used for this purpose. In this work, we present PP-FormulaNet, a state-of-the-art formula recognition model that excels in both accuracy and efficiency. To meet the diverse needs of applications, we have developed two specialized models: PP-FormulaNet-L, tailored for high-accuracy scenarios, and PP-FormulaNet-S, optimized for high-efficiency contexts. Our extensive evaluations reveal that PP-FormulaNet-L attains accuracy levels that surpass those of prominent models such as UniMERNet by a significant 6%. Conversely, PP-FormulaNet-S operates at speeds that are over 16 times faster. These advancements facilitate seamless integration of PP-FormulaNet into a broad spectrum of document processing environments that involve intricate mathematical formulas. Furthermore, we introduce a Formula Mining System, which is capable of extracting a vast amount of high-quality formula data. This system further enhances the robustness and applicability of our formula recognition model. Code and models are publicly available at PaddleOCR(https://github.com/PaddlePaddle/PaddleOCR) and PaddleX(https://github.com/PaddlePaddle/PaddleX).
中文: PP-FormulaNet推出了两个专用公式识别模型,其中PP-FormulaNet-L的准确率比UniMERNet高6%,PP-FormulaNet-S的运行速度快16倍以上,并配有公式挖掘系统以提升数据质量和模型鲁棒性。
English: PP-FormulaNet introduces two specialized models for formula recognition, with PP-FormulaNet-L achieving 6% higher accuracy than UniMERNet and PP-FormulaNet-S running over 16 times faster, alongside a Formula Mining System to enhance data quality and model robustness.

Authors:Tianpei Zhang, Yiming Zhu, Jufeng Zhao, Guangmang Cui, Yuchen Zheng
Title: Exploring State Space Model in Wavelet Domain: An Infrared and Visible Image Fusion Network via Wavelet Transform and State Space Model
Abstract:
Deep learning techniques have revolutionized the infrared and visible image fusion (IVIF), showing remarkable efficacy on complex scenarios. However, current methods do not fully combine frequency domain features with global semantic information, which will result in suboptimal extraction of global features across modalities and insufficient preservation of local texture details. To address these issues, we propose Wavelet-Mamba (W-Mamba), which integrates wavelet transform with the state-space model (SSM). Specifically, we introduce Wavelet-SSM module, which incorporates wavelet-based frequency domain feature extraction and global information extraction through SSM, thereby effectively capturing both global and local features. Additionally, we propose a cross-modal feature attention modulation, which facilitates efficient interaction and fusion between different modalities. The experimental results indicate that our method achieves both visually compelling results and superior performance compared to current state-of-the-art methods. Our code is available at https://github.com/Lmmh058/W-Mamba.
中文: 提出的Wavelet-Mamba(W-Mamba)模型将小波变换与状态空间模型相结合,在红外与可见光图像融合中有效捕捉全局和局部特征,相比现有方法获得了更优的视觉效果和性能表现。
English: The proposed Wavelet-Mamba (W-Mamba) model integrates wavelet transform with state-space models to effectively capture global and local features in infrared and visible image fusion, achieving superior visual and performance results compared to existing methods.

Authors:Hankyul Kang, Gregor Seifer, Donghyun Lee, Jongbin Ryu
Title: Do Your Best and Get Enough Rest for Continual Learning
Abstract:
According to the forgetting curve theory, we can enhance memory retention by learning extensive data and taking adequate rest. This means that in order to effectively retain new knowledge, it is essential to learn it thoroughly and ensure sufficient rest so that our brain can memorize without forgetting. The main takeaway from this theory is that learning extensive data at once necessitates sufficient rest before learning the same data again. This aspect of human long-term memory retention can be effectively utilized to address the continual learning of neural networks. Retaining new knowledge for a long period of time without catastrophic forgetting is the critical problem of continual learning. Therefore, based on Ebbinghaus' theory, we introduce the view-batch model that adjusts the learning schedules to optimize the recall interval between retraining the same samples. The proposed view-batch model allows the network to get enough rest to learn extensive knowledge from the same samples with a recall interval of sufficient length. To this end, we specifically present two approaches: 1) a replay method that guarantees the optimal recall interval, and 2) a self-supervised learning that acquires extensive knowledge from a single training sample at a time. We empirically show that these approaches of our method are aligned with the forgetting curve theory, which can enhance long-term memory. In our experiments, we also demonstrate that our method significantly improves many state-of-the-art continual learning methods in various protocols and scenarios. We open-source this project at https://github.com/hankyul2/ViewBatchModel.
中文: 本研究基于遗忘曲线理论,提出视图批次模型,通过优化回忆间隔并结合回放方法与自监督学习,有效增强持续学习系统中的长期记忆保持能力。
English: Based on the forgetting curve theory, this study proposes a view-batch model that optimizes recall intervals and incorporates replay methods with self-supervised learning to enhance long-term memory retention in continual learning systems.

Authors:Xu Han, Yuan Tang, Jinfeng Xu, Xianzhi Li
Title: MoST: Efficient Monarch Sparse Tuning for 3D Representation Learning
Abstract:
We introduce Monarch Sparse Tuning (MoST), the first reparameterization-based parameter-efficient fine-tuning (PEFT) method tailored for 3D representation learning. Unlike existing adapter-based and prompt-tuning 3D PEFT methods, MoST introduces no additional inference overhead and is compatible with many 3D representation learning backbones. At its core, we present a new family of structured matrices for 3D point clouds, Point Monarch, which can capture local geometric features of irregular points while offering high expressiveness. MoST reparameterizes the dense update weight matrices as our sparse Point Monarch matrices, significantly reducing parameters while retaining strong performance. Experiments on various backbones show that MoST is simple, effective, and highly generalizable. It captures local features in point clouds, achieving state-of-the-art results on multiple benchmarks, e.g., 97.5% acc. on ScanObjectNN (PB_50_RS) and 96.2% on ModelNet40 classification, while it can also combine with other matrix decompositions (e.g., Low-rank, Kronecker) to further reduce parameters.
中文: MoST是一种专为三维表示学习设计的参数高效微调方法,通过稀疏Point Monarch矩阵捕捉不规则点的局部几何特征,无需额外推理开销,在ScanObjectNN和ModelNet40等基准测试中达到了最先进性能。
English: MoST is a novel parameter-efficient fine-tuning method for 3D representation learning that uses sparse Point Monarch matrices to capture local geometric features without inference overhead, achieving state-of-the-art performance on benchmarks like ScanObjectNN and ModelNet40.

Authors:Chenxi Xie, Minghan Li, Hui Zeng, Jun Luo, Lei Zhang
Title: MaSS13K: A Matting-level Semantic Segmentation Benchmark
Abstract:
High-resolution semantic segmentation is essential for applications such as image editing, bokeh imaging, AR/VR, etc. Unfortunately, existing datasets often have limited resolution and lack precise mask details and boundaries. In this work, we build a large-scale, matting-level semantic segmentation dataset, named MaSS13K, which consists of 13,348 real-world images, all at 4K resolution. MaSS13K provides high-quality mask annotations of a number of objects, which are categorized into seven categories: human, vegetation, ground, sky, water, building, and others. MaSS13K features precise masks, with an average mask complexity 20-50 times higher than existing semantic segmentation datasets. We consequently present a method specifically designed for high-resolution semantic segmentation, namely MaSSFormer, which employs an efficient pixel decoder that aggregates high-level semantic features and low-level texture features across three stages, aiming to produce high-resolution masks with minimal computational cost. Finally, we propose a new learning paradigm, which integrates the high-quality masks of the seven given categories with pseudo labels from new classes, enabling MaSSFormer to transfer its accurate segmentation capability to other classes of objects. Our proposed MaSSFormer is comprehensively evaluated on the MaSS13K benchmark together with 14 representative segmentation models. We expect that our meticulously annotated MaSS13K dataset and the MaSSFormer model can facilitate the research of high-resolution and high-quality semantic segmentation. Datasets and codes can be found at https://github.com/xiechenxi99/MaSS13K.
中文:本研究提出了MaSS13K高分辨率精确标注数据集和MaSSFormer高效分割模型,能够将其精确分割能力迁移至新物体类别。
English: This work introduces MaSS13K, a high-resolution dataset with precise mask annotations, and MaSSFormer, a model designed for efficient semantic segmentation that can transfer its capabilities to new object classes.

Authors:Wenyuan Zhang, Yixiao Yang, Han Huang, Liang Han, Kanle Shi, Yu-Shen Liu, Zhizhong Han
Title: MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction
Abstract:
Monocular depth priors have been widely adopted by neural rendering in multi-view based tasks such as 3D reconstruction and novel view synthesis. However, due to the inconsistent prediction on each view, how to more effectively leverage monocular cues in a multi-view context remains a challenge. Current methods treat the entire estimated depth map indiscriminately, and use it as ground truth supervision, while ignoring the inherent inaccuracy and cross-view inconsistency in monocular priors. To resolve these issues, we propose MonoInstance, a general approach that explores the uncertainty of monocular depths to provide enhanced geometric priors for neural rendering and reconstruction. Our key insight lies in aligning each segmented instance depths from multiple views within a common 3D space, thereby casting the uncertainty estimation of monocular depths into a density measure within noisy point clouds. For high-uncertainty areas where depth priors are unreliable, we further introduce a constraint term that encourages the projected instances to align with corresponding instance masks on nearby views. MonoInstance is a versatile strategy which can be seamlessly integrated into various multi-view neural rendering frameworks. Our experimental results demonstrate that MonoInstance significantly improves the performance in both reconstruction and novel view synthesis under various benchmarks.
中文:MonoInstance通过在多视角下对齐实例深度并引入不确定性约束,有效解决了单眼深度先验的不一致性问题,从而显著提升了神经渲染与三维重建的性能。
English: MonoInstance enhances neural rendering and 3D reconstruction by addressing inconsistencies in monocular depth priors through uncertainty-based alignment of instance depths across multiple views and integrating a constraint for unreliable areas.

Authors:Wenyuan Zhang, Emily Yue-ting Jia, Junsheng Zhou, Baorui Ma, Kanle Shi, Yu-Shen Liu, Zhizhong Han
Title: NeRFPrior: Learning Neural Radiance Field as a Prior for Indoor Scene Reconstruction
Abstract:
Recently, it has shown that priors are vital for neural implicit functions to reconstruct high-quality surfaces from multi-view RGB images. However, current priors require large-scale pre-training, and merely provide geometric clues without considering the importance of color. In this paper, we present NeRFPrior, which adopts a neural radiance field as a prior to learn signed distance fields using volume rendering for surface reconstruction. Our NeRF prior can provide both geometric and color clues, and also get trained fast under the same scene without additional data. Based on the NeRF prior, we are enabled to learn a signed distance function (SDF) by explicitly imposing a multi-view consistency constraint on each ray intersection for surface inference. Specifically, at each ray intersection, we use the density in the prior as a coarse geometry estimation, while using the color near the surface as a clue to check its visibility from another view angle. For the textureless areas where the multi-view consistency constraint does not work well, we further introduce a depth consistency loss with confidence weights to infer the SDF. Our experimental results outperform the state-of-the-art methods under the widely used benchmarks.
中文摘要:NeRFPrior采用神经辐射场作为先验,通过体渲染学习符号距离场进行表面重建,提供几何和颜色线索,无需额外数据即可快速训练,并在多视角一致性和深度一致性损失下实现卓越性能。
English Summary: NeRFPrior introduces a neural radiance field as a prior that provides both geometric and color clues for surface reconstruction, enabling fast training and superior performance through multi-view consistency and depth consistency loss.

Authors:Zekai Deng, Ye Shi, Kaiyang Ji, Lan Xu, Shaoli Huang, Jingya Wang
Title: Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy
Abstract:
Human-object interaction (HOI) synthesis is crucial for applications in animation, simulation, and robotics. However, existing approaches either rely on expensive motion capture data or require manual reward engineering, limiting their scalability and generalizability. In this work, we introduce the first unified physics-based HOI framework that leverages Vision-Language Models (VLMs) to enable long-horizon interactions with diverse object types, including static, dynamic, and articulated objects. We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-grained spatio-temporal bipartite representation that automatically constructs goal states and reward functions for reinforcement learning. By encoding structured relationships between human and object parts, RMD enables VLMs to generate semantically grounded, interaction-aware motion guidance without manual reward tuning. To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans. Extensive experiments demonstrate that our framework outperforms existing methods in synthesizing natural, human-like motions across both simple single-task and complex multi-task scenarios. For more details, please refer to our project webpage: https://vlm-rmd.github.io/.
中文摘要:本文提出首个基于物理的人-物交互统一框架,利用视觉语言模型自动生成运动指导和奖励函数,无需人工调整即可在不同交互场景中实现自然拟人化运动。
English Summary: This paper introduces a unified physics-based human-object interaction framework using Vision-Language Models to automatically generate motion guidance and reward functions, enabling natural human-like motions across diverse interaction scenarios without manual tuning.

Authors:Inpyo Hong, Youngwan Jo, Hyojeong Lee, Sunghyun Ahn, Kijung Lee, Sanghyun Park
Title: GranQ: Granular Zero-Shot Quantization with Channel-Wise Activation Scaling in QAT
Abstract:
Zero-shot quantization (ZSQ) enables neural network compression without original training data, making it a promising solution for restricted data access scenarios. To compensate for the lack of data, recent ZSQ methods typically rely on synthetic inputs generated from the full-precision model. However, these synthetic inputs often lead to activation distortion, especially under low-bit settings. To mitigate this, existing methods typically employ per-channel scaling, but they still struggle due to the severe computational overhead during the accumulation process. To overcome this critical bottleneck, we propose GranQ, a novel activation quantization framework that introduces an efficient pre-scaling strategy. Unlike conventional channel-wise methods that repeatedly perform scaling operations during accumulation, GranQ applies scaling factors in a pre-scaling step through fully vectorized computation, eliminating runtime scaling overhead. This design enables GranQ to maintain fine-grained quantization accuracy while significantly reducing computational burden, particularly in low-bit quantization settings. Extensive experiments under quantization-aware training (QAT) settings demonstrate that GranQ consistently outperforms state-of-the-art ZSQ methods across CIFAR and ImageNet. In particular, our method achieves up to 5.45% higher accuracy in the 3-bit setting on CIFAR-100 and even surpasses the full-precision baseline on CIFAR-10. Furthermore, GranQ achieves significant speedup in quantization latency over conventional per-channel methods, demonstrating improved efficiency. With these findings, we anticipate that GranQ will inspire future research beyond conventional ZSQ approaches centered on data generation and model fine-tuning. The official code is available at https://github.com/anonymus-orange/GranQ.
中文: GranQ提出了一种高效的预缩放策略,通过全向量化计算消除运行时缩放开销,在低比特量化中实现了更高的精度和速度,显著优于现有零样本量化方法。
English: GranQ introduces an efficient pre-scaling strategy for zero-shot quantization, eliminating runtime scaling overhead to achieve superior accuracy and speedup in low-bit settings while outperforming state-of-the-art methods.

Authors:Wenrui Cai, Qingjie Liu, Yunhong Wang
Title: SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking
Abstract:
Most state-of-the-art trackers adopt one-stream paradigm, using a single Vision Transformer for joint feature extraction and relation modeling of template and search region images. However, relation modeling between different image patches exhibits significant variations. For instance, background regions dominated by target-irrelevant information require reduced attention allocation, while foreground, particularly boundary areas, need to be be emphasized. A single model may not effectively handle all kinds of relation modeling simultaneously. In this paper, we propose a novel tracker called SPMTrack based on mixture-of-experts tailored for visual tracking task (TMoE), combining the capability of multiple experts to handle diverse relation modeling more flexibly. Benefiting from TMoE, we extend relation modeling from image pairs to spatio-temporal context, further improving tracking accuracy with minimal increase in model parameters. Moreover, we employ TMoE as a parameter-efficient fine-tuning method, substantially reducing trainable parameters, which enables us to train SPMTrack of varying scales efficiently and preserve the generalization ability of pretrained models to achieve superior performance. We conduct experiments on seven datasets, and experimental results demonstrate that our method significantly outperforms current state-of-the-art trackers. The source code is available at https://github.com/WenRuiCai/SPMTrack.
中文: SPMTrack提出了一种专为视觉跟踪设计的专家混合模型,能灵活处理多样化的关系建模,并将其扩展至时空上下文,以最小的参数增长提升跟踪精度,在多个数据集上显著优于当前最先进的跟踪器。
English: SPMTrack introduces a mixture-of-experts model tailored for visual tracking to flexibly handle diverse relation modeling, extending it to spatio-temporal context for improved accuracy with minimal parameter increase, and outperforms state-of-the-art trackers across multiple datasets.

Authors:Chun Gu, Xiaofei Wei, Li Zhang, Xiatian Zhu
Title: TensoFlow: Tensorial Flow-based Sampler for Inverse Rendering
Abstract:
Inverse rendering aims to recover scene geometry, material properties, and lighting from multi-view images. Given the complexity of light-surface interactions, importance sampling is essential for the evaluation of the rendering equation, as it reduces variance and enhances the efficiency of Monte Carlo sampling. Existing inverse rendering methods typically use pre-defined non-learnable importance samplers in prior manually, struggling to effectively match the spatially and directionally varied integrand and resulting in high variance and suboptimal performance. To address this limitation, we propose the concept of learning a spatially and directionally aware importance sampler for the rendering equation to accurately and flexibly capture the unconstrained complexity of a typical scene. We further formulate TensoFlow, a generic approach for sampler learning in inverse rendering, enabling to closely match the integrand of the rendering equation spatially and directionally. Concretely, our sampler is parameterized by normalizing flows, allowing both directional sampling of incident light and probability density function (PDF) inference. To capture the characteristics of the sampler spatially, we learn a tensorial representation of the scene space, which imposes spatial conditions, together with reflected direction, leading to spatially and directionally aware sampling distributions. Our model can be optimized by minimizing the difference between the integrand and our normalizing flow. Extensive experiments validate the superiority of TensoFlow over prior alternatives on both synthetic and real-world benchmarks.
Chinese: 本文提出TensoFlow方法,通过归一化流学习具有空间和方向感知的重要性采样器,能更精准匹配渲染方程的积分项,从而在逆向渲染中有效降低方差并提升计算效率。
English: This paper introduces TensoFlow, a novel method that learns a spatially and directionally aware importance sampler using normalizing flows to reduce variance and improve efficiency in inverse rendering by better matching the rendering equation's integrand.

Authors:Jinjin Zhang, Guodong Wang, Yizhou Jin, Di Huang
Title: Towards Training-free Anomaly Detection with Vision and Language Foundation Models
Abstract:
Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at https://github.com/zhang0jhon/LogSAD.
Chinese: LogSAD是一种无需训练的新型多模态框架,通过思维匹配架构和多粒度检测,能同时识别逻辑和结构异常,并取得了最先进的性能。
English: LogSAD is a novel multi-modal framework that detects both logical and structural anomalies without training, achieving state-of-the-art results through a match-of-thought architecture and multi-granularity detection.

Authors:Christoforos N. Spartalis, Theodoros Semertzidis, Efstratios Gavves, Petros Daras
Title: LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty
Abstract:
We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoiding retraining from scratch. LoTUS smooths the prediction probabilities of the model up to an information-theoretic bound, mitigating its over-confidence stemming from data memorization. We evaluate LoTUS on Transformer and ResNet18 models against eight baselines across five public datasets. Beyond established MU benchmarks, we evaluate unlearning on ImageNet1k, a large-scale dataset, where retraining is impractical, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. The experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. Code: https://github.com/cspartalis/LoTUS.
中文: LoTUS是一种新颖的机器遗忘方法,无需重新训练即可高效消除预训练模型中训练数据的影响,在多个基准测试中效率和效果均优于现有方法。
English: LoTUS is a novel Machine Unlearning method that efficiently removes the influence of training data from pre-trained models without retraining, outperforming existing methods in both efficiency and effectiveness across multiple benchmarks.

Authors:Fiseha B. Tesema, Alejandro Guerra Manzanares, Tianxiang Cui, Qian Zhang, Moses Solomon, Sean He
Title: LGPS: A Lightweight GAN-Based Approach for Polyp Segmentation in Colonoscopy Images
Abstract:
Colorectal cancer (CRC) is a major global cause of cancer-related deaths, with early polyp detection and removal during colonoscopy being crucial for prevention. While deep learning methods have shown promise in polyp segmentation, challenges such as high computational costs, difficulty in segmenting small or low-contrast polyps, and limited generalizability across datasets persist. To address these issues, we propose LGPS, a lightweight GAN-based framework for polyp segmentation. LGPS incorporates three key innovations: (1) a MobileNetV2 backbone enhanced with modified residual blocks and Squeeze-and-Excitation (ResE) modules for efficient feature extraction; (2) Convolutional Conditional Random Fields (ConvCRF) for precise boundary refinement; and (3) a hybrid loss function combining Binary Cross-Entropy, Weighted IoU Loss, and Dice Loss to address class imbalance and enhance segmentation accuracy. LGPS is validated on five benchmark datasets and compared with state-of-the-art(SOTA) methods. On the largest and challenging PolypGen test dataset, LGPS achieves a Dice of 0.7299 and an IoU of 0.7867, outperformed all SOTA works and demonstrating robust generalization. With only 1.07 million parameters, LGPS is 17 times smaller than the smallest existing model, making it highly suitable for real-time clinical applications. Its lightweight design and strong performance underscore its potential for improving early CRC diagnosis. Code is available at https://github.com/Falmi/LGPS/.
Chinese: 提出的LGPS框架采用基于GAN的轻量级架构,结合MobileNetV2增强模块、ConvCRF边界优化和混合损失函数,在五个基准数据集上实现了最优的息肉分割性能,其模型参数量仅为107万,比现有最小模型小17倍,非常适合实时临床诊断应用。
English: The proposed LGPS framework utilizes a lightweight GAN-based architecture with MobileNetV2 enhancements, ConvCRF boundary refinement, and a hybrid loss function to achieve superior polyp segmentation performance, outperforming state-of-the-art methods while being 17 times smaller for real-time clinical use.

Authors:Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, Vikash Sehwag
Title: CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI
Abstract:
With the rapid advancement of generative AI, it is now possible to synthesize high-quality images in a few seconds. Despite the power of these technologies, they raise significant concerns regarding misuse. Current efforts to distinguish between real and AI-generated images may lack generalization, being effective for only certain types of generative models and susceptible to post-processing techniques like JPEG compression. To overcome these limitations, we propose a novel framework, Co-Spy, that first enhances existing semantic features (e.g., the number of fingers in a hand) and artifact features (e.g., pixel value differences), and then adaptively integrates them to achieve more general and robust synthetic image detection. Additionally, we create Co-Spy-Bench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models, including the latest models like FLUX. We also collect 50k synthetic images in the wild from the Internet to enable evaluation in a more practical setting. Our extensive evaluations demonstrate that our detector outperforms existing methods under identical training conditions, achieving an average accuracy improvement of approximately 11% to 34%. The code is available at https://github.com/Megum1/Co-Spy.
中文:提出的Co-Spy框架通过增强语义和伪影特征,实现了对AI生成图像的鲁棒检测,在全面评估中验证其以显著准确率提升超越现有方法。
English: The proposed Co-Spy framework enhances semantic and artifact features to achieve robust detection of AI-generated images, outperforming existing methods with significant accuracy improvements as validated through comprehensive evaluations.

Authors:Kazuhiro Yamada, Li Yin, Qingrui Hu, Ning Ding, Shunsuke Iwashita, Jun Ichikawa, Kiwamu Kotani, Calvin Yeung, Keisuke Fujii
Title: TrackID3x3: A Dataset and Algorithm for Multi-Player Tracking with Identification and Pose Estimation in 3x3 Basketball Full-court Videos
Abstract:
Multi-object tracking, player identification, and pose estimation are fundamental components of sports analytics, essential for analyzing player movements, performance, and tactical strategies. However, existing datasets and methodologies primarily target mainstream team sports such as soccer and conventional 5-on-5 basketball, often overlooking scenarios involving fixed-camera setups commonly used at amateur levels, less mainstream sports, or datasets that explicitly incorporate pose annotations. In this paper, we propose the TrackID3x3 dataset, the first publicly available comprehensive dataset specifically designed for multi-player tracking, player identification, and pose estimation in 3x3 basketball scenarios. The dataset comprises three distinct subsets (Indoor fixed-camera, Outdoor fixed-camera, and Drone camera footage), capturing diverse full-court camera perspectives and environments. We also introduce the Track-ID task, a simplified variant of the game state reconstruction task that excludes field detection and focuses exclusively on fixed-camera scenarios. To evaluate performance, we propose a baseline algorithm called Track-ID algorithm, tailored to assess tracking and identification quality. Furthermore, our benchmark experiments, utilizing recent multi-object tracking algorithms (e.g., BoT-SORT-ReID) and top-down pose estimation methods (HRNet, RTMPose, and SwinPose), demonstrate robust results and highlight remaining challenges. Our dataset and evaluation benchmarks provide a solid foundation for advancing automated analytics in 3x3 basketball. Dataset and code will be available at https://github.com/open-starlab/TrackID3x3.
中文: 本文提出了首个针对3x3篮球场景的公开综合数据集TrackID3x3,涵盖多视角拍摄环境并引入专用基线算法,填补了非主流体育项目中多目标追踪与姿态分析的研究空白。
English: This paper introduces the TrackID3x3 dataset, the first comprehensive public dataset for multi-player tracking, identification, and pose estimation in 3x3 basketball, featuring diverse camera setups and a baseline algorithm to address gaps in sports analytics for non-mainstream scenarios.

Authors:Minh-Tuan Tran, Trung Le, Xuan-May Le, Thanh-Toan Do, Dinh Phung
Title: Enhancing Dataset Distillation via Non-Critical Region Refinement
Abstract:
Dataset distillation has become a popular method for compressing large datasets into smaller, more efficient representations while preserving critical information for model training. Data features are broadly categorized into two types: instance-specific features, which capture unique, fine-grained details of individual examples, and class-general features, which represent shared, broad patterns across a class. However, previous approaches often struggle to balance these features-some focus solely on class-general patterns, neglecting finer instance details, while others prioritize instance-specific features, overlooking the shared characteristics essential for class-level understanding. In this paper, we introduce the Non-Critical Region Refinement Dataset Distillation (NRR-DD) method, which preserves instance-specific details and fine-grained regions in synthetic data while enriching non-critical regions with class-general information. This approach enables models to leverage all pixel information, capturing both feature types and enhancing overall performance. Additionally, we present Distance-Based Representative (DBR) knowledge transfer, which eliminates the need for soft labels in training by relying on the distance between synthetic data predictions and one-hot encoded labels. Experimental results show that NRR-DD achieves state-of-the-art performance on both small- and large-scale datasets. Furthermore, by storing only two distances per instance, our method delivers comparable results across various settings. The code is available at https://github.com/tmtuan1307/NRR-DD.
中文: 提出的NRR-DD方法通过在非关键区域融入类别共性信息同时保留细粒度细节,有效平衡了数据集蒸馏中的实例特定特征和类别通用特征,无需依赖软标签即可通过基于距离的知识迁移实现最优性能。
English: The proposed NRR-DD method effectively balances instance-specific and class-general features in dataset distillation by refining non-critical regions with shared class information while preserving fine-grained details, achieving state-of-the-art performance without requiring soft labels through distance-based knowledge transfer.

Authors:Feiran Wang, Bin Duan, Jiachen Tao, Nikhil Sharma, Dawen Cai, Yan Yan
Title: ZECO: ZeroFusion Guided 3D MRI Conditional Generation
Abstract:
Medical image segmentation is crucial for enhancing diagnostic accuracy and treatment planning in Magnetic Resonance Imaging (MRI). However, acquiring precise lesion masks for segmentation model training demands specialized expertise and significant time investment, leading to a small dataset scale in clinical practice. In this paper, we present ZECO, a ZeroFusion guided 3D MRI conditional generation framework that extracts, compresses, and generates high-fidelity MRI images with corresponding 3D segmentation masks to mitigate data scarcity. To effectively capture inter-slice relationships within volumes, we introduce a Spatial Transformation Module that encodes MRI images into a compact latent space for the diffusion process. Moving beyond unconditional generation, our novel ZeroFusion method progressively maps 3D masks to MRI images in latent space, enabling robust training on limited datasets while avoiding overfitting. ZECO outperforms state-of-the-art models in both quantitative and qualitative evaluations on Brain MRI datasets across various modalities, showcasing its exceptional capability in synthesizing high-quality MRI images conditioned on segmentation masks.
中文: ZECO是一种创新的三维MRI生成框架,通过ZeroFusion和空间变换模块合成具有对应分割掩模的高保真医学图像以缓解数据稀缺问题,在各项评估中均优于现有模型。
English: ZECO is a novel 3D MRI generation framework that synthesizes high-fidelity medical images with corresponding segmentation masks to address data scarcity, outperforming existing models through its innovative ZeroFusion and Spatial Transformation Module.

Authors:Yiheng Zhong, Zihong Luo, Chengzhi Liu, Feilong Tang, Zelin Peng, Ming Hu, Yingzhen Hu, Jionglong Su, Zongyuan Ge, Imran Razzak
Title: PG-SAM: Prior-Guided SAM with Medical for Multi-organ Segmentation
Abstract:
Segment Anything Model (SAM) demonstrates powerful zero-shot capabilities; however, its accuracy and robustness significantly decrease when applied to medical image segmentation. Existing methods address this issue through modality fusion, integrating textual and image information to provide more detailed priors. In this study, we argue that the granularity of text and the domain gap affect the accuracy of the priors. Furthermore, the discrepancy between high-level abstract semantics and pixel-level boundary details in images can introduce noise into the fusion process. To address this, we propose Prior-Guided SAM (PG-SAM), which employs a fine-grained modality prior aligner to leverage specialized medical knowledge for better modality alignment. The core of our method lies in efficiently addressing the domain gap with fine-grained text from a medical LLM. Meanwhile, it also enhances the priors' quality after modality alignment, ensuring more accurate segmentation. In addition, our decoder enhances the model's expressive capabilities through multi-level feature fusion and iterative mask optimizer operations, supporting unprompted learning. We also propose a unified pipeline that effectively supplies high-quality semantic information to SAM. Extensive experiments on the Synapse dataset demonstrate that the proposed PG-SAM achieves state-of-the-art performance. Our code is released at https://github.com/logan-0623/PG-SAM.
Chinese: 针对SAM模型在医学图像分割中因领域差异和语义鸿沟导致的精度不足问题,PG-SAM通过细粒度医学文本先验对齐和多级特征融合解码器,在Synapse数据集上实现了最先进的性能。
English: The Segment Anything Model (SAM) struggles with accuracy and robustness in medical image segmentation due to domain gaps and semantic discrepancies, prompting the development of PG-SAM, which integrates fine-grained medical text priors and a multi-feature decoder to achieve state-of-the-art performance on the Synapse dataset.

Authors:Massimo Bini, Leander Girrbach, Zeynep Akata
Title: DeLoRA: Decoupling Angles and Strength in Low-rank Adaptation
Abstract:
Parameter-Efficient FineTuning (PEFT) methods have recently gained significant popularity thanks to the widespread availability of large-scale pretrained models. These methods allow for quick adaptation to downstream tasks with minimal computational cost. However, popular finetuning methods such as LoRA exhibit limited robustness when it comes to hyperparameter choices or extended training regimes, preventing optimal out-of-the-box performance. In contrast, bounded approaches, such as ETHER, provide greater robustness but are limited to extremely low-rank adaptations and fixed-strength transformations, reducing their adaptation expressive power. In this work, we propose Decoupled Low-rank Adaptation (DeLoRA), a novel finetuning method that normalizes and scales learnable low-rank matrices. By bounding the distance of the transformation, DeLoRA effectively decouples the angular learning from the adaptation strength, enhancing robustness without compromising performance. Through evaluations on subject-driven image generation, natural language understanding, and instruction tuning, we show that DeLoRA matches or surpasses performance of competing PEFT methods, while exhibiting stronger robustness. Code is available at https://github.com/ExplainableML/DeLoRA.
中文: DeLoRA是一种新颖的参数高效微调方法,通过将角度学习与适应强度解耦来增强鲁棒性,在多项任务中实现优于或持平现有方法的性能,同时保持更高的稳定性。
English: DeLoRA is a novel parameter-efficient fine-tuning method that enhances robustness by decoupling angular learning from adaptation strength, achieving superior or comparable performance across multiple tasks while maintaining greater stability than existing approaches.

Authors:Valentin Gabeff, Haozhe Qi, Brendan Flaherty, Gencer Sumbül, Alexander Mathis, Devis Tuia
Title: MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps
Abstract:
Monitoring wildlife is essential for ecology and ethology, especially in light of the increasing human impact on ecosystems. Camera traps have emerged as habitat-centric sensors enabling the study of wildlife populations at scale with minimal disturbance. However, the lack of annotated video datasets limits the development of powerful video understanding models needed to process the vast amount of fieldwork data collected. To advance research in wild animal behavior monitoring we present MammAlps, a multimodal and multi-view dataset of wildlife behavior monitoring from 9 camera-traps in the Swiss National Park. MammAlps contains over 14 hours of video with audio, 2D segmentation maps and 8.5 hours of individual tracks densely labeled for species and behavior. Based on 6135 single animal clips, we propose the first hierarchical and multimodal animal behavior recognition benchmark using audio, video and reference scene segmentation maps as inputs. Furthermore, we also propose a second ecology-oriented benchmark aiming at identifying activities, species, number of individuals and meteorological conditions from 397 multi-view and long-term ecological events, including false positive triggers. We advocate that both tasks are complementary and contribute to bridging the gap between machine learning and ecology. Code and data are available at: https://github.com/eceo-epfl/MammAlps
中文:MammAlps数据集通过提供来自瑞士国家公园相机陷阱的多模态、多视角野生动物视频,解决了标注数据匮乏的问题,并提出了行为识别和生态监测的双重基准,以弥合机器学习与生态学之间的鸿沟。
English: The MammAlps dataset addresses the scarcity of annotated wildlife videos by providing multimodal, multi-view footage from Swiss camera traps, enabling hierarchical behavior recognition and ecological monitoring benchmarks to bridge machine learning and ecology.

Authors:Zhengyuan Li, Kai Cheng, Anindita Ghosh, Uttaran Bhattacharya, Liangyan Gui, Aniket Bera
Title: SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction
Abstract:
Text-based 3D human motion editing is a critical yet challenging task in computer vision and graphics. While training-free approaches have been explored, the recent release of the MotionFix dataset, which includes source-text-motion triplets, has opened new avenues for training, yielding promising results. However, existing methods struggle with precise control, often leading to misalignment between motion semantics and language instructions. In this paper, we introduce a related task, motion similarity prediction, and propose a multi-task training paradigm, where we train the model jointly on motion editing and motion similarity prediction to foster the learning of semantically meaningful representations. To complement this task, we design an advanced Diffusion-Transformer-based architecture that separately handles motion similarity prediction and motion editing. Extensive experiments demonstrate the state-of-the-art performance of our approach in both editing alignment and fidelity.
Chinese: 本文提出了一种结合运动编辑与运动相似性预测的多任务训练范式,并采用先进的扩散变换器架构,在编辑对齐度和保真度方面实现了最先进的性能。
English: This paper introduces a multi-task training paradigm combining motion editing and motion similarity prediction, supported by an advanced Diffusion-Transformer architecture, achieving state-of-the-art performance in editing alignment and fidelity.

Authors:Haoyang Li, Siyu Zhou, Liang Wang, Guodong Long
Title: MAO: Efficient Model-Agnostic Optimization of Prompt Tuning for Vision-Language Models
Abstract:
Though CLIP-based prompt tuning significantly enhances pre-trained Vision-Language Models, existing research focuses on reconstructing the model architecture, e.g., additional loss calculation and meta-networks. These approaches generally lead to increased complexity and extended training cost. To maintain the efficiency of the tuning process, we propose plug-and-play Model-Agnostic Optimization (MAO) for prompt tuning. Without altering any components of the prompt tuning backbone, we introduce a Data-Driven Enhancement framework to optimize the distribution of the initial data, and incorporate an Alterable Regularization module to boost the task-specific feature processing pipeline, thereby improving overall performance while maintaining low computational cost. Extensive experiments on MAO demonstrate its outstanding performance and efficiency. The code of MAO is available at: https://github.com/JREion/M.A.O .
中文: 本研究提出即插即用的模型无关优化方法(MAO),通过数据驱动增强和可变正则化模块优化初始数据分布与特征处理流程,在保持低计算成本的同时显著提升了CLIP提示调优的性能。
English: To enhance CLIP-based prompt tuning without increasing complexity, this study introduces a plug-and-play Model-Agnostic Optimization (MAO) that improves data distribution and feature processing while maintaining low computational costs.

Authors:Peng Chen, Xiaobao Wei, Ming Lu, Hui Chen, Feng Tian
Title: DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation
Abstract:
Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial animation. However, personalized speaking styles conveying accurate lip language is still lacking, besides, efficiency and compactness still need to be improved. In this work, we propose DiffusionTalker to address the above limitations via personalizer-guided distillation. In terms of personalization, we introduce a contrastive personalizer that learns identity and emotion embeddings to capture speaking styles from audio. We further propose a personalizer enhancer during distillation to enhance the influence of embeddings on facial animation. For efficiency, we use iterative distillation to reduce the steps required for animation generation and achieve more than 8x speedup in inference. To achieve compactness, we distill the large teacher model into a smaller student model, reducing our model's storage by 86.4\% while minimizing performance loss. After distillation, users can derive their identity and emotion embeddings from audio to quickly create personalized animations that reflect specific speaking styles. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released at: https://github.com/ChenVoid/DiffusionTalker.
中文: 本研究提出DiffusionTalker方法,通过个性化引导蒸馏技术捕捉说话风格并提升效率和紧凑性,显著改进了实时语音驱动的3D面部动画效果,优于现有方法。
English: This study introduces DiffusionTalker, a method that enhances real-time speech-driven 3D facial animation by using personalizer-guided distillation to capture personalized speaking styles and improve efficiency and compactness, outperforming existing approaches.

Authors:Alexander Gielisse, Jan van Gemert
Title: End-to-End Implicit Neural Representations for Classification
Abstract:
Implicit neural representations (INRs) such as NeRF and SIREN encode a signal in neural network parameters and show excellent results for signal reconstruction. Using INRs for downstream tasks, such as classification, is however not straightforward. Inherent symmetries in the parameters pose challenges and current works primarily focus on designing architectures that are equivariant to these symmetries. However, INR-based classification still significantly under-performs compared to pixel-based methods like CNNs. This work presents an end-to-end strategy for initializing SIRENs together with a learned learning-rate scheme, to yield representations that improve classification accuracy. We show that a simple, straightforward, Transformer model applied to a meta-learned SIREN, without incorporating explicit symmetry equivariances, outperforms the current state-of-the-art. On the CIFAR-10 SIREN classification task, we improve the state-of-the-art without augmentations from 38.8% to 59.6%, and from 63.4% to 64.7% with augmentations. We demonstrate scalability on the high-resolution Imagenette dataset achieving reasonable reconstruction quality with a classification accuracy of 60.8% and are the first to do INR classification on the full ImageNet-1K dataset where we achieve a SIREN classification performance of 23.6%. To the best of our knowledge, no other SIREN classification approach has managed to set a classification baseline for any high-resolution image dataset. Our code is available at https://github.com/SanderGielisse/MWT
Chinese: 本研究提出了一种端到端的策略,通过初始化SIREN并结合学习率优化方案来提高分类精度,在多个数据集上无需引入显式对称等变性即实现了最先进的性能。
English: This work introduces an end-to-end strategy that initializes SIRENs with a learned learning-rate scheme to enhance classification accuracy, achieving state-of-the-art performance on multiple datasets without incorporating explicit symmetry equivariances.

Authors:Hongjia Zhai, Hai Li, Zhenzhe Li, Xiaokun Pan, Yijia He, Guofeng Zhang
Title: PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding
Abstract:
Recently, 3D Gaussian Splatting (3DGS) has shown encouraging performance for open vocabulary scene understanding tasks. However, previous methods cannot distinguish 3D instance-level information, which usually predicts a heatmap between the scene feature and text query. In this paper, we propose PanoGS, a novel and effective 3D panoptic open vocabulary scene understanding approach. Technically, to learn accurate 3D language features that can scale to large indoor scenarios, we adopt the pyramid tri-plane to model the latent continuous parametric feature space and use a 3D feature decoder to regress the multi-view fused 2D feature cloud. Besides, we propose language-guided graph cuts that synergistically leverage reconstructed geometry and learned language cues to group 3D Gaussian primitives into a set of super-primitives. To obtain 3D consistent instance, we perform graph clustering based segmentation with SAM-guided edge affinity computation between different super-primitives. Extensive experiments on widely used datasets show better or more competitive performance on 3D panoptic open vocabulary scene understanding. Project page: \href{https://zju3dv.github.io/panogs}{https://zju3dv.github.io/panogs}.

Authors:Ziming Wei, Bingqian Lin, Yunshuang Nie, Jiaqi Chen, Shikui Ma, Hang Xu, Xiaodan Liang
Title: Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Abstract:
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
中文摘要:本文提出了一种重写驱动的增强范式(RAM),通过改写人类标注的训练数据直接生成未见过的观察-指令对,以无模拟器和省人工的方式解决了视觉语言导航领域的数据稀缺问题,在多个数据集上展现出卓越的泛化性能。
English Summary: This paper introduces a Rewriting-driven Augmentation (RAM) paradigm that addresses data scarcity in Vision-Language Navigation by generating unseen observation-instruction pairs through rewriting mechanisms, achieving superior generalization across multiple datasets without simulators or manual data cleaning.

Authors:Xiaoming Qi, Jingyang Zhang, Huazhu Fu, Guanyu Yang, Shuo Li, Yueming Jin
Title: Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for FCL
Abstract:
Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (\textbf{FedDAH}). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:https://github.com/jinlab-imvr/FedDAH.
中文: 联邦持续学习在动态医疗场景下面临灾难性遗忘和优化偏差的挑战,FedDAH方法通过动态分配超网络和自适应模型重校准技术,有效提升了跨客户端的协同学习能力。
English: Federated continual learning (FCL) faces challenges of catastrophic forgetting and biased optimization in dynamic medical scenarios, which the proposed FedDAH method addresses through a dynamic allocation hypernetwork and adaptive model recalibration to enhance collaborative learning across clients.

Authors:Mingde Yao, Menglu Wang, King-Man Tam, Lingen Li, Tianfan Xue, Jinwei Gu
Title: PolarFree: Polarization-based Reflection-free Imaging
Abstract:
Reflection removal is challenging due to complex light interactions, where reflections obscure important details and hinder scene understanding. Polarization naturally provides a powerful cue to distinguish between reflected and transmitted light, enabling more accurate reflection removal. However, existing methods often rely on small-scale or synthetic datasets, which fail to capture the diversity and complexity of real-world scenarios. To this end, we construct a large-scale dataset, PolaRGB, for Polarization-based reflection removal of RGB images, which enables us to train models that generalize effectively across a wide range of real-world scenarios. The PolaRGB dataset contains 6,500 well-aligned mixed-transmission image pairs, 8x larger than existing polarization datasets, and is the first to include both RGB and polarization images captured across diverse indoor and outdoor environments with varying lighting conditions. Besides, to fully exploit the potential of polarization cues for reflection removal, we introduce PolarFree, which leverages diffusion process to generate reflection-free cues for accurate reflection removal. Extensive experiments show that PolarFree significantly enhances image clarity in challenging reflective scenarios, setting a new benchmark for polarized imaging and reflection removal. Code and dataset are available at https://github.com/mdyao/PolarFree.
中文摘要:PolaRGB数据集和PolarFree方法通过利用偏振线索和扩散过程,有效解决了复杂场景下的反射去除难题,显著提升了各类现实环境中的图像清晰度。
English Summary: The PolaRGB dataset and PolarFree method address reflection removal challenges by leveraging polarization cues and diffusion processes, significantly improving image clarity in diverse real-world scenarios.

Authors:Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, Martin R. Oswald, Danda Pani Paudel
Title: SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
Abstract:
Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training or together at inference. This highlights the clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable manner remains an open challenge. To address these limitations, we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. To power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising 7916 scenes derived from seven established datasets, such as ScanNet and Matterport3D. Generating SceneSplat-7K required computational resources equivalent to 150 GPU days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes. Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed method over the established baselines.
中文:SceneSplat提出了首个基于3D高斯泼溅的大规模室内场景理解方法,通过自监督学习方案和SceneSplat-7K数据集实现了标准化基准测试,显著优于现有基线。
English: SceneSplat introduces the first large-scale 3D indoor scene understanding method that operates natively on 3D Gaussian Splatting, along with a self-supervised learning scheme and the SceneSplat-7K dataset to enable standardized benchmarking.

Authors:Dvir Samuel, Matan Levy, Nir Darshan, Gal Chechik, Rami Ben-Ari
Title: OmnimatteZero: Fast Training-free Omnimatte with Pre-trained Video Diffusion Models
Abstract:
In Omnimatte, one aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. These are accomplished by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. To overcome this, we introduce temporal and spatial attention guidance modules that steer the diffusion process for accurate object removal and temporally consistent background reconstruction. We further show that self-attention maps capture information about the object and its footprints and use them to inpaint the object's effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.

Authors:Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang
Title: Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning
Abstract:
Large Vision-Language Models (LVLMs) typically follow a two-stage training paradigm-pretraining and supervised fine-tuning. Recently, preference optimization, derived from the language domain, has emerged as an effective post-training reinforcement strategy to enhance capabilities of LVLMs. However, constructing high-quality human-annotated preference data and developing robust reward models to mimic these preferences are both costly and challenging. Motivated by this observation, we propose Vision-R1, a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback. It only leverages curated instruction data, eliminating the need for specialized reward models and handcrafted preference datasets. We incorporate a criterion-driven reward function that further integrates multi-dimensional feedback to evaluate model completions comprehensively based on the vision task logic. Furthermore, we introduce a progressive rule refinement strategy that dynamically adjusts the reward criteria during training, enabling continuous model improvement and mitigating reward hacking. Extensive experiments on both in-distribution and out-of-distribution benchmarks demonstrate that fine-tuning the 7B LVLMs with Vision-R1 achieves consistent performance gains, with even up to 50% improvement and surpassing the state-of-the-art 10x size model.
Chinese: Vision-R1是一种新颖的大规模视觉语言模型强化学习算法,通过视觉引导反馈来提升模型性能,无需人工标注的偏好数据或专用奖励模型,在基准测试中实现了显著改进。
English: Vision-R1 is a novel reinforcement learning algorithm for Large Vision-Language Models that utilizes vision-guided feedback to enhance performance without requiring human-annotated preference data or specialized reward models, achieving significant improvements in benchmarks.

Authors:Hongyu Yan, Zijun Li, Kunming Luo, Li Lu, Ping Tan
Title: SymmCompletion: High-Fidelity and High-Consistency Point Cloud Completion with Symmetry Guidance
Abstract:
Point cloud completion aims to recover a complete point shape from a partial point cloud. Although existing methods can form satisfactory point clouds in global completeness, they often lose the original geometry details and face the problem of geometric inconsistency between existing point clouds and reconstructed missing parts. To tackle this problem, we introduce SymmCompletion, a highly effective completion method based on symmetry guidance. Our method comprises two primary components: a Local Symmetry Transformation Network (LSTNet) and a Symmetry-Guidance Transformer (SGFormer). First, LSTNet efficiently estimates point-wise local symmetry transformation to transform key geometries of partial inputs into missing regions, thereby generating geometry-align partial-missing pairs and initial point clouds. Second, SGFormer leverages the geometric features of partial-missing pairs as the explicit symmetric guidance that can constrain the refinement process for initial point clouds. As a result, SGFormer can exploit provided priors to form high-fidelity and geometry-consistency final point clouds. Qualitative and quantitative evaluations on several benchmark datasets demonstrate that our method outperforms state-of-the-art completion networks.
中文: SymmCompletion提出了一种基于对称性指导的点云补全方法,通过局部对称变换网络和对称指导变换器,将部分输入的关键几何特征转换并优化,生成高保真且几何一致的完整点云。
English: SymmCompletion introduces a novel point cloud completion method using symmetry guidance through LSTNet and SGFormer to generate high-fidelity, geometry-consistent results by transforming and refining partial inputs.

Authors:Maochen Yang, Zekun Li, Jian Zhang, Lei Qi, Yinghuan Shi
Title: Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting
Abstract:
Semi-supervised crowd counting is crucial for addressing the high annotation costs of densely populated scenes. Although several methods based on pseudo-labeling have been proposed, it remains challenging to effectively and accurately utilize unlabeled data. In this paper, we propose a novel framework called Taste More Taste Better (TMTB), which emphasizes both data and model aspects. Firstly, we explore a data augmentation technique well-suited for the crowd counting task. By inpainting the background regions, this technique can effectively enhance data diversity while preserving the fidelity of the entire scenes. Secondly, we introduce the Visual State Space Model as backbone to capture the global context information from crowd scenes, which is crucial for extremely crowded, low-light, and adverse weather scenarios. In addition to the traditional regression head for exact prediction, we employ an Anti-Noise classification head to provide less exact but more accurate supervision, since the regression head is sensitive to noise in manual annotations. We conduct extensive experiments on four benchmark datasets and show that our method outperforms state-of-the-art methods by a large margin. Code is publicly available on https://github.com/syhien/taste_more_taste_better.
Chinese: TMTB框架通过背景修复的数据增强和视觉状态空间模型捕捉全局上下文,结合抗噪分类头减少标注噪声影响,在半监督人群计数中显著超越现有方法。
English: The TMTB framework enhances semi-supervised crowd counting by employing background inpainting for data augmentation and a Visual State Space Model for global context, achieving superior performance with an Anti-Noise classification head to mitigate annotation noise.

Authors:Baizhi Wang, Rui Yan, Wenxin Ma, Xu Zhang, Yuhao Wang, Xiaolong Li, Yunjie Gu, Zihang Jiang, S. Kevin Zhou
Title: Histomorphology-driven multi-instance learning for breast cancer WSI classification
Abstract:
Histomorphology is crucial in breast cancer diagnosis. However, existing whole slide image (WSI) classification methods struggle to effectively incorporate histomorphology information, limiting their ability to capture key and fine-grained pathological features. To address this limitation, we propose a novel framework that explicitly incorporates histomorphology (tumor cellularity, cellular morphology, and tissue architecture) into WSI classification. Specifically, our approach consists of three key components: (1) estimating the importance of tumor-related histomorphology information at the patch level based on medical prior knowledge; (2) generating representative cluster-level features through histomorphology-driven cluster pooling; and (3) enabling WSI-level classification through histomorphology-driven multi-instance aggregation. With the incorporation of histomorphological information, our framework strengthens the model's ability to capture key and fine-grained pathological patterns, thereby enhancing WSI classification performance. Experimental results demonstrate its effectiveness, achieving high diagnostic accuracy for molecular subtyping and cancer subtyping. The code will be made available at https://github.com/Badgewho/HMDMIL.
中文摘要:本文提出了一种新颖框架,通过显式整合组织形态学特征到全切片图像分类中,增强了模型捕捉关键病理模式的能力,在乳腺癌分子分型和亚型分类方面展现出卓越的诊断准确性。
English Summary: This paper introduces a novel framework that explicitly integrates histomorphological features into whole slide image classification, enhancing the model's ability to capture fine-grained pathological patterns and demonstrating superior diagnostic accuracy for breast cancer subtyping.

Authors:Yara AlaaEldin, Francesca Odone
Title: Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images
Abstract:
Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV) navigation. Such information may be by obtained by estimating depth and semantic segmentation maps of the surrounding environment and for their practical use in autonomous navigation, the procedure must be performed as close to real-time as possible. In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture that can perform the two tasks accurately and rapidly, and validate its effectiveness on MidAir and Aeroscapes benchmark datasets. Our joint-architecture proves to be competitive or superior to the other single and joint architecture methods while performing its task fast predicting 20.2 FPS on a single NVIDIA quadro p5000 GPU and it has a low memory footprint. All codes for training and prediction can be found on this link: https://github.com/Malga-Vision/Co-SemDepth
中文: 本文提出一种联合深度学习架构,利用无人机单目相机在非结构化环境中高效预测深度和语义地图,在基准数据集上展现出优越性能并实现实时运算。
English: This paper introduces a joint deep-learning architecture using monocular cameras on UAVs to efficiently predict depth and semantic maps in unstructured environments, demonstrating competitive performance and real-time speed on benchmark datasets.

Authors:Yuzhi Li, Haojun Xu, Feng Tian
Title: Shot Sequence Ordering for Video Editing: Benchmarks, Metrics, and Cinematology-Inspired Computing Methods
Abstract:
With the rising popularity of short video platforms, the demand for video production has increased substantially. However, high-quality video creation continues to rely heavily on professional editing skills and a nuanced understanding of visual language. To address this challenge, the Shot Sequence Ordering (SSO) task in AI-assisted video editing has emerged as a pivotal approach for enhancing video storytelling and the overall viewing experience. Nevertheless, the progress in this field has been impeded by a lack of publicly available benchmark datasets. In response, this paper introduces two novel benchmark datasets, AVE-Order and ActivityNet-Order. Additionally, we employ the Kendall Tau distance as an evaluation metric for the SSO task and propose the Kendall Tau Distance-Cross Entropy Loss. We further introduce the concept of Cinematology Embedding, which incorporates movie metadata and shot labels as prior knowledge into the SSO model, and constructs the AVE-Meta dataset to validate the method's effectiveness. Experimental results indicate that the proposed loss function and method substantially enhance SSO task accuracy. All datasets are publicly accessible at https://github.com/litchiar/ShotSeqBench.
中文摘要:本文针对AI辅助视频编辑中镜头序列排序任务缺乏公开基准数据集的问题,提出了两个新数据集和新型评估指标,显著提升了排序准确性。
English Summary: This paper addresses the shortage of benchmark datasets for AI-assisted video editing's Shot Sequence Ordering task by introducing two new datasets and a novel evaluation metric that significantly improves ordering accuracy.

Authors:Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, Yunzhu Li
Title: PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos
Abstract:
Creating a physical digital twin of a real-world object has immense potential in robotics, content creation, and XR. In this paper, we present PhysTwin, a novel framework that uses sparse videos of dynamic objects under interaction to produce a photo- and physically realistic, real-time interactive virtual replica. Our approach centers on two key components: (1) a physics-informed representation that combines spring-mass models for realistic physical simulation, generative shape models for geometry, and Gaussian splats for rendering; and (2) a novel multi-stage, optimization-based inverse modeling framework that reconstructs complete geometry, infers dense physical properties, and replicates realistic appearance from videos. Our method integrates an inverse physics framework with visual perception cues, enabling high-fidelity reconstruction even from partial, occluded, and limited viewpoints. PhysTwin supports modeling various deformable objects, including ropes, stuffed animals, cloth, and delivery packages. Experiments show that PhysTwin outperforms competing methods in reconstruction, rendering, future prediction, and simulation under novel interactions. We further demonstrate its applications in interactive real-time simulation and model-based robotic motion planning.

Authors:Yang Luo, Shiru Wang, Jun Liu, Jiaxuan Xiao, Rundong Xue, Zeyu Zhang, Hao Zhang, Yu Lu, Yang Zhao, Yutong Xie
Title: PathoHR: Breast Cancer Survival Prediction on High-Resolution Pathological Images
Abstract:
Breast cancer survival prediction in computational pathology presents a remarkable challenge due to tumor heterogeneity. For instance, different regions of the same tumor in the pathology image can show distinct morphological and molecular characteristics. This makes it difficult to extract representative features from whole slide images (WSIs) that truly reflect the tumor's aggressive potential and likely survival outcomes. In this paper, we present PathoHR, a novel pipeline for accurate breast cancer survival prediction that enhances any size of pathological images to enable more effective feature learning. Our approach entails (1) the incorporation of a plug-and-play high-resolution Vision Transformer (ViT) to enhance patch-wise WSI representation, enabling more detailed and comprehensive feature extraction, (2) the systematic evaluation of multiple advanced similarity metrics for comparing WSI-extracted features, optimizing the representation learning process to better capture tumor characteristics, (3) the demonstration that smaller image patches enhanced follow the proposed pipeline can achieve equivalent or superior prediction accuracy compared to raw larger patches, while significantly reducing computational overhead. Experimental findings valid that PathoHR provides the potential way of integrating enhanced image resolution with optimized feature learning to advance computational pathology, offering a promising direction for more accurate and efficient breast cancer survival prediction. Code will be available at https://github.com/AIGeeksGroup/PathoHR.
中文:PathoHR通过集成视觉Transformer增强病理图像分辨率,并采用相似性度量优化特征学习,提出了一种乳腺癌生存预测新方法,在保证高精度的同时显著降低了计算开销。
English: PathoHR introduces a novel pipeline for breast cancer survival prediction by enhancing pathological image resolution with a Vision Transformer and optimizing feature learning through similarity metrics, achieving high accuracy with reduced computational costs.

Authors:Zeng-Hui Zhu, Wei Lu, Si-Bao Chen, Chris H. Q. Ding, Jin Tang, Bin Luo
Title: Real-World Remote Sensing Image Dehazing: Benchmark and Baseline
Abstract:
Remote Sensing Image Dehazing (RSID) poses significant challenges in real-world scenarios due to the complex atmospheric conditions and severe color distortions that degrade image quality. The scarcity of real-world remote sensing hazy image pairs has compelled existing methods to rely primarily on synthetic datasets. However, these methods struggle with real-world applications due to the inherent domain gap between synthetic and real data. To address this, we introduce Real-World Remote Sensing Hazy Image Dataset (RRSHID), the first large-scale dataset featuring real-world hazy and dehazed image pairs across diverse atmospheric conditions. Based on this, we propose MCAF-Net, a novel framework tailored for real-world RSID. Its effectiveness arises from three innovative components: Multi-branch Feature Integration Block Aggregator (MFIBA), which enables robust feature extraction through cascaded integration blocks and parallel multi-branch processing; Color-Calibrated Self-Supervised Attention Module (CSAM), which mitigates complex color distortions via self-supervised learning and attention-guided refinement; and Multi-Scale Feature Adaptive Fusion Module (MFAFM), which integrates features effectively while preserving local details and global context. Extensive experiments validate that MCAF-Net demonstrates state-of-the-art performance in real-world RSID, while maintaining competitive performance on synthetic datasets. The introduction of RRSHID and MCAF-Net sets new benchmarks for real-world RSID research, advancing practical solutions for this complex task. The code and dataset are publicly available at https://github.com/lwCVer/RRSHID.
中文摘要:作者提出了首个大规模真实世界遥感雾霾图像数据集RRSHID,并开发了MCAF-Net框架,通过三个创新模块在真实场景遥感图像去雾中实现最优性能,同时在合成数据上保持竞争力。
English Summary: The authors introduce RRSHID, the first large-scale real-world hazy/dehazed remote sensing image dataset, and propose MCAF-Net with three novel modules that achieves state-of-the-art performance in real-world remote sensing image dehazing while maintaining competitive results on synthetic data.

Authors:Jianjian Yin, Tao Chen, Gensheng Pei, Yazhou Yao, Liqiang Nie, Xiansheng Hua
Title: Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning
Abstract:
Consistency regularization has prevailed in semi-supervised semantic segmentation and achieved promising performance. However, existing methods typically concentrate on enhancing the Image-augmentation based Prediction consistency and optimizing the segmentation network as a whole, resulting in insufficient utilization of potential supervisory information. In this paper, we propose a Multi-Constraint Consistency Learning (MCCL) approach to facilitate the staged enhancement of the encoder and decoder. Specifically, we first design a feature knowledge alignment (FKA) strategy to promote the feature consistency learning of the encoder from image-augmentation. Our FKA encourages the encoder to derive consistent features for strongly and weakly augmented views from the perspectives of point-to-point alignment and prototype-based intra-class compactness. Moreover, we propose a self-adaptive intervention (SAI) module to increase the discrepancy of aligned intermediate feature representations, promoting Feature-perturbation based Prediction consistency learning. Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder. Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance. The source code and models are made available at https://github.com/NUST-Machine-Intelligence-Laboratory/MCCL.
中文: 本文提出多约束一致性学习方法,通过特征知识对齐策略和自适应干预模块分别增强编码器和解码器的性能,在标准数据集上实现了最先进的半监督语义分割效果。
English: The paper introduces a Multi-Constraint Consistency Learning (MCCL) approach that enhances semi-supervised semantic segmentation by separately improving encoder and decoder performance through feature knowledge alignment and self-adaptive intervention, achieving state-of-the-art results on benchmark datasets.

Authors:Dongheng Lin, Han Hu, Jianbo Jiao
Title: What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images
Abstract:
Time becomes visible through illumination changes in what we see. Inspired by this, in this paper we explore the potential to learn time awareness from static images, trying to answer: what time tells us? To this end, we first introduce a Time-Oriented Collection (TOC) dataset, which contains 130,906 images with reliable timestamps. Leveraging this dataset, we propose a Time-Image Contrastive Learning (TICL) approach to jointly model timestamps and related visual representations through cross-modal contrastive learning. We found that the proposed TICL, 1) not only achieves state-of-the-art performance on the timestamp estimation task, over various benchmark metrics, 2) but also, interestingly, though only seeing static images, the time-aware embeddings learned from TICL show strong capability in several time-aware downstream tasks such as time-based image retrieval, video scene classification, and time-aware image editing. Our findings suggest that time-related visual cues can be learned from static images and are beneficial for various vision tasks, laying a foundation for future research on understanding time-related visual context. Project page:https://rathgrith.github.io/timetells/.

Authors:Radu Beche, Sergiu Nedevschi
Title: ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling
Abstract:
The development of aerial holistic scene understanding algorithms is hindered by the scarcity of comprehensive datasets that enable both semantic and geometric reconstruction. While synthetic datasets offer an alternative, existing options exhibit task-specific limitations, unrealistic scene compositions, and rendering artifacts that compromise real-world applicability. We introduce ClaraVid, a synthetic aerial dataset specifically designed to overcome these limitations. Comprising 16,917 high-resolution images captured at 4032x3024 from multiple viewpoints across diverse landscapes, ClaraVid provides dense depth maps, panoptic segmentation, sparse point clouds, and dynamic object masks, while mitigating common rendering artifacts. To further advance neural reconstruction, we introduce the Delentropic Scene Profile (DSP), a novel complexity metric derived from differential entropy analysis, designed to quantitatively assess scene difficulty and inform reconstruction tasks. Utilizing DSP, we systematically benchmark neural reconstruction methods, uncovering a consistent, measurable correlation between scene complexity and reconstruction accuracy. Empirical results indicate that higher delentropy strongly correlates with increased reconstruction errors, validating DSP as a reliable complexity prior. The data and code are available on the project page at https://rdbch.github.io/claravid/

Authors:Wen Li, Chen Liu, Shangshu Yu, Dunqiang Liu, Yin Zhou, Siqi Shen, Chenglu Wen, Cheng Wang
Title: LightLoc: Learning Outdoor LiDAR Localization at Light Speed
Abstract:
Scene coordinate regression achieves impressive results in outdoor LiDAR localization but requires days of training. Since training needs to be repeated for each new scene, long training times make these methods impractical for time-sensitive applications, such as autonomous driving, drones, and robotics. We identify large coverage areas and vast data in large-scale outdoor scenes as key challenges that limit fast training. In this paper, we propose LightLoc, the first method capable of efficiently learning localization in a new scene at light speed. LightLoc introduces two novel techniques to address these challenges. First, we introduce sample classification guidance to assist regression learning, reducing ambiguity from similar samples and improving training efficiency. Second, we propose redundant sample downsampling to remove well-learned frames during training, reducing training time without compromising accuracy. Additionally, the fast training and confidence estimation capabilities of sample classification enable its integration into SLAM, effectively eliminating error accumulation. Extensive experiments on large-scale outdoor datasets demonstrate that LightLoc achieves state-of-the-art performance with a 50x reduction in training time than existing methods. Our code is available at https://github.com/liw95/LightLoc.
Chinese: LightLoc通过引入样本分类指导和冗余样本下采样技术,实现了50倍训练速度提升的户外激光雷达定位,为自动驾驶等时效性应用提供了实用解决方案。
English: LightLoc introduces sample classification guidance and redundant sample downsampling to achieve state-of-the-art outdoor LiDAR localization with 50x faster training, making it practical for time-sensitive applications like autonomous driving.

Authors:Zeyu Liu, Zanlin Ni, Yeguo Hua, Xin Deng, Xiao Ma, Cheng Zhong, Gao Huang
Title: CODA: Repurposing Continuous VAEs for Discrete Tokenization
Abstract:
Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both compressing visual signals into a compact representation and discretizing them into a fixed set of codes. Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce \textbf{CODA}(\textbf{CO}ntinuous-to-\textbf{D}iscrete \textbf{A}daptation), a framework that decouples compression and discretization. Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs -- already optimized for perceptual compression -- into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs. Empirically, with $\mathbf{6 \times}$ less training budget than standard VQGAN, our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of $\mathbf{0.43}$ and $\mathbf{1.34}$ for $8 \times$ and $16 \times$ compression on ImageNet 256$\times$ 256 benchmark.

Authors:R. D. Lin, Pengcheng Weng, Yinqiao Wang, Han Ding, Jinsong Han, Fei Wang
Title: HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving
Abstract:
LiDAR point cloud semantic segmentation plays a crucial role in autonomous driving. In recent years, semi-supervised methods have gained popularity due to their significant reduction in annotation labor and time costs. Current semi-supervised methods typically focus on point cloud spatial distribution or consider short-term temporal representations, e.g., only two adjacent frames, often overlooking the rich long-term temporal properties inherent in autonomous driving scenarios. In driving experience, we observe that nearby objects, such as roads and vehicles, remain stable while driving, whereas distant objects exhibit greater variability in category and shape. This natural phenomenon is also captured by LiDAR, which reflects lower temporal sensitivity for nearby objects and higher sensitivity for distant ones. To leverage these characteristics, we propose HiLoTs, which learns high-temporal sensitivity and low-temporal sensitivity representations from continuous LiDAR frames. These representations are further enhanced and fused using a cross-attention mechanism. Additionally, we employ a teacher-student framework to align the representations learned by the labeled and unlabeled branches, effectively utilizing the large amounts of unlabeled data. Experimental results on the SemanticKITTI and nuScenes datasets demonstrate that our proposed HiLoTs outperforms state-of-the-art semi-supervised methods, and achieves performance close to LiDAR+Camera multimodal approaches. Code is available on https://github.com/rdlin118/HiLoTs
中文: 提出的HiLoTs方法通过交叉注意力和师生框架学习并融合激光雷达数据中的高低时间敏感性表征,利用长期时序特性,在自动驾驶数据集上超越了当前最先进的半监督方法。
English: The proposed HiLoTs method leverages long-term temporal properties in LiDAR data by learning and fusing high- and low-temporal sensitivity representations through cross-attention and a teacher-student framework, outperforming state-of-the-art semi-supervised approaches on autonomous driving datasets.

Authors:Yiming Zhao, Yu Zeng, Yukun Qi, YaoYang Liu, Lin Chen, Zehui Chen, Xikun Bao, Jie Zhao, Feng Zhao
Title: V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
Abstract:
Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark(V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs' video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instance-level fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts' 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2P-Bench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation. Project page: https://github.com/gaotiexinqu/V2P-Bench.
中文摘要:V2P-Bench基准通过多模态提示评估大视觉语言模型的视频理解能力,弥补了纯文本评估的不足,结果显示现有模型与人类专家表现存在显著差距。
English Summary: The V2P-Bench is introduced to address the limitations of text-only evaluations by assessing Large Vision-Language Models' video understanding through multimodal prompts, revealing significant performance gaps compared to human experts.

Authors:Bin Fu, Jialin Li, Bin Zhang, Ruiping Wang, Xilin Chen
Title: GS-LTS: 3D Gaussian Splatting-Based Adaptive Modeling for Long-Term Service Robots
Abstract:
3D Gaussian Splatting (3DGS) has garnered significant attention in robotics for its explicit, high fidelity dense scene representation, demonstrating strong potential for robotic applications. However, 3DGS-based methods in robotics primarily focus on static scenes, with limited attention to the dynamic scene changes essential for long-term service robots. These robots demand sustained task execution and efficient scene updates-challenges current approaches fail to meet. To address these limitations, we propose GS-LTS (Gaussian Splatting for Long-Term Service), a 3DGS-based system enabling indoor robots to manage diverse tasks in dynamic environments over time. GS-LTS detects scene changes (e.g., object addition or removal) via single-image change detection, employs a rule-based policy to autonomously collect multi-view observations, and efficiently updates the scene representation through Gaussian editing. Additionally, we propose a simulation-based benchmark that automatically generates scene change data as compact configuration scripts, providing a standardized, user-friendly evaluation benchmark. Experimental results demonstrate GS-LTS's advantages in reconstruction, navigation, and superior scene updates-faster and higher quality than the image training baseline-advancing 3DGS for long-term robotic operations. Code and benchmark are available at: https://vipl-vsu.github.io/3DGS-LTS.

Authors:Jie Zhang, Zhongqi Wang, Shiguang Shan, Xilin Chen
Title: Trigger without Trace: Towards Stealthy Backdoor Attack on Text-to-Image Diffusion Models
Abstract:
Backdoor attacks targeting text-to-image diffusion models have advanced rapidly. However, current backdoor samples often exhibit two key abnormalities compared to benign samples: 1) Semantic Consistency, where backdoor prompts tend to generate images with similar semantic content even with significant textual variations to the prompts; 2) Attention Consistency, where the trigger induces consistent structural responses in the cross-attention maps. These consistencies leave detectable traces for defenders, making backdoors easier to identify. In this paper, toward stealthy backdoor samples, we propose Trigger without Trace (TwT) by explicitly mitigating these consistencies. Specifically, our approach leverages syntactic structures as backdoor triggers to amplify the sensitivity to textual variations, effectively breaking down the semantic consistency. Besides, a regularization method based on Kernel Maximum Mean Discrepancy (KMMD) is proposed to align the distribution of cross-attention responses between backdoor and benign samples, thereby disrupting attention consistency. Extensive experiments demonstrate that our method achieves a 97.5% attack success rate while exhibiting stronger resistance to defenses. It achieves an average of over 98% backdoor samples bypassing three state-of-the-art detection mechanisms, revealing the vulnerabilities of current backdoor defense methods. The code is available at https://github.com/Robin-WZQ/TwT.
中文: 本文提出无痕触发器(TwT)方法,通过破坏语义一致性和注意力一致性,在保持高攻击成功率的同时实现更隐蔽的文本到图像扩散模型后门攻击,有效规避现有防御机制。
English: This paper introduces Trigger without Trace (TwT), a stealthy backdoor attack method for text-to-image diffusion models that disrupts semantic and attention consistencies to evade detection while maintaining high attack success rates.

Authors:Yu Wang, Junxian Mu, Hongzhi Huang, Qilong Wang, Pengfei Zhu, Qinghua Hu
Title: BackMix: Regularizing Open Set Recognition by Removing Underlying Fore-Background Priors
Abstract:
Open set recognition (OSR) requires models to classify known samples while detecting unknown samples for real-world applications. Existing studies show impressive progress using unknown samples from auxiliary datasets to regularize OSR models, but they have proved to be sensitive to selecting such known outliers. In this paper, we discuss the aforementioned problem from a new perspective: Can we regularize OSR models without elaborately selecting auxiliary known outliers? We first empirically and theoretically explore the role of foregrounds and backgrounds in open set recognition and disclose that: 1) backgrounds that correlate with foregrounds would mislead the model and cause failures when encounters 'partially' known images; 2) Backgrounds unrelated to foregrounds can serve as auxiliary known outliers and provide regularization via global average pooling. Based on the above insights, we propose a new method, Background Mix (BackMix), that mixes the foreground of an image with different backgrounds to remove the underlying fore-background priors. Specifically, BackMix first estimates the foreground with class activation maps (CAMs), then randomly replaces image patches with backgrounds from other images to obtain mixed images for training. With backgrounds de-correlated from foregrounds, the open set recognition performance is significantly improved. The proposed method is quite simple to implement, requires no extra operation for inferences, and can be seamlessly integrated into almost all of the existing frameworks. The code is released on https://github.com/Vanixxz/BackMix.
中文总结:本文提出BackMix方法,通过将图像前景与随机背景混合来消除前景与背景之间的关联性,从而无需精心选择辅助异常样本即可显著提升开放集识别的性能。
English Summary: This paper introduces BackMix, a method that enhances open set recognition by mixing image foregrounds with random backgrounds to eliminate fore-background correlations, improving model performance without needing carefully selected auxiliary outliers.

Authors:Heng Gao, Zhuolin He, Shoumeng Qiu, Xiangyang Xue, Jian Pu
Title: Multi-modality Anomaly Segmentation on the Road
Abstract:
Semantic segmentation allows autonomous driving cars to understand the surroundings of the vehicle comprehensively. However, it is also crucial for the model to detect obstacles that may jeopardize the safety of autonomous driving systems. Based on our experiments, we find that current uni-modal anomaly segmentation frameworks tend to produce high anomaly scores for non-anomalous regions in images. Motivated by this empirical finding, we develop a multi-modal uncertainty-based anomaly segmentation framework, named MMRAS+, for autonomous driving systems. MMRAS+ effectively reduces the high anomaly outputs of non-anomalous classes by introducing text-modal using the CLIP text encoder. Indeed, MMRAS+ is the first multi-modal anomaly segmentation solution for autonomous driving. Moreover, we develop an ensemble module to further boost the anomaly segmentation performance. Experiments on RoadAnomaly, SMIYC, and Fishyscapes validation datasets demonstrate the superior performance of our method. The code is available in https://github.com/HengGao12/MMRAS_plus.
中文:MMRAS+是一种创新的多模态异常分割框架,通过引入CLIP文本模态有效降低非异常区域的高异常输出,在多个验证数据集上展现出卓越的自动驾驶障碍物检测性能。
English: MMRAS+ is a novel multi-modal anomaly segmentation framework that enhances autonomous driving safety by integrating text modality via CLIP to accurately identify obstacles while reducing false positives in non-anomalous regions.

Authors:Haolin Qin, Tingfa Xu, Tianhao Li, Zhenxiang Chen, Tao Feng, Jianan Li
Title: MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking
Abstract:
UAV tracking faces significant challenges in real-world scenarios, such as small-size targets and occlusions, which limit the performance of RGB-based trackers. Multispectral images (MSI), which capture additional spectral information, offer a promising solution to these challenges. However, progress in this field has been hindered by the lack of relevant datasets. To address this gap, we introduce the first large-scale Multispectral UAV Single Object Tracking dataset (MUST), which includes 250 video sequences spanning diverse environments and challenges, providing a comprehensive data foundation for multispectral UAV tracking. We also propose a novel tracking framework, UNTrack, which encodes unified spectral, spatial, and temporal features from spectrum prompts, initial templates, and sequential searches. UNTrack employs an asymmetric transformer with a spectral background eliminate mechanism for optimal relationship modeling and an encoder that continuously updates the spectrum prompt to refine tracking, improving both accuracy and efficiency. Extensive experiments show that our proposed UNTrack outperforms state-of-the-art UAV trackers. We believe our dataset and framework will drive future research in this area. The dataset is available on https://github.com/q2479036243/MUST-Multispectral-UAV-Single-Object-Tracking.
中文摘要:本文提出了首个大规模多光谱无人机单目标跟踪数据集MUST,并开发了UNTrack追踪框架,通过融合光谱、空间和时间特征,有效提升了复杂场景下的追踪性能。
English Summary: This paper introduces MUST, the first large-scale multispectral UAV tracking dataset, and proposes UNTrack, a novel framework that integrates spectral, spatial, and temporal features to significantly enhance tracking performance in challenging scenarios.

Authors:Shulei Wang, Wang Lin, Hai Huang, Hanting Wang, Sihang Cai, WenKang Han, Tao Jin, Jingyuan Chen, Jiacheng Sun, Jieming Zhu, Zhou Zhao
Title: Towards Transformer-Based Aligned Generation with Self-Coherence Guidance
Abstract:
We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks. Our code is available at https://scg-diffusion.github.io/scg-diffusion.
中文: 本文提出了一种无需训练的"自相干引导"方法,通过动态优化生成过程中的交叉注意力图,显著提升了基于Transformer的文本引导扩散模型的对齐效果,在多种属性和风格绑定任务上均超越了现有最优方法。
English: This paper presents a training-free method called Self-Coherence Guidance that enhances alignment in Transformer-based Text-Guided Diffusion Models by dynamically optimizing cross-attention maps during generation, achieving superior performance across various attribute and style binding tasks.

Authors:Jinyuan Liu, Bowei Zhang, Qingyun Mei, Xingyuan Li, Yang Zou, Zhiying Jiang, Long Ma, Risheng Liu, Xin Fan
Title: DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion
Abstract:
Infrared and visible image fusion integrates information from distinct spectral bands to enhance image quality by leveraging the strengths and mitigating the limitations of each modality. Existing approaches typically treat image fusion and subsequent high-level tasks as separate processes, resulting in fused images that offer only marginal gains in task performance and fail to provide constructive feedback for optimizing the fusion process. To overcome these limitations, we propose a Discriminative Cross-Dimension Evolutionary Learning Framework, termed DCEvo, which simultaneously enhances visual quality and perception accuracy. Leveraging the robust search capabilities of Evolutionary Learning, our approach formulates the optimization of dual tasks as a multi-objective problem by employing an Evolutionary Algorithm (EA) to dynamically balance loss function parameters. Inspired by visual neuroscience, we integrate a Discriminative Enhancer (DE) within both the encoder and decoder, enabling the effective learning of complementary features from different modalities. Additionally, our Cross-Dimensional Embedding (CDE) block facilitates mutual enhancement between high-dimensional task features and low-dimensional fusion features, ensuring a cohesive and efficient feature integration process. Experimental results on three benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches, achieving an average improvement of 9.32% in visual quality while also enhancing subsequent high-level tasks. The code is available at https://github.com/Beate-Suy-Zhang/DCEvo.
中文摘要:提出的DCEvo框架通过进化学习和跨维度特征增强,将红外与可见光图像融合与高级任务相结合,在视觉质量和任务性能上显著优于现有方法。
English Summary: The proposed DCEvo framework integrates infrared and visible image fusion with high-level tasks through evolutionary learning and cross-dimensional feature enhancement, achieving superior visual quality and task performance improvements over existing methods.

Authors:Chi Zhang, Chengjian Feng, Feng Yan, Qiming Zhang, Mingjin Zhang, Yujie Zhong, Jing Zhang, Lin Ma
Title: InstructVEdit: A Holistic Approach for Instructional Video Editing
Abstract:
Video editing according to instructions is a highly challenging task due to the difficulty in collecting large-scale, high-quality edited video pair data. This scarcity not only limits the availability of training data but also hinders the systematic exploration of model architectures and training strategies. While prior work has improved specific aspects of video editing (e.g., synthesizing a video dataset using image editing techniques or decomposed video editing training), a holistic framework addressing the above challenges remains underexplored. In this study, we introduce InstructVEdit, a full-cycle instructional video editing approach that: (1) establishes a reliable dataset curation workflow to initialize training, (2) incorporates two model architectural improvements to enhance edit quality while preserving temporal consistency, and (3) proposes an iterative refinement strategy leveraging real-world data to enhance generalization and minimize train-test discrepancies. Extensive experiments show that InstructVEdit achieves state-of-the-art performance in instruction-based video editing, demonstrating robust adaptability to diverse real-world scenarios. Project page: https://o937-blip.github.io/InstructVEdit.
中文: 本研究提出的InstructVEdit框架通过构建高质量数据集、改进模型架构和迭代优化策略,有效解决了教学视频编辑中的数据稀缺问题,在保持时序一致性的同时实现了最优性能,并展现出强大的实际应用适应性。
English: This study introduces InstructVEdit, a comprehensive framework that addresses data scarcity in instructional video editing through a curated dataset, enhanced model architecture, and iterative refinement, achieving state-of-the-art performance and robust real-world adaptability.

Authors:Yan Zhang, Yao Feng, Alpár Cseke, Nitin Saini, Nathan Bajandas, Nicolas Heron, Michael J. Black
Title: PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning
Abstract:
We formulate the motor system of an interactive avatar as a generative motion model that can drive the body to move through 3D space in a perpetual, realistic, controllable, and responsive manner. Although human motion generation has been extensively studied, many existing methods lack the responsiveness and realism of real human movements. Inspired by recent advances in foundation models, we propose PRIMAL, which is learned with a two-stage paradigm. In the pretraining stage, the model learns body movements from a large number of sub-second motion segments, providing a generative foundation from which more complex motions are built. This training is fully unsupervised without annotations. Given a single-frame initial state during inference, the pretrained model not only generates unbounded, realistic, and controllable motion, but also enables the avatar to be responsive to induced impulses in real time. In the adaptation phase, we employ a novel ControlNet-like adaptor to fine-tune the base model efficiently, adapting it to new tasks such as few-shot personalized action generation and spatial target reaching. Evaluations show that our proposed method outperforms state-of-the-art baselines. We leverage the model to create a real-time character animation system in Unreal Engine that feels highly responsive and natural. Code, models, and more results are available at: https://yz-cnsdqz.github.io/eigenmotion/PRIMAL

Authors:Moein Heidari, Afshin Bozorgpour, AmirHossein Zarif-Fakharnia, Dorit Merhof, Ilker Hacihaliloglu
Title: Echo-E$^3$Net: Efficient Endo-Epi Spatio-Temporal Network for Ejection Fraction Estimation
Abstract:
Left ventricular ejection fraction (LVEF) is a critical metric for assessing cardiac function, widely used in diagnosing heart failure and guiding clinical decisions. Despite its importance, conventional LVEF estimation remains time-consuming and operator-dependent. Recent deep learning advancements have enhanced automation, yet many existing models are computationally demanding, hindering their feasibility for real-time clinical applications. Additionally, the interplay between spatial and temporal features is crucial for accurate estimation but is often overlooked. In this work, we propose Echo-E$^3$Net, an efficient Endo-Epi spatio-temporal network tailored for LVEF estimation. Our method introduces the Endo-Epi Cardial Border Detector (E$^2$CBD) module, which enhances feature extraction by leveraging spatial and temporal landmark cues. Complementing this, the Endo-Epi Feature Aggregator (E$^2$FA) distills statistical descriptors from backbone feature maps, refining the final EF prediction. These modules, along with a multi-component loss function tailored to align with the clinical definition of EF, collectively enhance spatial-temporal representation learning, ensuring robust and efficient EF estimation. We evaluate Echo-E$^3$Net on the EchoNet-Dynamic dataset, achieving a RMSE of 5.15 and an R$^2$ score of 0.82, setting a new benchmark in efficiency with 6.8 million parameters and only 8.49G Flops. Our model operates without pre-training, data augmentation, or ensemble methods, making it well-suited for real-time point-of-care ultrasound (PoCUS) applications. Our Code is publicly available on~\href{https://github.com/moeinheidari7829/Echo-E3Net}{\textcolor{magenta}{GitHub}}.
Chinese: Echo-E$^3$Net是一种高效的深度学习模型,通过整合心脏的空间和时间特征来准确估算左心室射血分数,以低计算成本实现高性能,适用于实时临床诊断。
English: Echo-E$^3$Net is an efficient deep learning model that accurately estimates left ventricular ejection fraction by integrating spatial and temporal cardiac features, achieving high performance with minimal computational demands for real-time clinical use.

Authors:Nusrat Munia, Abdullah-Al-Zubaer Imran
Title: DermDiff: Generative Diffusion Model for Mitigating Racial Biases in Dermatology Diagnosis
Abstract:
Skin diseases, such as skin cancer, are a significant public health issue, and early diagnosis is crucial for effective treatment. Artificial intelligence (AI) algorithms have the potential to assist in triaging benign vs malignant skin lesions and improve diagnostic accuracy. However, existing AI models for skin disease diagnosis are often developed and tested on limited and biased datasets, leading to poor performance on certain skin tones. To address this problem, we propose a novel generative model, named DermDiff, that can generate diverse and representative dermoscopic image data for skin disease diagnosis. Leveraging text prompting and multimodal image-text learning, DermDiff improves the representation of underrepresented groups (patients, diseases, etc.) in highly imbalanced datasets. Our extensive experimentation showcases the effectiveness of DermDiff in terms of high fidelity and diversity. Furthermore, downstream evaluation suggests the potential of DermDiff in mitigating racial biases for dermatology diagnosis. Our code is available at https://github.com/Munia03/DermDiff
中文: 针对皮肤疾病AI诊断中因数据有限导致的偏见问题,DermDiff模型通过文本提示和多模态学习生成多样化皮肤镜图像,有效提升数据代表性并减少种族偏差,同时保证图像的高保真度和多样性。
English: To address biases in AI-based skin disease diagnosis from limited datasets, DermDiff, a novel generative model, creates diverse dermoscopic images through text prompting and multimodal learning, enhancing representation and reducing racial bias while maintaining high fidelity and diversity.

Authors:Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, Yi Gui, Yao Wan, Philip S. Yu
Title: Judge Anything: MLLM as a Judge Across Any Modality
Abstract:
Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: https://urrealhero.github.io/judgeanythingweb/.
中文: 本文提出TaskAnything和JudgeAnything基准,用于评估多模态大语言模型在任意模态任务中的表现与评判能力,发现其在多模态理解方面表现良好但在生成任务中存在显著挑战,揭示了跨模态偏见问题。
English: This paper introduces TaskAnything and JudgeAnything benchmarks to evaluate multimodal LLMs' performance and judging capabilities across any-to-any modality tasks, revealing their strengths in multimodal understanding but significant challenges in generation tasks due to cross-modality biases.

Authors:Tianwen Zhou, Jing Wang, Songtao Wu, Kuanhong Xu
Title: ProDehaze: Prompting Diffusion Models Toward Faithful Image Dehazing
Abstract:
Recent approaches using large-scale pretrained diffusion models for image dehazing improve perceptual quality but often suffer from hallucination issues, producing unfaithful dehazed image to the original one. To mitigate this, we propose ProDehaze, a framework that employs internal image priors to direct external priors encoded in pretrained models. We introduce two types of \textit{selective} internal priors that prompt the model to concentrate on critical image areas: a Structure-Prompted Restorer in the latent space that emphasizes structure-rich regions, and a Haze-Aware Self-Correcting Refiner in the decoding process to align distributions between clearer input regions and the output. Extensive experiments on real-world datasets demonstrate that ProDehaze achieves high-fidelity results in image dehazing, particularly in reducing color shifts. Our code is at https://github.com/TianwenZhou/ProDehaze.
中文: ProDehaze是一种创新框架,通过选择性内部图像先验引导预训练扩散模型,在图像去雾中有效减少幻觉和色彩偏移,同时保持高保真度。
English: ProDehaze is a novel framework that leverages selective internal image priors to guide pretrained diffusion models, effectively reducing hallucination and color shifts in image dehazing while maintaining high fidelity.

Authors:Ran Liu, Fengyu Zhang, Cong Yu, Longjiang Yang, Zhuofan Wen, Siyuan Zhang, Hailiang Yao, Shun Chen, Zheng Lian, Bin Liu
Title: Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition
Abstract:
This article presents our results for the eighth Affective Behavior Analysis in-the-wild (ABAW) competition.Multimodal emotion recognition (ER) has important applications in affective computing and human-computer interaction. However, in the real world, compound emotion recognition faces greater issues of uncertainty and modal conflicts. For the Compound Expression (CE) Recognition Challenge,this paper proposes a multimodal emotion recognition method that fuses the features of Vision Transformer (ViT) and Residual Network (ResNet). We conducted experiments on the C-EXPR-DB and MELD datasets. The results show that in scenarios with complex visual and audio cues (such as C-EXPR-DB), the model that fuses the features of ViT and ResNet exhibits superior performance.Our code are avalible on https://github.com/MyGitHub-ax/8th_ABAW
中文: 本文提出融合视觉变换器和残差网络特征的多模态情绪识别方法,在C-EXPR-DB等复杂场景数据集上展现出优越性能。
English: This paper introduces a multimodal emotion recognition method combining Vision Transformer and ResNet features, demonstrating superior performance on complex datasets like C-EXPR-DB in the ABAW competition.

Authors:Haochen Zhang, Nader Zantout, Pujith Kachana, Ji Zhang, Wenshan Wang
Title: IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes
Abstract:
With the recent rise of large language models, vision-language models, and other general foundation models, there is growing potential for multimodal, multi-task robotics that can operate in diverse environments given natural language input. One such application is indoor navigation using natural language instructions. However, despite recent progress, this problem remains challenging due to the 3D spatial reasoning and semantic understanding required. Additionally, the language used may be imperfect or misaligned with the scene, further complicating the task. To address this challenge, we curate a benchmark dataset, IRef-VLA, for Interactive Referential Vision and Language-guided Action in 3D Scenes with imperfect references. IRef-VLA is the largest real-world dataset for the referential grounding task, consisting of over 11.5K scanned 3D rooms from existing datasets, 7.6M heuristically generated semantic relations, and 4.7M referential statements. Our dataset also contains semantic object and room annotations, scene graphs, navigable free space annotations, and is augmented with statements where the language has imperfections or ambiguities. We verify the generalizability of our dataset by evaluating with state-of-the-art models to obtain a performance baseline and also develop a graph-search baseline to demonstrate the performance bound and generation of alternatives using scene-graph knowledge. With this benchmark, we aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems. The dataset and all source code is publicly released at https://github.com/HaochenZ11/IRef-VLA.
中文: IRef-VLA基准数据集通过提供包含不完善参考信息的海量3D场景数据,旨在解决自然语言室内导航的挑战,推动稳健交互式导航系统的发展。
English: The IRef-VLA benchmark dataset is introduced to address challenges in indoor navigation using natural language by providing extensive 3D scene data with imperfect references, aiming to advance robust, interactive navigation systems.

Authors:Jerred Chen, Ronald Clark
Title: Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image
Abstract:
In many robotics and VR/AR applications, fast camera motions cause a high level of motion blur, causing existing camera pose estimation methods to fail. In this work, we propose a novel framework that leverages motion blur as a rich cue for motion estimation rather than treating it as an unwanted artifact. Our approach works by predicting a dense motion flow field and a monocular depth map directly from a single motion-blurred image. We then recover the instantaneous camera velocity by solving a linear least squares problem under the small motion assumption. In essence, our method produces an IMU-like measurement that robustly captures fast and aggressive camera movements. To train our model, we construct a large-scale dataset with realistic synthetic motion blur derived from ScanNet++v2 and further refine our model by training end-to-end on real data using our fully differentiable pipeline. Extensive evaluations on real-world benchmarks demonstrate that our method achieves state-of-the-art angular and translational velocity estimates, outperforming current methods like MASt3R and COLMAP.

Authors:Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang
Title: OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Abstract:
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning, achieving notable performance gains on challenging visual reasoning tasks. While text-based reasoning models (e.g., Deepseek R1) show promising results in text-only tasks, distilling their reasoning into LVLMs via supervised fine-tuning (SFT) often results in performance degradation due to imprecise visual grounding. Conversely, purely reinforcement learning (RL)-based methods face a large search space, hindering the emergence of reflective behaviors in smaller models (e.g., 7B LVLMs). Surprisingly, alternating between SFT and RL ultimately results in significant performance improvements after a few iterations. Our analysis reveals that the base model rarely exhibits reasoning behaviors initially, but SFT effectively surfaces these latent actions and narrows the RL search space, accelerating the development of reasoning capabilities. Each subsequent RL stage further refines the model's reasoning skills, producing higher-quality SFT data for continued self-improvement. OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning, notably improving MathVista by 3.8%, EMMA by 2.4%, and HallusionBench by 1.6%. Beyond demonstrating the synergy between SFT and RL for complex reasoning tasks, our findings provide early evidence towards achieving R1-style reasoning in multimodal contexts. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.
中文摘要:OpenVLThinker作为首个开源视觉语言大模型,通过交替使用监督微调与强化学习实现了卓越的链式推理能力,在多项视觉推理基准测试中取得显著性能突破。
English Summary: OpenVLThinker is an open-source vision-language model that achieves superior visual reasoning through alternating supervised fine-tuning and reinforcement learning, significantly advancing performance across multiple benchmarks.

Authors:Xianghan Meng, Zhiyuan Huang, Wei He, Xianbiao Qi, Rong Xiao, Chun-Guang Li
Title: Exploring a Principled Framework for Deep Subspace Clustering
Abstract:
Subspace clustering is a classical unsupervised learning task, built on a basic assumption that high-dimensional data can be approximated by a union of subspaces (UoS). Nevertheless, the real-world data are often deviating from the UoS assumption. To address this challenge, state-of-the-art deep subspace clustering algorithms attempt to jointly learn UoS representations and self-expressive coefficients. However, the general framework of the existing algorithms suffers from a catastrophic feature collapse and lacks a theoretical guarantee to learn desired UoS representation. In this paper, we present a Principled fRamewOrk for Deep Subspace Clustering (PRO-DSC), which is designed to learn structured representations and self-expressive coefficients in a unified manner. Specifically, in PRO-DSC, we incorporate an effective regularization on the learned representations into the self-expressive model, prove that the regularized self-expressive model is able to prevent feature space collapse, and demonstrate that the learned optimal representations under certain condition lie on a union of orthogonal subspaces. Moreover, we provide a scalable and efficient approach to implement our PRO-DSC and conduct extensive experiments to verify our theoretical findings and demonstrate the superior performance of our proposed deep subspace clustering approach. The code is available at https://github.com/mengxianghan123/PRO-DSC.
中文:本文提出了PRO-DSC原则性深度子空间聚类框架,通过有效正则化防止特征坍缩,理论上保证学习到的表示位于正交子空间上,实验证明其具有优越性能。
English: This paper introduces PRO-DSC, a principled deep subspace clustering framework that prevents feature collapse through effective regularization and theoretically guarantees learned representations lie on orthogonal subspaces, demonstrating superior performance in experiments.

Authors:Hiromu Taketsugu, Takeru Oba, Takahiro Maeda, Shohei Nobuhara, Norimichi Ukita
Title: Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment
Abstract:
Humans can predict future human trajectories even from momentary observations by using human pose-related cues. However, previous Human Trajectory Prediction (HTP) methods leverage the pose cues implicitly, resulting in implausible predictions. To address this, we propose Locomotion Embodiment, a framework that explicitly evaluates the physical plausibility of the predicted trajectory by locomotion generation under the laws of physics. While the plausibility of locomotion is learned with an indifferentiable physics simulator, it is replaced by our differentiable Locomotion Value function to train an HTP network in a data-driven manner. In particular, our proposed Embodied Locomotion loss is beneficial for efficiently training a stochastic HTP network using multiple heads. Furthermore, the Locomotion Value filter is proposed to filter out implausible trajectories at inference. Experiments demonstrate that our method enhances even the state-of-the-art HTP methods across diverse datasets and problem settings. Our code is available at: https://github.com/ImIntheMiddle/EmLoco.
中文摘要:提出的运动体现框架通过基于物理定律的运动生成来显式评估轨迹合理性,借助可微分运动值和过滤机制有效提升了多种数据集中的轨迹预测精度。
English Summary: The proposed Locomotion Embodiment framework explicitly evaluates trajectory plausibility through physics-based locomotion generation, enhancing prediction accuracy across diverse datasets by integrating differentiable locomotion values and filtering mechanisms.

Authors:Shuang Guo, Friedhelm Hamann, Guillermo Gallego
Title: Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras
Abstract:
Event cameras rely on motion to obtain information about scene appearance. This means that appearance and motion are inherently linked: either both are present and recorded in the event data, or neither is captured. Previous works treat the recovery of these two visual quantities as separate tasks, which does not fit with the above-mentioned nature of event cameras and overlooks the inherent relations between them. We propose an unsupervised learning framework that jointly estimates optical flow (motion) and image intensity (appearance) using a single network. From the data generation model, we newly derive the event-based photometric error as a function of optical flow and image intensity. This error is further combined with the contrast maximization framework to form a comprehensive loss function that provides proper constraints for both flow and intensity estimation. Exhaustive experiments show our method's state-of-the-art performance: in optical flow estimation, it reduces EPE by 20% and AE by 25% compared to unsupervised approaches, while delivering competitive intensity estimation results, particularly in high dynamic range scenarios. Our method also achieves shorter inference time than all other optical flow methods and many of the image reconstruction methods, while they output only one quantity. Project page: https://github.com/tub-rip/E2FAI
Chinese: 我们提出了一种无监督学习框架,通过单一网络联合估计光流和图像强度,在精度和效率上均实现了显著提升,达到了领先性能。
English: We introduce an unsupervised learning framework that jointly estimates optical flow and image intensity using a single network, achieving state-of-the-art performance with significant improvements in accuracy and efficiency.

Authors:Jie Mei, Chenyu Lin, Yu Qiu, Yaonan Wang, Hui Zhang, Ziyang Wang, Dong Dai
Title: Cross-Modal Interactive Perception Network with Mamba for Lung Tumor Segmentation in PET-CT Images
Abstract:
Lung cancer is a leading cause of cancer-related deaths globally. PET-CT is crucial for imaging lung tumors, providing essential metabolic and anatomical information, while it faces challenges such as poor image quality, motion artifacts, and complex tumor morphology. Deep learning-based models are expected to address these problems, however, existing small-scale and private datasets limit significant performance improvements for these methods. Hence, we introduce a large-scale PET-CT lung tumor segmentation dataset, termed PCLT20K, which comprises 21,930 pairs of PET-CT images from 605 patients. Furthermore, we propose a cross-modal interactive perception network with Mamba (CIPA) for lung tumor segmentation in PET-CT images. Specifically, we design a channel-wise rectification module (CRM) that implements a channel state space block across multi-modal features to learn correlated representations and helps filter out modality-specific noise. A dynamic cross-modality interaction module (DCIM) is designed to effectively integrate position and context information, which employs PET images to learn regional position information and serves as a bridge to assist in modeling the relationships between local features of CT images. Extensive experiments on a comprehensive benchmark demonstrate the effectiveness of our CIPA compared to the current state-of-the-art segmentation methods. We hope our research can provide more exploration opportunities for medical image segmentation. The dataset and code are available at https://github.com/mj129/CIPA.
Chinese: 本研究提出了用于肺肿瘤分割的大规模PET-CT数据集PCLT20K,并设计了一种跨模态交互感知网络(CIPA),通过有效整合多模态特征来提升分割精度,其性能优于现有最先进方法。
English: The study introduces PCLT20K, a large-scale PET-CT dataset for lung tumor segmentation, and proposes a cross-modal interactive perception network (CIPA) that effectively integrates multi-modal features to improve segmentation accuracy, outperforming current state-of-the-art methods.

Authors:Devavrat Tomar, Guillaume Vray, Dwarikanath Mahapatra, Sudipta Roy, Jean-Philippe Thiran, Behzad Bozorgtabar
Title: Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology
Abstract:
In this paper, we address the challenge of few-shot classification in histopathology whole slide images (WSIs) by utilizing foundational vision-language models (VLMs) and slide-level prompt learning. Given the gigapixel scale of WSIs, conventional multiple instance learning (MIL) methods rely on aggregation functions to derive slide-level (bag-level) predictions from patch representations, which require extensive bag-level labels for training. In contrast, VLM-based approaches excel at aligning visual embeddings of patches with candidate class text prompts but lack essential pathological prior knowledge. Our method distinguishes itself by utilizing pathological prior knowledge from language models to identify crucial local tissue types (patches) for WSI classification, integrating this within a VLM-based MIL framework. Our approach effectively aligns patch images with tissue types, and we fine-tune our model via prompt learning using only a few labeled WSIs per category. Experimentation on real-world pathological WSI datasets and ablation studies highlight our method's superior performance over existing MIL- and VLM-based methods in few-shot WSI classification tasks. Our code is publicly available at https://github.com/LTS5/SLIP.
中文: 本研究提出一种新颖的病理切片图像少样本分类方法,通过将语言模型的病理先验知识融入视觉语言模型框架,并利用提示学习仅需少量标注数据即可实现优于现有方法的性能。
English: This study introduces a novel method for few-shot classification of histopathology whole slide images by integrating pathological prior knowledge from language models into a vision-language model framework, achieving superior performance with minimal labeled data through prompt learning.

Authors:Yu-Hsi Chen
Title: Strong Baseline: Multi-UAV Tracking via YOLOv12 with BoT-SORT-ReID
Abstract:
Detecting and tracking multiple unmanned aerial vehicles (UAVs) in thermal infrared video is inherently challenging due to low contrast, environmental noise, and small target sizes. This paper provides a straightforward approach to address multi-UAV tracking in thermal infrared video, leveraging recent advances in detection and tracking. Instead of relying on the well-established YOLOv5 with DeepSORT combination, we present a tracking framework built on YOLOv12 and BoT-SORT, enhanced with tailored training and inference strategies. We evaluate our approach following the 4th Anti-UAV Challenge metrics and reach competitive performance. Notably, we achieved strong results without using contrast enhancement or temporal information fusion to enrich UAV features, highlighting our approach as a "Strong Baseline" for multi-UAV tracking tasks. We provide implementation details, in-depth experimental analysis, and a discussion of potential improvements. The code is available at https://github.com/wish44165/YOLOv12-BoT-SORT-ReID .
Chinese: 本文提出了一种基于YOLOv12和BoT-SORT的鲁棒跟踪框架,通过针对性训练策略,在不使用对比度增强或时序融合的情况下,实现了热红外视频中多无人机跟踪的优异性能。
English: This paper introduces a robust tracking framework using YOLOv12 and BoT-SORT with specialized training strategies, achieving competitive multi-UAV tracking in thermal infrared video without relying on contrast enhancement or temporal fusion.

Authors:Aryan Yazdan Parast, Basim Azam, Naveed Akhtar
Title: DDB: Diffusion Driven Balancing to Address Spurious Correlations
Abstract:
Deep neural networks trained with Empirical Risk Minimization (ERM) perform well when both training and test data come from the same domain, but they often fail to generalize to out-of-distribution samples. In image classification, these models may rely on spurious correlations that often exist between labels and irrelevant features of images, making predictions unreliable when those features do not exist. We propose a Diffusion Driven Balancing (DDB) technique to generate training samples with text-to-image diffusion models for addressing the spurious correlation problem. First, we compute the best describing token for the visual features pertaining to the causal components of samples by a textual inversion mechanism. Then, leveraging a language segmentation method and a diffusion model, we generate new samples by combining the causal component with the elements from other classes. We also meticulously prune the generated samples based on the prediction probabilities and attribution scores of the ERM model to ensure their correct composition for our objective. Finally, we retrain the ERM model on our augmented dataset. This process reduces the model's reliance on spurious correlations by learning from carefully crafted samples in which this correlation does not exist. Our experiments show that across different benchmarks, our technique achieves better worst-group accuracy than the existing state-of-the-art methods. Our code is available at https://github.com/ArianYp/DDB.
中文: 提出的扩散驱动平衡技术利用文本到图像扩散模型生成打破伪相关性的训练样本,从而提升模型泛化能力,并在多个基准测试中实现了最优的最差组准确率。
English: The proposed Diffusion Driven Balancing technique uses text-to-image diffusion models to generate training samples that break spurious correlations, improving model generalization and achieving superior worst-group accuracy across benchmarks.

Authors:Ting Sun, Cheng Cui, Yuning Du, Yi Liu
Title: PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction
Abstract:
Document layout analysis is a critical preprocessing step in document intelligence, enabling the detection and localization of structural elements such as titles, text blocks, tables, and formulas. Despite its importance, existing layout detection models face significant challenges in generalizing across diverse document types, handling complex layouts, and achieving real-time performance for large-scale data processing. To address these limitations, we present PP-DocLayout, which achieves high precision and efficiency in recognizing 23 types of layout regions across diverse document formats. To meet different needs, we offer three models of varying scales. PP-DocLayout-L is a high-precision model based on the RT-DETR-L detector, achieving 90.4% mAP@0.5 and an end-to-end inference time of 13.4 ms per page on a T4 GPU. PP-DocLayout-M is a balanced model, offering 75.2% mAP@0.5 with an inference time of 12.7 ms per page on a T4 GPU. PP-DocLayout-S is a high-efficiency model designed for resource-constrained environments and real-time applications, with an inference time of 8.1 ms per page on a T4 GPU and 14.5 ms on a CPU. This work not only advances the state of the art in document layout analysis but also provides a robust solution for constructing high-quality training data, enabling advancements in document intelligence and multimodal AI systems. Code and models are available at https://github.com/PaddlePaddle/PaddleX .
中文: PP-DocLayout通过三种不同规模的模型解决了文档布局分析中的泛化与效率难题,在多种格式下实现了高精度和实时处理能力。
English: PP-DocLayout introduces three scalable models that overcome generalization and efficiency challenges in document layout analysis, achieving high precision and real-time performance across diverse formats.

Authors:Yongli Xiang, Ziming Hong, Lina Yao, Dadong Wang, Tongliang Liu
Title: Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising
Abstract:
Non-transferable learning (NTL) has been proposed to protect model intellectual property (IP) by creating a "non-transferable barrier" to restrict generalization from authorized to unauthorized domains. Recently, well-designed attack, which restores the unauthorized-domain performance by fine-tuning NTL models on few authorized samples, highlights the security risks of NTL-based applications. However, such attack requires modifying model weights, thus being invalid in the black-box scenario. This raises a critical question: can we trust the security of NTL models deployed as black-box systems? In this work, we reveal the first loophole of black-box NTL models by proposing a novel attack method (dubbed as JailNTL) to jailbreak the non-transferable barrier through test-time data disguising. The main idea of JailNTL is to disguise unauthorized data so it can be identified as authorized by the NTL model, thereby bypassing the non-transferable barrier without modifying the NTL model weights. Specifically, JailNTL encourages unauthorized-domain disguising in two levels, including: (i) data-intrinsic disguising (DID) for eliminating domain discrepancy and preserving class-related content at the input-level, and (ii) model-guided disguising (MGD) for mitigating output-level statistics difference of the NTL model. Empirically, when attacking state-of-the-art (SOTA) NTL models in the black-box scenario, JailNTL achieves an accuracy increase of up to 55.7% in the unauthorized domain by using only 1% authorized samples, largely exceeding existing SOTA white-box attacks.
中文: 本研究提出JailNTL攻击方法,通过数据伪装在测试阶段绕过黑盒非可转移学习(NTL)的安全屏障,无需修改模型权重即可将未授权域准确率最高提升55.7%,显著超越现有白盒攻击效果。
English: This study introduces JailNTL, a novel black-box attack that bypasses non-transferable learning (NTL) security by disguising unauthorized data through input-level and output-level transformations, achieving up to 55.7% accuracy improvement without modifying model weights.

Authors:Xingchao Yang, Takafumi Taketomi, Yuki Endo, Yoshihiro Kanamori
Title: FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly Inference Strategy
Abstract:
Recovering high-quality 3D facial textures from single-view 2D images is a challenging task, especially under constraints of limited data and complex facial details such as makeup, wrinkles, and occlusions. In this paper, we introduce FreeUV, a novel ground-truth-free UV texture recovery framework that eliminates the need for annotated or synthetic UV data. FreeUV leverages pre-trained stable diffusion model alongside a Cross-Assembly inference strategy to fulfill this objective. In FreeUV, separate networks are trained independently to focus on realistic appearance and structural consistency, and these networks are combined during inference to generate coherent textures. Our approach accurately captures intricate facial features and demonstrates robust performance across diverse poses and occlusions. Extensive experiments validate FreeUV's effectiveness, with results surpassing state-of-the-art methods in both quantitative and qualitative metrics. Additionally, FreeUV enables new applications, including local editing, facial feature interpolation, and multi-view texture recovery. By reducing data requirements, FreeUV offers a scalable solution for generating high-fidelity 3D facial textures suitable for real-world scenarios.

Authors:Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci, Nicola Strisciuglio
Title: Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) learn a shared feature space for text and images, enabling the comparison of inputs of different modalities. While prior works demonstrated that VLMs organize natural language representations into regular structures encoding composite meanings, it remains unclear if compositional patterns also emerge in the visual embedding space. In this work, we investigate compositionality in the image domain, where the analysis of compositional properties is challenged by noise and sparsity of visual data. We address these problems and propose a framework, called Geodesically Decomposable Embeddings (GDE), that approximates image representations with geometry-aware compositional structures in the latent space. We demonstrate that visual embeddings of pre-trained VLMs exhibit a compositional arrangement, and evaluate the effectiveness of this property in the tasks of compositional classification and group robustness. GDE achieves stronger performance in compositional classification compared to its counterpart method that assumes linear geometry of the latent space. Notably, it is particularly effective for group robustness, where we achieve higher results than task-specific solutions. Our results indicate that VLMs can automatically develop a human-like form of compositional reasoning in the visual domain, making their underlying processes more interpretable. Code is available at https://github.com/BerasiDavide/vlm_image_compositionality.
中文摘要:视觉语言模型通过提出的"测地可分解嵌入"框架,在视觉表征中自动形成了组合推理能力,并在组合分类和群体鲁棒性任务中展现出卓越性能。
English Summary: Vision-Language Models automatically develop compositional reasoning in visual representations through the proposed Geodesically Decomposable Embeddings framework, which demonstrates superior performance in compositional classification and group robustness tasks.

Authors:Luca Rossetto, Werner Bailer, Duc-Tien Dang-Nguyen, Graham Healy, Björn Þór Jónsson, Onanong Kongmeesub, Hoang-Bao Le, Stevan Rudinac, Klaus Schöffmann, Florian Spiess, Allie Tran, Minh-Triet Tran, Quang-Linh Tran, Cathal Gurrin
Title: The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding
Abstract:
Egocentric video has seen increased interest in recent years, as it is used in a range of areas. However, most existing datasets are limited to a single perspective. In this paper, we present the CASTLE 2024 dataset, a multimodal collection containing ego- and exo-centric (i.e., first- and third-person perspective) video and audio from 15 time-aligned sources, as well as other sensor streams and auxiliary data. The dataset was recorded by volunteer participants over four days in a fixed location and includes the point of view of 10 participants, with an additional 5 fixed cameras providing an exocentric perspective. The entire dataset contains over 600 hours of UHD video recorded at 50 frames per second. In contrast to other datasets, CASTLE 2024 does not contain any partial censoring, such as blurred faces or distorted audio. The dataset is available via https://castle-dataset.github.io/.

Authors:Robin Hesse, Doğukan Bağcı, Bernt Schiele, Simone Schaub-Meyer, Stefan Roth
Title: Beyond Accuracy: What Matters in Designing Well-Behaved Models?
Abstract:
Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect the quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high fairness on ImageNet-1k classification and strong robustness against domain changes; (ii) self-supervised learning is an effective training paradigm to improve almost all considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.
中文: 本研究通过分析326个模型的九个质量维度填补了深度神经网络综合评估的空白,发现视觉语言模型和自监督学习能提升鲁棒性与公平性,并推出QUBA评分体系实现定制化模型推荐。
English: This study addresses the gap in evaluating deep neural networks' overall quality by analyzing nine dimensions across 326 models, revealing that vision-language models and self-supervised learning enhance robustness and fairness, and introduces the QUBA score for tailored model recommendations.

Authors:Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Gaopeng Gou, Qi Wu
Title: Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval
Abstract:
Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent across domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to modify a reference image according to manipulation text to accurately retrieve a target image, especially when the reference image is missing essential target content. In this paper, we propose a novel prediction-based mapping network, named PrediCIR, to adaptively predict the missing target visual content in reference images in the latent space before mapping for accurate ZS-CIR. Specifically, a world view generation module first constructs a source view by omitting certain visual content of a target view, coupled with an action that includes the manipulation intent derived from existing image-caption pairs. Then, a target content prediction module trains a world model as a predictor to adaptively predict the missing visual information guided by user intention in manipulating text at the latent space. The two modules map an image with the predicted relevant information to a pseudo-word token without extra supervision. Our model shows strong generalization ability on six ZS-CIR tasks. It obtains consistent and significant performance boosts ranging from 1.73% to 4.45% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available at https://github.com/Pter61/predicir.
中文: PrediCIR提出了一种基于预测的映射网络,通过自适应推断参考图像中缺失的视觉内容来提升零样本组合图像检索效果,在六项任务中均实现了最先进的性能。
English: PrediCIR introduces a prediction-based mapping network that adaptively infers missing visual content in reference images to enhance zero-shot composed image retrieval, achieving state-of-the-art performance across six tasks.

Authors:Ruiyang Ha, Songyi Jiang, Bin Li, Bikang Pan, Yihang Zhu, Junjie Zhang, Xiatian Zhu, Shaogang Gong, Jingya Wang
Title: Multi-modal Multi-platform Person Re-Identification: Benchmark and Method
Abstract:
Conventional person re-identification (ReID) research is often limited to single-modality sensor data from static cameras, which fails to address the complexities of real-world scenarios where multi-modal signals are increasingly prevalent. For instance, consider an urban ReID system integrating stationary RGB cameras, nighttime infrared sensors, and UAVs equipped with dynamic tracking capabilities. Such systems face significant challenges due to variations in camera perspectives, lighting conditions, and sensor modalities, hindering effective person ReID. To address these challenges, we introduce the MP-ReID benchmark, a novel dataset designed specifically for multi-modality and multi-platform ReID. This benchmark uniquely compiles data from 1,930 identities across diverse modalities, including RGB, infrared, and thermal imaging, captured by both UAVs and ground-based cameras in indoor and outdoor environments. Building on this benchmark, we introduce Uni-Prompt ReID, a framework with specific-designed prompts, tailored for cross-modality and cross-platform scenarios. Our method consistently outperforms state-of-the-art approaches, establishing a robust foundation for future research in complex and dynamic ReID environments. Our dataset are available at:https://mp-reid.github.io/.

Authors:Kwan Yun, Chaelin Kim, Hangyeul Shin, Junyong Noh
Title: FFaceNeRF: Few-shot Face Editing in Neural Radiance Fields
Abstract:
Recent 3D face editing methods using masks have produced high-quality edited images by leveraging Neural Radiance Fields (NeRF). Despite their impressive performance, existing methods often provide limited user control due to the use of pre-trained segmentation masks. To utilize masks with a desired layout, an extensive training dataset is required, which is challenging to gather. We present FFaceNeRF, a NeRF-based face editing technique that can overcome the challenge of limited user control due to the use of fixed mask layouts. Our method employs a geometry adapter with feature injection, allowing for effective manipulation of geometry attributes. Additionally, we adopt latent mixing for tri-plane augmentation, which enables training with a few samples. This facilitates rapid model adaptation to desired mask layouts, crucial for applications in fields like personalized medical imaging or creative face editing. Our comparative evaluations demonstrate that FFaceNeRF surpasses existing mask based face editing methods in terms of flexibility, control, and generated image quality, paving the way for future advancements in customized and high-fidelity 3D face editing. The code is available on the {\href{https://kwanyun.github.io/FFaceNeRF_page/}{project-page}}.

Authors:Johan Edstedt, André Mateus, Alberto Jaenal
Title: ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration
Abstract:
Structure-from-Motion (SfM) is the task of estimating 3D structure and camera poses from images. We define Collaborative SfM (ColabSfM) as sharing distributed SfM reconstructions. Sharing maps requires estimating a joint reference frame, which is typically referred to as registration. However, there is a lack of scalable methods and training datasets for registering SfM reconstructions. In this paper, we tackle this challenge by proposing the scalable task of point cloud registration for SfM reconstructions. We find that current registration methods cannot register SfM point clouds when trained on existing datasets. To this end, we propose a SfM registration dataset generation pipeline, leveraging partial reconstructions from synthetically generated camera trajectories for each scene. Finally, we propose a simple but impactful neural refiner on top of the SotA registration method RoITr that yields significant improvements, which we call RefineRoITr. Our extensive experimental evaluation shows that our proposed pipeline and model enables ColabSfM. Code is available at https://github.com/EricssonResearch/ColabSfM
中文: 本文提出ColabSfM解决分布式运动恢复结构重建的配准难题,通过建立可扩展的点云配准任务、开发新型数据集生成流程以及设计名为RefineRoITr的神经优化器,显著提升了配准性能。
English: This paper introduces ColabSfM to address the challenge of registering distributed SfM reconstructions by proposing a scalable point cloud registration task, a novel dataset generation pipeline, and a neural refiner called RefineRoITr that significantly enhances registration performance.

Authors:Victor Besnier, Mickael Chen, David Hurych, Eduardo Valle, Matthieu Cord
Title: Halton Scheduler For Masked Generative Image Transformer
Abstract:
Masked Generative Image Transformers (MaskGIT) have emerged as a scalable and efficient image generation framework, able to deliver high-quality visuals with low inference costs. However, MaskGIT's token unmasking scheduler, an essential component of the framework, has not received the attention it deserves. We analyze the sampling objective in MaskGIT, based on the mutual information between tokens, and elucidate its shortcomings. We then propose a new sampling strategy based on our Halton scheduler instead of the original Confidence scheduler. More precisely, our method selects the token's position according to a quasi-random, low-discrepancy Halton sequence. Intuitively, that method spreads the tokens spatially, progressively covering the image uniformly at each step. Our analysis shows that it allows reducing non-recoverable sampling errors, leading to simpler hyper-parameters tuning and better quality images. Our scheduler does not require retraining or noise injection and may serve as a simple drop-in replacement for the original sampling strategy. Evaluation of both class-to-image synthesis on ImageNet and text-to-image generation on the COCO dataset demonstrates that the Halton scheduler outperforms the Confidence scheduler quantitatively by reducing the FID and qualitatively by generating more diverse and more detailed images. Our code is at https://github.com/valeoai/Halton-MaskGIT.
中文: MaskGIT的原始置信度调度器被基于准随机哈尔顿序列的调度器取代,通过均匀选择标记位置来减少采样误差,无需重新训练即可提升图像质量与多样性并简化超参数调整。
English: MaskGIT's original Confidence scheduler is replaced by a Halton scheduler that uses quasi-random sequences to uniformly select tokens, improving image quality and diversity while simplifying hyper-parameter tuning without retraining.

Authors:Pablo Garcia-Fernandez, Lorenzo Vaquero, Mingxuan Liu, Feng Xue, Daniel Cores, Nicu Sebe, Manuel Mucientes, Elisa Ricci
Title: Superpowering Open-Vocabulary Object Detectors for X-ray Vision
Abstract:
Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray. Code and dataset available at: https://github.com/PAGF188/RAXO.
中文:RAXO是一种无需训练的框架,通过构建增强的视觉描述符将RGB物体检测器应用于X射线安检,无需标注数据即可显著提升检测性能。
English: RAXO is a training-free framework that adapts RGB object detectors for X-ray security screening by creating enriched visual descriptors, achieving significant performance improvements without labeled data.

Authors:Yongkang Cheng, Shaoli Huang, Xuelin Chen, Jifeng Ning, Mingming Gong
Title: DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Real-time Gesture Generation from Speech
Abstract:
Diffusion models have demonstrated remarkable synthesis quality and diversity in generating co-speech gestures. However, the computationally intensive sampling steps associated with diffusion models hinder their practicality in real-world applications. Hence, we present DIDiffGes, for a Decoupled Semi-Implicit Diffusion model-based framework, that can synthesize high-quality, expressive gestures from speech using only a few sampling steps. Our approach leverages Generative Adversarial Networks (GANs) to enable large-step sampling for diffusion model. We decouple gesture data into body and hands distributions and further decompose them into marginal and conditional distributions. GANs model the marginal distribution implicitly, while L2 reconstruction loss learns the conditional distributions exciplictly. This strategy enhances GAN training stability and ensures expressiveness of generated full-body gestures. Our framework also learns to denoise root noise conditioned on local body representation, guaranteeing stability and realism. DIDiffGes can generate gestures from speech with just 10 sampling steps, without compromising quality and expressiveness, reducing the number of sampling steps by a factor of 100 compared to existing methods. Our user study reveals that our method outperforms state-of-the-art approaches in human likeness, appropriateness, and style correctness. Project is https://cyk990422.github.io/DIDiffGes.

Authors:Fangyijie Wang, Kathleen M. Curran, Guénolé Silvestre
Title: Semi-supervised Cervical Segmentation on Ultrasound by A Dual Framework for Neural Networks
Abstract:
Accurate segmentation of ultrasound (US) images of the cervical muscles is crucial for precision healthcare. The demand for automatic computer-assisted methods is high. However, the scarcity of labeled data hinders the development of these methods. Advanced semi-supervised learning approaches have displayed promise in overcoming this challenge by utilizing labeled and unlabeled data. This study introduces a novel semi-supervised learning (SSL) framework that integrates dual neural networks. This SSL framework utilizes both networks to generate pseudo-labels and cross-supervise each other at the pixel level. Additionally, a self-supervised contrastive learning strategy is introduced, which employs a pair of deep representations to enhance feature learning capabilities, particularly on unlabeled data. Our framework demonstrates competitive performance in cervical segmentation tasks. Our codes are publicly available on https://github.com/13204942/SSL\_Cervical\_Segmentation.
Chinese: 本研究提出了一种新颖的半监督学习框架,通过整合双神经网络进行像素级交叉监督和自监督对比学习,在宫颈肌肉超声图像分割中展现出优异性能,有效解决了标注数据稀缺的问题。
English: This study introduces a novel semi-supervised learning framework that combines dual neural networks for pixel-level cross-supervision and self-supervised contrastive learning to achieve competitive performance in cervical muscle ultrasound image segmentation, addressing the challenge of limited labeled data.

Authors:Chandan Yeshwanth, David Rozenberszki, Angela Dai
Title: ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail
Abstract:
Generating text descriptions of objects in 3D indoor scenes is an important building block of embodied understanding. Existing methods do this by describing objects at a single level of detail, which often does not capture fine-grained details such as varying textures, materials, and shapes of the parts of objects. We propose the task of expressive 3D captioning: given an input 3D scene, describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts. To produce such captions, we present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description. We design ExCap3D to encourage semantic consistency between the generated text descriptions, as well as textual similarity in the latent space, to further increase the quality of the generated captions. To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. The ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail, comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes. Our experiments show that the object- and part-level of detail captions generated by ExCap3D are of higher quality than those produced by state-of-the-art methods, with a Cider score improvement of 17% and 124% for object- and part-level details respectively. Our code, dataset and models will be made publicly available.

Authors:Jianchuan Chen, Jingchuan Hu, Gaige Wang, Zhonghua Jiang, Tiansong Zhou, Zhiwen Chen, Chengfei Lv
Title: TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting
Abstract:
Realistic 3D full-body talking avatars hold great potential in AR, with applications ranging from e-commerce live streaming to holographic communication. Despite advances in 3D Gaussian Splatting (3DGS) for lifelike avatar creation, existing methods struggle with fine-grained control of facial expressions and body movements in full-body talking tasks. Additionally, they often lack sufficient details and cannot run in real-time on mobile devices. We present TaoAvatar, a high-fidelity, lightweight, 3DGS-based full-body talking avatar driven by various signals. Our approach starts by creating a personalized clothed human parametric template that binds Gaussians to represent appearances. We then pre-train a StyleUnet-based network to handle complex pose-dependent non-rigid deformation, which can capture high-frequency appearance details but is too resource-intensive for mobile devices. To overcome this, we "bake" the non-rigid deformations into a lightweight MLP-based network using a distillation technique and develop blend shapes to compensate for details. Extensive experiments show that TaoAvatar achieves state-of-the-art rendering quality while running in real-time across various devices, maintaining 90 FPS on high-definition stereo devices such as the Apple Vision Pro.

Authors:Qinghe Ma, Jian Zhang, Zekun Li, Lei Qi, Qian Yu, Yinghuan Shi
Title: Steady Progress Beats Stagnation: Mutual Aid of Foundation and Conventional Models in Mixed Domain Semi-Supervised Medical Image Segmentation
Abstract:
Large pretrained visual foundation models exhibit impressive general capabilities. However, the extensive prior knowledge inherent in these models can sometimes be a double-edged sword when adapting them to downstream tasks in specific domains. In the context of semi-supervised medical image segmentation with domain shift, foundation models like MedSAM tend to make overconfident predictions, some of which are incorrect. The error accumulation hinders the effective utilization of unlabeled data and limits further improvements. In this paper, we introduce a Synergistic training framework for Foundation and Conventional models (SynFoC) to address the issue. We observe that a conventional model trained from scratch has the ability to correct the high-confidence mispredictions of the foundation model, while the foundation model can supervise it with high-quality pseudo-labels in the early training stages. Furthermore, to enhance the collaborative training effectiveness of both models and promote reliable convergence towards optimization, the consensus-divergence consistency regularization is proposed. We demonstrate the superiority of our method across four public multi-domain datasets. In particular, our method improves the Dice score by 10.31\% on the Prostate dataset. Our code is available at https://github.com/MQinghe/SynFoC .
中文:SynFoC框架通过协同训练基础模型和传统模型,纠正过度自信的错误预测,并利用一致性-分歧正则化提升半监督医学图像分割的效果,在多个数据集上实现了显著性能提升。
English: The SynFoC framework synergistically combines a foundation model and a conventional model to correct overconfident mispredictions and enhance semi-supervised medical image segmentation, achieving significant performance improvements across multiple datasets.

Authors:Xu Zhang, Hao Zhou, Haoming Qin, Xiaobin Lu, Jiaxing Yan, Guanzhong Wang, Zeyu Chen, Yi Liu
Title: Enabling Versatile Controls for Video Diffusion Models
Abstract:
Despite substantial progress in text-to-video generation, achieving precise and flexible control over fine-grained spatiotemporal attributes remains a significant unresolved challenge in video generation research. To address these limitations, we introduce VCtrl (also termed PP-VCtrl), a novel framework designed to enable fine-grained control over pre-trained video diffusion models in a unified manner. VCtrl integrates diverse user-specified control signals-such as Canny edges, segmentation masks, and human keypoints-into pretrained video diffusion models via a generalizable conditional module capable of uniformly encoding multiple types of auxiliary signals without modifying the underlying generator. Additionally, we design a unified control signal encoding pipeline and a sparse residual connection mechanism to efficiently incorporate control representations. Comprehensive experiments and human evaluations demonstrate that VCtrl effectively enhances controllability and generation quality. The source code and pre-trained models are publicly available and implemented using the PaddlePaddle framework at http://github.com/PaddlePaddle/PaddleMIX/tree/develop/ppdiffusers/examples/ppvctrl.
中文摘要:VCtrl是一种新颖框架,通过通用条件模块整合多种控制信号,实现对预训练视频扩散模型的细粒度控制,有效提升了生成质量与可控性。
English Summary: VCtrl is a novel framework that enables fine-grained control over pre-trained video diffusion models by integrating diverse control signals through a generalizable conditional module, enhancing both controllability and generation quality.

Authors:Xiaofeng Mao, Yuefeng Chen, Rong Zhang, Hui Xue, Zhao Li, Hang Su
Title: EasyRobust: A Comprehensive and Easy-to-use Toolkit for Robust and Generalized Vision
Abstract:
Deep neural networks (DNNs) has shown great promise in computer vision tasks. However, machine vision achieved by DNNs cannot be as robust as human perception. Adversarial attacks and data distribution shifts have been known as two major scenarios which degrade machine performance and obstacle the wide deployment of machines "in the wild". In order to break these obstructions and facilitate the research of model robustness, we develop EasyRobust, a comprehensive and easy-to-use toolkit for training, evaluation and analysis of robust vision models. EasyRobust targets at two types of robustness: 1) Adversarial robustness enables the model to defense against malicious inputs crafted by worst-case perturbations, also known as adversarial examples; 2) Non-adversarial robustness enhances the model performance on natural test images with corruptions or distribution shifts. Thorough benchmarks on image classification enable EasyRobust to provide an accurate robustness evaluation on vision models. We wish our EasyRobust can help for training practically-robust models and promote academic and industrial progress in closing the gap between human and machine vision. Codes and models of EasyRobust have been open-sourced in https://github.com/alibaba/easyrobust.
Chinese: 深度神经网络在计算机视觉任务中表现出色,但在对抗性攻击和数据分布变化下缺乏人类感知的鲁棒性,为此我们开发了EasyRobust开源工具包,用于训练和评估稳健视觉模型以缩小人机视觉差距。
English: Deep neural networks excel in computer vision but lack human-level robustness against adversarial attacks and data shifts, prompting the development of EasyRobust, an open-source toolkit for training and evaluating robust vision models to bridge this gap.

Authors:Yingping Liang, Yutao Hu, Wenqi Shao, Ying Fu
Title: Distilling Monocular Foundation Model for Fine-grained Depth Completion
Abstract:
Depth completion involves predicting dense depth maps from sparse LiDAR inputs. However, sparse depth annotations from sensors limit the availability of dense supervision, which is necessary for learning detailed geometric features. In this paper, we propose a two-stage knowledge distillation framework that leverages powerful monocular foundation models to provide dense supervision for depth completion. In the first stage, we introduce a pre-training strategy that generates diverse training data from natural images, which distills geometric knowledge to depth completion. Specifically, we simulate LiDAR scans by utilizing monocular depth and mesh reconstruction, thereby creating training data without requiring ground-truth depth. Besides, monocular depth estimation suffers from inherent scale ambiguity in real-world settings. To address this, in the second stage, we employ a scale- and shift-invariant loss (SSI Loss) to learn real-world scales when fine-tuning on real-world datasets. Our two-stage distillation framework enables depth completion models to harness the strengths of monocular foundation models. Experimental results demonstrate that models trained with our two-stage distillation framework achieve state-of-the-art performance, ranking \textbf{first place} on the KITTI benchmark. Code is available at https://github.com/Sharpiless/DMD3C
Chinese: 本文提出了一种两阶段知识蒸馏框架,利用单目基础模型为深度补全提供密集监督,无需真实深度数据即可在KITTI基准测试中实现最优性能。
English: This paper introduces a two-stage knowledge distillation framework that leverages monocular foundation models to provide dense supervision for depth completion, achieving state-of-the-art performance on the KITTI benchmark without requiring ground-truth depth data.

Authors:Wei Zhang, Mengting Ma, Yizhen Jiang, Rongrong Lian, Zhenkai Wu, Kangning Cui, Xiaowen Ma
Title: Center-guided Classifier for Semantic Segmentation of Remote Sensing Images
Abstract:
Compared with natural images, remote sensing images (RSIs) have the unique characteristic. i.e., larger intraclass variance, which makes semantic segmentation for remote sensing images more challenging. Moreover, existing semantic segmentation models for remote sensing images usually employ a vanilla softmax classifier, which has three drawbacks: (1) non-direct supervision for the pixel representations during training; (2) inadequate modeling ability of parametric softmax classifiers under large intraclass variance; and (3) opaque process of classification decision. In this paper, we propose a novel classifier (called CenterSeg) customized for RSI semantic segmentation, which solves the abovementioned problems with multiple prototypes, direct supervision under Grassmann manifold, and interpretability strategy. Specifically, for each class, our CenterSeg obtains local class centers by aggregating corresponding pixel features based on ground-truth masks, and generates multiple prototypes through hard attention assignment and momentum updating. In addition, we introduce the Grassmann manifold and constrain the joint embedding space of pixel features and prototypes based on two additional regularization terms. Especially, during the inference, CenterSeg can further provide interpretability to the model by restricting the prototype as a sample of the training set. Experimental results on three remote sensing segmentation datasets validate the effectiveness of the model. Besides the superior performance, CenterSeg has the advantages of simplicity, lightweight, compatibility, and interpretability. Code is available at https://github.com/xwmaxwma/rssegmentation.
中文摘要:CenterSeg模型通过多原型、格拉斯曼流形监督和可解释性策略,解决了遥感图像语义分割中类内方差大的难题,在实现优越性能的同时兼具简洁性和兼容性。
English Summary: The CenterSeg model addresses the challenges of large intraclass variance in remote sensing image segmentation by introducing multiple prototypes, Grassmann manifold supervision, and interpretability strategies, achieving superior performance with simplicity and compatibility.

Authors:Yinhan Zhang, Yue Ma, Bingyuan Wang, Qifeng Chen, Zeyu Wang
Title: Follow-Your-Color: Multi-Instance Sketch Colorization
Abstract:
We present Follow-Your-Color, a diffusion-based framework for multi-instance sketch colorization. The production of multi-instance 2D line art colorization adheres to an industry-standard workflow, which consists of three crucial stages: the design of line art characters, the coloring of individual objects, and the refinement process. The artists are required to repeat the process of coloring each instance one by one, which is inaccurate and inefficient. Meanwhile, current generative methods fail to solve this task due to the challenge of multi-instance pair data collection. To tackle these challenges, we incorporate three technical designs to ensure precise character detail transcription and achieve multi-instance sketch colorization in a single forward pass. Specifically, we first propose the self-play training strategy to address the lack of training data. Then we introduce an instance guider to feed the color of the instance. To achieve accurate color matching, we present fine-grained color matching with edge loss to enhance visual quality. Equipped with the proposed modules, Follow-Your-Color enables automatically transforming sketches into vividly-colored images with accurate consistency and multi-instance control. Experiments on our collected datasets show that our model outperforms existing methods regarding chromatic precision. Specifically, our model critically automates the colorization process with zero manual adjustments, so novice users can produce stylistically consistent artwork by providing reference instances and the original line art. Our code and additional details are available at https://yinhan-zhang.github.io/color.
中文: Follow-Your-Color 是一种基于扩散的框架,通过自博弈训练和实例引导实现多实例线稿的自动上色,确保色彩精确匹配且无需手动调整。
English: Follow-Your-Color is a diffusion-based framework that automates multi-instance sketch colorization in a single pass, using self-play training and instance guidance to ensure precise color matching and eliminate manual adjustments.

Authors:Ibtissam Saadi, Abdenour Hadid, Douglas W. Cunningham, Abdelmalik Taleb-Ahmed, Yassin El Hillali
Title: PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition
Abstract:
Vision-Language Models (VLMs) like CLIP offer promising solutions for Dynamic Facial Expression Recognition (DFER) but face challenges such as inefficient full fine-tuning, high complexity, and poor alignment between textual and visual representations. Additionally, existing methods struggle with ineffective temporal modeling. To address these issues, we propose PE-CLIP, a parameter-efficient fine-tuning (PEFT) framework that adapts CLIP for DFER while significantly reducing trainable parameters while maintaining high accuracy. PE-CLIP introduces two specialized adapters: a Temporal Dynamic Adapter (TDA) and a Shared Adapter (ShA). The TDA is a GRU-based module with dynamic scaling that captures sequential dependencies while emphasizing informative temporal features and suppressing irrelevant variations. The ShA is a lightweight adapter that refines representations within both textual and visual encoders, ensuring consistency and efficiency. Additionally, we integrate Multi-modal Prompt Learning (MaPLe), introducing learnable prompts for visual and action unit-based textual inputs, enhancing semantic alignment between modalities and enabling efficient CLIP adaptation for dynamic tasks. We evaluate PE-CLIP on two benchmark datasets, DFEW and FERV39K, achieving competitive performance compared to state-of-the-art methods while requiring fewer trainable parameters. By balancing efficiency and accuracy, PE-CLIP sets a new benchmark in resource-efficient DFER. The source code of the proposed PE-CLIP will be publicly available at https://github.com/Ibtissam-SAADI/PE-CLIP .
中文:提出的PE-CLIP框架通过参数高效的适配器和多模态提示,解决了视觉语言模型在动态表情识别中的效率问题,以更少计算资源实现了优越性能。
English: The proposed PE-CLIP framework addresses inefficiencies in adapting Vision-Language Models for Dynamic Facial Expression Recognition by introducing parameter-efficient adapters and multi-modal prompts, achieving competitive accuracy with reduced computational resources.

Authors:Yingying Fan, Quanwei Yang, Kaisiyuan Wang, Hang Zhou, Yingying Li, Haocheng Feng, Errui Ding, Yu Wu, Jingdong Wang
Title: Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model
Abstract:
Current digital human studies focusing on lip-syncing and body movement are no longer sufficient to meet the growing industrial demand, while human video generation techniques that support interacting with real-world environments (e.g., objects) have not been well investigated. Despite human hand synthesis already being an intricate problem, generating objects in contact with hands and their interactions presents an even more challenging task, especially when the objects exhibit obvious variations in size and shape. To tackle these issues, we present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive Layout-instructed Diffusion model (Re-HOLD). Our key insight is to employ specialized layout representation for hands and objects, respectively. Such representations enable effective disentanglement of hand modeling and object adaptation to diverse motion sequences. To further improve the generation quality of HOI, we design an interactive textural enhancement module for both hands and objects by introducing two independent memory banks. We also propose a layout adjustment strategy for the cross-object reenactment scenario to adaptively adjust unreasonable layouts caused by diverse object sizes during inference. Comprehensive qualitative and quantitative evaluations demonstrate that our proposed framework significantly outperforms existing methods. Project page: https://fyycs.github.io/Re-HOLD.
中文:提出的Re-HOLD框架通过自适应布局表征和记忆增强纹理模块,有效解决了人-物交互视频生成的难题,在性能上显著超越现有方法。
English: The proposed Re-HOLD framework addresses the underexplored challenge of generating realistic human-object interaction videos by employing adaptive layout representations and memory-enhanced texture modules, significantly outperforming existing methods.

Authors:Shicheng Li, Lei Li, Kun Ouyang, Shuhuai Ren, Yuanxin Liu, Yuanxing Zhang, Fuzheng Zhang, Lingpeng Kong, Qi Liu, Xu Sun
Title: TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment
Abstract:
Video Large Language Models (Video LLMs) have achieved significant success by leveraging a two-stage paradigm: pretraining on large-scale video-text data for vision-language alignment, followed by supervised fine-tuning (SFT) for task-specific capabilities. However, existing approaches struggle with temporal reasoning due to weak temporal correspondence in the data and reliance on the next-token prediction paradigm during training. To address these limitations, we propose TEMPLE (TEMporal Preference Learning), a systematic framework that enhances Video LLMs' temporal reasoning capabilities through Direct Preference Optimization (DPO). To facilitate this, we introduce an automated preference data generation pipeline that systematically constructs preference pairs by selecting videos that are rich in temporal information, designing video-specific perturbation strategies, and finally evaluating model responses on clean and perturbed video inputs. Our temporal alignment features two key innovations: curriculum learning which that progressively increases perturbation difficulty to improve model robustness and adaptability; and "Pre-SFT Alignment'', applying preference optimization before instruction tuning to prioritize fine-grained temporal comprehension. Extensive experiments demonstrate that our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. We further analyze the transferability of DPO data across architectures and the role of difficulty scheduling in optimization. Our findings highlight our TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs. Code is available at https://github.com/lscpku/TEMPLE.
Chinese: 提出的TEMPLE框架通过自动生成偏好数据和基于课程学习的对齐方法,有效增强了视频大语言模型的时序推理能力,仅用少量数据即可在多个基准测试中显著提升性能。
English: The proposed TEMPLE framework enhances Video LLMs' temporal reasoning through automated preference data generation and curriculum-based alignment, significantly improving performance across benchmarks with minimal data.

Authors:Joo Chan Lee, Jong Hwan Ko, Eunbyung Park
Title: Optimized Minimal 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a powerful representation for real-time, high-performance rendering, enabling a wide range of applications. However, representing 3D scenes with numerous explicit Gaussian primitives imposes significant storage and memory overhead. Recent studies have shown that high-quality rendering can be achieved with a substantially reduced number of Gaussians when represented with high-precision attributes. Nevertheless, existing 3DGS compression methods still rely on a relatively large number of Gaussians, focusing primarily on attribute compression. This is because a smaller set of Gaussians becomes increasingly sensitive to lossy attribute compression, leading to severe quality degradation. Since the number of Gaussians is directly tied to computational costs, it is essential to reduce the number of Gaussians effectively rather than only optimizing storage. In this paper, we propose Optimized Minimal Gaussians representation (OMG), which significantly reduces storage while using a minimal number of primitives. First, we determine the distinct Gaussian from the near ones, minimizing redundancy without sacrificing quality. Second, we propose a compact and precise attribute representation that efficiently captures both continuity and irregularity among primitives. Additionally, we propose a sub-vector quantization technique for improved irregularity representation, maintaining fast training with a negligible codebook size. Extensive experiments demonstrate that OMG reduces storage requirements by nearly 50% compared to the previous state-of-the-art and enables 600+ FPS rendering while maintaining high rendering quality. Our source code is available at https://maincold2.github.io/omg/.

Authors:Dongseob Kim, Hyunjung Shim
Title: Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification
Abstract:
Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification leveraging CLIP, a powerful vision-language model. Despite CLIP's proficiency, it suffers from view-dependent predictions and inherent bias, limiting its effectiveness. We propose a novel method that addresses these issues by leveraging multiple views near target objects, guided by Class Activation Mapping (CAM) of the classifier, and debiasing pseudo-labels derived from CLIP predictions. Our Classifier-guided CLIP Distillation (CCD) enables selecting multiple local views without extra labels and debiasing predictions to enhance classification performance. Experimental results validate our method's superiority over existing techniques across diverse datasets. The code is available at https://github.com/k0u-id/CCD.
Chinese: 本研究提出了一种名为分类器引导的CLIP蒸馏(CCD)的新方法,通过利用类激活映射引导的多个局部视图并消除CLIP预测偏差,在无需额外标注的情况下显著提升了多标签图像分类性能,在多个数据集上验证了其优越性。
English: This study introduces Classifier-guided CLIP Distillation (CCD), a novel method that enhances multi-label image classification by leveraging multiple local views guided by Class Activation Mapping and debiasing CLIP predictions, achieving superior performance across diverse datasets without requiring additional labels.

Authors:Hou In Derek Pun, Hou In Ivan Tam, Austin T. Wang, Xiaoliang Huo, Angel X. Chang, Manolis Savva
Title: HSM: Hierarchical Scene Motifs for Multi-Scale Indoor Scene Generation
Abstract:
Despite advances in indoor 3D scene layout generation, synthesizing scenes with dense object arrangements remains challenging. Existing methods focus on large furniture while neglecting smaller objects, resulting in unrealistically empty scenes. Those that place small objects typically do not honor arrangement specifications, resulting in largely random placement not following the text description. We present Hierarchical Scene Motifs (HSM): a hierarchical framework for indoor scene generation with dense object arrangements across spatial scales. Indoor scenes are inherently hierarchical, with surfaces supporting objects at different scales, from large furniture on floors to smaller objects on tables and shelves. HSM embraces this hierarchy and exploits recurring cross-scale spatial patterns to generate complex and realistic scenes in a unified manner. Our experiments show that HSM outperforms existing methods by generating scenes that better conform to user input across room types and spatial configurations. Project website is available at https://3dlg-hcvc.github.io/hsm .

Authors:Xiyue Guo, Jiarui Hu, Junjie Hu, Hujun Bao, Guofeng Zhang
Title: SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion
Abstract:
Recently, camera-based solutions have been extensively explored for scene semantic completion (SSC). Despite their success in visible areas, existing methods struggle to capture complete scene semantics due to frequent visual occlusions. To address this limitation, this paper presents the first satellite-ground cooperative SSC framework, i.e., SGFormer, exploring the potential of satellite-ground image pairs in the SSC task. Specifically, we propose a dual-branch architecture that encodes orthogonal satellite and ground views in parallel, unifying them into a common domain. Additionally, we design a ground-view guidance strategy that corrects satellite image biases during feature encoding, addressing misalignment between satellite and ground views. Moreover, we develop an adaptive weighting strategy that balances contributions from satellite and ground views. Experiments demonstrate that SGFormer outperforms the state of the art on SemanticKITTI and SSCBench-KITTI-360 datasets. Our code is available on https://github.com/gxytcrc/SGFormer.
中文:本文提出首个卫星-地面协同的场景语义补全框架SGFormer,通过双分支架构和自适应策略整合卫星与地面视角,在基准数据集上实现了最优性能。
English: This paper introduces SGFormer, the first satellite-ground cooperative framework for scene semantic completion, which utilizes dual-branch architecture and adaptive strategies to unify satellite and ground views, achieving state-of-the-art performance on benchmark datasets.

Authors:Xuan Shen, Weize Ma, Jing Liu, Changdi Yang, Rui Ding, Quanyi Wang, Henghui Ding, Wei Niu, Yanzhi Wang, Pu Zhao, Jun Lin, Jiuxiang Gu
Title: QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge
Abstract:
Monocular Depth Estimation (MDE) has emerged as a pivotal task in computer vision, supporting numerous real-world applications. However, deploying accurate depth estimation models on resource-limited edge devices, especially Application-Specific Integrated Circuits (ASICs), is challenging due to the high computational and memory demands. Recent advancements in foundational depth estimation deliver impressive results but further amplify the difficulty of deployment on ASICs. To address this, we propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost. To mitigate the performance degradation, we introduce activation polishing and compensation algorithm applied before and after activation quantization, as well as a weight reconstruction method for minimizing errors in weight quantization. Furthermore, we design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability, enhancing throughput and efficiency. Experimental results demonstrate that our framework achieves competitive accuracy while enabling fast inference and higher energy efficiency on ASICs, bridging the gap between high-performance depth estimation and practical edge-device applicability. Code: https://github.com/shawnricecake/quart-depth
中文摘要:QuartDepth框架通过后训练量化将单目深度估计模型压缩至4位精度,结合激活优化和权重重建技术,在保持精度的同时显著提升ASIC芯片上的推理速度和能效,弥合了高性能深度估计与边缘设备实际应用之间的差距。
English Summary: QuartDepth is a framework that employs post-training quantization to reduce Monocular Depth Estimation models to 4-bit precision for efficient deployment on ASICs, incorporating activation polishing and weight reconstruction techniques to maintain accuracy while enhancing computational speed and energy efficiency.

Authors:Jinlong Li, Cristiano Saltori, Fabio Poiesi, Nicu Sebe
Title: Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
Abstract:
The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models-such as CLIP, DINOv2, and Stable Diffusion-into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities. The code will be available at: https://github.com/TyroneLi/CUA_O3D.
中文: 现有开放词汇3D场景理解方法因依赖单一视觉语言模型而受限,本文提出的CUA-O3D模型通过集成多基础模型并引入不确定性感知特征蒸馏技术,有效融合语义先验与几何知识,在场景分割和跨域对齐任务中实现了突破性性能。
English: Recent methods for open-vocabulary 3D scene understanding have been limited by relying on a single vision-language model, but our proposed CUA-O3D model integrates multiple foundation models and introduces uncertainty-aware feature distillation to enhance semantic and geometric capabilities, achieving superior performance in segmentation and cross-domain alignment.

Authors:Martin Kostelník, Karel Beneš, Michal Hradiš
Title: TextBite: A Historical Czech Document Dataset for Logical Page Segmentation
Abstract:
Logical page segmentation is an important step in document analysis, enabling better semantic representations, information retrieval, and text understanding. Previous approaches define logical segmentation either through text or geometric objects, relying on OCR or precise geometry. To avoid the need for OCR, we define the task purely as segmentation in the image domain. Furthermore, to ensure the evaluation remains unaffected by geometrical variations that do not impact text segmentation, we propose to use only foreground text pixels in the evaluation metric and disregard all background pixels. To support research in logical document segmentation, we introduce TextBite, a dataset of historical Czech documents spanning the 18th to 20th centuries, featuring diverse layouts from newspapers, dictionaries, and handwritten records. The dataset comprises 8,449 page images with 78,863 annotated segments of logically and thematically coherent text. We propose a set of baseline methods combining text region detection and relation prediction. The dataset, baselines and evaluation framework can be accessed at https://github.com/DCGM/textbite-dataset.
中文: 本研究提出了一种纯图像域的逻辑页面分割方法,无需依赖OCR,并发布了包含历史捷克文献的TextBite数据集、基线方法及评估框架。
English: This study introduces a purely image-based approach to logical page segmentation, avoiding OCR dependency, and presents the TextBite dataset of historical Czech documents with baseline methods and an evaluation framework.

Authors:Hanxiao Wang, Biao Zhang, Weize Quan, Dong-Ming Yan, Peter Wonka
Title: iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation
Abstract:
This paper propose iFlame, a novel transformer-based network architecture for mesh generation. While attention-based models have demonstrated remarkable performance in mesh generation, their quadratic computational complexity limits scalability, particularly for high-resolution 3D data. Conversely, linear attention mechanisms offer lower computational costs but often struggle to capture long-range dependencies, resulting in suboptimal outcomes. To address this trade-off, we propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms. To further enhance efficiency and leverage the inherent structure of mesh representations, we integrate this interleaving approach into an hourglass architecture, which significantly boosts efficiency. Our approach reduces training time while achieving performance comparable to pure attention-based models. To improve inference efficiency, we implemented a caching algorithm that almost doubles the speed and reduces the KV cache size by seven-eighths compared to the original Transformer. We evaluate our framework on ShapeNet and Objaverse, demonstrating its ability to generate high-quality 3D meshes efficiently. Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance, making it a practical solution for mesh generation. The training takes only 2 days with 4 GPUs on 39k data with a maximum of 4k faces on Objaverse.

Authors:Dana Cohen-Bar, Daniel Cohen-Or, Gal Chechik, Yoni Kasten
Title: TriTex: Learning Texture from a Single Mesh via Triplane Semantic Features
Abstract:
As 3D content creation continues to grow, transferring semantic textures between 3D meshes remains a significant challenge in computer graphics. While recent methods leverage text-to-image diffusion models for texturing, they often struggle to preserve the appearance of the source texture during texture transfer. We present \ourmethod, a novel approach that learns a volumetric texture field from a single textured mesh by mapping semantic features to surface colors. Using an efficient triplane-based architecture, our method enables semantic-aware texture transfer to a novel target mesh. Despite training on just one example, it generalizes effectively to diverse shapes within the same category. Extensive evaluation on our newly created benchmark dataset shows that \ourmethod{} achieves superior texture transfer quality and fast inference times compared to existing methods. Our approach advances single-example texture transfer, providing a practical solution for maintaining visual coherence across related 3D models in applications like game development and simulation.

Authors:Moshiur Rahman Tonmoy, Md. Mithun Hossain, Nilanjan Dey, M. F. Mridha
Title: MobilePlantViT: A Mobile-friendly Hybrid ViT for Generalized Plant Disease Image Classification
Abstract:
Plant diseases significantly threaten global food security by reducing crop yields and undermining agricultural sustainability. AI-driven automated classification has emerged as a promising solution, with deep learning models demonstrating impressive performance in plant disease identification. However, deploying these models on mobile and edge devices remains challenging due to high computational demands and resource constraints, highlighting the need for lightweight, accurate solutions for accessible smart agriculture systems. To address this, we propose MobilePlantViT, a novel hybrid Vision Transformer (ViT) architecture designed for generalized plant disease classification, which optimizes resource efficiency while maintaining high performance. Extensive experiments across diverse plant disease datasets of varying scales show our model's effectiveness and strong generalizability, achieving test accuracies ranging from 80% to over 99%. Notably, with only 0.69 million parameters, our architecture outperforms the smallest versions of MobileViTv1 and MobileViTv2, despite their higher parameter counts. These results underscore the potential of our approach for real-world, AI-powered automated plant disease classification in sustainable and resource-efficient smart agriculture systems. All codes will be available in the GitHub repository: https://github.com/moshiurtonmoy/MobilePlantViT
中文摘要:提出的MobilePlantViT模型通过轻量级视觉变换器架构解决了植物病害分类中的计算难题,仅用69万参数即可实现高达99%的准确率,为可持续智慧农业提供了高效解决方案。
English Summary: The proposed MobilePlantViT model addresses computational challenges in plant disease classification by combining a lightweight Vision Transformer architecture with high accuracy, achieving up to 99% performance using only 0.69 million parameters for sustainable smart agriculture.

Authors:Katja Schwarz, Denys Rozumnyi, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder
Title: A Recipe for Generating 3D Worlds From a Single Image
Abstract:
We introduce a recipe for generating immersive 3D worlds from a single image by framing the task as an in-context learning problem for 2D inpainting models. This approach requires minimal training and uses existing generative models. Our process involves two steps: generating coherent panoramas using a pre-trained diffusion model and lifting these into 3D with a metric depth estimator. We then fill unobserved regions by conditioning the inpainting model on rendered point clouds, requiring minimal fine-tuning. Tested on both synthetic and real images, our method produces high-quality 3D environments suitable for VR display. By explicitly modeling the 3D structure of the generated environment from the start, our approach consistently outperforms state-of-the-art, video synthesis-based methods along multiple quantitative image quality metrics. Project Page: https://katjaschwarz.github.io/worlds/
中文: 该方法通过将单张图像生成沉浸式3D场景构建为2D修复模型的情境学习任务,仅需少量训练即可超越现有视频合成方法,在多项图像质量指标上表现优异。
English: This method generates immersive 3D worlds from a single image using in-context learning with 2D inpainting models, requiring minimal training and outperforming existing video-based techniques in quality metrics.

Authors:Felix Chen, Hangjie Yuan, Yunqiu Xu, Tao Feng, Jun Cen, Pengwei Liu, Zeying Huang, Yi Yang
Title: MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems
Abstract:
Despite impressive performance across diverse tasks, Multimodal Large Language Models (MLLMs) have yet to fully demonstrate their potential in visual mathematical problem-solving, particularly in accurately perceiving and interpreting diagrams. Inspired by typical processes of humans, we hypothesize that the perception capabilities to extract meaningful information from diagrams is crucial, as it directly impacts subsequent inference processes. To validate this hypothesis, we developed FlowVerse, a comprehensive benchmark that categorizes all information used during problem-solving into four components, which are then combined into six problem versions for evaluation. Our preliminary results on FlowVerse reveal that existing MLLMs exhibit substantial limitations when extracting essential information and reasoned property from diagrams and performing complex reasoning based on these visual inputs. In response, we introduce MathFlow, a modular problem-solving pipeline that decouples perception and inference into distinct stages, thereby optimizing each independently. Given the perceptual limitations observed in current MLLMs, we trained MathFlow-P-7B as a dedicated perception model. Experimental results indicate that MathFlow-P-7B yields substantial performance gains when integrated with various closed-source and open-source inference models. This demonstrates the effectiveness of the MathFlow pipeline and its compatibility to diverse inference frameworks. The FlowVerse benchmark and code are available at https://github.com/MathFlow-zju/MathFlow.
中文: 多模态大语言模型在视觉数学解题中因图表感知能力不足而表现不佳,为此提出的MathFlow模块化流程将感知与推理分离,有效提升了各类推理模型的性能。
English: Multimodal Large Language Models struggle with visual mathematical problem-solving due to poor diagram perception, prompting the development of MathFlow, a modular pipeline that separates perception and inference stages to significantly enhance performance.

Authors:Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, Xihui Liu
Title: Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation
Abstract:
Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling. Project page: https://yuqingwang1029.github.io/TokenBridge.

Authors:Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, Song Han
Title: XAttention: Block Sparse Attention with Antidiagonal Scoring
Abstract:
Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks-including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation. XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications. Code is available at https://github.com/mit-han-lab/x-attention.
中文: XAttention是一种即插即用框架,通过使用注意力矩阵中反对角线值之和作为块重要性的代理,显著加速长上下文Transformer推理,在保持接近全注意力精度的同时实现高达13.5倍的计算加速。
English: XAttention is a plug-and-play framework that accelerates long-context Transformer inference by using the sum of antidiagonal values in attention matrices as a proxy for block importance, achieving near-full accuracy with up to 13.5x computational speedup.

Authors:Zigang Geng, Mengde Xu, Han Hu, Shuyang Gu
Title: Tokenize Image as a Set
Abstract:
This paper proposes a fundamentally new paradigm for image generation through set-based tokenization and distribution modeling. Unlike conventional methods that serialize images into fixed-position latent codes with a uniform compression ratio, we introduce an unordered token set representation to dynamically allocate coding capacity based on regional semantic complexity. This TokenSet enhances global context aggregation and improves robustness against local perturbations. To address the critical challenge of modeling discrete sets, we devise a dual transformation mechanism that bijectively converts sets into fixed-length integer sequences with summation constraints. Further, we propose Fixed-Sum Discrete Diffusion--the first framework to simultaneously handle discrete values, fixed sequence length, and summation invariance--enabling effective set distribution modeling. Experiments demonstrate our method's superiority in semantic-aware representation and generation quality. Our innovations, spanning novel representation and modeling strategies, advance visual generation beyond traditional sequential token paradigms. Our code and models are publicly available at https://github.com/Gengzigang/TokenSet.
中文摘要:本文提出TokenSet这一创新图像生成方法,通过无序令牌集合和带固定和约束的双重转换机制,动态分配编码容量并提升生成质量。
English Summary: This paper introduces TokenSet, a novel image generation method using unordered token sets and a dual transformation mechanism with Fixed-Sum Discrete Diffusion to dynamically allocate coding capacity and enhance generation quality.

Authors:Xi Liu, Chaoyi Zhou, Nanxuan Zhao, Siyu Huang
Title: Bézier Splatting for Fast and Differentiable Vector Graphics Rendering
Abstract:
Differentiable vector graphics (VGs) are widely used in image vectorization and vector synthesis, while existing representations are costly to optimize and struggle to achieve high-quality rendering results for high-resolution images. This work introduces a new differentiable VG representation, dubbed Bézier Splatting, that enables fast yet high-fidelity VG rasterization. Bézier Splatting samples 2D Gaussians along Bézier curves, which naturally provide positional gradients at object boundaries. Thanks to the efficient splatting-based differentiable rasterizer, Bézier Splatting achieves 30x and 150x faster per forward and backward rasterization step for open curves compared to DiffVG. Additionally, we introduce an adaptive pruning and densification strategy that dynamically adjusts the spatial distribution of curves to escape local minima, further improving VG quality. Furthermore, our new VG representation supports conversion to standard XML-based SVG format, enhancing interoperability with existing VG tools and pipelines. Experimental results show that Bézier Splatting significantly outperforms existing methods with better visual fidelity and significant optimization speedup.

Authors:Ron Campos, Ashmal Vayani, Parth Parag Kulkarni, Rohit Gupta, Aizan Zafar, Aritra Dutta, Mubarak Shah
Title: GAEA: A Geolocation Aware Conversational Assistant
Abstract:
Image geolocalization, in which an AI model traditionally predicts the precise GPS coordinates of an image, is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge beyond the GPS coordinates; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with the tremendous progress of large multimodal models (LMMs) -- proprietary and open-source -- researchers have attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, such as geolocalization, LMMs struggle. In this work, we propose solving this problem by introducing a conversational model, GAEA, that provides information regarding the location of an image as the user requires. No large-scale dataset enabling the training of such a model exists. Thus, we propose GAEA-1.4M, a comprehensive dataset comprising over 800k images and approximately 1.4M question-answer pairs, constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark, GAEA-Bench, comprising 3.5k image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision, by 18.2% and the best proprietary model, GPT-4o, by 7.2%. Our dataset, model and codes are available.

Authors:Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, Zuxuan Wu
Title: MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance
Abstract:
Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths. However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality. Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios. Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation. To address these challenges, we introduce MagicMotion, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality. Furthermore, we present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce MagicBench, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects. Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics. Our project page are publicly available at https://quanhaol.github.io/magicmotion-site.
中文摘要:MagicMotion是一种新型图像转视频框架,通过三种轨迹格式实现精确的多目标运动控制并保持视觉质量,同时配备了专用数据集和评估基准。
English Summary: MagicMotion is a novel image-to-video framework that enables precise multi-object motion control through three trajectory formats while maintaining visual quality, supported by a dedicated dataset and benchmark.

Authors:Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Hao Kang, Xin Lu
Title: InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Abstract:
Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low generation quality and aesthetics. Central to InfU is InfuseNet, a component that injects identity features into the DiT base model via residual connections, enhancing identity similarity while maintaining generation capabilities. A multi-stage training strategy, including pretraining and supervised fine-tuning (SFT) with synthetic single-person-multiple-sample (SPMS) data, further improves text-image alignment, ameliorates image quality, and alleviates face copy-pasting. Extensive experiments demonstrate that InfU achieves state-of-the-art performance, surpassing existing baselines. In addition, the plug-and-play design of InfU ensures compatibility with various existing methods, offering a valuable contribution to the broader community.
中文摘要:InfiniteYou(InfU)框架通过引入InfuseNet进行身份特征注入和多阶段训练策略,解决了扩散变换器中身份保持与生成质量的难题,实现了顶尖性能并具备即插即用的兼容性。
English Summary: The InfiniteYou (InfU) framework addresses identity preservation and quality issues in Diffusion Transformers by introducing InfuseNet for feature injection and a multi-stage training strategy, achieving state-of-the-art performance with plug-and-play compatibility.

Authors:Xueyan Zou, Yuchen Song, Ri-Zhao Qiu, Xuanbin Peng, Jianglong Ye, Sifei Liu, Xiaolong Wang
Title: M3: 3D-Spatial MultiModal Memory
Abstract:
We present 3D Spatial MultiModal Memory (M3), a multimodal memory system designed to retain information about medium-sized static scenes through video sources for visual perception. By integrating 3D Gaussian Splatting techniques with foundation models, M3 builds a multimodal memory capable of rendering feature representations across granularities, encompassing a wide range of knowledge. In our exploration, we identify two key challenges in previous works on feature splatting: (1) computational constraints in storing high-dimensional features for each Gaussian primitive, and (2) misalignment or information loss between distilled features and foundation model features. To address these challenges, we propose M3 with key components of principal scene components and Gaussian memory attention, enabling efficient training and inference. To validate M3, we conduct comprehensive quantitative evaluations of feature similarity and downstream tasks, as well as qualitative visualizations to highlight the pixel trace of Gaussian memory attention. Our approach encompasses a diverse range of foundation models, including vision-language models (VLMs), perception models, and large multimodal and language models (LMMs/LLMs). Furthermore, to demonstrate real-world applicability, we deploy M3's feature field in indoor scenes on a quadruped robot. Notably, we claim that M3 is the first work to address the core compression challenges in 3D feature distillation.
中文摘要:本文提出M3多模态记忆系统,通过结合3D高斯泼溅技术与基础模型构建可跨粒度渲染特征的多模态记忆,并针对特征存储计算瓶颈和特征对齐问题提出核心场景组件与高斯记忆注意力机制,在定量评估与机器人部署中验证了其有效性。
English Summary: The paper introduces M3, a multimodal memory system that uses 3D Gaussian Splatting and foundation models to create detailed scene representations while overcoming computational and feature alignment challenges through novel components like principal scene components and Gaussian memory attention.

Authors:Ananta R. Bhattarai, Xingzhe He, Alla Sheffer, Helge Rhodin
Title: DreamTexture: Shape from Virtual Texture with Analysis by Augmentation
Abstract:
DreamFusion established a new paradigm for unsupervised 3D reconstruction from virtual views by combining advances in generative models and differentiable rendering. However, the underlying multi-view rendering, along with supervision from large-scale generative models, is computationally expensive and under-constrained. We propose DreamTexture, a novel Shape-from-Virtual-Texture approach that leverages monocular depth cues to reconstruct 3D objects. Our method textures an input image by aligning a virtual texture with the real depth cues in the input, exploiting the inherent understanding of monocular geometry encoded in modern diffusion models. We then reconstruct depth from the virtual texture deformation with a new conformal map optimization, which alleviates memory-intensive volumetric representations. Our experiments reveal that generative models possess an understanding of monocular shape cues, which can be extracted by augmenting and aligning texture cues -- a novel monocular reconstruction paradigm that we call Analysis by Augmentation.

Authors:Yiran Qin, Li Kang, Xiufeng Song, Zhenfei Yin, Xiaohong Liu, Xihui Liu, Ruimao Zhang, Lei Bai
Title: RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints
Abstract:
Designing effective embodied multi-agent systems is critical for solving complex real-world tasks across domains. Due to the complexity of multi-agent embodied systems, existing methods fail to automatically generate safe and efficient training data for such systems. To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. We design various interfaces tailored to different types of constraints, enabling seamless interaction with the physical world. Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi-agent systems and introduce the first benchmark for embodied multi-agent manipulation, RoboFactory. Based on RoboFactory benchmark, we adapt and evaluate the method of imitation learning and analyzed its performance in different difficulty agent tasks. Furthermore, we explore the architectures and training strategies for multi-agent imitation learning, aiming to build safe and efficient embodied multi-agent systems.
中文: 本文提出组合约束和专用接口,为具身多智能体系统实现自动化数据采集,创建了RoboFactory基准,并通过模仿学习方法评估以提升系统的安全性和效率。
English: This paper introduces compositional constraints and specialized interfaces to automate data collection for embodied multi-agent systems, establishing the RoboFactory benchmark and evaluating imitation learning methods to enhance system safety and efficiency.

Authors:SeungJu Cha, Kwanyoung Lee, Ye-Chan Kim, Hyunwoo Oh, Dong-Jin Kim
Title: VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness
Abstract:
Recent large-scale text-to-image diffusion models generate photorealistic images but often struggle to accurately depict interactions between humans and objects due to their limited ability to differentiate various interaction words. In this work, we propose VerbDiff to address the challenge of capturing nuanced interactions within text-to-image diffusion models. VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects, enhancing the understanding of interactions. Specifically, we disentangle various interaction words from frequency-based anchor words and leverage localized interaction regions from generated images to help the model better capture semantics in distinctive words without extra conditions. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images with accurate interactions aligned with specified verbs. Extensive experiments on the HICO-DET dataset demonstrate the effectiveness of our method compared to previous approaches.
中文摘要:VerbDiff是一种新型文生图模型,通过解耦交互动词与高频锚点词并利用局部交互区域增强语义理解,无需额外条件即可生成更精准的人物与物体交互图像,在HICO-DET数据集上的实验验证了其有效性。
English Summary: VerbDiff is a novel text-to-image model that enhances interaction depiction by disentangling interaction verbs from common anchors and using localized regions to improve semantic understanding without additional constraints, producing more accurate human-object interaction images as validated on HICO-DET.

Authors:Chen Chen, Zhirui Wang, Taowei Sheng, Yi Jiang, Yundu Li, Peirui Cheng, Luning Zhang, Kaiqiang Chen, Yanfeng Hu, Xue Yang, Xian Sun
Title: SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World
Abstract:
Existing vision-based 3D occupancy prediction methods are inherently limited in accuracy due to their exclusive reliance on street-view imagery, neglecting the potential benefits of incorporating satellite views. We propose SA-Occ, the first Satellite-Assisted 3D occupancy prediction model, which leverages GPS & IMU to integrate historical yet readily available satellite imagery into real-time applications, effectively mitigating limitations of ego-vehicle perceptions, involving occlusions and degraded performance in distant regions. To address the core challenges of cross-view perception, we propose: 1) Dynamic-Decoupling Fusion, which resolves inconsistencies in dynamic regions caused by the temporal asynchrony between satellite and street views; 2) 3D-Proj Guidance, a module that enhances 3D feature extraction from inherently 2D satellite imagery; and 3) Uniform Sampling Alignment, which aligns the sampling density between street and satellite views. Evaluated on Occ3D-nuScenes, SA-Occ achieves state-of-the-art performance, especially among single-frame methods, with a 39.05% mIoU (a 6.97% improvement), while incurring only 6.93 ms of additional latency per frame. Our code and newly curated dataset are available at https://github.com/chenchen235/SA-Occ.
Chinese: SA-Occ首次提出卫星辅助的3D占据预测模型,通过动态解耦融合等创新技术将历史卫星图像与实时街景相结合,以39.05% mIoU的最优性能实现感知突破,且仅增加6.93毫秒延迟。
English: SA-Occ introduces the first satellite-assisted 3D occupancy prediction model that integrates historical satellite imagery with real-time street views through innovative fusion techniques, achieving state-of-the-art performance with a 39.05% mIoU and minimal latency.

Authors:Han-Hung Lee, Qinghong Han, Angel X. Chang
Title: NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes
Abstract:
In this paper, we explore the task of generating expansive outdoor scenes, ranging from castles to high-rises. Unlike indoor scene generation, which has been a primary focus of prior work, outdoor scene generation presents unique challenges, including wide variations in scene heights and the need for a method capable of rapidly producing large landscapes. To address this, we propose an efficient approach that encodes scene chunks as uniform vector sets, offering better compression and performance than the spatially structured latents used in prior methods. Furthermore, we train an explicit outpainting model for unbounded generation, which improves coherence compared to prior resampling-based inpainting schemes while also speeding up generation by eliminating extra diffusion steps. To facilitate this task, we curate NuiScene43, a small but high-quality set of scenes, preprocessed for joint training. Notably, when trained on scenes of varying styles, our model can blend different environments, such as rural houses and city skyscrapers, within the same scene, highlighting the potential of our curation process to leverage heterogeneous scenes for joint training.

Authors:Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, Yitao Liang
Title: JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
Abstract:
Recently, action-based decision-making in open-world environments has gained significant attention. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models' capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, we demonstrate that our approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance. We have open-sourced the code, models, and datasets to foster further research. The project page can be found in https://craftjarvis.github.io/JarvisVLA.
Chinese Summary: 本研究提出了一种自监督的后训练方法,通过增强视觉语言模型在开放世界环境中的世界知识、视觉识别和空间定位能力,在《我的世界》中实现了40%的性能提升和最优任务表现。
English Summary: The study introduces a self-supervised post-training method that enhances Visual Language Models for improved decision-making in open-world environments, achieving a 40% performance boost and state-of-the-art results in Minecraft tasks.

Authors:Yunzhi Yao, Jizhan Fang, Jia-Chen Gu, Ningyu Zhang, Shumin Deng, Huajun Chen, Nanyun Peng
Title: CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners
Abstract:
Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they often fail to generalize these updates to multi-hop reasoning tasks that rely on the modified knowledge. Through an analysis of reasoning circuits -- the neural pathways LLMs use for knowledge-based inference, we find that current layer-localized KE approaches (e.g., MEMIT, WISE), which edit only single or a few model layers, inadequately integrate updated knowledge into these reasoning pathways. To address this limitation, we present CaKE (Circuit-aware Knowledge Editing), a novel method that enhances the effective integration of updated knowledge in LLMs. By only leveraging a few curated data samples guided by our circuit-based analysis, CaKE stimulates the model to develop appropriate reasoning circuits for newly incorporated knowledge. Experiments show that CaKE enables more accurate and consistent use of edited knowledge across related reasoning tasks, achieving an average improvement of 20% in multi-hop reasoning accuracy on the MQuAKE dataset while requiring less memory than existing KE methods. We release the code and data in https://github.com/zjunlp/CaKE.
中文摘要:CaKE是一种新颖的知识编辑方法,通过基于推理电路的分析,有效提升大语言模型对新知识的整合能力,在多跳推理任务中实现更高的准确性和一致性。
English Summary: CaKE is a novel knowledge editing method that improves the integration of updated knowledge into large language models by leveraging circuit-based analysis, resulting in enhanced multi-hop reasoning accuracy and efficiency.

Authors:Ruonan Yu, Songhua Liu, Zhenxiong Tan, Xinchao Wang
Title: Ultra-Resolution Adaptation with Ease
Abstract:
Text-to-image diffusion models have achieved remarkable progress in recent years. However, training models for high-resolution image generation remains challenging, particularly when training data and computational resources are limited. In this paper, we explore this practical problem from two key perspectives: data and parameter efficiency, and propose a set of key guidelines for ultra-resolution adaptation termed \emph{URAE}. For data efficiency, we theoretically and empirically demonstrate that synthetic data generated by some teacher models can significantly promote training convergence. For parameter efficiency, we find that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable, offering substantial performance gains while maintaining efficiency. Additionally, for models leveraging guidance distillation, such as FLUX, we show that disabling classifier-free guidance, \textit{i.e.}, setting the guidance scale to 1 during adaptation, is crucial for satisfactory performance. Extensive experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations, while setting new benchmarks for 4K-resolution generation. Codes are available \href{https://github.com/Huage001/URAE}{here}.
中文: URAE框架通过利用合成数据加速训练和优化参数调整,显著提升了高分辨率图像生成效果,仅用少量数据和计算资源就实现了业界领先的2K和4K生成性能。
English: The URAE framework enhances high-resolution image generation by using synthetic data for faster training and optimized parameter tuning, achieving state-of-the-art 2K and 4K results with minimal data and computational resources.

Authors:Vivek Gopalakrishnan, Neel Dey, David-Dimitris Chlorogiannis, Andrew Abumoussa, Anna M. Larson, Darren B. Orbach, Sarah Frisken, Polina Golland
Title: Rapid patient-specific neural networks for intraoperative X-ray to volume registration
Abstract:
The integration of artificial intelligence in image-guided interventions holds transformative potential, promising to extract 3D geometric and quantitative information from conventional 2D imaging modalities during complex procedures. Achieving this requires the rapid and precise alignment of 2D intraoperative images (e.g., X-ray) with 3D preoperative volumes (e.g., CT, MRI). However, current 2D/3D registration methods fail across the broad spectrum of procedures dependent on X-ray guidance: traditional optimization techniques require custom parameter tuning for each subject, whereas neural networks trained on small datasets do not generalize to new patients or require labor-intensive manual annotations, increasing clinical burden and precluding application to new anatomical targets. To address these challenges, we present xvr, a fully automated framework for training patient-specific neural networks for 2D/3D registration. xvr uses physics-based simulation to generate abundant high-quality training data from a patient's own preoperative volumetric imaging, thereby overcoming the inherently limited ability of supervised models to generalize to new patients and procedures. Furthermore, xvr requires only 5 minutes of training per patient, making it suitable for emergency interventions as well as planned procedures. We perform the largest evaluation of a 2D/3D registration algorithm on real X-ray data to date and find that xvr robustly generalizes across a diverse dataset comprising multiple anatomical structures, imaging modalities, and hospitals. Across surgical tasks, xvr achieves submillimeter-accurate registration at intraoperative speeds, improving upon existing methods by an order of magnitude. xvr is released as open-source software freely available at https://github.com/eigenvivek/xvr.
中文: xvr框架通过基于物理的模拟训练患者特定的神经网络,实现了快速精确的2D/3D医学图像配准,无需人工标注即可在不同手术中达到亚毫米级精度。
English: The xvr framework enables rapid and precise 2D/3D medical image registration by training patient-specific neural networks through physics-based simulation, achieving submillimeter accuracy across diverse procedures without manual annotations.

Authors:Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Fuyun Wang, Huiwen Shi, Xianghui Yang, Qingxiang Lin, Jingwei Huang, Yuhong Liu, Jie Jiang, Chunchao Guo, Xiangyu Yue
Title: Unleashing Vecset Diffusion Model for Fast Shape Generation
Abstract:
3D shape generation has greatly flourished through the development of so-called "native" 3D diffusion, particularly through the Vecset Diffusion Model (VDM). While recent advancements have shown promising results in generating high-resolution 3D shapes, VDM still struggles with high-speed generation. Challenges exist because of difficulties not only in accelerating diffusion sampling but also VAE decoding in VDM, areas under-explored in previous works. To address these challenges, we present FlashVDM, a systematic framework for accelerating both VAE and DiT in VDM. For DiT, FlashVDM enables flexible diffusion sampling with as few as 5 inference steps and comparable quality, which is made possible by stabilizing consistency distillation with our newly introduced Progressive Flow Distillation. For VAE, we introduce a lightning vecset decoder equipped with Adaptive KV Selection, Hierarchical Volume Decoding, and Efficient Network Design. By exploiting the locality of the vecset and the sparsity of shape surface in the volume, our decoder drastically lowers FLOPs, minimizing the overall decoding overhead. We apply FlashVDM to Hunyuan3D-2 to obtain Hunyuan3D-2 Turbo. Through systematic evaluation, we show that our model significantly outperforms existing fast 3D generation methods, achieving comparable performance to the state-of-the-art while reducing inference time by over 45x for reconstruction and 32x for generation. Code and models are available at https://github.com/Tencent/FlashVDM.
Chinese: FlashVDM通过优化扩散采样和VAE解码来加速3D形状生成,采用渐进流蒸馏和高效向量集解码器等技术,在保持相当质量的同时实现高达45倍的推理速度提升。
English: FlashVDM accelerates 3D shape generation by optimizing both diffusion sampling and VAE decoding, achieving comparable quality with up to 45x faster inference through techniques like Progressive Flow Distillation and an efficient vecset decoder.

Authors:Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Junlin Han, Ender Konukoglu, Serge Belongie
Title: Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model
Abstract:
Generalized few-shot 3D point cloud segmentation (GFS-PCS) adapts models to new classes with few support samples while retaining base class segmentation. Existing GFS-PCS methods enhance prototypes via interacting with support or query features but remain limited by sparse knowledge from few-shot samples. Meanwhile, 3D vision-language models (3D VLMs), generalizing across open-world novel classes, contain rich but noisy novel class knowledge. In this work, we introduce a GFS-PCS framework that synergizes dense but noisy pseudo-labels from 3D VLMs with precise yet sparse few-shot samples to maximize the strengths of both, named GFS-VL. Specifically, we present a prototype-guided pseudo-label selection to filter low-quality regions, followed by an adaptive infilling strategy that combines knowledge from pseudo-label contexts and few-shot samples to adaptively label the filtered, unlabeled areas. Additionally, we design a novel-base mix strategy to embed few-shot samples into training scenes, preserving essential context for improved novel class learning. Moreover, recognizing the limited diversity in current GFS-PCS benchmarks, we introduce two challenging benchmarks with diverse novel classes for comprehensive generalization evaluation. Experiments validate the effectiveness of our framework across models and datasets. Our approach and benchmarks provide a solid foundation for advancing GFS-PCS in the real world. The code is at https://github.com/ZhaochongAn/GFS-VL
中文: 本文提出GFS-VL框架,通过原型引导的伪标签筛选和自适应填充策略,将3D视觉语言模型的密集伪标签与少量样本的精确信息相融合,并建立新基准推进三维点云分割在实际场景中的应用。
English: This paper introduces GFS-VL, a generalized few-shot 3D point cloud segmentation framework that synergizes dense pseudo-labels from 3D vision-language models with precise few-shot samples through prototype-guided selection and adaptive infilling, while also proposing new benchmarks to advance real-world applications.

Authors:Max Gutbrod, David Rauber, Danilo Weber Nunes, Christoph Palm
Title: OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection
Abstract:
The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs. This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field. By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare. The repository is available at https://github.com/remic-othr/OpenMIBOOD.
中文: 本文提出了OpenMIBOOD这一专门评估医学影像中分布外检测方法的框架,发现自然图像基准不适用于医疗领域,并通过标准化基准推动可靠人工智能系统的发展。
English: This paper introduces OpenMIBOOD, a specialized framework for evaluating out-of-distribution detection methods in medical imaging, revealing that natural image benchmarks are inadequate for healthcare applications and providing standardized benchmarks to advance reliable AI systems.

Authors:Peihao Wu, Yongxiang Yao, Wenfei Zhang, Dong Wei, Yi Wan, Yansheng Li, Yongjun Zhang
Title: MapGlue: Multimodal Remote Sensing Image Matching
Abstract:
Multimodal remote sensing image (MRSI) matching is pivotal for cross-modal fusion, localization, and object detection, but it faces severe challenges due to geometric, radiometric, and viewpoint discrepancies across imaging modalities. Existing unimodal datasets lack scale and diversity, limiting deep learning solutions. This paper proposes MapGlue, a universal MRSI matching framework, and MapData, a large-scale multimodal dataset addressing these gaps. Our contributions are twofold. MapData, a globally diverse dataset spanning 233 sampling points, offers original images (7,000x5,000 to 20,000x15,000 pixels). After rigorous cleaning, it provides 121,781 aligned electronic map-visible image pairs (512x512 pixels) with hybrid manual-automated ground truth, addressing the scarcity of scalable multimodal benchmarks. MapGlue integrates semantic context with a dual graph-guided mechanism to extract cross-modal invariant features. This structure enables global-to-local interaction, enhancing descriptor robustness against modality-specific distortions. Extensive evaluations on MapData and five public datasets demonstrate MapGlue's superiority in matching accuracy under complex conditions, outperforming state-of-the-art methods. Notably, MapGlue generalizes effectively to unseen modalities without retraining, highlighting its adaptability. This work addresses longstanding challenges in MRSI matching by combining scalable dataset construction with a robust, semantics-driven framework. Furthermore, MapGlue shows strong generalization capabilities on other modality matching tasks for which it was not specifically trained. The dataset and code are available at https://github.com/PeihaoWu/MapGlue.
Chinese: 本文提出了MapGlue通用多模态遥感图像匹配框架,通过语义上下文和双图引导机制提取跨模态不变特征,并创建了MapData大规模多样化数据集,解决了该领域缺乏可扩展基准的难题。
English: This paper introduces MapGlue, a universal multimodal remote sensing image matching framework that integrates semantic context with a dual graph-guided mechanism for robust cross-modal feature extraction, and MapData, a large-scale, globally diverse dataset designed to address the scarcity of scalable benchmarks in the field.

Authors:Soham Roy, Abhishek Mishra, Shirish Karande, Murari Mandal
Title: Guardians of Generation: Dynamic Inference-Time Copyright Shielding with Adaptive Guidance for AI Image Generation
Abstract:
Modern text-to-image generative models can inadvertently reproduce copyrighted content memorized in their training data, raising serious concerns about potential copyright infringement. We introduce Guardians of Generation, a model agnostic inference time framework for dynamic copyright shielding in AI image generation. Our approach requires no retraining or modification of the generative model weights, instead integrating seamlessly with existing diffusion pipelines. It augments the generation process with an adaptive guidance mechanism comprising three components: a detection module, a prompt rewriting module, and a guidance adjustment module. The detection module monitors user prompts and intermediate generation steps to identify features indicative of copyrighted content before they manifest in the final output. If such content is detected, the prompt rewriting mechanism dynamically transforms the user's prompt by sanitizing or replacing references that could trigger copyrighted material while preserving the prompt's intended semantics. The adaptive guidance module adaptively steers the diffusion process away from flagged content by modulating the model's sampling trajectory. Together, these components form a robust shield that enables a tunable balance between preserving creative fidelity and ensuring copyright compliance. We validate our method on a variety of generative models such as Stable Diffusion, SDXL, and Flux, demonstrating substantial reductions in copyrighted content generation with negligible impact on output fidelity or alignment with user intent. This work provides a practical, plug-and-play safeguard for generative image models, enabling more responsible deployment under real-world copyright constraints. Source code is available at: https://respailab.github.io/gog

Authors:Tianyi Wei, Yifan Zhou, Dongdong Chen, Xingang Pan
Title: FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing
Abstract:
The integration of Rotary Position Embedding (RoPE) in Multimodal Diffusion Transformer (MMDiT) has significantly enhanced text-to-image generation quality. However, the fundamental reliance of self-attention layers on positional embedding versus query-key similarity during generation remains an intriguing question. We present the first mechanistic analysis of RoPE-based MMDiT models (e.g., FLUX), introducing an automated probing strategy that disentangles positional information versus content dependencies by strategically manipulating RoPE during generation. Our analysis reveals distinct dependency patterns that do not straightforwardly correlate with depth, offering new insights into the layer-specific roles in RoPE-based MMDiT. Based on these findings, we propose a training-free, task-specific image editing framework that categorizes editing tasks into three types: position-dependent editing (e.g., object addition), content similarity-dependent editing (e.g., non-rigid editing), and region-preserved editing (e.g., background replacement). For each type, we design tailored key-value injection strategies based on the characteristics of the editing task. Extensive qualitative and quantitative evaluations demonstrate that our method outperforms state-of-the-art approaches, particularly in preserving original semantic content and achieving seamless modifications.

Authors:Dong Chen, Boyue Zhao, Yi Zhang, Meng Zhao
Title: Selective Complementary Feature Fusion and Modal Feature Compression Interaction for Brain Tumor Segmentation
Abstract:
Efficient modal feature fusion strategy is the key to achieve accurate segmentation of brain glioma. However, due to the specificity of different MRI modes, it is difficult to carry out cross-modal fusion with large differences in modal features, resulting in the model ignoring rich feature information. On the other hand, the problem of multi-modal feature redundancy interaction occurs in parallel networks due to the proliferation of feature dimensions, further increase the difficulty of multi-modal feature fusion at the bottom end. In order to solve the above problems, we propose a noval complementary feature compression interaction network (CFCI-Net), which realizes the complementary fusion and compression interaction of multi-modal feature information with an efficient mode fusion strategy. Firstly, we propose a selective complementary feature fusion (SCFF) module, which adaptively fuses rich cross-modal feature information by complementary soft selection weights. Secondly, a modal feature compression interaction (MFCI) transformer is proposed to deal with the multi-mode fusion redundancy problem when the feature dimension surges. The MFCI transformer is composed of modal feature compression (MFC) and modal feature interaction (MFI) to realize redundancy feature compression and multi-mode feature interactive learning. %In MFI, we propose a hierarchical interactive attention mechanism based on multi-head attention. Evaluations on the BraTS2019 and BraTS2020 datasets demonstrate that CFCI-Net achieves superior results compared to state-of-the-art models. Code: https://github.com/CDmm0/CFCI-Net
中文: 提出的CFCI-Net通过选择性互补特征融合和模态特征压缩交互变换器,有效解决了脑胶质瘤分割中的跨模态融合难题,在BraTS数据集上取得了最优性能。
English: The proposed CFCI-Net with its selective complementary feature fusion and modal feature compression interaction transformer effectively addresses cross-modal fusion challenges in brain glioma segmentation, achieving state-of-the-art performance on BraTS datasets.

Authors:Shiyang Zhou, Haijin Zeng, Yunfan Lu, Tong Shao, Ke Tang, Yongyong Chen, Jie Liu, Jingyong Su
Title: Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing
Abstract:
Quad Bayer demosaicing is the central challenge for enabling the widespread application of Hybrid Event-based Vision Sensors (HybridEVS). Although existing learning-based methods that leverage long-range dependency modeling have achieved promising results, their complexity severely limits deployment on mobile devices for real-world applications. To address these limitations, we propose a lightweight Mamba-based binary neural network designed for efficient and high-performing demosaicing of HybridEVS RAW images. First, to effectively capture both global and local dependencies, we introduce a hybrid Binarized Mamba-Transformer architecture that combines the strengths of the Mamba and Swin Transformer architectures. Next, to significantly reduce computational complexity, we propose a binarized Mamba (Bi-Mamba), which binarizes all projections while retaining the core Selective Scan in full precision. Bi-Mamba also incorporates additional global visual information to enhance global context and mitigate precision loss. We conduct quantitative and qualitative experiments to demonstrate the effectiveness of BMTNet in both performance and computational efficiency, providing a lightweight demosaicing solution suited for real-world edge devices. Our codes and models are available at https://github.com/Clausy9/BMTNet.
中文: 本文提出了一种轻量级的基于Mamba的二元神经网络BMTNet,通过结合Mamba和Swin Transformer架构的优势,在保持高性能的同时显著降低计算复杂度,为混合事件视觉传感器提供了一种适用于边缘设备的实时去马赛克解决方案。
English: This paper introduces a lightweight Mamba-based binary neural network called BMTNet, which combines Mamba and Swin Transformer architectures to efficiently perform Quad Bayer demosaicing for Hybrid Event-based Vision Sensors, achieving high performance with reduced computational complexity suitable for edge devices.

Authors:Jiyong Rao, Brian Nlong Zhao, Yu Wang
Title: Probabilistic Prompt Distribution Learning for Animal Pose Estimation
Abstract:
Multi-species animal pose estimation has emerged as a challenging yet critical task, hindered by substantial visual diversity and uncertainty. This paper challenges the problem by efficient prompt learning for Vision-Language Pretrained (VLP) models, \textit{e.g.} CLIP, aiming to resolve the cross-species generalization problem. At the core of the solution lies in the prompt designing, probabilistic prompt modeling and cross-modal adaptation, thereby enabling prompts to compensate for cross-modal information and effectively overcome large data variances under unbalanced data distribution. To this end, we propose a novel probabilistic prompting approach to fully explore textual descriptions, which could alleviate the diversity issues caused by long-tail property and increase the adaptability of prompts on unseen category instance. Specifically, we first introduce a set of learnable prompts and propose a diversity loss to maintain distinctiveness among prompts, thus representing diverse image attributes. Diverse textual probabilistic representations are sampled and used as the guidance for the pose estimation. Subsequently, we explore three different cross-modal fusion strategies at spatial level to alleviate the adverse impacts of visual uncertainty. Extensive experiments on multi-species animal pose benchmarks show that our method achieves the state-of-the-art performance under both supervised and zero-shot settings. The code is available at https://github.com/Raojiyong/PPAP.
中文摘要:本文提出一种新颖的概率提示方法,通过优化提示设计和跨模态融合策略,利用视觉语言预训练模型解决多物种动物姿态估计中的跨物种泛化难题。
English Summary: This paper introduces a novel probabilistic prompting approach for multi-species animal pose estimation that leverages vision-language pretrained models to overcome cross-species generalization challenges through optimized prompt design and cross-modal fusion strategies.

Authors:Longbin Ji, Lei Zhong, Pengfei Wei, Changjian Li
Title: PoseTraj: Pose-Aware Trajectory Control in Video Diffusion
Abstract:
Recent advancements in trajectory-guided video generation have achieved notable progress. However, existing models still face challenges in generating object motions with potentially changing 6D poses under wide-range rotations, due to limited 3D understanding. To address this problem, we introduce PoseTraj, a pose-aware video dragging model for generating 3D-aligned motion from 2D trajectories. Our method adopts a novel two-stage pose-aware pretraining framework, improving 3D understanding across diverse trajectories. Specifically, we propose a large-scale synthetic dataset PoseTraj-10K, containing 10k videos of objects following rotational trajectories, and enhance the model perception of object pose changes by incorporating 3D bounding boxes as intermediate supervision signals. Following this, we fine-tune the trajectory-controlling module on real-world videos, applying an additional camera-disentanglement module to further refine motion accuracy. Experiments on various benchmark datasets demonstrate that our method not only excels in 3D pose-aligned dragging for rotational trajectories but also outperforms existing baselines in trajectory accuracy and video quality.

Authors:Tim Seizinger, Florin-Alexandru Vasluianu, Marcos V. Conde, Zongwei Wu, Radu Timofte
Title: Bokehlicious: Photorealistic Bokeh Rendering with Controllable Apertures
Abstract:
Bokeh rendering methods play a key role in creating the visually appealing, softly blurred backgrounds seen in professional photography. While recent learning-based approaches show promising results, generating realistic Bokeh with variable strength remains challenging. Existing methods require additional inputs and suffer from unrealistic Bokeh reproduction due to reliance on synthetic data. In this work, we propose Bokehlicious, a highly efficient network that provides intuitive control over Bokeh strength through an Aperture-Aware Attention mechanism, mimicking the physical lens aperture. To further address the lack of high-quality real-world data, we present RealBokeh, a novel dataset featuring 23,000 high-resolution (24-MP) images captured by professional photographers, covering diverse scenes with varied aperture and focal length settings. Evaluations on both our new RealBokeh and established Bokeh rendering benchmarks show that Bokehlicious consistently outperforms SOTA methods while significantly reducing computational cost and exhibiting strong zero-shot generalization. Our method and dataset further extend to defocus deblurring, achieving competitive results on the RealDOF benchmark. Our code and data can be found at https://github.com/TimSeizinger/Bokehlicious
Chinese: Bokehlicious 提出了一种高效网络,通过光圈感知注意力机制实现可调节的虚化效果,并提供了高质量的RealBokeh数据集以克服合成数据局限,在真实性和效率上均优于现有方法。
English: Bokehlicious introduces an efficient network with an Aperture-Aware Attention mechanism for adjustable Bokeh effects and a high-quality RealBokeh dataset to overcome synthetic data limitations, outperforming existing methods in realism and efficiency.

Authors:Qiang Zou, Shuli Cheng, Jiayi Chen
Title: PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval
Abstract:
Cross-modal hashing is a promising approach for efficient data retrieval and storage optimization. However, contemporary methods exhibit significant limitations in semantic preservation, contextual integrity, and information redundancy, which constrains retrieval efficacy. We present PromptHash, an innovative framework leveraging affinity prompt-aware collaborative learning for adaptive cross-modal hashing. We propose an end-to-end framework for affinity-prompted collaborative hashing, with the following fundamental technical contributions: (i) a text affinity prompt learning mechanism that preserves contextual information while maintaining parameter efficiency, (ii) an adaptive gated selection fusion architecture that synthesizes State Space Model with Transformer network for precise cross-modal feature integration, and (iii) a prompt affinity alignment strategy that bridges modal heterogeneity through hierarchical contrastive learning. To the best of our knowledge, this study presents the first investigation into affinity prompt awareness within collaborative cross-modal adaptive hash learning, establishing a paradigm for enhanced semantic consistency across modalities. Through comprehensive evaluation on three benchmark multi-label datasets, PromptHash demonstrates substantial performance improvements over existing approaches. Notably, on the NUS-WIDE dataset, our method achieves significant gains of 18.22% and 18.65% in image-to-text and text-to-image retrieval tasks, respectively. The code is publicly available at https://github.com/ShiShuMo/PromptHash.
中文摘要:PromptHash提出了一种利用亲和性提示感知协同学习的创新框架,通过自适应跨模态哈希技术显著提升检索性能,有效解决了语义保持和模态异质性问题。
English Summary: PromptHash introduces an innovative framework using affinity prompt-aware collaborative learning to enhance cross-modal hashing, achieving significant performance improvements in retrieval tasks by addressing semantic preservation and modal heterogeneity.

Authors:Wanshu Fan, Yue Wang, Cong Wang, Yunzhe Zhang, Wei Wang, Dongsheng Zhou
Title: Semantic-Guided Global-Local Collaborative Networks for Lightweight Image Super-Resolution
Abstract:
Single-Image Super-Resolution (SISR) plays a pivotal role in enhancing the accuracy and reliability of measurement systems, which are integral to various vision-based instrumentation and measurement applications. These systems often require clear and detailed images for precise object detection and recognition. However, images captured by visual measurement tools frequently suffer from degradation, including blurring and loss of detail, which can impede measurement accuracy.As a potential remedy, we in this paper propose a Semantic-Guided Global-Local Collaborative Network (SGGLC-Net) for lightweight SISR. Our SGGLC-Net leverages semantic priors extracted from a pre-trained model to guide the super-resolution process, enhancing image detail quality effectively. Specifically,we propose a Semantic Guidance Module that seamlessly integrates the semantic priors into the super-resolution network, enabling the network to more adeptly capture and utilize semantic priors, thereby enhancing image details. To further explore both local and non-local interactions for improved detail rendition,we propose a Global-Local Collaborative Module, which features three Global and Local Detail Enhancement Modules, as well as a Hybrid Attention Mechanism to work together to efficiently learn more useful features. Our extensive experiments show that SGGLC-Net achieves competitive PSNR and SSIM values across multiple benchmark datasets, demonstrating higher performance with the multi-adds reduction of 12.81G compared to state-of-the-art lightweight super-resolution approaches. These improvements underscore the potential of our approach to enhance the precision and effectiveness of visual measurement systems. Codes are at https://github.com/fanamber831/SGGLC-Net.
中文: 本文提出SGGLC-Net轻量级单图像超分辨率网络,通过语义引导和全局-局部协作增强图像细节,在降低计算量的同时实现了优越性能,可有效提升视觉测量系统的精度。
English: This paper introduces SGGLC-Net, a lightweight single-image super-resolution network that uses semantic guidance and global-local collaboration to enhance image details, achieving competitive performance with reduced computational costs for visual measurement systems.

Authors:Abdelrahman Elsayed, Sarim Hashmi, Mohammed Elseiagy, Hu Wang, Mohammad Yaqub, Ibrahim Almakky
Title: SALT: Parameter-Efficient Fine-Tuning via Singular Value Adaptation with Low-Rank Transformation
Abstract:
The complex nature of medical image segmentation calls for models that are specifically designed to capture detailed, domain-specific features. Large foundation models offer considerable flexibility, yet the cost of fine-tuning these models remains a significant barrier. Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), efficiently update model weights with low-rank matrices but may suffer from underfitting when the chosen rank is insufficient to capture domain-specific nuances. Conversely, full-rank Singular Value Decomposition (SVD) based methods provide comprehensive updates by modifying all singular values, yet they often lack flexibility and exhibit variable performance across datasets. We propose SALT (Singular Value Adaptation with Low-Rank Transformation), a method that selectively adapts the most influential singular values using trainable scale and shift parameters while complementing this with a low-rank update for the remaining subspace. This hybrid approach harnesses the advantages of both LoRA and SVD, enabling effective adaptation without relying on increasing model size or depth. Evaluated on 5 challenging medical datasets, ranging from as few as 20 samples to 1000, SALT outperforms state-of-the-art PEFT (LoRA and SVD) by 2% to 5% in Dice with only 3.9% trainable parameters, demonstrating robust adaptation even in low-resource settings. The code for SALT is available at: https://github.com/BioMedIA-MBZUAI/SALT
中文: 提出的SALT方法结合选择性奇异值适应与低秩更新,在医学图像分割任务中以仅3.9%可训练参数实现比现有高效调参技术高出2-5%的性能表现。
English: The proposed SALT method combines selective singular value adaptation with low-rank updates to outperform existing parameter-efficient fine-tuning techniques by 2-5% on medical image segmentation tasks while using only 3.9% trainable parameters.

Authors:Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, Hongtao Xie
Title: Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
Abstract:
Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 2.43\% average on three multiple-choice QA benchmarks and saving 78.8\% tokens compared with the SOTA method. The code is available at https://github.com/lntzm/HICom.
Chinese Summary: 提出的HICom方法通过指令引导混合层级令牌压缩,在多模态大语言模型中优化视频压缩,聚焦用户相关信息以提升性能并显著降低计算负担。
English Summary: The proposed HICom method enhances video compression in Multi-modal Large Language Models by using instructions to guide hybrid-level token compression, improving performance and reducing computational load by focusing on user-relevant information.

Authors:Sunqi Fan, Meng-Hao Guo, Shuojin Yang
Title: Agentic Keyframe Search for Video Question Answering
Abstract:
Video question answering (VideoQA) enables machines to extract and comprehend key information from videos through natural language interaction, which is a critical step towards achieving intelligence. However, the demand for a thorough understanding of videos and high computational costs still limit the widespread applications of VideoQA. To address it, we propose Agentic Keyframe Search (AKeyS), a simple yet powerful algorithm for identifying keyframes in the VideoQA task. It can effectively distinguish key information from redundant, irrelevant content by leveraging modern language agents to direct classical search algorithms. Specifically, we first segment the video and organize it as a tree structure. Then, AKeyS uses a language agent to estimate heuristics and movement costs while dynamically expanding nodes. Finally, the agent determines if sufficient keyframes have been collected based on termination conditions and provides answers. Extensive experiments on the EgoSchema and NExT-QA datasets show that AKeyS outperforms all previous methods with the highest keyframe searching efficiency, which means it can accurately identify key information and conduct effective visual reasoning with minimal computational overhead. For example, on the EgoSchema subset, it achieves 1.8% higher accuracy while processing only 43.5% of the frames compared to VideoTree. We believe that AKeyS represents a significant step towards building intelligent agents for video understanding. The code is publicly available at https://github.com/fansunqi/AKeyS.
Chinese: 提出的Agentic Keyframe Search (AKeyS)算法通过语言智能体引导搜索策略,在VideoQA任务中高效识别关键帧,在EgoSchema和NExT-QA数据集上的实验表明其能以最低计算成本实现最优准确率。
English: The proposed Agentic Keyframe Search (AKeyS) algorithm efficiently identifies keyframes in VideoQA tasks by leveraging language agents to guide search algorithms, achieving superior accuracy with minimal computational overhead as demonstrated on EgoSchema and NExT-QA datasets.

Authors:Xiaomeng Chu, Jiajun Deng, Guoliang You, Wei Liu, Xingchen Li, Jianmin Ji, Yanyong Zhang
Title: GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions
Abstract:
Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users' intentions in the instructions. However, the LLM's knowledge about objects' physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary question-answering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. The code is available at https://github.com/cxmomo/GraspCoT.
中文摘要:本文提出GraspCoT框架,通过面向物理属性的思维链推理机制改进机器人抓取,并在新基准IntentGrasp和实际应用中验证了其优越性。
English Summary: This paper introduces GraspCoT, a 6-DoF grasp detection framework that enhances robotic grasping through Chain-of-Thought reasoning about physical properties, validated on the new IntentGrasp benchmark and real-world applications.

Authors:Zeqi Zheng, Yanchen Huang, Yingchao Yu, Zizheng Zhu, Junfeng Tang, Zhaofei Yu, Yaochu Jin
Title: SpiLiFormer: Enhancing Spiking Transformers with Lateral Inhibition
Abstract:
Spiking Neural Networks (SNNs) based on Transformers have garnered significant attention due to their superior performance and high energy efficiency. However, the spiking attention modules of most existing Transformer-based SNNs are adapted from those of analog Transformers, failing to fully address the issue of over-allocating attention to irrelevant contexts. To fix this fundamental yet overlooked issue, we propose a Lateral Inhibition-inspired Spiking Transformer (SpiLiFormer). It emulates the brain's lateral inhibition mechanism, guiding the model to enhance attention to relevant tokens while suppressing attention to irrelevant ones. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including CIFAR-10 (+0.45%), CIFAR-100 (+0.48%), CIFAR10-DVS (+2.70%), N-Caltech101 (+1.94%), and ImageNet-1K (+1.6%). Notably, on the ImageNet-1K dataset, SpiLiFormer (69.9M parameters, 4 time steps, 384 resolution) outperforms E-SpikeFormer (173.0M parameters, 8 time steps, 384 resolution), a SOTA spiking Transformer, by 0.46% using only 39% of the parameters and half the time steps. The code and model checkpoints are publicly available at https://github.com/KirinZheng/SpiLiFormer.
中文摘要:SpiLiFormer模型通过引入侧向抑制机制,解决了基于Transformer的脉冲神经网络中注意力过度分配的问题,在多个数据集上以更高效率实现了最先进的性能。
English Summary: The SpiLiFormer model introduces a lateral inhibition mechanism to address the issue of over-allocating attention in Transformer-based Spiking Neural Networks, achieving state-of-the-art performance across multiple datasets with significantly improved efficiency.

Authors:Pengyu Liu, Guohua Dong, Dan Guo, Kun Li, Fengling Li, Xun Yang, Meng Wang, Xiaomin Ying
Title: A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli
Abstract:
In daily life, we encounter diverse external stimuli, such as images, sounds, and videos. As research in multimodal stimuli and neuroscience advances, fMRI-based brain decoding has become a key tool for understanding brain perception and its complex cognitive processes. Decoding brain signals to reconstruct stimuli not only reveals intricate neural mechanisms but also drives progress in AI, disease treatment, and brain-computer interfaces. Recent advancements in neuroimaging and image generation models have significantly improved fMRI-based decoding. While fMRI offers high spatial resolution for precise brain activity mapping, its low temporal resolution and signal noise pose challenges. Meanwhile, techniques like GANs, VAEs, and Diffusion Models have enhanced reconstructed image quality, and multimodal pre-trained models have boosted cross-modal decoding tasks. This survey systematically reviews recent progress in fMRI-based brain decoding, focusing on stimulus reconstruction from passive brain signals. It summarizes datasets, relevant brain regions, and categorizes existing methods by model structure. Additionally, it evaluates model performance and discusses their effectiveness. Finally, it identifies key challenges and proposes future research directions, offering valuable insights for the field. For more information and resources related to this survey, visit https://github.com/LpyNow/BrainDecodingImage.
中文: 本综述系统回顾了基于fMRI的大脑解码在刺激重建方面的最新进展,重点介绍了利用生成模型和多模态预训练的技术提升,同时探讨了时间分辨率等挑战并提出了未来研究方向。
English: This survey reviews recent advances in fMRI-based brain decoding for stimulus reconstruction, highlighting improved techniques using generative models and multimodal pre-training while addressing challenges like temporal resolution and proposing future research directions.

Authors:Zichen Liu, Kunlun Xu, Bing Su, Xu Zou, Yuxin Peng, Jiahuan Zhou
Title: STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
Abstract:
Pre-trained on tremendous image-text pairs, vision-language models like CLIP have demonstrated promising zero-shot generalization across numerous image-based tasks. However, extending these capabilities to video tasks remains challenging due to limited labeled video data and high training costs. Recent video prompting methods attempt to adapt CLIP for video tasks by introducing learnable prompts, but they typically rely on a single static prompt for all video sequences, overlooking the diverse temporal dynamics and spatial variations that exist across frames. This limitation significantly hinders the model's ability to capture essential temporal information for effective video understanding. To address this, we propose an integrated Spatial-TempOral dynamic Prompting (STOP) model which consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting. Our intra-frame spatial prompts are designed to adaptively highlight discriminative regions within each frame by leveraging intra-frame attention and temporal variation, allowing the model to focus on areas with substantial temporal dynamics and capture fine-grained spatial details. Additionally, to highlight the varying importance of frames for video understanding, we further introduce inter-frame temporal prompts, dynamically inserting prompts between frames with high temporal variance as measured by frame similarity. This enables the model to prioritize key frames and enhances its capacity to understand temporal dependencies across sequences. Extensive experiments on various video benchmarks demonstrate that STOP consistently achieves superior performance against state-of-the-art methods. The code is available at https://github.com/zhoujiahuan1991/CVPR2025-STOP.
中文:提出的STOP模型通过动态生成空间提示来突出帧内关键区域,以及时间提示来优先处理重要帧,从而提升视频理解能力,在多个视频基准测试中取得了优越性能。
English: The proposed STOP model enhances video understanding by dynamically generating spatial prompts to highlight key regions within frames and temporal prompts to prioritize important frames, achieving superior performance on video benchmarks.

Authors:Clive Tinashe Marimo, Benedikt Blumenstiel, Maximilian Nitsche, Johannes Jakubik, Thomas Brunschwiler
Title: Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation
Abstract:
Vision-language models for Earth observation (EO) typically rely on the visual spectrum of data as the only model input, thus failing to leverage the rich spectral information available in the multispectral channels recorded by satellites. Therefore, we introduce Llama3-MS-CLIP, the first vision-language model pre-trained with contrastive learning on a large-scale multispectral dataset and report on the performance gains due to the extended spectral range. Furthermore, we present the largest-to-date image-caption dataset for multispectral data, consisting of one million Sentinel-2 samples and corresponding textual descriptions generated using Llama3-LLaVA-Next and Overture Maps data. We develop a scalable captioning pipeline, which is validated by domain experts. We evaluate Llama3-MS-CLIP on multispectral zero-shot image classification and retrieval using three datasets of varying complexity. Our results demonstrate that Llama3-MS-CLIP significantly outperforms other RGB-based approaches, improving classification accuracy by +6.77% on average and retrieval performance by +4.63% mAP compared to the second-best model. Our results emphasize the relevance of multispectral vision-language learning. The image-caption dataset, code, and model weights are available at https://github.com/IBM/MS-CLIP.
中文摘要:Llama3-MS-CLIP是首个基于多光谱卫星数据预训练的视觉语言模型,在分类任务中平均准确率提升6.77%,检索任务中mAP提高4.63%,显著优于传统RGB模型。
English Summary: Llama3-MS-CLIP is the first vision-language model pre-trained on multispectral satellite data, significantly outperforming RGB-based models with a 6.77% average accuracy improvement in classification and 4.63% mAP boost in retrieval tasks.

Authors:Yaxiong Chen, Minghong Wei, Zixuan Zheng, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou
Title: CausalCLIPSeg: Unlocking CLIP's Potential in Referring Medical Image Segmentation with Causal Intervention
Abstract:
Referring medical image segmentation targets delineating lesions indicated by textual descriptions. Aligning visual and textual cues is challenging due to their distinct data properties. Inspired by large-scale pre-trained vision-language models, we propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation that leverages CLIP. Despite not being trained on medical data, we enforce CLIP's rich semantic space onto the medical domain by a tailored cross-modal decoding method to achieve text-to-pixel alignment. Furthermore, to mitigate confounding bias that may cause the model to learn spurious correlations instead of meaningful causal relationships, CausalCLIPSeg introduces a causal intervention module which self-annotates confounders and excavates causal features from inputs for segmentation judgments. We also devise an adversarial min-max game to optimize causal features while penalizing confounding ones. Extensive experiments demonstrate the state-of-the-art performance of our proposed method. Code is available at https://github.com/WUTCM-Lab/CausalCLIPSeg.
中文摘要:CausalCLIPSeg是一种端到端框架,通过定制跨模态解码将CLIP的语义空间应用于医学领域,并引入因果干预模块消除混杂偏差,从而在医学图像分割中实现文本与像素的精准对齐。
English Summary: CausalCLIPSeg is an end-to-end framework that adapts CLIP's vision-language capabilities for medical image segmentation by aligning text descriptions with pixel-level features and incorporating causal intervention to eliminate spurious correlations.

Authors:Yaxiong Chen, Chuang Du, Chunlei Li, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou
Title: UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation
Abstract:
Automated radiology report generation aims to expedite the tedious and error-prone reporting process for radiologists. While recent works have made progress, learning to align medical images and textual findings remains challenging due to the relative scarcity of labeled medical data. For example, datasets for this task are much smaller than those used for image captioning in computer vision. In this work, we propose to transfer representations from CLIP, a large-scale pre-trained vision-language model, to better capture cross-modal semantics between images and texts. However, directly applying CLIP is suboptimal due to the domain gap between natural images and radiology. To enable efficient adaptation, we introduce UniCrossAdapter, lightweight adapter modules that are incorporated into CLIP and fine-tuned on the target task while keeping base parameters fixed. The adapters are distributed across modalities and their interaction to enhance vision-language alignment. Experiments on two public datasets demonstrate the effectiveness of our approach, advancing state-of-the-art in radiology report generation. The proposed transfer learning framework provides a means of harnessing semantic knowledge from large-scale pre-trained models to tackle data-scarce medical vision-language tasks. Code is available at https://github.com/chauncey-tow/MRG-CLIP.
中文摘要:本研究提出UniCrossAdapter轻量适配器,通过将其集成到CLIP模型中弥合自然图像与医学图像的领域差异,在有限标注数据条件下显著提升放射学报告生成的跨模态语义对齐效果。
English Summary: This study introduces UniCrossAdapter, lightweight adapters integrated into the CLIP model to bridge the domain gap between natural and medical images, enhancing cross-modal alignment for improved radiology report generation with limited labeled data.

Authors:Sidi Yang, Binxiao Huang, Yulun Zhang, Dahai Yu, Yujiu Yang, Ngai Wong
Title: DnLUT: Ultra-Efficient Color Image Denoising via Channel-Aware Lookup Tables
Abstract:
While deep neural networks have revolutionized image denoising capabilities, their deployment on edge devices remains challenging due to substantial computational and memory requirements. To this end, we present DnLUT, an ultra-efficient lookup table-based framework that achieves high-quality color image denoising with minimal resource consumption. Our key innovation lies in two complementary components: a Pairwise Channel Mixer (PCM) that effectively captures inter-channel correlations and spatial dependencies in parallel, and a novel L-shaped convolution design that maximizes receptive field coverage while minimizing storage overhead. By converting these components into optimized lookup tables post-training, DnLUT achieves remarkable efficiency - requiring only 500KB storage and 0.1% energy consumption compared to its CNN contestant DnCNN, while delivering 20X faster inference. Extensive experiments demonstrate that DnLUT outperforms all existing LUT-based methods by over 1dB in PSNR, establishing a new state-of-the-art in resource-efficient color image denoising. The project is available at https://github.com/Stephen0808/DnLUT.
中文: DnLUT是一种基于查找表的超高效框架,通过创新的并行通道混合器和L形卷积设计,仅需500KB存储和DnCNN 0.1%的能耗即可实现高质量彩色图像去噪,推理速度提升20倍,并在PSNR指标上超越现有查找表方法超过1dB。
English: DnLUT is an ultra-efficient lookup table-based framework that achieves high-quality color image denoising with minimal resource consumption, requiring only 500KB storage and 0.1% energy compared to DnCNN while delivering 20X faster inference and outperforming existing LUT methods by over 1dB in PSNR.

Authors:Boran Wen, Dingbang Huang, Zichen Zhang, Jiahong Zhou, Jianbin Deng, Jingyu Gong, Yulong Chen, Lizhuang Ma, Yong-Lu Li
Title: Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions
Abstract:
Reconstructing human-object interactions (HOI) from single images is fundamental in computer vision. Existing methods are primarily trained and tested on indoor scenes due to the lack of 3D data, particularly constrained by the object variety, making it challenging to generalize to real-world scenes with a wide range of objects. The limitations of previous 3D HOI datasets were primarily due to the difficulty in acquiring 3D object assets. However, with the development of 3D reconstruction from single images, recently it has become possible to reconstruct various objects from 2D HOI images. We therefore propose a pipeline for annotating fine-grained 3D humans, objects, and their interactions from single images. We annotated 2.5k+ 3D HOI assets from existing 2D HOI datasets and built the first open-vocabulary in-the-wild 3D HOI dataset Open3DHOI, to serve as a future test set. Moreover, we design a novel Gaussian-HOI optimizer, which efficiently reconstructs the spatial interactions between humans and objects while learning the contact regions. Besides the 3D HOI reconstruction, we also propose several new tasks for 3D HOI understanding to pave the way for future work. Data and code will be publicly available at https://wenboran2002.github.io/3dhoi.

Authors:Junho Kim, Gwangtak Bae, Eun Sun Lee, Young Min Kim
Title: Learning 3D Scene Analogies with Neural Contextual Scene Maps
Abstract:
Understanding scene contexts is crucial for machines to perform tasks and adapt prior knowledge in unseen or noisy 3D environments. As data-driven learning is intractable to comprehensively encapsulate diverse ranges of layouts and open spaces, we propose teaching machines to identify relational commonalities in 3D spaces. Instead of focusing on point-wise or object-wise representations, we introduce 3D scene analogies, which are smooth maps between 3D scene regions that align spatial relationships. Unlike well-studied single instance-level maps, these scene-level maps smoothly link large scene regions, potentially enabling unique applications in trajectory transfer in AR/VR, long demonstration transfer for imitation learning, and context-aware object rearrangement. To find 3D scene analogies, we propose neural contextual scene maps, which extract descriptor fields summarizing semantic and geometric contexts, and holistically align them in a coarse-to-fine manner for map estimation. This approach reduces reliance on individual feature points, making it robust to input noise or shape variations. Experiments demonstrate the effectiveness of our approach in identifying scene analogies and transferring trajectories or object placements in diverse indoor scenes, indicating its potential for robotics and AR/VR applications. Project page including the code is available through this link: https://82magnolia.github.io/3d_scene_analogies/.

Authors:Jiawei Wang, Kai Hu, Qiang Huo
Title: UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis
Abstract:
Document structure analysis, aka document layout analysis, is crucial for understanding both the physical layout and logical structure of documents, serving information retrieval, document summarization, knowledge extraction, etc. Hierarchical Document Structure Analysis (HDSA) specifically aims to restore the hierarchical structure of documents created using authoring software with hierarchical schemas. Previous research has primarily followed two approaches: one focuses on tackling specific subtasks of HDSA in isolation, such as table detection or reading order prediction, while the other adopts a unified framework that uses multiple branches or modules, each designed to address a distinct task. In this work, we propose a unified relation prediction approach for HDSA, called UniHDSA, which treats various HDSA sub-tasks as relation prediction problems and consolidates relation prediction labels into a unified label space. This allows a single relation prediction module to handle multiple tasks simultaneously, whether at a page-level or document-level structure analysis. To validate the effectiveness of UniHDSA, we develop a multimodal end-to-end system based on Transformer architectures. Extensive experimental results demonstrate that our approach achieves state-of-the-art performance on a hierarchical document structure analysis benchmark, Comp-HRDoc, and competitive results on a large-scale document layout analysis dataset, DocLayNet, effectively illustrating the superiority of our method across all sub-tasks. The Comp-HRDoc benchmark and UniHDSA's configurations are publicly available at https://github.com/microsoft/CompHRDoc.
Chinese: 本文提出UniHDSA方法,通过将层次化文档结构分析任务统一为关系预测问题,基于多模态Transformer架构实现了最优性能,在多个基准测试中验证了其全面优势。
English: This paper introduces UniHDSA, a unified relation prediction approach that consolidates various hierarchical document structure analysis tasks into a single module, achieving state-of-the-art performance on benchmarks through a multimodal Transformer-based system.

Authors:Haiguang Wang, Daqi Liu, Hongwei Xie, Haisong Liu, Enhui Ma, Kaicheng Yu, Limin Wang, Bing Wang
Title: MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving
Abstract:
In recent years, data-driven techniques have greatly advanced autonomous driving systems, but the need for rare and diverse training data remains a challenge, requiring significant investment in equipment and labor. World models, which predict and generate future environmental states, offer a promising solution by synthesizing annotated video data for training. However, existing methods struggle to generate long, consistent videos without accumulating errors, especially in dynamic scenes. To address this, we propose MiLA, a novel framework for generating high-fidelity, long-duration videos up to one minute. MiLA utilizes a Coarse-to-Re(fine) approach to both stabilize video generation and correct distortion of dynamic objects. Additionally, we introduce a Temporal Progressive Denoising Scheduler and Joint Denoising and Correcting Flow modules to improve the quality of generated videos. Extensive experiments on the nuScenes dataset show that MiLA achieves state-of-the-art performance in video generation quality. For more information, visit the project website: https://github.com/xiaomi-mlab/mila.github.io.
Chinese: MiLA框架通过粗到精的方法和专门模块,解决了自动驾驶训练中生成长时、一致性视频的难题,在nuScenes数据集上实现了最先进的性能。
English: The MiLA framework addresses the challenge of generating long, consistent videos for autonomous driving training by using a coarse-to-refine approach and specialized modules, achieving state-of-the-art performance on the nuScenes dataset.

Authors:Debabrata Mandal, Soumitri Chattopadhyay, Guansen Tong, Praneeth Chakravarthula
Title: UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations
Abstract:
Image restoration is essential for enhancing degraded images across computer vision tasks. However, most existing methods address only a single type of degradation (e.g., blur, noise, or haze) at a time, limiting their real-world applicability where multiple degradations often occur simultaneously. In this paper, we propose UniCoRN, a unified image restoration approach capable of handling multiple degradation types simultaneously using a multi-head diffusion model. Specifically, we uncover the potential of low-level visual cues extracted from images in guiding a controllable diffusion model for real-world image restoration and we design a multi-head control network adaptable via a mixture-of-experts strategy. We train our model without any prior assumption of specific degradations, through a smartly designed curriculum learning recipe. Additionally, we also introduce MetaRestore, a metalens imaging benchmark containing images with multiple degradations and artifacts. Extensive evaluations on several challenging datasets, including our benchmark, demonstrate that our method achieves significant performance gains and can robustly restore images with severe degradations. Project page: https://codejaeger.github.io/unicorn-gh

Authors:Hyojun Go, Byeongjun Park, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, Changick Kim
Title: VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling
Abstract:
We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement.

Authors:Zhenglin Zhou, Fan Ma, Hehe Fan, Tat-Seng Chua
Title: Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion
Abstract:
Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: https://github.com/ZhenglinZhou/Zero-1-to-A.
Chinese: Zero-1-to-A提出了一种渐进式学习方法,通过视频扩散模型构建时空一致性数据集,实现了高保真4D虚拟人生成,显著提升了动画质量和渲染效率。
English: Zero-1-to-A introduces a progressive learning method that constructs spatial and temporal consistency datasets through video diffusion models, enabling high-fidelity 4D avatar generation with improved animation quality and rendering efficiency.

Authors:Yiren Lu, Yunlai Zhou, Disheng Liu, Tuo Liang, Yu Yin
Title: BARD-GS: Blur-Aware Reconstruction of Dynamic Scenes via Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has shown remarkable potential for static scene reconstruction, and recent advancements have extended its application to dynamic scenes. However, the quality of reconstructions depends heavily on high-quality input images and precise camera poses, which are not that trivial to fulfill in real-world scenarios. Capturing dynamic scenes with handheld monocular cameras, for instance, typically involves simultaneous movement of both the camera and objects within a single exposure. This combined motion frequently results in image blur that existing methods cannot adequately handle. To address these challenges, we introduce BARD-GS, a novel approach for robust dynamic scene reconstruction that effectively handles blurry inputs and imprecise camera poses. Our method comprises two main components: 1) camera motion deblurring and 2) object motion deblurring. By explicitly decomposing motion blur into camera motion blur and object motion blur and modeling them separately, we achieve significantly improved rendering results in dynamic regions. In addition, we collect a real-world motion blur dataset of dynamic scenes to evaluate our approach. Extensive experiments demonstrate that BARD-GS effectively reconstructs high-quality dynamic scenes under realistic conditions, significantly outperforming existing methods.

Authors:Xuan Gao, Jingtao Zhou, Dongyu Liu, Yuqi Zhou, Juyong Zhang
Title: Controlling Avatar Diffusion with Learnable Gaussian Embedding
Abstract:
Recent advances in diffusion models have made significant progress in digital human generation. However, most existing models still struggle to maintain 3D consistency, temporal coherence, and motion accuracy. A key reason for these shortcomings is the limited representation ability of commonly used control signals(e.g., landmarks, depth maps, etc.). In addition, the lack of diversity in identity and pose variations in public datasets further hinders progress in this area. In this paper, we analyze the shortcomings of current control signals and introduce a novel control signal representation that is optimizable, dense, expressive, and 3D consistent. Our method embeds a learnable neural Gaussian onto a parametric head surface, which greatly enhances the consistency and expressiveness of diffusion-based head models. Regarding the dataset, we synthesize a large-scale dataset with multiple poses and identities. In addition, we use real/synthetic labels to effectively distinguish real and synthetic data, minimizing the impact of imperfections in synthetic data on the generated head images. Extensive experiments show that our model outperforms existing methods in terms of realism, expressiveness, and 3D consistency. Our code, synthetic datasets, and pre-trained models will be released in our project page: https://ustc3dv.github.io/Learn2Control/

Authors:Jingyun Liu, Daiqin Yang, Zhenzhong Chen
Title: Frequency Enhancement for Image Demosaicking
Abstract:
Recovering high-frequency textures in image demosaicking remains a challenging issue. While existing methods introduced elaborate spatial learning methods, they still exhibit limited performance. To address this issue, a frequency enhancement approach is proposed. Based on the frequency analysis of color filter array (CFA)/demosaicked/ground truth images, we propose Dual-path Frequency Enhancement Network (DFENet), which reconstructs RGB images in a divide-and-conquer manner through fourier-domain frequency selection. In DFENet, two frequency selectors are employed, each selecting a set of frequency components for processing along separate paths. One path focuses on generating missing information through detail refinement in spatial domain, while the other aims at suppressing undesirable frequencies with the guidance of CFA images in frequency domain. Multi-level frequency supervision with a stagewise training strategy is employed to further improve the reconstruction performance. With these designs, the proposed DFENet outperforms other state-of-the-art algorithms on different datasets and demonstrates significant advantages on hard cases. Moreover, to better assess algorithms' ability to reconstruct high-frequency textures, a new dataset, LineSet37, is contributed, which consists of 37 artificially designed and generated images. These images feature complex line patterns and are prone to severe visual artifacts like color moiré after demosaicking. Experiments on LineSet37 offer a more targeted evaluation of performance on challenging cases. The code and dataset are available at https://github.com/VelvetReverie/DFENet-demosaicking.
中文摘要:提出的双路径频率增强网络(DFENet)通过空间域细节优化和频域干扰抑制的双路径设计,有效提升了图像去马赛克中高频纹理的恢复能力,在多个数据集上超越现有方法,并专门构建了针对高频重建评估的新数据集。
English Summary: The proposed Dual-path Frequency Enhancement Network (DFENet) improves image demosaicking by processing frequency components through separate spatial refinement and frequency suppression paths, outperforming existing methods and introducing a specialized dataset for evaluating high-frequency texture reconstruction.

Authors:Parham Saremi, Amar Kumar, Mohamed Mohamed, Zahra TehraniNasab, Tal Arbel
Title: RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models
Abstract:
Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images. While VLFMs show a rich understanding of semantic content across modalities, they often struggle with fine-grained alignment tasks that require precise correspondence between image regions and textual descriptions, a limitation in medical imaging, where accurate localization and detection of clinical features are essential for diagnosis and analysis. To address this issue, we propose a multi-stage architecture where a pre-trained VLFM (e.g. Stable Diffusion) provides a cursory semantic understanding, while a reinforcement learning (RL) algorithm refines the alignment through an iterative process that optimizes for understanding semantic context. The reward signal is designed to align the semantic information of the text with synthesized images. Experiments on the public ISIC2019 skin lesion dataset demonstrate that the proposed method improves (a) the quality of the generated images, and (b) the alignment with the text prompt over the original fine-tuned Stable Diffusion baseline. We also show that the synthesized samples could be used to improve disease classifier performance for underrepresented subgroups through augmentation. Our code is accessible through the project website: https://parhamsaremi.github.io/rl4med-ddpo

Authors:Panagiota Moraiti, Efstathios Karypidis
Title: Technical Report for the 5th CLVision Challenge at CVPR: Addressing the Class-Incremental with Repetition using Unlabeled Data -- 4th Place Solution
Abstract:
This paper outlines our approach to the 5th CLVision challenge at CVPR, which addresses the Class-Incremental with Repetition (CIR) scenario. In contrast to traditional class incremental learning, this novel setting introduces unique challenges and research opportunities, particularly through the integration of unlabeled data into the training process. In the CIR scenario, encountered classes may reappear in later learning experiences, and each experience may involve only a subset of the overall class distribution. Additionally, the unlabeled data provided during training may include instances of unseen classes, or irrelevant classes which should be ignored. Our approach focuses on retaining previously learned knowledge by utilizing knowledge distillation and pseudo-labeling techniques. The key characteristic of our method is the exploitation of unlabeled data during training, in order to maintain optimal performance on instances of previously encountered categories and reduce the detrimental effects of catastrophic forgetting. Our method achieves an average accuracy of 16.68\% during the pre-selection phase and 21.19% during the final evaluation phase, outperforming the baseline accuracy of 9.39%. We provide the implementation code at https://github.com/panagiotamoraiti/continual-learning-challenge-2024 .
中文: 本文针对重复类增量学习场景,提出了一种结合知识蒸馏和伪标签技术的方法,利用未标记数据缓解灾难性遗忘,显著超越了基线准确率。
English: This paper presents a method for the Class-Incremental with Repetition challenge, utilizing knowledge distillation and pseudo-labeling on unlabeled data to mitigate catastrophic forgetting and improve accuracy over the baseline.

Authors:Jiaqi Liu, Jichao Zhang, Paolo Rota, Nicu Sebe
Title: Multi-focal Conditioned Latent Diffusion for Person Image Synthesis
Abstract:
The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information, enhancing the model's ability to produce appearance realistic and identity-consistent images. Our method demonstrates consistent identity and appearance generation on the DeepFashion dataset and enables flexible person image editing due to its generation consistency. The code is available at https://github.com/jqliu09/mcld.
潜在扩散模型在生成图像时易丢失面部和服装等敏感区域的细节,因此提出的多焦点条件潜在扩散方法通过分离这些区域的特征来增强图像真实感,在DeepFashion等数据集上提升了身份一致性和编辑灵活性。
The Latent Diffusion Model often loses fine details in sensitive areas like faces and clothing, so the proposed Multi-focal Conditioned Latent Diffusion method enhances image realism by focusing on disentangled features from these regions, improving identity consistency and editing flexibility on datasets like DeepFashion.

Authors:Yanis Benidir, Nicolas Gonthier, Clement Mallet
Title: The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generation
Abstract:
Bi-temporal change detection at scale based on Very High Resolution (VHR) images is crucial for Earth monitoring. This remains poorly addressed so far: methods either require large volumes of annotated data (semantic case), or are limited to restricted datasets (binary set-ups). Most approaches do not exhibit the versatility required for temporal and spatial adaptation: simplicity in architecture design and pretraining on realistic and comprehensive datasets. Synthetic datasets are the key solution but still fail to handle complex and diverse scenes. In this paper, we present HySCDG a generative pipeline for creating a large hybrid semantic change detection dataset that contains both real VHR images and inpainted ones, along with land cover semantic map at both dates and the change map. Being semantically and spatially guided, HySCDG generates realistic images, leading to a comprehensive and hybrid transfer-proof dataset FSC-180k. We evaluate FSC-180k on five change detection cases (binary and semantic), from zero-shot to mixed and sequential training, and also under low data regime training. Experiments demonstrate that pretraining on our hybrid dataset leads to a significant performance boost, outperforming SyntheWorld, a fully synthetic dataset, in every configuration. All codes, models, and data are available here: https://yb23.github.io/projects/cywd/

Authors:Cédric Vincent, Taehyoung Kim, Henri Meeß
Title: High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight
Abstract:
Semantic segmentation from RGB cameras is essential to the perception of autonomous flying vehicles. The stability of predictions through the captured videos is paramount to their reliability and, by extension, to the trustworthiness of the agents. In this paper, we propose a lightweight video semantic segmentation approach-suited to onboard real-time inference-achieving high temporal consistency on aerial data through Semantic Similarity Propagation across frames. SSP temporally propagates the predictions of an efficient image segmentation model with global registration alignment to compensate for camera movements. It combines the current estimation and the prior prediction with linear interpolation using weights computed from the features similarities of the two frames. Because data availability is a challenge in this domain, we propose a consistency-aware Knowledge Distillation training procedure for sparsely labeled datasets with few annotations. Using a large image segmentation model as a teacher to train the efficient SSP, we leverage the strong correlations between labeled and unlabeled frames in the same training videos to obtain high-quality supervision on all frames. KD-SSP obtains a significant temporal consistency increase over the base image segmentation model of 12.5% and 6.7% TC on UAVid and RuralScapes respectively, with higher accuracy and comparable inference speed. On these aerial datasets, KD-SSP provides a superior segmentation quality and inference speed trade-off than other video methods proposed for general applications and shows considerably higher consistency. Project page: https://github.com/FraunhoferIVI/SSP.
Chinese Summary: 本文提出KD-SSP轻量级视频语义分割方法,通过语义相似性传播和一致性感知知识蒸馏提升航拍数据的时序稳定性,在分割质量和速度上实现更优平衡。
English Summary: This paper introduces KD-SSP, a lightweight video semantic segmentation method that enhances temporal consistency in aerial data through Semantic Similarity Propagation and consistency-aware Knowledge Distillation, achieving superior segmentation quality and speed.

Authors:Yuming Gu, Phong Tran, Yujian Zheng, Hongyi Xu, Heyuan Li, Adilbek Karmanov, Hao Li
Title: DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis
Abstract:
Generating high-quality 360-degree views of human heads from single-view images is essential for enabling accessible immersive telepresence applications and scalable personalized content creation. While cutting-edge methods for full head generation are limited to modeling realistic human heads, the latest diffusion-based approaches for style-omniscient head synthesis can produce only frontal views and struggle with view consistency, preventing their conversion into true 3D models for rendering from arbitrary angles. We introduce a novel approach that generates fully consistent 360-degree head views, accommodating human, stylized, and anthropomorphic forms, including accessories like glasses and hats. Our method builds on the DiffPortrait3D framework, incorporating a custom ControlNet for back-of-head detail generation and a dual appearance module to ensure global front-back consistency. By training on continuous view sequences and integrating a back reference image, our approach achieves robust, locally continuous view synthesis. Our model can be used to produce high-quality neural radiance fields (NeRFs) for real-time, free-viewpoint rendering, outperforming state-of-the-art methods in object synthesis and 360-degree head generation for very challenging input portraits.
中文: 该创新方法通过改进DiffPortrait3D框架,结合定制ControlNet和双重外观模块,可为不同类型头部及配饰生成一致的360度视图,并能创建高质量神经辐射场实现实时自由视角渲染,性能超越现有技术。
English: This novel method generates consistent 360-degree head views for various forms and accessories by enhancing the DiffPortrait3D framework with a custom ControlNet and dual appearance module, enabling high-quality neural radiance fields for real-time free-viewpoint rendering that surpasses current techniques.

Authors:Matthew Massey, Abdullah-Al-Zubaer Imran
Title: EarthScape: A Multimodal Dataset for Surficial Geologic Mapping and Earth Surface Analysis
Abstract:
Surficial geologic mapping is essential for understanding Earth surface processes, addressing modern challenges such as climate change and national security, and supporting common applications in engineering and resource management. However, traditional mapping methods are labor-intensive, limiting spatial coverage and introducing potential biases. To address these limitations, we introduce EarthScape, a novel, AI-ready multimodal dataset specifically designed for surficial geologic mapping and Earth surface analysis. EarthScape integrates high-resolution aerial RGB and near-infrared (NIR) imagery, digital elevation models (DEM), multi-scale DEM-derived terrain features, and hydrologic and infrastructure vector data. The dataset provides detailed annotations for seven distinct surficial geologic classes encompassing various geological processes. We present a comprehensive data processing pipeline using open-sourced raw data and establish baseline benchmarks using different spatial modalities to demonstrate the utility of EarthScape. As a living dataset with a vision for expansion, EarthScape bridges the gap between computer vision and Earth sciences, offering a valuable resource for advancing research in multimodal learning, geospatial analysis, and geological mapping. Our code is available at https://github.com/masseygeo/earthscape.
中文: EarthScape是一个专为地表地质测绘设计的多模态AI数据集,它通过整合多种地理空间数据和七类详细地质标注,有效解决了传统测绘方法的局限性,搭建了计算机视觉与地球科学之间的研究桥梁。
English: EarthScape is an AI-ready multimodal dataset designed to overcome the limitations of traditional surficial geologic mapping by integrating diverse geospatial data and providing detailed annotations for seven geologic classes, serving as a bridge between computer vision and Earth sciences.

Authors:Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
Title: LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
Abstract:
Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs -- including Phi-4, LLaMA-3.1, and Gemma-2 -- to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: https://github.com/aimagelab/LLaVA-MORE.
Chinese: 本文介绍了LLaVA-MORE系列多模态大语言模型,通过统一训练协议系统评估语言模型与视觉骨干网络的相互作用,为设计更有效的MLLM提供了见解。
English: The paper introduces LLaVA-MORE, a family of multimodal large language models that systematically evaluates the interplay between language models and visual backbones using a unified training protocol to provide insights for designing more effective MLLMs.

Authors:Masud Ahmed, Zahid Hasan, Syed Arefinul Haque, Abu Zaher Md Faridee, Sanjay Purushotham, Suya You, Nirmalya Roy
Title: CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation
Abstract:
Traditional transformer-based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ-VAE) is 8% lower than continuous-valued embeddings (e.g. KL-VAE). Motivated by this, we propose a continuous-valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image-to-embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine-grained spatial and semantic details. Our key contribution includes a diffusion-guided autoregressive transformer that learns a continuous semantic embedding space by modeling long-range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion-guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero-shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain-shifted variants) demonstrate state-of-the-art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance ($\approx$ 95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact ($\approx$ 90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: https://github.com/mahmed10/CAMSS.git
Chinese: 本文提出了一种用于语义分割的连续值嵌入框架,通过扩散引导的自回归变换器替代传统量化嵌入,在多个数据集上展现出对领域偏移和噪声的卓越鲁棒性。
English: This paper introduces a continuous-valued embedding framework for semantic segmentation that replaces traditional quantized embeddings with a diffusion-guided autoregressive transformer, achieving superior robustness to domain shifts and noise across various datasets.

Authors:Alba Márquez-Rodríguez, Miguel Ángel Mohedano-Munoz, Manuel J. Marín-Jiménez, Eduardo Santamaría-García, Giulia Bastianelli, Pedro Jordano, Irene Mendoza
Title: A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana
Abstract:
Passive Acoustic Monitoring is a key tool for biodiversity conservation, but the large volumes of unsupervised audio it generates present major challenges for extracting meaningful information. Deep Learning offers promising solutions. BirdNET, a widely used bird identification model, has shown success in many study systems but is limited at local scale due to biases in its training data, which focus on specific locations and target sounds rather than entire soundscapes. A key challenge in bird species identification is that many recordings either lack target species or contain overlapping vocalizations, complicating automatic identification. To address these problems, we developed a multi-stage pipeline for automatic bird vocalization identification in Doñana National Park (SW Spain), a wetland of high conservation concern. We deployed AudioMoth recorders in three main habitats across nine locations and manually annotated 461 minutes of audio, resulting in 3749 labeled segments spanning 34 classes. We first applied a Bird Song Detector to isolate bird vocalizations using spectrogram-based image processing. Then, species were classified using custom models trained at the local scale. Applying the Bird Song Detector before classification improved species identification, as all models performed better when analyzing only the segments where birds were detected. Specifically, the combination of detector and fine-tuned BirdNET outperformed the baseline without detection. This approach demonstrates the effectiveness of integrating a Bird Song Detector with local classification models. These findings highlight the need to adapt general-purpose tools to specific ecological challenges. Automatically detecting bird species helps track the health of this threatened ecosystem, given birds sensitivity to environmental change, and supports conservation planning to reduce biodiversity loss.
中文: 针对多尼亚纳国家公园开发的多阶段识别流程,通过结合鸟类鸣叫检测器与本地优化的分类模型,显著提升了鸟类自动识别的准确性,凸显了针对具体生态挑战定制化通用工具对于有效保护生物多样性的重要性。
English: A multi-stage pipeline integrating a Bird Song Detector with locally fine-tuned classification models was developed to enhance automatic bird species identification in Doñana National Park, demonstrating improved performance over general tools by adapting to specific ecological challenges for effective biodiversity conservation.

Authors:NVIDIA, :, Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Liang Feng, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Maosheng Liao, Tsung-Yi Lin, Yen-Chen Lin, Ming-Yu Liu, Xiangyu Lu, Alice Luo, Andrew Mathau, Yun Ni, Lindsey Pavao, Wei Ping, David W. Romero, Misha Smelyanskiy, Shuran Song, Lyne Tchapmi, Andrew Z. Wang, Boxin Wang, Haoxiang Wang, Fangyin Wei, Jiashu Xu, Yao Xu, Dinghao Yang, Xiaodong Yang, Zhuolin Yang, Jingxu Zhang, Xiaohui Zeng, Zhe Zhang
Title: Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Abstract:
Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-7B and Cosmos-Reason1-56B. We curate data and train our models in two stages: Physical AI supervised fine-tuning (SFT) and Physical AI reinforcement learning (RL). To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and RL bring significant improvements. To facilitate the development of Physical AI, we make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.
中文:Cosmos-Reason1模型通过长链思维推理过程理解物理世界并生成具身决策,利用分层和二维本体表示物理常识和具身推理,经过监督微调和强化学习训练后展现出显著性能提升。
English: The Cosmos-Reason1 models are designed to understand the physical world and generate embodied decisions through long reasoning processes, utilizing hierarchical and two-dimensional ontologies for physical common sense and embodied reasoning, with significant improvements shown through supervised fine-tuning and reinforcement learning.

Authors:Foundation AI Team, Kiran Bhat, Nishchaie Khanna, Karun Channa, Tinghui Zhou, Yiheng Zhu, Xiaoxia Sun, Charles Shang, Anirudh Sudarshan, Maurice Chu, Daiqing Li, Kangle Deng, Jean-Philippe Fauconnier, Tijmen Verhulsdonck, Maneesh Agrawala, Kayvon Fatahalian, Alexander Weiss, Christian Reiser, Ravi Kiran Chirravuri, Ravali Kandur, Alejandro Pelaez, Akash Garg, Michael Palleschi, Jessica Wang, Skylar Litz, Leon Liu, Anying Li, David Harmon, Derek Liu, Liangjun Feng, Denis Goupil, Lukas Kuczynski, Jihyun Yoon, Naveen Marri, Peiye Zhuang, Yinan Zhang, Brian Yin, Haomiao Jiang, Marcel van Workum, Thomas Lane, Bryce Erickson, Salil Pathare, Kyle Price, Steve Han, Yiqing Wang, Anupam Singh, David Baszucki
Title: Cube: A Roblox View of 3D Intelligence
Abstract:
Foundation models trained on vast amounts of data have demonstrated remarkable reasoning and generation capabilities in the domains of text, images, audio and video. Our goal at Roblox is to build such a foundation model for 3D intelligence, a model that can support developers in producing all aspects of a Roblox experience, from generating 3D objects and scenes to rigging characters for animation to producing programmatic scripts describing object behaviors. We discuss three key design requirements for such a 3D foundation model and then present our first step towards building such a model. We expect that 3D geometric shapes will be a core data type and describe our solution for 3D shape tokenizer. We show how our tokenization scheme can be used in applications for text-to-shape generation, shape-to-text generation and text-to-scene generation. We demonstrate how these applications can collaborate with existing large language models (LLMs) to perform scene analysis and reasoning. We conclude with a discussion outlining our path to building a fully unified foundation model for 3D intelligence.
Chinese: Roblox正在构建一个3D智能基础模型,旨在帮助开发者生成3D对象、场景和脚本,目前重点开发了3D形状分词器,支持文本到形状及场景的生成应用。
English: Roblox is developing a foundation model for 3D intelligence to assist developers in generating 3D objects, scenes, and scripts, with a current focus on a 3D shape tokenizer that enables text-to-shape and scene generation applications.

Authors:Boshen Xu, Yuting Mei, Xinbi Liu, Sipeng Zheng, Qin Jin
Title: EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Abstract:
Egocentric video-language pretraining has significantly advanced video representation learning. Humans perceive and interact with a fully 3D world, developing spatial awareness that extends beyond text-based understanding. However, most previous works learn from 1D text or 2D visual cues, such as bounding boxes, which inherently lack 3D understanding. To bridge this gap, we introduce EgoDTM, an Egocentric Depth- and Text-aware Model, jointly trained through large-scale 3D-aware video pretraining and video-text contrastive learning. EgoDTM incorporates a lightweight 3D-aware decoder to efficiently learn 3D-awareness from pseudo depth maps generated by depth estimation models. To further facilitate 3D-aware video pretraining, we enrich the original brief captions with hand-object visual cues by organically combining several foundation models. Extensive experiments demonstrate EgoDTM's superior performance across diverse downstream tasks, highlighting its superior 3D-aware visual understanding. Our code will be released at https://github.com/xuboshen/EgoDTM.
中文: EgoDTM提出了一种新型以自我为中心的视频-语言预训练模型,通过结合伪深度图和增强字幕来提升三维感知视觉理解能力,在多项任务中表现出卓越性能。
English: EgoDTM introduces a novel egocentric video-language pretraining model that enhances 3D-aware visual understanding by integrating pseudo depth maps and enriched captions, achieving superior performance across various tasks.

Authors:Ruichen Chen, Keith G. Mills, Di Niu
Title: FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers
Abstract:
Diffusion Models (DM) have revolutionized the text-to-image visual generation process. However, the large computational cost and model footprint of DMs hinders practical deployment, especially on edge devices. Post-training quantization (PTQ) is a lightweight method to alleviate these burdens without the need for training or fine-tuning. While recent DM PTQ methods achieve W4A8 on integer-based PTQ, two key limitations remain: First, while most existing DM PTQ methods evaluate on classical DMs like Stable Diffusion XL, 1.5 or earlier, which use convolutional U-Nets, newer Diffusion Transformer (DiT) models like the PixArt series, Hunyuan and others adopt fundamentally different transformer backbones to achieve superior image synthesis. Second, integer (INT) quantization is prevailing in DM PTQ but doesn't align well with the network weight and activation distribution, while Floating-Point Quantization (FPQ) is still under-investigated, yet it holds the potential to better align the weight and activation distributions in low-bit settings for DiT. In response, we introduce FP4DiT, a PTQ method that leverages FPQ to achieve W4A6 quantization. Specifically, we extend and generalize the Adaptive Rounding PTQ technique to adequately calibrate weight quantization for FPQ and demonstrate that DiT activations depend on input patch data, necessitating robust online activation quantization techniques. Experimental results demonstrate that FP4DiT outperforms integer-based PTQ at W4A6 and W4A8 precision and generates convincing visual content on PixArt-$α$, PixArt-$Σ$ and Hunyuan in terms of several T2I metrics such as HPSv2 and CLIP.
中文摘要:FP4DiT提出了一种采用浮点量化的后训练量化方法,在扩散变换器上实现W4A6精度,优于基于整数的量化方案,并在PixArt和Hunyuan等模型上生成高质量的图像。
English Summary: FP4DiT introduces a post-training quantization method using floating-point quantization to achieve W4A6 precision for Diffusion Transformers, outperforming integer-based approaches and generating high-quality images on models like PixArt and Hunyuan.

Authors:Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke Fan, Liang Pan, Yueer Zhou, Ziyong Feng, Xiaowei Zhou, Sida Peng, Jingbo Wang
Title: MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space
Abstract:
This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: https://zju3dv.github.io/MotionStreamer/
Chinese: 本文提出MotionStreamer框架,通过将连续因果潜在空间融入概率自回归模型,解决了现有方法在文本条件流式动作生成中的延迟响应和误差累积问题,实现了精确的在线动作解码。
English: This paper introduces MotionStreamer, a novel framework that overcomes limitations of existing methods by using a continuous causal latent space in a probabilistic autoregressive model to enable accurate, real-time text-conditioned streaming motion generation without error accumulation.

Authors:Wei Tang, Yanpeng Sun, Qinying Gu, Zechao Li
Title: Visual Position Prompt for MLLM based Visual Grounding
Abstract:
Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address these issues, we introduce VPP-LLaVA, an MLLM enhanced with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms: the global VPP overlays a learnable, axis-like tensor onto the input image to provide structured spatial cues, while the local VPP incorporates position-aware queries to support fine-grained localization.To effectively train our model with spatial guidance, we further introduce VPP-SFT, a curated dataset of 0.6M high-quality visual grounding samples. Designed in a compact format, it enables efficient training and is significantly smaller than datasets used by other MLLMs (e.g., ~21M samples in MiniGPT-v2), yet still provides a strong performance boost. The resulting model, VPP-LLaVA, not only achieves state-of-the-art results on standard visual grounding benchmarks but also demonstrates strong zero-shot generalization to challenging unseen datasets. The code and dataset are available at https://github.com/WayneTomas/VPP-LLaVA.
中文: VPP-LLaVA通过引入视觉位置提示机制,有效解决了多模态大语言模型在图像空间定位中的不足,利用精简数据集训练后在视觉 grounding 任务中达到领先水平并展现强大泛化能力。
English: VPP-LLaVA enhances multimodal large language models by integrating Visual Position Prompts to address spatial alignment challenges, achieving state-of-the-art performance in visual grounding tasks through efficient training with a compact dataset.

Authors:Harold Haodong Chen, Haojian Huang, Xianfeng Wu, Yexin Liu, Yajing Bai, Wen-Jie Shu, Harry Yang, Ser-Nam Lim
Title: Temporal Regularization Makes Your Video Generator Stronger
Abstract:
Temporal quality is a critical aspect of video generation, as it ensures consistent motion and realistic dynamics across frames. However, achieving high temporal coherence and diversity remains challenging. In this work, we explore temporal augmentation in video generation for the first time, and introduce FluxFlow for initial investigation, a strategy designed to enhance temporal quality. Operating at the data level, FluxFlow applies controlled temporal perturbations without requiring architectural modifications. Extensive experiments on UCF-101 and VBench benchmarks demonstrate that FluxFlow significantly improves temporal coherence and diversity across various video generation models, including U-Net, DiT, and AR-based architectures, while preserving spatial fidelity. These findings highlight the potential of temporal augmentation as a simple yet effective approach to advancing video generation quality.
中文摘要:FluxFlow首次在视频生成中引入时间增强技术,通过在数据层面施加受控扰动,无需修改模型结构即可显著提升多种生成模型的时间连贯性与多样性。
English Summary: FluxFlow introduces temporal augmentation to enhance video generation by applying controlled perturbations at the data level, significantly improving temporal coherence and diversity across multiple models without architectural changes.

Authors:Jisu Nam, Soowon Son, Zhan Xu, Jing Shi, Difan Liu, Feng Liu, Aashish Misraa, Seungryong Kim, Yang Zhou
Title: Visual Persona: Foundation Model for Full-Body Human Customization
Abstract:
We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.

Authors:Yuchen Ren, Zhengyu Zhao, Chenhao Lin, Bo Yang, Lu Zhou, Zhe Liu, Chao Shen
Title: Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement
Abstract:
Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate refinement to backward propagation. In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial gradient vanishing during backward propagation. For token embeddings, we propose Momentum Token Embedding (MTE), which accumulates historical token embeddings to stabilize the forward updates in both the Attention and MLP blocks. We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the current best (backward) surrogate refinement by up to 7.0\% on average. We also validate its superiority against popular defenses and its compatibility with other transfer methods. Codes and appendix are available at https://github.com/RYC-98/FPR.
Chinese: 本文提出针对视觉Transformer的前向传播优化方法,通过多样化注意力图和稳定令牌嵌入来提升对抗样本的可迁移性,平均性能优于现有反向优化方法达7.0%。
English: This paper introduces Forward Propagation Refinement (FPR) for Vision Transformers, which enhances adversarial transferability by diversifying attention maps and stabilizing token embeddings, outperforming existing backward-based methods by up to 7.0% on average.

Authors:Hao Tan, Zichang Tan, Jun Li, Ajian Liu, Jun Wan, Zhen Lei
Title: Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport
Abstract:
Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce refocusing on local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on various datasets from three distinct domains, and shows great potential to boost the existing methods. Code: https://github.com/EricTan7/RAM.
Chinese: RAM框架通过LLA恢复局部语义和KCOT优化区域与标签匹配,有效解决了开放词汇多标签识别中的关键难题,在多个数据集上实现了最先进的性能。
English: The RAM framework addresses open-vocabulary multi-label recognition challenges by introducing LLA to restore local semantics and KCOT to optimize region-label matching, achieving state-of-the-art results across multiple datasets.

Authors:Weixiao Gao, Liangliang Nan, Hugo Ledoux
Title: SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes
Abstract:
Semantic segmentation in urban scene analysis has mainly focused on images or point clouds, while textured meshes - offering richer spatial representation - remain underexplored. This paper introduces SUM Parts, the first large-scale dataset for urban textured meshes with part-level semantic labels, covering about 2.5 km2 with 21 classes. The dataset was created using our own annotation tool, which supports both face- and texture-based annotations with efficient interactive selection. We also provide a comprehensive evaluation of 3D semantic segmentation and interactive annotation methods on this dataset. Our project page is available at https://tudelft3d.github.io/SUMParts/.

Authors:Yuanchao Yue, Hui Yuan, Zhengxin Li, Shuai Li, Wei Zhang
Title: EEPNet-V2: Patch-to-Pixel Solution for Efficient Cross-Modal Registration between LiDAR Point Cloud and Camera Image
Abstract:
The primary requirement for cross-modal data fusion is the precise alignment of data from different sensors. However, the calibration between LiDAR point clouds and camera images is typically time-consuming and needs external calibration board or specific environmental features. Cross-modal registration effectively solves this problem by aligning the data directly without requiring external calibration. However, due to the domain gap between the point cloud and the image, existing methods rarely achieve satisfactory registration accuracy while maintaining real-time performance. To address this issue, we propose a framework that projects point clouds into several 2D representations for matching with camera images, which not only leverages the geometric characteristic of LiDAR point clouds effectively but also bridge the domain gap between the point cloud and image. Moreover, to tackle the challenges of cross modal differences and the limited overlap between LiDAR point clouds and images in the image matching task, we introduce a multi-scale feature extraction network to effectively extract features from both camera images and the projection maps of LiDAR point cloud. Additionally, we propose a patch-to-pixel matching network to provide more effective supervision and achieve high accuracy. We validate the performance of our model through experiments on the KITTI and nuScenes datasets. Experimental results demonstrate the the proposed method achieves real-time performance and extremely high registration accuracy. Specifically, on the KITTI dataset, our model achieves a registration accuracy rate of over 99\%. Our code is released at: https://github.com/ESRSchao/EEPNet-V2.
中文: 该框架通过将激光雷达点云投影为二维表示与相机图像匹配,采用多尺度特征提取和像素块匹配网络,在KITTI数据集上实现了实时处理性能并达到超过99%的配准精度。
English: The proposed framework addresses cross-modal registration challenges by projecting LiDAR point clouds into 2D representations for matching with camera images, achieving real-time performance and over 99% accuracy on the KITTI dataset through multi-scale feature extraction and patch-to-pixel matching networks.

Authors:Ruowen Zhao, Junliang Ye, Zhengyi Wang, Guangce Liu, Yiwen Chen, Yikai Wang, Jun Zhu
Title: DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
Abstract:
Triangle meshes play a crucial role in 3D applications for efficient manipulation and rendering. While auto-regressive methods generate structured meshes by predicting discrete vertex tokens, they are often constrained by limited face counts and mesh incompleteness. To address these challenges, we propose DeepMesh, a framework that optimizes mesh generation through two key innovations: (1) an efficient pre-training strategy incorporating a novel tokenization algorithm, along with improvements in data curation and processing, and (2) the introduction of Reinforcement Learning (RL) into 3D mesh generation to achieve human preference alignment via Direct Preference Optimization (DPO). We design a scoring standard that combines human evaluation with 3D metrics to collect preference pairs for DPO, ensuring both visual appeal and geometric accuracy. Conditioned on point clouds and images, DeepMesh generates meshes with intricate details and precise topology, outperforming state-of-the-art methods in both precision and quality. Project page: https://zhaorw02.github.io/DeepMesh/
中文: DeepMesh提出了一种创新框架,通过结合高效预训练与强化学习优化,实现了基于点云和图像的精细三维网格生成,在几何精度和视觉质量上均超越现有先进方法。
English: DeepMesh introduces a novel framework that enhances 3D mesh generation through efficient pre-training with advanced tokenization and reinforcement learning for human preference alignment, producing detailed and topologically accurate meshes from point clouds and images.

Authors:Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Weijia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, Conghui He
Title: LEGION: Learning to Ground and Explain for Synthetic Image Detection
Abstract:
The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection methods often lack artifact-level textual interpretability and are overly focused on image manipulation detection, and current datasets usually suffer from outdated generators and a lack of fine-grained annotations. In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels. Furthermore, we propose LEGION (LEarning to Ground and explain for Synthetic Image detectiON), a multimodal large language model (MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation. Building upon this capability, we further explore LEGION as a controller, integrating it into image refinement pipelines to guide the generation of higher-quality and more realistic images. Extensive experiments show that LEGION outperforms existing methods across multiple benchmarks, particularly surpassing the second-best traditional expert on SynthScars by 3.31% in mIoU and 7.75% in F1 score. Moreover, the refined images generated under its guidance exhibit stronger alignment with human preferences. The code, model, and dataset will be released.
中文: 本文提出了包含精细标注的合成图像数据集SynthScars,并开发了多模态框架LEGION,该框架不仅能出色地检测和解释图像伪影,还能通过优化流程指导生成更逼真的图像。
English: This paper introduces SynthScars, a comprehensive dataset with fine-grained annotations of synthetic images, and proposes LEGION, a multimodal framework that excels in detecting and explaining image artifacts while also guiding the generation of more realistic images through refinement pipelines.

Authors:Zechuan Li, Hongshan Yu, Yihao Ding, Jinhao Qiao, Basim Azam, Naveed Akhtar
Title: GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector
Abstract:
We propose GO-N3RDet, a scene-geometry optimized multi-view 3D object detector enhanced by neural radiance fields. The key to accurate 3D object detection is in effective voxel representation. However, due to occlusion and lack of 3D information, constructing 3D features from multi-view 2D images is challenging. Addressing that, we introduce a unique 3D positional information embedded voxel optimization mechanism to fuse multi-view features. To prioritize neural field reconstruction in object regions, we also devise a double importance sampling scheme for the NeRF branch of our detector. We additionally propose an opacity optimization module for precise voxel opacity prediction by enforcing multi-view consistency constraints. Moreover, to further improve voxel density consistency across multiple perspectives, we incorporate ray distance as a weighting factor to minimize cumulative ray errors. Our unique modules synergetically form an end-to-end neural model that establishes new state-of-the-art in NeRF-based multi-view 3D detection, verified with extensive experiments on ScanNet and ARKITScenes. Code will be available at https://github.com/ZechuanLi/GO-N3RDet.
中文: GO-N3RDet是一种创新的多视角3D物体检测器,通过神经辐射场与场景几何优化相结合,并采用独特的体素增强和采样机制,实现了最先进的检测性能。
English: GO-N3RDet is a novel multi-view 3D object detector that integrates neural radiance fields with scene geometry optimization, achieving state-of-the-art performance through specialized voxel enhancement and sampling mechanisms.

Authors:Àlex Pujol Vidal, Sergio Escalera, Kamal Nasrollahi, Thomas B. Moeslund
Title: Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU
Abstract:
Machine unlearning methods have become increasingly important for selective concept removal in large pre-trained models. While recent work has explored unlearning in Euclidean contrastive vision-language models, the effectiveness of concept removal in hyperbolic spaces remains unexplored. This paper investigates machine unlearning in hyperbolic contrastive learning by adapting Alignment Calibration to MERU, a model that embeds images and text in hyperbolic space to better capture semantic hierarchies. Through systematic experiments and ablation studies, we demonstrate that hyperbolic geometry offers distinct advantages for concept removal, achieving near perfect forgetting with reasonable performance on retained concepts, particularly when scaling to multiple concept removal. Our approach introduces hyperbolic-specific components including entailment calibration and norm regularization that leverage the unique properties of hyperbolic space. Comparative analysis with Euclidean models reveals fundamental differences in unlearning dynamics, with hyperbolic unlearning reorganizing the semantic hierarchy while Euclidean approaches merely disconnect cross-modal associations. These findings not only advance machine unlearning techniques but also provide insights into the geometric properties that influence concept representation and removal in multimodal models. Source code available at https://github.com/alex-pv01/HAC
中文: 本文提出一种针对双曲对比学习模型的机器遗忘方法,通过双曲空间特有的技术重组语义层次结构,在保持保留概念性能的同时实现了更优越的概念消除效果。
English: This paper introduces a machine unlearning method for hyperbolic contrastive learning models, demonstrating superior concept removal through hyperbolic-specific techniques that reorganize semantic hierarchies while maintaining performance on retained concepts.

Authors:Yang Li, Soumya Snigdha Kundu, Maxence Boels, Toktam Mahmoodi, Sebastien Ourselin, Tom Vercauteren, Prokar Dasgupta, Jonathan Shapey, Alejandro Granados
Title: UltraFlwr -- An Efficient Federated Medical and Surgical Object Detection Framework
Abstract:
Object detection shows promise for medical and surgical applications such as cell counting and tool tracking. However, its faces multiple real-world edge deployment challenges including limited high-quality annotated data, data sharing restrictions, and computational constraints. In this work, we introduce UltraFlwr, a framework for federated medical and surgical object detection. By leveraging Federated Learning (FL), UltraFlwr enables decentralized model training across multiple sites without sharing raw data. To further enhance UltraFlwr's efficiency, we propose YOLO-PA, a set of novel Partial Aggregation (PA) strategies specifically designed for YOLO models in FL. YOLO-PA significantly reduces communication overhead by up to 83% per round while maintaining performance comparable to Full Aggregation (FA) strategies. Our extensive experiments on BCCD and m2cai16-tool-locations datasets demonstrate that YOLO-PA not only provides better client models compared to client-wise centralized training and FA strategies, but also facilitates efficient training and deployment across resource-constrained edge devices. Further, we also establish one of the first benchmarks in federated medical and surgical object detection. This paper advances the feasibility of training and deploying detection models on the edge, making federated object detection more practical for time-critical and resource-constrained medical and surgical applications. UltraFlwr is publicly available at https://github.com/KCL-BMEIS/UltraFlwr.
中文:UltraFlwr提出了一种用于医疗目标检测的联邦学习框架,可在不共享数据的情况下实现分布式训练,其YOLO-PA策略将通信成本降低83%的同时保持与完全聚合相当的性能。
English: UltraFlwr introduces a federated learning framework for medical object detection that enables decentralized training without data sharing, while its YOLO-PA strategy reduces communication costs by 83% and maintains performance comparable to full aggregation.

Authors:Joost Luijmes, Alexander Gielisse, Roman Knyazhitskiy, Jan van Gemert
Title: ARC: Anchored Representation Clouds for High-Resolution INR Classification
Abstract:
Implicit neural representations (INRs) encode signals in neural network weights as a memory-efficient representation, decoupling sampling resolution from the associated resource costs. Current INR image classification methods are demonstrated on low-resolution data and are sensitive to image-space transformations. We attribute these issues to the global, fully-connected MLP neural network architecture encoding of current INRs, which lack mechanisms for local representation: MLPs are sensitive to absolute image location and struggle with high-frequency details. We propose ARC: Anchored Representation Clouds, a novel INR architecture that explicitly anchors latent vectors locally in image-space. By introducing spatial structure to the latent vectors, ARC captures local image data which in our testing leads to state-of-the-art implicit image classification of both low- and high-resolution images and increased robustness against image-space translation. Code can be found at https://github.com/JLuij/anchored_representation_clouds.
中文: 提出的ARC架构通过在图像空间中引入局部锚定潜在向量,克服了现有隐式神经表示的局限,实现了跨分辨率的最先进图像分类和更强的变换鲁棒性。
English: The proposed ARC architecture introduces locally anchored latent vectors in image-space to overcome the limitations of current implicit neural representations, achieving state-of-the-art image classification across resolutions and enhanced robustness to transformations.

Authors:Xing He, Zhe Zhu, Liangliang Nan, Honghua Chen, Jing Qin, Mingqiang Wei
Title: PointSFDA: Source-free Domain Adaptation for Point Cloud Completion
Abstract:
Conventional methods for point cloud completion, typically trained on synthetic datasets, face significant challenges when applied to out-of-distribution real-world scans. In this paper, we propose an effective yet simple source-free domain adaptation framework for point cloud completion, termed \textbf{PointSFDA}. Unlike unsupervised domain adaptation that reduces the domain gap by directly leveraging labeled source data, PointSFDA uses only a pretrained source model and unlabeled target data for adaptation, avoiding the need for inaccessible source data in practical scenarios. Being the first source-free domain adaptation architecture for point cloud completion, our method offers two core contributions. First, we introduce a coarse-to-fine distillation solution to explicitly transfer the global geometry knowledge learned from the source dataset. Second, as noise may be introduced due to domain gaps, we propose a self-supervised partial-mask consistency training strategy to learn local geometry information in the target domain. Extensive experiments have validated that our method significantly improves the performance of state-of-the-art networks in cross-domain shape completion. Our code is available at \emph{\textcolor{magenta}{https://github.com/Starak-x/PointSFDA}}.
中文摘要:本文提出PointSFDA框架,通过粗到精的知识蒸馏迁移全局几何信息,并采用自监督局部掩码一致性训练学习目标域局部几何,有效提升了点云补全在跨域场景中的性能。
English Summary: The paper introduces PointSFDA, a source-free domain adaptation framework that enhances point cloud completion by transferring global geometry knowledge through coarse-to-fine distillation and learning local geometry via self-supervised consistency training, significantly improving cross-domain performance.

Authors:Nikola Đukić, Tim Lebailly, Tinne Tuytelaars
Title: Object-Centric Pretraining via Target Encoder Bootstrapping
Abstract:
Object-centric representation learning has recently been successfully applied to real-world datasets. This success can be attributed to pretrained non-object-centric foundation models, whose features serve as reconstruction targets for slot attention. However, targets must remain frozen throughout the training, which sets an upper bound on the performance object-centric models can attain. Attempts to update the target encoder by bootstrapping result in large performance drops, which can be attributed to its lack of object-centric inductive biases, causing the object-centric model's encoder to drift away from representations useful as reconstruction targets. To address these limitations, we propose Object-CEntric Pretraining by Target Encoder BOotstrapping, a self-distillation setup for training object-centric models from scratch, on real-world data, for the first time ever. In OCEBO, the target encoder is updated as an exponential moving average of the object-centric model, thus explicitly being enriched with object-centric inductive biases introduced by slot attention while removing the upper bound on performance present in other models. We mitigate the slot collapse caused by random initialization of the target encoder by introducing a novel cross-view patch filtering approach that limits the supervision to sufficiently informative patches. When pretrained on 241k images from COCO, OCEBO achieves unsupervised object discovery performance comparable to that of object-centric models with frozen non-object-centric target encoders pretrained on hundreds of millions of images. The code and pretrained models are publicly available at https://github.com/djukicn/ocebo.
Chinese: OCEBO方法提出了一种自蒸馏框架,通过将目标编码器更新为以目标为中心的模型的指数移动平均值,首次实现了在真实世界数据上从头训练以目标为中心的模型,克服了固定目标带来的性能限制同时保持了目标为中心的归纳偏差。
English: The proposed OCEBO method introduces a self-distillation framework that enables training object-centric models from scratch on real-world data by updating the target encoder as an exponential moving average of the object-centric model, overcoming performance limitations of frozen targets while maintaining object-centric inductive biases.

Authors:Zinqin Huang, Gu Wang, Chenyangguang Zhang, Ruida Zhang, Xiu Li, Xiangyang Ji
Title: GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation
Abstract:
Recent advances in RGBD-based category-level object pose estimation have been limited by their reliance on precise depth information, restricting their broader applicability. In response, RGB-based methods have been developed. Among these methods, geometry-guided pose regression that originated from instance-level tasks has demonstrated strong performance. However, we argue that the NOCS map is an inadequate intermediate representation for geometry-guided pose regression method, as its many-to-one correspondence with category-level pose introduces redundant instance-specific information, resulting in suboptimal results. This paper identifies the intra-class variation problem inherent in pose regression based solely on the NOCS map and proposes the Intra-class Variation-Free Consensus (IVFC) map, a novel coordinate representation generated from the category-level consensus model. By leveraging the complementary strengths of the NOCS map and the IVFC map, we introduce GIVEPose, a framework that implements Gradual Intra-class Variation Elimination for category-level object pose estimation. Extensive evaluations on both synthetic and real-world datasets demonstrate that GIVEPose significantly outperforms existing state-of-the-art RGB-based approaches, achieving substantial improvements in category-level object pose estimation. Our code is available at https://github.com/ziqin-h/GIVEPose.
中文: 本文提出GIVEPose框架,通过引入类内无差异共识坐标表示解决NOCS映射在类别级物体姿态估计中的不足,在合成与真实数据集上显著优于当前最先进的RGB方法。
English: This paper introduces GIVEPose, a framework that overcomes the limitations of NOCS maps in category-level object pose estimation by proposing an Intra-class Variation-Free Consensus map and achieves superior performance compared to existing RGB-based methods.

Authors:Amir Hamza, Andrea Caraffa, Davide Boscaini, Fabio Poiesi
Title: Distilling 3D distinctive local descriptors for 6D pose estimation
Abstract:
Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process. Can we retain GeDi's effectiveness while significantly improving its efficiency? In this paper, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors. We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility. Project Website: https://tev-fbk.github.io/dGeDi/
中文摘要:本文提出一种知识蒸馏框架,通过训练高效学生模型来复现GeDi的局部描述符,在保持6D姿态估计竞争力的同时大幅降低推理时间,使零样本姿态估计更接近实时应用。
English Summary: This paper introduces a knowledge distillation framework that trains an efficient student model to replicate GeDi's local descriptors, achieving competitive 6D pose estimation performance while drastically reducing inference time for real-time applications.

Authors:Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Qingming Huang
Title: When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning
Abstract:
The past decade has witnessed notable achievements in self-supervised learning for video tasks. Recent efforts typically adopt the Masked Video Modeling (MVM) paradigm, leading to significant progress on multiple video tasks. However, two critical challenges remain: 1) Without human annotations, the random temporal sampling introduces uncertainty, increasing the difficulty of model training. 2) Previous MVM methods primarily recover the masked patches in the pixel space, leading to insufficient information compression for downstream tasks. To address these challenges jointly, we propose a self-supervised framework that leverages Temporal Correspondence for video Representation learning (T-CoRe). For challenge 1), we propose a sandwich sampling strategy that selects two auxiliary frames to reduce reconstruction uncertainty in a two-side-squeezing manner. Addressing challenge 2), we introduce an auxiliary branch into a self-distillation architecture to restore representations in the latent space, generating high-level semantic representations enriched with temporal information. Experiments of T-CoRe consistently present superior performance across several downstream tasks, demonstrating its effectiveness for video representation learning. The code is available at https://github.com/yafeng19/T-CORE.
中文: T-CoRe框架通过三明治采样策略降低时序不确定性,并采用自蒸馏架构在潜在空间恢复表征,有效解决了视频自监督学习中的关键挑战,在多项下游任务中表现优异。
English: The proposed T-CoRe framework tackles self-supervised video learning challenges by introducing sandwich sampling to reduce temporal uncertainty and a self-distillation branch for latent-space representation recovery, achieving superior performance across multiple downstream tasks.

Authors:Imanol G. Estepa, Jesús M. Rodríguez-de-Vera, Ignacio Sarasúa, Bhalaji Nagarajan, Petia Radeva
Title: Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis
Abstract:
While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they rely solely on semantic token reconstruction, which requires an external tokenizer during training -- introducing a significant overhead. In this work, we introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our Contrastive objective, "Echo Contrast", leverages the generative capabilities of Sorcen, eliminating the need for additional image crops or augmentations during training. Sorcen "generates" an echo sample in the semantic token space, forming the contrastive positive pair. Sorcen operates exclusively on precomputed tokens, eliminating the need for an online token transformation during training, thereby significantly reducing computational overhead. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively, while being 60.8% more efficient. Additionally, Sorcen surpasses previous single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.
中文: Sorcen提出了一种结合对比与重建目标的统一自监督学习框架,无需外部标记器或额外图像裁剪,在多个基准测试中实现了更高的效率和性能突破。
English: Sorcen introduces a unified self-supervised learning framework with a synergic contrastive-reconstruction objective, eliminating the need for external tokenizers and additional image crops while outperforming previous methods in efficiency and performance across multiple benchmarks.

Authors:Suhyeon Lee, Kwanyoung Kim, Jong Chul Ye
Title: Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation
Abstract:
Unpaired image-to-image translation has seen significant progress since the introduction of CycleGAN. However, methods based on diffusion models or Schrödinger bridges have yet to be widely adopted in real-world applications due to their iterative sampling nature. To address this challenge, we propose a novel framework, Implicit Bridge Consistency Distillation (IBCD), which enables single-step bidirectional unpaired translation without using adversarial loss. IBCD extends consistency distillation by using a diffusion implicit bridge model that connects PF-ODE trajectories between distributions. Additionally, we introduce two key improvements: 1) distribution matching for consistency distillation and 2) adaptive weighting method based on distillation difficulty. Experimental results demonstrate that IBCD achieves state-of-the-art performance on benchmark datasets in a single generation step. Project page available at https://hyn2028.github.io/project_page/IBCD/index.html

Authors:Jianbo Zhao, Taiyu Ban, Zhihao Liu, Hangning Zhou, Xiyang Wang, Qibin Zhou, Hailong Qin, Mu Yang, Lei Liu, Bin Li
Title: DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling
Abstract:
Accurate and efficient modeling of agent interactions is essential for trajectory generation, the core of autonomous driving systems. Existing methods, scene-centric, agent-centric, and query-centric frameworks, each present distinct advantages and drawbacks, creating an impossible triangle among accuracy, computational time, and memory efficiency. To break this limitation, we propose Directional Rotary Position Embedding (DRoPE), a novel adaptation of Rotary Position Embedding (RoPE), originally developed in natural language processing. Unlike traditional relative position embedding (RPE), which introduces significant space complexity, RoPE efficiently encodes relative positions without explicitly increasing complexity but faces inherent limitations in handling angular information due to periodicity. DRoPE overcomes this limitation by introducing a uniform identity scalar into RoPE's 2D rotary transformation, aligning rotation angles with realistic agent headings to naturally encode relative angular information. We theoretically analyze DRoPE's correctness and efficiency, demonstrating its capability to simultaneously optimize trajectory generation accuracy, time complexity, and space complexity. Empirical evaluations compared with various state-of-the-art trajectory generation models, confirm DRoPE's good performance and significantly reduced space complexity, indicating both theoretical soundness and practical effectiveness. The video documentation is available at https://drope-traj.github.io/.
Chinese: 提出的方向性旋转位置嵌入(DRoPE)通过改进角度编码优化了自动驾驶中的轨迹生成,在精度、计算时间和内存效率上均优于现有方法,理论和实证评估均验证了其优越性能。
English: The proposed Directional Rotary Position Embedding (DRoPE) enhances trajectory generation in autonomous driving by optimizing accuracy, computational time, and memory efficiency through improved angular encoding, outperforming existing methods in both theoretical analysis and empirical evaluations.

Authors:Jin Wang, Chenghui Lv, Xian Li, Shichao Dong, Huadong Li, kelu Yao, Chao Li, Wenqi Shao, Ping Luo
Title: Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models
Abstract:
Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc. To detect the ever-increasingly diverse malicious fake media in the new era of AIGC, recent studies have proposed to exploit Large Vision Language Models (LVLMs) to design robust forgery detectors due to their impressive performance on a wide range of multimodal tasks. However, it still lacks a comprehensive benchmark designed to comprehensively assess LVLMs' discerning capabilities on forgery media. To fill this gap, we present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries. Forensics-Bench comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models. We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench. We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC. The deliverables will be updated at https://Forensics-Bench.github.io/.
中文: AIGC的快速发展加剧了虚假媒体的传播,为此推出了Forensics-Bench基准测试,通过多维度选择题全面评估大型视觉语言模型在伪造检测中的识别、定位和推理能力。
English: The rapid advancement of AIGC has escalated the spread of diverse fake media, prompting the development of Forensics-Bench, a comprehensive benchmark to evaluate Large Vision Language Models' capabilities in detecting forgeries through multi-choice questions across various dimensions.

Authors:Saad Lahlali, Sandra Kara, Hejer Ammar, Florian Chabot, Nicolas Granger, Hervé Le Borgne, Quoc-Cuong Pham
Title: xMOD: Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion
Abstract:
Object discovery, which refers to the task of localizing objects without human annotations, has gained significant attention in 2D image analysis. However, despite this growing interest, it remains under-explored in 3D data, where approaches rely exclusively on 3D motion, despite its several challenges. In this paper, we present a novel framework that leverages advances in 2D object discovery which are based on 2D motion to exploit the advantages of such motion cues being more flexible and generalizable and to bridge the gap between 2D and 3D modalities. Our primary contributions are twofold: (i) we introduce DIOD-3D, the first baseline for multi-object discovery in 3D data using 2D motion, incorporating scene completion as an auxiliary task to enable dense object localization from sparse input data; (ii) we develop xMOD, a cross-modal training framework that integrates 2D and 3D data while always using 2D motion cues. xMOD employs a teacher-student training paradigm across the two modalities to mitigate confirmation bias by leveraging the domain gap. During inference, the model supports both RGB-only and point cloud-only inputs. Additionally, we propose a late-fusion technique tailored to our pipeline that further enhances performance when both modalities are available at inference. We evaluate our approach extensively on synthetic (TRIP-PD) and challenging real-world datasets (KITTI and Waymo). Notably, our approach yields a substantial performance improvement compared with the 2D object discovery state-of-the-art on all datasets with gains ranging from +8.7 to +15.1 in F1@50 score. The code is available at https://github.com/CEA-LIST/xMOD
中文: 本文提出了一种新颖框架,通过利用2D运动线索来弥合2D与3D物体发现之间的鸿沟,借助跨模态训练和场景补全技术实现了显著的性能提升。
English: This paper introduces a novel framework that bridges 2D and 3D object discovery by leveraging 2D motion cues, achieving significant performance improvements through cross-modal training and scene completion techniques.

Authors:Yunwei Lan, Zhigao Cui, Chang Liu, Jialun Peng, Nian Wang, Xin Luo, Dong Liu
Title: Exploiting Diffusion Prior for Real-World Image Dehazing with Unpaired Training
Abstract:
Unpaired training has been verified as one of the most effective paradigms for real scene dehazing by learning from unpaired real-world hazy and clear images. Although numerous studies have been proposed, current methods demonstrate limited generalization for various real scenes due to limited feature representation and insufficient use of real-world prior. Inspired by the strong generative capabilities of diffusion models in producing both hazy and clear images, we exploit diffusion prior for real-world image dehazing, and propose an unpaired framework named Diff-Dehazer. Specifically, we leverage diffusion prior as bijective mapping learners within the CycleGAN, a classic unpaired learning framework. Considering that physical priors contain pivotal statistics information of real-world data, we further excavate real-world knowledge by integrating physical priors into our framework. Furthermore, we introduce a new perspective for adequately leveraging the representation ability of diffusion models by removing degradation in image and text modalities, so as to improve the dehazing effect. Extensive experiments on multiple real-world datasets demonstrate the superior performance of our method. Our code https://github.com/ywxjm/Diff-Dehazer.
中文总结:Diff-Dehazer框架通过将扩散模型先验与物理知识融入CycleGAN,利用图像和文本模态的退化去除机制,在非配对图像去雾任务中实现了卓越性能。
English Summary: The proposed Diff-Dehazer framework leverages diffusion model priors integrated with physical knowledge within CycleGAN to achieve superior unpaired image dehazing by effectively removing degradation across image and text modalities.

Authors:Michael Neri, Federica Battisti
Title: Low-Complexity Patch-based No-Reference Point Cloud Quality Metric exploiting Weighted Structure and Texture Features
Abstract:
During the compression, transmission, and rendering of point clouds, various artifacts are introduced, affecting the quality perceived by the end user. However, evaluating the impact of these distortions on the overall quality is a challenging task. This study introduces PST-PCQA, a no-reference point cloud quality metric based on a low-complexity, learning-based framework. It evaluates point cloud quality by analyzing individual patches, integrating local and global features to predict the Mean Opinion Score. In summary, the process involves extracting features from patches, combining them, and using correlation weights to predict the overall quality. This approach allows us to assess point cloud quality without relying on a reference point cloud, making it particularly useful in scenarios where reference data is unavailable. Experimental tests on three state-of-the-art datasets show good prediction capabilities of PST-PCQA, through the analysis of different feature pooling strategies and its ability to generalize across different datasets. The ablation study confirms the benefits of evaluating quality on a patch-by-patch basis. Additionally, PST-PCQA's light-weight structure, with a small number of parameters to learn, makes it well-suited for real-time applications and devices with limited computational capacity. For reproducibility purposes, we made code, model, and pretrained weights available at https://github.com/michaelneri/PST-PCQA.
中文: 本研究提出PST-PCQA无参考质量评估方法,通过分析点云块并结合局部与全局特征来预测质量,在多个数据集上表现优异,其轻量级结构适用于实时应用场景。
English: This study presents PST-PCQA, a no-reference quality metric that evaluates point clouds by analyzing patches with local and global features, demonstrating effective performance across datasets with a lightweight structure suitable for real-time applications.

Authors:Yaxiong Chen, Junjian Hu, Chunlei Li, Zixuan Zheng, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou
Title: One-Shot Medical Video Object Segmentation via Temporal Contrastive Memory Networks
Abstract:
Video object segmentation is crucial for the efficient analysis of complex medical video data, yet it faces significant challenges in data availability and annotation. We introduce the task of one-shot medical video object segmentation, which requires separating foreground and background pixels throughout a video given only the mask annotation of the first frame. To address this problem, we propose a temporal contrastive memory network comprising image and mask encoders to learn feature representations, a temporal contrastive memory bank that aligns embeddings from adjacent frames while pushing apart distant ones to explicitly model inter-frame relationships and stores these features, and a decoder that fuses encoded image features and memory readouts for segmentation. We also collect a diverse, multi-source medical video dataset spanning various modalities and anatomies to benchmark this task. Extensive experiments demonstrate state-of-the-art performance in segmenting both seen and unseen structures from a single exemplar, showing ability to generalize from scarce labels. This highlights the potential to alleviate annotation burdens for medical video analysis. Code is available at https://github.com/MedAITech/TCMN.
中文:研究者提出了一种时序对比记忆网络,用于一次性医学视频对象分割,仅需首帧标注即可完成全视频分割,在多样化医学数据集上实现了领先性能,显著减轻了标注负担。
English: The authors propose a temporal contrastive memory network for one-shot medical video object segmentation, which utilizes a single annotated frame to segment entire videos and demonstrates state-of-the-art performance on a diverse medical dataset, effectively reducing annotation burdens.

Authors:Zihan Cao, Yu Zhong, Liang-Jian Deng
Title: Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening
Abstract:
Pansharpening, a pivotal task in remote sensing for fusing high-resolution panchromatic and multispectral imagery, has garnered significant research interest. Recent advancements employing diffusion models based on stochastic differential equations (SDEs) have demonstrated state-of-the-art performance. However, the inherent multi-step sampling process of SDEs imposes substantial computational overhead, hindering practical deployment. While existing methods adopt efficient samplers, knowledge distillation, or retraining to reduce sampling steps (e.g., from 1,000 to fewer steps), such approaches often compromise fusion quality. In this work, we propose the Optimal Transport Flow Matching (OTFM) framework, which integrates the dual formulation of unbalanced optimal transport (UOT) to achieve one-step, high-quality pansharpening. Unlike conventional OT formulations that enforce rigid distribution alignment, UOT relaxes marginal constraints to enhance modeling flexibility, accommodating the intrinsic spectral and spatial disparities in remote sensing data. Furthermore, we incorporate task-specific regularization into the UOT objective, enhancing the robustness of the flow model. The OTFM framework enables simulation-free training and single-step inference while maintaining strict adherence to pansharpening constraints. Experimental evaluations across multiple datasets demonstrate that OTFM matches or exceeds the performance of previous regression-based models and leading diffusion-based methods while only needing one sampling step. Codes are available at https://github.com/294coder/PAN-OTFM.
Chinese: 本文提出的最优传输流匹配(OTFM)框架利用非平衡最优传输实现单步高质量全色锐化,在保持计算效率的同时超越了现有方法的性能。
English: This paper introduces the Optimal Transport Flow Matching (OTFM) framework, which leverages unbalanced optimal transport to achieve high-quality pansharpening in a single sampling step, outperforming existing methods while maintaining computational efficiency.

Authors:Yifan Li, Shuai Yang, Jiaying Liu
Title: Language-based Image Colorization: A Benchmark and Beyond
Abstract:
Image colorization aims to bring colors back to grayscale images. Automatic image colorization methods, which requires no additional guidance, struggle to generate high-quality images due to color ambiguity, and provides limited user controllability. Thanks to the emergency of cross-modality datasets and models, language-based colorization methods are proposed to fully utilize the efficiency and flexibly of text descriptions to guide colorization. In view of the lack of a comprehensive review of language-based colorization literature, we conduct a thorough analysis and benchmarking. We first briefly summarize existing automatic colorization methods. Then, we focus on language-based methods and point out their core challenge on cross-modal alignment. We further divide these methods into two categories: one attempts to train a cross-modality network from scratch, while the other utilizes the pre-trained cross-modality model to establish the textual-visual correspondence. Based on the analyzed limitations of existing language-based methods, we propose a simple yet effective method based on distilled diffusion model. Extensive experiments demonstrate that our simple baseline can produces better results than previous complex methods with 14 times speed up. To the best of our knowledge, this is the first comprehensive review and benchmark on language-based image colorization field, providing meaningful insights for the community. The code is available at https://github.com/lyf1212/Color-Turbo.
中文摘要:本文首次对基于语言的图像上色方法进行全面综述与基准测试,提出了一种简单高效的蒸馏扩散模型,在实现14倍加速的同时超越了先前复杂方法的性能表现。
English Summary: This paper presents the first comprehensive review and benchmark of language-based image colorization methods, proposing a simple yet effective distilled diffusion model that outperforms previous approaches with significantly faster processing.

Authors:Tingxiu Chen, Yilei Shi, Zixuan Zheng, Bingcong Yan, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Title: Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models
Abstract:
Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we introduce a latent dynamic diffusion model (LDDM) to efficiently translate static images to dynamic sequences with realistic video characteristics. We demonstrate strong quantitative results and visually appealing synthesized videos on the BUSV benchmark. Notably, training video classification models on combinations of real and LDDM-synthesized videos substantially improves performance over using real data alone, indicating our method successfully emulates dynamics critical for discrimination. Our image-to-video approach provides an effective data augmentation solution to advance ultrasound video analysis. Code is available at https://github.com/MedAITech/U_I2V.
中文摘要:本研究提出一种潜在动态扩散模型(LDDM),通过将静态超声图像转化为动态视频序列来解决数据集稀缺问题,结合真实数据训练能显著提升视频分类模型的诊断性能。
English Summary: The study introduces a latent dynamic diffusion model (LDDM) that synthesizes realistic ultrasound videos from static images to address dataset scarcity, significantly enhancing video classification performance when combined with real data.

Authors:Zixuan Zheng, Yilei Shi, Chunlei Li, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Title: Reducing Annotation Burden: Exploiting Image Knowledge for Few-Shot Medical Video Object Segmentation via Spatiotemporal Consistency Relearning
Abstract:
Few-shot video object segmentation aims to reduce annotation costs; however, existing methods still require abundant dense frame annotations for training, which are scarce in the medical domain. We investigate an extremely low-data regime that utilizes annotations from only a few video frames and leverages existing labeled images to minimize costly video annotations. Specifically, we propose a two-phase framework. First, we learn a few-shot segmentation model using labeled images. Subsequently, to improve performance without full supervision, we introduce a spatiotemporal consistency relearning approach on medical videos that enforces consistency between consecutive frames. Constraints are also enforced between the image model and relearning model at both feature and prediction levels. Experiments demonstrate the superiority of our approach over state-of-the-art few-shot segmentation methods. Our model bridges the gap between abundant annotated medical images and scarce, sparsely labeled medical videos to achieve strong video segmentation performance in this low data regime. Code is available at https://github.com/MedAITech/RAB.
中文: 本研究提出一个两阶段框架用于少样本医学视频目标分割,先利用标注图像训练模型,再通过时空一致性再学习和跨模型约束,在极少量视频标注下实现优越的分割性能。
English: This study introduces a two-phase framework for few-shot medical video object segmentation that first trains on labeled images and then applies spatiotemporal consistency relearning with cross-model constraints to achieve superior performance with minimal video annotations.

Authors:Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat, Basura Fernando
Title: Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering
Abstract:
We introduce PKR-QA (Procedural Knowledge Reasoning Question Answering), a new benchmark for question answering over procedural tasks that require structured reasoning. PKR-QA is constructed semi-automatically using a procedural knowledge graph (PKG), which encodes task-specific knowledge across diverse domains. The PKG is built by curating and linking information from the COIN instructional video dataset and the ontology, enriched with commonsense knowledge from ConceptNet and structured outputs from Large Language Models (LLMs), followed by manual verification. To generate question-answer pairs, we design graph traversal templates where each template is applied systematically over PKG. To enable interpretable reasoning, we propose a neurosymbolic approach called Knowledge Module Learning (KML), which learns procedural relations via neural modules and composes them for structured reasoning with LLMs. Experiments demonstrate that this paradigm improves reasoning performance on PKR-QA and enables step-by-step reasoning traces that facilitate interpretability. Code and dataset will be released soon https://github.com/LUNAProject22/KML.
中文: 我们推出了PKR-QA,这是一个基于融合常识与大型语言模型数据的知识图谱半自动构建的程序性问答基准,并提出了KML这一神经符号方法,有效提升了推理性能与可解释性。
English: We introduce PKR-QA, a benchmark for procedural question answering built semi-automatically from a knowledge graph enriched with commonsense and LLM data, and propose KML, a neurosymbolic method that enhances reasoning performance and interpretability.

Authors:Yanhao Wu, Haoyang Zhang, Tianwei Lin, Lichao Huang, Shujie Luo, Rui Wu, Congpei Qiu, Wei Ke, Tong Zhang
Title: Generating Multimodal Driving Scenes via Next-Scene Prediction
Abstract:
Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements. Project page: https://yanhaowu.github.io/UMGen/

Authors:Zihan Cao, Yu Zhong, Ziqi Wang, Liang-Jian Deng
Title: MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance
Abstract:
Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (\textit{e.g.}, noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines. Codes are available at https://github.com/294coder/MMAIF.
中文: 本文提出了一种统一的图像融合框架,通过引入模拟真实退化的流程和潜在空间中的一体化扩散变换器,解决了现有方法的关键局限,实现了多任务、多退化条件下的语言引导融合,并展现出卓越性能。
English: This paper introduces a unified framework for image fusion that addresses key limitations of existing methods by incorporating a degradation pipeline and an all-in-one Diffusion Transformer in latent space, enabling multi-task, multi-degradation processing with language guidance and demonstrating superior performance.

Authors:Tengjin Weng, Jingyi Wang, Wenhao Jiang, Zhong Ming
Title: VisNumBench: Evaluating Number Sense of Multimodal Large Language Models
Abstract:
Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1,900 multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested, including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash, perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMs with larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing MLLMs' number sense abilities. Code and dataset are available at https://wwwtttjjj.github.io/VisNumBench/.

Authors:Chongjun Tu, Lin Zhang, Pengtao Chen, Peng Ye, Xianfang Zeng, Wei Cheng, Gang Yu, Tao Chen
Title: FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding
Abstract:
Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: \href{https://favor-bench.github.io/}{https://favor-bench.github.io/}.
中文: 研究者提出了FAVOR-Bench基准,揭示了多模态大语言模型在动作理解上的局限,并通过FAVOR-Train数据集进行微调,有效提升了模型在动作相关任务上的表现。
English: Researchers introduce FAVOR-Bench, a benchmark revealing limitations in multimodal large language models' motion understanding, and FAVOR-Train, a dataset that improves model performance when fine-tuned.

Authors:Gahye Lee, Hyejeong Yoon, Jungeon Kim, Seungyong Lee
Title: Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes
Abstract:
This paper presents a novel framework for compactly representing a 3D indoor scene using a set of polycuboids through a deep learning-based fitting method. Indoor scenes mainly consist of man-made objects, such as furniture, which often exhibit rectilinear geometry. This property allows indoor scenes to be represented using combinations of polycuboids, providing a compact representation that benefits downstream applications like furniture rearrangement. Our framework takes a noisy point cloud as input and first detects six types of cuboid faces using a transformer network. Then, a graph neural network is used to validate the spatial relationships of the detected faces to form potential polycuboids. Finally, each polycuboid instance is reconstructed by forming a set of boxes based on the aggregated face labels. To train our networks, we introduce a synthetic dataset encompassing a diverse range of cuboid and polycuboid shapes that reflect the characteristics of indoor scenes. Our framework generalizes well to real-world indoor scene datasets, including Replica, ScanNet, and scenes captured with an iPhone. The versatility of our method is demonstrated through practical applications, such as virtual room tours and scene editing.

Authors:Siyuan Yan, Ming Hu, Yiwen Jiang, Xieji Li, Hao Fei, Philipp Tschandl, Harald Kittler, Zongyuan Ge
Title: Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology
Abstract:
The emergence of vision-language models has transformed medical AI, enabling unprecedented advances in diagnostic capability and clinical applications. However, progress in dermatology has lagged behind other medical domains due to the lack of standard image-text pairs. Existing dermatological datasets are limited in both scale and depth, offering only single-label annotations across a narrow range of diseases instead of rich textual descriptions, and lacking the crucial clinical context needed for real-world applications. To address these limitations, we present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs. Built from diverse educational resources and structured around a standard ontology collaboratively developed by experts, Derm1M provides comprehensive coverage for over 390 skin conditions across four hierarchical levels and 130 clinical concepts with rich contextual information such as medical history, symptoms, and skin tone. To demonstrate Derm1M potential in advancing both AI research and clinical application, we pretrained a series of CLIP-like models, collectively called DermLIP, on this dataset. The DermLIP family significantly outperforms state-of-the-art foundation models on eight diverse datasets across multiple tasks, including zero-shot skin disease classification, clinical and artifacts concept identification, few-shot/full-shot learning, and cross-modal retrieval. Our dataset and code will be publicly available at https://github.com/SiyuanYan1/Derm1M upon acceptance.
Chinese: Derm1M数据集通过提供超过一百万张带有丰富临床背景的图像-文本对,解决了皮肤病学中缺乏大规模视觉-语言数据的问题,并基于此开发的DermLIP模型在多项诊断任务中实现了最先进的性能。
English: The Derm1M dataset addresses the lack of large-scale vision-language data in dermatology by providing over one million image-text pairs with rich clinical context, enabling the development of DermLIP models that achieve state-of-the-art performance across multiple diagnostic tasks.

Authors:Yaofei Duan, Tao Tan, Zhiyuan Zhu, Yuhao Huang, Yuanji Zhang, Rui Gao, Patrick Cheong-Iao Pang, Xinru Gao, Guowei Tao, Xiang Cong, Zhou Li, Lianying Liang, Guangzhi He, Linliang Yin, Xuedong Deng, Xin Yang, Dong Ni
Title: FetalFlex: Anatomy-Guided Diffusion Model for Flexible Control on Fetal Ultrasound Image Synthesis
Abstract:
Fetal ultrasound (US) examinations require the acquisition of multiple planes, each providing unique diagnostic information to evaluate fetal development and screening for congenital anomalies. However, obtaining a comprehensive, multi-plane annotated fetal US dataset remains challenging, particularly for rare or complex anomalies owing to their low incidence and numerous subtypes. This poses difficulties in training novice radiologists and developing robust AI models, especially for detecting abnormal fetuses. In this study, we introduce a Flexible Fetal US image generation framework (FetalFlex) to address these challenges, which leverages anatomical structures and multimodal information to enable controllable synthesis of fetal US images across diverse planes. Specifically, FetalFlex incorporates a pre-alignment module to enhance controllability and introduces a repaint strategy to ensure consistent texture and appearance. Moreover, a two-stage adaptive sampling strategy is developed to progressively refine image quality from coarse to fine levels. We believe that FetalFlex is the first method capable of generating both in-distribution normal and out-of-distribution abnormal fetal US images, without requiring any abnormal data. Experiments on multi-center datasets demonstrate that FetalFlex achieved state-of-the-art performance across multiple image quality metrics. A reader study further confirms the close alignment of the generated results with expert visual assessments. Furthermore, synthetic images by FetalFlex significantly improve the performance of six typical deep models in downstream classification and anomaly detection tasks. Lastly, FetalFlex's anatomy-level controllable generation offers a unique advantage for anomaly simulation and creating paired or counterfactual data at the pixel level. The demo is available at: https://dyf1023.github.io/FetalFlex/.
中文摘要:本研究提出的FetalFlex框架无需异常数据即可生成多平面胎儿超声图像,通过可控合成技术显著提升AI模型训练效果与异常检测能力,在多项评估中表现优异。
English Summary: The study introduces FetalFlex, a novel framework that generates high-quality fetal ultrasound images across multiple planes without abnormal data, enhancing AI model training and diagnostic capabilities through controllable synthesis.

Authors:Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, Weijia Li
Title: Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation
Abstract:
With the rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, synthetic images have become increasingly prevalent in everyday life, posing new challenges for authenticity assessment and detection. Despite the effectiveness of existing methods in evaluating image authenticity and locating forgeries, these approaches often lack human interpretability and do not fully address the growing complexity of synthetic data. To tackle these challenges, we introduce FakeVLM, a specialized large multimodal model designed for both general synthetic image and DeepFake detection tasks. FakeVLM not only excels in distinguishing real from fake images but also provides clear, natural language explanations for image artifacts, enhancing interpretability. Additionally, we present FakeClue, a comprehensive dataset containing over 100,000 images across seven categories, annotated with fine-grained artifact clues in natural language. FakeVLM demonstrates performance comparable to expert models while eliminating the need for additional classifiers, making it a robust solution for synthetic data detection. Extensive evaluations across multiple datasets confirm the superiority of FakeVLM in both authenticity classification and artifact explanation tasks, setting a new benchmark for synthetic image detection. The dataset and code will be released in: https://github.com/opendatalab/FakeVLM.
Chinese: 针对现有合成图像检测方法的不足,FakeVLM作为多模态模型不仅能有效检测伪造图像,还能提供可解释的自然语言说明,并得到FakeClue数据集的全面支持。
English: To address the limitations of existing synthetic image detection methods, FakeVLM is introduced as a multimodal model that excels in detecting fakes and providing interpretable natural language explanations, supported by the comprehensive FakeClue dataset.

Authors:Henrique Morimitsu, Xiaobin Zhu, Roberto M. Cesar, Xiangyang Ji, Xu-Cheng Yin
Title: DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework
Abstract:
Optical flow estimation is essential for video processing tasks, such as restoration and action recognition. The quality of videos is constantly increasing, with current standards reaching 8K resolution. However, optical flow methods are usually designed for low resolution and do not generalize to large inputs due to their rigid architectures. They adopt downscaling or input tiling to reduce the input size, causing a loss of details and global information. There is also a lack of optical flow benchmarks to judge the actual performance of existing methods on high-resolution samples. Previous works only conducted qualitative high-resolution evaluations on hand-picked samples. This paper fills this gap in optical flow estimation in two ways. We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. We also introduce Kubric-NK, a new benchmark for evaluating optical flow methods with input resolutions ranging from 1K to 8K. Our high-resolution evaluation pushes the boundaries of existing methods and reveals new insights about their generalization capabilities. Extensive experimental results show that DPFlow achieves state-of-the-art results on the MPI-Sintel, KITTI 2015, Spring, and other high-resolution benchmarks.
中文: 本文提出了DPFlow,一种仅用低分辨率样本训练即可泛化至8K输入的自适应光流架构,并引入Kubric-NK新基准来评估1K至8K分辨率的光流方法,在多个基准测试中取得了最优结果。
English: This paper introduces DPFlow, an adaptive optical flow architecture that generalizes to 8K resolution using only low-resolution training data, and Kubric-NK, a new benchmark for evaluating optical flow methods across resolutions from 1K to 8K, achieving state-of-the-art results on multiple benchmarks.

Authors:Yuhang Liu, Wenjie Zhao, Yunhui Guo
Title: H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection
Abstract:
Task Incremental Learning (TIL) is a specialized form of Continual Learning (CL) in which a model incrementally learns from non-stationary data streams. Existing TIL methodologies operate under the closed-world assumption, presuming that incoming data remains in-distribution (ID). However, in an open-world setting, incoming samples may originate from out-of-distribution (OOD) sources, with their task identities inherently unknown. Continually detecting OOD samples presents several challenges for current OOD detection methods: reliance on model outputs leads to excessive dependence on model performance, selecting suitable thresholds is difficult, hindering real-world deployment, and binary ID/OOD classification fails to provide task-level identification. To address these issues, we propose a novel continual OOD detection method called the Hierarchical Two-sample Tests (H2ST). H2ST eliminates the need for threshold selection through hypothesis testing and utilizes feature maps to better exploit model capabilities without excessive dependence on model performance. The proposed hierarchical architecture enables task-level detection with superior performance and lower overhead compared to non-hierarchical classifier two-sample tests. Extensive experiments and analysis validate the effectiveness of H2ST in open-world TIL scenarios and its superiority to the existing methods. Code is available at \href{https://github.com/YuhangLiuu/H2ST}{https://github.com/YuhangLiuu/H2ST}.
中文: 本文提出分层双样本检验(H2ST)这一新型持续离群检测方法,通过假设检验避免阈值选择,利用特征映射实现任务级识别,在开放世界任务增量学习场景中展现出优于现有方法的性能与更低开销。
English: This paper introduces Hierarchical Two-sample Tests (H2ST), a novel continual out-of-distribution detection method that addresses challenges in open-world Task Incremental Learning by eliminating threshold selection through hypothesis testing and enabling task-level identification with superior performance and lower overhead.

Authors:Junfeng Ni, Yu Liu, Ruijie Lu, Zirui Zhou, Song-Chun Zhu, Yixin Chen, Siyuan Huang
Title: Decompositional Neural Scene Reconstruction with Generative Diffusion Prior
Abstract:
Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture of all objects within, is intriguing for downstream applications but remains challenging, particularly with sparse views as input. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas and fail to recover occluded regions. We argue that the key to solving this problem lies in supplementing missing information for these areas. To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. This provides additional information for the underconstrained areas, but directly incorporating diffusion prior raises potential conflicts between the reconstruction and generative guidance. Therefore, we further introduce a visibility-guided approach to dynamically adjust the per-pixel SDS loss weights. Together these components enhance both geometry and appearance recovery while remaining faithful to input images. Extensive experiments across Replica and ScanNet++ demonstrate that our method significantly outperforms SOTA methods. Notably, it achieves better object reconstruction under 10 views than the baselines under 100 views. Our method enables seamless text-based editing for geometry and appearance through SDS optimization and produces decomposed object meshes with detailed UV maps that support photorealistic Visual effects (VFX) editing. The project page is available at https://dp-recon.github.io/.
中文: 提出的DP-Recon方法通过引入扩散先验的分数蒸馏采样和可见性引导损失调整,显著提升了稀疏视角下的三维场景重建质量,在改善几何与纹理恢复的同时实现了先进的编辑功能。
English: The proposed DP-Recon method enhances 3D scene reconstruction from sparse views by integrating diffusion priors through Score Distillation Sampling and a visibility-guided loss adjustment, significantly improving geometry and texture recovery while enabling advanced editing capabilities.

Authors:Akram Khatami-Rizi, Ahmad Mahmoudi-Aznaveh
Title: Involution and BSConv Multi-Depth Distillation Network for Lightweight Image Super-Resolution
Abstract:
Single-image super-resolution (SISR) is a fundamental problem in computer vision that aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs. Although convolutional neural networks (CNNs) have achieved substantial advancements, deeper architectures often introduce excessive parameters, higher memory usage, and computational cost, limiting their applicability on resource-constrained devices. Recent research has thus focused on lightweight architectures that preserve accuracy while reducing complexity. This paper presents the Involution and BSConv Multi-Depth Distillation Network (IBMDN), a lightweight and effective architecture for SISR. The proposed IBMDN comprises Involution and BSConv Multi-Depth Distillation Blocks (IBMDB) and a Contrast and High-Frequency Attention Block (CHFAB). IBMDB employs varying combinations of Involution and BSConv at multiple depths to perform efficient feature extraction while minimizing computational complexity. CHFAB, a lightweight self-attention mechanism, focuses on extracting high-frequency and contrast information to enhance perceptual quality in the reconstructed images. The flexible design of IBMDB enables it to be seamlessly integrated into diverse SISR frameworks, including information distillation, transformer-based, and GAN-based models. Extensive experiments demonstrate that incorporating IBMDB significantly reduces memory usage, parameters, and floating-point operations (FLOPs), while achieving improvements in both pixel-wise accuracy and visual quality. The source code is available at: https://github.com/akramkhatami/IBMDN.
中文: 本文提出IBMDN轻量级单图像超分辨率网络,通过多深度蒸馏块和注意力机制在降低计算资源消耗的同时,有效提升重建图像的像素精度与视觉质量。
English: The paper introduces IBMDN, a lightweight single-image super-resolution network that uses innovative blocks to reduce computational costs while enhancing image quality through efficient feature extraction and attention mechanisms.

Authors:Yiming Wang, Lucy Chai, Xuan Luo, Michael Niemeyer, Manuel Lagunas, Stephen Lombardi, Siyu Tang, Tiancheng Sun
Title: SplatVoxel: History-Aware Novel View Streaming without Temporal Training
Abstract:
We study the problem of novel view streaming from sparse-view videos, which aims to generate a continuous sequence of high-quality, temporally consistent novel views as new input frames arrive. However, existing novel view synthesis methods struggle with temporal coherence and visual fidelity, leading to flickering and inconsistency. To address these challenges, we introduce history-awareness, leveraging previous frames to reconstruct the scene and improve quality and stability. We propose a hybrid splat-voxel feed-forward scene reconstruction approach that combines Gaussian Splatting to propagate information over time, with a hierarchical voxel grid for temporal fusion. Gaussian primitives are efficiently warped over time using a motion graph that extends 2D tracking models to 3D motion, while a sparse voxel transformer integrates new temporal observations in an error-aware manner. Crucially, our method does not require training on multi-view video datasets, which are currently limited in size and diversity, and can be directly applied to sparse-view video streams in a history-aware manner at inference time. Our approach achieves state-of-the-art performance in both static and streaming scene reconstruction, effectively reducing temporal artifacts and visual artifacts while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU. Project Page: https://19reborn.github.io/SplatVoxel/

Authors:Yi Liao, Yongsheng Gao, Weichuan Zhang
Title: Dynamic Accumulated Attention Map for Interpreting Evolution of Decision-Making in Vision Transformer
Abstract:
Various Vision Transformer (ViT) models have been widely used for image recognition tasks. However, existing visual explanation methods can not display the attention flow hidden inside the inner structure of ViT models, which explains how the final attention regions are formed inside a ViT for its decision-making. In this paper, a novel visual explanation approach, Dynamic Accumulated Attention Map (DAAM), is proposed to provide a tool that can visualize, for the first time, the attention flow from the top to the bottom through ViT networks. To this end, a novel decomposition module is proposed to construct and store the spatial feature information by unlocking the [class] token generated by the self-attention module of each ViT block. The module can also obtain the channel importance coefficients by decomposing the classification score for supervised ViT models. Because of the lack of classification score in self-supervised ViT models, we propose dimension-wise importance weights to compute the channel importance coefficients. Such spatial features are linearly combined with the corresponding channel importance coefficients, forming the attention map for each block. The dynamic attention flow is revealed by block-wisely accumulating each attention map. The contribution of this work focuses on visualizing the evolution dynamic of the decision-making attention for any intermediate block inside a ViT model by proposing a novel decomposition module and dimension-wise importance weights. The quantitative and qualitative analysis consistently validate the effectiveness and superior capacity of the proposed DAAM for not only interpreting ViT models with the fully-connected layers as the classifier but also self-supervised ViT models. The code is available at https://github.com/ly9802/DynamicAccumulatedAttentionMap.
中文: 本文提出了一种名为DAAM的新型视觉解释方法,通过分解空间特征和计算通道重要性系数,动态可视化Vision Transformer模型内部的注意力流动,从而能够解释监督式和非监督式ViT模型的决策过程。
English: The paper introduces DAAM, a novel visual explanation method that dynamically visualizes attention flow within Vision Transformer models by decomposing spatial features and computing channel importance coefficients, enabling interpretation of both supervised and self-supervised ViTs.

Authors:Merkourios Simos, Alberto Silvio Chiappa, Alexander Mathis
Title: Reinforcement learning-based motion imitation for physiologically plausible musculoskeletal motor control
Abstract:
How do humans move? The quest to understand human motion has broad applications in numerous fields, ranging from computer animation and motion synthesis to neuroscience, human prosthetics and rehabilitation. Although advances in reinforcement learning (RL) have produced impressive results in capturing human motion using simplified humanoids, controlling physiologically accurate models of the body remains an open challenge. In this work, we present a model-free motion imitation framework (KINESIS) to advance the understanding of muscle-based motor control. Using a musculoskeletal model of the lower body with 80 muscle actuators and 20 DoF, we demonstrate that KINESIS achieves strong imitation performance on 1.9 hours of motion capture data, is controllable by natural language through pre-trained text-to-motion generative models, and can be fine-tuned to carry out high-level tasks such as target goal reaching. Importantly, KINESIS generates muscle activity patterns that correlate well with human EMG activity. The physiological plausibility makes KINESIS a promising model for tackling challenging problems in human motor control theory, which we highlight by investigating Bernstein's redundancy problem in the context of locomotion. Code, videos and benchmarks will be available at https://github.com/amathislab/Kinesis.
中文: 本研究提出的KINESIS无模型框架通过精确的肌肉骨骼系统成功模拟人体运动,不仅能通过自然语言控制生成与肌电信号高度吻合的肌肉活动模式,还为探索运动控制理论提供了新途径。
English: This study introduces KINESIS, a model-free framework that effectively imitates human motion using a detailed musculoskeletal model, demonstrates controllability via natural language, and generates physiologically plausible muscle activity patterns for advancing motor control research.

Authors:Shuo Xing, Zezhou Sun, Shuangyu Xie, Kaiyuan Chen, Yanjia Huang, Yuping Wang, Jiachen Li, Dezhen Song, Zhengzhong Tu
Title: Can Large Vision Language Models Read Maps Like a Human?
Abstract:
In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.
中文:MapBench是首个专为人类可读地图导航设计的数据集,包含1600多个路径规划问题,通过直接和结构化提示方法显著挑战了大型视觉语言模型的空间推理能力。
English: MapBench is the first dataset for human-readable map navigation, featuring over 1600 pathfinding problems that challenge LVLMs' spatial reasoning through both direct and structured prompting methods.

Authors:Sara Sarto, Marcella Cornia, Rita Cucchiara
Title: Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
Abstract:
The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.
中文: 本文综述了图像描述评估指标的发展历程与局限性,重点分析了多模态大语言模型生成详细描述带来的挑战,并提出了未来研究方向。
English: This survey comprehensively reviews the evolution and limitations of image captioning evaluation metrics, emphasizing the challenges posed by MLLMs' detailed outputs and proposing future research directions.

Authors:Guowei Wang, Changxing Ding
Title: Effortless Active Labeling for Long-Term Test-Time Adaptation
Abstract:
Long-term test-time adaptation (TTA) is a challenging task due to error accumulation. Recent approaches tackle this issue by actively labeling a small proportion of samples in each batch, yet the annotation burden quickly grows as the batch number increases. In this paper, we investigate how to achieve effortless active labeling so that a maximum of one sample is selected for annotation in each batch. First, we annotate the most valuable sample in each batch based on the single-step optimization perspective in the TTA context. In this scenario, the samples that border between the source- and target-domain data distributions are considered the most feasible for the model to learn in one iteration. Then, we introduce an efficient strategy to identify these samples using feature perturbation. Second, we discover that the gradient magnitudes produced by the annotated and unannotated samples have significant variations. Therefore, we propose balancing their impact on model optimization using two dynamic weights. Extensive experiments on the popular ImageNet-C, -R, -K, -A and PACS databases demonstrate that our approach consistently outperforms state-of-the-art methods with significantly lower annotation costs.
中文: 本文提出了一种轻松主动标注策略,通过特征扰动识别每个批次中最有价值的样本进行标注,并利用动态权重平衡标注与未标注样本对模型优化的影响,在多个数据集上以极低标注成本实现了优于现有方法的性能。
English: This paper introduces an effortless active labeling strategy for long-term test-time adaptation that selects at most one sample per batch for annotation based on feature perturbation and balances optimization impact with dynamic weights, achieving superior performance with minimal annotation costs across multiple datasets.

Authors:Kasra Borazjani, Payam Abdisarabshali, Naji Khosravan, Seyyedali Hosseinalipour
Title: Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions
Abstract:
Federated Learning (FL) represents a paradigm shift in distributed machine learning (ML), enabling clients to train models collaboratively while keeping their raw data private. This paradigm shift from traditional centralized ML introduces challenges due to the non-iid (non-independent and identically distributed) nature of data across clients, significantly impacting FL's performance. Existing literature, predominantly model data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the real-world data heterogeneity among clients in computer vision tasks beyond classification. Subsequently, we demonstrate that current approaches overestimate FL's performance by relying on label/class distribution skew, exposing an overlooked gap in the literature. By utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. We further unveil a series of open research directions that can be pursued.
中文摘要:本文挑战了联邦学习中标签分布偏斜在捕捉计算机视觉任务中真实数据异质性方面的不足,提出了基于嵌入的异质性概念和新基准,揭示了现有文献对联邦学习性能的高估。
English Summary: This paper challenges the adequacy of label distribution skew for capturing real-world data heterogeneity in federated learning for computer vision tasks, proposing embedding-based heterogeneity and new benchmarks that reveal performance overestimations in existing literature.

Authors:Huaqiu Li, Xiaowan Hu, Haoqian Wang
Title: Interpretable Unsupervised Joint Denoising and Enhancement for Real-World low-light Scenarios
Abstract:
Real-world low-light images often suffer from complex degradations such as local overexposure, low brightness, noise, and uneven illumination. Supervised methods tend to overfit to specific scenarios, while unsupervised methods, though better at generalization, struggle to model these degradations due to the lack of reference images. To address this issue, we propose an interpretable, zero-reference joint denoising and low-light enhancement framework tailored for real-world scenarios. Our method derives a training strategy based on paired sub-images with varying illumination and noise levels, grounded in physical imaging principles and retinex theory. Additionally, we leverage the Discrete Cosine Transform (DCT) to perform frequency domain decomposition in the sRGB space, and introduce an implicit-guided hybrid representation strategy that effectively separates intricate compounded degradations. In the backbone network design, we develop retinal decomposition network guided by implicit degradation representation mechanisms. Extensive experiments demonstrate the superiority of our method. Code will be available at https://github.com/huaqlili/unsupervised-light-enhance-ICLR2025.
中文: 本文提出了一种无需参考图像的无监督、可解释性低光增强与去噪联合框架,通过频域分解和隐式引导表示策略,有效解决真实场景中复杂的图像退化问题。
English: This paper introduces an unsupervised, interpretable framework for joint denoising and low-light enhancement in real-world images, utilizing frequency domain decomposition and implicit-guided representation to effectively address complex degradations without reference images.

Authors:Yu Fang, Yue Yang, Xinghao Zhu, Kaiyuan Zheng, Gedas Bertasius, Daniel Szafir, Mingyu Ding
Title: ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis
Abstract:
Vision-language-action (VLA) models present a promising paradigm by training policies directly on real robot datasets like Open X-Embodiment. However, the high cost of real-world data collection hinders further data scaling, thereby restricting the generalizability of VLAs. In this paper, we introduce ReBot, a novel real-to-sim-to-real approach for scaling real robot datasets and adapting VLA models to target domains, which is the last-mile deployment challenge in robot manipulation. Specifically, ReBot replays real-world robot trajectories in simulation to diversify manipulated objects (real-to-sim), and integrates the simulated movements with inpainted real-world background to synthesize physically realistic and temporally consistent robot videos (sim-to-real). Our approach has several advantages: 1) it enjoys the benefit of real data to minimize the sim-to-real gap; 2) it leverages the scalability of simulation; and 3) it can generalize a pretrained VLA to a target domain with fully automated data pipelines. Extensive experiments in both simulation and real-world environments show that ReBot significantly enhances the performance and robustness of VLAs. For example, in SimplerEnv with the WidowX robot, ReBot improved the in-domain performance of Octo by 7.2% and OpenVLA by 21.8%, and out-of-domain generalization by 19.9% and 9.4%, respectively. For real-world evaluation with a Franka robot, ReBot increased the success rates of Octo by 17% and OpenVLA by 20%. More information can be found at: https://yuffish.github.io/rebot/

Authors:Hejia Chen, Haoxian Zhang, Shoulong Zhang, Xiaoqiang Liu, Sisi Zhuang, Yuan Zhang, Pengfei Wan, Di Zhang, Shuai Li
Title: Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control
Abstract:
Speech-driven 3D talking face method should offer both accurate lip synchronization and controllable expressions. Previous methods solely adopt discrete emotion labels to globally control expressions throughout sequences while limiting flexible fine-grained facial control within the spatiotemporal domain. We propose a diffusion-transformer-based 3D talking face generation model, Cafe-Talk, which simultaneously incorporates coarse- and fine-grained multimodal control conditions. Nevertheless, the entanglement of multiple conditions challenges achieving satisfying performance. To disentangle speech audio and fine-grained conditions, we employ a two-stage training pipeline. Specifically, Cafe-Talk is initially trained using only speech audio and coarse-grained conditions. Then, a proposed fine-grained control adapter gradually adds fine-grained instructions represented by action units (AUs), preventing unfavorable speech-lip synchronization. To disentangle coarse- and fine-grained conditions, we design a swap-label training mechanism, which enables the dominance of the fine-grained conditions. We also devise a mask-based CFG technique to regulate the occurrence and intensity of fine-grained control. In addition, a text-based detector is introduced with text-AU alignment to enable natural language user input and further support multimodal control. Extensive experimental results prove that Cafe-Talk achieves state-of-the-art lip synchronization and expressiveness performance and receives wide acceptance in fine-grained control in user studies. Project page: https://harryxd2018.github.io/cafe-talk/
中文:Cafe-Talk是一种基于扩散变换器的3D说话人脸模型,通过两阶段训练流程和多模态控制条件,实现了卓越的唇形同步与细腻表情调控,支持从粗粒度到细粒度的面部动画精确控制。
English: Cafe-Talk is a diffusion-transformer-based 3D talking face model that achieves superior lip synchronization and expressive control through a two-stage training pipeline and multimodal conditions, enabling both coarse and fine-grained facial animation adjustments.

Authors:Susung Hong, Ira Kemelmacher-Shlizerman, Brian Curless, Steven M. Seitz
Title: MusicInfuser: Making Video Diffusion Listen and Dance
Abstract:
We introduce MusicInfuser, an approach for generating high-quality dance videos that are synchronized to a specified music track. Rather than attempting to design and train a new multimodal audio-video model, we show how existing video diffusion models can be adapted to align with musical inputs by introducing lightweight music-video cross-attention and a low-rank adapter. Unlike prior work requiring motion capture data, our approach fine-tunes only on dance videos. MusicInfuser achieves high-quality music-driven video generation while preserving the flexibility and generative capabilities of the underlying models. We introduce an evaluation framework using Video-LLMs to assess multiple dimensions of dance generation quality. The project page and code are available at https://susunghong.github.io/MusicInfuser.

Authors:Tao Yu, Yi-Fan Zhang, Chaoyou Fu, Junkang Wu, Jinda Lu, Kun Wang, Xingyu Lu, Yunhang Shen, Guibin Zhang, Dingjie Song, Yibo Yan, Tianlong Xu, Qingsong Wen, Zhang Zhang, Yan Huang, Liang Wang, Tieniu Tan
Title: Aligning Multimodal LLM with Human Preference: A Survey
Abstract:
Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal Large Language Models (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed. This gap has spurred the emergence of various alignment algorithms, each targeting different application scenarios and optimization goals. Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. In this paper, we aim to provide a comprehensive and systematic review of alignment algorithms for MLLMs. Specifically, we explore four key aspects: (1) the application scenarios covered by alignment algorithms, including general image understanding, multi-image, video, and audio, and extended multimodal applications; (2) the core factors in constructing alignment datasets, including data sources, model responses, and preference annotations; (3) the benchmarks used to evaluate alignment algorithms; and (4) a discussion of potential future directions for the development of alignment algorithms. This work seeks to help researchers organize current advancements in the field and inspire better alignment methods. The project page of this paper is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Alignment.
中文: 本文系统综述了多模态大语言模型的校准算法,涵盖其应用场景、数据集构建、评估基准及未来方向,旨在提升模型的真实性和安全性。
English: This paper provides a comprehensive review of alignment algorithms for multimodal large language models, addressing their applications, dataset construction, evaluation benchmarks, and future directions to enhance truthfulness and safety.

Authors:Ayesha Ishaq, Jean Lahoud, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer
Title: Tracking Meets Large Multimodal Models for Driving Scenario Understanding
Abstract:
Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result, their effectiveness in dynamic driving environments is limited. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details that are not effectively captured in the images. We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios. By incorporating 3D tracking data through a track encoder, we enrich visual queries with crucial spatial and temporal cues while avoiding the computational overhead associated with processing lengthy video sequences or extensive 3D inputs. Moreover, we employ a self-supervised approach to pretrain the tracking encoder to provide LMMs with additional contextual information, significantly improving their performance in perception, planning, and prediction tasks for autonomous driving. Experimental results demonstrate the effectiveness of our approach, with a gain of 9.5% in accuracy, an increase of 7.04 points in the ChatGPT score, and 9.4% increase in the overall score over baseline models on DriveLM-nuScenes benchmark, along with a 3.7% final score improvement on DriveLM-CARLA. Our code is available at https://github.com/mbzuai-oryx/TrackingMeetsLMM
中文摘要:大型多模态模型在自动驾驶中常忽略三维时空信息,为此我们通过创新的跟踪编码器整合追踪数据,显著提升了多个基准测试中的性能表现。
English Summary: Large Multimodal Models in autonomous driving often underutilize 3D spatiotemporal data, so we enhance them by integrating tracking information through a novel encoder, achieving significant performance improvements across benchmarks.

Authors:Chuxin Wang, Wenfei Yang, Xiang Liu, Tianzhu Zhang
Title: State Space Model Meets Transformer: A New Paradigm for 3D Object Detection
Abstract:
DETR-based methods, which use multi-layer transformer decoders to refine object queries iteratively, have shown promising performance in 3D indoor object detection. However, the scene point features in the transformer decoder remain fixed, leading to minimal contributions from later decoder layers, thereby limiting performance improvement. Recently, State Space Models (SSM) have shown efficient context modeling ability with linear complexity through iterative interactions between system states and inputs. Inspired by SSMs, we propose a new 3D object DEtection paradigm with an interactive STate space model (DEST). In the interactive SSM, we design a novel state-dependent SSM parameterization method that enables system states to effectively serve as queries in 3D indoor detection tasks. In addition, we introduce four key designs tailored to the characteristics of point cloud and SSM: The serialization and bidirectional scanning strategies enable bidirectional feature interaction among scene points within the SSM. The inter-state attention mechanism models the relationships between state points, while the gated feed-forward network enhances inter-channel correlations. To the best of our knowledge, this is the first method to model queries as system states and scene points as system inputs, which can simultaneously update scene point features and query features with linear complexity. Extensive experiments on two challenging datasets demonstrate the effectiveness of our DEST-based method. Our method improves the GroupFree baseline in terms of AP50 on ScanNet V2 (+5.3) and SUN RGB-D (+3.2) datasets. Based on the VDETR baseline, Our method sets a new SOTA on the ScanNetV2 and SUN RGB-D datasets.
中文: DEST方法提出了一种交互式状态空间模型,以线性复杂度同时更新场景点特征和查询特征,在三维室内物体检测基准上实现了最先进的性能。
English: The DEST method introduces an interactive state space model that dynamically updates both scene point features and query features with linear complexity, achieving state-of-the-art performance on 3D indoor object detection benchmarks.

Authors:NVIDIA, :, Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, Dieter Fox, Yunhao Ge, Jinwei Gu, Ali Hassani, Michael Isaev, Pooya Jannaty, Shiyi Lan, Tobias Lasser, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Fabio Ramos, Xuanchi Ren, Tianchang Shen, Xinglong Sun, Shitao Tang, Ting-Chun Wang, Jay Wu, Jiashu Xu, Stella Xu, Kevin Xie, Yuchong Ye, Xiaodong Yang, Xiaohui Zeng, Yu Zeng
Title: Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control
Abstract:
We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at https://github.com/nvidia-cosmos/cosmos-transfer1.
中文:Cosmos-Transfer 是一种条件性世界生成模型,通过可自定义的空间输入(如分割和深度)实现高度可控的模拟,应用于机器人和自动驾驶领域,并已开源供研究使用。
English: Cosmos-Transfer is a conditional world generation model that uses customizable spatial inputs like segmentation and depth to enable highly controllable simulations, with applications in robotics and autonomous vehicles, and it is open-sourced for research.

Authors:Jensen Zhou, Hang Gao, Vikram Voleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, Varun Jampani
Title: Stable Virtual Camera: Generative View Synthesis with Diffusion Models
Abstract:
We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras. Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations. Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time. As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild. Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure. Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings. Project page with code and model: https://stable-virtual-camera.github.io/.

Authors:Minglei Shi, Ziyang Yuan, Haotian Yang, Xintao Wang, Mingwu Zheng, Xin Tao, Wenliang Zhao, Wenzhao Zheng, Jie Zhou, Jiwen Lu, Pengfei Wan, Di Zhang, Kun Gai
Title: DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers
Abstract:
Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: https://shiml20.github.io/DiffMoE/
中文: DiffMoE通过引入批量级全局令牌池和容量预测器动态分配计算资源,在ImageNet上实现顶尖性能,并在文本到图像生成等任务中表现卓越。
English: DiffMoE introduces a batch-level global token pool and a capacity predictor to dynamically allocate computational resources, achieving state-of-the-art performance on ImageNet and excelling in text-to-image generation.

Authors:Haoyu Guo, He Zhu, Sida Peng, Haotong Lin, Yunzhi Yan, Tao Xie, Wenguan Wang, Xiaowei Zhou, Hujun Bao
Title: Multi-view Reconstruction via SfM-guided Monocular Depth Estimation
Abstract:
In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods. The code and supplementary materials are available at https://zju3dv.github.io/murre/ .
中文: 本文提出一种将运动恢复结构信息融入深度估计的新方法,显著提升多视角几何重建质量,在多种场景下超越现有先进技术。
English: This paper introduces a method that enhances multi-view geometric reconstruction by integrating Structure from Motion (SfM) information into depth estimation, improving accuracy and outperforming existing techniques across diverse scenes.

Authors:Xinyu Fang, Zhijian Chen, Kai Lan, Lixin Ma, Shengyuan Ding, Yingji Liang, Xiangyu Zhao, Farong Wen, Zicheng Zhang, Guofeng Zhang, Haodong Duan, Kai Chen, Dahua Lin
Title: Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Abstract:
Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on https://github.com/open-compass/Creation-MMBench.
Chinese: Creation-MMBench是一个专门评估多模态大语言模型在图像任务中创造力的新基准,研究发现开源模型表现远逊于专有模型,且视觉微调可能削弱模型的创新能力。
English: Creation-MMBench is a new multimodal benchmark designed to evaluate the creative capabilities of Multimodal Large Language Models (MLLMs) in image-based tasks, revealing that current open-source models significantly lag behind proprietary ones and that visual fine-tuning can impair creativity.

Authors:Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, Philipp Henzler
Title: Bolt3D: Generating 3D Scenes in Seconds
Abstract:
We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.

Authors:Hongyu Zhang, Yufan Deng, Shenghai Yuan, Peng Jin, Zesen Cheng, Yian Zhao, Chang Liu, Jie Chen
Title: MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation
Abstract:
Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention, which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation. Project page: https://hong-yu-zhang.github.io/MagicComp-Page/.
中文: MagicComp是一种无需训练的方法,通过双阶段优化解决多主体间的语义模糊性并动态绑定时空区域,显著提升了文本到视频生成的组合能力,性能优于现有技术。
English: MagicComp is a training-free method that enhances text-to-video generation by resolving subject ambiguity and dynamically binding subjects to spatiotemporal regions through dual-phase refinement, outperforming existing methods.

Authors:Vlad Hondru, Eduard Hogea, Darian Onchis, Radu Tudor Ionescu
Title: ExDDV: A New Dataset for Explainable Deepfake Detection in Video
Abstract:
The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.
Chinese: 随着深度伪造视频的真实感日益增强,人类和自动化检测系统都面临挑战,为此我们推出了首个可解释深度伪造检测数据集ExDDV,通过文本和点击标注提升模型对伪造痕迹的定位与描述能力,确保检测结果既可靠又可解释。
English: The increasing realism of deepfake videos challenges both human detection and automated systems, which often lack explainability, prompting the introduction of ExDDV—the first dataset and benchmark for explainable deepfake detection, using text and click annotations to enhance model robustness and localization of artifacts.

Authors:Yali Bi, Enyu Che, Yinan Chen, Yuanpeng He, Jingwei Qu
Title: Multi-Prototype Embedding Refinement for Semi-Supervised Medical Image Segmentation
Abstract:
Medical image segmentation aims to identify anatomical structures at the voxel-level. Segmentation accuracy relies on distinguishing voxel differences. Compared to advancements achieved in studies of the inter-class variance, the intra-class variance receives less attention. Moreover, traditional linear classifiers, limited by a single learnable weight per class, struggle to capture this finer distinction. To address the above challenges, we propose a Multi-Prototype-based Embedding Refinement method for semi-supervised medical image segmentation. Specifically, we design a multi-prototype-based classification strategy, rethinking the segmentation from the perspective of structural relationships between voxel embeddings. The intra-class variations are explored by clustering voxels along the distribution of multiple prototypes in each class. Next, we introduce a consistency constraint to alleviate the limitation of linear classifiers. This constraint integrates different classification granularities from a linear classifier and the proposed prototype-based classifier. In the thorough evaluation on two popular benchmarks, our method achieves superior performance compared with state-of-the-art methods. Code is available at https://github.com/Briley-byl123/MPER.
Chinese: 本研究提出了一种基于多原型嵌入优化的半监督医学图像分割方法,通过多原型聚类和一致性约束改进类内差异建模,在基准测试中取得了优于现有技术的性能。
English: This study introduces a Multi-Prototype-based Embedding Refinement method for semi-supervised medical image segmentation, which enhances intra-class variance modeling through multi-prototype clustering and consistency constraints, achieving state-of-the-art performance on benchmark datasets.

Authors:Yu Cheng, Fajie Yuan
Title: LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models
Abstract:
Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space. However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs. Our model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation. Our models and code are available at https://github.com/westlake-repl/LeanVAE
中文: 提出的LeanVAE框架采用轻量级架构和基于小波的增强技术,在保持高质量视频重建的同时,显著降低了计算成本并加快了推理速度。
English: The proposed LeanVAE framework introduces a lightweight architecture and wavelet-based enhancements to significantly reduce computational costs and accelerate inference in video generation, while maintaining high reconstruction quality.

Authors:Jiang Qin, Senmao Li, Alexandra Gomez-Villa, Shiqi Yang, Yaxing Wang, Kai Wang, Joost van de Weijer
Title: Free-Lunch Color-Texture Disentanglement for Stylized Image Generation
Abstract:
Recent advances in Text-to-Image (T2I) diffusion models have transformed image generation, enabling significant progress in stylized generation using only a few style reference images. However, current diffusion-based methods struggle with fine-grained style customization due to challenges in controlling multiple style attributes, such as color and texture. This paper introduces the first tuning-free approach to achieve free-lunch color-texture disentanglement in stylized T2I generation, addressing the need for independently controlled style elements for the Disentangled Stylized Image Generation (DisIG) problem. Our approach leverages the Image-Prompt Additivity property in the CLIP image embedding space to develop techniques for separating and extracting Color-Texture Embeddings (CTE) from individual color and texture reference images. To ensure that the color palette of the generated image aligns closely with the color reference, we apply a whitening and coloring transformation to enhance color consistency. Additionally, to prevent texture loss due to the signal-leak bias inherent in diffusion training, we introduce a noise term that preserves textural fidelity during the Regularized Whitening and Coloring Transformation (RegWCT). Through these methods, our Style Attributes Disentanglement approach (SADis) delivers a more precise and customizable solution for stylized image generation. Experiments on images from the WikiArt and StyleDrop datasets demonstrate that, both qualitatively and quantitatively, SADis surpasses state-of-the-art stylization methods in the DisIG task.Code will be released at https://deepffff.github.io/sadis.github.io/.
中文: 本文提出无需调优的SADis方法,通过利用CLIP嵌入空间和正则化变换实现风格图像生成中的色彩-纹理解耦,在保持色彩一致性和纹理保真度方面优于现有最先进方法。
English: This paper presents SADis, a tuning-free method that achieves color-texture disentanglement in stylized text-to-image generation by leveraging CLIP embeddings and regularized transformations to independently control style attributes, outperforming existing approaches.

Authors:Chenting Wang, Kunchang Li, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang
Title: Make Your Training Flexible: Towards Deployment-Efficient Video Models
Abstract:
Popular video training methods mainly operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid, resulting in sub-optimal accuracy-computation trade-offs due to inherent video redundancy. They also lack adaptability to varying computational budgets for downstream tasks, hindering applications of the most competitive model in real-world scenes. We thus propose a new test setting, Token Optimization, for maximized input information across budgets, which optimizes the size-limited set of input tokens through token selection from more suitably sampled videos. To this end, we propose a novel augmentation tool termed Flux. By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks, boosting model robustness with nearly no additional cost. We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs. Notably, with 1/4 tokens only, it can still match the performance of previous state-of-the-art models with Token Optimization, yielding nearly 90\% savings. All models and data are available at https://github.com/OpenGVLab/FluxViT.
中文摘要:提出的Flux方法通过灵活采样视频并优化令牌选择,以极低成本显著提升了视频训练的精度与计算效率,同时增强了模型在不同计算预算下的适应能力。
English Summary: The proposed Flux method enhances video training by dynamically optimizing token selection from flexibly sampled videos, achieving superior accuracy-computation efficiency and adaptability across budgets with minimal cost.

Authors:Junjin Xiao, Qing Zhang, Yonewei Nie, Lei Zhu, Wei-Shi Zheng
Title: RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images
Abstract:
This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in such challenging conditions. Our key idea is to lift SMPL vertices to dense and reliable 3D prior points representing accurate human body geometry, and then regress human Gaussian parameters based on the points. To account for possible misalignment between SMPL model and images, we propose to predict image-aligned 3D prior points by leveraging both pixel-level features and voxel-level features, from which we regress the coarse Gaussians. To enhance the ability to capture high-frequency details, we further render depth maps from the coarse 3D Gaussians to help regress fine-grained pixel-wise Gaussians. Experiments on several benchmark datasets demonstrate that our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization. Our code is available at https://github.com/iSEE-Laboratory/RoGSplat.
中文: RoGSplat提出了一种无需逐对象优化的新方法,通过利用基于SMPL的3D先验点和多级特征,从稀疏多视角图像合成未见人体的高保真新视图,在挑战性条件下实现鲁棒重建和卓越性能。
English: RoGSplat introduces a novel method for synthesizing high-fidelity novel views of unseen humans from sparse multi-view images without per-subject optimization, leveraging SMPL-based 3D prior points and multi-level features to achieve robust reconstruction and superior performance in challenging conditions.

Authors:Rui Cao, Wei Tu, Dongsheng Chen, Wenyu Zhang
Title: Mapping Urban Villages in China: Progress and Challenges
Abstract:
The shift toward high-quality urbanization has brought increased attention to the issue of "urban villages", which has become a prominent social problem in China. However, there is a lack of available geospatial data on urban villages, making it crucial to prioritize urban village mapping. In order to assess the current progress in urban village mapping and identify challenges and future directions, we have conducted a comprehensive review, which to the best of our knowledge is the first of its kind in this field. Our review begins by providing a clear context for urban villages and elaborating the method for literature review, then summarizes the study areas, data sources, and approaches used for urban village mapping in China. We also address the challenges and future directions for further research. Through thorough investigation, we find that current studies only cover very limited study areas and periods and lack sufficient investigation into the scalability, transferability, and interpretability of identification approaches due to the challenges in concept fuzziness and variances, spatial heterogeneity and variances of urban villages, and data availability. Future research can complement and further the current research in the following potential directions in order to achieve large-area mapping across the whole nation...
中文: 该综述指出中国城中村测绘存在研究范围有限、方法可扩展性不足等问题,并提出未来需通过解决概念模糊性、空间异质性等挑战来实现全国范围的高效测绘。
English: The review highlights the critical need for comprehensive urban village mapping in China, revealing current limitations in study scope and methodology while suggesting future directions to overcome challenges like spatial heterogeneity and data constraints.

Authors:Yongqi Li, Lu Yang, Jian Wang, Runyang You, Wenjie Li, Liqiang Nie
Title: Towards Harmless Multimodal Assistants with Blind Preference Optimization
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM's unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval, demonstrating the effectiveness and robustness of both the dataset and the approach. We release code and data at https://lu-yang666.github.io/MMsafe-PO-Web/.
中文摘要:本研究提出MMSafe-PO数据集和盲偏好优化方法,显著提升多模态大语言模型的安全性能,在基准测试中实现45%的安全率提升并展现卓越鲁棒性。
English Summary: The study introduces the MMSafe-PO dataset and Blind Preference Optimization (BPO) method to enhance multimodal large language models' safety, achieving a 45% safety improvement and demonstrating robustness across benchmarks.

Authors:Zining Wang, Tongkun Guan, Pei Fu, Chen Duan, Qianyi Jiang, Zhentao Guo, Shan Guo, Junfeng Luo, Wei Shen, Xiaokang Yang
Title: Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Abstract:
Multi-modal Large Language Models (MLLMs) have introduced a novel dimension to document understanding, i.e., they endow large language models with visual comprehension capabilities; however, how to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. The former allows the model to implicitly align images and text at the semantic level. The latter introduces an additional mask generator (discarded during inference) to explicitly ensure alignment between visual texts within images and their corresponding image regions at a spatially-aware level. Together, they can prevent model hallucinations when parsing visual text and effectively promote spatially-aware feature representation learning. To support the proposed VQAMask task, we construct a comprehensive image-mask generation pipeline and provide a large-scale dataset with 6M data (MTMask6M). Subsequently, we demonstrate that introducing the proposed mask generation task yields competitive document-level understanding performance. Leveraging the proposed VQAMask, we introduce Marten, a training-efficient MLLM tailored for document-level understanding. Extensive experiments show that our Marten consistently achieves significant improvements among 8B-MLLMs in document-centric tasks. Code and datasets are available at https://github.com/PriNing/Marten.
中文: 本研究提出了一种新颖的视觉问答掩码生成方法,通过语义和空间层面的双重对齐增强多模态文档理解,由此开发的Marten模型在文档任务中展现出卓越性能。
English: This study introduces a novel Visual Question Answering with Mask generation (VQAMask) method that enhances multi-modal document understanding by simultaneously aligning images and text at semantic and spatial levels, resulting in the development of Marten, an efficient MLLM that demonstrates superior performance in document-centric tasks.

Authors:Subhadeep Koley, Tapas Kumar Dutta, Aneeshan Sain, Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Yi-Zhe Song
Title: SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models
Abstract:
While foundation models have revolutionised computer vision, their effectiveness for sketch understanding remains limited by the unique challenges of abstract, sparse visual inputs. Through systematic analysis, we uncover two fundamental limitations: Stable Diffusion (SD) struggles to extract meaningful features from abstract sketches (unlike its success with photos), and exhibits a pronounced frequency-domain bias that suppresses essential low-frequency components needed for sketch understanding. Rather than costly retraining, we address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases. By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval (+3.35%), recognition (+1.06%), segmentation (+29.42%), and correspondence learning (+21.22%), demonstrating the first truly universal sketch feature representation in the era of foundation models.
中文: 本研究通过将稳定扩散模型与CLIP进行协同整合,无需重新训练即可在多项草图任务中实现最优性能,从而克服了基础模型在草图理解方面的局限性。
English: This study overcomes the limitations of foundation models in sketch understanding by synergistically integrating Stable Diffusion with CLIP, achieving state-of-the-art performance across multiple sketch tasks without retraining.

Authors:Weihong Chen, Xuemiao Xu, Haoxin Yang, Yi Xie, Peng Xiao, Cheng Xu, Huaidong Zhang, Pheng-Ann Heng
Title: SCJD: Sparse Correlation and Joint Distillation for Efficient 3D Human Pose Estimation
Abstract:
Existing 3D Human Pose Estimation (HPE) methods achieve high accuracy but suffer from computational overhead and slow inference, while knowledge distillation methods fail to address spatial relationships between joints and temporal correlations in multi-frame inputs. In this paper, we propose Sparse Correlation and Joint Distillation (SCJD), a novel framework that balances efficiency and accuracy for 3D HPE. SCJD introduces Sparse Correlation Input Sequence Downsampling to reduce redundancy in student network inputs while preserving inter-frame correlations. For effective knowledge transfer, we propose Dynamic Joint Spatial Attention Distillation, which includes Dynamic Joint Embedding Distillation to enhance the student's feature representation using the teacher's multi-frame context feature, and Adjacent Joint Attention Distillation to improve the student network's focus on adjacent joint relationships for better spatial understanding. Additionally, Temporal Consistency Distillation aligns the temporal correlations between teacher and student networks through upsampling and global supervision. Extensive experiments demonstrate that SCJD achieves state-of-the-art performance. Code is available at https://github.com/wileychan/SCJD.
中文:提出的SCJD框架通过稀疏输入降采样和动态关节注意力蒸馏技术,在提升三维人体姿态估计效率的同时,通过优化时空建模保持了精度优势。
English: The proposed SCJD framework enhances 3D human pose estimation by introducing sparse input downsampling and dynamic joint attention distillation to improve efficiency while maintaining accuracy through better spatial and temporal modeling.

Authors:Shengping Zhang, Xiaoyu Han, Weigang Zhang, Xiangyuan Lan, Hongxun Yao, Qingming Huang
Title: Limb-Aware Virtual Try-On Network with Progressive Clothing Warping
Abstract:
Image-based virtual try-on aims to transfer an in-shop clothing image to a person image. Most existing methods adopt a single global deformation to perform clothing warping directly, which lacks fine-grained modeling of in-shop clothing and leads to distorted clothing appearance. In addition, existing methods usually fail to generate limb details well because they are limited by the used clothing-agnostic person representation without referring to the limb textures of the person image. To address these problems, we propose Limb-aware Virtual Try-on Network named PL-VTON, which performs fine-grained clothing warping progressively and generates high-quality try-on results with realistic limb details. Specifically, we present Progressive Clothing Warping (PCW) that explicitly models the location and size of in-shop clothing and utilizes a two-stage alignment strategy to progressively align the in-shop clothing with the human body. Moreover, a novel gravity-aware loss that considers the fit of the person wearing clothing is adopted to better handle the clothing edges. Then, we design Person Parsing Estimator (PPE) with a non-limb target parsing map to semantically divide the person into various regions, which provides structural constraints on the human body and therefore alleviates texture bleeding between clothing and body regions. Finally, we introduce Limb-aware Texture Fusion (LTF) that focuses on generating realistic details in limb regions, where a coarse try-on result is first generated by fusing the warped clothing image with the person image, then limb textures are further fused with the coarse result under limb-aware guidance to refine limb details. Extensive experiments demonstrate that our PL-VTON outperforms the state-of-the-art methods both qualitatively and quantitatively.
中文: PL-VTON模型通过渐进式服装变形和肢体感知纹理融合技术,解决了虚拟试衣中的服装变形和肢体细节缺失问题,在效果上超越了现有最优方法。
English: The PL-VTON model addresses clothing distortion and poor limb detail in virtual try-on by introducing progressive clothing warping and limb-aware texture fusion, achieving superior results over existing methods.

Authors:Runsong Zhu, Shi Qiu, Zhengzhe Liu, Ka-Hei Hui, Qianyi Wu, Pheng-Ann Heng, Chi-Wing Fu
Title: Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting
Abstract:
Lifting multi-view 2D instance segmentation to a radiance field has proven to be effective to enhance 3D understanding. Existing methods rely on direct matching for end-to-end lifting, yielding inferior results; or employ a two-stage solution constrained by complex pre- or post-processing. In this work, we design a new end-to-end object-aware lifting approach, named Unified-Lift that provides accurate 3D segmentation based on the 3D Gaussian representation. To start, we augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information. Importantly, we introduce a learnable object-level codebook to account for individual objects in the scene for an explicit object-level understanding and associate the encoded object-level features with the Gaussian-level point features for segmentation predictions. While promising, achieving effective codebook learning is non-trivial and a naive solution leads to degraded performance. Therefore, we formulate the association learning module and the noisy label filtering module for effective and robust codebook learning. We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms datasets. Both qualitative and quantitative results manifest that our Unified-Lift clearly outperforms existing methods in terms of segmentation quality and time efficiency. The code is publicly available at \href{https://github.com/Runsong123/Unified-Lift}{https://github.com/Runsong123/Unified-Lift}.
Chinese: 本文提出Unified-Lift方法,通过结合高斯点特征与可学习的对象级码本实现端到端的对象感知三维分割,在精度和效率上均优于现有方法。
English: This paper introduces Unified-Lift, an end-to-end object-aware lifting method that enhances 3D segmentation by integrating Gaussian-level features with a learnable object-level codebook, outperforming existing approaches in accuracy and efficiency.

Authors:Wei Lu, Si-Bao Chen, Hui-Dong Li, Qing-Ling Shu, Chris H. Q. Ding, Jin Tang, Bin Luo
Title: LEGNet: Lightweight Edge-Gaussian Driven Network for Low-Quality Remote Sensing Image Object Detection
Abstract:
Remote sensing object detection (RSOD) often suffers from degradations such as low spatial resolution, sensor noise, motion blur, and adverse illumination. These factors diminish feature distinctiveness, leading to ambiguous object representations and inadequate foreground-background separation. Existing RSOD methods exhibit limitations in robust detection of low-quality objects. To address these pressing challenges, we introduce LEGNet, a lightweight backbone network featuring a novel Edge-Gaussian Aggregation (EGA) module specifically engineered to enhance feature representation derived from low-quality remote sensing images. EGA module integrates: (a) orientation-aware Scharr filters to sharpen crucial edge details often lost in low-contrast or blurred objects, and (b) Gaussian-prior-based feature refinement to suppress noise and regularize ambiguous feature responses, enhancing foreground saliency under challenging conditions. EGA module alleviates prevalent problems in reduced contrast, structural discontinuities, and ambiguous feature responses prevalent in degraded images, effectively improving model robustness while maintaining computational efficiency. Comprehensive evaluations across five benchmarks (DOTA-v1.0, v1.5, DIOR-R, FAIR1M-v1.0, and VisDrone2019) demonstrate that LEGNet achieves state-of-the-art performance, particularly in detecting low-quality objects. The code is available at https://github.com/lwCVer/LEGNet.
中文: LEGNet提出了一种轻量级主干网络,其边缘-高斯聚合模块通过锐化边缘和抑制噪声来增强低质量遥感图像中的特征表示,在多个基准测试中实现了最先进的检测性能。
English: LEGNet introduces a lightweight backbone with an Edge-Gaussian Aggregation module that enhances feature representation in low-quality remote sensing images by sharpening edges and suppressing noise, achieving state-of-the-art detection performance across multiple benchmarks.

Authors:Zixuan Zheng, Yilei Shi, Chunlei Li, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Title: Rethinking Cell Counting Methods: Decoupling Counting and Localization
Abstract:
Cell counting in microscopy images is vital in medicine and biology but extremely tedious and time-consuming to perform manually. While automated methods have advanced in recent years, state-of-the-art approaches tend to increasingly complex model designs. In this paper, we propose a conceptually simple yet effective decoupled learning scheme for automated cell counting, consisting of separate counter and localizer networks. In contrast to jointly learning counting and density map estimation, we show that decoupling these objectives surprisingly improves results. The counter operates on intermediate feature maps rather than pixel space to leverage global context and produce count estimates, while also generating coarse density maps. The localizer then reconstructs high-resolution density maps that precisely localize individual cells, conditional on the original images and coarse density maps from the counter. Besides, to boost counting accuracy, we further introduce a global message passing module to integrate cross-region patterns. Extensive experiments on four datasets demonstrate that our approach, despite its simplicity, challenges common practice and achieves state-of-the-art performance by significant margins. Our key insight is that decoupled learning alleviates the need to learn counting on high-resolution density maps directly, allowing the model to focus on global features critical for accurate estimates. Code is available at https://github.com/MedAITech/DCL.
中文: 本文提出解耦学习方案,通过分离计数与定位网络,利用全局上下文生成粗密度图并重构高分辨率定位,以更简单的模型实现了最优性能。
English: This paper introduces a decoupled learning approach that separates cell counting and localization into distinct networks, achieving superior accuracy by focusing on global context and simplifying high-resolution density map reconstruction.

Authors:Yaxiong Chen, Yujie Wang, Zixuan Zheng, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou
Title: Striving for Simplicity: Simple Yet Effective Prior-Aware Pseudo-Labeling for Semi-Supervised Ultrasound Image Segmentation
Abstract:
Medical ultrasound imaging is ubiquitous, but manual analysis struggles to keep pace. Automated segmentation can help but requires large labeled datasets, which are scarce. Semi-supervised learning leveraging both unlabeled and limited labeled data is a promising approach. State-of-the-art methods use consistency regularization or pseudo-labeling but grow increasingly complex. Without sufficient labels, these models often latch onto artifacts or allow anatomically implausible segmentations. In this paper, we present a simple yet effective pseudo-labeling method with an adversarially learned shape prior to regularize segmentations. Specifically, we devise an encoder-twin-decoder network where the shape prior acts as an implicit shape model, penalizing anatomically implausible but not ground-truth-deviating predictions. Without bells and whistles, our simple approach achieves state-of-the-art performance on two benchmarks under different partition protocols. We provide a strong baseline for future semi-supervised medical image segmentation. Code is available at https://github.com/WUTCM-Lab/Shape-Prior-Semi-Seg.
中文摘要:本文提出了一种简单有效的伪标签方法,通过对抗学习形状先验来规范医学图像分割,采用编码器-双解码器网络防止解剖学上不合理的预测,在两个基准测试中实现了最先进的性能。
English Summary: This paper introduces a simple yet effective pseudo-labeling method with an adversarially learned shape prior to regularize medical image segmentations, achieving state-of-the-art performance on two benchmarks through an encoder-twin-decoder network that prevents anatomically implausible predictions.

Authors:Jiankang Wang, Zhihan Zhang, Zhihang Liu, Yang Li, Jiannan Ge, Hongtao Xie, Yongdong Zhang
Title: SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
Abstract:
Multimodal large language models (MLLMs) have made remarkable progress in either temporal or spatial localization. However, they struggle to perform spatio-temporal video grounding. This limitation stems from two major challenges. Firstly, it is difficult to extract accurate spatio-temporal information of each frame in the video. Secondly, the substantial number of visual tokens makes it challenging to precisely map visual tokens of each frame to their corresponding spatial coordinates. To address these issues, we introduce SpaceVLLM, a MLLM endowed with spatio-temporal video grounding capability. Specifically, we adopt a set of interleaved Spatio-Temporal Aware Queries to capture temporal perception and dynamic spatial information. Moreover, we propose a Query-Guided Space Decoder to establish a corresponding connection between the queries and spatial coordinates. Additionally, due to the lack of spatio-temporal datasets, we construct the Unified Spatio-Temporal Grounding (Uni-STG) dataset, comprising 480K instances across three tasks. This dataset fully exploits the potential of MLLM to simultaneously facilitate localization in both temporal and spatial dimensions. Extensive experiments demonstrate that SpaceVLLM achieves the state-of-the-art performance across 11 benchmarks covering temporal, spatial, spatio-temporal and video understanding tasks, highlighting the effectiveness of our approach. Our code, datasets and model will be released at https://github.com/Jayce1kk/SpaceVLLM.
Chinese: SpaceVLLM通过引入时空感知查询和查询引导的空间解码器,解决了多模态模型在视频时空定位中的难题,并利用新构建的Uni-STG数据集在11个基准测试中取得了最优性能。
English: SpaceVLLM introduces spatio-temporal aware queries and a query-guided decoder to overcome multimodal models' limitations in video grounding, achieving state-of-the-art results across 11 benchmarks with its newly created Uni-STG dataset.

Authors:Huy-Hoang Bui, Bach-Thuan Bui, Quang-Vinh Tran, Yasuyuki Fujii, Joo-Ho Lee
Title: A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios
Abstract:
Visual localization is considered to be one of the crucial parts in many robotic and vision systems. While state-of-the art methods that relies on feature matching have proven to be accurate for visual localization, its requirements for storage and compute are burdens. Scene coordinate regression (SCR) is an alternative approach that remove the barrier for storage by learning to map 2D pixels to 3D scene coordinates. Most popular SCR use Convolutional Neural Network (CNN) to extract 2D descriptor, which we would argue that it miss the spatial relationship between pixels. Inspired by the success of vision transformer architecture, we present a new SCR architecture, called A-ScoRe, an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors. Since the operation is performed on descriptor map, our model can work with multiple data modality whether it is a dense or sparse from depth-map, SLAM to Structure-from-Motion (SfM). This versatility allows A-SCoRe to operate in different kind of environments, conditions and achieve the level of flexibility that is important for mobile robots. Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible. Code and pre-trained models are public in our repository: https://github.com/ais-lab/A-SCoRe.
中文摘要:A-SCoRe模型采用基于注意力的视觉定位架构,通过描述符层面的注意力机制有效捕捉像素空间关系,在保持轻量化的同时实现了与主流方法相当的跨场景适应能力。
English Summary: The A-SCoRe model introduces an attention-based architecture for visual localization that overcomes storage limitations and captures spatial relationships between pixels, achieving lightweight yet competitive performance across diverse environments.

Authors:Mu Chen, Liulei Li, Wenguan Wang, Yi Yang
Title: DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation
Abstract:
Top-leading solutions for Video Scene Graph Generation (VSGG) typically adopt an offline pipeline. Though demonstrating promising performance, they remain unable to handle real-time video streams and consume large GPU memory. Moreover, these approaches fall short in temporal reasoning, merely aggregating frame-level predictions over a temporal context. In response, we introduce DIFFVSGG, an online VSGG solution that frames this task as an iterative scene graph update problem. Drawing inspiration from Latent Diffusion Models (LDMs) which generate images via denoising a latent feature embedding, we unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding. Then, given an embedding containing unified features of object pairs, we conduct a step-wise Denoising on it within LDMs, so as to deliver a clean embedding which clearly indicates the relationships between objects. This embedding then serves as the input to task-specific heads for object classification, scene graph generation, etc. DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs, to guide the reverse diffusion process for current frames. Extensive experiments on three setups of Action Genome demonstrate the superiority of DIFFVSGG.
中文: DIFFVSGG是一种在线视频场景图生成方法,它将任务重构为迭代图更新,利用统一的潜在扩散模型高效解码物体与关系特征,实现实时处理、连续时序推理并降低GPU内存消耗。
English: DIFFVSGG is an online Video Scene Graph Generation method that reframes the task as iterative graph updates using a unified latent diffusion model to efficiently decode object and relationship features, enabling real-time processing and continuous temporal reasoning with reduced GPU memory.

Authors:Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang
Title: Improving LLM Video Understanding with 16 Frames Per Second
Abstract:
Human vision is dynamic and continuous. However, in video understanding with multimodal large language models (LLMs), existing methods primarily rely on static features extracted from images sampled at a fixed low frame rate of frame-per-second (FPS) $\leqslant$2, leading to critical visual information loss. In this paper, we introduce F-16, the first multimodal LLM designed for high-frame-rate video understanding. By increasing the frame rate to 16 FPS and compressing visual tokens within each 1-second clip, F-16 efficiently captures dynamic visual features while preserving key semantic information. Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video-MME and TemporalBench. Furthermore, F-16 excels in complex spatiotemporal tasks, including high-speed sports analysis (\textit{e.g.}, basketball, football, gymnastics, and diving), outperforming SOTA proprietary visual models like GPT-4o and Gemini-1.5-pro. Additionally, we introduce a novel decoding method for F-16 that enables highly efficient low-frame-rate inference without requiring model retraining. We will release the source code, model checkpoints, and data at \href{https://github.com/bytedance/F-16}{https://github.com/bytedance/F-16}.
中文: F-16是一种多模态大语言模型,通过处理16帧每秒并压缩视觉标记,在多项基准测试和复杂任务中实现最优性能,无需扩大模型规模或训练数据。
English: F-16 is a multimodal LLM that enhances video understanding by processing 16 FPS with compressed visual tokens, achieving state-of-the-art results in various benchmarks and complex tasks without increasing model size or data.

Authors:Xinqing Li, Ruiqi Song, Qingyu Xie, Ye Wu, Nanxin Zeng, Yunfeng Ai
Title: SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model
Abstract:
With the rapid advancement of autonomous driving technology, a lack of data has become a major obstacle to enhancing perception model accuracy. Researchers are now exploring controllable data generation using world models to diversify datasets. However, previous work has been limited to studying image generation quality on specific public datasets. There is still relatively little research on how to build data generation engines for real-world application scenes to achieve large-scale data generation for challenging scenes. In this paper, a simulator-conditioned scene generation engine based on world model is proposed. By constructing a simulation system consistent with real-world scenes, simulation data and labels, which serve as the conditions for data generation in the world model, for any scenes can be collected. It is a novel data generation pipeline by combining the powerful scene simulation capabilities of the simulation engine with the robust data generation capabilities of the world model. In addition, a benchmark with proportionally constructed virtual and real data, is provided for exploring the capabilities of world models in real-world scenes. Quantitative results show that these generated images significantly improve downstream perception models performance. Finally, we explored the generative performance of the world model in urban autonomous driving scenarios. All the data and code will be available at https://github.com/Li-Zn-H/SimWorld.
中文: 本文提出了一种基于世界模型的仿真条件场景生成引擎,通过结合仿真系统与生成模型构建新型数据生成流程,有效提升自动驾驶感知模型在真实场景中的性能表现。
English: This paper introduces a simulator-conditioned scene generation engine that integrates simulation capabilities with world models to produce diverse datasets, significantly enhancing perception model performance in autonomous driving applications.

Authors:Kang Yang, Tianci Bu, Lantao Li, Chunxu Li, Yongcai Wang, Deying Li
Title: Is Discretization Fusion All You Need for Collaborative Perception?
Abstract:
Collaborative perception in multi-agent system enhances overall perceptual capabilities by facilitating the exchange of complementary information among agents. Current mainstream collaborative perception methods rely on discretized feature maps to conduct fusion, which however, lacks flexibility in extracting and transmitting the informative features and can hardly focus on the informative features during fusion. To address these problems, this paper proposes a novel Anchor-Centric paradigm for Collaborative Object detection (ACCO). It avoids grid precision issues and allows more flexible and efficient anchor-centric communication and fusion. ACCO is composed by three main components: (1) Anchor featuring block (AFB) that targets to generate anchor proposals and projects prepared anchor queries to image features. (2) Anchor confidence generator (ACG) is designed to minimize communication by selecting only the features in the confident anchors to transmit. (3) A local-global fusion module, in which local fusion is anchor alignment-based fusion (LAAF) and global fusion is conducted by spatial-aware cross-attention (SACA). LAAF and SACA run in multi-layers, so agents conduct anchor-centric fusion iteratively to adjust the anchor proposals. Comprehensive experiments are conducted to evaluate ACCO on OPV2V and Dair-V2X datasets, which demonstrate ACCO's superiority in reducing the communication volume, and in improving the perception range and detection performances. Code can be found at: \href{https://github.com/sidiangongyuan/ACCO}{https://github.com/sidiangongyuan/ACCO}.
中文: 本文提出ACCO这一以锚点为中心的协同目标检测新范式,通过灵活的锚点通信与融合机制,在显著降低通信量的同时有效提升了感知范围与检测性能。
English: This paper introduces ACCO, a novel anchor-centric collaborative object detection method that enhances perception by enabling flexible anchor-based communication and fusion, significantly reducing communication volume while improving detection performance and range.

Authors:Huan Ren, Wenfei Yang, Xiang Liu, Shifeng Zhang, Tianzhu Zhang
Title: Learning Shape-Independent Transformation via Spherical Representations for Category-Level Object Pose Estimation
Abstract:
Category-level object pose estimation aims to determine the pose and size of novel objects in specific categories. Existing correspondence-based approaches typically adopt point-based representations to establish the correspondences between primitive observed points and normalized object coordinates. However, due to the inherent shape-dependence of canonical coordinates, these methods suffer from semantic incoherence across diverse object shapes. To resolve this issue, we innovatively leverage the sphere as a shared proxy shape of objects to learn shape-independent transformation via spherical representations. Based on this insight, we introduce a novel architecture called SpherePose, which yields precise correspondence prediction through three core designs. Firstly, We endow the point-wise feature extraction with SO(3)-invariance, which facilitates robust mapping between camera coordinate space and object coordinate space regardless of rotation transformation. Secondly, the spherical attention mechanism is designed to propagate and integrate features among spherical anchors from a comprehensive perspective, thus mitigating the interference of noise and incomplete point cloud. Lastly, a hyperbolic correspondence loss function is designed to distinguish subtle distinctions, which can promote the precision of correspondence prediction. Experimental results on CAMERA25, REAL275 and HouseCat6D benchmarks demonstrate the superior performance of our method, verifying the effectiveness of spherical representations and architectural innovations.
中文: SpherePose创新性地采用球形表示作为共享代理形状,通过SO(3)不变特征、球形注意力机制和双曲对应损失函数,有效解决了类别级物体姿态估计中的语义不一致问题,在多个基准测试中展现出优越性能。
English: SpherePose introduces a novel approach to category-level object pose estimation by using spherical representations as a shared proxy shape to overcome semantic incoherence in existing methods, achieving superior performance through SO(3)-invariant features, spherical attention, and hyperbolic correspondence loss.

Authors:Dongkwan Lee, Kyomin Hwang, Nojun Kwak
Title: Unlocking the Potential of Unlabeled Data in Semi-Supervised Domain Generalization
Abstract:
We address the problem of semi-supervised domain generalization (SSDG), where the distributions of train and test data differ, and only a small amount of labeled data along with a larger amount of unlabeled data are available during training. Existing SSDG methods that leverage only the unlabeled samples for which the model's predictions are highly confident (confident-unlabeled samples), limit the full utilization of the available unlabeled data. To the best of our knowledge, we are the first to explore a method for incorporating the unconfident-unlabeled samples that were previously disregarded in SSDG setting. To this end, we propose UPCSC to utilize these unconfident-unlabeled samples in SSDG that consists of two modules: 1) Unlabeled Proxy-based Contrastive learning (UPC) module, treating unconfident-unlabeled samples as additional negative pairs and 2) Surrogate Class learning (SC) module, generating positive pairs for unconfident-unlabeled samples using their confusing class set. These modules are plug-and-play and do not require any domain labels, which can be easily integrated into existing approaches. Experiments on four widely used SSDG benchmarks demonstrate that our approach consistently improves performance when attached to baselines and outperforms competing plug-and-play methods. We also analyze the role of our method in SSDG, showing that it enhances class-level discriminability and mitigates domain gaps. The code is available at https://github.com/dongkwani/UPCSC.
中文: 本文提出UPCSC方法,通过利用未置信的未标记样本进行对比学习和代理类别学习,无需领域标签即可在半监督领域泛化中提升模型性能。
English: This paper introduces UPCSC, a novel method for semi-supervised domain generalization that leverages unconfident-unlabeled samples through contrastive learning and surrogate class modules to enhance model performance without requiring domain labels.

Authors:Barza Nisar, Steven L. Waslander
Title: PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds
Abstract:
Self-supervised learning (SSL) on 3D point clouds has the potential to learn feature representations that can transfer to diverse sensors and multiple downstream perception tasks. However, recent SSL approaches fail to define pretext tasks that retain geometric information such as object pose and scale, which can be detrimental to the performance of downstream localization and geometry-sensitive 3D scene understanding tasks, such as 3D semantic segmentation and 3D object detection. We propose PSA-SSL, a novel extension to point cloud SSL that learns object pose and size-aware (PSA) features. Our approach defines a self-supervised bounding box regression pretext task, which retains object pose and size information. Furthermore, we incorporate LiDAR beam pattern augmentation on input point clouds, which encourages learning sensor-agnostic features. Our experiments demonstrate that with a single pretrained model, our light-weight yet effective extensions achieve significant improvements on 3D semantic segmentation with limited labels across popular autonomous driving datasets (Waymo, nuScenes, SemanticKITTI). Moreover, our approach outperforms other state-of-the-art SSL methods on 3D semantic segmentation (using up to 10 times less labels), as well as on 3D object detection. Our code will be released on https://github.com/TRAILab/PSA-SSL.
中文: 提出的PSA-SSL方法通过边界框回归任务和激光雷达束增强,在三维点云自监督学习中融入物体姿态与尺寸感知,在多个自动驾驶数据集上以少量标注数据实现了三维语义分割和物体检测的显著性能提升。
English: The proposed PSA-SSL method enhances self-supervised learning on 3D point clouds by incorporating object pose and size awareness through a bounding box regression task and LiDAR beam augmentation, achieving superior performance in 3D semantic segmentation and object detection with limited labeled data across autonomous driving datasets.

Authors:Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, Sijie Zhu
Title: Where do Large Vision-Language Models Look at when Answering Questions?
Abstract:
Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.
Chinese: 本研究扩展了热力图可视化方法,以解释大型视觉语言模型在开放式问答中如何利用视觉输入,揭示了其关注区域、架构差异及语言模型规模对视觉理解影响的重要发现。
English: This study extends heatmap visualization methods to interpret how Large Vision-Language Models (LVLMs) utilize visual inputs for open-ended question answering, revealing key insights into their focus regions, architectural differences, and the impact of language model scale on visual understanding.

Authors:Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, Dae-Shik Kim
Title: MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation
Abstract:
The fusion of Large Language Models with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. However, seamless human-AI interaction demands more than just object-level recognition; it requires understanding both objects and the functions of their detailed parts, particularly in multi-target scenarios. For example, when instructing a robot to \textit{turn on the TV"}, there could be various ways to accomplish this command. Recognizing multiple objects capable of turning on the TV, such as the TV itself or a remote control (multi-target), provides more flexible options and aids in finding the optimized scenario. Furthermore, understanding specific parts of these objects, like the TV's button or the remote's button (part-level), is important for completing the action. Unfortunately, current reasoning segmentation datasets predominantly focus on a single target object-level reasoning, which limits the detailed recognition of an object's parts in multi-target contexts. To address this gap, we construct a large-scale dataset called Multi-target and Multi-granularity Reasoning (MMR). MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects, based on pre-existing image-mask sets. This dataset supports diverse and context-aware interactions by hierarchically providing object and part information. Moreover, we propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation. Experimental results on MMR show that the proposed method can reason effectively in multi-target and multi-granularity scenarios, while the existing reasoning segmentation model still has room for improvement.
Chinese: 大型语言模型与视觉模型的融合正在推动视觉语言任务的发展,尤其在推理分割领域,但现有数据集缺乏多目标和部件级理解,为此构建了MMR数据集并提出新框架,在复杂场景中展现出更优的推理能力。
English: The integration of Large Language Models with vision models is advancing vision-language tasks, particularly in reasoning segmentation, but current datasets lack multi-target and part-level understanding, leading to the creation of the MMR dataset and a new framework that demonstrates improved performance in these complex scenarios.

Authors:Seokhyeon Hong, Chaelin Kim, Serin Yoon, Junghyun Nam, Sihun Cha, Junyong Noh
Title: SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing
Abstract:
Text-driven motion generation has advanced significantly with the rise of denoising diffusion models. However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions. Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning. In this paper, we introduce a skeleton-aware latent diffusion (SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words. Furthermore, by leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts. Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation. Code is available at project page.

Authors:Chunlei Li, Yilei Shi, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Title: Scale-Aware Contrastive Reverse Distillation for Unsupervised Medical Anomaly Detection
Abstract:
Unsupervised anomaly detection using deep learning has garnered significant research attention due to its broad applicability, particularly in medical imaging where labeled anomalous data are scarce. While earlier approaches leverage generative models like autoencoders and generative adversarial networks (GANs), they often fall short due to overgeneralization. Recent methods explore various strategies, including memory banks, normalizing flows, self-supervised learning, and knowledge distillation, to enhance discrimination. Among these, knowledge distillation, particularly reverse distillation, has shown promise. Following this paradigm, we propose a novel scale-aware contrastive reverse distillation model that addresses two key limitations of existing reverse distillation methods: insufficient feature discriminability and inability to handle anomaly scale variations. Specifically, we introduce a contrastive student-teacher learning approach to derive more discriminative representations by generating and exploring out-of-normal distributions. Further, we design a scale adaptation mechanism to softly weight contrastive distillation losses at different scales to account for the scale variation issue. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance, validating the efficacy of the proposed method. Code is available at https://github.com/MedAITech/SCRD4AD.
Chinese Summary: 本文提出了一种新颖的尺度感知对比反向蒸馏模型,通过对比学习增强特征区分度,并采用自适应加权机制处理异常尺度变化,在基准数据集上实现了最先进的性能。
English Summary: This paper introduces a novel scale-aware contrastive reverse distillation model that enhances anomaly detection by improving feature discriminability through contrastive learning and addressing scale variations with adaptive weighting, achieving state-of-the-art results on benchmark datasets.

Authors:Jinping Wang, Weiwei Song, Hao Chen, Jinchang Ren, Huimin Zhao
Title: FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification
Abstract:
World models significantly enhance hierarchical understanding, improving data integration and learning efficiency. To explore the potential of the world model in the remote sensing (RS) field, this paper proposes a label-efficient remote sensing world model for multimodal data fusion (FusDreamer). The FusDreamer uses the world model as a unified representation container to abstract common and high-level knowledge, promoting interactions across different types of data, \emph{i.e.}, hyperspectral (HSI), light detection and ranging (LiDAR), and text data. Initially, a new latent diffusion fusion and multimodal generation paradigm (LaMG) is utilized for its exceptional information integration and detail retention capabilities. Subsequently, an open-world knowledge-guided consistency projection (OK-CP) module incorporates prompt representations for visually described objects and aligns language-visual features through contrastive learning. In this way, the domain gap can be bridged by fine-tuning the pre-trained world models with limited samples. Finally, an end-to-end multitask combinatorial optimization (MuCO) strategy can capture slight feature bias and constrain the diffusion process in a collaboratively learnable direction. Experiments conducted on four typical datasets indicate the effectiveness and advantages of the proposed FusDreamer. The corresponding code will be released at https://github.com/Cimy-wang/FusDreamer.
中文: 本文提出FusDreamer,一种标签高效的遥感世界模型,通过潜在扩散融合和知识引导的一致性整合多模态数据,以弥合领域差距并用有限样本提升学习效果。
English: This paper introduces FusDreamer, a label-efficient remote sensing world model that integrates multimodal data through latent diffusion fusion and knowledge-guided consistency to bridge domain gaps and enhance learning with limited samples.

Authors:Ali Mollaahmadi Dehaghi, Hossein KhademSohi, Reza Razavi, Steve Drew, Mohammad Moshirpour
Title: FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution
Abstract:
Video super-resolution aims to enhance low-resolution videos by leveraging both spatial and temporal information. While deep learning has led to impressive progress, it typically requires centralized data, which raises privacy concerns. Federated learning offers a privacy-friendly solution, but general FL frameworks often struggle with low-level vision tasks, resulting in blurry, low-quality outputs. To address this, we introduce FedVSR, the first FL framework specifically designed for VSR. It is model-agnostic and stateless, and introduces a lightweight loss function based on the DWT to better preserve high-frequency details during local training. Additionally, a loss-aware aggregation strategy combines both DWT-based and task-specific losses to guide global updates effectively. Extensive experiments across multiple VSR models and datasets demonstrate that FedVSR consistently outperforms existing FL methods, achieving up to 0.82 dB higher PSNR, 0.0327 higher SSIM, and 0.0251 lower LPIPS. These results underscore FedVSR's ability to bridge the gap between privacy and performance, setting a new benchmark for federated learning in low-level vision tasks. The code is available at: https://github.com/alimd94/FedVSR
Chinese: FedVSR是首个专为视频超分辨率设计的联邦学习框架,通过基于离散小波变换的轻量损失函数和损失感知聚合策略,在保护隐私的同时显著提升了图像质量,性能优于现有方法。
English: FedVSR is a novel federated learning framework tailored for video super-resolution that enhances privacy and performance by using a DWT-based loss function and loss-aware aggregation, achieving superior results over existing methods.

Authors:Keqi Chen, Vinkle Srivastav, Didier Mutter, Nicolas Padoy
Title: Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes
Abstract:
Multi-view person association is a fundamental step towards multi-view analysis of human activities. Although the person re-identification features have been proven effective, they become unreliable in challenging scenes where persons share similar appearances. Therefore, cross-view geometric constraints are required for a more robust association. However, most existing approaches are either fully-supervised using ground-truth identity labels or require calibrated camera parameters that are hard to obtain. In this work, we investigate the potential of learning from synchronization, and propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations. Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. The model encodes each person's unified geometric and appearance features, and we train it by utilizing synchronization labels for supervision after applying Hungarian matching to bridge the gap between instance-wise and image-wise distances. To further reduce the solution space, we propose two types of self-supervised linear constraints: multi-view re-projection and pairwise edge association. Extensive experiments on three challenging public benchmark datasets (WILDTRACK, MVOR, and SOLDIERS) show that our approach achieves state-of-the-art results, surpassing existing unsupervised and fully-supervised approaches. Code is available at https://github.com/CAMMA-public/Self-MVA.
中文摘要:本文提出Self-MVA方法,通过跨视角图像同步的自监督学习实现无需标注或相机标定的多视角行人关联,在多个基准数据集上达到最优性能。
English Summary: This paper introduces Self-MVA, a self-supervised approach for multi-view person association that learns from cross-view synchronization without annotations or camera calibration, achieving state-of-the-art performance on benchmark datasets.

Authors:Maan Qraitem, Piotr Teterwak, Kate Saenko, Bryan A. Plummer
Title: Web Artifact Attacks Disrupt Vision Language Models
Abstract:
Vision-language models (VLMs) (e.g. CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a "typographic" attack. These attacks succeed due to VLMs' text-heavy bias-a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce "artifact-based" attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them simultaneously harder to defend against and more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100% attack success rates. These attacks transfer across models with up to 90% effectiveness, making it possible to attack unseen models. To defend against these attacks, we extend prior work's artifact aware prompting to the graphical setting. We see a moderate reduction of success rates of up to 15% relative to standard prompts, suggesting a promising direction for enhancing model robustness. Code: https://github.com/mqraitem/Web-Artifact-Attacks
中文: 视觉语言模型易受新型“基于伪影”攻击的影响,这些攻击利用网络数据中的意外关联,通过不匹配文本和图形元素误导预测,并在不同模型间具有高迁移性。
English: Vision-language models are vulnerable to novel "artifact-based" attacks that exploit unintended correlations from web data, using both non-matching text and graphical elements to mislead predictions with high transferability across models.

Authors:Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, Federico Tombari
Title: Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos
Abstract:
Understanding fine-grained temporal dynamics is crucial in egocentric videos, where continuous streams capture frequent, close-up interactions with objects. In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. Our analysis shows that state-of-the-art Multi-Modal Large Language Models (MLLMs) on these benchmarks achieve remarkably high performance using just text or a single frame as input. To address these limitations, we introduce EgoTempo, a dataset specifically designed to evaluate temporal understanding in the egocentric domain. EgoTempo emphasizes tasks that require integrating information across the entire video, ensuring that models would need to rely on temporal patterns rather than static cues or pre-existing knowledge. Extensive experiments on EgoTempo show that current MLLMs still fall short in temporal reasoning on egocentric videos, and thus we hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics. Dataset and code are available at https://github.com/google-research-datasets/egotempo.git.
中文: 现有第一人称视频问答数据集和模型常忽略真实时序推理,因此作者提出EgoTempo新数据集,要求整合全视频信息以评估和改进多模态大语言模型的时序理解能力。
English: Current egocentric video QA datasets and models often bypass true temporal reasoning, so the authors introduce EgoTempo, a new dataset requiring full-video integration to assess and advance temporal understanding in MLLMs.

Authors:Shiran Yuan, Hao Zhao
Title: Next-Scale Autoregressive Models are Zero-Shot Single-Image Object View Synthesizers
Abstract:
Methods based on diffusion backbones have recently revolutionized novel view synthesis (NVS). However, those models require pretrained 2D diffusion checkpoints (e.g., Stable Diffusion) as the basis for geometrical priors. Since such checkpoints require exorbitant amounts of data and compute to train, this greatly limits the scalability of diffusion-based NVS models. We present Next-Scale Autoregression Conditioned by View (ArchonView), a method that significantly exceeds state-of-the-art methods despite being trained from scratch with 3D rendering data only and no 2D pretraining. We achieve this by incorporating both global (pose-augmented semantics) and local (multi-scale hierarchical encodings) conditioning into a backbone based on the next-scale autoregression paradigm. Our model also exhibits robust performance even for difficult camera poses where previous methods fail, and is several times faster in inference speed compared to diffusion. We experimentally verify that performance scales with model and dataset size, and conduct extensive demonstration of our method's synthesis quality across several tasks. Our code is open-sourced at https://github.com/Shiran-Yuan/ArchonView.
中文: ArchonView提出了一种无需二维预训练的新视角合成方法,通过结合全局与局部条件的下一尺度自回归技术,在性能和推理速度上均超越了现有最佳方法。
English: ArchonView introduces a novel view synthesis method that surpasses state-of-the-art techniques by using next-scale autoregression with global and local conditioning, achieving superior performance without 2D pretraining and offering faster inference.

Authors:Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, Xiang Bai
Title: Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception
Abstract:
We present UniFuture, a simple yet effective driving world model that seamlessly integrates future scene generation and perception within a single framework. Unlike existing models focusing solely on pixel-level future prediction or geometric reasoning, our approach jointly models future appearance (i.e., RGB image) and geometry (i.e., depth), ensuring coherent predictions. Specifically, during the training, we first introduce a Dual-Latent Sharing scheme, which transfers image and depth sequence in a shared latent space, allowing both modalities to benefit from shared feature learning. Additionally, we propose a Multi-scale Latent Interaction mechanism, which facilitates bidirectional refinement between image and depth features at multiple spatial scales, effectively enhancing geometry consistency and perceptual alignment. During testing, our UniFuture can easily predict high-consistency future image-depth pairs by only using the current image as input. Extensive experiments on the nuScenes dataset demonstrate that UniFuture outperforms specialized models on future generation and perception tasks, highlighting the advantages of a unified, structurally-aware world model. The project page is at https://github.com/dk-liang/UniFuture.
中文: UniFuture是一种统一的驾驶世界模型,通过双潜在共享和多尺度交互机制联合预测未来RGB图像和深度图,在nuScenes数据集上实现了优于专用模型的生成与感知性能。
English: UniFuture is a unified driving world model that jointly predicts future RGB images and depth maps through dual-latent sharing and multi-scale interaction, achieving superior performance in generation and perception tasks on nuScenes dataset.

Authors:Zhaodong Wu, Qiaochu Zhao, Ming Hu, Yulong Li, Haochen Xue, Kang Dang, Zhengyong Jiang, Angelos Stefanidis, Qiufeng Wang, Imran Razzak, Zongyuan Ge, Junjun He, Yu Qiao, Zhong Zheng, Feilong Tang, Jionglong Su
Title: MSWAL: 3D Multi-class Segmentation of Whole Abdominal Lesions Dataset
Abstract:
With the significantly increasing incidence and prevalence of abdominal diseases, there is a need to embrace greater use of new innovations and technology for the diagnosis and treatment of patients. Although deep-learning methods have notably been developed to assist radiologists in diagnosing abdominal diseases, existing models have the restricted ability to segment common lesions in the abdomen due to missing annotations for typical abdominal pathologies in their training datasets. To address the limitation, we introduce MSWAL, the first 3D Multi-class Segmentation of the Whole Abdominal Lesions dataset, which broadens the coverage of various common lesion types, such as gallstones, kidney stones, liver tumors, kidney tumors, pancreatic cancer, liver cysts, and kidney cysts. With CT scans collected from 694 patients (191,417 slices) of different genders across various scanning phases, MSWAL demonstrates strong robustness and generalizability. The transfer learning experiment from MSWAL to two public datasets, LiTS and KiTS, effectively demonstrates consistent improvements, with Dice Similarity Coefficient (DSC) increase of 3.00% for liver tumors and 0.89% for kidney tumors, demonstrating that the comprehensive annotations and diverse lesion types in MSWAL facilitate effective learning across different domains and data distributions. Furthermore, we propose Inception nnU-Net, a novel segmentation framework that effectively integrates an Inception module with the nnU-Net architecture to extract information from different receptive fields, achieving significant enhancement in both voxel-level DSC and region-level F1 compared to the cutting-edge public algorithms on MSWAL. Our dataset will be released after being accepted, and the code is publicly released at https://github.com/tiuxuxsh76075/MSWAL-.
中文: MSWAL数据集通过提供多种常见腹部病变的全面3D标注,解决了现有模型因训练数据标注缺失导致的病灶分割能力受限问题,同时提出的Inception nnU-Net框架通过多尺度特征整合实现了更优的分割性能。
English: The MSWAL dataset addresses the limitation of existing models in segmenting abdominal lesions by providing comprehensive 3D annotations for various common pathologies, while the proposed Inception nnU-Net framework demonstrates superior segmentation performance through multi-scale feature integration.

Authors:Mohammod N. I. Suvon, Shuo Zhou, Prasun C. Tripathi, Wenrui Fan, Samer Alabed, Bishesh Khanal, Venet Osmani, Andrew J. Swift, Chen, Chen, Haiping Lu
Title: Multimodal Latent Fusion of ECG Leads for Early Assessment of Pulmonary Hypertension
Abstract:
Recent advancements in early assessment of pulmonary hypertension (PH) primarily focus on applying machine learning methods to centralized diagnostic modalities, such as 12-lead electrocardiogram (12L-ECG). Despite their potential, these approaches fall short in decentralized clinical settings, e.g., point-of-care and general practice, where handheld 6-lead ECG (6L-ECG) can offer an alternative but is limited by the scarcity of labeled data for developing reliable models. To address this, we propose a lead-specific electrocardiogram multimodal variational autoencoder (\textsc{LS-EMVAE}), which incorporates a hierarchical modality expert (HiME) fusion mechanism and a latent representation alignment loss. HiME combines mixture-of-experts and product-of-experts to enable flexible, adaptive latent fusion, while the alignment loss improves coherence among lead-specific and shared representations. To alleviate data scarcity and enhance representation learning, we adopt a transfer learning strategy: the model is first pre-trained on a large unlabeled 12L-ECG dataset and then fine-tuned on smaller task-specific labeled 6L-ECG datasets. We validate \textsc{LS-EMVAE} across two retrospective cohorts in a 6L-ECG setting: 892 subjects from the ASPIRE registry for (1) PH detection and (2) phenotyping pre-/post-capillary PH, and 16,416 subjects from UK Biobank for (3) predicting elevated pulmonary atrial wedge pressure, where it consistently outperforms unimodal and multimodal baseline methods and demonstrates strong generalizability and interpretability. The code is available at https://github.com/Shef-AIRE/LS-EMVAE.
中文: 针对分散式临床环境中6导联心电图标记数据稀缺的问题,本研究提出LS-EMVAE模型,通过分层专家融合机制和迁移学习策略,在肺动脉高压检测与分型任务中显著优于现有方法,并展现出优异的泛化能力。
English: Recent machine learning methods for pulmonary hypertension assessment using 12-lead ECG face limitations in decentralized settings with 6-lead ECG due to scarce labeled data, prompting the development of LS-EMVAE, a multimodal variational autoencoder that employs transfer learning and hierarchical fusion to enhance performance and generalizability across clinical tasks.

Authors:Zhenyu Wu, Yuheng Zhou, Xiuwei Xu, Ziwei Wang, Haibin Yan
Title: MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation
Abstract:
Mobile manipulation is the fundamental challenge for robotics to assist humans with diverse tasks and environments in everyday life. However, conventional mobile manipulation approaches often struggle to generalize across different tasks and environments because of the lack of large-scale training. In contrast, recent advances in vision-language-action (VLA) models have shown impressive generalization capabilities, but these foundation models are developed for fixed-base manipulation tasks. Therefore, we propose an efficient policy adaptation framework named MoManipVLA to transfer pre-trained VLA models of fix-base manipulation to mobile manipulation, so that high generalization ability across tasks and environments can be achieved in mobile manipulation policy. Specifically, we utilize pre-trained VLA models to generate waypoints of the end-effector with high generalization ability. We design motion planning objectives for the mobile base and the robot arm, which aim at maximizing the physical feasibility of the trajectory. Finally, we present an efficient bi-level objective optimization framework for trajectory generation, where the upper-level optimization predicts waypoints for base movement to enhance the manipulator policy space, and the lower-level optimization selects the optimal end-effector trajectory to complete the manipulation task. In this way, MoManipVLA can adjust the position of the robot base in a zero-shot manner, thus making the waypoints predicted from the fixed-base VLA models feasible. Extensive experimental results on OVMM and the real world demonstrate that MoManipVLA achieves a 4.2% higher success rate than the state-of-the-art mobile manipulation, and only requires 50 training cost for real world deployment due to the strong generalization ability in the pre-trained VLA models.
中文摘要:MoManipVLA通过双级优化框架将预训练的视觉-语言-动作模型从固定基座操作迁移到移动操作,在提升任务成功率的同时显著降低了实际部署的训练成本。
English Summary: MoManipVLA is an efficient framework that adapts pre-trained vision-language-action models from fixed-base to mobile manipulation by optimizing both base movement and end-effector trajectories, achieving higher success rates with minimal training costs.

Authors:Haoyang Li, Liang Wang, Chao Wang, Jing Jiang, Yan Peng, Guodong Long
Title: DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models
Abstract:
The Base-New Trade-off (BNT) problem universally exists during the optimization of CLIP-based prompt tuning, where continuous fine-tuning on base (target) classes leads to a simultaneous decrease of generalization ability on new (unseen) classes. Existing approaches attempt to regulate the prompt tuning process to balance BNT by appending constraints. However, imposed on the same target prompt, these constraints fail to fully avert the mutual exclusivity between the optimization directions for base and new. As a novel solution to this challenge, we propose the plug-and-play Dual-Prompt Collaboration (DPC) framework, the first that decoupling the optimization processes of base and new tasks at the prompt level. Specifically, we clone a learnable parallel prompt based on the backbone prompt, and introduce a variable Weighting-Decoupling framework to independently control the optimization directions of dual prompts specific to base or new tasks, thus avoiding the conflict in generalization. Meanwhile, we propose a Dynamic Hard Negative Optimizer, utilizing dual prompts to construct a more challenging optimization task on base classes for enhancement. For interpretability, we prove the feature channel invariance of the prompt vector during the optimization process, providing theoretical support for the Weighting-Decoupling of DPC. Extensive experiments on multiple backbones demonstrate that DPC can significantly improve base performance without introducing any external knowledge beyond the base classes, while maintaining generalization to new classes. Code is available at: https://github.com/JREion/DPC.
Chinese: 提出的双提示协作(DPC)框架通过解耦基础任务和新任务的优化过程,解决了CLIP提示调优中的基础-新任务权衡问题,在无需外部知识的情况下提升基础类性能的同时保持了对未知类的泛化能力。
English: The proposed Dual-Prompt Collaboration (DPC) framework addresses the Base-New Trade-off problem in CLIP-based prompt tuning by decoupling optimization processes for base and new tasks, enhancing base performance while preserving generalization to unseen classes without external knowledge.

Authors:Yingyue Li, Bencheng Liao, Wenyu Liu, Xinggang Wang
Title: MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
Abstract:
With the advancement of RNN models with linear complexity, the quadratic complexity challenge of transformers has the potential to be overcome. Notably, the emerging Mamba-2 has demonstrated competitive performance, bridging the gap between RNN models and transformers. However, due to sequential processing and vanishing gradients, RNN models struggle to capture long-range dependencies, limiting contextual understanding. This results in slow convergence, high resource demands, and poor performance on downstream understanding and complex reasoning tasks. In this work, we present a hybrid model MaTVLM by substituting a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers. Leveraging the inherent relationship between attention and Mamba-2, we initialize Mamba-2 with corresponding attention weights to accelerate convergence. Subsequently, we employ a single-stage distillation process, using the pre-trained VLM as the teacher model to transfer knowledge to the MaTVLM, further enhancing convergence speed and performance. Furthermore, we investigate the impact of differential distillation loss within our training framework. We evaluate the MaTVLM on multiple benchmarks, demonstrating competitive performance against the teacher model and existing VLMs while surpassing both Mamba-based VLMs and models of comparable parameter scales. Remarkably, the MaTVLM achieves up to 3.6x faster inference than the teacher model while reducing GPU memory consumption by 27.5%, all without compromising performance. Code and models are released at http://github.com/hustvl/MaTVLM.
中文:MaTVLM混合模型将Mamba-2层融入预训练视觉语言Transformer中,通过权重初始化和知识蒸馏技术,在保持性能的同时实现了更快的推理速度和更低的内存消耗。
English: The MaTVLM hybrid model integrates Mamba-2 layers into a pre-trained visual-language transformer, achieving competitive performance with faster inference and lower memory usage through weight initialization and knowledge distillation.

Authors:Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, Tat-Jen Cham
Title: Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images
Abstract:
Most image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a "foundation" 3D generative model and extend it to recover plausible 3D geometry and appearance from occluded objects. We introduce a mask-weighted multi-head cross-attention mechanism followed by an occlusion-aware attention layer that explicitly leverages occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes. It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction.

Authors:Ling Yang, Kaixin Zhu, Juanxi Tian, Bohan Zeng, Mingbao Lin, Hongjuan Pei, Wentao Zhang, Shuicheng Yan
Title: WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes
Abstract:
With the rapid development of 3D reconstruction technology, research in 4D reconstruction is also advancing, existing 4D reconstruction methods can generate high-quality 4D scenes. However, due to the challenges in acquiring multi-view video data, the current 4D reconstruction benchmarks mainly display actions performed in place, such as dancing, within limited scenarios. In practical scenarios, many scenes involve wide-range spatial movements, highlighting the limitations of existing 4D reconstruction datasets. Additionally, existing 4D reconstruction methods rely on deformation fields to estimate the dynamics of 3D objects, but deformation fields struggle with wide-range spatial movements, which limits the ability to achieve high-quality 4D scene reconstruction with wide-range spatial movements. In this paper, we focus on 4D scene reconstruction with significant object spatial movements and propose a novel 4D reconstruction benchmark, WideRange4D. This benchmark includes rich 4D scene data with large spatial variations, allowing for a more comprehensive evaluation of the generation capabilities of 4D generation methods. Furthermore, we introduce a new 4D reconstruction method, Progress4D, which generates stable and high-quality 4D results across various complex 4D scene reconstruction tasks. We conduct both quantitative and qualitative comparison experiments on WideRange4D, showing that our Progress4D outperforms existing state-of-the-art 4D reconstruction methods. Project: https://github.com/Gen-Verse/WideRange4D
中文摘要:本文针对现有4D重建方法难以处理大范围空间运动的局限性,提出了包含丰富空间变化数据的WideRange4D新基准和Progress4D新方法,在复杂4D场景重建任务中实现了更优性能。
English Summary: This paper introduces WideRange4D, a novel benchmark addressing the limitations of existing 4D reconstruction methods in handling large spatial movements, and proposes Progress4D, a new method that achieves superior performance in complex 4D scene reconstruction tasks.

Authors:Yaowei Li, Lingen Li, Zhaoyang Zhang, Xiaoyu Li, Guangzhi Wang, Hongxiang Li, Xiaodong Cun, Ying Shan, Yuexian Zou
Title: BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing
Abstract:
Element-level visual manipulation is essential in digital content creation, but current diffusion-based methods lack the precision and flexibility of traditional tools. In this work, we introduce BlobCtrl, a framework that unifies element-level generation and editing using a probabilistic blob-based representation. By employing blobs as visual primitives, our approach effectively decouples and represents spatial location, semantic content, and identity information, enabling precise element-level manipulation. Our key contributions include: 1) a dual-branch diffusion architecture with hierarchical feature fusion for seamless foreground-background integration; 2) a self-supervised training paradigm with tailored data augmentation and score functions; and 3) controllable dropout strategies to balance fidelity and diversity. To support further research, we introduce BlobData for large-scale training and BlobBench for systematic evaluation. Experiments show that BlobCtrl excels in various element-level manipulation tasks while maintaining computational efficiency, offering a practical solution for precise and flexible visual content creation. Project page: https://liyaowei-stu.github.io/project/BlobCtrl/
中文: BlobCtrl提出了一种基于斑点的框架,通过双分支扩散模型和自监督训练解耦布局与外观,实现了精细的元素级图像编辑,在物体操作等任务中达到领先性能并保持高效性。
English: BlobCtrl introduces a blob-based framework that enables fine-grained, element-level image editing by decoupling layout and appearance through a dual-branch diffusion model and self-supervised training, achieving state-of-the-art performance in tasks like object manipulation while maintaining efficiency.

Authors:Johan Edstedt
Title: Less Biased Noise Scale Estimation for Threshold-Robust RANSAC
Abstract:
The gold-standard for robustly estimating relative pose through image matching is RANSAC. While RANSAC is powerful, it requires setting the inlier threshold that determines whether the error of a correspondence under an estimated model is sufficiently small to be included in its consensus set. Setting this threshold is typically done by hand, and is difficult to tune without an access to ground truth data. Thus, a method capable of automatically determining the optimal threshold would be desirable. In this paper we revisit inlier noise scale estimation, which is an attractive approach as the inlier noise scale is linear to the optimal threshold. We revisit the noise scale estimation method SIMFIT and find bias in the estimate of the noise scale. In particular, we fix underestimates from using the same data for fitting the model as estimating the inlier noise, and from not taking the threshold itself into account. Secondly, since the optimal threshold within a scene is approximately constant we propose a multi-pair extension of SIMFIT++, by filtering of estimates, which improves results. Our approach yields robust performance across a range of thresholds, shown in Figure 1. Code is available at https://github.com/Parskatt/simfitpp
中文: 本文重新审视并改进了SIMFIT方法,通过修正噪声尺度估计偏差和引入多对扩展,实现了无需手动调参的鲁棒相对位姿估计。
English: This paper revisits and improves SIMFIT for automatic inlier noise scale estimation, addressing bias and underestimation to enhance robust relative pose estimation without manual threshold tuning.

Authors:Xinyu Lian, Zichao Yu, Ruiming Liang, Yitong Wang, Li Ray Luo, Kaixu Chen, Yuanzhen Zhou, Qihong Tang, Xudong Xu, Zhaoyang Lyu, Bo Dai, Jiangmiao Pang
Title: Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation
Abstract:
Large-scale articulated objects with high quality are desperately needed for multiple tasks related to embodied AI. Most existing methods for creating articulated objects are either data-driven or simulation based, which are limited by the scale and quality of the training data or the fidelity and heavy labour of the simulation. In this paper, we propose Infinite Mobility, a novel method for synthesizing high-fidelity articulated objects through procedural generation. User study and quantitative evaluation demonstrate that our method can produce results that excel current state-of-the-art methods and are comparable to human-annotated datasets in both physics property and mesh quality. Furthermore, we show that our synthetic data can be used as training data for generative models, enabling next-step scaling up. Code is available at https://github.com/Intern-Nexus/Infinite-Mobility
中文: "无限机动性"方法通过程序化生成高保真铰接物体,在物理属性和网格质量上超越现有最优方法并媲美人工作数据集,同时可作为生成模型的扩展训练数据。
English: The proposed "Infinite Mobility" method procedurally generates high-fidelity articulated objects that surpass current state-of-the-art techniques and match human-annotated datasets in physics and mesh quality, while also serving as scalable training data for generative models.

Authors:Qi Zhang, Xiuyuan Chen, Ziyi He, Kun Wang, Lianming Wu, Hongxing Shen, Jianqi Sun
Title: U2AD: Uncertainty-based Unsupervised Anomaly Detection Framework for Detecting T2 Hyperintensity in MRI Spinal Cord
Abstract:
T2 hyperintensities in spinal cord MR images are crucial biomarkers for conditions such as degenerative cervical myelopathy. However, current clinical diagnoses primarily rely on manual evaluation. Deep learning methods have shown promise in lesion detection, but most supervised approaches are heavily dependent on large, annotated datasets. Unsupervised anomaly detection (UAD) offers a compelling alternative by eliminating the need for abnormal data annotations. However, existing UAD methods rely on curated normal datasets and their performance frequently deteriorates when applied to clinical datasets due to domain shifts. We propose an Uncertainty-based Unsupervised Anomaly Detection framework, termed U2AD, to address these limitations. Unlike traditional methods, U2AD is designed to be trained and tested within the same clinical dataset, following a "mask-and-reconstruction" paradigm built on a Vision Transformer-based architecture. We introduce an uncertainty-guided masking strategy to resolve task conflicts between normal reconstruction and anomaly detection to achieve an optimal balance. Specifically, we employ a Monte-Carlo sampling technique to estimate reconstruction uncertainty mappings during training. By iteratively optimizing reconstruction training under the guidance of both epistemic and aleatoric uncertainty, U2AD reduces overall reconstruction variance while emphasizing regions. Experimental results demonstrate that U2AD outperforms existing supervised and unsupervised methods in patient-level identification and segment-level localization tasks. This framework establishes a new benchmark for incorporating uncertainty guidance into UAD, highlighting its clinical utility in addressing domain shifts and task conflicts in medical image anomaly detection. Our code is available: https://github.com/zhibaishouheilab/U2AD
中文摘要:提出的U2AD框架通过基于视觉变换器的不确定性引导掩码重建方法,在无需异常标注数据的情况下实现了优于现有方法的脊髓异常检测性能。
English Summary: The proposed U2AD framework leverages uncertainty-guided mask-and-reconstruction with Vision Transformers to outperform existing methods in spinal cord anomaly detection while eliminating dependency on annotated abnormal data.

Authors:James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G. Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, Sarina M. Hasan, Alexandra Johannesson, William D. Leineweber, Malvika G Nair, Ridhi Yarlagadda, Connor Zuraski, Wah Chiu, Sarah Cohen, Jan N. Hansen, Manuel D Leonetti, Chad Liu, Emma Lundberg, Serena Yeung-Levy
Title: MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
Abstract:
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at https://huggingface.co/datasets/jmhb/microvqa, and project page at https://jmhb0.github.io/microvqa.
中文: MicroVQA是一个专为生物学研究设计的视觉问答基准,用于评估多模态推理能力,填补了现有基准的不足,并通过测试先进模型揭示了感知错误是科学推理中的主要挑战。
English: MicroVQA is a research-level visual question answering benchmark developed to evaluate critical multimodal reasoning skills in biology, addressing gaps in existing benchmarks and revealing key challenges through testing state-of-the-art models.

Authors:Qing Zhou, Junyu Gao, Qi Wang
Title: Scale Efficient Training for Large Datasets
Abstract:
The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement.To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard curriculum.We conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million samples.SeTa reduces training costs by up to 50\% while maintaining or improving performance, with minimal degradation even at 70\% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of our approach. Code is available at https://github.com/mrazhou/SeTa.
中文: 提出的规模高效训练(SeTa)方法通过动态剪除大规模数据集中的低价值样本,可在保持性能的同时将训练时间减少高达50%,并在多种数据集和架构中验证了其有效性。
English: The proposed Scale Efficient Training (SeTa) method dynamically prunes low-value samples from large datasets to reduce training time by up to 50% without compromising performance, demonstrating effectiveness across multiple datasets and architectures.

Authors:Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, Qin Jin
Title: Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Abstract:
Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance the capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore data-efficient post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small yet comprehensive benchmark for LVLM evaluation, assessing 11 types of queries and featuring balanced distributions across both videos and queries. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.

Authors:Hai-Long Sun, Zhun Sun, Houwen Peng, Han-Jia Ye
Title: Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning
Abstract:
Recent advancements in Large Language Models (LLMs) have demonstrated enhanced reasoning capabilities, evolving from Chain-of-Thought (CoT) prompting to advanced, product-oriented solutions like OpenAI o1. During our re-implementation of this model, we noticed that in multimodal tasks requiring visual input (e.g., geometry problems), Multimodal LLMs (MLLMs) struggle to maintain focus on the visual information, in other words, MLLMs suffer from a gradual decline in attention to visual information as reasoning progresses, causing text-over-relied outputs. To investigate this, we ablate image inputs during long-chain reasoning. Concretely, we truncate the reasoning process midway, then re-complete the reasoning process with the input image removed. We observe only a ~2% accuracy drop on MathVista's test-hard subset, revealing the model's textual outputs dominate the following reasoning process. Motivated by this, we propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages and compresses redundant visual tokens via dynamic pruning. This methodology helps the model retain attention to the visual components throughout the reasoning. Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks (+3.4 points vs previous sota), demonstrating the effectiveness of TVC in enhancing multimodal reasoning systems.

Authors:Jiaming Kang, Keyan Chen, Zhengxia Zou, Zhenwei Shi
Title: TriDF: Triplane-Accelerated Density Fields for Few-Shot Remote Sensing Novel View Synthesis
Abstract:
Remote sensing novel view synthesis (NVS) offers significant potential for 3D interpretation of remote sensing scenes, with important applications in urban planning and environmental monitoring. However, remote sensing scenes frequently lack sufficient multi-view images due to acquisition constraints. While existing NVS methods tend to overfit when processing limited input views, advanced few-shot NVS methods are computationally intensive and perform sub-optimally in remote sensing scenes. This paper presents TriDF, an efficient hybrid 3D representation for fast remote sensing NVS from as few as 3 input views. Our approach decouples color and volume density information, modeling them independently to reduce the computational burden on implicit radiance fields and accelerate reconstruction. We explore the potential of the triplane representation in few-shot NVS tasks by mapping high-frequency color information onto this compact structure, and the direct optimization of feature planes significantly speeds up convergence. Volume density is modeled as continuous density fields, incorporating reference features from neighboring views through image-based rendering to compensate for limited input data. Additionally, we introduce depth-guided optimization based on point clouds, which effectively mitigates the overfitting problem in few-shot NVS. Comprehensive experiments across multiple remote sensing scenes demonstrate that our hybrid representation achieves a 30x speed increase compared to NeRF-based methods, while simultaneously improving rendering quality metrics over advanced few-shot methods (7.4% increase in PSNR, 12.2% in SSIM, and 18.7% in LPIPS). The code is publicly available at https://github.com/kanehub/TriDF
中文: 本文提出的TriDF混合三维表示方法,仅需三个输入视图即可实现快速遥感新视角合成,相比基于NeRF的方法提速30倍,同时显著提升了多项渲染质量指标。
English: This paper introduces TriDF, an efficient hybrid 3D representation that enables fast remote sensing novel view synthesis from just three input views, achieving a 30x speed increase over NeRF-based methods while improving rendering quality metrics.

Authors:Shitong Shao, Hongwei Yi, Hanzhong Guo, Tian Ye, Daquan Zhou, Michael Lingelbach, Zhiqiang Xu, Zeke Xie
Title: MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Few-Step Synthesis
Abstract:
Recently, open-source video diffusion models (VDMs), such as WanX, Magic141 and HunyuanVideo, have been scaled to over 10 billion parameters. These large-scale VDMs have demonstrated significant improvements over smaller-scale VDMs across multiple dimensions, including enhanced visual quality and more natural motion dynamics. However, these models face two major limitations: (1) High inference overhead: Large-scale VDMs require approximately 10 minutes to synthesize a 28-step video on a single H100 GPU. (2) Limited in portrait video synthesis: Models like WanX-I2V and HunyuanVideo-I2V often produce unnatural facial expressions and movements in portrait videos. To address these challenges, we propose MagicDistillation, a novel framework designed to reduce inference overhead while ensuring the generalization of VDMs for portrait video synthesis. Specifically, we primarily use sufficiently high-quality talking video to fine-tune Magic141, which is dedicated to portrait video synthesis. We then employ LoRA to effectively and efficiently fine-tune the fake DiT within the step distillation framework known as distribution matching distillation (DMD). Following this, we apply weak-to-strong (W2S) distribution matching and minimize the discrepancy between the fake data distribution and the ground truth distribution, thereby improving the visual fidelity and motion dynamics of the synthesized videos. Experimental results on portrait video synthesis demonstrate the effectiveness of MagicDistillation, as our method surpasses Euler, LCM, and DMD baselines in both FID/FVD metrics and VBench. Moreover, MagicDistillation, requiring only 4 steps, also outperforms WanX-I2V (14B) and HunyuanVideo-I2V (13B) on visualization and VBench. Our project page is https://magicdistillation.github.io/MagicDistillation/.

Authors:Yinqiao Wang, Hao Xu, Pheng-Ann Heng, Chi-Wing Fu
Title: UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation
Abstract:
Estimating the 3D pose of hand and potential hand-held object from monocular images is a longstanding challenge. Yet, existing methods are specialized, focusing on either bare-hand or hand interacting with object. No method can flexibly handle both scenarios and their performance degrades when applied to the other scenario. In this paper, we propose UniHOPE, a unified approach for general 3D hand-object pose estimation, flexibly adapting both scenarios. Technically, we design a grasp-aware feature fusion module to integrate hand-object features with an object switcher to dynamically control the hand-object pose estimation according to grasping status. Further, to uplift the robustness of hand pose estimation regardless of object presence, we generate realistic de-occluded image pairs to train the model to learn object-induced hand occlusions, and formulate multi-level feature enhancement techniques for learning occlusion-invariant features. Extensive experiments on three commonly-used benchmarks demonstrate UniHOPE's SOTA performance in addressing hand-only and hand-object scenarios. Code will be released on https://github.com/JoyboyWang/UniHOPE_Pytorch.
Chinese: UniHOPE提出了一种统一的三维手-物姿态估计方法,通过抓取感知特征融合和遮挡不变学习,灵活适应空手和手持物体两种场景,实现了最先进的性能。
English: UniHOPE is a unified method for 3D hand-object pose estimation that adapts to both bare-hand and hand-object scenarios through grasp-aware feature fusion and occlusion-invariant learning, achieving state-of-the-art performance.

Authors:Ling-An Zeng, Gaojie Wu, Ancong Wu, Jian-Fang Hu, Wei-Shi Zheng
Title: Progressive Human Motion Generation Based on Text and Few Motion Frames
Abstract:
Although existing text-to-motion (T2M) methods can produce realistic human motion from text description, it is still difficult to align the generated motion with the desired postures since using text alone is insufficient for precisely describing diverse postures. To achieve more controllable generation, an intuitive way is to allow the user to input a few motion frames describing precise desired postures. Thus, we explore a new Text-Frame-to-Motion (TF2M) generation task that aims to generate motions from text and very few given frames. Intuitively, the closer a frame is to a given frame, the lower the uncertainty of this frame is when conditioned on this given frame. Hence, we propose a novel Progressive Motion Generation (PMG) method to progressively generate a motion from the frames with low uncertainty to those with high uncertainty in multiple stages. During each stage, new frames are generated by a Text-Frame Guided Generator conditioned on frame-aware semantics of the text, given frames, and frames generated in previous stages. Additionally, to alleviate the train-test gap caused by multi-stage accumulation of incorrectly generated frames during testing, we propose a Pseudo-frame Replacement Strategy for training. Experimental results show that our PMG outperforms existing T2M generation methods by a large margin with even one given frame, validating the effectiveness of our PMG. Code is available at https://github.com/qinghuannn/PMG.
中文: 本文提出的渐进式运动生成方法通过结合少量用户提供的帧来增强文本到运动的生成,实现了更精确的姿态控制,并显著超越了现有方法。
English: This paper introduces a Progressive Motion Generation method that enhances text-to-motion synthesis by incorporating a few user-provided frames, enabling more precise posture control and significantly outperforming existing methods.

Authors:Katja Schwarz, Norman Mueller, Peter Kontschieder
Title: Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors
Abstract:
Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e., lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) -- a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet+, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by ~20% on both RealEstate10K and ScanNet+. Project page: https://katjaschwarz.github.io/ggs/

Authors:Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, Chongxuan Li
Title: FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis
Abstract:
Generating flexible-view 3D scenes, including 360° rotation and zooming, from single images is challenging due to a lack of 3D data. To this end, we introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360° rotations and zooming. Project page: https://ml-gsai.github.io/FlexWorld.
Chinese Summary: FlexWorld是一种创新框架,通过视频到视频扩散模型和渐进扩展过程,从单张图像生成高质量灵活视角的3D场景,在视觉质量上超越了现有最优方法。
English Summary: FlexWorld is a novel framework that uses a video-to-video diffusion model and progressive expansion to generate high-quality flexible-view 3D scenes from single images, outperforming existing methods in visual quality.

Authors:Zhifu Tian, Tao Hu, Chaoyang Niu, Di Wu, Shu Wang
Title: Sampling Innovation-Based Adaptive Compressive Sensing
Abstract:
Scene-aware Adaptive Compressive Sensing (ACS) has attracted significant interest due to its promising capability for efficient and high-fidelity acquisition of scene images. ACS typically prescribes adaptive sampling allocation (ASA) based on previous samples in the absence of ground truth. However, when confronting unknown scenes, existing ACS methods often lack accurate judgment and robust feedback mechanisms for ASA, thus limiting the high-fidelity sensing of the scene. In this paper, we introduce a Sampling Innovation-Based ACS (SIB-ACS) method that can effectively identify and allocate sampling to challenging image reconstruction areas, culminating in high-fidelity image reconstruction. An innovation criterion is proposed to judge ASA by predicting the decrease in image reconstruction error attributable to sampling increments, thereby directing more samples towards regions where the reconstruction error diminishes significantly. A sampling innovation-guided multi-stage adaptive sampling (AS) framework is proposed, which iteratively refines the ASA through a multi-stage feedback process. For image reconstruction, we propose a Principal Component Compressed Domain Network (PCCD-Net), which efficiently and faithfully reconstructs images under AS scenarios. Extensive experiments demonstrate that the proposed SIB-ACS method significantly outperforms the state-of-the-art methods in terms of image reconstruction fidelity and visual effects. Codes are available at https://github.com/giant-pandada/SIB-ACS_CVPR2025.
中文摘要:提出的SIB-ACS方法通过创新准则和多阶段自适应采样框架,能有效识别并分配采样至图像重建困难区域,在重建保真度方面显著优于现有方法。
English Summary: The proposed SIB-ACS method introduces an innovation criterion and multi-stage adaptive sampling framework to effectively allocate samples to challenging reconstruction areas, significantly outperforming existing methods in image reconstruction fidelity.

Authors:Ihab Asaad, Maha Shadaydeh, Joachim Denzler
Title: Gradient Extrapolation for Debiased Representation Learning
Abstract:
Machine learning classification models trained with empirical risk minimization (ERM) often inadvertently rely on spurious correlations. When absent in the test data, these unintended associations between non-target attributes and target labels lead to poor generalization. This paper addresses this problem from a model optimization perspective and proposes a novel method, Gradient Extrapolation for Debiased Representation Learning (GERNE), designed to learn debiased representations in both known and unknown attribute training cases. GERNE uses two distinct batches with different amounts of spurious correlations and defines the target gradient as a linear extrapolation of the gradients computed from each batch's loss. Our analysis shows that when the extrapolated gradient points toward the batch gradient with fewer spurious correlations, it effectively guides training toward learning a debiased model. GERNE serves as a general framework for debiasing, encompassing ERM and Resampling methods as special cases. We derive the theoretical upper and lower bounds of the extrapolation factor employed by GERNE. By tuning this factor, GERNE can adapt to maximize either Group-Balanced Accuracy (GBA) or Worst-Group Accuracy (WGA). We validate GERNE on five vision and one NLP benchmarks, demonstrating competitive and often superior performance compared to state-of-the-art baselines. The project page is available at: https://gerne-debias.github.io/.

Authors:Yongkang Cheng, Shaoli Huang
Title: HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures
Abstract:
Animating virtual characters with holistic co-speech gestures is a challenging but critical task. Previous systems have primarily focused on the weak correlation between audio and gestures, leading to physically unnatural outcomes that degrade the user experience. To address this problem, we introduce HoleGest, a novel neural network framework based on decoupled diffusion and motion priors for the automatic generation of high-quality, expressive co-speech gestures. Our system leverages large-scale human motion datasets to learn a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements. To improve the generation efficiency of diffusion-based models, we integrate implicit joint constraints with explicit geometric and conditional constraints, capturing complex motion distributions between large strides. This integration significantly enhances generation speed while maintaining high-quality motion. Furthermore, we design a shared embedding space for gesture-transcription text alignment, enabling the generation of semantically correct gesture actions. Extensive experiments and user feedback demonstrate the effectiveness and potential applications of our model, with our method achieving a level of realism close to the ground truth, providing an immersive user experience. Our code, model, and demo are are available at https://cyk990422.github.io/HoloGest.github.io/.

Authors:Yijie Liu, Xinyi Shang, Yiqun Zhang, Yang Lu, Chen Gong, Jing-Hao Xue, Hanzi Wang
Title: Mind the Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning Across Pseudo-Mismatch
Abstract:
Federated Semi-Supervised Learning (FSSL) aims to leverage unlabeled data across clients with limited labeled data to train a global model with strong generalization ability. Most FSSL methods rely on consistency regularization with pseudo-labels, converting predictions from local or global models into hard pseudo-labels as supervisory signals. However, we discover that the quality of pseudo-label is largely deteriorated by data heterogeneity, an intrinsic facet of federated learning. In this paper, we study the problem of FSSL in-depth and show that (1) heterogeneity exacerbates pseudo-label mismatches, further degrading model performance and convergence, and (2) local and global models' predictive tendencies diverge as heterogeneity increases. Motivated by these findings, we propose a simple and effective method called Semi-supervised Aggregation for Globally-Enhanced Ensemble (SAGE), that can flexibly correct pseudo-labels based on confidence discrepancies. This strategy effectively mitigates performance degradation caused by incorrect pseudo-labels and enhances consensus between local and global models. Experimental results demonstrate that SAGE outperforms existing FSSL methods in both performance and convergence. Our code is available at https://github.com/Jay-Codeman/SAGE
中文: 联邦半监督学习(FSSL)面临数据异构性导致伪标签质量下降的问题,而提出的SAGE方法通过置信度差异灵活修正伪标签,有效提升了模型性能和收敛速度。
English: Federated Semi-Supervised Learning (FSSL) faces challenges from data heterogeneity that degrade pseudo-label quality, but the proposed SAGE method effectively corrects pseudo-labels using confidence discrepancies to enhance model performance and convergence.

Authors:Jie Huang, Haorui Chen, Jiaxuan Ren, Siran Peng, Liangjian Deng
Title: A General Adaptive Dual-level Weighting Mechanism for Remote Sensing Pansharpening
Abstract:
Currently, deep learning-based methods for remote sensing pansharpening have advanced rapidly. However, many existing methods struggle to fully leverage feature heterogeneity and redundancy, thereby limiting their effectiveness. We use the covariance matrix to model the feature heterogeneity and redundancy and propose Correlation-Aware Covariance Weighting (CACW) to adjust them. CACW captures these correlations through the covariance matrix, which is then processed by a nonlinear function to generate weights for adjustment. Building upon CACW, we introduce a general adaptive dual-level weighting mechanism (ADWM) to address these challenges from two key perspectives, enhancing a wide range of existing deep-learning methods. First, Intra-Feature Weighting (IFW) evaluates correlations among channels within each feature to reduce redundancy and enhance unique information. Second, Cross-Feature Weighting (CFW) adjusts contributions across layers based on inter-layer correlations, refining the final output. Extensive experiments demonstrate the superior performance of ADWM compared to recent state-of-the-art (SOTA) methods. Furthermore, we validate the effectiveness of our approach through generality experiments, redundancy visualization, comparison experiments, key variables and complexity analysis, and ablation studies. Our code is available at https://github.com/Jie-1203/ADWM.
中文摘要:本文提出相关性感知协方差加权(CACW)方法和自适应双级加权机制(ADWM),通过特征内加权和跨特征加权解决遥感图像融合中的特征异构性与冗余问题,实验证明该方法优于现有先进技术。
English Summary: This paper introduces a Correlation-Aware Covariance Weighting (CACW) method and an Adaptive Dual-level Weighting Mechanism (ADWM) to address feature heterogeneity and redundancy in remote sensing pansharpening, demonstrating superior performance over existing methods through comprehensive experiments.

Authors:Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit
Title: Clustering is back: Reaching state-of-the-art LiDAR instance segmentation without training
Abstract:
Panoptic segmentation of LiDAR point clouds is fundamental to outdoor scene understanding, with autonomous driving being a primary application. While state-of-the-art approaches typically rely on end-to-end deep learning architectures and extensive manual annotations of instances, the significant cost and time investment required for labeling large-scale point cloud datasets remains a major bottleneck in this field. In this work, we demonstrate that competitive panoptic segmentation can be achieved using only semantic labels, with instances predicted without any training or annotations. Our method outperforms state-of-the-art supervised methods on standard benchmarks including SemanticKITTI and nuScenes, and outperforms every publicly available method on SemanticKITTI as a drop-in instance head replacement, while running in real-time on a single-threaded CPU and requiring no instance labels. It is fully explainable, and requires no learning or parameter tuning. Alpine combined with state-of-the-art semantic segmentation ranks first on the official panoptic segmentation leaderboard of SemanticKITTI. Code is available at https://github.com/valeoai/Alpine/
Chinese: 本研究提出了一种仅使用语义标签即可实现激光雷达点云全景分割的方法,无需实例标注即可达到先进性能,并能在CPU上实时高效运行。
English: This study presents a method for panoptic segmentation of LiDAR point clouds that achieves competitive performance using only semantic labels, eliminating the need for instance annotations while running efficiently in real-time on a CPU.

Authors:Matteo Sodano, Federico Magistri, Elias Marks, Fares Hosn, Aibek Zurbayev, Rodrigo Marcuzzi, Meher V. R. Malladi, Jens Behley, Cyrill Stachniss
Title: 3D Hierarchical Panoptic Segmentation in Real Orchard Environments Across Different Sensors
Abstract:
Crop yield estimation is a relevant problem in agriculture, because an accurate yield estimate can support farmers' decisions on harvesting or precision intervention. Robots can help to automate this process. To do so, they need to be able to perceive the surrounding environment to identify target objects such as trees and plants. In this paper, we introduce a novel approach to address the problem of hierarchical panoptic segmentation of apple orchards on 3D data from different sensors. Our approach is able to simultaneously provide semantic segmentation, instance segmentation of trunks and fruits, and instance segmentation of trees (a trunk with its fruits). This allows us to identify relevant information such as individual plants, fruits, and trunks, and capture the relationship among them, such as precisely estimate the number of fruits associated to each tree in an orchard. To efficiently evaluate our approach for hierarchical panoptic segmentation, we provide a dataset designed specifically for this task. Our dataset is recorded in Bonn, Germany, in a real apple orchard with a variety of sensors, spanning from a terrestrial laser scanner to a RGB-D camera mounted on different robots platforms. The experiments show that our approach surpasses state-of-the-art approaches in 3D panoptic segmentation in the agricultural domain, while also providing full hierarchical panoptic segmentation. Our dataset is publicly available at https://www.ipb.uni-bonn.de/data/hops/. The open-source implementation of our approach is available at https://github.com/PRBonn/hapt3D.
中文: 本文提出了一种新颖的农业三维数据分层全景分割方法,能同时识别树木、树干和果实并建立其关联关系,在苹果园数据集上的表现优于现有方法,并提供了开源实现。
English: This paper introduces a novel hierarchical panoptic segmentation method for 3D agricultural data that simultaneously identifies trees, trunks, and fruits while capturing their relationships, outperforming existing approaches and providing an open-source dataset from apple orchards.

Authors:Yuanze Li, Shihao Yuan, Haolin Wang, Qizhang Li, Ming Liu, Chen Xu, Guangming Shi, Wangmeng Zuo
Title: Triad: Empowering LMM-based Anomaly Detection with Vision Expert-guided Visual Tokenizer and Manufacturing Process
Abstract:
Although recent methods have tried to introduce large multimodal models (LMMs) into industrial anomaly detection (IAD), their generalization in the IAD field is far inferior to that for general purposes. We summarize the main reasons for this gap into two aspects. On one hand, general-purpose LMMs lack cognition of defects in the visual modality, thereby failing to sufficiently focus on defect areas. Therefore, we propose to modify the AnyRes structure of the LLaVA model, providing the potential anomalous areas identified by existing IAD models to the LMMs. On the other hand, existing methods mainly focus on identifying defects by learning defect patterns or comparing with normal samples, yet they fall short of understanding the causes of these defects. Considering that the generation of defects is closely related to the manufacturing process, we propose a manufacturing-driven IAD paradigm. An instruction-tuning dataset for IAD (InstructIAD) and a data organization approach for Chain-of-Thought with manufacturing (CoT-M) are designed to leverage the manufacturing process for IAD. Based on the above two modifications, we present Triad, a novel LMM-based method incorporating an expert-guided region-of-interest tokenizer and manufacturing process for industrial anomaly detection. Extensive experiments show that our Triad not only demonstrates competitive performance against current LMMs but also achieves further improved accuracy when equipped with manufacturing processes. Source code, training data, and pre-trained models will be publicly available at https://github.com/tzjtatata/Triad.
中文摘要:本研究提出Triad新型大模型,通过结合专家引导的缺陷区域识别与制造工艺认知,显著提升了工业异常检测性能,实验证明其准确率优于现有方法。
English Summary: The study introduces Triad, a novel large multimodal model that enhances industrial anomaly detection by integrating expert-guided defect area identification and manufacturing process insights, achieving superior accuracy over existing methods.

Authors:Rui Wang, Quentin Lohmeyer, Mirko Meboldt, Siyu Tang
Title: DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction
Abstract:
Reconstructing clean, distractor-free 3D scenes from real-world captures remains a significant challenge, particularly in highly dynamic and cluttered settings such as egocentric videos. To tackle this problem, we introduce DeGauss, a simple and robust self-supervised framework for dynamic scene reconstruction based on a decoupled dynamic-static Gaussian Splatting design. DeGauss models dynamic elements with foreground Gaussians and static content with background Gaussians, using a probabilistic mask to coordinate their composition and enable independent yet complementary optimization. DeGauss generalizes robustly across a wide range of real-world scenarios, from casual image collections to long, dynamic egocentric videos, without relying on complex heuristics or extensive supervision. Experiments on benchmarks including NeRF-on-the-go, ADT, AEA, Hot3D, and EPIC-Fields demonstrate that DeGauss consistently outperforms existing methods, establishing a strong baseline for generalizable, distractor-free 3D reconstructionin highly dynamic, interaction-rich environments. Project page: https://batfacewayne.github.io/DeGauss.io/

Authors:Chen Zhao, Zhizhou Chen, Yunzhe Xu, Enxuan Gu, Jian Li, Zili Yi, Qian Wang, Jian Yang, Ying Tai
Title: From Zero to Detail: Deconstructing Ultra-High-Definition Image Restoration from Progressive Spectral Perspective
Abstract:
Ultra-high-definition (UHD) image restoration faces significant challenges due to its high resolution, complex content, and intricate details. To cope with these challenges, we analyze the restoration process in depth through a progressive spectral perspective, and deconstruct the complex UHD restoration problem into three progressive stages: zero-frequency enhancement, low-frequency restoration, and high-frequency refinement. Building on this insight, we propose a novel framework, ERR, which comprises three collaborative sub-networks: the zero-frequency enhancer (ZFE), the low-frequency restorer (LFR), and the high-frequency refiner (HFR). Specifically, the ZFE integrates global priors to learn global mapping, while the LFR restores low-frequency information, emphasizing reconstruction of coarse-grained content. Finally, the HFR employs our designed frequency-windowed kolmogorov-arnold networks (FW-KAN) to refine textures and details, producing high-quality image restoration. Our approach significantly outperforms previous UHD methods across various tasks, with extensive ablation studies validating the effectiveness of each component. The code is available at \href{https://github.com/NJU-PCALab/ERR}{here}.
中文: 提出的ERR框架将超高清图像修复分解为零频增强、低频修复和高频细化三个渐进阶段,通过专门设计的子网络协同工作,在多项任务中显著超越了现有方法。
English: The proposed ERR framework addresses ultra-high-definition image restoration by decomposing it into three progressive stages—zero-frequency enhancement, low-frequency restoration, and high-frequency refinement—using specialized sub-networks, significantly outperforming existing methods across multiple tasks.

Authors:Yaxi Chen, Simin Ni, Aleksandra Ivanova, Shaheer U. Saeed, Rikin Hargunani, Jie Huang, Chaozong Liu, Yipeng Hu
Title: Patient-specific radiomic feature selection with reconstructed healthy persona of knee MR images
Abstract:
Classical radiomic features have been designed to describe image appearance and intensity patterns. These features are directly interpretable and readily understood by radiologists. Compared with end-to-end deep learning (DL) models, lower dimensional parametric models that use such radiomic features offer enhanced interpretability but lower comparative performance in clinical tasks. In this study, we propose an approach where a standard logistic regression model performance is substantially improved by learning to select radiomic features for individual patients, from a pool of candidate features. This approach has potentials to maintain the interpretability of such approaches while offering comparable performance to DL. We also propose to expand the feature pool by generating a patient-specific healthy persona via mask-inpainting using a denoising diffusion model trained on healthy subjects. Such a pathology-free baseline feature set allows further opportunity in novel feature discovery and improved condition classification. We demonstrate our method on multiple clinical tasks of classifying general abnormalities, anterior cruciate ligament tears, and meniscus tears. Experimental results demonstrate that our approach achieved comparable or even superior performance than state-of-the-art DL approaches while offering added interpretability by using radiomic features extracted from images and supplemented by generating healthy personas. Example clinical cases are discussed in-depth to demonstrate the intepretability-enabled utilities such as human-explainable feature discovery and patient-specific location/view selection. These findings highlight the potentials of the combination of subject-specific feature selection with generative models in augmenting radiomic analysis for more interpretable decision-making. The codes are available at: https://github.com/YaxiiC/RadiomicsPersona.git
中文摘要:本研究通过逻辑回归模型改进针对个体的影像组学特征选择,并利用扩散模型生成健康基线特征扩展特征库,在临床分类任务中实现了与深度学习相当的性能,同时保持了方法的可解释性。
English Summary: This study enhances radiomic feature selection for individual patients using a logistic regression model and expands the feature pool with healthy personas generated via diffusion models, achieving performance comparable to deep learning while maintaining interpretability in clinical classification tasks.

Authors:Ling-An Zeng, Guohong Huang, Yi-Lin Wei, Shengbo Gu, Yu-Ming Tang, Jingke Meng, Wei-Shi Zheng
Title: ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation
Abstract:
We propose ChainHOI, a novel approach for text-driven human-object interaction (HOI) generation that explicitly models interactions at both the joint and kinetic chain levels. Unlike existing methods that implicitly model interactions using full-body poses as tokens, we argue that explicitly modeling joint-level interactions is more natural and effective for generating realistic HOIs, as it directly captures the geometric and semantic relationships between joints, rather than modeling interactions in the latent pose space. To this end, ChainHOI introduces a novel joint graph to capture potential interactions with objects, and a Generative Spatiotemporal Graph Convolution Network to explicitly model interactions at the joint level. Furthermore, we propose a Kinematics-based Interaction Module that explicitly models interactions at the kinetic chain level, ensuring more realistic and biomechanically coherent motions. Evaluations on two public datasets demonstrate that ChainHOI significantly outperforms previous methods, generating more realistic, and semantically consistent HOIs. Code is available \href{https://github.com/qinghuannn/ChainHOI}{here}.
中文: ChainHOI提出了一种新颖的文本驱动人-物交互生成方法,通过在关节和运动链层面显式建模交互,相比现有方法能生成更真实且生物力学一致的动作。
English: ChainHOI introduces a novel approach for text-driven human-object interaction generation by explicitly modeling interactions at both joint and kinetic chain levels, outperforming existing methods in producing realistic and biomechanically coherent motions.

Authors:Jing Li, Yihang Fu, Falai Chen
Title: DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry
Abstract:
Boundary representation (B-rep) of geometric models is a fundamental format in Computer-Aided Design (CAD). However, automatically generating valid and high-quality B-rep models remains challenging due to the complex interdependence between the topology and geometry of the models. Existing methods tend to prioritize geometric representation while giving insufficient attention to topological constraints, making it difficult to maintain structural validity and geometric accuracy. In this paper, we propose DTGBrepGen, a novel topology-geometry decoupled framework for B-rep generation that explicitly addresses both aspects. Our approach first generates valid topological structures through a two-stage process that independently models edge-face and edge-vertex adjacency relationships. Subsequently, we employ Transformer-based diffusion models for sequential geometry generation, progressively generating vertex coordinates, followed by edge geometries and face geometries which are represented as B-splines. Extensive experiments on diverse CAD datasets show that DTGBrepGen significantly outperforms existing methods in both topological validity and geometric accuracy, achieving higher validity rates and producing more diverse and realistic B-reps. Our code is publicly available at https://github.com/jinli99/DTGBrepGen.
中文: DTGBrepGen提出了一种拓扑-几何解耦框架,首先生成有效的拓扑结构,然后使用基于Transformer的扩散模型进行顺序几何生成,在有效性和准确性上显著优于现有方法。
English: DTGBrepGen introduces a topology-geometry decoupled framework that first generates valid topological structures and then uses Transformer-based diffusion models for sequential geometry generation, significantly outperforming existing methods in validity and accuracy.

Authors:Runyu Jiao, Alice Fasoli, Francesco Giuliari, Matteo Bortolon, Sergio Povoli, Guofeng Mei, Yiming Wang, Fabio Poiesi
Title: Free-form language-based robotic reasoning and grasping
Abstract:
Performing robotic grasping from a cluttered bin based on human instructions is a challenging task, as it requires understanding both the nuances of free-form language and the spatial relationships between objects. Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have demonstrated remarkable reasoning capabilities across both text and images. But can they truly be used for this task in a zero-shot setting? And what are their limitations? In this paper, we explore these research questions via the free-form language-based robotic grasping task, and propose a novel method, FreeGrasp, leveraging the pre-trained VLMs' world knowledge to reason about human instructions and object spatial arrangements. Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o's zero-shot spatial reasoning. This allows our method to determine whether a requested object is directly graspable or if other objects must be grasped and removed first. Since no existing dataset is specifically designed for this task, we introduce a synthetic dataset FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated instructions and ground-truth grasping sequences. We conduct extensive analyses with both FreeGraspData and real-world validation with a gripper-equipped robotic arm, demonstrating state-of-the-art performance in grasp reasoning and execution. Project website: https://tev-fbk.github.io/FreeGrasp/.

Authors:Zhicheng Zhao, Jinquan Yan, Chenglong Li, Xiao Wang, Jin Tang
Title: DehazeMamba: SAR-guided Optical Remote Sensing Image Dehazing with Adaptive State Space Model
Abstract:
Optical remote sensing image dehazing presents significant challenges due to its extensive spatial scale and highly non-uniform haze distribution, which traditional single-image dehazing methods struggle to address effectively. While Synthetic Aperture Radar (SAR) imagery offers inherently haze-free reference information for large-scale scenes, existing SAR-guided dehazing approaches face two critical limitations: the integration of SAR information often diminishes the quality of haze-free regions, and the instability of feature quality further exacerbates cross-modal domain shift. To overcome these challenges, we introduce DehazeMamba, a novel SAR-guided dehazing network built on a progressive haze decoupling fusion strategy. Our approach incorporates two key innovations: a Haze Perception and Decoupling Module (HPDM) that dynamically identifies haze-affected regions through optical-SAR difference analysis, and a Progressive Fusion Module (PFM) that mitigates domain shift through a two-stage fusion process based on feature quality assessment. To facilitate research in this domain, we present MRSHaze, a large-scale benchmark dataset comprising 8,000 pairs of temporally synchronized, precisely geo-registered SAR-optical images with high resolution and diverse haze conditions. Extensive experiments demonstrate that DehazeMamba significantly outperforms state-of-the-art methods, achieving a 0.73 dB improvement in PSNR and substantial enhancements in downstream tasks such as semantic segmentation. The dataset is available at https://github.com/mmic-lcl/Datasets-and-benchmark-code.
中文摘要:本文提出DehazeMamba网络,通过渐进式雾霾解耦融合策略解决大范围光学遥感图像去雾难题,并发布MRSHaze基准数据集,实验证明其性能显著优于现有先进方法。
English Summary: This paper introduces DehazeMamba, a SAR-guided dehazing network using progressive haze decoupling fusion to address challenges in large-scale optical remote sensing image dehazing, along with the MRSHaze benchmark dataset demonstrating superior performance over existing methods.

Authors:Henghui Du, Guangyao Li, Chang Zhou, Chunjie Zhang, Alan Zhao, Di Hu
Title: Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Abstract:
In recent years, numerous tasks have been proposed to encourage model to develop specified capability in understanding audio-visual scene, primarily categorized into temporal localization, spatial localization, spatio-temporal reasoning, and pixel-level understanding. Instead, human possesses a unified understanding ability for diversified tasks. Therefore, designing an audio-visual model with general capability to unify these tasks is of great value. However, simply joint training for all tasks can lead to interference due to the heterogeneity of audiovisual data and complex relationship among tasks. We argue that this problem can be solved through explicit cooperation among tasks. To achieve this goal, we propose a unified learning method which achieves explicit inter-task cooperation from both the perspectives of data and model thoroughly. Specifically, considering the labels of existing datasets are simple words, we carefully refine these datasets and construct an Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning process (AV-UIE), which clarifies the cooperative relationship among tasks. Subsequently, to facilitate concrete cooperation in learning stage, an interaction-aware LoRA structure with multiple LoRA heads is designed to learn different aspects of audiovisual data interaction. By unifying the explicit cooperation across the data and model aspect, our method not only surpasses existing unified audio-visual model on multiple tasks, but also outperforms most specialized models for certain tasks. Furthermore, we also visualize the process of explicit cooperation and surprisingly find that each LoRA head has certain audio-visual understanding ability. Code and dataset: https://github.com/GeWu-Lab/Crab
中文: 本文提出了一种统一学习方法,通过精炼数据集和交互感知的LoRA结构实现任务间显式协作,在多项视听任务上超越了现有模型。
English: This paper introduces a unified learning method that achieves explicit inter-task cooperation through a refined dataset and an interaction-aware LoRA structure, outperforming existing models across multiple audio-visual tasks.

Authors:Ruiqi Song, Xianda Guo, Hangbin Wu, Qinggong Wei, Long Chen
Title: InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving
Abstract:
Directly generating planning results from raw sensors has become increasingly prevalent due to its adaptability and robustness in complex scenarios. Scene representation, as a key module in the pipeline, has traditionally relied on conventional perception, which focus on the global scene. However, in driving scenarios, human drivers typically focus only on regions that directly impact driving, which often coincide with those required for end-to-end autonomous driving. In this paper, a novel end-to-end autonomous driving method called InsightDrive is proposed, which organizes perception by language-guided scene representation. We introduce an instance-centric scene tokenizer that transforms the surrounding environment into map- and object-aware instance tokens. Scene attention language descriptions, which highlight key regions and obstacles affecting the ego vehicle's movement, are generated by a vision-language model that leverages the cognitive reasoning capabilities of foundation models. We then align scene descriptions with visual features using the vision-language model, guiding visual attention through these descriptions to give effectively scene representation. Furthermore, we employ self-attention and cross-attention mechanisms to model the ego-agents and ego-map relationships to comprehensively build the topological relationships of the scene. Finally, based on scene understanding, we jointly perform motion prediction and planning. Extensive experiments on the widely used nuScenes benchmark demonstrate that the proposed InsightDrive achieves state-of-the-art performance in end-to-end autonomous driving. The code is available at https://github.com/songruiqi/InsightDrive
中文摘要:InsightDrive提出了一种新颖的端到端自动驾驶方法,通过语言引导的场景表征和以实例为中心的标记来聚焦关键驾驶区域,借助集成运动预测与规划实现了最优性能。
English Summary: InsightDrive introduces a novel end-to-end autonomous driving method that uses language-guided scene representation and instance-centric tokens to focus on key driving regions, achieving state-of-the-art performance through integrated motion prediction and planning.

Authors:Gabriele Berton, Kevin Musgrave, Carlo Masone
Title: All You Need to Know About Training Image Retrieval Models
Abstract:
Image retrieval is the task of finding images in a database that are most similar to a given query image. The performance of an image retrieval pipeline depends on many training-time factors, including the embedding model architecture, loss function, data sampler, mining function, learning rate(s), and batch size. In this work, we run tens of thousands of training runs to understand the effect each of these factors has on retrieval accuracy. We also discover best practices that hold across multiple datasets. The code is available at https://github.com/gmberton/image-retrieval
图像检索性能受多种训练因素影响,本研究通过大量实验分析这些因素的作用并建立了跨数据集的最佳实践方案。
Image retrieval performance is influenced by multiple training factors, and this study conducts extensive experiments to analyze their effects and establish cross-dataset best practices.

Authors:Tao Wang, Changxu Cheng, Lingfeng Wang, Senda Chen, Wuyue Zhao
Title: HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
Abstract:
The remarkable performance of large multimodal models (LMMs) has attracted significant interest from the image segmentation community. To align with the next-token-prediction paradigm, current LMM-driven segmentation methods either use object boundary points to represent masks or introduce special segmentation tokens, whose hidden states are decoded by a segmentation model requiring the original image as input. However, these approaches often suffer from inadequate mask representation and complex architectures, limiting the potential of LMMs. In this work, we propose the Hierarchical Mask Tokenizer (HiMTok), which represents segmentation masks with up to 32 tokens and eliminates the need for the original image during mask de-tokenization. HiMTok allows for compact and coarse-to-fine mask representations, aligning well with the LLM next-token-prediction paradigm and facilitating the direct acquisition of segmentation capabilities. We develop a 3-stage training recipe for progressive learning of segmentation and visual capabilities, featuring a hierarchical mask loss for effective coarse-to-fine learning. Additionally, we enable bidirectional information flow, allowing conversion between bounding boxes and mask tokens to fully leverage multi-task training potential. Extensive experiments demonstrate that our method achieves state-of-the-art performance across various segmentation tasks,while also enhancing visual grounding and maintaining overall visual understanding.
中文: 本文提出的HiMTok分层掩码标记器使用最多32个标记表示分割掩码,无需原始图像即可解码,实现了与LLM范式对齐的紧凑型由粗到精掩码表示,在取得最先进分割性能的同时增强了视觉定位能力。
English: This paper introduces HiMTok, a hierarchical mask tokenizer that represents segmentation masks with up to 32 tokens, eliminating the need for original images during decoding and enabling compact coarse-to-fine mask representation aligned with LLM paradigms, achieving state-of-the-art segmentation performance while enhancing visual grounding.

Authors:Xingguo Lv, Xingbo Dong, Liwen Wang, Jiewen Yang, Lei Zhao, Bin Pu, Zhe Jin, Xuejun Li
Title: Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation
Abstract:
Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmentation, primarily because they overlook the crucial prior knowledge inherent to medical images. To address this challenge, we incorporate morphological information and propose a framework based on multi-graph matching. Specifically, we introduce learnable universe embeddings that integrate morphological priors during multi-source training, along with novel unsupervised test-time paradigms for domain adaptation. This approach guarantees cycle-consistency in multi-matching while enabling the model to more effectively capture the invariant priors of unseen data, significantly mitigating the effects of domain shifts. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches on two medical image segmentation benchmarks for both multi-source and single-source domain generalization tasks. The source code is available at https://github.com/Yore0/TTDG-MGM.
The proposed framework integrates morphological priors through multi-graph matching and learnable universe embeddings to enhance test-time adaptation for medical image segmentation, effectively addressing domain shifts and outperforming existing methods on benchmark tasks.
English Summary:

Authors:Ruichuan An, Kai Zeng, Ming Lu, Sihan Yang, Renrui Zhang, Huitong Ji, Qizhe Zhang, Yulin Luo, Hao Liang, Wentao Zhang
Title: Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization
Abstract:
Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for fine-tuning. To reveal the relationship between sample and model performance, we systematically investigate the impact of positive and negative samples (easy and hard) and their diversity on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity for VLM personalization. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the personalization capabilities of VLMs across the MyVLM, Yo'LLaVA, and MC-LLaVA datasets. To our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code is released at $\href{https://github.com/zengkaiya/CaT}{\text{https://github.com/zengkaiya/CaT}}$.
Chinese: 本研究提出概念树框架,通过生成高质量正负样本增强视觉语言模型的个性化能力,以可控合成数据管道有效解决样本稀缺和质量问题。
English: This study introduces Concept-as-Tree (CaT), a framework that generates high-quality positive and negative samples to enhance Vision-Language Model personalization, effectively addressing data scarcity and quality issues through a controllable synthetic data pipeline.

Authors:Junming Liu, Siyuan Meng, Yanting Gao, Song Mao, Pinlong Cai, Guohang Yan, Yirong Chen, Zilin Bian, Ding Wang, Botian Shi
Title: Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
Abstract:
Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts, challenges that textual Knowledge Graphs (KGs) only partially mitigate due to their modality isolation. While Multimodal Knowledge Graphs (MMKGs) promise enhanced cross-modal understanding, their practical construction is impeded by semantic narrowness of manual text annotations and inherent noise in visual-semantic entity linkages. In this paper, we propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing MMKGs that enhances LLMs reasoning through cross-modal information supplementation. Specifically, we cascade pre-trained Vision-Language Models (VLMs) to align image features with text, transforming them into descriptions that encapsulate image-specific information. Furthermore, we developed a cross-modal similarity verification mechanism to quantify semantic consistency, effectively filtering out noise introduced during feature alignment. Even without manually annotated image captions, the refined descriptions alone suffice to construct the MMKG. Compared to conventional MMKGs construction paradigms, our approach achieves substantial storage efficiency gains while maintaining direct entity-to-image linkage capability. Experimental results on multimodal reasoning tasks demonstrate that LLMs augmented with VaLiK outperform previous state-of-the-art models. Our code is published at https://github.com/Wings-Of-Disaster/VaLiK.
中文:VaLiK框架通过跨模态对齐和噪声过滤构建多模态知识图谱,无需人工标注即可增强大语言模型的多模态推理能力,实现更优性能与存储效率。
English: The VaLiK framework enhances multimodal reasoning in LLMs by constructing MMKGs through cross-modal alignment and noise filtering, achieving superior performance and storage efficiency without manual annotations.

Authors:Chaolong Yang, Kai Yao, Yuyao Yan, Chenru Jiang, Weiguang Zhao, Jie Sun, Guangliang Cheng, Yifei Zhang, Bin Dong, Kaizhu Huang
Title: Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait
Abstract:
Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution efficiency.Our codes are available at https://github.com/chaolongy/KDTalker.
中文摘要:KDTalker首次将无监督隐式3D关键点与时空扩散模型相结合,能生成唇部同步精准、头部姿态多样且计算高效的高质量动态人像。
English Summary: KDTalker introduces a novel framework combining unsupervised implicit 3D keypoints with a spatiotemporal diffusion model to generate high-quality talking portraits with accurate lip synchronization, diverse head poses, and enhanced computational efficiency.

Authors:Jiahe Zhao, Ruibing Hou, Zejie Tian, Hong Chang, Shiguang Shan
Title: HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding
Abstract:
We propose a new task to benchmark human-in-scene understanding for embodied agents: Human-In-Scene Question Answering (HIS-QA). Given a human motion within a 3D scene, HIS-QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human-related questions within the scene. To support this new task, we present HIS-Bench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum, from basic perception to commonsense reasoning and planning. Our evaluation of various vision-language models on HIS-Bench reveals significant limitations in their ability to handle HIS-QA tasks. To this end, we propose HIS-GPT, the first foundation model for HIS understanding. HIS-GPT integrates 3D scene context and human motion dynamics into large language models while incorporating specialized mechanisms to capture human-scene interactions. Extensive experiments demonstrate that HIS-GPT sets a new state-of-the-art on HIS-QA tasks. We hope this work inspires future research on human behavior analysis in 3D scenes, advancing embodied AI and world models. The codes and data: https://github.com/ZJHTerry18/HumanInScene.
中文摘要:本文提出了人类场景问答(HIS-QA)新任务以评估具身智能体在三维场景中对人类行为的理解能力,并开发了首个基础模型HIS-GPT,通过融合三维场景上下文与人体运动动态实现了最先进的性能表现。
English Summary: This paper introduces Human-In-Scene Question Answering (HIS-QA), a novel task for evaluating embodied agents' understanding of human behavior in 3D environments, and proposes HIS-GPT, a foundation model that achieves state-of-the-art performance by integrating 3D scene context with human motion dynamics.

Authors:Zheyuan Liu, Junyan Wang, Zicheng Duan, Cristian Rodriguez-Opazo, Anton van den Hengel
Title: Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction
Abstract:
Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion. In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms. Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity. A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models. This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results. In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA). Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation. We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition. We compare and discuss the more effective strategy for injecting such embeddings into the T2V model. We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new state-of-the-art for the task of TVP. The project page is at https://github.com/Cuberick-Orion/FCA .
中文: 本文提出帧级条件自适应(FCA)方法,通过生成帧级文本嵌入来微调文生视频模型,有效提升文本视频预测任务中的运动连续性,实现了最先进的性能。
English: This paper introduces Frame-wise Conditioning Adaptation (FCA), a novel method that fine-tunes text-to-video models by generating frame-wise text embeddings to enhance motion continuity in text-video prediction tasks, achieving state-of-the-art results.

Authors:Xuying Zhang, Yupeng Zhou, Kai Wang, Yikai Wang, Zhen Li, Shaohui Jiao, Daquan Zhou, Qibin Hou, Ming-Ming Cheng
Title: AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction
Abstract:
Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR-1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTM-based global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views, producing high-fidelity 3D assets.
中文摘要:提出的AR-1-to-3方法通过渐进式视图合成和专门设计的编码策略,相比现有方法显著提升了视图一致性并生成高质量3D资源。
English Summary: The proposed AR-1-to-3 method progressively synthesizes novel views using diffusion models and specialized encoding strategies, significantly improving view consistency and 3D asset quality compared to existing approaches.

Authors:Huangwei Chen, Yifei Chen, Zhenyu Yan, Mingyang Ding, Chenlei Li, Zhu Zhu, Feiwei Qin
Title: MMLNB: Multi-Modal Learning for Neuroblastoma Subtyping Classification Assisted with Textual Description Generation
Abstract:
Neuroblastoma (NB), a leading cause of childhood cancer mortality, exhibits significant histopathological variability, necessitating precise subtyping for accurate prognosis and treatment. Traditional diagnostic methods rely on subjective evaluations that are time-consuming and inconsistent. To address these challenges, we introduce MMLNB, a multi-modal learning (MML) model that integrates pathological images with generated textual descriptions to improve classification accuracy and interpretability. The approach follows a two-stage process. First, we fine-tune a Vision-Language Model (VLM) to enhance pathology-aware text generation. Second, the fine-tuned VLM generates textual descriptions, using a dual-branch architecture to independently extract visual and textual features. These features are fused via Progressive Robust Multi-Modal Fusion (PRMF) Block for stable training. Experimental results show that the MMLNB model is more accurate than the single modal model. Ablation studies demonstrate the importance of multi-modal fusion, fine-tuning, and the PRMF mechanism. This research creates a scalable AI-driven framework for digital pathology, enhancing reliability and interpretability in NB subtyping classification. Our source code is available at https://github.com/HovChen/MMLNB.
中文:MMLNB模型通过双分支架构和渐进式融合,整合病理图像与生成文本描述,显著提高了神经母细胞瘤分型的准确性和可解释性,优于单模态方法。
English: The MMLNB model integrates pathological images with generated text descriptions using a dual-branch architecture and progressive fusion to enhance accuracy and interpretability in neuroblastoma subtyping, outperforming single-modal approaches.

Authors:Zhuoqun Su, Huimin Lu, Shuaifeng Jiao, Junhao Xiao, Yaonan Wang, Xieyuanli Chen
Title: Efficient Multimodal 3D Object Detector via Instance-Level Contrastive Distillation
Abstract:
Multimodal 3D object detectors leverage the strengths of both geometry-aware LiDAR point clouds and semantically rich RGB images to enhance detection performance. However, the inherent heterogeneity between these modalities, including unbalanced convergence and modal misalignment, poses significant challenges. Meanwhile, the large size of the detection-oriented feature also constrains existing fusion strategies to capture long-range dependencies for the 3D detection tasks. In this work, we introduce a fast yet effective multimodal 3D object detector, incorporating our proposed Instance-level Contrastive Distillation (ICD) framework and Cross Linear Attention Fusion Module (CLFM). ICD aligns instance-level image features with LiDAR representations through object-aware contrastive distillation, ensuring fine-grained cross-modal consistency. Meanwhile, CLFM presents an efficient and scalable fusion strategy that enhances cross-modal global interactions within sizable multimodal BEV features. Extensive experiments on the KITTI and nuScenes 3D object detection benchmarks demonstrate the effectiveness of our methods. Notably, our 3D object detector outperforms state-of-the-art (SOTA) methods while achieving superior efficiency. The implementation of our method has been released as open-source at: https://github.com/nubot-nudt/ICD-Fusion.
中文摘要:本文提出了一种快速高效的多模态3D目标检测器,通过实例级对比蒸馏框架和交叉线性注意力融合模块解决模态差异并增强全局特征交互,在基准测试中实现最优性能且已开源。
English Summary: This paper introduces a fast and effective multimodal 3D object detector using an Instance-level Contrastive Distillation framework and Cross Linear Attention Fusion Module to address cross-modal heterogeneity and enhance global feature interactions, achieving state-of-the-art performance on benchmarks with open-source implementation.

Authors:Siyuan Yao, Yang Guo, Yanyang Yan, Wenqi Ren, Xiaochun Cao
Title: UncTrack: Reliable Visual Object Tracking with Uncertainty-Aware Prototype Memory Network
Abstract:
Transformer-based trackers have achieved promising success and become the dominant tracking paradigm due to their accuracy and efficiency. Despite the substantial progress, most of the existing approaches tackle object tracking as a deterministic coordinate regression problem, while the target localization uncertainty has been greatly overlooked, which hampers trackers' ability to maintain reliable target state prediction in challenging scenarios. To address this issue, we propose UncTrack, a novel uncertainty-aware transformer tracker that predicts the target localization uncertainty and incorporates this uncertainty information for accurate target state inference. Specifically, UncTrack utilizes a transformer encoder to perform feature interaction between template and search images. The output features are passed into an uncertainty-aware localization decoder (ULD) to coarsely predict the corner-based localization and the corresponding localization uncertainty. Then the localization uncertainty is sent into a prototype memory network (PMN) to excavate valuable historical information to identify whether the target state prediction is reliable or not. To enhance the template representation, the samples with high confidence are fed back into the prototype memory bank for memory updating, making the tracker more robust to challenging appearance variations. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods. Our code is available at https://github.com/ManOfStory/UncTrack.
中文: 基于Transformer的跟踪器虽在精度和效率上表现出色,但常忽略目标定位的不确定性,因此UncTrack提出一种不确定性感知方法,通过预测并利用该不确定性来实现更可靠的状态推断,并在复杂场景中展现更强的鲁棒性。
English: Transformer-based trackers excel in accuracy and efficiency but often overlook target localization uncertainty, so UncTrack introduces an uncertainty-aware approach that predicts and utilizes this uncertainty for more reliable state inference and robust performance in challenging scenarios.

Authors:Linzhou Li, Yumeng Li, Yanlin Weng, Youyi Zheng, Kun Zhou
Title: RGBAvatar: Reduced Gaussian Blendshapes for Online Modeling of Head Avatars
Abstract:
We present Reduced Gaussian Blendshapes Avatar (RGBAvatar), a method for reconstructing photorealistic, animatable head avatars at speeds sufficient for on-the-fly reconstruction. Unlike prior approaches that utilize linear bases from 3D morphable models (3DMM) to model Gaussian blendshapes, our method maps tracked 3DMM parameters into reduced blendshape weights with an MLP, leading to a compact set of blendshape bases. The learned compact base composition effectively captures essential facial details for specific individuals, and does not rely on the fixed base composition weights of 3DMM, leading to enhanced reconstruction quality and higher efficiency. To further expedite the reconstruction process, we develop a novel color initialization estimation method and a batch-parallel Gaussian rasterization process, achieving state-of-the-art quality with training throughput of about 630 images per second. Moreover, we propose a local-global sampling strategy that enables direct on-the-fly reconstruction, immediately reconstructing the model as video streams in real time while achieving quality comparable to offline settings. Our source code is available at https://github.com/gapszju/RGBAvatar.
RGBAvatar提出了一种实时重建逼真头部化身的方法,通过MLP将3DMM参数映射为紧凑形变基,结合创新的初始化与光栅化技术,实现了高效高质量的实时重建。
RGBAvatar introduces a method for real-time photorealistic head avatar reconstruction by mapping 3DMM parameters to compact blendshape bases via MLP, achieving high efficiency and quality with innovative initialization and rasterization techniques.

Authors:Dewei Zhou, Mingwei Li, Zongxin Yang, Yi Yang
Title: DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models
Abstract:
Image-conditioned generation methods, such as depth- and canny-conditioned approaches, have demonstrated remarkable abilities for precise image synthesis. However, existing models still struggle to accurately control the content of multiple instances (or regions). Even state-of-the-art models like FLUX and 3DIS face challenges, such as attribute leakage between instances, which limits user control. To address these issues, we introduce DreamRenderer, a training-free approach built upon the FLUX model. DreamRenderer enables users to control the content of each instance via bounding boxes or masks, while ensuring overall visual harmony. We propose two key innovations: 1) Bridge Image Tokens for Hard Text Attribute Binding, which uses replicated image tokens as bridge tokens to ensure that T5 text embeddings, pre-trained solely on text data, bind the correct visual attributes for each instance during Joint Attention; 2) Hard Image Attribute Binding applied only to vital layers. Through our analysis of FLUX, we identify the critical layers responsible for instance attribute rendering and apply Hard Image Attribute Binding only in these layers, using soft binding in the others. This approach ensures precise control while preserving image quality. Evaluations on the COCO-POS and COCO-MIG benchmarks demonstrate that DreamRenderer improves the Image Success Ratio by 17.7% over FLUX and enhances the performance of layout-to-image models like GLIGEN and 3DIS by up to 26.8%. Project Page: https://limuloo.github.io/DreamRenderer/.
中文:DreamRenderer提出了一种无需训练的方法,通过桥接图像令牌和针对性硬属性绑定技术,有效提升了图像合成中对多实例内容的精确控制能力,显著优于现有模型的性能表现。
English: DreamRenderer introduces a training-free method that enhances multi-instance content control in image synthesis by using bridge image tokens and targeted hard attribute binding, significantly improving precision and performance over existing models.

Authors:Xiaojun Jia, Sensen Gao, Simeng Qin, Ke Ma, Xinfeng Li, Yihao Huang, Wei Dong, Yang Liu, Xiaochun Cao
Title: Evolution-based Region Adversarial Prompt Learning for Robustness Enhancement in Vision-Language Models
Abstract:
Large pre-trained vision-language models (VLMs), such as CLIP, demonstrate impressive generalization but remain highly vulnerable to adversarial examples (AEs). Previous work has explored robust text prompts through adversarial training, achieving some improvement in both robustness and generalization. However, they primarily rely on singlegradient direction perturbations (e.g., PGD) to generate AEs, which lack diversity, resulting in limited improvement in adversarial robustness. To address these limitations, we propose an evolution-based region adversarial prompt tuning method called ER-APT, which combines gradient methods with genetic evolution to generate more diverse and challenging AEs. In each training iteration, we first generate AEs using traditional gradient-based methods. Subsequently, a genetic evolution mechanism incorporating selection, mutation, and crossover is applied to optimize the AEs, ensuring a broader and more aggressive perturbation distribution.The final evolved AEs are used for prompt tuning, achieving region-based adversarial optimization instead of conventional single-point adversarial prompt tuning. We also propose a dynamic loss weighting method to adjust prompt learning efficiency for accuracy and robustness. Experimental evaluations on various benchmark datasets demonstrate the superiority of our proposed method, outperforming stateof-the-art APT methods. The code is released at https://github.com/jiaxiaojunQAQ/ER-APT.
中文: 本文提出ER-APT方法,通过结合梯度技术与遗传进化生成更多样化的对抗样本,实现了基于区域的对抗提示优化,在多个基准测试中显著提升了对抗鲁棒性并优于现有先进方法。
English: This paper introduces ER-APT, an evolution-based region adversarial prompt tuning method that combines gradient techniques with genetic evolution to generate more diverse adversarial examples, significantly enhancing adversarial robustness and outperforming existing methods across multiple benchmarks.

Authors:Chenyu Zhang, Kunlun Xu, Zichen Liu, Yuxin Peng, Jiahuan Zhou
Title: SCAP: Transductive Test-Time Adaptation via Supportive Clique-based Attribute Prompting
Abstract:
Vision-language models (VLMs) encounter considerable challenges when adapting to domain shifts stemming from changes in data distribution. Test-time adaptation (TTA) has emerged as a promising approach to enhance VLM performance under such conditions. In practice, test data often arrives in batches, leading to increasing interest in the transductive TTA setting. However, existing TTA methods primarily focus on individual test samples, overlooking crucial cross-sample correlations within a batch. While recent ViT-based TTA methods have introduced batch-level adaptation, they remain suboptimal for VLMs due to inadequate integration of the text modality. To address these limitations, we propose a novel transductive TTA framework, Supportive Clique-based Attribute Prompting (SCAP), which effectively combines visual and textual information to enhance adaptation by generating fine-grained attribute prompts across test batches. SCAP first forms supportive cliques of test samples in an unsupervised manner based on visual similarity and learns an attribute prompt for each clique, capturing shared attributes critical for adaptation. For each test sample, SCAP aggregates attribute prompts from its associated cliques, providing enriched contextual information. To ensure adaptability over time, we incorporate a retention module that dynamically updates attribute prompts and their associated attributes as new data arrives. Comprehensive experiments across multiple benchmarks demonstrate that SCAP outperforms existing state-of-the-art methods, significantly advancing VLM generalization under domain shifts. Our code is available at https://github.com/zhoujiahuan1991/CVPR2025-SCAP.
Chinese: SCAP提出了一种新颖的归纳式测试时适应框架,通过基于视觉相似性构建支持性样本团并生成细粒度属性提示,有效整合跨测试批次的视觉与文本信息,显著提升了视觉语言模型在领域偏移下的泛化性能。
English: SCAP introduces a novel transductive test-time adaptation framework that leverages supportive cliques and fine-grained attribute prompts to enhance vision-language model performance under domain shifts by effectively integrating visual and textual information across test batches.

Authors:Xinyu Ma, Ziyang Ding, Zhicong Luo, Chi Chen, Zonghao Guo, Derek F. Wong, Xiaoyi Feng, Maosong Sun
Title: DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
Abstract:
Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features, a capability that remains underdeveloped in current Multimodal Large Language Models (MLLMs). Despite possessing vast expert-level knowledge, MLLMs struggle to integrate reasoning into visual perception, often generating direct responses without deeper analysis. To bridge this gap, we introduce knowledge-intensive visual grounding (KVG), a novel visual grounding task that requires both fine-grained perception and domain-specific knowledge integration. To address the challenges of KVG, we propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities. Our approach consists of (1) an automated data synthesis pipeline that generates high-quality, knowledge-aligned training samples, and (2) a two-stage training framework combining supervised fine-tuning for cognitive reasoning scaffolding and reinforcement learning to optimize perception-cognition synergy. To benchmark performance, we introduce KVG-Bench a comprehensive dataset spanning 10 domains with 1.3K manually curated test cases. Experimental results demonstrate that DeepPerception significantly outperforms direct fine-tuning, achieving +8.08\% accuracy improvements on KVG-Bench and exhibiting +4.60\% superior cross-domain generalization over baseline approaches. Our findings highlight the importance of integrating cognitive processes into MLLMs for human-like visual perception and open new directions for multimodal reasoning research. The data, codes, and models are released at https://github.com/thunlp/DeepPerception.
Chinese Summary: 本文提出DeepPerception模型,通过知识密集型视觉定位任务,结合认知推理框架与强化学习,显著提升了多模态大语言模型在细粒度视觉感知与领域知识融合方面的性能。
English Summary: This paper introduces DeepPerception, a Multimodal Large Language Model enhanced with cognitive visual perception to bridge the gap between expert knowledge and fine-grained visual discrimination through knowledge-intensive visual grounding.

Authors:Jianan Li, Huan Chen, Wangcai Zhao, Rui Chen, Tingfa Xu
Title: Mixed-granularity Implicit Representation for Continuous Hyperspectral Compressive Reconstruction
Abstract:
Hyperspectral Images (HSIs) are crucial across numerous fields but are hindered by the long acquisition times associated with traditional spectrometers. The Coded Aperture Snapshot Spectral Imaging (CASSI) system mitigates this issue through a compression technique that accelerates the acquisition process. However, reconstructing HSIs from compressed data presents challenges due to fixed spatial and spectral resolution constraints. This study introduces a novel method using implicit neural representation for continuous hyperspectral image reconstruction. We propose the Mixed Granularity Implicit Representation (MGIR) framework, which includes a Hierarchical Spectral-Spatial Implicit Encoder for efficient multi-scale implicit feature extraction. This is complemented by a Mixed-Granularity Local Feature Aggregator that adaptively integrates local features across scales, combined with a decoder that merges coordinate information for precise reconstruction. By leveraging implicit neural representations, the MGIR framework enables reconstruction at any desired spatial-spectral resolution, significantly enhancing the flexibility and adaptability of the CASSI system. Extensive experimental evaluations confirm that our model produces reconstructed images at arbitrary resolutions and matches state-of-the-art methods across varying spectral-spatial compression ratios. The code will be released at https://github.com/chh11/MGIR.
中文摘要:本研究提出混合粒度隐式表示(MGIR)框架,利用隐式神经表示实现任意空间-光谱分辨率的连续高光谱图像重建,克服了传统CASSI系统的固定分辨率限制,并在不同压缩比下保持了最先进的性能表现。
English Summary: This study introduces the Mixed Granularity Implicit Representation (MGIR) framework using implicit neural representation to enable continuous hyperspectral image reconstruction at arbitrary spatial-spectral resolutions, overcoming the fixed resolution limitations of traditional CASSI systems while maintaining state-of-the-art performance across various compression ratios.

Authors:Haoxiao Wang, Kaichen Zhou, Binrui Gu, Zhiyuan Feng, Weijie Wang, Peilin Sun, Yicheng Xiao, Jianhua Zhang, Hao Dong
Title: TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image
Abstract:
Manipulating transparent objects presents significant challenges due to the complexities introduced by their reflection and refraction properties, which considerably hinder the accurate estimation of their 3D shapes. To address these challenges, we propose a single-view RGB-D-based depth completion framework, TransDiff, that leverages the Denoising Diffusion Probabilistic Models(DDPM) to achieve material-agnostic object grasping in desktop. Specifically, we leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process. Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information, ensuring more accurate depth estimation in scenarios involving transparent objects. Additionally, we propose a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step. Finally, we utilized an improved inference process to accelerate the denoising procedure. Through comprehensive experimental validation, we demonstrate that our method significantly outperforms the baselines in both synthetic and real-world benchmarks with acceptable inference time. The demo of our method can be found on https://wang-haoxiao.github.io/TransDiff/

Authors:Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, Ziran Wang
Title: NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models
Abstract:
Recent advances in multi-modal large language models (MLLMs) have demonstrated strong performance across various domains; however, their ability to comprehend driving scenes remains less proven. The complexity of driving scenarios, which includes multi-view information, poses significant challenges for existing MLLMs. In this paper, we introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. To further support generalization to multi-view driving scenarios, we also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. For context-aware analysis of traffic scenes, we categorize our dataset into nine subtasks across three core skills: Road Environment Perception, Spatial Relations Recognition, and Ego-Centric Reasoning. Furthermore, we present BEV-LLM, integrating Bird's-Eye-View (BEV) features from multi-view images into MLLMs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives. In contrast, BEV-LLM demonstrates remarkable adaptability to this domain, outperforming other models in six of the nine subtasks. These findings highlight how BEV integration enhances multi-view MLLMs while also identifying key areas that require further refinement for effective adaptation to driving scenes. To facilitate further research, we publicly release NuPlanQA at https://github.com/sungyeonparkk/NuPlanQA.
中文: 本文提出了用于驾驶场景理解的多模态评估基准NuPlanQA及其大规模数据集,并开发了集成鸟瞰特征的BEV-LLM模型,该模型在多数子任务中表现优异,同时揭示了该领域仍需解决的关键问题。
English: This paper introduces NuPlanQA, a multi-modal evaluation benchmark and dataset for driving scene understanding, and proposes BEV-LLM, a model integrating Bird's-Eye-View features that outperforms existing models in most subtasks while highlighting remaining challenges.

Authors:Kewei Sui, Anindita Ghosh, Inwoo Hwang, Bing Zhou, Jian Wang, Chuan Guo
Title: A Survey on Human Interaction Motion Generation
Abstract:
Humans inhabit a world defined by interactions -- with other humans, objects, and environments. These interactive movements not only convey our relationships with our surroundings but also demonstrate how we perceive and communicate with the real world. Therefore, replicating these interaction behaviors in digital systems has emerged as an important topic for applications in robotics, virtual reality, and animation. While recent advances in deep generative models and new datasets have accelerated progress in this field, significant challenges remain in modeling the intricate human dynamics and their interactions with entities in the external world. In this survey, we present, for the first time, a comprehensive overview of the literature in human interaction motion generation. We begin by establishing foundational concepts essential for understanding the research background. We then systematically review existing solutions and datasets across three primary interaction tasks -- human-human, human-object, and human-scene interactions -- followed by evaluation metrics. Finally, we discuss open research directions and future opportunities.
Chinese Summary: 本综述首次系统梳理了人体交互动作生成领域,涵盖基础概念、解决方案与数据集,重点分析人-人、人-物、人-场景三类交互任务及评估标准,并展望了未来研究方向。
English Summary: This survey provides a comprehensive overview of human interaction motion generation, covering foundational concepts, existing solutions, datasets, and evaluation metrics across human-human, human-object, and human-scene interactions, while identifying future research directions.

Authors:Zibin Liu, Banglei Guan, Yang Shang, Yifei Bian, Pengju Sun, Qifeng Yu
Title: Stereo Event-based, 6-DOF Pose Tracking for Uncooperative Spacecraft
Abstract:
Pose tracking of uncooperative spacecraft is an essential technology for space exploration and on-orbit servicing, which remains an open problem. Event cameras possess numerous advantages, such as high dynamic range, high temporal resolution, and low power consumption. These attributes hold the promise of overcoming challenges encountered by conventional cameras, including motion blur and extreme illumination, among others. To address the standard on-orbit observation missions, we propose a line-based pose tracking method for uncooperative spacecraft utilizing a stereo event camera. To begin with, we estimate the wireframe model of uncooperative spacecraft, leveraging the spatio-temporal consistency of stereo event streams for line-based reconstruction. Then, we develop an effective strategy to establish correspondences between events and projected lines of uncooperative spacecraft. Using these correspondences, we formulate the pose tracking as a continuous optimization process over 6-DOF motion parameters, achieved by minimizing event-line distances. Moreover, we construct a stereo event-based uncooperative spacecraft motion dataset, encompassing both simulated and real events. The proposed method is quantitatively evaluated through experiments conducted on our self-collected dataset, demonstrating an improvement in terms of effectiveness and accuracy over competing methods. The code will be open-sourced at https://github.com/Zibin6/SE6PT.
中文摘要:本研究提出了一种基于立体事件相机的非合作航天器线型姿态跟踪方法,通过连续优化六自由度运动参数,在精度和效果上优于现有方法。
English Summary: This study introduces a stereo event camera-based line pose tracking method for uncooperative spacecraft, which uses continuous optimization of 6-DOF motion parameters to achieve enhanced accuracy and effectiveness over existing approaches.

Authors:Feng Qiao, Zhexiao Xiong, Eric Xing, Nathan Jacobs
Title: Towards Open-World Generation of Stereo Images and Unsupervised Matching
Abstract:
Stereo images are fundamental to numerous applications, including extended reality (XR) devices, autonomous driving, and robotics. Unfortunately, acquiring high-quality stereo images remains challenging due to the precise calibration requirements of dual-camera setups and the complexity of obtaining accurate, dense disparity maps. Existing stereo image generation methods typically focus on either visual quality for viewing or geometric accuracy for matching, but not both. We introduce GenStereo, a diffusion-based approach, to bridge this gap. The method includes two primary innovations (1) conditioning the diffusion process on a disparity-aware coordinate embedding and a warped input image, allowing for more precise stereo alignment than previous methods, and (2) an adaptive fusion mechanism that intelligently combines the diffusion-generated image with a warped image, improving both realism and disparity consistency. Through extensive training on 11 diverse stereo datasets, GenStereo demonstrates strong generalization ability. GenStereo achieves state-of-the-art performance in both stereo image generation and unsupervised stereo matching tasks. Project page is available at https://qjizhi.github.io/genstereo.

Authors:Javier Tirado-Garín, Javier Civera
Title: AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration
Abstract:
We present AnyCalib, a method for calibrating the intrinsic parameters of a camera from a single in-the-wild image, that is agnostic to the camera model. Current methods are predominantly tailored to specific camera models and/or require extrinsic cues, such as the direction of gravity, to be visible in the image. In contrast, we argue that the perspective and distortion cues inherent in images are sufficient for model-agnostic camera calibration. To demonstrate this, we frame the calibration process as the regression of the rays corresponding to each pixel. We show, for the first time, that this intermediate representation allows for a closed-form recovery of the intrinsics for a wide range of camera models, including but not limited to: pinhole, Brown-Conrady and Kannala-Brandt. Our approach also applies to edited -- cropped and stretched -- images. Experimentally, we demonstrate that AnyCalib consistently outperforms alternative methods, including 3D foundation models, despite being trained on orders of magnitude less data. Code is available at https://github.com/javrtg/AnyCalib.
中文: AnyCalib是一种新颖方法,仅通过单张图像中的透视和畸变线索即可校准相机内参,无需依赖特定相机模型或外部参照,在多种相机模型上均优于现有方法且训练数据需求极少。
English: AnyCalib is a novel method that calibrates camera intrinsic parameters from a single image using inherent perspective and distortion cues, outperforming existing approaches across various camera models with minimal training data.

Authors:Imran Kabir, Md Alimoor Reza, Syed Billah
Title: Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding
Abstract:
Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs' spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 55% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at https://github.com/Imran2205/LogicRAG.
Chinese: Logic-RAG是一种新颖的检索增强生成框架,通过一阶逻辑构建动态知识库,显著提升多模态大模型在自动驾驶中的空间推理能力,有效解决了现有模型在细粒度空间理解上的不足。
English: Logic-RAG is a novel Retrieval-Augmented Generation framework that enhances large multimodal models' spatial reasoning in autonomous driving by constructing a dynamic knowledge base with first-order logic, significantly improving accuracy on visual-spatial queries.

Authors:Vrushank Ahire, Kunal Shah, Mudasir Nazir Khan, Nikhil Pakhale, Lownish Rai Sookha, M. A. Ganaie, Abhinav Dhall
Title: MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network
Abstract:
Dynamic emotion recognition in the wild remains challenging due to the transient nature of emotional expressions and temporal misalignment of multi-modal cues. Traditional approaches predict valence and arousal and often overlook the inherent correlation between these two dimensions. The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities through a bi-directional cross-modal attention mechanism. MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts, predicting emotions in polar coordinates following Russell's circumplex model. The evaluation of the Aff-Wild2 dataset using MAVEN achieved a concordance correlation coefficient (CCC) of 0.3061, surpassing the ResNet-50 baseline model with a CCC of 0.22. The multistage architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition in real-world situations. The code is available at: https://github.com/Vrushank-Ahire/MAVEN_8th_ABAW
Chinese: 提出的MAVEN模型通过跨模态注意力机制整合多模态线索,在Aff-Wild2数据集上实现了优于传统方法的动态情绪识别性能。
English: The proposed MAVEN model enhances dynamic emotion recognition in the wild by integrating multi-modal cues through a cross-modal attention mechanism, achieving superior performance on the Aff-Wild2 dataset compared to traditional methods.

Authors:Yitian Shi, Di Wen, Guanqi Chen, Edgar Welte, Sheng Liu, Kunyu Peng, Rainer Stiefelhagen, Rania Rayyes
Title: VISO-Grasp: Vision-Language Informed Spatial Object-centric 6-DoF Active View Planning and Grasping in Clutter and Invisibility
Abstract:
We propose VISO-Grasp, a novel vision-language-informed system designed to systematically address visibility constraints for grasping in severely occluded environments. By leveraging Foundation Models (FMs) for spatial reasoning and active view planning, our framework constructs and updates an instance-centric representation of spatial relationships, enhancing grasp success under challenging occlusions. Furthermore, this representation facilitates active Next-Best-View (NBV) planning and optimizes sequential grasping strategies when direct grasping is infeasible. Additionally, we introduce a multi-view uncertainty-driven grasp fusion mechanism that refines grasp confidence and directional uncertainty in real-time, ensuring robust and stable grasp execution. Extensive real-world experiments demonstrate that VISO-Grasp achieves a success rate of $87.5\%$ in target-oriented grasping with the fewest grasp attempts outperforming baselines. To the best of our knowledge, VISO-Grasp is the first unified framework integrating FMs into target-aware active view planning and 6-DoF grasping in environments with severe occlusions and entire invisibility constraints. Code is available at: https://github.com/YitianShi/vMF-Contact
VISO-Grasp提出了一种新颖的视觉语言系统,利用基础模型进行空间推理和主动视角规划,在严重遮挡环境下通过实时优化抓取策略实现了87.5%的成功率。
VISO-Grasp introduces a vision-language system using Foundation Models for spatial reasoning and active view planning to achieve robust grasping in highly occluded scenes, achieving 87.5% success rate with optimized strategies.

Authors:Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, Hao Fei
Title: Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Abstract:
By extending the advantage of chain-of-thought (CoT) reasoning in human-like step-by-step processes to multimodal contexts, multimodal CoT (MCoT) reasoning has recently garnered significant research attention, especially in the integration with multimodal large language models (MLLMs). Existing MCoT studies design various methodologies and innovative reasoning paradigms to address the unique challenges of image, video, speech, audio, 3D, and structured data across different modalities, achieving extensive success in applications such as robotics, healthcare, autonomous driving, and multimodal generation. However, MCoT still presents distinct challenges and opportunities that require further focus to ensure consistent thriving in this field, where, unfortunately, an up-to-date review of this domain is lacking. To bridge this gap, we present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions. We offer a comprehensive taxonomy and an in-depth analysis of current methodologies from diverse perspectives across various application scenarios. Furthermore, we provide insights into existing challenges and future research directions, aiming to foster innovation toward multimodal AGI.
中文摘要:多模态思维链(MCoT)推理将分步推理扩展至多种数据类型,在各类应用中取得显著成果,但仍需系统性综述以应对当前挑战并推动多模态通用人工智能的发展。
English Summary: Multimodal chain-of-thought (MCoT) reasoning extends step-by-step reasoning to diverse data types, achieving success in various applications while facing challenges that require systematic review and future innovation toward multimodal AGI.

Authors:Haoran Feng, Zehuan Huang, Lin Li, Hairong Lv, Lu Sheng
Title: Personalize Anything for Free with Diffusion Transformer
Abstract:
Personalized image generation aims to produce images of user-specified concepts while enabling flexible editing. Recent training-free approaches, while exhibit higher computational efficiency than training-based methods, struggle with identity preservation, applicability, and compatibility with diffusion transformers (DiTs). In this paper, we uncover the untapped potential of DiT, where simply replacing denoising tokens with those of a reference subject achieves zero-shot subject reconstruction. This simple yet effective feature injection technique unlocks diverse scenarios, from personalization to image editing. Building upon this observation, we propose \textbf{Personalize Anything}, a training-free framework that achieves personalized image generation in DiT through: 1) timestep-adaptive token replacement that enforces subject consistency via early-stage injection and enhances flexibility through late-stage regularization, and 2) patch perturbation strategies to boost structural diversity. Our method seamlessly supports layout-guided generation, multi-subject personalization, and mask-controlled editing. Evaluations demonstrate state-of-the-art performance in identity preservation and versatility. Our work establishes new insights into DiTs while delivering a practical paradigm for efficient personalization.

Authors:Xiaoyu Han, Shengping Zhang, Qinglin Liu, Zonglin Li, Chenyang Wang
Title: Progressive Limb-Aware Virtual Try-On
Abstract:
Existing image-based virtual try-on methods directly transfer specific clothing to a human image without utilizing clothing attributes to refine the transferred clothing geometry and textures, which causes incomplete and blurred clothing appearances. In addition, these methods usually mask the limb textures of the input for the clothing-agnostic person representation, which results in inaccurate predictions for human limb regions (i.e., the exposed arm skin), especially when transforming between long-sleeved and short-sleeved garments. To address these problems, we present a progressive virtual try-on framework, named PL-VTON, which performs pixel-level clothing warping based on multiple attributes of clothing and embeds explicit limb-aware features to generate photo-realistic try-on results. Specifically, we design a Multi-attribute Clothing Warping (MCW) module that adopts a two-stage alignment strategy based on multiple attributes to progressively estimate pixel-level clothing displacements. A Human Parsing Estimator (HPE) is then introduced to semantically divide the person into various regions, which provides structural constraints on the human body and therefore alleviates texture bleeding between clothing and limb regions. Finally, we propose a Limb-aware Texture Fusion (LTF) module to estimate high-quality details in limb regions by fusing textures of the clothing and the human body with the guidance of explicit limb-aware features. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art virtual try-on methods both qualitatively and quantitatively. The code is available at https://github.com/xyhanHIT/PL-VTON.
中文: 现有虚拟试衣方法因未利用服装属性优化几何纹理且忽略肢体区域预测,导致衣物模糊和肢体不准确,而PL-VTON通过多属性渐进式变形和显式肢体特征融合,实现了更高质量的逼真试衣效果。
English: Current virtual try-on methods often produce incomplete and blurred clothing appearances due to neglecting clothing attributes for refinement and inaccurate limb predictions when switching sleeve lengths, which PL-VTON addresses through progressive pixel-level warping and limb-aware feature integration for superior photo-realistic results.

Authors:Ruopeng Gao, Yuyao Wang, Chunxu Liu, Limin Wang
Title: History-Aware Transformation of ReID Features for Multiple Object Tracking
Abstract:
The aim of multiple object tracking (MOT) is to detect all objects in a video and bind them into multiple trajectories. Generally, this process is carried out in two steps: detecting objects and associating them across frames based on various cues and metrics. Many studies and applications adopt object appearance, also known as re-identification (ReID) features, for target matching through straightforward similarity calculation. However, we argue that this practice is overly naive and thus overlooks the unique characteristics of MOT tasks. Unlike regular re-identification tasks that strive to distinguish all potential targets in a general representation, multi-object tracking typically immerses itself in differentiating similar targets within the same video sequence. Therefore, we believe that seeking a more suitable feature representation space based on the different sample distributions of each sequence will enhance tracking performance. In this paper, we propose using history-aware transformations on ReID features to achieve more discriminative appearance representations. Specifically, we treat historical trajectory features as conditions and employ a tailored Fisher Linear Discriminant (FLD) to find a spatial projection matrix that maximizes the differentiation between different trajectories. Our extensive experiments reveal that this training-free projection can significantly boost feature-only trackers to achieve competitive, even superior tracking performance compared to state-of-the-art methods while also demonstrating impressive zero-shot transfer capabilities. This demonstrates the effectiveness of our proposal and further encourages future investigation into the importance and customization of ReID models in multiple object tracking. The code will be released at https://github.com/HELLORPG/HATReID-MOT.
中文: 本文提出了一种多目标跟踪中重识别特征的历史感知变换方法,通过定制化的Fisher线性判别分析增强轨迹间的特征区分度,无需训练即可实现具有竞争力的跟踪性能。
English: This paper proposes a history-aware transformation method for re-identification features in multiple object tracking, using a tailored Fisher Linear Discriminant to enhance feature discrimination between trajectories and achieve competitive tracking performance without training.

Authors:Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie
Title: AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding
Abstract:
Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos. Recent methods compress videos by leveraging visual redundancy uniformly, yielding promising results. Nevertheless, our quantitative analysis shows that redundancy varies significantly across time and model layers, necessitating a more flexible compression strategy. We propose AdaReTaKe, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees. Integrated into state-of-the-art MLLMs, AdaReTaKe improves processing capacity from 256 to 2048 frames while preserving critical information. Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively, with even greater improvements of 5.9% and 6.0% on the longest LVBench. Our code is available at https://github.com/SCZwangxiao/video-FlexReduc.git.
Chinese: AdaReTaKe是一种无需训练的方法,通过自适应地减少时间和模型层间的视觉冗余,使多模态大语言模型能够处理多达2048帧的视频,并在基准数据集上以高达6.0%的优势超越现有方法。
English: AdaReTaKe is a training-free method that adaptively reduces visual redundancy across time and model layers, enabling multimodal large language models to process up to 2048 frames while outperforming existing methods by up to 6.0% on benchmark datasets.

Authors:Weiguang Zhao, Rui Zhang, Qiufeng Wang, Guangliang Cheng, Kaizhu Huang
Title: BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis
Abstract:
3D semantic segmentation plays a fundamental and crucial role to understand 3D scenes. While contemporary state-of-the-art techniques predominantly concentrate on elevating the overall performance of 3D semantic segmentation based on general metrics (e.g. mIoU, mAcc, and oAcc), they unfortunately leave the exploration of challenging regions for segmentation mostly neglected. In this paper, we revisit 3D semantic segmentation through a more granular lens, shedding light on subtle complexities that are typically overshadowed by broader performance metrics. Concretely, we have delineated 3D semantic segmentation errors into four comprehensive categories as well as corresponding evaluation metrics tailored to each. Building upon this categorical framework, we introduce an innovative 3D semantic segmentation network called BFANet that incorporates detailed analysis of semantic boundary features. First, we design the boundary-semantic module to decouple point cloud features into semantic and boundary features, and fuse their query queue to enhance semantic features with attention. Second, we introduce a more concise and accelerated boundary pseudo-label calculation algorithm, which is 3.9 times faster than the state-of-the-art, offering compatibility with data augmentation and enabling efficient computation in training. Extensive experiments on benchmark data indicate the superiority of our BFANet model, confirming the significance of emphasizing the four uniquely designed metrics. Code is available at https://github.com/weiguangzhao/BFANet.
中文: 本文提出BFANet这一新型3D语义分割网络,通过分析语义边界特征并设计四种针对性误差指标,有效解决了以往被忽视的困难区域分割问题,实验证明其优越性能。
English: This paper introduces BFANet, a novel 3D semantic segmentation network that addresses overlooked challenging regions by analyzing semantic boundary features and proposing four new error-specific metrics, demonstrating superior performance through extensive experiments.

Authors:Guibiao Liao, Qing Li, Zhenyu Bao, Guoping Qiu, Kanglin Liu
Title: SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs
Abstract:
3D Gaussian Splatting-based indoor open-world free-view synthesis approaches have shown significant performance with dense input images. However, they exhibit poor performance when confronted with sparse inputs, primarily due to the sparse distribution of Gaussian points and insufficient view supervision. To relieve these challenges, we propose SPC-GS, leveraging Scene-layout-based Gaussian Initialization (SGI) and Semantic-Prompt Consistency (SPC) Regularization for open-world free view synthesis with sparse inputs. Specifically, SGI provides a dense, scene-layout-based Gaussian distribution by utilizing view-changed images generated from the video generation model and view-constraint Gaussian points densification. Additionally, SPC mitigates limited view supervision by employing semantic-prompt-based consistency constraints developed by SAM2. This approach leverages available semantics from training views, serving as instructive prompts, to optimize visually overlapping regions in novel views with 2D and 3D consistency constraints. Extensive experiments demonstrate the superior performance of SPC-GS across Replica and ScanNet benchmarks. Notably, our SPC-GS achieves a 3.06 dB gain in PSNR for reconstruction quality and a 7.3% improvement in mIoU for open-world semantic segmentation.

Authors:Fanbin Lu, Zhisheng Zhong, Ziqin Wei, Shu Liu, Chi-Wing Fu, Jiaya Jia
Title: STEVE: A Step Verification Pipeline for Computer-use Agent Training
Abstract:
Developing AI agents to autonomously manipulate graphical user interfaces is a long challenging task. Recent advances in data scaling law inspire us to train computer-use agents with a scaled instruction set, yet using behavior cloning to train agents still requires immense high-quality trajectories. To meet the scalability need, we designed STEVE, a step verification pipeline for computer-use agent training. First, we establish a large instruction set for computer-use agents and collect trajectory data with some suboptimal agents. GPT-4o is used to verify the correctness of each step in the trajectories based on the screens before and after the action execution, assigning each step with a binary label. Last, we adopt the Kahneman and Tversky Optimization to optimize the agent from the binary stepwise labels. Extensive experiments manifest that our agent outperforms supervised finetuning by leveraging both positive and negative actions within a trajectory. Also, STEVE enables us to train a 7B vision-language model as a computer-use agent, achieving leading performance in the challenging live desktop environment WinAgentArena with great efficiency at a reduced cost. Code and data: https://github.com/FanbinLu/STEVE.
中文: 为解决图形用户界面操作中AI代理训练的可扩展性难题,STEVE流程采用GPT-4o进行轨迹步骤验证,并通过卡尼曼-特沃斯基优化方法训练代理,在实时桌面环境中实现了领先的性能与成本效益。
English: To address the challenges of training scalable AI agents for graphical user interface manipulation, the STEVE pipeline employs a step verification method using GPT-4o to label trajectory steps and optimizes agents with Kahneman and Tversky Optimization, achieving superior performance and efficiency in live desktop environments.

Authors:Mehmet Kerem Turkcan, Mattia Ballo, Filippo Filicori, Zoran Kostic
Title: Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks
Abstract:
We introduce specialized diffusion-based generative models that capture the spatiotemporal dynamics of fine-grained robotic surgical sub-stitch actions through supervised learning on annotated laparoscopic surgery footage. The proposed models form a foundation for data-driven world models capable of simulating the biomechanical interactions and procedural dynamics of surgical suturing with high temporal fidelity. Annotating a dataset of $\sim2K$ clips extracted from simulation videos, we categorize surgical actions into fine-grained sub-stitch classes including ideal and non-ideal executions of needle positioning, targeting, driving, and withdrawal. We fine-tune two state-of-the-art video diffusion models, LTX-Video and HunyuanVideo, to generate high-fidelity surgical action sequences at $\ge$768x512 resolution and $\ge$49 frames. For training our models, we explore both Low-Rank Adaptation (LoRA) and full-model fine-tuning approaches. Our experimental results demonstrate that these world models can effectively capture the dynamics of suturing, potentially enabling improved training simulators, surgical skill assessment tools, and autonomous surgical systems. The models also display the capability to differentiate between ideal and non-ideal technique execution, providing a foundation for building surgical training and evaluation systems. We release our models for testing and as a foundation for future research. Project Page: https://mkturkcan.github.io/suturingmodels/

Authors:Yang Yi, Kunqing Wang, Jinpu Zhang, Zhen Tan, Xiangke Wang, Hui Shen, Dewen Hu
Title: A Plug-and-Play Learning-based IMU Bias Factor for Robust Visual-Inertial Odometry
Abstract:
The bias of low-cost Inertial Measurement Units (IMU) is a critical factor affecting the performance of Visual-Inertial Odometry (VIO). In particular, when visual tracking encounters errors, the optimized bias results may deviate significantly from the true values, adversely impacting the system's stability and localization precision. In this paper, we propose a novel plug-and-play framework featuring the Inertial Prior Network (IPNet), which is designed to accurately estimate IMU bias. Recognizing the substantial impact of initial bias errors in low-cost inertial devices on system performance, our network directly leverages raw IMU data to estimate the mean bias, eliminating the dependency on historical estimates in traditional recursive predictions and effectively preventing error propagation. Furthermore, we introduce an iterative approach to calculate the mean value of the bias for network training, addressing the lack of bias labels in many visual-inertial datasets. The framework is evaluated on two public datasets and one self-collected dataset. Extensive experiments demonstrate that our method significantly enhances both localization precision and robustness, with the ATE-RMSE metric improving on average by 46\%. The source code and video will be available at \textcolor{red}{https://github.com/yiyscut/VIO-IPNet.git}.
中文: 本文提出了一种带有惯性先验网络(IPNet)的即插即用框架,可直接从原始数据估计IMU偏差,将视觉惯性里程计的定位精度平均提升46%,并显著增强系统鲁棒性。
English: This paper introduces a plug-and-play framework with an Inertial Prior Network (IPNet) that directly estimates IMU bias from raw data, enhancing VIO localization precision by 46% on average and improving system robustness.

Authors:Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, Jiaya Jia
Title: Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Abstract:
The rise of Large Vision-Language Models (LVLMs) has significantly advanced video understanding. However, efficiently processing long videos remains a challenge due to the ``Sampling Dilemma'': low-density sampling risks missing critical information, while high-density sampling introduces redundancy. To address this issue, we introduce LSDBench, the first benchmark designed to evaluate LVLMs on long-video tasks by constructing high Necessary Sampling Density (NSD) questions, where NSD represents the minimum sampling density required to accurately answer a given question. LSDBench focuses on dense, short-duration actions to rigorously assess the sampling strategies employed by LVLMs. To tackle the challenges posed by high-NSD questions, we propose a novel Reasoning-Driven Hierarchical Sampling (RHS) framework, which combines global localization of question-relevant cues with local dense sampling for precise inference. Additionally, we develop a lightweight Semantic-Guided Frame Selector to prioritize informative frames, enabling RHS to achieve comparable or superior performance with significantly fewer sampled frames. Together, our LSDBench and RHS framework address the unique challenges of high-NSD long-video tasks, setting a new standard for evaluating and improving LVLMs in this domain. Our benchmark and evaluation codes has been released at: https://github.com/dvlab-research/LSDBench
中文摘要:LSDBench作为首个针对长视频任务评估大型视觉语言模型的基准,通过构建高必要采样密度问题来应对采样困境,同时提出的推理驱动分层采样框架通过结合全局定位与局部密集采样,实现了对长视频的高效处理。
English Summary: LSDBench is introduced as the first benchmark to evaluate Large Vision-Language Models on long-video tasks by creating high Necessary Sampling Density questions, while the Reasoning-Driven Hierarchical Sampling framework is proposed to efficiently process these videos by combining global localization with local dense sampling.

Authors:Wenbo Dai, Lijing Lu, Zhihang Li
Title: Diffusion-based Synthetic Data Generation for Visible-Infrared Person Re-Identification
Abstract:
The performance of models is intricately linked to the abundance of training data. In Visible-Infrared person Re-IDentification (VI-ReID) tasks, collecting and annotating large-scale images of each individual under various cameras and modalities is tedious, time-expensive, costly and must comply with data protection laws, posing a severe challenge in meeting dataset requirements. Current research investigates the generation of synthetic data as an efficient and privacy-ensuring alternative to collecting real data in the field. However, a specific data synthesis technique tailored for VI-ReID models has yet to be explored. In this paper, we present a novel data generation framework, dubbed Diffusion-based VI-ReID data Expansion (DiVE), that automatically obtain massive RGB-IR paired images with identity preserving by decoupling identity and modality to improve the performance of VI-ReID models. Specifically, identity representation is acquired from a set of samples sharing the same ID, whereas the modality of images is learned by fine-tuning the Stable Diffusion (SD) on modality-specific data. DiVE extend the text-driven image synthesis to identity-preserving RGB-IR multimodal image synthesis. This approach significantly reduces data collection and annotation costs by directly incorporating synthetic data into ReID model training. Experiments have demonstrated that VI-ReID models trained on synthetic data produced by DiVE consistently exhibit notable enhancements. In particular, the state-of-the-art method, CAJ, trained with synthetic images, achieves an improvement of about $9\%$ in mAP over the baseline on the LLCM dataset. Code: https://github.com/BorgDiven/DiVE
中文: DiVE框架通过解耦身份与模态生成合成的RGB-IR配对图像,在降低数据收集成本的同时显著提升了可见光-红外行人重识别模型的性能。
English: The DiVE framework generates synthetic RGB-IR paired images by decoupling identity and modality, significantly enhancing VI-ReID model performance while reducing data collection costs.

Authors:Han Mei, Kunqian Li, Shuaixin Liu, Chengzhi Ma, Qianli Jiang
Title: DPF-Net: Physical Imaging Model Embedded Data-Driven Underwater Image Enhancement
Abstract:
Due to the complex interplay of light absorption and scattering in the underwater environment, underwater images experience significant degradation. This research presents a two-stage underwater image enhancement network called the Data-Driven and Physical Parameters Fusion Network (DPF-Net), which harnesses the robustness of physical imaging models alongside the generality and efficiency of data-driven methods. We first train a physical parameter estimate module using synthetic datasets to guarantee the trustworthiness of the physical parameters, rather than solely learning the fitting relationship between raw and reference images by the application of the imaging equation, as is common in prior studies. This module is subsequently trained in conjunction with an enhancement network, where the estimated physical parameters are integrated into a data-driven model within the embedding space. To maintain the uniformity of the restoration process amid underwater imaging degradation, we propose a physics-based degradation consistency loss. Additionally, we suggest an innovative weak reference loss term utilizing the entire dataset, which alleviates our model's reliance on the quality of individual reference images. Our proposed DPF-Net demonstrates superior performance compared to other benchmark methods across multiple test sets, achieving state-of-the-art results. The source code and pre-trained models are available on the project home page: https://github.com/OUCVisionGroup/DPF-Net.
Chinese: 本研究提出DPF-Net双阶段水下图像增强网络,通过融合物理成像模型与数据驱动方法,采用基于物理的退化一致性损失和数据集级弱参考损失,实现了最优性能表现。
English: This study introduces DPF-Net, a two-stage underwater image enhancement network that integrates physical imaging models with data-driven methods, achieving state-of-the-art performance by employing physics-based consistency loss and dataset-wide weak reference loss.

Authors:Jiahang Cao, Qiang Zhang, Hanzhong Guo, Jiaxu Wang, Hao Cheng, Renjing Xu
Title: Modality-Composable Diffusion Policy via Inference-Time Distribution-level Composition
Abstract:
Diffusion Policy (DP) has attracted significant attention as an effective method for policy representation due to its capacity to model multi-distribution dynamics. However, current DPs are often based on a single visual modality (e.g., RGB or point cloud), limiting their accuracy and generalization potential. Although training a generalized DP capable of handling heterogeneous multimodal data would enhance performance, it entails substantial computational and data-related costs. To address these challenges, we propose a novel policy composition method: by leveraging multiple pre-trained DPs based on individual visual modalities, we can combine their distributional scores to form a more expressive Modality-Composable Diffusion Policy (MCDP), without the need for additional training. Through extensive empirical experiments on the RoboTwin dataset, we demonstrate the potential of MCDP to improve both adaptability and performance. This exploration aims to provide valuable insights into the flexible composition of existing DPs, facilitating the development of generalizable cross-modality, cross-domain, and even cross-embodiment policies. Our code is open-sourced at https://github.com/AndyCao1125/MCDP.
Chinese: 提出的模态可组合扩散策略(MCDP)无需重新训练即可融合多个预训练的视觉模态扩散策略,在RoboTwin数据集上验证了其提升异构数据适应性与性能的有效性。
English: The proposed Modality-Composable Diffusion Policy (MCDP) combines pre-trained diffusion policies from individual visual modalities without retraining, enhancing adaptability and performance across heterogeneous data as validated on the RoboTwin dataset.

Authors:Alessio Xompero, Andrea Cavallaro
Title: Learning Privacy from Visual Entities
Abstract:
Subjective interpretation and content diversity make predicting whether an image is private or public a challenging task. Graph neural networks combined with convolutional neural networks (CNNs), which consist of 14,000 to 500 millions parameters, generate features for visual entities (e.g., scene and object types) and identify the entities that contribute to the decision. In this paper, we show that using a simpler combination of transfer learning and a CNN to relate privacy with scene types optimises only 732 parameters while achieving comparable performance to that of graph-based methods. On the contrary, end-to-end training of graph-based methods can mask the contribution of individual components to the classification performance. Furthermore, we show that a high-dimensional feature vector, extracted with CNNs for each visual entity, is unnecessary and complexifies the model. The graph component has also negligible impact on performance, which is driven by fine-tuning the CNN to optimise image features for privacy nodes.
中文: 研究表明,仅用732个参数的迁移学习与CNN结合方法即可实现与复杂图神经网络相当的隐私分类性能,同时揭示高维特征和图结构组件对优化结果并非必要。
English: This study demonstrates that a simpler approach using transfer learning and CNNs with only 732 parameters achieves privacy classification performance comparable to complex graph-based methods, while revealing that high-dimensional features and graph components are unnecessary for optimal results.

Authors:Fanhu Zeng, Hao Tang, Yihua Shao, Siyu Chen, Ling Shao, Yan Wang
Title: MambaIC: State Space Models for High-Performance Learned Image Compression
Abstract:
A high-performance image compression algorithm is crucial for real-time information transmission across numerous fields. Despite rapid progress in image compression, computational inefficiency and poor redundancy modeling still pose significant bottlenecks, limiting practical applications. Inspired by the effectiveness of state space models (SSMs) in capturing long-range dependencies, we leverage SSMs to address computational inefficiency in existing methods and improve image compression from multiple perspectives. In this paper, we integrate the advantages of SSMs for better efficiency-performance trade-off and propose an enhanced image compression approach through refined context modeling, which we term MambaIC. Specifically, we explore context modeling to adaptively refine the representation of hidden states. Additionally, we introduce window-based local attention into channel-spatial entropy modeling to reduce potential spatial redundancy during compression, thereby increasing efficiency. Comprehensive qualitative and quantitative results validate the effectiveness and efficiency of our approach, particularly for high-resolution image compression. Code is released at https://github.com/AuroraZengfh/MambaIC.
中文摘要:本文提出MambaIC这一增强型图像压缩方法,通过状态空间模型优化上下文建模并引入局部注意力机制,在提升计算效率的同时有效减少空间冗余。
English Summary: The paper introduces MambaIC, an enhanced image compression method that leverages state space models to improve computational efficiency and reduce redundancy through refined context modeling and window-based local attention.

Authors:Feihong Yan, Qingyan Wei, Jiayi Tang, Jiajun Li, Yulin Wang, Xuming Hu, Huiqi Li, Linfeng Zhang
Title: LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching
Abstract:
Masked Autoregressive (MAR) models have emerged as a promising approach in image generation, expected to surpass traditional autoregressive models in computational efficiency by leveraging the capability of parallel decoding. However, their dependence on bidirectional self-attention inherently conflicts with conventional KV caching mechanisms, creating unexpected computational bottlenecks that undermine their expected efficiency. To address this problem, this paper studies the caching mechanism for MAR by leveraging two types of redundancy: Token Redundancy indicates that a large portion of tokens have very similar representations in the adjacent decoding steps, which allows us to first cache them in previous steps and then reuse them in the later steps. Condition Redundancy indicates that the difference between conditional and unconditional output in classifier-free guidance exhibits very similar values in adjacent steps. Based on these two redundancies, we propose LazyMAR, which introduces two caching mechanisms to handle them one by one. LazyMAR is training-free and plug-and-play for all MAR models. Experimental results demonstrate that our method achieves 2.83 times acceleration with almost no drop in generation quality. Our codes will be released in https://github.com/feihongyan1/LazyMAR.
中文: 掩码自回归(MAR)模型因双向自注意力与KV缓存机制冲突而产生计算瓶颈,本文提出LazyMAR方法,利用标记冗余和条件冗余实现免训练的即插即用优化,在保持生成质量的同时将速度提升2.83倍。
English: Masked Autoregressive (MAR) models face computational bottlenecks due to their reliance on bidirectional self-attention conflicting with KV caching, prompting the development of LazyMAR, a training-free solution that leverages token and condition redundancies to achieve a 2.83x speedup without compromising generation quality.

Authors:Luming Wang, Hao Shi, Xiaoting Yin, Kailun Yang, Kaiwei Wang, Jian Bai
Title: EgoEvGesture: Gesture Recognition Based on Egocentric Event Camera
Abstract:
Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing asynchronous event streams due to their synchronous frame-based nature. Moreover, from an egocentric perspective, event cameras record data that includes events generated by both head movements and hand gestures, thereby increasing the complexity of gesture recognition. To address this, we propose a novel network architecture specifically designed for event data processing, incorporating (1) a lightweight CNN with asymmetric depthwise convolutions to reduce parameters while preserving spatiotemporal features, (2) a plug-and-play state-space model as context block that decouples head movement noise from gesture dynamics, and (3) a parameter-free Bins-Temporal Shift Module (BTSM) that shifts features along bins and temporal dimensions to fuse sparse events efficiently. We further establish the EgoEvGesture dataset, the first large-scale dataset for egocentric gesture recognition using event cameras. Experimental results demonstrate that our method achieves 62.7% accuracy tested on unseen subjects with only 7M parameters, 3.1% higher than state-of-the-art approaches. Notable misclassifications in freestyle motions stem from high inter-personal variability and unseen test patterns differing from training data. Moreover, our approach achieved a remarkable accuracy of 97.0% on the DVS128 Gesture, demonstrating the effectiveness and generalization capability of our method on public datasets. The dataset and models are made available at https://github.com/3190105222/EgoEv_Gesture.
中文: 本文提出了一种专为事件相机设计的轻量级网络架构,通过非对称卷积和状态空间模型分离头部运动噪声,在自定义和公共数据集上以较少参数实现了更高的手势识别准确率。
English: This paper introduces a lightweight network architecture for egocentric gesture recognition using event cameras, featuring asymmetric convolutions and a state-space model to decouple head movement noise, achieving superior accuracy with minimal parameters on both custom and public datasets.

Authors:Shuo Gao, Jingyang Zhang, Jun Xue, Meng Yang, Yang Chen, Guangquan Zhou
Title: A Causality-Inspired Model for Intima-Media Thickening Assessment in Ultrasound Videos
Abstract:
Carotid atherosclerosis represents a significant health risk, with its early diagnosis primarily dependent on ultrasound-based assessments of carotid intima-media thickening. However, during carotid ultrasound screening, significant view variations cause style shifts, impairing content cues related to thickening, such as lumen anatomy, which introduces spurious correlations that hinder assessment. Therefore, we propose a novel causal-inspired method for assessing carotid intima-media thickening in frame-wise ultrasound videos, which focuses on two aspects: eliminating spurious correlations caused by style and enhancing causal content correlations. Specifically, we introduce a novel Spurious Correlation Elimination (SCE) module to remove non-causal style effects by enforcing prediction invariance with style perturbations. Simultaneously, we propose a Causal Equivalence Consolidation (CEC) module to strengthen causal content correlation through adversarial optimization during content randomization. Simultaneously, we design a Causal Transition Augmentation (CTA) module to ensure smooth causal flow by integrating an auxiliary pathway with text prompts and connecting it through contrastive learning. The experimental results on our in-house carotid ultrasound video dataset achieved an accuracy of 86.93\%, demonstrating the superior performance of the proposed method. Code is available at \href{https://github.com/xielaobanyy/causal-imt}{https://github.com/xielaobanyy/causal-imt}.
中文: 本研究提出了一种新颖的因果启发方法,通过消除超声图像风格变化导致的伪相关并增强因果内容关联来评估颈动脉内膜中层厚度,在内部数据集上达到了86.93%的准确率。
English: This study introduces a novel causal-inspired method for assessing carotid intima-media thickening in ultrasound videos, which eliminates spurious correlations from style variations and enhances causal content cues through specialized modules, achieving 86.93% accuracy on an in-house dataset.

Authors:Jiangdong Cai, Yan Chen, Zhenrong Shen, Haotian Jiang, Honglin Xiong, Kai Xuan, Lichi Zhang, Qian Wang
Title: Pathology Image Restoration via Mixture of Prompts
Abstract:
In digital pathology, acquiring all-in-focus images is essential to high-quality imaging and high-efficient clinical workflow. Traditional scanners achieve this by scanning at multiple focal planes of varying depths and then merging them, which is relatively slow and often struggles with complex tissue defocus. Recent prevailing image restoration technique provides a means to restore high-quality pathology images from scans of single focal planes. However, existing image restoration methods are inadequate, due to intricate defocus patterns in pathology images and their domain-specific semantic complexities. In this work, we devise a two-stage restoration solution cascading a transformer and a diffusion model, to benefit from their powers in preserving image fidelity and perceptual quality, respectively. We particularly propose a novel mixture of prompts for the two-stage solution. Given initial prompt that models defocus in microscopic imaging, we design two prompts that describe the high-level image semantics from pathology foundation model and the fine-grained tissue structures via edge extraction. We demonstrate that, by feeding the prompt mixture to our method, we can restore high-quality pathology images from single-focal-plane scans, implying high potentials of the mixture of prompts to clinical usage. Code will be publicly available at https://github.com/caijd2000/MoP.
中文摘要:本研究提出了一种结合Transformer和扩散模型的两阶段方法,通过创新的提示混合技术,从单焦平面扫描中恢复高质量病理图像,有效解决散焦和语义复杂性,具有临床应用潜力。
English Summary: This study introduces a two-stage method combining a transformer and a diffusion model, enhanced by a novel mixture of prompts, to restore high-quality pathology images from single-focal-plane scans, addressing defocus and semantic complexities for clinical applications.

Authors:Yutao Hu, Sen Li, Jincheng Yan, Wenqi Shao, Xiaoyan Luo
Title: Car-1000: A New Large Scale Fine-Grained Visual Categorization Dataset
Abstract:
Fine-grained visual categorization (FGVC) is a challenging but significant task in computer vision, which aims to recognize different sub-categories of birds, cars, airplanes, etc. Among them, recognizing models of different cars has significant application value in autonomous driving, traffic surveillance and scene understanding, which has received considerable attention in the past few years. However, Stanford-Car, the most widely used fine-grained dataset for car recognition, only has 196 different categories and only includes vehicle models produced earlier than 2013. Due to the rapid advancements in the automotive industry during recent years, the appearances of various car models have become increasingly intricate and sophisticated. Consequently, the previous Stanford-Car dataset fails to capture this evolving landscape and cannot satisfy the requirements of automotive industry. To address these challenges, in our paper, we introduce Car-1000, a large-scale dataset designed specifically for fine-grained visual categorization of diverse car models. Car-1000 encompasses vehicles from 165 different automakers, spanning a wide range of 1000 distinct car models. Additionally, we have reproduced several state-of-the-art FGVC methods on the Car-1000 dataset, establishing a new benchmark for research in this field. We hope that our work will offer a fresh perspective for future FGVC researchers. Our dataset is available at https://github.com/toggle1995/Car-1000.
Chinese: 为解决斯坦福汽车数据集过时的问题,作者推出了包含1000种车型的大规模数据集Car-1000,为细粒度视觉分类研究建立了新基准。
English: The authors introduce Car-1000, a large-scale dataset with 1000 car models to address the limitations of the outdated Stanford-Car dataset and advance fine-grained visual categorization research, establishing new benchmarks for the field.

Authors:Kang You, Tong Chen, Dandan Ding, M. Salman Asif, Zhan Ma
Title: RENO: Real-Time Neural Compression for 3D LiDAR Point Clouds
Abstract:
Despite the substantial advancements demonstrated by learning-based neural models in the LiDAR Point Cloud Compression (LPCC) task, realizing real-time compression - an indispensable criterion for numerous industrial applications - remains a formidable challenge. This paper proposes RENO, the first real-time neural codec for 3D LiDAR point clouds, achieving superior performance with a lightweight model. RENO skips the octree construction and directly builds upon the multiscale sparse tensor representation. Instead of the multi-stage inferring, RENO devises sparse occupancy codes, which exploit cross-scale correlation and derive voxels' occupancy in a one-shot manner, greatly saving processing time. Experimental results demonstrate that the proposed RENO achieves real-time coding speed, 10 fps at 14-bit depth on a desktop platform (e.g., one RTX 3090 GPU) for both encoding and decoding processes, while providing 12.25% and 48.34% bit-rate savings compared to G-PCCv23 and Draco, respectively, at a similar quality. RENO model size is merely 1MB, making it attractive for practical applications. The source code is available at https://github.com/NJUVISION/RENO.
中文: RENO是首个实时3D激光雷达点云神经编解码器,通过1MB轻量模型在实现10 fps实时处理的同时,显著降低了比特率。
English: RENO is the first real-time neural codec for 3D LiDAR point clouds, achieving significant bit-rate savings with a lightweight 1MB model while enabling real-time processing at 10 fps.

Authors:Ruoyu Wang, Yukai Ma, Yi Yao, Sheng Tao, Haoang Li, Zongzhi Zhu, Yong Liu, Xingxing Zuo
Title: L2COcc: Lightweight Camera-Centric Semantic Scene Completion via Distillation of LiDAR Model
Abstract:
Semantic Scene Completion (SSC) constitutes a pivotal element in autonomous driving perception systems, tasked with inferring the 3D semantic occupancy of a scene from sensory data. To improve accuracy, prior research has implemented various computationally demanding and memory-intensive 3D operations, imposing significant computational requirements on the platform during training and testing. This paper proposes L2COcc, a lightweight camera-centric SSC framework that also accommodates LiDAR inputs. With our proposed efficient voxel transformer (EVT) and cross-modal knowledge modules, including feature similarity distillation (FSD), TPV distillation (TPVD) and prediction alignment distillation (PAD), our method substantially reduce computational burden while maintaining high accuracy. The experimental evaluations demonstrate that our proposed method surpasses the current state-of-the-art vision-based SSC methods regarding accuracy on both the SemanticKITTI and SSCBench-KITTI-360 benchmarks, respectively. Additionally, our method is more lightweight, exhibiting a reduction in both memory consumption and inference time by over 23% compared to the current state-of-the-arts method. Code is available at our project page:https://studyingfufu.github.io/L2COcc/.
中文: 本文提出L2COcc轻量级相机中心语义场景补全框架,通过高效体素变换器和跨模态知识蒸馏模块,在保持高精度的同时显著降低计算负担,在多个基准测试中超越现有最优方法。
English: This paper introduces L2COcc, a lightweight camera-centric framework for Semantic Scene Completion that uses efficient voxel transformers and cross-modal knowledge distillation to significantly reduce computational costs while achieving superior accuracy on benchmark datasets.

Authors:Kumar Krishna Agrawal, Long Lian, Longchao Liu, Natalia Harguindeguy, Boyi Li, Alexander Bick, Maggie Chung, Trevor Darrell, Adam Yala
Title: Atlas: Multi-Scale Attention Improves Long Context Image Modeling
Abstract:
Efficiently modeling massive images is a long-standing challenge in machine learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale communication. MSA creates O(log N) scales to represent the image across progressively coarser features and leverages cross-attention to propagate information across scales. We then introduce Atlas, a novel neural network architecture based on MSA. We demonstrate that Atlas significantly improves the compute-performance tradeoff of long-context image modeling in a high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves 91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster. Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96% better than LongViT. In comparisons against MambaVision-S, we find Atlas-S achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px respectively, while obtaining similar runtimes. Code for reproducing our experiments and pretrained models is available at https://github.com/yalalab/atlas.
中文: 摘要介绍了多尺度注意力机制及其Atlas架构,该架构在高分辨率图像建模中显著提升了计算效率与准确性,性能超越了ConvNext-B和FasterViT等现有模型。
English: The abstract introduces Multi-Scale Attention (MSA) and the Atlas architecture, which significantly enhances computational efficiency and accuracy in high-resolution image modeling, outperforming existing models like ConvNext-B and FasterViT.

Authors:Wenqing Kuang, Xiongwei Zhao, Yehui Shen, Congcong Wen, Huimin Lu, Zongtan Zhou, Xieyuanli Chen
Title: ResLPR: A LiDAR Data Restoration Network and Benchmark for Robust Place Recognition Against Weather Corruptions
Abstract:
LiDAR-based place recognition (LPR) is a key component for autonomous driving, and its resilience to environmental corruption is critical for safety in high-stakes applications. While state-of-the-art (SOTA) LPR methods perform well in clean weather, they still struggle with weather-induced corruption commonly encountered in driving scenarios. To tackle this, we propose ResLPRNet, a novel LiDAR data restoration network that largely enhances LPR performance under adverse weather by restoring corrupted LiDAR scans using a wavelet transform-based network. ResLPRNet is efficient, lightweight and can be integrated plug-and-play with pretrained LPR models without substantial additional computational cost. Given the lack of LPR datasets under adverse weather, we introduce ResLPR, a novel benchmark that examines SOTA LPR methods under a wide range of LiDAR distortions induced by severe snow, fog, and rain conditions. Experiments on our proposed WeatherKITTI and WeatherNCLT datasets demonstrate the resilience and notable gains achieved by using our restoration method with multiple LPR approaches in challenging weather scenarios. Our code and benchmark are publicly available here: https://github.com/nubot-nudt/ResLPR.
中文: ResLPRNet是一种基于小波变换的轻量级激光雷达修复网络,可显著提升恶劣天气下的地点识别性能,并通过新基准在多个数据集上验证了其显著效果。
English: ResLPRNet is a lightweight LiDAR restoration network using wavelet transforms to enhance place recognition performance under adverse weather, with a new benchmark showing significant gains across multiple datasets.

Authors:Tengfei Wang, Xin Wang, Yongmao Hou, Zhaoning Zhang, Yiwei Xu, Zongqian Zhan
Title: GS-I$^{3}$: Gaussian Splatting for Surface Reconstruction from Illumination-Inconsistent Images
Abstract:
Accurate geometric surface reconstruction, providing essential environmental information for navigation and manipulation tasks, is critical for enabling robotic self-exploration and interaction. Recently, 3D Gaussian Splatting (3DGS) has gained significant attention in the field of surface reconstruction due to its impressive geometric quality and computational efficiency. While recent relevant advancements in novel view synthesis under inconsistent illumination using 3DGS have shown promise, the challenge of robust surface reconstruction under such conditions is still being explored. To address this challenge, we propose a method called GS-3I. Specifically, to mitigate 3D Gaussian optimization bias caused by underexposed regions in single-view images, based on Convolutional Neural Network (CNN), a tone mapping correction framework is introduced. Furthermore, inconsistent lighting across multi-view images, resulting from variations in camera settings and complex scene illumination, often leads to geometric constraint mismatches and deviations in the reconstructed surface. To overcome this, we propose a normal compensation mechanism that integrates reference normals extracted from single-view image with normals computed from multi-view observations to effectively constrain geometric inconsistencies. Extensive experimental evaluations demonstrate that GS-3I can achieve robust and accurate surface reconstruction across complex illumination scenarios, highlighting its effectiveness and versatility in this critical challenge. https://github.com/TFwang-9527/GS-3I
中文: 提出的GS-3I方法通过引入基于CNN的色调映射校正和法线补偿机制,增强了3D高斯泼溅技术,实现在不一致光照条件下的鲁棒表面重建。
English: The proposed GS-3I method enhances 3D Gaussian Splatting by incorporating a CNN-based tone mapping correction and a normal compensation mechanism to achieve robust surface reconstruction under inconsistent illumination conditions.

Authors:Yunze Liu, Peiran Wu, Cheng Liang, Junxiao Shen, Limin Wang, Li Yi
Title: VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining
Abstract:
Recent Mamba-based architectures for video understanding demonstrate promising computational efficiency and competitive performance, yet struggle with overfitting issues that hinder their scalability. To overcome this challenge, we introduce VideoMAP, a Hybrid Mamba-Transformer framework featuring a novel pre-training approach. VideoMAP uses a 4:1 Mamba-to-Transformer ratio, effectively balancing computational cost and model capacity. This architecture, combined with our proposed frame-wise masked autoregressive pre-training strategy, delivers significant performance gains when scaling to larger models. Additionally, VideoMAP exhibits impressive sample efficiency, significantly outperforming existing methods with less training data. Experiments show that VideoMAP outperforms existing models across various datasets, including Kinetics-400, Something-Something V2, Breakfast, and COIN. Furthermore, we demonstrate the potential of VideoMAP as a visual encoder for multimodal large language models, highlighting its ability to reduce memory usage and enable the processing of longer video sequences. The code is open-source at https://github.com/yunzeliu/MAP
中文: VideoMAP是一种混合Mamba-Transformer框架,通过创新的预训练策略解决视频理解中的过拟合问题,在多个数据集上实现卓越性能,同时显著提升计算效率和扩展性。
English: VideoMAP is a hybrid Mamba-Transformer framework that addresses overfitting in video understanding through a novel pre-training strategy, achieving superior performance across multiple datasets while enhancing computational efficiency and scalability.

Authors:Jiahao Wu, Rui Peng, Zhiyan Wang, Lu Xiao, Luyang Tang, Jinbo Yan, Kaiqiang Xiong, Ronggang Wang
Title: Swift4D:Adaptive divide-and-conquer Gaussian Splatting for compact and efficient reconstruction of dynamic scene
Abstract:
Novel view synthesis has long been a practical but challenging task, although the introduction of numerous methods to solve this problem, even combining advanced representations like 3D Gaussian Splatting, they still struggle to recover high-quality results and often consume too much storage memory and training time. In this paper we propose Swift4D, a divide-and-conquer 3D Gaussian Splatting method that can handle static and dynamic primitives separately, achieving a good trade-off between rendering quality and efficiency, motivated by the fact that most of the scene is the static primitive and does not require additional dynamic properties. Concretely, we focus on modeling dynamic transformations only for the dynamic primitives which benefits both efficiency and quality. We first employ a learnable decomposition strategy to separate the primitives, which relies on an additional parameter to classify primitives as static or dynamic. For the dynamic primitives, we employ a compact multi-resolution 4D Hash mapper to transform these primitives from canonical space into deformation space at each timestamp, and then mix the static and dynamic primitives to produce the final output. This divide-and-conquer method facilitates efficient training and reduces storage redundancy. Our method not only achieves state-of-the-art rendering quality while being 20X faster in training than previous SOTA methods with a minimum storage requirement of only 30MB on real-world datasets. Code is available at https://github.com/WuJH2001/swift4d.
Chinese Summary: Swift4D提出了一种分治式的3D高斯溅射方法,通过分别处理静态和动态元素,在实现卓越渲染质量的同时,将训练速度提升20倍并大幅降低存储需求。
English Summary: Swift4D is a novel divide-and-conquer 3D Gaussian Splatting method that separately handles static and dynamic elements, achieving superior rendering quality with 20x faster training and minimal storage requirements.

Authors:Negar Shahamiri, Moritz Rempe, Lukas Heine, Jens Kleesiek, Fabian Hörst
Title: Cracking the PUMA Challenge in 24 Hours with CellViT++ and nnU-Net
Abstract:
Automatic tissue segmentation and nuclei detection is an important task in pathology, aiding in biomarker extraction and discovery. The panoptic segmentation of nuclei and tissue in advanced melanoma (PUMA) challenge aims to improve tissue segmentation and nuclei detection in melanoma histopathology. Unlike many challenge submissions focusing on extensive model tuning, our approach emphasizes delivering a deployable solution within a 24-hour development timeframe, using out-of-the-box frameworks. The pipeline combines two models, namely CellViT++ for nuclei detection and nnU-Net for tissue segmentation. Our results demonstrate a significant improvement in tissue segmentation, achieving a Dice score of 0.750, surpassing the baseline score of 0.629. For nuclei detection, we obtained results comparable to the baseline in both challenge tracks. The code is publicly available at https://github.com/TIO-IKIM/PUMA.
中文: 我们通过整合CellViT++细胞核检测与nnU-Net组织分割的可部署方案,在PUMA挑战赛中显著提升组织分割性能(Dice评分0.750),同时保持与基线相当的细胞核检测效果。
English: Our deployable pipeline, combining CellViT++ for nuclei detection and nnU-Net for tissue segmentation, significantly improves tissue segmentation with a Dice score of 0.750 while maintaining comparable nuclei detection results in the PUMA challenge.

Authors:Boyu Chen, Ameenat L. Solebo, Daqian Shi, Jinge Wu, Paul Taylor
Title: Minuscule Cell Detection in AS-OCT Images with Progressive Field-of-View Focusing
Abstract:
Anterior Segment Optical Coherence Tomography (AS-OCT) is an emerging imaging technique with great potential for diagnosing anterior uveitis, a vision-threatening ocular inflammatory condition. A hallmark of this condition is the presence of inflammatory cells in the eye's anterior chamber, and detecting these cells using AS-OCT images has attracted research interest. While recent efforts aim to replace manual cell detection with automated computer vision approaches, detecting extremely small (minuscule) objects in high-resolution images, such as AS-OCT, poses substantial challenges: (1) each cell appears as a minuscule particle, representing less than 0.005\% of the image, making the detection difficult, and (2) OCT imaging introduces pixel-level noise that can be mistaken for cells, leading to false positive detections. To overcome these challenges, we propose a minuscule cell detection framework through a progressive field-of-view focusing strategy. This strategy systematically refines the detection scope from the whole image to a target region where cells are likely to be present, and further to minuscule regions potentially containing individual cells. Our framework consists of two modules. First, a Field-of-Focus module uses a vision foundation model to segment the target region. Subsequently, a Fine-grained Object Detection module introduces a specialized Minuscule Region Proposal followed by a Spatial Attention Network to distinguish individual cells from noise within the segmented region. Experimental results demonstrate that our framework outperforms state-of-the-art methods for cell detection, providing enhanced efficacy for clinical applications. Our code is publicly available at: https://github.com/joeybyc/MCD.
中文: 该框架通过渐进式视野聚焦策略,解决了前段光学相干断层扫描图像中微小炎症细胞检测的难题,借助目标区域分割和抗干扰识别技术,显著提升了细胞检测的临床效能。
English: The proposed framework utilizes a progressive field-of-view focusing strategy to overcome challenges in detecting minuscule inflammatory cells in AS-OCT images, significantly outperforming existing methods through targeted region segmentation and noise-resistant cell identification.

Authors:Yuheng Jiang, Zhehao Shen, Chengcheng Guo, Yu Hong, Zhuo Su, Yingliang Zhang, Marc Habermann, Lan Xu
Title: RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance
Abstract:
Human-centric volumetric videos offer immersive free-viewpoint experiences, yet existing methods focus either on replaying general dynamic scenes or animating human avatars, limiting their ability to re-perform general dynamic scenes. In this paper, we present RePerformer, a novel Gaussian-based representation that unifies playback and re-performance for high-fidelity human-centric volumetric videos. Specifically, we hierarchically disentangle the dynamic scenes into motion Gaussians and appearance Gaussians which are associated in the canonical space. We further employ a Morton-based parameterization to efficiently encode the appearance Gaussians into 2D position and attribute maps. For enhanced generalization, we adopt 2D CNNs to map position maps to attribute maps, which can be assembled into appearance Gaussians for high-fidelity rendering of the dynamic scenes. For re-performance, we develop a semantic-aware alignment module and apply deformation transfer on motion Gaussians, enabling photo-real rendering under novel motions. Extensive experiments validate the robustness and effectiveness of RePerformer, setting a new benchmark for playback-then-reperformance paradigm in human-centric volumetric videos.

Authors:Yan Jiang, Hao Yu, Mengting Wei, Zhaodong Sun, Haoyu Chen, Xu Cheng, Guoying Zhao
Title: L2RW+: A Comprehensive Benchmark Towards Privacy-Preserved Visible-Infrared Person Re-Identification
Abstract:
Visible-infrared person re-identification (VI-ReID) is a challenging task that aims to match pedestrian images captured under varying lighting conditions, which has drawn intensive research attention and achieved promising results. However, existing methods adopt the centralized training, ignoring the potential privacy concerns as the data is distributed across multiple devices or entities in reality. In this paper, we propose L2RW+, a benchmark that brings VI-ReID closer to real-world applications. The core rationale behind L2RW+ is that incorporating decentralized training into VI-ReID can address privacy concerns in scenarios with limited data-sharing constrains. Specifically, we design protocols and corresponding algorithms for different privacy sensitivity levels. In our new benchmark, we simulate the training under real-world data conditions that: 1) data from each camera is completely isolated, or 2) different data entities (e.g., data controllers of a certain region) can selectively share the data. In this way, we simulate scenarios with strict privacy restrictions, which is closer to real-world conditions. Comprehensive experiments show the feasibility and potential of decentralized VI-ReID training at both image and video levels. In particular, with increasing data scales, the performance gap between decentralized and centralized training decreases, especially in video-level VI-ReID. In unseen domains, decentralized training even achieves performance comparable to SOTA centralized methods. This work offers a novel research entry for deploying VI-ReID into real-world scenarios and can benefit the community. Code is available at: https://github.com/Joey623/L2RW.
中文: 本文提出L2RW+基准,通过模拟现实世界中数据隔离场景下的分散式训练方法解决可见光-红外行人重识别中的隐私问题,实验表明该方法在大规模数据下性能接近集中式训练。
English: This paper introduces L2RW+, a decentralized training benchmark for visible-infrared person re-identification that addresses privacy concerns by simulating real-world data isolation scenarios, demonstrating competitive performance with centralized methods especially at larger data scales.

Authors:Ans Munir, Faisal Z. Qureshi, Muhammad Haris Khan, Mohsen Ali
Title: TLAC: Two-stage LMM Augmented CLIP for Zero-Shot Classification
Abstract:
Contrastive Language-Image Pretraining (CLIP) has shown impressive zero-shot performance on image classification. However, state-of-the-art methods often rely on fine-tuning techniques like prompt learning and adapter-based tuning to optimize CLIP's performance. The necessity for fine-tuning significantly limits CLIP's adaptability to novel datasets and domains. This requirement mandates substantial time and computational resources for each new dataset. To overcome this limitation, we introduce simple yet effective training-free approaches, Single-stage LMM Augmented CLIP (SLAC) and Two-stage LMM Augmented CLIP (TLAC), that leverages powerful Large Multimodal Models (LMMs), such as Gemini, for image classification. The proposed methods leverages the capabilities of pre-trained LMMs, allowing for seamless adaptation to diverse datasets and domains without the need for additional training. Our approaches involve prompting the LMM to identify objects within an image. Subsequently, the CLIP text encoder determines the image class by identifying the dataset class with the highest semantic similarity to the LLM predicted object. Our models achieved superior accuracy on 9 of 11 base-to-novel datasets, including ImageNet, SUN397, and Caltech101, while maintaining a strictly training-free paradigm. Our TLAC model achieved an overall accuracy of 83.44%, surpassing the previous state-of-the-art few-shot methods by a margin of 6.75%. Compared to other training-free approaches, our TLAC method achieved 83.6% average accuracy across 13 datasets, a 9.7% improvement over the previous methods. Our Code is available at https://github.com/ans92/TLAC
中文: 本文提出无需训练的SLAC和TLAC方法,通过大型多模态模型识别图像对象并利用CLIP文本编码器进行语义匹配,无需微调即可在多个数据集上实现最先进的零样本分类精度。
English: This paper introduces training-free methods SLAC and TLAC that enhance CLIP's zero-shot image classification by leveraging Large Multimodal Models to identify objects and CLIP's text encoder for semantic matching, achieving state-of-the-art accuracy without additional training.

Authors:Yuchen Deng, Haibin Ling, Bingyao Huang
Title: LAPIG: Language Guided Projector Image Generation with Surface Adaptation and Stylization
Abstract:
We propose LAPIG, a language guided projector image generation method with surface adaptation and stylization. LAPIG consists of a projector-camera system and a target textured projection surface. LAPIG takes the user text prompt as input and aims to transform the surface style using the projector. LAPIG's key challenge is that due to the projector's physical brightness limitation and the surface texture, the viewer's perceived projection may suffer from color saturation and artifacts in both dark and bright regions, such that even with the state-of-the-art projector compensation techniques, the viewer may see clear surface texture-related artifacts. Therefore, how to generate a projector image that follows the user's instruction while also displaying minimum surface artifacts is an open problem. To address this issue, we propose projection surface adaptation (PSA) that can generate compensable surface stylization. We first train two networks to simulate the projector compensation and project-and-capture processes, this allows us to find a satisfactory projector image without real project-and-capture and utilize gradient descent for fast convergence. Then, we design content and saturation losses to guide the projector image generation, such that the generated image shows no clearly perceivable artifacts when projected. Finally, the generated image is projected for visually pleasing surface style morphing effects. The source code and video are available on the project page: https://Yu-chen-Deng.github.io/LAPIG/.

Authors:Zijian He, Yuwei Ning, Yipeng Qin, Guangrun Wang, Sibei Yang, Liang Lin, Guanbin Li
Title: VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction
Abstract:
Virtual Try-On (VTON) is a transformative technology in e-commerce and fashion design, enabling realistic digital visualization of clothing on individuals. In this work, we propose VTON 360, a novel 3D VTON method that addresses the open challenge of achieving high-fidelity VTON that supports any-view rendering. Specifically, we leverage the equivalence between a 3D model and its rendered multi-view 2D images, and reformulate 3D VTON as an extension of 2D VTON that ensures 3D consistent results across multiple views. To achieve this, we extend 2D VTON models to include multi-view garments and clothing-agnostic human body images as input, and propose several novel techniques to enhance them, including: i) a pseudo-3D pose representation using normal maps derived from the SMPL-X 3D human model, ii) a multi-view spatial attention mechanism that models the correlations between features from different viewing angles, and iii) a multi-view CLIP embedding that enhances the garment CLIP features used in 2D VTON with camera information. Extensive experiments on large-scale real datasets and clothing images from e-commerce platforms demonstrate the effectiveness of our approach. Project page: https://scnuhealthy.github.io/VTON360.
中文:VTON 360提出了一种创新的3D虚拟试穿方法,通过扩展2D模型为多视图输入,并结合伪3D姿态表征、多视角注意力机制和增强CLIP嵌入等技术,实现了支持任意视角的高保真渲染。
English: VTON 360 introduces a novel 3D virtual try-on method that achieves high-fidelity, any-view rendering by extending 2D VTON models with multi-view inputs and innovative techniques including pseudo-3D pose representation, multi-view attention mechanisms, and enhanced CLIP embeddings.

Authors:Hongyu Sun, Qiuhong Ke, Ming Cheng, Yongcai Wang, Deying Li, Chenhui Gou, Jianfei Cai
Title: Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis
Abstract:
This paper proposes a general solution to enable point cloud recognition models to handle distribution shifts at test time. Unlike prior methods, which rely heavily on training data (often inaccessible during online inference) and are limited to recognizing a fixed set of point cloud classes predefined during training, we explore a more practical and challenging scenario: adapting the model solely based on online test data to recognize both previously seen classes and novel, unseen classes at test time. To this end, we develop \textbf{Point-Cache}, a hierarchical cache model that captures essential clues of online test samples, particularly focusing on the global structure of point clouds and their local-part details. Point-Cache, which serves as a rich 3D knowledge base, is dynamically managed to prioritize the inclusion of high-quality samples. Designed as a plug-and-play module, our method can be flexibly integrated into large multimodal 3D models to support open-vocabulary point cloud recognition. Notably, our solution operates with efficiency comparable to zero-shot inference, as it is entirely training-free. Point-Cache demonstrates substantial gains across 8 challenging benchmarks and 4 representative large 3D models, highlighting its effectiveness. Code is available at https://github.com/auniquesun/Point-Cache.
中文: 本文提出Point-Cache,一种无需训练的层次化缓存模型,使点云识别模型能够仅基于在线测试数据适应分布变化并识别已知和未知类别,在多个基准测试中取得了显著性能提升。
English: This paper introduces Point-Cache, a training-free hierarchical cache model that enables point cloud recognition models to adapt to distribution shifts and recognize both seen and unseen classes using only online test data, achieving significant performance gains across multiple benchmarks.

Authors:Tobia Poppi, Tejaswi Kasarla, Pascal Mettes, Lorenzo Baraldi, Rita Cucchiara
Title: Hyperbolic Safety-Aware Vision-Language Models
Abstract:
Addressing the retrieval of unsafe content from vision-language models such as CLIP is an important step towards real-world integration. Current efforts have relied on unlearning techniques that try to erase the model's knowledge of unsafe concepts. While effective in reducing unwanted outputs, unlearning limits the model's capacity to discern between safe and unsafe content. In this work, we introduce a novel approach that shifts from unlearning to an awareness paradigm by leveraging the inherent hierarchical properties of the hyperbolic space. We propose to encode safe and unsafe content as an entailment hierarchy, where both are placed in different regions of hyperbolic space. Our HySAC, Hyperbolic Safety-Aware CLIP, employs entailment loss functions to model the hierarchical and asymmetrical relations between safe and unsafe image-text pairs. This modelling, ineffective in standard vision-language models due to their reliance on Euclidean embeddings, endows the model with awareness of unsafe content, enabling it to serve as both a multimodal unsafe classifier and a flexible content retriever, with the option to dynamically redirect unsafe queries toward safer alternatives or retain the original output. Extensive experiments show that our approach not only enhances safety recognition but also establishes a more adaptable and interpretable framework for content moderation in vision-language models. Our source code is available at https://github.com/aimagelab/HySAC.
中文: 本文提出HySAC模型,通过双曲空间分层编码安全与不安全内容,使视觉语言模型既能有效识别不安全内容,又能灵活控制检索结果,在提升安全性的同时保持模型适应性和可解释性。
English: This paper introduces HySAC, a novel safety-aware CLIP model that uses hyperbolic space to hierarchically encode safe and unsafe content, enabling both effective unsafe content classification and flexible retrieval while maintaining adaptability and interpretability.

Authors:Amir M. Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Elaheh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, Shohreh Kasaei
Title: A Comprehensive Survey on Knowledge Distillation
Abstract:
Deep Neural Networks (DNNs) have achieved notable performance in the fields of computer vision and natural language processing with various applications in both academia and industry. However, with recent advancements in DNNs and transformer models with a tremendous number of parameters, deploying these large models on edge devices causes serious issues such as high runtime and memory consumption. This is especially concerning with the recent large-scale foundation models, Vision-Language Models (VLMs), and Large Language Models (LLMs). Knowledge Distillation (KD) is one of the prominent techniques proposed to address the aforementioned problems using a teacher-student architecture. More specifically, a lightweight student model is trained using additional knowledge from a cumbersome teacher model. In this work, a comprehensive survey of knowledge distillation methods is proposed. This includes reviewing KD from different aspects: distillation sources, distillation schemes, distillation algorithms, distillation by modalities, applications of distillation, and comparison among existing methods. In contrast to most existing surveys, which are either outdated or simply update former surveys, this work proposes a comprehensive survey with a new point of view and representation structure that categorizes and investigates the most recent methods in knowledge distillation. This survey considers various critically important subcategories, including KD for diffusion models, 3D inputs, foundational models, transformers, and LLMs. Furthermore, existing challenges in KD and possible future research directions are discussed. Github page of the project: https://github.com/IPL-Sharif/KD_Survey
中文: 本文从蒸馏来源、算法机制、模态应用等维度系统综述了知识蒸馏方法,针对大模型部署难题提出创新分类体系,并探讨了扩散模型、大语言模型等前沿领域的应用挑战与发展方向。
English: This comprehensive survey reviews knowledge distillation methods from multiple perspectives, addressing deployment challenges of large models while introducing novel categorization frameworks and examining emerging applications like diffusion models and LLMs.

Authors:Ruijie Lu, Yixin Chen, Yu Liu, Jiaxiang Tang, Junfeng Ni, Diwen Wan, Gang Zeng, Siyuan Huang
Title: TACO: Taming Diffusion for in-the-wild Video Amodal Completion
Abstract:
Humans can infer complete shapes and appearances of objects from limited visual cues, relying on extensive prior knowledge of the physical world. However, completing partially observable objects while ensuring consistency across video frames remains challenging for existing models, especially for unstructured, in-the-wild videos. This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video given a visual prompt specifying the object of interest. Leveraging the rich, consistent manifolds learned by pre-trained video diffusion models, we propose a conditional diffusion model, TACO, that repurposes these manifolds for VAC. To enable its effective and robust generalization to challenging in-the-wild scenarios, we curate a large-scale synthetic dataset with multiple difficulty levels by systematically imposing occlusions onto un-occluded videos. Building on this, we devise a progressive fine-tuning paradigm that starts with simpler recovery tasks and gradually advances to more complex ones. We demonstrate TACO's versatility on a wide range of in-the-wild videos from Internet, as well as on diverse, unseen datasets commonly used in autonomous driving, robotic manipulation, and scene understanding. Moreover, we show that TACO can be effectively applied to various downstream tasks like object reconstruction and pose estimation, highlighting its potential to facilitate physical world understanding and reasoning. Our project page is available at https://jason-aplp.github.io/TACO.
中文: 本文提出TACO条件扩散模型,通过利用预训练视频扩散模型的丰富流形,在非结构化真实场景视频中实现一致的视频全貌补全,并在多种数据集与下游任务中展现出卓越的泛化能力。
English: This paper introduces TACO, a conditional diffusion model that leverages pre-trained video diffusion models to achieve consistent video amodal completion for objects in unstructured, in-the-wild videos, demonstrating strong generalization across diverse datasets and downstream applications.

Authors:Zhedong Zhang, Liang Li, Chenggang Yan, Chunshan Liu, Anton van den Hengel, Yuankai Qi
Title: Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing
Abstract:
Movie dubbing describes the process of transforming a script into speech that aligns temporally and emotionally with a given movie clip while exemplifying the speaker's voice demonstrated in a short reference audio clip. This task demands the model bridge character performances and complicated prosody structures to build a high-quality video-synchronized dubbing track. The limited scale of movie dubbing datasets, along with the background noise inherent in audio data, hinder the acoustic modeling performance of trained models. To address these issues, we propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment. First, we propose a prosody-enhanced acoustic pre-training to develop robust acoustic modeling capabilities. Then, we freeze the pre-trained acoustic system and design a disentangled framework to model prosodic text features and dubbing style while maintaining acoustic quality. Additionally, we incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies, thereby enhancing emotion-prosody alignment. Extensive experiments show that our method performs favorably against the state-of-the-art models on two primary benchmarks. The demos are available at https://zzdoog.github.io/ProDubber/.

Authors:Zhengyuan Peng, Jinpeng Ma, Zhimin Sun, Ran Yi, Haichuan Song, Xin Tan, Lizhuang Ma
Title: MOS: Modeling Object-Scene Associations in Generalized Category Discovery
Abstract:
Generalized Category Discovery (GCD) is a classification task that aims to classify both base and novel classes in unlabeled images, using knowledge from a labeled dataset. In GCD, previous research overlooks scene information or treats it as noise, reducing its impact during model training. However, in this paper, we argue that scene information should be viewed as a strong prior for inferring novel classes. We attribute the misinterpretation of scene information to a key factor: the Ambiguity Challenge inherent in GCD. Specifically, novel objects in base scenes might be wrongly classified into base categories, while base objects in novel scenes might be mistakenly recognized as novel categories. Once the ambiguity challenge is addressed, scene information can reach its full potential, significantly enhancing the performance of GCD models. To more effectively leverage scene information, we propose the Modeling Object-Scene Associations (MOS) framework, which utilizes a simple MLP-based scene-awareness module to enhance GCD performance. It achieves an exceptional average accuracy improvement of 4% on the challenging fine-grained datasets compared to state-of-the-art methods, emphasizing its superior performance in fine-grained GCD. The code is publicly available at https://github.com/JethroPeng/MOS
中文摘要:本文提出建模对象-场景关联(MOS)框架,通过将场景信息作为先验知识来解决广义类别发现中的模糊性挑战,在细粒度数据集上实现了4%的准确率提升。
English Summary: This paper introduces the Modeling Object-Scene Associations (MOS) framework to address the Ambiguity Challenge in Generalized Category Discovery by leveraging scene information as a prior, achieving a 4% accuracy improvement on fine-grained datasets.

Authors:Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez
Title: Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training
Abstract:
End-to-end autonomous driving research currently faces a critical challenge in bridging the gap between open-loop training and closed-loop deployment. Current approaches are trained to predict trajectories in an open-loop environment, which struggle with quick reactions to other agents in closed-loop environments and risk generating kinematically infeasible plans due to the gap between open-loop training and closed-loop driving. In this paper, we introduce Hydra-NeXt, a novel multi-branch planning framework that unifies trajectory prediction, control prediction, and a trajectory refinement network in one model. Unlike current open-loop trajectory prediction models that only handle general-case planning, Hydra-NeXt further utilizes a control decoder to focus on short-term actions, which enables faster responses to dynamic situations and reactive agents. Moreover, we propose the Trajectory Refinement module to augment and refine the planning decisions by effectively adhering to kinematic constraints in closed-loop environments. This unified approach bridges the gap between open-loop training and closed-loop driving, demonstrating superior performance of 65.89 Driving Score (DS) and 48.20% Success Rate (SR) on the Bench2Drive dataset without relying on external experts for data collection. Hydra-NeXt surpasses the previous state-of-the-art by 22.98 DS and 17.49 SR, marking a significant advancement in autonomous driving. Code will be available at https://github.com/woxihuanjiangguo/Hydra-NeXt.
Chinese: Hydra-NeXt是一种新颖的多分支规划框架,通过统一轨迹预测、控制预测和轨迹优化,弥合了开环训练与闭环部署之间的差距,在自动驾驶中实现了卓越性能。
English: Hydra-NeXt is a novel multi-branch planning framework that unifies trajectory prediction, control prediction, and trajectory refinement to bridge the gap between open-loop training and closed-loop deployment, achieving superior performance in autonomous driving.

Authors:Byeongjun Park, Hyojun Go, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, Changick Kim
Title: SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering
Abstract:
Recent progress in 3D/4D scene generation emphasizes the importance of physical alignment throughout video generation and scene reconstruction. However, existing methods improve the alignment separately at each stage, making it difficult to manage subtle misalignments arising from another stage. Here, we present SteerX, a zero-shot inference-time steering method that unifies scene reconstruction into the generation process, tilting data distributions toward better geometric alignment. To this end, we introduce two geometric reward functions for 3D/4D scene generation by using pose-free feed-forward scene reconstruction models. Through extensive experiments, we demonstrate the effectiveness of SteerX in improving 3D/4D scene generation.

Authors:Donglin Yang, Paul Vicol, Xiaojuan Qi, Renjie Liao, Xiaofan Zhang
Title: QDM: Quadtree-Based Region-Adaptive Sparse Diffusion Models for Efficient Image Super-Resolution
Abstract:
Deep learning-based super-resolution (SR) methods often perform pixel-wise computations uniformly across entire images, even in homogeneous regions where high-resolution refinement is redundant. We propose the Quadtree Diffusion Model (QDM), a region-adaptive diffusion framework that leverages a quadtree structure to selectively enhance detail-rich regions while reducing computations in homogeneous areas. By guiding the diffusion with a quadtree derived from the low-quality input, QDM identifies key regions-represented by leaf nodes-where fine detail is essential and applies minimal refinement elsewhere. This mask-guided, two-stream architecture adaptively balances quality and efficiency, producing high-fidelity outputs with low computational redundancy. Experiments demonstrate QDM's effectiveness in high-resolution SR tasks across diverse image types, particularly in medical imaging (e.g., CT scans), where large homogeneous regions are prevalent. Furthermore, QDM outperforms or is comparable to state-of-the-art SR methods on standard benchmarks while significantly reducing computational costs, highlighting its efficiency and suitability for resource-limited environments. Our code is available at https://github.com/linYDTHU/QDM.
中文: 四叉树扩散模型(QDM)提出了一种区域自适应扩散框架,利用四叉树结构选择性增强细节丰富区域,在保持高保真超分辨率性能的同时显著降低计算冗余,特别适用于医疗影像等大均匀区域图像处理。
English: The Quadtree Diffusion Model (QDM) introduces a region-adaptive diffusion framework that selectively enhances detail-rich areas using a quadtree structure, significantly reducing computational redundancy while maintaining high-fidelity super-resolution performance across diverse image types.

Authors:Shun Zou, Yi Zou, Mingya Zhang, Shipeng Luo, Guangwei Gao, Guojun Qi
Title: Learning Dual-Domain Multi-Scale Representations for Single Image Deraining
Abstract:
Existing image deraining methods typically rely on single-input, single-output, and single-scale architectures, which overlook the joint multi-scale information between external and internal features. Furthermore, single-domain representations are often too restrictive, limiting their ability to handle the complexities of real-world rain scenarios. To address these challenges, we propose a novel Dual-Domain Multi-Scale Representation Network (DMSR). The key idea is to exploit joint multi-scale representations from both external and internal domains in parallel while leveraging the strengths of both spatial and frequency domains to capture more comprehensive properties. Specifically, our method consists of two main components: the Multi-Scale Progressive Spatial Refinement Module (MPSRM) and the Frequency Domain Scale Mixer (FDSM). The MPSRM enables the interaction and coupling of multi-scale expert information within the internal domain using a hierarchical modulation and fusion strategy. The FDSM extracts multi-scale local information in the spatial domain, while also modeling global dependencies in the frequency domain. Extensive experiments show that our model achieves state-of-the-art performance across six benchmark datasets.

Authors:Zhe Shan, Yang Liu, Lei Zhou, Cheng Yan, Heng Wang, Xia Xie
Title: ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving Object
Abstract:
The availability of large-scale remote sensing video data underscores the importance of high-quality interactive segmentation. However, challenges such as small object sizes, ambiguous features, and limited generalization make it difficult for current methods to achieve this goal. In this work, we propose ROS-SAM, a method designed to achieve high-quality interactive segmentation while preserving generalization across diverse remote sensing data. The ROS-SAM is built upon three key innovations: 1) LoRA-based fine-tuning, which enables efficient domain adaptation while maintaining SAM's generalization ability, 2) Enhancement of deep network layers to improve the discriminability of extracted features, thereby reducing misclassifications, and 3) Integration of global context with local boundary details in the mask decoder to generate high-quality segmentation masks. Additionally, we design the data pipeline to ensure the model learns to better handle objects at varying scales during training while focusing on high-quality predictions during inference. Experiments on remote sensing video datasets show that the redesigned data pipeline boosts the IoU by 6%, while ROS-SAM increases the IoU by 13%. Finally, when evaluated on existing remote sensing object tracking datasets, ROS-SAM demonstrates impressive zero-shot capabilities, generating masks that closely resemble manual annotations. These results confirm ROS-SAM as a powerful tool for fine-grained segmentation in remote sensing applications. Code is available at https://github.com/ShanZard/ROS-SAM.
中文: 本文提出ROS-SAM方法,通过领域自适应、增强特征区分度和上下文感知掩码生成,实现遥感视频的高质量交互式分割,IoU提升13%并展现出色的零样本能力。
English: This paper introduces ROS-SAM, a method that achieves high-quality interactive segmentation for remote sensing videos through domain adaptation, enhanced feature discriminability, and context-aware mask generation, improving IoU by 13% and demonstrating strong zero-shot performance.

Authors:Shun Zou, Yi Zou, Mingya Zhang, Shipeng Luo, Zhihao Chen, Guangwei Gao
Title: Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition
Abstract:
In recent years, Transformer has witnessed significant progress in food recognition. However, most existing approaches still face two critical challenges in lightweight food recognition: (1) the quadratic complexity and redundant feature representation from interactions with irrelevant tokens; (2) static feature recognition and single-scale representation, which overlook the unstructured, non-fixed nature of food images and the need for multi-scale features. To address these, we propose an adaptive and efficient sparse Transformer architecture (Fraesormer) with two core designs: Adaptive Top-k Sparse Partial Attention (ATK-SPA) and Hierarchical Scale-Sensitive Feature Gating Network (HSSFGN). ATK-SPA uses a learnable Gated Dynamic Top-K Operator (GDTKO) to retain critical attention scores, filtering low query-key matches that hinder feature aggregation. It also introduces a partial channel mechanism to reduce redundancy and promote expert information flow, enabling local-global collaborative modeling. HSSFGN employs gating mechanism to achieve multi-scale feature representation, enhancing contextual semantic information. Extensive experiments show that Fraesormer outperforms state-of-the-art methods. code is available at https://zs1314.github.io/Fraesormer.

Authors:Wonwoong Cho, Yan-Ying Chen, Matthew Klenk, David I. Inouye, Yanxia Zhang
Title: Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder
Abstract:
Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g., numeric values like eye openness or car width) with text-only guidance remains a significant challenge. To address this, we introduce the Attribute (Att) Adapter, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models. Our approach learns a single control adapter from a set of sample images that can be unpaired and contain multiple visual attributes. The Att-Adapter leverages the decoupled cross attention module to naturally harmonize the multiple domain attributes with text conditioning. We further introduce Conditional Variational Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the diverse nature of the visual world. Evaluations on two public datasets show that Att-Adapter outperforms all LoRA-based baselines in controlling continuous attributes. Additionally, our method enables a broader control range and also improves disentanglement across multiple attributes, surpassing StyleGAN-based techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.

Authors:Dhruv Kudale, Badri Vishal Kasuba, Venkatapathy Subramanian, Parag Chaudhuri, Ganesh Ramakrishnan
Title: SPRINT: Script-agnostic Structure Recognition in Tables
Abstract:
Table Structure Recognition (TSR) is vital for various downstream tasks like information retrieval, table reconstruction, and document understanding. While most state-of-the-art (SOTA) research predominantly focuses on TSR in English documents, the need for similar capabilities in other languages is evident, considering the global diversity of data. Moreover, creating substantial labeled data in non-English languages and training these SOTA models from scratch is costly and time-consuming. We propose TSR as a language-agnostic cell arrangement prediction and introduce SPRINT, Script-agnostic Structure Recognition in Tables. SPRINT uses recently introduced Optimized Table Structure Language (OTSL) sequences to predict table structures. We show that when coupled with a pre-trained table grid estimator, SPRINT can improve the overall tree edit distance-based similarity structure scores of tables even for non-English documents. We experimentally evaluate our performance across benchmark TSR datasets including PubTabNet, FinTabNet, and PubTables-1M. Our findings reveal that SPRINT not only matches SOTA models in performance on standard datasets but also demonstrates lower latency. Additionally, SPRINT excels in accurately identifying table structures in non-English documents, surpassing current leading models by showing an absolute average increase of 11.12%. We also present an algorithm for converting valid OTSL predictions into a widely used HTML-based table representation. To encourage further research, we release our code and Multilingual Scanned and Scene Table Structure Recognition Dataset, MUSTARD labeled with OTSL sequences for 1428 tables in thirteen languages encompassing several scripts at https://github.com/IITB-LEAP-OCR/SPRINT
中文:SPRINT框架将表格结构识别视为与语言无关的单元格布局预测,不仅性能超越现有最优模型且延迟更低,在非英文文档处理上平均提升11.12%,同时发布了多语言数据集以推动相关研究。
English: The proposed SPRINT framework treats table structure recognition as language-agnostic cell arrangement prediction, demonstrating superior performance over state-of-the-art models with lower latency and achieving an 11.12% average improvement for non-English documents while releasing a multilingual dataset to advance research.

Authors:Md Abu Bakr Siddique, Vaishnav Ramesh, Junliang Liu, Piyush Singh, Md Jahidul Islam
Title: UStyle: Waterbody Style Transfer of Underwater Scenes by Depth-Guided Feature Synthesis
Abstract:
The concept of waterbody style transfer remains largely unexplored in the underwater imaging and vision literature. Traditional image style transfer (STx) methods primarily focus on artistic and photorealistic blending, often failing to preserve object and scene geometry in images captured in high-scattering mediums such as underwater. The wavelength-dependent nonlinear attenuation and depth-dependent backscattering artifacts further complicate learning underwater image STx from unpaired data. This paper introduces UStyle, the first data-driven learning framework for transferring waterbody styles across underwater images without requiring prior reference images or scene information. We propose a novel depth-aware whitening and coloring transform (DA-WCT) mechanism that integrates physics-based waterbody synthesis to ensure perceptually consistent stylization while preserving scene structure. To enhance style transfer quality, we incorporate carefully designed loss functions that guide UStyle to maintain colorfulness, lightness, structural integrity, and frequency-domain characteristics, as well as high-level content in VGG and CLIP (contrastive language-image pretraining) feature spaces. By addressing domain-specific challenges, UStyle provides a robust framework for no-reference underwater image STx, surpassing state-of-the-art (SOTA) methods that rely solely on end-to-end reconstruction loss. Furthermore, we introduce the UF7D dataset, a curated collection of high-resolution underwater images spanning seven distinct waterbody styles, establishing a benchmark to support future research in underwater image STx. The UStyle inference pipeline and UF7D dataset are released at: https://github.com/uf-robopi/UStyle.
中文摘要:本文提出首个数据驱动的水体风格迁移框架UStyle,通过结合基于物理的水体合成和专门设计的损失函数,在无需参考图像的情况下克服传统方法的局限,既能保持场景几何结构又能实现卓越的风格化效果。
English Summary: This paper introduces UStyle, the first data-driven framework for underwater image style transfer that overcomes traditional methods' limitations by integrating physics-based waterbody synthesis and specialized loss functions to preserve scene geometry while achieving superior stylization without requiring reference images.

Authors:Chengxuan Qian, Shuo Xing, Shawn Li, Yue Zhao, Zhengzhong Tu
Title: DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning
Abstract:
Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities. However, the intrinsic heterogeneity of diverse modalities presents substantial challenges to achieve effective cross-modal collaboration and integration. To address this, we introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features. For handling heterogeneity, we employ a prototype-guided optimal transport alignment strategy leveraging gaussian mixture modeling and multi-marginal transport plans, thus mitigating distribution discrepancies while preserving modality-unique characteristics. To reinforce homogeneity, we ensure semantic consistency across modalities by aligning latent distribution matching with Maximum Mean Discrepancy regularization. Furthermore, we incorporate a multimodal transformer to enhance high-level semantic feature fusion, thereby further reducing cross-modal inconsistencies. Our extensive experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods across five metrics. These results highlight the efficacy of DecAlign in enhancing superior cross-modal alignment and semantic consistency while preserving modality-unique features, marking a significant advancement in multimodal representation learning scenarios. Our project page is at https://taco-group.github.io/DecAlign and the code is available at https://github.com/taco-group/DecAlign.
中文摘要:DecAlign框架通过将多模态表征解耦为独有与共享特征,采用原型引导的最优传输对齐策略,在保留模态特性的同时显著提升了跨模态语义一致性,并在多项基准测试中超越现有最优方法。
English Summary: The DecAlign framework effectively separates multimodal representations into unique and shared features, using advanced alignment strategies to enhance cross-modal collaboration and achieve state-of-the-art performance across multiple benchmarks.

Authors:Yi Wang, Zhitong Xiong, Chenying Liu, Adam J. Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taixé, Xiao Xiang Zhu
Title: Towards a Unified Copernicus Foundation Model for Earth Vision
Abstract:
Advances in Earth observation (EO) foundation models have unlocked the potential of big satellite data to learn generic representations from space, benefiting a wide range of downstream applications crucial to our planet. However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth's surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth's surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. Our dataset, model, and benchmark greatly improve the scalability, versatility, and multimodal adaptability of EO foundation models, while also creating new opportunities to connect EO, weather, and climate research. Codes, datasets and models are available at https://github.com/zhu-xlab/Copernicus-FM.
中文摘要:哥白尼项目通过整合多传感器卫星数据并采用统一架构,推出了新一代地球观测基础模型,大幅提升了可扩展性和跨领域应用能力。
English Summary: The Copernicus project introduces a next-generation Earth observation foundation model that integrates multi-sensor satellite data through a unified architecture, significantly enhancing scalability and cross-domain applications.

Authors:Alexander Weers, Alexander H. Berger, Laurin Lux, Peter Schüffler, Daniel Rueckert, Johannes C. Paetzold
Title: From Pixels to Histopathology: A Graph-Based Framework for Interpretable Whole Slide Image Analysis
Abstract:
The histopathological classification of whole-slide images (WSIs) is a fundamental task in digital pathology; yet it requires extensive time and expertise from specialists. While deep learning methods show promising results, they typically process WSIs by dividing them into artificial patches, which inherently prevents a network from learning from the entire image context, disregards natural tissue structures and compromises interpretability. Our method overcomes this limitation through a novel graph-based framework that constructs WSI graph representations. The WSI-graph efficiently captures essential histopathological information in a compact form. We build tissue representations (nodes) that follow biological boundaries rather than arbitrary patches all while providing interpretable features for explainability. Through adaptive graph coarsening guided by learned embeddings, we progressively merge regions while maintaining discriminative local features and enabling efficient global information exchange. In our method's final step, we solve the diagnostic task through a graph attention network. We empirically demonstrate strong performance on multiple challenging tasks such as cancer stage classification and survival prediction, while also identifying predictive factors using Integrated Gradients. Our implementation is publicly available at https://github.com/HistoGraph31/pix2pathology
中文摘要:本研究提出了一种基于生物信息的图框架,将全切片图像转化为可解释的表示形式用于癌症诊断,在显著减少资源使用的同时保持竞争力,并具备完全可解释性。
English Summary: This study introduces a biologically-informed graph framework that transforms whole-slide images into interpretable representations for cancer diagnosis, achieving competitive performance with significantly fewer resources while maintaining full interpretability.

Authors:Haoxin Liu, Harshavardhan Kamarthi, Zhiyuan Zhao, Shangqing Xu, Shiyu Wang, Qingsong Wen, Tom Hartvigsen, Fei Wang, B. Aditya Prakash
Title: How Can Time Series Analysis Benefit From Multiple Modalities? A Survey and Outlook
Abstract:
Time series analysis (TSA) is a longstanding research topic in the data mining community and has wide real-world significance. Compared to "richer" modalities such as language and vision, which have recently experienced explosive development and are densely connected, the time-series modality remains relatively underexplored and isolated. We notice that many recent TSA works have formed a new research field, i.e., Multiple Modalities for TSA (MM4TSA). In general, these MM4TSA works follow a common motivation: how TSA can benefit from multiple modalities. This survey is the first to offer a comprehensive review and a detailed outlook for this emerging field. Specifically, we systematically discuss three benefits: (1) reusing foundation models of other modalities for efficient TSA, (2) multimodal extension for enhanced TSA, and (3) cross-modality interaction for advanced TSA. We further group the works by the introduced modality type, including text, images, audio, tables, and others, within each perspective. Finally, we identify the gaps with future opportunities, including the reused modalities selections, heterogeneous modality combinations, and unseen tasks generalizations, corresponding to the three benefits. We release an up-to-date GitHub repository that includes key papers and resources.
中文摘要:本调查首次系统综述了多模态时间序列分析这一新兴领域,详细阐述了利用其他模态提升时间序列分析的三大优势,并指出了未来研究方向。
English Summary: This survey introduces MM4TSA as an emerging field exploring how time series analysis can benefit from integrating multiple modalities, systematically reviewing three key benefits and outlining future research opportunities.

Authors:Peizhi Yan, Rabab K. Ward, Dan Wang, Qiang Tang, Shan Du
Title: StyleMorpheus: A Style-Based 3D-Aware Morphable Face Model
Abstract:
For 3D face modeling, the recently developed 3D-aware neural rendering methods are able to render photorealistic face images with arbitrary viewing directions. The training of the parametric controllable 3D-aware face models, however, still relies on a large-scale dataset that is lab-collected. To address this issue, this paper introduces "StyleMorpheus", the first style-based neural 3D Morphable Face Model (3DMM) that is trained on in-the-wild images. It inherits 3DMM's disentangled controllability (over face identity, expression, and appearance) but without the need for accurately reconstructed explicit 3D shapes. StyleMorpheus employs an auto-encoder structure. The encoder aims at learning a representative disentangled parametric code space and the decoder improves the disentanglement using shape and appearance-related style codes in the different sub-modules of the network. Furthermore, we fine-tune the decoder through style-based generative adversarial learning to achieve photorealistic 3D rendering quality. The proposed style-based design enables StyleMorpheus to achieve state-of-the-art 3D-aware face reconstruction results, while also allowing disentangled control of the reconstructed face. Our model achieves real-time rendering speed, allowing its use in virtual reality applications. We also demonstrate the capability of the proposed style-based design in face editing applications such as style mixing and color editing. Project homepage: https://github.com/ubc-3d-vision-lab/StyleMorpheus.
中文: 本文提出StyleMorpheus,首个基于风格且能在非受控图像上训练的3D可变形人脸模型,无需精确三维重建即可实现照片级真实感渲染,并保持对人脸属性的解耦控制。
English: This paper introduces StyleMorpheus, a style-based 3D Morphable Face Model trained on in-the-wild images that achieves photorealistic rendering with disentangled control over facial attributes without requiring explicit 3D shapes.

Authors:Artem Nikonorov, Georgy Perevozchikov, Andrei Korepanov, Nancy Mehta, Mahmoud Afifi, Egor Ershov, Radu Timofte
Title: Color Matching Using Hypernetwork-Based Kolmogorov-Arnold Networks
Abstract:
We present cmKAN, a versatile framework for color matching. Given an input image with colors from a source color distribution, our method effectively and accurately maps these colors to match a target color distribution in both supervised and unsupervised settings. Our framework leverages the spline capabilities of Kolmogorov-Arnold Networks (KANs) to model the color matching between source and target distributions. Specifically, we developed a hypernetwork that generates spatially varying weight maps to control the nonlinear splines of a KAN, enabling accurate color matching. As part of this work, we introduce a first large-scale dataset of paired images captured by two distinct cameras and evaluate the efficacy of our and existing methods in matching colors. We evaluated our approach across various color-matching tasks, including: (1) raw-to-raw mapping, where the source color distribution is in one camera's raw color space and the target in another camera's raw space; (2) raw-to-sRGB mapping, where the source color distribution is in a camera's raw space and the target is in the display sRGB space, emulating the color rendering of a camera ISP; and (3) sRGB-to-sRGB mapping, where the goal is to transfer colors from a source sRGB space (e.g., produced by a source camera ISP) to a target sRGB space (e.g., from a different camera ISP). The results show that our method outperforms existing approaches by 37.3% on average for supervised and unsupervised cases while remaining lightweight compared to other methods. The codes, dataset, and pre-trained models are available at: https://github.com/gosha20777/cmKAN
中文:cmKAN框架利用基于超网络生成样条的Kolmogorov-Arnold网络,在多种色彩匹配场景中实现卓越性能,以轻量化设计超越现有方法37.3%的显著优势。
English: The cmKAN framework utilizes Kolmogorov-Arnold Networks with hypernetwork-generated splines to achieve superior color matching across multiple scenarios, outperforming existing methods by 37.3% while maintaining a lightweight design.

Authors:Tianyi Zhao, Boyang Liu, Yanglei Gao, Yiming Sun, Maoxun Yuan, Xingxing Wei
Title: Rethinking Multi-Modal Object Detection from the Perspective of Mono-Modality Feature Learning
Abstract:
Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem, which arises from decreased feature extraction capability in multi-modal joint learning. This leads to a prevalent but unreasonable phenomenon\textemdash Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct a novel framework called M$^2$D-LIF, which consists of the Mono-Modality Distillation (M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$^2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors. The codes are available at https://github.com/Zhao-Tian-yi/M2D-LIF.
中文: 本文针对RGB-IR多模态目标检测中的融合退化问题,提出了M²D-LIF框架,通过增强单模态学习和实现有效特征融合来提升检测性能。
English: This paper addresses the Fusion Degradation issue in RGB-IR multi-modal object detection by proposing the M²D-LIF framework, which enhances mono-modality learning and enables effective feature fusion to improve detection performance.

Authors:Rashik Shrestha, Madhav Rijal, Trevor Smith, Yu Gu
Title: FloPE: Flower Pose Estimation for Precision Pollination
Abstract:
This study presents Flower Pose Estimation (FloPE), a real-time flower pose estimation framework for computationally constrained robotic pollination systems. Robotic pollination has been proposed to supplement natural pollination to ensure global food security due to the decreased population of natural pollinators. However, flower pose estimation for pollination is challenging due to natural variability, flower clusters, and high accuracy demands due to the flowers' fragility when pollinating. This method leverages 3D Gaussian Splatting to generate photorealistic synthetic datasets with precise pose annotations, enabling effective knowledge distillation from a high-capacity teacher model to a lightweight student model for efficient inference. The approach was evaluated on both single and multi-arm robotic platforms, achieving a mean pose estimation error of 0.6 cm and 19.14 degrees within a low computational cost. Our experiments validate the effectiveness of FloPE, achieving up to 78.75% pollination success rate and outperforming prior robotic pollination techniques.

Authors:Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny
Title: VGGT: Visual Geometry Grounded Transformer
Abstract:
We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives that require post-processing with visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis. Code and models are publicly available at https://github.com/facebookresearch/vggt.
Chinese: VGGT是一种前馈神经网络,能够从单张或多张图像中高效推断完整的3D场景属性,无需复杂后处理即可在多项任务中实现最先进的性能。
English: VGGT is a feed-forward neural network that efficiently infers comprehensive 3D scene attributes from varying numbers of views, achieving state-of-the-art performance across multiple tasks without complex post-processing.

Authors:Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, Di Zhang
Title: ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
Abstract:
Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. It is non-trivial due to the extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. The core innovation lies in harnessing the generative capabilities of pre-trained text-to-video models through a simple yet powerful video conditioning mechanism--its capability is often overlooked in current research. To overcome the scarcity of qualified training data, we construct a comprehensive multi-camera synchronized video dataset using Unreal Engine 5, which is carefully curated to follow real-world filming characteristics, covering diverse scenes and camera movements. It helps the model generalize to in-the-wild videos. Lastly, we further improve the robustness to diverse inputs through a meticulously designed training strategy. Extensive experiments show that our method substantially outperforms existing state-of-the-art approaches. Our method also finds promising applications in video stabilization, super-resolution, and outpainting. Our code and dataset are publicly available at: https://github.com/KwaiVGI/ReCamMaster.
Chinese: ReCamMaster 是一种创新的视频重渲染框架,通过利用预训练文本-视频模型和精心构建的合成数据集,能够为输入视频生成新的摄像机轨迹,在视频生成任务中显著优于现有方法。
English: ReCamMaster is a novel framework that re-renders input videos with new camera trajectories by leveraging pre-trained text-to-video models and a curated synthetic dataset, significantly outperforming existing methods in video generation tasks.

Authors:Stefan Lionar, Jiabin Liang, Gim Hee Lee
Title: TreeMeshGPT: Artistic Mesh Generation with Autoregressive Tree Sequencing
Abstract:
We introduce TreeMeshGPT, an autoregressive Transformer designed to generate high-quality artistic meshes aligned with input point clouds. Instead of the conventional next-token prediction in autoregressive Transformer, we propose a novel Autoregressive Tree Sequencing where the next input token is retrieved from a dynamically growing tree structure that is built upon the triangle adjacency of faces within the mesh. Our sequencing enables the mesh to extend locally from the last generated triangular face at each step, and therefore reduces training difficulty and improves mesh quality. Our approach represents each triangular face with two tokens, achieving a compression rate of approximately 22% compared to the naive face tokenization. This efficient tokenization enables our model to generate highly detailed artistic meshes with strong point cloud conditioning, surpassing previous methods in both capacity and fidelity. Furthermore, our method generates mesh with strong normal orientation constraints, minimizing flipped normals commonly encountered in previous methods. Our experiments show that TreeMeshGPT enhances the mesh generation quality with refined details and normal orientation consistency.
中文: TreeMeshGPT是一种自回归Transformer,通过创新的自回归树序列和高效面片标记化,从输入点云生成高质量艺术网格,在细节、保真度和法向一致性方面优于现有方法。
English: TreeMeshGPT is an autoregressive Transformer that generates high-quality artistic meshes from input point clouds using a novel Autoregressive Tree Sequencing and efficient face tokenization, improving detail, fidelity, and normal orientation consistency over previous methods.

Authors:Weiming Ren, Wentao Ma, Huan Yang, Cong Wei, Ge Zhang, Wenhu Chen
Title: Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Abstract:
State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640$\times$360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.3% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.
中文:VAMBA混合Mamba-Transformer模型通过采用线性复杂度的Mamba-2模块,有效解决了传统Transformer模型处理长视频时的计算瓶颈,在保持所有视频令牌的同时大幅提升了训练推理效率,并在长视频理解任务上实现了准确率的显著提升。
English: VAMBA, a hybrid Mamba-Transformer model, overcomes the computational limitations of transformer-based large multimodal models by using Mamba-2 blocks to efficiently process hour-long videos with linear complexity, achieving significant improvements in GPU memory usage, speed, and accuracy on long video understanding benchmarks.

Authors:Tianrui Pan, Lin Liu, Jie Liu, Xiaopeng Zhang, Jie Tang, Gangshan Wu, Qi Tian
Title: RASA: Replace Anyone, Say Anything -- A Training-Free Framework for Audio-Driven and Universal Portrait Video Editing
Abstract:
Portrait video editing focuses on modifying specific attributes of portrait videos, guided by audio or video streams. Previous methods typically either concentrate on lip-region reenactment or require training specialized models to extract keypoints for motion transfer to a new identity. In this paper, we introduce a training-free universal portrait video editing framework that provides a versatile and adaptable editing strategy. This framework supports portrait appearance editing conditioned on the changed first reference frame, as well as lip editing conditioned on varied speech, or a combination of both. It is based on a Unified Animation Control (UAC) mechanism with source inversion latents to edit the entire portrait, including visual-driven shape control, audio-driven speaking control, and inter-frame temporal control. Furthermore, our method can be adapted to different scenarios by adjusting the initial reference frame, enabling detailed editing of portrait videos with specific head rotations and facial expressions. This comprehensive approach ensures a holistic and flexible solution for portrait video editing. The experimental results show that our model can achieve more accurate and synchronized lip movements for the lip editing task, as well as more flexible motion transfer for the appearance editing task. Demo is available at https://alice01010101.github.io/RASA/.
中文: 本文提出了一种无需训练的通用人像视频编辑框架,通过统一动画控制机制支持外观与唇形编辑,实现了精确的唇语同步和灵活的运动迁移。
English: This paper presents a training-free universal portrait video editing framework that supports appearance and lip editing through a Unified Animation Control mechanism, achieving accurate lip synchronization and flexible motion transfer.

Authors:Parsa Rahimi, Damien Teney, Sebastien Marcel
Title: AugGen: Synthetic Augmentation Can Improve Discriminative Models
Abstract:
The increasing reliance on large-scale datasets in machine learning poses significant privacy and ethical challenges, particularly in sensitive domains such as face recognition (FR). Synthetic data generation offers a promising alternative; however, most existing methods depend heavily on external datasets or pre-trained models, increasing complexity and resource demands. In this paper, we introduce AugGen, a self-contained synthetic augmentation technique. AugGen strategically samples from a class-conditional generative model trained exclusively on the target FR dataset, eliminating the need for external resources. Evaluated across 8 FR benchmarks, including IJB-C and IJB-B, our method achieves 1-12% performance improvements, outperforming models trained solely on real data and surpassing state-of-the-art synthetic data generation approaches, while using less real data. Notably, these gains often exceed those from architectural modifications, underscoring the value of synthetic augmentation in data-limited scenarios. Our findings demonstrate that carefully integrated synthetic data can both mitigate privacy constraints and substantially enhance discriminative performance in face recognition. Paper website: https://parsa-ra.github.io/auggen/.

Authors:Ziqin Zhou, Yifan Yang, Yuqing Yang, Tianyu He, Houwen Peng, Kai Qiu, Qi Dai, Lili Qiu, Chong Luo, Lingqiao Liu
Title: HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models
Abstract:
Text-to-video generation poses significant challenges due to the inherent complexity of video data, which spans both temporal and spatial dimensions. It introduces additional redundancy, abrupt variations, and a domain gap between language and vision tokens while generation. Addressing these challenges requires an effective video tokenizer that can efficiently encode video data while preserving essential semantic and spatiotemporal information, serving as a critical bridge between text and vision. Inspired by the observation in VQ-VAE-2 and workflows of traditional animation, we propose HiTVideo for text-to-video generation with hierarchical tokenizers. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks. Higher layers capture semantic information with higher compression, while lower layers focus on fine-grained spatiotemporal details, striking a balance between compression efficiency and reconstruction quality. Our approach efficiently encodes longer video sequences (e.g., 8 seconds, 64 frames), reducing bits per pixel (bpp) by approximately 70\% compared to baseline tokenizers, while maintaining competitive reconstruction quality. We explore the trade-offs between compression and reconstruction, while emphasizing the advantages of high-compressed semantic tokens in text-to-video tasks. HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks, striving for higher compression ratios and simplify LLMs modeling under language guidance, offering a scalable and promising framework for advancing text to video generation. Demo page: https://ziqinzhou66.github.io/project/HiTVideo.
中文: HiTVideo采用分层标记器和3D因果VAE,将视频高效编码为结构化码本,在保持质量的同时显著降低比特率约70%,有效弥合文本与视觉的鸿沟,推动文本到视频生成的发展。
English: HiTVideo introduces a hierarchical tokenizer using 3D causal VAE to efficiently encode videos into structured codebooks, achieving 70% lower bpp while maintaining quality and bridging text-vision gaps in generation.

Authors:Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, Simone Paolo Ponzetto
Title: TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
Abstract:
Automatically synthesizing figures from text captions is a compelling capability. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like TikZ, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, TikZero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models are publicly available.
中文摘要:TikZero通过图像表示将文本理解与图形程序生成解耦,实现了零样本文本到图形程序的合成,其性能超越基线模型,并在补充对齐数据时达到或超过大型商业系统的水平。
English Summary: TikZero enables zero-shot text-to-graphics program synthesis by decoupling text understanding from program generation through image representations, outperforming baseline models and matching larger commercial systems when supplemented with aligned data.

Authors:Seyed Mohammad Hadi Hosseini, Amir Mohammad Izadi, Ali Abdollahi, Armin Saghafian, Mahdieh Soleymani Baghshah
Title: T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation
Abstract:
Although recent text-to-image generative models have achieved impressive performance, they still often struggle with capturing the compositional complexities of prompts including attribute binding, and spatial relationships between different entities. This misalignment is not revealed by common evaluation metrics such as CLIPScore. Recent works have proposed evaluation metrics that utilize Visual Question Answering (VQA) by decomposing prompts into questions about the generated image for more robust compositional evaluation. Although these methods align better with human evaluations, they still fail to fully cover the compositionality within the image. To address this, we propose a novel metric that breaks down images into components, and texts into fine-grained questions about the generated image for evaluation. Our method outperforms previous state-of-the-art metrics, demonstrating its effectiveness in evaluating text-to-image generative models. Code is available at https://github.com/hadi-hosseini/ T2I-FineEval.
中文: 当前文本到图像生成模型常难以准确呈现复杂组合提示,现有评估指标亦不足以衡量这些方面,为此我们提出了一种新方法,通过分解图像和文本来进行更精确的评估,其性能优于现有技术。
English: Current text-to-image models often fail to accurately represent complex compositional prompts, and existing metrics inadequately assess these aspects, prompting the development of a new evaluation method that decomposes images and texts for more precise analysis, which outperforms previous approaches.

Authors:Sanghyun Jo, Seo Jin Lee, Seungwoo Lee, Seohyung Hong, Hyungseok Seo, Kyungsu Kim
Title: COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation
Abstract:
Cell instance segmentation (CIS) is crucial for identifying individual cell morphologies in histopathological images, providing valuable insights for biological and medical research. While unsupervised CIS (UCIS) models aim to reduce the heavy reliance on labor-intensive image annotations, they fail to accurately capture cell boundaries, causing missed detections and poor performance. Recognizing the absence of error-free instances as a key limitation, we present COIN (COnfidence score-guided INstance distillation), a novel annotation-free framework with three key steps: (1) Increasing the sensitivity for the presence of error-free instances via unsupervised semantic segmentation with optimal transport, leveraging its ability to discriminate spatially minor instances, (2) Instance-level confidence scoring to measure the consistency between model prediction and refined mask and identify highly confident instances, offering an alternative to ground truth annotations, and (3) Progressive expansion of confidence with recursive self-distillation. Extensive experiments across six datasets show COIN outperforming existing UCIS methods, even surpassing semi- and weakly-supervised approaches across all metrics on the MoNuSeg and TNBC datasets. The code is available at https://github.com/shjo-april/COIN.
中文摘要:COIN是一种无需标注的细胞实例分割框架,通过置信度引导的实例蒸馏技术逐步识别无错误实例,有效克服了无监督模型的局限性,在多个数据集上实现了超越现有方法的性能表现。
English Summary: COIN is an annotation-free cell instance segmentation framework that uses confidence-guided instance distillation to overcome the limitations of unsupervised models by progressively identifying error-free instances, achieving superior performance across multiple datasets.

Authors:Shiyuan Yang, Zheng Gu, Liang Hou, Xin Tao, Pengfei Wan, Xiaodong Chen, Jing Liao
Title: MTV-Inpaint: Multi-Task Long Video Inpainting
Abstract:
Video inpainting involves modifying local regions within a video, ensuring spatial and temporal consistency. Most existing methods focus primarily on scene completion (i.e., filling missing regions) and lack the capability to insert new objects into a scene in a controllable manner. Fortunately, recent advancements in text-to-video (T2V) diffusion models pave the way for text-guided video inpainting. However, directly adapting T2V models for inpainting remains limited in unifying completion and insertion tasks, lacks input controllability, and struggles with long videos, thereby restricting their applicability and flexibility. To address these challenges, we propose MTV-Inpaint, a unified multi-task video inpainting framework capable of handling both traditional scene completion and novel object insertion tasks. To unify these distinct tasks, we design a dual-branch spatial attention mechanism in the T2V diffusion U-Net, enabling seamless integration of scene completion and object insertion within a single framework. In addition to textual guidance, MTV-Inpaint supports multimodal control by integrating various image inpainting models through our proposed image-to-video (I2V) inpainting mode. Additionally, we propose a two-stage pipeline that combines keyframe inpainting with in-between frame propagation, enabling MTV-Inpaint to effectively handle long videos with hundreds of frames. Extensive experiments demonstrate that MTV-Inpaint achieves state-of-the-art performance in both scene completion and object insertion tasks. Furthermore, it demonstrates versatility in derived applications such as multi-modal inpainting, object editing, removal, image object brush, and the ability to handle long videos. Project page: https://mtv-inpaint.github.io/.
中文:MTV-Inpaint是一个统一的视频修复框架,通过双分支空间注意力机制整合场景补全和对象插入,支持多模态控制并能高效处理长视频。
English: MTV-Inpaint is a unified video inpainting framework that integrates scene completion and object insertion through a dual-branch spatial attention mechanism, supporting multimodal control and efficient long-video processing.

Authors:Shuaifeng Jiao, Zhiwen Zeng, Zhuoqun Su, Xieyuanli Chen, Zongtan Zhou, Huimin Lu
Title: LuSeg: Efficient Negative and Positive Obstacles Segmentation via Contrast-Driven Multi-Modal Feature Fusion on the Lunar
Abstract:
As lunar exploration missions grow increasingly complex, ensuring safe and autonomous rover-based surface exploration has become one of the key challenges in lunar exploration tasks. In this work, we have developed a lunar surface simulation system called the Lunar Exploration Simulator System (LESS) and the LunarSeg dataset, which provides RGB-D data for lunar obstacle segmentation that includes both positive and negative obstacles. Additionally, we propose a novel two-stage segmentation network called LuSeg. Through contrastive learning, it enforces semantic consistency between the RGB encoder from Stage I and the depth encoder from Stage II. Experimental results on our proposed LunarSeg dataset and additional public real-world NPO road obstacle dataset demonstrate that LuSeg achieves state-of-the-art segmentation performance for both positive and negative obstacles while maintaining a high inference speed of approximately 57\,Hz. We have released the implementation of our LESS system, LunarSeg dataset, and the code of LuSeg at:https://github.com/nubot-nudt/LuSeg.
中文: 本研究开发了月球表面模拟系统、LunarSeg数据集及新型两阶段分割网络LuSeg,在保持高推理速度的同时,实现了障碍物分割的最先进性能。
English: This study introduces a lunar surface simulation system, the LunarSeg dataset, and a novel two-stage segmentation network, LuSeg, which achieves state-of-the-art performance in obstacle segmentation while maintaining high inference speed.

Authors:Ziyue Wang, Chenghao Shi, Neng Wang, Qinghua Yu, Xieyuanli Chen, Huimin Lu
Title: BEVDiffLoc: End-to-End LiDAR Global Localization in BEV View based on Diffusion Model
Abstract:
Localization is one of the core parts of modern robotics. Classic localization methods typically follow the retrieve-then-register paradigm, achieving remarkable success. Recently, the emergence of end-to-end localization approaches has offered distinct advantages, including a streamlined system architecture and the elimination of the need to store extensive map data. Although these methods have demonstrated promising results, current end-to-end localization approaches still face limitations in robustness and accuracy. Bird's-Eye-View (BEV) image is one of the most widely adopted data representations in autonomous driving. It significantly reduces data complexity while preserving spatial structure and scale consistency, making it an ideal representation for localization tasks. However, research on BEV-based end-to-end localization remains notably insufficient. To fill this gap, we propose BEVDiffLoc, a novel framework that formulates LiDAR localization as a conditional generation of poses. Leveraging the properties of BEV, we first introduce a specific data augmentation method to significantly enhance the diversity of input data. Then, the Maximum Feature Aggregation Module and Vision Transformer are employed to learn robust features while maintaining robustness against significant rotational view variations. Finally, we incorporate a diffusion model that iteratively refines the learned features to recover the absolute pose. Extensive experiments on the Oxford Radar RobotCar and NCLT datasets demonstrate that BEVDiffLoc outperforms the baseline methods. Our code is available at https://github.com/nubot-nudt/BEVDiffLoc.
中文: BEVDiffLoc是一种新颖的端到端激光雷达定位框架,它利用鸟瞰图表示和扩散模型迭代优化位姿估计,在基准数据集上展现出优于基线方法的性能。
English: BEVDiffLoc is a novel end-to-end LiDAR localization framework that uses Bird's-Eye-View representation and a diffusion model to iteratively refine pose estimation, demonstrating superior performance over baseline methods on benchmark datasets.

Authors:Xiaokang Wei, Bowen Zhang, Xianghui Yang, Yuxuan Wang, Chunchao Guo, Xi Zhao, Yan Luximon
Title: PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture
Abstract:
Generating high-quality physically based rendering (PBR) materials is important to achieve realistic rendering in the downstream tasks, yet it remains challenging due to the intertwined effects of materials and lighting. While existing methods have made breakthroughs by incorporating material decomposition in the 3D generation pipeline, they tend to bake highlights into albedo and ignore spatially varying properties of metallicity and roughness. In this work, we present PBR3DGen, a two-stage mesh generation method with high-quality PBR materials that integrates the novel multi-view PBR material estimation model and a 3D PBR mesh reconstruction model. Specifically, PBR3DGen leverages vision language models (VLM) to guide multi-view diffusion, precisely capturing the spatial distribution and inherent attributes of reflective-metalness material. Additionally, we incorporate view-dependent illumination-aware conditions as pixel-aware priors to enhance spatially varying material properties. Furthermore, our reconstruction model reconstructs high-quality mesh with PBR materials. Experimental results demonstrate that PBR3DGen significantly outperforms existing methods, achieving new state-of-the-art results for PBR estimation and mesh generation. More results and visualization can be found on our project page: https://pbr3dgen1218.github.io/.
中文: PBR3DGen提出了一种两阶段网格生成方法,通过结合多视角材质估计和三维重建技术,实现了高质量的物理渲染材质生成,性能显著优于现有方法。
English: PBR3DGen introduces a two-stage mesh generation method that integrates multi-view material estimation and 3D reconstruction to produce high-quality physically based rendering materials, significantly outperforming existing approaches.

Authors:M. Akın Yılmaz, Ahmet Bilican, A. Murat Tekalp
Title: FG-DFPN: Flow Guided Deformable Frame Prediction Network
Abstract:
Video frame prediction remains a fundamental challenge in computer vision with direct implications for autonomous systems, video compression, and media synthesis. We present FG-DFPN, a novel architecture that harnesses the synergy between optical flow estimation and deformable convolutions to model complex spatio-temporal dynamics. By guiding deformable sampling with motion cues, our approach addresses the limitations of fixed-kernel networks when handling diverse motion patterns. The multi-scale design enables FG-DFPN to simultaneously capture global scene transformations and local object movements with remarkable precision. Our experiments demonstrate that FG-DFPN achieves state-of-the-art performance on eight diverse MPEG test sequences, outperforming existing methods by 1dB PSNR while maintaining competitive inference speeds. The integration of motion cues with adaptive geometric transformations makes FG-DFPN a promising solution for next-generation video processing systems that require high-fidelity temporal predictions. The model and instructions to reproduce our results will be released at: https://github.com/KUIS-AI-Tekalp-Research Group/frame-prediction
中文: FG-DFPN提出了一种新颖的视频帧预测架构,通过将光流与可变形卷积相结合来有效建模复杂时空动态,在多个测试序列上实现了最优性能,同时保持了高效的推理速度。
English: FG-DFPN introduces a novel video frame prediction architecture that integrates optical flow with deformable convolutions to effectively model complex spatio-temporal dynamics, achieving state-of-the-art performance on multiple test sequences while maintaining efficient inference speeds.

Authors:Moein Sorkhei, Emir Konuk, Kevin Smith, Christos Matsoukas
Title: APLA: A Simple Adaptation Method for Vision Transformers
Abstract:
Existing adaptation techniques typically require architectural modifications or added parameters, leading to high computational costs and complexity. We introduce Attention Projection Layer Adaptation (APLA), a simple approach to adapt vision transformers (ViTs) without altering the architecture or adding parameters. Through a systematic analysis, we find that the layer immediately after the attention mechanism is crucial for adaptation. By updating only this projection layer, or even just a random subset of this layer's weights, APLA achieves state-of-the-art performance while reducing GPU memory usage by up to 52.63% and training time by up to 43.0%, with no extra cost at inference. Across 46 datasets covering a variety of tasks including scene classification, medical imaging, satellite imaging, and fine-grained classification, APLA consistently outperforms 17 other leading adaptation methods, including full fine-tuning, on classification, segmentation, and detection tasks. The code is available at https://github.com/MoeinSorkhei/APLA.
中文: APLA提出了一种新颖的视觉变换器适配方法,仅更新注意力机制后的投影层,无需改变架构或增加参数即可显著降低计算成本和内存使用,同时获得卓越性能。
English: APLA introduces a novel adaptation method for vision transformers by updating only the post-attention projection layer, achieving superior performance while significantly reducing computational costs and memory usage without architectural changes or added parameters.

Authors:Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro
Title: MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Abstract:
Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition in noisy environments by combining auditory and visual information. However, recent Large Language Model (LLM) based AVSR systems incur high computational costs due to the high temporal resolution of audio-visual speech processed by LLMs. In this work, we introduce an efficient multimodal speech LLM framework that minimizes token length while preserving essential linguistic content. Our approach employs an early AV-fusion module for streamlined feature integration, an audio-visual speech Q-Former that dynamically allocates tokens based on input duration, and a refined query allocation strategy with a speech rate predictor to adjust token allocation according to speaking speed of each audio sample. Extensive experiments on the LRS3 dataset show that our method achieves state-of-the-art performance with a WER of 0.72% while using only 3.5 tokens per second. Moreover, our approach not only reduces token usage by 86% compared to the previous multimodal speech LLM framework, but also improves computational efficiency by reducing FLOPs by 35.7%.
中文: 本研究提出了一种高效的多模态语音大模型框架,通过早期融合和自适应令牌分配策略,在LRS3数据集上实现了0.72%的词错误率,同时将令牌使用量减少86%并降低35.7%的计算量。
English: This study introduces an efficient multimodal speech LLM framework that reduces token usage by 86% and computational load by 35.7% while achieving state-of-the-art 0.72% WER on LRS3 through early fusion and adaptive token allocation strategies.

Authors:Jonas Utz, Stefan Vocht, Anne Tjorven Buessen, Dennis Possart, Fabian Wagner, Mareike Thies, Mingxuan Gu, Stefan Uderhardt, Katharina Breininger
Title: CyclePose -- Leveraging Cycle-Consistency for Annotation-Free Nuclei Segmentation in Fluorescence Microscopy
Abstract:
In recent years, numerous neural network architectures specifically designed for the instance segmentation of nuclei in microscopic images have been released. These models embed nuclei-specific priors to outperform generic architectures like U-Nets; however, they require large annotated datasets, which are often not available. Generative models (GANs, diffusion models) have been used to compensate for this by synthesizing training data. These two-stage approaches are computationally expensive, as first a generative model and then a segmentation model has to be trained. We propose CyclePose, a hybrid framework integrating synthetic data generation and segmentation training. CyclePose builds on a CycleGAN architecture, which allows unpaired translation between microscopy images and segmentation masks. We embed a segmentation model into CycleGAN and leverage a cycle consistency loss for self-supervision. Without annotated data, CyclePose outperforms other weakly or unsupervised methods on two public datasets. Code is available at https://github.com/jonasutz/CyclePose
中文摘要:近年来,针对显微图像中细胞核实例分割的神经网络需大量标注数据,而生成模型虽能合成数据但计算成本高;CyclePose作为集成合成数据生成与分割训练的混合框架,在无需标注数据的情况下优于其他弱监督或无监督方法。
English Summary: Recent neural networks for nuclei segmentation require large annotated datasets, but generative models can compensate by synthesizing data, though they are computationally expensive; CyclePose, a hybrid framework integrating synthetic data generation and segmentation training, outperforms other weakly or unsupervised methods without annotated data.

Authors:Haoyang Huang, Guoqing Ma, Nan Duan, Xing Chen, Changyi Wan, Ranchen Ming, Tianyu Wang, Bo Wang, Zhiying Lu, Aojie Li, Xianfang Zeng, Xinhao Zhang, Gang Yu, Yuhe Yin, Qiling Wu, Wen Sun, Kang An, Xin Han, Deshan Sun, Wei Ji, Bizhu Huang, Brian Li, Chenfei Wu, Guanzhe Huang, Huixin Xiong, Jiaxin He, Jianchang Wu, Jianlong Yuan, Jie Wu, Jiashuai Liu, Junjing Guo, Kaijun Tan, Liangyu Chen, Qiaohui Chen, Ran Sun, Shanshan Yuan, Shengming Yin, Sitong Liu, Wei Chen, Yaqi Dai, Yuchu Luo, Zheng Ge, Zhisheng Guan, Xiaoniu Song, Yu Zhou, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Yi Xiu, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang
Title: Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
Abstract:
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.
中文摘要:Step-Video-TI2V作为先进的300亿参数模型,能够根据文本和图像输入生成视频,在新基准测试中相比现有方案展现了最优性能。
English Summary: Step-Video-TI2V is a state-of-the-art 30B-parameter model that generates videos from text and image inputs, achieving top performance on a new benchmark compared to existing solutions.

Authors:Ziwei Shi, Xiaoran Zhang, Yan Xia, Yu Zang, Siqi Shen, Cheng Wang
Title: L2RSI: Cross-view LiDAR-based Place Recognition for Large-scale Urban Scenes via Remote Sensing Imagery
Abstract:
We tackle the challenge of LiDAR-based place recognition, which traditionally depends on costly and time-consuming prior 3D maps. To overcome this, we first construct XA-L\&RSI dataset, which encompasses approximately $110,000$ remote sensing submaps and $13,000$ LiDAR point cloud submaps captured in urban scenes, and propose a novel method, L2RSI, for cross-view LiDAR place recognition using high-resolution Remote Sensing Imagery. This approach enables large-scale localization capabilities at a reduced cost by leveraging readily available overhead images as map proxies. L2RSI addresses the dual challenges of cross-view and cross-modal place recognition by learning feature alignment between point cloud submaps and remote sensing submaps in the semantic domain. Additionally, we introduce a novel probability propagation method based on particle estimation to refine position predictions, effectively leveraging temporal and spatial information. This approach enables large-scale retrieval and cross-scene generalization without fine-tuning. Extensive experiments on XA-L\&RSI demonstrate that, within a $100km^2$ retrieval range, L2RSI accurately localizes $83.27\%$ of point cloud submaps within a $30m$ radius for top-$1$ retrieved location. We provide a video to more vividly display the place recognition results of L2RSI at https://shizw695.github.io/L2RSI/.

Authors:Du Chen, Tianhe Wu, Kede Ma, Lei Zhang
Title: Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption
Abstract:
Full-reference image quality assessment (FR-IQA) generally assumes that reference images are of perfect quality. However, this assumption is flawed due to the sensor and optical limitations of modern imaging systems. Moreover, recent generative enhancement methods are capable of producing images of higher quality than their original. All of these challenge the effectiveness and applicability of current FR-IQA models. To relax the assumption of perfect reference image quality, we build a large-scale IQA database, namely DiffIQA, containing approximately 180,000 images generated by a diffusion-based image enhancer with adjustable hyper-parameters. Each image is annotated by human subjects as either worse, similar, or better quality compared to its reference. Building on this, we present a generalized FR-IQA model, namely Adaptive Fidelity-Naturalness Evaluator (A-FINE), to accurately assess and adaptively combine the fidelity and naturalness of a test image. A-FINE aligns well with standard FR-IQA when the reference image is much more natural than the test image. We demonstrate by extensive experiments that A-FINE surpasses standard FR-IQA models on well-established IQA datasets and our newly created DiffIQA. To further validate A-FINE, we additionally construct a super-resolution IQA benchmark (SRIQA-Bench), encompassing test images derived from ten state-of-the-art SR methods with reliable human quality annotations. Tests on SRIQA-Bench re-affirm the advantages of A-FINE. The code and dataset are available at https://tianhewu.github.io/A-FINE-page.github.io/.

Authors:Yansheng Li, Yuning Wu, Gong Cheng, Chao Tao, Bo Dang, Yu Wang, Jiahao Zhang, Chuge Zhang, Yiting Liu, Xu Tang, Jiayi Ma, Yongjun Zhang
Title: MEET: A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification with Zoom-Free Remote Sensing Imagery
Abstract:
Accurate fine-grained geospatial scene classification using remote sensing imagery is essential for a wide range of applications. However, existing approaches often rely on manually zooming remote sensing images at different scales to create typical scene samples. This approach fails to adequately support the fixed-resolution image interpretation requirements in real-world scenarios. To address this limitation, we introduce the Million-scale finE-grained geospatial scEne classification dataseT (MEET), which contains over 1.03 million zoom-free remote sensing scene samples, manually annotated into 80 fine-grained categories. In MEET, each scene sample follows a scene-inscene layout, where the central scene serves as the reference, and auxiliary scenes provide crucial spatial context for finegrained classification. Moreover, to tackle the emerging challenge of scene-in-scene classification, we present the Context-Aware Transformer (CAT), a model specifically designed for this task, which adaptively fuses spatial context to accurately classify the scene samples. CAT adaptively fuses spatial context to accurately classify the scene samples by learning attentional features that capture the relationships between the center and auxiliary scenes. Based on MEET, we establish a comprehensive benchmark for fine-grained geospatial scene classification, evaluating CAT against 11 competitive baselines. The results demonstrate that CAT significantly outperforms these baselines, achieving a 1.88% higher balanced accuracy (BA) with the Swin-Large backbone, and a notable 7.87% improvement with the Swin-Huge backbone. Further experiments validate the effectiveness of each module in CAT and show the practical applicability of CAT in the urban functional zone mapping. The source code and dataset will be publicly available at https://jerrywyn.github.io/project/MEET.html.

Authors:Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding
Title: FastVID: Dynamic Density Pruning for Fast Video Large Language Models
Abstract:
Video Large Language Models have demonstrated strong video understanding capabilities, yet their practical deployment is hindered by substantial inference costs caused by redundant video tokens. Existing pruning techniques fail to fully exploit the spatiotemporal redundancy inherent in video data. To bridge this gap, we perform a systematic analysis of video redundancy from two perspectives: temporal context and visual context. Leveraging these insights, we propose Dynamic Density Pruning for Fast Video LLMs termed FastVID. Specifically, FastVID dynamically partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential visual information. Our method significantly reduces computational overhead while maintaining temporal and visual integrity. Extensive evaluations show that FastVID achieves state-of-the-art performance across various short- and long-video benchmarks on leading Video LLMs, including LLaVA-OneVision and LLaVA-Video. Notably, on LLaVA-OneVision-7B, FastVID effectively prunes $\textbf{90.3%}$ of video tokens, reduces FLOPs to $\textbf{8.3%}$, and accelerates the prefilling stage by $\textbf{7.1}\times$, while maintaining $\textbf{98.0%}$ of the original accuracy. The code is available at https://github.com/LunarShen/FastVID.
中文摘要:FastVID提出动态密度剪枝方法,通过保留视频时空结构的关键信息,在减少90.3%视频令牌的同时保持98%原始准确率,大幅提升视频大语言模型的推理效率。
English Summary: FastVID introduces a dynamic density pruning method that significantly reduces video token redundancy, cutting computational costs by over 90% while maintaining near-original accuracy in Video LLMs.

Authors:Leideng Shi, Juan Zhang
Title: Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation
Abstract:
Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation, which aims to segment objects based on a given text description, with great significance in practical application. Previous studies fuse visual and linguistic modalities by explicit feature interaction, which fail to effectively excavate useful multimodal information from dual-branch encoder. In this letter, we design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities. We propose a correlation fusion module (CFM) to enhance multi-scale visual features by introducing adaptively noise in transformer, and integrate cross-modal aware features. In addition, MAFN employs multi-scale refinement convolution (MSRC) to adapt to the various orientations of objects at different scales to boost their representation ability to enhances segmentation accuracy. Extensive experiments have shown that MAFN is significantly more effective than the state of the art on RRSIS-D datasets. The source code is available at https://github.com/Roaxy/MAFN.
中文摘要:本文提出了一种多模态感知融合网络(MAFN),通过自适应噪声注入和多尺度优化有效对齐并融合视觉与语言特征,显著提升了遥感图像参照分割的精度,在基准数据集上实现了最优性能。
English Summary: This paper introduces a Multimodal-Aware Fusion Network (MAFN) that enhances remote sensing image segmentation by effectively aligning and fusing visual and linguistic features through adaptive noise injection and multi-scale refinement, achieving state-of-the-art performance on benchmark datasets.

Authors:Haonan Wang, Qixiang Zhang, Lehan Wang, Xuanqi Huang, Xiaomeng Li
Title: Neurons: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction
Abstract:
Decoding visual stimuli from neural activity is essential for understanding the human brain. While fMRI methods have successfully reconstructed static images, fMRI-to-video reconstruction faces challenges due to the need for capturing spatiotemporal dynamics like motion and scene transitions. Recent approaches have improved semantic and perceptual alignment but struggle to integrate coarse fMRI data with detailed visual features. Inspired by the hierarchical organization of the visual system, we propose NEURONS, a novel framework that decouples learning into four correlated sub-tasks: key object segmentation, concept recognition, scene description, and blurry video reconstruction. This approach simulates the visual cortex's functional specialization, allowing the model to capture diverse video content. In the inference stage, NEURONS generates robust conditioning signals for a pre-trained text-to-video diffusion model to reconstruct the videos. Extensive experiments demonstrate that NEURONS outperforms state-of-the-art baselines, achieving solid improvements in video consistency (26.6%) and semantic-level accuracy (19.1%). Notably, NEURONS shows a strong functional correlation with the visual cortex, highlighting its potential for brain-computer interfaces and clinical applications. Code and model weights are available at: https://github.com/xmed-lab/NEURONS.
中文摘要:NEURONS框架受视觉皮层启发,将fMRI到视频的重建解耦为四个层次化子任务,通过与扩散模型结合显著提升视频连贯性26.6%和语义准确率19.1%。
English Summary: The NEURONS framework decouples fMRI-to-video reconstruction into four hierarchical sub-tasks inspired by the visual cortex, significantly improving video consistency by 26.6% and semantic accuracy by 19.1% through integration with a diffusion model.

Authors:Rachel S. Y. Teo, Tan M. Nguyen
Title: MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling
Abstract:
Large-scale pre-training of deep models, followed by fine-tuning them, has become the cornerstone of natural language processing (NLP). The prevalence of data coupled with computational resources has led to large models with a considerable number of parameters. While the massive size of these models has led to remarkable success in many NLP tasks, a detriment is the expense required to retrain all the base model's parameters for the adaptation to each task or domain. Parameter Efficient Fine-Tuning (PEFT) provides an effective solution for this challenge by minimizing the number of parameters required to be fine-tuned while maintaining the quality of the model. While existing methods have achieved impressive results, they mainly focus on adapting a subset of parameters, weight reparameterization, and prompt engineering. In this paper, we study layers as extractors of different types of linguistic information that are valuable when used in conjunction. We then propose the Mixture of Layer Experts (MoLEx), a novel sparse mixture of experts (SMoE) whose experts are layers in the pre-trained model. It performs a conditional computation of a mixture of layers during fine-tuning to provide the model with more structural knowledge about the data. By providing an avenue for information exchange between layers, MoLEx enables the model to make a more well-informed prediction for the downstream task, leading to better fine-tuning results with the same number of effective parameters. As experts can be processed in parallel, MoLEx introduces minimal additional computational overhead. We empirically corroborate the advantages of MoLEx when combined with popular PEFT baseline methods on a variety of downstream fine-tuning tasks, including the popular GLUE benchmark as well as the End-to-End Challenge (E2E). The code is publicly available at https://github.com/rachtsy/molex.
中文: 大规模预训练模型在任务适应时面临高昂的重训练成本,而提出的层专家混合方法通过将模型层作为专家组合,以最小计算开销增强结构知识交换,实现了高效微调。
English: Large-scale pre-trained models face high retraining costs for task adaptation, but the proposed Mixture of Layer Experts (MoLEx) method enables efficient fine-tuning by combining layers as experts to enhance structural knowledge exchange with minimal computational overhead.

Authors:Zichen Tang, Yuan Yao, Miaomiao Cui, Liefeng Bo, Hongyu Yang
Title: GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior
Abstract:
Text-guided 3D human generation has advanced with the development of efficient 3D representations and 2D-lifting methods like Score Distillation Sampling (SDS). However, current methods suffer from prolonged training times and often produce results that lack fine facial and garment details. In this paper, we propose GaussianIP, an effective two-stage framework for generating identity-preserving realistic 3D humans from text and image prompts. Our core insight is to leverage human-centric knowledge to facilitate the generation process. In stage 1, we propose a novel Adaptive Human Distillation Sampling (AHDS) method to rapidly generate a 3D human that maintains high identity consistency with the image prompt and achieves a realistic appearance. Compared to traditional SDS methods, AHDS better aligns with the human-centric generation process, enhancing visual quality with notably fewer training steps. To further improve the visual quality of the face and clothes regions, we design a View-Consistent Refinement (VCR) strategy in stage 2. Specifically, it produces detail-enhanced results of the multi-view images from stage 1 iteratively, ensuring the 3D texture consistency across views via mutual attention and distance-guided attention fusion. Then a polished version of the 3D human can be achieved by directly perform reconstruction with the refined images. Extensive experiments demonstrate that GaussianIP outperforms existing methods in both visual quality and training efficiency, particularly in generating identity-preserving results. Our code is available at: https://github.com/silence-tang/GaussianIP.
中文: GaussianIP提出了一种两阶段框架,通过自适应人体蒸馏采样和视角一致优化,从文本和图像高效生成保留身份特征的逼真3D人体,显著提升了训练速度和面部服装细节表现。
English: GaussianIP introduces a two-stage framework using Adaptive Human Distillation Sampling and View-Consistent Refinement to efficiently generate detailed, identity-preserving 3D humans from text and images, significantly improving training speed and visual quality.

Authors:Hao Liu, Pengyu Guo, Siyuan Yang, Zeqing Jiang, Qinglei Hu, Dongyu Li
Title: SpaceSeg: A High-Precision Intelligent Perception Segmentation Method for Multi-Spacecraft On-Orbit Targets
Abstract:
With the continuous advancement of human exploration into deep space, intelligent perception and high-precision segmentation technology for on-orbit multi-spacecraft targets have become critical factors for ensuring the success of modern space missions. However, the complex deep space environment, diverse imaging conditions, and high variability in spacecraft morphology pose significant challenges to traditional segmentation methods. This paper proposes SpaceSeg, an innovative vision foundation model-based segmentation framework with four core technical innovations: First, the Multi-Scale Hierarchical Attention Refinement Decoder (MSHARD) achieves high-precision feature decoding through cross-resolution feature fusion via hierarchical attention. Second, the Multi-spacecraft Connected Component Analysis (MS-CCA) effectively resolves topological structure confusion in dense targets. Third, the Spatial Domain Adaptation Transform framework (SDAT) eliminates cross-domain disparities and resist spatial sensor perturbations through composite enhancement strategies. Finally, a custom Multi-Spacecraft Segmentation Task Loss Function is created to significantly improve segmentation robustness in deep space scenarios. To support algorithm validation, we construct the first multi-scale on-orbit multi-spacecraft semantic segmentation dataset SpaceES, which covers four types of spatial backgrounds and 17 typical spacecraft targets. In testing, SpaceSeg achieves state-of-the-art performance with 89.87$\%$ mIoU and 99.98$\%$ mAcc, surpassing existing best methods by 5.71 percentage points. The dataset and code are open-sourced at https://github.com/Akibaru/SpaceSeg to provide critical technical support for next-generation space situational awareness systems.
中文: 本文提出基于视觉基础模型的SpaceSeg分割框架,通过四项核心技术突破实现了多航天器分割的最优性能,并开源了SpaceES数据集和代码为空间态势感知提供技术支持。
English: This paper introduces SpaceSeg, a vision foundation model-based framework with four key innovations that achieves state-of-the-art performance in multi-spacecraft segmentation, supported by the open-sourced SpaceES dataset and code.

Authors:Hongbin Lin, Zilu Guo, Yifan Zhang, Shuaicheng Niu, Yafeng Li, Ruimao Zhang, Shuguang Cui, Zhen Li
Title: DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation
Abstract:
In autonomous driving, vision-centric 3D detection aims to identify 3D objects from images. However, high data collection costs and diverse real-world scenarios limit the scale of training data. Once distribution shifts occur between training and test data, existing methods often suffer from performance degradation, known as Out-of-Distribution (OOD) problems. To address this, controllable Text-to-Image (T2I) diffusion offers a potential solution for training data enhancement, which is required to generate diverse OOD scenarios with precise 3D object geometry. Nevertheless, existing controllable T2I approaches are restricted by the limited scale of training data or struggle to preserve all annotated 3D objects. In this paper, we present DriveGEN, a method designed to improve the robustness of 3D detectors in Driving via Training-Free Controllable Text-to-Image Diffusion Generation. Without extra diffusion model training, DriveGEN consistently preserves objects with precise 3D geometry across diverse OOD generations, consisting of 2 stages: 1) Self-Prototype Extraction: We empirically find that self-attention features are semantic-aware but require accurate region selection for 3D objects. Thus, we extract precise object features via layouts to capture 3D object geometry, termed self-prototypes. 2) Prototype-Guided Diffusion: To preserve objects across various OOD scenarios, we perform semantic-aware feature alignment and shallow feature alignment during denoising. Extensive experiments demonstrate the effectiveness of DriveGEN in improving 3D detection. The code is available at https://github.com/Hongbin98/DriveGEN.
中文: DriveGEN通过无需训练的可控文本到图像扩散方法,在多样化分布外场景中保持精确三维几何特征,有效提升了自动驾驶三维检测的鲁棒性。
English: DriveGEN enhances autonomous driving 3D detection robustness through training-free controllable text-to-image diffusion that preserves precise 3D geometry across diverse out-of-distribution scenarios.

Authors:Wenbang Deng, Xieyuanli Chen, Qinghua Yu, Yunze He, Junhao Xiao, Huimin Lu
Title: A Novel Decomposed Feature-Oriented Framework for Open-Set Semantic Segmentation on LiDAR Data
Abstract:
Semantic segmentation is a key technique that enables mobile robots to understand and navigate surrounding environments autonomously. However, most existing works focus on segmenting known objects, overlooking the identification of unknown classes, which is common in real-world applications. In this paper, we propose a feature-oriented framework for open-set semantic segmentation on LiDAR data, capable of identifying unknown objects while retaining the ability to classify known ones. We design a decomposed dual-decoder network to simultaneously perform closed-set semantic segmentation and generate distinctive features for unknown objects. The network is trained with multi-objective loss functions to capture the characteristics of known and unknown objects. Using the extracted features, we introduce an anomaly detection mechanism to identify unknown objects. By integrating the results of close-set semantic segmentation and anomaly detection, we achieve effective feature-driven LiDAR open-set semantic segmentation. Evaluations on both SemanticKITTI and nuScenes datasets demonstrate that our proposed framework significantly outperforms state-of-the-art methods. The source code will be made publicly available at https://github.com/nubot-nudt/DOSS.
中文: 本文提出了一种面向特征的激光雷达开放集语义分割框架,通过双解码器网络和异常检测机制,在识别已知类别的同时有效检测未知物体,在多个数据集上验证了其优越性能。
English: This paper introduces a feature-oriented framework for LiDAR open-set semantic segmentation, which identifies unknown objects while maintaining known class recognition through a dual-decoder network and anomaly detection, demonstrating superior performance on benchmark datasets.

Authors:Weichen Zhang, Zile Zhou, Zhiheng Zheng, Chen Gao, Jinqiang Cui, Yong Li, Xinlei Chen, Xiao-Ping Zhang
Title: Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space
Abstract:
Spatial reasoning is a fundamental capability of embodied agents and has garnered widespread attention in the field of multimodal large language models (MLLMs). In this work, we propose a novel benchmark, Open3DVQA, to comprehensively evaluate the spatial reasoning capacities of current state-of-the-art (SOTA) foundation models in open 3D space. Open3DVQA consists of 9k VQA samples, collected using an efficient semi-automated tool in a high-fidelity urban simulator. We evaluate several SOTA MLLMs across various aspects of spatial reasoning, such as relative and absolute spatial relationships, situational reasoning, and object-centric spatial attributes. Our results reveal that: 1) MLLMs perform better at answering questions regarding relative spatial relationships than absolute spatial relationships, 2) MLLMs demonstrate similar spatial reasoning abilities for both egocentric and allocentric perspectives, and 3) Fine-tuning large models significantly improves their performance across different spatial reasoning tasks. We believe that our open-source data collection tools and in-depth analyses will inspire further research on MLLM spatial reasoning capabilities. The benchmark is available at https://github.com/WeichenZh/Open3DVQA.
中文:本文提出Open3DVQA基准,包含9千个视觉问答样本,用于评估多模态大语言模型在开放3D空间中的空间推理能力,发现模型在相对空间关系方面表现更佳且微调能显著提升性能。
English: This paper introduces Open3DVQA, a benchmark with 9k VQA samples to evaluate multimodal large language models' spatial reasoning in open 3D environments, revealing their strengths in relative spatial relationships and the benefits of fine-tuning.

Authors:Pingrui Zhang, Xianqiang Gao, Yuhan Wu, Kehui Liu, Dong Wang, Zhigang Wang, Bin Zhao, Yan Ding, Xuelong Li
Title: MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation
Abstract:
In mobile manipulation, navigation and manipulation are often treated as separate problems, resulting in a significant gap between merely approaching an object and engaging with it effectively. Many navigation approaches primarily define success by proximity to the target, often overlooking the necessity for optimal positioning that facilitates subsequent manipulation. To address this, we introduce MoMa-Kitchen, a benchmark dataset comprising over 100k samples that provide training data for models to learn optimal final navigation positions for seamless transition to manipulation. Our dataset includes affordance-grounded floor labels collected from diverse kitchen environments, in which robotic mobile manipulators of different models attempt to grasp target objects amidst clutter. Using a fully automated pipeline, we simulate diverse real-world scenarios and generate affordance labels for optimal manipulation positions. Visual data are collected from RGB-D inputs captured by a first-person view camera mounted on the robotic arm, ensuring consistency in viewpoint during data collection. We also develop a lightweight baseline model, NavAff, for navigation affordance grounding that demonstrates promising performance on the MoMa-Kitchen benchmark. Our approach enables models to learn affordance-based final positioning that accommodates different arm types and platform heights, thereby paving the way for more robust and generalizable integration of navigation and manipulation in embodied AI. Project page: \href{https://momakitchen.github.io/}{https://momakitchen.github.io/}.
Chinese: MoMa-Kitchen基准数据集通过提供超过10万个样本,训练模型学习最优导航位置以实现与操作的顺畅衔接,采用自动化功能标签生成和NavAff基线模型,解决了导航与操作之间的脱节问题。
English: The MoMa-Kitchen benchmark dataset addresses the disconnect between navigation and manipulation by providing over 100,000 samples to train models for optimal positioning, enabling seamless transitions to object interaction through automated affordance labeling and a baseline model called NavAff.

Authors:Hongyang Wei, Shuaizheng Liu, Chun Yuan, Lei Zhang
Title: Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models
Abstract:
By leveraging the generative priors from pre-trained text-to-image diffusion models, significant progress has been made in real-world image super-resolution (Real-ISR). However, these methods tend to generate inaccurate and unnatural reconstructions in complex and/or heavily degraded scenes, primarily due to their limited perception and understanding capability of the input low-quality image. To address these limitations, we propose, for the first time to our knowledge, to adapt the pre-trained autoregressive multimodal model such as Lumina-mGPT into a robust Real-ISR model, namely PURE, which Perceives and Understands the input low-quality image, then REstores its high-quality counterpart. Specifically, we implement instruction tuning on Lumina-mGPT to perceive the image degradation level and the relationships between previously generated image tokens and the next token, understand the image content by generating image semantic descriptions, and consequently restore the image by generating high-quality image tokens autoregressively with the collected information. In addition, we reveal that the image token entropy reflects the image structure and present a entropy-based Top-k sampling strategy to optimize the local structure of the image during inference. Experimental results demonstrate that PURE preserves image content while generating realistic details, especially in complex scenes with multiple objects, showcasing the potential of autoregressive multimodal generative models for robust Real-ISR. The model and code will be available at https://github.com/nonwhy/PURE.
中文摘要:PURE模型通过指令调优改造Lumina-mGPT多模态框架,创新性地感知图像退化程度、理解语义内容,并采用基于熵的采样策略自回归生成高质量图像,在复杂场景中实现了保真度与细节还原的突破。
English Summary: The proposed PURE model adapts the Lumina-mGPT autoregressive multimodal framework to enhance real-world image super-resolution by perceiving degradation levels, understanding image semantics, and restoring high-quality details through entropy-optimized token generation.

Authors:Kelu Yao, Nuo Xu, Rong Yang, Yingying Xu, Zhuoyan Gao, Titinunt Kitrungrotsakul, Yi Ren, Pu Zhang, Jin Wang, Ning Wei, Chao Li
Title: Falcon: A Remote Sensing Vision-Language Foundation Model
Abstract:
This paper introduces a holistic vision-language foundation model tailored for remote sensing, named Falcon. Falcon offers a unified, prompt-based paradigm that effectively executes comprehensive and complex remote sensing tasks. Falcon demonstrates powerful understanding and reasoning abilities at the image, region, and pixel levels. Specifically, given simple natural language instructions and remote sensing images, Falcon can produce impressive results in text form across 14 distinct tasks, i.e., image classification, object detection, segmentation, image captioning, and etc. To facilitate Falcon's training and empower its representation capacity to encode rich spatial and semantic information, we developed Falcon_SFT, a large-scale, multi-task, instruction-tuning dataset in the field of remote sensing. The Falcon_SFT dataset consists of approximately 78 million high-quality data samples, covering 5.6 million multi-spatial resolution and multi-view remote sensing images with diverse instructions. It features hierarchical annotations and undergoes manual sampling verification to ensure high data quality and reliability. Extensive comparative experiments are conducted, which verify that Falcon achieves remarkable performance over 67 datasets and 14 tasks, despite having only 0.7B parameters. We release the complete dataset, code, and model weights at https://github.com/TianHuiLab/Falcon, hoping to help further develop the open-source community.
中文: 本文提出了Falcon,一种面向遥感领域的统一视觉语言基础模型,通过简单自然语言指令在14项任务中表现卓越,并基于包含约7800万样本的大规模数据集进行训练。
English: This paper introduces Falcon, a unified vision-language foundation model for remote sensing that achieves impressive performance across 14 tasks using simple natural language instructions, supported by a large-scale dataset of approximately 78 million samples.

Authors:Kyle Sargent, Kyle Hsu, Justin Johnson, Li Fei-Fei, Jiajun Wu
Title: Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization
Abstract:
Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at http://kylesargent.github.io/flowmo .
Chinese: FlowMo提出了一种基于Transformer的扩散自编码器,通过创新的两阶段训练方法,在多种压缩率下实现了图像标记化的新突破,无需依赖卷积、对抗损失等传统组件。
English: FlowMo introduces a transformer-based diffusion autoencoder that sets a new benchmark for image tokenization across multiple compression rates, utilizing a novel two-stage training approach without traditional components like convolutions or adversarial losses.

Authors:Worameth Chinchuthakun, Tossaporn Saengja, Nontawat Tritrong, Pitchaporn Rewatbowornwong, Pramook Khungurn, Supasorn Suwajanakorn
Title: LUSD: Localized Update Score Distillation for Text-Guided Image Editing
Abstract:
While diffusion models show promising results in image editing given a target prompt, achieving both prompt fidelity and background preservation remains difficult. Recent works have introduced score distillation techniques that leverage the rich generative prior of text-to-image diffusion models to solve this task without additional fine-tuning. However, these methods often struggle with tasks such as object insertion. Our investigation of these failures reveals significant variations in gradient magnitude and spatial distribution, making hyperparameter tuning highly input-specific or unsuccessful. To address this, we propose two simple yet effective modifications: attention-based spatial regularization and gradient filtering-normalization, both aimed at reducing these variations during gradient updates. Experimental results show our method outperforms state-of-the-art score distillation techniques in prompt fidelity, improving successful edits while preserving the background. Users also preferred our method over state-of-the-art techniques across three metrics, and by 58-64% overall.
中文摘要:本研究通过引入基于注意力的空间正则化和梯度过滤归一化两项改进,有效解决了分数蒸馏技术中的梯度变化问题,在图像编辑任务中实现了比现有最优方法更优的提示匹配度和背景保持效果。
English Summary: The proposed method introduces attention-based spatial regularization and gradient filtering-normalization to address gradient variations in score distillation techniques, achieving superior prompt fidelity and background preservation in image editing compared to state-of-the-art approaches.

Authors:Mingjie Wei, Xuemei Xie, Guangming Shi
Title: ACMo: Attribute Controllable Motion Generation
Abstract:
Attributes such as style, fine-grained text, and trajectory are specific conditions for describing motion. However, existing methods often lack precise user control over motion attributes and suffer from limited generalizability to unseen motions. This work introduces an Attribute Controllable Motion generation architecture, to address these challenges via decouple any conditions and control them separately. Firstly, we explored the Attribute Diffusion Model to imporve text-to-motion performance via decouple text and motion learning, as the controllable model relies heavily on the pre-trained model. Then, we introduce Motion Adpater to quickly finetune previously unseen motion patterns. Its motion prompts inputs achieve multimodal text-to-motion generation that captures user-specified styles. Finally, we propose a LLM Planner to bridge the gap between unseen attributes and dataset-specific texts via local knowledage for user-friendly interaction. Our approach introduces the capability for motion prompts for stylize generation, enabling fine-grained and user-friendly attribute control while providing performance comparable to state-of-the-art methods. Project page: https://mjwei3d.github.io/ACMo/

Authors:Lilin Zhang, Chengpei Wu, Ning Yang
Title: Weakly Supervised Contrastive Adversarial Training for Learning Robust Features from Semi-supervised Data
Abstract:
Existing adversarial training (AT) methods often suffer from incomplete perturbation, meaning that not all non-robust features are perturbed when generating adversarial examples (AEs). This results in residual correlations between non-robust features and labels, leading to suboptimal learning of robust features. However, achieving complete perturbation, i.e., perturbing as many non-robust features as possible, is challenging due to the difficulty in distinguishing robust and non-robust features and the sparsity of labeled data. To address these challenges, we propose a novel approach called Weakly Supervised Contrastive Adversarial Training (WSCAT). WSCAT ensures complete perturbation for improved learning of robust features by disrupting correlations between non-robust features and labels through complete AE generation over partially labeled data, grounded in information theory. Extensive theoretical analysis and comprehensive experiments on widely adopted benchmarks validate the superiority of WSCAT. Our code is available at https://github.com/zhang-lilin/WSCAT.
中文: 现有对抗训练方法常因无法扰动所有非鲁棒特征而导致鲁棒性不足,而提出的弱监督对比对抗训练(WSCAT)通过弱监督对比学习确保完全扰动,从而提升鲁棒特征学习效果。
English: Existing adversarial training methods often fail to perturb all non-robust features, leading to suboptimal robustness, but the proposed WSCAT approach ensures complete perturbation through weakly supervised contrastive learning to enhance robust feature learning.

Authors:Ming Deng, Sijin Sun, Zihao Li, Xiaochuan Hu, Xing Wu
Title: FMNet: Frequency-Assisted Mamba-Like Linear Attention Network for Camouflaged Object Detection
Abstract:
Camouflaged Object Detection (COD) is challenging due to the strong similarity between camouflaged objects and their surroundings, which complicates identification. Existing methods mainly rely on spatial local features, failing to capture global information, while Transformers increase computational costs. To address this, the Frequency-Assisted Mamba-Like Linear Attention Network (FMNet) is proposed, which leverages frequency-domain learning to efficiently capture global features and mitigate ambiguity between objects and the background. FMNet introduces the Multi-Scale Frequency-Assisted Mamba-Like Linear Attention (MFM) module, integrating frequency and spatial features through a multi-scale structure to handle scale variations while reducing computational complexity. Additionally, the Pyramidal Frequency Attention Extraction (PFAE) module and the Frequency Reverse Decoder (FRD) enhance semantics and reconstruct features. Experimental results demonstrate that FMNet outperforms existing methods on multiple COD datasets, showcasing its advantages in both performance and efficiency. Code available at https://github.com/Chranos/FMNet.
中文: 提出的FMNet通过融合频域学习和多尺度空间特征,有效解决了伪装物体检测的难题,在多个数据集上实现了优越的性能和效率。
English: The proposed FMNet effectively addresses camouflaged object detection challenges by integrating frequency-domain learning with multi-scale spatial features, achieving superior performance and efficiency across multiple datasets.

Authors:Sungwoo Cho, Jeongsoo Choi, Sungnyun Kim, Se-Young Yun
Title: MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation
Abstract:
Despite recent advances in text-to-speech (TTS) models, audio-visual-to-audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual modalities. By leveraging multimodal guidance with CFM, our model robustly preserves speaker-specific characteristics and enhances zero-shot AV2AV translation abilities. For the audio modality, we enhance the CFM process by integrating robust speaker embeddings with x-vectors, which serve to bolster speaker consistency. Additionally, we convey emotional nuances to the face rendering module. The guidance provided by both audio and visual cues remains independent of semantic or linguistic content, allowing our renderer to effectively handle zero-shot translation tasks for monolingual speakers in different languages. We empirically demonstrate that the inclusion of high-quality mel-spectrograms conditioned on facial information not only enhances the quality of the synthesized speech but also positively influences facial generation, leading to overall performance improvements in LSE and FID score. Our code is available at https://github.com/Peter-SungwooCho/MAVFlow.
中文摘要:本研究提出一种条件流匹配零样本视听渲染器,通过双模态引导保持说话人特征一致性,有效提升跨语言视听翻译性能。
English Summary: This study introduces a conditional flow matching zero-shot audio-visual renderer that uses dual audio-visual guidance to maintain speaker consistency and enhance translation capabilities across languages.

Authors:Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee
Title: RONA: Pragmatically Diverse Image Captioning with Coherence Relations
Abstract:
Writing Assistants (e.g., Grammarly, Microsoft Copilot) traditionally generate diverse image captions by employing syntactic and semantic variations to describe image components. However, human-written captions prioritize conveying a central message alongside visual descriptions using pragmatic cues. To enhance caption diversity, it is essential to explore alternative ways of communicating these messages in conjunction with visual content. We propose RONA, a novel prompting strategy for Multi-modal Large Language Models (MLLM) that leverages Coherence Relations as a controllable axis for pragmatic variations. We demonstrate that RONA generates captions with better overall diversity and ground-truth alignment, compared to MLLM baselines across multiple domains. Our code is available at: https://github.com/aashish2000/RONA
中文: 传统写作助手通过句法和语义变化生成多样的图像描述,而人类描述则注重利用语用线索传达核心信息与视觉细节;为此提出的RONA策略,通过连贯关系作为可控轴,使多模态大语言模型生成的描述在多样性和真实性上优于基线模型。
English: Traditional writing assistants create diverse image captions through syntactic and semantic variations, but human captions emphasize conveying a central message with visual details using pragmatic cues, leading to the development of RONA, a prompting strategy for MLLMs that uses coherence relations to improve caption diversity and alignment with ground truth.

Authors:Yanjie Xu, Handing Xu, Tianmu Wang, Yaguan Li, Yunzhi Chen, Zhenguo Nie
Title: Rethinking Rotation-Invariant Recognition of Fine-grained Shapes from the Perspective of Contour Points
Abstract:
Rotation-invariant recognition of shapes is a common challenge in computer vision. Recent approaches have significantly improved the accuracy of rotation-invariant recognition by encoding the rotational invariance of shapes as hand-crafted image features and introducing deep neural networks. However, the methods based on pixels have too much redundant information, and the critical geometric information is prone to early leakage, resulting in weak rotation-invariant recognition of fine-grained shapes. In this paper, we reconsider the shape recognition problem from the perspective of contour points rather than pixels. We propose an anti-noise rotation-invariant convolution module based on contour geometric aware for fine-grained shape recognition. The module divides the shape contour into multiple local geometric regions(LGA), where we implement finer-grained rotation-invariant coding in terms of point topological relations. We provide a deep network composed of five such cascaded modules for classification and retrieval experiments. The results show that our method exhibits excellent performance in rotation-invariant recognition of fine-grained shapes. In addition, we demonstrate that our method is robust to contour noise and the rotation centers. The source code is available at https://github.com/zhenguonie/ANRICN_CGA.
Chinese: 本文提出了一种基于轮廓的抗噪声旋转不变卷积模块,通过局部几何区域中的点拓扑关系实现更精细的旋转不变编码,在细粒度形状识别中展现出卓越性能,并对轮廓噪声和旋转中心具有强鲁棒性。
English: This paper introduces a contour-based anti-noise rotation-invariant convolution module that enhances fine-grained shape recognition by encoding topological relations in local geometric regions, demonstrating superior performance and robustness to noise and rotation centers.

Authors:Zhicheng Feng, Xieyuanli Chen, Chenghao Shi, Lun Luo, Zhichao Chen, Yun-Hui Liu, Huimin Lu
Title: Image-Goal Navigation Using Refined Feature Guidance and Scene Graph Enhancement
Abstract:
In this paper, we introduce a novel image-goal navigation approach, named RFSG. Our focus lies in leveraging the fine-grained connections between goals, observations, and the environment within limited image data, all the while keeping the navigation architecture simple and lightweight. To this end, we propose the spatial-channel attention mechanism, enabling the network to learn the importance of multi-dimensional features to fuse the goal and observation features. In addition, a selfdistillation mechanism is incorporated to further enhance the feature representation capabilities. Given that the navigation task needs surrounding environmental information for more efficient navigation, we propose an image scene graph to establish feature associations at both the image and object levels, effectively encoding the surrounding scene information. Crossscene performance validation was conducted on the Gibson and HM3D datasets, and the proposed method achieved stateof-the-art results among mainstream methods, with a speed of up to 53.5 frames per second on an RTX3080. This contributes to the realization of end-to-end image-goal navigation in realworld scenarios. The implementation and model of our method have been released at: https://github.com/nubot-nudt/RFSG.
中文: 本文提出了一种名为RFSG的新型图像目标导航方法,通过空间-通道注意力机制和自蒸馏机制增强特征学习能力,并利用图像场景图进行环境信息编码,在Gibson和HM3D数据集上取得了最优性能,同时保持了实时运行效率。
English: This paper presents RFSG, a novel image-goal navigation method that employs spatial-channel attention and self-distillation mechanisms to enhance feature learning while using image scene graphs for environmental encoding, achieving state-of-the-art performance on Gibson and HM3D datasets with real-time efficiency.

Authors:Zhuoyan Xu, Khoi Duc Nguyen, Preeti Mukherjee, Saurabh Bagchi, Somali Chaterji, Yingyu Liang, Yin Li
Title: Learning to Inference Adaptively for Multimodal Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in visual reasoning, yet come with substantial computational cost, limiting their deployment in resource-constrained settings. Despite recent effort on improving the efficiency of MLLMs, prior solutions fall short in responding to varying runtime conditions, in particular changing resource availability (e.g., contention due to the execution of other programs on the device). To bridge this gap, we introduce AdaLLaVA, an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. We conduct extensive experiments across benchmarks involving question-answering, reasoning, and hallucination. Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime. Further, we demonstrate that AdaLLaVA adapts to both input latency and content, can be integrated with token selection for enhanced efficiency, and generalizes across MLLMs. Our project webpage with code release is at https://zhuoyan-xu.github.io/ada-llava/.

Authors:Avinash Paliwal, Xilong Zhou, Wei Ye, Jinhui Xiong, Rakesh Ranjan, Nima Khademi Kalantari
Title: RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors
Abstract:
In this paper, we propose RI3D, a novel 3DGS-based approach that harnesses the power of diffusion models to reconstruct high-quality novel views given a sparse set of input images. Our key contribution is separating the view synthesis process into two tasks of reconstructing visible regions and hallucinating missing regions, and introducing two personalized diffusion models, each tailored to one of these tasks. Specifically, one model ('repair') takes a rendered image as input and predicts the corresponding high-quality image, which in turn is used as a pseudo ground truth image to constrain the optimization. The other model ('inpainting') primarily focuses on hallucinating details in unobserved areas. To integrate these models effectively, we introduce a two-stage optimization strategy: the first stage reconstructs visible areas using the repair model, and the second stage reconstructs missing regions with the inpainting model while ensuring coherence through further optimization. Moreover, we augment the optimization with a novel Gaussian initialization method that obtains per-image depth by combining 3D-consistent and smooth depth with highly detailed relative depth. We demonstrate that by separating the process into two tasks and addressing them with the repair and inpainting models, we produce results with detailed textures in both visible and missing regions that outperform state-of-the-art approaches on a diverse set of scenes with extremely sparse inputs.
Chinese: 本文提出RI3D方法,通过将视图合成拆分为可见区域修复和缺失区域生成两个任务,并采用两个定制化扩散模型分别处理,结合两阶段优化策略,实现了从稀疏图像生成高质量新视图的效果。
English: This paper introduces RI3D, a 3DGS-based method that uses two specialized diffusion models—one for repairing visible areas and another for inpainting missing regions—to achieve high-quality novel view synthesis from sparse images through a two-stage optimization process.

Authors:Kai Zhang, Jianwei Yang, Jeevana Priya Inala, Chandan Singh, Jianfeng Gao, Yu Su, Chenglong Wang
Title: Towards Understanding Graphical Perception in Large Multimodal Models
Abstract:
Despite the promising results of large multimodal models (LMMs) in complex vision-language tasks that require knowledge, reasoning, and perception abilities together, we surprisingly found that these models struggle with simple tasks on infographics that require perception only. As existing benchmarks primarily focus on end tasks that require various abilities, they provide limited, fine-grained insights into the limitations of the models' perception abilities. To address this gap, we leverage the theory of graphical perception, an approach used to study how humans decode visual information encoded on charts and graphs, to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. With automated task generation and response evaluation designs, our framework enables comprehensive and controlled testing of LMMs' graphical perception across diverse chart types, visual elements, and task types. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three granularity levels (chart, visual element, and pixel). Our findings underscore several critical limitations of current state-of-the-art LMMs, including GPT-4o: their inability to (1) generalize across chart types, (2) understand fundamental visual elements, and (3) cross reference values within a chart. These insights provide guidance for future improvements in perception abilities of LMMs. The evaluation framework and labeled data are publicly available at https://github.com/microsoft/lmm-graphical-perception.
Large multimodal models surprisingly struggle with simple perception tasks on infographics, prompting the development of an evaluation framework based on graphical perception theory that reveals critical limitations in their ability to generalize across chart types and understand visual elements.
English Summary:

Authors:Evangelos Kazakos, Cordelia Schmid, Josef Sivic
Title: Large-scale Pre-training for Grounded Video Caption Generation
Abstract:
We propose a novel approach for captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally dense bounding boxes. We introduce the following contributions. First, we present a large-scale automatic annotation method that aggregates frame-level captions grounded with bounding boxes into temporally dense and consistent annotations. We apply this approach on the HowTo100M dataset to construct a large-scale pre-training dataset, named HowToGround1M. We also introduce a Grounded Video Caption Generation model, dubbed GROVE, and pre-train the model on HowToGround1M. Second, we introduce iGround--a dataset of 3513 videos with manually annotated captions and dense spatio-temporally grounded bounding boxes. This allows us to measure progress on this challenging problem, as well as to fine-tune our model on this small-scale but high-quality data. Third, we demonstrate that our approach achieves state-of-the-art results on the proposed iGround dataset, as well as on the VidSTG, ActivityNet-Entities, GroundingYouTube, and YouCook-Interactions datasets. Our ablations demonstrate the importance of pre-training on our automatically annotated HowToGround1M dataset followed by fine-tuning on the manually annotated iGround dataset and validate the key technical contributions of our model. The dataset and code are available at https://ekazakos.github.io/grounded_video_caption_generation/.

Authors:Ju He, Qihang Yu, Qihao Liu, Liang-Chieh Chen
Title: FlowTok: Flowing Seamlessly Across Text and Image Tokens
Abstract:
Bridging different modalities lies at the heart of cross-modality generation. While conventional approaches treat the text modality as a conditioning signal that gradually guides the denoising process from Gaussian noise to the target image modality, we explore a much simpler paradigm-directly evolving between text and image modalities through flow matching. This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. Compared to prior methods, this design reduces the latent space size by 3.3x at an image resolution of 256, eliminating the need for complex conditioning mechanisms or noise scheduling. Moreover, FlowTok naturally extends to image-to-text generation under the same formulation. With its streamlined architecture centered around compact 1D tokens, FlowTok is highly memory-efficient, requires significantly fewer training resources, and achieves much faster sampling speeds-all while delivering performance comparable to state-of-the-art models. Code will be available at https://github.com/bytedance/1d-tokenizer.
Chinese: 本文提出FlowTok框架,通过将图像编码为紧凑的一维标记并在共享空间中直接进行流匹配来实现文本与图像间的跨模态转换,无需复杂条件机制即可实现高效生成并保持优异性能。
English: This paper introduces FlowTok, a novel framework that simplifies cross-modality generation by directly transforming between text and images through flow matching in a shared 1D token space, achieving high efficiency and performance without complex conditioning mechanisms.

Authors:Tsan-Tsung Yang, I-Wei Chen, Kuan-Ting Chen, Shang-Hsuan Chiang, Wen-Chih Peng
Title: Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models
Abstract:
With the rapid advancement of generative AI, AI-generated images have become increasingly realistic, raising concerns about creativity, misinformation, and content authenticity. Detecting such images and identifying their source models has become a critical challenge in ensuring the integrity of digital media. This paper tackles the detection of AI-generated images and identifying their source models using CNN and CLIP-ViT classifiers. For the CNN-based classifier, we leverage EfficientNet-B0 as the backbone and feed with RGB channels, frequency features, and reconstruction errors, while for CLIP-ViT, we adopt a pretrained CLIP image encoder to extract image features and SVM to perform classification. Evaluated on the Defactify 4 dataset, our methods demonstrate strong performance in both tasks, with CLIP-ViT showing superior robustness to image perturbations. Compared to baselines like AEROBLADE and OCC-CLIP, our approach achieves competitive results. Notably, our method ranked Top-3 overall in the Defactify 4 competition, highlighting its effectiveness and generalizability. All of our implementations can be found in https://github.com/uuugaga/Defactify_4
Chinese: 本文采用CNN和CLIP-ViT分类器检测AI生成图像并识别其来源模型,在Defactify 4竞赛中表现优异,总体排名前三。
English: This paper presents a method using CNN and CLIP-ViT classifiers to detect AI-generated images and identify their source models, achieving strong performance and ranking Top-3 in the Defactify 4 competition.

Authors:Fan Lyu, Tianle Liu, Zhang Zhang, Fuyuan Hu, Liang Wang
Title: Test-Time Discovery via Hashing Memory
Abstract:
We introduce Test-Time Discovery (TTD) as a novel task that addresses class shifts during testing, requiring models to simultaneously identify emerging categories while preserving previously learned ones. A key challenge in TTD is distinguishing newly discovered classes from those already identified. To address this, we propose a training-free, hash-based memory mechanism that enhances class discovery through fine-grained comparisons with past test samples. Leveraging the characteristics of unknown classes, our approach introduces hash representation based on feature scale and directions, utilizing Locality-Sensitive Hashing (LSH) for efficient grouping of similar samples. This enables test samples to be easily and quickly compared with relevant past instances. Furthermore, we design a collaborative classification strategy, combining a prototype classifier for known classes with an LSH-based classifier for novel ones. To enhance reliability, we incorporate a self-correction mechanism that refines memory labels through hash-based neighbor retrieval, ensuring more stable and accurate class assignments. Experimental results demonstrate that our method achieves good discovery of novel categories while maintaining performance on known classes, establishing a new paradigm in model testing. Our code is available at https://github.com/fanlyu/ttd.
中文摘要:本文提出测试时发现(TTD)方法,通过基于哈希的内存机制和协同分类策略,在测试阶段有效识别新类别同时保持已知类别性能,无需重新训练模型。
English Summary: The paper introduces Test-Time Discovery (TTD), a training-free method using hash-based memory and collaborative classification to identify new classes during testing while maintaining known class performance.

Authors:Yefei He, Yuanyu He, Shaoxuan He, Feng Chen, Hong Zhou, Kaipeng Zhang, Bohan Zhuang
Title: Neighboring Autoregressive Modeling for Efficient Visual Generation
Abstract:
Visual autoregressive models typically adhere to a raster-order ``next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant. In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far ``next-neighbor prediction" mechanism. Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region. To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension. During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation. Experiments on ImageNet$256\times 256$ and UCF101 demonstrate that NAR achieves 2.4$\times$ and 8.6$\times$ higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach. When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4 of the training data. Code is available at https://github.com/ThisisBillhe/NAR.
Chinese: 本文提出邻近自回归建模(NAR),通过将视觉生成重构为基于时空邻近性的渐进式外延过程,在图像和视频生成任务中实现了比现有方法更高的吞吐量和更优的性能指标。
English: The paper introduces Neighboring Autoregressive Modeling (NAR), a novel visual generation paradigm that replaces raster-order token prediction with a progressive outpainting approach based on spatial-temporal proximity, achieving higher throughput and superior performance in image and video generation tasks compared to existing methods.

Authors:Yibin Ye, Xichao Teng, Shuo Chen, Zhang Li, Leqi Liu, Qifeng Yu, Tao Tan
Title: Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark
Abstract:
Absolute Visual Localization (AVL) enables Unmanned Aerial Vehicle (UAV) to determine its position in GNSS-denied environments by establishing geometric relationships between UAV images and geo-tagged reference maps. While many previous works have achieved AVL with image retrieval and matching techniques, research in low-altitude multi-view scenarios still remains limited. Low-altitude Multi-view condition presents greater challenges due to extreme viewpoint changes. To explore the best UAV AVL approach in such condition, we proposed this benchmark. Firstly, a large-scale Low-altitude Multi-view dataset called AnyVisLoc was constructed. This dataset includes 18,000 images captured at multiple scenes and altitudes, along with 2.5D reference maps containing aerial photogrammetry maps and historical satellite maps. Secondly, a unified framework was proposed to integrate the state-of-the-art AVL approaches and comprehensively test their performance. The best combined method was chosen as the baseline and the key factors that influencing localization accuracy are thoroughly analyzed based on it. This baseline achieved a 74.1% localization accuracy within 5m under Low-altitude, Multi-view conditions. In addition, a novel retrieval metric called PDM@K was introduced to better align with the characteristics of the UAV AVL task. Overall, this benchmark revealed the challenges of Low-altitude, Multi-view UAV AVL and provided valuable guidance for future research. The dataset and codes are available at https://github.com/UAV-AVL/Benchmark
中文: 该基准针对低空多视角无人机绝对视觉定位任务,构建了大规模数据集并提出了统一评估框架,在5米误差范围内达到74.1%定位精度,同时揭示了该场景的特殊挑战并为后续研究提供了方向。
English: This benchmark introduces a large-scale dataset and unified framework for absolute visual localization in low-altitude multi-view UAV scenarios, achieving 74.1% accuracy within 5 meters while highlighting unique challenges and providing future research guidance.

Authors:Qiji Zhou, Yifan Gong, Guangsheng Bao, Hongjie Qiu, Jinqiang Li, Xiangrong Zhu, Huajian Zhang, Yue Zhang
Title: Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation
Abstract:
Counterfactual reasoning is crucial for robust video understanding but remains underexplored in existing multimodal benchmarks. In this paper, we introduce \textbf{COVER} (\textbf{\underline{CO}}unterfactual \textbf{\underline{V}}id\textbf{\underline{E}}o \textbf{\underline{R}}easoning), a multidimensional multimodal benchmark that systematically evaluates MLLMs across the abstract-concrete and perception-cognition dimensions. Beyond prior multimodal benchmarks, COVER decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis. Experiments on commercial and open-source models reveal a strong correlation between sub-question accuracy and counterfactual reasoning performance, highlighting the role of structured inference in video understanding. Furthermore, our results suggest a key insight: enhancing the reasoning capability of models is essential for improving the robustness of video understanding. COVER establishes a new standard for assessing MLLMs' logical reasoning abilities in dynamic environments. Our work is available at https://github.com/gongyifan-hash/COVER-Benchmark.
中文: 本文提出的COVER基准通过结构化子问题分解系统评估多模态大模型在视频理解中的反事实推理能力,实验表明提升模型推理能力对增强视频理解鲁棒性至关重要。
English: The COVER benchmark is introduced to systematically evaluate multimodal large language models' counterfactual reasoning abilities in video understanding through structured sub-question decomposition, revealing that enhanced reasoning capabilities are crucial for robust performance.

Authors:Khawar Islam, Naveed Akhtar
Title: Context-guided Responsible Data Augmentation with Diffusion Models
Abstract:
Generative diffusion models offer a natural choice for data augmentation when training complex vision models. However, ensuring reliability of their generative content as augmentation samples remains an open challenge. Despite a number of techniques utilizing generative images to strengthen model training, it remains unclear how to utilize the combination of natural and generative images as a rich supervisory signal for effective model induction. In this regard, we propose a text-to-image (T2I) data augmentation method, named DiffCoRe-Mix, that computes a set of generative counterparts for a training sample with an explicitly constrained diffusion model that leverages sample-based context and negative prompting for a reliable augmentation sample generation. To preserve key semantic axes, we also filter out undesired generative samples in our augmentation process. To that end, we propose a hard-cosine filtration in the embedding space of CLIP. Our approach systematically mixes the natural and generative images at pixel and patch levels. We extensively evaluate our technique on ImageNet-1K,Tiny ImageNet-200, CIFAR-100, Flowers102, CUB-Birds, Stanford Cars, and Caltech datasets, demonstrating a notable increase in performance across the board, achieving up to $\sim 3\%$ absolute gain for top-1 accuracy over the state-of-the-art methods, while showing comparable computational overhead. Our code is publicly available at https://github.com/khawar-islam/DiffCoRe-Mix
中文:DiffCoRe-Mix方法通过约束性扩散和语义过滤系统性地融合自然与生成图像,在保持计算效率的同时将视觉模型准确率最高提升3%。
English: The proposed DiffCoRe-Mix method enhances vision model training by systematically combining natural and generative images through constrained diffusion and semantic filtering, achieving up to 3% accuracy improvement with minimal computational overhead.

Authors:Jingwen Deng, Zihao Wang, Shaofei Cai, Anji Liu, Yitao Liang
Title: Open-World Skill Discovery from Unsegmented Demonstrations
Abstract:
Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in Minecraft, a rich open-world simulator with extensive gameplay videos available online. Our SBD-generated segments improved the average performance of conditioned policies by 63.7% and 52.1% on short-term atomic skill tasks, and their corresponding hierarchical agents by 11.3% and 20.8% on long-horizon tasks. Our method can leverage the diverse YouTube videos to train instruction-following agents. The project page can be found in https://craftjarvis.github.io/SkillDiscovery.
中文: 我们开发了一种自监督的技能边界检测算法,通过检测预测误差峰值将未标记视频分割为技能一致的片段,在《我的世界》中显著提升了智能体在原子技能和分层任务中的表现。
English: We developed a self-supervised Skill Boundary Detection algorithm that segments unlabeled videos into skill-consistent segments by detecting prediction error spikes, significantly improving agent performance in both atomic and hierarchical tasks in Minecraft.

Authors:Lehan Yang, Jincen Song, Tianlong Wang, Daiqing Qi, Weili Shi, Yuheng Liu, Sheng Li
Title: VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion
Abstract:
We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at https://github.com/Hansxsourse/VRMDiff.
中文: 本文提出了视频参考抠图新任务,通过利用视频扩散模型和新型潜在构造损失,根据参考描述生成时间一致的指定实例阿尔法蒙版,并发布了包含1万个视频的大规模数据集。
English: This paper introduces video referring matting, a novel task that generates temporally consistent alpha mattes for specified instances using referring captions by leveraging video diffusion models and a new Latent-Constructive loss, supported by a large-scale dataset of 10,000 videos.

Authors:Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, Hongsheng Li
Title: GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Abstract:
Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/rongyaofang/GoT.
中文: GoT框架通过语言推理链分析语义空间关系,将图像生成与编辑转化为推理引导的过程,在实现精确用户控制的同时显著超越了基线方法的性能。
English: The GoT framework introduces a reasoning-guided approach to image generation and editing by analyzing semantic-spatial relationships through explicit language reasoning chains before producing images, significantly outperforming baseline methods and enabling precise user control.

Authors:Ho Kei Cheng, Alexander Schwing
Title: The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation
Abstract:
Minibatch optimal transport coupling straightens paths in unconditional flow matching. This leads to computationally less demanding inference as fewer integration steps and less complex numerical solvers can be employed when numerically solving an ordinary differential equation at test time. However, in the conditional setting, minibatch optimal transport falls short. This is because the default optimal transport mapping disregards conditions, resulting in a conditionally skewed prior distribution during training. In contrast, at test time, we have no access to the skewed prior, and instead sample from the full, unbiased prior distribution. This gap between training and testing leads to a subpar performance. To bridge this gap, we propose conditional optimal transport C^2OT that adds a conditional weighting term in the cost matrix when computing the optimal transport assignment. Experiments demonstrate that this simple fix works with both discrete and continuous conditions in 8gaussians-to-moons, CIFAR-10, ImageNet-32x32, and ImageNet-256x256. Our method performs better overall compared to the existing baselines across different function evaluation budgets. Code is available at https://hkchengrex.github.io/C2OT

Authors:Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, Zhiqiang Shen
Title: A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
Abstract:
Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against black-box commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we notice that identifying core semantic objects is a key objective for models trained with various datasets and methodologies. This insight motivates our approach that refines semantic clarity by encoding explicit semantic details within local regions, thus ensuring interoperability and capturing finer-grained features, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective solution: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods. Our optimized adversarial examples under different configurations and training code are available at https://github.com/VILA-Lab/M-Attack.
中文: 现有针对黑盒商业大视觉语言模型的定向攻击常因扰动缺乏语义细节而失败,而我们的方法通过局部裁剪和对齐将对抗性修改集中于语义丰富区域,显著提升了迁移性,在GPT-4.5和Claude-3.7等模型上成功率超90%。
English: Current targeted attacks on black-box commercial large vision-language models often fail due to their uniform perturbations lacking semantic clarity, but our method enhances transferability by focusing adversarial modifications on semantically rich regions through local cropping and alignment, achieving over 90% success on models like GPT-4.5 and Claude-3.7.

Authors:Yanming Zhang, Jun-Kun Chen, Jipeng Lyu, Yu-Xiong Wang
Title: V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes
Abstract:
This paper introduces V$^2$Edit, a novel training-free framework for instruction-guided video and 3D scene editing. Addressing the critical challenge of balancing original content preservation with editing task fulfillment, our approach employs a progressive strategy that decomposes complex editing tasks into a sequence of simpler subtasks. Each subtask is controlled through three key synergistic mechanisms: the initial noise, noise added at each denoising step, and cross-attention maps between text prompts and video content. This ensures robust preservation of original video elements while effectively applying the desired edits. Beyond its native video editing capability, we extend V$^2$Edit to 3D scene editing via a "render-edit-reconstruct" process, enabling high-quality, 3D-consistent edits even for tasks involving substantial geometric changes such as object insertion. Extensive experiments demonstrate that our V$^2$Edit achieves high-quality and successful edits across various challenging video editing tasks and complex 3D scene editing tasks, thereby establishing state-of-the-art performance in both domains.

Authors:Hang Yin, Xiuwei Xu, Lingqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
Title: UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
Abstract:
In this paper, we propose a general framework for universal zero-shot goal-oriented navigation. Existing zero-shot methods build inference framework upon large language models (LLM) for specific tasks, which differs a lot in overall pipeline and fails to generalize across different types of goal. Towards the aim of universal zero-shot navigation, we propose a uniform graph representation to unify different goals, including object category, instance image and text description. We also convert the observation of agent into an online maintained scene graph. With this consistent scene and goal representation, we preserve most structural information compared with pure text and are able to leverage LLM for explicit graph-based reasoning. Specifically, we conduct graph matching between the scene graph and goal graph at each time instant and propose different strategies to generate long-term goal of exploration according to different matching states. The agent first iteratively searches subgraph of goal when zero-matched. With partial matching, the agent then utilizes coordinate projection and anchor pair alignment to infer the goal location. Finally scene graph correction and goal verification are applied for perfect matching. We also present a blacklist mechanism to enable robust switch between stages. Extensive experiments on several benchmarks show that our UniGoal achieves state-of-the-art zero-shot performance on three studied navigation tasks with a single model, even outperforming task-specific zero-shot methods and supervised universal methods.
中文: 本文提出UniGoal通用零样本导航框架,通过统一图表示整合不同目标,并利用大语言模型进行显式图推理,以单一模型在多项导航任务中实现了最先进的零样本性能。
English: This paper introduces UniGoal, a universal zero-shot navigation framework that unifies diverse goals through uniform graph representations and leverages LLMs for explicit graph-based reasoning, achieving state-of-the-art performance across multiple navigation tasks with a single model.

Authors:Hashmat Shadab Malik, Shahina Kunhimon, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan
Title: Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathology
Abstract:
Adversarial attacks pose significant challenges for vision models in critical fields like healthcare, where reliability is essential. Although adversarial training has been well studied in natural images, its application to biomedical and microscopy data remains limited. Existing self-supervised adversarial training methods overlook the hierarchical structure of histopathology images, where patient-slide-patch relationships provide valuable discriminative signals. To address this, we propose Hierarchical Self-Supervised Adversarial Training (HSAT), which exploits these properties to craft adversarial examples using multi-level contrastive learning and integrate it into adversarial training for enhanced robustness. We evaluate HSAT on multiclass histopathology dataset OpenSRH and the results show that HSAT outperforms existing methods from both biomedical and natural image domains. HSAT enhances robustness, achieving an average gain of 54.31% in the white-box setting and reducing performance drops to 3-4% in the black-box setting, compared to 25-30% for the baseline. These results set a new benchmark for adversarial training in this domain, paving the way for more robust models. Our Code for training and evaluation is available at https://github.com/HashmatShadab/HSAT.
中文: HSAT提出了一种分层自监督对抗训练方法,通过利用多级对比学习来增强组织病理学图像的鲁棒性,在白盒和黑盒场景下均取得了显著性能提升。
English: HSAT introduces hierarchical self-supervised adversarial training to enhance robustness in histopathology images by leveraging multi-level contrastive learning, achieving significant gains in white-box and black-box settings.

Authors:Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, Liefeng Bo
Title: LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds
Abstract:
Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.
中文摘要:提出的LHM模型通过多模态变换器和3D高斯溅射技术,无需后处理即可高效重建高保真可动画三维人体,在重建精度和泛化能力上均优于现有方法。
English Summary: The proposed LHM model efficiently reconstructs high-fidelity animatable 3D human avatars using 3D Gaussian splatting and a multimodal transformer, outperforming existing methods in accuracy and generalization without post-processing.

Authors:Boqian Li, Haiwen Feng, Zeyu Cai, Michael J. Black, Yuliang Xiu
Title: ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness
Abstract:
Fitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods -- both tightness-agnostic and tightness-aware -- in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant tightness design can even reduce directional errors by (67.2% ~ 89.8%) in one-shot (or out-of-distribution) settings (~ 1% data). Qualitative results demonstrate strong generalization of ETCH, regardless of challenging poses, unseen shapes, loose clothing, and non-rigid dynamics. We will release the code and models soon for research purposes at https://boqian-li.github.io/ETCH/.
中文: ETCH通过等变紧密度拟合将衣物表面映射至人体,在多样姿态和服装类型下显著提升了人体拟合精度与泛化能力。
English: ETCH is a novel pipeline that uses equivariant tightness fitting to map cloth surfaces to underlying bodies, significantly improving body fitting accuracy and generalization across diverse poses and clothing types.

Authors:Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu
Title: Transformers without Normalization
Abstract:
Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(α$x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Authors:Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, Ivan Laptev, Rao Muhammad Anwer, Salman Khan
Title: DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding
Abstract:
While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model. Our framework, dataset, and model are available at https://github.com/ayesha-ishaq/DriveLMM-o1.
Chinese: 大型多模态模型在视觉问答任务中表现出色,但在自动驾驶的复杂多步推理上存在不足,为此我们提出了DriveLMM-o1数据集和模型,以提升自动驾驶场景中的推理能力和准确性。
English: Large multimodal models excel in Visual Question Answering but struggle with complex multi-step reasoning in autonomous driving, prompting the introduction of DriveLMM-o1, a dataset and model that enhances reasoning and accuracy in driving scenarios.

Authors:Jinyang Li, En Yu, Sijia Chen, Wenbing Tao
Title: OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer
Abstract:
Open-vocabulary multiple object tracking aims to generalize trackers to unseen categories during training, enabling their application across a variety of real-world scenarios. However, the existing open-vocabulary tracker is constrained by its framework structure, isolated frame-level perception, and insufficient modal interactions, which hinder its performance in open-vocabulary classification and tracking. In this paper, we propose OVTR (End-to-End Open-Vocabulary Multiple Object Tracking with TRansformer), the first end-to-end open-vocabulary tracker that models motion, appearance, and category simultaneously. To achieve stable classification and continuous tracking, we design the CIP (Category Information Propagation) strategy, which establishes multiple high-level category information priors for subsequent frames. Additionally, we introduce a dual-branch structure for generalization capability and deep multimodal interaction, and incorporate protective strategies in the decoder to enhance performance. Experimental results show that our method surpasses previous trackers on the open-vocabulary MOT benchmark while also achieving faster inference speeds and significantly reducing preprocessing requirements. Moreover, the experiment transferring the model to another dataset demonstrates its strong adaptability. Models and code are released at https://github.com/jinyanglii/OVTR.
中文摘要:OVTR是首个端到端的开放词汇多目标跟踪器,通过类别信息传播策略和深度多模态交互设计,在保持高效推理的同时显著提升了未知类别的跟踪性能。
English Summary: OVTR is the first end-to-end open-vocabulary multiple object tracker that simultaneously models motion, appearance, and categories, achieving superior performance and faster inference through innovative strategies like category information propagation and deep multimodal interaction.

Authors:Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, Wei Chen
Title: R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Abstract:
Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.
Chinese: R1-Onevision多模态推理模型通过跨模态管道将图像转化为文本表示,弥合了视觉感知与深度推理之间的鸿沟,在多个复杂基准测试中超越GPT-4o等模型,实现了最先进的性能。
English: The R1-Onevision multimodal reasoning model bridges visual and textual analysis through a cross-modal pipeline that converts images into textual representations, achieving state-of-the-art performance on complex benchmarks by outperforming models like GPT-4o.

Authors:Severin Heidrich, Till Beemelmanns, Alexey Nekrasov, Bastian Leibe, Lutz Eckstein
Title: OCCUQ: Exploring Efficient Uncertainty Quantification for 3D Occupancy Prediction
Abstract:
Autonomous driving has the potential to significantly enhance productivity and provide numerous societal benefits. Ensuring robustness in these safety-critical systems is essential, particularly when vehicles must navigate adverse weather conditions and sensor corruptions that may not have been encountered during training. Current methods often overlook uncertainties arising from adversarial conditions or distributional shifts, limiting their real-world applicability. We propose an efficient adaptation of an uncertainty estimation technique for 3D occupancy prediction. Our method dynamically calibrates model confidence using epistemic uncertainty estimates. Our evaluation under various camera corruption scenarios, such as fog or missing cameras, demonstrates that our approach effectively quantifies epistemic uncertainty by assigning higher uncertainty values to unseen data. We introduce region-specific corruptions to simulate defects affecting only a single camera and validate our findings through both scene-level and region-level assessments. Our results show superior performance in Out-of-Distribution (OoD) detection and confidence calibration compared to common baselines such as Deep Ensembles and MC-Dropout. Our approach consistently demonstrates reliable uncertainty measures, indicating its potential for enhancing the robustness of autonomous driving systems in real-world scenarios. Code and dataset are available at https://github.com/ika-rwth-aachen/OCCUQ .
中文: 本研究提出了一种用于自动驾驶中3D占据预测的高效不确定性估计方法,通过动态校准模型置信度,在相机损坏等恶劣条件下展现出优异的异常数据检测能力和系统鲁棒性提升。
English: This study introduces an efficient uncertainty estimation method for 3D occupancy prediction in autonomous driving, which dynamically calibrates model confidence and demonstrates superior performance in detecting out-of-distribution data and improving system robustness under adverse conditions like camera corruptions.

Authors:Jinhao Duan, Fei Kong, Hao Cheng, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
Title: TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention
Abstract:
Object Hallucination (OH) has been acknowledged as one of the major trustworthy challenges in Large Vision-Language Models (LVLMs). Recent advancements in Large Language Models (LLMs) indicate that internal states, such as hidden states, encode the "overall truthfulness" of generated responses. However, it remains under-explored how internal states in LVLMs function and whether they could serve as "per-token" hallucination indicators, which is essential for mitigating OH. In this paper, we first conduct an in-depth exploration of LVLM internal states in relation to OH issues and discover that (1) LVLM internal states are high-specificity per-token indicators of hallucination behaviors. Moreover, (2) different LVLMs encode universal patterns of hallucinations in common latent subspaces, indicating that there exist "generic truthful directions" shared by various LVLMs. Based on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt) that first learns the truthful direction of LVLM decoding and then applies truthful-guided inference-time intervention during LVLM decoding. We further propose ComnHallu to enhance both cross-LVLM and cross-data hallucination detection transferability by constructing and aligning hallucination latent subspaces. We evaluate TruthPrInt in extensive experimental settings, including in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks. Experimental results indicate that TruthPrInt significantly outperforms state-of-the-art methods. Codes will be available at https://github.com/jinhaoduan/TruthPrInt.
Chinese: 物体幻觉是大视觉语言模型中的主要可信挑战,但内部状态可作为逐令牌指示器并揭示通用模式,由此提出的TruthPrInt方法通过真实性引导干预显著优于现有方法。
English: Object Hallucination is a major trust issue in Large Vision-Language Models, but internal states can serve as per-token indicators and reveal universal patterns, leading to the proposed TruthPrInt method that significantly outperforms existing approaches through truthful-guided intervention.

Authors:Yang Zheng, Menglei Chai, Delio Vicini, Yuxiao Zhou, Yinghao Xu, Leonidas Guibas, Gordon Wetzstein, Thabo Beeler
Title: GroomLight: Hybrid Inverse Rendering for Relightable Human Hair Appearance Modeling
Abstract:
We present GroomLight, a novel method for relightable hair appearance modeling from multi-view images. Existing hair capture methods struggle to balance photorealistic rendering with relighting capabilities. Analytical material models, while physically grounded, often fail to fully capture appearance details. Conversely, neural rendering approaches excel at view synthesis but generalize poorly to novel lighting conditions. GroomLight addresses this challenge by combining the strengths of both paradigms. It employs an extended hair BSDF model to capture primary light transport and a light-aware residual model to reconstruct the remaining details. We further propose a hybrid inverse rendering pipeline to optimize both components, enabling high-fidelity relighting, view synthesis, and material editing. Extensive evaluations on real-world hair data demonstrate state-of-the-art performance of our method.
Chinese: GroomLight提出了一种混合方法,结合扩展头发BSDF模型处理主要光线传输和光感知残差模型重建细节,通过优化的逆向渲染流程实现了高保真的重光照与视角合成。
English: GroomLight introduces a hybrid approach combining an extended hair BSDF model for primary light transport and a light-aware residual model for detail reconstruction, enabling high-fidelity relighting and view synthesis through an optimized inverse rendering pipeline.

Authors:Rui Hu, Lianghui Zhu, Yuxuan Zhang, Tianheng Cheng, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang
Title: GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
Abstract:
Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., $4.5 \times$ faster than GLaMM.
中文: 针对像素接地任务中现有数据集在对象类别、文本多样性和标注质量上的局限,本研究提出了GroundingSuite自动化数据标注框架,构建了大规模训练数据集和精选评估基准,不仅显著提升了模型性能达到最优水平,其标注效率也比现有领先方法快4.5倍。
English: To address the limitations of existing datasets in pixel grounding, such as restricted object categories and insufficient annotations, this study introduces GroundingSuite—an automated data annotation framework that generates a large-scale training dataset and a curated evaluation benchmark, achieving state-of-the-art performance and significantly faster annotation efficiency compared to current methods.

Authors:Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, Hongsheng Li
Title: CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models
Abstract:
This paper introduces CameraCtrl II, a framework that enables large-scale dynamic scene exploration through a camera-controlled video diffusion model. Previous camera-conditioned video generative models suffer from diminished video dynamics and limited range of viewpoints when generating videos with large camera movement. We take an approach that progressively expands the generation of dynamic scenes -- first enhancing dynamic content within individual video clip, then extending this capability to create seamless explorations across broad viewpoint ranges. Specifically, we construct a dataset featuring a large degree of dynamics with camera parameter annotations for training while designing a lightweight camera injection module and training scheme to preserve dynamics of the pretrained models. Building on these improved single-clip techniques, we enable extended scene exploration by allowing users to iteratively specify camera trajectories for generating coherent video sequences. Experiments across diverse scenarios demonstrate that CameraCtrl Ii enables camera-controlled dynamic scene synthesis with substantially wider spatial exploration than previous approaches.
中文: CameraCtrl II 框架通过先增强单个视频片段的动态内容,再扩展至大范围视角的无缝探索,实现了基于相机控制的动态场景合成,显著拓宽了空间探索能力。
English: CameraCtrl II is a framework that progressively enhances dynamic scene generation by first improving video clip dynamics and then enabling seamless, wide-ranging camera-controlled explorations through iterative trajectory inputs.

Authors:Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, Lu Jiang
Title: Long Context Tuning for Video Generation
Abstract:
Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See https://guoyww.github.io/projects/long-context-video/ for more details.

Authors:Quoc-Tien Nguyen, Hong-Hai Nguyen, Van-Thong Huynh
Title: Lightweight Models for Emotional Analysis in Video
Abstract:
In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which processes spatial features at multiple resolutions while maintaining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective behavior analysis. By integrating an efficient vision backbone with a structured temporal modeling mechanism, the proposed framework achieves a balance between computational efficiency and predictive accuracy, making it well-suited for real-time applications in mobile and embedded computing environments.
中文: 本研究提出一种高效框架,结合MobileNetV4进行空间特征提取和多尺度3D MLP-Mixer进行时序建模,在ABAW第八届竞赛中验证了其在情感行为分析中兼顾计算效率与预测精度的优势。
English: This study introduces an efficient framework combining MobileNetV4 for spatial feature extraction with a multi-scale 3D MLP-Mixer for temporal modeling, achieving balanced computational efficiency and accuracy in affective behavior analysis as validated on the ABAW 8th competition.

Authors:Zengrong Lin, Zheng Wang, Tianwen Qian, Pan Mu, Sixian Chan, Cong Bai
Title: NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval
Abstract:
Cross-modal retrieval aims to bridge the semantic gap between different modalities, such as visual and textual data, enabling accurate retrieval across them. Despite significant advancements with models like CLIP that align cross-modal representations, a persistent challenge remains: the hubness problem, where a small subset of samples (hubs) dominate as nearest neighbors, leading to biased representations and degraded retrieval accuracy. Existing methods often mitigate hubness through post-hoc normalization techniques, relying on prior data distributions that may not be practical in real-world scenarios. In this paper, we directly mitigate hubness during training and introduce NeighborRetr, a novel method that effectively balances the learning of hubs and adaptively adjusts the relations of various kinds of neighbors. Our approach not only mitigates the hubness problem but also enhances retrieval performance, achieving state-of-the-art results on multiple cross-modal retrieval benchmarks. Furthermore, NeighborRetr demonstrates robust generalization to new domains with substantial distribution shifts, highlighting its effectiveness in real-world applications. We make our code publicly available at: https://github.com/zzezze/NeighborRetr .
中文: NeighborRetr是一种新颖的训练方法,通过自适应平衡中心点学习和邻居关系,直接缓解跨模态检索中的中心点问题,实现了最优性能并展现出强大的跨领域泛化能力。
English: NeighborRetr is a novel training method that directly mitigates the hubness problem in cross-modal retrieval by adaptively balancing hub learning and neighbor relations, achieving state-of-the-art performance and robust generalization across domains.

Authors:Zeyue Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo
Title: AudioX: Diffusion Transformer for Anything-to-Audio Generation
Abstract:
Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture. The code and datasets will be available at https://zeyuet.github.io/AudioX/
中文: AudioX作为一种统一的扩散变换器模型,通过多模态掩码训练策略和构建大规模数据集,实现了任意内容到音频的高质量生成,既能处理文本、视频等多种输入,又超越了专业模型的性能表现。
English: AudioX is a unified Diffusion Transformer model that overcomes limitations in audio and music generation by enabling anything-to-audio conversion with flexible natural language control and robust cross-modal learning, while newly curated datasets address data scarcity issues.

Authors:Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, Tao Chen
Title: TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) are becoming increasingly popular, while the high computational cost associated with multimodal data input, particularly from visual tokens, poses a significant challenge. Existing training-based token compression methods improve inference efficiency but require costly retraining, while training-free methods struggle to maintain performance when aggressively reducing token counts. In this study, we reveal that the performance degradation of MLLM closely correlates with the accelerated loss of information in the attention output matrix. This insight introduces a novel information-preserving perspective, making it possible to maintain performance even under extreme token compression. Based on this finding, we propose TokenCarve, a training-free, plug-and-play, two-stage token compression framework. The first stage employs an Information-Preservation-Guided Selection (IPGS) strategy to prune low-information tokens, while the second stage further leverages IPGS to guide token merging, minimizing information loss. Extensive experiments on 11 datasets and 2 model variants demonstrate the effectiveness of TokenCarve. It can even reduce the number of visual tokens to 22.2% of the original count, achieving a 1.23x speedup in inference, a 64% reduction in KV cache storage, and only a 1.54% drop in accuracy. Our code is available at https://github.com/ShawnTan86/TokenCarve.
Chinese: TokenCarve是一种无需训练的令牌压缩框架,通过减少注意力输出中的信息损失来保持性能,在显著提升效率的同时仅造成微小的精度下降。
English: TokenCarve is a training-free token compression framework that preserves performance by minimizing information loss in attention outputs, achieving significant efficiency gains with minimal accuracy drop.

Authors:Jiali Yao, Xinran Deng, Xin Gu, Mengrui Dai, Bing Fan, Zhipeng Zhang, Yan Huang, Heng Fan, Libo Zhang
Title: OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding
Abstract:
In this paper, we propose spatio-temporal omni-object video grounding, dubbed OmniSTVG, a new STVG task that aims at localizing spatially and temporally all targets mentioned in the textual query from videos. Compared to classic STVG locating only a single target, OmniSTVG enables localization of not only an arbitrary number of text-referred targets but also their interacting counterparts in the query from the video, making it more flexible and practical in real scenarios for comprehensive understanding. In order to facilitate exploration of OmniSTVG, we introduce BOSTVG, a large-scale benchmark dedicated to OmniSTVG. Specifically, our BOSTVG consists of 10,018 videos with 10.2M frames and covers a wide selection of 287 classes from diverse scenarios. Each sequence in BOSTVG, paired with a free-form textual query, encompasses a varying number of targets ranging from 1 to 10. To ensure high quality, each video is manually annotated with meticulous inspection and refinement. To our best knowledge, BOSTVG is to date the first and the largest benchmark for OmniSTVG. To encourage future research, we introduce a simple yet effective approach, named OmniTube, which, drawing inspiration from Transformer-based STVG methods, is specially designed for OmniSTVG and demonstrates promising results. By releasing BOSTVG, we hope to go beyond classic STVG by locating every object appearing in the query for more comprehensive understanding, opening up a new direction for STVG. Our benchmark, model, and results will be released at https://github.com/JellyYao3000/OmniSTVG.
中文: 本文提出OmniSTVG,一种新颖的时空视频定位任务,可在视频中同时定位文本查询中的所有目标对象,并发布了首个大规模基准BOSTVG和简单有效的OmniTube方法,为视频理解开辟了新方向。
English: This paper introduces OmniSTVG, a novel spatio-temporal video grounding task that localizes all objects mentioned in text queries across both space and time in videos, along with BOSTVG, the first large-scale benchmark for this task, and a simple yet effective method called OmniTube that shows promising results.

Authors:Xunzhi Zheng, Dan Xu
Title: Flow-NeRF: Joint Learning of Geometry, Poses, and Dense Flow within Unified Neural Representations
Abstract:
Learning accurate scene reconstruction without pose priors in neural radiance fields is challenging due to inherent geometric ambiguity. Recent development either relies on correspondence priors for regularization or uses off-the-shelf flow estimators to derive analytical poses. However, the potential for jointly learning scene geometry, camera poses, and dense flow within a unified neural representation remains largely unexplored. In this paper, we present Flow-NeRF, a unified framework that simultaneously optimizes scene geometry, camera poses, and dense optical flow all on-the-fly. To enable the learning of dense flow within the neural radiance field, we design and build a bijective mapping for flow estimation, conditioned on pose. To make the scene reconstruction benefit from the flow estimation, we develop an effective feature enhancement mechanism to pass canonical space features to world space representations, significantly enhancing scene geometry. We validate our model across four important tasks, i.e., novel view synthesis, depth estimation, camera pose prediction, and dense optical flow estimation, using several datasets. Our approach surpasses previous methods in almost all metrics for novel-view view synthesis and depth estimation and yields both qualitatively sound and quantitatively accurate novel-view flow. Our project page is https://zhengxunzhi.github.io/flownerf/.

Authors:Yuwen Du, Anning Hu, Zichen Chao, Yifan Lu, Junhao Ge, Genjia Liu, Weitao Wu, Lanjun Wang, Siheng Chen
Title: RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation
Abstract:
Roadside Collaborative Perception refers to a system where multiple roadside units collaborate to pool their perceptual data, assisting vehicles in enhancing their environmental awareness. Existing roadside perception methods concentrate on model design but overlook data issues like calibration errors, sparse information, and multi-view consistency, leading to poor performance on recent published datasets. To significantly enhance roadside collaborative perception and address critical data issues, we present the first simulation framework RoCo-Sim for road-side collaborative perception. RoCo-Sim is capable of generating diverse, multi-view consistent simulated roadside data through dynamic foreground editing and full-scene style transfer of a single image. RoCo-Sim consists of four components: (1) Camera Extrinsic Optimization ensures accurate 3D to 2D projection for roadside cameras; (2) A novel Multi-View Occlusion-Aware Sampler (MOAS) determines the placement of diverse digital assets within 3D space; (3) DepthSAM innovatively models foreground-background relationships from single-frame fixed-view images, ensuring multi-view consistency of foreground; and (4) Scalable Post-Processing Toolkit generates more realistic and enriched scenes through style transfer and other enhancements. RoCo-Sim significantly improves roadside 3D object detection, outperforming SOTA methods by 83.74 on Rcooper-Intersection and 83.12 on TUMTraf-V2X for AP70. RoCo-Sim fills a critical gap in roadside perception simulation. Code and pre-trained models will be released soon: https://github.com/duyuwen-duen/RoCo-Sim
中文: RoCo-Sim是首个用于路边协同感知的仿真框架,通过动态前景编辑和全场景风格迁移生成多样且多视角一致的数据,显著提升了三维物体检测性能。
English: RoCo-Sim is the first simulation framework designed to enhance roadside collaborative perception by generating diverse, multi-view consistent data through dynamic foreground editing and full-scene style transfer, significantly improving 3D object detection performance.

Authors:Yijing Lin, Mengqi Huang, Shuhan Zhuang, Zhendong Mao
Title: RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models
Abstract:
Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project page: https://lyne1.github.io/realgeneral_web/; GitHub Link: https://github.com/Lyne1/RealGeneral
中文: RealGeneral提出了一种新颖框架,通过将图像生成任务重构为基于视频模型的帧预测任务,实现了多任务统一,并在主体相似性和图像质量方面取得显著提升。
English: RealGeneral introduces a novel framework that unifies diverse image generation tasks by reformulating them as conditional frame prediction using video models, achieving significant improvements in subject similarity and image quality.

Authors:Matteo Gambella, Fabrizio Pittorino, Manuel Roveri
Title: Architecture-Aware Minimization (A$^2$M): How to Find Flat Minima in Neural Architecture Search
Abstract:
Neural Architecture Search (NAS) has become an essential tool for designing effective and efficient neural networks. In this paper, we investigate the geometric properties of neural architecture spaces commonly used in differentiable NAS methods, specifically NAS-Bench-201 and DARTS. By defining flatness metrics such as neighborhoods and loss barriers along paths in architecture space, we reveal locality and flatness characteristics analogous to the well-known properties of neural network loss landscapes in weight space. In particular, we find that highly accurate architectures cluster together in flat regions, while suboptimal architectures remain isolated, unveiling the detailed geometrical structure of the architecture search landscape. Building on these insights, we propose Architecture-Aware Minimization (A$^2$M), a novel analytically derived algorithmic framework that explicitly biases, for the first time, the gradient of differentiable NAS methods towards flat minima in architecture space. A$^2$M consistently improves generalization over state-of-the-art DARTS-based algorithms on benchmark datasets including CIFAR-10, CIFAR-100, and ImageNet16-120, across both NAS-Bench-201 and DARTS search spaces. Notably, A$^2$M is able to increase the test accuracy, on average across different differentiable NAS methods, by +3.60\% on CIFAR-10, +4.60\% on CIFAR-100, and +3.64\% on ImageNet16-120, demonstrating its superior effectiveness in practice. A$^2$M can be easily integrated into existing differentiable NAS frameworks, offering a versatile tool for future research and applications in automated machine learning. We open-source our code at https://github.com/AI-Tech-Research-Lab/AsquaredM.
中文: 本文提出架构感知最小化(A²M)这一创新框架,通过引导梯度朝向架构空间中的平坦最小值来增强神经架构搜索,在多个基准数据集上显著提升了泛化性能。
English: This paper introduces Architecture-Aware Minimization (A²M), a novel framework that enhances neural architecture search by guiding gradients toward flat minima in architecture space, significantly improving generalization across benchmark datasets.

Authors:Fengxiang Wang, Hongzhen Wang, Yulin Wang, Di Wang, Mingshuo Chen, Haiyan Zhao, Yangang Sun, Shuo Wang, Long Lan, Wenjing Yang, Jing Zhang
Title: RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing
Abstract:
Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at https://github.com/MiliLab/RoMA.
中文:RoMA框架通过旋转感知预训练机制和多尺度标记预测,实现了基于Mamba的遥感基础模型的可扩展自监督预训练,在精度和计算效率上持续优于基于ViT的模型。
English: The RoMA framework enables scalable self-supervised pretraining of Mamba-based remote sensing foundation models, overcoming ViT's quadratic complexity through rotation-aware mechanisms and multi-scale token prediction, consistently outperforming ViT counterparts in accuracy and efficiency.

Authors:Elad Richardson, Kfir Goldberg, Yuval Alaluf, Daniel Cohen-Or
Title: Piece it Together: Part-Based Concepting with IP-Priors
Abstract:
Advanced generative models excel at synthesizing images but often rely on text-based conditioning. Visual designers, however, often work beyond language, directly drawing inspiration from existing visual elements. In many cases, these elements represent only fragments of a potential concept-such as an uniquely structured wing, or a specific hairstyle-serving as inspiration for the artist to explore how they can come together creatively into a coherent whole. Recognizing this need, we introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition while simultaneously sampling the missing parts needed to generate a plausible and complete concept. Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+, on which we train IP-Prior, a lightweight flow-matching model that synthesizes coherent compositions based on domain-specific priors, enabling diverse and context-aware generations. Additionally, we present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task, addressing its common trade-off between reconstruction quality and prompt adherence.

Authors:Maxim Popov, Regina Kurkova, Mikhail Iumanov, Jaafar Mahmoud, Sergey Kolyubin
Title: OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions
Abstract:
Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques. This paper introduces a dynamically configurable and highly automated LLM/LVLM-powered pipeline for evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark). The study focuses on evaluating state-of-the-art semantic mapping algorithms under varying indoor lighting conditions, a critical challenge in indoor environments. We introduce a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions, facilitating the rigorous analysis of mapping performance across different lighting conditions. Through experiments on leading models such as ConceptGraphs, BBQ and OpenScene, we evaluate the semantic fidelity of object recognition and segmentation. Additionally, we introduce a Scene Graph evaluation method to analyze the ability of models to interpret semantic structure. The results provide insights into the robustness of these models, forming future research directions for developing resilient and adaptable robotic systems. Project page is available at https://be2rlab.github.io/OSMa-Bench/.

Authors:Georgy Ponimatkin, Martin Cífka, Tomáš Souček, Médéric Fourmy, Yann Labbé, Vladimir Petrik, Josef Sivic
Title: 6D Object Pose Tracking in Internet Videos for Robotic Manipulation
Abstract:
We seek to extract a temporally consistent 6D pose trajectory of a manipulated object from an Internet instructional video. This is a challenging set-up for current 6D pose estimation methods due to uncontrolled capturing conditions, subtle but dynamic object motions, and the fact that the exact mesh of the manipulated object is not known. To address these challenges, we present the following contributions. First, we develop a new method that estimates the 6D pose of any object in the input image without prior knowledge of the object itself. The method proceeds by (i) retrieving a CAD model similar to the depicted object from a large-scale model database, (ii) 6D aligning the retrieved CAD model with the input image, and (iii) grounding the absolute scale of the object with respect to the scene. Second, we extract smooth 6D object trajectories from Internet videos by carefully tracking the detected objects across video frames. The extracted object trajectories are then retargeted via trajectory optimization into the configuration space of a robotic manipulator. Third, we thoroughly evaluate and ablate our 6D pose estimation method on YCB-V and HOPE-Video datasets as well as a new dataset of instructional videos manually annotated with approximate 6D object trajectories. We demonstrate significant improvements over existing state-of-the-art RGB 6D pose estimation methods. Finally, we show that the 6D object motion estimated from Internet videos can be transferred to a 7-axis robotic manipulator both in a virtual simulator as well as in a real world set-up. We also successfully apply our method to egocentric videos taken from the EPIC-KITCHENS dataset, demonstrating potential for Embodied AI applications.

Authors:Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, Wenhai Wang
Title: VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Abstract:
We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.
中文: VisualPRM是一个拥有80亿参数的多模态过程奖励模型,通过最佳N选评估策略提升不同多模态大语言模型的推理能力,并构建了新数据集和基准以推动该领域发展。
English: VisualPRM is an 8-billion-parameter multimodal Process Reward Model that enhances reasoning across various MLLMs through Best-of-N evaluation, achieving significant performance gains and introducing a new dataset and benchmark for future research.

Authors:Zhiqi Li, Chengrui Dong, Yiming Chen, Zhangchi Huang, Peidong Liu
Title: VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames
Abstract:
We present VicaSplat, a novel framework for joint 3D Gaussians reconstruction and camera pose estimation from a sequence of unposed video frames, which is a critical yet underexplored task in real-world 3D applications. The core of our method lies in a novel transformer-based network architecture. In particular, our model starts with an image encoder that maps each image to a list of visual tokens. All visual tokens are concatenated with additional inserted learnable camera tokens. The obtained tokens then fully communicate with each other within a tailored transformer decoder. The camera tokens causally aggregate features from visual tokens of different views, and further modulate them frame-wisely to inject view-dependent features. 3D Gaussian splats and camera pose parameters can then be estimated via different prediction heads. Experiments show that VicaSplat surpasses baseline methods for multi-view inputs, and achieves comparable performance to prior two-view approaches. Remarkably, VicaSplat also demonstrates exceptional cross-dataset generalization capability on the ScanNet benchmark, achieving superior performance without any fine-tuning. Project page: https://lizhiqi49.github.io/VicaSplat.

Authors:Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, Linfeng Zhang
Title: EEdit: Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing
Abstract:
Inversion-based image editing is rapidly gaining momentum while suffering from significant computation overhead, hindering its application in real-time interactive scenarios. In this paper, we rethink that the redundancy in inversion-based image editing exists in both the spatial and temporal dimensions, such as the unnecessary computation in unedited regions and the redundancy in the inversion progress. To tackle these challenges, we propose a practical framework, named EEdit, to achieve efficient image editing. Specifically, we introduce three techniques to solve them one by one. For spatial redundancy, spatial locality caching is introduced to compute the edited region and its neighboring regions while skipping the unedited regions, and token indexing preprocessing is designed to further accelerate the caching. For temporal redundancy, inversion step skipping is proposed to reuse the latent for efficient editing. Our experiments demonstrate an average of 2.46 $\times$ acceleration without performance drop in a wide range of editing tasks including prompt-guided image editing, dragging and image composition. Our codes are available at https://github.com/yuriYanZeXuan/EEdit
中文:EEdit框架通过空间局部缓存和反转步骤跳跃技术,解决了基于反转的图像编辑中的空间与时间冗余问题,在多种编辑任务中实现了2.46倍加速且性能无损。
English: The EEdit framework addresses spatial and temporal redundancies in inversion-based image editing through spatial locality caching and inversion step skipping, achieving a 2.46× speedup without performance loss across various editing tasks.

Authors:Yunpeng Qu, Kun Yuan, Qizhi Xie, Ming Sun, Chao Zhou, Jian Wang
Title: KVQ: Boosting Video Quality Assessment via Saliency-guided Local Perception
Abstract:
Video Quality Assessment (VQA), which intends to predict the perceptual quality of videos, has attracted increasing attention. Due to factors like motion blur or specific distortions, the quality of different regions in a video varies. Recognizing the region-wise local quality within a video is beneficial for assessing global quality and can guide us in adopting fine-grained enhancement or transcoding strategies. Due to the heavy cost of annotating region-wise quality, the lack of ground truth constraints from relevant datasets further complicates the utilization of local perception. Inspired by the Human Visual System (HVS) that links global quality to the local texture of different regions and their visual saliency, we propose a Kaleidoscope Video Quality Assessment (KVQ) framework, which aims to effectively assess both saliency and local texture, thereby facilitating the assessment of global quality. Our framework extracts visual saliency and allocates attention using Fusion-Window Attention (FWA) while incorporating a Local Perception Constraint (LPC) to mitigate the reliance of regional texture perception on neighboring areas. KVQ obtains significant improvements across multiple scenarios on five VQA benchmarks compared to SOTA methods. Furthermore, to assess local perception, we establish a new Local Perception Visual Quality (LPVQ) dataset with region-wise annotations. Experimental results demonstrate the capability of KVQ in perceiving local distortions. KVQ models and the LPVQ dataset will be available at https://github.com/qyp2000/KVQ.
中文: KVQ框架通过融合窗口注意力和局部感知约束,结合视觉显著性与局部纹理分析来提升视频质量评估,在多个基准测试中表现优异,并发布了用于局部感知评估的新数据集。
English: The KVQ framework enhances video quality assessment by integrating visual saliency and local texture analysis through Fusion-Window Attention and Local Perception Constraint, achieving superior performance on benchmarks and introducing a new dataset for local distortion evaluation.

Authors:Yeonjin Chang, Erqun Dong, Seunghyeon Seo, Nojun Kwak, Kwang Moo Yi
Title: ROODI: Reconstructing Occluded Objects with Denoising Inpainters
Abstract:
While the quality of novel-view images has improved dramatically with 3D Gaussian Splatting, extracting specific objects from scenes remains challenging. Isolating individual 3D Gaussian primitives for each object and handling occlusions in scenes remains far from being solved. We propose a novel object extraction method based on two key principles: (1) object-centric reconstruction through removal of irrelevant primitives; and (2) leveraging generative inpainting to compensate for missing observations caused by occlusions. For pruning, we propose to remove irrelevant Gaussians by looking into how close they are to its K-nearest neighbors and removing those that are statistical outliers. Importantly, these distances must take into account the actual spatial extent they cover -- we thus propose to use Wasserstein distances. For inpainting, we employ an off-the-shelf diffusion-based inpainter combined with occlusion reasoning, utilizing the 3D representation of the entire scene. Our findings highlight the crucial synergy between proper pruning and inpainting, both of which significantly enhance extraction performance. We evaluate our method on a standard real-world dataset and introduce a synthetic dataset for quantitative analysis. Our approach outperforms the state-of-the-art, demonstrating its effectiveness in object extraction from complex scenes.

Authors:Zhi Chen, Zecheng Zhao, Jingcai Guo, Jingjing Li, Zi Huang
Title: SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning
Abstract:
Zero-shot learning (ZSL) aims to recognize unseen classes without labeled training examples by leveraging class-level semantic descriptors such as attributes. A fundamental challenge in ZSL is semantic misalignment, where semantic-unrelated information involved in visual features introduce ambiguity to visual-semantic interaction. Unlike existing methods that suppress semantic-unrelated information post hoc either in the feature space or the model space, we propose addressing this issue at the input stage, preventing semantic-unrelated patches from propagating through the network. To this end, we introduce Semantically contextualized VIsual Patches (SVIP) for ZSL, a transformer-based framework designed to enhance visual-semantic alignment. Specifically, we propose a self-supervised patch selection mechanism that preemptively learns to identify semantic-unrelated patches in the input space. This is trained with the supervision from aggregated attention scores across all transformer layers, which estimate each patch's semantic score. As removing semantic-unrelated patches from the input sequence may disrupt object structure, we replace them with learnable patch embeddings. With initialization from word embeddings, we can ensure they remain semantically meaningful throughout feature extraction. Extensive experiments on ZSL benchmarks demonstrate that SVIP achieves state-of-the-art performance results while providing more interpretable and semantically rich feature representations. Code is available at https://github.com/uqzhichen/SVIP.
中文: SVIP是一种基于Transformer的框架,通过在输入阶段预先识别并替换语义无关的图像块来增强零样本学习中的视觉-语义对齐,实现了最先进的性能并提供了更具解释性的特征表示。
English: SVIP is a transformer-based framework that enhances zero-shot learning by preemptively identifying and replacing semantic-unrelated patches in the input stage, achieving state-of-the-art performance with more interpretable feature representations.

Authors:Zhijie Zhu, Lei Fan, Maurice Pagnucco, Yang Song
Title: Interpretable Image Classification via Non-parametric Part Prototype Learning
Abstract:
Classifying images with an interpretable decision-making process is a long-standing problem in computer vision. In recent years, Prototypical Part Networks has gained traction as an approach for self-explainable neural networks, due to their ability to mimic human visual reasoning by providing explanations based on prototypical object parts. However, the quality of the explanations generated by these methods leaves room for improvement, as the prototypes usually focus on repetitive and redundant concepts. Leveraging recent advances in prototype learning, we present a framework for part-based interpretable image classification that learns a set of semantically distinctive object parts for each class, and provides diverse and comprehensive explanations. The core of our method is to learn the part-prototypes in a non-parametric fashion, through clustering deep features extracted from foundation vision models that encode robust semantic information. To quantitatively evaluate the quality of explanations provided by ProtoPNets, we introduce Distinctiveness Score and Comprehensiveness Score. Through evaluation on CUB-200-2011, Stanford Cars and Stanford Dogs datasets, we show that our framework compares favourably against existing ProtoPNets while achieving better interpretability. Code is available at: https://github.com/zijizhu/proto-non-param.
中文: 本文提出了一种基于部件可解释图像分类框架,通过深度特征的非参数聚类学习具有语义区分度的物体部件,在提升解释多样性的同时,在可解释性评估指标上优于现有方法。
English: This paper introduces a framework for interpretable image classification that learns distinctive object parts through non-parametric clustering of deep features, providing diverse explanations while outperforming existing methods in interpretability metrics.

Authors:Kaixiang Yang, Xin Li, Qiang Li, Zhiwei Wang
Title: CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition
Abstract:
Anticipating and recognizing surgical workflows are critical for intelligent surgical assistance systems. However, existing methods rely on deterministic decision-making, struggling to generalize across the large anatomical and procedural variations inherent in real-world surgeries.In this paper, we introduce an innovative framework that incorporates stochastic modeling through a denoising diffusion probabilistic model (DDPM) into conventional deterministic learning for surgical workflow analysis. At the heart of our approach is a collaborative co-training paradigm: the DDPM branch captures procedural uncertainties to enrich feature representations, while the task branch focuses on predicting surgical phases and instrument usage.Theoretically, we demonstrate that this mutual refinement mechanism benefits both branches: the DDPM reduces prediction errors in uncertain scenarios, and the task branch directs the DDPM toward clinically meaningful representations. Notably, the DDPM branch is discarded during inference, enabling real-time predictions without sacrificing accuracy.Experiments on the Cholec80 dataset show that for the anticipation task, our method achieves a 16% reduction in eMAE compared to state-of-the-art approaches, and for phase recognition, it improves the Jaccard score by 1.0%. Additionally, on the AutoLaparo dataset, our method achieves a 1.5% improvement in the Jaccard score for phase recognition, while also exhibiting robust generalization to patient-specific variations. Our code and weight are available at https://github.com/kk42yy/CoStoDet-DDPM.
Chinese: 本文提出了一种结合去噪扩散概率模型(DDPM)随机建模与确定性学习的协同训练框架,显著提升了手术工作流分析的预测和识别性能,在基准数据集上实现了关键指标突破并保持实时推理能力。
English: This paper introduces a collaborative co-training framework that integrates stochastic modeling via a denoising diffusion probabilistic model (DDPM) with deterministic learning to enhance surgical workflow analysis, achieving significant improvements in anticipation and recognition tasks on benchmark datasets while ensuring real-time inference.

Authors:Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, Yali Wang
Title: LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Abstract:
Existing MLLMs encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our method consists of four key steps: 1) Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2) Perception: We design an effective retrieval scheme for long videos to improve the coverage of critical temporal segments while maintaining computational efficiency. 3) Action: Agents answer long video questions and exchange reasons. 4) Reflection: We evaluate each agent's performance in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (like GPT-4o) and open-source models (like InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80\% on four mainstream long video understanding tasks. Notably, LVAgent improves accuracy by 13.3\% on LongVideoBench. Code is available at https://github.com/64327069/LVAgent.
中文: 现有MLLM在长视频时序建模上存在局限,而LVAgent通过多智能体动态协作框架,在四大主流任务中实现80%准确率,并在LongVideoBench上提升13.3%,性能超越所有开源与闭源模型。
English: Current MLLMs struggle with temporal context in long videos, but LVAgent introduces a multi-agent collaboration framework that outperforms existing models by 13.3% on benchmarks through dynamic team optimization and iterative reasoning.

Authors:Jianheng Liu, Yunfei Wan, Bowen Wang, Chunran Zheng, Jiarong Lin, Fu Zhang
Title: GS-SDF: LiDAR-Augmented Gaussian Splatting and Neural SDF for Geometrically Consistent Rendering and Reconstruction
Abstract:
Digital twins are fundamental to the development of autonomous driving and embodied artificial intelligence. However, achieving high-granularity surface reconstruction and high-fidelity rendering remains a challenge. Gaussian splatting offers efficient photorealistic rendering but struggles with geometric inconsistencies due to fragmented primitives and sparse observational data in robotics applications. Existing regularization methods, which rely on render-derived constraints, often fail in complex environments. Moreover, effectively integrating sparse LiDAR data with Gaussian splatting remains challenging. We propose a unified LiDAR-visual system that synergizes Gaussian splatting with a neural signed distance field. The accurate LiDAR point clouds enable a trained neural signed distance field to offer a manifold geometry field. This motivates us to offer an SDF-based Gaussian initialization for physically grounded primitive placement and a comprehensive geometric regularization for geometrically consistent rendering and reconstruction. Experiments demonstrate superior reconstruction accuracy and rendering quality across diverse trajectories. To benefit the community, the codes are released at https://github.com/hku-mars/GS-SDF.
中文摘要:该研究提出的统一激光雷达视觉系统将高斯泼溅与神经符号距离场相结合,通过基于SDF的初始化和几何正则化方法,有效解决了自动驾驶应用中几何不一致的问题,实现了卓越的重建精度与渲染质量。
English Summary: The proposed unified LiDAR-visual system integrates Gaussian splatting with neural signed distance fields to overcome geometric inconsistencies in autonomous driving applications, achieving superior reconstruction and rendering through SDF-based initialization and geometric regularization.

Authors:Thomas Sanchez, Vladyslav Zalevskyi, Angeline Mihailov, Gerard Martí-Juan, Elisenda Eixarch, Andras Jakab, Vincent Dunet, Mériam Koob, Guillaume Auzias, Meritxell Bach Cuadra
Title: Automatic quality control in multi-centric fetal brain MRI super-resolution reconstruction
Abstract:
Quality control (QC) has long been considered essential to guarantee the reliability of neuroimaging studies. It is particularly important for fetal brain MRI, where acquisitions and image processing techniques are less standardized than in adult imaging. In this work, we focus on automated quality control of super-resolution reconstruction (SRR) volumes of fetal brain MRI, an important processing step where multiple stacks of thick 2D slices are registered together and combined to build a single, isotropic and artifact-free T2 weighted volume. We propose FetMRQC$_{SR}$, a machine-learning method that extracts more than 100 image quality metrics to predict image quality scores using a random forest model. This approach is well suited to a problem that is high dimensional, with highly heterogeneous data and small datasets. We validate FetMRQC$_{SR}$ in an out-of-domain (OOD) setting and report high performance (ROC AUC = 0.89), even when faced with data from an unknown site or SRR method. We also investigate failure cases and show that they occur in $45\%$ of the images due to ambiguous configurations for which the rating from the expert is arguable. These results are encouraging and illustrate how a non deep learning-based method like FetMRQC$_{SR}$ is well suited to this multifaceted problem. Our tool, along with all the code used to generate, train and evaluate the model are available at https://github.com/Medical-Image-Analysis-Laboratory/fetmrqc_sr/ .
中文: 本研究提出了FetMRQC$_{SR}$,一种利用100多个图像指标的机器学习工具,用于自动评估胎儿脑部超分辨率MRI重建的质量,在跨域测试中表现出色,有效应对了这一复杂领域的挑战。
English: This study introduces FetMRQC$_{SR}$, a machine-learning tool using over 100 image metrics to automatically assess the quality of super-resolution fetal brain MRI reconstructions, achieving high out-of-domain performance and addressing challenges in this complex field.

Authors:Zhenxuan Zeng, Qiao Wu, Xiyu Zhang, Lin Yuanbo Wu, Pei An, Jiaqi Yang, Ji Wang, Peng Wang
Title: Unlocking Generalization Power in LiDAR Point Cloud Registration
Abstract:
In real-world environments, a LiDAR point cloud registration method with robust generalization capabilities (across varying distances and datasets) is crucial for ensuring safety in autonomous driving and other LiDAR-based applications. However, current methods fall short in achieving this level of generalization. To address these limitations, we propose UGP, a pruned framework designed to enhance generalization power for LiDAR point cloud registration. The core insight in UGP is the elimination of cross-attention mechanisms to improve generalization, allowing the network to concentrate on intra-frame feature extraction. Additionally, we introduce a progressive self-attention module to reduce ambiguity in large-scale scenes and integrate Bird's Eye View (BEV) features to incorporate semantic information about scene elements. Together, these enhancements significantly boost the network's generalization performance. We validated our approach through various generalization experiments in multiple outdoor scenes. In cross-distance generalization experiments on KITTI and nuScenes, UGP achieved state-of-the-art mean Registration Recall rates of 94.5% and 91.4%, respectively. In cross-dataset generalization from nuScenes to KITTI, UGP achieved a state-of-the-art mean Registration Recall of 90.9%. Code will be available at https://github.com/peakpang/UGP.
中文: 提出的UGP框架通过消除交叉注意力机制并引入渐进式自注意力与鸟瞰图特征,显著提升了激光雷达点云配准的泛化能力,在跨距离和跨数据集实验中均取得了最优的召回率。
English: The proposed UGP framework enhances LiDAR point cloud registration generalization by removing cross-attention and incorporating progressive self-attention with BEV features, achieving state-of-the-art recall rates in cross-distance and cross-dataset experiments.

Authors:Jinfeng Liu, Lingtong Kong, Bo Li, Dan Xu
Title: GaussHDR: High Dynamic Range Gaussian Splatting via Learning Unified 3D and 2D Local Tone Mapping
Abstract:
High dynamic range (HDR) novel view synthesis (NVS) aims to reconstruct HDR scenes by leveraging multi-view low dynamic range (LDR) images captured at different exposure levels. Current training paradigms with 3D tone mapping often result in unstable HDR reconstruction, while training with 2D tone mapping reduces the model's capacity to fit LDR images. Additionally, the global tone mapper used in existing methods can impede the learning of both HDR and LDR representations. To address these challenges, we present GaussHDR, which unifies 3D and 2D local tone mapping through 3D Gaussian splatting. Specifically, we design a residual local tone mapper for both 3D and 2D tone mapping that accepts an additional context feature as input. We then propose combining the dual LDR rendering results from both 3D and 2D local tone mapping at the loss level. Finally, recognizing that different scenes may exhibit varying balances between the dual results, we introduce uncertainty learning and use the uncertainties for adaptive modulation. Extensive experiments demonstrate that GaussHDR significantly outperforms state-of-the-art methods in both synthetic and real-world scenarios.

Authors:Runze He, Bo Cheng, Yuhang Ma, Qingxiang Jia, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Liebucha Wu, Dawei Leng, Yuhui Yin
Title: PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models
Abstract:
In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layoutrelated tasks, showing its great potential. Code is available at: https://360cvgroup.github.io/PlanGen.

Authors:Yanfeng Li, Kahou Chan, Yue Sun, Chantong Lam, Tong Tong, Zitong Yu, Keren Fu, Xiaohong Liu, Tao Tan
Title: MoEdit: On Learning Quantity Perception for Multi-object Image Editing
Abstract:
Multi-object images are prevalent in various real-world scenarios, including augmented reality, advertisement design, and medical imaging. Efficient and precise editing of these images is critical for these applications. With the advent of Stable Diffusion (SD), high-quality image generation and editing have entered a new era. However, existing methods often struggle to consider each object both individually and part of the whole image editing, both of which are crucial for ensuring consistent quantity perception, resulting in suboptimal perceptual performance. To address these challenges, we propose MoEdit, an auxiliary-free multi-object image editing framework. MoEdit facilitates high-quality multi-object image editing in terms of style transfer, object reinvention, and background regeneration, while ensuring consistent quantity perception between inputs and outputs, even with a large number of objects. To achieve this, we introduce the Feature Compensation (FeCom) module, which ensures the distinction and separability of each object attribute by minimizing the in-between interlacing. Additionally, we present the Quantity Attention (QTTN) module, which perceives and preserves quantity consistency by effective control in editing, without relying on auxiliary tools. By leveraging the SD model, MoEdit enables customized preservation and modification of specific concepts in inputs with high quality. Experimental results demonstrate that our MoEdit achieves State-Of-The-Art (SOTA) performance in multi-object image editing. Data and codes will be available at https://github.com/Tear-kitty/MoEdit.
中文摘要:MoEdit是一种无需辅助工具的多对象图像编辑框架,通过特征补偿模块和数量注意力模块确保对象间的区分度与数量一致性,在多对象风格迁移和背景重建等任务中实现了最先进的编辑效果。
English Summary: MoEdit is a novel framework that enables high-quality multi-object image editing by introducing Feature Compensation and Quantity Attention modules to maintain object distinction and quantity consistency without auxiliary tools, achieving state-of-the-art performance.

Authors:Zecheng Zhao, Zhi Chen, Zi Huang, Shazia Sadiq, Tong Chen
Title: Continual Text-to-Video Retrieval with Frame Fusion and Task-Aware Routing
Abstract:
Text-to-Video Retrieval (TVR) aims to retrieve relevant videos based on textual queries. However, as video content evolves continuously, adapting TVR systems to new data remains a critical yet under-explored challenge. In this paper, we introduce the first benchmark for Continual Text-to-Video Retrieval (CTVR) to address the limitations of existing approaches. Current Pre-Trained Model (PTM)-based TVR methods struggle with maintaining model plasticity when adapting to new tasks, while existing Continual Learning (CL) methods suffer from catastrophic forgetting, leading to semantic misalignment between historical queries and stored video features. To address these two challenges, we propose FrameFusionMoE, a novel CTVR framework that comprises two key components: (1) the Frame Fusion Adapter (FFA), which captures temporal video dynamics while preserving model plasticity, and (2) the Task-Aware Mixture-of-Experts (TAME), which ensures consistent semantic alignment between queries across tasks and the stored video features. Thus, FrameFusionMoE enables effective adaptation to new video content while preserving historical text-video relevance to mitigate catastrophic forgetting. We comprehensively evaluate FrameFusionMoE on two benchmark datasets under various task settings. Results demonstrate that FrameFusionMoE outperforms existing CL and TVR methods, achieving superior retrieval performance with minimal degradation on earlier tasks when handling continuous video streams. Our code is available at: https://github.com/JasonCodeMaker/CTVR.
中文摘要:本文提出了一种名为FrameFusionMoE的新型连续文本-视频检索框架,通过其帧融合适配器和任务感知专家混合组件,在适应新视频内容的同时有效防止灾难性遗忘,相比现有方法展现出更优越的检索性能。
English Summary: This paper introduces a novel Continual Text-to-Video Retrieval framework called FrameFusionMoE, which effectively adapts to new video content while preventing catastrophic forgetting through its Frame Fusion Adapter and Task-Aware Mixture-of-Experts components, demonstrating superior performance over existing methods.

Authors:Yuheng Liang, Zheyu Wang, Feng Liu, Mingzhou Liu, Yu Yao
Title: Mamba-VA: A Mamba-based Approach for Continuous Emotion Recognition in Valence-Arousal Space
Abstract:
Continuous Emotion Recognition (CER) plays a crucial role in intelligent human-computer interaction, mental health monitoring, and autonomous driving. Emotion modeling based on the Valence-Arousal (VA) space enables a more nuanced representation of emotional states. However, existing methods still face challenges in handling long-term dependencies and capturing complex temporal dynamics. To address these issues, this paper proposes a novel emotion recognition model, Mamba-VA, which leverages the Mamba architecture to efficiently model sequential emotional variations in video frames. First, the model employs a Masked Autoencoder (MAE) to extract deep visual features from video frames, enhancing the robustness of temporal information. Then, a Temporal Convolutional Network (TCN) is utilized for temporal modeling to capture local temporal dependencies. Subsequently, Mamba is applied for long-sequence modeling, enabling the learning of global emotional trends. Finally, a fully connected (FC) layer performs regression analysis to predict continuous valence and arousal values. Experimental results on the Valence-Arousal (VA) Estimation task of the 8th competition on Affective Behavior Analysis in-the-wild (ABAW) demonstrate that the proposed model achieves valence and arousal scores of 0.5362 (0.5036) and 0.4310 (0.4119) on the validation (test) set, respectively, outperforming the baseline. The source code is available on GitHub:https://github.com/FreedomPuppy77/Charon.
中文: 本文提出Mamba-VA模型,通过结合掩码自编码器、时序卷积网络和Mamba架构,有效捕捉视频中情感识别的局部与全局时序依赖关系,在效价-唤醒度估计任务上取得了优于基准的性能表现。
English: This paper introduces Mamba-VA, a novel model that combines Masked Autoencoder, Temporal Convolutional Network, and Mamba architecture to effectively capture both local and global temporal dependencies for continuous emotion recognition in videos, achieving superior performance on valence-arousal estimation tasks.

Authors:Jiawei Zhang, Ziyuan Liu, Leon Yan, Gen Li, Yuantao Gu
Title: Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation
Abstract:
Diffusion models have demonstrated remarkable performance in modeling complex data priors, catalyzing their widespread adoption in solving various inverse problems. However, the inherently iterative nature of diffusion-based inverse algorithms often requires hundreds to thousands of steps, with performance degradation occurring under fewer steps which limits their practical applicability. While high-order diffusion ODE solvers have been extensively explored for efficient diffusion sampling without observations, their application to inverse problems remains underexplored due to the diverse forms of inverse algorithms and their need for repeated trajectory correction based on observations. To address this gap, we first introduce a canonical form that decomposes existing diffusion-based inverse algorithms into three modules to unify their analysis. Inspired by the linear subspace search strategy in the design of high-order diffusion ODE solvers, we propose the Learnable Linear Extrapolation (LLE) method, a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm that fits the proposed canonical form. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at https://github.com/weigerzan/LLE_inverse_problem.
中文摘要:本文提出了一种规范形式来统一基于扩散的逆问题算法,并设计了可学习线性外推(LLE)方法,能够普遍提升这些算法在有限步数下的性能表现。
English Summary: This paper introduces a canonical form to unify diffusion-based inverse algorithms and proposes the Learnable Linear Extrapolation (LLE) method to universally enhance their performance with limited computational steps.

Authors:Zhen Qu, Xian Tao, Xinyi Gong, Shichen Qu, Qiyu Chen, Zhengtao Zhang, Xingang Wang, Guiguang Ding
Title: Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection
Abstract:
Recently, vision-language models (e.g. CLIP) have demonstrated remarkable performance in zero-shot anomaly detection (ZSAD). By leveraging auxiliary data during training, these models can directly perform cross-category anomaly detection on target datasets, such as detecting defects on industrial product surfaces or identifying tumors in organ tissues. Existing approaches typically construct text prompts through either manual design or the optimization of learnable prompt vectors. However, these methods face several challenges: 1) handcrafted prompts require extensive expert knowledge and trial-and-error; 2) single-form learnable prompts struggle to capture complex anomaly semantics; and 3) an unconstrained prompt space limits generalization to unseen categories. To address these issues, we propose Bayesian Prompt Flow Learning (Bayes-PFL), which models the prompt space as a learnable probability distribution from a Bayesian perspective. Specifically, a prompt flow module is designed to learn both image-specific and image-agnostic distributions, which are jointly utilized to regularize the text prompt space and improve the model's generalization on unseen categories. These learned distributions are then sampled to generate diverse text prompts, effectively covering the prompt space. Additionally, a residual cross-model attention (RCA) module is introduced to better align dynamic text embeddings with fine-grained image features. Extensive experiments on 15 industrial and medical datasets demonstrate our method's superior performance. The code is available at https://github.com/xiaozhen228/Bayes-PFL.
中文:近期视觉语言模型在零样本异常检测中表现突出,但面临提示设计的挑战;贝叶斯提示流学习通过将提示空间建模为可学习概率分布并引入残差跨模态注意力模块,在工业和医疗数据集上实现了卓越性能。
English: Recent vision-language models like CLIP show strong zero-shot anomaly detection capabilities but face challenges in prompt design, which Bayes-PFL addresses by modeling prompts as learnable distributions and introducing a residual attention module for better cross-modal alignment, achieving superior results across industrial and medical datasets.

Authors:Chunyi Li, Yuan Tian, Xiaoyue Ling, Zicheng Zhang, Haodong Duan, Haoning Wu, Ziheng Jia, Xiaohong Liu, Xiongkuo Min, Guo Lu, Weisi Lin, Guangtao Zhai
Title: Image Quality Assessment: From Human to Machine Preference
Abstract:
Image Quality Assessment (IQA) based on human subjective preferences has undergone extensive research in the past decades. However, with the development of communication protocols, the visual data consumption volume of machines has gradually surpassed that of humans. For machines, the preference depends on downstream tasks such as segmentation and detection, rather than visual appeal. Considering the huge gap between human and machine visual systems, this paper proposes the topic: Image Quality Assessment for Machine Vision for the first time. Specifically, we (1) defined the subjective preferences of machines, including downstream tasks, test models, and evaluation metrics; (2) established the Machine Preference Database (MPD), which contains 2.25M fine-grained annotations and 30k reference/distorted image pair instances; (3) verified the performance of mainstream IQA algorithms on MPD. Experiments show that current IQA metrics are human-centric and cannot accurately characterize machine preferences. We sincerely hope that MPD can promote the evolution of IQA from human to machine preferences. Project page is on: https://github.com/lcysyzxdxc/MPD.
中文摘要:本文首次提出面向机器视觉的图像质量评估主题,建立了包含大量标注的机器偏好数据库(MPD),验证了现有以人为中心的图像质量评估标准无法准确表征机器偏好,旨在推动该领域从人类偏好向机器偏好的演进。
English Summary: This paper introduces the concept of Image Quality Assessment for Machine Vision, establishing a Machine Preference Database (MPD) to evaluate IQA algorithms based on machine task performance rather than human visual appeal, revealing that current human-centric metrics fail to accurately capture machine preferences.

Authors:Xinran Ling, Chen Zhu, Meiqi Wu, Hangyu Li, Xiaokun Feng, Cundian Yang, Aiming Hao, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu
Title: VMBench: A Benchmark for Perception-Aligned Video Motion Generation
Abstract:
Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based on these findings, we introduce VMBench--a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: 1) Perception-Driven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. 2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multi-level prompt library covering six key dynamic scene dimensions. 3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3% improvement in Spearman's correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment. Additionally, we will soon release VMBench at https://github.com/GD-AIGC/VMBench, setting a new standard for evaluating and advancing motion generation models.
中文:VMBench作为首个从人类感知角度评估视频运动质量的综合性基准,提出了感知对齐的运动指标和多样化运动提示生成方法,通过人工验证显著提升了评估相关性。
English: VMBench is introduced as a comprehensive video motion benchmark featuring perception-aligned metrics and diverse motion prompts to address current limitations in motion evaluation, validated by human preferences with significantly improved correlation.

Authors:Xiangyu Shi, Zerui Li, Wenqi Lyu, Jiatong Xia, Feras Dayoub, Yanyuan Qiao, Qi Wu
Title: SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation
Abstract:
Vision-and-Language Navigation (VLN) in continuous environments requires agents to interpret natural language instructions while navigating unconstrained 3D spaces. Existing VLN-CE frameworks rely on a two-stage approach: a waypoint predictor to generate waypoints and a navigator to execute movements. However, current waypoint predictors struggle with spatial awareness, while navigators lack historical reasoning and backtracking capabilities, limiting adaptability. We propose a zero-shot VLN-CE framework integrating an enhanced waypoint predictor with a Multi-modal Large Language Model (MLLM)-based navigator. Our predictor employs a stronger vision encoder, masked cross-attention fusion, and an occupancy-aware loss for better waypoint quality. The navigator incorporates history-aware reasoning and adaptive path planning with backtracking, improving robustness. Experiments on R2R-CE and MP3D benchmarks show our method achieves state-of-the-art (SOTA) performance in zero-shot settings, demonstrating competitive results compared to fully supervised methods. Real-world validation on Turtlebot 4 further highlights its adaptability.

Authors:Han Liu, Riqiang Gao, Sasa Grbic
Title: AI-assisted Early Detection of Pancreatic Ductal Adenocarcinoma on Contrast-enhanced CT
Abstract:
Pancreatic ductal adenocarcinoma (PDAC) is one of the most common and aggressive types of pancreatic cancer. However, due to the lack of early and disease-specific symptoms, most patients with PDAC are diagnosed at an advanced disease stage. Consequently, early PDAC detection is crucial for improving patients' quality of life and expanding treatment options. In this work, we develop a coarse-to-fine approach to detect PDAC on contrast-enhanced CT scans. First, we localize and crop the region of interest from the low-resolution images, and then segment the PDAC-related structures at a finer scale. Additionally, we introduce two strategies to further boost detection performance: (1) a data-splitting strategy for model ensembling, and (2) a customized post-processing function. We participated in the PANORAMA challenge and ranked 1st place for PDAC detection with an AUROC of 0.9263 and an AP of 0.7243. Our code and models are publicly available at https://github.com/han-liu/PDAC_detection.
Chinese: 本研究提出一种从粗到精的方法,通过模型集成和定制后处理策略,在CT扫描中检测胰腺导管腺癌(PDAC),并在PANORAMA挑战赛中取得最佳性能。
English: This study presents a coarse-to-fine approach for detecting pancreatic ductal adenocarcinoma (PDAC) on CT scans, achieving top performance in the PANORAMA challenge through model ensembling and customized post-processing.

Authors:Minje Kim, Minjun Kim, Xu Yang
Title: DTA: Dual Temporal-channel-wise Attention for Spiking Neural Networks
Abstract:
Spiking Neural Networks (SNNs) present a more energy-efficient alternative to Artificial Neural Networks (ANNs) by harnessing spatio-temporal dynamics and event-driven spikes. Effective utilization of temporal information is crucial for SNNs, leading to the exploration of attention mechanisms to enhance this capability. Conventional attention operations either apply identical operation or employ non-identical operations across target dimensions. We identify that these approaches provide distinct perspectives on temporal information. To leverage the strengths of both operations, we propose a novel Dual Temporal-channel-wise Attention (DTA) mechanism that integrates both identical/non-identical attention strategies. To the best of our knowledge, this is the first attempt to concentrate on both the correlation and dependency of temporal-channel using both identical and non-identical attention operations. Experimental results demonstrate that the DTA mechanism achieves state-of-the-art performance on both static datasets (CIFAR10, CIFAR100, ImageNet-1k) and dynamic dataset (CIFAR10-DVS), elevating spike representation and capturing complex temporal-channel relationship. We open-source our code: https://github.com/MnJnKIM/DTA-SNN.
中文: 提出的双时序通道注意力机制通过融合相同与非相同注意力操作,有效提升了脉冲表征能力并捕获复杂时序通道关系,在多个数据集上实现了最优性能。
English: The proposed Dual Temporal-channel-wise Attention (DTA) mechanism synergistically combines identical and non-identical attention operations to enhance spike representation and capture complex temporal-channel relationships, achieving state-of-the-art performance across multiple datasets.

Authors:Wenjie Li, Heng Guo, Yuefeng Hou, Guangwei Gao, Zhanyu Ma
Title: Dual-domain Modulation Network for Lightweight Image Super-Resolution
Abstract:
Lightweight image super-resolution (SR) aims to reconstruct high-resolution images from low-resolution images under limited computational costs. We find that existing frequency-based SR methods cannot balance the reconstruction of overall structures and high-frequency parts. Meanwhile, these methods are inefficient for handling frequency features and unsuitable for lightweight SR. In this paper, we show that introducing both wavelet and Fourier information allows our model to consider both high-frequency features and overall SR structure reconstruction while reducing costs. Specifically, we propose a Dual-domain Modulation Network that integrates both wavelet and Fourier information for enhanced frequency modeling. Unlike existing methods that rely on a single frequency representation, our design combines wavelet-domain modulation via a Wavelet-domain Modulation Transformer (WMT) with global Fourier supervision, enabling complementary spectral learning well-suited for lightweight SR. Experimental results show that our method achieves a comparable PSNR to SRFormer and MambaIR while with less than 50\% and 60\% of their FLOPs and achieving inference speeds 15.4x and 5.4x faster, respectively, demonstrating the effectiveness of our method on SR quality and lightweight. Code link: https://github.com/24wenjie-li/DMNet
中文摘要:该研究提出的双域调制网络通过整合小波和傅里叶信息,在轻量化图像超分辨率任务中实现了高频细节与整体结构的平衡重建,在显著降低计算成本的同时获得了更优的恢复质量与推理速度。
English Summary: The proposed Dual-domain Modulation Network integrates wavelet and Fourier information to achieve superior lightweight image super-resolution, balancing high-frequency detail reconstruction with computational efficiency while outperforming existing methods in speed and resource usage.

Authors:Shu Wang, Yanbo Gao, Shuai Li, Chong Lv, Xun Cai, Chuankun Li, Hui Yuan, Jinglin Zhang
Title: MetricGrids: Arbitrary Nonlinear Approximation with Elementary Metric Grids based Implicit Neural Representation
Abstract:
This paper presents MetricGrids, a novel grid-based neural representation that combines elementary metric grids in various metric spaces to approximate complex nonlinear signals. While grid-based representations are widely adopted for their efficiency and scalability, the existing feature grids with linear indexing for continuous-space points can only provide degenerate linear latent space representations, and such representations cannot be adequately compensated to represent complex nonlinear signals by the following compact decoder. To address this problem while keeping the simplicity of a regular grid structure, our approach builds upon the standard grid-based paradigm by constructing multiple elementary metric grids as high-order terms to approximate complex nonlinearities, following the Taylor expansion principle. Furthermore, we enhance model compactness with hash encoding based on different sparsities of the grids to prevent detrimental hash collisions, and a high-order extrapolation decoder to reduce explicit grid storage requirements. experimental results on both 2D and 3D reconstructions demonstrate the superior fitting and rendering accuracy of the proposed method across diverse signal types, validating its robustness and generalizability. Code is available at https://github.com/wangshu31/MetricGrids}{https://github.com/wangshu31/MetricGrids.
Chinese: 本文提出MetricGrids,一种基于网格的神经表示方法,通过组合多个基本度量网格来近似复杂非线性信号,并利用优化的哈希编码和高阶解码器在2D和3D重建中实现了卓越的精度。
English: This paper introduces MetricGrids, a grid-based neural representation that combines multiple elementary metric grids to approximate complex nonlinear signals, achieving superior accuracy in 2D and 3D reconstructions through enhanced hash encoding and a high-order decoder.

Authors:Lin Tian, Sean I. Young, Jonathan Williams Ramirez, Dina Zemlyanker, Lucas Jacob Deden Binder, Rogeny Herisse, Theresa R. Connors, Derek H. Oakley, Bradley T. Hyman, Oula Puonti, Matthew S. Rosen, Juan Eugenio Iglesias
Title: Reference-Free 3D Reconstruction of Brain Dissection Photographs with Machine Learning
Abstract:
Correlation of neuropathology with MRI has the potential to transfer microscopic signatures of pathology to invivo scans. Recently, a classical registration method has been proposed, to build these correlations from 3D reconstructed stacks of dissection photographs, which are routinely taken at brain banks. These photographs bypass the need for exvivo MRI, which is not widely accessible. However, this method requires a full stack of brain slabs and a reference mask (e.g., acquired with a surface scanner), which severely limits the applicability of the technique. Here we propose RefFree, a dissection photograph reconstruction method without external reference. RefFree is a learning approach that estimates the 3D coordinates in the atlas space for every pixel in every photograph; simple least-squares fitting can then be used to compute the 3D reconstruction. As a by-product, RefFree also produces an atlas-based segmentation of the reconstructed stack. RefFree is trained on synthetic photographs generated from digitally sliced 3D MRI data, with randomized appearance for enhanced generalization ability. Experiments on simulated and real data show that RefFree achieves performance comparable to the baseline method without an explicit reference while also enabling reconstruction of partial stacks. Our code is available at https://github.com/lintian-a/reffree.
中文:RefFree方法无需外部参考即可实现脑解剖照片的三维重建,它通过基于合成MRI数据训练的学习方法达到相近性能,并能处理不完整的图像序列。
English: The proposed RefFree method enables 3D reconstruction of brain dissection photographs without requiring external references, using a learning approach trained on synthetic MRI data to achieve comparable performance while supporting partial stacks.

Authors:Jiayu Jiang, Changxing Ding, Wentao Tan, Junhong Wang, Jin Tao, Xiangmin Xu
Title: Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification
Abstract:
Text-to-image person re-identification (ReID) aims to retrieve the images of an interested person based on textual descriptions. One main challenge for this task is the high cost in manually annotating large-scale databases, which affects the generalization ability of ReID models. Recent works handle this problem by leveraging Multi-modal Large Language Models (MLLMs) to describe pedestrian images automatically. However, the captions produced by MLLMs lack diversity in description styles. To address this issue, we propose a Human Annotator Modeling (HAM) approach to enable MLLMs to mimic the description styles of thousands of human annotators. Specifically, we first extract style features from human textual descriptions and perform clustering on them. This allows us to group textual descriptions with similar styles into the same cluster. Then, we employ a prompt to represent each of these clusters and apply prompt learning to mimic the description styles of different human annotators. Furthermore, we define a style feature space and perform uniform sampling in this space to obtain more diverse clustering prototypes, which further enriches the diversity of the MLLM-generated captions. Finally, we adopt HAM to automatically annotate a massive-scale database for text-to-image ReID. Extensive experiments on this database demonstrate that it significantly improves the generalization ability of ReID models.
中文: 本文提出了一种人类标注者建模方法,通过聚类风格特征并应用提示学习,使多模态大语言模型能够模仿多样的人类描述风格,从而通过自动大规模数据库标注显著提升了文本到图像行人重识别模型的泛化能力。
English: This paper introduces a Human Annotator Modeling (HAM) approach that enables Multi-modal Large Language Models to mimic diverse human description styles by clustering style features and using prompt learning, thereby enhancing the generalization ability of text-to-image person re-identification models through automated large-scale database annotation.

Authors:Yuanxin Liu, Rui Zhu, Shuhuai Ren, Jiacong Wang, Haoyuan Guo, Xu Sun, Lu Jiang
Title: UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
Abstract:
With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 16 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation. The code is available at https://github.com/bytedance/UVE.
中文: 本研究提出采用多模态大语言模型作为AI生成视频的统一评估器,通过UVE-Bench基准测试证明其虽不及人工评估,但已显著超越现有专业评估方法。
English: This study proposes using multimodal large language models (MLLMs) as unified evaluators for AI-generated videos, introducing UVE-Bench benchmark to demonstrate their superior performance over specialized methods while still trailing human assessment.

Authors:Yasheng Sun, Zhiliang Xu, Hang Zhou, Jiazhi Guan, Quanwei Yang, Kaisiyuan Wang, Borong Liang, Yingying Li, Haocheng Feng, Jingdong Wang, Ziwei Liu, Koike Hideki
Title: Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers
Abstract:
Co-speech gesture video synthesis is a challenging task that requires both probabilistic modeling of human gestures and the synthesis of realistic images that align with the rhythmic nuances of speech. To address these challenges, we propose Cosh-DiT, a Co-speech gesture video system with hybrid Diffusion Transformers that perform audio-to-motion and motion-to-video synthesis using discrete and continuous diffusion modeling, respectively. First, we introduce an audio Diffusion Transformer (Cosh-DiT-A) to synthesize expressive gesture dynamics synchronized with speech rhythms. To capture upper body, facial, and hand movement priors, we employ vector-quantized variational autoencoders (VQ-VAEs) to jointly learn their dependencies within a discrete latent space. Then, for realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer (Cosh-DiT-V) that effectively integrates spatial and temporal contexts. Extensive experiments demonstrate that our framework consistently generates lifelike videos with expressive facial expressions and natural, smooth gestures that align seamlessly with speech.
中文:Cosh-DiT系统采用混合扩散变换器,通过音频生成同步动作再转化为逼真视频,实现了语音与生动手势的无缝对齐,能合成具有自然流畅动作和丰富表情的讲话手势视频。
English: The proposed Cosh-DiT system employs hybrid Diffusion Transformers to synthesize lifelike co-speech gesture videos by first generating synchronized motion from audio and then converting it into realistic video, achieving seamless alignment between speech and expressive gestures.

Authors:Daniel Syomichev, Padmini Gopinath, Guang-Lin Wei, Eric Chang, Ian Gordon, Amanuel Seifu, Rahul Pemmaraju, Neehar Peri, James Purtilo
Title: QuickDraw: Fast Visualization, Analysis and Active Learning for Medical Image Segmentation
Abstract:
Analyzing CT scans, MRIs and X-rays is pivotal in diagnosing and treating diseases. However, detecting and identifying abnormalities from such medical images is a time-intensive process that requires expert analysis and is prone to interobserver variability. To mitigate such issues, machine learning-based models have been introduced to automate and significantly reduce the cost of image segmentation. Despite significant advances in medical image analysis in recent years, many of the latest models are never applied in clinical settings because state-of-the-art models do not easily interface with existing medical image viewers. To address these limitations, we propose QuickDraw, an open-source framework for medical image visualization and analysis that allows users to upload DICOM images and run off-the-shelf models to generate 3D segmentation masks. In addition, our tool allows users to edit, export, and evaluate segmentation masks to iteratively improve state-of-the-art models through active learning. In this paper, we detail the design of our tool and present survey results that highlight the usability of our software. Notably, we find that QuickDraw reduces the time to manually segment a CT scan from four hours to six minutes and reduces machine learning-assisted segmentation time by 10\% compared to prior work. Our code and documentation are available at https://github.com/qd-seg/quickdraw
中文:QuickDraw 是一个开源框架,支持医学影像可视化和分析,用户可上传 DICOM 图像运行模型生成 3D 分割掩码,并通过主动学习迭代优化结果,将 CT 扫描分割时间从四小时缩短至六分钟。
English: QuickDraw is an open-source framework that enables medical image visualization and analysis by allowing users to upload DICOM images, run models for 3D segmentation, and iteratively improve results through active learning, reducing CT scan segmentation time from four hours to six minutes.

Authors:Héctor Laria, Alexandra Gomez-Villa, Jiang Qin, Muhammad Atif Butt, Bogdan Raducanu, Javier Vazquez-Corral, Joost van de Weijer, Kai Wang
Title: Leveraging Semantic Attribute Binding for Free-Lunch Color Control in Diffusion Models
Abstract:
Recent advances in text-to-image (T2I) diffusion models have enabled remarkable control over various attributes, yet precise color specification remains a fundamental challenge. Existing approaches, such as ColorPeel, rely on model personalization, requiring additional optimization and limiting flexibility in specifying arbitrary colors. In this work, we introduce ColorWave, a novel training-free approach that achieves exact RGB-level color control in diffusion models without fine-tuning. By systematically analyzing the cross-attention mechanisms within IP-Adapter, we uncover an implicit binding between textual color descriptors and reference image features. Leveraging this insight, our method rewires these bindings to enforce precise color attribution while preserving the generative capabilities of pretrained models. Our approach maintains generation quality and diversity, outperforming prior methods in accuracy and applicability across diverse object categories. Through extensive evaluations, we demonstrate that ColorWave establishes a new paradigm for structured, color-consistent diffusion-based image synthesis.
中文摘要:ColorWave是一种无需训练的新方法,通过重构交叉注意力机制在扩散模型中实现精确的RGB色彩控制,无需微调即可超越现有方法的准确性和适用性。
English Summary: ColorWave is a training-free method that enables precise RGB color control in diffusion models by rewiring cross-attention mechanisms, outperforming previous approaches in accuracy and versatility without requiring fine-tuning.

Authors:Nahid Ul Islam, DongAo Ma, Jiaxuan Pang, Shivasakthi Senthil Velan, Michael Gotway, Jianming Liang
Title: Foundation X: Integrating Classification, Localization, and Segmentation through Lock-Release Pretraining Strategy for Chest X-ray Analysis
Abstract:
Developing robust and versatile deep-learning models is essential for enhancing diagnostic accuracy and guiding clinical interventions in medical imaging, but it requires a large amount of annotated data. The advancement of deep learning has facilitated the creation of numerous medical datasets with diverse expert-level annotations. Aggregating these datasets can maximize data utilization and address the inadequacy of labeled data. However, the heterogeneity of expert-level annotations across tasks such as classification, localization, and segmentation presents a significant challenge for learning from these datasets. To this end, we introduce nFoundation X, an end-to-end framework that utilizes diverse expert-level annotations from numerous public datasets to train a foundation model capable of multiple tasks including classification, localization, and segmentation. To address the challenges of annotation and task heterogeneity, we propose a Lock-Release pretraining strategy to enhance the cyclic learning from multiple datasets, combined with the student-teacher learning paradigm, ensuring the model retains general knowledge for all tasks while preventing overfitting to any single task. To demonstrate the effectiveness of Foundation X, we trained a model using 11 chest X-ray datasets, covering annotations for classification, localization, and segmentation tasks. Our experimental results show that Foundation X achieves notable performance gains through extensive annotation utilization, excels in cross-dataset and cross-task learning, and further enhances performance in organ localization and segmentation tasks. All code and pretrained models are publicly accessible at https://github.com/jlianglab/Foundation_X.
中文: nFoundation X框架采用锁定-释放预训练策略和师生学习模式,整合多个医学数据集的专家标注,训练出能同时胜任分类、定位和分割任务的基础模型,有效避免过拟合并提升跨数据集学习能力。
English: The nFoundation X framework leverages diverse expert annotations from multiple medical datasets through a Lock-Release pretraining strategy and student-teacher learning to train a versatile foundation model that excels in classification, localization, and segmentation tasks while preventing overfitting.

Authors:Benjamin Towle, Xin Chen, Ke Zhou
Title: SeqSAM: Autoregressive Multiple Hypothesis Prediction for Medical Image Segmentation using SAM
Abstract:
Pre-trained segmentation models are a powerful and flexible tool for segmenting images. Recently, this trend has extended to medical imaging. Yet, often these methods only produce a single prediction for a given image, neglecting inherent uncertainty in medical images, due to unclear object boundaries and errors caused by the annotation tool. Multiple Choice Learning is a technique for generating multiple masks, through multiple learned prediction heads. However, this cannot readily be extended to producing more outputs than its initial pre-training hyperparameters, as the sparse, winner-takes-all loss function makes it easy for one prediction head to become overly dominant, thus not guaranteeing the clinical relevancy of each mask produced. We introduce SeqSAM, a sequential, RNN-inspired approach to generating multiple masks, which uses a bipartite matching loss for ensuring the clinical relevancy of each mask, and can produce an arbitrary number of masks. We show notable improvements in quality of each mask produced across two publicly available datasets. Our code is available at https://github.com/BenjaminTowle/SeqSAM.
中文: SeqSAM提出了一种基于循环神经网络启发的顺序方法,通过双向匹配损失生成多个临床相关的医学图像分割掩码,在两个公开数据集上显著提升了掩码质量。
English: SeqSAM introduces a sequential, RNN-inspired method that generates multiple clinically relevant masks for medical image segmentation using a bipartite matching loss, demonstrating improved mask quality across datasets.

Authors:Zhiyuan Zhang, Dongdong Chen, Jing Liao
Title: I2V3D: Controllable image-to-video generation with 3D guidance
Abstract:
We present I2V3D, a novel framework for animating static images into dynamic videos with precise 3D control, leveraging the strengths of both 3D geometry guidance and advanced generative models. Our approach combines the precision of a computer graphics pipeline, enabling accurate control over elements such as camera movement, object rotation, and character animation, with the visual fidelity of generative AI to produce high-quality videos from coarsely rendered inputs. To support animations with any initial start point and extended sequences, we adopt a two-stage generation process guided by 3D geometry: 1) 3D-Guided Keyframe Generation, where a customized image diffusion model refines rendered keyframes to ensure consistency and quality, and 2) 3D-Guided Video Interpolation, a training-free approach that generates smooth, high-quality video frames between keyframes using bidirectional guidance. Experimental results highlight the effectiveness of our framework in producing controllable, high-quality animations from single input images by harmonizing 3D geometry with generative models. The code for our framework will be publicly released.

Authors:Shitong Shao, Zikai Zhou, Dian Xie, Yuetong Fang, Tian Ye, Lichen Bai, Zeke Xie
Title: CoRe^2: Collect, Reflect and Refine to Generate Better and Faster
Abstract:
Making text-to-image (T2I) generative model sample both fast and well represents a promising research direction. Previous studies have typically focused on either enhancing the visual quality of synthesized images at the expense of sampling efficiency or dramatically accelerating sampling without improving the base model's generative capacity. Moreover, nearly all inference methods have not been able to ensure stable performance simultaneously on both diffusion models (DMs) and visual autoregressive models (ARMs). In this paper, we introduce a novel plug-and-play inference paradigm, CoRe^2, which comprises three subprocesses: Collect, Reflect, and Refine. CoRe^2 first collects classifier-free guidance (CFG) trajectories, and then use collected data to train a weak model that reflects the easy-to-learn contents while reducing number of function evaluations during inference by half. Subsequently, CoRe^2 employs weak-to-strong guidance to refine the conditional output, thereby improving the model's capacity to generate high-frequency and realistic content, which is difficult for the base model to capture. To the best of our knowledge, CoRe^2 is the first to demonstrate both efficiency and effectiveness across a wide range of DMs, including SDXL, SD3.5, and FLUX, as well as ARMs like LlamaGen. It has exhibited significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench. Furthermore, CoRe^2 can be seamlessly integrated with the state-of-the-art Z-Sampling, outperforming it by 0.3 and 0.16 on PickScore and AES, while achieving 5.64s time saving using SD3.5.Code is released at https://github.com/xie-lab-ml/CoRe/tree/main.
中文: 本文提出CoRe^2这一即插即用推理方法,通过减少函数评估次数并采用弱到强指导来提升文本到图像模型的输出质量,在多种扩散模型和自回归模型上均实现了高效与高性能的统一。
English: This paper introduces CoRe^2, a plug-and-play inference method that enhances text-to-image models by reducing function evaluations and improving output quality through weak-to-strong guidance, achieving efficiency and effectiveness across various diffusion and autoregressive models.

Authors:Itay Chachy, Guy Yariv, Sagie Benaim
Title: RewardSDS: Aligning Score Distillation via Reward-Weighted Sampling
Abstract:
Score Distillation Sampling (SDS) has emerged as an effective technique for leveraging 2D diffusion priors for tasks such as text-to-3D generation. While powerful, SDS struggles with achieving fine-grained alignment to user intent. To overcome this, we introduce RewardSDS, a novel approach that weights noise samples based on alignment scores from a reward model, producing a weighted SDS loss. This loss prioritizes gradients from noise samples that yield aligned high-reward output. Our approach is broadly applicable and can extend SDS-based methods. In particular, we demonstrate its applicability to Variational Score Distillation (VSD) by introducing RewardVSD. We evaluate RewardSDS and RewardVSD on text-to-image, 2D editing, and text-to-3D generation tasks, showing significant improvements over SDS and VSD on a diverse set of metrics measuring generation quality and alignment to desired reward models, enabling state-of-the-art performance. Project page is available at https://itaychachy.github.io/reward-sds/.

Authors:Nannan Wu, Zhuo Kuang, Zengqiang Yan, Ping Wang, Li Yu
Title: Fair Federated Medical Image Classification Against Quality Shift via Inter-Client Progressive State Matching
Abstract:
Despite the potential of federated learning in medical applications, inconsistent imaging quality across institutions-stemming from lower-quality data from a minority of clients-biases federated models toward more common high-quality images. This raises significant fairness concerns. Existing fair federated learning methods have demonstrated some effectiveness in solving this problem by aligning a single 0th- or 1st-order state of convergence (e.g., training loss or sharpness). However, we argue in this work that fairness based on such a single state is still not an adequate surrogate for fairness during testing, as these single metrics fail to fully capture the convergence characteristics, making them suboptimal for guiding fair learning. To address this limitation, we develop a generalized framework. Specifically, we propose assessing convergence using multiple states, defined as sharpness or perturbed loss computed at varying search distances. Building on this comprehensive assessment, we propose promoting fairness for these states across clients to achieve our ultimate fairness objective. This is accomplished through the proposed method, FedISM+. In FedISM+, the search distance evolves over time, progressively focusing on different states. We then incorporate two components in local training and global aggregation to ensure cross-client fairness for each state. This gradually makes convergence equitable for all states, thereby improving fairness during testing. Our empirical evaluations, performed on the well-known RSNA ICH and ISIC 2019 datasets, demonstrate the superiority of FedISM+ over existing state-of-the-art methods for fair federated learning. The code is available at https://github.com/wnn2000/FFL4MIA.
中文: 提出的FedISM+框架通过评估多个收敛状态并动态调整搜索距离来解决联邦学习中的公平性问题,在医学影像数据集上相比现有方法展现出更优性能。
English: The proposed FedISM+ framework addresses fairness limitations in federated learning by evaluating multiple convergence states across clients and dynamically adjusting search distances during training, demonstrating superior performance on medical imaging datasets compared to existing methods.

Authors:Lingmin Ran, Mike Zheng Shou
Title: TPDiff: Temporal Pyramid Video Diffusion Model
Abstract:
The development of video diffusion models unveils a significant challenge: the substantial computational demands. To mitigate this challenge, we note that the reverse process of diffusion exhibits an inherent entropy-reducing nature. Given the inter-frame redundancy in video modality, maintaining full frame rates in high-entropy stages is unnecessary. Based on this insight, we propose TPDiff, a unified framework to enhance training and inference efficiency. By dividing diffusion into several stages, our framework progressively increases frame rate along the diffusion process with only the last stage operating on full frame rate, thereby optimizing computational efficiency. To train the multi-stage diffusion model, we introduce a dedicated training framework: stage-wise diffusion. By solving the partitioned probability flow ordinary differential equations (ODE) of diffusion under aligned data and noise, our training strategy is applicable to various diffusion forms and further enhances training efficiency. Comprehensive experimental evaluations validate the generality of our method, demonstrating 50% reduction in training cost and 1.5x improvement in inference efficiency.

Authors:Peng Chen, Pi Bu, Yingyao Wang, Xinyi Wang, Ziming Wang, Jie Guo, Yingxiu Zhao, Qi Zhu, Jun Song, Siran Yang, Jiamang Wang, Bo Zheng
Title: CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
Abstract:
Recent advances in Vision-Language-Action models (VLAs) have expanded the capabilities of embodied intelligence. However, significant challenges remain in real-time decision-making in complex 3D environments, which demand second-level responses, high-resolution perception, and tactical reasoning under dynamic conditions. To advance the field, we introduce CombatVLA, an efficient VLA model optimized for combat tasks in 3D action role-playing games(ARPGs). Specifically, our CombatVLA is a 3B model trained on video-action pairs collected by an action tracker, where the data is formatted as action-of-thought (AoT) sequences. Thereafter, CombatVLA seamlessly integrates into an action execution framework, allowing efficient inference through our truncated AoT strategy. Experimental results demonstrate that CombatVLA not only outperforms all existing models on the combat understanding benchmark but also achieves a 50-fold acceleration in game combat. Moreover, it has a higher task success rate than human players. We will open-source all resources, including the action tracker, dataset, benchmark, model weights, training code, and the implementation of the framework at https://combatvla.github.io/.
中文:CombatVLA是一种专为3D游戏战斗任务优化的高效视觉语言动作模型,在基准测试中不仅性能超越现有模型,还实现了50倍加速和高于人类玩家的任务成功率。
English: CombatVLA is an efficient 3B Vision-Language-Action model optimized for combat tasks in 3D games, achieving superior performance on benchmarks with 50x faster combat speed and higher success rates than humans.

Authors:Nazanin Moradinasab, Saurav Sengupta, Jiebei Liu, Sana Syed, Donald E. Brown
Title: Towards Robust Multimodal Representation: A Unified Approach with Adaptive Experts and Alignment
Abstract:
Healthcare relies on multiple types of data, such as medical images, genetic information, and clinical records, to improve diagnosis and treatment. However, missing data is a common challenge due to privacy restrictions, cost, and technical issues, making many existing multi-modal models unreliable. To address this, we propose a new multi-model model called Mixture of Experts, Symmetric Aligning, and Reconstruction (MoSARe), a deep learning framework that handles incomplete multimodal data while maintaining high accuracy. MoSARe integrates expert selection, cross-modal attention, and contrastive learning to improve feature representation and decision-making. Our results show that MoSARe outperforms existing models in situations when the data is complete. Furthermore, it provides reliable predictions even when some data are missing. This makes it especially useful in real-world healthcare settings, including resource-limited environments. Our code is publicly available at https://github.com/NazaninMn/MoSARe.
中文: 提出的MoSARe模型通过专家选择和跨模态学习,能有效处理不完整的多模态医疗数据,在数据缺失时仍保持高精度并优于现有方法。
English: The proposed MoSARe model effectively handles incomplete multimodal healthcare data through expert selection and cross-modal learning, maintaining high accuracy and outperforming existing methods even with missing information.

Authors:Zhihua Tian, Sirun Nan, Ming Xu, Shengfang Zhai, Wenjie Qu, Jian Liu, Ruoxi Jia, Jiaheng Zhang
Title: Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models
Abstract:
Text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images but also raise people's concerns about generating harmful or misleading content. While extensive approaches have been proposed to erase unwanted concepts without requiring retraining from scratch, they inadvertently degrade performance on normal generation tasks. In this work, we propose Interpret then Deactivate (ItD), a novel framework to enable precise concept removal in T2I diffusion models while preserving overall performance. ItD first employs a sparse autoencoder (SAE) to interpret each concept as a combination of multiple features. By permanently deactivating the specific features associated with target concepts, we repurpose SAE as a zero-shot classifier that identifies whether the input prompt includes target concepts, allowing selective concept erasure in diffusion models. Moreover, we demonstrate that ItD can be easily extended to erase multiple concepts without requiring further training. Comprehensive experiments across celebrity identities, artistic styles, and explicit content demonstrate ItD's effectiveness in eliminating targeted concepts without interfering with normal concept generation. Additionally, ItD is also robust against adversarial prompts designed to circumvent content filters. Code is available at: https://github.com/NANSirun/Interpret-then-deactivate.
中文: 本文提出"解释后停用"(ItD)新框架,通过将概念解析为特征组合并停用目标特征,能在文本到图像生成模型中精确移除特定概念,同时保持整体生成质量并有效抵御对抗性提示攻击。
English: This paper introduces Interpret then Deactivate (ItD), a novel framework that precisely removes unwanted concepts from text-to-image diffusion models by interpreting them as feature combinations and deactivating specific features, effectively preserving overall generation quality and resisting adversarial prompts.

Authors:Kechun Xu, Xunlong Xia, Kaixuan Wang, Yifei Yang, Yunxuan Mao, Bing Deng, Jieping Ye, Rong Xiong, Yue Wang
Title: Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter
Abstract:
We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In addition, they primarily leverage vision and language foundation models, focusing less on action priors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. We propose A$^2$, an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions. Videos and codes are available at https://xukechun.github.io/papers/A2.
中文: 本文提出A²方法,通过整合视觉、语言和动作先验,实现了在杂乱环境中高效执行语言引导的抓取放置任务,仅需少量数据即可泛化至未见过的物体和指令。
English: This paper introduces A², an action prior alignment method that integrates vision, language, and action priors to enable efficient language-conditioned pick-and-place tasks in clutter with minimal data and strong generalization to unseen objects and instructions.

Authors:Yifan Zhou, Zeqi Xiao, Shuai Yang, Xingang Pan
Title: Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space
Abstract:
Latent Diffusion Models (LDMs) are known to have an unstable generation process, where even small perturbations or shifts in the input noise can lead to significantly different outputs. This hinders their applicability in applications requiring consistent results. In this work, we redesign LDMs to enhance consistency by making them shift-equivariant. While introducing anti-aliasing operations can partially improve shift-equivariance, significant aliasing and inconsistency persist due to the unique challenges in LDMs, including 1) aliasing amplification during VAE training and multiple U-Net inferences, and 2) self-attention modules that inherently lack shift-equivariance. To address these issues, we redesign the attention modules to be shift-equivariant and propose an equivariance loss that effectively suppresses the frequency bandwidth of the features in the continuous domain. The resulting alias-free LDM (AF-LDM) achieves strong shift-equivariance and is also robust to irregular warping. Extensive experiments demonstrate that AF-LDM produces significantly more consistent results than vanilla LDM across various applications, including video editing and image-to-image translation.
中文摘要:本研究重新设计隐扩散模型以实现平移等变性,通过改进注意力模块和引入等变性损失解决不稳定性与混叠问题,提出的无混叠隐扩散模型在视频编辑等应用中能稳定生成一致结果。
English Summary: The study redesigns Latent Diffusion Models to achieve shift-equivariance, addressing instability and aliasing issues through modified attention modules and an equivariance loss, resulting in the alias-free LDM that ensures consistent outputs for applications like video editing.

Authors:Kevin Qinghong Lin, Mike Zheng Shou
Title: VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
Abstract:
Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce VLog, a novel video understanding framework that define video narrations as vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, VLog feature three key innovations: (i) A generative retrieval model, marrying language model's complex reasoning capabilities with contrastive retrieval's flexible upgrading over narration vocabulary. (ii) A hierarchical vocabulary derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand). (iii) A vocabulary update strategy leveraging generative models to extend the vocabulary for novel events encountered during inference. To validate our approach, we introduce VidCap-Eval, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on EgoSchema, COIN, and HiREST further demonstrate the effectiveness of VLog, highlighting its ability to generate concise, contextually accurate, and efficient narrations, offering a novel perspective on video understanding. Codes are released at https://github.com/showlab/VLog.
中文摘要:VLog是一种创新的视频理解框架,将视频叙述定义为词汇表,结合生成推理与对比检索,通过分层事件索引生成简洁准确的视频描述。
English Summary: VLog is a novel video understanding framework that treats video narrations as a vocabulary, integrating generative reasoning with contrastive retrieval and hierarchical event indexing to produce concise and accurate narrations.

Authors:Tobias Christian Nauen, Brian Moser, Federico Raue, Stanislav Frolov, Andreas Dengel
Title: ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation
Abstract:
Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification. However, they often require large amounts of data and can exhibit biases that limit their robustness and generalizability. This paper introduces ForAug, a novel data augmentation scheme that addresses these challenges and explicitly includes inductive biases, which commonly are part of the neural network architecture, into the training data. ForAug is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds, enabling fine-grained control over image composition during training. It thus increases the data diversity and effective number of training samples. We demonstrate that training on ForNet, the application of ForAug to ImageNet, significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet and 7.3 p.p. on downstream tasks. Importantly, ForAug enables novel ways of analyzing model behavior and quantifying biases. Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that training on ForNet substantially reduces these biases compared to training on ImageNet. In summary, ForAug provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models. Our code and dataset are publicly available at https://github.com/tobna/ForAug.
Chinese: ForAug是一种新颖的数据增强方法,通过分离和重组前景与背景来提升图像多样性并减少Vision Transformer的偏差,显著提高了其在ImageNet及下游任务中的准确性和鲁棒性。
English: ForAug is a novel data augmentation method that enhances image diversity and mitigates biases in Vision Transformers, significantly boosting their accuracy and robustness on tasks like ImageNet and downstream applications.

Authors:Nikolai Körber, Eduard Kromer, Andreas Siebert, Sascha Hauke, Daniel Mueller-Gritschneder, Björn Schuller
Title: PerCoV2: Improved Ultra-Low Bit-Rate Perceptual Image Compression with Implicit Hierarchical Masked Image Modeling
Abstract:
We introduce PerCoV2, a novel and open ultra-low bit-rate perceptual image compression system designed for bandwidth- and storage-constrained applications. Building upon prior work by Careil et al., PerCoV2 extends the original formulation to the Stable Diffusion 3 ecosystem and enhances entropy coding efficiency by explicitly modeling the discrete hyper-latent image distribution. To this end, we conduct a comprehensive comparison of recent autoregressive methods (VAR and MaskGIT) for entropy modeling and evaluate our approach on the large-scale MSCOCO-30k benchmark. Compared to previous work, PerCoV2 (i) achieves higher image fidelity at even lower bit-rates while maintaining competitive perceptual quality, (ii) features a hybrid generation mode for further bit-rate savings, and (iii) is built solely on public components. Code and trained models will be released at https://github.com/Nikolai10/PerCoV2.
Chinese: PerCoV2 是一种先进的超低码率图像压缩系统,它基于 Stable Diffusion 3 生态系统改进了熵编码,在保持感知质量的同时实现了更高的图像保真度和更低的码率,且完全采用公开组件构建。
English: PerCoV2 is an advanced open-source ultra-low bit-rate image compression system that improves upon previous models by integrating with Stable Diffusion 3 and enhancing entropy coding, achieving higher fidelity and lower bit-rates while using only public components.

Authors:Zhoutong Ye, Mingze Sun, Huan-ang Gao, Chun Yu, Yuanchun Shi
Title: MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding
Abstract:
Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, there remains a significant gap between state-of-the-art LMMs and human performance when it comes to complex tasks that require a combination of fundamental VL capabilities, as well as tasks involving the grounding of complex instructions. To thoroughly investigate the human-LMM gap and its underlying causes, we propose MOAT, a diverse benchmark with complex real-world VL tasks that are challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating fundamental VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 10 fundamental VL capabilities, enabling MOAT to provide a fine-grained view of LMMs' strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs' ability to ground complex text and visual instructions, which is essential to many real-world applications. We evaluate over 20 proprietary and open source LMMs, as well as humans, on MOAT, and found that humans achieved 82.7% accuracy while the best performing LMM (OpenAI o1) achieved only 38.8%. To guide future model development, we analyze common trends in our results and discuss the underlying causes of observed performance gaps between LMMs and humans, focusing on which VL capability forms the bottleneck in complex tasks, whether test time scaling improves performance on MOAT, and how tiling harms LMMs' capability to count. Code and data are available at https://cambrian-yzt.github.io/MOAT.
大型多模态模型在视觉语言任务中展现出潜力,但在需要整合多种基础能力的复杂任务上仍与人类表现存在显著差距,MOAT基准测试显示人类准确率达82.7%而最佳模型仅为38.8%。
Large multimodal models show promise in vision-language tasks but still lag significantly behind human performance on complex tasks requiring integrated capabilities, as demonstrated by the MOAT benchmark where humans achieved 82.7% accuracy versus the top model's 38.8%.

Authors:Gorjan Radevski, Teodora Popordanoska, Matthew B. Blaschko, Tinne Tuytelaars
Title: DAVE: Diagnostic benchmark for Audio Visual Evaluation
Abstract:
Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- where answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE (Diagnostic Audio Visual Evaluation), a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled challenges. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. The dataset is released: https://github.com/gorjanradevski/dave
中文:DAVE基准数据集通过要求两种模态共同参与正确回答,并分解评估至原子子类别,解决了视听理解中的视觉偏见和综合评分问题,从而精准识别模型缺陷。
English: The DAVE benchmark dataset addresses visual bias and aggregate scoring issues in audio-visual understanding by requiring both modalities for correct answers and providing decoupled evaluation across subcategories to identify specific model weaknesses.

Authors:Luozheng Qin, Zhiyu Tan, Mengping Yang, Xiaomeng Yang, Hao Li
Title: Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption
Abstract:
Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.
中文摘要:本文提出Cockatiel三阶段训练框架,通过整合合成训练与人类偏好对齐来改进视频精细描述任务,有效解决了现有模型的能力偏差和人类偏好失准问题,在维度平衡性能和人类评估方面均实现了最优表现。
English Summary: This paper introduces Cockatiel, a three-stage training pipeline that enhances Video Detailed Captioning by combining synthetic and human-aligned training to overcome model biases and misalignment with human preferences, achieving state-of-the-art results in both balanced performance and human evaluation.

Authors:Chiara Cappellino, Gianluca Mancusi, Matteo Mosconi, Angelo Porrello, Simone Calderara, Rita Cucchiara
Title: DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection
Abstract:
Open-Vocabulary object detectors can generalize to an unrestricted set of categories through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on multiple specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to assess class reappearance. For more details, visit our project page: https://aimagelab.github.io/DitHub/
Chinese: DitHub采用模块化框架,将专家模块作为分支进行灵活适配,在开放词汇目标检测中实现了ODinW-13和ODinW-O等基准测试的最先进性能。
English: DitHub introduces a modular framework using expert modules as branches for flexible adaptation in open-vocabulary object detection, achieving state-of-the-art results on benchmarks like ODinW-13 and ODinW-O.

Authors:Ruihai Wu, Ziyu Zhu, Yuran Wang, Yue Chen, Jiarui Wang, Hao Dong
Title: GarmentPile: Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation
Abstract:
Cluttered garments manipulation poses significant challenges due to the complex, deformable nature of garments and intricate garment relations. Unlike single-garment manipulation, cluttered scenarios require managing complex garment entanglements and interactions, while maintaining garment cleanliness and manipulation stability. To address these demands, we propose to learn point-level affordance, the dense representation modeling the complex space and multi-modal manipulation candidates, while being aware of garment geometry, structure, and inter-object relations. Additionally, as it is difficult to directly retrieve a garment in some extremely entangled clutters, we introduce an adaptation module, guided by learned affordance, to reorganize highly-entangled garments into states plausible for manipulation. Our framework demonstrates effectiveness over environments featuring diverse garment types and pile configurations in both simulation and the real world. Project page: https://garmentpile.github.io/.
中文: 本研究提出一种学习点级可供性框架,通过感知衣物几何与相互关系,在杂乱环境中有效重组纠缠衣物,实现稳定操作,适用于多种衣物类型与堆叠配置。
English: This study introduces a framework that learns point-level affordance to manage complex garment interactions in cluttered environments, effectively reorganizing entangled garments for stable manipulation across diverse scenarios.

Authors:Xiuwen Fang, Mang Ye, Bo Du
Title: Robust Asymmetric Heterogeneous Federated Learning with Corrupted Clients
Abstract:
This paper studies a challenging robust federated learning task with model heterogeneous and data corrupted clients, where the clients have different local model structures. Data corruption is unavoidable due to factors such as random noise, compression artifacts, or environmental conditions in real-world deployment, drastically crippling the entire federated system. To address these issues, this paper introduces a novel Robust Asymmetric Heterogeneous Federated Learning (RAHFL) framework. We propose a Diversity-enhanced supervised Contrastive Learning technique to enhance the resilience and adaptability of local models on various data corruption patterns. Its basic idea is to utilize complex augmented samples obtained by the mixed-data augmentation strategy for supervised contrastive learning, thereby enhancing the ability of the model to learn robust and diverse feature representations. Furthermore, we design an Asymmetric Heterogeneous Federated Learning strategy to resist corrupt feedback from external clients. The strategy allows clients to perform selective one-way learning during collaborative learning phase, enabling clients to refrain from incorporating lower-quality information from less robust or underperforming collaborators. Extensive experimental results demonstrate the effectiveness and robustness of our approach in diverse, challenging federated learning environments. Our code and models are public available at https://github.com/FangXiuwen/RAHFL.
中文摘要:本文提出了一种新颖的鲁棒非对称异构联邦学习框架,通过多样性增强的对比学习和非对称协作策略,有效解决了联邦系统中模型异构和数据损坏的挑战。
English Summary: This paper introduces a novel Robust Asymmetric Heterogeneous Federated Learning (RAHFL) framework that employs diversity-enhanced contrastive learning and asymmetric collaboration strategies to address model heterogeneity and data corruption in federated systems.

Authors:Zicheng Zhang, Haoning Wu, Ziheng Jia, Weisi Lin, Guangtao Zhai
Title: Teaching LMMs for Image Quality Scoring and Interpreting
Abstract:
Image quality scoring and interpreting are two fundamental components of Image Quality Assessment (IQA). The former quantifies image quality, while the latter enables descriptive question answering about image quality. Traditionally, these two tasks have been addressed independently. However, from the perspective of the Human Visual System (HVS) and the Perception-Decision Integration Model, they are inherently interconnected: interpreting serves as the foundation for scoring, while scoring provides an abstract summary of interpreting. Thus, unifying these capabilities within a single model is both intuitive and logically coherent. In this paper, we propose Q-SiT (Quality Scoring and Interpreting joint Teaching), a unified framework that enables large multimodal models (LMMs) to learn both image quality scoring and interpreting simultaneously. We achieve this by transforming conventional IQA datasets into learnable question-answering datasets and incorporating human-annotated quality interpreting data for training. Furthermore, we introduce an efficient scoring & interpreting balance strategy, which first determines the optimal data mix ratio on lightweight LMMs and then maps this ratio to primary LMMs for fine-tuning adjustment. This strategy not only mitigates task interference and enhances cross-task knowledge transfer but also significantly reduces computational costs compared to direct optimization on full-scale LMMs. With this joint learning framework and corresponding training strategy, we develop Q-SiT, the first model capable of simultaneously performing image quality scoring and interpreting tasks, along with its lightweight variant, Q-SiT-mini. Experimental results demonstrate that Q-SiT achieves strong performance in both tasks with superior generalization IQA abilities.Project page at https://github.com/Q-Future/Q-SiT.
中文摘要:本文提出Q-SiT统一框架,通过将图像质量评估数据集转化为问答格式并采用高效平衡策略,使大型多模态模型能够同时学习图像质量评分与解释,在保证双任务性能的同时显著降低计算成本。
English Summary: The paper introduces Q-SiT, a unified framework that enables large multimodal models to jointly learn image quality scoring and interpreting by transforming IQA datasets into question-answering formats and implementing an efficient balance strategy to optimize performance while reducing computational costs.

Authors:Yuanyang Zhang, Yijie Lin, Weiqing Yan, Li Yao, Xinhang Wan, Guangyuan Li, Chao Zhang, Guanzhou Ke, Jie Xu
Title: Incomplete Multi-view Clustering via Diffusion Contrastive Generation
Abstract:
Incomplete multi-view clustering (IMVC) has garnered increasing attention in recent years due to the common issue of missing data in multi-view datasets. The primary approach to address this challenge involves recovering the missing views before applying conventional multi-view clustering methods. Although imputation-based IMVC methods have achieved significant improvements, they still encounter notable limitations: 1) heavy reliance on paired data for training the data recovery module, which is impractical in real scenarios with high missing data rates; 2) the generated data often lacks diversity and discriminability, resulting in suboptimal clustering results. To address these shortcomings, we propose a novel IMVC method called Diffusion Contrastive Generation (DCG). Motivated by the consistency between the diffusion and clustering processes, DCG learns the distribution characteristics to enhance clustering by applying forward diffusion and reverse denoising processes to intra-view data. By performing contrastive learning on a limited set of paired multi-view samples, DCG can align the generated views with the real views, facilitating accurate recovery of views across arbitrary missing view scenarios. Additionally, DCG integrates instance-level and category-level interactive learning to exploit the consistent and complementary information available in multi-view data, achieving robust and end-to-end clustering. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches. The code is available at https://github.com/zhangyuanyang21/2025-AAAI-DCG.
中文: 提出的扩散对比生成方法通过结合扩散过程和对比学习,有效恢复多视图数据中的缺失视图,并利用实例级和类别级交互学习实现稳健的端到端聚类,在任意缺失场景下均优于现有方法。
English: The proposed Diffusion Contrastive Generation (DCG) method addresses incomplete multi-view clustering by leveraging diffusion processes and contrastive learning to accurately recover missing views and integrate multi-level interactive learning, achieving superior clustering performance without heavy reliance on paired data.

Authors:Chengshu Zhao, Yunyang Ge, Xinhua Cheng, Bin Zhu, Yatian Pang, Bin Lin, Fan Yang, Feng Gao, Li Yuan
Title: SwapAnyone: Consistent and Realistic Video Synthesis for Swapping Any Person into Any Video
Abstract:
Video body-swapping aims to replace the body in an existing video with a new body from arbitrary sources, which has garnered more attention in recent years. Existing methods treat video body-swapping as a composite of multiple tasks instead of an independent task and typically rely on various models to achieve video body-swapping sequentially. However, these methods fail to achieve end-to-end optimization for the video body-swapping which causes issues such as variations in luminance among frames, disorganized occlusion relationships, and the noticeable separation between bodies and background. In this work, we define video body-swapping as an independent task and propose three critical consistencies: identity consistency, motion consistency, and environment consistency. We introduce an end-to-end model named SwapAnyone, treating video body-swapping as a video inpainting task with reference fidelity and motion control. To improve the ability to maintain environmental harmony, particularly luminance harmony in the resulting video, we introduce a novel EnvHarmony strategy for training our model progressively. Additionally, we provide a dataset named HumanAction-32K covering various videos about human actions. Extensive experiments demonstrate that our method achieves State-Of-The-Art (SOTA) performance among open-source methods while approaching or surpassing closed-source models across multiple dimensions. All code, model weights, and the HumanAction-32K dataset will be open-sourced at https://github.com/PKU-YuanGroup/SwapAnyone.
中文: 本文提出SwapAnyone模型,将视频换体重新定义为具有身份、动作和环境三个关键一致性的独立视频修复任务,通过创新的EnvHarmony策略和HumanAction-32K数据集实现了最先进的性能。
English: This paper introduces SwapAnyone, an end-to-end model that redefines video body-swapping as an independent video inpainting task with three key consistencies, achieving state-of-the-art performance through a novel EnvHarmony strategy and a comprehensive HumanAction-32K dataset.

Authors:Hyeonho Jeong, Suhyeon Lee, Jong Chul Ye
Title: Reangle-A-Video: 4D Video Generation as Video-to-Video Translation
Abstract:
We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Project page: https://hyeonho99.github.io/reangle-a-video/

Authors:Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, Junnan Li
Title: Generative Frame Sampler for Long Video Understanding
Abstract:
Despite recent advances in Video Large Language Models (VideoLLMs), effectively understanding long-form videos remains a significant challenge. Perceiving lengthy videos containing thousands of frames poses substantial computational burden. To mitigate this issue, this paper introduces Generative Frame Sampler (GenS), a plug-and-play module integrated with VideoLLMs to facilitate efficient lengthy video perception. Built upon a lightweight VideoLLM, GenS leverages its inherent vision-language capabilities to identify question-relevant frames. To facilitate effective retrieval, we construct GenS-Video-150K, a large-scale video instruction dataset with dense frame relevance annotations. Extensive experiments demonstrate that GenS consistently boosts the performance of various VideoLLMs, including open-source models (Qwen2-VL-7B, Aria-25B, VILA-40B, LLaVA-Video-7B/72B) and proprietary assistants (GPT-4o, Gemini). When equipped with GenS, open-source VideoLLMs achieve impressive state-of-the-art results on long-form video benchmarks: LLaVA-Video-72B reaches 66.8 (+4.3) on LongVideoBench and 77.0 (+2.7) on MLVU, while Aria obtains 39.2 on HourVideo surpassing the Gemini-1.5-pro by 1.9 points. We will release all datasets and models at https://generative-sampler.github.io.
中文: 本文提出生成式帧采样器(GenS),作为一种即插即用模块,通过识别问题相关帧来提升视频大语言模型处理长视频的效率,并借助新构建的大规模数据集在多项基准测试中取得了领先性能。
English: This paper introduces Generative Frame Sampler (GenS), a plug-and-play module that enhances VideoLLMs' efficiency in processing long-form videos by identifying question-relevant frames, achieving state-of-the-art results on benchmarks through a newly constructed large-scale dataset.

Authors:Haoyu Zhang, Qiaohui Chu, Meng Liu, Yunxiao Wang, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Yaowei Wang, Liqiang Nie
Title: Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding
Abstract:
AI personal assistants, deployed through robots or wearables, require embodied understanding to collaborate effectively with humans. Current Multimodal Large Language Models (MLLMs) primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos. Additionally, high acquisition costs limit data size, impairing MLLM performance. To address these challenges, we propose learning the mapping between exocentric and egocentric domains, leveraging the extensive exocentric knowledge within existing MLLMs to enhance egocentric video understanding. To this end, we introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs derived from Ego-Exo4D. Our approach features a progressive training pipeline with three stages: Teacher Self-Preparation, Teacher-Student Guidance, and Student Self-Practice. Additionally, we propose an instruction-tuning data EgoIT from multiple sources to strengthen the model's instruction-following capabilities, along with the EgoBench benchmark comprising eight different tasks for thorough evaluation. Extensive experiments across diverse egocentric tasks reveal that existing MLLMs perform inadequately in egocentric video understanding, while our model significantly outperforms these leading models.
Chinese: 为提升AI个人助理的具身理解能力,我们提出通过Ego-ExoClip数据集和渐进式训练流程将外中心视觉知识迁移至自我中心视觉的方法,在多项自我中心任务中显著优于现有模型。
English: To enhance AI personal assistants' embodied understanding, we propose a method that transfers knowledge from exocentric to egocentric vision using our Ego-ExoClip dataset and progressive training pipeline, significantly outperforming existing models in egocentric tasks.

Authors:Zhehui Wu, Yong Chen, Naoto Yokoya, Wei He
Title: MP-HSIR: A Multi-Prompt Framework for Universal Hyperspectral Image Restoration
Abstract:
Hyperspectral images (HSIs) often suffer from diverse and unknown degradations during imaging, leading to severe spectral and spatial distortions. Existing HSI restoration methods typically rely on specific degradation assumptions, limiting their effectiveness in complex scenarios. In this paper, we propose \textbf{MP-HSIR}, a novel multi-prompt framework that effectively integrates spectral, textual, and visual prompts to achieve universal HSI restoration across diverse degradation types and intensities. Specifically, we develop a prompt-guided spatial-spectral transformer, which incorporates spatial self-attention and a prompt-guided dual-branch spectral self-attention. Since degradations affect spectral features differently, we introduce spectral prompts in the local spectral branch to provide universal low-rank spectral patterns as prior knowledge for enhancing spectral reconstruction. Furthermore, the text-visual synergistic prompt fuses high-level semantic representations with fine-grained visual features to encode degradation information, thereby guiding the restoration process. Extensive experiments on 9 HSI restoration tasks, including all-in-one scenarios, generalization tests, and real-world cases, demonstrate that MP-HSIR not only consistently outperforms existing all-in-one methods but also surpasses state-of-the-art task-specific approaches across multiple tasks. The code and models are available at https://github.com/ZhehuiWu/MP-HSIR.
中文: 本文提出MP-HSIR多提示框架,通过融合光谱、文本和视觉提示,实现对多种退化类型高光谱图像的通用恢复,在广泛实验中性能超越现有方法。
English: This paper introduces MP-HSIR, a multi-prompt framework that integrates spectral, textual, and visual prompts to universally restore hyperspectral images across various degradation types, outperforming existing methods in extensive experiments.

Authors:Jiun Tian Hoe, Weipeng Hu, Wei Zhou, Chao Xie, Ziwei Wang, Chee Seng Chan, Xudong Jiang, Yap-Peng Tan
Title: InteractEdit: Zero-Shot Editing of Human-Object Interactions in Images
Abstract:
This paper presents InteractEdit, a novel framework for zero-shot Human-Object Interaction (HOI) editing, addressing the challenging task of transforming an existing interaction in an image into a new, desired interaction while preserving the identities of the subject and object. Unlike simpler image editing scenarios such as attribute manipulation, object replacement or style transfer, HOI editing involves complex spatial, contextual, and relational dependencies inherent in humans-objects interactions. Existing methods often overfit to the source image structure, limiting their ability to adapt to the substantial structural modifications demanded by new interactions. To address this, InteractEdit decomposes each scene into subject, object, and background components, then employs Low-Rank Adaptation (LoRA) and selective fine-tuning to preserve pretrained interaction priors while learning the visual identity of the source image. This regularization strategy effectively balances interaction edits with identity consistency. We further introduce IEBench, the most comprehensive benchmark for HOI editing, which evaluates both interaction editing and identity preservation. Our extensive experiments show that InteractEdit significantly outperforms existing methods, establishing a strong baseline for future HOI editing research and unlocking new possibilities for creative and practical applications. Code will be released upon publication.

Authors:Jin Li, Ziqiang He, Anwei Luo, Jian-Fang Hu, Z. Jane Wang, Xiangui Kang
Title: AdvAD: Exploring Non-Parametric Diffusion for Imperceptible Adversarial Attacks
Abstract:
Imperceptible adversarial attacks aim to fool DNNs by adding imperceptible perturbation to the input data. Previous methods typically improve the imperceptibility of attacks by integrating common attack paradigms with specifically designed perception-based losses or the capabilities of generative models. In this paper, we propose Adversarial Attacks in Diffusion (AdvAD), a novel modeling framework distinct from existing attack paradigms. AdvAD innovatively conceptualizes attacking as a non-parametric diffusion process by theoretically exploring basic modeling approach rather than using the denoising or generation abilities of regular diffusion models requiring neural networks. At each step, much subtler yet effective adversarial guidance is crafted using only the attacked model without any additional network, which gradually leads the end of diffusion process from the original image to a desired imperceptible adversarial example. Grounded in a solid theoretical foundation of the proposed non-parametric diffusion process, AdvAD achieves high attack efficacy and imperceptibility with intrinsically lower overall perturbation strength. Additionally, an enhanced version AdvAD-X is proposed to evaluate the extreme of our novel framework under an ideal scenario. Extensive experiments demonstrate the effectiveness of the proposed AdvAD and AdvAD-X. Compared with state-of-the-art imperceptible attacks, AdvAD achieves an average of 99.9$\%$ (+17.3$\%$) ASR with 1.34 (-0.97) $l_2$ distance, 49.74 (+4.76) PSNR and 0.9971 (+0.0043) SSIM against four prevalent DNNs with three different architectures on the ImageNet-compatible dataset. Code is available at https://github.com/XianguiKang/AdvAD.
Chinese: 本文提出AdvAD,一种新颖的非参数扩散框架,通过理论建模而非神经网络实现难以察觉的对抗攻击,在更低扰动强度下获得更高攻击成功率和视觉隐蔽性。
English: The paper introduces AdvAD, a novel non-parametric diffusion framework for imperceptible adversarial attacks that crafts subtle perturbations without neural networks, achieving superior attack success and imperceptibility with lower perturbation strength.

Authors:Yuechen Xie, Jie Song, Huiqiong Wang, Mingli Song
Title: Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?
Abstract:
High-quality open-source text-to-image models have lowered the threshold for obtaining photorealistic images significantly, but also face potential risks of misuse. Specifically, suspects may use synthetic data generated by these generative models to train models for specific tasks without permission, when lacking real data resources especially. Protecting these generative models is crucial for the well-being of their owners. In this work, we propose the first method to this important yet unresolved issue, called Training data Provenance Verification (TrainProVe). The rationale behind TrainProVe is grounded in the principle of generalization error bound, which suggests that, for two models with the same task, if the distance between their training data distributions is smaller, their generalization ability will be closer. We validate the efficacy of TrainProVe across four text-to-image models (Stable Diffusion v1.4, latent consistency model, PixArt-$α$, and Stable Cascade). The results show that TrainProVe achieves a verification accuracy of over 99\% in determining the provenance of suspicious model training data, surpassing all previous methods. Code is available at https://github.com/xieyc99/TrainProVe.
中文摘要:本文提出TrainProVe方法,首次通过验证训练数据来源防止文本到图像模型的未经授权使用,在多个模型中实现了超过99%的验证准确率。
English Summary: This paper introduces TrainProVe, a novel method for verifying the provenance of training data in text-to-image models to prevent unauthorized use, achieving over 99% accuracy across multiple models.

Authors:Byeongchan Lee, Sehyun Lee
Title: Implicit Contrastive Representation Learning with Guided Stop-gradient
Abstract:
In self-supervised representation learning, Siamese networks are a natural architecture for learning transformation-invariance by bringing representations of positive pairs closer together. But it is prone to collapse into a degenerate solution. To address the issue, in contrastive learning, a contrastive loss is used to prevent collapse by moving representations of negative pairs away from each other. But it is known that algorithms with negative sampling are not robust to a reduction in the number of negative samples. So, on the other hand, there are algorithms that do not use negative pairs. Many positive-only algorithms adopt asymmetric network architecture consisting of source and target encoders as a key factor in coping with collapse. By exploiting the asymmetric architecture, we introduce a methodology to implicitly incorporate the idea of contrastive learning. As its implementation, we present a novel method guided stop-gradient. We apply our method to benchmark algorithms SimSiam and BYOL and show that our method stabilizes training and boosts performance. We also show that the algorithms with our method work well with small batch sizes and do not collapse even when there is no predictor. The code is available at https://github.com/bych-lee/gsg.
Chinese: 本文提出了一种引导式停止梯度方法,通过非对称网络架构隐式融入对比学习思想,无需负样本或大批量即可提升SimSiam和BYOL等自监督学习算法的训练稳定性与性能表现。
English: This paper introduces a guided stop-gradient method that enhances self-supervised learning by implicitly incorporating contrastive principles through asymmetric network architectures, improving training stability and performance in algorithms like SimSiam and BYOL without requiring negative samples or large batch sizes.

Authors:Yifan Wang, Yifei Liu, Yingdong Shi, Changming Li, Anqi Pang, Sibei Yang, Jingyi Yu, Kan Ren
Title: Discovering Influential Neuron Path in Vision Transformers
Abstract:
Vision Transformer models exhibit immense power yet remain opaque to human understanding, posing challenges and risks for practical applications. While prior research has attempted to demystify these models through input attribution and neuron role analysis, there's been a notable gap in considering layer-level information and the holistic path of information flow across layers. In this paper, we investigate the significance of influential neuron paths within vision Transformers, which is a path of neurons from the model input to output that impacts the model inference most significantly. We first propose a joint influence measure to assess the contribution of a set of neurons to the model outcome. And we further provide a layer-progressive neuron locating approach that efficiently selects the most influential neuron at each layer trying to discover the crucial neuron path from input to output within the target model. Our experiments demonstrate the superiority of our method finding the most influential neuron path along which the information flows, over the existing baseline solutions. Additionally, the neuron paths have illustrated that vision Transformers exhibit some specific inner working mechanism for processing the visual information within the same image category. We further analyze the key effects of these neurons on the image classification task, showcasing that the found neuron paths have already preserved the model capability on downstream tasks, which may also shed some lights on real-world applications like model pruning. The project website including implementation code is available at https://foundation-model-research.github.io/NeuronPath/.

Authors:Xinyu Zhang, Haonan Chang, Yuhan Liu, Abdeslam Boularias
Title: Motion Blender Gaussian Splatting for Dynamic Scene Reconstruction
Abstract:
Gaussian splatting has emerged as a powerful tool for high-fidelity reconstruction of dynamic scenes. However, existing methods primarily rely on implicit motion representations, such as encoding motions into neural networks or per-Gaussian parameters, which makes it difficult to further manipulate the reconstructed motions. This lack of explicit controllability limits existing methods to replaying recorded motions only, which hinders a wider application in robotics. To address this, we propose Motion Blender Gaussian Splatting (MBGS), a novel framework that uses motion graphs as an explicit and sparse motion representation. The motion of a graph's links is propagated to individual Gaussians via dual quaternion skinning, with learnable weight painting functions that determine the influence of each link. The motion graphs and 3D Gaussians are jointly optimized from input videos via differentiable rendering. Experiments show that MBGS achieves state-of-the-art performance on the highly challenging iPhone dataset while being competitive on HyperNeRF. We demonstrate the application potential of our method in animating novel object poses, synthesizing real robot demonstrations, and predicting robot actions through visual planning. The source code, models, video demonstrations can be found at http://mlzxy.github.io/motion-blender-gs.

Authors:Rongxin Liao, Feng Li, Yanyan Wei, Zenglin Shi, Le Zhang, Huihui Bai, Meng Wang
Title: Prompt to Restore, Restore to Prompt: Cyclic Prompting for Universal Adverse Weather Removal
Abstract:
Universal adverse weather removal (UAWR) seeks to address various weather degradations within a unified framework. Recent methods are inspired by prompt learning using pre-trained vision-language models (e.g., CLIP), leveraging degradation-aware prompts to facilitate weather-free image restoration, yielding significant improvements. In this work, we propose CyclicPrompt, an innovative cyclic prompt approach designed to enhance the effectiveness, adaptability, and generalizability of UAWR. CyclicPrompt Comprises two key components: 1) a composite context prompt that integrates weather-related information and context-aware representations into the network to guide restoration. This prompt differs from previous methods by marrying learnable input-conditional vectors with weather-specific knowledge, thereby improving adaptability across various degradations. 2) The erase-and-paste mechanism, after the initial guided restoration, substitutes weather-specific knowledge with constrained restoration priors, inducing high-quality weather-free concepts into the composite prompt to further fine-tune the restoration process. Therefore, we can form a cyclic "Prompt-Restore-Prompt" pipeline that adeptly harnesses weather-specific knowledge, textual contexts, and reliable textures. Extensive experiments on synthetic and real-world datasets validate the superior performance of CyclicPrompt. The code is available at: https://github.com/RongxinL/CyclicPrompt.
中文: 本文提出CyclicPrompt,一种创新的循环提示方法,通过组合上下文提示和擦除-粘贴机制,有效利用天气特定知识与可靠纹理,显著提升了通用恶劣天气去除的适应性和恢复质量。
English: This paper introduces CyclicPrompt, a novel cyclic prompt approach for universal adverse weather removal that integrates composite context prompts and an erase-and-paste mechanism to enhance restoration effectiveness and adaptability across various weather degradations.

Authors:Joao D. S. Marques, Arlindo L. Oliveira
Title: Are ECGs enough? Deep learning classification of pulmonary embolism using electrocardiograms
Abstract:
Pulmonary embolism is a leading cause of out of hospital cardiac arrest that requires fast diagnosis. While computed tomography pulmonary angiography is the standard diagnostic tool, it is not always accessible. Electrocardiography is an essential tool for diagnosing multiple cardiac anomalies, as it is affordable, fast and available in many settings. However, the availability of public ECG datasets, specially for PE, is limited and, in practice, these datasets tend to be small, making it essential to optimize learning strategies. In this study, we investigate the performance of multiple neural networks in order to assess the impact of various approaches. Moreover, we check whether these practices enhance model generalization when transfer learning is used to translate information learned in larger ECG datasets, such as PTB-XL, CPSC18 and MedalCare-XL, to a smaller, more challenging dataset for PE. By leveraging transfer learning, we analyze the extent to which we can improve learning efficiency and predictive performance on limited data. Code available at https://github.com/joaodsmarques/Are-ECGs-enough-Deep-Learning-Classifiers .
中文: 本研究通过从大型心电图数据集进行迁移学习,评估神经网络在有限数据条件下对肺栓塞的诊断性能,旨在提升学习效率与预测表现。
English: This study evaluates neural networks using transfer learning from large ECG datasets to improve pulmonary embolism diagnosis on limited data, aiming to enhance learning efficiency and predictive performance.

Authors:Hrishikesh Viswanath, Md Ashiqur Rahman, Chi Lin, Damon Conover, Aniket Bera
Title: HessianForge: Scalable LiDAR reconstruction with Physics-Informed Neural Representation and Smoothness Energy Constraints
Abstract:
Accurate and efficient 3D mapping of large-scale outdoor environments from LiDAR measurements is a fundamental challenge in robotics, particularly towards ensuring smooth and artifact-free surface reconstructions. Although the state-of-the-art methods focus on memory-efficient neural representations for high-fidelity surface generation, they often fail to produce artifact-free manifolds, with artifacts arising due to noisy and sparse inputs. To address this issue, we frame surface mapping as a physics-informed energy optimization problem, enforcing surface smoothness by optimizing an energy functional that penalizes sharp surface ridges. Specifically, we propose a deep learning based approach that learns the signed distance field (SDF) of the surface manifold from raw LiDAR point clouds using a physics-informed loss function that optimizes the $L_2$-Hessian energy of the surface. Our learning framework includes a hierarchical octree based input feature encoding and a multi-scale neural network to iteratively refine the signed distance field at different scales of resolution. Lastly, we introduce a test-time refinement strategy to correct topological inconsistencies and edge distortions that can arise in the generated mesh. We propose a \texttt{CUDA}-accelerated least-squares optimization that locally adjusts vertex positions to enforce feature-preserving smoothing. We evaluate our approach on large-scale outdoor datasets and demonstrate that our approach outperforms current state-of-the-art methods in terms of improved accuracy and smoothness. Our code is available at \href{https://github.com/HrishikeshVish/HessianForge/}{https://github.com/HrishikeshVish/HessianForge/}
Chinese: 本研究提出了一种基于物理信息的深度学习方法,通过优化L2-Hessian能量从LiDAR点云重建平滑无伪影的3D表面,在精度和网格质量上超越了现有技术。
English: This study introduces a physics-informed deep learning method that optimizes the L2-Hessian energy to reconstruct smooth, artifact-free 3D surfaces from LiDAR point clouds, outperforming existing techniques in accuracy and mesh quality.

Authors:Xiwen Chen, Wenhui Zhu, Peijie Qiu, Hao Wang, Huayu Li, Haiyu Wu, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi
Title: Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation
Abstract:
Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks. Prompt learning has emerged as an efficient and effective strategy to adapt VLMs while preserving their pre-trained knowledge. However, existing methods still lead to overfitting and degrade zero-shot generalization. To address this challenge, we propose an optimal transport (OT)-guided prompt learning framework that mitigates forgetting by preserving the structural consistency of feature distributions between pre-trained and fine-tuned models. Unlike conventional point-wise constraints, OT naturally captures cross-instance relationships and expands the feasible parameter space for prompt tuning, allowing a better trade-off between adaptation and generalization. Our approach enforces joint constraints on both vision and text representations, ensuring a holistic feature alignment. Extensive experiments on benchmark datasets demonstrate that our simple yet effective method can outperform existing prompt learning strategies in base-to-novel generalization, cross-dataset evaluation, and domain generalization without additional augmentation or ensemble techniques. The code is available at https://github.com/ChongQingNoSubway/Prompt-OT
中文摘要:本文提出了一种基于最优传输的提示学习框架,通过保持特征分布一致性来增强视觉语言模型的适应能力,无需额外技术即可在多项任务中实现卓越的泛化性能。
English Summary: This paper introduces an optimal transport-guided prompt learning framework to enhance vision-language model adaptation by preserving feature distribution consistency, achieving superior generalization across various tasks without extra techniques.

Authors:Rajitha de Silva, Jonathan Cox, Marija Popovic, Cesar Cadena, Cyrill Stachniss, Riccardo Polvara
Title: Keypoint Semantic Integration for Improved Feature Matching in Outdoor Agricultural Environments
Abstract:
Robust robot navigation in outdoor environments requires accurate perception systems capable of handling visual challenges such as repetitive structures and changing appearances. Visual feature matching is crucial to vision-based pipelines but remains particularly challenging in natural outdoor settings due to perceptual aliasing. We address this issue in vineyards, where repetitive vine trunks and other natural elements generate ambiguous descriptors that hinder reliable feature matching. We hypothesise that semantic information tied to keypoint positions can alleviate perceptual aliasing by enhancing keypoint descriptor distinctiveness. To this end, we introduce a keypoint semantic integration technique that improves the descriptors in semantically meaningful regions within the image, enabling more accurate differentiation even among visually similar local features. We validate this approach in two vineyard perception tasks: (i) relative pose estimation and (ii) visual localisation. Across all tested keypoint types and descriptors, our method improves matching accuracy by 12.6%, demonstrating its effectiveness over multiple months in challenging vineyard conditions.
中文: 本文提出一种关键点语义集成技术,通过结合语义信息增强葡萄园环境中特征描述符的区分度,在姿态估计和视觉定位任务中将特征匹配准确率提升12.6%,有效应对户外复杂场景的视觉挑战。
English: This paper introduces a keypoint semantic integration technique that enhances descriptor distinctiveness in vineyards by incorporating semantic information, improving feature matching accuracy by 12.6% for tasks like pose estimation and visual localization under challenging outdoor conditions.

Authors:Letian Zhang, Quan Cui, Bingchen Zhao, Cheng Yang
Title: Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis
Abstract:
The success of multi-modal large language models (MLLMs) has been largely attributed to the large-scale training data. However, the training data of many MLLMs is unavailable due to privacy concerns. The expensive and labor-intensive process of collecting multi-modal data further exacerbates the problem. Is it possible to synthesize multi-modal training data automatically without compromising diversity and quality? In this paper, we propose a new method, Oasis, to synthesize high-quality multi-modal data with only images. Oasis breaks through traditional methods by prompting only images to the MLLMs, thus extending the data diversity by a large margin. Our method features a delicate quality control method which ensures the data quality. We collected over 500k data and conducted incremental experiments on LLaVA-NeXT. Extensive experiments demonstrate that our method can significantly improve the performance of MLLMs. The image-based synthesis also allows us to focus on the specific-domain ability of MLLMs. Code and dataset are publicly available at https://github.com/Letian2003/MM_INF.
中文:Oasis方法仅利用图像即可自动合成高质量多模态训练数据,在保证质量的同时大幅提升数据多样性,显著增强多模态大语言模型的性能。
English: The Oasis method enables automatic synthesis of high-quality multi-modal training data using only images, significantly enhancing data diversity and MLLM performance without compromising quality.

Authors:In Cho, Youngbeom Yoo, Subin Jeon, Seon Joo Kim
Title: Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion Models
Abstract:
Constructing a compressed latent space through a variational autoencoder (VAE) is the key for efficient 3D diffusion models. This paper introduces COD-VAE that encodes 3D shapes into a COmpact set of 1D latent vectors without sacrificing quality. COD-VAE introduces a two-stage autoencoder scheme to improve compression and decoding efficiency. First, our encoder block progressively compresses point clouds into compact latent vectors via intermediate point patches. Second, our triplane-based decoder reconstructs dense triplanes from latent vectors instead of directly decoding neural fields, significantly reducing computational overhead of neural fields decoding. Finally, we propose uncertainty-guided token pruning, which allocates resources adaptively by skipping computations in simpler regions and improves the decoder efficiency. Experimental results demonstrate that COD-VAE achieves 16x compression compared to the baseline while maintaining quality. This enables 20.8x speedup in generation, highlighting that a large number of latent vectors is not a prerequisite for high-quality reconstruction and generation. The code is available at https://github.com/join16/COD-VAE.
中文: 本文提出的COD-VAE通过两阶段变分自编码器将3D形状压缩为紧凑的一维潜向量,在保持质量的同时实现了16倍压缩比和20.8倍生成加速,其核心在于渐进式编码、三平面解码和自适应令牌剪枝技术。
English: This paper introduces COD-VAE, a two-stage variational autoencoder that compresses 3D shapes into compact 1D latent vectors while maintaining quality, achieving 16x compression and 20.8x faster generation through progressive encoding, triplane-based decoding, and adaptive token pruning.

Authors:Raphi Kang, Yue Song, Georgia Gkioxari, Pietro Perona
Title: Is CLIP ideal? No. Can we fix it? Yes!
Abstract:
Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: https://github.com/Raphoo/DCSM_Ideal_CLIP
中文: 研究发现CLIP的潜在空间存在固有几何限制,无法同时处理多种语义任务,并提出密集余弦相似度映射作为可解释的解决方案,通过保持图像-文本拓扑结构提升了各项基准测试的性能。
English: The study reveals that CLIP's latent space has inherent geometric limitations preventing it from simultaneously handling multiple semantic tasks, and proposes Dense Cosine Similarity Maps as an interpretable solution that improves performance across benchmarks by preserving image-text topology.

Authors:Yan Hu, Ahmad Chaddad
Title: SHAP-Integrated Convolutional Diagnostic Networks for Feature-Selective Medical Analysis
Abstract:
This study introduces the SHAP-integrated convolutional diagnostic network (SICDN), an interpretable feature selection method designed for limited datasets, to address the challenge posed by data privacy regulations that restrict access to medical datasets. The SICDN model was tested on classification tasks using pneumonia and breast cancer datasets, demonstrating over 97% accuracy and surpassing four popular CNN models. We also integrated a historical weighted moving average technique to enhance feature selection. The SICDN shows potential in medical image prediction, with the code available on https://github.com/AIPMLab/SICDN.
中文摘要:SICDN模型作为一种可解释的特征选择方法,在有限医疗数据集上实现了超过97%的分类准确率,有效应对了数据隐私保护带来的挑战。
English Summary: The SICDN model is an interpretable feature selection method that achieves over 97% accuracy on medical datasets while addressing data privacy constraints through effective performance with limited data.

Authors:Yongdong Luo, Wang Chen, Xiawu Zheng, Weizhong Huang, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Jiebo Luo, Rongrong Ji
Title: QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension
Abstract:
Recent advances in long video understanding typically mitigate visual redundancy through visual token pruning based on attention distribution. However, while existing methods employ post-hoc low-response token pruning in decoder layers, they overlook the input-level semantic correlation between visual tokens and instructions (query). In this paper, we propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models (LVLMs) for visual token assignment based on query-oriented frame-level importance assessment. The query-oriented token selection is crucial as it aligns visual processing with task-specific requirements, optimizing token budget utilization while preserving semantically relevant content. Specifically, (i) QuoTA strategically allocates frame-level importance scores based on query relevance, enabling one-time visual token assignment before cross-modal interactions in decoder layers, (ii) we decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring, and (iii) QuoTA offers a plug-and-play functionality that extends to existing LVLMs. Extensive experimental results demonstrate that implementing QuoTA with LLaVA-Video-7B yields an average performance improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while operating within an identical visual token budget as the baseline. Codes are open-sourced at https://github.com/MAC-AutoML/QuoTA.
中文摘要:本文提出QuoTA,一种无需训练即可基于查询相关性在跨模态交互前筛选视觉标记的模块,在相同标记预算下将六个基准测试的平均性能提升了3.2%。
English Summary: The paper introduces QuoTA, a training-free module that enhances long video understanding by selecting visual tokens based on query relevance before cross-modal interactions, improving performance by 3.2% on benchmarks within the same token budget.

Authors:Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang Wang
Title: OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
Abstract:
Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2's high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2 times speedup and 63% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts. Code and models are released at https://github.com/hustvl/OmniMamba
中文:OmniMamba是一种基于线性架构的多模态生成模型,通过统一的下一词预测范式高效生成文本和图像,在显著降低计算复杂度和训练数据需求的同时实现了优越性能。
English: OmniMamba is a linear-architecture multimodal generation model that efficiently generates text and images using next-token prediction, achieving competitive performance with significantly reduced computational complexity and training data requirements.

Authors:Changxing Liu, Genjia Liu, Zijun Wang, Jinchang Yang, Siheng Chen
Title: CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving
Abstract:
Vehicle-to-vehicle (V2V) cooperative autonomous driving holds great promise for improving safety by addressing the perception and prediction uncertainties inherent in single-agent systems. However, traditional cooperative methods are constrained by rigid collaboration protocols and limited generalization to unseen interactive scenarios. While LLM-based approaches offer generalized reasoning capabilities, their challenges in spatial planning and unstable inference latency hinder their direct application in cooperative driving. To address these limitations, we propose CoLMDriver, the first full-pipeline LLM-based cooperative driving system, enabling effective language-based negotiation and real-time driving control. CoLMDriver features a parallel driving pipeline with two key components: (i) an LLM-based negotiation module under an actor-critic paradigm, which continuously refines cooperation policies through feedback from previous decisions of all vehicles; and (ii) an intention-guided waypoint generator, which translates negotiation outcomes into executable waypoints. Additionally, we introduce InterDrive, a CARLA-based simulation benchmark comprising 10 challenging interactive driving scenarios for evaluating V2V cooperation. Experimental results demonstrate that CoLMDriver significantly outperforms existing approaches, achieving an 11% higher success rate across diverse highly interactive V2V driving scenarios. Code will be released on https://github.com/cxliu0314/CoLMDriver.
Chinese: CoLMDriver是首个基于大语言模型的全流程协同驾驶系统,通过基于语言的协商和实时控制提升车辆间安全性,在多样化交互场景中相比现有方法成功率提高了11%。
English: CoLMDriver is a pioneering full-pipeline LLM-based cooperative driving system that enhances vehicle-to-vehicle safety through language-based negotiation and real-time control, achieving an 11% higher success rate in diverse interactive scenarios compared to existing methods.

Authors:Yuanhao Wang, Cheng Zhang, Gonçalo Frazão, Jinlong Yang, Alexandru-Eugen Ichim, Thabo Beeler, Fernando De la Torre
Title: GarmentCrafter: Progressive Novel View Synthesis for Single-View 3D Garment Reconstruction and Editing
Abstract:
We introduce GarmentCrafter, a new approach that enables non-professional users to create and modify 3D garments from a single-view image. While recent advances in image generation have facilitated 2D garment design, creating and editing 3D garments remains challenging for non-professional users. Existing methods for single-view 3D reconstruction often rely on pre-trained generative models to synthesize novel views conditioning on the reference image and camera pose, yet they lack cross-view consistency, failing to capture the internal relationships across different views. In this paper, we tackle this challenge through progressive depth prediction and image warping to approximate novel views. Subsequently, we train a multi-view diffusion model to complete occluded and unknown clothing regions, informed by the evolving camera pose. By jointly inferring RGB and depth, GarmentCrafter enforces inter-view coherence and reconstructs precise geometries and fine details. Extensive experiments demonstrate that our method achieves superior visual fidelity and inter-view coherence compared to state-of-the-art single-view 3D garment reconstruction methods.

Authors:Yongsheng Yu, Ziyun Zeng, Haitian Zheng, Jiebo Luo
Title: OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting
Abstract:
Diffusion-based generative models have revolutionized object-oriented image editing, yet their deployment in realistic object removal and insertion remains hampered by challenges such as the intricate interplay of physical effects and insufficient paired training data. In this work, we introduce OmniPaint, a unified framework that re-conceptualizes object removal and insertion as interdependent processes rather than isolated tasks. Leveraging a pre-trained diffusion prior along with a progressive training pipeline comprising initial paired sample optimization and subsequent large-scale unpaired refinement via CycleFlow, OmniPaint achieves precise foreground elimination and seamless object insertion while faithfully preserving scene geometry and intrinsic properties. Furthermore, our novel CFD metric offers a robust, reference-free evaluation of context consistency and object hallucination, establishing a new benchmark for high-fidelity image editing. Project page: https://yeates.github.io/OmniPaint-Page/

Authors:Shuaiting Li, Juncan Deng, Chenxuan Wang, Kedong Xu, Rongtao Deng, Hong Gu, Haibin Shen, Kejie Huang
Title: SSVQ: Unleashing the Potential of Vector Quantization with Sign-Splitting
Abstract:
Vector Quantization (VQ) has emerged as a prominent weight compression technique, showcasing substantially lower quantization errors than uniform quantization across diverse models, particularly in extreme compression scenarios. However, its efficacy during fine-tuning is limited by the constraint of the compression format, where weight vectors assigned to the same codeword are restricted to updates in the same direction. Consequently, many quantized weights are compelled to move in directions contrary to their local gradient information. To mitigate this issue, we introduce a novel VQ paradigm, Sign-Splitting VQ (SSVQ), which decouples the sign bit of weights from the codebook. Our approach involves extracting the sign bits of uncompressed weights and performing clustering and compression on all-positive weights. We then introduce latent variables for the sign bit and jointly optimize both the signs and the codebook. Additionally, we implement a progressive freezing strategy for the learnable sign to ensure training stability. Extensive experiments on various modern models and tasks demonstrate that SSVQ achieves a significantly superior compression-accuracy trade-off compared to conventional VQ. Furthermore, we validate our algorithm on a hardware accelerator, showing that SSVQ achieves a 3$\times$ speedup over the 8-bit compressed model by reducing memory access. Our code is available at https://github.com/list0830/SSVQ.
中文: 本文提出符号分离向量量化(SSVQ)方法,通过将权重符号位与码本解耦来改进传统向量量化在微调中的局限性,在保持更高压缩精度的同时实现了硬件加速器上3倍的运算速度提升。
English: This paper introduces Sign-Splitting Vector Quantization (SSVQ), a novel weight compression method that decouples sign bits from codebooks to overcome conventional VQ's fine-tuning limitations, achieving superior compression-accuracy trade-offs and 3× speedup on hardware accelerators.

Authors:Yuhan Wang, Fangzhou Hong, Shuai Yang, Liming Jiang, Wayne Wu, Chen Change Loy
Title: MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention
Abstract:
Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called mesh attention to enable training at 1024x1024 resolution. Using a clothed human mesh as a central coarse geometric representation, the proposed mesh attention leverages rasterization and projection to establish direct cross-view coordinate correspondences. This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency. Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, MEAT. In addition, we present valuable insights into applying multiview human motion videos for diffusion training, addressing the longstanding issue of data scarcity. Extensive experiments show that MEAT effectively generates dense, consistent multiview human images at the megapixel level, outperforming existing multiview diffusion methods.
中文: 本文提出MEAT这一针对人体的多视角扩散模型,通过网格注意力机制利用着装人体网格建立跨视角坐标对应,实现了1024x1024分辨率的高清训练,在保证视角一致性的同时有效解决了数据稀缺问题,其性能优于现有多视角扩散方法。
English: This paper introduces MEAT, a human-specific multiview diffusion model that employs mesh attention to enable high-resolution (1024x1024) training by establishing direct cross-view correspondences through clothed human mesh representations, effectively addressing consistency and data scarcity issues while outperforming existing methods.

Authors:Zhenchen Wan, Yanwu xu, Dongting Hu, Weilun Cheng, Tianxi Chen, Zhaoqing Wang, Feng Liu, Tongliang Liu, Mingming Gong
Title: MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input
Abstract:
Recent advancements in Virtual Try-On (VITON) have significantly improved image realism and garment detail preservation, driven by powerful text-to-image (T2I) diffusion models. However, existing methods often rely on user-provided masks, introducing complexity and performance degradation due to imperfect inputs, as shown in Fig.1(a). To address this, we propose a Mask-Free VITON (MF-VITON) framework that achieves realistic VITON using only a single person image and a target garment, eliminating the requirement for auxiliary masks. Our approach introduces a novel two-stage pipeline: (1) We leverage existing Mask-based VITON models to synthesize a high-quality dataset. This dataset contains diverse, realistic pairs of person images and corresponding garments, augmented with varied backgrounds to mimic real-world scenarios. (2) The pre-trained Mask-based model is fine-tuned on the generated dataset, enabling garment transfer without mask dependencies. This stage simplifies the input requirements while preserving garment texture and shape fidelity. Our framework achieves state-of-the-art (SOTA) performance regarding garment transfer accuracy and visual realism. Notably, the proposed Mask-Free model significantly outperforms existing Mask-based approaches, setting a new benchmark and demonstrating a substantial lead over previous approaches. For more details, visit our project page: https://zhenchenwan.github.io/MF-VITON/.
中文: 提出的无掩码虚拟试穿(MF-VITON)框架通过两阶段流程,合成真实数据集并微调预训练模型,无需用户提供掩码,在服装转移精度和视觉真实感上达到领先水平。
English: The proposed Mask-Free VITON (MF-VITON) framework eliminates the need for user-provided masks by using a two-stage pipeline that synthesizes a realistic dataset and fine-tunes a pre-trained model, achieving state-of-the-art performance in garment transfer accuracy and visual realism.

Authors:Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qingpei Guo, Yang Liu, Ming Yang, Chunhua Shen
Title: SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
Abstract:
While MLLMs have demonstrated adequate image understanding capabilities, they still struggle with pixel-level comprehension, limiting their practical applications. Current evaluation tasks like VQA and visual grounding remain too coarse to assess fine-grained pixel comprehension accurately. Though segmentation is foundational for pixel-level understanding, existing methods often require MLLMs to generate implicit tokens, decoded through external pixel decoders. This approach disrupts the MLLM's text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model's intrinsic pixel-level understanding. Thus, we introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. Modeling segmentation as a multi-step Markov Decision Process, HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. Through this setup, we develop SegAgent, a model fine-tuned on human-like annotation trajectories, which achieves performance comparable to state-of-the-art (SOTA) methods and supports additional tasks like mask refinement and annotation filtering. HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task that facilitates exploration of MLLMs' visual reasoning abilities. Our adaptations of policy improvement method StaR and PRM-guided tree search further enhance model robustness in complex segmentation tasks, laying a foundation for future advancements in fine-grained visual perception and multi-step decision-making for MLLMs.
中文: 摘要提出了人类式掩码标注任务(HLMAT),该方法通过多步迭代生成基于文本的点击点,使多模态大语言模型能够进行像素级分割,无需改变架构即可获得高质量结果,并提升细粒度视觉理解能力。
English: The abstract introduces the Human-Like Mask Annotation Task (HLMAT), a method enabling multimodal large language models to perform pixel-level segmentation by generating text-based click points iteratively, achieving high-quality results without architectural changes and enhancing fine-grained visual understanding.

Authors:Xianfeng Wu, Yajing Bai, Haoze Zheng, Harold Haodong Chen, Yexin Liu, Zihao Wang, Xuran Ma, Wen-Jie Shu, Xianzu Wu, Harry Yang, Ser-Nam Lim
Title: LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization
Abstract:
Recent advances in text-to-image generation have primarily relied on extensive datasets and parameter-heavy architectures. These requirements severely limit accessibility for researchers and practitioners who lack substantial computational resources. In this paper, we introduce \model, an efficient training paradigm for image generation models that uses knowledge distillation (KD) and Direct Preference Optimization (DPO). Drawing inspiration from the success of data KD techniques widely adopted in Multi-Modal Large Language Models (MLLMs), LightGen distills knowledge from state-of-the-art (SOTA) text-to-image models into a compact Masked Autoregressive (MAR) architecture with only $0.7B$ parameters. Using a compact synthetic dataset of just $2M$ high-quality images generated from varied captions, we demonstrate that data diversity significantly outweighs data volume in determining model performance. This strategy dramatically reduces computational demands and reduces pre-training time from potentially thousands of GPU-days to merely 88 GPU-days. Furthermore, to address the inherent shortcomings of synthetic data, particularly poor high-frequency details and spatial inaccuracies, we integrate the DPO technique that refines image fidelity and positional accuracy. Comprehensive experiments confirm that LightGen achieves image generation quality comparable to SOTA models while significantly reducing computational resources and expanding accessibility for resource-constrained environments. Code is available at https://github.com/XianfengWu01/LightGen
中文摘要:LightGen通过知识蒸馏和直接偏好优化技术,仅用7亿参数和少量计算资源就实现了与顶尖模型相当的图像生成质量。
English Summary: LightGen introduces an efficient training paradigm using knowledge distillation and Direct Preference Optimization to achieve state-of-the-art image generation quality with only 0.7B parameters and minimal computational resources.

Authors:Feiran Wang, Jiachen Tao, Junyi Wu, Haoxuan Wang, Bin Duan, Kai Wang, Zongxin Yang, Yan Yan
Title: X-Field: A Physically Grounded Representation for 3D X-ray Reconstruction
Abstract:
X-ray imaging is indispensable in medical diagnostics, yet its use is tightly regulated due to potential health risks. To mitigate radiation exposure, recent research focuses on generating novel views from sparse inputs and reconstructing Computed Tomography (CT) volumes, borrowing representations from the 3D reconstruction area. However, these representations originally target visible light imaging that emphasizes reflection and scattering effects, while neglecting penetration and attenuation properties of X-ray imaging. In this paper, we introduce X-Field, the first 3D representation specifically designed for X-ray imaging, rooted in the energy absorption rates across different materials. To accurately model diverse materials within internal structures, we employ 3D ellipsoids with distinct attenuation coefficients. To estimate each material's energy absorption of X-rays, we devise an efficient path partitioning algorithm accounting for complex ellipsoid intersections. We further propose hybrid progressive initialization to refine the geometric accuracy of X-Filed and incorporate material-based optimization to enhance model fitting along material boundaries. Experiments show that X-Field achieves superior visual fidelity on both real-world human organ and synthetic object datasets, outperforming state-of-the-art methods in X-ray Novel View Synthesis and CT Reconstruction.
中文摘要:X-Field首次提出专为X射线成像设计的3D表征方法,通过模拟材料特异性能量吸收,结合创新的几何优化和材质边界处理技术,在X射线新视角合成和CT重建任务中实现了超越现有方法的视觉保真度。
English Summary: X-Field introduces the first 3D representation tailored for X-ray imaging by modeling material-specific energy absorption, achieving superior performance in novel view synthesis and CT reconstruction through innovative geometric and material optimization techniques.

Authors:Justus Karlsson, Yonghao Xu, Amanda Berg, Leif Haglund
Title: Comparing Next-Day Wildfire Predictability of MODIS and VIIRS Satellite Data
Abstract:
Multiple studies have performed next-day fire prediction using satellite imagery. Two main satellites are used to detect wildfires: MODIS and VIIRS. Both satellites provide fire mask products, called MOD14 and VNP14, respectively. Studies have used one or the other, but there has been no comparison between them to determine which might be more suitable for next-day fire prediction. In this paper, we first evaluate how well VIIRS and MODIS data can be used to forecast wildfire spread one day ahead. We find that the model using VIIRS as input and VNP14 as target achieves the best results. Interestingly, the model using MODIS as input and VNP14 as target performs significantly better than using VNP14 as input and MOD14 as target. Next, we discuss why MOD14 might be harder to use for predicting next-day fires. We find that the MOD14 fire mask is highly stochastic and does not correlate with reasonable fire spread patterns. This is detrimental for machine learning tasks, as the model learns irrational patterns. Therefore, we conclude that MOD14 is unsuitable for next-day fire prediction and that VNP14 is a much better option. However, using MODIS input and VNP14 as target, we achieve a significant improvement in predictability. This indicates that an improved fire detection model is possible for MODIS. The full code and dataset is available online: https://github.com/justuskarlsson/wildfire-mod14-vnp14
中文摘要:本研究比较了MODIS和VIIRS卫星数据在次日野火预测中的表现,发现VIIRS(VNP14)因MOD14的随机性而更优,但将MODIS输入与VNP14目标结合可显著提升预测效果。
English Summary: This study compares MODIS and VIIRS satellite data for next-day wildfire prediction, finding that VIIRS (VNP14) outperforms MODIS (MOD14) due to MOD14's unpredictable patterns, though combining MODIS input with VNP14 targets yields better results.

Authors:Yihang Chen, Mengyao Li, Qianyi Wu, Weiyao Lin, Mehrtash Harandi, Jianfei Cai
Title: PCGS: Progressive Compression of 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) achieves impressive rendering fidelity and speed for novel view synthesis. However, its substantial data size poses a significant challenge for practical applications. While many compression techniques have been proposed, they fail to efficiently utilize existing bitstreams in on-demand applications due to their lack of progressivity, leading to a waste of resource. To address this issue, we propose PCGS (Progressive Compression of 3D Gaussian Splatting), which adaptively controls both the quantity and quality of Gaussians (or anchors) to enable effective progressivity for on-demand applications. Specifically, for quantity, we introduce a progressive masking strategy that incrementally incorporates new anchors while refining existing ones to enhance fidelity. For quality, we propose a progressive quantization approach that gradually reduces quantization step sizes to achieve finer modeling of Gaussian attributes. Furthermore, to compact the incremental bitstreams, we leverage existing quantization results to refine probability prediction, improving entropy coding efficiency across progressive levels. Overall, PCGS achieves progressivity while maintaining compression performance comparable to SoTA non-progressive methods. Code available at: github.com/YihangChen-ee/PCGS.
中文: 3D高斯泼溅技术虽渲染效果出色但数据量庞大,PCGS通过渐进式掩码策略和量化方法动态调控高斯点数量与质量,在保持压缩性能的同时实现了适用于按需应用的可扩展传输。
English: 3D Gaussian Splatting delivers high-quality rendering but faces data size challenges, so PCGS introduces progressive compression through adaptive control of Gaussian quantity and quality to enable efficient on-demand streaming while maintaining competitive compression rates.

Authors:Da-Wei Zhou, Kai-Wen Li, Jingyi Ning, Han-Jia Ye, Lijun Zhang, De-Chuan Zhan
Title: External Knowledge Injection for CLIP-Based Class-Incremental Learning
Abstract:
Class-Incremental Learning (CIL) enables learning systems to continuously adapt to evolving data streams. With the advancement of pre-training, leveraging pre-trained vision-language models (e.g., CLIP) offers a promising starting point for CIL. However, CLIP makes decisions by matching visual embeddings to class names, overlooking the rich contextual information conveyed through language. For instance, the concept of ``cat'' can be decomposed into features like tail, fur, and face for recognition. Besides, since the model is continually updated, these detailed features are overwritten in CIL, requiring external knowledge for compensation. In this paper, we introduce ExterNal knowledGe INjEction (ENGINE) for CLIP-based CIL. To enhance knowledge transfer from outside the dataset, we propose a dual-branch injection tuning framework that encodes informative knowledge from both visual and textual modalities. The visual branch is enhanced with data augmentation to enrich the visual features, while the textual branch leverages GPT-4 to rewrite discriminative descriptors. In addition to this on-the-fly knowledge injection, we also implement post-tuning knowledge by re-ranking the prediction results during inference. With the injected knowledge, the model can better capture informative features for downstream tasks as data evolves. Extensive experiments demonstrate the state-of-the-art performance of ENGINE. Code is available at: https://github.com/LAMDA-CL/ICCV25-ENGINE
Chinese: 本文提出ENGINE方法,通过双分支调优和重排序机制将外部知识注入CLIP模型,在类增量学习中实现了最先进的性能,使模型能够更好地捕捉演化数据中的上下文特征。
English: This paper introduces ENGINE, a method that enhances Class-Incremental Learning (CIL) by injecting external knowledge into CLIP through dual-branch tuning and re-ranking, achieving state-of-the-art performance by capturing richer contextual features for evolving data streams.

Authors:Qing Jiang, Lin Wu, Zhaoyang Zeng, Tianhe Ren, Yuda Xiong, Yihao Chen, Qin Liu, Lei Zhang
Title: Referring to Any Person
Abstract:
Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks. Code is available at https://github.com/IDEA-Research/RexSeek
Chinese: 本研究提出了HumanRef数据集,以解决现有模型在通过自然语言描述检测多个人物方面的不足,并开发了RexSeek模型,该模型在人物指代任务中表现出色,并能有效泛化至物体检测应用。
English: This study introduces HumanRef, a new dataset addressing the limitations of existing models in detecting multiple individuals through natural language descriptions, and proposes RexSeek, a robust model that excels in human referring tasks and generalizes effectively to object detection.

Authors:Fan Wu, Sijun Dong, Xiaoliang Meng
Title: CFNet: Optimizing Remote Sensing Change Detection through Content-Aware Enhancement
Abstract:
Change detection is a crucial and widely applied task in remote sensing, aimed at identifying and analyzing changes occurring in the same geographical area over time. Due to variability in acquisition conditions, bi-temporal remote sensing images often exhibit significant differences in image style. Even with the powerful generalization capabilities of DNNs, these unpredictable style variations between bi-temporal images inevitably affect model's ability to accurately detect changed areas. To address issue above, we propose the Content Focuser Network (CFNet), which takes content-aware strategy as a key insight. CFNet employs EfficientNet-B5 as the backbone for feature extraction. To enhance the model's focus on the content features of images while mitigating the misleading effects of style features, we develop a constraint strategy that prioritizes the content features of bi-temporal images, termed Content-Aware. Furthermore, to enable the model to flexibly focus on changed and unchanged areas according to the requirements of different stages, we design a reweighting module based on the cosine distance between bi-temporal image features, termed Focuser. CFNet achieve outstanding performance across three well-known change detection datasets: CLCD (F1: 81.41%, IoU: 68.65%), LEVIR-CD (F1: 92.18%, IoU: 85.49%), and SYSU-CD (F1: 82.89%, IoU: 70.78%). The code and pretrained models of CFNet are publicly released at https://github.com/wifiBlack/CFNet.
中文摘要:提出的内容聚焦网络(CFNet)通过优先处理图像内容特征而非误导性风格差异,有效提升了遥感变化检测的准确性,并在三个权威数据集上取得了领先性能。
English Summary: The Content Focuser Network (CFNet) is proposed to enhance change detection in remote sensing by prioritizing content features over misleading style variations, achieving state-of-the-art performance on three benchmark datasets.

Authors:Yuncheng Guo, Xiaodong Gu
Title: MMRL: Multi-Modal Representation Learning for Vision-Language Models
Abstract:
Large-scale pre-trained Vision-Language Models (VLMs) have become essential for transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, diminishing their performance on new tasks. To tackle this issue, we propose a novel Multi-Modal Representation Learning (MMRL) framework that introduces a shared, learnable, and modality-agnostic representation space. MMRL projects the space tokens to text and image representation tokens, facilitating more effective multi-modal interactions. Unlike previous approaches that solely optimize class token features, MMRL integrates representation tokens at higher layers of the encoders--where dataset-specific features are more prominent--while preserving generalized knowledge in the lower layers. During training, both representation and class features are optimized, with trainable projection layer applied to the representation tokens, whereas the class token projection layer remains frozen to retain pre-trained knowledge. Furthermore, a regularization term is introduced to align the class features and text features with the zero-shot features from the frozen VLM, thereby safeguarding the model's generalization capacity. For inference, a decoupling strategy is employed, wherein both representation and class features are utilized for base classes, while only the class features, which retain more generalized knowledge, are used for new tasks. Extensive experiments across 15 datasets demonstrate that MMRL outperforms state-of-the-art methods, achieving a balanced trade-off between task-specific adaptation and generalization. Code is available at https://github.com/yunncheng/MMRL.
中文摘要:提出的多模态表征学习(MMRL)框架通过构建共享表征空间,在优化表征和类别特征的同时与冻结预训练知识对齐,有效解决了小样本视觉语言任务中的过拟合问题,实现了任务适应性与泛化能力的平衡。
English Summary: The proposed Multi-Modal Representation Learning (MMRL) framework addresses overfitting in few-shot vision-language tasks by creating a shared representation space that optimizes both representation and class features while preserving generalization through alignment with frozen pre-trained knowledge.

Authors:Fengyi Zhang, Huitong Yang, Zheng Zhang, Zi Huang, Yadan Luo
Title: TT-Occ: Test-Time Compute for Self-Supervised Occupancy via Spatio-Temporal Gaussian Splatting
Abstract:
Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ. Our method incrementally constructs, optimizes and voxelizes time-aware 3D Gaussians from raw sensor streams by integrating vision foundation models (VLMs) at runtime. The flexible nature of 3D Gaussians allows voxelization at arbitrary user-specified resolutions, while the generalization ability of VLMs enables accurate perception and open-vocabulary recognition, without any network training or fine-tuning. Specifically, TT-Occ operates in a lift-track-voxelize symphony: We first lift the geometry and semantics of surrounding-view extracted from VLMs to instantiate Gaussians at 3D space; Next, we track dynamic Gaussians while accumulating static ones to complete the scene and enforce temporal consistency; Finally, we voxelize the optimized Gaussians to generate occupancy prediction. Optionally, inherent noise in VLM predictions and tracking is mitigated by periodically smoothing neighboring Gaussians during optimization. To validate the generality and effectiveness of our framework, we offer two variants: one LiDAR-based and one vision-centric, and conduct extensive experiments on Occ3D and nuCraft benchmarks with varying voxel resolutions. Code will be available at https://github.com/Xian-Bei/TT-Occ.
自监督3D占据预测无需昂贵标注即可理解复杂驾驶场景,但存在训练效率低和适应性差的问题;我们提出的TT-Occ框架通过融合视觉基础模型与3D高斯方法,实现了无需训练、支持任意分辨率且具备开放词汇识别能力的实时占据预测。
Self-supervised 3D occupancy prediction enables complex scene understanding without expensive annotations, but faces challenges in training efficiency and adaptability, which our proposed TT-Occ framework overcomes by integrating vision foundation models with 3D Gaussians for flexible, training-free real-time occupancy prediction at arbitrary resolutions.

Authors:Han-Wei Kung, Tuomas Varanka, Terence Sim, Nicu Sebe
Title: NullFace: Training-Free Localized Face Anonymization
Abstract:
Privacy concerns around ever increasing number of cameras are increasing in today's digital age. Although existing anonymization methods are able to obscure identity information, they often struggle to preserve the utility of the images. In this work, we introduce a training-free method for face anonymization that preserves key non-identity-related attributes. Our approach utilizes a pre-trained text-to-image diffusion model without requiring optimization or training. It begins by inverting the input image to recover its initial noise. The noise is then denoised through an identity-conditioned diffusion process, where modified identity embeddings ensure the anonymized face is distinct from the original identity. Our approach also supports localized anonymization, giving users control over which facial regions are anonymized or kept intact. Comprehensive evaluations against state-of-the-art methods show our approach excels in anonymization, attribute preservation, and image quality. Its flexibility, robustness, and practicality make it well-suited for real-world applications. Code and data can be found at https://github.com/hanweikung/nullface .
中文: 本文提出了一种无需训练的人脸匿名化方法,利用预训练的扩散模型在保护非身份属性和图像质量的同时有效隐藏身份,并支持局部匿名化,适用于实际应用。
English: This paper introduces a training-free face anonymization method using a pre-trained diffusion model that effectively obscures identity while preserving non-identity attributes and image quality, supporting localized anonymization for practical use.

Authors:Zhuoguang Chen, Kenan Li, Xiuyu Yang, Tao Jiang, Yiming Li, Hang Zhao
Title: TrackOcc: Camera-based 4D Panoptic Occupancy Tracking
Abstract:
Comprehensive and consistent dynamic scene understanding from camera input is essential for advanced autonomous systems. Traditional camera-based perception tasks like 3D object tracking and semantic occupancy prediction lack either spatial comprehensiveness or temporal consistency. In this work, we introduce a brand-new task, Camera-based 4D Panoptic Occupancy Tracking, which simultaneously addresses panoptic occupancy segmentation and object tracking from camera-only input. Furthermore, we propose TrackOcc, a cutting-edge approach that processes image inputs in a streaming, end-to-end manner with 4D panoptic queries to address the proposed task. Leveraging the localization-aware loss, TrackOcc enhances the accuracy of 4D panoptic occupancy tracking without bells and whistles. Experimental results demonstrate that our method achieves state-of-the-art performance on the Waymo dataset. The source code will be released at https://github.com/Tsinghua-MARS-Lab/TrackOcc.
中文: 本文提出了基于摄像头的4D全景占用跟踪这一新任务,将全景占用分割与目标跟踪相结合,并开发了TrackOcc这一端到端方法,在Waymo数据集上实现了最先进的性能。
English: This paper introduces Camera-based 4D Panoptic Occupancy Tracking, a novel task combining panoptic occupancy segmentation and object tracking from camera-only input, and proposes TrackOcc, an end-to-end method that achieves state-of-the-art performance on the Waymo dataset.

Authors:Armando Fortes, Tianyi Wei, Shangchen Zhou, Xingang Pan
Title: Bokeh Diffusion: Defocus Blur Control in Text-to-Image Diffusion Models
Abstract:
Recent advances in large-scale text-to-image models have revolutionized creative fields by generating visually captivating outputs from textual prompts; however, while traditional photography offers precise control over camera settings to shape visual aesthetics - such as depth-of-field via aperture - current diffusion models typically rely on prompt engineering to mimic such effects. This approach often results in crude approximations and inadvertently alters the scene content. In this work, we propose Bokeh Diffusion, a scene-consistent bokeh control framework that explicitly conditions a diffusion model on a physical defocus blur parameter. To overcome the scarcity of paired real-world images captured under different camera settings, we introduce a hybrid training pipeline that aligns in-the-wild images with synthetic blur augmentations, providing diverse scenes and subjects as well as supervision to learn the separation of image content from lens blur. Central to our framework is our grounded self-attention mechanism, trained on image pairs with different bokeh levels of the same scene, which enables blur strength to be adjusted in both directions while preserving the underlying scene. Extensive experiments demonstrate that our approach enables flexible, lens-like blur control, supports downstream applications such as real image editing via inversion, and generalizes effectively across both Stable Diffusion and FLUX architectures.

Authors:Chen Liao, Yan Shen, Dan Li, Zhongli Wang
Title: Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing
Abstract:
Recently, Deep Unfolding Networks (DUNs) have achieved impressive reconstruction quality in the field of image Compressive Sensing (CS) by unfolding iterative optimization algorithms into neural networks. The reconstruction quality of DUNs depends on the learned prior knowledge, so introducing stronger prior knowledge can further improve reconstruction quality. On the other hand, pre-trained diffusion models contain powerful prior knowledge and have a solid theoretical foundation and strong scalability, but it requires a large number of iterative steps to achieve reconstruction. In this paper, we propose to use the powerful prior knowledge of pre-trained diffusion model in DUNs to achieve high-quality reconstruction with less steps for image CS. Specifically, we first design an iterative optimization algorithm named Diffusion Message Passing (DMP), which embeds a pre-trained diffusion model into each iteration process of DMP. Then, we deeply unfold the DMP algorithm into a neural network named DMP-DUN. The proposed DMP-DUN can use lightweight neural networks to achieve mapping from measurement data to the intermediate steps of the reverse diffusion process and directly approximate the divergence of the diffusion model, thereby further improving reconstruction efficiency. Extensive experiments show that our proposed DMP-DUN achieves state-of-the-art performance and requires at least only 2 steps to reconstruct the image. Codes are available at https://github.com/FengodChen/DMP-DUN-CVPR2025.
Chinese: 深度展开网络(DUNs)利用预训练扩散模型的强大先验知识,在图像压缩感知中以更少的步骤实现高质量重建,所提出的DMP-DUN方法仅需两步即可达到最先进的性能。
English: Deep Unfolding Networks (DUNs) leverage the powerful prior knowledge of pre-trained diffusion models to achieve high-quality image reconstruction in compressive sensing with significantly fewer steps, as demonstrated by the proposed DMP-DUN method requiring only two steps for state-of-the-art results.

Authors:Qiming Xia, Wenkai Lin, Haoen Xiang, Xun Huang, Siheng Chen, Zhen Dong, Cheng Wang, Chenglu Wen
Title: Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels
Abstract:
Unsupervised 3D object detection serves as an important solution for offline 3D object annotation. However, due to the data sparsity and limited views, the clustering-based label fitting in unsupervised object detection often generates low-quality pseudo-labels. Multi-agent collaborative dataset, which involves the sharing of complementary observations among agents, holds the potential to break through this bottleneck. In this paper, we introduce a novel unsupervised method that learns to Detect Objects from Multi-Agent LiDAR scans, termed DOtA, without using labels from external. DOtA first uses the internally shared ego-pose and ego-shape of collaborative agents to initialize the detector, leveraging the generalization performance of neural networks to infer preliminary labels. Subsequently,DOtA uses the complementary observations between agents to perform multi-scale encoding on preliminary labels, then decodes high-quality and low-quality labels. These labels are further used as prompts to guide a correct feature learning process, thereby enhancing the performance of the unsupervised object detection task. Extensive experiments on the V2V4Real and OPV2V datasets show that our DOtA outperforms state-of-the-art unsupervised 3D object detection methods. Additionally, we also validate the effectiveness of the DOtA labels under various collaborative perception frameworks.The code is available at https://github.com/xmuqimingxia/DOtA.
Chinese: DOtA是一种无监督3D物体检测方法,通过多智能体激光雷达数据,利用多尺度编码和特征学习生成高质量伪标签,在无需外部标注的情况下实现了最先进的性能。
English: DOtA is an unsupervised 3D object detection method that leverages multi-agent LiDAR data to generate high-quality pseudo-labels through multi-scale encoding and feature learning, achieving state-of-the-art performance without external annotations.

Authors:Zhanjie Zhang, Quanwei Zhang, Guangyuan Li, Junsheng Luan, Mengyuan Yang, Yun Wang, Lei Zhao
Title: DyArtbank: Diverse Artistic Style Transfer via Pre-trained Stable Diffusion and Dynamic Style Prompt Artbank
Abstract:
Artistic style transfer aims to transfer the learned style onto an arbitrary content image. However, most existing style transfer methods can only render consistent artistic stylized images, making it difficult for users to get enough stylized images to enjoy. To solve this issue, we propose a novel artistic style transfer framework called DyArtbank, which can generate diverse and highly realistic artistic stylized images. Specifically, we introduce a Dynamic Style Prompt ArtBank (DSPA), a set of learnable parameters. It can learn and store the style information from the collection of artworks, dynamically guiding pre-trained stable diffusion to generate diverse and highly realistic artistic stylized images. DSPA can also generate random artistic image samples with the learned style information, providing a new idea for data augmentation. Besides, a Key Content Feature Prompt (KCFP) module is proposed to provide sufficient content prompts for pre-trained stable diffusion to preserve the detailed structure of the input content image. Extensive qualitative and quantitative experiments verify the effectiveness of our proposed method. Code is available: https://github.com/Jamie-Cheung/DyArtbank
中文:DyArtbank框架通过动态风格提示艺术库和关键内容特征提示模块,能够生成多样化且高度逼真的艺术风格化图像,同时保持内容结构,解决了传统风格迁移方法的局限性。
English: The DyArtbank framework introduces a Dynamic Style Prompt ArtBank and Key Content Feature Prompt module to generate diverse, highly realistic artistic stylized images while preserving content structure, addressing limitations of conventional style transfer methods.

Authors:Zhengyao Fang, Pengyuan Lyu, Jingjing Wu, Chengquan Zhang, Jun Yu, Guangming Lu, Wenjie Pei
Title: Recognition-Synergistic Scene Text Editing
Abstract:
Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model's ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, RS-STE achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code is available at https://github.com/ZhengyaoFang/RS-STE.
中文: RS-STE是一种创新的场景文本编辑方法,它将文本识别与编辑集成在统一框架中,利用识别模型隐式分离样式与内容,在基准测试中达到最优性能,并能提升下游识别任务的效果。
English: RS-STE is a novel scene text editing method that integrates text recognition and editing in a unified framework, leveraging the recognition model to implicitly disentangle style and content, achieving state-of-the-art performance on benchmarks and enhancing downstream recognition tasks.

Authors:Susu Sun, Dominique van Midden, Geert Litjens, Christian F. Baumgartner
Title: Prototype-Based Multiple Instance Learning for Gigapixel Whole Slide Image Classification
Abstract:
Multiple Instance Learning (MIL) methods have succeeded remarkably in histopathology whole slide image (WSI) analysis. However, most MIL models only offer attention-based explanations that do not faithfully capture the model's decision mechanism and do not allow human-model interaction. To address these limitations, we introduce ProtoMIL, an inherently interpretable MIL model for WSI analysis that offers user-friendly explanations and supports human intervention. Our approach employs a sparse autoencoder to discover human-interpretable concepts from the image feature space, which are then used to train ProtoMIL. The model represents predictions as linear combinations of concepts, making the decision process transparent. Furthermore, ProtoMIL allows users to perform model interventions by altering the input concepts. Experiments on two widely used pathology datasets demonstrate that ProtoMIL achieves a classification performance comparable to state-of-the-art MIL models while offering intuitively understandable explanations. Moreover, we demonstrate that our method can eliminate reliance on diagnostically irrelevant information via human intervention, guiding the model toward being right for the right reason. Code will be publicly available at https://github.com/ss-sun/ProtoMIL.
中文: ProtoMIL提出了一种可解释的多示例学习模型,通过人类可理解的概念进行透明预测并支持用户干预,在保持与先进方法相当性能的同时提供直观解释,用于组织病理学全切片图像分析。
English: ProtoMIL introduces an interpretable multiple instance learning model for histopathology whole slide image analysis that uses human-understandable concepts for transparent predictions and allows user intervention, achieving performance comparable to state-of-the-art methods while providing intuitive explanations.

Authors:Qingsong Xie, Zhao Zhang, Zhe Huang, Yanhao Zhang, Haonan Lu, Zhenyu Yang
Title: Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens
Abstract:
Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and fidelity: high-resolution image reconstruction either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (Layton) that bridges discrete visual tokens with the compact latent space of pre-trained Latent Diffusion Models (LDMs), enabling efficient representation of 1024x1024 images using only 256 tokens-a 16 times compression over VQGAN. Layton integrates a transformer encoder, a quantized codebook, and a latent consistency decoder. Direct application of LDM as the decoder results in color and brightness discrepancies. Thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. Experiments demonstrate Layton's superiority in high-fidelity reconstruction, with 10.8 reconstruction Frechet Inception Distance on MSCOCO-2017 5K benchmark for 1024x1024 image reconstruction. We also extend Layton to a text-to-image generation model, LaytonGen, working in autoregression. It achieves 0.73 score on GenEval benchmark, surpassing current state-of-the-art methods. Project homepage: https://github.com/OPPO-Mente-Lab/Layton
中文摘要:Latent Consistency Tokenizer (Layton) 通过集成变换器编码器、量化码本和潜在一致性解码器,仅用256个标记高效表示高分辨率图像,在图像重建保真度和文本到图像生成方面均超越现有最优方法。
English Summary: The Latent Consistency Tokenizer (Layton) efficiently represents high-resolution images using only 256 tokens by integrating a transformer encoder, quantized codebook, and latent consistency decoder, achieving superior reconstruction fidelity and outperforming state-of-the-art methods in text-to-image generation.

Authors:Fabian Isensee, Maximilian Rokuss, Lars Krämer, Stefan Dinkelacker, Ashis Ravindran, Florian Stritzke, Benjamin Hamm, Tassilo Wald, Moritz Langenberg, Constantin Ulrich, Jonathan Deissler, Ralf Floca, Klaus Maier-Hein
Title: nnInteractive: Redefining 3D Promptable Segmentation
Abstract:
Accurate and efficient 3D segmentation is essential for both clinical and research applications. While foundation models like SAM have revolutionized interactive segmentation, their 2D design and domain shift limitations make them ill-suited for 3D medical images. Current adaptations address some of these challenges but remain limited, either lacking volumetric awareness, offering restricted interactivity, or supporting only a small set of structures and modalities. Usability also remains a challenge, as current tools are rarely integrated into established imaging platforms and often rely on cumbersome web-based interfaces with restricted functionality. We introduce nnInteractive, the first comprehensive 3D interactive open-set segmentation method. It supports diverse prompts-including points, scribbles, boxes, and a novel lasso prompt-while leveraging intuitive 2D interactions to generate full 3D segmentations. Trained on 120+ diverse volumetric 3D datasets (CT, MRI, PET, 3D Microscopy, etc.), nnInteractive sets a new state-of-the-art in accuracy, adaptability, and usability. Crucially, it is the first method integrated into widely used image viewers (e.g., Napari, MITK), ensuring broad accessibility for real-world clinical and research applications. Extensive benchmarking demonstrates that nnInteractive far surpasses existing methods, setting a new standard for AI-driven interactive 3D segmentation. nnInteractive is publicly available: https://github.com/MIC-DKFZ/napari-nninteractive (Napari plugin), https://www.mitk.org/MITK-nnInteractive (MITK integration), https://github.com/MIC-DKFZ/nnInteractive (Python backend).
Chinese: nnInteractive是首个全面的3D交互式开放集分割方法,支持多种提示并利用直观的2D交互生成完整3D分割,在精度和平台集成度上达到新标杆,已整合至主流医学影像平台。
English: nnInteractive is the first comprehensive 3D interactive open-set segmentation method that supports diverse prompts and leverages 2D interactions to generate full 3D segmentations, achieving state-of-the-art accuracy and broad integration into popular imaging platforms.

Authors:Kai Qiu, Xiang Li, Jason Kuen, Hao Chen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides
Title: Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis
Abstract:
Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer. Though the performance of tokenizer plays an essential role to the successful generation, its current evaluation metrics (e.g. rFID) fail to precisely assess the tokenizer and correlate its performance to the generation quality (e.g. gFID). In this paper, we comprehensively analyze the reason for the discrepancy of reconstruction and generation qualities in a discrete latent space, and, from which, we propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction. Specifically, a latent perturbation approach is proposed to simulate sampling noises, i.e., the unexpected tokens sampled, from the generative process. With the latent perturbation, we further propose (1) a novel tokenizer evaluation metric, i.e., pFID, which successfully correlates the tokenizer performance to generation quality and (2) a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer thus boosting the generation quality and convergence speed. Extensive benchmarking are conducted with 11 advanced discrete image tokenizers with 2 autoregressive generation models to validate our approach. The tokenizer trained with our proposed latent perturbation achieve a notable 1.60 gFID with classifier-free guidance (CFG) and 3.45 gFID without CFG with a $\sim$400M generator. Code: https://github.com/lxa9867/ImageFolder.
中文: 本文提出一种潜在扰动方法模拟生成过程中的采样噪声,开发了新的分词器评估指标pFID和即插即用训练方案,显著提升了分词器鲁棒性和图像生成质量。
English: This paper introduces a latent perturbation method to simulate generative sampling noise, proposing a new tokenizer evaluation metric (pFID) and a plug-and-play training scheme that significantly improves tokenizer robustness and generation quality.

Authors:Bin Huang, Binzhong He, Yanhan Chen, Zhili Liu, Xinyue Wang, Binxuan Li, Qiegen Liu
Title: Diffusion Transformer Meets Random Masks: An Advanced PET Reconstruction Framework
Abstract:
Deep learning has significantly advanced PET image re-construction, achieving remarkable improvements in image quality through direct training on sinogram or image data. Traditional methods often utilize masks for inpainting tasks, but their incorporation into PET reconstruction frameworks introduces transformative potential. In this study, we pro-pose an advanced PET reconstruction framework called Diffusion tRansformer mEets rAndom Masks (DREAM). To the best of our knowledge, this is the first work to integrate mask mechanisms into both the sinogram domain and the latent space, pioneering their role in PET reconstruction and demonstrating their ability to enhance reconstruction fidelity and efficiency. The framework employs a high-dimensional stacking approach, transforming masked data from two to three dimensions to expand the solution space and enable the model to capture richer spatial rela-tionships. Additionally, a mask-driven latent space is de-signed to accelerate the diffusion process by leveraging sinogram-driven and mask-driven compact priors, which reduce computational complexity while preserving essen-tial data characteristics. A hierarchical masking strategy is also introduced, guiding the model from focusing on fi-ne-grained local details in the early stages to capturing broader global patterns over time. This progressive ap-proach ensures a balance between detailed feature preservation and comprehensive context understanding. Experimental results demonstrate that DREAM not only improves the overall quality of reconstructed PET images but also preserves critical clinical details, highlighting its potential to advance PET imaging technology. By inte-grating compact priors and hierarchical masking, DREAM offers a promising and efficient avenue for future research and application in PET imaging. The open-source code is available at: https://github.com/yqx7150/DREAM.
中文摘要:DREAM框架通过在正弦图和潜在空间中引入掩码机制,结合分层掩码策略和紧凑先验,显著提升了PET图像重建质量与效率,同时保留了关键临床细节。
English Summary: The DREAM framework introduces a novel PET image reconstruction method by integrating mask mechanisms in both sinogram and latent domains, employing hierarchical masking and compact priors to enhance image quality while preserving clinical details efficiently.

Authors:Runwei Guan, Jianan Liu, Ningwei Ouyang, Shaofeng Liang, Daizong Liu, Xiaolou Sun, Lianqing Zheng, Ming Xu, Yutao Yue, Guoqiang Mao, Hui Xiong
Title: Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving
Abstract:
Embodied outdoor scene understanding forms the foundation for autonomous agents to perceive, analyze, and react to dynamic driving environments. However, existing 3D understanding is predominantly based on 2D Vision-Language Models (VLMs), which collect and process limited scene-aware contexts. In contrast, compared to the 2D planar visual information, point cloud sensors such as LiDAR provide rich depth and fine-grained 3D representations of objects. Even better the emerging 4D millimeter-wave radar detects the motion trend, velocity, and reflection intensity of each object. The integration of these two modalities provides more flexible querying conditions for natural language, thereby supporting more accurate 3D visual grounding. To this end, we propose a novel method called TPCNet, the first outdoor 3D visual grounding model upon the paradigm of prompt-guided point cloud sensor combination, including both LiDAR and radar sensors. To optimally combine the features of these two sensors required by the prompt, we design a multi-fusion paradigm called Two-Stage Heterogeneous Modal Adaptive Fusion. Specifically, this paradigm initially employs Bidirectional Agent Cross-Attention (BACA), which feeds both-sensor features, characterized by global receptive fields, to the text features for querying. Moreover, we design a Dynamic Gated Graph Fusion (DGGF) module to locate the regions of interest identified by the queries. To further enhance accuracy, we devise an C3D-RECHead, based on the nearest object edge to the ego-vehicle. Experimental results demonstrate that our TPCNet, along with its individual modules, achieves the state-of-the-art performance on both the Talk2Radar and Talk2Car datasets. We release the code at https://github.com/GuanRunwei/TPCNet.
中文: 具身户外场景理解对自主智能体至关重要,提出的TPCNet模型通过融合激光雷达与雷达传感器,采用创新的多模态融合范式,实现了最先进的3D视觉定位性能。
English: Embodied outdoor scene understanding is crucial for autonomous agents, and the proposed TPCNet model integrates LiDAR and radar sensors with a novel fusion paradigm to achieve state-of-the-art 3D visual grounding accuracy.

Authors:Alex Ergasti, Giuseppe Gabriele Tarollo, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
Title: $^R$FLAV: Rolling Flow matching for infinite Audio Video generation
Abstract:
Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present $^R$-FLAV, a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that $^R$-FLAV outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at https://github.com/ErgastiAlex/R-FLAV.
中文: 本文提出$^R$-FLAV模型,采用基于Transformer的架构,通过高效的跨模态融合模块解决了音视频生成中的质量、同步性和时序一致性难题,在性能上超越了现有最优方法。
English: This paper introduces $^R$-FLAV, a transformer-based model that overcomes key challenges in joint audio-video generation by ensuring high-quality output, seamless synchronization, and temporal coherence through an efficient cross-modality fusion module, outperforming existing state-of-the-art methods.

Authors:Chengjun Yu, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha
Title: HERO: Human Reaction Generation from Videos
Abstract:
Human reaction generation represents a significant research domain for interactive AI, as humans constantly interact with their surroundings. Previous works focus mainly on synthesizing the reactive motion given a human motion sequence. This paradigm limits interaction categories to human-human interactions and ignores emotions that may influence reaction generation. In this work, we propose to generate 3D human reactions from RGB videos, which involves a wider range of interaction categories and naturally provides information about expressions that may reflect the subject's emotions. To cope with this task, we present HERO, a simple yet powerful framework for Human rEaction geneRation from videOs. HERO considers both global and frame-level local representations of the video to extract the interaction intention, and then uses the extracted interaction intention to guide the synthesis of the reaction. Besides, local visual representations are continuously injected into the model to maximize the exploitation of the dynamic properties inherent in videos. Furthermore, the ViMo dataset containing paired Video-Motion data is collected to support the task. In addition to human-human interactions, these video-motion pairs also cover animal-human interactions and scene-human interactions. Extensive experiments demonstrate the superiority of our methodology. The code and dataset will be publicly available at https://jackyu6.github.io/HERO.
中文: 本研究提出HERO框架,通过整合视频的全局和局部表征,从RGB视频生成涵盖更广交互类别和情感线索的3D人体反应,并在新收集的ViMo数据集上验证了其优越性。
English: This study introduces HERO, a framework for generating 3D human reactions from RGB videos that captures a broader range of interactions and emotional cues by integrating global and local video representations, validated on the new ViMo dataset.

Authors:Saad Sohail, Muhammad Usama, Usman Ghous, Manuel Mazzara, Salvatore Distefano, Muhammad Ahmad
Title: EnergyFormer: Energy Attention with Fourier Embedding for Hyperspectral Image Classification
Abstract:
Hyperspectral imaging (HSI) provides rich spectral-spatial information across hundreds of contiguous bands, enabling precise material discrimination in applications such as environmental monitoring, agriculture, and urban analysis. However, the high dimensionality and spectral variability of HSI data pose significant challenges for feature extraction and classification. This paper presents EnergyFormer, a transformer-based framework designed to address these challenges through three key innovations: (1) Multi-Head Energy Attention (MHEA), which optimizes an energy function to selectively enhance critical spectral-spatial features, improving feature discrimination; (2) Fourier Position Embedding (FoPE), which adaptively encodes spectral and spatial dependencies to reinforce long-range interactions; and (3) Enhanced Convolutional Block Attention Module (ECBAM), which selectively amplifies informative wavelength bands and spatial structures, enhancing representation learning. Extensive experiments on the WHU-Hi-HanChuan, Salinas, and Pavia University datasets demonstrate that EnergyFormer achieves exceptional overall accuracies of 99.28\%, 98.63\%, and 98.72\%, respectively, outperforming state-of-the-art CNN, transformer, and Mamba-based models. The source code will be made available at https://github.com/mahmad000.
中文:EnergyFormer是一种基于Transformer的框架,通过多头部能量注意力、傅里叶位置编码和增强卷积注意力模块三大创新,有效应对高光谱影像的高维度和光谱变异性挑战,在多个基准数据集上实现了卓越的分类精度。
English: EnergyFormer is a transformer-based framework that introduces three innovations—Multi-Head Energy Attention, Fourier Position Embedding, and Enhanced Convolutional Block Attention Module—to effectively handle the high dimensionality and spectral variability of hyperspectral imaging data, achieving superior classification accuracy on benchmark datasets.

Authors:Ao Li, Zongfang Liu, Xinhua Li, Jinghui Zhang, Pengwei Wang, Hu Wang
Title: Modeling Variants of Prompts for Vision-Language Models
Abstract:
Large pre-trained vision-language models (VLMs) offer a promising approach to leveraging human language for enhancing downstream tasks. However, VLMs such as CLIP face significant limitation: its performance is highly sensitive to prompt template design. Although prompt learning methods can address the sensitivity issue by replacing natural language prompts with learnable ones, they are incomprehensible to humans. Ensuring consistent performance across various prompt templates enables models to adapt seamlessly to diverse phrasings, enhancing their ability to handle downstream tasks without requiring extensive prompt engineering. In this work, we introduce the RobustPrompt Benchmark, a systematic benchmark to evaluate robustness to different prompt templates for VLMs. It includes a dataset with hundreds of carefully designed prompt templates, divided into six types, covering a wide variety of commonly used templates. Beside the benchmark, we propose Modeling Variants of Prompts (MVP), a simple yet effective method that mitigates sensitivity by modeling variants of prompt structures. The innovation of MVP lies in decoupling prompts into templates and class names, and using Variational Autoencoders (VAE) to model the distribution of diverse prompt structures. Experiments across 11 datasets demonstrate that MVP can greatly enhance model robustness to variations in input prompts without a drop in performance. The code is available at https://github.com/liaolea/MVP.
中文摘要:大型视觉语言模型对提示模板设计敏感,而提出的MVP方法通过变分自编码器建模提示结构变体,在保持性能的同时显著提升了模型对不同提示模板的鲁棒性。
English Summary: Large vision-language models are sensitive to prompt design, but the proposed MVP method enhances robustness by modeling prompt variations with variational autoencoders, maintaining performance across diverse templates.

Authors:Dongbin Zhang, Yunfei Liu, Lijian Lin, Ye Zhu, Kangjie Chen, Minghan Qin, Yu Li, Haoqian Wang
Title: HRAvatar: High-Quality and Relightable Gaussian Head Avatar
Abstract:
Reconstructing animatable and high-quality 3D head avatars from monocular videos, especially with realistic relighting, is a valuable task. However, the limited information from single-view input, combined with the complex head poses and facial movements, makes this challenging. Previous methods achieve real-time performance by combining 3D Gaussian Splatting with a parametric head model, but the resulting head quality suffers from inaccurate face tracking and limited expressiveness of the deformation model. These methods also fail to produce realistic effects under novel lighting conditions. To address these issues, we propose HRAvatar, a 3DGS-based method that reconstructs high-fidelity, relightable 3D head avatars. HRAvatar reduces tracking errors through end-to-end optimization and better captures individual facial deformations using learnable blendshapes and learnable linear blend skinning. Additionally, it decomposes head appearance into several physical properties and incorporates physically-based shading to account for environmental lighting. Extensive experiments demonstrate that HRAvatar not only reconstructs superior-quality heads but also achieves realistic visual effects under varying lighting conditions.

Authors:Junbin Xiao, Nanxin Huang, Hao Qiu, Zhulin Tao, Xun Yang, Richang Hong, Meng Wang, Angela Yao
Title: EgoBlind: Towards Egocentric Visual Assistance for the Blind
Abstract:
We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 videos that record the daily lives of real blind users from a first-person perspective. It also features 5,311 questions directly posed or generated and verified by blind individuals to reflect their in-situation needs for visual assistance under various scenarios. We provide each question with an average of 3 reference answers to alleviate subjective evaluation. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle, with the best performers achieving accuracy near 60\%, far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope EgoBlind can serve as a valuable foundation for developing more effective AI assistants to enhance the independence of the blind individuals' lives. Data and evaluation code are available at https://github.com/doc-doc/EgoBlind.
中文: EgoBlind是首个基于盲人视角的视觉问答数据集,研究表明现有MLLMs在视觉辅助任务中表现远逊于人类,最优模型准确率仅达60%,而人类表现高达87.4%。
English: EgoBlind is the first egocentric VideoQA dataset from blind individuals, revealing that current MLLMs perform significantly below human levels in visual assistance tasks, with the best model achieving only 60% accuracy compared to humans' 87.4%.

Authors:Yuan Tian, Kaiyuan Ji, Rongzhao Zhang, Yankai Jiang, Chunyi Li, Xiaosong Wang, Guangtao Zhai
Title: Towards All-in-One Medical Image Re-Identification
Abstract:
Medical image re-identification (MedReID) is under-explored so far, despite its critical applications in personalized healthcare and privacy protection. In this paper, we introduce a thorough benchmark and a unified model for this problem. First, to handle various medical modalities, we propose a novel Continuous Modality-based Parameter Adapter (ComPA). ComPA condenses medical content into a continuous modality representation and dynamically adjusts the modality-agnostic model with modality-specific parameters at runtime. This allows a single model to adaptively learn and process diverse modality data. Furthermore, we integrate medical priors into our model by aligning it with a bag of pre-trained medical foundation models, in terms of the differential features. Compared to single-image feature, modeling the inter-image difference better fits the re-identification problem, which involves discriminating multiple images. We evaluate the proposed model against 25 foundation models and 8 large multi-modal language models across 11 image datasets, demonstrating consistently superior performance. Additionally, we deploy the proposed MedReID technique to two real-world applications, i.e., history-augmented personalized diagnosis and medical privacy protection. Codes and model is available at \href{https://github.com/tianyuan168326/All-in-One-MedReID-Pytorch}{https://github.com/tianyuan168326/All-in-One-MedReID-Pytorch}.
中文摘要:本文针对医学图像重识别提出了一个全面基准和统一模型,通过新型连续模态参数适配器实现跨医学模态的自适应学习,并利用预训练医学基础模型的差异特征对齐整合医学先验知识,在多个数据集上展现出优越性能。
English Summary: This paper introduces a comprehensive benchmark and a unified model for medical image re-identification, featuring a novel Continuous Modality-based Parameter Adapter that enables adaptive learning across diverse medical modalities and integrates medical priors through differential feature alignment with pre-trained foundation models.

Authors:Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin, Xiaodan Liang
Title: WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation
Abstract:
Recent rapid advancements in text-to-video (T2V) generation, such as SoRA and Kling, have shown great potential for building world simulators. However, current T2V models struggle to grasp abstract physical principles and generate videos that adhere to physical laws. This challenge arises primarily from a lack of clear guidance on physical information due to a significant gap between abstract physical principles and generation models. To this end, we introduce the World Simulator Assistant (WISA), an effective framework for decomposing and incorporating physical principles into T2V models. Specifically, WISA decomposes physical principles into textual physical descriptions, qualitative physical categories, and quantitative physical properties. To effectively embed these physical attributes into the generation process, WISA incorporates several key designs, including Mixture-of-Physical-Experts Attention (MoPA) and a Physical Classifier, enhancing the model's physics awareness. Furthermore, most existing datasets feature videos where physical phenomena are either weakly represented or entangled with multiple co-occurring processes, limiting their suitability as dedicated resources for learning explicit physical principles. We propose a novel video dataset, WISA-32K, collected based on qualitative physical categories. It consists of 32,000 videos, representing 17 physical laws across three domains of physics: dynamics, thermodynamics, and optics. Experimental results demonstrate that WISA can effectively enhance the compatibility of T2V models with real-world physical laws, achieving a considerable improvement on the VideoPhy benchmark. The visual exhibitions of WISA and WISA-32K are available in the https://360cvgroup.github.io/WISA/.

Authors:Chengzhi Ma, Kunqian Li, Shuaixin Liu, Han Mei
Title: Depth-Assisted Network for Indiscernible Marine Object Counting with Adaptive Motion-Differentiated Feature Encoding
Abstract:
Indiscernible marine object counting encounters numerous challenges, including limited visibility in underwater scenes, mutual occlusion and overlap among objects, and the dynamic similarity in appearance, color, and texture between the background and foreground. These factors significantly complicate the counting process. To address the scarcity of video-based indiscernible object counting datasets, we have developed a novel dataset comprising 50 videos, from which approximately 800 frames have been extracted and annotated with around 40,800 point-wise object labels. This dataset accurately represents real underwater environments where indiscernible marine objects are intricately integrated with their surroundings, thereby comprehensively illustrating the aforementioned challenges in object counting. To address these challenges, we propose a depth-assisted network with adaptive motion-differentiated feature encoding. The network consists of a backbone encoding module and three branches: a depth-assisting branch, a density estimation branch, and a motion weight generation branch. Depth-aware features extracted by the depth-assisting branch are enhanced via a depth-enhanced encoder to improve object representation. Meanwhile, weights from the motion weight generation branch refine multi-scale perception features in the adaptive flow estimation module. Experimental results demonstrate that our method not only achieves state-of-the-art performance on the proposed dataset but also yields competitive results on three additional video-based crowd counting datasets. The pre-trained model, code, and dataset are publicly available at https://github.com/OUCVisionGroup/VIMOC-Net.
Chinese: 本文通过提出新型视频数据集和具备自适应运动特征编码的深度辅助网络,解决了难以辨识海洋物体计数的挑战,在自建及现有数据集上均实现了最优性能。
English: This paper addresses the challenges of indiscernible marine object counting by introducing a novel video dataset and a depth-assisted network with adaptive motion-differentiated feature encoding, achieving state-of-the-art performance on both the proposed and existing datasets.

Authors:Zhifeng Xie, Qile He, Youjia Zhu, Qiwei He, Mengtian Li
Title: FilmComposer: LLM-Driven Music Production for Silent Film Clips
Abstract:
In this work, we implement music production for silent film clips using LLM-driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi-agent approach, leveraging the advantages of both waveform music and symbolic music generation. Additionally, FilmComposer is the first to focus on the three core elements of music production for film-audio quality, musicality, and musical development-and introduces various controls, such as rhythm, semantics, and visuals, to enhance these key aspects. Specifically, FilmComposer consists of the visual processing module, rhythm-controllable MusicGen, and multi-agent assessment, arrangement and mix. In addition, our framework can seamlessly integrate into the actual music production pipeline and allows user intervention in every step, providing strong interactivity and a high degree of creative freedom. Furthermore, we propose MusicPro-7k which includes 7,418 film clips, music, description, rhythm spots and main melody, considering the lack of a professional and high-quality film music dataset. Finally, both the standard metrics and the new specialized metrics we propose demonstrate that the music generated by our model achieves state-of-the-art performance in terms of quality, consistency with video, diversity, musicality, and musical development. Project page: https://apple-jun.github.io/FilmComposer.github.io/

Authors:Ethan Griffiths, Maryam Haghighat, Simon Denman, Clinton Fookes, Milad Ramezani
Title: HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views
Abstract:
We present HOTFormerLoc, a novel and versatile Hierarchical Octree-based TransFormer, for large-scale 3D place recognition in both ground-to-ground and ground-to-aerial scenarios across urban and forest environments. We propose an octree-based multi-scale attention mechanism that captures spatial and semantic features across granularities. To address the variable density of point distributions from spinning lidar, we present cylindrical octree attention windows to reflect the underlying distribution during attention. We introduce relay tokens to enable efficient global-local interactions and multi-scale representation learning at reduced computational cost. Our pyramid attentional pooling then synthesises a robust global descriptor for end-to-end place recognition in challenging environments. In addition, we introduce CS-Wild-Places, a novel 3D cross-source dataset featuring point cloud data from aerial and ground lidar scans captured in dense forests. Point clouds in CS-Wild-Places contain representational gaps and distinctive attributes such as varying point densities and noise patterns, making it a challenging benchmark for cross-view localisation in the wild. HOTFormerLoc achieves a top-1 average recall improvement of 5.5% - 11.5% on the CS-Wild-Places benchmark. Furthermore, it consistently outperforms SOTA 3D place recognition methods, with an average performance gain of 4.9% on well-established urban and forest datasets. The code and CS-Wild-Places benchmark is available at https://csiro-robotics.github.io/HOTFormerLoc.
中文: HOTFormerLoc是一种基于分层八叉树的变换器,通过多尺度注意力机制和新型跨源数据集CS-Wild-Places,在多种环境下的地面与空中三维场景识别中实现了最先进的性能。
English: HOTFormerLoc is a hierarchical octree-based transformer that achieves state-of-the-art 3D place recognition across ground-to-ground and ground-to-aerial scenarios by leveraging multi-scale attention mechanisms and a novel cross-source dataset, CS-Wild-Places.

Authors:Huy Nguyen, Kien Nguyen, Akila Pemasiri, Feng Liu, Sridha Sridharan, Clinton Fookes
Title: AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification
Abstract:
We introduce AG-VPReID, a new large-scale dataset for aerial-ground video-based person re-identification (ReID) that comprises 6,632 subjects, 32,321 tracklets and over 9.6 million frames captured by drones (altitudes ranging from 15-120m), CCTV, and wearable cameras. This dataset offers a real-world benchmark for evaluating the robustness to significant viewpoint changes, scale variations, and resolution differences in cross-platform aerial-ground settings. In addition, to address these challenges, we propose AG-VPReID-Net, an end-to-end framework composed of three complementary streams: (1) an Adapted Temporal-Spatial Stream addressing motion pattern inconsistencies and facilitating temporal feature learning, (2) a Normalized Appearance Stream leveraging physics-informed techniques to tackle resolution and appearance changes, and (3) a Multi-Scale Attention Stream handling scale variations across drone altitudes. We integrate visual-semantic cues from all streams to form a robust, viewpoint-invariant whole-body representation. Extensive experiments demonstrate that AG-VPReID-Net outperforms state-of-the-art approaches on both our new dataset and existing video-based ReID benchmarks, showcasing its effectiveness and generalizability. Nevertheless, the performance gap observed on AG-VPReID across all methods underscores the dataset's challenging nature. The dataset, code and trained models are available at https://github.com/agvpreid25/AG-VPReID-Net.
中文: 我们提出了AG-VPReID大规模空地视频行人重识别数据集,并开发了AG-VPReID-Net端到端框架,该框架通过三个互补流集成视觉语义线索,在应对视角和尺度变化方面展现出优越性能。
English: We introduce AG-VPReID, a large-scale aerial-ground video dataset for person re-identification, and propose AG-VPReID-Net, an end-to-end framework with three complementary streams that integrates visual-semantic cues to achieve robust performance across challenging viewpoint and scale variations.

Authors:Ruipeng Wang, Junfeng Fang, Jiaqi Li, Hao Chen, Jie Shi, Kun Wang, Xiang Wang
Title: ACE: Concept Editing in Diffusion Models without Performance Degradation
Abstract:
Diffusion-based text-to-image models have demonstrated remarkable capabilities in generating realistic images, but they raise societal and ethical concerns, such as the creation of unsafe content. While concept editing is proposed to address these issues, they often struggle to balance the removal of unsafe concept with maintaining the model's general genera-tive capabilities. In this work, we propose ACE, a new editing method that enhances concept editing in diffusion models. ACE introduces a novel cross null-space projection approach to precisely erase unsafe concept while maintaining the model's ability to generate high-quality, semantically consistent images. Extensive experiments demonstrate that ACE significantly outperforms the advancing baselines,improving semantic consistency by 24.56% and image generation quality by 34.82% on average with only 1% of the time cost. These results highlight the practical utility of concept editing by mitigating its potential risks, paving the way for broader applications in the field. Code is avaliable at https://github.com/littlelittlenine/ACE-zero.git
中文: ACE提出了一种交叉零空间投影方法,能在扩散模型中精确消除不安全概念的同时保持图像生成质量与语义一致性,以极低的时间成本显著提升了模型性能。
English: ACE introduces a cross null-space projection method to effectively remove unsafe concepts from diffusion models while preserving image quality and semantic consistency, achieving significant improvements in performance with minimal time cost.

Authors:Lizhen Xu, Xiuxiu Bai, Xiaojun Jia, Jianwu Fang, Shanmin Pang
Title: Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning
Abstract:
Query-based methods with dense features have demonstrated remarkable success in 3D object detection tasks. However, the computational demands of these models, particularly with large image sizes and multiple transformer layers, pose significant challenges for efficient running on edge devices. Existing pruning and distillation methods either need retraining or are designed for ViT models, which are hard to migrate to 3D detectors. To address this issue, we propose a zero-shot runtime pruning method for transformer decoders in 3D object detection models. The method, termed tgGBC (trim keys gradually Guided By Classification scores), systematically trims keys in transformer modules based on their importance. We expand the classification score to multiply it with the attention map to get the importance score of each key and then prune certain keys after each transformer layer according to their importance scores. Our method achieves a 1.99x speedup in the transformer decoder of the latest ToC3D model, with only a minimal performance loss of less than 1%. Interestingly, for certain models, our method even enhances their performance. Moreover, we deploy 3D detectors with tgGBC on an edge device, further validating the effectiveness of our method. The code can be found at https://github.com/iseri27/tg_gbc.
中文: 提出的零样本剪枝方法tgGBC通过基于分类得分的注意力机制逐步修剪transformer模块中的关键键值,在3D目标检测中实现了近两倍的解码器加速,性能损失小于1%且部分模型性能反而提升,有效验证了边缘设备部署的可行性。
English: The proposed zero-shot pruning method, tgGBC, efficiently accelerates transformer decoders in 3D object detection by trimming keys based on classification-guided importance scores, achieving near 2x speedup with minimal performance loss and even enhancing some models when deployed on edge devices.

Authors:Kai Deng, Yigong Zhang, Jian Yang, Jin Xie
Title: GigaSLAM: Large-Scale Monocular SLAM with Hierarchical Gaussian Splats
Abstract:
Tracking and mapping in large-scale, unbounded outdoor environments using only monocular RGB input presents substantial challenges for existing SLAM systems. Traditional Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) SLAM methods are typically limited to small, bounded indoor settings. To overcome these challenges, we introduce GigaSLAM, the first RGB NeRF / 3DGS-based SLAM framework for kilometer-scale outdoor environments, as demonstrated on the KITTI, KITTI 360, 4 Seasons and A2D2 datasets. Our approach employs a hierarchical sparse voxel map representation, where Gaussians are decoded by neural networks at multiple levels of detail. This design enables efficient, scalable mapping and high-fidelity viewpoint rendering across expansive, unbounded scenes. For front-end tracking, GigaSLAM utilizes a metric depth model combined with epipolar geometry and PnP algorithms to accurately estimate poses, while incorporating a Bag-of-Words-based loop closure mechanism to maintain robust alignment over long trajectories. Consequently, GigaSLAM delivers high-precision tracking and visually faithful rendering on urban outdoor benchmarks, establishing a robust SLAM solution for large-scale, long-term scenarios, and significantly extending the applicability of Gaussian Splatting SLAM systems to unbounded outdoor environments. GitHub: https://github.com/DengKaiCQ/GigaSLAM.
中文: GigaSLAM是首个基于神经辐射场和3D高斯泼溅的RGB SLAM框架,通过分层地图表示和鲁棒的姿态估计,实现了公里级户外环境的高精度追踪与逼真渲染。
English: GigaSLAM is the first RGB-based SLAM framework using Neural Radiance Fields and 3D Gaussian Splatting that enables high-precision tracking and realistic rendering for kilometer-scale outdoor environments through hierarchical mapping and robust pose estimation.

Authors:Nadarasar Bahavan, Sachith Seneviratne, Saman Halgamuge
Title: SphOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Learning Models
Abstract:
The widespread use of deep learning classifiers necessitates Open-set recognition (OSR), which enables the identification of input data not only from classes known during training but also from unknown classes that might be present in test data. Many existing OSR methods are computationally expensive due to the reliance on complex generative models or suffer from high training costs. We investigate OSR from a representation-learning perspective, specifically through spherical embeddings. We introduce SphOR, a computationally efficient representation learning method that models the feature space as a mixture of von Mises-Fisher distributions. This approach enables the use of semantically ambiguous samples during training, to improve the detection of samples from unknown classes. We further explore the relationship between OSR performance and key representation learning properties which influence how well features are structured in high-dimensional space. Extensive experiments on multiple OSR benchmarks demonstrate the effectiveness of our method, producing state-of-the-art results, with improvements up-to 6% that validate its performance. Code at https://github.com/nadarasarbahavan/SpHOR
中文摘要:SphOR是一种基于球形嵌入和冯·米塞斯-费希尔分布的高效开放集识别方法,在多个基准测试中实现了最高达6%的性能提升,达到领先水平。
English Summary: SphOR is an efficient open-set recognition method using spherical embeddings with von Mises-Fisher distributions, achieving state-of-the-art performance improvements up to 6% on benchmarks.

Authors:Sanghyuk Chun, Sangdoo Yun
Title: LongProLIP: A Probabilistic Vision-Language Model with Long Context Text
Abstract:
Recently, Probabilistic Language-Image Pre-Training (ProLIP) has been proposed to tackle the multiplicity issue of vision-language (VL) tasks. Despite their success in probabilistic representation learning at a scale, the ProLIP models cannot handle long context texts longer than 64 context length, which limits their ability to capture rich contextual information from longer text sequences. To address this issue, this paper proposes a fine-tuning strategy for ProLIP to accept longer texts, e.g., 256 text tokens. Experimental results on Urban-1k and the DataComp evaluation suite show that the proposed LongProLIP recipe can improve understanding of long contexts while minimizing the negative effect of fine-tuning.We also observe a trade-off between the long context understanding (measured by Urban-1k) and general zero-shot capability (measured by evaluation datasets by DataComp). Code is available at https://github.com/naver-ai/prolip
中文: 本文提出了一种针对概率语言-图像预训练(ProLIP)的微调策略,使其能处理长达256个标记的文本序列,在Urban-1k和DataComp基准测试中验证了该方法在提升长文本理解能力的同时,尽可能保持了模型的通用零样本能力。
English: This paper introduces a fine-tuning strategy for Probabilistic Language-Image Pre-Training (ProLIP) to handle longer text sequences up to 256 tokens, improving long-context understanding while maintaining general zero-shot capabilities, as validated on Urban-1k and DataComp benchmarks.

Authors:Jiequan Cui, Beier Zhu, Qingshan Xu, Zhuotao Tian, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong
Title: Generalized Kullback-Leibler Divergence Loss
Abstract:
In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of (1) a weighted Mean Square Error (wMSE) loss and (2) a Cross-Entropy loss incorporating soft labels. Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL loss in scenarios like knowledge distillation by breaking its asymmetric optimization property along with a smoother weight function. This modification effectively alleviates convergence challenges in optimization, particularly for classes with high predicted scores in soft labels. Secondly, we introduce class-wise global information into KL/DKL to reduce bias arising from individual samples. With these two enhancements, we derive the Generalized Kullback-Leibler (GKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100, ImageNet, and vision-language datasets, focusing on adversarial training, and knowledge distillation tasks. Specifically, we achieve new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive knowledge distillation performance across CIFAR/ImageNet models and CLIP models, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.
中文: 本文从数学上证明了KL散度损失与解耦KL(DKL)损失的等价性,提出了一种广义KL(GKL)损失,通过改进非对称优化特性和引入类间全局信息,在多个数据集上实现了最先进的对抗鲁棒性和有竞争力的知识蒸馏性能。
English: This paper mathematically demonstrates the equivalence between KL Divergence loss and Decoupled KL (DKL) loss, proposing a Generalized KL (GKL) loss with two enhancements—addressing asymmetric optimization and incorporating class-wise global information—which achieves state-of-the-art adversarial robustness and competitive knowledge distillation performance across multiple datasets.

Authors:Xin Yu, Tianyu Wang, Soo Ye Kim, Paul Guerrero, Xi Chen, Qing Liu, Zhe Lin, Xiaojuan Qi
Title: ObjectMover: Generative Object Movement with Video Prior
Abstract:
Simple as it seems, moving an object to another location within an image is, in fact, a challenging image-editing task that requires re-harmonizing the lighting, adjusting the pose based on perspective, accurately filling occluded regions, and ensuring coherent synchronization of shadows and reflections while maintaining the object identity. In this paper, we present ObjectMover, a generative model that can perform object movement in highly challenging scenes. Our key insight is that we model this task as a sequence-to-sequence problem and fine-tune a video generation model to leverage its knowledge of consistent object generation across video frames. We show that with this approach, our model is able to adjust to complex real-world scenarios, handling extreme lighting harmonization and object effect movement. As large-scale data for object movement are unavailable, we construct a data generation pipeline using a modern game engine to synthesize high-quality data pairs. We further propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization. Through extensive experiments, we demonstrate that ObjectMover achieves outstanding results and adapts well to real-world scenarios.

Authors:S M A Sharif, Rizwan Ali Naqvi, Mithun Biswas, Woong-Kee Loh
Title: Deep Perceptual Enhancement for Medical Image Analysis
Abstract:
Due to numerous hardware shortcomings, medical image acquisition devices are susceptible to producing low-quality (i.e., low contrast, inappropriate brightness, noisy, etc.) images. Regrettably, perceptually degraded images directly impact the diagnosis process and make the decision-making manoeuvre of medical practitioners notably complicated. This study proposes to enhance such low-quality images by incorporating end-to-end learning strategies for accelerating medical image analysis tasks. To the best concern, this is the first work in medical imaging which comprehensively tackles perceptual enhancement, including contrast correction, luminance correction, denoising, etc., with a fully convolutional deep network. The proposed network leverages residual blocks and a residual gating mechanism for diminishing visual artefacts and is guided by a multi-term objective function to perceive the perceptually plausible enhanced images. The practicability of the deep medical image enhancement method has been extensively investigated with sophisticated experiments. The experimental outcomes illustrate that the proposed method could outperform the existing enhancement methods for different medical image modalities by 5.00 to 7.00 dB in peak signal-to-noise ratio (PSNR) metrics and 4.00 to 6.00 in DeltaE metrics. Additionally, the proposed method can drastically improve the medical image analysis tasks' performance and reveal the potentiality of such an enhancement method in real-world applications. Code Available: https://github.com/sharif-apu/DPE_JBHI
本研究提出了一种开创性的全卷积深度网络,通过端到端学习全面提升低质量医学图像的感知质量,在图像增强指标和诊断任务性能上均显著优于现有方法。
This study introduces a pioneering fully convolutional deep network that comprehensively enhances low-quality medical images through end-to-end learning, significantly outperforming existing methods in both perceptual quality and diagnostic task performance.

Authors:Sudarshan Regmi
Title: AdaSCALE: Adaptive Scaling for OOD Detection
Abstract:
The ability of the deep learning model to recognize when a sample falls outside its learned distribution is critical for safe and reliable deployment. Recent state-of-the-art out-of-distribution (OOD) detection methods leverage activation shaping to improve the separation between in-distribution (ID) and OOD inputs. These approaches resort to sample-specific scaling but apply a static percentile threshold across all samples regardless of their nature, resulting in suboptimal ID-OOD separability. In this work, we propose \textbf{AdaSCALE}, an adaptive scaling procedure that dynamically adjusts the percentile threshold based on a sample's estimated OOD likelihood. This estimation leverages our key observation: OOD samples exhibit significantly more pronounced activation shifts at high-magnitude activations under minor perturbation compared to ID samples. AdaSCALE enables stronger scaling for likely ID samples and weaker scaling for likely OOD samples, yielding highly separable energy scores. Our approach achieves state-of-the-art OOD detection performance, outperforming the latest rival OptFS by 14.94 in near-OOD and 21.67 in far-OOD datasets in average FPR@95 metric on the ImageNet-1k benchmark across eight diverse architectures. The code is available at: https://github.com/sudarshanregmi/AdaSCALE/
中文: AdaSCALE提出了一种自适应缩放方法,根据样本的分布外似然度动态调整百分位阈值,通过扰动下激活偏移增强分布内外样本的可分性,实现了最先进的异常检测性能。
English: AdaSCALE introduces an adaptive scaling method that dynamically adjusts percentile thresholds based on estimated out-of-distribution likelihood, achieving state-of-the-art OOD detection performance by enhancing ID-OOD separability through activation shifts under perturbation.

Authors:Bozhi Luan, Wengang Zhou, Hao Feng, Zhe Wang, Xiaosong Li, Houqiang Li
Title: Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models
Abstract:
As the computational needs of Large Vision-Language Models (LVLMs) increase, visual token pruning has proven effective in improving inference speed and memory efficiency. Traditional pruning methods in LVLMs predominantly focus on attention scores to determine token relevance, overlooking critical aspects such as spatial position and token similarity. To this end, we introduce AdaptPrune, a novel plug-and-play training-free pruning method that builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach. Our method is based on several observed phenomena in large models: the positional bias in the model's image attention and the redundancy of token information ignored by previous approaches. By integrating attention, spatial, and similarity information, our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions. Our method has been extensively tested across various LVLMs and benchmarks, confirming its robustness and adaptability. The results demonstrate that AdaptPrune consistently outperforms existing methods across various pruning ratios. Code is available at https://github.com/bzluan/AdaptPrune.
Chinese: AdaptPrune是一种无需训练的可视化令牌剪枝方法,通过结合注意力分数、空间距离和令牌相似性与自适应NMS,有效提升大型视觉语言模型的效率,并在多种剪枝比例下持续优于现有方法。
English: AdaptPrune is a training-free visual token pruning method that enhances LVLM efficiency by integrating attention scores, spatial distance, and token similarity with adaptive NMS, consistently outperforming existing methods across various pruning ratios.

Authors:Sanghyun Jo, Ziseok Lee, Wooyeol Lee, Kyungsu Kim
Title: DiffEGG: Diffusion-Driven Edge Generation as a Pixel-Annotation-Free Alternative for Instance Annotation
Abstract:
Achieving precise panoptic segmentation relies on pixel-wise instance annotations, but obtaining such datasets is costly. Unsupervised instance segmentation (UIS) eliminates annotation requirements but struggles with adjacent instance merging and single-instance fragmentation, largely due to the limitations of DINO-based backbones which lack strong instance separation cues. Weakly-supervised panoptic segmentation (WPS) reduces annotation costs using sparse labels (e.g., points, boxes), yet these annotations remain expensive and introduce human bias and boundary errors. To address these challenges, we propose DiffEGG (Diffusion-Driven EdGe Generation), a fully annotation-free method that extracts instance-aware features from pretrained diffusion models to generate precise instance edge maps. Unlike DINO-based UIS methods, diffusion models inherently capture fine-grained, instance-aware features, enabling more precise boundary delineation. For WPS, DiffEGG eliminates annotation costs and human bias by operating without any form of manual supervision, addressing the key limitations of prior best methods. Additionally, we introduce RIP, a post-processing technique that fuses DiffEGG's edge maps with segmentation masks in a task-agnostic manner. RIP allows DiffEGG to be seamlessly integrated into various segmentation frameworks. When applied to UIS, DiffEGG and RIP achieve an average $+4.4\text{ AP}$ improvement over prior best UIS methods. When combined with weakly-supervised semantic segmentation (WSS), DiffEGG enables WPS without instance annotations, outperforming prior best point-supervised WPS methods by $+1.7\text{ PQ}$. These results demonstrate that DiffEGG's edge maps serve as a cost-effective, annotation-free alternative to instance annotations, significantly improving segmentation without human intervention. Code is available at https://github.com/shjo-april/DiffEGG.
Chinese: DiffEGG提出了一种无需标注的方法,利用预训练扩散模型生成精确的实例边缘图,克服了现有无监督和弱监督分割方法的局限性,并实现了显著的性能提升。
English: DiffEGG introduces an annotation-free method using pretrained diffusion models to generate precise instance edge maps, overcoming limitations of existing unsupervised and weakly-supervised segmentation methods while achieving significant performance improvements.

Authors:Chen Liu, Feng Qiu, Wei Zhang, Lincheng Li, Dadong Wang, Xin Yu
Title: 7ABAW-Compound Expression Recognition via Curriculum Learning
Abstract:
With the advent of deep learning, expression recognition has made significant advancements. However, due to the limited availability of annotated compound expression datasets and the subtle variations of compound expressions, Compound Emotion Recognition (CE) still holds considerable potential for exploration. To advance this task, the 7th Affective Behavior Analysis in-the-wild (ABAW) competition introduces the Compound Expression Challenge based on C-EXPR-DB, a limited dataset without labels. In this paper, we present a curriculum learning-based framework that initially trains the model on single-expression tasks and subsequently incorporates multi-expression data. This design ensures that our model first masters the fundamental features of basic expressions before being exposed to the complexities of compound emotions. Specifically, our designs can be summarized as follows: 1) Single-Expression Pre-training: The model is first trained on datasets containing single expressions to learn the foundational facial features associated with basic emotions. 2) Dynamic Compound Expression Generation: Given the scarcity of annotated compound expression datasets, we employ CutMix and Mixup techniques on the original single-expression images to create hybrid images exhibiting characteristics of multiple basic emotions. 3) Incremental Multi-Expression Integration: After performing well on single-expression tasks, the model is progressively exposed to multi-expression data, allowing the model to adapt to the complexity and variability of compound expressions. The official results indicate that our method achieves the \textbf{best} performance in this competition track with an F-score of 0.6063. Our code is released at https://github.com/YenanLiu/ABAW7th.
中文: 本文提出一种课程学习框架,先在单一表情数据上预训练模型,再逐步引入复合表情数据,最终在ABAW7竞赛中以0.6063的F值取得了最佳性能。
English: This paper introduces a curriculum learning framework that first trains on single-expression data and then incrementally incorporates multi-expression data, achieving the best performance in the ABAW7 competition with an F-score of 0.6063.

Authors:Zhongpai Gao, Benjamin Planche, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Ziyan Wu
Title: 7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting
Abstract:
Real-time rendering of dynamic scenes with view-dependent effects remains a fundamental challenge in computer graphics. While recent advances in Gaussian Splatting have shown promising results separately handling dynamic scenes (4DGS) and view-dependent effects (6DGS), no existing method unifies these capabilities while maintaining real-time performance. We present 7D Gaussian Splatting (7DGS), a unified framework representing scene elements as seven-dimensional Gaussians spanning position (3D), time (1D), and viewing direction (3D). Our key contribution is an efficient conditional slicing mechanism that transforms 7D Gaussians into view- and time-conditioned 3D Gaussians, maintaining compatibility with existing 3D Gaussian Splatting pipelines while enabling joint optimization. Experiments demonstrate that 7DGS outperforms prior methods by up to 7.36 dB in PSNR while achieving real-time rendering (401 FPS) on challenging dynamic scenes with complex view-dependent effects. The project page is: https://gaozhongpai.github.io/7dgs/.
Chinese: 7D高斯泼溅(7DGS)提出了一个统一框架,通过七维高斯表示和高效条件切片机制,在保持实时渲染(401 FPS)的同时,对具有复杂视角相关效果的动态场景实现了比现有方法最高提升7.36 dB的PSNR性能。
English: 7D Gaussian Splatting (7DGS) introduces a unified framework that efficiently renders dynamic scenes with view-dependent effects in real-time, outperforming previous methods by up to 7.36 dB in PSNR while achieving 401 FPS.

Authors:Andrew Gao, Jun Liu
Title: STEAD: Spatio-Temporal Efficient Anomaly Detection for Time and Compute Sensitive Applications
Abstract:
This paper presents a new method for anomaly detection in automated systems with time and compute sensitive requirements, such as autonomous driving, with unparalleled efficiency. As systems like autonomous driving become increasingly popular, ensuring their safety has become more important than ever. Therefore, this paper focuses on how to quickly and effectively detect various anomalies in the aforementioned systems, with the goal of making them safer and more effective. Many detection systems have been developed with great success under spatial contexts; however, there is still significant room for improvement when it comes to temporal context. While there is substantial work regarding this task, there is minimal work done regarding the efficiency of models and their ability to be applied to scenarios that require real-time inference, i.e., autonomous driving where anomalies need to be detected the moment they are within view. To address this gap, we propose STEAD (Spatio-Temporal Efficient Anomaly Detection), whose backbone is developed using (2+1)D Convolutions and Performer Linear Attention, which ensures computational efficiency without sacrificing performance. When tested on the UCF-Crime benchmark, our base model achieves an AUC of 91.34%, outperforming the previous state-of-the-art, and our fast version achieves an AUC of 88.87%, while having 99.70% less parameters and outperforming the previous state-of-the-art as well. The code and pretrained models are made publicly available at https://github.com/agao8/STEAD
本文提出STEAD方法,通过(2+1)D卷积和Performer线性注意力实现高效异常检测,在UCF-Crime基准测试中以极少参数量达到最优性能。
This paper introduces STEAD, a computationally efficient anomaly detection method using (2+1)D Convolutions and Performer Linear Attention, achieving state-of-the-art performance on the UCF-Crime benchmark with significantly reduced parameters.

Authors:Minkyun Seo, Hyungtae Lim, Kanghee Lee, Luca Carlone, Jaesik Park
Title: BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes
Abstract:
Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, and (c) raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFER-X by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and (c) leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code is available at https://github.com/MIT-SPARK/BUFFER-X.
Chinese: 当前基于深度学习的点云配准方法仍存在泛化性问题,需针对新环境重新训练或手动调参,但提出的BUFFER-X流程通过自适应参数设置、最远点采样和局部尺度归一化,在多种数据集上实现了无需先验知识或手动调参的零样本鲁棒性能。
English: Recent deep learning point cloud registration methods still face generalization issues, requiring retraining or manual tuning for new environments, but the proposed BUFFER-X pipeline overcomes these by adaptively setting parameters, using farthest point sampling, and applying patch-wise normalization to achieve robust zero-shot performance across diverse datasets.

Authors:Samuel Cahyawijaya, Holy Lovenia, Joel Ruben Antony Moniz, Tack Hwa Wong, Mohammad Rifqi Farhansyah, Thant Thiri Maung, Frederikus Hudi, David Anugraha, Muhammad Ravi Shulthan Habibi, Muhammad Reza Qorib, Amit Agarwal, Joseph Marvin Imperial, Hitesh Laxmichand Patel, Vicky Feliren, Bahrul Ilmi Nasution, Manuel Antonio Rufino, Genta Indra Winata, Rian Adam Rajagede, Carlos Rafael Catalan, Mohamed Fazli Imam, Priyaranjan Pattnayak, Salsabila Zahirah Pranida, Kevin Pratama, Yeshil Bangera, Adisai Na-Thalang, Patricia Nicole Monderin, Yueqi Song, Christian Simon, Lynnette Hui Xian Ng, Richardy Lobo' Sapan, Taki Hasan Rafi, Bin Wang, Supryadi, Kanyakorn Veerakanjana, Piyalitt Ittichaiwong, Matthew Theodore Roque, Karissa Vincentio, Takdanai Kreangphet, Phakphum Artkaew, Kadek Hendrawan Palgunadi, Yanzhi Yu, Rochana Prih Hastuti, William Nixon, Mithil Bangera, Adrian Xuan Wei Lim, Aye Hninn Khine, Hanif Muhammad Zhafran, Teddy Ferdinan, Audra Aurora Izzani, Ayushman Singh, Evan, Jauza Akbar Krito, Michael Anugraha, Fenal Ashokbhai Ilasariya, Haochen Li, John Amadeo Daniswara, Filbert Aurelian Tjiaranata, Eryawan Presma Yulianrifat, Can Udomcharoenchaikit, Fadil Risdian Ansori, Mahardika Krisna Ihsani, Giang Nguyen, Anab Maulana Barik, Dan John Velasco, Rifo Ahmad Genadi, Saptarshi Saha, Chengwei Wei, Isaiah Flores, Kenneth Ko Han Chen, Anjela Gail Santos, Wan Shen Lim, Kaung Si Phyo, Tim Santos, Meisyarah Dwiastuti, Jiayun Luo, Jan Christian Blaise Cruz, Ming Shan Hee, Ikhlasul Akmal Hanif, M. Alif Al Hakim, Muhammad Rizky Sya'ban, Kun Kerdthaisong, Lester James V. Miranda, Fajri Koto, Tirana Noor Fatyanosa, Alham Fikri Aji, Jostin Jerico Rosal, Jun Kevin, Robert Wijaya, Onno P. Kampman, Ruochen Zhang, Börje F. Karlsson, Peerat Limkonchotiwat
Title: Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Abstract:
Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.
中文摘要:SEA-VL是一个开源项目,旨在通过众包和自动化方法收集东南亚文化相关数据,以弥补该地区在视觉语言研究中的代表性不足,最终汇集了128万张图像,推动构建更具包容性的人工智能系统。
English Summary: SEA-VL is an open-source initiative addressing the underrepresentation of Southeast Asian cultures in vision-language research by creating culturally relevant datasets through crowdsourcing and automated methods, ultimately gathering 1.28M images to foster more inclusive AI systems.

Authors:Yuru Jia, Valerio Marsocci, Ziyang Gong, Xue Yang, Maarten Vergauwen, Andrea Nascetti
Title: Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?
Abstract:
Self-supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily employ objectives like contrastive learning or masked image modeling, owing to their proven success in learning transferable representations. However, generative diffusion models, which demonstrate the potential to capture multi-grained semantics essential for RS tasks during image generation, remain underexplored for discriminative applications. This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion-based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. By systematically analyzing multi-stage, noise-dependent diffusion features, we develop three fusion strategies to effectively leverage these diverse representations. Extensive experiments on remote sensing benchmarks show that SatDiFuser outperforms state-of-the-art GFMs, achieving gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1-score in classification, demonstrating the capacity of diffusion-based generative foundation models to rival or exceed discriminative GFMs. The source code is available at: https://github.com/yurujaja/SatDiFuser.
中文: SatDiFuser将生成式扩散模型转化为强大的判别式遥感预训练工具,通过融合多阶段扩散特征,在语义分割和分类任务中超越现有最优地理空间基础模型。
English: SatDiFuser transforms generative diffusion models into effective discriminative tools for remote sensing, achieving state-of-the-art performance in tasks like semantic segmentation and classification through innovative fusion of multi-stage diffusion features.

Authors:James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy
Title: Video Action Differencing
Abstract:
How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at https://huggingface.co/datasets/jmhb/VidDiffBench and code at http://jmhb0.github.io/viddiff.
中文: 本文提出了VidDiff任务,用于识别相同动作视频间的细微差异,并创建了VidDiffBench基准数据集,该数据集对现有先进模型构成挑战;同时设计了三阶段智能工作流方法,以解决动作定位和细粒度帧比较两大核心难题。
English: This paper introduces VidDiff, a novel task for identifying subtle differences between videos of the same action, and proposes VidDiffBench, a benchmark dataset that challenges state-of-the-art models, along with a three-stage agentic workflow method to address localization and fine-grained comparison difficulties.

Authors:Anh-Kiet Duong
Title: Elderly Activity Recognition in the Wild: Results from the EAR Challenge
Abstract:
This paper presents our solution for the Elderly Action Recognition (EAR) Challenge, part of the Computer Vision for Smalls Workshop at WACV 2025. The competition focuses on recognizing Activities of Daily Living (ADLs) performed by the elderly, covering six action categories with a diverse dataset. Our approach builds upon a state-of-the-art action recognition model, fine-tuned through transfer learning on elderly-specific datasets to enhance adaptability. To improve generalization and mitigate dataset bias, we carefully curated training data from multiple publicly available sources and applied targeted pre-processing techniques. Our solution currently achieves 0.81455 accuracy on the public leaderboard, highlighting its effectiveness in classifying elderly activities. Source codes are publicly available at https://github.com/ffyyytt/EAR-WACV25-DAKiet-TSM.
Chinese: 本文介绍了老年人行为识别挑战的解决方案,通过优化先进模型并整合多源数据,在老年日常活动分类中取得了0.81455的准确率。
English: This paper details a solution for the Elderly Action Recognition Challenge, utilizing a fine-tuned state-of-the-art model and curated multi-source data to achieve 0.81455 accuracy in classifying daily activities of the elderly.

Authors:Elvis Kimara, Mozhgan Hadadi, Jackson Godbersen, Aditya Balu, Talukder Jubery, Yawei Li, Adarsh Krishnamurthy, Patrick S. Schnable, Baskar Ganapathysubramanian
Title: MaizeField3D: A Curated 3D Point Cloud and Procedural Model Dataset of Field-Grown Maize from a Diversity Panel
Abstract:
The development of artificial intelligence (AI) and machine learning (ML) based tools for 3D phenotyping, especially for maize, has been limited due to the lack of large and diverse 3D datasets. 2D image datasets fail to capture essential structural details such as leaf architecture, plant volume, and spatial arrangements that 3D data provide. To address this limitation, we present MaizeField3D (https://baskargroup.github.io/MaizeField3D/), a curated dataset of 3D point clouds of field-grown maize plants from a diverse genetic panel, designed to be AI-ready for advancing agricultural research. Our dataset includes 1,045 high-quality point clouds of field-grown maize collected using a terrestrial laser scanner (TLS). Point clouds of 520 plants from this dataset were segmented and annotated using a graph-based segmentation method to isolate individual leaves and stalks, ensuring consistent labeling across all samples. This labeled data was then used for fitting procedural models that provide a structured parametric representation of the maize plants. The leaves of the maize plants in the procedural models are represented using Non-Uniform Rational B-Spline (NURBS) surfaces that were generated using a two-step optimization process combining gradient-free and gradient-based methods. We conducted rigorous manual quality control on all datasets, correcting errors in segmentation, ensuring accurate leaf ordering, and validating metadata annotations. The dataset also includes metadata detailing plant morphology and quality, alongside multi-resolution subsampled point cloud data (100k, 50k, 10k points), which can be readily used for different downstream computational tasks. MaizeField3D will serve as a comprehensive foundational dataset for AI-driven phenotyping, plant structural analysis, and 3D applications in agricultural research.
中文: MaizeField3D数据集通过提供1045个带标注的田间玉米三维点云数据,解决了三维表型分析数据匮乏的问题,其分割的叶片茎秆结构和多分辨率数据为农业人工智能研究奠定了基础。
English: The MaizeField3D dataset addresses the scarcity of diverse 3D maize data by providing 1,045 annotated point clouds with segmented leaves and stalks, enabling AI-driven agricultural research through structured parametric models and multi-resolution data.

Authors:Wei Dai, Peilin Chen, Malinda Lu, Daniel Li, Haowen Wei, Hejie Cui, Paul Pu Liang
Title: CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models
Abstract:
Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of large-scale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-Scale Integrative Multimodal Benchmark (CLIMB), a comprehensive clinical benchmark unifying diverse clinical data across imaging, language, temporal, and graph modalities. CLIMB comprises 4.51 million patient samples totaling 19.01 terabytes distributed across 2D imaging, 3D video, time series, graphs, and multimodal data. Through extensive empirical evaluation, we demonstrate that multitask pretraining significantly improves performance on understudied domains, achieving up to 29% improvement in ultrasound and 23% in ECG analysis over single-task learning. Pretraining on CLIMB also effectively improves models' generalization capability to new tasks, and strong unimodal encoder performance translates well to multimodal performance when paired with task-appropriate fusion strategies. Our findings provide a foundation for new architecture designs and pretraining strategies to advance clinical AI research. Code is released at https://github.com/DDVD233/climb.
中文:临床大规模综合多模态基准(CLIMB)通过整合多样化的临床数据,为全面评估患者健康提供了基础,研究表明多任务预训练能显著提升超声和心电图等薄弱领域的性能及模型泛化能力。
English: The Clinical Large-Scale Integrative Multimodal Benchmark (CLIMB) introduces a comprehensive dataset integrating diverse clinical data to advance holistic patient assessments, demonstrating that multitask pretraining significantly enhances performance and generalization in understudied domains like ultrasound and ECG analysis.

Authors:Weixing Chen, Yang Liu, Binglin Chen, Jiandong Su, Yongsen Zheng, Liang Lin
Title: Cross-modal Causal Relation Alignment for Video Question Grounding
Abstract:
Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, vision-language models exhibit unfaithful generalization performance and lack robustness on challenging downstream tasks such as VideoQG. In this work, we propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Grounding (GSG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter, ii) Cross-Modal Alignment (CMA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features, iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving robust question reasoning. Codes are available at https://github.com/WissingChen/CRA-GQA.
中文摘要:本文提出的跨模态因果关联对齐(CRA)框架通过高斯平滑定位、跨模态对齐和显式因果干预三大模块,有效消除视频问答任务中的伪相关性,显著提升了时序定位精度与问答推理的鲁棒性。
English Summary: The proposed Cross-modal Causal Relation Alignment (CRA) framework addresses spurious correlations in Video Question Grounding by integrating Gaussian smoothing, cross-modal alignment, and explicit causal intervention to improve temporal localization and reasoning robustness.

Authors:Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, Xinggang Wang
Title: AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
Abstract:
OpenAI o1 and DeepSeek R1 achieve or even surpass human expert-level performance in complex domains like mathematics and science, with reinforcement learning (RL) and reasoning playing a crucial role. In autonomous driving, recent end-to-end models have greatly improved planning performance but still struggle with long-tailed problems due to limited common sense and reasoning abilities. Some studies integrate vision-language models (VLMs) into autonomous driving, but they typically rely on pre-trained models with simple supervised fine-tuning (SFT) on driving data, without further exploration of training strategies or optimizations specifically tailored for planning. In this paper, we propose AlphaDrive, a RL and reasoning framework for VLMs in autonomous driving. AlphaDrive introduces four GRPO-based RL rewards tailored for planning and employs a two-stage planning reasoning training strategy that combines SFT with RL. As a result, AlphaDrive significantly improves both planning performance and training efficiency compared to using only SFT or without reasoning. Moreover, we are also excited to discover that, following RL training, AlphaDrive exhibits some emergent multimodal planning capabilities, which is critical for improving driving safety and efficiency. To the best of our knowledge, AlphaDrive is the first to integrate GRPO-based RL with planning reasoning into autonomous driving. Code will be released to facilitate future research.
中文:AlphaDrive为自动驾驶中的视觉语言模型提出了强化学习与推理框架,通过定制化强化学习奖励和双阶段训练策略,显著提升规划性能与效率,并展现出新兴多模态规划能力。
English: AlphaDrive introduces a reinforcement learning and reasoning framework for vision-language models in autonomous driving, employing tailored RL rewards and a two-stage training strategy to significantly enhance planning performance and efficiency while developing emergent multimodal capabilities.

Authors:Ying Xu, Marius Pedersen, Kiran Raja
Title: VoD: Learning Volume of Differences for Video-Based Deepfake Detection
Abstract:
The rapid development of deep learning and generative AI technologies has profoundly transformed the digital contact landscape, creating realistic Deepfake that poses substantial challenges to public trust and digital media integrity. This paper introduces a novel Deepfake detention framework, Volume of Differences (VoD), designed to enhance detection accuracy by exploiting temporal and spatial inconsistencies between consecutive video frames. VoD employs a progressive learning approach that captures differences across multiple axes through the use of consecutive frame differences (CFD) and a network with stepwise expansions. We evaluate our approach with intra-dataset and cross-dataset testing scenarios on various well-known Deepfake datasets. Our findings demonstrate that VoD excels with the data it has been trained on and shows strong adaptability to novel, unseen data. Additionally, comprehensive ablation studies examine various configurations of segment length, sampling steps, and intervals, offering valuable insights for optimizing the framework. The code for our VoD framework is available at https://github.com/xuyingzhongguo/VoD.
中文: 本文提出差异量(VoD)框架,通过分析视频帧间的时空不一致性来提升深度伪造检测的准确性,并在多个数据集中展现出优异的性能和适应性。
English: This paper presents the Volume of Differences (VoD) framework, which improves Deepfake detection by analyzing temporal and spatial inconsistencies in video frames and demonstrates high accuracy and adaptability across datasets.

Authors:Yuxin Jiang, Liming Jiang, Shuai Yang, Jia-Wei Liu, Ivor Tsang, Mike Zheng Shou
Title: Balanced Image Stylization with Style Matching Score
Abstract:
We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge. Unlike existing efforts, our method reframes image stylization as a style distribution matching problem. The target style distribution is estimated from off-the-shelf style-dependent LoRAs via carefully designed score functions. To preserve content information adaptively, we propose Progressive Spectrum Regularization, which operates in the frequency domain to guide stylization progressively from low-frequency layouts to high-frequency details. In addition, we devise a Semantic-Aware Gradient Refinement technique that leverages relevance maps derived from diffusion semantic priors to selectively stylize semantically important regions. The proposed optimization formulation extends stylization from pixel space to parameter space, readily applicable to lightweight feedforward generators for efficient one-step stylization. SMS effectively balances style alignment and content preservation, outperforming state-of-the-art approaches, verified by extensive experiments.
中文:风格匹配评分(SMS)方法将图像风格化重新定义为风格分布匹配问题,通过渐进频谱正则化和语义感知梯度优化,有效平衡风格转换与内容保留,性能优于现有技术。
English: The Style Matching Score (SMS) method reframes image stylization as a style distribution matching problem, utilizing Progressive Spectrum Regularization and Semantic-Aware Gradient Refinement to effectively balance style transfer with content preservation, outperforming existing approaches.

Authors:Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, Yu Liu
Title: VACE: All-in-One Video Creation and Editing
Abstract:
Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging. We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly. Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations. Project page: https://ali-vilab.github.io/VACE-Page/.

Authors:Yuhong Zhang, Guanlin Wu, Ling-Hao Chen, Zhuokai Zhao, Jing Lin, Xiaoke Jiang, Jiamin Wu, Zhuoheng Li, Hao Frank Yang, Haoqian Wang, Lei Zhang
Title: HumanMM: Global Human Motion Recovery from Multi-shot Videos
Abstract:
In this paper, we present a novel framework designed to reconstruct long-sequence 3D human motion in the world coordinates from in-the-wild videos with multiple shot transitions. Such long-sequence in-the-wild motions are highly valuable to applications such as motion generation and motion understanding, but are of great challenge to be recovered due to abrupt shot transitions, partial occlusions, and dynamic backgrounds presented in such videos. Existing methods primarily focus on single-shot videos, where continuity is maintained within a single camera view, or simplify multi-shot alignment in camera space only. In this work, we tackle the challenges by integrating an enhanced camera pose estimation with Human Motion Recovery (HMR) by incorporating a shot transition detector and a robust alignment module for accurate pose and orientation continuity across shots. By leveraging a custom motion integrator, we effectively mitigate the problem of foot sliding and ensure temporal consistency in human pose. Extensive evaluations on our created multi-shot dataset from public 3D human datasets demonstrate the robustness of our method in reconstructing realistic human motion in world coordinates.

Authors:Youjun Zhao, Jiaying Lin, Rynson W. H. Lau
Title: Hierarchical Cross-Modal Alignment for Open-Vocabulary 3D Object Detection
Abstract:
Open-vocabulary 3D object detection (OV-3DOD) aims at localizing and classifying novel objects beyond closed sets. The recent success of vision-language models (VLMs) has demonstrated their remarkable capabilities to understand open vocabularies. Existing works that leverage VLMs for 3D object detection (3DOD) generally resort to representations that lose the rich scene context required for 3D perception. To address this problem, we propose in this paper a hierarchical framework, named HCMA, to simultaneously learn local object and global scene information for OV-3DOD. Specifically, we first design a Hierarchical Data Integration (HDI) approach to obtain coarse-to-fine 3D-image-text data, which is fed into a VLM to extract object-centric knowledge. To facilitate the association of feature hierarchies, we then propose an Interactive Cross-Modal Alignment (ICMA) strategy to establish effective intra-level and inter-level feature connections. To better align features across different levels, we further propose an Object-Focusing Context Adjustment (OFCA) module to refine multi-level features by emphasizing object-related features. Extensive experiments demonstrate that the proposed method outperforms SOTA methods on the existing OV-3DOD benchmarks. It also achieves promising OV-3DOD results even without any 3D annotations.

Authors:Bardia Safaei, Faizan Siddiqui, Jiacong Xu, Vishal M. Patel, Shao-Yuan Lo
Title: Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning
Abstract:
Visual instruction tuning (VIT) for large vision-language models (LVLMs) requires training on expansive datasets of image-instruction pairs, which can be costly. Recent efforts in VIT data selection aim to select a small subset of high-quality image-instruction pairs, reducing VIT runtime while maintaining performance comparable to full-scale training. However, a major challenge often overlooked is that generating instructions from unlabeled images for VIT is highly expensive. Most existing VIT datasets rely heavily on human annotations or paid services like the GPT API, which limits users with constrained resources from creating VIT datasets for custom applications. To address this, we introduce Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images. PreSel first estimates the relative importance of each vision task within VIT datasets to derive task-wise sampling budgets. It then clusters image features within each task, selecting the most representative images with the budget. This approach reduces computational overhead for both instruction generation during VIT data formation and LVLM fine-tuning. By generating instructions for only 15% of the images, PreSel achieves performance comparable to full-data VIT on the LLaVA-1.5 and Vision-Flan datasets. The link to our project page: https://bardisafa.github.io/PreSel

Authors:Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, Yansheng Li
Title: When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Abstract:
Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs' perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. Dataset and code are in https://github.com/VisionXLab/LRS-VQA.
中文摘要:本文提出一种结合动态图像金字塔的文本引导令牌剪枝方法,通过区域聚焦模块和由粗到细的图像块选择策略,在保持遥感图像细节的同时降低计算复杂度,并创建了包含7,333个问答对的新基准LRS-VQA用于评估。
English Summary: This paper introduces a text-guided token pruning method with Dynamic Image Pyramid integration to efficiently process large remote sensing images by preserving details while reducing computational costs, and establishes a new benchmark LRS-VQA for comprehensive evaluation.

Authors:Jen-tse Huang, Jiantong Qin, Jianping Zhang, Youliang Yuan, Wenxuan Wang, Jieyu Zhao
Title: VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models
Abstract:
This research investigates both explicit and implicit social biases exhibited by Vision-Language Models (VLMs). The key distinction between these bias types lies in the level of awareness: explicit bias refers to conscious, intentional biases, while implicit bias operates subconsciously. To analyze explicit bias, we directly pose questions to VLMs related to gender and racial differences: (1) Multiple-choice questions based on a given image (e.g., "What is the education level of the person in the image?") (2) Yes-No comparisons using two images (e.g., "Is the person in the first image more educated than the person in the second image?") For implicit bias, we design tasks where VLMs assist users but reveal biases through their responses: (1) Image description tasks: Models are asked to describe individuals in images, and we analyze disparities in textual cues across demographic groups. (2) Form completion tasks: Models draft a personal information collection form with 20 attributes, and we examine correlations among selected attributes for potential biases. We evaluate Gemini-1.5, GPT-4V, GPT-4o, LLaMA-3.2-Vision and LLaVA-v1.6. Our code and data are publicly available at https://github.com/uscnlp-lime/VisBias.
本研究通过直接提问和间接任务测试视觉语言模型在性别和种族方面的显性与隐性社会偏见,评估了包括Gemini-1.5和GPT-4V在内的多个模型。
This study examines explicit and implicit social biases in Vision-Language Models by testing them with direct questions and indirect tasks related to gender and race, evaluating models like Gemini-1.5 and GPT-4V.

Authors:Clément Chadebec, Onur Tasar, Sanjeev Sreetharan, Benjamin Aubin
Title: LBM: Latent Bridge Matching for Fast Image-to-Image Translation
Abstract:
In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation. We show that the method can reach state-of-the-art results for various image-to-image tasks using only a single inference step. In addition to its efficiency, we also demonstrate the versatility of the method across different image translation tasks such as object removal, normal and depth estimation, and object relighting. We also derive a conditional framework of LBM and demonstrate its effectiveness by tackling the tasks of controllable image relighting and shadow generation. We provide an implementation at https://github.com/gojasper/LBM.
中文: 本文提出潜在桥匹配(LBM)方法,通过在潜在空间进行桥匹配实现高效图像转换,仅需单步推理即可在物体移除、重照明等多项任务中达到顶尖水平。
English: This paper presents Latent Bridge Matching (LBM), an efficient and versatile method for image-to-image translation that achieves state-of-the-art results in a single inference step across tasks like object removal and relighting.

Authors:Zhangquan Chen, Xufang Luo, Dongsheng Li
Title: VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Abstract:
Visual understanding is inherently intention-driven - humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as an internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs. Our code is available at https://github.com/zhangquanchen/VisRL.
中文摘要:VisRL是首个将强化学习应用于意图驱动视觉感知的框架,通过仅使用奖励信号优化整个视觉推理过程,无需昂贵区域标注即可超越现有基线方法。
English Summary: VisRL is a novel reinforcement learning framework that eliminates the need for supervised region annotations by optimizing visual reasoning through trial-and-error, achieving superior performance across multiple benchmarks.

Authors:Takeru Inoue, Ryusuke Miyamoto
Title: FastInstShadow: A Simple Query-Based Model for Instance Shadow Detection
Abstract:
Instance shadow detection is the task of detecting pairs of shadows and objects, where existing methods first detect shadows and objects independently, then associate them. This paper introduces FastInstShadow, a method that enhances detection accuracy through a query-based architecture featuring an association transformer decoder with two dual-path transformer decoders to assess relationships between shadows and objects during detection. Experimental results using the SOBA dataset showed that the proposed method outperforms all existing methods across all criteria. This method makes real-time processing feasible for moderate-resolution images with better accuracy than SSISv2, the most accurate existing method. Our code is available at https://github.com/wlotkr/FastInstShadow.
中文摘要:FastInstShadow是一种基于查询的实例阴影检测方法,通过关联变换器解码器和双路径解码器提升检测精度,在SOBA数据集上全面超越现有方法,并能实现中等分辨率图像的实时处理。
English Summary: FastInstShadow is a query-based method that improves instance shadow detection accuracy using an association transformer decoder and dual-path decoders, outperforming existing techniques on the SOBA dataset while enabling real-time processing.

Authors:Jie Hu, Shizun Wang, Xinchao Wang
Title: PE3R: Perception-Efficient 3D Reconstruction
Abstract:
Recent advancements in 2D-to-3D perception have significantly improved the understanding of 3D scenes from 2D images. However, existing methods face critical challenges, including limited generalization across scenes, suboptimal perception accuracy, and slow reconstruction speeds. To address these limitations, we propose Perception-Efficient 3D Reconstruction (PE3R), a novel framework designed to enhance both accuracy and efficiency. PE3R employs a feed-forward architecture to enable rapid 3D semantic field reconstruction. The framework demonstrates robust zero-shot generalization across diverse scenes and objects while significantly improving reconstruction speed. Extensive experiments on 2D-to-3D open-vocabulary segmentation and 3D reconstruction validate the effectiveness and versatility of PE3R. The framework achieves a minimum 9-fold speedup in 3D semantic field reconstruction, along with substantial gains in perception accuracy and reconstruction precision, setting new benchmarks in the field. The code is publicly available at: https://github.com/hujiecpp/PE3R.
Chinese: PE3R框架通过前馈架构实现快速3D语义场重建,在零样本泛化能力上表现优异,不仅将重建速度提升至少9倍,还在感知精度和重建质量上取得显著突破。
English: The proposed PE3R framework enables rapid 3D semantic field reconstruction with robust zero-shot generalization, achieving at least a 9-fold speedup and significant improvements in accuracy and precision across diverse scenes.

Authors:Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang
Title: Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts
Abstract:
Reasoning segmentation is a challenging vision-language task that aims to output the segmentation mask with respect to a complex, implicit, and even non-visual query text. Previous works incorporated multimodal Large Language Models (MLLMs) with segmentation models to approach the difficult problem. However, their segmentation quality often falls short in complex cases, particularly when dealing with out-of-domain objects with intricate structures, blurry boundaries, occlusions, or high similarity with surroundings. In this paper, we introduce ThinkFirst, a training-free reasoning segmentation framework that leverages GPT's chain of thought to address these challenging cases. Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image. This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process. Our framework allows users to easily interact with the segmentation agent using multimodal inputs, such as easy text and image scribbles, for successive refinement or communication. We evaluate the performance of ThinkFirst on diverse objects. Extensive experiments show that, this zero-shot-CoT approach significantly improves the vanilla reasoning segmentation agent, both qualitatively and quantitatively, while being less sensitive or critical to user-supplied prompts after Thinking First.
中文:ThinkFirst是一种无需训练的分割推理框架,通过利用GPT的思维链生成详细图像描述,再由分割助手处理,有效提升复杂场景下的分割精度。
English: ThinkFirst is a training-free reasoning segmentation framework that enhances segmentation accuracy in complex scenarios by using GPT's chain of thought to generate detailed image descriptions, which are then processed by a segmentation assistant.

Authors:Calvin Yeung, Tomohiro Suzuki, Ryota Tanaka, Zhuoer Yin, Keisuke Fujii
Title: AthletePose3D: A Benchmark Dataset for 3D Human Pose Estimation and Kinematic Validation in Athletic Movements
Abstract:
Human pose estimation is a critical task in computer vision and sports biomechanics, with applications spanning sports science, rehabilitation, and biomechanical research. While significant progress has been made in monocular 3D pose estimation, current datasets often fail to capture the complex, high-acceleration movements typical of competitive sports. In this work, we introduce AthletePose3D, a novel dataset designed to address this gap. AthletePose3D includes 12 types of sports motions across various disciplines, with approximately 1.3 million frames and 165 thousand individual postures, specifically capturing high-speed, high-acceleration athletic movements. We evaluate state-of-the-art (SOTA) monocular 2D and 3D pose estimation models on the dataset, revealing that models trained on conventional datasets perform poorly on athletic motions. However, fine-tuning these models on AthletePose3D notably reduces the SOTA model mean per joint position error (MPJPE) from 214mm to 65mm-a reduction of over 69%. We also validate the kinematic accuracy of monocular pose estimations through waveform analysis, highlighting strong correlations in joint angle estimations but limitations in velocity estimation. Our work provides a comprehensive evaluation of monocular pose estimation models in the context of sports, contributing valuable insights for advancing monocular pose estimation techniques in high-performance sports environments. The dataset, code, and model checkpoints are available at: https://github.com/calvinyeungck/AthletePose3D
中文: 本研究提出了AthletePose3D数据集,旨在解决现有数据集难以捕捉高速运动姿态的局限,实验表明基于该数据集微调模型可将姿态估计误差降低69%以上,并通过波形分析验证了运动学参数的准确性。
English: The study introduces AthletePose3D, a dataset addressing the limitations of existing datasets in capturing high-acceleration sports movements, and demonstrates that fine-tuning models on it reduces pose estimation errors by over 69% while validating kinematic accuracy through waveform analysis.

Authors:Guiwei Zhang, Tianyu Zhang, Mohan Zhou, Yalong Bai, Biye Li
Title: V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation
Abstract:
We propose V2Flow, a novel tokenizer that produces discrete visual tokens capable of high-fidelity reconstruction, while ensuring structural and latent distribution alignment with the vocabulary space of large language models (LLMs). Leveraging this tight visual-vocabulary coupling, V2Flow enables autoregressive visual generation on top of existing LLMs. Our approach formulates visual tokenization as a flow-matching problem, aiming to learn a mapping from a standard normal prior to the continuous image distribution, conditioned on token sequences embedded within the LLMs vocabulary space. The effectiveness of V2Flow stems from two core designs. First, we propose a Visual Vocabulary resampler, which compresses visual data into compact token sequences, with each represented as a soft categorical distribution over LLM's vocabulary. This allows seamless integration of visual tokens into existing LLMs for autoregressive visual generation. Second, we present a masked autoregressive Rectified-Flow decoder, employing a masked transformer encoder-decoder to refine visual tokens into contextually enriched embeddings. These embeddings then condition a dedicated velocity field for precise reconstruction. Additionally, an autoregressive rectified-flow sampling strategy is incorporated, ensuring flexible sequence lengths while preserving competitive reconstruction quality. Extensive experiments show that V2Flow outperforms mainstream VQ-based tokenizers and facilitates autoregressive visual generation on top of existing. https://github.com/zhangguiwei610/V2Flow
中文: V2Flow是一种创新的视觉分词器,通过流匹配方法将图像映射至大语言模型词表空间,结合视觉词汇重采样器和掩码自回归整流流解码器,实现高质量重建与自回归视觉生成。
English: V2Flow is a novel visual tokenizer that maps images into discrete tokens aligned with LLM vocabulary, enabling high-fidelity reconstruction and autoregressive visual generation through flow-matching and specialized decoder designs.

Authors:Zongzheng Zhang, Xinrun Li, Sizhe Zou, Guoxuan Chi, Siqi Li, Xuchong Qiu, Guoliang Wang, Guantian Zheng, Leichen Wang, Hang Zhao, Hao Zhao
Title: Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction
Abstract:
Lane topology extraction involves detecting lanes and traffic elements and determining their relationships, a key perception task for mapless autonomous driving. This task requires complex reasoning, such as determining whether it is possible to turn left into a specific lane. To address this challenge, we introduce neuro-symbolic methods powered by vision-language foundation models (VLMs). Existing approaches have notable limitations: (1) Dense visual prompting with VLMs can achieve strong performance but is costly in terms of both financial resources and carbon footprint, making it impractical for robotics applications. (2) Neuro-symbolic reasoning methods for 3D scene understanding fail to integrate visual inputs when synthesizing programs, making them ineffective in handling complex corner cases. To this end, we propose a fast-slow neuro-symbolic lane topology extraction algorithm, named Chameleon, which alternates between a fast system that directly reasons over detected instances using synthesized programs and a slow system that utilizes a VLM with a chain-of-thought design to handle corner cases. Chameleon leverages the strengths of both approaches, providing an affordable solution while maintaining high performance. We evaluate the method on the OpenLane-V2 dataset, showing consistent improvements across various baseline detectors. Our code, data, and models are publicly available at https://github.com/XR-Lee/neural-symbolic
Chinese: 本文提出Chameleon算法,通过快慢神经符号系统交替处理常规与复杂场景,结合视觉语言模型的链式推理,实现了高效低成本的车道拓扑提取,并在OpenLane-V2数据集上验证了其优越性能。
English: This paper introduces Chameleon, a fast-slow neuro-symbolic algorithm that combines efficient instance reasoning with vision-language model analysis to affordably extract lane topology for autonomous driving, demonstrating improved performance on the OpenLane-V2 dataset.

Authors:Jiacheng Ruan, Wenzhen Yuan, Xian Gao, Ye Guo, Daoxin Zhang, Zhe Xu, Yao Hu, Ting Liu, Yuzhuo Fu
Title: VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
Abstract:
Although large visual-language models (LVLMs) have demonstrated strong performance in multimodal tasks, errors may occasionally arise due to biases during the reasoning process. Recently, reward models (RMs) have become increasingly pivotal in the reasoning process. Specifically, process RMs evaluate each reasoning step, outcome RMs focus on the assessment of reasoning results, and critique RMs perform error analysis on the entire reasoning process, followed by corrections. However, existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities (e.g., distinguishing between two answers), thus limiting the all-round evaluation and restricting the development of RMs in the visual-language domain. To address this gap, we propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions. VLRMBench is constructed based on three distinct types of datasets, covering mathematical reasoning, hallucination understanding, and multi-image understanding. We design 12 tasks across three major categories, focusing on evaluating VLRMs in the aspects of process understanding, outcome judgment, and critique generation. Extensive experiments are conducted on 21 open-source models and 5 advanced closed-source models, highlighting the challenges posed by VLRMBench. For instance, in the `Forecasting Future', a binary classification task, the advanced GPT-4o achieves only a 76.0% accuracy. Additionally, we perform comprehensive analytical studies, offering valuable insights for the future development of VLRMs. We anticipate that VLRMBench will serve as a pivotal benchmark in advancing VLRMs. Code and datasets will be available at https://github.com/JCruan519/VLRMBench.
Chinese: 本文提出了VLRMBench基准,旨在全面评估视觉语言奖励模型在过程理解、结果判断和批判生成方面的能力,通过大量实验揭示了现有模型的显著挑战,弥补了当前评估体系的不足。
English: This paper introduces VLRMBench, a comprehensive benchmark designed to evaluate vision-language reward models across process understanding, outcome judgment, and critique generation, addressing limitations in existing assessments and demonstrating significant challenges through extensive experiments on multiple models.

Authors:Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
Title: YOLOE: Real-Time Seeing Anything
Abstract:
Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP$^b$ and 0.4 AP$^m$ gains over closed-set YOLOv8-L with nearly 4$\times$ less training time. Code and models are available at https://github.com/THU-MIG/yoloe.
中文: YOLOE是一种高效的开集检测与分割模型,通过创新的文本、视觉和无提示机制,以极低的训练成本和计算开销实现了实时性能。
English: YOLOE is an efficient open-set model integrating detection and segmentation that achieves real-time performance with minimal training cost and computational overhead through innovative text, visual, and prompt-free mechanisms.

Authors:Yan Tai, Luhao Zhu, Zhiqiang Chen, Ynan Ding, Yiying Dong, Xiaohong Liu, Guodong Guo
Title: REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding
Abstract:
Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot capabilities across diverse vision-language tasks after training on mega-scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi-task learning and multi-granularity scenarios. In this work, we present REF-VLM, an end-to-end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the Triplet-Based Referring Paradigm (TRP), which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct Visual-Task Instruction Following Dataset (VTInstruct), a large-scale multi-task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF-VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF-VLM outperforms other MLLMs across a variety of standard benchmarks. The code, dataset, and demo available at https://github.com/MacavityT/REF-VLM.
Chinese: REF-VLM提出了一种端到端的统一框架,采用基于三元组的指代范式来提高多模态大语言模型在复杂视觉解码任务中的适应性,并借助大规模数据集在多个基准测试中优于现有模型。
English: REF-VLM introduces a unified end-to-end framework with a triplet-based referring paradigm to enhance multimodal large language models' adaptability in complex visual decoding tasks, supported by a large-scale dataset and outperforming existing models on benchmarks.

Authors:Ouxiang Li, Yuan Wang, Xinting Hu, Houcheng Jiang, Tao Liang, Yanbin Hao, Guojun Ma, Fuli Feng
Title: SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models
Abstract:
Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, offensive content, and privacy violations. However, existing methods either require costly fine-tuning or degrade image quality for non-target concepts (i.e., prior) due to inherent optimization limitations. In this paper, we introduce SPEED, a model editing-based concept erasure approach that leverages null-space constraints for scalable, precise, and efficient erasure. Specifically, SPEED incorporates Influence-based Prior Filtering (IPF) to retain the most affected non-target concepts during erasing, Directed Prior Augmentation (DPA) to expand prior coverage while maintaining semantic consistency, and Invariant Equality Constraints (IEC) to regularize model editing by explicitly preserving key invariants during the T2I generation process. Extensive evaluations across multiple concept erasure tasks demonstrate that SPEED consistently outperforms existing methods in prior preservation while achieving efficient and high-fidelity concept erasure, successfully removing 100 concepts within just 5 seconds. Our code and models are available at: https://github.com/Ouxiang-Li/SPEED.
中文摘要:SPEED是一种通过空空间优化直接编辑模型参数的高效概念消除方法,能够在快速移除目标概念的同时保持非目标内容的生成质量。
English Summary: SPEED is an efficient concept erasure method that directly edits diffusion model parameters through null space optimization, enabling rapid removal of target concepts while preserving non-target content quality.

Authors:Ruidong Chen, Honglin Guo, Lanjun Wang, Chenyu Zhang, Weizhi Nie, An-An Liu
Title: TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models
Abstract:
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. However, current studies struggle to fully erase malicious concepts implicitly embedded in prompts (e.g., metaphorical expressions or adversarial prompts) while preserving the model's normal generation capability. To address this challenge, our study proposes TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation. Firstly, TRCE starts by erasing the malicious semantics implicitly embedded in textual prompts. By identifying a critical mapping objective(i.e., the [EoT] embedding), we optimize the cross-attention layers to map malicious prompts to contextually similar prompts but with safe concepts. This step prevents the model from being overly influenced by malicious semantics during the denoising process. Following this, considering the deterministic properties of the sampling trajectory of the diffusion model, TRCE further steers the early denoising prediction toward the safe direction and away from the unsafe one through contrastive learning, thus further avoiding the generation of malicious content. Finally, we conduct comprehensive evaluations of TRCE on multiple malicious concept erasure benchmarks, and the results demonstrate its effectiveness in erasing malicious concepts while better preserving the model's original generation ability. The code is available at: http://github.com/ddgoodgood/TRCE. CAUTION: This paper includes model-generated content that may contain offensive material.
中文摘要:本研究提出TRCE方法,通过两阶段概念消除策略:先消除文本提示中的隐含恶意语义,再通过对比学习引导去噪过程朝向安全方向,从而在有效清除恶意内容的同时更好地保持模型的正常生成能力。
English Summary: The study introduces TRCE, a two-stage concept erasure method that effectively removes malicious content from text-to-image models by first neutralizing harmful semantics in prompts and then guiding the denoising process toward safe outputs, achieving a balance between erasure and preservation of generation quality.

Authors:Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, Wenqi Shao
Title: MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Abstract:
DeepSeek R1, and o1 have demonstrated powerful reasoning capabilities in the text domain through stable large-scale reinforcement learning. To enable broader applications, some works have attempted to transfer these capabilities to multimodal reasoning. However, these efforts have been limited by the limited difficulty of selected tasks and relatively small training scales, making it challenging to demonstrate strong multimodal reasoning abilities. To address this gap, we introduce the MMK12 dataset and MM-EUREKA with 7B and 32B parameters. The former is a high-quality multimodal mathematics reasoning dataset featuring diverse knowledge domains with human-verified answers and solution processes. The latter is a multimodal model employing rule-based reinforcement learning on MMK12, utilizing online filtering and two-stage training strategy to enhance training stability. MM-EUREKA demonstrates remarkable performance gains in multimodal mathematical reasoning, outperforming previous powerful models like InternVL2.5-78B or InternVL2.5-38B-MPO. In particular, MM-EUREKA achieves competitive or superior performance compared to both open-source and closed-source models, and trails slightly behind o1 in multidisciplinary reasoning tasks. We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at https://github.com/ModalMinds/MM-EUREKA
中文: MM-EUREKA模型基于高质量多模态数学推理数据集MMK12,采用规则强化学习训练,在多模态数学推理中表现卓越,超越先前强大模型并与顶尖系统性能相当,同时全面开源促进后续研究。
English: The MM-EUREKA model, trained on the high-quality MMK12 dataset using rule-based reinforcement learning, achieves significant advancements in multimodal mathematical reasoning, outperforming previous models and closely competing with leading systems while being fully open-sourced.

Authors:Johan Edstedt, Georg Bökman, Mårten Wadenbäck, Michael Felsberg
Title: DaD: Distilled Reinforcement Learning for Diverse Keypoint Detection
Abstract:
Keypoints are what enable Structure-from-Motion (SfM) systems to scale to thousands of images. However, designing a keypoint detection objective is a non-trivial task, as SfM is non-differentiable. Typically, an auxiliary objective involving a descriptor is optimized. This however induces a dependency on the descriptor, which is undesirable. In this paper we propose a fully self-supervised and descriptor-free objective for keypoint detection, through reinforcement learning. To ensure training does not degenerate, we leverage a balanced top-K sampling strategy. While this already produces competitive models, we find that two qualitatively different types of detectors emerge, which are only able to detect light and dark keypoints respectively. To remedy this, we train a third detector, DaD, that optimizes the Kullback-Leibler divergence of the pointwise maximum of both light and dark detectors. Our approach significantly improve upon SotA across a range of benchmarks. Code and model weights are publicly available at https://github.com/parskatt/dad
中文摘要:本文提出了一种完全自监督且无需描述符的关键点检测方法,通过强化学习和平衡采样策略,在多个基准测试中显著超越了现有最优技术。
English Summary: This paper introduces a fully self-supervised, descriptor-free keypoint detection method using reinforcement learning and a balanced sampling strategy, which significantly outperforms state-of-the-art approaches across multiple benchmarks.

Authors:Xing Xie, Jiawei Liu, Ziyue Lin, Huijie Fan, Zhi Han, Yandong Tang, Liangqiong Qu
Title: Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
Abstract:
We present Autoregressive Representation Alignment (ARRA), a new training framework that unlocks global-coherent text-to-image generation in autoregressive LLMs without architectural modifications. Different from prior works that require complex architectural redesigns, ARRA aligns LLM's hidden states with visual representations from external visual foundational models via a global visual alignment loss and a hybrid token, [object Object]. This token enforces dual constraints: local next-token prediction and global semantic distillation, enabling LLMs to implicitly learn spatial and contextual coherence while retaining their original autoregressive paradigm. Extensive experiments validate ARRA's plug-and-play versatility. When training T2I LLMs from scratch, ARRA reduces FID by 16.6% (ImageNet), 12.0% (LAION-COCO) for autoregressive LLMs like LlamaGen, without modifying original architecture and inference mechanism. For training from text-generation-only LLMs, ARRA reduces FID by 25.5% (MIMIC-CXR), 8.8% (DeepEyeNet) for advanced LLMs like Chameleon. For domain adaptation, ARRA aligns general-purpose LLMs with specialized models (e.g., BioMedCLIP), achieving an 18.6% FID reduction over direct fine-tuning on medical imaging (MIMIC-CXR). These results demonstrate that training objective redesign, rather than architectural modifications, can resolve cross-modal global coherence challenges. ARRA offers a complementary paradigm for advancing autoregressive models. The code is available at https://github.com/xiexing0916/ARRA.
Chinese: 我们提出自回归表征对齐(ARRA)训练框架,通过混合令牌和双重约束将LLM隐藏状态与外部视觉表征对齐,无需修改架构即可实现全局连贯的文生图生成,在多个数据集和领域显著提升性能。
English: We introduce Autoregressive Representation Alignment (ARRA), a training framework that enables autoregressive LLMs to generate globally coherent images by aligning hidden states with visual representations through a hybrid token and dual constraints, without architectural changes, significantly improving performance across various datasets and domains.

Authors:Rui Qiao, Zhaoxuan Wu, Jingtan Wang, Pang Wei Koh, Bryan Kian Hsiang Low
Title: Group-robust Sample Reweighting for Subpopulation Shifts via Influence Functions
Abstract:
Machine learning models often have uneven performance among subpopulations (a.k.a., groups) in the data distributions. This poses a significant challenge for the models to generalize when the proportions of the groups shift during deployment. To improve robustness to such shifts, existing approaches have developed strategies that train models or perform hyperparameter tuning using the group-labeled data to minimize the worst-case loss over groups. However, a non-trivial amount of high-quality labels is often required to obtain noticeable improvements. Given the costliness of the labels, we propose to adopt a different paradigm to enhance group label efficiency: utilizing the group-labeled data as a target set to optimize the weights of other group-unlabeled data. We introduce Group-robust Sample Reweighting (GSR), a two-stage approach that first learns the representations from group-unlabeled data, and then tinkers the model by iteratively retraining its last layer on the reweighted data using influence functions. Our GSR is theoretically sound, practically lightweight, and effective in improving the robustness to subpopulation shifts. In particular, GSR outperforms the previous state-of-the-art approaches that require the same amount or even more group labels.
中文摘要:机器学习模型常在不同数据子群体间表现不均,本研究提出的群体鲁棒样本重加权(GSR)方法,通过两阶段策略利用有限标注数据优化未标注样本权重,以更高效的标签使用显著提升模型对群体分布变化的鲁棒性,性能优于现有最佳方法。
English summary: Machine learning models often struggle with performance disparities across subpopulations, and this study introduces Group-robust Sample Reweighting (GSR), a lightweight two-stage method that enhances robustness to group shifts by efficiently utilizing limited labeled data to reweight unlabeled samples, outperforming prior approaches with equal or fewer labels.

Authors:Weijia Wu, Zeyu Zhu, Mike Zheng Shou
Title: Automated Movie Generation via Multi-Agent CoT Planning
Abstract:
Existing long-form video generation frameworks lack automated planning, requiring manual input for storylines, scenes, cinematography, and character interactions, resulting in high costs and inefficiencies. To address these challenges, we present MovieAgent, an automated movie generation via multi-agent Chain of Thought (CoT) planning. MovieAgent offers two key advantages: 1) We firstly explore and define the paradigm of automated movie/long-video generation. Given a script and character bank, our MovieAgent can generates multi-scene, multi-shot long-form videos with a coherent narrative, while ensuring character consistency, synchronized subtitles, and stable audio throughout the film. 2) MovieAgent introduces a hierarchical CoT-based reasoning process to automatically structure scenes, camera settings, and cinematography, significantly reducing human effort. By employing multiple LLM agents to simulate the roles of a director, screenwriter, storyboard artist, and location manager, MovieAgent streamlines the production pipeline. Experiments demonstrate that MovieAgent achieves new state-of-the-art results in script faithfulness, character consistency, and narrative coherence. Our hierarchical framework takes a step forward and provides new insights into fully automated movie generation. The code and project website are available at: https://github.com/showlab/MovieAgent and https://weijiawu.github.io/MovieAgent.
中文摘要:MovieAgent通过多智能体思维链规划实现了自动化长视频生成,能够根据剧本自动构建场景和摄影设置,在保证角色一致性和叙事连贯性的同时大幅降低了人工成本。
English Summary: MovieAgent introduces an automated long-form video generation system using multi-agent Chain of Thought planning to create coherent multi-scene videos from scripts while ensuring character consistency and synchronized elements, significantly reducing manual effort.

Authors:Jiarui Wu, Yujin Wang, Lingen Li, Zhang Fan, Tianfan Xue
Title: Goal Conditioned Reinforcement Learning for Photo Finishing Tuning
Abstract:
Photo finishing tuning aims to automate the manual tuning process of the photo finishing pipeline, like Adobe Lightroom or Darktable. Previous works either use zeroth-order optimization, which is slow when the set of parameters increases, or rely on a differentiable proxy of the target finishing pipeline, which is hard to train. To overcome these challenges, we propose a novel goal-conditioned reinforcement learning framework for efficiently tuning parameters using a goal image as a condition. Unlike previous approaches, our tuning framework does not rely on any proxy and treats the photo finishing pipeline as a black box. Utilizing a trained reinforcement learning policy, it can efficiently find the desired set of parameters within just 10 queries, while optimization based approaches normally take 200 queries. Furthermore, our architecture utilizes a goal image to guide the iterative tuning of pipeline parameters, allowing for flexible conditioning on pixel-aligned target images, style images, or any other visually representable goals. We conduct detailed experiments on photo finishing tuning and photo stylization tuning tasks, demonstrating the advantages of our method. Project website: https://openimaginglab.github.io/RLPixTuner/.

Authors:Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, Li Yuan
Title: WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Abstract:
Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose $\textbf{WISE}$, the first benchmark specifically designed for $\textbf{W}$orld Knowledge-$\textbf{I}$nformed $\textbf{S}$emantic $\textbf{E}$valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce $\textbf{WiScore}$, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at https://github.com/PKU-YuanGroup/WISE.
中文: WISE基准通过引入涵盖25个领域的结构化提示和创新的WiScore评估指标,解决了文本到图像模型在复杂语义理解与世界知识整合方面的评估空白,揭示了现有模型在知识应用上的显著不足。
English: The WISE benchmark addresses the gap in evaluating complex semantic understanding and world knowledge integration in text-to-image models by introducing structured prompts across 25 domains and a novel WiScore metric, revealing significant limitations in current models' knowledge application.

Authors:Baiyu Chen, Wilson Wongso, Zechen Li, Yonchanok Khaokaew, Hao Xue, Flora Salim
Title: COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition
Abstract:
Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR). However, their high power consumption, privacy concerns, and dependence on lighting conditions limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient and privacy-preserving alternative, yet they suffer from limited large-scale annotated datasets, leading to weaker generalization in downstream tasks. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue, aligning the feature distributions of video and IMU embeddings. By distilling knowledge from video representations, our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets demonstrate that COMODO consistently improves downstream classification performance, achieving results comparable to or exceeding fully supervised fine-tuned models. Moreover, COMODO exhibits strong cross-dataset generalization. Benefiting from its simplicity, our method is also generally applicable to various video and time-series pre-trained models, offering the potential to leverage more powerful teacher and student foundation models in future research. The code is available at https://github.com/Breezelled/COMODO .
中文摘要:COMODO是一种跨模态自监督蒸馏框架,通过将视频语义知识迁移至IMU传感器,无需标注数据即可实现高效且保护隐私的人类活动识别,并展现出卓越性能。
English Summary: COMODO is a self-supervised framework that transfers semantic knowledge from video to IMU sensors, enabling efficient and privacy-preserving human activity recognition without labeled data while achieving superior performance.

Authors:Xinyu Nan, Meng He, Zifan Chen, Bin Dong, Lei Tang, Li Zhang
Title: AI-Driven Automated Tool for Abdominal CT Body Composition Analysis in Gastrointestinal Cancer Management
Abstract:
The incidence of gastrointestinal cancers remains significantly high, particularly in China, emphasizing the importance of accurate prognostic assessments and effective treatment strategies. Research shows a strong correlation between abdominal muscle and fat tissue composition and patient outcomes. However, existing manual methods for analyzing abdominal tissue composition are time-consuming and costly, limiting clinical research scalability. To address these challenges, we developed an AI-driven tool for automated analysis of abdominal CT scans to effectively identify and segment muscle, subcutaneous fat, and visceral fat. Our tool integrates a multi-view localization model and a high-precision 2D nnUNet-based segmentation model, demonstrating a localization accuracy of 90% and a Dice Score Coefficient of 0.967 for segmentation. Furthermore, it features an interactive interface that allows clinicians to refine the segmentation results, ensuring high-quality outcomes effectively. Our tool offers a standardized method for effectively extracting critical abdominal tissues, potentially enhancing the management and treatment for gastrointestinal cancers. The code is available at https://github.com/NanXinyu/AI-Tool4Abdominal-Seg.git}{https://github.com/NanXinyu/AI-Tool4Abdominal-Seg.git.
Chinese Summary: 开发了一款基于人工智能的工具,用于自动分析腹部CT扫描,精确分割肌肉和脂肪组织,提供标准化方法以改善胃肠道癌症的治疗管理。
English Summary: An AI-driven tool was developed to automate the analysis of abdominal CT scans, accurately segmenting muscle and fat tissues with high precision, offering a standardized method to enhance gastrointestinal cancer management.

Authors:Haowen Bai, Jiangshe Zhang, Zixiang Zhao, Lilun Deng, Yukun Cui, Shuang Xu
Title: Retinex-MEF: Retinex-based Glare Effects Aware Unsupervised Multi-Exposure Image Fusion
Abstract:
Multi-exposure image fusion (MEF) synthesizes multiple, differently exposed images of the same scene into a single, well-exposed composite. Retinex theory, which separates image illumination from scene reflectance, provides a natural framework to ensure consistent scene representation and effective information fusion across varied exposure levels. However, the conventional pixel-wise multiplication of illumination and reflectance inadequately models the glare effect induced by overexposure. To address this limitation, we introduce an unsupervised and controllable method termed Retinex-MEF. Specifically, our method decomposes multi-exposure images into separate illumination components with a shared reflectance component, and effectively models the glare induced by overexposure. The shared reflectance is learned via a bidirectional loss, which enables our approach to effectively mitigate the glare effect. Furthermore, we introduce a controllable exposure fusion criterion, enabling global exposure adjustments while preserving contrast, thus overcoming the constraints of a fixed exposure level. Extensive experiments on diverse datasets, including underexposure-overexposure fusion, exposure controlled fusion, and homogeneous extreme exposure fusion, demonstrate the effective decomposition and flexible fusion capability of our model. The code is available at https://github.com/HaowenBai/Retinex-MEF
中文:Retinex-MEF是一种无监督可控方法,通过将多曝光图像分解为光照和共享反射分量,有效抑制过曝光眩光,并能在保持对比度的同时实现灵活的全局曝光调节。
English: Retinex-MEF is an unsupervised and controllable method that decomposes multi-exposure images into illumination and shared reflectance components, effectively mitigating glare effects and enabling flexible global exposure adjustments while preserving contrast.

Authors:Zixuan Wang, Chi-Keung Tang, Yu-Wing Tai
Title: ReelWave: Multi-Agentic Movie Sound Generation through Multimodal LLM Conversation
Abstract:
Current audio generation conditioned by text or video focuses on aligning audio with text/video modalities. Despite excellent alignment results, these multimodal frameworks still cannot be directly applied to compelling movie storytelling involving multiple scenes, where "on-screen" sounds require temporally-aligned audio generation, while "off-screen" sounds contribute to appropriate environment sounds accompanied by background music when applicable. Inspired by professional movie production, this paper proposes a multi-agentic framework for audio generation supervised by an autonomous Sound Director agent, engaging multi-turn conversations with other agents for on-screen and off-screen sound generation through multimodal LLM. To address on-screen sound generation, after detecting any talking humans in videos, we capture semantically and temporally synchronized sound by training a prediction model that forecasts interpretable, time-varying audio control signals: loudness, pitch, and timbre, which are used by a Foley Artist agent to condition a cross-attention module in the sound generation. The Foley Artist works cooperatively with the Composer and Voice Actor agents, and together they autonomously generate off-screen sound to complement the overall production. Each agent takes on specific roles similar to those of a movie production team. To temporally ground audio language models, in ReelWave, text/video conditions are decomposed into atomic, specific sound generation instructions synchronized with visuals when applicable. Consequently, our framework can generate rich and relevant audio content conditioned on video clips extracted from movies.
中文: 本文提出了一种由声音导演监督的多智能体框架,通过多模态大语言模型和专门角色代理,生成与视频同步的屏幕内声音及补充性的屏幕外环境音效,以增强电影叙事的音频表现力。
English: This paper introduces a multi-agent framework supervised by a Sound Director for generating synchronized on-screen and off-screen audio in movies, utilizing multimodal LLM and specialized agents to produce temporally aligned sounds and complementary background audio.

Authors:Zebin You, Jingyang Ou, Xiaolu Zhang, Jun Hu, Jun Zhou, Chongxuan Li
Title: Effective and Efficient Masked Image Generation Models
Abstract:
Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as eMIGM. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fréchet Inception Distance (FID). In particular, on ImageNet 256x256, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion models while requiring less than 40% of the NFE. Additionally, on ImageNet 512x512, with only about 60% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models. Code is available at https://github.com/ML-GSAI/eMIGM.
中文: eMIGM模型将掩码图像生成与扩散模型统一起来,在ImageNet基准测试中以更少的函数评估次数实现了优于现有方法的效率与性能表现。
English: The eMIGM model unifies masked image generation and diffusion approaches, demonstrating superior efficiency and performance on ImageNet benchmarks by requiring significantly fewer function evaluations than existing methods.

Authors:Yan Ren, Shilin Lu, Adams Wai-Kin Kong
Title: All That Glitters Is Not Gold: Key-Secured 3D Secrets within 3D Gaussian Splatting
Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have revolutionized scene reconstruction, opening new possibilities for 3D steganography by hiding 3D secrets within 3D covers. The key challenge in steganography is ensuring imperceptibility while maintaining high-fidelity reconstruction. However, existing methods often suffer from detectability risks and utilize only suboptimal 3DGS features, limiting their full potential. We propose a novel end-to-end key-secured 3D steganography framework (KeySS) that jointly optimizes a 3DGS model and a key-secured decoder for secret reconstruction. Our approach reveals that Gaussian features contribute unequally to secret hiding. The framework incorporates a key-controllable mechanism enabling multi-secret hiding and unauthorized access prevention, while systematically exploring optimal feature update to balance fidelity and security. To rigorously evaluate steganographic imperceptibility beyond conventional 2D metrics, we introduce 3D-Sinkhorn distance analysis, which quantifies distributional differences between original and steganographic Gaussian parameters in the representation space. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both cover and secret reconstruction while maintaining high security levels, advancing the field of 3D steganography. Code is available at https://github.com/RY-Paper/KeySS
中文:我们提出的KeySS框架开创了一种新型端到端3D隐写方法,通过联合优化高斯溅射模型与密钥保护解码器,借助优化特征选择和3D-Sinkhorn距离评估,实现了卓越的安全性和重建保真度。
English: Our proposed KeySS framework introduces a novel end-to-end 3D steganography method that jointly optimizes Gaussian Splatting models with key-secured decoders, achieving superior security and reconstruction fidelity through optimized feature selection and 3D-Sinkhorn distance evaluation.

Authors:Kazuya Nishimura, Ryoma Bise, Yasuhiro Kojima
Title: Towards Spatial Transcriptomics-guided Pathological Image Recognition with Batch-Agnostic Encoder
Abstract:
Spatial transcriptomics (ST) is a novel technique that simultaneously captures pathological images and gene expression profiling with spatial coordinates. Since ST is closely related to pathological features such as disease subtypes, it may be valuable to augment image representation with pathological information. However, there are no attempts to leverage ST for image recognition ({\it i.e,} patch-level classification of subtypes of pathological image.). One of the big challenges is significant batch effects in spatial transcriptomics that make it difficult to extract pathological features of images from ST. In this paper, we propose a batch-agnostic contrastive learning framework that can extract consistent signals from gene expression of ST in multiple patients. To extract consistent signals from ST, we utilize the batch-agnostic gene encoder that is trained in a variational inference manner. Experiments demonstrated the effectiveness of our framework on a publicly available dataset. Code is publicly available at https://github.com/naivete5656/TPIRBAE
中文: 空间转录组学是一种结合病理图像与基因表达数据的新技术,本文提出了一种批次无关的对比学习框架,能从中提取一致性信号,并在公开数据集上验证了其有效性。
English: Spatial transcriptomics is a cutting-edge method that combines pathological images with gene expression data, and this paper introduces a batch-agnostic contrastive learning framework to extract consistent signals from it, validated on a public dataset.

Authors:Jing Yang, Sen Yang, Xiao Tan, Hanli Wang
Title: HisTrackMap: Global Vectorized High-Definition Map Construction via History Map Tracking
Abstract:
As an essential component of autonomous driving systems, high-definition (HD) maps provide rich and precise environmental information for auto-driving scenarios; however, existing methods, which primarily rely on query-based detection frameworks to directly model map elements or implicitly propagate queries over time, often struggle to maintain consistent temporal perception outcomes. These inconsistencies pose significant challenges to the stability and reliability of real-world autonomous driving and map data collection systems. To address this limitation, we propose a novel end-to-end tracking framework for global map construction by temporally tracking map elements' historical trajectories. Firstly, instance-level historical rasterization map representation is designed to explicitly store previous perception results, which can control and maintain different global instances' history information in a fine-grained way. Secondly, we introduce a Map-Trajectory Prior Fusion module within this tracking framework, leveraging historical priors for tracked instances to improve temporal smoothness and continuity. Thirdly, we propose a global perspective metric to evaluate the quality of temporal geometry construction in HD maps, filling the gap in current metrics for assessing global geometric perception results. Substantial experiments on the nuScenes and Argoverse2 datasets demonstrate that the proposed method outperforms state-of-the-art (SOTA) methods in both single-frame and temporal metrics. The project page is available at: https://yj772881654.github.io/HisTrackMap.

Authors:Hanqing Guo, Xiuxiu Lin, Shiyu Zhao
Title: YOLOMG: Vision-based Drone-to-Drone Detection with Appearance and Pixel-Level Motion Fusion
Abstract:
Vision-based drone-to-drone detection has attracted increasing attention due to its importance in numerous tasks such as vision-based swarming, aerial see-and-avoid, and malicious drone detection. However, existing methods often encounter failures when the background is complex or the target is tiny. This paper proposes a novel end-to-end framework that accurately identifies small drones in complex environments using motion guidance. It starts by creating a motion difference map to capture the motion characteristics of tiny drones. Next, this motion difference map is combined with an RGB image using a bimodal fusion module, allowing for adaptive feature learning of the drone. Finally, the fused feature map is processed through an enhanced backbone and detection head based on the YOLOv5 framework to achieve accurate detection results. To validate our method, we propose a new dataset, named ARD100, which comprises 100 videos (202,467 frames) covering various challenging conditions and has the smallest average object size compared with the existing drone detection datasets. Extensive experiments on the ARD100 and NPS-Drones datasets show that our proposed detector performs exceptionally well under challenging conditions and surpasses state-of-the-art algorithms across various metrics. We publicly release the codes and ARD100 dataset at https://github.com/Irisky123/YOLOMG.
中文: 本文提出了一种新颖的端到端框架,通过融合运动引导与RGB图像来提升复杂环境下小型无人机的检测精度,并在多个数据集上验证了其优越性能。
English: This paper introduces an end-to-end framework that enhances small drone detection in complex environments by integrating motion guidance with RGB imagery and outperforms existing methods on challenging datasets.

Authors:Haiyang Xie, Xi Shen, Shihua Huang, Qirui Wang, Zheng Wang
Title: SimROD: A Simple Baseline for Raw Object Detection with Global and Local Enhancements
Abstract:
Most visual models are designed for sRGB images, yet RAW data offers significant advantages for object detection by preserving sensor information before ISP processing. This enables improved detection accuracy and more efficient hardware designs by bypassing the ISP. However, RAW object detection is challenging due to limited training data, unbalanced pixel distributions, and sensor noise. To address this, we propose SimROD, a lightweight and effective approach for RAW object detection. We introduce a Global Gamma Enhancement (GGE) module, which applies a learnable global gamma transformation with only four parameters, improving feature representation while keeping the model efficient. Additionally, we leverage the green channel's richer signal to enhance local details, aligning with the human eye's sensitivity and Bayer filter design. Extensive experiments on multiple RAW object detection datasets and detectors demonstrate that SimROD outperforms state-of-the-art methods like RAW-Adapter and DIAP while maintaining efficiency. Our work highlights the potential of RAW data for real-world object detection. Code is available at https://ocean146.github.io/SimROD2025/.

Authors:Shuhe Wang, Xiaoya Li, Jiwei Li, Guoyin Wang, Xiaofei Sun, Bob Zhu, Han Qiu, Mo Yu, Shengjie Shen, Tianwei Zhang, Eduard Hovy
Title: FaceID-6M: A Large-Scale, Open-Source FaceID Customization Dataset
Abstract:
Due to the data-driven nature of current face identity (FaceID) customization methods, all state-of-the-art models rely on large-scale datasets containing millions of high-quality text-image pairs for training. However, none of these datasets are publicly available, which restricts transparency and hinders further advancements in the field. To address this issue, in this paper, we collect and release FaceID-6M, the first large-scale, open-source FaceID dataset containing 6 million high-quality text-image pairs. Filtered from LAION-5B \cite{schuhmann2022laion}, FaceID-6M undergoes a rigorous image and text filtering steps to ensure dataset quality, including resolution filtering to maintain high-quality images and faces, face filtering to remove images that lack human faces, and keyword-based strategy to retain descriptions containing human-related terms (e.g., nationality, professions and names). Through these cleaning processes, FaceID-6M provides a high-quality dataset optimized for training powerful FaceID customization models, facilitating advancements in the field by offering an open resource for research and development. We conduct extensive experiments to show the effectiveness of our FaceID-6M, demonstrating that models trained on our FaceID-6M dataset achieve performance that is comparable to, and slightly better than currently available industrial models. Additionally, to support and advance research in the FaceID customization community, we make our code, datasets, and models fully publicly available. Our codes, models, and datasets are available at: https://github.com/ShuheSH/FaceID-6M.
中文: 现有FaceID模型依赖非公开数据集,因此本文推出高质量开源数据集FaceID-6M,包含600万图文对,既能训练出有竞争力的模型,又通过公开资源推动该领域研究发展。
English: Current FaceID models rely on non-public datasets, so this paper introduces FaceID-6M, a high-quality open-source dataset with 6 million text-image pairs that enables training competitive models and promotes transparency in research.

Authors:Spyros Kondylatos, Nikolaos Ioannis Bountos, Dimitrios Michail, Xiao Xiang Zhu, Gustau Camps-Valls, Ioannis Papoutsis
Title: On the Generalization of Representation Uncertainty in Earth Observation
Abstract:
Recent advances in Computer Vision have introduced the concept of pretrained representation uncertainty, enabling zero-shot uncertainty estimation. This holds significant potential for Earth Observation (EO), where trustworthiness is critical, yet the complexity of EO data poses challenges to uncertainty-aware methods. In this work, we investigate the generalization of representation uncertainty in EO, considering the domain's unique semantic characteristics. We pretrain uncertainties on large EO datasets and propose an evaluation framework to assess their zero-shot performance in multi-label classification and segmentation EO tasks. Our findings reveal that, unlike uncertainties pretrained on natural images, EO-pretraining exhibits strong generalization across unseen EO domains, geographic locations, and target granularities, while maintaining sensitivity to variations in ground sampling distance. We demonstrate the practical utility of pretrained uncertainties showcasing their alignment with task-specific uncertainties in downstream tasks, their sensitivity to real-world EO image noise, and their ability to generate spatial uncertainty estimates out-of-the-box. Initiating the discussion on representation uncertainty in EO, our study provides insights into its strengths and limitations, paving the way for future research in the field. Code and weights are available at: https://github.com/Orion-AI-Lab/EOUncertaintyGeneralization.
中文: 计算机视觉的最新进展通过预训练表示不确定性实现了零样本不确定性估计,该技术在地球观测任务中展现出跨领域和地理位置的强大泛化能力,同时与任务特定不确定性及实际噪声敏感性保持一致。
English: Recent advances in computer vision enable zero-shot uncertainty estimation through pretrained representation uncertainty, which shows strong generalization in Earth Observation tasks across domains and geographic locations while aligning with task-specific uncertainties and real-world noise sensitivity.

Authors:Sheng Luo, Yi Zhou, Tao Zhou
Title: Universal Incremental Learning: Mitigating Confusion from Inter- and Intra-task Distribution Randomness
Abstract:
Incremental learning (IL) aims to overcome catastrophic forgetting of previous tasks while learning new ones. Existing IL methods make strong assumptions that the incoming task type will either only increases new classes or domains (i.e. Class IL, Domain IL), or increase by a static scale in a class- and domain-agnostic manner (i.e. Versatile IL (VIL)), which greatly limit their applicability in the unpredictable and dynamic wild. In this work, we investigate $\textbf{Universal Incremental Learning (UIL)}$, where a model neither knows which new classes or domains will increase along sequential tasks, nor the scale of the increments within each task. This uncertainty prevents the model from confidently learning knowledge from all task distributions and symmetrically focusing on the diverse knowledge within each task distribution. Consequently, UIL presents a more general and realistic IL scenario, making the model face confusion arising from inter-task and intra-task distribution randomness. To $\textbf{Mi}$tigate both $\textbf{Co}$nfusion, we propose a simple yet effective framework for UIL, named $\textbf{MiCo}$. At the inter-task distribution level, we employ a multi-objective learning scheme to enforce accurate and deterministic predictions, and its effectiveness is further enhanced by a direction recalibration module that reduces conflicting gradients. Moreover, at the intra-task distribution level, we introduce a magnitude recalibration module to alleviate asymmetrical optimization towards imbalanced class distribution. Extensive experiments on three benchmarks demonstrate the effectiveness of our method, outperforming existing state-of-the-art methods in both the UIL scenario and the VIL scenario. Our code will be available at $\href{https://github.com/rolsheng/UIL}{here}$.
中文: 本文提出通用增量学习(UIL)这一更现实的场景,模型需面对类别和领域不可预测的双重增量挑战,并通过多目标学习和校准模块的MiCo框架有效缓解任务间与任务内的混淆,在多个基准测试中显著超越现有方法。
English: This paper introduces Universal Incremental Learning (UIL), a more realistic scenario where models face unpredictable task increments in both classes and domains, and proposes MiCo, a framework that mitigates inter-task and intra-task confusion through multi-objective learning and recalibration modules, demonstrating superior performance over existing methods.

Authors:Dong-Hee Paek, Seung-Hyun Kong
Title: Availability-aware Sensor Fusion via Unified Canonical Space for 4D Radar, LiDAR, and Camera
Abstract:
Sensor fusion of camera, LiDAR, and 4-dimensional (4D) Radar has brought a significant performance improvement in autonomous driving (AD). However, there still exist fundamental challenges: deeply coupled fusion methods assume continuous sensor availability, making them vulnerable to sensor degradation and failure, whereas sensor-wise cross-attention fusion methods struggle with computational cost and unified feature representation. This paper presents availability-aware sensor fusion (ASF), a novel method that employs unified canonical projection (UCP) to enable consistency in all sensor features for fusion and cross-attention across sensors along patches (CASAP) to enhance robustness of sensor fusion against sensor degradation and failure. As a result, the proposed ASF shows a superior object detection performance to the existing state-of-the-art fusion methods under various weather and sensor degradation (or failure) conditions; Extensive experiments on the K-Radar dataset demonstrate that ASF achieves improvements of 9.7% in AP BEV (87.2%) and 20.1% in AP 3D (73.6%) in object detection at IoU=0.5, while requiring a low computational cost. The code will be available at https://github.com/kaist-avelab/K-Radar.
中文: 本文提出可用性感知传感器融合(ASF)方法,通过统一规范投影和跨传感器补丁交叉注意力机制,在恶劣天气和传感器故障条件下显著提升自动驾驶物体检测性能,同时保持较低计算成本。
English: This paper introduces availability-aware sensor fusion (ASF), a novel method that enhances autonomous driving robustness by employing unified canonical projection and cross-attention across sensors, achieving superior object detection performance under adverse conditions with low computational cost.

Authors:Juncheng Wang, Chao Xu, Cheng Yu, Lei Shang, Zhe Hu, Shujun Wang, Liefeng Bo
Title: Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition
Abstract:
Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos. Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD). We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor. Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process. Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency. Our codes and demos will be released at \href{Website}{https://wjc2830.github.io/MelQCD/}.

Authors:Jiahao Wang, Xiangyu Cao, Jiaru Zhong, Yuner Zhang, Haibao Yu, Lei He, Shaobing Xu
Title: Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark
Abstract:
Despite significant advancements, autonomous driving systems continue to struggle with occluded objects and long-range detection due to the inherent limitations of single-perspective sensing. Aerial-ground cooperation offers a promising solution by integrating UAVs' aerial views with ground vehicles' local observations. However, progress in this emerging field has been hindered by the absence of public datasets and standardized evaluation benchmarks. To address this gap, this paper presents a comprehensive solution for aerial-ground cooperative 3D perception through three key contributions: (1) Griffin, a large-scale multi-modal dataset featuring over 200 dynamic scenes (30k+ frames) with varied UAV altitudes (20-60m), diverse weather conditions, and occlusion-aware 3D annotations, enhanced by CARLA-AirSim co-simulation for realistic UAV dynamics; (2) A unified benchmarking framework for aerial-ground cooperative detection and tracking tasks, including protocols for evaluating communication efficiency, latency tolerance, and altitude adaptability; (3) AGILE, an instance-level intermediate fusion baseline that dynamically aligns cross-view features through query-based interaction, achieving an advantageous balance between communication overhead and perception accuracy. Extensive experiments prove the effectiveness of aerial-ground cooperative perception and demonstrate the direction of further research. The dataset and codes are available at https://github.com/wang-jh18-SVM/Griffin.
中文: 自动驾驶系统因单视角感知的局限难以处理遮挡和远距离检测,而空地协同通过提供数据集、基准框架和融合方法,有效提升了感知精度与效率。
English: Autonomous driving faces challenges with occlusions and long-range detection, which aerial-ground cooperation addresses through a new dataset, benchmarking framework, and fusion method that enhance perception accuracy and efficiency.

Authors:Shining Wang, Yunlong Wang, Ruiqi Wu, Bingliang Jiao, Wenxuan Wang, Peng Wang
Title: SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks
Abstract:
When discussing the Aerial-Ground Person Re-identification (AGPReID) task, we face the main challenge of the significant appearance variations caused by different viewpoints, making identity matching difficult. To address this issue, previous methods attempt to reduce the differences between viewpoints by critical attributes and decoupling the viewpoints. While these methods can mitigate viewpoint differences to some extent, they still face two main issues: (1) difficulty in handling viewpoint diversity and (2) neglect of the contribution of local features. To effectively address these challenges, we design and implement the Self-Calibrating and Adaptive Prompt (SeCap) method for the AGPReID task. The core of this framework relies on the Prompt Re-calibration Module (PRM), which adaptively re-calibrates prompts based on the input. Combined with the Local Feature Refinement Module (LFRM), SeCap can extract view-invariant features from local features for AGPReID. Meanwhile, given the current scarcity of datasets in the AGPReID field, we further contribute two real-world Large-scale Aerial-Ground Person Re-Identification datasets, LAGPeR and G2APS-ReID. The former is collected and annotated by us independently, covering $4,231$ unique identities and containing $63,841$ high-quality images; the latter is reconstructed from the person search dataset G2APS. Through extensive experiments on AGPReID datasets, we demonstrate that SeCap is a feasible and effective solution for the AGPReID task. The datasets and source code available on https://github.com/wangshining681/SeCap-AGPReID.
中文:SeCap方法通过自适应重新校准提示和优化局部特征来解决空地行人重识别难题,同时引入两个新数据集以推动该领域研究。
English: The SeCap method addresses aerial-ground person re-identification challenges by adaptively recalibrating prompts and refining local features, while introducing two new datasets to advance research in this field.

Authors:Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi
Title: A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning
Abstract:
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. Through systematic evaluation, we find that while DINO and iBOT outperform MAE across visuomotor control and perception tasks, they struggle when trained on non-(single-)object-centric (NOC) data--a limitation strongly correlated with their diminished ability to learn object-centric representations. This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. Our experiments encompass pre-training on object-centric, scene-centric, web-crawled, and ego-centric data. Across all settings, our approach learns transferrable representations and achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations. When scaled up with million-scale datasets, our method also demonstrates superior data efficiency and scalability. Our code and models are publicly available at https://github.com/CVMI-Lab/SlotMIM.
中文摘要:研究表明,预训练视觉模型在机器人技术中的成功关键在于其从非物体中心数据中形成物体中心表征的能力,据此开发的SlotMIM方法在多项任务和数据集上均实现了卓越性能提升。
English Summary: This study reveals that pre-trained vision models' success in robotics hinges on their ability to form object-centric representations from non-object-centric data, leading to the development of SlotMIM, which achieves superior performance across various tasks and datasets.

Authors:Zeyu Zhang, Yiran Wang, Wei Mao, Danning Li, Rui Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, Richard Hartley
Title: Motion Anything: Any to Motion Generation
Abstract:
Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail to integrate multiple modalities effectively, limiting control and coherence in generated motion. To address these challenges, we propose Motion Anything, a multimodal motion generation framework that introduces an Attention-based Mask Modeling approach, enabling fine-grained spatial and temporal control over key frames and actions. Our model adaptively encodes multimodal conditions, including text and music, improving controllability. Additionally, we introduce Text-Music-Dance (TMD), a new motion dataset consisting of 2,153 pairs of text, music, and dance, making it twice the size of AIST++, thereby filling a critical gap in the community. Extensive experiments demonstrate that Motion Anything surpasses state-of-the-art methods across multiple benchmarks, achieving a 15% improvement in FID on HumanML3D and showing consistent performance gains on AIST++ and TMD. See our project website https://steve-zeyu-zhang.github.io/MotionAnything

Authors:Junyan Lin, Feng Gap, Lin Qi, Junyu Dong, Qian Du, Xinbo Gao
Title: Dynamic Cross-Modal Feature Interaction Network for Hyperspectral and LiDAR Data Classification
Abstract:
Hyperspectral image (HSI) and LiDAR data joint classification is a challenging task. Existing multi-source remote sensing data classification methods often rely on human-designed frameworks for feature extraction, which heavily depend on expert knowledge. To address these limitations, we propose a novel Dynamic Cross-Modal Feature Interaction Network (DCMNet), the first framework leveraging a dynamic routing mechanism for HSI and LiDAR classification. Specifically, our approach introduces three feature interaction blocks: Bilinear Spatial Attention Block (BSAB), Bilinear Channel Attention Block (BCAB), and Integration Convolutional Block (ICB). These blocks are designed to effectively enhance spatial, spectral, and discriminative feature interactions. A multi-layer routing space with routing gates is designed to determine optimal computational paths, enabling data-dependent feature fusion. Additionally, bilinear attention mechanisms are employed to enhance feature interactions in spatial and channel representations. Extensive experiments on three public HSI and LiDAR datasets demonstrate the superiority of DCMNet over state-of-the-art methods. Our code will be available at https://github.com/oucailab/DCMNet.
中文: 本文提出DCMNet动态跨模态网络,通过路由机制和双线性注意力模块增强高光谱与激光雷达数据的特征交互,在公开数据集上实现了优于现有方法的分类性能。
English: The paper introduces DCMNet, a dynamic cross-modal network using routing mechanisms and bilinear attention blocks to enhance feature fusion for hyperspectral and LiDAR image classification, outperforming existing methods on public datasets.

Authors:Jianxiong Gao, Yichang Liu, Baofeng Yang, Jianfeng Feng, Yanwei Fu
Title: CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing
Abstract:
In this paper, we introduce CineBrain, the first large-scale dataset featuring simultaneous EEG and fMRI recordings during dynamic audiovisual stimulation. Recognizing the complementary strengths of EEG's high temporal resolution and fMRI's deep-brain spatial coverage, CineBrain provides approximately six hours of narrative-driven content from the popular television series The Big Bang Theory for each of six participants. Building upon this unique dataset, we propose CineSync, an innovative multimodal decoding framework integrates a Multi-Modal Fusion Encoder with a diffusion-based Neural Latent Decoder. Our approach effectively fuses EEG and fMRI signals, significantly improving the reconstruction quality of complex audiovisual stimuli. To facilitate rigorous evaluation, we introduce Cine-Benchmark, a comprehensive evaluation protocol that assesses reconstructions across semantic and perceptual dimensions. Experimental results demonstrate that CineSync achieves state-of-the-art video reconstruction performance and highlight our initial success in combining fMRI and EEG for reconstructing both video and audio stimuli. Project Page: https://jianxgao.github.io/CineBrain.

Authors:Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, Linfeng Zhang
Title: From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers
Abstract:
Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps. However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality. To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps. Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios. For instance, it achieves an almost lossless acceleration of 4.99$\times$ on FLUX and 5.00$\times$ on HunyuanVideo without additional training. On DiT, it achieves $3.41$ lower FID compared with previous SOTA at $4.53$$\times$ acceleration. %Our code is provided in the supplementary materials and will be made publicly available on GitHub. Our codes have been released in Github:https://github.com/Shenyi-Z/TaylorSeer
Chinese: TaylorSeer通过泰勒级数展开预测未来时间步的特征,无需额外训练即可在图像和视频合成中实现近乎无损的高速加速效果。
English: TaylorSeer accelerates diffusion models by predicting future timestep features using Taylor series expansion, achieving near-lossless high-speed performance in image and video synthesis without extra training.

Authors:Chikai Shang, Mengke Li, Yiqun Zhang, Zhen Chen, Jinlin Wu, Fangqing Gu, Yang Lu, Yiu-ming Cheung
Title: Iterative Prompt Relocation for Distribution-Adaptive Visual Prompt Tuning
Abstract:
Visual prompt tuning (VPT) provides an efficient and effective solution for adapting pre-trained models to various downstream tasks by incorporating learnable prompts. However, most prior art indiscriminately applies a fixed prompt distribution across different tasks, neglecting the importance of each block differing depending on the task. In this paper, we investigate adaptive distribution optimization (ADO) by addressing two key questions: (1) How to appropriately and formally define ADO, and (2) How to design an adaptive distribution strategy guided by this definition? Through in-depth analysis, we provide an affirmative answer that properly adjusting the distribution significantly improves VPT performance, and further uncover a key insight that a nested relationship exists between ADO and VPT. Based on these findings, we propose a new VPT framework, termed PRO-VPT (iterative Prompt RelOcation-based VPT), which adaptively adjusts the distribution building upon a nested optimization formulation. Specifically, we develop a prompt relocation strategy for ADO derived from this formulation, comprising two optimization steps: identifying and pruning idle prompts, followed by determining the optimal blocks for their relocation. By iteratively performing prompt relocation and VPT, our proposal adaptively learns the optimal prompt distribution, thereby unlocking the full potential of VPT. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VPT methods, e.g., PRO-VPT surpasses VPT by 1.6% average accuracy, leading prompt-based methods to state-of-the-art performance on the VTAB-1k benchmark. The code is available at https://github.com/ckshang/PRO-VPT.
中文摘要:本文提出PRO-VPT框架,通过迭代式提示重定位策略自适应优化提示分布,显著提升视觉提示微调性能,在多个基准测试中达到最先进水平。
English Summary: This paper introduces PRO-VPT, an adaptive prompt distribution optimization framework that iteratively relocates prompts between model blocks to enhance visual prompt tuning performance, achieving state-of-the-art results on multiple benchmarks.

Authors:Ta Duc Huy, Sen Kim Tran, Phan Nguyen, Nguyen Hoang Tran, Tran Bao Sam, Anton van den Hengel, Zhibin Liao, Johan W. Verjans, Minh-Son To, Vu Minh Hieu Phan
Title: Interactive Medical Image Analysis with Concept-based Similarity Reasoning
Abstract:
The ability to interpret and intervene model decisions is important for the adoption of computer-aided diagnosis methods in clinical workflows. Recent concept-based methods link the model predictions with interpretable concepts and modify their activation scores to interact with the model. However, these concepts are at the image level, which hinders the model from pinpointing the exact patches the concepts are activated. Alternatively, prototype-based methods learn representations from training image patches and compare these with test image patches, using the similarity scores for final class prediction. However, interpreting the underlying concepts of these patches can be challenging and often necessitates post-hoc guesswork. To address this issue, this paper introduces the novel Concept-based Similarity Reasoning network (CSR), which offers (i) patch-level prototype with intrinsic concept interpretation, and (ii) spatial interactivity. First, the proposed CSR provides localized explanation by grounding prototypes of each concept on image regions. Second, our model introduces novel spatial-level interaction, allowing doctors to engage directly with specific image areas, making it an intuitive and transparent tool for medical imaging. CSR improves upon prior state-of-the-art interpretable methods by up to 4.5\% across three biomedical datasets. Our code is released at https://github.com/tadeephuy/InteractCSR.
Chinese: 本文提出了基于概念的相似性推理网络(CSR),通过提供具有内在概念解释的补丁级原型和空间交互性,增强了医学影像的可解释性,并在三个生物医学数据集上比现有最优方法性能提升高达4.5%。
English: The paper introduces the Concept-based Similarity Reasoning network (CSR), which provides patch-level prototypes with intrinsic concept interpretation and spatial interactivity, improving interpretability in medical imaging and outperforming prior methods by up to 4.5% on biomedical datasets.

Authors:Sungsik Kim, Janghyun Baek, Jinkyu Kim, Jaekoo Lee
Title: GUIDE-CoT: Goal-driven and User-Informed Dynamic Estimation for Pedestrian Trajectory using Chain-of-Thought
Abstract:
While Large Language Models (LLMs) have recently shown impressive results in reasoning tasks, their application to pedestrian trajectory prediction remains challenging due to two key limitations: insufficient use of visual information and the difficulty of predicting entire trajectories. To address these challenges, we propose Goal-driven and User-Informed Dynamic Estimation for pedestrian trajectory using Chain-of-Thought (GUIDE-CoT). Our approach integrates two innovative modules: (1) a goal-oriented visual prompt, which enhances goal prediction accuracy combining visual prompts with a pretrained visual encoder, and (2) a chain-of-thought (CoT) LLM for trajectory generation, which generates realistic trajectories toward the predicted goal. Moreover, our method introduces controllable trajectory generation, allowing for flexible and user-guided modifications to the predicted paths. Through extensive experiments on the ETH/UCY benchmark datasets, our method achieves state-of-the-art performance, delivering both high accuracy and greater adaptability in pedestrian trajectory prediction. Our code is publicly available at https://github.com/ai-kmu/GUIDE-CoT.
Chinese Summary: 提出的GUIDE-CoT方法通过整合目标导向的视觉提示和思维链推理,克服了大语言模型在行人轨迹预测中的视觉信息利用不足和轨迹生成困难等局限,实现了可控路径生成和最先进的预测性能。
English Summary: The proposed GUIDE-CoT method overcomes LLMs' limitations in pedestrian trajectory prediction by integrating goal-oriented visual prompts and chain-of-thought reasoning, achieving state-of-the-art performance with controllable path generation.

Authors:Siyu Li, Yihong Cao, Hao Shi, Yongsheng Zang, Xuan He, Kailun Yang, Zhiyong Li
Title: HierDAMap: Towards Universal Domain Adaptive BEV Mapping via Hierarchical Perspective Priors
Abstract:
The exploration of Bird's-Eye View (BEV) mapping technology has driven significant innovation in visual perception technology for autonomous driving. BEV mapping models need to be applied to the unlabeled real world, making the study of unsupervised domain adaptation models an essential path. However, research on unsupervised domain adaptation for BEV mapping remains limited and cannot perfectly accommodate all BEV mapping tasks. To address this gap, this paper proposes HierDAMap, a universal and holistic BEV domain adaptation framework with hierarchical perspective priors. Unlike existing research that solely focuses on image-level learning using prior knowledge, this paper explores the guiding role of perspective prior knowledge across three distinct levels: global, sparse, and instance levels. With these priors, HierDA consists of three essential components, including Semantic-Guided Pseudo Supervision (SGPS), Dynamic-Aware Coherence Learning (DACL), and Cross-Domain Frustum Mixing (CDFM). SGPS constrains the cross-domain consistency of perspective feature distribution through pseudo labels generated by vision foundation models in 2D space. To mitigate feature distribution discrepancies caused by spatial variations, DACL employs uncertainty-aware predicted depth as an intermediary to derive dynamic BEV labels from perspective pseudo-labels, thereby constraining the coarse BEV features derived from corresponding perspective features. CDFM, on the other hand, leverages perspective masks of view frustum to mix multi-view perspective images from both domains, which guides cross-domain view transformation and encoding learning through mixed BEV labels. The proposed method is verified on multiple BEV mapping tasks, such as BEV semantic segmentation, high-definition semantic, and vectorized mapping. The source code will be made publicly available at https://github.com/lynn-yu/HierDAMap.
中文摘要:本文提出HierDAMap框架,通过全局、稀疏和实例三个层次整合视角先验知识,解决鸟瞰图映射在无标签真实场景中的跨域适应问题,显著提升自动驾驶视觉感知的泛化能力。
English Summary: This paper introduces HierDAMap, a hierarchical domain adaptation framework that enhances BEV mapping for autonomous driving by incorporating perspective priors at global, sparse, and instance levels to address cross-domain challenges in unlabeled environments.

Authors:Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, Kai-Wei Chang
Title: VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation
Abstract:
Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy-2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 200 diverse actions and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy-2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy-AutoEval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy-2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation. The data and code is available at https://videophy2.github.io/.
Chinese: VideoPhy-2作为评估视频生成模型物理常识的新基准,揭示了模型在遵循现实世界物理规则方面的重大缺陷,尤其在守恒定律方面表现不佳,最优模型在其困难子集上仅达到22%的综合性能。
English: VideoPhy-2 is a new benchmark that evaluates video generative models' physical commonsense, revealing significant shortcomings in adhering to real-world physics, particularly in conservation laws, with the best model achieving only 22% joint performance on its hard subset.

Authors:Abdelaziz Bouzidi, Hamid Laga, Hazem Wannous, Ferdous Sohel
Title: GNF: Gaussian Neural Fields for Multidimensional Signal Representation and Reconstruction
Abstract:
Neural fields have emerged as a powerful framework for representing continuous multidimensional signals such as images and videos, 3D and 4D objects and scenes, and radiance fields. While efficient, achieving high-quality representation requires the use of wide and deep neural networks. These, however, are slow to train and evaluate. Although several acceleration techniques have been proposed, they either trade memory for faster training and/or inference, rely on thousands of fitted primitives with considerable optimization time, or compromise the smooth, continuous nature of neural fields. In this paper, we introduce Gaussian Neural Fields (GNF), a novel compact neural decoder that maps learned feature grids into continuous non-linear signals, such as RGB images, Signed Distance Functions (SDFs), and radiance fields, using a single compact layer of Gaussian kernels defined in a high-dimensional feature space. Our key observation is that neurons in traditional MLPs perform simple computations, usually a dot product followed by an activation function, necessitating wide and deep MLPs or high-resolution feature grids to model complex functions. In this paper, we show that replacing MLP-based decoders with Gaussian kernels whose centers are learned features yields highly accurate representations of 2D (RGB), 3D (geometry), and 5D (radiance fields) signals with just a single layer of such kernels. This representation is highly parallelizable, operates on low-resolution grids, and trains in under $15$ seconds for 3D geometry and under $11$ minutes for view synthesis. GNF matches the accuracy of deep MLP-based decoders with far fewer parameters and significantly higher inference throughput.

Authors:Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, Shaohui Lin
Title: Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Abstract:
DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. The datasets and code will be released in: https://github.com/Osilly/Vision-R1 .
中文: Vision-R1通过构建无需人工标注的20万规模多模态思维链数据集,并采用渐进式训练策略,有效提升了多模态数学推理能力,在多个基准测试中表现优异。
English: Vision-R1 is a multimodal reasoning model that enhances reasoning capabilities by creating a high-quality 200K multimodal CoT dataset without human annotation and employing progressive training strategies, achieving significant improvements on math reasoning benchmarks.

Authors:Hantao Zhang, Yuhe Liu, Jiancheng Yang, Weidong Guo, Xinyuan Wang, Pascal Fua
Title: DiffAtlas: GenAI-fying Atlas Segmentation via Image-Mask Diffusion
Abstract:
Accurate medical image segmentation is crucial for precise anatomical delineation. Deep learning models like U-Net have shown great success but depend heavily on large datasets and struggle with domain shifts, complex structures, and limited training samples. Recent studies have explored diffusion models for segmentation by iteratively refining masks. However, these methods still retain the conventional image-to-mask mapping, making them highly sensitive to input data, which hampers stability and generalization. In contrast, we introduce DiffAtlas, a novel generative framework that models both images and masks through diffusion during training, effectively ``GenAI-fying'' atlas-based segmentation. During testing, the model is guided to generate a specific target image-mask pair, from which the corresponding mask is obtained. DiffAtlas retains the robustness of the atlas paradigm while overcoming its scalability and domain-specific limitations. Extensive experiments on CT and MRI across same-domain, cross-modality, varying-domain, and different data-scale settings using the MMWHS and TotalSegmentator datasets demonstrate that our approach outperforms existing methods, particularly in limited-data and zero-shot modality segmentation. Code is available at https://github.com/M3DV/DiffAtlas.
中文: DiffAtlas提出了一种新颖的生成框架,在训练中通过扩散过程同时建模图像和掩码,显著提升了医学图像分割的鲁棒性和泛化能力,尤其在数据有限和跨模态场景下优于现有方法。
English: DiffAtlas introduces a generative framework that models both images and masks using diffusion during training, enhancing robustness and generalization in medical image segmentation, particularly outperforming existing methods in limited-data and cross-modality scenarios.

Authors:Hantao Zhou, Rui Yang, Longxiang Tang, Guanyi Qin, Runze Hu, Xiu Li
Title: Gamma: Toward Generic Image Assessment with Mixture of Assessment Experts
Abstract:
Image assessment aims to evaluate the quality and aesthetics of images and has been applied across various scenarios, such as natural and AIGC scenes. Existing methods mostly address these sub-tasks or scenes individually. While some works attempt to develop unified image assessment models, they have struggled to achieve satisfactory performance or cover a broad spectrum of assessment scenarios. In this paper, we present \textbf{Gamma}, a \textbf{G}eneric im\textbf{A}ge assess\textbf{M}ent model using \textbf{M}ixture of \textbf{A}ssessment Experts, which can effectively assess images from diverse scenes through mixed-dataset training. Achieving unified training in image assessment presents significant challenges due to annotation biases across different datasets. To address this issue, we first propose a Mixture of Assessment Experts (MoAE) module, which employs shared and adaptive experts to dynamically learn common and specific knowledge for different datasets, respectively. In addition, we introduce a Scene-based Differential Prompt (SDP) strategy, which uses scene-specific prompts to provide prior knowledge and guidance during the learning process, further boosting adaptation for various scenes. Our Gamma model is trained and evaluated on 12 datasets spanning 6 image assessment scenarios. Extensive experiments show that our unified Gamma outperforms other state-of-the-art mixed-training methods by significant margins while covering more scenes. Codes are available at https://github.com/zht8506/Gamma.
中文摘要:本文提出Gamma模型,通过混合评估专家模块和场景差异提示策略,实现了跨多种场景的统一图像质量评估,在覆盖更广场景的同时显著优于现有混合训练方法。
English Summary: The paper introduces Gamma, a unified image assessment model that employs a Mixture of Assessment Experts and Scene-based Differential Prompts to effectively evaluate images across diverse scenarios, outperforming existing methods while covering more scenes.

Authors:AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialu Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yuehan Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng Ruan, Jiaqi Shan, Yongjian Shen, Chengshi Shi, Mingkang Shi, Modi Shi, Chonghao Sima, Jianheng Song, Huijie Wang, Wenhao Wang, Dafeng Wei, Chengen Xie, Guo Xu, Junchi Yan, Cunbiao Yang, Lei Yang, Shukai Yang, Maoqing Yao, Jia Zeng, Chi Zhang, Qinglin Zhang, Bin Zhao, Chengyue Zhao, Jiaqi Zhao, Jianchao Zhu
Title: AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
Abstract:
We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets. Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution. It is extensible from grippers to dexterous hands and visuo-tactile sensors for fine-grained skill acquisition. Building on top of data, we introduce Genie Operator-1 (GO-1), a novel generalist policy that leverages latent action representations to maximize data utilization, demonstrating predictable performance scaling with increased data volume. Policies pre-trained on our dataset achieve an average performance improvement of 30% over those trained on Open X-Embodiment, both in in-domain and out-of-distribution scenarios. GO-1 exhibits exceptional capability in real-world dexterous and long-horizon tasks, achieving over 60% success rate on complex tasks and outperforming prior RDT approach by 32%. By open-sourcing the dataset, tools, and models, we aim to democratize access to large-scale, high-quality robot data, advancing the pursuit of scalable and general-purpose intelligence.
中文: 本研究推出了AgiBot World大规模机器人数据集平台和GO-1通用策略,通过利用海量高质量数据显著提升了机器人操作性能,在复杂任务中表现优异。
English: This research introduces AgiBot World, a large-scale platform with over 1 million robot trajectories, and GO-1, a generalist policy that significantly improves robotic manipulation performance by leveraging extensive, high-quality data.

Authors:Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, S. Kevin Zhou
Title: AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP
Abstract:
Anomaly detection (AD) identifies outliers for applications like defect and lesion detection. While CLIP shows promise for zero-shot AD tasks due to its strong generalization capabilities, its inherent Anomaly-Unawareness leads to limited discrimination between normal and abnormal features. To address this problem, we propose Anomaly-Aware CLIP (AA-CLIP), which enhances CLIP's anomaly discrimination ability in both text and visual spaces while preserving its generalization capability. AA-CLIP is achieved through a straightforward yet effective two-stage approach: it first creates anomaly-aware text anchors to differentiate normal and abnormal semantics clearly, then aligns patch-level visual features with these anchors for precise anomaly localization. This two-stage strategy, with the help of residual adapters, gradually adapts CLIP in a controlled manner, achieving effective AD while maintaining CLIP's class knowledge. Extensive experiments validate AA-CLIP as a resource-efficient solution for zero-shot AD tasks, achieving state-of-the-art results in industrial and medical applications. The code is available at https://github.com/Mwxinnn/AA-CLIP.
中文: 提出的异常感知CLIP(AA-CLIP)通过文本锚点和视觉特征对齐的两阶段方法增强CLIP的异常识别能力,在工业和医疗应用的零样本异常检测中取得了最先进的结果。
English: The proposed Anomaly-Aware CLIP (AA-CLIP) enhances CLIP's anomaly discrimination through a two-stage approach using text anchors and visual feature alignment, achieving state-of-the-art zero-shot anomaly detection results in industrial and medical applications.

Authors:Hasan Abed Al Kader Hammoud, Bernard Ghanem
Title: DiffCLIP: Differential Attention Meets CLIP
Abstract:
We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency. Code can be found at https://github.com/hammoudhasan/DiffCLIP.
中文摘要:DiffCLIP通过在CLIP模型中引入差分注意力机制,以极少的计算开销显著提升了图像-文本理解任务的性能,同时保持了高效性。
English Summary: DiffCLIP enhances the CLIP model by integrating a differential attention mechanism, which improves performance on image-text tasks with minimal computational cost and no efficiency loss.

Authors:Chaocan Xue, Bineng Zhong, Qihua Liang, Yaozong Zheng, Ning Li, Yuanliang Xue, Shuxiang Song
Title: Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking
Abstract:
Vision transformers (ViTs) have emerged as a popular backbone for visual tracking. However, complete ViT architectures are too cumbersome to deploy for unmanned aerial vehicle (UAV) tracking which extremely emphasizes efficiency. In this study, we discover that many layers within lightweight ViT-based trackers tend to learn relatively redundant and repetitive target representations. Based on this observation, we propose a similarity-guided layer adaptation approach to optimize the structure of ViTs. Our approach dynamically disables a large number of representation-similar layers and selectively retains only a single optimal layer among them, aiming to achieve a better accuracy-speed trade-off. By incorporating this approach into existing ViTs, we tailor previously complete ViT architectures into an efficient similarity-guided layer-adaptive framework, namely SGLATrack, for real-time UAV tracking. Extensive experiments on six tracking benchmarks verify the effectiveness of the proposed approach, and show that our SGLATrack achieves a state-of-the-art real-time speed while maintaining competitive tracking precision. Codes and models are available at https://github.com/GXNU-ZhongLab/SGLATrack.
中文: 针对无人机跟踪,本研究通过相似性引导的层自适应方法精简视觉变换器结构,动态去除冗余层并保留最优单层,实现了高速实时跟踪且保持高精度。
English: Vision transformers are optimized for UAV tracking by dynamically disabling redundant layers and retaining only the most effective one, achieving high-speed real-time performance without compromising accuracy.

Authors:Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, Shuxiang Song
Title: Dynamic Updates for Language Adaptation in Visual-Language Tracking
Abstract:
The consistency between the semantic information provided by the multi-modal reference and the tracked object is crucial for visual-language (VL) tracking. However, existing VL tracking frameworks rely on static multi-modal references to locate dynamic objects, which can lead to semantic discrepancies and reduce the robustness of the tracker. To address this issue, we propose a novel vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency. Specifically, we introduce a Dynamic Language Update Module, which leverages a large language model to generate dynamic language descriptions for the object based on visual features and object category information. Then, we design a Dynamic Template Capture Module, which captures the regions in the image that highly match the dynamic language descriptions. Furthermore, to ensure the efficiency of description generation, we design an update strategy that assesses changes in target displacement, scale, and other factors to decide on updates. Finally, the dynamic template and language descriptions that record the latest state of the target are used to update the multi-modal references, providing more accurate reference information for subsequent inference and enhancing the robustness of the tracker. DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT$_{\rm{ext}}$, TNL2K, OTB99-Lang, GOT-10K, and UAV123. Code and models are available at https://github.com/GXNU-ZhongLab/DUTrack.
中文: DUTrack提出了一种动态视觉语言跟踪框架,通过大语言模型生成动态描述和自适应模板更新多模态参考信息,以保持语义一致性并提升跟踪器鲁棒性,在多个基准测试中取得了最优性能。
English: DUTrack introduces a dynamic vision-language tracking framework that updates multi-modal references using a large language model and adaptive templates to maintain semantic consistency and enhance tracker robustness, achieving state-of-the-art results on multiple benchmarks.

Authors:Ruchi Bhatt, Shreya Bansal, Amanpreet Chander, Rupinder Kaur, Malya Singh, Mohan Kankanhalli, Abdulmotaleb El Saddik, Mukesh Kumar Saini
Title: GroMo: Plant Growth Modeling with Multiview Images
Abstract:
Understanding plant growth dynamics is essential for applications in agriculture and plant phenotyping. We present the Growth Modelling (GroMo) challenge, which is designed for two primary tasks: (1) plant age prediction and (2) leaf count estimation, both essential for crop monitoring and precision agriculture. For this challenge, we introduce GroMo25, a dataset with images of four crops: radish, okra, wheat, and mustard. Each crop consists of multiple plants (p1, p2, ..., pn) captured over different days (d1, d2, ..., dm) and categorized into five levels (L1, L2, L3, L4, L5). Each plant is captured from 24 different angles with a 15-degree gap between images. Participants are required to perform both tasks for all four crops with these multiview images. We proposed a Multiview Vision Transformer (MVVT) model for the GroMo challenge and evaluated the crop-wise performance on GroMo25. MVVT reports an average MAE of 7.74 for age prediction and an MAE of 5.52 for leaf count. The GroMo Challenge aims to advance plant phenotyping research by encouraging innovative solutions for tracking and predicting plant growth. The GitHub repository is publicly available at https://github.com/mriglab/GroMo-Plant-Growth-Modeling-with-Multiview-Images.
中文: GroMo挑战赛通过提供多视角图像数据集,提出植物年龄预测和叶片计数两项任务以推动农业表型研究,其MVVT模型实现了较低误差率。
English: The GroMo Challenge introduces a multiview image dataset and tasks for predicting plant age and leaf count to advance agricultural phenotyping, with a proposed MVVT model achieving low error rates.

Authors:Yixin Yang, Yang Zhou, Hui Huang
Title: Introducing Unbiased Depth into 2D Gaussian Splatting for High-accuracy Surface Reconstruction
Abstract:
Recently, 2D Gaussian Splatting (2DGS) has demonstrated superior geometry reconstruction quality than the popular 3DGS by using 2D surfels to approximate thin surfaces. However, it falls short when dealing with glossy surfaces, resulting in visible holes in these areas. We find that the reflection discontinuity causes the issue. To fit the jump from diffuse to specular reflection at different viewing angles, depth bias is introduced in the optimized Gaussian primitives. To address that, we first replace the depth distortion loss in 2DGS with a novel depth convergence loss, which imposes a strong constraint on depth continuity. Then, we rectify the depth criterion in determining the actual surface, which fully accounts for all the intersecting Gaussians along the ray. Qualitative and quantitative evaluations across various datasets reveal that our method significantly improves reconstruction quality, with more complete and accurate surfaces than 2DGS. Code is available at https://github.com/XiaoXinyyx/Unbiased_Surfel.
中文摘要:本文针对二维高斯泼溅技术在光泽表面重建中的不足,提出新型深度收敛损失函数和改进的深度判定准则,相比原有方法显著提升了表面重建的完整度与精确度。
English Summary: This paper introduces a novel depth convergence loss and refined depth criterion to address 2D Gaussian Splatting's limitations in reconstructing glossy surfaces, significantly improving reconstruction completeness and accuracy compared to previous methods.

Authors:Qiyuan He, Angela Yao
Title: Conceptrol: Concept Control of Zero-shot Personalized Image Generation
Abstract:
Personalized image generation with text-to-image diffusion models generates unseen images based on reference image content. Zero-shot adapter methods such as IP-Adapter and OminiControl are especially interesting because they do not require test-time fine-tuning. However, they struggle to balance preserving personalized content and adherence to the text prompt. We identify a critical design flaw resulting in this performance gap: current adapters inadequately integrate personalization images with the textual descriptions. The generated images, therefore, replicate the personalized content rather than adhere to the text prompt instructions. Yet the base text-to-image has strong conceptual understanding capabilities that can be leveraged. We propose Conceptrol, a simple yet effective framework that enhances zero-shot adapters without adding computational overhead. Conceptrol constrains the attention of visual specification with a textual concept mask that improves subject-driven generation capabilities. It achieves as much as 89% improvement on personalization benchmarks over the vanilla IP-Adapter and can even outperform fine-tuning approaches such as Dreambooth LoRA. The source code is available at https://github.com/QY-H00/Conceptrol.
中文: Conceptrol是一种创新框架,通过引入文本概念掩码增强零样本适配器,在无需额外计算开销的情况下显著提升了图像生成中个性化内容与文本指令的协调能力。
English: Conceptrol is a novel framework that enhances zero-shot adapters by integrating textual concept masks to better balance personalized content with text prompts, achieving significant improvements in image generation benchmarks without extra computational cost.

Authors:Jiangdong Cai, Haotian Jiang, Zhenrong Shen, Yonghao Li, Honglin Xiong, Lichi Zhang, Qian Wang
Title: LSA: Latent Style Augmentation Towards Stain-Agnostic Cervical Cancer Screening
Abstract:
The deployment of computer-aided diagnosis systems for cervical cancer screening using whole slide images (WSIs) faces critical challenges due to domain shifts caused by staining variations across different scanners and imaging environments. While existing stain augmentation methods improve patch-level robustness, they fail to scale to WSIs due to two key limitations: (1) inconsistent stain patterns when extending patch operations to gigapixel slides, and (2) prohibitive computational/storage costs from offline processing of augmented WSIs.To address this, we propose Latent Style Augmentation (LSA), a framework that performs efficient, online stain augmentation directly on WSI-level latent features. We first introduce WSAug, a WSI-level stain augmentation method ensuring consistent stain across patches within a WSI. Using offline-augmented WSIs by WSAug, we design and train Stain Transformer, which can simulate targeted style in the latent space, efficiently enhancing the robustness of the WSI-level classifier. We validate our method on a multi-scanner WSI dataset for cervical cancer diagnosis. Despite being trained on data from a single scanner, our approach achieves significant performance improvements on out-of-distribution data from other scanners. Code will be available at https://github.com/caijd2000/LSA.
Chinese: 所提出的潜在样式增强(LSA)框架通过在全切片图像潜在特征上直接进行高效的在线染色增强,有效解决了宫颈癌筛查中的域偏移问题,显著提升了分类器在不同扫描仪间的泛化性能,且无需多扫描仪训练数据。
English: The proposed Latent Style Augmentation (LSA) framework addresses domain shifts in cervical cancer screening by enabling efficient, online stain augmentation directly on whole slide image latent features, significantly improving classifier robustness across different scanners without requiring multi-scanner training data.

Authors:Junyi Wu, Zhiteng Li, Zheng Hui, Yulun Zhang, Linghe Kong, Xiaokang Yang
Title: QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation
Abstract:
Recently, Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation, surpassing U-Net-based models in terms of performance. However, the enhanced capabilities of DiTs come with significant drawbacks, including increased computational and memory costs, which hinder their deployment on resource-constrained devices. Current acceleration techniques, such as quantization and cache mechanism, offer limited speedup and are often applied in isolation, failing to fully address the complexities of DiT architectures. In this paper, we propose QuantCache, a novel training-free inference acceleration framework that jointly optimizes hierarchical latent caching, adaptive importance-guided quantization, and structural redundancy-aware pruning. QuantCache achieves an end-to-end latency speedup of 6.72$\times$ on Open-Sora with minimal loss in generation quality. Extensive experiments across multiple video generation benchmarks demonstrate the effectiveness of our method, setting a new standard for efficient DiT inference. The code and models will be available at https://github.com/JunyiWuCode/QuantCache.
Chinese: 本文提出QuantCache,一种免训练的推理加速框架,通过联合优化分层潜在缓存、自适应量化与结构剪枝,在保持生成质量的同时实现了DiT视频生成6.72倍的端到端加速。
English: This paper introduces QuantCache, a training-free inference acceleration framework that combines hierarchical latent caching, adaptive quantization, and structural pruning to achieve a 6.72× speedup in DiT-based video generation with minimal quality loss.

Authors:Jianwen Sun, Yukang Feng, Chuanhao Li, Fanrui Zhang, Zizhen Li, Jiaxin Ai, Sizhuo Zhou, Yu Dai, Shenglin Zhang, Kaipeng Zhang
Title: ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability
Abstract:
Unified multimodal understanding and generation have recently received much attention in the area of vision and language. Existing UniMs are designed to simultaneously learn both multimodal understanding and generation capabilities, demanding substantial computational resources, and often struggle to generate interleaved text-image. We present ARMOR, a resource-efficient and pure autoregressive framework that achieves both understanding and generation by fine-tuning existing multimodal large language models (MLLMs). Specifically, ARMOR extends existing MLLMs from three perspectives: (1) For model architecture, an asymmetric encoder-decoder architecture with a forward-switching mechanism is introduced to unify embedding space integrating textual and visual modalities for enabling natural text-image interleaved generation with minimal computational overhead. (2) For training data, a meticulously curated, high-quality interleaved dataset is collected for fine-tuning MLLMs. (3) For the training algorithm, we propose a ``what or how to generate'' algorithm to empower existing MLLMs with multimodal generation capabilities while preserving their multimodal understanding capabilities, through three progressive training stages based on the collected dataset. Experimental results demonstrate that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources. Our code will be released soon at https://github.com/finyorko/armor.
Chinese: ARMOR是一种资源高效的自回归框架,通过非对称编码器-解码器架构、精心策划的数据集和渐进式训练算法,将现有多模态大语言模型升级为能实现统一多模态理解与生成的系统,以最小计算开销实现流畅的图文交错生成。
English: ARMOR is a resource-efficient autoregressive framework that enhances existing multimodal large language models to achieve unified multimodal understanding and generation through an asymmetric encoder-decoder architecture, curated dataset, and progressive training algorithm, enabling seamless text-image interleaved generation with minimal computational overhead.

Authors:Xiaoyang Liu, Yuquan Wang, Zheng Chen, Jiezhang Cao, He Zhang, Yulun Zhang, Xiaokang Yang
Title: One-Step Diffusion Model for Image Motion-Deblurring
Abstract:
Currently, methods for single-image deblurring based on CNNs and transformers have demonstrated promising performance. However, these methods often suffer from perceptual limitations, poor generalization ability, and struggle with heavy or complex blur. While diffusion-based methods can partially address these shortcomings, their multi-step denoising process limits their practical usage. In this paper, we conduct an in-depth exploration of diffusion models in deblurring and propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step, significantly improving inference efficiency while maintaining high fidelity. To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration. Additionally, we construct a high-quality synthetic deblurring dataset to mitigate perceptual collapse and design a dynamic dual-adapter (DDA) to enhance perceptual quality while preserving fidelity. Extensive experiments demonstrate that our method achieves strong performance on both full and no-reference metrics. Our code and pre-trained model will be publicly available at https://github.com/xyLiu339/OSDD.
中文: 本文提出了一种用于去模糊的一步扩散模型(OSDD),通过改进的变分自编码器和动态双适配器,在显著提高推理效率的同时保持了高保真度,并在去模糊基准测试中表现出色。
English: This paper introduces a one-step diffusion model for deblurring (OSDD) that enhances inference efficiency and maintains high fidelity through an improved variational autoencoder and a dynamic dual-adapter, validated by strong performance on deblurring benchmarks.

Authors:Chen-Lin Zhang, Lin Sui, Shuming Liu, Fangzhou Mu, Zhangcheng Wang, Bernard Ghanem
Title: TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos
Abstract:
Temporal localization in untrimmed videos, which aims to identify specific timestamps, is crucial for video understanding but remains challenging. This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection. Existing methods in each subfield are typically designed for specific tasks and lack generalizability across domains. In this paper, we propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks. First, our approach employs a simple yet effective one-stage localization model that supports text queries as input and multiple actions as output. Second, we jointly train the video encoder and localization model in an end-to-end manner. To efficiently process long videos, we introduce temporal chunking, enabling the handling of videos with over 30k frames. Third, we find that fine-tuning pre-trained text encoders with a multi-stage training strategy further enhances text-conditioned localization. TimeLoc achieves state-of-the-art results across multiple benchmarks: +1.3% and +1.9% mAP over previous best methods on THUMOS14 and EPIC-Kitchens-100, +1.1% on Kinetics-GEBD, +2.94% mAP on QVHighlights, and significant improvements in temporal video grounding (+11.5% on TACoS and +6.7% on Charades-STA under R1@0.5). Our code and checkpoints will be released at https://github.com/sming256/TimeLoc.
中文: TimeLoc是一个统一的端到端时间定位框架,通过高效的单阶段模型处理未剪辑视频中的多种任务,并在多个基准测试中取得了最先进的性能。
English: TimeLoc is a unified end-to-end framework for temporal localization in untrimmed videos that handles multiple tasks through an efficient one-stage model, achieving state-of-the-art results across various benchmarks.

Authors:Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, Jiaya Jia
Title: Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Abstract:
Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process. Code is available at https://github.com/dvlab-research/Seg-Zero.
Chinese: Seg-Zero是一种创新框架,通过解耦架构和强化学习解决了传统分割方法的局限性,实现了强大的零样本泛化能力和显式推理链,在基准测试中显著超越先前模型。
English: Seg-Zero is a novel framework that overcomes the limitations of traditional segmentation methods by using a decoupled architecture and reinforcement learning to achieve robust zero-shot generalization and explicit reasoning chains, significantly outperforming previous models on benchmarks.

Authors:Shinnosuke Matsuo, Riku Togashi, Ryoma Bise, Seiichi Uchida, Masahiro Nomura
Title: Instance-wise Supervision-level Optimization in Active Learning
Abstract:
Active learning (AL) is a label-efficient machine learning paradigm that focuses on selectively annotating high-value instances to maximize learning efficiency. Its effectiveness can be further enhanced by incorporating weak supervision, which uses rough yet cost-effective annotations instead of exact (i.e., full) but expensive annotations. We introduce a novel AL framework, Instance-wise Supervision-Level Optimization (ISO), which not only selects the instances to annotate but also determines their optimal annotation level within a fixed annotation budget. Its optimization criterion leverages the value-to-cost ratio (VCR) of each instance while ensuring diversity among the selected instances. In classification experiments, ISO consistently outperforms traditional AL methods and surpasses a state-of-the-art AL approach that combines full and weak supervision, achieving higher accuracy at a lower overall cost. This code is available at https://github.com/matsuo-shinnosuke/ISOAL.
Chinese: ISO框架通过动态选择实例及其标注级别,利用优化的价值成本比和多样性,以更低成本实现了更高的分类准确率。
English: The ISO framework enhances active learning by dynamically selecting both instances and their annotation levels, achieving superior accuracy at lower cost through optimized value-to-cost ratios and diversity.

Authors:Amir Mohammad Izadi, Seyed Mohammad Hadi Hosseini, Soroush Vafaie Tabar, Ali Abdollahi, Armin Saghafian, Mahdieh Soleymani Baghshah
Title: Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation
Abstract:
Text-to-image generative models have made significant advancements in recent years; however, accurately capturing intricate details in textual prompts-such as entity missing, attribute binding errors, and incorrect relationships remains a formidable challenge. In response, we present an innovative, training-free method that directly addresses these challenges by incorporating tailored objectives to account for textual constraints. Unlike layout-based approaches that enforce rigid structures and limit diversity, our proposed approach offers a more flexible arrangement of the scene by imposing just the extracted constraints from the text, without any unnecessary additions. These constraints are formulated as losses-entity missing, entity mixing, attribute binding, and spatial relationships-integrated into a unified loss that is applied in the first generation stage. Furthermore, we introduce a feedback-driven system for fine-grained initial noise refinement. This system integrates a verifier that evaluates the generated image, identifies inconsistencies, and provides corrective feedback. Leveraging this feedback, our refinement method first targets the unmet constraints by refining the faulty attention maps caused by initial noise, through the optimization of selective losses associated with these constraints. Subsequently, our unified loss function is reapplied to proceed the second generation phase. Experimental results demonstrate that our method, relying solely on our proposed objective functions, significantly enhances compositionality, achieving a 24% improvement in human evaluation and a 25% gain in spatial relationships. Furthermore, our fine-grained noise refinement proves effective, boosting performance by up to 5%. Code is available at \href{https://github.com/hadi-hosseini/noise-refinement}{https://github.com/hadi-hosseini/noise-refinement}.
中文摘要:本文提出一种无需训练的方法,通过将文本约束整合为统一损失函数并采用反馈驱动的噪声优化系统,显著提升了文本到图像生成的组合能力与空间关系准确性。
English Summary: This paper introduces a training-free method that enhances text-to-image generation by integrating textual constraints as unified losses and employing a feedback-driven noise refinement system, significantly improving compositionality and spatial accuracy.

Authors:Xirui Hu, Jiahao Wang, Hao Chen, Weizhan Zhang, Benqi Wang, Yikun Li, Haishun Nan
Title: DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability
Abstract:
Recent advances in text-to-image generation have driven interest in generating personalized human images that depict specific identities from reference images. Although existing methods achieve high-fidelity identity preservation, they are generally limited to single-ID scenarios and offer insufficient facial editability. We present DynamicID, a tuning-free framework that inherently facilitates both single-ID and multi-ID personalized generation with high fidelity and flexible facial editability. Our key innovations include: 1) Semantic-Activated Attention (SAA), which employs query-level activation gating to minimize disruption to the base model when injecting ID features and achieve multi-ID personalization without requiring multi-ID samples during training. 2) Identity-Motion Reconfigurator (IMR), which applies feature-space manipulation to effectively disentangle and reconfigure facial motion and identity features, supporting flexible facial editing. 3) a task-decoupled training paradigm that reduces data dependency, together with VariFace-10k, a curated dataset of 10k unique individuals, each represented by 35 distinct facial images. Experimental results demonstrate that DynamicID outperforms state-of-the-art methods in identity fidelity, facial editability, and multi-ID personalization capability. Our code will be released at https://github.com/ByteCat-bot/DynamicID.
中文: DynamicID框架通过语义激活注意力和身份-运动重构器,无需调优即可实现高保真的单人与多人身份图像生成及灵活面部编辑,性能超越现有方法。
English: DynamicID is a novel framework that enables high-fidelity single and multi-ID image generation with enhanced facial editability through Semantic-Activated Attention and Identity-Motion Reconfigurator, outperforming existing methods without requiring fine-tuning.

Authors:Huaqi Tao, Bingxi Liu, Calvin Chen, Tingjun Huang, He Li, Jinqiang Cui, Hong Zhang
Title: TextInPlace: Indoor Visual Place Recognition in Repetitive Structures with Scene Text Spotting and Verification
Abstract:
Visual Place Recognition (VPR) is a crucial capability for long-term autonomous robots, enabling them to identify previously visited locations using visual information. However, existing methods remain limited in indoor settings due to the highly repetitive structures inherent in such environments. We observe that scene texts frequently appear in indoor spaces and can help distinguish visually similar but different places. This inspires us to propose TextInPlace, a simple yet effective VPR framework that integrates Scene Text Spotting (STS) to mitigate visual perceptual ambiguity in repetitive indoor environments. Specifically, TextInPlace adopts a dual-branch architecture within a local parameter sharing network. The VPR branch employs attention-based aggregation to extract global descriptors for coarse-grained retrieval, while the STS branch utilizes a bridging text spotter to detect and recognize scene texts. Finally, the discriminative texts are filtered to compute text similarity and re-rank the top-K retrieved images. To bridge the gap between current text-based repetitive indoor scene datasets and the typical scenarios encountered in robot navigation, we establish an indoor VPR benchmark dataset, called Maze-with-Text. Extensive experiments on both custom and public datasets demonstrate that TextInPlace achieves superior performance over existing methods that rely solely on appearance information. The dataset, code, and trained models are publicly available at https://github.com/HqiTao/TextInPlace.
中文摘要:TextInPlace是一种创新的视觉位置识别框架,通过集成场景文本检测技术来区分视觉重复的室内环境,采用双分支架构并结合新建基准数据集实现了优越性能。
English Summary: TextInPlace is a novel Visual Place Recognition framework that integrates scene text spotting to distinguish visually repetitive indoor environments, achieving superior performance through a dual-branch architecture and a newly established benchmark dataset.

Authors:Xiao Wang, Yuehang Li, Fuling Wang, Bo Jiang, Yaowei Wang, Yonghong Tian, Jin Tang, Bin Luo
Title: Sign Language Translation using Frame and Event Stream: Benchmark Dataset and Algorithms
Abstract:
Accurate sign language understanding serves as a crucial communication channel for individuals with disabilities. Current sign language translation algorithms predominantly rely on RGB frames, which may be limited by fixed frame rates, variable lighting conditions, and motion blur caused by rapid hand movements. Inspired by the recent successful application of event cameras in other fields, we propose to leverage event streams to assist RGB cameras in capturing gesture data, addressing the various challenges mentioned above. Specifically, we first collect a large-scale RGB-Event sign language translation dataset using the DVS346 camera, termed VECSL, which contains 15,676 RGB-Event samples, 15,191 glosses, and covers 2,568 Chinese characters. These samples were gathered across a diverse range of indoor and outdoor environments, capturing multiple viewing angles, varying light intensities, and different camera motions. Due to the absence of benchmark algorithms for comparison in this new task, we retrained and evaluated multiple state-of-the-art SLT algorithms, and believe that this benchmark can effectively support subsequent related research. Additionally, we propose a novel RGB-Event sign language translation framework (i.e., M$^2$-SLT) that incorporates fine-grained micro-sign and coarse-grained macro-sign retrieval, achieving state-of-the-art results on the proposed dataset. Both the source code and dataset will be released on https://github.com/Event-AHU/OpenESL.
中文摘要:本研究提出一种新颖的RGB-事件手语翻译方法,利用事件相机克服RGB相机的局限性,构建了VECSL数据集和M²-SLT框架,实现了最先进的性能表现。
English Summary: This study introduces a novel RGB-Event sign language translation approach using event cameras to overcome RGB limitations, creating the VECSL dataset and M²-SLT framework that achieves state-of-the-art performance.

Authors:Qiaole Dong, Yanwei Fu
Title: Online Dense Point Tracking with Streaming Memory
Abstract:
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video, even in the presence of occlusions. Traditional methods use optical flow models to directly estimate long-range motion, but they often suffer from appearance drifting without considering temporal consistency. Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one, which is slow and less effective for long-range tracking. To account for temporal consistency and enable efficient information propagation, we present a lightweight and fast model with \textbf{S}treaming memory for dense \textbf{PO}int \textbf{T}racking and online video processing. The \textbf{SPOT} framework features three core components: a customized memory reading module for feature enhancement, a sensory memory for short-term motion dynamics modeling, and a visibility-guided splatting module for accurate information propagation. This combination enables SPOT to perform dense point tracking with state-of-the-art accuracy on the CVO benchmark, as well as comparable or superior performance to offline models on sparse tracking benchmarks such as TAP-Vid and RoboTAP. Notably, SPOT with 10$\times$ smaller parameter numbers operates at least 2$\times$ faster than previous state-of-the-art models while maintaining the best performance on CVO. We will release the models and codes at: https://dqiaole.github.io/SPOT/.

Authors:Shijia Zhao, Qiming Xia, Xusheng Guo, Pufan Zou, Maoji Zheng, Hai Wu, Chenglu Wen, Cheng Wang
Title: SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts
Abstract:
Recently, sparsely-supervised 3D object detection has gained great attention, achieving performance close to fully-supervised 3D objectors while requiring only a few annotated instances. Nevertheless, these methods suffer challenges when accurate labels are extremely absent. In this paper, we propose a boosting strategy, termed SP3D, explicitly utilizing the cross-modal semantic prompts generated from Large Multimodal Models (LMMs) to boost the 3D detector with robust feature discrimination capability under sparse annotation settings. Specifically, we first develop a Confident Points Semantic Transfer (CPST) module that generates accurate cross-modal semantic prompts through boundary-constrained center cluster selection. Based on these accurate semantic prompts, which we treat as seed points, we introduce a Dynamic Cluster Pseudo-label Generation (DCPG) module to yield pseudo-supervision signals from the geometry shape of multi-scale neighbor points. Additionally, we design a Distribution Shape score (DS score) that chooses high-quality supervision signals for the initial training of the 3D detector. Experiments on the KITTI dataset and Waymo Open Dataset (WOD) have validated that SP3D can enhance the performance of sparsely supervised detectors by a large margin under meager labeling conditions. Moreover, we verified SP3D in the zero-shot setting, where its performance exceeded that of the state-of-the-art methods. The code is available at https://github.com/xmuqimingxia/SP3D.
Chinese: 本文提出SP3D增强策略,利用大型多模态模型生成的跨模态语义提示,在稀疏标注条件下显著提升3D物体检测性能,并在多个基准数据集上验证了其优越性。
English: The paper introduces SP3D, a boosting strategy that leverages cross-modal semantic prompts from Large Multimodal Models to enhance 3D object detection performance under sparse annotation settings, achieving significant improvements on benchmark datasets.

Authors:Yanbiao Ma, Wei Dai, Wenke Huang, Jiayi Chen
Title: Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning
Abstract:
Data heterogeneity in federated learning, characterized by a significant misalignment between local and global distributions, leads to divergent local optimization directions and hinders global model training. Existing studies mainly focus on optimizing local updates or global aggregation, but these indirect approaches demonstrate instability when handling highly heterogeneous data distributions, especially in scenarios where label skew and domain skew coexist. To address this, we propose a geometry-guided data generation method that centers on simulating the global embedding distribution locally. We first introduce the concept of the geometric shape of an embedding distribution and then address the challenge of obtaining global geometric shapes under privacy constraints. Subsequently, we propose GGEUR, which leverages global geometric shapes to guide the generation of new samples, enabling a closer approximation to the ideal global distribution. In single-domain scenarios, we augment samples based on global geometric shapes to enhance model generalization; in multi-domain scenarios, we further employ class prototypes to simulate the global distribution across domains. Extensive experimental results demonstrate that our method significantly enhances the performance of existing approaches in handling highly heterogeneous data, including scenarios with label skew, domain skew, and their coexistence. Code published at: https://github.com/WeiDai-David/2025CVPR_GGEUR
中文摘要:本文提出一种几何引导的数据生成方法,通过全局嵌入形状在本地模拟全局分布,显著提升了联邦学习在标签偏移、域偏移等高度异构数据场景下的性能表现。
English Summary: This paper introduces a geometry-guided data generation method that uses global embedding shapes to simulate the global distribution locally, significantly improving federated learning performance under highly heterogeneous data conditions including label and domain skew.

Authors:Chengxuan Qian, Kai Han, Jingchao Wang, Zhenlong Yuan, Chongwen Lyu, Jun Chen, Zhe Liu
Title: DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning
Abstract:
Multimodal learning integrates complementary information from diverse modalities to enhance the decision-making process. However, the potential of multimodal collaboration remains under-exploited due to disparities in data quality and modality representation capabilities. To address this, we introduce DynCIM, a novel dynamic curriculum learning framework designed to quantify the inherent imbalances from both sample and modality perspectives. DynCIM employs a sample-level curriculum to dynamically assess each sample's difficulty according to prediction deviation, consistency, and stability, while a modality-level curriculum measures modality contributions from global and local. Furthermore, a gating-based dynamic fusion mechanism is introduced to adaptively adjust modality contributions, minimizing redundancy and optimizing fusion effectiveness. Extensive experiments on six multimodal benchmarking datasets, spanning both bimodal and trimodal scenarios, demonstrate that DynCIM consistently outperforms state-of-the-art methods. Our approach effectively mitigates modality and sample imbalances while enhancing adaptability and robustness in multimodal learning tasks. Our code is available at https://github.com/Raymond-Qiancx/DynCIM.
Chinese: 本文提出DynCIM动态课程学习框架,通过动态评估样本难度和模态贡献度,并自适应融合模态,有效解决了多模态学习中的不平衡问题,在多种数据集上显著提升了性能。
English: This paper introduces DynCIM, a dynamic curriculum learning framework that addresses imbalances in multimodal learning by dynamically assessing sample difficulty and modality contributions, and adaptively fusing modalities to enhance performance across diverse datasets.

Authors:Mingxiang Cao, Weiying Xie, Xin Zhang, Jiaqing Zhang, Kai Jiang, Jie Lei, Yunsong Li
Title: M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification
Abstract:
Multi-modal fusion holds great promise for integrating information from different modalities. However, due to a lack of consideration for modal consistency, existing multi-modal fusion methods in the field of remote sensing still face challenges of incomplete semantic information and low computational efficiency in their fusion designs. Inspired by the observation that the visual language pre-training model CLIP can effectively extract strong semantic information from visual features, we propose M$^3$amba, a novel end-to-end CLIP-driven Mamba model for multi-modal fusion to address these challenges. Specifically, we introduce CLIP-driven modality-specific adapters in the fusion architecture to avoid the bias of understanding specific domains caused by direct inference, making the original CLIP encoder modality-specific perception. This unified framework enables minimal training to achieve a comprehensive semantic understanding of different modalities, thereby guiding cross-modal feature fusion. To further enhance the consistent association between modality mappings, a multi-modal Mamba fusion architecture with linear complexity and a cross-attention module Cross-SS2D are designed, which fully considers effective and efficient information interaction to achieve complete fusion. Extensive experiments have shown that M$^3$amba has an average performance improvement of at least 5.98\% compared with the state-of-the-art methods in multi-modal hyperspectral image classification tasks in the remote sensing field, while also demonstrating excellent training efficiency, achieving a double improvement in accuracy and efficiency. The code is released at https://github.com/kaka-Cao/M3amba.
Chinese: 提出的M$^3$amba模型利用CLIP驱动的适配器和多模态Mamba架构,在遥感多模态融合中提升了语义整合能力与计算效率,实现了显著的性能提升和训练效率优化。
English: The proposed M$^3$amba model leverages CLIP-driven adapters and a multi-modal Mamba architecture to enhance semantic integration and computational efficiency in remote sensing fusion, achieving significant performance gains and training efficiency improvements.

Authors:Yu Jin, Jingming Liu, Zhexu Luo, Yifei Peng, Ziang Qin, Wang-Zhou Dai, Yao-Xiang Ding, Kun Zhou
Title: Pre-Training Meta-Rule Selection Policy for Visual Generative Abductive Learning
Abstract:
Visual generative abductive learning studies jointly training symbol-grounded neural visual generator and inducing logic rules from data, such that after learning, the visual generation process is guided by the induced logic rules. A major challenge for this task is to reduce the time cost of logic abduction during learning, an essential step when the logic symbol set is large and the logic rule to induce is complicated. To address this challenge, we propose a pre-training method for obtaining meta-rule selection policy for the recently proposed visual generative learning approach AbdGen [Peng et al., 2023], aiming at significantly reducing the candidate meta-rule set and pruning the search space. The selection model is built based on the embedding representation of both symbol grounding of cases and meta-rules, which can be effectively integrated with both neural model and logic reasoning system. The pre-training process is done on pure symbol data, not involving symbol grounding learning of raw visual inputs, making the entire learning process low-cost. An additional interesting observation is that the selection policy can rectify symbol grounding errors unseen during pre-training, which is resulted from the memorization ability of attention mechanism and the relative stability of symbolic patterns. Experimental results show that our method is able to effectively address the meta-rule selection problem for visual abduction, boosting the efficiency of visual generative abductive learning. Code is available at https://github.com/future-item/metarule-select.
中文摘要:本研究提出一种预训练方法,通过元规则选择策略缩减逻辑规则搜索空间,在无需视觉输入的情况下低成本提升视觉生成溯因学习的效率,并能修正符号基础错误。
English Summary: This study introduces a pre-training method to enhance visual generative abductive learning by reducing logic rule search space through meta-rule selection, significantly improving efficiency without compromising performance.

Authors:Guofeng Zhang, Ruyi Zha, Hao He, Yixun Liang, Alan Yuille, Hongdong Li, Yuanhao Cai
Title: X-LRM: X-ray Large Reconstruction Model for Extremely Sparse-View Computed Tomography Recovery in One Second
Abstract:
Sparse-view 3D CT reconstruction aims to recover volumetric structures from a limited number of 2D X-ray projections. Existing feedforward methods are constrained by the limited capacity of CNN-based architectures and the scarcity of large-scale training datasets. In this paper, we propose an X-ray Large Reconstruction Model (X-LRM) for extremely sparse-view (<10 views) CT reconstruction. X-LRM consists of two key components: X-former and X-triplane. Our X-former can handle an arbitrary number of input views using an MLP-based image tokenizer and a Transformer-based encoder. The output tokens are then upsampled into our X-triplane representation, which models the 3D radiodensity as an implicit neural field. To support the training of X-LRM, we introduce Torso-16K, a large-scale dataset comprising over 16K volume-projection pairs of various torso organs. Extensive experiments demonstrate that X-LRM outperforms the state-of-the-art method by 1.5 dB and achieves 27x faster speed and better flexibility. Furthermore, the downstream evaluation of lung segmentation tasks also suggests the practical value of our approach. Our code, pre-trained models, and dataset will be released at https://github.com/caiyuanhao1998/X-LRM
中文: 本文提出X-LRM模型,通过基于Transformer的编码器和三平面表示实现稀疏视图3D CT重建,在性能、速度和灵活性上均超越现有方法,并配套发布了大规模数据集。
English: This paper introduces X-LRM, a novel model for sparse-view 3D CT reconstruction that utilizes a Transformer-based encoder and triplane representation to achieve superior performance, faster speed, and enhanced flexibility, supported by a newly created large-scale dataset.

Authors:Chen Liu, Tobias Ritschel
Title: Generative Video Bi-flow
Abstract:
We propose a novel generative video model to robustly learn temporal change as a neural Ordinary Differential Equation (ODE) flow with a bilinear objective which combines two aspects: The first is to map from the past into future video frames directly. Previous work has mapped the noise to new frames, a more computationally expensive process. Unfortunately, starting from the previous frame, instead of noise, is more prone to drifting errors. Hence, second, we additionally learn how to remove the accumulated errors as the joint objective by adding noise during training. We demonstrate unconditional video generation in a streaming manner for various video datasets, all at competitive quality compared to a conditional diffusion baseline but with higher speed, i.e., fewer ODE solver steps.

Authors:Zhitong Xiong, Yi Wang, Weikang Yu, Adam J Stewart, Jie Zhao, Nils Lehmann, Thomas Dujardin, Zhenghang Yuan, Pedram Ghamisi, Xiao Xiang Zhu
Title: DOFA-CLIP: Multimodal Vision-Language Foundation Models for Earth Observation
Abstract:
Earth observation (EO) spans a broad spectrum of modalities, including optical, radar, multispectral, and hyperspectral data, each capturing distinct environmental signals. However, current vision-language models in EO, particularly CLIP-based variants, remain confined to individual modalities, limiting generalization and scalability across diverse tasks. We present DOFA-CLIP (Dynamic-One-For-All CLIP), a unified vision-language foundation model that dynamically adapts to EO modalities with flexible spectral configurations through a single Transformer backbone. Our approach introduces three key contributions: 1) the construction of GeoLangBind-2M, a large-scale EO image-text dataset covering six heterogeneous modalities with rich natural language descriptions; 2) a novel training strategy called VECT (Vision-models Enhanced Contrastive Text-image pretraining), which enhances the spatial awareness of CLIP features with multiple vision foundation models; and 3) a Modality-aware Knowledge Agglomeration (MaKA) module that refines feature distillation with modality-specific awareness. DOFA-CLIP achieves state-of-the-art zero-shot performance across a wide range of EO benchmarks, including unseen modalities and a diverse number of input spectral bands. Together, these contributions establish a scalable foundation for multimodal EO understanding and open new avenues for integrating heterogeneous EO data with large language models. Code and datasets will be released. Code and datasets are publicly available.
中文: DOFA-CLIP 是一个统一的视觉语言基础模型,通过单一Transformer主干动态适应多种地球观测模态,结合大规模数据集和创新训练策略,在各类EO基准测试中实现了最先进的零样本性能。
English: DOFA-CLIP is a unified vision-language foundation model that dynamically adapts to multiple Earth observation modalities through a single Transformer backbone, achieving state-of-the-art zero-shot performance across diverse EO benchmarks by integrating large-scale datasets and novel training strategies.

Authors:Siyi Du, Xinzhe Luo, Declan P. O'Regan, Chen Qin
Title: STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification
Abstract:
Multimodal image-tabular learning is gaining attention, yet it faces challenges due to limited labeled data. While earlier work has applied self-supervised learning (SSL) to unlabeled data, its task-agnostic nature often results in learning suboptimal features for downstream tasks. Semi-supervised learning (SemiSL), which combines labeled and unlabeled data, offers a promising solution. However, existing multimodal SemiSL methods typically focus on unimodal or modality-shared features, ignoring valuable task-relevant modality-specific information, leading to a Modality Information Gap. In this paper, we propose STiL, a novel SemiSL tabular-image framework that addresses this gap by comprehensively exploring task-relevant information. STiL features a new disentangled contrastive consistency module to learn cross-modal invariant representations of shared information while retaining modality-specific information via disentanglement. We also propose a novel consensus-guided pseudo-labeling strategy to generate reliable pseudo-labels based on classifier consensus, along with a new prototype-guided label smoothing technique to refine pseudo-label quality with prototype embeddings, thereby enhancing task-relevant information learning in unlabeled data. Experiments on natural and medical image datasets show that STiL outperforms the state-of-the-art supervised/SSL/SemiSL image/multimodal approaches. Our code is available at https://github.com/siyi-wind/STiL.
中文: 多模态图像-表格学习面临标注数据有限和任务无关特征欠佳的挑战,STiL框架通过解耦对比一致性模块和共识引导伪标签策略,有效利用标注与未标注数据中的共享及模态特定信息,从而提升性能。
English: Multimodal image-tabular learning faces challenges from limited labeled data and suboptimal task-agnostic features, which the proposed STiL framework addresses by integrating a disentangled contrastive consistency module and consensus-guided pseudo-labeling to effectively leverage both shared and modality-specific information across labeled and unlabeled data.

Authors:Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis, Yong Man Ro
Title: Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
Abstract:
We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the strong multilingual modeling capabilities of Large Language Models (LLMs), we propose converting the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it a step further, we explore a unified Zero-AVSR approach by directly integrating the audio-visual speech representations encoded by the AV-Romanizer into the LLM. This is achieved through finetuning the adapter and the LLM using our proposed multi-task learning scheme. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data across 82 languages, along with transcriptions in both language-specific graphemes and Roman text. Extensive analysis and experiments confirm that the proposed Zero-AVSR framework has the potential to expand language support beyond the languages seen during the training of the AV-Romanizer.
中文:Zero-AVSR框架通过生成语言无关的罗马化文本表征,并利用大语言模型将其转换为目标语言,实现了零样本多语言视听语音识别,其有效性已通过大规模多语言实验验证。
English: The Zero-AVSR framework enables zero-shot multilingual audio-visual speech recognition by generating language-agnostic Roman text representations and leveraging large language models for conversion into target languages, validated through extensive multilingual experiments.

Authors:Shaobin Zhuang, Zhipeng Huang, Binxin Yang, Ying Zhang, Fangyikang Wang, Canmiao Fu, Chong Sun, Zheng-Jun Zha, Chen Li, Yali Wang
Title: Get In Video: Add Anything You Want to the Video
Abstract:
Video editing increasingly demands the ability to incorporate specific real-world instances into existing footage, yet current approaches fundamentally fail to capture the unique visual characteristics of particular subjects and ensure natural instance/scene interactions. We formalize this overlooked yet critical editing paradigm as "Get-In-Video Editing", where users provide reference images to precisely specify visual elements they wish to incorporate into videos. Addressing this task's dual challenges, severe training data scarcity and technical challenges in maintaining spatiotemporal coherence, we introduce three key contributions. First, we develop GetIn-1M dataset created through our automated Recognize-Track-Erase pipeline, which sequentially performs video captioning, salient instance identification, object detection, temporal tracking, and instance removal to generate high-quality video editing pairs with comprehensive annotations (reference image, tracking mask, instance prompt). Second, we present GetInVideo, a novel end-to-end framework that leverages a diffusion transformer architecture with 3D full attention to process reference images, condition videos, and masks simultaneously, maintaining temporal coherence, preserving visual identity, and ensuring natural scene interactions when integrating reference objects into videos. Finally, we establish GetInBench, the first comprehensive benchmark for Get-In-Video Editing scenario, demonstrating our approach's superior performance through extensive evaluations. Our work enables accessible, high-quality incorporation of specific real-world subjects into videos, significantly advancing personalized video editing capabilities.

Authors:Kun Xiang, Zhili Liu, Zihao Jiang, Yunshuang Nie, Kaixin Cai, Yiyang Yin, Runhui Huang, Haoxiang Fan, Hanhui Li, Weiran Huang, Yihan Zeng, Yu-Jie Yuan, Jianhua Han, Lanqing Hong, Hang Xu, Xiaodan Liang
Title: Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?
Abstract:
In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of "slow thinking" into multimodal large language models (MLLMs). Our core idea is that different levels of reasoning abilities can be combined dynamically to tackle questions with different complexity. To this end, we propose a paradigm of Self-structured Chain of Thought (SCoT), which is composed of minimal semantic atomic steps. Different from existing methods that rely on structured templates or free-form paradigms, our method can not only generate cognitive CoT structures for various complex tasks but also mitigates the phenomenon of overthinking. To introduce structured reasoning capabilities into visual understanding models, we further design a novel AtomThink framework with four key modules, including (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single step utilization rate. We conduct extensive experiments to show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10\% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 times and boosts inference efficiency by 85.3\%. Our code is now public available in https://github.com/Quinn777/AtomThink.
中文: 本文提出自结构化思维链范式和AtomThink框架,通过最小语义原子步骤动态整合多模态大语言模型的推理能力,显著提升了数学推理的准确性和效率。
English: This paper introduces the Self-structured Chain of Thought (SCoT) paradigm and AtomThink framework to enhance multimodal mathematical reasoning in MLLMs by dynamically combining reasoning abilities through minimal semantic atomic steps, achieving significant accuracy gains and efficiency improvements.

Authors:Yixin Wu, Feiran Zhang, Tianyuan Shi, Ruicheng Yin, Zhenghua Wang, Zhenliang Gan, Xiaohua Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang
Title: Explainable Synthetic Image Detection through Diffusion Timestep Ensembling
Abstract:
Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we empirically show that different timesteps of DDIM inversion reveal varying subtle distinctions between synthetic and real images that are extractable for detection, in the forms of such as Fourier power spectrum high-frequency discrepancies and inter-pixel variance distributions. Based on these observations, we propose a novel synthetic image detection method that directly utilizes features of intermediately noised images by training an ensemble on multiple noised timesteps, circumventing conventional reconstruction-based strategies. To enhance human comprehension, we introduce a metric-grounded explanation generation and refinement module to identify and explain AI-generated flaws. Additionally, we construct the GenHard and GenExplain benchmarks to provide detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that our method achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and challenging samples respectively, and demonstrates generalizability and robustness. Our code and datasets are available at https://github.com/Shadowlized/ESIDE.
中文: 本研究提出了一种新颖的合成图像检测方法,通过利用多时间步中间噪声图像特征,在常规样本和挑战性基准上分别实现了98.91%和95.89%的最优检测准确率,同时提供了可解释的AI生成缺陷识别功能。
English: This study introduces a novel synthetic image detection method that leverages intermediate noised image features across multiple timesteps, achieving state-of-the-art detection accuracy of 98.91% on regular samples and 95.89% on challenging benchmarks while providing explainable AI-generated flaw identification.

Authors:YingLiang Ma, Sandra Howell, Aldo Rinaldi, Tarv Dhanjal, Kawal S. Rhode
Title: Attention on the Wires (AttWire): A Foundation Model for Detecting Devices and Catheters in X-ray Fluoroscopic Images
Abstract:
Objective: Interventional devices, catheters and insertable imaging devices such as transesophageal echo (TOE) probes are routinely used in minimally invasive cardiovascular procedures. Detecting their positions and orientations in X-ray fluoroscopic images is important for many clinical applications. Method: In this paper, a novel attention mechanism was designed to guide a convolution neural network (CNN) model to the areas of wires in X-ray images, as nearly all interventional devices and catheters used in cardiovascular procedures contain wires. The attention mechanism includes multi-scale Gaussian derivative filters and a dot-product-based attention layer. By utilizing the proposed attention mechanism, a lightweight foundation model can be created to detect multiple objects simultaneously with higher precision and real-time speed. Results: The proposed model was trained and tested on a total of 12,438 X-ray images. An accuracy of 0.88 was achieved for detecting an echo probe and 0.87 for detecting an artificial valve at 58 FPS. The accuracy was measured by intersection-over-union (IoU). We also achieved a 99.8% success rate in detecting a 10-electrode catheter and a 97.8% success rate in detecting an ablation catheter. Conclusion: Our detection foundation model can simultaneously detect and identify both interventional devices and flexible catheters in real-time X-ray fluoroscopic images. Significance: The proposed model employs a novel attention mechanism to achieve high-performance object detection, making it suitable for various clinical applications and robotic-assisted surgeries. Codes are available at https://github.com/YingLiangMa/AttWire.
中文摘要:本研究提出一种新型注意力机制,通过多尺度高斯导数滤波器实现了对X射线影像中介入器械的实时高精度检测,在58帧/秒速度下对多种心血管器械的检测准确率均超过0.87。
English Summary: A novel attention mechanism using multi-scale Gaussian filters enables real-time, high-precision detection of interventional devices in X-ray images, achieving over 0.87 accuracy for various cardiovascular tools at 58 FPS.

Authors:Xiang Gao, Shuai Yang, Jiaying Liu
Title: PTDiffusion: Free Lunch for Generating Optical Illusion Hidden Pictures with Phase-Transferred Diffusion Model
Abstract:
Optical illusion hidden picture is an interesting visual perceptual phenomenon where an image is cleverly integrated into another picture in a way that is not immediately obvious to the viewer. Established on the off-the-shelf text-to-image (T2I) diffusion model, we propose a novel training-free text-guided image-to-image (I2I) translation framework dubbed as \textbf{P}hase-\textbf{T}ransferred \textbf{Diffusion} Model (PTDiffusion) for hidden art syntheses. PTDiffusion harmoniously embeds an input reference image into arbitrary scenes described by the text prompts, producing illusion images exhibiting hidden visual cues of the reference image. At the heart of our method is a plug-and-play phase transfer mechanism that dynamically and progressively transplants diffusion features' phase spectrum from the denoising process to reconstruct the reference image into the one to sample the generated illusion image, realizing deep fusion of the reference structural information and the textual semantic information in the diffusion model latent space. Furthermore, we propose asynchronous phase transfer to enable flexible control to the degree of hidden content discernability. Our method bypasses any model training and fine-tuning process, all while substantially outperforming related text-guided I2I methods in image generation quality, text fidelity, visual discernibility, and contextual naturalness for illusion picture synthesis, as demonstrated by extensive qualitative and quantitative experiments. Our project is publically available at \href{https://xianggao1102.github.io/PTDiffusion_webpage/}{this web page}.

Authors:Shawn Li, Jiashu Qu, Yuxiao Zhou, Yuehan Qin, Tiankai Yang, Yue Zhao
Title: Treble Counterfactual VLMs: A Causal Approach to Hallucination
Abstract:
Vision-Language Models (VLMs) have advanced multi-modal tasks like image captioning, visual question answering, and reasoning. However, they often generate hallucinated outputs inconsistent with the visual context or prompt, limiting reliability in critical applications like autonomous driving and medical imaging. Existing studies link hallucination to statistical biases, language priors, and biased feature learning but lack a structured causal understanding. In this work, we introduce a causal perspective to analyze and mitigate hallucination in VLMs. We hypothesize that hallucination arises from unintended direct influences of either the vision or text modality, bypassing proper multi-modal fusion. To address this, we construct a causal graph for VLMs and employ counterfactual analysis to estimate the Natural Direct Effect (NDE) of vision, text, and their cross-modal interaction on the output. We systematically identify and mitigate these unintended direct effects to ensure that responses are primarily driven by genuine multi-modal fusion. Our approach consists of three steps: (1) designing structural causal graphs to distinguish correct fusion pathways from spurious modality shortcuts, (2) estimating modality-specific and cross-modal NDE using perturbed image representations, hallucinated text embeddings, and degraded visual inputs, and (3) implementing a test-time intervention module to dynamically adjust the model's dependence on each modality. Experimental results demonstrate that our method significantly reduces hallucination while preserving task performance, providing a robust and interpretable framework for improving VLM reliability. To enhance accessibility and reproducibility, our code is publicly available at https://github.com/TREE985/Treble-Counterfactual-VLMs.
中文: 本研究提出一种因果框架,通过反事实估计和动态干预分析并减轻视觉或文本模态的意外直接影响,从而显著减少视觉语言模型的幻觉问题,在保持性能的同时提高其可靠性。
English: This study introduces a causal framework to address hallucination in Vision-Language Models by analyzing and mitigating unintended direct influences of vision or text modalities through counterfactual estimation and dynamic intervention, significantly improving reliability while maintaining performance.

Authors:Zidu Wang, Jiankuo Zhao, Miao Xu, Xiangyu Zhu, Zhen Lei
Title: SRM-Hair: Single Image Head Mesh Reconstruction via 3D Morphable Hair
Abstract:
3D Morphable Models (3DMMs) have played a pivotal role as a fundamental representation or initialization for 3D avatar animation and reconstruction. However, extending 3DMMs to hair remains challenging due to the difficulty of enforcing vertex-level consistent semantic meaning across hair shapes. This paper introduces a novel method, Semantic-consistent Ray Modeling of Hair (SRM-Hair), for making 3D hair morphable and controlled by coefficients. The key contribution lies in semantic-consistent ray modeling, which extracts ordered hair surface vertices and exhibits notable properties such as additivity for hairstyle fusion, adaptability, flipping, and thickness modification. We collect a dataset of over 250 high-fidelity real hair scans paired with 3D face data to serve as a prior for the 3D morphable hair. Based on this, SRM-Hair can reconstruct a hair mesh combined with a 3D head from a single image. Note that SRM-Hair produces an independent hair mesh, facilitating applications in virtual avatar creation, realistic animation, and high-fidelity hair rendering. Both quantitative and qualitative experiments demonstrate that SRM-Hair achieves state-of-the-art performance in 3D mesh reconstruction. Our project is available at https://github.com/wang-zidu/SRM-Hair
中文摘要:本文提出SRM-Hair方法,通过语义一致的光线建模实现三维可变形头发建模,在保持独立网格结构的同时,能够从单张图像实现最先进的头发重建,适用于多种虚拟应用场景。
English Summary: This paper introduces SRM-Hair, a novel method that enables 3D morphable hair modeling through semantic-consistent ray modeling, achieving state-of-the-art hair reconstruction from single images while maintaining independent mesh structures for versatile applications.

Authors:Jian Ma, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu, Zhenyu Yang
Title: X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
Abstract:
Text-to-image (T2I) models are well known for their ability to produce highly realistic images, while multimodal large language models (MLLMs) are renowned for their proficiency in understanding and integrating multiple modalities. However, currently there is no straightforward and efficient framework to transfer the multimodal comprehension abilities of MLLMs to T2I models to enable them to understand multimodal inputs. In this paper, we propose the X2I framework, which endows Diffusion Transformer (DiT) models with the capability to comprehend various modalities, including multilingual text, screenshot documents, images, videos, and audio. X2I is trained using merely 100K English corpus with 160 GPU hours. Building on the DiT teacher model, we adopt an innovative distillation method to extract the inference capabilities of the teacher model and design a lightweight AlignNet structure to serve as an intermediate bridge. Compared to the teacher model, X2I shows a decrease in performance degradation of less than 1\% while gaining various multimodal understanding abilities, including multilingual to image, image to image, image-text to image, video to image, audio to image, and utilizing creative fusion to enhance imagery. Furthermore, it is applicable for LoRA training in the context of image-text to image generation, filling a void in the industry in this area. We further design a simple LightControl to enhance the fidelity of instructional image editing. Finally, extensive experiments demonstrate the effectiveness, efficiency, multifunctionality, and transferability of our X2I. The open-source code and checkpoints for X2I can be found at the following link: https://github.com/OPPO-Mente-Lab/X2I.
中文: X2I框架将多模态大语言模型的理解能力高效迁移至文本生成图像模型,使其能够处理多种输入并实现创意图像生成,同时保持高性能和多功能性。
English: The X2I framework efficiently transfers multimodal comprehension abilities from MLLMs to T2I models, enabling them to process diverse inputs like text, images, and audio with minimal performance loss and enhanced creative capabilities.

Authors:Xiangxiang Chu, Renda Li, Yong Wang
Title: USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
Abstract:
Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models. Our code will be publicly available at https://github.com/AMAP-ML/USP.
中文摘要:提出的统一自监督预训练(USP)框架通过在VAE潜在空间中进行掩码潜在建模来初始化扩散模型,不仅保持了可比较的理解任务性能,还显著提升了扩散模型的收敛速度与生成质量。
English Summary: The proposed Unified Self-supervised Pretraining (USP) framework overcomes transfer challenges by initializing diffusion models through masked latent modeling in VAE space, achieving competitive understanding performance while significantly boosting diffusion models' convergence speed and generation quality.

Authors:Hongjia Zhai, Boming Zhao, Hai Li, Xiaokun Pan, Yijia He, Zhaopeng Cui, Hujun Bao, Guofeng Zhang
Title: NeuraLoc: Visual Localization in Neural Implicit Map with Dual Complementary Features
Abstract:
Recently, neural radiance fields (NeRF) have gained significant attention in the field of visual localization. However, existing NeRF-based approaches either lack geometric constraints or require extensive storage for feature matching, limiting their practical applications. To address these challenges, we propose an efficient and novel visual localization approach based on the neural implicit map with complementary features. Specifically, to enforce geometric constraints and reduce storage requirements, we implicitly learn a 3D keypoint descriptor field, avoiding the need to explicitly store point-wise features. To further address the semantic ambiguity of descriptors, we introduce additional semantic contextual feature fields, which enhance the quality and reliability of 2D-3D correspondences. Besides, we propose descriptor similarity distribution alignment to minimize the domain gap between 2D and 3D feature spaces during matching. Finally, we construct the matching graph using both complementary descriptors and contextual features to establish accurate 2D-3D correspondences for 6-DoF pose estimation. Compared with the recent NeRF-based approaches, our method achieves a 3$\times$ faster training speed and a 45$\times$ reduction in model storage. Extensive experiments on two widely used datasets demonstrate that our approach outperforms or is highly competitive with other state-of-the-art NeRF-based visual localization methods. Project page: \href{https://zju3dv.github.io/neuraloc}{https://zju3dv.github.io/neuraloc}
Chinese: 本文提出了一种基于神经隐式地图与互补特征的高效视觉定位方法,通过结合几何约束和语义上下文特征,实现了3倍训练加速和45倍存储缩减,性能优于现有基于NeRF的方法。
English: This paper introduces an efficient visual localization method using neural implicit maps with complementary features, which accelerates training by 3 times, reduces storage by 45 times, and outperforms existing NeRF-based approaches by integrating geometric constraints and semantic contextual features.

Authors:Xianjie Liu, Keren Fu, Qijun Zhao
Title: Patch-Depth Fusion: Dichotomous Image Segmentation via Fine-Grained Patch Strategy and Depth Integrity-Prior
Abstract:
Dichotomous Image Segmentation (DIS) is a high-precision object segmentation task for high-resolution natural images. The current mainstream methods focus on the optimization of local details but overlook the fundamental challenge of modeling the integrity of objects. We have found that the depth integrity-prior implicit in the the pseudo-depth maps generated by Depth Anything Model v2 and the local detail features of image patches can jointly address the above dilemmas. Based on the above findings, we have designed a novel Patch-Depth Fusion Network (PDFNet) for high-precision dichotomous image segmentation. The core of PDFNet consists of three aspects. Firstly, the object perception is enhanced through multi-modal input fusion. By utilizing the patch fine-grained strategy, coupled with patch selection and enhancement, the sensitivity to details is improved. Secondly, by leveraging the depth integrity-prior distributed in the depth maps, we propose an integrity-prior loss to enhance the uniformity of the segmentation results in the depth maps. Finally, we utilize the features of the shared encoder and, through a simple depth refinement decoder, improve the ability of the shared encoder to capture subtle depth-related information in the images. Experiments on the DIS-5K dataset show that PDFNet significantly outperforms state-of-the-art non-diffusion methods. Due to the incorporation of the depth integrity-prior, PDFNet achieves or even surpassing the performance of the latest diffusion-based methods while using less than 11% of the parameters of diffusion-based methods. The source code at https://github.com/Tennine2077/PDFNet
中文: PDFNet提出了一种新颖的块-深度融合网络,通过结合局部图像块细节与深度完整性先验来提升分割精度,在参数量远少于扩散模型的情况下实现了顶尖性能。
English: PDFNet introduces a novel Patch-Depth Fusion Network that enhances segmentation precision by integrating local patch details with depth integrity-priors, achieving state-of-the-art performance with significantly fewer parameters than diffusion-based methods.

Authors:Xiang Lan, Feng Wu, Kai He, Qinghao Zhao, Shenda Hong, Mengling Feng
Title: GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images
Abstract:
While recent multimodal large language models (MLLMs) have advanced automated ECG interpretation, they still face two key limitations: (1) insufficient multimodal synergy between time series signals and visual ECG representations, and (2) limited explainability in linking diagnoses to granular waveform evidence. We introduce GEM, the first MLLM unifying ECG time series, 12-lead ECG images and text for grounded and clinician-aligned ECG interpretation. GEM enables feature-grounded analysis, evidence-driven reasoning, and a clinician-like diagnostic process through three core innovations: a dual-encoder framework extracting complementary time series and image features, cross-modal alignment for effective multimodal understanding, and knowledge-guided instruction generation for generating high-granularity grounding data (ECG-Grounding) linking diagnoses to measurable parameters ($e.g.$, QRS/PR Intervals). Additionally, we propose the Grounded ECG Understanding task, a clinically motivated benchmark designed to comprehensively assess the MLLM's capability in grounded ECG understanding. Experimental results on both existing and our proposed benchmarks show GEM significantly improves predictive performance (CSN $7.4\% \uparrow$), explainability ($22.7\% \uparrow$), and grounding ($24.8\% \uparrow$), making it more suitable for real-world clinical applications. GitHub repository: https://github.com/lanxiang1017/GEM.git
中文: GEM作为首个融合心电时间序列、12导联图像与文本的多模态大模型,通过跨模态特征对齐和知识引导实现了可解释的心电分析,显著提升了诊断性能和临床适用性。
English: GEM is a novel multimodal large language model that integrates ECG time series, images, and text to enhance diagnostic accuracy, explainability, and clinical alignment through cross-modal feature extraction and evidence-driven reasoning.

Authors:Junyan Lin, Haoran Chen, Yue Fan, Yingqi Fan, Xin Jin, Hui Su, Jinlan Fu, Xiaoyu Shen
Title: Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
Abstract:
Multimodal Large Language Models (MLLMs) have made significant advancements in recent years, with visual features playing an increasingly critical role in enhancing model performance. However, the integration of multi-layer visual features in MLLMs remains underexplored, particularly with regard to optimal layer selection and fusion strategies. Existing methods often rely on arbitrary design choices, leading to suboptimal outcomes. In this paper, we systematically investigate two core aspects of multi-layer visual feature fusion: (1) selecting the most effective visual layers and (2) identifying the best fusion approach with the language model. Our experiments reveal that while combining visual features from multiple stages improves generalization, incorporating additional features from the same stage typically leads to diminished performance. Furthermore, we find that direct fusion of multi-layer visual features at the input stage consistently yields superior and more stable performance across various configurations. We make all our code publicly available: https://github.com/EIT-NLP/Layer_Select_Fuse_for_MLLM.
Chinese: 本研究系统探索了多模态大语言模型中多层视觉特征的最佳层级选择与融合策略,发现跨阶段特征组合能提升泛化能力,而输入层直接融合则能实现更优且稳定的性能表现。
English: This study systematically explores optimal layer selection and fusion strategies for multi-layer visual features in Multimodal Large Language Models, finding that combining features from different stages enhances generalization while direct input-stage fusion delivers superior performance.

Authors:Yuheng Li, Yuxiang Lai, Maria Thor, Deborah Marshall, Zachary Buchwald, David S. Yu, Xiaofeng Yang
Title: Towards Universal Text-driven CT Image Segmentation
Abstract:
Computed tomography (CT) is extensively used for accurate visualization and segmentation of organs and lesions. While deep learning models such as convolutional neural networks (CNNs) and vision transformers (ViTs) have significantly improved CT image analysis, their performance often declines when applied to diverse, real-world clinical data. Although foundation models offer a broader and more adaptable solution, their potential is limited due to the challenge of obtaining large-scale, voxel-level annotations for medical images. In response to these challenges, prompting-based models using visual or text prompts have emerged. Visual-prompting methods, such as the Segment Anything Model (SAM), still require significant manual input and can introduce ambiguity when applied to clinical scenarios. Instead, foundation models that use text prompts offer a more versatile and clinically relevant approach. Notably, current text-prompt models, such as the CLIP-Driven Universal Model, are limited to text prompts already encountered during training and struggle to process the complex and diverse scenarios of real-world clinical applications. Instead of fine-tuning models trained from natural imaging, we propose OpenVocabCT, a vision-language model pretrained on large-scale 3D CT images for universal text-driven segmentation. Using the large-scale CT-RATE dataset, we decompose the diagnostic reports into fine-grained, organ-level descriptions using large language models for multi-granular contrastive learning. We evaluate our OpenVocabCT on downstream segmentation tasks across nine public datasets for organ and tumor segmentation, demonstrating the superior performance of our model compared to existing methods. All code, datasets, and models will be publicly released at https://github.com/ricklisz/OpenVocabCT.
Chinese: OpenVocabCT是一种基于大规模3D CT图像预训练的视觉语言模型,通过利用细粒度诊断报告进行多粒度对比学习,在文本驱动的通用分割任务中表现出优于现有方法的性能。
English: OpenVocabCT is a vision-language model pretrained on large-scale 3D CT images that enables universal text-driven segmentation, outperforming existing methods by leveraging fine-grained diagnostic reports for multi-granular contrastive learning.

Authors:Xiaohao Xu, Feng Xue, Xiang Li, Haowei Li, Shusheng Yang, Tianyi Zhang, Matthew Johnson-Roberson, Xiaonan Huang
Title: Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity
Abstract:
Depth ambiguity is a fundamental challenge in spatial scene understanding, especially in transparent scenes where single-depth estimates fail to capture full 3D structure. Existing models, limited to deterministic predictions, overlook real-world multi-layer depth. To address this, we introduce a paradigm shift from single-prediction to multi-hypothesis spatial foundation models. We first present \texttt{MD-3k}, a benchmark exposing depth biases in expert and foundational models through multi-layer spatial relationship labels and new metrics. To resolve depth ambiguity, we propose Laplacian Visual Prompting (LVP), a training-free spectral prompting technique that extracts hidden depth from pre-trained models via Laplacian-transformed RGB inputs. By integrating LVP-inferred depth with standard RGB-based estimates, our approach elicits multi-layer depth without model retraining. Extensive experiments validate the effectiveness of LVP in zero-shot multi-layer depth estimation, unlocking more robust and comprehensive geometry-conditioned visual generation, 3D-grounded spatial reasoning, and temporally consistent video-level depth inference. Our benchmark and code will be available at https://github.com/Xiaohao-Xu/Ambiguity-in-Space.
中文摘要:本研究通过提出免训练的拉普拉斯视觉提示技术,从预训练模型中提取多层深度信息以解决透明场景中的深度模糊问题,并基于新基准验证了其在三维空间理解中的有效性。
English Summary: This research introduces a training-free Laplacian Visual Prompting technique to address depth ambiguity in transparent scenes by extracting multi-layer depth from pre-trained models, validated through a new benchmark for robust 3D spatial understanding.

Authors:Hongwei Yi, Tian Ye, Shitong Shao, Xuancheng Yang, Jiantong Zhao, Hanzhong Guo, Terrance Wang, Qingyu Yin, Zeke Xie, Lei Zhu, Wei Li, Michael Lingelbach, Daquan Zhou
Title: MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice
Abstract:
We present MagicInfinite, a novel diffusion Transformer (DiT) framework that overcomes traditional portrait animation limitations, delivering high-fidelity results across diverse character types-realistic humans, full-body figures, and stylized anime characters. It supports varied facial poses, including back-facing views, and animates single or multiple characters with input masks for precise speaker designation in multi-character scenes. Our approach tackles key challenges with three innovations: (1) 3D full-attention mechanisms with a sliding window denoising strategy, enabling infinite video generation with temporal coherence and visual quality across diverse character styles; (2) a two-stage curriculum learning scheme, integrating audio for lip sync, text for expressive dynamics, and reference images for identity preservation, enabling flexible multi-modal control over long sequences; and (3) region-specific masks with adaptive loss functions to balance global textual control and local audio guidance, supporting speaker-specific animations. Efficiency is enhanced via our innovative unified step and cfg distillation techniques, achieving a 20x inference speed boost over the basemodel: generating a 10 second 540x540p video in 10 seconds or 720x720p in 30 seconds on 8 H100 GPUs, without quality loss. Evaluations on our new benchmark demonstrate MagicInfinite's superiority in audio-lip synchronization, identity preservation, and motion naturalness across diverse scenarios. It is publicly available at https://www.hedra.com/, with examples at https://magicinfinite.github.io/.

Authors:Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, Wei Yin
Title: GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving
Abstract:
We propose GoalFlow, an end-to-end autonomous driving method for generating high-quality multimodal trajectories. In autonomous driving scenarios, there is rarely a single suitable trajectory. Recent methods have increasingly focused on modeling multimodal trajectory distributions. However, they suffer from trajectory selection complexity and reduced trajectory quality due to high trajectory divergence and inconsistencies between guidance and scene information. To address these issues, we introduce GoalFlow, a novel method that effectively constrains the generative process to produce high-quality, multimodal trajectories. To resolve the trajectory divergence problem inherent in diffusion-based methods, GoalFlow constrains the generated trajectories by introducing a goal point. GoalFlow establishes a novel scoring mechanism that selects the most appropriate goal point from the candidate points based on scene information. Furthermore, GoalFlow employs an efficient generative method, Flow Matching, to generate multimodal trajectories, and incorporates a refined scoring mechanism to select the optimal trajectory from the candidates. Our experimental results, validated on the Navsim\cite{Dauner2024_navsim}, demonstrate that GoalFlow achieves state-of-the-art performance, delivering robust multimodal trajectories for autonomous driving. GoalFlow achieved PDMS of 90.3, significantly surpassing other methods. Compared with other diffusion-policy-based methods, our approach requires only a single denoising step to obtain excellent performance. The code is available at https://github.com/YvanYin/GoalFlow.
Chinese: GoalFlow是一种端到端的自动驾驶方法,通过引入目标点约束和新颖的评分机制生成高质量多模态轨迹,仅需单步去噪即可实现最优性能。
English: GoalFlow is an end-to-end autonomous driving method that generates high-quality multimodal trajectories by introducing goal point constraints and a novel scoring mechanism, achieving state-of-the-art performance with only a single denoising step.

Authors:Zhenxuan Zhang, Hongjie Wu, Jiahao Huang, Baihong Xie, Zhifan Gao, Junxian Du, Pete Lally, Guang Yang
Title: Task-oriented Uncertainty Collaborative Learning for Label-Efficient Brain Tumor Segmentation
Abstract:
Multi-contrast magnetic resonance imaging (MRI) plays a vital role in brain tumor segmentation and diagnosis by leveraging complementary information from different contrasts. Each contrast highlights specific tumor characteristics, enabling a comprehensive understanding of tumor morphology, edema, and pathological heterogeneity. However, existing methods still face the challenges of multi-level specificity perception across different contrasts, especially with limited annotations. These challenges include data heterogeneity, granularity differences, and interference from redundant information. To address these limitations, we propose a Task-oriented Uncertainty Collaborative Learning (TUCL) framework for multi-contrast MRI segmentation. TUCL introduces a task-oriented prompt attention (TPA) module with intra-prompt and cross-prompt attention mechanisms to dynamically model feature interactions across contrasts and tasks. Additionally, a cyclic process is designed to map the predictions back to the prompt to ensure that the prompts are effectively utilized. In the decoding stage, the TUCL framework proposes a dual-path uncertainty refinement (DUR) strategy which ensures robust segmentation by refining predictions iteratively. Extensive experimental results on limited labeled data demonstrate that TUCL significantly improves segmentation accuracy (88.2\% in Dice and 10.853 mm in HD95). It shows that TUCL has the potential to extract multi-contrast information and reduce the reliance on extensive annotations. The code is available at: https://github.com/Zhenxuan-Zhang/TUCL_BrainSeg.
中文: 提出的任务导向不确定性协同学习(TUCL)框架通过动态建模特征交互和迭代优化预测,有效提升了多对比度MRI脑肿瘤分割的精度,并降低了对大量标注数据的依赖。
English: The proposed Task-oriented Uncertainty Collaborative Learning (TUCL) framework enhances multi-contrast MRI brain tumor segmentation by dynamically modeling feature interactions and refining predictions, achieving higher accuracy with limited annotations.

Authors:Zhongyi Shui, Ruizhe Guo, Honglin Li, Yuxuan Sun, Yunlong Zhang, Chenglu Zhu, Jiatong Cai, Pingyi Chen, Yanzhou Su, Lin Yang
Title: Towards Effective and Efficient Context-aware Nucleus Detection in Histopathology Whole Slide Images
Abstract:
Nucleus detection in histopathology whole slide images (WSIs) is crucial for a broad spectrum of clinical applications. Current approaches for nucleus detection in gigapixel WSIs utilize a sliding window methodology, which overlooks boarder contextual information (eg, tissue structure) and easily leads to inaccurate predictions. To address this problem, recent studies additionally crops a large Filed-of-View (FoV) region around each sliding window to extract contextual features. However, such methods substantially increases the inference latency. In this paper, we propose an effective and efficient context-aware nucleus detection algorithm. Specifically, instead of leveraging large FoV regions, we aggregate contextual clues from off-the-shelf features of historically visited sliding windows. This design greatly reduces computational overhead. Moreover, compared to large FoV regions at a low magnification, the sliding window patches have higher magnification and provide finer-grained tissue details, thereby enhancing the detection accuracy. To further improve the efficiency, we propose a grid pooling technique to compress dense feature maps of each patch into a few contextual tokens. Finally, we craft OCELOT-seg, the first benchmark dedicated to context-aware nucleus instance segmentation. Code, dataset, and model checkpoints will be available at https://github.com/windygoo/PathContext.
Chinese: 本文提出了一种高效的上下文感知细胞核检测算法,通过聚合历史访问滑动窗口的现成特征来整合上下文线索,不仅降低了计算开销,还利用高倍率切片和网格池化技术提升了检测精度。
English: The paper introduces an efficient context-aware nucleus detection algorithm that aggregates contextual clues from previously visited sliding windows, reducing computational overhead and improving accuracy by utilizing higher magnification patches and a grid pooling technique.

Authors:Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Ioannis Patras
Title: AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data
Abstract:
Recent advances in generative models have sparked research on improving model fairness with AI-generated data. However, existing methods often face limitations in the diversity and quality of synthetic data, leading to compromised fairness and overall model accuracy. Moreover, many approaches rely on the availability of demographic group labels, which are often costly to annotate. This paper proposes AIM-Fair, aiming to overcome these limitations and harness the potential of cutting-edge generative models in promoting algorithmic fairness. We investigate a fine-tuning paradigm starting from a biased model initially trained on real-world data without demographic annotations. This model is then fine-tuned using unbiased synthetic data generated by a state-of-the-art diffusion model to improve its fairness. Two key challenges are identified in this fine-tuning paradigm, 1) the low quality of synthetic data, which can still happen even with advanced generative models, and 2) the domain and bias gap between real and synthetic data. To address the limitation of synthetic data quality, we propose Contextual Synthetic Data Generation (CSDG) to generate data using a text-to-image diffusion model (T2I) with prompts generated by a context-aware LLM, ensuring both data diversity and control of bias in synthetic data. To resolve domain and bias shifts, we introduce a novel selective fine-tuning scheme in which only model parameters more sensitive to bias and less sensitive to domain shift are updated. Experiments on CelebA and UTKFace datasets show that our AIM-Fair improves model fairness while maintaining utility, outperforming both fully and partially fine-tuned approaches to model fairness.
中文: 本文提出AIM-Fair方法,通过使用上下文感知提示生成的高质量合成数据对偏置模型进行选择性参数微调,在保持模型效用的同时显著提升了算法公平性。
English: This paper introduces AIM-Fair, a method that enhances algorithmic fairness by fine-tuning biased models with high-quality synthetic data generated via context-aware prompts and a selective parameter update scheme, effectively improving fairness without compromising utility.

Authors:Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, Qiang Xu
Title: VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control
Abstract:
Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context preservation and foreground generation in one model, respectively. To address these limitations, we propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential. Extensive experiments demonstrate VideoPainter's superior performance in both any-length video inpainting and editing, across eight key metrics, including video quality, mask region preservation, and textual coherence.
中文摘要:VideoPainter提出了一种双流范式,通过高效的上下文编码器将背景信息融入预训练扩散模型,显著提升了视频修复的内容一致性和处理任意长度视频的能力,并借助大规模数据集在多项指标中展现出卓越性能。
English Summary: VideoPainter introduces a dual-stream paradigm with an efficient context encoder to enhance video inpainting by integrating background context into pre-trained diffusion models, enabling superior performance in generating consistent content and supporting any-length video processing through novel techniques and a large-scale dataset.

Authors:Mufan Liu, Qi Yang, Miaoran Zhao, He Huang, Le Yang, Zhu Li, Yiling Xu
Title: D2GV: Deformable 2D Gaussian Splatting for Video Representation in 400FPS
Abstract:
Implicit Neural Representations (INRs) have emerged as a powerful approach for video representation, offering versatility across tasks such as compression and inpainting. However, their implicit formulation limits both interpretability and efficacy, undermining their practicality as a comprehensive solution. We propose a novel video representation based on deformable 2D Gaussian splatting, dubbed D2GV, which aims to achieve three key objectives: 1) improved efficiency while delivering superior quality; 2) enhanced scalability and interpretability; and 3) increased friendliness for downstream tasks. Specifically, we initially divide the video sequence into fixed-length Groups of Pictures (GoP) to allow parallel training and linear scalability with video length. For each GoP, D2GV represents video frames by applying differentiable rasterization to 2D Gaussians, which are deformed from a canonical space into their corresponding timestamps. Notably, leveraging efficient CUDA-based rasterization, D2GV converges fast and decodes at speeds exceeding 400 FPS, while delivering quality that matches or surpasses state-of-the-art INRs. Moreover, we incorporate a learnable pruning and quantization strategy to streamline D2GV into a more compact representation. We demonstrate D2GV's versatility in tasks including video interpolation, inpainting and denoising, underscoring its potential as a promising solution for video representation. Code is available at: https://github.com/Evan-sudo/D2GV.
中文: 提出的D2GV视频表示方法通过可变形二维高斯溅射克服了隐式神经表示的局限性,在多种视频任务中实现了超过400 FPS的高效处理、优越的可扩展性,同时保持了卓越的质量表现。
English: The proposed D2GV video representation method overcomes the limitations of Implicit Neural Representations by using deformable 2D Gaussian splatting, achieving superior efficiency, scalability, and performance exceeding 400 FPS while maintaining high quality across various video tasks.

Authors:Libo Zhu, Haotong Qin, Kaicheng Yang, Wenbo Li, Yong Guo, Yulun Zhang, Susanto Rahardja, Xiaokang Yang
Title: QArtSR: Quantization via Reverse-Module and Timestep-Retraining in One-Step Diffusion based Image Super-Resolution
Abstract:
One-step diffusion-based image super-resolution (OSDSR) models are showing increasingly superior performance nowadays. However, although their denoising steps are reduced to one and they can be quantized to 8-bit to reduce the costs further, there is still significant potential for OSDSR to quantize to lower bits. To explore more possibilities of quantized OSDSR, we propose an efficient method, Quantization via reverse-module and timestep-retraining for OSDSR, named QArtSR. Firstly, we investigate the influence of timestep value on the performance of quantized models. Then, we propose Timestep Retraining Quantization (TRQ) and Reversed Per-module Quantization (RPQ) strategies to calibrate the quantized model. Meanwhile, we adopt the module and image losses to update all quantized modules. We only update the parameters in quantization finetuning components, excluding the original weights. To ensure that all modules are fully finetuned, we add extended end-to-end training after per-module stage. Our 4-bit and 2-bit quantization experimental results indicate that QArtSR obtains superior effects against the recent leading comparison methods. The performance of 4-bit QArtSR is close to the full-precision one. Our code will be released at https://github.com/libozhu03/QArtSR.
中文: 提出的QArtSR方法通过时间步重训练和反向逐模块量化优化了一步扩散图像超分辨率模型,在4位和2位量化中实现了接近全精度的卓越性能。
English: The proposed QArtSR method enhances one-step diffusion-based image super-resolution (OSDSR) models through timestep retraining and reversed per-module quantization, achieving superior 4-bit and 2-bit performance that rivals full-precision models.

Authors:Jian Liu, Wei Sun, Kai Zeng, Jin Zheng, Hui Yang, Hossein Rahmani, Ajmal Mian, Lin Wang
Title: Novel Object 6D Pose Estimation with a Single Reference View
Abstract:
Existing novel object 6D pose estimation methods typically rely on CAD models or dense reference views, which are both difficult to acquire. Using only a single reference view is more scalable, but challenging due to large pose discrepancies and limited geometric and spatial information. To address these issues, we propose a Single-Reference-based novel object 6D (SinRef-6D) pose estimation method. Our key idea is to iteratively establish point-wise alignment in a common coordinate system based on state space models (SSMs). Specifically, iterative object-space point-wise alignment can effectively handle large pose discrepancies, while our proposed RGB and Points SSMs can capture long-range dependencies and spatial information from a single view, offering linear complexity and superior spatial modeling capability. Once pre-trained on synthetic data, SinRef-6D can estimate the 6D pose of a novel object using only a single reference view, without requiring retraining or a CAD model. Extensive experiments on six popular datasets and real-world robotic scenes demonstrate that we achieve on-par performance with CAD-based and dense reference view-based methods, despite operating in the more challenging single reference setting. Code will be released at https://github.com/CNJianLiu/SinRef-6D.
中文: 提出的SinRef-6D方法通过基于状态空间模型的迭代点对点对齐,仅需单张参考视图即可实现新物体的6D姿态估计,在无需重新训练或CAD模型的情况下达到了与基于CAD方法相当的性能。
English: The proposed SinRef-6D method enables novel object 6D pose estimation using only a single reference view through iterative point-wise alignment with state space models, achieving performance comparable to CAD-based methods without requiring retraining or CAD models.

Authors:Xiaobei Zhao, Xiangrong Zeng, Yihang Ma, Pengjin Tang, Xiang Li
Title: TomatoScanner: phenotyping tomato fruit based on only RGB image
Abstract:
In tomato greenhouse, phenotypic measurement is meaningful for researchers and farmers to monitor crop growth, thereby precisely control environmental conditions in time, leading to better quality and higher yield. Traditional phenotyping mainly relies on manual measurement, which is accurate but inefficient, more importantly, endangering the health and safety of people. Several studies have explored computer vision-based methods to replace manual phenotyping. However, the 2D-based need extra calibration, or cause destruction to fruit, or can only measure limited and meaningless traits. The 3D-based need extra depth camera, which is expensive and unacceptable for most farmers. In this paper, we propose a non-contact tomato fruit phenotyping method, titled TomatoScanner, where RGB image is all you need for input. First, pixel feature is extracted by instance segmentation of our proposed EdgeYOLO with preprocessing of individual separation and pose correction. Second, depth feature is extracted by depth estimation of Depth Pro. Third, pixel and depth feature are fused to output phenotype results in reality. We establish self-built Tomato Phenotype Dataset to test TomatoScanner, which achieves excellent phenotyping on width, height, vertical area and volume, with median relative error of 5.63%, 7.03%, -0.64% and 37.06%, respectively. We propose and add three innovative modules - EdgeAttention, EdgeLoss and EdgeBoost - into EdgeYOLO, to enhance the segmentation accuracy on edge portion. Precision and mean Edge Error greatly improve from 0.943 and 5.641% to 0.986 and 2.963%, respectively. Meanwhile, EdgeYOLO keeps lightweight and efficient, with 48.7 M weights size and 76.34 FPS. Codes and datasets: https://github.com/AlexTraveling/TomatoScanner.
中文: 本文提出的TomatoScanner方法仅需RGB图像,通过实例分割与深度估计实现非接触式番茄表型测量,在保持轻量化的同时显著提升了边缘分割精度,且无需昂贵设备。
English: This paper introduces TomatoScanner, a non-contact method using only RGB images to accurately measure tomato fruit phenotypes through instance segmentation and depth estimation, achieving low error rates without expensive equipment.

Authors:Juan Miguel Valverde, Maja Østergaard, Adrian Rodriguez-Palomo, Peter Alling Strange Vibe, Nina Kølln Wittig, Henrik Birkedal, Anders Bjorholm Dahl
Title: Disconnect to Connect: A Data Augmentation Method for Improving Topology Accuracy in Image Segmentation
Abstract:
Accurate segmentation of thin, tubular structures (e.g., blood vessels) is challenging for deep neural networks. These networks classify individual pixels, and even minor misclassifications can break the thin connections within these structures. Existing methods for improving topology accuracy, such as topology loss functions, rely on very precise, topologically-accurate training labels, which are difficult to obtain. This is because annotating images, especially 3D images, is extremely laborious and time-consuming. Low image resolution and contrast further complicates the annotation by causing tubular structures to appear disconnected. We present CoLeTra, a data augmentation strategy that integrates to the models the prior knowledge that structures that appear broken are actually connected. This is achieved by creating images with the appearance of disconnected structures while maintaining the original labels. Our extensive experiments, involving different architectures, loss functions, and datasets, demonstrate that CoLeTra leads to segmentations topologically more accurate while often improving the Dice coefficient and Hausdorff distance. CoLeTra's hyper-parameters are intuitive to tune, and our sensitivity analysis shows that CoLeTra is robust to changes in these hyper-parameters. We also release a dataset specifically suited for image segmentation methods with a focus on topology accuracy. CoLetra's code can be found at https://github.com/jmlipman/CoLeTra.
中文: CoLeTra是一种数据增强策略,通过使用看似断开但实际保持连接标签的图像进行训练,提高了深度学习网络对管状结构分割的拓扑准确性,无需精确标注即可获得更优的分割效果。
English: CoLeTra is a data augmentation technique that enhances the topological accuracy of tubular structure segmentation in deep neural networks by training with intentionally disconnected-looking images while preserving correct connectivity labels, achieving better results without requiring precise annotations.

Authors:Haotian Hu, Jingwei Xu, Fanyi Wang, Toyota Li, Yaonong Wang, Laifeng Hu, Zhiwang Zhang
Title: FastMap: Fast Queries Initialization Based Vectorized HD Map Reconstruction Framework
Abstract:
Reconstruction of high-definition maps is a crucial task in perceiving the autonomous driving environment, as its accuracy directly impacts the reliability of prediction and planning capabilities in downstream modules. Current vectorized map reconstruction methods based on the DETR framework encounter limitations due to the redundancy in the decoder structure, necessitating the stacking of six decoder layers to maintain performance, which significantly hampers computational efficiency. To tackle this issue, we introduce FastMap, an innovative framework designed to reduce decoder redundancy in existing approaches. FastMap optimizes the decoder architecture by employing a single-layer, two-stage transformer that achieves multilevel representation capabilities. Our framework eliminates the conventional practice of randomly initializing queries and instead incorporates a heatmap-guided query generation module during the decoding phase, which effectively maps image features into structured query vectors using learnable positional encoding. Additionally, we propose a geometry-constrained point-to-line loss mechanism for FastMap, which adeptly addresses the challenge of distinguishing highly homogeneous features that often arise in traditional point-to-point loss computations. Extensive experiments demonstrate that FastMap achieves state-of-the-art performance in both nuScenes and Argoverse2 datasets, with its decoder operating 3.2 faster than the baseline. Code and more demos are available at https://github.com/hht1996ok/FastMap.
Chinese: FastMap提出了一种创新框架,通过热图引导的查询生成模块和几何约束损失机制,有效减少矢量地图重建中的解码器冗余,在保持顶尖性能的同时将解码速度提升了3.2倍。
English: FastMap introduces a novel framework that reduces decoder redundancy in vectorized map reconstruction by employing a heatmap-guided query generation module and a geometry-constrained loss, achieving state-of-the-art performance with a 3.2 times faster decoder.

Authors:Miaowei Wang, Yibo Zhang, Rui Ma, Weiwei Xu, Changqing Zou, Daniel Morris
Title: DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction
Abstract:
We present DecoupledGaussian, a novel system that decouples static objects from their contacted surfaces captured in-the-wild videos, a key prerequisite for realistic Newtonian-based physical simulations. Unlike prior methods focused on synthetic data or elastic jittering along the contact surface, which prevent objects from fully detaching or moving independently, DecoupledGaussian allows for significant positional changes without being constrained by the initial contacted surface. Recognizing the limitations of current 2D inpainting tools for restoring 3D locations, our approach proposes joint Poisson fields to repair and expand the Gaussians of both objects and contacted scenes after separation. This is complemented by a multi-carve strategy to refine the object's geometry. Our system enables realistic simulations of decoupling motions, collisions, and fractures driven by user-specified impulses, supporting complex interactions within and across multiple scenes. We validate DecoupledGaussian through a comprehensive user study and quantitative benchmarks. This system enhances digital interaction with objects and scenes in real-world environments, benefiting industries such as VR, robotics, and autonomous driving. Our project page is at: https://wangmiaowei.github.io/DecoupledGaussian.github.io/.

Authors:Run He, Di Fang, Yicheng Xu, Yawen Cui, Ming Li, Cen Chen, Ziqian Zeng, Huiping Zhuang
Title: Semantic Shift Estimation via Dual-Projection and Classifier Reconstruction for Exemplar-Free Class-Incremental Learning
Abstract:
Exemplar-Free Class-Incremental Learning (EFCIL) aims to sequentially learn from distinct categories without retaining exemplars but easily suffers from catastrophic forgetting of learned knowledge. While existing EFCIL methods leverage knowledge distillation to alleviate forgetting, they still face two critical challenges: semantic shift and decision bias. Specifically, the embeddings of old tasks shift in the embedding space after learning new tasks, and the classifier becomes biased towards new tasks due to training solely with new data, hindering the balance between old and new knowledge. To address these issues, we propose the Dual-Projection Shift Estimation and Classifier Reconstruction (DPCR) approach for EFCIL. DPCR effectively estimates semantic shift through a dual-projection, which combines a learnable transformation with a row-space projection to capture both task-wise and category-wise shifts. Furthermore, to mitigate decision bias, DPCR employs ridge regression to reformulate a classifier reconstruction process. This reconstruction exploits previous in covariance and prototype of each class after calibration with estimated shift, thereby reducing decision bias. Extensive experiments demonstrate that, on various datasets, DPCR effectively balances old and new tasks, outperforming state-of-the-art EFCIL methods. Our codes are available at https://github.com/RHe502/ICML25-DPCR.
中文: 提出的双投影偏移估计与分类器重构(DPCR)方法通过双投影估计语义偏移并利用岭回归重构分类器,有效解决了无范例类增量学习中的语义偏移和决策偏差问题,在多个数据集上实现了最优性能。
English: The proposed Dual-Projection Shift Estimation and Classifier Reconstruction (DPCR) approach effectively addresses semantic shift and decision bias in exemplar-free class-incremental learning by estimating shifts through dual projections and reconstructing classifiers using ridge regression, achieving superior performance across multiple datasets.

Authors:Zhenxuan Zhang, Peiyuan Jing, Coraline Beitone, Jiahao Huang, Zhifan Gao, Guang Yang, Pete Lally
Title: Pretext Task Adversarial Learning for Unpaired Low-field to Ultra High-field MRI Synthesis
Abstract:
Given the scarcity and cost of high-field MRI, the synthesis of high-field MRI from low-field MRI holds significant potential when there is limited data for training downstream tasks (e.g. segmentation). Low-field MRI often suffers from a reduced signal-to-noise ratio (SNR) and spatial resolution compared to high-field MRI. However, synthesizing high-field MRI data presents challenges. These involve aligning image features across domains while preserving anatomical accuracy and enhancing fine details. To address these challenges, we propose a Pretext Task Adversarial (PTA) learning framework for high-field MRI synthesis from low-field MRI data. The framework comprises three processes: (1) The slice-wise gap perception (SGP) network aligns the slice inconsistencies of low-field and high-field datasets based on contrastive learning. (2) The local structure correction (LSC) network extracts local structures by restoring the locally rotated and masked images. (3) The pretext task-guided adversarial training process introduces additional supervision and incorporates a discriminator to improve image realism. Extensive experiments on low-field to ultra high-field task demonstrate the effectiveness of our method, achieving state-of-the-art performance (16.892 in FID, 1.933 in IS, and 0.324 in MS-SSIM). This enables the generation of high-quality high-field-like MRI data from low-field MRI data to augment training datasets for downstream tasks. The code is available at: https://github.com/Zhenxuan-Zhang/PTA4Unpaired_HF_MRI_SYN.
中文: 提出的前置任务对抗(PTA)学习框架通过特征对齐和细节增强,有效从低场MRI数据合成高场MRI,实现了顶尖性能,为下游任务扩充训练数据集。
English: The proposed Pretext Task Adversarial (PTA) learning framework effectively synthesizes high-field-like MRI from low-field data by aligning features and enhancing details, achieving state-of-the-art performance to augment training datasets for downstream tasks.

Authors:Jungho Lee, Donghyeong Kim, Dogyoon Lee, Suhwan Cho, Minhyeok Lee, Wonjoon Lee, Taeoh Kim, Dongyoon Wee, Sangyoun Lee
Title: CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images
Abstract:
3D Gaussian Splatting (3DGS) has gained significant attention due to its high-quality novel view rendering, motivating research to address real-world challenges. A critical issue is the camera motion blur caused by movement during exposure, which hinders accurate 3D scene reconstruction. In this study, we propose CoMoGaussian, a Continuous Motion-Aware Gaussian Splatting that reconstructs precise 3D scenes from motion-blurred images while maintaining real-time rendering speed. Considering the complex motion patterns inherent in real-world camera movements, we predict continuous camera trajectories using neural ordinary differential equations (ODEs). To ensure accurate modeling, we employ rigid body transformations, preserving the shape and size of the object but rely on the discrete integration of sampled frames. To better approximate the continuous nature of motion blur, we introduce a continuous motion refinement (CMR) transformation that refines rigid transformations by incorporating additional learnable parameters. By revisiting fundamental camera theory and leveraging advanced neural ODE techniques, we achieve precise modeling of continuous camera trajectories, leading to improved reconstruction accuracy. Extensive experiments demonstrate state-of-the-art performance both quantitatively and qualitatively on benchmark datasets, which include a wide range of motion blur scenarios, from moderate to extreme blur.
Chinese: 本研究提出CoMoGaussian方法,通过连续运动感知的高斯泼溅和神经常微分方程,从运动模糊图像中精确重建3D场景,在实时渲染中取得了最先进的性能。
English: This study introduces CoMoGaussian, a method that uses continuous motion-aware Gaussian splatting and neural ODEs to reconstruct accurate 3D scenes from motion-blurred images, achieving state-of-the-art results in real-time rendering.

Authors:Souhail Hadgi, Luca Moschella, Andrea Santilli, Diego Gomez, Qixing Huang, Emanuele RodolÃ, Simone Melzi, Maks Ovsjanikov
Title: Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces
Abstract:
Recent works have shown that, when trained at scale, uni-modal 2D vision and text encoders converge to learned features that share remarkable structural properties, despite arising from different representations. However, the role of 3D encoders with respect to other modalities remains unexplored. Furthermore, existing 3D foundation models that leverage large datasets are typically trained with explicit alignment objectives with respect to frozen encoders from other representations. In this work, we investigate the possibility of a posteriori alignment of representations obtained from uni-modal 3D encoders compared to text-based feature spaces. We show that naive post-training feature alignment of uni-modal text and 3D encoders results in limited performance. We then focus on extracting subspaces of the corresponding feature spaces and discover that by projecting learned representations onto well-chosen lower-dimensional subspaces the quality of alignment becomes significantly higher, leading to improved accuracy on matching and retrieval tasks. Our analysis further sheds light on the nature of these shared subspaces, which roughly separate between semantic and geometric data representations. Overall, ours is the first work that helps to establish a baseline for post-training alignment of 3D uni-modal and text feature spaces, and helps to highlight both the shared and unique properties of 3D data compared to other representations. Our code and weights are available at https://github.com/Souhail-01/3d-text-alignment
中文: 研究表明,三维与文本编码器的简单后训练对齐效果有限,但将特征投影到精选的低维子空间可显著提升对齐质量与任务性能,同时揭示了语义与几何数据表征的分离特性。
English: This study demonstrates that naive post-training alignment of 3D and text encoders yields limited results, but projecting features onto carefully selected lower-dimensional subspaces significantly enhances alignment quality and task performance, revealing distinct semantic and geometric data representations.

Authors:Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, Jiaqi Wang
Title: Unified Reward Model for Multimodal Understanding and Generation
Abstract:
Recent advances in human preference alignment have significantly enhanced multimodal generation and understanding. A key approach is training reward models to guide preference optimization. However, existing models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that jointly learning to assess multiple tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring, which can be employed for vision model preference alignment. Specifically, (1) we first develop UnifiedReward on our constructed large-scale human preference dataset, including both image and video generation/understanding tasks. (2) Then, it is utilized to automatically construct high-quality preference pair data based on the vision models, fine-gradually filtering their outputs through pair ranking and point sifting. (3) Finally, these data are used for their preference alignment through Direct Preference Optimization (DPO). Experimental results demonstrate that joint learning to assess diverse visual tasks can lead to substantial mutual benefits and we apply our pipeline to both image and video understanding/generation tasks, significantly improving the performance in each domain.
中文摘要:本文提出首个统一奖励模型UnifiedReward,通过联合学习多种视觉任务实现多模态理解与生成评估,实验证明该方法能在图像和视频领域产生显著协同效应并有效提升各领域性能。
English Summary: This paper introduces UnifiedReward, a unified reward model for multimodal understanding and generation assessment that enables joint learning across diverse visual tasks, demonstrating substantial mutual benefits through experimental validation and application in both image and video domains.

Authors:Ruoxuan Zhang, Hongxia Xie, Yi Yao, Jian-Yu Jiang-Lin, Bin Wen, Ling Lo, Hong-Han Shuai, Yung-Hui Li, Wen-Huang Cheng
Title: RecipeGen: A Benchmark for Real-World Recipe Image Generation
Abstract:
Recipe image generation is an important challenge in food computing, with applications from culinary education to interactive recipe platforms. However, there is currently no real-world dataset that comprehensively connects recipe goals, sequential steps, and corresponding images. To address this, we introduce RecipeGen, the first real-world goal-step-image benchmark for recipe generation, featuring diverse ingredients, varied recipe steps, multiple cooking styles, and a broad collection of food categories. Data is in https://github.com/zhangdaxia22/RecipeGen.
Chinese: RecipeGen 是首个真实世界的菜谱图像生成基准数据集,它通过涵盖多样食材、烹饪风格和食物类别,解决了菜谱目标、步骤与图像之间缺乏全面关联的问题。
English: RecipeGen is introduced as the first real-world benchmark dataset for recipe image generation, addressing the lack of comprehensive connections between recipe goals, steps, and images across diverse ingredients, cooking styles, and food categories.

Authors:Yifan Liu, Yu Fang, Zhouhan Lin
Title: DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility
Abstract:
Video-to-speech (V2S) synthesis, the task of generating speech directly from silent video input, is inherently more challenging than other speech synthesis tasks due to the need to accurately reconstruct both speech content and speaker characteristics from visual cues alone. Recently, audio-visual pre-training has eliminated the need for additional acoustic hints in V2S, which previous methods often relied on to ensure training convergence. However, even with pre-training, existing methods continue to face challenges in achieving a balance between acoustic intelligibility and the preservation of speaker-specific characteristics. We analyzed this limitation and were motivated to introduce DiVISe (Direct Visual-Input Speech Synthesis), an end-to-end V2S model that predicts Mel-spectrograms directly from video frames alone. Despite not taking any acoustic hints, DiVISe effectively preserves speaker characteristics in the generated audio, and achieves superior performance on both objective and subjective metrics across the LRS2 and LRS3 datasets. Our results demonstrate that DiVISe not only outperforms existing V2S models in acoustic intelligibility but also scales more effectively with increased data and model parameters. Code and weights can be found at https://github.com/PussyCat0700/DiVISe.
中文摘要:DiVISe是一种端到端的视频语音合成模型,仅从无声视频直接生成梅尔频谱图,无需声学提示即可在保持说话人特征的同时实现卓越的语音可懂度。
English Summary: DiVISe is an end-to-end video-to-speech model that directly generates Mel-spectrograms from silent video, achieving superior acoustic intelligibility and speaker characteristic preservation without requiring acoustic hints.

Authors:Bill Cassidy, Christian McBride, Connah Kendrick, Neil D. Reeves, Joseph M. Pappachan, Shaghayegh Raad, Moi Hoon Yap
Title: Gaussian Random Fields as an Abstract Representation of Patient Metadata for Multimodal Medical Image Segmentation
Abstract:
The growing rate of chronic wound occurrence, especially in patients with diabetes, has become a concerning trend in recent years. Chronic wounds are difficult and costly to treat, and have become a serious burden on health care systems worldwide. Chronic wounds can have devastating consequences for the patient, with infection often leading to reduced quality of life and increased mortality risk. Innovative deep learning methods for the detection and monitoring of such wounds have the potential to reduce the impact to both patient and clinician. We present a novel multimodal segmentation method which allows for the introduction of patient metadata into the training workflow whereby the patient data are expressed as Gaussian random fields. Our results indicate that the proposed method improved performance when utilising multiple models, each trained on different metadata categories. Using the Diabetic Foot Ulcer Challenge 2022 test set, when compared to the baseline results (intersection over union = 0.4670, Dice similarity coefficient = 0.5908) we demonstrate improvements of +0.0220 and +0.0229 for intersection over union and Dice similarity coefficient respectively. This paper presents the first study to focus on integrating patient data into a chronic wound segmentation workflow. Our results show significant performance gains when training individual models using specific metadata categories, followed by average merging of prediction masks using distance transforms. All source code for this study is available at: https://github.com/mmu-dermatology-research/multimodal-grf
中文: 本研究提出了一种新颖的多模态分割方法,将患者元数据以高斯随机场形式融入训练流程,在慢性伤口分析中通过深度学习技术显著提升了性能表现。
English: This study introduces a novel multimodal segmentation method that integrates patient metadata as Gaussian random fields, demonstrating improved performance in chronic wound analysis using deep learning techniques.

Authors:Qingyuan Zhou, Yuehu Gong, Weidong Yang, Jiaze Li, Yeqi Luo, Baixin Xu, Shuhao Li, Ben Fei, Ying He
Title: MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions
Abstract:
Novel view synthesis (NVS) and surface reconstruction (SR) are essential tasks in 3D Gaussian Splatting (3D-GS). Despite recent progress, these tasks are often addressed independently, with GS-based rendering methods struggling under diverse light conditions and failing to produce accurate surfaces, while GS-based reconstruction methods frequently compromise rendering quality. This raises a central question: must rendering and reconstruction always involve a trade-off? To address this, we propose MGSR, a 2D/3D Mutual-boosted Gaussian splatting for Surface Reconstruction that enhances both rendering quality and 3D reconstruction accuracy. MGSR introduces two branches--one based on 2D-GS and the other on 3D-GS. The 2D-GS branch excels in surface reconstruction, providing precise geometry information to the 3D-GS branch. Leveraging this geometry, the 3D-GS branch employs a geometry-guided illumination decomposition module that captures reflected and transmitted components, enabling realistic rendering under varied light conditions. Using the transmitted component as supervision, the 2D-GS branch also achieves high-fidelity surface reconstruction. Throughout the optimization process, the 2D-GS and 3D-GS branches undergo alternating optimization, providing mutual supervision. Prior to this, each branch completes an independent warm-up phase, with an early stopping strategy implemented to reduce computational costs. We evaluate MGSR on a diverse set of synthetic and real-world datasets, at both object and scene levels, demonstrating strong performance in rendering and surface reconstruction. Code is available at https://github.com/TsingyuanChou/MGSR.
中文: 提出的MGSR方法通过2D和3D高斯泼溅分支的相互促进框架,利用交替优化和几何引导的光照分解,在多变光照条件下同时提升了渲染质量和表面重建精度。
English: The proposed MGSR method introduces a mutual-boosted framework between 2D and 3D Gaussian splatting branches that enhances both rendering quality under varied lighting and surface reconstruction accuracy through alternating optimization and geometry-guided illumination decomposition.

Authors:Junxiang Qiu, Lin Liu, Shuo Wang, Jinda Lu, Kezhou Chen, Yanbin Hao
Title: Accelerating Diffusion Transformer via Gradient-Optimized Cache
Abstract:
Feature caching has emerged as an effective strategy to accelerate diffusion transformer (DiT) sampling through temporal feature reuse. It is a challenging problem since (1) Progressive error accumulation from cached blocks significantly degrades generation quality, particularly when over 50\% of blocks are cached; (2) Current error compensation approaches neglect dynamic perturbation patterns during the caching process, leading to suboptimal error correction. To solve these problems, we propose the Gradient-Optimized Cache (GOC) with two key innovations: (1) Cached Gradient Propagation: A gradient queue dynamically computes the gradient differences between cached and recomputed features. These gradients are weighted and propagated to subsequent steps, directly compensating for the approximation errors introduced by caching. (2) Inflection-Aware Optimization: Through statistical analysis of feature variation patterns, we identify critical inflection points where the denoising trajectory changes direction. By aligning gradient updates with these detected phases, we prevent conflicting gradient directions during error correction. Extensive evaluations on ImageNet demonstrate GOC's superior trade-off between efficiency and quality. With 50\% cached blocks, GOC achieves IS 216.28 (26.3\% higher) and FID 3.907 (43\% lower) compared to baseline DiT, while maintaining identical computational costs. These improvements persist across various cache ratios, demonstrating robust adaptability to different acceleration requirements. Code is available at https://github.com/qiujx0520/GOC_ICCV2025.git.
中文: 提出的梯度优化缓存(GOC)通过动态传播梯度差异和对齐拐点更新,解决了扩散变换器采样中的误差累积问题,在保持计算效率的同时显著提升了生成质量。
English: The proposed Gradient-Optimized Cache (GOC) addresses error accumulation in diffusion transformer sampling by dynamically propagating gradient differences and aligning updates with inflection points, achieving significant quality improvements while maintaining computational efficiency.

Authors:Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh
Title: R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
Abstract:
Recently DeepSeek R1 demonstrated how reinforcement learning with simple rule-based incentives can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models. aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities. The project code is available at https://github.com/turningpoint-ai/VisualThinker-R1-Zero
中文摘要:DeepSeek R1首次在20亿参数多模态模型上成功复现了复杂推理能力的涌现,通过SAT数据的强化学习实现了准确率大幅提升,同时揭示了指令模型和简单奖励机制在激发推理能力方面的局限性。
English Summary: DeepSeek R1 successfully replicated complex reasoning emergence in a 2B multimodal model using reinforcement learning on SAT data, achieving significant accuracy gains while revealing challenges with instruct models and naive rewards.

Authors:Xi Li, Tong Rao, Cihui Pan
Title: EDM: Efficient Deep Feature Matching
Abstract:
Recent feature matching methods have achieved remarkable performance but lack efficiency consideration. In this paper, we revisit the mainstream detector-free matching pipeline and improve all its stages considering both accuracy and efficiency. We propose an Efficient Deep feature Matching network, EDM. We first adopt a deeper CNN with fewer dimensions to extract multi-level features. Then we present a Correlation Injection Module that conducts feature transformation on high-level deep features, and progressively injects feature correlations from global to local for efficient multi-scale feature aggregation, improving both speed and performance. In the refinement stage, a novel lightweight bidirectional axis-based regression head is designed to directly predict subpixel-level correspondences from latent features, avoiding the significant computational cost of explicitly locating keypoints on high-resolution local feature heatmaps. Moreover, effective selection strategies are introduced to enhance matching accuracy. Extensive experiments show that our EDM achieves competitive matching accuracy on various benchmarks and exhibits excellent efficiency, offering valuable best practices for real-world applications. The code is available at https://github.com/chicleee/EDM.
中文: 提出的高效深度特征匹配网络(EDM)通过采用更深层CNN、关联注入模块和轻量级回归头,在保证精度的同时显著提升了特征匹配效率,在多个基准测试中表现优异。
English: The proposed Efficient Deep feature Matching network (EDM) enhances detector-free feature matching by integrating a deeper CNN with a correlation injection module and a lightweight regression head, achieving both high accuracy and superior efficiency across benchmarks.

Authors:Wenhao Liang, Wei Zhang, Lin Yue, Miao Xu, Olaf Maennel, Weitong Chen
Title: We Care Each Pixel: Calibrating on Medical Segmentation Model
Abstract:
Medical image segmentation is fundamental for computer-aided diagnostics, providing accurate delineation of anatomical structures and pathological regions. While common metrics such as Accuracy, DSC, IoU, and HD primarily quantify spatial agreement between predictions and ground-truth labels, they do not assess the calibration quality of segmentation models, which is crucial for clinical reliability. To address this limitation, we propose pixel-wise Expected Calibration Error (pECE), a novel metric that explicitly measures miscalibration at the pixel level, thereby ensuring both spatial precision and confidence reliability. We further introduce a morphological adaptation strategy that applies morphological operations to ground-truth masks before computing calibration losses, particularly benefiting margin-based losses such as Margin SVLS and NACL. Additionally, we present the Signed Distance Calibration Loss (SDC), which aligns boundary geometry with calibration objectives by penalizing discrepancies between predicted and ground-truth signed distance functions (SDFs). Extensive experiments demonstrate that our method not only enhances segmentation performance but also improves calibration quality, yielding more trustworthy confidence estimates. Code is available at: https://github.com/EagleAdelaide/SDC-Loss.
Chinese Summary: 本文提出像素级期望校准误差(pECE)和符号距离校准损失(SDC),通过形态学自适应策略提升医学图像分割的校准质量与空间精度,弥补传统指标不足。
English Summary: This paper introduces pixel-wise Expected Calibration Error (pECE) and Signed Distance Calibration Loss (SDC) to improve both segmentation accuracy and confidence calibration in medical imaging, addressing limitations of traditional metrics.

Authors:Yingji Zhong, Zhihao Li, Dave Zhenyu Chen, Lanqing Hong, Dan Xu
Title: Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs
Abstract:
Despite recent successes in novel view synthesis using 3D Gaussian Splatting (3DGS), modeling scenes with sparse inputs remains a challenge. In this work, we address two critical yet overlooked issues in real-world sparse-input modeling: extrapolation and occlusion. To tackle these issues, we propose to use a reconstruction by generation pipeline that leverages learned priors from video diffusion models to provide plausible interpretations for regions outside the field of view or occluded. However, the generated sequences exhibit inconsistencies that do not fully benefit subsequent 3DGS modeling. To address the challenge of inconsistencies, we introduce a novel scene-grounding guidance based on rendered sequences from an optimized 3DGS, which tames the diffusion model to generate consistent sequences. This guidance is training-free and does not require any fine-tuning of the diffusion model. To facilitate holistic scene modeling, we also propose a trajectory initialization method. It effectively identifies regions that are outside the field of view and occluded. We further design a scheme tailored for 3DGS optimization with generated sequences. Experiments demonstrate that our method significantly improves upon the baseline and achieves state-of-the-art performance on challenging benchmarks.

Authors:Souvik Kundu, Anahita Bhiwandiwalla, Sungduk Yu, Phillip Howard, Tiep Le, Sharath Nittur Sridhar, David Cobbley, Hao Kang, Vasudev Lal
Title: LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression
Abstract:
Despite recent efforts in understanding the compression impact on large language models (LLMs) in terms of their downstream task performance and trustworthiness on relatively simpler uni-modal benchmarks (for example, question answering, common sense reasoning), their detailed study on multi-modal Large Vision-Language Models (LVLMs) is yet to be unveiled. Towards mitigating this gap, we present LVLM-Compress-Bench, a framework to first thoroughly study the broad impact of compression on the generative performance of LVLMs with multi-modal input driven tasks. In specific, we consider two major classes of compression for autoregressive models, namely KV cache and weight compression, for the dynamically growing intermediate cache and static weights, respectively. We use four LVLM variants of the popular LLaVA framework to present our analysis via integrating various state-of-the-art KV and weight compression methods including uniform, outlier-reduced, and group quantization for the KV cache and weights. With this framework we demonstrate on ten different multi-modal datasets with different capabilities including recognition, knowledge, language generation, spatial awareness, visual reasoning, hallucination and visual illusion identification, toxicity, stereotypes and bias. In specific, our framework demonstrates the compression impact on both general and ethically critical metrics leveraging a combination of real world and synthetic datasets to encompass diverse societal intersectional attributes. Extensive experimental evaluations yield diverse and intriguing observations on the behavior of LVLMs at different quantization budget of KV and weights, in both maintaining and losing performance as compared to the baseline model with FP16 data format. Code will be open-sourced at https://github.com/opengear-project/LVLM-compress-bench.
中文: 本研究提出LVLM-Compress-Bench框架,全面评估压缩技术对多模态大视觉语言模型在各项任务中性能的影响,揭示了其在通用指标和伦理指标上的多样化表现。
English: This study introduces LVLM-Compress-Bench, a framework to comprehensively evaluate how compression techniques affect the performance of multimodal large vision-language models across various tasks, revealing diverse impacts on both general and ethical metrics.

Authors:Albert Wilcox, Mohamed Ghanem, Masoud Moghani, Pierre Barroso, Benjamin Joffe, Animesh Garg
Title: Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning
Abstract:
Imitation Learning can train robots to perform complex and diverse manipulation tasks, but learned policies are brittle with observations outside of the training distribution. 3D scene representations that incorporate observations from calibrated RGBD cameras have been proposed as a way to mitigate this, but in our evaluations with unseen embodiments and camera viewpoints they show only modest improvement. To address those challenges, we propose Adapt3R, a general-purpose 3D observation encoder which synthesizes data from calibrated RGBD cameras into a vector that can be used as conditioning for arbitrary IL algorithms. The key idea is to use a pretrained 2D backbone to extract semantic information, using 3D only as a medium to localize this information with respect to the end-effector. We show across 93 simulated and 6 real tasks that when trained end-to-end with a variety of IL algorithms, Adapt3R maintains these algorithms' learning capacity while enabling zero-shot transfer to novel embodiments and camera poses.

Authors:Thilo Reinold, Suman Ghosh, Guillermo Gallego
Title: Combined Physics and Event Camera Simulator for Slip Detection
Abstract:
Robot manipulation is a common task in fields like industrial manufacturing. Detecting when objects slip from a robot's grasp is crucial for safe and reliable operation. Event cameras, which register pixel-level brightness changes at high temporal resolution (called ``events''), offer an elegant feature when mounted on a robot's end effector: since they only detect motion relative to their viewpoint, a properly grasped object produces no events, while a slipping object immediately triggers them. To research this feature, representative datasets are essential, both for analytic approaches and for training machine learning models. The majority of current research on slip detection with event-based data is done on real-world scenarios and manual data collection, as well as additional setups for data labeling. This can result in a significant increase in the time required for data collection, a lack of flexibility in scene setups, and a high level of complexity in the repetition of experiments. This paper presents a simulation pipeline for generating slip data using the described camera-gripper configuration in a robot arm, and demonstrates its effectiveness through initial data-driven experiments. The use of a simulator, once it is set up, has the potential to reduce the time spent on data collection, provide the ability to alter the setup at any time, simplify the process of repetition and the generation of arbitrarily large data sets. Two distinct datasets were created and validated through visual inspection and artificial neural networks (ANNs). Visual inspection confirmed photorealistic frame generation and accurate slip modeling, while three ANNs trained on this data achieved high validation accuracy and demonstrated good generalization capabilities on a separate test set, along with initial applicability to real-world data. Project page: https://github.com/tub-rip/event_slip
中文: 本文提出了一种在机器人手臂上使用事件相机生成滑移检测数据的仿真流程,该方法通过视觉检查和神经网络验证,能有效减少数据采集时间并提高实验灵活性。
English: This paper introduces a simulation pipeline for generating slip detection data using event cameras on robot arms, which reduces data collection time and enhances flexibility while validating the approach through visual inspection and neural networks.

Authors:Donghyeok Shin, HeeSun Bae, Gyuwon Sim, Wanmo Kang, Il-Chul Moon
Title: Distilling Dataset into Neural Field
Abstract:
Utilizing a large-scale dataset is essential for training high-performance deep learning models, but it also comes with substantial computation and storage costs. To overcome these challenges, dataset distillation has emerged as a promising solution by compressing the large-scale dataset into a smaller synthetic dataset that retains the essential information needed for training. This paper proposes a novel parameterization framework for dataset distillation, coined Distilling Dataset into Neural Field (DDiF), which leverages the neural field to store the necessary information of the large-scale dataset. Due to the unique nature of the neural field, which takes coordinates as input and output quantity, DDiF effectively preserves the information and easily generates various shapes of data. We theoretically confirm that DDiF exhibits greater expressiveness than some previous literature when the utilized budget for a single synthetic instance is the same. Through extensive experiments, we demonstrate that DDiF achieves superior performance on several benchmark datasets, extending beyond the image domain to include video, audio, and 3D voxel. We release the code at https://github.com/aailab-kaist/DDiF.
中文: 本文提出DDiF这一新型数据集蒸馏框架,通过神经场将大规模数据集高效压缩为紧凑的合成数据并保留关键信息,在图像、视频、音频和3D体素等多个领域展现出卓越性能。
English: This paper introduces DDiF, a novel dataset distillation framework that uses neural fields to efficiently compress large datasets into compact synthetic forms while preserving essential information, demonstrating superior performance across multiple domains including images, video, audio, and 3D data.

Authors:Emanuele Bugliarello, Anurag Arnab, Roni Paiss, Pieter-Jan Kindermans, Cordelia Schmid
Title: What Are You Doing? A Closer Look at Controllable Human Video Generation
Abstract:
High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human generation. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok and TED-Talks, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing `What Are You Doing?' (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1{,}544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose and validate automatic metrics that leverage our annotations and better capture human evaluations. Equipped with our dataset and metrics, we perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation, showing how WYD provides novel insights about the capabilities of these models. We release our data and code to drive forward progress in human video generation modeling at https://github.com/google-deepmind/wyd-benchmark.
中文: 该摘要介绍了“你在做什么?”(WYD)基准,这是一个用于评估可控图像到视频生成的数据集,通过提供细粒度类别和自动指标来弥补现有数据集的不足,从而深入分析模型性能。
English: The abstract introduces the "What Are You Doing?" (WYD) benchmark, a comprehensive dataset designed to evaluate controllable image-to-video generation of humans, addressing the limitations of existing datasets by providing fine-grained categories and automatic metrics for in-depth model analysis.

Authors:Hong Liu, Haosen Yang, Federica Eduati, Josien P. W. Pluim, Mitko Veta
Title: Adaptive Prototype Learning for Multimodal Cancer Survival Analysis
Abstract:
Leveraging multimodal data, particularly the integration of whole-slide histology images (WSIs) and transcriptomic profiles, holds great promise for improving cancer survival prediction. However, excessive redundancy in multimodal data can degrade model performance. In this paper, we propose Adaptive Prototype Learning (APL), a novel and effective approach for multimodal cancer survival analysis. APL adaptively learns representative prototypes in a data-driven manner, reducing redundancy while preserving critical information. Our method employs two sets of learnable query vectors that serve as a bridge between high-dimensional representations and survival prediction, capturing task-relevant features. Additionally, we introduce a multimodal mixed self-attention mechanism to enable cross-modal interactions, further enhancing information fusion. Extensive experiments on five benchmark cancer datasets demonstrate the superiority of our approach over existing methods. The code is available at https://github.com/HongLiuuuuu/APL.
中文: 本文提出的自适应原型学习(APL)方法通过数据驱动方式学习代表性原型并引入跨模态交互,有效降低多模态癌症数据冗余,在五个基准数据集上展现出优越性能。
English: The proposed Adaptive Prototype Learning (APL) method effectively reduces redundancy in multimodal cancer data by adaptively learning prototypes and employing cross-modal interactions, demonstrating superior performance across five benchmark datasets.

Authors:Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, Hui Xiong
Title: Simulating the Real World: A Unified Survey of Multimodal Generative Models
Abstract:
Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.
中文摘要:本综述首次提出多模态生成模型的统一框架,系统梳理从二维到四维生成的演进路径,旨在克服现有方法在模态整合方面的局限,推动人工智能通用智能中的真实世界模拟研究。
English Summary: This survey presents a unified framework for multimodal generative models, systematically reviewing the progression from 2D to 4D generation to advance real-world simulation in AGI research while addressing current limitations in modality integration.

Authors:Hong Liu, Haosen Yang, Evi M. C. Huijben, Mark Schuiveling, Ruisheng Su, Josien P. W. Pluim, Mitko Veta
Title: PathoPainter: Augmenting Histopathology Segmentation via Tumor-aware Inpainting
Abstract:
Tumor segmentation plays a critical role in histopathology, but it requires costly, fine-grained image-mask pairs annotated by pathologists. Thus, synthesizing histopathology data to expand the dataset is highly desirable. Previous works suffer from inaccuracies and limited diversity in image-mask pairs, both of which affect training segmentation, particularly in small-scale datasets and the inherently complex nature of histopathology images. To address this challenge, we propose PathoPainter, which reformulates image-mask pair generation as a tumor inpainting task. Specifically, our approach preserves the background while inpainting the tumor region, ensuring precise alignment between the generated image and its corresponding mask. To enhance dataset diversity while maintaining biological plausibility, we incorporate a sampling mechanism that conditions tumor inpainting on regional embeddings from a different image. Additionally, we introduce a filtering strategy to exclude uncertain synthetic regions, further improving the quality of the generated data. Our comprehensive evaluation spans multiple datasets featuring diverse tumor types and various training data scales. As a result, segmentation improved significantly with our synthetic data, surpassing existing segmentation data synthesis approaches, e.g., 75.69% -> 77.69% on CAMELYON16. The code is available at https://github.com/HongLiuuuuu/PathoPainter.
Chinese: PathoPainter通过将图像-掩码对生成重构为肿瘤修复任务,在保留背景的同时生成多样且生物学可信的肿瘤区域,显著提升了多数据集上的分割性能。
English: PathoPainter enhances tumor segmentation by generating diverse and biologically plausible synthetic image-mask pairs through tumor inpainting, significantly improving segmentation accuracy across multiple datasets.

Authors:Aoxiong Yin, Kai Shen, Yichong Leng, Xu Tan, Xinyu Zhou, Juncheng Li, Siliang Tang
Title: The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation
Abstract:
Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a $\sim$14,000$\times$ compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Kling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.
中文:LanDiff是一种结合自回归语言模型与扩散模型优势的混合文本到视频框架,通过从粗到细的生成方式克服了两者固有缺陷,在标准及长视频生成基准测试中均实现了最先进的性能表现。
English: LanDiff is a hybrid text-to-video framework that combines autoregressive language models and diffusion models to overcome their individual limitations, achieving state-of-the-art performance in both standard and long video generation benchmarks.

Authors:Qing Zhou, Tao Yang, Junyu Gao, Weiping Ni, Junzheng Wu, Qi Wang
Title: A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning
Abstract:
Remote Sensing Image Captioning (RSIC) is a cross-modal field bridging vision and language, aimed at automatically generating natural language descriptions of features and scenes in remote sensing imagery. Despite significant advances in developing sophisticated methods and large-scale datasets for training vision-language models (VLMs), two critical challenges persist: the scarcity of non-English descriptive datasets and the lack of multilingual capability evaluation for models. These limitations fundamentally impede the progress and practical deployment of RSIC, particularly in the era of large VLMs. To address these challenges, this paper presents several significant contributions to the field. First, we introduce and analyze BRSIC (Bilingual Remote Sensing Image Captioning), a comprehensive bilingual dataset that enriches three established English RSIC datasets with Chinese descriptions, encompassing 13,634 images paired with 68,170 bilingual captions. Building upon this foundation, we develop a systematic evaluation framework that addresses the prevalent inconsistency in evaluation protocols, enabling rigorous assessment of model performance through standardized retraining procedures on BRSIC. Furthermore, we present an extensive empirical study of eight state-of-the-art large vision-language models (LVLMs), examining their capabilities across multiple paradigms including zero-shot inference, supervised fine-tuning, and multi-lingual training. This comprehensive evaluation provides crucial insights into the strengths and limitations of current LVLMs in handling multilingual remote sensing tasks. Additionally, our cross-dataset transfer experiments reveal interesting findings. The code and data will be available at https://github.com/mrazhou/BRSIC.
中文摘要:本文提出了BRSIC双语遥感图像描述数据集,构建了标准化评估框架系统评估大视觉语言模型的多语言能力,解决了该领域缺乏多语言数据和统一评估标准的关键瓶颈。
English Summary: This paper introduces BRSIC, a bilingual dataset with Chinese and English captions for remote sensing images, and establishes a standardized evaluation framework to assess multilingual capabilities of large vision-language models, addressing key limitations in the field.

Authors:Kai Luo, Hao Shi, Sheng Wu, Fei Teng, Mengfei Duan, Chang Huang, Yuhang Wang, Kaiwei Wang, Kailun Yang
Title: Omnidirectional Multi-Object Tracking
Abstract:
Panoramic imagery, with its 360° field of view, offers comprehensive information to support Multi-Object Tracking (MOT) in capturing spatial and temporal relationships of surrounding objects. However, most MOT algorithms are tailored for pinhole images with limited views, impairing their effectiveness in panoramic settings. Additionally, panoramic image distortions, such as resolution loss, geometric deformation, and uneven lighting, hinder direct adaptation of existing MOT methods, leading to significant performance degradation. To address these challenges, we propose OmniTrack, an omnidirectional MOT framework that incorporates Tracklet Management to introduce temporal cues, FlexiTrack Instances for object localization and association, and the CircularStatE Module to alleviate image and geometric distortions. This integration enables tracking in panoramic field-of-view scenarios, even under rapid sensor motion. To mitigate the lack of panoramic MOT datasets, we introduce the QuadTrack dataset--a comprehensive panoramic dataset collected by a quadruped robot, featuring diverse challenges such as panoramic fields of view, intense motion, and complex environments. Extensive experiments on the public JRDB dataset and the newly introduced QuadTrack benchmark demonstrate the state-of-the-art performance of the proposed framework. OmniTrack achieves a HOTA score of 26.92% on JRDB, representing an improvement of 3.43%, and further achieves 23.45% on QuadTrack, surpassing the baseline by 6.81%. The established dataset and source code are available at https://github.com/xifen523/OmniTrack.
中文摘要:OmniTrack是一种创新的全景多目标跟踪框架,通过整合轨迹管理、灵活跟踪实例和环形状态模块,有效解决了全景图像失真问题,并在JRDB和新型QuadTrack数据集上实现了最先进的性能表现。
English Summary: OmniTrack is a novel omnidirectional multi-object tracking framework designed to overcome panoramic image distortions and the limitations of traditional tracking methods, achieving state-of-the-art performance on both JRDB and the newly introduced QuadTrack dataset.

Authors:Alessandro Scherl, Stefan Thalhammer, Bernhard Neuberger, Wilfried Wöber, José García-Rodríguez
Title: ViT-VS: On the Applicability of Pretrained Vision Transformer Features for Generalizable Visual Servoing
Abstract:
Visual servoing enables robots to precisely position their end-effector relative to a target object. While classical methods rely on hand-crafted features and thus are universally applicable without task-specific training, they often struggle with occlusions and environmental variations, whereas learning-based approaches improve robustness but typically require extensive training. We present a visual servoing approach that leverages pretrained vision transformers for semantic feature extraction, combining the advantages of both paradigms while also being able to generalize beyond the provided sample. Our approach achieves full convergence in unperturbed scenarios and surpasses classical image-based visual servoing by up to 31.2\% relative improvement in perturbed scenarios. Even the convergence rates of learning-based methods are matched despite requiring no task- or object-specific training. Real-world evaluations confirm robust performance in end-effector positioning, industrial box manipulation, and grasping of unseen objects using only a reference from the same category. Our code and simulation environment are available at: https://alessandroscherl.github.io/ViT-VS/

Authors:Matias Cosarinsky, Ramiro Billot, Lucas Mansilla, Gabriel Jimenez, Nicolas Gaggión, Guanghui Fu, Tom Tirer, Enzo Ferrante
Title: Conformal In-Context Reverse Classification Accuracy: Efficient Estimation of Segmentation Quality with Statistical Guarantees
Abstract:
Assessing the quality of automatic image segmentation is crucial in clinical practice, but often very challenging due to the limited availability of ground truth annotations. Reverse Classification Accuracy (RCA) is an approach that estimates the quality of new predictions on unseen samples by training a segmenter on those predictions, and then evaluating it against existing annotated images. In this work, we introduce Conformal In-Context RCA, a novel method for automatically estimating segmentation quality with statistical guarantees in the absence of ground-truth annotations, which consists of two main innovations. First, In-Context RCA, which leverages recent in-context learning models for image segmentation and incorporates retrieval-augmentation techniques to select the most relevant reference images. This approach enables efficient quality estimation with minimal reference data while avoiding the need of training additional models. Second, we introduce Conformal RCA, which extends both the original RCA framework and In-Context RCA to go beyond point estimation. Using tools from split conformal prediction, Conformal RCA produces prediction intervals for segmentation quality providing statistical guarantees that the true score lies within the estimated interval with a user-specified probability. Validated across 10 different medical imaging tasks in various organs and modalities, our methods demonstrate robust performance and computational efficiency, offering a promising solution for automated quality control in clinical workflows, where fast and reliable segmentation assessment is essential. The code is available at https://github.com/mcosarinsky/Conformal-In-Context-RCA.
中文: 本文提出Conformal In-Context RCA新方法,结合情境学习和保形预测技术,能够在缺乏真实标注的情况下自动评估医学图像分割质量并提供统计保证,经多临床任务验证具备高效可靠的自动化质量控制能力。
English: This paper introduces Conformal In-Context RCA, a novel method that leverages in-context learning and conformal prediction to automatically estimate medical image segmentation quality with statistical guarantees, validated across diverse clinical tasks for reliable automated quality control.

Authors:Zhihao Shi, Dong Huo, Yuhongze Zhou, Kejia Yin, Yan Min, Juwei Lu, Xinxin Zuo
Title: IMFine: 3D Inpainting via Geometry-guided Multi-view Refinement
Abstract:
Current 3D inpainting and object removal methods are largely limited to front-facing scenes, facing substantial challenges when applied to diverse, "unconstrained" scenes where the camera orientation and trajectory are unrestricted. To bridge this gap, we introduce a novel approach that produces inpainted 3D scenes with consistent visual quality and coherent underlying geometry across both front-facing and unconstrained scenes. Specifically, we propose a robust 3D inpainting pipeline that incorporates geometric priors and a multi-view refinement network trained via test-time adaptation, building on a pre-trained image inpainting model. Additionally, we develop a novel inpainting mask detection technique to derive targeted inpainting masks from object masks, boosting the performance in handling unconstrained scenes. To validate the efficacy of our approach, we create a challenging and diverse benchmark that spans a wide range of scenes. Comprehensive experiments demonstrate that our proposed method substantially outperforms existing state-of-the-art approaches.

Authors:Benjamin Billot, Ramya Muthukrishnan, Esra Abaci-Turk, P. Ellen Grant, Nicholas Ayache, Hervé Delingette, Polina Golland
Title: Spatial regularisation for improved accuracy and interpretability in keypoint-based registration
Abstract:
Unsupervised registration strategies bypass requirements in ground truth transforms or segmentations by optimising similarity metrics between fixed and moved volumes. Among these methods, a recent subclass of approaches based on unsupervised keypoint detection stand out as very promising for interpretability. Specifically, these methods train a network to predict feature maps for fixed and moving images, from which explainable centres of mass are computed to obtain point clouds, that are then aligned in closed-form. However, the features returned by the network often yield spatially diffuse patterns that are hard to interpret, thus undermining the purpose of keypoint-based registration. Here, we propose a three-fold loss to regularise the spatial distribution of the features. First, we use the KL divergence to model features as point spread functions that we interpret as probabilistic keypoints. Then, we sharpen the spatial distributions of these features to increase the precision of the detected landmarks. Finally, we introduce a new repulsive loss across keypoints to encourage spatial diversity. Overall, our loss considerably improves the interpretability of the features, which now correspond to precise and anatomically meaningful landmarks. We demonstrate our three-fold loss in foetal rigid motion tracking and brain MRI affine registration tasks, where it not only outperforms state-of-the-art unsupervised strategies, but also bridges the gap with state-of-the-art supervised methods. Our code is available at https://github.com/BenBillot/spatial_regularisation.
中文摘要:本文提出了一种三重损失函数,通过锐化特征分布和促进空间多样性,显著提升了无监督关键点图像配准的可解释性,在医学影像任务中实现了与监督方法相媲美的性能。
English Summary: This paper introduces a three-fold loss function to enhance the interpretability of unsupervised keypoint-based image registration by sharpening feature distributions and promoting spatial diversity, achieving performance comparable to supervised methods in medical imaging tasks.

Authors:Hongyeob Kim, Inyoung Jung, Dayoon Suh, Youjia Zhang, Sangmin Lee, Sungeun Hong
Title: Question-Aware Gaussian Experts for Audio-Visual Question Answering
Abstract:
Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes QA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance. Code is available at https://aim-skku.github.io/QA-TIGER/
中文摘要:本文提出QA-TIGER框架,通过显式结合问题信息、采用高斯建模和专家混合方法模拟连续时间动态,在多个视听问答基准测试中实现了最先进的性能。
English Summary: The paper introduces QA-TIGER, a framework that explicitly integrates question information and models continuous temporal dynamics using Gaussian-based modeling and Mixture of Experts to achieve state-of-the-art performance on AVQA benchmarks.

Authors:Chanda Grover Kamra, Indra Deep Mastan, Debayan Gupta
Title: ObjMST: An Object-Focused Multimodal Style Transfer Framework
Abstract:
We propose ObjMST, an object-focused multimodal style transfer framework that provides separate style supervision for salient objects and surrounding elements while addressing alignment issues in multimodal representation learning. Existing image-text multimodal style transfer methods face the following challenges: (1) generating non-aligned and inconsistent multimodal style representations; and (2) content mismatch, where identical style patterns are applied to both salient objects and their surrounding elements. Our approach mitigates these issues by: (1) introducing a Style-Specific Masked Directional CLIP Loss, which ensures consistent and aligned style representations for both salient objects and their surroundings; and (2) incorporating a salient-to-key mapping mechanism for stylizing salient objects, followed by image harmonization to seamlessly blend the stylized objects with their environment. We validate the effectiveness of ObjMST through experiments, using both quantitative metrics and qualitative visual evaluations of the stylized outputs. Our code is available at: https://github.com/chandagrover/ObjMST.
Chinese: ObjMST是一种多模态风格迁移框架,通过引入风格特定掩码定向CLIP损失和显著对象到关键映射机制,分别对显著物体和背景元素进行风格监督,有效解决了多模态表示学习中的对齐和内容不匹配问题。
English: ObjMST is a multimodal style transfer framework that addresses alignment and content mismatch issues by providing separate style supervision for salient objects and surroundings through a Style-Specific Masked Directional CLIP Loss and a salient-to-key mapping mechanism.

Authors:Shen Zhang, Siyuan Liang, Yaning Tan, Zhaowei Chen, Linze Li, Ge Wu, Yuhao Chen, Shuheng Li, Zhenyu Zhao, Caihua Chen, Jiajun Liang, Yao Tang
Title: LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding
Abstract:
Diffusion transformers (DiTs) struggle to generate images at resolutions higher than their training resolutions. The primary obstacle is that the explicit positional encodings(PE), such as RoPE, need extrapolating to unseen positions which degrades performance when the inference resolution differs from training. In this paper, We propose a Length-Extrapolatable Diffusion Transformer~(LEDiT) to overcome this limitation. LEDiT needs no explicit PEs, thereby avoiding PE extrapolation. The key innovation of LEDiT lies in the use of causal attention. We demonstrate that causal attention can implicitly encode global positional information and show that such information facilitates extrapolation. We further introduce a locality enhancement module, which captures fine-grained local information to complement the global coarse-grained position information encoded by causal attention. Experimental results on both conditional and text-to-image generation tasks demonstrate that LEDiT supports up to 4x resolution scaling (e.g., from 256x256 to 512x512), achieving better image quality compared to the state-of-the-art length extrapolation methods. We believe that LEDiT marks a departure from the standard RoPE-based methods and offers a promising insight into length extrapolation. Project page: https://shenzhang2145.github.io/ledit/

Authors:Cecilia Diana-Albelda, Roberto Alcover-Couso, Álvaro García-Martín, Jesus Bescos, Marcos Escudero-Viñolo
Title: GBT-SAM: Adapting a Foundational Deep Learning Model for Generalizable Brain Tumor Segmentation via Efficient Integration of Multi-Parametric MRI Data
Abstract:
Gliomas are aggressive brain tumors that require accurate imaging-based diagnosis, with segmentation playing a critical role in evaluating morphology and treatment decisions. Manual delineation of gliomas is time-consuming and prone to variability, motivating the use of deep learning to improve consistency and alleviate clinical workload. However, existing methods often fail to fully exploit the information available in multi-parametric MRI (mp-MRI), particularly inter-slice contextual features, and typically require considerable computational resources while lacking robustness across tumor type variations. We present GBT-SAM, a parameter-efficient deep learning framework that adapts the Segment Anything Model (SAM), a large-scale vision model, to volumetric mp-MRI data. GBT-SAM reduces input complexity by selecting fewer than 2.6\% of slices per scan while incorporating all four MRI modalities, preserving essential tumor-related information with minimal cost. Furthermore, our model is trained by a two-step fine-tuning strategy that incorporates a depth-aware module to capture inter-slice correlations and lightweight adaptation layers, resulting in just 6.5M trainable parameters, which is the lowest among SAM-based approaches. GBT-SAM achieves a Dice Score of 93.54 on the BraTS Adult Glioma dataset and demonstrates robust performance on Meningioma, Pediatric Glioma, and Sub-Saharan Glioma datasets. These results highlight GBT-SAM's potential as a computationally efficient and domain-robust framework for brain tumor segmentation using mp-MRI. Our code and models are available at https://github.com/vpulab/med-sam-brain .
中文: GBT-SAM是一种参数高效的深度学习框架,通过改进Segment Anything模型实现多参数MRI中的脑肿瘤分割,仅需少量计算资源即可获得高精度,并在多种肿瘤类型上表现出稳定的性能。
English: GBT-SAM is a parameter-efficient deep learning framework that adapts the Segment Anything Model for brain tumor segmentation in multi-parametric MRI, achieving high accuracy with minimal computational resources and demonstrating robust performance across various tumor types.

Authors:Lars Bredereke, Yale Hartmann, Tanja Schultz
Title: A Modular Pipeline for 3D Object Tracking Using RGB Cameras
Abstract:
Object tracking is a key challenge of computer vision with various applications that all require different architectures. Most tracking systems have limitations such as constraining all movement to a 2D plane and they often track only one object. In this paper, we present a new modular pipeline that calculates 3D trajectories of multiple objects. It is adaptable to various settings where multiple time-synced and stationary cameras record moving objects, using off the shelf webcams. Our pipeline was tested on the Table Setting Dataset, where participants are recorded with various sensors as they set a table with tableware objects. We need to track these manipulated objects, using 6 rgb webcams. Challenges include: Detecting small objects in 9.874.699 camera frames, determining camera poses, discriminating between nearby and overlapping objects, temporary occlusions, and finally calculating a 3D trajectory using the right subset of an average of 11.12.456 pixel coordinates per 3-minute trial. We implement a robust pipeline that results in accurate trajectories with covariance of x,y,z-position as a confidence metric. It deals dynamically with appearing and disappearing objects, instantiating new Extended Kalman Filters. It scales to hundreds of table-setting trials with very little human annotation input, even with the camera poses of each trial unknown. The code is available at https://github.com/LarsBredereke/object_tracking
中文: 本文提出了一种模块化流程,利用固定网络摄像头计算多个物体的三维轨迹,能有效应对小物体检测和遮挡等挑战,且所需人工标注极少。
English: This paper introduces a modular pipeline for calculating 3D trajectories of multiple objects using stationary webcams, effectively handling challenges like small object detection and occlusions with minimal human input.

Authors:Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Mingfang Zhang, Lijin Yang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Xinyuan Chen, Yaohui Wang, Yali Wang, Yu Qiao, Limin Wang
Title: An Egocentric Vision-Language Model based Portable Real-time Smart Assistant
Abstract:
We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices. At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model (LLM), enabling advanced functionalities such as scene understanding, temporal grounding, video summarization, and future planning. To enhance its utility, Vinci incorporates a memory module for processing long video streams in real time while retaining contextual history, a generation module for producing visual action demonstrations, and a retrieval module that bridges egocentric and third-person perspectives to provide relevant how-to videos for skill acquisition. Unlike existing systems that often depend on specialized hardware, Vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras. In our experiments, we first demonstrate the superior performance of EgoVideo-VL on multiple public benchmarks, showcasing its vision-language reasoning and contextual understanding capabilities. We then conduct a series of user studies to evaluate the real-world effectiveness of Vinci, highlighting its adaptability and usability in diverse scenarios. We hope Vinci can establish a new framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. Including the frontend, backend, and models, all codes of Vinci are available at https://github.com/OpenGVLab/vinci.
中文: Vinci 是一款便携式视觉语言系统,通过结合 EgoVideo-VL 模型及记忆、生成和检索模块,可在多种设备上提供实时AI辅助,支持场景理解和技能学习等任务。
English: Vinci is a portable vision-language system that integrates the EgoVideo-VL model with memory, generation, and retrieval modules to deliver real-time AI assistance for tasks like scene understanding and skill acquisition across various devices.

Authors:Haitao Wu, Qing Li, Changqing Zhang, Zhen He, Xiaomin Ying
Title: Bridging the Vision-Brain Gap with an Uncertainty-Aware Blur Prior
Abstract:
Can our brain signals faithfully reflect the original visual stimuli, even including high-frequency details? Although human perceptual and cognitive capacities enable us to process and remember visual information, these abilities are constrained by several factors, such as limited attentional resources and the finite capacity of visual memory. When visual stimuli are processed by human visual system into brain signals, some information is inevitably lost, leading to a discrepancy known as the \textbf{System GAP}. Additionally, perceptual and cognitive dynamics, along with technical noise in signal acquisition, degrade the fidelity of brain signals relative to the visual stimuli, known as the \textbf{Random GAP}. When encoded brain representations are directly aligned with the corresponding pretrained image features, the System GAP and Random GAP between paired data challenge the model, requiring it to bridge these gaps. However, in the context of limited paired data, these gaps are difficult for the model to learn, leading to overfitting and poor generalization to new data. To address these GAPs, we propose a simple yet effective approach called the \textbf{Uncertainty-aware Blur Prior (UBP)}. It estimates the uncertainty within the paired data, reflecting the mismatch between brain signals and visual stimuli. Based on this uncertainty, UBP dynamically blurs the high-frequency details of the original images, reducing the impact of the mismatch and improving alignment. Our method achieves a top-1 accuracy of \textbf{50.9\%} and a top-5 accuracy of \textbf{79.7\%} on the zero-shot brain-to-image retrieval task, surpassing previous state-of-the-art methods by margins of \textbf{13.7\%} and \textbf{9.8\%}, respectively. Code is available at \href{https://github.com/HaitaoWuTJU/Uncertainty-aware-Blur-Prior}{GitHub}.
Chinese: 该研究提出了一种不确定性感知模糊先验方法,通过基于不确定性动态模糊原始图像的高频细节,弥合了脑信号与视觉刺激之间的系统性和随机性差距,在脑信号到图像的检索任务中取得了顶尖的准确率。
English: The study introduces an Uncertainty-aware Blur Prior method to bridge the System and Random Gaps between brain signals and visual stimuli, achieving state-of-the-art accuracy in brain-to-image retrieval by dynamically blurring high-frequency details based on uncertainty.

Authors:Yufang Liu, Yao Du, Tao Ji, Jianing Wang, Yang Liu, Yuanbin Wu, Aimin Zhou, Mengdi Zhang, Xunliang Cai
Title: The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights
Abstract:
Recent research has increasingly focused on multimodal mathematical reasoning, particularly emphasizing the creation of relevant datasets and benchmarks. Despite this, the role of visual information in reasoning has been underexplored. Our findings show that existing multimodal mathematical models minimally leverage visual information, and model performance remains largely unaffected by changes to or removal of images in the dataset. We attribute this to the dominance of textual information and answer options that inadvertently guide the model to correct answers. To improve evaluation methods, we introduce the HC-M3D dataset, specifically designed to require image reliance for problem-solving and to challenge models with similar, yet distinct, images that change the correct answer. In testing leading models, their failure to detect these subtle visual differences suggests limitations in current visual perception capabilities. Additionally, we observe that the common approach of improving general VQA capabilities by combining various types of image encoders does not contribute to math reasoning performance. This finding also presents a challenge to enhancing visual reliance during math reasoning. Our benchmark and code would be available at \href{https://github.com/Yufang-Liu/visual_modality_role}{https://github.com/Yufang-Liu/visual\_modality\_role}.
Chinese: 当前多模态数学模型未能充分利用视觉信息,为此引入HC-M3D数据集以强化图像依赖性并揭示视觉感知的局限性,同时通用的视觉问答能力提升对数学推理并无助益。
English: Current multimodal mathematical models underutilize visual information, prompting the introduction of the HC-M3D dataset to enforce image reliance and reveal limitations in visual perception, while general VQA enhancements fail to improve math reasoning performance.

Authors:Jie Xu, Na Zhao, Gang Niu, Masashi Sugiyama, Xiaofeng Zhu
Title: Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation
Abstract:
Recently, multi-view learning (MVL) has garnered significant attention due to its ability to fuse discriminative information from multiple views. However, real-world multi-view datasets are often heterogeneous and imperfect, which usually causes MVL methods designed for specific combinations of views to lack application potential and limits their effectiveness. To address this issue, we propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment. Specifically, we introduce a simple yet effective multi-view transformer fusion network where we transform heterogeneous multi-view data into homogeneous word embeddings, and then integrate multiple views by the sample-level attention mechanism to obtain a fused representation. Furthermore, we propose a simulated perturbation based multi-view contrastive learning framework that dynamically generates the noise and unusable perturbations for simulating imperfect data conditions. The simulated noisy and unusable data obtain two distinct fused representations, and we utilize contrastive learning to align them for learning discriminative and robust representations. Our RML is self-supervised and can also be applied for downstream tasks as a regularization. In experiments, we employ it in multi-view unsupervised clustering, noise-label classification, and as a plug-and-play module for cross-modal hashing retrieval. Extensive comparison experiments and ablation studies validate RML's effectiveness. Code is available at https://github.com/SubmissionsIn/RML.
中文:提出的鲁棒多视角学习方法RML通过基于Transformer的融合网络整合异构视角,并利用模拟扰动对比学习增强表征鲁棒性,在多种无监督和噪声标签任务中验证了其有效性。
English: The proposed robust multi-view learning method, RML, integrates heterogeneous views through a transformer-based fusion network and enhances representation robustness via simulated perturbation contrastive learning, demonstrating effectiveness in various unsupervised and noisy-label tasks.

Authors:Senming Tan, Zhenyu Hou, Zhihao Zhang, Long Xu, Mengke Zhang, Zhaoqi He, Chao Xu, Fei Gao, Yanjun Cao
Title: Real-time Spatial-temporal Traversability Assessment via Feature-based Sparse Gaussian Process
Abstract:
Terrain analysis is critical for the practical application of ground mobile robots in real-world tasks, especially in outdoor unstructured environments. In this paper, we propose a novel spatial-temporal traversability assessment method, which aims to enable autonomous robots to effectively navigate through complex terrains. Our approach utilizes sparse Gaussian processes (SGP) to extract geometric features (curvature, gradient, elevation, etc.) directly from point cloud scans. These features are then used to construct a high-resolution local traversability map. Then, we design a spatial-temporal Bayesian Gaussian kernel (BGK) inference method to dynamically evaluate traversability scores, integrating historical and real-time data while considering factors such as slope, flatness, gradient, and uncertainty metrics. GPU acceleration is applied in the feature extraction step, and the system achieves real-time performance. Extensive simulation experiments across diverse terrain scenarios demonstrate that our method outperforms SOTA approaches in both accuracy and computational efficiency. Additionally, we develop an autonomous navigation framework integrated with the traversability map and validate it with a differential driven vehicle in complex outdoor environments. Our code will be open-source for further research and development by the community, https://github.com/ZJU-FAST-Lab/FSGP_BGK.
本文提出了一种新颖的时空可通行性评估方法,结合稀疏高斯过程和贝叶斯推理,实现了复杂户外地形中的实时自主导航,在精度和计算效率上均优于现有方法。
This paper introduces a novel spatial-temporal traversability assessment method using sparse Gaussian processes and Bayesian inference to enable real-time autonomous navigation in complex outdoor terrains, demonstrating superior accuracy and efficiency over existing approaches.

Authors:Haoran Wang, Lian Huai, Wenbin Li, Lei Qi, Xingqun Jiang, Yinghuan Shi
Title: WeakMedSAM: Weakly-Supervised Medical Image Segmentation via SAM with Sub-Class Exploration and Prompt Affinity Mining
Abstract:
We have witnessed remarkable progress in foundation models in vision tasks. Currently, several recent works have utilized the segmenting anything model (SAM) to boost the segmentation performance in medical images, where most of them focus on training an adaptor for fine-tuning a large amount of pixel-wise annotated medical images following a fully supervised manner. In this paper, to reduce the labeling cost, we investigate a novel weakly-supervised SAM-based segmentation model, namely WeakMedSAM. Specifically, our proposed WeakMedSAM contains two modules: 1) to mitigate severe co-occurrence in medical images, a sub-class exploration module is introduced to learn accurate feature representations. 2) to improve the quality of the class activation maps, our prompt affinity mining module utilizes the prompt capability of SAM to obtain an affinity map for random-walk refinement. Our method can be applied to any SAM-like backbone, and we conduct experiments with SAMUS and EfficientSAM. The experimental results on three popularly-used benchmark datasets, i.e., BraTS 2019, AbdomenCT-1K, and MSD Cardiac dataset, show the promising results of our proposed WeakMedSAM. Our code is available at https://github.com/wanghr64/WeakMedSAM.
中文摘要:本文提出WeakMedSAM这一新型弱监督分割模型,通过子类探索模块解决医学图像共现问题,并利用提示亲和力挖掘模块提升类别激活图质量,在三个主流医学数据集上验证了其降低标注成本的有效性。
English Summary: This paper introduces WeakMedSAM, a novel weakly-supervised segmentation model that reduces labeling costs in medical imaging by incorporating a sub-class exploration module to address co-occurrence issues and a prompt affinity mining module to enhance class activation maps using SAM's capabilities.

Authors:Beverley Gorry, Tobias Fischer, Michael Milford, Alejandro Fontan
Title: Image-Based Relocalization and Alignment for Long-Term Monitoring of Dynamic Underwater Environments
Abstract:
Effective monitoring of underwater ecosystems is crucial for tracking environmental changes, guiding conservation efforts, and ensuring long-term ecosystem health. However, automating underwater ecosystem management with robotic platforms remains challenging due to the complexities of underwater imagery, which pose significant difficulties for traditional visual localization methods. We propose an integrated pipeline that combines Visual Place Recognition (VPR), feature matching, and image segmentation on video-derived images. This method enables robust identification of revisited areas, estimation of rigid transformations, and downstream analysis of ecosystem changes. Furthermore, we introduce the SQUIDLE+ VPR Benchmark-the first large-scale underwater VPR benchmark designed to leverage an extensive collection of unstructured data from multiple robotic platforms, spanning time intervals from days to years. The dataset encompasses diverse trajectories, arbitrary overlap and diverse seafloor types captured under varying environmental conditions, including differences in depth, lighting, and turbidity. Our code is available at: https://github.com/bev-gorry/underloc
中文: 提出的集成流程结合了视觉地点识别、特征匹配和图像分割技术,能够可靠识别水下区域并分析生态系统变化,同时引入了利用多平台广泛数据的SQUIDLE+新型基准数据集。
English: The proposed integrated pipeline combining Visual Place Recognition, feature matching, and image segmentation enables robust underwater area identification and ecosystem change analysis, supported by the novel SQUIDLE+ benchmark leveraging extensive multi-platform data.

Authors:Idris O. Sunmola, Zhenjun Zhao, Samuel Schmidgall, Yumeng Wang, Paul Maria Scheikl, Viet Pham, Axel Krieger
Title: Surgical Gaussian Surfels: Highly Accurate Real-time Surgical Scene Rendering using Gaussian Surfels
Abstract:
Accurate geometric reconstruction of deformable tissues in monocular endoscopic video remains a fundamental challenge in robot-assisted minimally invasive surgery. Although recent volumetric and point primitive methods based on neural radiance fields (NeRF) and 3D Gaussian primitives have efficiently rendered surgical scenes, they still struggle with handling artifact-free tool occlusions and preserving fine anatomical details. These limitations stem from unrestricted Gaussian scaling and insufficient surface alignment constraints during reconstruction. To address these issues, we introduce Surgical Gaussian Surfels (SGS), which transform anisotropic point primitives into surface-aligned elliptical splats by constraining the scale component of the Gaussian covariance matrix along the view-aligned axis. We also introduce the Fully Fused Deformation Multilayer Perceptron (FFD-MLP), a lightweight Multi-Layer Perceptron (MLP) that predicts accurate surfel motion fields up to 5x faster than a standard MLP. This is coupled with locality constraints to handle complex tissue deformations. We use homodirectional view-space positional gradients to capture fine image details by splitting Gaussian Surfels in over-reconstructed regions. In addition, we define surface normals as the direction of the steepest density change within each Gaussian surfel primitive, enabling accurate normal estimation without requiring monocular normal priors. We evaluate our method on two in-vivo surgical datasets, where it outperforms current state-of-the-art methods in surface geometry, normal map quality, and rendering efficiency, while remaining competitive in real-time rendering performance. We make our code available at https://github.com/aloma85/SurgicalGaussianSurfels
中文: 本文提出的手术高斯表面元(SGS)方法通过将点基元转换为表面对齐的椭圆块并采用快速变形网络,显著提升了内窥镜手术中可变形组织的几何重建质量,在表面细节和效率方面优于现有技术。
English: This paper introduces Surgical Gaussian Surfels (SGS), a method that enhances geometric reconstruction of deformable tissues in endoscopic surgery by transforming point primitives into surface-aligned splats and incorporating a fast deformation network, outperforming existing techniques in surface detail and efficiency.

Authors:Feng Ni, Kui Huang, Yao Lu, Wenyu Lv, Guanzhong Wang, Zeyu Chen, Yi Liu
Title: PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
Abstract:
With the rapid advancement of digitalization, various document images are being applied more extensively in production and daily life, and there is an increasingly urgent need for fast and accurate parsing of the content in document images. Therefore, this report presents PP-DocBee, a novel multimodal large language model designed for end-to-end document image understanding. First, we develop a data synthesis strategy tailored to document scenarios in which we build a diverse dataset to improve the model generalization. Then, we apply a few training techniques, including dynamic proportional sampling, data preprocessing, and OCR postprocessing strategies. Extensive evaluations demonstrate the superior performance of PP-DocBee, achieving state-of-the-art results on English document understanding benchmarks and even outperforming existing open source and commercial models in Chinese document understanding. The source code and pre-trained models are publicly available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
中文: 本报告提出的PP-DocBee多模态大语言模型,通过针对性的数据合成与训练策略,在英文和中文文档图像理解任务中均实现了最优性能。
English: This report introduces PP-DocBee, a multimodal large language model that achieves state-of-the-art performance in both English and Chinese document image understanding through specialized data synthesis and training techniques.

Authors:Amin Karimi, Charalambos Poullis
Title: DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation
Abstract:
Few-shot semantic segmentation (FSS) aims to enable models to segment novel/unseen object classes using only a limited number of labeled examples. However, current FSS methods frequently struggle with generalization due to incomplete and biased feature representations, especially when support images do not capture the full appearance variability of the target class. To improve the FSS pipeline, we propose a novel framework that utilizes large language models (LLMs) to adapt general class semantic information to the query image. Furthermore, the framework employs dense pixel-wise matching to identify similarities between query and support images, resulting in enhanced FSS performance. Inspired by reasoning-based segmentation frameworks, our method, named DSV-LFS, introduces an additional token into the LLM vocabulary, allowing a multimodal LLM to generate a "semantic prompt" from class descriptions. In parallel, a dense matching module identifies visual similarities between the query and support images, generating a "visual prompt". These prompts are then jointly employed to guide the prompt-based decoder for accurate segmentation of the query image. Comprehensive experiments on the benchmark datasets Pascal-$5^{i}$ and COCO-$20^{i}$ demonstrate that our framework achieves state-of-the-art performance-by a significant margin-demonstrating superior generalization to novel classes and robustness across diverse scenarios. The source code is available at \href{https://github.com/aminpdik/DSV-LFS}{https://github.com/aminpdik/DSV-LFS}
中文: 本文提出DSV-LFS框架,通过大语言模型生成语义提示并结合密集像素匹配的视觉提示,显著提升了小样本语义分割在新类别上的泛化能力和性能,在基准测试中达到最优效果。
English: This paper introduces DSV-LFS, a novel few-shot semantic segmentation framework that leverages large language models and dense pixel-wise matching to generate semantic and visual prompts, achieving state-of-the-art performance and superior generalization on benchmark datasets.

Authors:Wenhui Zhu, Xin Li, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Xuanzhao Dong, Yanxi Chen, Natasha Lepore, Oana Dumitrascu, Yi Su, Yalin Wang
Title: RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have gained significant attention for their remarkable ability to process and analyze non-textual data, such as images, videos, and audio. Notably, several adaptations of general-domain MLLMs to the medical field have been explored, including LLaVA-Med. However, these medical adaptations remain insufficiently advanced in understanding and interpreting retinal images. In contrast, medical experts emphasize the importance of quantitative analyses for disease detection and interpretation. This underscores a gap between general-domain and medical-domain MLLMs: while general-domain MLLMs excel in broad applications, they lack the specialized knowledge necessary for precise diagnostic and interpretative tasks in the medical field. To address these challenges, we introduce \textit{RetinalGPT}, a multimodal conversational assistant for clinically preferred quantitative analysis of retinal images. Specifically, we achieve this by compiling a large retinal image dataset, developing a novel data pipeline, and employing customized visual instruction tuning to enhance both retinal analysis and enrich medical knowledge. In particular, RetinalGPT outperforms MLLM in the generic domain by a large margin in the diagnosis of retinal diseases in 8 benchmark retinal datasets. Beyond disease diagnosis, RetinalGPT features quantitative analyses and lesion localization, representing a pioneering step in leveraging LLMs for an interpretable and end-to-end clinical research framework. The code is available at https://github.com/Retinal-Research/RetinalGPT
中文: 多模态大语言模型在通用领域表现出色,但在医学领域,尤其是视网膜图像解读方面存在不足;为此开发的RetinalGPT通过定量分析、疾病诊断和病灶定位,显著提升了临床应用的准确性和可解释性。
English: Multimodal Large Language Models (MLLMs) are advancing but lack specialized capabilities for precise medical diagnostics, particularly in interpreting retinal images, leading to the development of RetinalGPT, which excels in quantitative analysis, disease diagnosis, and lesion localization for enhanced clinical applications.

Authors:Jiangtong Zhu, Zhao Yang, Yinan Shi, Jianwu Fang, Jianru Xue
Title: IC-Mapper: Instance-Centric Spatio-Temporal Modeling for Online Vectorized Map Construction
Abstract:
Online vector map construction based on visual data can bypass the processes of data collection, post-processing, and manual annotation required by traditional map construction, which significantly enhances map-building efficiency. However, existing work treats the online mapping task as a local range perception task, overlooking the spatial scalability required for map construction. We propose IC-Mapper, an instance-centric online mapping framework, which comprises two primary components: 1) Instance-centric temporal association module: For the detection queries of adjacent frames, we measure them in both feature and geometric dimensions to obtain the matching correspondence between instances across frames. 2) Instance-centric spatial fusion module: We perform point sampling on the historical global map from a spatial dimension and integrate it with the detection results of instances corresponding to the current frame to achieve real-time expansion and update of the map. Based on the nuScenes dataset, we evaluate our approach on detection, tracking, and global mapping metrics. Experimental results demonstrate the superiority of IC-Mapper against other state-of-the-art methods. Code will be released on https://github.com/Brickzhuantou/IC-Mapper.
中文:IC-Mapper是一种以实例为中心的在线地图构建框架,通过时序关联和空间融合模块提升空间扩展能力,在检测、跟踪和全局地图指标上优于现有方法。
English: IC-Mapper is an instance-centric online mapping framework that enhances spatial scalability through temporal association and spatial fusion modules, outperforming existing methods on detection, tracking, and global mapping metrics.

Authors:Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu
Title: EgoLife: Towards Egocentric Life Assistant
Abstract:
We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities - including discussions, shopping, cooking, socializing, and entertainment - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations. To address the key technical challenges of (1) developing robust visual-audio models for egocentric data, (2) enabling identity recognition, and (3) facilitating long-context question answering over extensive temporal information, we introduce EgoButler, an integrated system comprising EgoGPT and EgoRAG. EgoGPT is an omni-modal model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.
中文: EgoLife项目通过开发AI智能眼镜和构建全面的EgoLife数据集,推出了集成EgoButler框架的自我中心生活助手系统,在自我中心视频理解方面达到领先水平,为日常生活提供智能化支持。
English: The EgoLife project introduces an AI-powered wearable glasses system and the comprehensive EgoLife Dataset, enabling advanced daily life assistance through the integrated EgoButler framework with state-of-the-art egocentric understanding capabilities.

Authors:Zanting Ye, Xiaolong Niu, Xu Han, Xuanbin Wu, Wantong Lu, Yijun Lu, Hao Sun, Yanchao Huang, Hubing Wu, Lijun Lu
Title: Self is the Best Learner: CT-free Ultra-Low-Dose PET Organ Segmentation via Collaborating Denoising and Segmentation Learning
Abstract:
Organ segmentation in Positron Emission Tomography (PET) plays a vital role in cancer quantification. Low-dose PET (LDPET) provides a safer alternative by reducing radiation exposure. However, the inherent noise and blurred boundaries make organ segmentation more challenging. Additionally, existing PET organ segmentation methods rely on coregistered Computed Tomography (CT) annotations, overlooking the problem of modality mismatch. In this study, we propose LDOS, a novel CT-free ultra-LDPET organ segmentation pipeline. Inspired by Masked Autoencoders (MAE), we reinterpret LDPET as a naturally masked version of Full-Dose PET (FDPET). LDOS adopts a simple yet effective architecture: a shared encoder extracts generalized features, while task-specific decoders independently refine outputs for denoising and segmentation. By integrating CT-derived organ annotations into the denoising process, LDOS improves anatomical boundary recognition and alleviates the PET/CT misalignments. Experiments demonstrate that LDOS achieves state-of-the-art performance with mean Dice scores of 73.11% (18F-FDG) and 73.97% (68Ga-FAPI) across 18 organs in 5% dose PET. Our code will be available at https://github.com/yezanting/LDOS.
中文: LDOS提出了一种无需CT的超低剂量PET器官分割方法,通过共享编码器和任务特定解码器增强去噪与分割效果,并结合CT标注改善边界识别与模态不匹配问题,实现了领先性能。
English: LDOS introduces a CT-free pipeline for ultra-low-dose PET organ segmentation, using a shared encoder and task-specific decoders to enhance denoising and segmentation while integrating CT-derived annotations to improve boundary recognition and address modality mismatches, achieving state-of-the-art results.

Authors:Huang Huang, Fangchen Liu, Letian Fu, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, Pieter Abbeel
Title: OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
Abstract:
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained visionlanguage models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zeroshot generalization to novel objects and environments. Video, code, checkpoints, and dataset: https://ottervla.github.io/.
中文:OTTER是一种新颖的视觉-语言-动作模型,通过选择性提取与语言指令语义对齐的任务相关视觉特征,无需微调即可保持预训练语义对齐,展现出强大的零样本泛化能力。
English: OTTER is a novel Vision-Language-Action model that preserves pre-trained semantic alignments by selectively extracting task-relevant visual features aligned with language instructions, enabling strong zero-shot generalization without fine-tuning.

Authors:Nianzu Yang, Pandeng Li, Liming Zhao, Yang Li, Chen-Wei Xie, Yehui Tang, Xudong Lu, Zhihang Liu, Yun Zheng, Yu Liu, Junchi Yan
Title: Rethinking Video Tokenization: A Conditioned Diffusion-based Approach
Abstract:
Existing video tokenizers typically use the traditional Variational Autoencoder (VAE) architecture for video compression and reconstruction. However, to achieve good performance, its training process often relies on complex multi-stage training tricks that go beyond basic reconstruction loss and KL regularization. Among these tricks, the most challenging is the precise tuning of adversarial training with additional Generative Adversarial Networks (GANs) in the final stage, which can hinder stable convergence. In contrast to GANs, diffusion models offer more stable training processes and can generate higher-quality results. Inspired by these advantages, we propose CDT, a novel Conditioned Diffusion-based video Tokenizer, that replaces the GAN-based decoder with a conditional causal diffusion model. The encoder compresses spatio-temporal information into compact latents, while the decoder reconstructs videos through a reverse diffusion process conditioned on these latents. During inference, we incorporate a feature cache mechanism to generate videos of arbitrary length while maintaining temporal continuity and adopt sampling acceleration technique to enhance efficiency. Trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch, extensive experiments demonstrate that CDT achieves state-of-the-art performance in video reconstruction tasks with just a single-step sampling. Even a scaled-down version of CDT (3$\times$ inference speedup) still performs comparably with top baselines. Moreover, the latent video generation model trained with CDT also exhibits superior performance. The source code and pretrained weights are available at https://github.com/ali-vilab/CDT.
中文: 提出的CDT视频分词器采用条件扩散模型替代传统基于GAN的解码器,通过稳定训练和高效推理机制,无需复杂对抗性调优即可实现最先进的视频重建效果。
English: The proposed CDT video tokenizer replaces the traditional GAN-based decoder with a conditional diffusion model, achieving state-of-the-art video reconstruction through stable training and efficient inference mechanisms without complex adversarial tuning.

Authors:Zhao Yang, Zezhong Qian, Xiaofan Li, Weixiang Xu, Gongpeng Zhao, Ruohong Yu, Lingsi Zhu, Longjun Liu
Title: DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance
Abstract:
Accurate and high-fidelity driving scene reconstruction demands the effective utilization of comprehensive scene information as conditional inputs. Existing methods predominantly rely on 3D bounding boxes and BEV road maps for foreground and background control, which fail to capture the full complexity of driving scenes and adequately integrate multimodal information. In this work, we present DualDiff, a dual-branch conditional diffusion model designed to enhance driving scene generation across multiple views and video sequences. Specifically, we introduce Occupancy Ray-shape Sampling (ORS) as a conditional input, offering rich foreground and background semantics alongside 3D spatial geometry to precisely control the generation of both elements. To improve the synthesis of fine-grained foreground objects, particularly complex and distant ones, we propose a Foreground-Aware Mask (FGM) denoising loss function. Additionally, we develop the Semantic Fusion Attention (SFA) mechanism to dynamically prioritize relevant information and suppress noise, enabling more effective multimodal fusion. Finally, to ensure high-quality image-to-video generation, we introduce the Reward-Guided Diffusion (RGD) framework, which maintains global consistency and semantic coherence in generated videos. Extensive experiments demonstrate that DualDiff achieves state-of-the-art (SOTA) performance across multiple datasets. On the NuScenes dataset, DualDiff reduces the FID score by 4.09% compared to the best baseline. In downstream tasks, such as BEV segmentation, our method improves vehicle mIoU by 4.50% and road mIoU by 1.70%, while in BEV 3D object detection, the foreground mAP increases by 1.46%. Code will be made available at https://github.com/yangzhaojason/DualDiff.
中文: DualDiff采用双分支扩散模型,通过占用射线采样和语义融合注意力等创新技术提升多模态驾驶场景生成质量,在多项任务中实现最优性能。
English: DualDiff introduces a dual-branch diffusion model with novel techniques like Occupancy Ray-shape Sampling and Semantic Fusion Attention to enhance multimodal driving scene generation, achieving state-of-the-art performance in reconstruction and downstream tasks.

Authors:Wei Li, Bing Hu, Rui Shao, Leyang Shen, Liqiang Nie
Title: LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
Abstract:
First-person video assistants are highly anticipated to enhance our daily lives through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features.To overcome the trade-off between efficacy and efficiency, we propose "Fast & Slow Video-Language Thinker" as an onLIne videO assistaNt, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: 1)Fast Path: Routing-Based Response Determination evaluates frame-by-frame whether an immediate response is necessary. To enhance response determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features. 2)Slow Path: Multi-granularity Keyframe Augmentation optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. These features are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations on online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency.
中文摘要:提出的LION-FS系统通过双路径优化策略,在快速路径中实现动态特征融合与冗余消除,在慢速路径中通过多粒度特征增强与思维模板,成功解决了在线视频助手效能与效率的权衡问题。
English Summary: The proposed LION-FS system overcomes the trade-off between efficacy and efficiency in online video assistants through a dual-path optimization strategy, achieving real-time performance with temporally and contextually precise responses.

Authors:Rui Zhao, Weijia Mao, Mike Zheng Shou
Title: DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
Abstract:
Adapting generative models to specific domains presents an effective solution for satisfying specialized requirements. However, adapting to some complex domains remains challenging, especially when these domains require substantial paired data to capture the targeted distributions. Since unpaired data from a single modality, such as vision or language, is more readily available, we utilize the bidirectional mappings between vision and language learned by the unified generative model to enable training on unpaired data for domain adaptation. Specifically, we propose DoraCycle, which integrates two multimodal cycles: text-to-image-to-text and image-to-text-to-image. The model is optimized through cross-entropy loss computed at the cycle endpoints, where both endpoints share the same modality. This facilitates self-evolution of the model without reliance on annotated text-image pairs. Experimental results demonstrate that for tasks independent of paired knowledge, such as stylization, DoraCycle can effectively adapt the unified model using only unpaired data. For tasks involving new paired knowledge, such as specific identities, a combination of a small set of paired image-text examples and larger-scale unpaired data is sufficient for effective domain-oriented adaptation. The code will be released at https://github.com/showlab/DoraCycle.
中文: DoraCycle通过双向多模态循环利用非配对数据实现生成模型的领域适应,仅用非配对数据即可有效完成风格化任务,而结合少量配对数据还能处理特定身份任务。
English: DoraCycle enables domain adaptation of generative models using unpaired data through bidirectional multimodal cycles, achieving effective stylization with unpaired data alone and handling identity-specific tasks with minimal paired examples.

Authors:Xiaojun Bi, Shuo Li, Junyao Xing, Ziyue Wang, Fuwen Luo, Weizheng Qiao, Lu Han, Ziwei Sun, Peng Li, Yang Liu
Title: DongbaMIE: A Multimodal Information Extraction Dataset for Evaluating Semantic Understanding of Dongba Pictograms
Abstract:
Dongba pictographic is the only pictographic script still in use in the world. Its pictorial ideographic features carry rich cultural and contextual information. However, due to the lack of relevant datasets, research on semantic understanding of Dongba hieroglyphs has progressed slowly. To this end, we constructed \textbf{DongbaMIE} - the first dataset focusing on multimodal information extraction of Dongba pictographs. The dataset consists of images of Dongba hieroglyphic characters and their corresponding semantic annotations in Chinese. It contains 23,530 sentence-level and 2,539 paragraph-level high-quality text-image pairs. The annotations cover four semantic dimensions: object, action, relation and attribute. Systematic evaluation of mainstream multimodal large language models shows that the models are difficult to perform information extraction of Dongba hieroglyphs efficiently under zero-shot and few-shot learning. Although supervised fine-tuning can improve the performance, accurate extraction of complex semantics is still a great challenge at present.
中文: 为解决东巴象形文字语义理解研究的数据匮乏问题,我们构建了首个多模态数据集DongbaMIE,包含26,069个高质量图文对,但实验表明现有模型在零样本和微调场景下仍难以实现精确的语义信息抽取。
English: The DongbaMIE dataset, the first multimodal resource for Dongba pictographs, was created to advance semantic understanding by providing 26,069 annotated text-image pairs, yet current models struggle with accurate information extraction even after fine-tuning.

Authors:Woo-Jin Jung, Dong-Hee Paek, Seung-Hyun Kong
Title: L2RDaS: Synthesizing 4D Radar Tensors for Model Generalization via Dataset Expansion
Abstract:
4-dimensional (4D) radar is increasingly adopted in autonomous driving for perception tasks, owing to its robustness under adverse weather conditions. To better utilize the spatial information inherent in 4D radar data, recent deep learning methods have transitioned from using sparse point cloud to 4D radar tensors. However, the scarcity of publicly available 4D radar tensor datasets limits model generalization across diverse driving scenarios. Previous methods addressed this by synthesizing radar data, but the outputs did not fully exploit the spatial information characteristic of 4D radar. To overcome these limitations, we propose LiDAR-to-4D radar data synthesis (L2RDaS), a framework that synthesizes spatially informative 4D radar tensors from LiDAR data available in existing autonomous driving datasets. L2RDaS integrates a modified U-Net architecture to effectively capture spatial information and an object information supplement (OBIS) module to enhance reflection fidelity. This framework enables the synthesis of radar tensors across diverse driving scenarios without additional sensor deployment or data collection. L2RDaS improves model generalization by expanding real datasets with synthetic radar tensors, achieving an average increase of 4.25\% in ${{AP}_{BEV}}$ and 2.87\% in ${{AP}_{3D}}$ across three detection models. Additionally, L2RDaS supports ground-truth augmentation (GT-Aug) by embedding annotated objects into LiDAR data and synthesizing them into radar tensors, resulting in further average increases of 3.75\% in ${{AP}_{BEV}}$ and 4.03\% in ${{AP}_{3D}}$. The implementation will be available at https://github.com/kaist-avelab/K-Radar.
Chinese: L2RDaS框架通过从现有LiDAR数据合成具有丰富空间信息的4D雷达张量,无需额外采集数据即可提升自动驾驶感知模型的泛化能力和检测精度。
English: The L2RDaS framework synthesizes spatially rich 4D radar tensors from LiDAR data to enhance autonomous driving perception, improving model generalization and detection accuracy across diverse scenarios without additional data collection.

Authors:Songlong Xing, Zhengyu Zhao, Nicu Sebe
Title: CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP
Abstract:
Despite its prevalent use in image-text matching tasks in a zero-shot manner, CLIP has been shown to be highly vulnerable to adversarial perturbations added onto images. Recent studies propose to finetune the vision encoder of CLIP with adversarial samples generated on the fly, and show improved robustness against adversarial attacks on a spectrum of downstream datasets, a property termed as zero-shot robustness. In this paper, we show that malicious perturbations that seek to maximise the classification loss lead to `falsely stable' images, and propose to leverage the pre-trained vision encoder of CLIP to counterattack such adversarial images during inference to achieve robustness. Our paradigm is simple and training-free, providing the first method to defend CLIP from adversarial attacks at test time, which is orthogonal to existing methods aiming to boost zero-shot adversarial robustness of CLIP. We conduct experiments across 16 classification datasets, and demonstrate stable and consistent gains compared to test-time defence methods adapted from existing adversarial robustness studies that do not rely on external networks, without noticeably impairing performance on clean images. We also show that our paradigm can be employed on CLIP models that have been adversarially finetuned to further enhance their robustness at test time. Our code is available \href{https://github.com/Sxing2/CLIP-Test-time-Counterattacks}{here}.
中文: 本文提出一种无需训练的防御方法,利用CLIP预训练视觉编码器在推理阶段对对抗图像进行反制,在多个数据集上实现稳健性能且不影响正常图像的识别精度。
English: This paper introduces a training-free defense method that uses CLIP's pre-trained vision encoder to counterattack adversarial images during inference, achieving robust performance across multiple datasets without compromising clean image accuracy.

Authors:Wenqiao Li, Yao Gu, Xintao Chen, Xiaohao Xu, Ming Hu, Xiaonan Huang, Yingna Wu
Title: Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection
Abstract:
Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge. The long-term goal of Industrial Anomaly Detection (IAD) is to enable machines to autonomously replicate this skill. However, current IAD algorithms are largely developed and tested on static, semantically simple datasets, which diverge from real-world scenarios where physical understanding and reasoning are essential. To bridge this gap, we introduce the Physics Anomaly Detection (Phys-AD) dataset, the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection. Collected using a real robot arm and motor, Phys-AD provides a diverse set of dynamic, semantically rich scenarios. The dataset includes more than 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies. Anomaly detection in Phys-AD requires visual reasoning, combining both physical knowledge and video content to determine object abnormality. We benchmark state-of-the-art anomaly detection methods under three settings: unsupervised AD, weakly-supervised AD, and video-understanding AD, highlighting their limitations in handling physics-grounded anomalies. Additionally, we introduce the Physics Anomaly Explanation (PAEval) metric, designed to assess the ability of visual-language foundation models to not only detect anomalies but also provide accurate explanations for their underlying physical causes. Our project is available at https://guyao2023.github.io/Phys-AD/.
Chinese Summary: 人类通过感知、交互和基于物体条件的物理知识推理来检测现实世界中的物体异常,而当前工业异常检测方法因依赖静态数据集难以应对真实场景,为此我们推出了Phys-AD数据集,通过基于物理的视频数据填补这一空白。
English Summary: Humans detect object anomalies through perception, interaction, and physical reasoning, while current industrial anomaly detection methods fall short in real-world scenarios due to their reliance on static datasets, prompting the introduction of the Phys-AD dataset to bridge this gap with physics-grounded video data.

Authors:Po-Chien Luan, Yang Gao, Celine Demonsant, Alexandre Alahi
Title: Unified Human Localization and Trajectory Prediction with Monocular Vision
Abstract:
Conventional human trajectory prediction models rely on clean curated data, requiring specialized equipment or manual labeling, which is often impractical for robotic applications. The existing predictors tend to overfit to clean observation affecting their robustness when used with noisy inputs. In this work, we propose MonoTransmotion (MT), a Transformer-based framework that uses only a monocular camera to jointly solve localization and prediction tasks. Our framework has two main modules: Bird's Eye View (BEV) localization and trajectory prediction. The BEV localization module estimates the position of a person using 2D human poses, enhanced by a novel directional loss for smoother sequential localizations. The trajectory prediction module predicts future motion from these estimates. We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs. We validate our MT network on both curated and non-curated datasets. On the curated dataset, MT achieves around 12% improvement over baseline models on BEV localization and trajectory prediction. On real-world non-curated dataset, experimental results indicate that MT maintains similar performance levels, highlighting its robustness and generalization capability. The code is available at https://github.com/vita-epfl/MonoTransmotion.
中文:提出的MonoTransmotion框架通过单目摄像头联合实现定位与轨迹预测,在干净数据和真实噪声场景中均展现出优于基准模型的鲁棒性和泛化能力。
English: The proposed MonoTransmotion framework jointly performs localization and trajectory prediction using a monocular camera, demonstrating enhanced robustness and generalization with significant improvements over baselines in both curated and noisy real-world datasets.

Authors:Linyu Fan, Che Wang, Ming Ye, Qizhi Yang, Zejun Wu, Xinghao Ding, Yue Huang, Jianfeng Bao, Shuhui Cai, Congbo Cai
Title: Bridging Synthetic-to-Real Gaps: Frequency-Aware Perturbation and Selection for Single-shot Multi-Parametric Mapping Reconstruction
Abstract:
Data-centric artificial intelligence (AI) has remarkably advanced medical imaging, with emerging methods using synthetic data to address data scarcity while introducing synthetic-to-real gaps. Unsupervised domain adaptation (UDA) shows promise in ground truth-scarce tasks, but its application in reconstruction remains underexplored. Although multiple overlapping-echo detachment (MOLED) achieves ultra-fast multi-parametric reconstruction, extending its application to various clinical scenarios, the quality suffers from deficiency in mitigating the domain gap, difficulty in maintaining structural integrity, and inadequacy in ensuring mapping accuracy. To resolve these issues, we proposed frequency-aware perturbation and selection (FPS), comprising Wasserstein distance-modulated frequency-aware perturbation (WDFP) and hierarchical frequency-aware selection network (HFSNet), which integrates frequency-aware adaptive selection (FAS), compact FAS (cFAS) and feature-aware architecture integration (FAI). Specifically, perturbation activates domain-invariant feature learning within uncertainty, while selection refines optimal solutions within perturbation, establishing a robust and closed-loop learning pathway. Extensive experiments on synthetic data, along with diverse real clinical cases from 5 healthy volunteers, 94 ischemic stroke patients, and 46 meningioma patients, demonstrate the superiority and clinical applicability of FPS. Furthermore, FPS is applied to diffusion tensor imaging (DTI), underscoring its versatility and potential for broader medical applications. The code is available at https://github.com/flyannie/FPS.
中文摘要:提出的频率感知扰动与选择(FPS)方法通过结合扰动和选择机制,有效解决了医学影像重建中的领域适应问题,在多种临床数据集中展现出卓越性能。
English Summary: The proposed Frequency-aware Perturbation and Selection (FPS) method effectively addresses domain adaptation challenges in medical imaging reconstruction by combining perturbation and selection mechanisms, demonstrating superior performance across diverse clinical datasets.

Authors:Xiaoyong Lu, Songlin Du
Title: JamMa: Ultra-lightweight Local Feature Matching with Joint Mamba
Abstract:
Existing state-of-the-art feature matchers capture long-range dependencies with Transformers but are hindered by high spatial complexity, leading to demanding training and highlatency inference. Striking a better balance between performance and efficiency remains a challenge in feature matching. Inspired by the linear complexity O(N) of Mamba, we propose an ultra-lightweight Mamba-based matcher, named JamMa, which converges on a single GPU and achieves an impressive performance-efficiency balance in inference. To unlock the potential of Mamba for feature matching, we propose Joint Mamba with a scan-merge strategy named JEGO, which enables: (1) Joint scan of two images to achieve high-frequency mutual interaction, (2) Efficient scan with skip steps to reduce sequence length, (3) Global receptive field, and (4) Omnidirectional feature representation. With the above properties, the JEGO strategy significantly outperforms the scan-merge strategies proposed in VMamba and EVMamba in the feature matching task. Compared to attention-based sparse and semi-dense matchers, JamMa demonstrates a superior balance between performance and efficiency, delivering better performance with less than 50% of the parameters and FLOPs.

Authors:Junhao Xu, Yanan Zhang, Zhi Cai, Di Huang
Title: CoSDH: Communication-Efficient Collaborative Perception via Supply-Demand Awareness and Intermediate-Late Hybridization
Abstract:
Multi-agent collaborative perception enhances perceptual capabilities by utilizing information from multiple agents and is considered a fundamental solution to the problem of weak single-vehicle perception in autonomous driving. However, existing collaborative perception methods face a dilemma between communication efficiency and perception accuracy. To address this issue, we propose a novel communication-efficient collaborative perception framework based on supply-demand awareness and intermediate-late hybridization, dubbed as \mymethodname. By modeling the supply-demand relationship between agents, the framework refines the selection of collaboration regions, reducing unnecessary communication cost while maintaining accuracy. In addition, we innovatively introduce the intermediate-late hybrid collaboration mode, where late-stage collaboration compensates for the performance degradation in collaborative perception under low communication bandwidth. Extensive experiments on multiple datasets, including both simulated and real-world scenarios, demonstrate that \mymethodname~ achieves state-of-the-art detection accuracy and optimal bandwidth trade-offs, delivering superior detection precision under real communication bandwidths, thus proving its effectiveness and practical applicability. The code will be released at https://github.com/Xu2729/CoSDH.
中文: 提出的\mymethodname框架通过供需感知优化协作区域,并采用中晚期混合协作模式,在降低通信成本的同时保持感知精度,在真实带宽条件下实现了最优性能表现。
English: The proposed \mymethodname framework enhances multi-agent collaborative perception by optimizing collaboration regions through supply-demand awareness and employing an intermediate-late hybrid approach to maintain accuracy while reducing communication costs, achieving state-of-the-art performance in real bandwidth conditions.

Authors:Juan Miguel Valverde, Motoya Koga, Nijihiko Otsuka, Anders Bjorholm Dahl
Title: TopoMortar: A dataset to evaluate image segmentation methods focused on topology accuracy
Abstract:
We present TopoMortar, a brick wall dataset that is the first dataset specifically designed to evaluate topology-focused image segmentation methods, such as topology loss functions. Motivated by the known sensitivity of methods to dataset challenges, such as small training sets, noisy labels, and out-of-distribution test-set images, TopoMortar is created to enable in two ways investigating methods' effectiveness at improving topology accuracy. First, by eliminating dataset challenges that, as we show, impact the effectiveness of topology loss functions. Second, by allowing to represent different dataset challenges in the same dataset, isolating methods' performance from dataset challenges. TopoMortar includes three types of labels (accurate, pseudo-labels, and noisy labels), two fixed training sets (large and small), and in-distribution and out-of-distribution test-set images. We compared eight loss functions on TopoMortar, and we found that clDice achieved the most topologically accurate segmentations, and that the relative advantageousness of the other loss functions depends on the experimental setting. Additionally, we show that data augmentation and self-distillation can elevate Cross entropy Dice loss to surpass most topology loss functions, and that those simple methods can enhance topology loss functions as well. TopoMortar and our code can be found at https://jmlipman.github.io/TopoMortar

Authors:Guoyu Yang, Yuan Wang, Daming Shi, Yanzhong Wang
Title: Golden Cudgel Network for Real-Time Semantic Segmentation
Abstract:
Recent real-time semantic segmentation models, whether single-branch or multi-branch, achieve good performance and speed. However, their speed is limited by multi-path blocks, and some depend on high-performance teacher models for training. To overcome these issues, we propose Golden Cudgel Network (GCNet). Specifically, GCNet uses vertical multi-convolutions and horizontal multi-paths for training, which are reparameterized into a single convolution for inference, optimizing both performance and speed. This design allows GCNet to self-enlarge during training and self-contract during inference, effectively becoming a "teacher model" without needing external ones. Experimental results show that GCNet outperforms existing state-of-the-art models in terms of performance and speed on the Cityscapes, CamVid, and Pascal VOC 2012 datasets. The code is available at https://github.com/gyyang23/GCNet.
Chinese: 金箍棒网络(GCNet)提出了一种创新架构,在训练时采用垂直多卷积和水平多路径设计,并在推理时重参数化为单一卷积,无需外部教师模型即可自我增强,在多个基准数据集上实现了卓越的性能和速度。
English: The Golden Cudgel Network (GCNet) introduces a novel architecture that utilizes vertical multi-convolutions and horizontal multi-paths during training, which are reparameterized into a single convolution for inference, enabling self-enhancement without external teacher models and achieving superior performance and speed on benchmark datasets.

Authors:Ji Zhao, Banglei Guan, Zibin Liu, Laurent Kneip
Title: Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers
Abstract:
For event cameras, current sparse geometric solvers for egomotion estimation assume that the rotational displacements are known, such as those provided by an IMU. Thus, they can only recover the translational motion parameters. Recovering full-DoF motion parameters using a sparse geometric solver is a more challenging task, and has not yet been investigated. In this paper, we propose several solvers to estimate both rotational and translational velocities within a unified framework. Our method leverages event manifolds induced by line segments. The problem formulations are based on either an incidence relation for lines or a novel coplanarity relation for normal vectors. We demonstrate the possibility of recovering full-DoF egomotion parameters for both angular and linear velocities without requiring extra sensor measurements or motion priors. To achieve efficient optimization, we exploit the Adam framework with a first-order approximation of rotations for quick initialization. Experiments on both synthetic and real-world data demonstrate the effectiveness of our method. The code is available at https://github.com/jizhaox/relpose-event.
中文: 本文提出多种求解器,利用线段产生的事件流形,在统一框架下估计事件相机的旋转和平移速度,无需额外传感器或运动先验即可实现全自由度运动估计。
English: This paper introduces novel solvers that estimate both rotational and translational velocities for event cameras using event manifolds from line segments, enabling full-DoF motion recovery without external sensors or motion priors.

Authors:Li Lun, Kunyu Feng, Qinglong Ni, Ling Liang, Yuan Wang, Ying Li, Dunshan Yu, Xiaoxin Cui
Title: Towards Effective and Sparse Adversarial Attack on Spiking Neural Networks via Breaking Invisible Surrogate Gradients
Abstract:
Spiking neural networks (SNNs) have shown their competence in handling spatial-temporal event-based data with low energy consumption. Similar to conventional artificial neural networks (ANNs), SNNs are also vulnerable to gradient-based adversarial attacks, wherein gradients are calculated by spatial-temporal back-propagation (STBP) and surrogate gradients (SGs). However, the SGs may be invisible for an inference-only model as they do not influence the inference results, and current gradient-based attacks are ineffective for binary dynamic images captured by the dynamic vision sensor (DVS). While some approaches addressed the issue of invisible SGs through universal SGs, their SGs lack a correlation with the victim model, resulting in sub-optimal performance. Moreover, the imperceptibility of existing SNN-based binary attacks is still insufficient. In this paper, we introduce an innovative potential-dependent surrogate gradient (PDSG) method to establish a robust connection between the SG and the model, thereby enhancing the adaptability of adversarial attacks across various models with invisible SGs. Additionally, we propose the sparse dynamic attack (SDA) to effectively attack binary dynamic images. Utilizing a generation-reduction paradigm, SDA can fully optimize the sparsity of adversarial perturbations. Experimental results demonstrate that our PDSG and SDA outperform state-of-the-art SNN-based attacks across various models and datasets. Specifically, our PDSG achieves 100% attack success rate on ImageNet, and our SDA obtains 82% attack success rate by modifying only 0.24% of the pixels on CIFAR10DVS. The code is available at https://github.com/ryime/PDSG-SDA .
中文: 本文提出了一种基于电位依赖的替代梯度方法和稀疏动态攻击,通过解决不可见替代梯度问题并优化扰动稀疏性,显著提升了针对脉冲神经网络的对抗攻击效果,在多种模型和数据集上实现了最优性能。
English: This paper introduces a potential-dependent surrogate gradient (PDSG) method and sparse dynamic attack (SDA) to enhance adversarial attacks on spiking neural networks, achieving superior performance across models and datasets by addressing invisible surrogate gradients and optimizing perturbation sparsity.

Authors:Ping Chen, Xingpeng Zhang, Zhaoxiang Liu, Huan Hu, Xiang Liu, Kai Wang, Min Wang, Yanlin Qian, Shiguo Lian
Title: Optimizing for the Shortest Path in Denoising Diffusion Model
Abstract:
In this research, we propose a novel denoising diffusion model based on shortest-path modeling that optimizes residual propagation to enhance both denoising efficiency and quality. Drawing on Denoising Diffusion Implicit Models (DDIM) and insights from graph theory, our model, termed the Shortest Path Diffusion Model (ShortDF), treats the denoising process as a shortest-path problem aimed at minimizing reconstruction error. By optimizing the initial residuals, we improve the efficiency of the reverse diffusion process and the quality of the generated samples. Extensive experiments on multiple standard benchmarks demonstrate that ShortDF significantly reduces diffusion time (or steps) while enhancing the visual fidelity of generated samples compared to prior arts. This work, we suppose, paves the way for interactive diffusion-based applications and establishes a foundation for rapid data generation. Code is available at https://github.com/UnicomAI/ShortDF.
中文: 本研究提出了一种基于最短路径建模的ShortDF模型,通过优化残差传播提高了去噪效率和样本质量,显著减少了扩散步骤并增强了生成样本的视觉保真度。
English: This study introduces the Shortest Path Diffusion Model (ShortDF), which optimizes residual propagation to enhance denoising efficiency and quality, significantly reducing diffusion steps while improving sample fidelity.

Authors:Gangwei Xu, Jiaxin Liu, Xianqi Wang, Junda Cheng, Yong Deng, Jinliang Zang, Yurui Chen, Xin Yang
Title: BANet: Bilateral Aggregation Network for Mobile Stereo Matching
Abstract:
State-of-the-art stereo matching methods typically use costly 3D convolutions to aggregate a full cost volume, but their computational demands make mobile deployment challenging. Directly applying 2D convolutions for cost aggregation often results in edge blurring, detail loss, and mismatches in textureless regions. Some complex operations, like deformable convolutions and iterative warping, can partially alleviate this issue; however, they are not mobile-friendly, limiting their deployment on mobile devices. In this paper, we present a novel bilateral aggregation network (BANet) for mobile stereo matching that produces high-quality results with sharp edges and fine details using only 2D convolutions. Specifically, we first separate the full cost volume into detailed and smooth volumes using a spatial attention map, then perform detailed and smooth aggregations accordingly, ultimately fusing both to obtain the final disparity map. Experimental results demonstrate that our BANet-2D significantly outperforms other mobile-friendly methods, achieving 35.3\% higher accuracy on the KITTI 2015 leaderboard than MobileStereoNet-2D, with faster runtime on mobile devices. Code: \textcolor{magenta}{https://github.com/gangweix/BANet}.
中文: 本文提出BANet,一种仅使用二维卷积的移动端立体匹配网络,通过分别聚合细节和平滑代价体积实现清晰边缘与精细细节,在KITTI 2015数据集上以35.3%更高精度超越其他方法,且在移动设备上运行更快。
English: This paper introduces BANet, a mobile-friendly stereo matching network that uses only 2D convolutions to achieve sharp edges and fine details by separately aggregating detailed and smooth cost volumes, outperforming other methods with 35.3% higher accuracy on KITTI 2015 while running faster on mobile devices.

Authors:Gangwei Xu, Haotong Lin, Zhaoxing Zhang, Hongcheng Luo, Haiyang Sun, Xin Yang
Title: BAT: Learning Event-based Optical Flow with Bidirectional Adaptive Temporal Correlation
Abstract:
Event cameras deliver visual information characterized by a high dynamic range and high temporal resolution, offering significant advantages in estimating optical flow for complex lighting conditions and fast-moving objects. Current advanced optical flow methods for event cameras largely adopt established image-based frameworks. However, the spatial sparsity of event data limits their performance. In this paper, we present BAT, an innovative framework that estimates event-based optical flow using bidirectional adaptive temporal correlation. BAT includes three novel designs: 1) a bidirectional temporal correlation that transforms bidirectional temporally dense motion cues into spatially dense ones, enabling accurate and spatially dense optical flow estimation; 2) an adaptive temporal sampling strategy for maintaining temporal consistency in correlation; 3) spatially adaptive temporal motion aggregation to efficiently and adaptively aggregate consistent target motion features into adjacent motion features while suppressing inconsistent ones. Our results rank $1^{st}$ on the DSEC-Flow benchmark, outperforming existing state-of-the-art methods by a large margin while also exhibiting sharp edges and high-quality details. Notably, our BAT can accurately predict future optical flow using only past events, significantly outperforming E-RAFT's warm-start approach. Code: \textcolor{magenta}{https://github.com/gangweiX/BAT}.
中文摘要:BAT框架通过双向自适应时间关联技术解决了事件相机数据空间稀疏性的限制,在实现最先进光流估计精度的同时,仅使用过去事件即可准确预测未来光流,在DSEC-Flow基准测试中大幅领先现有方法。
English Summary: The BAT framework introduces bidirectional adaptive temporal correlation to overcome the spatial sparsity limitations of event cameras, achieving state-of-the-art optical flow estimation with superior accuracy and detail while enabling future flow prediction using only past events.

Authors:Zhumei Wang, Zechen Hu, Ruoxi Guo, Huaijin Pi, Ziyong Feng, Sida Peng, Xiaowei Zhou, Mingtao Pei, Siyuan Huang
Title: Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining
Abstract:
Recovering absolute human motion from monocular inputs is challenging due to two main issues. First, existing methods depend on 3D training data collected from limited environments, constraining out-of-distribution generalization. The second issue is the difficulty of estimating metric-scale poses from monocular input. To address these challenges, we introduce Mocap-2-to-3, a novel framework that performs multi-view lifting from monocular input by leveraging 2D data pre-training, enabling the reconstruction of metrically accurate 3D motions with absolute positions. To leverage abundant 2D data, we decompose complex 3D motion into multi-view syntheses. We first pretrain a single-view diffusion model on extensive 2D datasets, then fine-tune a multi-view model using public 3D data to enable view-consistent motion generation from monocular input, allowing the model to acquire action priors and diversity through 2D data. Furthermore, to recover absolute poses, we propose a novel human motion representation that decouples the learning of local pose and global movements, while encoding geometric priors of the ground to accelerate convergence. This enables progressive recovery of motion in absolute space during inference. Experimental results on in-the-wild benchmarks demonstrate that our method surpasses state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting superior generalization capability. Our code will be made publicly available.
中文: Mocap-2-to-3框架通过利用2D数据预训练和创新的运动表示方法,解决了单目输入恢复绝对人体运动的难题,实现了高精度三维运动重建,并展现出卓越的泛化能力和真实感。
English: The Mocap-2-to-3 framework overcomes monocular motion recovery challenges by leveraging 2D data pre-training and a novel human motion representation to reconstruct metrically accurate 3D motions with superior generalization and realism.

Authors:Zhiyuan Huang, Ziming Cheng, Junting Pan, Zhaohui Hou, Mingjie Zhan
Title: SpiritSight Agent: Advanced GUI Agent with One Look
Abstract:
Graphical User Interface (GUI) agents show amazing abilities in assisting human-computer interaction, automating human user's navigation on digital devices. An ideal GUI agent is expected to achieve high accuracy, low latency, and compatibility for different GUI platforms. Recent vision-based approaches have shown promise by leveraging advanced Vision Language Models (VLMs). While they generally meet the requirements of compatibility and low latency, these vision-based GUI agents tend to have low accuracy due to their limitations in element grounding. To address this issue, we propose $\textbf{SpiritSight}$, a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms. First, we create a multi-level, large-scale, high-quality GUI dataset called $\textbf{GUI-Lasagne}$ using scalable methods, empowering SpiritSight with robust GUI understanding and grounding capabilities. Second, we introduce the $\textbf{Universal Block Parsing (UBP)}$ method to resolve the ambiguity problem in dynamic high-resolution of visual inputs, further enhancing SpiritSight's ability to ground GUI objects. Through these efforts, SpiritSight agent outperforms other advanced methods on diverse GUI benchmarks, demonstrating its superior capability and compatibility in GUI navigation tasks. Models and datasets are available at https://hzhiyuan.github.io/SpiritSight-Agent.

Authors:Jingzhou Luo, Yang Liu, Weixing Chen, Zhen Li, Yaowei Wang, Guanbin Li, Liang Lin
Title: DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering
Abstract:
3D Question Answering (3D QA) requires the model to comprehensively understand its situated 3D scene described by the text, then reason about its surrounding environment and answer a question under that situation. However, existing methods usually rely on global scene perception from pure 3D point clouds and overlook the importance of rich local texture details from multi-view images. Moreover, due to the inherent noise in camera poses and complex occlusions, there exists significant feature degradation and reduced feature robustness problems when aligning 3D point cloud with multi-view images. In this paper, we propose a Dual-vision Scene Perception Network (DSPNet), to comprehensively integrate multi-view and point cloud features to improve robustness in 3D QA. Our Text-guided Multi-view Fusion (TGMF) module prioritizes image views that closely match the semantic content of the text. To adaptively fuse back-projected multi-view images with point cloud features, we design the Adaptive Dual-vision Perception (ADVP) module, enhancing 3D scene comprehension. Additionally, our Multimodal Context-guided Reasoning (MCGR) module facilitates robust reasoning by integrating contextual information across visual and linguistic modalities. Experimental results on SQA3D and ScanQA datasets demonstrate the superiority of our DSPNet. Codes will be available at https://github.com/LZ-CH/DSPNet.
中文: 本文提出的DSPNet通过文本引导的多视角融合和自适应双视觉感知模块,有效整合点云与多视角图像特征,在3D问答任务中实现了更鲁棒的场景理解与推理能力。
English: The proposed DSPNet enhances 3D Question Answering by integrating multi-view and point cloud features through modules that prioritize text-relevant views and adaptively fuse visual data, achieving superior performance on benchmark datasets.

Authors:Awais Nizamani, Hamid Laga, Guanjin Wang, Farid Boussaid, Mohammed Bennamoun, Anuj Srivastava
Title: Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis
Abstract:
We propose a novel framework for the statistical analysis of genus-zero 4D surfaces, i.e., 3D surfaces that deform and evolve over time. This problem is particularly challenging due to the arbitrary parameterizations of these surfaces and their varying deformation speeds, necessitating effective spatiotemporal registration. Traditionally, 4D surfaces are discretized, in space and time, before computing their spatiotemporal registrations, geodesics, and statistics. However, this approach may result in suboptimal solutions and, as we demonstrate in this paper, is not necessary. In contrast, we treat 4D surfaces as continuous functions in both space and time. We introduce Dynamic Spherical Neural Surfaces (D-SNS), an efficient smooth and continuous spatiotemporal representation for genus-0 4D surfaces. We then demonstrate how to perform core 4D shape analysis tasks such as spatiotemporal registration, geodesics computation, and mean 4D shape estimation, directly on these continuous representations without upfront discretization and meshing. By integrating neural representations with classical Riemannian geometry and statistical shape analysis techniques, we provide the building blocks for enabling full functional shape analysis. We demonstrate the efficiency of the framework on 4D human and face datasets. The source code and additional results are available at https://4d-dsns.github.io/DSNS/.
中文摘要:本文提出动态球面神经表面(D-SNS)这一连续框架,无需预先离散化即可直接对四维形状进行时空配准、测地线计算和统计均值分析。
English Summary: This paper introduces Dynamic Spherical Neural Surfaces (D-SNS), a continuous framework for analyzing 4D shapes without prior discretization, enabling direct computation of registrations, geodesics, and statistical means.

Authors:Wenqi Guo, Yiyang Du, Shan Du
Title: LangGas: Introducing Language in Selective Zero-Shot Background Subtraction for Semi-Transparent Gas Leak Detection with a New Dataset
Abstract:
Gas leakage poses a significant hazard that requires prevention. Traditionally, human inspection has been used for detection, a slow and labour-intensive process. Recent research has applied machine learning techniques to this problem, yet there remains a shortage of high-quality, publicly available datasets. This paper introduces a synthetic dataset, SimGas, featuring diverse backgrounds, interfering foreground objects, diverse leak locations, and precise segmentation ground truth. We propose a zero-shot method that combines background subtraction, zero-shot object detection, filtering, and segmentation to leverage this dataset. Experimental results indicate that our approach significantly outperforms baseline methods based solely on background subtraction and zero-shot object detection with segmentation, reaching an IoU of 69%. We also present an analysis of various prompt configurations and threshold settings to provide deeper insights into the performance of our method. Finally, we qualitatively (because of the lack of ground truth) tested our performance on GasVid and reached decent results on the real-world dataset. The dataset, code, and full qualitative results are available at https://github.com/weathon/Lang-Gas.
中文: 本文提出了用于气体泄漏检测的合成数据集SimGas和一种零样本方法,该方法以69%的交并比显著优于基线方法,并在真实数据上取得了良好效果。
English: This paper introduces SimGas, a synthetic dataset for gas leak detection, and proposes a zero-shot method that outperforms baseline approaches with a 69% IoU, also demonstrating decent performance on real-world data.

Authors:Saurabh Koju, Saurav Bastola, Prashant Shrestha, Sanskar Amgain, Yash Raj Shrestha, Rudra P. K. Poudel, Binod Bhattarai
Title: Surgical Vision World Model
Abstract:
Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action-controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large-scale real data acquisition is infeasible. However, such works in the surgical domain have been limited to simplified computer simulations, and lack realism. Furthermore, existing literature in world models has predominantly dealt with action-labeled data, limiting their applicability to real-world surgical data, where obtaining action annotation is prohibitively expensive. Inspired by the recent success of Genie in leveraging unlabeled video game data to infer latent actions and enable action-controlled data generation, we propose the first surgical vision world model. The proposed model can generate action-controllable surgical data and the architecture design is verified with extensive experiments on the unlabeled SurgToolLoc-2022 dataset. Codes and implementation details are available at https://github.com/bhattarailab/Surgical-Vision-World-Model
Chinese: 本文提出了首个手术视觉世界模型,通过利用未标记视频数据生成可控制动作的手术数据,并在SurgToolLoc-2022数据集上进行了验证。
English: This paper introduces the first surgical vision world model, which generates action-controllable surgical data by leveraging unlabeled video data and has been validated on the SurgToolLoc-2022 dataset.

Authors:Qinyu Zhao, Stephen Gould, Liang Zheng
Title: ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models
Abstract:
Existing autoregressive (AR) image generative models use a token-by-token generation schema. That is, they predict a per-token probability distribution and sample the next token from that distribution. The main challenge is how to model the complex distribution of high-dimensional tokens. Previous methods either are too simplistic to fit the distribution or result in slow generation speed. Instead of fitting the distribution of the whole tokens, we explore using a AR model to generate each token in a feature-by-feature way, i.e., taking the generated features as input and generating the next feature. Based on that, we propose ARINAR (AR-in-AR), a bi-level AR model. The outer AR layer take previous tokens as input, predicts a condition vector z for the next token. The inner layer, conditional on z, generates features of the next token autoregressively. In this way, the inner layer only needs to model the distribution of a single feature, for example, using a simple Gaussian Mixture Model. On the ImageNet 256x256 image generation task, ARINAR-B with 213M parameters achieves an FID of 2.75, which is comparable to the state-of-the-art MAR-B model (FID=2.31), while five times faster than the latter.
中文摘要:提出的ARINAR模型采用双层自回归方法,先预测下一标记的条件向量,再逐步生成标记特征,在保持与先进模型相当图像生成质量的同时,大幅提升了生成速度。
English Summary: The proposed ARINAR model introduces a bi-level autoregressive approach that first predicts a condition vector for the next token and then generates token features sequentially, achieving comparable image generation quality to state-of-the-art models while significantly improving speed.

Authors:Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova
Title: SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models
Abstract:
Advancing AI in computational pathology requires large, high-quality, and diverse datasets, yet existing public datasets are often limited in organ diversity, class coverage, or annotation quality. To bridge this gap, we introduce SPIDER (Supervised Pathology Image-DEscription Repository), the largest publicly available patch-level dataset covering multiple organ types, including Skin, Colorectal, Thorax, and Breast with comprehensive class coverage for each organ. SPIDER provides high-quality annotations verified by expert pathologists and includes surrounding context patches, which enhance classification performance by providing spatial context. Alongside the dataset, we present baseline models trained on SPIDER using the Hibou-L foundation model as a feature extractor combined with an attention-based classification head. The models achieve state-of-the-art performance across multiple tissue categories and serve as strong benchmarks for future digital pathology research. Beyond patch classification, the model enables rapid identification of significant areas, quantitative tissue metrics, and establishes a foundation for multimodal approaches. Both the dataset and trained models are publicly available to advance research, reproducibility, and AI-driven pathology development. Access them at: https://github.com/HistAI/SPIDER
中文摘要:SPIDER作为最大的公开病理图像数据集,提供多器官覆盖和专家验证标注,其配套的基准模型实现了最先进的性能,为计算病理学AI研究设立了新标准并支持多维组织分析。
English Summary: SPIDER is introduced as the largest public patch-level pathology dataset with expert-verified annotations and multi-organ coverage, accompanied by state-of-the-art baseline models that enable advanced tissue analysis and set new benchmarks for AI in computational pathology.

Authors:Nuria Alina Chandra, Ryan Murtfeldt, Lin Qiu, Arnab Karmakar, Hannah Lee, Emmanuel Tanumihardja, Kevin Farhat, Ben Caffee, Sejin Paik, Changyeon Lee, Jongwook Choi, Aerin Kim, Oren Etzioni
Title: Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024
Abstract:
In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.
中文: Deepfake-Eval-2024基准测试表明,现有深度伪造检测模型在真实场景中表现不佳,相比过时的学术数据集性能下降高达50%,尽管商业模型和微调模型相比开源方案有所改进。
English: The Deepfake-Eval-2024 benchmark reveals that current deepfake detection models perform poorly on real-world content, with performance dropping up to 50% compared to outdated academic datasets, though commercial and fine-tuned models show some improvement over open-source alternatives.

Authors:Weihang Wang, Duolin Sun, Jielei Zhang, Longwen Gao
Title: MX-Font++: Mixture of Heterogeneous Aggregation Experts for Few-shot Font Generation
Abstract:
Few-shot Font Generation (FFG) aims to create new font libraries using limited reference glyphs, with crucial applications in digital accessibility and equity for low-resource languages, especially in multilingual artificial intelligence systems. Although existing methods have shown promising performance, transitioning to unseen characters in low-resource languages remains a significant challenge, especially when font glyphs vary considerably across training sets. MX-Font considers the content of a character from the perspective of a local component, employing a Mixture of Experts (MoE) approach to adaptively extract the component for better transition. However, the lack of a robust feature extractor prevents them from adequately decoupling content and style, leading to sub-optimal generation results. To alleviate these problems, we propose Heterogeneous Aggregation Experts (HAE), a powerful feature extraction expert that helps decouple content and style downstream from being able to aggregate information in channel and spatial dimensions. Additionally, we propose a novel content-style homogeneity loss to enhance the untangling. Extensive experiments on several datasets demonstrate that our MX-Font++ yields superior visual results in FFG and effectively outperforms state-of-the-art methods. Code and data are available at https://github.com/stephensun11/MXFontpp.
中文:MX-Font++ 提出了异构聚合专家和内容风格同质性损失函数,能有效解耦内容与风格,在少样本字体生成任务中实现了优于现有方法的视觉效果。
English: MX-Font++ introduces Heterogeneous Aggregation Experts and a content-style homogeneity loss to effectively decouple content and style, achieving superior few-shot font generation results compared to existing methods.

Authors:Shuang Chen, Yifeng He, Barry Lennox, Farshad Arvin, Amir Atapour-Abarghouei
Title: Deep Learning-Enhanced Visual Monitoring in Hazardous Underwater Environments with a Swarm of Micro-Robots
Abstract:
Long-term monitoring and exploration of extreme environments, such as underwater storage facilities, is costly, labor-intensive, and hazardous. Automating this process with low-cost, collaborative robots can greatly improve efficiency. These robots capture images from different positions, which must be processed simultaneously to create a spatio-temporal model of the facility. In this paper, we propose a novel approach that integrates data simulation, a multi-modal deep learning network for coordinate prediction, and image reassembly to address the challenges posed by environmental disturbances causing drift and rotation in the robots' positions and orientations. Our approach enhances the precision of alignment in noisy environments by integrating visual information from snapshots, global positional context from masks, and noisy coordinates. We validate our method through extensive experiments using synthetic data that simulate real-world robotic operations in underwater settings. The results demonstrate very high coordinate prediction accuracy and plausible image assembly, indicating the real-world applicability of our approach. The assembled images provide clear and coherent views of the underwater environment for effective monitoring and inspection, showcasing the potential for broader use in extreme settings, further contributing to improved safety, efficiency, and cost reduction in hazardous field monitoring. Code is available on https://github.com/ChrisChen1023/Micro-Robot-Swarm.
中文摘要:本文提出一种利用协作机器人和多模态深度学习的新方法,通过集成数据模拟和图像重组技术解决水下机器人位置漂移和旋转问题,从而提高极端环境监测的精度与效率。
English Summary: This paper presents a novel method using collaborative robots and multi-modal deep learning to enhance the precision of underwater environmental monitoring by addressing positional drift and rotation through integrated data simulation and image reassembly.

Authors:Jia Wang, Xinfeng Zhang, Gai Zhang, Jun Zhu, Lv Tang, Li Zhang
Title: UAR-NVC: A Unified AutoRegressive Framework for Memory-Efficient Neural Video Compression
Abstract:
Implicit Neural Representations (INRs) have demonstrated significant potential in video compression by representing videos as neural networks. However, as the number of frames increases, the memory consumption for training and inference increases substantially, posing challenges in resource-constrained scenarios. Inspired by the success of traditional video compression frameworks, which process video frame by frame and can efficiently compress long videos, we adopt this modeling strategy for INRs to decrease memory consumption, while aiming to unify the frameworks from the perspective of timeline-based autoregressive modeling. In this work, we present a novel understanding of INR models from an autoregressive (AR) perspective and introduce a Unified AutoRegressive Framework for memory-efficient Neural Video Compression (UAR-NVC). UAR-NVC integrates timeline-based and INR-based neural video compression under a unified autoregressive paradigm. It partitions videos into several clips and processes each clip using a different INR model instance, leveraging the advantages of both compression frameworks while allowing seamless adaptation to either in form. To further reduce temporal redundancy between clips, we design two modules to optimize the initialization, training, and compression of these model parameters. UAR-NVC supports adjustable latencies by varying the clip length. Extensive experimental results demonstrate that UAR-NVC, with its flexible video clip setting, can adapt to resource-constrained environments and significantly improve performance compared to different baseline models. The project page: "https://wj-inf.github.io/UAR-NVC-page/".

Authors:Xiaoyu Zheng, Xu Chen, Shaogang Gong, Xavier Griffin, Greg Slabaugh
Title: XFMamba: Cross-Fusion Mamba for Multi-View Medical Image Classification
Abstract:
Compared to single view medical image classification, using multiple views can significantly enhance predictive accuracy as it can account for the complementarity of each view while leveraging correlations between views. Existing multi-view approaches typically employ separate convolutional or transformer branches combined with simplistic feature fusion strategies. However, these approaches inadvertently disregard essential cross-view correlations, leading to suboptimal classification performance, and suffer from challenges with limited receptive field (CNNs) or quadratic computational complexity (transformers). Inspired by state space sequence models, we propose XFMamba, a pure Mamba-based cross-fusion architecture to address the challenge of multi-view medical image classification. XFMamba introduces a novel two-stage fusion strategy, facilitating the learning of single-view features and their cross-view disparity. This mechanism captures spatially long-range dependencies in each view while enhancing seamless information transfer between views. Results on three public datasets, MURA, CheXpert and DDSM, illustrate the effectiveness of our approach across diverse multi-view medical image classification tasks, showing that it outperforms existing convolution-based and transformer-based multi-view methods. Code is available at https://github.com/XZheng0427/XFMamba.
中文摘要:提出的XFMamba模型采用基于状态空间序列的交叉融合架构,能有效学习单视图特征及其跨视图差异,在多个医学图像数据集上超越了现有卷积和Transformer方法。
English Summary: The proposed XFMamba model introduces a novel cross-fusion architecture using state space sequence models to effectively capture both single-view features and cross-view correlations, outperforming existing methods on multiple medical image datasets.

Authors:Zirun Guo, Tao Jin
Title: Smoothing the Shift: Towards Stable Test-Time Adaptation under Complex Multimodal Noises
Abstract:
Test-Time Adaptation (TTA) aims to tackle distribution shifts using unlabeled test data without access to the source data. In the context of multimodal data, there are more complex noise patterns than unimodal data such as simultaneous corruptions for multiple modalities and missing modalities. Besides, in real-world applications, corruptions from different distribution shifts are always mixed. Existing TTA methods always fail in such multimodal scenario because the abrupt distribution shifts will destroy the prior knowledge from the source model, thus leading to performance degradation. To this end, we reveal a new challenge named multimodal wild TTA. To address this challenging problem, we propose two novel strategies: sample identification with interquartile range Smoothing and unimodal assistance, and Mutual information sharing (SuMi). SuMi smooths the adaptation process by interquartile range which avoids the abrupt distribution shifts. Then, SuMi fully utilizes the unimodal features to select low-entropy samples with rich multimodal information for optimization. Furthermore, mutual information sharing is introduced to align the information, reduce the discrepancies and enhance the information utilization across different modalities. Extensive experiments on two public datasets show the effectiveness and superiority over existing methods under the complex noise patterns in multimodal data. Code is available at https://github.com/zrguo/SuMi.
中文摘要:测试时自适应(TTA)旨在利用未标记测试数据应对分布偏移,但在复杂多模态噪声场景中表现不佳;提出的SuMi方法通过平滑适应过程和增强跨模态信息共享,有效解决了这一难题。
English Summary: Test-Time Adaptation (TTA) addresses distribution shifts using unlabeled test data, but struggles with complex multimodal noise patterns; the proposed SuMi method overcomes this by smoothing adaptation processes and enhancing cross-modal information sharing.

Authors:Yizhou Huang, Fan Yang, Guoliang Zhu, Gen Li, Hao Shi, Yukun Zuo, Wenrui Chen, Zhiyong Li, Kailun Yang
Title: Resource-Efficient Affordance Grounding with Complementary Depth and Semantic Prompts
Abstract:
Affordance refers to the functional properties that an agent perceives and utilizes from its environment, and is key perceptual information required for robots to perform actions. This information is rich and multimodal in nature. Existing multimodal affordance methods face limitations in extracting useful information, mainly due to simple structural designs, basic fusion methods, and large model parameters, making it difficult to meet the performance requirements for practical deployment. To address these issues, this paper proposes the BiT-Align image-depth-text affordance mapping framework. The framework includes a Bypass Prompt Module (BPM) and a Text Feature Guidance (TFG) attention selection mechanism. BPM integrates the auxiliary modality depth image directly as a prompt to the primary modality RGB image, embedding it into the primary modality encoder without introducing additional encoders. This reduces the model's parameter count and effectively improves functional region localization accuracy. The TFG mechanism guides the selection and enhancement of attention heads in the image encoder using textual features, improving the understanding of affordance characteristics. Experimental results demonstrate that the proposed method achieves significant performance improvements on public AGD20K and HICO-IIF datasets. On the AGD20K dataset, compared with the current state-of-the-art method, we achieve a 6.0% improvement in the KLD metric, while reducing model parameters by 88.8%, demonstrating practical application values. The source code will be made publicly available at https://github.com/DAWDSE/BiT-Align.
中文: 本文提出的BiT-Align框架通过深度图像提示和文本引导注意力机制改进功能映射,在基准数据集上以更少参数实现了更高精度。
English: This paper introduces the BiT-Align framework, which enhances affordance mapping by integrating depth images as prompts and using text-guided attention, achieving higher accuracy with fewer parameters on benchmark datasets.

Authors:Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi
Title: Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
Abstract:
Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem, we propose \MapleLeaf AKI, a novel MLLM that unlocks causal attention into modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows AKI to achieve superior performance in 12 multimodal understanding benchmarks (+7.2% on average) without introducing additional parameters and increasing training time. Our MMA design is intended to be generic, allowing for application across various modalities, and scalable to accommodate diverse multimodal scenarios. The code and model are publicly available at https://github.com/sony/aki to encourage further advancements in MLLMs across various directions.
中文:针对多模态大语言模型中视觉与语言不对齐的问题,提出的MapleLeaf AKI模型通过将因果注意力机制替换为模态互注意力机制,使图像标记能够关注文本标记,从而在12个多模态理解基准上实现卓越性能,且无需额外参数或增加训练时间。
English: Recent Multimodal Large Language Models (MLLMs) face vision-language misalignment issues, which the proposed MapleLeaf AKI model addresses by replacing causal attention with modality-mutual attention, enabling image tokens to interact with text tokens and achieving superior performance across 12 benchmarks without extra parameters or training time.

Authors:Zhaoxing Gan, Mengtian Li, Ruhua Chen, Zhongxia Ji, Sichen Guo, Huanling Hu, Guangnan Ye, Zuo Hu
Title: StageDesigner: Artistic Stage Generation for Scenography via Theater Scripts
Abstract:
In this work, we introduce StageDesigner, the first comprehensive framework for artistic stage generation using large language models combined with layout-controlled diffusion models. Given the professional requirements of stage scenography, StageDesigner simulates the workflows of seasoned artists to generate immersive 3D stage scenes. Specifically, our approach is divided into three primary modules: Script Analysis, which extracts thematic and spatial cues from input scripts; Foreground Generation, which constructs and arranges essential 3D objects; and Background Generation, which produces a harmonious background aligned with the narrative atmosphere and maintains spatial coherence by managing occlusions between foreground and background elements. Furthermore, we introduce the StagePro-V1 dataset, a dedicated dataset with 276 unique stage scenes spanning different historical styles and annotated with scripts, images, and detailed 3D layouts, specifically tailored for this task. Finally, evaluations using both standard and newly proposed metrics, along with extensive user studies, demonstrate the effectiveness of StageDesigner. Project can be found at: https://deadsmither5.github.io/2025/01/03/StageDesigner/

Authors:Yanlong Xu, Haoxuan Qu, Jun Liu, Wenxiao Zhang, Xun Yang
Title: CMMLoc: Advancing Text-to-PointCloud Localization with Cauchy-Mixture-Model Based Framework
Abstract:
The goal of point cloud localization based on linguistic description is to identify a 3D position using textual description in large urban environments, which has potential applications in various fields, such as determining the location for vehicle pickup or goods delivery. Ideally, for a textual description and its corresponding 3D location, the objects around the 3D location should be fully described in the text description. However, in practical scenarios, e.g., vehicle pickup, passengers usually describe only the part of the most significant and nearby surroundings instead of the entire environment. In response to this $\textbf{partially relevant}$ challenge, we propose $\textbf{CMMLoc}$, an uncertainty-aware $\textbf{C}$auchy-$\textbf{M}$ixture-$\textbf{M}$odel ($\textbf{CMM}$) based framework for text-to-point-cloud $\textbf{Loc}$alization. To model the uncertain semantic relations between text and point cloud, we integrate CMM constraints as a prior during the interaction between the two modalities. We further design a spatial consolidation scheme to enable adaptive aggregation of different 3D objects with varying receptive fields. To achieve precise localization, we propose a cardinal direction integration module alongside a modality pre-alignment strategy, helping capture the spatial relationships among objects and bringing the 3D objects closer to the text modality. Comprehensive experiments validate that CMMLoc outperforms existing methods, achieving state-of-the-art results on the KITTI360Pose dataset. Codes are available in this GitHub repository https://github.com/kevin301342/CMMLoc.
Chinese: 针对点云定位中文本描述部分相关性的挑战,我们提出了CMMLoc框架,通过整合柯西混合模型约束和空间整合方案,实现了精确的文本到点云定位,在KITTI360Pose数据集上取得了最先进的性能。
English: To address the challenge of partially relevant text descriptions in point cloud localization, we propose CMMLoc, an uncertainty-aware framework that integrates Cauchy-Mixture-Model constraints and spatial consolidation for precise text-to-point-cloud localization, achieving state-of-the-art performance on the KITTI360Pose dataset.

Authors:Patryk Marszałek, Maciej Rut, Piotr Kawa, Przemysław Spurek, Piotr Syga
Title: A Hypernetwork-Based Approach to KAN Representation of Audio Signals
Abstract:
Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN's utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at https://github.com/gmum/fewsound.git.
Chinese: 本研究提出Kolmogorov-Arnold网络(KAN)作为音频的有效隐式神经表示,通过其FewSound扩展展示了卓越的性能和适应性,显著超越了现有方法。
English: This study introduces the Kolmogorov-Arnold Network (KAN) as an effective implicit neural representation for audio, demonstrating superior performance and adaptability through its FewSound extension, which significantly outperforms existing methods.

Authors:Jiayi Zhao, Fei Teng, Kai Luo, Guoqiang Zhao, Zhiyong Li, Xu Zheng, Kailun Yang
Title: Unveiling the Potential of Segment Anything Model 2 for RGB-Thermal Semantic Segmentation with Language Guidance
Abstract:
The perception capability of robotic systems relies on the richness of the dataset. Although Segment Anything Model 2 (SAM2), trained on large datasets, demonstrates strong perception potential in perception tasks, its inherent training paradigm prevents it from being suitable for RGB-T tasks. To address these challenges, we propose SHIFNet, a novel SAM2-driven Hybrid Interaction Paradigm that unlocks the potential of SAM2 with linguistic guidance for efficient RGB-Thermal perception. Our framework consists of two key components: (1) Semantic-Aware Cross-modal Fusion (SACF) module that dynamically balances modality contributions through text-guided affinity learning, overcoming SAM2's inherent RGB bias; (2) Heterogeneous Prompting Decoder (HPD) that enhances global semantic information through a semantic enhancement module and then combined with category embeddings to amplify cross-modal semantic consistency. With 32.27M trainable parameters, SHIFNet achieves state-of-the-art segmentation performance on public benchmarks, reaching 89.8% on PST900 and 67.8% on FMB, respectively. The framework facilitates the adaptation of pre-trained large models to RGB-T segmentation tasks, effectively mitigating the high costs associated with data collection while endowing robotic systems with comprehensive perception capabilities. The source code will be made publicly available at https://github.com/iAsakiT3T/SHIFNet.
中文: 提出的SHIFNet框架通过文本引导的跨模态融合和语义增强机制,成功将Segment Anything Model 2适配于RGB-热成像任务,以少量可训练参数实现了最先进的分割性能。
English: The proposed SHIFNet framework enhances robotic perception by adapting the Segment Anything Model 2 for RGB-Thermal tasks through text-guided cross-modal fusion and semantic prompting, achieving state-of-the-art segmentation performance with minimal trainable parameters.

Authors:Ege Özsoy, Chantal Pellegrini, Tobias Czempiel, Felix Tristram, Kun Yuan, David Bani-Harouni, Ulrich Eck, Benjamin Busam, Matthias Keicher, Nassir Navab
Title: MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments
Abstract:
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establish a new benchmark for holistic OR understanding, and open the path towards multimodal scene analysis in complex, high-stakes environments. Our code, and data is available at https://github.com/egeozsoy/MM-OR.
中文摘要:本研究推出首个多模态手术室数据集MM-OR及场景图生成模型MM2SG,通过整合多元数据与先进算法,为复杂医疗场景的全面感知建立了新标准。
English Summary: The study introduces MM-OR, a comprehensive multimodal dataset for operating rooms, and MM2SG, a novel vision-language model, establishing new benchmarks for holistic surgical environment understanding through enhanced scene graph generation.

Authors:Xinying Hong, Siyu Li, Kang Zeng, Hao Shi, Bomin Peng, Kailun Yang, Zhiyong Li
Title: TS-CGNet: Temporal-Spatial Fusion Meets Centerline-Guided Diffusion for BEV Mapping
Abstract:
Bird's Eye View (BEV) perception technology is crucial for autonomous driving, as it generates top-down 2D maps for environment perception, navigation, and decision-making. Nevertheless, the majority of current BEV map generation studies focusing on visual map generation lack depth-aware reasoning capabilities. They exhibit limited efficacy in managing occlusions and handling complex environments, with a notable decline in perceptual performance under adverse weather conditions or low-light scenarios. Therefore, this paper proposes TS-CGNet, which leverages Temporal-Spatial fusion with Centerline-Guided diffusion. This visual framework, grounded in prior knowledge, is designed for integration into any existing network for building BEV maps. Specifically, this framework is decoupled into three parts: Local mapping system involves the initial generation of semantic maps using purely visual information; The Temporal-Spatial Aligner Module (TSAM) integrates historical information into mapping generation by applying transformation matrices; The Centerline-Guided Diffusion Model (CGDM) is a prediction module based on the diffusion model. CGDM incorporates centerline information through spatial-attention mechanisms to enhance semantic segmentation reconstruction. We construct BEV semantic segmentation maps by our methods on the public nuScenes and the robustness benchmarks under various corruptions. Our method improves 1.90%, 1.73%, and 2.87% for perceived ranges of 60x30m, 120x60m, and 240x60m in the task of BEV HD mapping. TS-CGNet attains an improvement of 1.92% for perceived ranges of 100x100m in the task of BEV semantic mapping. Moreover, TS-CGNet achieves an average improvement of 2.92% in detection accuracy under varying weather conditions and sensor interferences in the perception range of 240x60m. The source code will be publicly available at https://github.com/krabs-H/TS-CGNet.
中文摘要:本文提出TS-CGNet框架,通过时空融合与中心线引导扩散技术,有效提升鸟瞰图感知的深度推理能力,在不同环境和感知范围内均实现性能提升。
English Summary: This paper introduces TS-CGNet, a temporal-spatial fusion framework with centerline-guided diffusion that enhances Bird's Eye View perception by addressing depth reasoning limitations and improving performance across various conditions and ranges.

Authors:Grzegorz Skorupko, Fotios Avgoustidis, Carlos Martín-Isla, Lidia Garrucho, Dimitri A. Kessler, Esmeralda Ruiz Pujadas, Oliver Díaz, Maciej Bobowicz, Katarzyna Gwoździewicz, Xavier Bargalló, Paulius Jaruševičius, Richard Osuala, Kaisar Kushibar, Karim Lekadir
Title: Federated nnU-Net for Privacy-Preserving Medical Image Segmentation
Abstract:
The nnU-Net framework has played a crucial role in medical image segmentation and has become the gold standard in multitudes of applications targeting different diseases, organs, and modalities. However, so far it has been used primarily in a centralized approach where the collected data is stored in the same location where nnU-Net is trained. This centralized approach has various limitations, such as potential leakage of sensitive patient information and violation of patient privacy. Federated learning has emerged as a key approach for training segmentation models in a decentralized manner, enabling collaborative development while prioritising patient privacy. In this paper, we propose FednnU-Net, a plug-and-play, federated learning extension of the nnU-Net framework. To this end, we contribute two federated methodologies to unlock decentralized training of nnU-Net, namely, Federated Fingerprint Extraction (FFE) and Asymmetric Federated Averaging (AsymFedAvg). We conduct a comprehensive set of experiments demonstrating high and consistent performance of our methods for breast, cardiac and fetal segmentation based on a multi-modal collection of 6 datasets representing samples from 18 different institutions. To democratize research as well as real-world deployments of decentralized training in clinical centres, we publicly share our framework at https://github.com/faildeny/FednnUNet .
中文:FednnU-Net框架通过联邦指纹提取和异步联邦平均方法扩展了nnU-Net,实现了跨医疗机构的去中心化医学图像分割,在保护数据隐私的同时展现出卓越性能。
English: The FednnU-Net framework extends nnU-Net with federated learning capabilities through Federated Fingerprint Extraction and Asymmetric Federated Averaging, enabling decentralized medical image segmentation while maintaining data privacy across multiple institutions.

Authors:Zhen Yang, Guibao Shen, Minyang Li, Liang Hou, Mushui Liu, Luozhou Wang, Xin Tao, Pengfei Wan, Di Zhang, Ying-Cong Chen
Title: Efficient Training-Free High-Resolution Synthesis with Energy Rectification in Diffusion Models
Abstract:
Diffusion models have achieved remarkable progress across various visual generation tasks. However, their performance significantly declines when generating content at resolutions higher than those used during training. Although numerous methods have been proposed to enable high-resolution generation, they all suffer from inefficiency. In this paper, we propose RectifiedHR, a straightforward and efficient solution for training-free high-resolution synthesis. Specifically, we propose a noise refresh strategy that unlocks the model's training-free high-resolution synthesis capability and improves efficiency. Additionally, we are the first to observe the phenomenon of energy decay, which may cause image blurriness during the high-resolution synthesis process. To address this issue, we introduce average latent energy analysis and find that tuning the classifier-free guidance hyperparameter can significantly improve generation performance. Our method is entirely training-free and demonstrates efficient performance. Furthermore, we show that RectifiedHR is compatible with various diffusion model techniques, enabling advanced features such as image editing, customized generation, and video synthesis. Extensive comparisons with numerous baseline methods validate the superior effectiveness and efficiency of RectifiedHR.
中文: 本文提出无需训练的RectifiedHR方法,通过噪声刷新策略实现高效高分辨率图像生成,并针对能量衰减现象调整分类器引导参数以提升画质,同时兼容多种扩散模型的高级应用功能。
English: This paper introduces RectifiedHR, a training-free method that enables efficient high-resolution image generation through a noise refresh strategy and addresses energy decay by adjusting classifier-free guidance, while maintaining compatibility with various diffusion model applications.

Authors:Xin Ding, Xin Li, Haotong Qin, Zhibo Chen
Title: Q&C: When Quantization Meets Cache in Efficient Image Generation
Abstract:
Quantization and cache mechanisms are typically applied individually for efficient Diffusion Transformers (DiTs), each demonstrating notable potential for acceleration. However, the promoting effect of combining the two mechanisms on efficient generation remains under-explored. Through empirical investigation, we find that the combination of quantization and cache mechanisms for DiT is not straightforward, and two key challenges lead to severe catastrophic performance degradation: (i) the sample efficacy of calibration datasets in post-training quantization (PTQ) is significantly eliminated by cache operation; (ii) the combination of the above mechanisms introduces more severe exposure bias within sampling distribution, resulting in amplified error accumulation in the image generation process. In this work, we take advantage of these two acceleration mechanisms and propose a hybrid acceleration method by tackling the above challenges, aiming to further improve the efficiency of DiTs while maintaining excellent generation capability. Concretely, a temporal-aware parallel clustering (TAP) is designed to dynamically improve the sample selection efficacy for the calibration within PTQ for different diffusion steps. A variance compensation (VC) strategy is derived to correct the sampling distribution. It mitigates exposure bias through an adaptive correction factor generation. Extensive experiments have shown that our method has accelerated DiTs by 12.7x while preserving competitive generation capability. The code will be available at https://github.com/xinding-sys/Quant-Cache.
中文摘要:本研究提出了一种结合量化和缓存机制的混合加速方法,通过时序感知并行聚类和方差补偿策略解决关键挑战,在保持生成质量的同时实现了Diffusion Transformers 12.7倍的加速效果。
English Summary: This study introduces a hybrid acceleration method for Diffusion Transformers that combines quantization and cache mechanisms, addressing key challenges through temporal-aware parallel clustering and variance compensation to achieve a 12.7x speedup while maintaining generation quality.

Authors:Wei Luo, Yunkang Cao, Haiming Yao, Xiaotian Zhang, Jianan Lou, Yuqi Cheng, Weiming Shen, Wenyong Yu
Title: Exploring Intrinsic Normal Prototypes within a Single Image for Universal Anomaly Detection
Abstract:
Anomaly detection (AD) is essential for industrial inspection, yet existing methods typically rely on ``comparing'' test images to normal references from a training set. However, variations in appearance and positioning often complicate the alignment of these references with the test image, limiting detection accuracy. We observe that most anomalies manifest as local variations, meaning that even within anomalous images, valuable normal information remains. We argue that this information is useful and may be more aligned with the anomalies since both the anomalies and the normal information originate from the same image. Therefore, rather than relying on external normality from the training set, we propose INP-Former, a novel method that extracts Intrinsic Normal Prototypes (INPs) directly from the test image. Specifically, we introduce the INP Extractor, which linearly combines normal tokens to represent INPs. We further propose an INP Coherence Loss to ensure INPs can faithfully represent normality for the testing image. These INPs then guide the INP-Guided Decoder to reconstruct only normal tokens, with reconstruction errors serving as anomaly scores. Additionally, we propose a Soft Mining Loss to prioritize hard-to-optimize samples during training. INP-Former achieves state-of-the-art performance in single-class, multi-class, and few-shot AD tasks across MVTec-AD, VisA, and Real-IAD, positioning it as a versatile and universal solution for AD. Remarkably, INP-Former also demonstrates some zero-shot AD capability. Code is available at:https://github.com/luow23/INP-Former.
Chinese: INP-Former提出了一种新颖的异常检测方法,直接从测试图像中提取内在正常原型来重建正常区域,在多个基准测试中实现了最先进的性能,并展现出零样本检测能力。
English: INP-Former introduces a novel anomaly detection method that extracts intrinsic normal prototypes directly from test images to reconstruct normal regions, achieving state-of-the-art performance across various benchmarks and demonstrating zero-shot capability.

Authors:Jiesi Hu, Chenfei Ye, Yanwu Yang, Xutao Guo, Yang Shang, Pengcheng Shi, Hanyang Peng, Ting Ma
Title: Neuroverse3D: Developing In-Context Learning Universal Model for Neuroimaging in 3D
Abstract:
In-context learning (ICL), a type of universal model, demonstrates exceptional generalization across a wide range of tasks without retraining by leveraging task-specific guidance from context, making it particularly effective for the intricate demands of neuroimaging. However, current ICL models, limited to 2D inputs and thus exhibiting suboptimal performance, struggle to extend to 3D inputs due to the high memory demands of ICL. In this regard, we introduce Neuroverse3D, an ICL model capable of performing multiple neuroimaging tasks in 3D (e.g., segmentation, denoising, inpainting). Neuroverse3D overcomes the large memory consumption associated with 3D inputs through adaptive parallel-sequential context processing and a U-shaped fusion strategy, allowing it to handle an unlimited number of context images. Additionally, we propose an optimized loss function to balance multi-task training and enhance focus on anatomical boundaries. Our study incorporates 43,674 3D multi-modal scans from 19 neuroimaging datasets and evaluates Neuroverse3D on 14 diverse tasks using held-out test sets. The results demonstrate that Neuroverse3D significantly outperforms existing ICL models and closely matches task-specific models, enabling flexible adaptation to medical center variations without retraining. The code and model weights are publicly available at https://github.com/jiesihu/Neuroverse3D.
中文: Neuroverse3D是一种创新的上下文学习模型,通过自适应并行-顺序上下文处理技术突破内存限制,能在无需重新训练的情况下处理3D神经影像任务,性能显著优于现有模型并接近专用模型水平。
English: Neuroverse3D is a novel in-context learning model that overcomes memory limitations to handle 3D neuroimaging tasks, outperforming existing models and matching task-specific performance without retraining.

Authors:Xinyu Wang, Bohan Zhuang, Qi Wu
Title: Are Large Vision Language Models Good Game Players?
Abstract:
Large Vision Language Models (LVLMs) have demonstrated remarkable abilities in understanding and reasoning about both visual and textual information. However, existing evaluation methods for LVLMs, primarily based on benchmarks like Visual Question Answering and image captioning, often fail to capture the full scope of LVLMs' capabilities. These benchmarks are limited by issues such as inadequate assessment of detailed visual perception, data contamination, and a lack of focus on multi-turn reasoning. To address these challenges, we propose \method{}, a game-based evaluation framework designed to provide a comprehensive assessment of LVLMs' cognitive and reasoning skills in structured environments. \method{} uses a set of games to evaluate LVLMs on four core tasks: Perceiving, Question Answering, Rule Following, and End-to-End Playing, with each target task designed to assess specific abilities, including visual perception, reasoning, decision-making, etc. Based on this framework, we conduct extensive experiments that explore the limitations of current LVLMs, such as handling long structured outputs and perceiving detailed and dense elements. Code and data are publicly available at https://github.com/xinke-wang/LVLM-Playground.
中文: 作者提出了\method{},一种基于游戏的评估框架,通过感知、问答、规则遵循和端到端游戏四个核心任务全面评估大型视觉语言模型的认知与推理能力,解决了现有基准在详细视觉感知评估和多轮推理关注不足等局限性。
English: The authors introduce \method{}, a game-based evaluation framework that comprehensively assesses Large Vision Language Models' cognitive and reasoning abilities through four core tasks, addressing limitations in current benchmarks like inadequate visual perception assessment and lack of multi-turn reasoning focus.

Authors:Zicheng Zhang, Tengchuan Kou, Shushi Wang, Chunyi Li, Wei Sun, Wei Wang, Xiaoyu Li, Zongyu Wang, Xuezhi Cao, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai
Title: Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content
Abstract:
Evaluating text-to-vision content hinges on two crucial aspects: visual quality and alignment. While significant progress has been made in developing objective models to assess these dimensions, the performance of such models heavily relies on the scale and quality of human annotations. According to Scaling Law, increasing the number of human-labeled instances follows a predictable pattern that enhances the performance of evaluation models. Therefore, we introduce a comprehensive dataset designed to Evaluate Visual quality and Alignment Level for text-to-vision content (Q-EVAL-100K), featuring the largest collection of human-labeled Mean Opinion Scores (MOS) for the mentioned two aspects. The Q-EVAL-100K dataset encompasses both text-to-image and text-to-video models, with 960K human annotations specifically focused on visual quality and alignment for 100K instances (60K images and 40K videos). Leveraging this dataset with context prompt, we propose Q-Eval-Score, a unified model capable of evaluating both visual quality and alignment with special improvements for handling long-text prompt alignment. Experimental results indicate that the proposed Q-Eval-Score achieves superior performance on both visual quality and alignment, with strong generalization capabilities across other benchmarks. These findings highlight the significant value of the Q-EVAL-100K dataset. Data and codes will be available at https://github.com/zzc-1998/Q-Eval.
中文: Q-EVAL-100K数据集通过大规模人工标注,使Q-Eval-Score模型在评估文本到视觉内容的质量和对齐度方面表现出卓越性能和泛化能力。
English: The Q-EVAL-100K dataset, featuring extensive human annotations, enables the Q-Eval-Score model to excel in evaluating text-to-vision content's visual quality and alignment with superior performance and generalization.

Authors:Gen Shi, Hui Zhang, Jie Tian
Title: COMMA: Coordinate-aware Modulated Mamba Network for 3D Dispersed Vessel Segmentation
Abstract:
Accurate segmentation of 3D vascular structures is essential for various medical imaging applications. The dispersed nature of vascular structures leads to inherent spatial uncertainty and necessitates location awareness, yet most current 3D medical segmentation models rely on the patch-wise training strategy that usually loses this spatial context. In this study, we introduce the Coordinate-aware Modulated Mamba Network (COMMA) and contribute a manually labeled dataset of 570 cases, the largest publicly available 3D vessel dataset to date. COMMA leverages both entire and cropped patch data through global and local branches, ensuring robust and efficient spatial location awareness. Specifically, COMMA employs a channel-compressed Mamba (ccMamba) block to encode entire image data, capturing long-range dependencies while optimizing computational costs. Additionally, we propose a coordinate-aware modulated (CaM) block to enhance interactions between the global and local branches, allowing the local branch to better perceive spatial information. We evaluate COMMA on six datasets, covering two imaging modalities and five types of vascular tissues. The results demonstrate COMMA's superior performance compared to state-of-the-art methods with computational efficiency, especially in segmenting small vessels. Ablation studies further highlight the importance of our proposed modules and spatial information. The code and data will be open source at https://github.com/shigen-StoneRoot/COMMA.
中文摘要:本研究提出坐标感知调制Mamba网络(COMMA),通过全局和局部分支增强空间定位能力,在多个数据集的血管分割任务中表现出卓越性能,尤其在小血管分割方面优势明显。
English Summary: This study introduces the Coordinate-aware Modulated Mamba Network (COMMA) for 3D vascular segmentation, which enhances spatial location awareness through global and local branches and demonstrates superior performance in segmenting small vessels across multiple datasets.

Authors:Guotao Shen, Ziheng Yan, Xin Jin, Longhai Wu, Jie Chen, Ilhyun Cho, Cheul-Hee Hahm
Title: Exploring Simple Siamese Network for High-Resolution Video Quality Assessment
Abstract:
In the research of video quality assessment (VQA), two-branch network has emerged as a promising solution. It decouples VQA with separate technical and aesthetic branches to measure the perception of low-level distortions and high-level semantics respectively. However, we argue that while technical and aesthetic perspectives are complementary, the technical perspective itself should be measured in semantic-aware manner. We hypothesize that existing technical branch struggles to perceive the semantics of high-resolution videos, as it is trained on local mini-patches sampled from videos. This issue can be hidden by apparently good results on low-resolution videos, but indeed becomes critical for high-resolution VQA. This work introduces SiamVQA, a simple but effective Siamese network for highre-solution VQA. SiamVQA shares weights between technical and aesthetic branches, enhancing the semantic perception ability of technical branch to facilitate technical-quality representation learning. Furthermore, it integrates a dual cross-attention layer for fusing technical and aesthetic features. SiamVQA achieves state-of-the-art accuracy on high-resolution benchmarks, and competitive results on lower-resolution benchmarks. Codes will be available at: https://github.com/srcn-ivl/SiamVQA
SiamVQA introduces a Siamese network with shared weights between technical and aesthetic branches to enhance semantic perception in technical quality assessment, achieving state-of-the-art performance on high-resolution video benchmarks.
English Summary:

Authors:Xin Jin, Longhai Wu, Jie Chen, Ilhyun Cho, Cheul-Hee Hahm
Title: Unified Arbitrary-Time Video Frame Interpolation and Prediction
Abstract:
Video frame interpolation and prediction aim to synthesize frames in-between and subsequent to existing frames, respectively. Despite being closely-related, these two tasks are traditionally studied with different model architectures, or same architecture but individually trained weights. Furthermore, while arbitrary-time interpolation has been extensively studied, the value of arbitrary-time prediction has been largely overlooked. In this work, we present uniVIP - unified arbitrary-time Video Interpolation and Prediction. Technically, we firstly extend an interpolation-only network for arbitrary-time interpolation and prediction, with a special input channel for task (interpolation or prediction) encoding. Then, we show how to train a unified model on common triplet frames. Our uniVIP provides competitive results for video interpolation, and outperforms existing state-of-the-arts for video prediction. Codes will be available at: https://github.com/srcn-ivl/uniVIP
中文: 本文提出了uniVIP模型,统一处理任意时间点的视频插帧和预测任务,在插帧方面表现优异,在预测方面超越了现有最优方法。
English: This paper introduces uniVIP, a unified model for arbitrary-time video interpolation and prediction that achieves competitive results in interpolation and outperforms state-of-the-art methods in prediction.

Authors:Tongkun Guan, Zining Wang, Pei Fu, Zhengtao Guo, Wei Shen, Kai Zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang
Title: A Token-level Text Image Foundation Model for Document Understanding
Abstract:
In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenOCR, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks. Finally, extensive experiments demonstrate the effectiveness of TokenOCR and TokenVL. Code, datasets, and weights will be available at https://github.com/Token-family/TokenFD.
中文:TokenOCR是首个专为文本图像任务设计的令牌级视觉基础模型,通过创新的数据集和流程解决了预测错误问题,显著提升了文档理解等下游应用的性能。
English: TokenOCR is the first token-level visual foundation model developed to address fundamental prediction errors in text-image tasks by leveraging a novel dataset and pipeline, enhancing performance in downstream applications like document understanding.

Authors:Tong Liang, Jim Davis
Title: Making Better Mistakes in CLIP-Based Zero-Shot Classification with Hierarchy-Aware Language Prompts
Abstract:
Recent studies are leveraging advancements in large language models (LLMs) trained on extensive internet-crawled text data to generate textual descriptions of downstream classes in CLIP-based zero-shot image classification. While most of these approaches aim at improving accuracy, our work focuses on ``making better mistakes", of which the mistakes' severities are derived from the given label hierarchy of downstream tasks. Since CLIP's image encoder is trained with language supervising signals, it implicitly captures the hierarchical semantic relationships between different classes. This motivates our goal of making better mistakes in zero-shot classification, a task for which CLIP is naturally well-suited. Our approach (HAPrompts) queries the language model to produce textual representations for given classes as zero-shot classifiers of CLIP to perform image classification on downstream tasks. To our knowledge, this is the first work to introduce making better mistakes in CLIP-based zero-shot classification. Our approach outperforms the related methods in a holistic comparison across five datasets of varying scales with label hierarchies of different heights in our experiments. Our code and LLM-generated image prompts: \href{https://github.com/ltong1130ztr/HAPrompts}{https://github.com/ltong1130ztr/HAPrompts}.
Chinese: 本研究提出HAPrompts方法,利用大语言模型为基于CLIP的零样本图像分类生成层次感知文本提示,通过依据标签层级结构产生语义合理的分类错误,在多个数据集上展现出优越性能。
English: This study introduces HAPrompts, a novel approach that leverages large language models to generate hierarchical-aware textual prompts for CLIP-based zero-shot image classification, focusing on making semantically meaningful mistakes according to label hierarchies and demonstrating superior performance across multiple datasets.

Authors:Dujun Nie, Xianda Guo, Yiqun Duan, Ruijun Zhang, Long Chen
Title: WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation
Abstract:
Object Goal Navigation-requiring an agent to locate a specific object in an unseen environment-remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)-based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. To retain the predicted state of the environment, WMNav proposes the online maintained Curiosity Value Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. To further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. Extensive evaluation on HM3D and MP3D validates WMNav surpasses existing zero-shot benchmarks in both success rate and exploration efficiency (absolute improvement: +3.2% SR and +3.2% SPL on HM3D, +13.5% SR and +1.1% SPL on MP3D). Project page: https://b0b8k1ng.github.io/WMNav/.

Authors:Bo Cheng, Jueqing Lu, Yuan Tian, Haifeng Zhao, Yi Chang, Lan Du
Title: CGMatch: A Different Perspective of Semi-supervised Learning
Abstract:
Semi-supervised learning (SSL) has garnered significant attention due to its ability to leverage limited labeled data and a large amount of unlabeled data to improve model generalization performance. Recent approaches achieve impressive successes by combining ideas from both consistency regularization and pseudo-labeling. However, these methods tend to underperform in the more realistic situations with relatively scarce labeled data. We argue that this issue arises because existing methods rely solely on the model's confidence, making them challenging to accurately assess the model's state and identify unlabeled examples contributing to the training phase when supervision information is limited, especially during the early stages of model training. In this paper, we propose a novel SSL model called CGMatch, which, for the first time, incorporates a new metric known as Count-Gap (CG). We demonstrate that CG is effective in discovering unlabeled examples beneficial for model training. Along with confidence, a commonly used metric in SSL, we propose a fine-grained dynamic selection (FDS) strategy. This strategy dynamically divides the unlabeled dataset into three subsets with different characteristics: easy-to-learn set, ambiguous set, and hard-to-learn set. By selective filtering subsets, and applying corresponding regularization with selected subsets, we mitigate the negative impact of incorrect pseudo-labels on model optimization and generalization. Extensive experimental results on several common SSL benchmarks indicate the effectiveness of CGMatch especially when the labeled data are particularly limited. Source code is available at https://github.com/BoCheng-96/CGMatch.
Chinese: 半监督学习在标记数据稀缺时表现不佳,而提出的CGMatch模型通过引入计数间隙指标和细粒度动态选择策略,能有效识别有益的无标记样本并提升性能。
English: Semi-supervised learning faces challenges with scarce labeled data, but the proposed CGMatch model introduces a Count-Gap metric and fine-grained dynamic selection strategy to effectively identify beneficial unlabeled examples and improve performance.

Authors:Haoyuan Li, Ziqin Ye, Yue Hao, Weiyang Lin, Chao Ye
Title: DQO-MAP: Dual Quadrics Multi-Object mapping with Gaussian Splatting
Abstract:
Accurate object perception is essential for robotic applications such as object navigation. In this paper, we propose DQO-MAP, a novel object-SLAM system that seamlessly integrates object pose estimation and reconstruction. We employ 3D Gaussian Splatting for high-fidelity object reconstruction and leverage quadrics for precise object pose estimation. Both of them management is handled on the CPU, while optimization is performed on the GPU, significantly improving system efficiency. By associating objects with unique IDs, our system enables rapid object extraction from the scene. Extensive experimental results on object reconstruction and pose estimation demonstrate that DQO-MAP achieves outstanding performance in terms of precision, reconstruction quality, and computational efficiency. The code and dataset are available at: https://github.com/LiHaoy-ux/DQO-MAP.
中文: DQO-MAP是一种新型物体SLAM系统,通过结合3D高斯泼溅实现高保真物体重建和二次曲面精确定位,在精度、重建质量和计算效率方面均表现出卓越性能。
English: DQO-MAP is a novel object-SLAM system that integrates 3D Gaussian Splatting for high-fidelity object reconstruction and quadrics for precise pose estimation, achieving superior performance in precision, reconstruction quality, and computational efficiency.

Authors:Zhihua Shen, Siyang Chen, Han Wang, Tongsu Zhang, Xiaohu Zhang, Xiangpeng Xu, Xia Yang
Title: Low-Level Matters: An Efficient Hybrid Architecture for Robust Multi-frame Infrared Small Target Detection
Abstract:
Multi-frame infrared small target detection (IRSTD) plays a crucial role in low-altitude and maritime surveillance. The hybrid architecture combining CNNs and Transformers shows great promise for enhancing multi-frame IRSTD performance. In this paper, we propose LVNet, a simple yet powerful hybrid architecture that redefines low-level feature learning in hybrid frameworks for multi-frame IRSTD. Our key insight is that the standard linear patch embeddings in Vision Transformers are insufficient for capturing the scale-sensitive local features critical to infrared small targets. To address this limitation, we introduce a multi-scale CNN frontend that explicitly models local features by leveraging the local spatial bias of convolution. Additionally, we design a U-shaped video Transformer for multi-frame spatiotemporal context modeling, effectively capturing the motion characteristics of targets. Experiments on the publicly available datasets IRDST and NUDT-MIRSDT demonstrate that LVNet outperforms existing state-of-the-art methods. Notably, compared to the current best-performing method, LMAFormer, LVNet achieves an improvement of 5.63\% / 18.36\% in nIoU, while using only 1/221 of the parameters and 1/92 / 1/21 of the computational cost. Ablation studies further validate the importance of low-level representation learning in hybrid architectures. Our code and trained models are available at https://github.com/ZhihuaShen/LVNet.
中文: LVNet提出了一种结合多尺度CNN前端和U型视频Transformer的混合架构,通过优化底层特征学习和时空上下文建模,在红外小目标多帧检测中实现了更高精度,同时大幅降低了参数量和计算开销。
English: LVNet introduces a hybrid CNN-Transformer architecture with a multi-scale CNN frontend for enhanced low-level feature extraction and a U-shaped video Transformer for spatiotemporal modeling, achieving superior performance in multi-frame infrared small target detection with significantly reduced parameters and computational costs.

Authors:Shuo Wang, Tong Ren, Nan Cheng, Rong Wang, Li Zhang
Title: Time-Varying Coronary Artery Deformation: A Dynamic Skinning Framework for Surgical Training
Abstract:
Purpose: This study proposes a novel anatomically-driven dynamic modeling framework for coronary arteries using skeletal skinning weights computation, aiming to achieve precise control over vessel deformation while maintaining real-time performance for surgical simulation applications. Methods: We developed a computational framework based on biharmonic energy minimization for skinning weight calculation, incorporating volumetric discretization through tetrahedral mesh generation. The method implements temporal sampling and interpolation for continuous vessel deformation throughout the cardiac cycle, with mechanical constraints and volume conservation enforcement. The framework was validated using clinical datasets from 5 patients, comparing interpolated deformation results against ground truth data obtained from frame-by-frame segmentation across cardiac phases. Results: The proposed framework effectively handled interactive vessel manipulation. Geometric accuracy evaluation showed mean Hausdorff distance of 4.96 +- 1.78 mm and mean surface distance of 1.78 +- 0.75 mm between interpolated meshes and ground truth models. The Branch Completeness Ratio achieved 1.82 +- 0.46, while Branch Continuity Score maintained 0.84 +- 0.06 (scale 0-1) across all datasets. The system demonstrated capability in supporting real-time guidewire-vessel collision detection and contrast medium flow simulation throughout the complete coronary tree structure. Conclusion: Our skinning weight-based methodology enhances model interactivity and applicability while maintaining geometric accuracy. The framework provides a more flexible technical foundation for virtual surgical training systems, demonstrating promising potential for both clinical practice and medical education applications. The code is available at https://github.com/ipoirot/DynamicArtery.
本研究提出一种基于骨骼蒙皮权重的动态冠状动脉建模框架,通过临床数据验证表明该方法在保持几何精度的同时,能够实现实时交互式血管变形控制,适用于手术模拟应用。
This study introduces a dynamic coronary artery modeling framework using skeletal skinning weights to enable real-time deformation control for surgical simulations, validated with clinical data showing high geometric accuracy and interactive performance.

Authors:Toan Nguyen, Kien Do, Duc Kieu, Thin Nguyen
Title: h-Edit: Effective and Flexible Diffusion-Based Editing via Doob's h-Transform
Abstract:
We introduce a theoretical framework for diffusion-based image editing by formulating it as a reverse-time bridge modeling problem. This approach modifies the backward process of a pretrained diffusion model to construct a bridge that converges to an implicit distribution associated with the editing target at time 0. Building on this framework, we propose h-Edit, a novel editing method that utilizes Doob's h-transform and Langevin Monte Carlo to decompose the update of an intermediate edited sample into two components: a "reconstruction" term and an "editing" term. This decomposition provides flexibility, allowing the reconstruction term to be computed via existing inversion techniques and enabling the combination of multiple editing terms to handle complex editing tasks. To our knowledge, h-Edit is the first training-free method capable of performing simultaneous text-guided and reward-model-based editing. Extensive experiments, both quantitative and qualitative, show that h-Edit outperforms state-of-the-art baselines in terms of editing effectiveness and faithfulness. Our source code is available at https://github.com/nktoan/h-edit.
中文摘要:我们提出h-Edit这一无需训练的扩散图像编辑方法,通过Doob's h变换将编辑过程分解为重建和编辑两个组件,在文本引导与奖励模型协同编辑方面实现了最优性能。
English Summary: We propose h-Edit, a training-free diffusion-based image editing method that employs Doob's h-transform to decompose edits into reconstruction and editing components, achieving superior performance in simultaneous text-guided and reward-model-based editing.

Authors:Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang
Title: DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Abstract:
Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by an integrated Large Language Model (LLM). Including visual tokens substantially increases the total token count, often by thousands. The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. The existing token pruning methods either require extensive calibration and fine-tuning or rely on suboptimal importance metrics which results in increased redundancy among the retained tokens. In this paper, we first formulate token pruning as Max-Min Diversity Problem (MMDP) where the goal is to select a subset such that the diversity among the selected {tokens} is maximized. Then, we solve the MMDP to obtain the selected subset and prune the rest. The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens. By ensuring high diversity, the selected tokens better represent the original tokens, enabling effective performance even at high pruning ratios without requiring fine-tuning. Extensive experiments with various LMMs show that DivPrune achieves state-of-the-art accuracy over 16 image- and video-language datasets. Additionally, DivPrune reduces both the end-to-end latency and GPU memory usage for the tested models. The code is available $\href{https://github.com/vbdi/divprune}{\text{here}}$.
中文: 提出的DivPrune方法将令牌剪枝构建为最大最小多样性问题,通过最大化所选视觉令牌的多样性,在无需微调的情况下实现了跨多个数据集的最优精度,同时降低了延迟和内存使用。
English: The proposed DivPrune method formulates token pruning as a Max-Min Diversity Problem to maximize diversity among selected visual tokens, achieving state-of-the-art accuracy across multiple datasets while reducing latency and memory usage without requiring fine-tuning.

Authors:Boyong He, Yuxiang Ji, Qianwen Ye, Zhuoyue Tan, Liaoni Wu
Title: Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection
Abstract:
Domain generalization (DG) for object detection aims to enhance detectors' performance in unseen scenarios. This task remains challenging due to complex variations in real-world applications. Recently, diffusion models have demonstrated remarkable capabilities in diverse scene generation, which inspires us to explore their potential for improving DG tasks. Instead of generating images, our method extracts multi-step intermediate features during the diffusion process to obtain domain-invariant features for generalized detection. Furthermore, we propose an efficient knowledge transfer framework that enables detectors to inherit the generalization capabilities of diffusion models through feature and object-level alignment, without increasing inference time. We conduct extensive experiments on six challenging DG benchmarks. The results demonstrate that our method achieves substantial improvements of 14.0% mAP over existing DG approaches across different domains and corruption types. Notably, our method even outperforms most domain adaptation methods without accessing any target domain data. Moreover, the diffusion-guided detectors show consistent improvements of 15.9% mAP on average compared to the baseline. Our work aims to present an effective approach for domain-generalized detection and provide potential insights for robust visual recognition in real-world scenarios. The code is available at https://github.com/heboyong/Generalized-Diffusion-Detector.
中文: 本研究提出了一种新颖的物体检测领域泛化方法,通过提取扩散模型的多步中间特征获取领域不变特征,并采用知识转移框架实现特征对齐,在不增加推理时间的情况下比现有方法显著提升14.0% mAP。
English: This study introduces a novel domain generalization method for object detection that leverages intermediate features from diffusion models to achieve domain-invariant representations and employs a knowledge transfer framework for feature alignment, resulting in significant performance improvements of up to 14.0% mAP over existing approaches without increasing inference time.

Authors:Ayush Gaggar, Todd D. Murphey
Title: Data Augmentation for NeRFs in the Low Data Limit
Abstract:
Current methods based on Neural Radiance Fields fail in the low data limit, particularly when training on incomplete scene data. Prior works augment training data only in next-best-view applications, which lead to hallucinations and model collapse with sparse data. In contrast, we propose adding a set of views during training by rejection sampling from a posterior uncertainty distribution, generated by combining a volumetric uncertainty estimator with spatial coverage. We validate our results on partially observed scenes; on average, our method performs 39.9% better with 87.5% less variability across established scene reconstruction benchmarks, as compared to state of the art baselines. We further demonstrate that augmenting the training set by sampling from any distribution leads to better, more consistent scene reconstruction in sparse environments. This work is foundational for robotic tasks where augmenting a dataset with informative data is critical in resource-constrained, a priori unknown environments. Videos and source code are available at https://murpheylab.github.io/low-data-nerf/.

Authors:Rustin Soraki, Huayu Wang, Joann G. Elmore, Linda Shapiro
Title: CrossFusion: A Multi-Scale Cross-Attention Convolutional Fusion Model for Cancer Survival Prediction
Abstract:
Cancer survival prediction from whole slide images (WSIs) is a challenging task in computational pathology due to the large size, irregular shape, and high granularity of the WSIs. These characteristics make it difficult to capture the full spectrum of patterns, from subtle cellular abnormalities to complex tissue interactions, which are crucial for accurate prognosis. To address this, we propose CrossFusion, a novel multi-scale feature integration framework that extracts and fuses information from patches across different magnification levels. By effectively modeling both scale-specific patterns and their interactions, CrossFusion generates a rich feature set that enhances survival prediction accuracy. We validate our approach across six cancer types from public datasets, demonstrating significant improvements over existing state-of-the-art methods. Moreover, when coupled with domain-specific feature extraction backbones, our method shows further gains in prognostic performance compared to general-purpose backbones. The source code is available at: https://github.com/RustinS/CrossFusion
Chinese: CrossFusion是一种新颖的多尺度特征融合框架,通过有效整合不同放大倍数下的全切片图像信息,显著提升了六种癌症类型的生存预测准确性。
English: CrossFusion is a novel multi-scale feature integration framework that enhances cancer survival prediction accuracy by effectively fusing information from whole slide images at different magnification levels, demonstrating significant improvements across six cancer types.

Authors:Ruth Crasto
Title: Robustness to Geographic Distribution Shift Using Location Encoders
Abstract:
Geographic distribution shift arises when the distribution of locations on Earth in a training dataset is different from what is seen at test time. The most common approaches to tackling geographic distribution shift treat regions delimited by administrative boundaries such as countries or continents as separate domains and apply standard domain adaptation methods, ignoring geographic coordinates that are often available as metadata. This paper proposes the use of location encoders for modeling continuous, learnable domain assignment. We show how both non-parametric sine-cosine encoders and pre-trained location encoders can be used in conjunction with standard domain adaptation methods for improved robustness to geographic distribution shift. Our proposed methods achieve new state-of-the-art results on two geo-tagged remote sensing datasets from the WILDS benchmark. We have made our code publicly available at: https://github.com/crastoru/wilds-geoshift.
中文摘要:本文提出使用位置编码器来建模连续可学习的域分配,通过将其与标准域适应方法结合,在两个地理标记遥感数据集上实现了最先进的性能,有效应对地理分布偏移问题。
English Summary: This paper introduces location encoders to model continuous domain assignments for addressing geographic distribution shift, achieving state-of-the-art results on geo-tagged datasets by integrating them with standard domain adaptation methods.

Authors:Zhusi Zhong, Yuli Wang, Lulu Bi, Zhuoqi Ma, Sun Ho Ahn, Christopher J. Mullin, Colin F. Greineder, Michael K. Atalay, Scott Collins, Grayson L. Baird, Cheng Ting Lin, Webster Stayman, Todd M. Kolb, Ihab Kamel, Harrison X. Bai, Zhicheng Jiao
Title: Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for Pulmonary Embolism Diagnosis and Report Generation from CTPA
Abstract:
Medical imaging plays a pivotal role in modern healthcare, with computed tomography pulmonary angiography (CTPA) being a critical tool for diagnosing pulmonary embolism and other thoracic conditions. However, the complexity of interpreting CTPA scans and generating accurate radiology reports remains a significant challenge. This paper introduces Abn-BLIP (Abnormality-aligned Bootstrapping Language-Image Pretraining), an advanced diagnosis model designed to align abnormal findings to generate the accuracy and comprehensiveness of radiology reports. By leveraging learnable queries and cross-modal attention mechanisms, our model demonstrates superior performance in detecting abnormalities, reducing missed findings, and generating structured reports compared to existing methods. Our experiments show that Abn-BLIP outperforms state-of-the-art medical vision-language models and 3D report generation methods in both accuracy and clinical relevance. These results highlight the potential of integrating multimodal learning strategies for improving radiology reporting. The source code is available at https://github.com/zzs95/abn-blip.
中文: 本文提出的Abn-BLIP模型通过多模态学习对齐异常发现,在CTPA影像解读中展现出优于现有方法的准确性和临床实用性,显著提升了放射学报告的质量。
English: This paper presents Abn-BLIP, an advanced diagnostic model that enhances radiology reporting by aligning abnormal findings through multimodal learning, demonstrating superior accuracy and clinical relevance in CTPA interpretation compared to existing methods.

Authors:Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Title: Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
Abstract:
Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries, composed of both an image and a text, and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at https://github.com/aimagelab/ReT.
Chinese: 本文提出ReT模型,采用多模态查询和文档,通过基于Transformer的循环单元实现多层次特征融合,在多个基准测试中取得了领先性能。
English: The paper introduces ReT, a novel cross-modal retrieval model that uses multimodal queries and documents with a Transformer-based recurrent cell for enhanced feature integration, achieving state-of-the-art results on benchmarks.

Authors:Christian Gapp, Elias Tappeiner, Martin Welk, Karl Fritscher, Elke Ruth Gizewski, Rainer Schubert
Title: What are You Looking at? Modality Contribution in Multimodal Medical Deep Learning Methods
Abstract:
Purpose High dimensional, multimodal data can nowadays be analyzed by huge deep neural networks with little effort. Several fusion methods for bringing together different modalities have been developed. Particularly, in the field of medicine with its presence of high dimensional multimodal patient data, multimodal models characterize the next step. However, what is yet very underexplored is how these models process the source information in detail. Methods To this end, we implemented an occlusion-based both model and performance agnostic modality contribution method that quantitatively measures the importance of each modality in the dataset for the model to fulfill its task. We applied our method to three different multimodal medical problems for experimental purposes. Results Herein we found that some networks have modality preferences that tend to unimodal collapses, while some datasets are imbalanced from the ground up. Moreover, we could determine a link between our metric and the performance of single modality trained nets. Conclusion The information gain through our metric holds remarkable potential to improve the development of multimodal models and the creation of datasets in the future. With our method we make a crucial contribution to the field of interpretability in deep learning based multimodal research and thereby notably push the integrability of multimodal AI into clinical practice. Our code is publicly available at https://github.com/ChristianGappGit/MC_MMD.
中文: 本研究提出了一种基于遮挡的方法,用于定量评估多模态深度学习模型中各模态的重要性,揭示了模态偏好和数据集不平衡问题,并为临床AI整合提供了可解释性支持。
English: This study introduces an occlusion-based method to quantitatively assess the importance of individual modalities in multimodal deep learning models, revealing modality preferences and dataset imbalances while providing interpretability for clinical AI integration.

Authors:Chenxu Dang, Zaipeng Duan, Pei An, Xinmin Zhang, Xuzhong Hu, Jie Ma
Title: FASTer: Focal Token Acquiring-and-Scaling Transformer for Long-term 3D Object Detection
Abstract:
Recent top-performing temporal 3D detectors based on Lidars have increasingly adopted region-based paradigms. They first generate coarse proposals, followed by encoding and fusing regional features. However, indiscriminate sampling and fusion often overlook the varying contributions of individual points and lead to exponentially increased complexity as the number of input frames grows. Moreover, arbitrary result-level concatenation limits the global information extraction. In this paper, we propose a Focal Token Acquring-and-Scaling Transformer (FASTer), which dynamically selects focal tokens and condenses token sequences in an adaptive and lightweight manner. Emphasizing the contribution of individual tokens, we propose a simple but effective Adaptive Scaling mechanism to capture geometric contexts while sifting out focal points. Adaptively storing and processing only focal points in historical frames dramatically reduces the overall complexity. Furthermore, a novel Grouped Hierarchical Fusion strategy is proposed, progressively performing sequence scaling and Intra-Group Fusion operations to facilitate the exchange of global spatial and temporal information. Experiments on the Waymo Open Dataset demonstrate that our FASTer significantly outperforms other state-of-the-art detectors in both performance and efficiency while also exhibiting improved flexibility and robustness. The code is available at https://github.com/MSunDYY/FASTer.git.
中文: 提出的FASTer变换器通过动态选择并缩放焦点令牌,有效降低复杂度并优化特征融合,在Waymo开放数据集上以更优的性能和效率超越了现有最先进的检测器。
English: The proposed FASTer transformer dynamically selects and scales focal tokens to efficiently reduce complexity and enhance feature fusion, outperforming state-of-the-art detectors in both performance and efficiency on the Waymo Open Dataset.

Authors:Aleksei Zhuravlev, Zorah Lähner, Vladislav Golyanik
Title: Denoising Functional Maps: Diffusion Models for Shape Correspondence
Abstract:
Estimating correspondences between pairs of deformable shapes remains a challenging problem. Despite substantial progress, existing methods lack broad generalization capabilities and require category-specific training data. To address these limitations, we propose a fundamentally new approach to shape correspondence based on denoising diffusion models. In our method, a diffusion model learns to directly predict the functional map, a low-dimensional representation of a point-wise map between shapes. We use a large dataset of synthetic human meshes for training and employ two steps to reduce the number of functional maps that need to be learned. First, the maps refer to a template rather than shape pairs. Second, the functional map is defined in a basis of eigenvectors of the Laplacian, which is not unique due to sign ambiguity. Therefore, we introduce an unsupervised approach to select a specific basis by correcting the signs of eigenvectors based on surface features. Our model achieves competitive performance on standard human datasets, meshes with anisotropic connectivity, non-isometric humanoid shapes, as well as animals compared to existing descriptor-based and large-scale shape deformation methods. See our project page for the source code and the datasets.

Authors:Adrià López Escoriza, Nicklas Hansen, Stone Tao, Tongzhou Mu, Hao Su
Title: Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning
Abstract:
Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable subgoals. In this work, we propose DEMO3, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult tasks compared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.
中文:DEMO3框架通过将长时程机器人操作任务分解为子目标,并融合多阶段密集奖励学习、双阶段训练方案与世界模型学习,在稀疏奖励环境中实现了高达70%的数据效率提升。
English: The DEMO3 framework addresses the challenges of long-horizon robotic manipulation tasks by decomposing them into subgoals and integrating multi-stage dense reward learning, bi-phasic training, and world model learning, achieving up to 70% higher data efficiency in sparse-reward environments.

Authors:Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
Title: Visual-RFT: Visual Reinforcement Fine-Tuning
Abstract:
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by $24.3\%$ over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by $21.9$ on COCO's two-shot setting and $15.4$ on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.
中文摘要:Visual-RFT通过可验证的视觉奖励函数优化大型视觉语言模型,将强化微调扩展至多模态领域,在数据稀缺的视觉任务中展现出卓越性能和泛化能力。
English Summary: Visual-RFT extends reinforcement fine-tuning to multi-modal tasks by using verifiable visual reward functions to optimize large vision-language models, achieving superior performance and generalization in visual tasks with limited data.

Authors:Ali Tourani, Saad Ejaz, Hriday Bavle, David Morilla-Cabello, Jose Luis Sanchez-Lopez, Holger Voos
Title: vS-Graphs: Integrating Visual SLAM and Situational Graphs through Multi-level Scene Understanding
Abstract:
Current Visual Simultaneous Localization and Mapping (VSLAM) systems often struggle to create maps that are both semantically rich and easily interpretable. While incorporating semantic scene knowledge aids in building richer maps with contextual associations among mapped objects, representing them in structured formats like scene graphs has not been widely addressed, encountering complex map comprehension and limited scalability. This paper introduces visual S-Graphs (vS-Graphs), a novel real-time VSLAM framework that integrates vision-based scene understanding with map reconstruction and comprehensible graph-based representation. The framework infers structural elements (i.e., rooms and corridors) from detected building components (i.e., walls and ground surfaces) and incorporates them into optimizable 3D scene graphs. This solution enhances the reconstructed map's semantic richness, comprehensibility, and localization accuracy. Extensive experiments on standard benchmarks and real-world datasets demonstrate that vS-Graphs outperforms state-of-the-art VSLAM methods, reducing trajectory error by an average of 3.38% and up to 9.58% on real-world data. Furthermore, the proposed framework achieves environment-driven semantic entity detection accuracy comparable to precise LiDAR-based frameworks using only visual features. A web page containing more media and evaluation outcomes is available on https://snt-arg.github.io/vsgraphs-results/.
中文: 本文提出vS-Graphs实时视觉SLAM框架,通过结合场景理解与三维场景图构建,提升了地图语义丰富度、可理解性和定位精度,在轨迹误差减少方面优于现有方法。
English: This paper introduces vS-Graphs, a real-time VSLAM framework that integrates scene understanding with 3D scene graph construction to enhance semantic richness, map comprehensibility, and localization accuracy, outperforming existing methods in trajectory error reduction.

Authors:Tiansheng Wen, Yifei Wang, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, Chenyu You
Title: Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation
Abstract:
Many large-scale systems rely on high-quality deep representations (embeddings) to facilitate tasks like retrieval, search, and generative modeling. Matryoshka Representation Learning (MRL) recently emerged as a solution for adaptive embedding lengths, but it requires full model retraining and suffers from noticeable performance degradations at short lengths. In this paper, we show that sparse coding offers a compelling alternative for achieving adaptive representation with minimal overhead and higher fidelity. We propose Contrastive Sparse Representation (CSR), a method that sparsifies pre-trained embeddings into a high-dimensional but selectively activated feature space. By leveraging lightweight autoencoding and task-aware contrastive objectives, CSR preserves semantic quality while allowing flexible, cost-effective inference at different sparsity levels. Extensive experiments on image, text, and multimodal benchmarks demonstrate that CSR consistently outperforms MRL in terms of both accuracy and retrieval speed-often by large margins-while also cutting training time to a fraction of that required by MRL. Our results establish sparse coding as a powerful paradigm for adaptive representation learning in real-world applications where efficiency and fidelity are both paramount. Code is available at https://github.com/neilwen987/CSR_Adaptive_Rep
Chinese: 对比稀疏表示(CSR)提出了一种稀疏编码方法,将预训练嵌入转换为高维稀疏特征,相比嵌套表示学习,能以更高精度、更快检索速度和大幅缩短的训练时间实现自适应表示。
English: Contrastive Sparse Representation (CSR) introduces a sparse coding method that transforms pre-trained embeddings into high-dimensional sparse features, enabling adaptive representation with superior accuracy, faster retrieval, and significantly reduced training time compared to Matryoshka Representation Learning.

Authors:Wenhao Wang, Yi Yang
Title: VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation
Abstract:
Text-to-video generative models convert textual prompts into dynamic visual content, offering wide-ranging applications in film production, gaming, and education. However, their real-world performance often falls short of user expectations. One key reason is that these models have not been trained on videos related to some topics users want to create. In this paper, we propose VideoUFO, the first Video dataset specifically curated to align with Users' FOcus in real-world scenarios. Beyond this, our VideoUFO also features: (1) minimal (0.29%) overlap with existing video datasets, and (2) videos searched exclusively via YouTube's official API under the Creative Commons license. These two attributes provide future researchers with greater freedom to broaden their training sources. The VideoUFO comprises over 1.09 million video clips, each paired with both a brief and a detailed caption (description). Specifically, through clustering, we first identify 1,291 user-focused topics from the million-scale real text-to-video prompt dataset, VidProM. Then, we use these topics to retrieve videos from YouTube, split the retrieved videos into clips, and generate both brief and detailed captions for each clip. After verifying the clips with specified topics, we are left with about 1.09 million video clips. Our experiments reveal that (1) current 16 text-to-video models do not achieve consistent performance across all user-focused topics; and (2) a simple model trained on VideoUFO outperforms others on worst-performing topics. The dataset and code are publicly available at https://huggingface.co/datasets/WenhaoWang/VideoUFO and https://github.com/WangWenhao0716/BenchUFO under the CC BY 4.0 License.
中文: 本文提出了VideoUFO数据集,专门针对文本到视频模型在现实场景中的不足,通过收集109万条与用户关注点匹配的视频片段,有效提升了模型在薄弱主题上的生成效果。
English: This paper introduces VideoUFO, a novel video dataset designed to address the limitations of text-to-video models by aligning with real-world user interests, featuring over 1.09 million clips with minimal overlap from existing sources and demonstrating improved model performance on challenging topics.

Authors:Linhao Huang, Jing Yu
Title: ToLo: A Two-Stage, Training-Free Layout-To-Image Generation Framework For High-Overlap Layouts
Abstract:
Recent training-free layout-to-image diffusion models have demonstrated remarkable performance in generating high-quality images with controllable layouts. These models follow a one-stage framework: Encouraging the model to focus the attention map of each concept on its corresponding region by defining attention map-based losses. However, these models still struggle to accurately follow layouts with significant overlap, often leading to issues like attribute leakage and missing entities. In this paper, we propose ToLo, a two-stage, training-free layout-to-image generation framework for high-overlap layouts. Our framework consists of two stages: the aggregation stage and the separation stage, each with its own loss function based on the attention map. To provide a more effective evaluation, we partition the HRS dataset based on the Intersection over Union (IoU) of the input layouts, creating a new dataset for layout-to-image generation with varying levels of overlap. Through extensive experiments on this dataset, we demonstrate that ToLo significantly enhances the performance of existing methods when dealing with high-overlap layouts. Our code and dataset are available here: https://github.com/misaka12435/ToLo.
中文: 本文提出了ToLo,一种无需训练的双阶段框架,通过聚合和分离阶段的注意力损失改进高重叠布局的图像生成,并在按IoU分级的新数据集上验证了其有效性。
English: The paper introduces ToLo, a two-stage training-free framework that improves layout-to-image generation for high-overlap scenarios by using attention-based losses in aggregation and separation stages, validated on a new dataset partitioned by IoU levels.

Authors:Shuvendu Roy, Franklin Ogidi, Ali Etemad, Elham Dolatabadi, Arash Afkanpour
Title: A Shared Encoder Approach to Multimodal Representation Learning
Abstract:
Multimodal representation learning has demonstrated remarkable potential in enabling models to process and integrate diverse data modalities, such as text and images, for improved understanding and performance. While the medical domain can benefit significantly from this paradigm, the scarcity of paired multimodal data and reliance on proprietary or pretrained encoders pose significant challenges. In this work, we present a shared encoder framework for multimodal representation learning tailored to the medical domain. Our approach employs a single set of encoder parameters shared across modalities, augmented with learnable modality features. Empirical results demonstrate that our shared encoder idea achieves superior performance compared to separate modality-specific encoders, demonstrating improved generalization in data-constrained settings. Notably, the performance gains are more pronounced with fewer training examples, underscoring the efficiency of our shared encoder framework for real-world medical applications with limited data. Our code and experiment setup are available at https://github.com/VectorInstitute/shared_encoder.
Chinese: 本文提出了一种适用于医疗领域的多模态表示学习共享编码器框架,通过跨模态共享编码器参数,在数据有限的情况下实现了更优的性能和泛化能力。
English: This paper introduces a shared encoder framework for multimodal representation learning in the medical domain, which uses a single set of encoder parameters across different data modalities and achieves superior performance and generalization, especially with limited training data.

Authors:Yingxue Xu, Fengtao Zhou, Chenyu Zhao, Yihui Wang, Can Yang, Hao Chen
Title: Distilled Prompt Learning for Incomplete Multimodal Survival Prediction
Abstract:
The integration of multimodal data including pathology images and gene profiles is widely applied in precise survival prediction. Despite recent advances in multimodal survival models, collecting complete modalities for multimodal fusion still poses a significant challenge, hindering their application in clinical settings. Current approaches tackling incomplete modalities often fall short, as they typically compensate for only a limited part of the knowledge of missing modalities. To address this issue, we propose a Distilled Prompt Learning framework (DisPro) to utilize the strong robustness of Large Language Models (LLMs) to missing modalities, which employs two-stage prompting for compensation of comprehensive information for missing modalities. In the first stage, Unimodal Prompting (UniPro) distills the knowledge distribution of each modality, preparing for supplementing modality-specific knowledge of the missing modality in the subsequent stage. In the second stage, Multimodal Prompting (MultiPro) leverages available modalities as prompts for LLMs to infer the missing modality, which provides modality-common information. Simultaneously, the unimodal knowledge acquired in the first stage is injected into multimodal inference to compensate for the modality-specific knowledge of the missing modality. Extensive experiments covering various missing scenarios demonstrated the superiority of the proposed method. The code is available at https://github.com/Innse/DisPro.
中文摘要:提出的蒸馏提示学习框架(DisPro)通过两阶段提示方法,利用大型语言模型的强大鲁棒性,有效补偿缺失模态的特定信息和通用信息,解决了生存预测中多模态数据不完整的难题。
English Summary: The proposed Distilled Prompt Learning framework (DisPro) effectively addresses incomplete multimodal data in survival prediction by leveraging Large Language Models through two-stage prompting to compensate for both modality-specific and modality-common information of missing modalities.

Authors:Dianyi Yang, Yu Gao, Xihan Wang, Yufeng Yue, Yi Yang, Mengyin Fu
Title: OpenGS-SLAM: Open-Set Dense Semantic SLAM with 3D Gaussian Splatting for Object-Level Scene Understanding
Abstract:
Recent advancements in 3D Gaussian Splatting have significantly improved the efficiency and quality of dense semantic SLAM. However, previous methods are generally constrained by limited-category pre-trained classifiers and implicit semantic representation, which hinder their performance in open-set scenarios and restrict 3D object-level scene understanding. To address these issues, we propose OpenGS-SLAM, an innovative framework that utilizes 3D Gaussian representation to perform dense semantic SLAM in open-set environments. Our system integrates explicit semantic labels derived from 2D foundational models into the 3D Gaussian framework, facilitating robust 3D object-level scene understanding. We introduce Gaussian Voting Splatting to enable fast 2D label map rendering and scene updating. Additionally, we propose a Confidence-based 2D Label Consensus method to ensure consistent labeling across multiple views. Furthermore, we employ a Segmentation Counter Pruning strategy to improve the accuracy of semantic scene representation. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our method in scene understanding, tracking, and mapping, achieving 10 times faster semantic rendering and 2 times lower storage costs compared to existing methods. Project page: https://young-bit.github.io/opengs-github.github.io/.
中文摘要:OpenGS-SLAM提出了一种基于3D高斯表示的开集语义SLAM框架,通过集成2D基础模型的显式语义标签,实现了比现有方法快10倍的语义渲染速度和降低50%的存储成本。
English Summary: OpenGS-SLAM introduces an open-set semantic SLAM framework using 3D Gaussian representation with explicit semantic labels from 2D foundational models, achieving faster semantic rendering and lower storage costs than existing methods.

Authors:Luyi Qiu, Tristan Till, Xiaobao Guo, Adams Wai-Kin Kong
Title: SparseMamba-PCL: Scribble-Supervised Medical Image Segmentation via SAM-Guided Progressive Collaborative Learning
Abstract:
Scribble annotations significantly reduce the cost and labor required for dense labeling in large medical datasets with complex anatomical structures. However, current scribble-supervised learning methods are limited in their ability to effectively propagate sparse annotation labels to dense segmentation masks and accurately segment object boundaries. To address these issues, we propose a Progressive Collaborative Learning framework that leverages novel algorithms and the Med-SAM foundation model to enhance information quality during training. (1) We enrich ground truth scribble segmentation labels through a new algorithm, propagating scribbles to estimate object boundaries. (2) We enhance feature representation by optimizing Med-SAM-guided training through the fusion of feature embeddings from Med-SAM and our proposed Sparse Mamba network. This enriched representation also facilitates the fine-tuning of the Med-SAM decoder with enriched scribbles. (3) For inference, we introduce a Sparse Mamba network, which is highly capable of capturing local and global dependencies by replacing the traditional sequential patch processing method with a skip-sampling procedure. Experiments on the ACDC, CHAOS, and MSCMRSeg datasets validate the effectiveness of our framework, outperforming nine state-of-the-art methods. Our code is available at \href{https://github.com/QLYCode/SparseMamba-PCL}{SparseMamba-PCL.git}.
The proposed Progressive Collaborative Learning framework enhances scribble-supervised medical image segmentation by propagating sparse scribble annotations to dense masks and improving boundary accuracy through novel algorithms and Med-SAM integration, outperforming nine state-of-the-art methods on multiple datasets.
English Summary:

Authors:Kaveen Perera, Fouad Khelifi, Ammar Belatreche
Title: Robust Palm-Vein Recognition Using the MMD Filter: Improving SIFT-Based Feature Matching
Abstract:
A major challenge with palm vein images is that slight movements of the fingers and thumb, or variations in hand posture, can stretch the skin in different areas and alter the vein patterns. This can result in an infinite number of variations in palm vein images for a given individual. This paper introduces a novel filtering technique for SIFT-based feature matching, known as the Mean and Median Distance (MMD) Filter. This method evaluates the differences in keypoint coordinates and computes the mean and median in each direction to eliminate incorrect matches. Experiments conducted on the 850nm subset of the CASIA dataset indicate that the proposed MMD filter effectively preserves correct points while reducing false positives detected by other filtering methods. A comparison with existing SIFT-based palm vein recognition systems demonstrates that the proposed MMD filter delivers outstanding performance, achieving lower Equal Error Rate (EER) values. This article presents an extended author's version based on our previous work, A Keypoint Filtering Method for SIFT based Palm-Vein Recognition.
中文摘要:本文提出一种均值中值距离(MMD)滤波器,通过分析关键点坐标差异有效剔除掌静脉识别中错误的SIFT特征匹配,在CASIA数据集上以更低等错误率实现了优越性能。
English Summary: This paper proposes a Mean and Median Distance (MMD) filter that effectively removes incorrect SIFT feature matches in palm vein recognition by analyzing keypoint coordinate differences, achieving superior performance with lower Equal Error Rates on the CASIA dataset.

Authors:Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, Chen Cao
Title: Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior
Abstract:
We present Vid2Avatar-Pro, a method to create photorealistic and animatable 3D human avatars from monocular in-the-wild videos. Building a high-quality avatar that supports animation with diverse poses from a monocular video is challenging because the observation of pose diversity and view points is inherently limited. The lack of pose variations typically leads to poor generalization to novel poses, and avatars can easily overfit to limited input view points, producing artifacts and distortions from other views. In this work, we address these limitations by leveraging a universal prior model (UPM) learned from a large corpus of multi-view clothed human performance capture data. We build our representation on top of expressive 3D Gaussians with canonical front and back maps shared across identities. Once the UPM is learned to accurately reproduce the large-scale multi-view human images, we fine-tune the model with an in-the-wild video via inverse rendering to obtain a personalized photorealistic human avatar that can be faithfully animated to novel human motions and rendered from novel views. The experiments show that our approach based on the learned universal prior sets a new state-of-the-art in monocular avatar reconstruction by substantially outperforming existing approaches relying only on heuristic regularization or a shape prior of minimally clothed bodies (e.g., SMPL) on publicly available datasets.

Authors:Saad Ejaz, Hriday Bavle, Laura Ribeiro, Holger Voos, Jose Luis Sanchez-Lopez
Title: Category-level Meta-learned NeRF Priors for Efficient Object Mapping
Abstract:
In 3D object mapping, category-level priors enable efficient object reconstruction and canonical pose estimation, requiring only a single prior per semantic category (e.g., chair, book, laptop, etc.). DeepSDF has been used predominantly as a category-level shape prior, but it struggles to reconstruct sharp geometry and is computationally expensive. In contrast, NeRFs capture fine details but have yet to be effectively integrated with category-level priors in a real-time multi-object mapping framework. To bridge this gap, we introduce PRENOM, a Prior-based Efficient Neural Object Mapper that integrates category-level priors with object-level NeRFs to enhance reconstruction efficiency and enable canonical object pose estimation. PRENOM gets to know objects on a first-name basis by meta-learning on synthetic reconstruction tasks generated from open-source shape datasets. To account for object category variations, it employs a multi-objective genetic algorithm to optimize the NeRF architecture for each category, balancing reconstruction quality and training time. Additionally, prior-based probabilistic ray sampling directs sampling toward expected object regions, accelerating convergence and improving reconstruction quality under constrained resources. Experimental results highlight the ability of PRENOM to achieve high-quality reconstructions while maintaining computational feasibility. Specifically, comparisons with prior-free NeRF-based approaches on a synthetic dataset show a 21\% lower Chamfer distance. Furthermore, evaluations against other approaches using shape priors on a noisy real-world dataset indicate a 13\% improvement averaged across all reconstruction metrics, and comparable pose and size estimation accuracy, while being trained for 5$\times$ less time. Code available at: https://github.com/snt-arg/PRENOM
中文: PRENOM通过将类别级先验与物体级神经辐射场相结合,提升了重建效率并实现了规范姿态估计,相比现有方法以更少训练时间获得了更高质量的重建结果。
English: PRENOM integrates category-level priors with object-level NeRFs to enhance reconstruction efficiency and enable canonical pose estimation, achieving higher quality reconstructions with reduced computational time compared to existing methods.

Authors:Mojtaba Safari, Shansong Wang, Zach Eidex, Qiang Li, Erik H. Middlebrooks, David S. Yu, Xiaofeng Yang
Title: MRI super-resolution reconstruction using efficient diffusion probabilistic model with residual shifting
Abstract:
Objective:This study introduces a residual error-shifting mechanism that drastically reduces sampling steps while preserving critical anatomical details, thus accelerating MRI reconstruction. Approach:We propose a novel diffusion-based SR framework called Res-SRDiff, which integrates residual error shifting into the forward diffusion process. This enables efficient HR image reconstruction by aligning the degraded HR and LR distributions.We evaluated Res-SRDiff on ultra-high-field brain T1 MP2RAGE maps and T2-weighted prostate images, comparing it with Bicubic, Pix2pix, CycleGAN, and a conventional denoising diffusion probabilistic model with vision transformer backbone (TM-DDPM), using quantitative metrics such as peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), gradient magnitude similarity deviation (GMSD), and learned perceptual image patch similarity (LPIPS). Main results: Res-SRDiff significantly outperformed all comparative methods in terms of PSNR, SSIM, and GMSD across both datasets, with statistically significant improvements (p-values<<0.05). The model achieved high-fidelity image restoration with only four sampling steps, drastically reducing computational time to under one second per slice, which is substantially faster than conventional TM-DDPM with around 20 seconds per slice. Qualitative analyses further demonstrated that Res-SRDiff effectively preserved fine anatomical details and lesion morphology in both brain and pelvic MRI images. Significance: Our findings show that Res-SRDiff is an efficient and accurate MRI SR method, markedly improving computational efficiency and image quality. Integrating residual error shifting into the diffusion process allows for rapid and robust HR image reconstruction, enhancing clinical MRI workflows and advancing medical imaging research. The source at:https://github.com/mosaf/Res-SRDiff
中文摘要:本研究提出的Res-SRDiff方法通过残差偏移机制,仅需四步采样即可实现高质量MRI图像重建,在计算效率和图像质量上均显著优于现有方法。
English Summary: This study introduces Res-SRDiff, a diffusion-based MRI reconstruction method that uses a residual error-shifting mechanism to achieve high-fidelity image restoration in just four sampling steps, significantly outperforming other methods in both computational efficiency and image quality.

Authors:Yuheng Xu, Shijie Yang, Xin Liu, Jie Liu, Jie Tang, Gangshan Wu
Title: AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning
Abstract:
In recent years, the increasing popularity of Hi-DPI screens has driven a rising demand for high-resolution images. However, the limited computational power of edge devices poses a challenge in deploying complex super-resolution neural networks, highlighting the need for efficient methods. While prior works have made significant progress, they have not fully exploited pixel-level information. Moreover, their reliance on fixed sampling patterns limits both accuracy and the ability to capture fine details in low-resolution images. To address these challenges, we introduce two plug-and-play modules designed to capture and leverage pixel information effectively in Look-Up Table (LUT) based super-resolution networks. Our method introduces Automatic Sampling (AutoSample), a flexible LUT sampling approach where sampling weights are automatically learned during training to adapt to pixel variations and expand the receptive field without added inference cost. We also incorporate Adaptive Residual Learning (AdaRL) to enhance inter-layer connections, enabling detailed information flow and improving the network's ability to reconstruct fine details. Our method achieves significant performance improvements on both MuLUT and SPF-LUT while maintaining similar storage sizes. Specifically, for MuLUT, we achieve a PSNR improvement of approximately +0.20 dB improvement on average across five datasets. For SPF-LUT, with more than a 50% reduction in storage space and about a 2/3 reduction in inference time, our method still maintains performance comparable to the original. The code is available at https://github.com/SuperKenVery/AutoLUT.
中文摘要:本研究提出AutoSample和AdaRL两个即插即用模块,通过自适应学习像素采样和增强细节重建能力,在保持存储效率的同时显著提升了基于查找表的超分辨率网络性能。
English Summary: This paper introduces two plug-and-play modules, AutoSample and AdaRL, that enhance LUT-based super-resolution networks by adaptively learning pixel sampling and improving detail reconstruction, achieving significant performance gains with minimal storage and computational costs.

Authors:Ramkrishna Acharya
Title: An Approach for Air Drawing Using Background Subtraction and Contour Extraction
Abstract:
In this paper, we propose a novel approach for air drawing that uses image processing techniques to draw on the screen by moving fingers in the air. This approach benefits a wide range of applications such as sign language, in-air drawing, and 'writing' in the air as a new way of input. The approach starts with preparing ROI (Region of Interest) background images by taking a running average in initial camera frames and later subtracting it from the live camera frames to get a binary mask image. We calculate the pointer's position as the top of the contour on the binary image. When drawing a circle on the canvas in that position, it simulates the drawing. Furthermore, we combine the pre-trained Tesseract model for OCR purposes. To address the false contours, we perform hand detection based on the haar cascade before performing the background subtraction. In an experimental setup, we achieved a latency of only 100ms in air drawing. The code used to this research are available in GitHub as https://github.com/q-viper/Contour-Based-Writing
中文: 本文提出一种基于图像处理和背景减除的空中绘图方法,通过追踪手指运动实现手语识别和空中书写等功能,实验延迟仅100毫秒且代码已开源。
English: This paper introduces an air drawing method using image processing and background subtraction to track finger movements for applications like sign language and in-air writing, achieving 100ms latency with code available on GitHub.

Authors:Haoxuan Li, Ziya Erkoc, Lei Li, Daniele Sirigatti, Vladyslav Rozov, Angela Dai, Matthias Nießner
Title: MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing
Abstract:
We introduce MeshPad, a generative approach that creates 3D meshes from sketch inputs. Building on recent advances in artist-reminiscent triangle mesh generation, our approach addresses the need for interactive mesh creation. To this end, we focus on enabling consistent edits by decomposing editing into 'deletion' of regions of a mesh, followed by 'addition' of new mesh geometry. Both operations are invoked by simple user edits of a sketch image, facilitating an iterative content creation process and enabling the construction of complex 3D meshes. Our approach is based on a triangle sequence-based mesh representation, exploiting a large Transformer model for mesh triangle addition and deletion. In order to perform edits interactively, we introduce a vertex-aligned speculative prediction strategy on top of our additive mesh generator. This speculator predicts multiple output tokens corresponding to a vertex, thus significantly reducing the computational cost of inference and accelerating the editing process, making it possible to execute each editing step in only a few seconds. Comprehensive experiments demonstrate that MeshPad outperforms state-of-the-art sketch-conditioned mesh generation methods, achieving more than 22% mesh quality improvement in Chamfer distance, and being preferred by 90% of participants in perceptual evaluations.
中文:MeshPad是一种通过顶点对齐推测预测策略从草图交互式生成3D网格的方法,在网格质量和用户偏好方面均优于现有技术。
English: MeshPad is a generative method that enables interactive 3D mesh creation from sketches through a vertex-aligned speculative prediction strategy, achieving superior quality and user preference over existing techniques.

Authors:Siddhant Prakash, David R. Walton, Rafael K. dos Anjos, Anthony Steed, Tobias Ritschel
Title: Blind Augmentation: Calibration-free Camera Distortion Model Estimation for Real-time Mixed-reality Consistency
Abstract:
Real camera footage is subject to noise, motion blur (MB) and depth of field (DoF). In some applications these might be considered distortions to be removed, but in others it is important to model them because it would be ineffective, or interfere with an aesthetic choice, to simply remove them. In augmented reality applications where virtual content is composed into a live video feed, we can model noise, MB and DoF to make the virtual content visually consistent with the video. Existing methods for this typically suffer two main limitations. First, they require a camera calibration step to relate a known calibration target to the specific cameras response. Second, existing work require methods that can be (differentiably) tuned to the calibration, such as slow and specialized neural networks. We propose a method which estimates parameters for noise, MB and DoF instantly, which allows using off-the-shelf real-time simulation methods from e.g., a game engine in compositing augmented content. Our main idea is to unlock both features by showing how to use modern computer vision methods that can remove noise, MB and DoF from the video stream, essentially providing self-calibration. This allows to auto-tune any black-box real-time noise+MB+DoF method to deliver fast and high-fidelity augmentation consistency.

Authors:Hao Tang, Chenwei Xie, Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, Liwei Wang
Title: UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
Abstract:
Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present \ours, a framework that \textbf{U}nifies \textbf{F}ine-grained visual perception tasks through an \textbf{O}pen-ended language interface. By transforming all perception targets into the language space, \ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task-specific designs. After multi-task training on five standard visual perception datasets, \ours outperforms the previous state-of-the-art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby enabling more challenging tasks such as reasoning segmentation. Code and models are available at https://github.com/nnnth/UFO.
中文:UFO框架通过开放式语言接口统一了细粒度视觉感知任务,将检测、分割和视觉语言任务整合到单一模型中,并在多个基准测试中实现了领先性能。
English: The UFO framework unifies fine-grained visual perception tasks through a language interface, integrating detection, segmentation, and vision-language tasks into a single model while achieving state-of-the-art performance on benchmarks.

Authors:Kun Zhang, Jingyu Li, Zhe Li, Jingjing Zhang, Fan Li, Yandong Liu, Rui Yan, Zihang Jiang, Nan Chen, Lei Zhang, Yongdong Zhang, Zhendong Mao, S. Kevin Zhou
Title: Composed Multi-modal Retrieval: A Survey of Approaches and Applications
Abstract:
The burgeoning volume of multi-modal data necessitates advanced retrieval paradigms beyond unimodal and cross-modal approaches. Composed Multi-modal Retrieval (CMR) emerges as a pivotal next-generation technology, enabling users to query images or videos by integrating a reference visual input with textual modifications, thereby achieving unprecedented flexibility and precision. This paper provides a comprehensive survey of CMR, covering its fundamental challenges, technical advancements, and applications. CMR is categorized into supervised, zero-shot, and semi-supervised learning paradigms. We discuss key research directions, including data construction, model architecture, and loss optimization in supervised CMR, as well as transformation frameworks and linear integration in zero-shot CMR, and semi-supervised CMR that leverages generated pseudo-triplets while addressing data noise/uncertainty. Additionally, we extensively survey the diverse application landscape of CMR, highlighting its transformative potential in e-commerce, social media, search engines, public security, etc. Seven high impact application scenarios are explored in detail with benchmark data sets and performance analysis. Finally, we further provide new potential research directions with the hope of inspiring exploration in other yet-to-be-explored fields. A curated list of works is available at: https://github.com/kkzhang95/Awesome-Composed-Multi-modal-Retrieval
中文: 组合多模态检索(CMR)是一种前沿技术,通过结合视觉输入与文本修改实现灵活精准的图像或视频查询,涵盖监督、零样本和半监督学习范式,并在电子商务、公共安全等领域展现出变革性应用潜力。
English: Composed Multi-modal Retrieval (CMR) is a cutting-edge technology that enables flexible and precise querying of images or videos by combining visual inputs with textual modifications, addressing challenges through supervised, zero-shot, and semi-supervised learning paradigms, and showcasing transformative applications across fields like e-commerce and public security.

Authors:Xu Liang
Title: Group Relative Policy Optimization for Image Captioning
Abstract:
Image captioning tasks usually use two-stage training to complete model optimization. The first stage uses cross-entropy as the loss function for optimization, and the second stage uses self-critical sequence training (SCST) for reinforcement learning optimization. However, the SCST algorithm has certain defects. SCST relies only on a single greedy decoding result as a baseline. If the model itself is not stable enough, the greedy decoding result may be relatively worst, which will lead to a high variance of advantage estimation, further leading to unstable policy updates. In addition, SCST only compares one sampling result with the greedy decoding result, and the generation diversity is limited, which may fall into a local optimum. In this paper, we propose using the latest Group Relative Policy Optimization (GRPO) reinforcement learning algorithm as an optimization solution for the second stage. GRPO generates multiple candidate captions for the input image and then continuously optimizes the model through intragroup comparison. By constraining the amplitude of policy updates and KL divergence, the stability of the model during training is greatly guaranteed. In addition, compared to SCST, which only samples one answer, GRPO samples and generates multiple answers. Multiple candidate answers in the group cover a wider solution space. Combined with KL divergence constraints, GRPO can improve diversity while ensuring model stability. The code for this article is available at https://github.com/liangxu-one/ms-models/tree/image_caption_grpo/research/arxiv_papers/Image_Caption_GRPO.
中文: 本文提出在图像描述任务的第二阶段使用群组相对策略优化(GRPO)强化学习算法替代自临界序列训练(SCST),通过生成多个候选描述并进行组内比较,结合KL散度约束,有效解决SCST方差高和多样性受限的问题,实现更稳定、更多样化的模型优化。
English: The abstract proposes replacing the self-critical sequence training (SCST) algorithm in image captioning with the Group Relative Policy Optimization (GRPO) to address SCST's issues of high variance and limited diversity by generating multiple candidate captions and using intragroup comparisons with KL divergence constraints for more stable and diverse model optimization.

Authors:Xuewen Liu, Zhikai Li, Qingyi Gu
Title: CacheQuant: Comprehensively Accelerated Diffusion Models
Abstract:
Diffusion models have gradually gained prominence in the field of image synthesis, showcasing remarkable generative capabilities. Nevertheless, the slow inference and complex networks, resulting from redundancy at both temporal and structural levels, hinder their low-latency applications in real-world scenarios. Current acceleration methods for diffusion models focus separately on temporal and structural levels. However, independent optimization at each level to further push the acceleration limits results in significant performance degradation. On the other hand, integrating optimizations at both levels can compound the acceleration effects. Unfortunately, we find that the optimizations at these two levels are not entirely orthogonal. Performing separate optimizations and then simply integrating them results in unsatisfactory performance. To tackle this issue, we propose CacheQuant, a novel training-free paradigm that comprehensively accelerates diffusion models by jointly optimizing model caching and quantization techniques. Specifically, we employ a dynamic programming approach to determine the optimal cache schedule, in which the properties of caching and quantization are carefully considered to minimize errors. Additionally, we propose decoupled error correction to further mitigate the coupled and accumulated errors step by step. Experimental results show that CacheQuant achieves a 5.18 speedup and 4 compression for Stable Diffusion on MS-COCO, with only a 0.02 loss in CLIP score. Our code are open-sourced: https://github.com/BienLuky/CacheQuant .
中文:CacheQuant框架通过联合优化模型缓存与量化技术,在保持性能基本不变的前提下,显著提升了扩散模型的推理速度并实现了模型压缩。
English: The proposed CacheQuant framework jointly optimizes model caching and quantization to comprehensively accelerate diffusion models, achieving significant speedup and compression with minimal performance degradation.

Authors:Peishan Cong, Ziyi Wang, Yuexin Ma, Xiangyu Yue
Title: SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance
Abstract:
Generating reasonable and high-quality human interactive motions in a given dynamic environment is crucial for understanding, modeling, transferring, and applying human behaviors to both virtual and physical robots. In this paper, we introduce an effective method, SemGeoMo, for dynamic contextual human motion generation, which fully leverages the text-affordance-joint multi-level semantic and geometric guidance in the generation process, improving the semantic rationality and geometric correctness of generative motions. Our method achieves state-of-the-art performance on three datasets and demonstrates superior generalization capability for diverse interaction scenarios. The project page and code can be found at https://4dvlab.github.io/project_page/semgeomo/.

Authors:Linhao Li, Changhui Su, Yu Guo, Huimao Zhang, Dong Liang, Kun Shang
Title: Interactive Gadolinium-Free MRI Synthesis: A Transformer with Localization Prompt Learning
Abstract:
Contrast-enhanced magnetic resonance imaging (CE-MRI) is crucial for tumor detection and diagnosis, but the use of gadolinium-based contrast agents (GBCAs) in clinical settings raises safety concerns due to potential health risks. To circumvent these issues while preserving diagnostic accuracy, we propose a novel Transformer with Localization Prompts (TLP) framework for synthesizing CE-MRI from non-contrast MR images. Our architecture introduces three key innovations: a hierarchical backbone that uses efficient Transformer to process multi-scale features; a multi-stage fusion system consisting of Local and Global Fusion modules that hierarchically integrate complementary information via spatial attention operations and cross-attention mechanisms, respectively; and a Fuzzy Prompt Generation (FPG) module that enhances the TLP model's generalization by emulating radiologists' manual annotation through stochastic feature perturbation. The framework uniquely enables interactive clinical integration by allowing radiologists to input diagnostic prompts during inference, synergizing artificial intelligence with medical expertise. This research establishes a new paradigm for contrast-free MRI synthesis while addressing critical clinical needs for safer diagnostic procedures. Codes are available at https://github.com/ChanghuiSu/TLP.
中文摘要:本研究提出了一种带定位提示的Transformer(TLP)框架,通过分层特征处理和交互式临床集成,能够从非增强MRI合成对比增强图像,在保持诊断准确性的同时避免了钆基造影剂的使用需求。
English Summary: This study introduces a Transformer with Localization Prompts (TLP) framework that synthesizes contrast-enhanced MRI from non-contrast images, eliminating the need for gadolinium-based agents while maintaining diagnostic accuracy through hierarchical feature processing and interactive clinical integration.

Authors:Xuan Zhu, Jijun Xiang, Xianqi Wang, Longliang Liu, Yu Wang, Hong Zhang, Fei Guo, Xin Yang
Title: SVDC: Consistent Direct Time-of-Flight Video Depth Completion with Frequency Selective Fusion
Abstract:
Lightweight direct Time-of-Flight (dToF) sensors are ideal for 3D sensing on mobile devices. However, due to the manufacturing constraints of compact devices and the inherent physical principles of imaging, dToF depth maps are sparse and noisy. In this paper, we propose a novel video depth completion method, called SVDC, by fusing the sparse dToF data with the corresponding RGB guidance. Our method employs a multi-frame fusion scheme to mitigate the spatial ambiguity resulting from the sparse dToF imaging. Misalignment between consecutive frames during multi-frame fusion could cause blending between object edges and the background, which results in a loss of detail. To address this, we introduce an adaptive frequency selective fusion (AFSF) module, which automatically selects convolution kernel sizes to fuse multi-frame features. Our AFSF utilizes a channel-spatial enhancement attention (CSEA) module to enhance features and generates an attention map as fusion weights. The AFSF ensures edge detail recovery while suppressing high-frequency noise in smooth regions. To further enhance temporal consistency, We propose a cross-window consistency loss to ensure consistent predictions across different windows, effectively reducing flickering. Our proposed SVDC achieves optimal accuracy and consistency on the TartanAir and Dynamic Replica datasets. Code is available at https://github.com/Lan1eve/SVDC.
中文摘要:本文提出SVDC视频深度补全方法,通过融合稀疏直接飞行时间数据与RGB引导,采用多帧融合和自适应频率选择模块,在增强边缘细节和时序一致性的同时有效抑制噪声。
English Summary: The paper introduces SVDC, a video depth completion method that fuses sparse dToF data with RGB guidance using multi-frame fusion and an adaptive frequency selective module to enhance edge details and temporal consistency while reducing noise.

Authors:Xiaolong Yu, Junqiao Zhao, Shuangfu Song, Zhongyang Zhu, Zihan Yuan, Chen Ye, Tiantian Feng
Title: Convex Hull-based Algebraic Constraint for Visual Quadric SLAM
Abstract:
Using Quadrics as the object representation has the benefits of both generality and closed-form projection derivation between image and world spaces. Although numerous constraints have been proposed for dual quadric reconstruction, we found that many of them are imprecise and provide minimal improvements to localization.After scrutinizing the existing constraints, we introduce a concise yet more precise convex hull-based algebraic constraint for object landmarks, which is applied to object reconstruction, frontend pose estimation, and backend bundle adjustment.This constraint is designed to fully leverage precise semantic segmentation, effectively mitigating mismatches between complex-shaped object contours and dual quadrics.Experiments on public datasets demonstrate that our approach is applicable to both monocular and RGB-D SLAM and achieves improved object mapping and localization than existing quadric SLAM methods. The implementation of our method is available at https://github.com/tiev-tongji/convexhull-based-algebraic-constraint.
Chinese: 本研究提出了一种基于凸包的精确代数约束方法,通过充分利用语义分割技术,在单目和RGB-D SLAM系统中实现了比现有方法更优的物体建图与定位效果。
English: This study introduces a precise convex hull-based algebraic constraint for dual quadric reconstruction that enhances object mapping and localization in both monocular and RGB-D SLAM systems by leveraging accurate semantic segmentation.

Authors:Jiqing Wu, Ingrid Berg, Yawei Li, Ender Konukoglu, Viktor H. Koelzer
Title: Tera-MIND: Tera-scale mouse brain simulation via spatial mRNA-guided diffusion
Abstract:
Holistic 3D modeling of molecularly defined brain structures is crucial for understanding complex brain functions. Emerging tissue profiling technologies enable the construction of a comprehensive atlas of the mammalian brain with sub-cellular resolution and spatially resolved gene expression data. However, such tera-scale volumetric datasets present significant computational challenges in understanding complex brain functions within their native 3D spatial context. Here, we propose the novel generative approach $\textbf{Tera-MIND}$, which can simulate $\textbf{Tera}$-scale $\textbf{M}$ouse bra$\textbf{IN}$s in 3D using a patch-based and boundary-aware $\textbf{D}$iffusion model. Taking spatial transcriptomic data as the conditional input, we generate virtual mouse brains with comprehensive cellular morphological detail at teravoxel scale. Through the lens of 3D $gene$-$gene$ self-attention, we identify spatial molecular interactions for key transcriptomic pathways in the murine brain, exemplified by glutamatergic and dopaminergic neuronal systems. Importantly, these $in$-$silico$ biological findings are consistent and reproducible across three tera-scale virtual mouse brains. Therefore, Tera-MIND showcases a promising path toward efficient and generative simulations of whole organ systems for biomedical research. Project website: https://musikisomorphie.github.io/Tera-MIND.html

Authors:Guanyao Wu, Haoyu Liu, Hongming Fu, Yichuan Peng, Jinyuan Liu, Xin Fan, Risheng Liu
Title: Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond
Abstract:
Multi-modality image fusion, particularly infrared and visible, plays a crucial role in integrating diverse modalities to enhance scene understanding. Although early research prioritized visual quality, preserving fine details and adapting to downstream tasks remains challenging. Recent approaches attempt task-specific design but rarely achieve "The Best of Both Worlds" due to inconsistent optimization goals. To address these issues, we propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to Grow the quality of fusion results and Enable downstream task adaptability, namely SAGE. Specifically, we design a Semantic Persistent Attention (SPA) Module that efficiently maintains source information via the persistent repository while extracting high-level semantic priors from SAM. More importantly, to eliminate the impractical dependence on SAM during inference, we introduce a bi-level optimization-driven distillation mechanism with triplet losses, which allow the student network to effectively extract knowledge. Extensive experiments show that our method achieves a balance between high-quality visual results and downstream task adaptability while maintaining practical deployment efficiency. The code is available at https://github.com/RollingPlain/SAGE_IVIF.
中文摘要:提出的SAGE方法利用Segment Anything Model的语义知识,通过新型注意力模块和蒸馏机制实现多模态图像融合的平衡,在提升视觉质量的同时增强了下游任务的适应性。
English Summary: The proposed SAGE method leverages semantic knowledge from the Segment Anything Model to achieve balanced multi-modality image fusion, enhancing both visual quality and downstream task adaptability through a novel attention module and distillation mechanism.

Authors:Tianjie Ju, Yi Hua, Hao Fei, Zhenyu Shao, Yubin Zheng, Haodong Zhao, Mong-Li Lee, Wynne Hsu, Zhuosheng Zhang, Gongshen Liu
Title: Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models
Abstract:
Multi-Modal Large Language Models (MLLMs) have exhibited remarkable performance on various vision-language tasks such as Visual Question Answering (VQA). Despite accumulating evidence of privacy concerns associated with task-relevant content, it remains unclear whether MLLMs inadvertently memorize private content that is entirely irrelevant to the training tasks. In this paper, we investigate how randomly generated task-irrelevant private content can become spuriously correlated with downstream objectives due to partial mini-batch training dynamics, thus causing inadvertent memorization. Concretely, we randomly generate task-irrelevant watermarks into VQA fine-tuning images at varying probabilities and propose a novel probing framework to determine whether MLLMs have inadvertently encoded such content. Our experiments reveal that MLLMs exhibit notably different training behaviors in partial mini-batch settings with task-irrelevant watermarks embedded. Furthermore, through layer-wise probing, we demonstrate that MLLMs trigger distinct representational patterns when encountering previously seen task-irrelevant knowledge, even if this knowledge does not influence their output during prompting. Our code is available at https://github.com/illusionhi/ProbingPrivacy.
中文: 研究表明多模态大语言模型在部分小批量训练中会通过虚假关联无意记忆与任务无关的私有内容,水印实验显示即使这些内容不影响输出,模型仍会触发不同的表征模式。
English: This study reveals that multi-modal large language models can inadvertently memorize task-irrelevant private content through spurious correlations in partial mini-batch training, as demonstrated by watermark experiments showing distinct representational patterns even when such content doesn't affect model outputs.

Authors:Xingyuan Li, Zirui Wang, Yang Zou, Zhixin Chen, Jun Ma, Zhiying Jiang, Long Ma, Jinyuan Liu
Title: DifIISR: A Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution
Abstract:
Infrared imaging is essential for autonomous driving and robotic operations as a supportive modality due to its reliable performance in challenging environments. Despite its popularity, the limitations of infrared cameras, such as low spatial resolution and complex degradations, consistently challenge imaging quality and subsequent visual tasks. Hence, infrared image super-resolution (IISR) has been developed to address this challenge. While recent developments in diffusion models have greatly advanced this field, current methods to solve it either ignore the unique modal characteristics of infrared imaging or overlook the machine perception requirements. To bridge these gaps, we propose DifIISR, an infrared image super-resolution diffusion model optimized for visual quality and perceptual performance. Our approach achieves task-based guidance for diffusion by injecting gradients derived from visual and perceptual priors into the noise during the reverse process. Specifically, we introduce an infrared thermal spectrum distribution regulation to preserve visual fidelity, ensuring that the reconstructed infrared images closely align with high-resolution images by matching their frequency components. Subsequently, we incorporate various visual foundational models as the perceptual guidance for downstream visual tasks, infusing generalizable perceptual features beneficial for detection and segmentation. As a result, our approach gains superior visual results while attaining State-Of-The-Art downstream task performance. Code is available at https://github.com/zirui0625/DifIISR
中文摘要:提出的DifIISR模型通过融合热光谱分布调控以保持视觉保真度,并利用视觉基础模型作为感知指导,显著提升了红外图像超分辨率效果,在视觉质量和下游任务性能上均达到领先水平。
English Summary: The proposed DifIISR model enhances infrared image super-resolution by integrating thermal spectrum regulation for visual fidelity and leveraging visual foundation models as perceptual guidance, achieving superior image quality and state-of-the-art performance in downstream tasks.

Authors:Hongye Cheng, Tianyu Wang, Guangsi Shi, Zexing Zhao, Yanwei Fu
Title: HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation
Abstract:
Co-speech gestures are crucial non-verbal cues that enhance speech clarity and expressiveness in human communication, which have attracted increasing attention in multimodal research. While the existing methods have made strides in gesture accuracy, challenges remain in generating diverse and coherent gestures, as most approaches assume independence among multimodal inputs and lack explicit modeling of their interactions. In this work, we propose a novel multimodal learning method named HOP for co-speech gesture generation that captures the heterogeneous entanglement between gesture motion, audio rhythm, and text semantics, enabling the generation of coordinated gestures. By leveraging spatiotemporal graph modeling, we achieve the alignment of audio and action. Moreover, to enhance modality coherence, we build the audio-text semantic representation based on a reprogramming module, which is beneficial for cross-modality adaptation. Our approach enables the trimodal system to learn each other's features and represent them in the form of topological entanglement. Extensive experiments demonstrate that HOP achieves state-of-the-art performance, offering more natural and expressive co-speech gesture generation. More information, codes, and demos are available here: https://star-uu-wang.github.io/HOP/

Authors:Zhipeng Huang, Shaobin Zhuang, Canmiao Fu, Binxin Yang, Ying Zhang, Chong Sun, Zhizheng Zhang, Yali Wang, Chen Li, Zheng-Jun Zha
Title: WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
Abstract:
Existing multimodal generative models fall short as qualified design copilots, as they often struggle to generate imaginative outputs once instructions are less detailed or lack the ability to maintain consistency with the provided references. In this work, we introduce WeGen, a model that unifies multimodal generation and understanding, and promotes their interplay in iterative generation. It can generate diverse results with high creativity for less detailed instructions. And it can progressively refine prior generation results or integrating specific contents from references following the instructions in its chat with users. During this process, it is capable of preserving consistency in the parts that the user is already satisfied with. To this end, we curate a large-scale dataset, extracted from Internet videos, containing rich object dynamics and auto-labeled dynamics descriptions by advanced foundation models to date. These two information are interleaved into a single sequence to enable WeGen to learn consistency-aware generation where the specified dynamics are generated while the consistency of unspecified content is preserved aligned with instructions. Besides, we introduce a prompt self-rewriting mechanism to enhance generation diversity. Extensive experiments demonstrate the effectiveness of unifying multimodal understanding and generation in WeGen and show it achieves state-of-the-art performance across various visual generation benchmarks. These also demonstrate the potential of WeGen as a user-friendly design copilot as desired. The code and models will be available at https://github.com/hzphzp/WeGen.
Chinese: WeGen是一种统一的多模态模型,通过迭代生成提升创造性和一致性,作为用户友好的设计助手实现了最先进的性能。
English: WeGen is a unified multimodal model that enhances creativity and consistency in iterative generation, achieving state-of-the-art performance as a user-friendly design copilot.

Authors:Hui Liu, Chen Jia, Fan Shi, Xu Cheng, Shengyong Chen
Title: SCSegamba: Lightweight Structure-Aware Vision Mamba for Crack Segmentation in Structures
Abstract:
Pixel-level segmentation of structural cracks across various scenarios remains a considerable challenge. Current methods encounter challenges in effectively modeling crack morphology and texture, facing challenges in balancing segmentation quality with low computational resource usage. To overcome these limitations, we propose a lightweight Structure-Aware Vision Mamba Network (SCSegamba), capable of generating high-quality pixel-level segmentation maps by leveraging both the morphological information and texture cues of crack pixels with minimal computational cost. Specifically, we developed a Structure-Aware Visual State Space module (SAVSS), which incorporates a lightweight Gated Bottleneck Convolution (GBC) and a Structure-Aware Scanning Strategy (SASS). The key insight of GBC lies in its effectiveness in modeling the morphological information of cracks, while the SASS enhances the perception of crack topology and texture by strengthening the continuity of semantic information between crack pixels. Experiments on crack benchmark datasets demonstrate that our method outperforms other state-of-the-art (SOTA) methods, achieving the highest performance with only 2.8M parameters. On the multi-scenario dataset, our method reached 0.8390 in F1 score and 0.8479 in mIoU. The code is available at https://github.com/Karl1109/SCSegamba.
中文:提出的轻量级结构感知视觉Mamba网络(SCSegamba)通过建模裂缝形态和纹理,以最小计算资源有效解决分割难题,在基准数据集上实现了最优性能。
English: The proposed lightweight Structure-Aware Vision Mamba Network (SCSegamba) effectively addresses crack segmentation challenges by modeling morphology and texture with minimal computational resources, achieving state-of-the-art performance on benchmark datasets.

Authors:Kaiwen Zheng, Yongxin Chen, Huayu Chen, Guande He, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
Title: Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator
Abstract:
While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective, which minimizes the forward KL divergence, inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that integrates likelihood-based generative training and GAN-type discrimination to bypass this fundamental constraint by exploiting reverse KL and self-generated negative signals. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58/1.96 to new records of 1.30/0.97/1.26 on CIFAR-10/ImageNet-64/ImageNet 512x512 datasets without any guidance mechanisms, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256x256.
Chinese: 本文提出直接判别优化(DDO)框架,将基于似然的生成训练与GAN式判别相结合,突破最大似然估计的限制,在无需引导机制的情况下于多个数据集上实现了最先进的生成性能。
English: This paper introduces Direct Discriminative Optimization (DDO), a unified framework that combines likelihood-based generative training with GAN-type discrimination to overcome the limitations of maximum likelihood estimation, achieving state-of-the-art performance across multiple datasets without guidance mechanisms.

Authors:Wanjun Jia, Fan Yang, Mengfei Duan, Xianchi Chen, Yinxi Wang, Yiming Jiang, Wenrui Chen, Kailun Yang, Zhiyong Li
Title: One-Shot Affordance Grounding of Deformable Objects in Egocentric Organizing Scenes
Abstract:
Deformable object manipulation in robotics presents significant challenges due to uncertainties in component properties, diverse configurations, visual interference, and ambiguous prompts. These factors complicate both perception and control tasks. To address these challenges, we propose a novel method for One-Shot Affordance Grounding of Deformable Objects (OS-AGDO) in egocentric organizing scenes, enabling robots to recognize previously unseen deformable objects with varying colors and shapes using minimal samples. Specifically, we first introduce the Deformable Object Semantic Enhancement Module (DefoSEM), which enhances hierarchical understanding of the internal structure and improves the ability to accurately identify local features, even under conditions of weak component information. Next, we propose the ORB-Enhanced Keypoint Fusion Module (OEKFM), which optimizes feature extraction of key components by leveraging geometric constraints and improves adaptability to diversity and visual interference. Additionally, we propose an instance-conditional prompt based on image data and task context, which effectively mitigates the issue of region ambiguity caused by prompt words. To validate these methods, we construct a diverse real-world dataset, AGDDO15, which includes 15 common types of deformable objects and their associated organizational actions. Experimental results demonstrate that our approach significantly outperforms state-of-the-art methods, achieving improvements of 6.2%, 3.2%, and 2.9% in KLD, SIM, and NSS metrics, respectively, while exhibiting high generalization performance. Source code and benchmark dataset are made publicly available at https://github.com/Dikay1/OS-AGDO.
中文: 本文提出了一种新颖的单次可变形物体功能基元感知方法(OS-AGDO),通过语义增强模块和优化特征提取技术,使机器人能够用最少样本识别不同形态的可变形物体,在真实场景实验中显著超越了现有最优方法。
English: This paper introduces a novel One-Shot Affordance Grounding method for Deformable Objects (OS-AGDO) that enables robots to recognize unseen deformable objects with minimal samples through semantic enhancement and optimized feature extraction, achieving superior performance over state-of-the-art methods in real-world experiments.

Authors:Jingyu Song, Haoyu Ma, Onur Bagoren, Advaith V. Sethuraman, Yiting Zhang, Katherine A. Skinner
Title: OceanSim: A GPU-Accelerated Underwater Robot Perception Simulation Framework
Abstract:
Underwater simulators offer support for building robust underwater perception solutions. Significant work has recently been done to develop new simulators and to advance the performance of existing underwater simulators. Still, there remains room for improvement on physics-based underwater sensor modeling and rendering efficiency. In this paper, we propose OceanSim, a high-fidelity GPU-accelerated underwater simulator to address this research gap. We propose advanced physics-based rendering techniques to reduce the sim-to-real gap for underwater image simulation. We develop OceanSim to fully leverage the computing advantages of GPUs and achieve real-time imaging sonar rendering and fast synthetic data generation. We evaluate the capabilities and realism of OceanSim using real-world data to provide qualitative and quantitative results. The code and detailed documentation are made available on the project website to support the marine robotics community: https://umfieldrobotics.github.io/OceanSim.

Authors:Lie Ju, Sijin Zhou, Yukun Zhou, Huimin Lu, Zhuoting Zhu, Pearse A. Keane, Zongyuan Ge
Title: Delving into Out-of-Distribution Detection with Medical Vision-Language Models
Abstract:
Recent advances in medical vision-language models (VLMs) demonstrate impressive performance in image classification tasks, driven by their strong zero-shot generalization capabilities. However, given the high variability and complexity inherent in medical imaging data, the ability of these models to detect out-of-distribution (OOD) data in this domain remains underexplored. In this work, we conduct the first systematic investigation into the OOD detection potential of medical VLMs. We evaluate state-of-the-art VLM-based OOD detection methods across a diverse set of medical VLMs, including both general and domain-specific purposes. To accurately reflect real-world challenges, we introduce a cross-modality evaluation pipeline for benchmarking full-spectrum OOD detection, rigorously assessing model robustness against both semantic shifts and covariate shifts. Furthermore, we propose a novel hierarchical prompt-based method that significantly enhances OOD detection performance. Extensive experiments are conducted to validate the effectiveness of our approach. The codes are available at https://github.com/PyJulie/Medical-VLMs-OOD-Detection.
中文: 本研究开创性地系统评估了医学视觉语言模型的分布外检测能力,提出了跨模态基准和新型分层提示方法,显著提升了检测性能。
English: This study pioneers a systematic evaluation of out-of-distribution detection in medical vision-language models, introducing a cross-modality benchmark and a novel hierarchical prompt method that significantly improves performance.

Authors:Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang
Title: Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning
Abstract:
In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature. However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dynamics between hands and objects. In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process. Since no suitable data is available, we introduce HOD, a novel pipeline employing a hand-object detector and a large language model to generate high-quality narrations with detailed descriptions of hand-object dynamics. To learn these fine-grained dynamics, we propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information. Through our co-training strategy, EgoVideo effectively and efficiently leverages the fine-grained hand-object dynamics in the HOD data. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple egocentric downstream tasks, including improvements of 6.3% in EK-100 multi-instance retrieval, 5.7% in EK-100 classification, and 16.3% in EGTEA classification in zero-shot settings. Furthermore, our model exhibits robust generalization capabilities in hand-object interaction and robot manipulation tasks. Code and data are available at https://github.com/OpenRobotLab/EgoHOD/.
中文: 本研究提出EgoVideo模型,通过创新的HOD流程和运动适配器将细粒度手物动态整合到自我中心视频表征学习中,在多个下游任务中实现了领先性能。
English: This work introduces EgoVideo, a model that integrates fine-grained hand-object dynamics into egocentric video representation learning using a novel HOD pipeline and motion adapter, achieving state-of-the-art performance across multiple downstream tasks.

Authors:Xingzhuo Guo, Yu Zhang, Baixu Chen, Haoran Xu, Jianmin Wang, Mingsheng Long
Title: Dynamical Diffusion: Learning Temporal Dynamics with Diffusion Models
Abstract:
Diffusion models have emerged as powerful generative frameworks by progressively adding noise to data through a forward process and then reversing this process to generate realistic samples. While these models have achieved strong performance across various tasks and modalities, their application to temporal predictive learning remains underexplored. Existing approaches treat predictive learning as a conditional generation problem, but often fail to fully exploit the temporal dynamics inherent in the data, leading to challenges in generating temporally coherent sequences. To address this, we introduce Dynamical Diffusion (DyDiff), a theoretically sound framework that incorporates temporally aware forward and reverse processes. Dynamical Diffusion explicitly models temporal transitions at each diffusion step, establishing dependencies on preceding states to better capture temporal dynamics. Through the reparameterization trick, Dynamical Diffusion achieves efficient training and inference similar to any standard diffusion model. Extensive experiments across scientific spatiotemporal forecasting, video prediction, and time series forecasting demonstrate that Dynamical Diffusion consistently improves performance in temporal predictive tasks, filling a crucial gap in existing methodologies. Code is available at this repository: https://github.com/thuml/dynamical-diffusion.
中文: Dynamical Diffusion (DyDiff) 提出了一种时间感知框架,通过显式建模时间动态增强扩散模型,从而在多种预测任务中提升性能。
English: Dynamical Diffusion (DyDiff) introduces a temporally aware framework that enhances diffusion models by explicitly modeling temporal dynamics, leading to improved performance in various predictive tasks.

Authors:Zhu Liu, Zijun Wang, Jinyuan Liu, Fanqi Meng, Long Ma, Risheng Liu
Title: DEAL: Data-Efficient Adversarial Learning for High-Quality Infrared Imaging
Abstract:
Thermal imaging is often compromised by dynamic, complex degradations caused by hardware limitations and unpredictable environmental factors. The scarcity of high-quality infrared data, coupled with the challenges of dynamic, intricate degradations, makes it difficult to recover details using existing methods. In this paper, we introduce thermal degradation simulation integrated into the training process via a mini-max optimization, by modeling these degraded factors as adversarial attacks on thermal images. The simulation is dynamic to maximize objective functions, thus capturing a broad spectrum of degraded data distributions. This approach enables training with limited data, thereby improving model performance.Additionally, we introduce a dual-interaction network that combines the benefits of spiking neural networks with scale transformation to capture degraded features with sharp spike signal intensities. This architecture ensures compact model parameters while preserving efficient feature representation. Extensive experiments demonstrate that our method not only achieves superior visual quality under diverse single and composited degradation, but also delivers a significant reduction in processing when trained on only fifty clear images, outperforming existing techniques in efficiency and accuracy. The source code will be available at https://github.com/LiuZhu-CV/DEAL.
中文: 本文提出了一种热成像退化模拟方法,通过极小极大优化将退化建模为对抗攻击,结合脉冲神经网络与尺度变换的双交互网络,在有限数据下实现高效训练和模型性能提升。
English: This paper introduces a thermal degradation simulation method using mini-max optimization to model degradations as adversarial attacks, enabling effective training with limited data and enhanced model performance through a dual-interaction network that combines spiking neural networks with scale transformation for efficient feature representation.

Authors:Rui Yi Yong, Samuel Picosson, Arnold Wiliem
Title: MTReD: 3D Reconstruction Dataset for Fly-over Videos of Maritime Domain
Abstract:
This work tackles 3D scene reconstruction for a video fly-over perspective problem in the maritime domain, with a specific emphasis on geometrically and visually sound reconstructions. This will allow for downstream tasks such as segmentation, navigation, and localization. To our knowledge, there is no dataset available in this domain. As such, we propose a novel maritime 3D scene reconstruction benchmarking dataset, named as MTReD (Maritime Three-Dimensional Reconstruction Dataset). The MTReD comprises 19 fly-over videos curated from the Internet containing ships, islands, and coastlines. As the task is aimed towards geometrical consistency and visual completeness, the dataset uses two metrics: (1) Reprojection error; and (2) Perception based metrics. We find that existing perception-based metrics, such as Learned Perceptual Image Patch Similarity (LPIPS), do not appropriately measure the completeness of a reconstructed image. Thus, we propose a novel semantic similarity metric utilizing DINOv2 features coined DiFPS (DinoV2 Features Perception Similarity). We perform initial evaluation on two baselines: (1) Structured from Motion (SfM) through Colmap; and (2) the recent state-of-the-art MASt3R model. We find that the reconstructed scenes by MASt3R have higher reprojection errors, but superior perception based metric scores. To this end, some pre-processing methods are explored, and we find a pre-processing method which improves both the reprojection error and perception-based score. We envisage our proposed MTReD to stimulate further research in these directions. The dataset and all the code will be made available in https://github.com/RuiYiYong/MTReD.
中文: 本研究提出MTReD这一新型海事三维重建数据集,填补了飞越视角视频数据空白,通过新提出的语义相似度指标DiFPS进行评估,并展示了预处理方法对提升重建质量的有效性。
English: This study introduces MTReD, a novel maritime 3D reconstruction dataset addressing the lack of fly-over video data, featuring a new semantic similarity metric (DiFPS) and demonstrating improved reconstruction performance through preprocessing methods.

Authors:Yalun Dai, Lingao Xiao, Ivor W. Tsang, Yang He
Title: Training-Free Dataset Pruning for Instance Segmentation
Abstract:
Existing dataset pruning techniques primarily focus on classification tasks, limiting their applicability to more complex and practical tasks like instance segmentation. Instance segmentation presents three key challenges: pixel-level annotations, instance area variations, and class imbalances, which significantly complicate dataset pruning efforts. Directly adapting existing classification-based pruning methods proves ineffective due to their reliance on time-consuming model training process. To address this, we propose a novel Training-Free Dataset Pruning (TFDP) method for instance segmentation. Specifically, we leverage shape and class information from image annotations to design a Shape Complexity Score (SCS), refining it into a Scale-Invariant (SI-SCS) and Class-Balanced (CB-SCS) versions to address instance area variations and class imbalances, all without requiring model training. We achieve state-of-the-art results on VOC 2012, Cityscapes, and COCO datasets, generalizing well across CNN and Transformer architectures. Remarkably, our approach accelerates the pruning process by an average of 1349$\times$ on COCO compared to the adapted baselines. Source code is available at: https://github.com/he-y/dataset-pruning-for-instance-segmentation
中文: 本文提出了一种无需训练的实例分割数据集剪枝方法(TFDP),利用标注中的形状和类别信息设计复杂度评分,无需模型训练即可在多个数据集上取得最优性能,并将剪枝速度提升了1300倍以上。
English: This paper introduces a Training-Free Dataset Pruning (TFDP) method for instance segmentation that uses shape and class information from annotations to create complexity scores, achieving state-of-the-art results and accelerating pruning by over 1300 times without model training.

Authors:Bowen Zheng, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan
Title: Task-Agnostic Guided Feature Expansion for Class-Incremental Learning
Abstract:
The ability to learn new concepts while preserve the learned knowledge is desirable for learning systems in Class-Incremental Learning (CIL). Recently, feature expansion of the model become a prevalent solution for CIL, where the old features are fixed during the training of the new task while new features are expanded for the new tasks. However, such task-specific features learned from the new task may collide with the old features, leading to misclassification between tasks. Therefore, the expanded model is often encouraged to capture diverse features from the new task, aiming to avoid such collision. However, the existing solution is largely restricted to the samples from the current task, because of the poor accessibility to previous samples. To promote the learning and transferring of diverse features across tasks, we propose a framework called Task-Agnostic Guided Feature Expansion (TagFex). Firstly, it captures task-agnostic features continually with a separate model, providing extra task-agnostic features for subsequent tasks. Secondly, to obtain useful features from the task-agnostic model for the current task, it aggregates the task-agnostic features with the task-specific feature using a merge attention. Then the aggregated feature is transferred back into the task-specific feature for inference, helping the task-specific model capture diverse features. Extensive experiments show the effectiveness and superiority of TagFex on various CIL settings. Code is available at https://github.com/bwnzheng/TagFex_CVPR2025.
中文: 提出的TagFex框架通过持续捕获任务无关特征,并利用注意力机制将其与任务特定特征融合,有效解决了类增量学习中的特征冲突问题,实现了跨任务的多样化特征学习与知识保持。
English: The proposed TagFex framework addresses feature collision in Class-Incremental Learning by continually capturing task-agnostic features and merging them with task-specific features through attention mechanisms, enabling diverse feature learning across tasks while maintaining performance.

Authors:Lu Ma, Kaibo Cao, Hao Liang, Jiaxin Lin, Zhuang Li, Yuhong Liu, Jihong Zhang, Wentao Zhang, Bin Cui
Title: Evaluating and Predicting Distorted Human Body Parts for Generated Images
Abstract:
Recent advancements in text-to-image (T2I) models enable high-quality image synthesis, yet generating anatomically accurate human figures remains challenging. AI-generated images frequently exhibit distortions such as proliferated limbs, missing fingers, deformed extremities, or fused body parts. Existing evaluation metrics like Inception Score (IS) and Fréchet Inception Distance (FID) lack the granularity to detect these distortions, while human preference-based metrics focus on abstract quality assessments rather than anatomical fidelity. To address this gap, we establish the first standards for identifying human body distortions in AI-generated images and introduce Distortion-5K, a comprehensive dataset comprising 4,700 annotated images of normal and malformed human figures across diverse styles and distortion types. Based on this dataset, we propose ViT-HD, a Vision Transformer-based model tailored for detecting human body distortions in AI-generated images, which outperforms state-of-the-art segmentation models and visual language models, achieving an F1 score of 0.899 and IoU of 0.831 on distortion localization. Additionally, we construct the Human Distortion Benchmark with 500 human-centric prompts to evaluate four popular T2I models using trained ViT-HD, revealing that nearly 50\% of generated images contain distortions. This work pioneers a systematic approach to evaluating anatomical accuracy in AI-generated humans, offering tools to advance the fidelity of T2I models and their real-world applicability. The Distortion-5K dataset, trained ViT-HD will soon be released in our GitHub repository: \href{https://github.com/TheRoadQaQ/Predicting-Distortion}{https://github.com/TheRoadQaQ/Predicting-Distortion}.
Chinese: 本研究推出了首个用于识别AI生成图像中人体畸变的标注数据集Distortion-5K,并提出基于视觉Transformer的ViT-HD模型,该模型在检测解剖结构异常方面优于现有方法,揭示近半数生成图像存在此类缺陷。
English: This study introduces Distortion-5K, the first annotated dataset for identifying human body distortions in AI-generated images, and proposes ViT-HD, a Vision Transformer model that outperforms existing methods in detecting anatomical inaccuracies, revealing nearly half of generated images contain such flaws.

Authors:Xulin Chen, Junzhou Huang
Title: DELST: Dual Entailment Learning for Hyperbolic Image-Gene Pretraining in Spatial Transcriptomics
Abstract:
Spatial transcriptomics (ST) maps gene expression within tissue at individual spots, making it a valuable resource for multimodal representation learning. Additionally, ST inherently contains rich hierarchical information both across and within modalities. For instance, different spots exhibit varying numbers of nonzero gene expressions, corresponding to different levels of cellular activity and semantic hierarchies. However, existing methods rely on contrastive alignment of image-gene pairs, failing to accurately capture the intricate hierarchical relationships in ST data. Here, we propose DELST, the first framework to embed hyperbolic representations while modeling hierarchy for image-gene pretraining at two levels: (1) Cross-modal entailment learning, which establishes an order relationship between genes and images to enhance image representation generalization; (2) Intra-modal entailment learning, which encodes gene expression patterns as hierarchical relationships, guiding hierarchical learning across different samples at a global scale and integrating biological insights into single-modal representations. Extensive experiments on ST benchmarks annotated by pathologists demonstrate the effectiveness of our framework, achieving improved predictive performance compared to existing methods. Our code and models are available at: https://github.com/XulinChen/DELST.
中文: 提出的DELST框架通过双模态和单模态蕴含学习引入双曲表征嵌入,有效捕捉空间转录组数据中的层级关系,在基准测试中优于现有方法。
English: The proposed DELST framework introduces hyperbolic representation embedding with cross-modal and intra-modal entailment learning to capture hierarchical relationships in spatial transcriptomics data, outperforming existing methods on benchmarks.

Authors:Qingwen Zhang, Ajinkya Khoche, Yi Yang, Li Ling, Sina Sharif Mansouri, Olov Andersson, Patric Jensfelt
Title: HiMo: High-Speed Objects Motion Compensation in Point Clouds
Abstract:
LiDAR point cloud is essential for autonomous vehicles, but motion distortions from dynamic objects degrade the data quality. While previous work has considered distortions caused by ego motion, distortions caused by other moving objects remain largely overlooked, leading to errors in object shape and position. This distortion is particularly pronounced in high-speed environments such as highways and in multi-LiDAR configurations, a common setup for heavy vehicles. To address this challenge, we introduce HiMo, a pipeline that repurposes scene flow estimation for non-ego motion compensation, correcting the representation of dynamic objects in point clouds. During the development of HiMo, we observed that existing self-supervised scene flow estimators often produce degenerate or inconsistent estimates under high-speed distortion. We further propose SeFlow++, a real-time scene flow estimator that achieves state-of-the-art performance on both scene flow and motion compensation. Since well-established motion distortion metrics are absent in the literature, we introduce two evaluation metrics: compensation accuracy at a point level and shape similarity of objects. We validate HiMo through extensive experiments on Argoverse 2, ZOD, and a newly collected real-world dataset featuring highway driving and multi-LiDAR-equipped heavy vehicles. Our findings show that HiMo improves the geometric consistency and visual fidelity of dynamic objects in LiDAR point clouds, benefiting downstream tasks such as semantic segmentation and 3D detection. See https://kin-zhang.github.io/HiMo for more details.

Authors:Ukcheol Shin, Kyunghyun Lee, Jean Oh
Title: Bridging Spectral-wise and Multi-spectral Depth Estimation via Geometry-guided Contrastive Learning
Abstract:
Deploying depth estimation networks in the real world requires high-level robustness against various adverse conditions to ensure safe and reliable autonomy. For this purpose, many autonomous vehicles employ multi-modal sensor systems, including an RGB camera, NIR camera, thermal camera, LiDAR, or Radar. They mainly adopt two strategies to use multiple sensors: modality-wise and multi-modal fused inference. The former method is flexible but memory-inefficient, unreliable, and vulnerable. Multi-modal fusion can provide high-level reliability, yet it needs a specialized architecture. In this paper, we propose an effective solution, named align-and-fuse strategy, for the depth estimation from multi-spectral images. In the align stage, we align embedding spaces between multiple spectrum bands to learn shareable representation across multi-spectral images by minimizing contrastive loss of global and spatially aligned local features with geometry cue. After that, in the fuse stage, we train an attachable feature fusion module that can selectively aggregate the multi-spectral features for reliable and robust prediction results. Based on the proposed method, a single-depth network can achieve both spectral-invariant and multi-spectral fused depth estimation while preserving reliability, memory efficiency, and flexibility.
Chinese: 本文提出一种对齐与融合策略,用于多光谱图像的稳健深度估计,通过对齐不同光谱的嵌入空间并选择性融合特征,实现可靠、内存高效且灵活的预测结果。
English: This paper introduces an align-and-fuse strategy for robust depth estimation from multi-spectral images, which aligns embedding spaces across spectra and selectively fuses features to achieve reliable, memory-efficient, and flexible predictions.

Authors:Elahe Delavari, Aws Khalil, Jaerock Kwon
Title: CARIL: Confidence-Aware Regression in Imitation Learning for Autonomous Driving
Abstract:
End-to-end vision-based imitation learning has demonstrated promising results in autonomous driving by learning control commands directly from expert demonstrations. However, traditional approaches rely on either regressionbased models, which provide precise control but lack confidence estimation, or classification-based models, which offer confidence scores but suffer from reduced precision due to discretization. This limitation makes it challenging to quantify the reliability of predicted actions and apply corrections when necessary. In this work, we introduce a dual-head neural network architecture that integrates both regression and classification heads to improve decision reliability in imitation learning. The regression head predicts continuous driving actions, while the classification head estimates confidence, enabling a correction mechanism that adjusts actions in low-confidence scenarios, enhancing driving stability. We evaluate our approach in a closed-loop setting within the CARLA simulator, demonstrating its ability to detect uncertain actions, estimate confidence, and apply real-time corrections. Experimental results show that our method reduces lane deviation and improves trajectory accuracy by up to 50%, outperforming conventional regression-only models. These findings highlight the potential of classification-guided confidence estimation in enhancing the robustness of vision-based imitation learning for autonomous driving. The source code is available at https://github.com/ElaheDlv/Confidence_Aware_IL.
中文摘要:本研究提出一种结合回归与分类的双头神经网络,通过实时置信度估计和动作校正机制提升自动驾驶可靠性,在CARLA模拟器中显著减少车道偏离并提高轨迹精度达50%。
English Summary: This study introduces a dual-head neural network combining regression and classification to enhance autonomous driving reliability by enabling real-time confidence estimation and action correction, significantly reducing lane deviation and improving trajectory accuracy in CARLA simulations.

Authors:Zihao Luo, Zijun Gao, Wenjun Liao, Shichuan Zhang, Guotai Wang, Xiangde Luo
Title: Dynamic Gradient Sparsification Training for Few-Shot Fine-tuning of CT Lymph Node Segmentation Foundation Model
Abstract:
Accurate lymph node (LN) segmentation is critical in radiotherapy treatment and prognosis analysis, but is limited by the need for large annotated datasets. While deep learning-based segmentation foundation models show potential in developing high-performing models with fewer samples, their medical adaptation faces LN domain-specific prior deficiencies and inefficient few-shot fine-tuning for complex clinical practices, highlighting the necessity of an LN segmentation foundation model. In this work, we annotated 36,106 visible LNs from 3,346 publicly available head-and-neck CT scans to establish a robust LN segmentation model (nnUNetv2). Building on this, we propose Dynamic Gradient Sparsification Training (DGST), a few-shot fine-tuning approach that preserves foundational knowledge while dynamically updating the most critical parameters of the LN segmentation model with few annotations. We validate it on two publicly available LN segmentation datasets: SegRap2023 and LNQ2023. The results show that DGST outperforms existing few-shot fine-tuning methods, achieving satisfactory performance with limited labeled data. We release the dataset, models and all implementations to facilitate relevant research: https://github.com/Zihaoluoh/LN-Seg-FM.
Chinese: 本研究针对淋巴结分割在放疗和预后中因标注数据稀缺而受限的问题,通过大规模标注数据和创新的少样本微调方法,开发出优于现有技术的模型,有效利用有限标注数据实现高精度分割。
English: Accurate lymph node segmentation is essential for radiotherapy and prognosis but is hindered by the scarcity of annotated data, which this study addresses by developing a robust model using extensive annotations and a novel few-shot fine-tuning method that outperforms existing approaches with limited labeled data.

Authors:Fei Teng, Buyin Deng, Boyuan Zheng, Kai Luo, Kunyu Peng, Jiaming Zhang, Kailun Yang
Title: Unifying Light Field Perception with Field of Parallax
Abstract:
Field of Parallax (FoP)}, a spatial field that distills the common features from different LF representations to provide flexible and consistent support for multi-task learning. FoP is built upon three core features--projection difference, adjacency divergence, and contextual consistency--which are essential for cross-task adaptability. To implement FoP, we design a two-step angular adapter: the first step captures angular-specific differences, while the second step consolidates contextual consistency to ensure robust representation. Leveraging the FoP-based representation, we introduce the LFX framework, the first to handle arbitrary LF representations seamlessly, unifying LF multi-task vision. We evaluated LFX across three different tasks, achieving new state-of-the-art results, compared with previous task-specific architectures: 84.74% in mIoU for semantic segmentation on UrbanLF, 0.84% in AP for object detection on PKU, and 0.030 in MAE and 0.026 in MAE for salient object detection on Duftv2 and PKU, respectively. The source code will be made publicly available at https://github.com/warriordby/LFX.
中文: 视差场(FoP)通过核心特征构建了支持跨任务适应的空间场,基于此的LFX框架统一了光场多任务视觉,在语义分割、目标检测和显著目标检测任务中均取得了最优性能。
English: The Field of Parallax (FoP) introduces a spatial field with core features enabling cross-task adaptability, and the LFX framework built on it unifies light field multi-task vision, achieving state-of-the-art results across semantic segmentation, object detection, and salient object detection.

Authors:Teng Zhang, Hongxu Jiang, Kuang Gong, Wei Shao
Title: Geodesic Diffusion Models for Medical Image-to-Image Generation
Abstract:
Diffusion models transform an unknown data distribution into a Gaussian prior by progressively adding noise until the data become indistinguishable from pure noise. This stochastic process traces a path in probability space, evolving from the original data distribution (considered as a Gaussian with near-zero variance) to an isotropic Gaussian. The denoiser then learns to reverse this process, generating high-quality samples from random Gaussian noise. However, standard diffusion models, such as the Denoising Diffusion Probabilistic Model (DDPM), do not ensure a geodesic (i.e., shortest) path in probability space. This inefficiency necessitates the use of many intermediate time steps, leading to high computational costs in training and sampling. To address this limitation, we propose the Geodesic Diffusion Model (GDM), which defines a geodesic path under the Fisher-Rao metric with a variance-exploding noise scheduler. This formulation transforms the data distribution into a Gaussian prior with minimal energy, significantly improving the efficiency of diffusion models. We trained GDM by continuously sampling time steps from 0 to 1 and using as few as 15 evenly spaced time steps for model sampling. We evaluated GDM on two medical image-to-image generation tasks: CT image denoising and MRI image super-resolution. Experimental results show that GDM achieved state-of-the-art performance while reducing training time by a 50-fold compared to DDPM and 10-fold compared to Fast-DDPM, with 66 times faster sampling than DDPM and a similar sampling speed to Fast-DDPM. These efficiency gains enable rapid model exploration and real-time clinical applications. Our code is publicly available at: https://github.com/mirthAI/GDM-VE.
Chinese: 测地扩散模型(GDM)在Fisher-Rao度量下引入测地路径以提升扩散模型效率,在医学影像任务中实现最优性能,相比标准模型大幅减少了训练和采样时间。
English: The Geodesic Diffusion Model (GDM) introduces a geodesic path under the Fisher-Rao metric to enhance diffusion model efficiency, achieving state-of-the-art performance in medical imaging tasks with significantly reduced training and sampling times compared to standard models.

Authors:Henrui Tian, Wenhui Lei, Linrui Dai, Hanyu Chen, Xiaofan Zhang
Title: LesionDiffusion: Towards Text-controlled General Lesion Synthesis
Abstract:
Fully-supervised lesion recognition methods in medical imaging face challenges due to the reliance on large annotated datasets, which are expensive and difficult to collect. To address this, synthetic lesion generation has become a promising approach. However, existing models struggle with scalability, fine-grained control over lesion attributes, and the generation of complex structures. We propose LesionDiffusion, a text-controllable lesion synthesis framework for 3D CT imaging that generates both lesions and corresponding masks. By utilizing a structured lesion report template, our model provides greater control over lesion attributes and supports a wider variety of lesion types. We introduce a dataset of 1,505 annotated CT scans with paired lesion masks and structured reports, covering 14 lesion types across 8 organs. LesionDiffusion consists of two components: a lesion mask synthesis network (LMNet) and a lesion inpainting network (LINet), both guided by lesion attributes and image features. Extensive experiments demonstrate that LesionDiffusion significantly improves segmentation performance, with strong generalization to unseen lesion types and organs, outperforming current state-of-the-art models. Code is available at https://github.com/HengruiTianSJTU/LesionDiffusion.
中文: LesionDiffusion是一种基于文本控制的3D CT病灶合成框架,通过结构化属性调控生成病灶和对应掩模,显著提升了分割性能并展现出对未知病灶类型的强泛化能力。
English: LesionDiffusion is a text-controllable framework for generating synthetic 3D CT lesions and masks, enhancing segmentation performance and generalization across diverse lesion types through structured attribute control.

Authors:Junyao Gao, Yanan Sun, Fei Shen, Xin Jiang, Zhening Xing, Kai Chen, Cairong Zhao
Title: FaceShot: Bring Any Character into Life
Abstract:
In this paper, we present FaceShot, a novel training-free portrait animation framework designed to bring any character into life from any driven video without fine-tuning or retraining. We achieve this by offering precise and robust reposed landmark sequences from an appearance-guided landmark matching module and a coordinate-based landmark retargeting module. Together, these components harness the robust semantic correspondences of latent diffusion models to produce facial motion sequence across a wide range of character types. After that, we input the landmark sequences into a pre-trained landmark-driven animation model to generate animated video. With this powerful generalization capability, FaceShot can significantly extend the application of portrait animation by breaking the limitation of realistic portrait landmark detection for any stylized character and driven video. Also, FaceShot is compatible with any landmark-driven animation model, significantly improving overall performance. Extensive experiments on our newly constructed character benchmark CharacBench confirm that FaceShot consistently surpasses state-of-the-art (SOTA) approaches across any character domain. More results are available at our project website https://faceshot2024.github.io/faceshot/.

Authors:Jinjiang You, Hewei Wang, Yijie Li, Mingxiao Huo, Long Van Tran Ha, Mingyuan Ma, Jinfeng Xu, Jiayi Zhang, Puzhen Wu, Shubham Garg, Wei Pu
Title: Multi-Cali Anything: Dense Feature Multi-Frame Structure-from-Motion for Large-Scale Camera Array Calibration
Abstract:
Calibrating large-scale camera arrays, such as those in dome-based setups, is time-intensive and typically requires dedicated captures of known patterns. While extrinsics in such arrays are fixed due to the physical setup, intrinsics often vary across sessions due to factors like lens adjustments or temperature changes. In this paper, we propose a dense-feature-driven multi-frame calibration method that refines intrinsics directly from scene data, eliminating the necessity for additional calibration captures. Our approach enhances traditional Structure-from-Motion (SfM) pipelines by introducing an extrinsics regularization term to progressively align estimated extrinsics with ground-truth values, a dense feature reprojection term to reduce keypoint errors by minimizing reprojection loss in the feature space, and an intrinsics variance term for joint optimization across multiple frames. Experiments on the Multiface dataset show that our method achieves nearly the same precision as dedicated calibration processes, and significantly enhances intrinsics and 3D reconstruction accuracy. Fully compatible with existing SfM pipelines, our method provides an efficient and practical plug-and-play solution for large-scale camera setups. Our code is publicly available at: https://github.com/YJJfish/Multi-Cali-Anything
中文摘要:本文提出一种基于密集特征的多帧标定方法,可直接从场景数据优化相机内参,通过引入外参正则化、密集特征重投影和内参方差约束,在无需专门标定采集的情况下达到与传统方法相近的精度,显著提升了三维重建精度。
English Summary: This paper introduces a dense-feature-driven multi-frame calibration method that refines camera intrinsics directly from scene data, eliminating dedicated calibration captures while achieving comparable precision through enhanced SfM pipelines with specialized regularization terms.

Authors:Wenhui Lei, Anqi Li, Yusheng Tan, Hanyu Chen, Xiaofan Zhang
Title: Shazam: Unifying Multiple Foundation Models for Advanced Computational Pathology
Abstract:
Foundation Models (FMs) in computational pathology (CPath) have significantly advanced the extraction of meaningful features from histopathology image datasets, achieving strong performance across various clinical tasks. Despite their impressive performance, these models often exhibit variability when applied to different tasks, prompting the need for a unified framework capable of consistently excelling across various applications. In this work, we propose Shazam, a novel framework designed to efficiently combine multiple CPath models. Unlike previous approaches that train a fixed-parameter FM, Shazam dynamically extracts and refines information from diverse FMs for each specific task. To ensure that each FM contributes effectively without dominance, a novel distillation strategy is applied, guiding the student model with features from all teacher models, which enhances its generalization ability. Experimental results on two pathology patch classification datasets demonstrate that Shazam outperforms existing CPath models and other fusion methods. Its lightweight, flexible design makes it a promising solution for improving CPath analysis in real-world settings. Code will be available at https://github.com/Tuner12/Shazam.
Chinese: Shazam框架通过动态特征提取和新型蒸馏策略,有效整合多个计算病理学基础模型,在病理分类任务中展现出卓越的性能和泛化能力。
English: The Shazam framework efficiently integrates multiple computational pathology foundation models through dynamic feature extraction and a novel distillation strategy, demonstrating superior performance and generalization in pathology classification tasks.

Authors:Yang Ding, Can Han, Sijia Du, Yaqi Wang, Dahong Qian
Title: LightEndoStereo: A Real-time Lightweight Stereo Matching Method for Endoscopy Images
Abstract:
Real-time acquisition of accurate depth of scene is essential for automated robotic minimally invasive surgery, and stereo matching with binocular endoscopy can generate such depth. However, existing algorithms struggle with ambiguous tissue boundaries and real-time performance in prevalent high-resolution endoscopic scenes. We propose LightEndoStereo, a lightweight real-time stereo matching method for endoscopic images. We introduce a 3D Mamba Coordinate Attention module to streamline the cost aggregation process by generating position-sensitive attention maps and capturing long-range dependencies across spatial dimensions using the Mamba block. Additionally, we introduce a High-Frequency Disparity Optimization module to refine disparity estimates at tissue boundaries by enhancing high-frequency information in the wavelet domain. Our method is evaluated on the SCARED and SERV-CT datasets, achieving state-of-the-art matching accuracy and a real-time inference speed of 42 FPS. The code is available at https://github.com/Sonne-Ding/LightEndoStereo.
中文: LightEndoStereo是一种用于内窥镜图像的轻量级实时立体匹配方法,通过引入3D曼巴坐标注意力模块和高频视差优化模块,在SCARED和SERV-CT数据集上实现了最先进的匹配精度和42 FPS的实时推理速度。
English: LightEndoStereo is a lightweight real-time stereo matching method for endoscopic images that introduces a 3D Mamba Coordinate Attention module and a High-Frequency Disparity Optimization module, achieving state-of-the-art accuracy and 42 FPS inference speed on SCARED and SERV-CT datasets.

Authors:Changlin Song, Jiaqi Wang, Liyun Zhu, He Weng
Title: Enhancing Monocular 3D Scene Completion with Diffusion Model
Abstract:
3D scene reconstruction is essential for applications in virtual reality, robotics, and autonomous driving, enabling machines to understand and interact with complex environments. Traditional 3D Gaussian Splatting techniques rely on images captured from multiple viewpoints to achieve optimal performance, but this dependence limits their use in scenarios where only a single image is available. In this work, we introduce FlashDreamer, a novel approach for reconstructing a complete 3D scene from a single image, significantly reducing the need for multi-view inputs. Our approach leverages a pre-trained vision-language model to generate descriptive prompts for the scene, guiding a diffusion model to produce images from various perspectives, which are then fused to form a cohesive 3D reconstruction. Extensive experiments show that our method effectively and robustly expands single-image inputs into a comprehensive 3D scene, extending monocular 3D reconstruction capabilities without further training. Our code is available https://github.com/CharlieSong1999/FlashDreamer/tree/main.
Chinese: FlashDreamer提出了一种从单张图像重建完整3D场景的创新方法,通过视觉语言模型和扩散模型生成多视角图像,显著减少对多视角输入的依赖,无需额外训练即可扩展单目3D重建能力。
English: FlashDreamer introduces a novel method for reconstructing complete 3D scenes from single images by using vision-language models and diffusion models to generate multi-view images, significantly reducing reliance on multi-view inputs and extending monocular 3D reconstruction capabilities without additional training.

Authors:Zhiqi Kang, Liyuan Wang, Xingxing Zhang, Karteek Alahari
Title: Advancing Prompt-Based Methods for Replay-Independent General Continual Learning
Abstract:
General continual learning (GCL) is a broad concept to describe real-world continual learning (CL) problems, which are often characterized by online data streams without distinct transitions between tasks, i.e., blurry task boundaries. Such requirements result in poor initial performance, limited generalizability, and severe catastrophic forgetting, heavily impacting the effectiveness of mainstream GCL models trained from scratch. While the use of a frozen pretrained backbone with appropriate prompt tuning can partially address these challenges, such prompt-based methods remain suboptimal for CL of remaining tunable parameters on the fly. In this regard, we propose an innovative approach named MISA (Mask and Initial Session Adaption) to advance prompt-based methods in GCL. It includes a forgetting-aware initial session adaption that employs pretraining data to initialize prompt parameters and improve generalizability, as well as a non-parametric logit mask of the output layers to mitigate catastrophic forgetting. Empirical results demonstrate substantial performance gains of our approach compared to recent competitors, especially without a replay buffer (e.g., up to 18.39%, 22.06%, and 11.96% performance lead on CIFAR-100, Tiny-ImageNet, and ImageNet-R, respectively). Moreover, our approach features the plug-in nature for prompt-based methods, independence of replay, ease of implementation, and avoidance of CL-relevant hyperparameters, serving as a strong baseline for GCL research. Our source code is publicly available at https://github.com/kangzhiq/MISA
中文:MISA方法通过预训练数据初始化提示参数提升泛化能力,并采用非参数化逻辑掩码减轻灾难性遗忘,在无需回放缓冲区的情况下显著提升了通用持续学习的性能。
English: The MISA method enhances general continual learning by using pretraining data to initialize prompts for better adaptability and a logit mask to reduce forgetting, achieving significant performance improvements without a replay buffer.

Authors:Ashish Verma, Aupendu Kar, Krishnendu Ghosh, Sobhan Kanti Dhara, Debashis Sen, Prabir Kumar Biswas
Title: Artificially Generated Visual Scanpath Improves Multi-label Thoracic Disease Classification in Chest X-Ray Images
Abstract:
Expert radiologists visually scan Chest X-Ray (CXR) images, sequentially fixating on anatomical structures to perform disease diagnosis. An automatic multi-label classifier of diseases in CXR images can benefit by incorporating aspects of the radiologists' approach. Recorded visual scanpaths of radiologists on CXR images can be used for the said purpose. But, such scanpaths are not available for most CXR images, which creates a gap even for modern deep learning based classifiers. This paper proposes to mitigate this gap by generating effective artificial visual scanpaths using a visual scanpath prediction model for CXR images. Further, a multi-class multi-label classifier framework is proposed that uses a generated scanpath and visual image features to classify diseases in CXR images. While the scanpath predictor is based on a recurrent neural network, the multi-label classifier involves a novel iterative sequential model with an attention module. We show that our scanpath predictor generates human-like visual scanpaths. We also demonstrate that the use of artificial visual scanpaths improves multi-class multi-label disease classification results on CXR images. The above observations are made from experiments involving around 0.2 million CXR images from 2 widely-used datasets considering the multi-label classification of 14 pathological findings. Code link: https://github.com/ashishverma03/SDC
中文: 本文提出通过循环神经网络生成模拟放射科医生视觉扫描路径的人工轨迹,并结合图像特征构建新型多标签分类器,有效提升了胸部X光片的疾病检测性能。
English: This paper proposes generating artificial visual scanpaths using a recurrent neural network to mimic radiologists' eye movements, which are then integrated with image features in a novel multi-label classifier to improve disease detection in chest X-rays.

Authors:Siddhartha Gairola, Moritz Böhle, Francesco Locatello, Bernt Schiele
Title: How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations
Abstract:
Post-hoc importance attribution methods are a popular tool for "explaining" Deep Neural Networks (DNNs) and are inherently based on the assumption that the explanations can be applied independently of how the models were trained. Contrarily, in this work we bring forward empirical evidence that challenges this very notion. Surprisingly, we discover a strong dependency on and demonstrate that the training details of a pre-trained model's classification layer (less than 10 percent of model parameters) play a crucial role, much more than the pre-training scheme itself. This is of high practical relevance: (1) as techniques for pre-training models are becoming increasingly diverse, understanding the interplay between these techniques and attribution methods is critical; (2) it sheds light on an important yet overlooked assumption of post-hoc attribution methods which can drastically impact model explanations and how they are interpreted eventually. With this finding we also present simple yet effective adjustments to the classification layers, that can significantly enhance the quality of model explanations. We validate our findings across several visual pre-training frameworks (fully-supervised, self-supervised, contrastive vision-language training) and analyse how they impact explanations for a wide range of attribution methods on a diverse set of evaluation metrics.
Chinese: 本研究揭示,解释深度神经网络的事后重要性归因方法受分类层训练细节的显著影响,挑战了其与模型训练无关的假设,并提出简单调整以提升解释质量。
English: This study reveals that post-hoc importance attribution methods for explaining deep neural networks are significantly influenced by the training details of the classification layer, challenging the assumption of independence from model training and offering simple adjustments to improve explanation quality.

Authors:Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang
Title: Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
Abstract:
We propose ReKV, a novel training-free approach that enables efficient streaming video question-answering (StreamingVQA), by seamlessly integrating with existing Video Large Language Models (Video-LLMs). Traditional VideoQA systems struggle with long videos, as they must process entire videos before responding to queries, and repeat this process for each new question. In contrast, our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received. Building on a common Video-LLM, we first incorporate a sliding-window attention mechanism, ensuring that input frames attend to a limited number of preceding frames, thereby reducing computational overhead. To prevent information loss, we store processed video key-value caches (KV-Caches) in RAM and disk, reloading them into GPU memory as needed. Additionally, we introduce a retrieval method that leverages an external retriever or the parameters within Video-LLMs to retrieve only query-relevant KV-Caches, ensuring both efficiency and accuracy in question answering. ReKV enables the separation of video encoding and question-answering across different processes and GPUs, significantly enhancing the efficiency of StreamingVQA. Through comprehensive experimentation, we validate the efficacy and practicality of our approach, which significantly boosts efficiency and enhances applicability over existing VideoQA models.
Chinese: ReKV提出了一种无需训练的高效流式视频问答方法,通过结合视频大语言模型、滑动窗口注意力机制和键值缓存管理,实现了实时响应并显著提升了计算效率。
English: ReKV introduces a training-free method for efficient streaming video question-answering by integrating with Video-LLMs, utilizing sliding-window attention and key-value cache management to enable real-time responses and improved computational efficiency.

Authors:Jiawen Zhu, Huayi Tang, Xin Chen, Xinying Wang, Dong Wang, Huchuan Lu
Title: Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking
Abstract:
Efficient tracking has garnered attention for its ability to operate on resource-constrained platforms for real-world deployment beyond desktop GPUs. Current efficient trackers mainly follow precision-oriented trackers, adopting a one-stream framework with lightweight modules. However, blindly adhering to the one-stream paradigm may not be optimal, as incorporating template computation in every frame leads to redundancy, and pervasive semantic interaction between template and search region places stress on edge devices. In this work, we propose a novel asymmetric Siamese tracker named \textbf{AsymTrack} for efficient tracking. AsymTrack disentangles template and search streams into separate branches, with template computing only once during initialization to generate modulation signals. Building on this architecture, we devise an efficient template modulation mechanism to unidirectional inject crucial cues into the search features, and design an object perception enhancement module that integrates abstract semantics and local details to overcome the limited representation in lightweight tracker. Extensive experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms compared to the current state-of-the-arts. For instance, AsymTrack-T achieves 60.8\% AUC on LaSOT and 224/81/84 FPS on GPU/CPU/AGX, surpassing HiT-Tiny by 6.0\% AUC with higher speeds. The code is available at https://github.com/jiawen-zhu/AsymTrack.
中文: AsymTrack提出了一种非对称孪生跟踪器,通过分离模板和搜索流来减少冗余并提升效率,在不同平台上实现了优越的速度与精度平衡。
English: AsymTrack introduces an asymmetric Siamese tracker that separates template and search streams to reduce redundancy and enhance efficiency, achieving superior speed-precision trade-offs across various platforms.

Authors:Hanxun Yu, Wentong Li, Song Wang, Junbo Chen, Jianke Zhu
Title: Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
Abstract:
Despite encouraging progress in 3D scene understanding, it remains challenging to develop an effective Large Multi-modal Model (LMM) that is capable of understanding and reasoning in complex 3D environments. Most previous methods typically encode 3D point and 2D image features separately, neglecting interactions between 2D semantics and 3D object properties, as well as the spatial relationships within the 3D environment. This limitation not only hinders comprehensive representations of 3D scene, but also compromises training and inference efficiency. To address these challenges, we propose a unified Instance-aware 3D Large Multi-modal Model (Inst3D-LMM) to deal with multiple 3D scene understanding tasks simultaneously. To obtain the fine-grained instance-level visual tokens, we first introduce a novel Multi-view Cross-Modal Fusion (MCMF) module to inject the multi-view 2D semantics into their corresponding 3D geometric features. For scene-level relation-aware tokens, we further present a 3D Instance Spatial Relation (3D-ISR) module to capture the intricate pairwise spatial relationships among objects. Additionally, we perform end-to-end multi-task instruction tuning simultaneously without the subsequent task-specific fine-tuning. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods across 3D scene understanding, reasoning and grounding tasks. Source code is available at https://github.com/hanxunyu/Inst3D-LMM
中文: 本文提出Inst3D-LMM统一模型,通过多视图跨模态融合注入2D语义至3D几何特征,并利用3D实例空间关系模块捕捉物体间复杂空间关联,在多项3D场景理解任务中实现最优性能。
English: This paper introduces Inst3D-LMM, a unified model that enhances 3D scene understanding by integrating multi-view 2D semantics with 3D geometric features and capturing spatial relationships among objects, achieving state-of-the-art performance across multiple tasks.

Authors:Xuanchen Li, Jianyu Wang, Yuhao Cheng, Yikun Zeng, Xingyu Ren, Wenhan Zhu, Weiming Zhao, Yichao Yan
Title: Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture
Abstract:
Significant progress has been made for speech-driven 3D face animation, but most works focus on learning the motion of mesh/geometry, ignoring the impact of dynamic texture. In this work, we reveal that dynamic texture plays a key role in rendering high-fidelity talking avatars, and introduce a high-resolution 4D dataset \textbf{TexTalk4D}, consisting of 100 minutes of audio-synced scan-level meshes with detailed 8K dynamic textures from 100 subjects. Based on the dataset, we explore the inherent correlation between motion and texture, and propose a diffusion-based framework \textbf{TexTalker} to simultaneously generate facial motions and dynamic textures from speech. Furthermore, we propose a novel pivot-based style injection strategy to capture the complicity of different texture and motion styles, which allows disentangled control. TexTalker, as the first method to generate audio-synced facial motion with dynamic texture, not only outperforms the prior arts in synthesising facial motions, but also produces realistic textures that are consistent with the underlying facial movements. Project page: https://xuanchenli.github.io/TexTalk/.

Authors:Giuseppe Stracquadanio, Federico Vasile, Elisa Maiettini, Nicolò Boccardo, Lorenzo Natale
Title: Bring Your Own Grasp Generator: Leveraging Robot Grasp Generation for Prosthetic Grasping
Abstract:
One of the most important research challenges in upper-limb prosthetics is enhancing the user-prosthesis communication to closely resemble the experience of a natural limb. As prosthetic devices become more complex, users often struggle to control the additional degrees of freedom. In this context, leveraging shared-autonomy principles can significantly improve the usability of these systems. In this paper, we present a novel eye-in-hand prosthetic grasping system that follows these principles. Our system initiates the approach-to-grasp action based on user's command and automatically configures the DoFs of a prosthetic hand. First, it reconstructs the 3D geometry of the target object without the need of a depth camera. Then, it tracks the hand motion during the approach-to-grasp action and finally selects a candidate grasp configuration according to user's intentions. We deploy our system on the Hannes prosthetic hand and test it on able-bodied subjects and amputees to validate its effectiveness. We compare it with a multi-DoF prosthetic control baseline and find that our method enables faster grasps, while simplifying the user experience. Code and demo videos are available online at https://hsp-iit.github.io/byogg/.

Authors:Lixu Wang, Bingqi Shang, Yi Li, Payal Mohapatra, Wei Dong, Xiao Wang, Qi Zhu
Title: Split Adaptation for Pre-trained Vision Transformers
Abstract:
Vision Transformers (ViTs), extensively pre-trained on large-scale datasets, have become essential to foundation models, allowing excellent performance on diverse downstream tasks with minimal adaptation. Consequently, there is growing interest in adapting pre-trained ViTs across various fields, including privacy-sensitive domains where clients are often reluctant to share their data. Existing adaptation methods typically require direct data access, rendering them infeasible under these constraints. A straightforward solution may be sending the pre-trained ViT to clients for local adaptation, which poses issues of model intellectual property protection and incurs heavy client computation overhead. To address these issues, we propose a novel split adaptation (SA) method that enables effective downstream adaptation while protecting data and models. SA, inspired by split learning (SL), segments the pre-trained ViT into a frontend and a backend, with only the frontend shared with the client for data representation extraction. But unlike regular SL, SA replaces frontend parameters with low-bit quantized values, preventing direct exposure of the model. SA allows the client to add bi-level noise to the frontend and the extracted data representations, ensuring data protection. Accordingly, SA incorporates data-level and model-level out-of-distribution enhancements to mitigate noise injection's impact on adaptation performance. Our SA focuses on the challenging few-shot adaptation and adopts patch retrieval augmentation for overfitting alleviation. Extensive experiments on multiple datasets validate SA's superiority over state-of-the-art methods and demonstrate its defense against advanced data reconstruction attacks while preventing model leakage with minimal computation cost on the client side. The source codes can be found at https://github.com/conditionWang/Split_Adaptation.
中文: 提出的分割适配方法将视觉Transformer分割为前端和后端组件,通过量化和噪声注入在少样本下游任务中同时保护数据隐私和模型知识产权,同时保持高性能并降低客户端计算成本。
English: The proposed split adaptation method segments Vision Transformers into frontend and backend components, using quantization and noise injection to protect both data privacy and model intellectual property during few-shot downstream tasks, while maintaining high performance and low client computation costs.

Authors:Jingyi Yang, Xun Lin, Zitong Yu, Liepiao Zhang, Xin Liu, Hui Li, Xiaochen Yuan, Xiaochun Cao
Title: DADM: Dual Alignment of Domain and Modality for Face Anti-spoofing
Abstract:
With the availability of diverse sensor modalities (i.e., RGB, Depth, Infrared) and the success of multi-modal learning, multi-modal face anti-spoofing (FAS) has emerged as a prominent research focus. The intuition behind it is that leveraging multiple modalities can uncover more intrinsic spoofing traces. However, this approach presents more risk of misalignment. We identify two main types of misalignment: (1) \textbf{Intra-domain modality misalignment}, where the importance of each modality varies across different attacks. For instance, certain modalities (e.g., Depth) may be non-defensive against specific attacks (e.g., 3D mask), indicating that each modality has unique strengths and weaknesses in countering particular attacks. Consequently, simple fusion strategies may fall short. (2) \textbf{Inter-domain modality misalignment}, where the introduction of additional modalities exacerbates domain shifts, potentially overshadowing the benefits of complementary fusion. To tackle (1), we propose a alignment module between modalities based on mutual information, which adaptively enhances favorable modalities while suppressing unfavorable ones. To address (2), we employ a dual alignment optimization method that aligns both sub-domain hyperplanes and modality angle margins, thereby mitigating domain gaps. Our method, dubbed \textbf{D}ual \textbf{A}lignment of \textbf{D}omain and \textbf{M}odality (DADM), achieves state-of-the-art performance in extensive experiments across four challenging protocols demonstrating its robustness in multi-modal domain generalization scenarios. The codes will be released soon.
中文摘要:多模态人脸防伪技术利用多种传感器检测伪造痕迹,但面临模态内和模态间不对齐问题,提出的DADM方法通过互信息和双重对齐优化解决了这些问题,实现了最优性能。
English Summary: Multi-modal face anti-spoofing leverages diverse sensors to detect spoofing traces but faces intra-domain and inter-domain misalignment issues, which are addressed by the proposed DADM method using mutual information and dual alignment optimization to achieve state-of-the-art performance.

Authors:Xin Lin, Chong Shi, Zuopeng Yang, Haojin Tang, Zhili Zhou
Title: SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection
Abstract:
Recent open-vocabulary human-object interaction (OV-HOI) detection methods primarily rely on large language model (LLM) for generating auxiliary descriptions and leverage knowledge distilled from CLIP to detect unseen interaction categories. Despite their effectiveness, these methods face two challenges: (1) feature granularity deficiency, due to reliance on last layer visual features for text alignment, leading to the neglect of crucial object-level details from intermediate layers; (2) semantic similarity confusion, resulting from CLIP's inherent biases toward certain classes, while LLM-generated descriptions based solely on labels fail to adequately capture inter-class similarities. To address these challenges, we propose a stratified granular comparison network. First, we introduce a granularity sensing alignment module that aggregates global semantic features with local details, refining interaction representations and ensuring robust alignment between intermediate visual features and text embeddings. Second, we develop a hierarchical group comparison module that recursively compares and groups classes using LLMs, generating fine-grained and discriminative descriptions for each interaction category. Experimental results on two widely-used benchmark datasets, SWIG-HOI and HICO-DET, demonstrate that our method achieves state-of-the-art results in OV-HOI detection. Codes will be released on https://github.com/Phil0212/SGC-Net.
Chinese: 针对现有开放词汇人-物交互检测方法存在的特征粒度不足和语义相似性混淆问题,我们提出了分层粒度比较网络,通过粒度感知对齐和层次化分组比较模块,在基准数据集上取得了最优性能。
English: Recent OV-HOI detection methods face challenges of feature granularity deficiency and semantic similarity confusion, which are addressed by our proposed stratified granular comparison network through granularity sensing alignment and hierarchical group comparison modules, achieving state-of-the-art results on benchmark datasets.

Authors:Zhaoyi Tian, Feifeng Wang, Shiwei Wang, Zihao Zhou, Yao Zhu, Liquan Shen
Title: High Dynamic Range Video Compression: A Large-Scale Benchmark Dataset and A Learned Bit-depth Scalable Compression Algorithm
Abstract:
Recently, learned video compression (LVC) is undergoing a period of rapid development. However, due to absence of large and high-quality high dynamic range (HDR) video training data, LVC on HDR video is still unexplored. In this paper, we are the first to collect a large-scale HDR video benchmark dataset, named HDRVD2K, featuring huge quantity, diverse scenes and multiple motion types. HDRVD2K fills gaps of video training data and facilitate the development of LVC on HDR videos. Based on HDRVD2K, we further propose the first learned bit-depth scalable video compression (LBSVC) network for HDR videos by effectively exploiting bit-depth redundancy between videos of multiple dynamic ranges. To achieve this, we first propose a compression-friendly bit-depth enhancement module (BEM) to effectively predict original HDR videos based on compressed tone-mapped low dynamic range (LDR) videos and dynamic range prior, instead of reducing redundancy only through spatio-temporal predictions. Our method greatly improves the reconstruction quality and compression performance on HDR videos. Extensive experiments demonstrate the effectiveness of HDRVD2K on learned HDR video compression and great compression performance of our proposed LBSVC network. Code and dataset will be released in https://github.com/sdkinda/HDR-Learned-Video-Coding.
中文: 本文提出了首个大规模HDR视频数据集HDRVD2K,并设计了一种基于比特深度冗余的端到端可伸缩视频压缩网络,通过创新的比特深度增强模块显著提升了HDR视频的重建质量与压缩性能。
English: This paper introduces HDRVD2K, the first large-scale HDR video dataset, and proposes a learned bit-depth scalable video compression network that significantly enhances HDR video reconstruction and compression by exploiting bit-depth redundancy.

Authors:Zongru Wu, Pengzhou Cheng, Zheng Wu, Tianjie Ju, Zhuosheng Zhang, Gongshen Liu
Title: Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks
Abstract:
Perception-enhanced pre-training, particularly through grounding techniques, is widely adopted to enhance the performance of graphical user interface (GUI) agents. However, in resource-constrained scenarios, the format discrepancy between coordinate-oriented grounding and action-oriented reasoning limits the effectiveness of grounding for reasoning tasks. To address this challenge, we propose a query-oriented pivot approach called query inference, which serves as a bridge between GUI grounding and reasoning. By inferring potential user queries from a screenshot and its associated element coordinates, query inference improves the understanding of coordinates while aligning more closely with reasoning tasks. Experimental results show that query inference outperforms previous grounding techniques under the same training data scale. Notably, query inference achieves comparable or even better performance to large-scale grounding-enhanced OS-Atlas with less than 0.1% of training data. Furthermore, we explore the impact of reasoning formats and demonstrate that integrating additional semantic information into the input further boosts reasoning performance. The code is publicly available at https://github.com/ZrW00/GUIPivot.
中文摘要:提出的查询推理方法通过从截图和坐标推断用户查询,弥合了GUI基础与推理之间的差距,在极少训练数据下显著超越现有技术。
English Summary: The proposed query inference method bridges the gap between GUI grounding and reasoning by inferring user queries from screenshots and coordinates, significantly outperforming previous techniques with minimal training data.

Authors:Tianyi Wang, Jianan Fan, Dingxin Zhang, Dongnan Liu, Yong Xia, Heng Huang, Weidong Cai
Title: MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention
Abstract:
Histopathology and transcriptomics are fundamental modalities in oncology, encapsulating the morphological and molecular aspects of the disease. Multi-modal self-supervised learning has demonstrated remarkable potential in learning pathological representations by integrating diverse data sources. Conventional multi-modal integration methods primarily emphasize modality alignment, while paying insufficient attention to retaining the modality-specific structures. However, unlike conventional scenarios where multi-modal inputs share highly overlapping features, histopathology and transcriptomics exhibit pronounced heterogeneity, offering orthogonal yet complementary insights. Histopathology provides morphological and spatial context, elucidating tissue architecture and cellular topology, whereas transcriptomics delineates molecular signatures through gene expression patterns. This inherent disparity introduces a major challenge in aligning them while maintaining modality-specific fidelity. To address these challenges, we present MIRROR, a novel multi-modal representation learning method designed to foster both modality alignment and retention. MIRROR employs dedicated encoders to extract comprehensive features for each modality, which is further complemented by a modality alignment module to achieve seamless integration between phenotype patterns and molecular profiles. Furthermore, a modality retention module safeguards unique attributes from each modality, while a style clustering module mitigates redundancy and enhances disease-relevant information by modeling and aligning consistent pathological signatures within a clustering space. Extensive evaluations on TCGA cohorts for cancer subtyping and survival analysis highlight MIRROR's superior performance, demonstrating its effectiveness in constructing comprehensive oncological feature representations and benefiting the cancer diagnosis.
Chinese: MIRROR是一种新颖的多模态学习方法,通过平衡模态对齐与特征保留,有效整合组织病理学和转录组学数据,在癌症亚型分类和生存分析中展现出卓越性能。
English: MIRROR is a novel multi-modal learning method that effectively integrates histopathology and transcriptomics data by balancing modality alignment with the preservation of unique features, demonstrating superior performance in cancer subtyping and survival analysis.

Authors:Wei Suo, Lijun Zhang, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, Yanning Zhang
Title: Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding
Abstract:
Large Vision-Language Models (LVLMs) have obtained impressive performance in visual content understanding and multi-modal reasoning. Unfortunately, these large models suffer from serious hallucination problems and tend to generate fabricated responses. Recently, several Contrastive Decoding (CD) strategies have been proposed to alleviate hallucination by introducing disturbed inputs. Although great progress has been made, these CD strategies mostly apply a one-size-fits-all approach for all input conditions. In this paper, we revisit this process through extensive experiments. Related results show that hallucination causes are hybrid and each generative step faces a unique hallucination challenge. Leveraging these meaningful insights, we introduce a simple yet effective Octopus-like framework that enables the model to adaptively identify hallucination types and create a dynamic CD workflow. Our Octopus framework not only outperforms existing methods across four benchmarks but also demonstrates excellent deployability and expansibility. Code is available at https://github.com/LijunZhang01/Octopus.
中文: Octopus框架通过动态识别幻觉类型并自适应调整对比解码策略,有效减少大型视觉语言模型的虚构生成,在多项基准测试中超越现有方法。
English: The Octopus framework dynamically identifies hallucination types and adapts contrastive decoding strategies to effectively mitigate fabrication in Large Vision-Language Models, outperforming existing methods across multiple benchmarks.

Authors:Samuel Garske, Konrad Heidler, Bradley Evans, KC Wong, Xiao Xiang Zhu
Title: SHAZAM: Self-Supervised Change Monitoring for Hazard Detection and Mapping
Abstract:
The increasing frequency of environmental hazards due to climate change underscores the urgent need for effective monitoring systems. Current approaches either rely on expensive labelled datasets, struggle with seasonal variations, or require multiple observations for confirmation (which delays detection). To address these challenges, this work presents SHAZAM - Self-Supervised Change Monitoring for Hazard Detection and Mapping. SHAZAM uses a lightweight conditional UNet to generate expected images of a region of interest (ROI) for any day of the year, allowing for the direct modelling of normal seasonal changes and the ability to distinguish potential hazards. A modified structural similarity measure compares the generated images with actual satellite observations to compute region-level anomaly scores and pixel-level hazard maps. Additionally, a theoretically grounded seasonal threshold eliminates the need for dataset-specific optimisation. Evaluated on four diverse datasets that contain bushfires (wildfires), burned regions, extreme and out-of-season snowfall, floods, droughts, algal blooms, and deforestation, SHAZAM achieved F1 score improvements of between 0.066 and 0.234 over existing methods. This was achieved primarily through more effective hazard detection (higher recall) while using only 473K parameters. SHAZAM demonstrated superior mapping capabilities through higher spatial resolution and improved ability to suppress background features while accentuating both immediate and gradual hazards. SHAZAM has been established as an effective and generalisable solution for hazard detection and mapping across different geographical regions and a diverse range of hazards. The Python code is available at: https://github.com/WiseGamgee/SHAZAM
中文: 本研究提出SHAZAM自监督系统,通过轻量级条件UNet建模季节性变化,并对比生成图像与卫星观测来检测环境灾害,在多种数据集上实现了显著性能提升。
English: This work introduces SHAZAM, a self-supervised system that uses a lightweight conditional UNet to model seasonal changes and detect environmental hazards by comparing generated images with satellite observations, achieving significant performance improvements across diverse datasets.

Authors:Benjamin Schneider, Florian Kerschbaum, Wenhu Chen
Title: ABC: Achieving Better Control of Multimodal Embeddings using VLMs
Abstract:
Visual embedding models excel at zero-shot tasks like visual retrieval and classification. However, these models cannot be used for tasks that contain ambiguity or require user instruction. These tasks necessitate an embedding model which outputs can use a natural language instruction to control the representation of a visual embedding. Existing CLIP-based approaches embed images and text independently, and fuse the result. We find that this results in weak interactions between modalities, and poor user control over the representation. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions. ABC achieves best-for-size performance on MSCOCO image-to-text retrieval and is the top performing model on classification and VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly unified vision-language representation, ABC can use natural language to solve subtle and potentially ambiguous visual retrieval problems. To evaluate this capability, we design CtrlBench, a benchmark that requires interleaving textual instructions with image content for correct retrieval. ABC advances the state of visual embeddings, outputting high-quality visual representations with natural language control. Our model and datasets are available at our project page: https://tiger-ai-lab.github.io/ABC/

Authors:Zihan Huang, Wei Fang, Tong Bu, Peng Xue, Zecheng Hao, Wenxuan Liu, Yuanhong Tang, Zhaofei Yu, Tiejun Huang
Title: Differential Coding for Training-Free ANN-to-SNN Conversion
Abstract:
Spiking Neural Networks (SNNs) exhibit significant potential due to their low energy consumption. Converting Artificial Neural Networks (ANNs) to SNNs is an efficient way to achieve high-performance SNNs. However, many conversion methods are based on rate coding, which requires numerous spikes and longer time-steps compared to directly trained SNNs, leading to increased energy consumption and latency. This article introduces differential coding for ANN-to-SNN conversion, a novel coding scheme that reduces spike counts and energy consumption by transmitting changes in rate information rather than rates directly, and explores its application across various layers. Additionally, the threshold iteration method is proposed to optimize thresholds based on activation distribution when converting Rectified Linear Units (ReLUs) to spiking neurons. Experimental results on various Convolutional Neural Networks (CNNs) and Transformers demonstrate that the proposed differential coding significantly improves accuracy while reducing energy consumption, particularly when combined with the threshold iteration method, achieving state-of-the-art performance. The source codes of the proposed method are available at https://github.com/h-z-h-cell/ANN-to-SNN-DCGS.
中文: 本文提出用于人工神经网络到脉冲神经网络转换的差分编码方法,通过传输速率变化而非直接速率来减少脉冲数量和能耗,并结合阈值迭代法优化神经元阈值,在多种网络结构上实现了精度和能效的最优性能。
English: This article introduces differential coding for ANN-to-SNN conversion, which reduces spike counts and energy consumption by transmitting rate changes instead of direct rates, and proposes a threshold iteration method to optimize neuron thresholds, achieving state-of-the-art performance in accuracy and efficiency across various networks.

Authors:Haoxin Li, Yingchen Yu, Qilong Wu, Hanwang Zhang, Song Bai, Boyang Li
Title: Learning to Animate Images from A Few Videos to Portray Delicate Human Actions
Abstract:
Despite recent progress, video generative models still struggle to animate static images into videos that portray delicate human actions, particularly when handling uncommon or novel actions whose training data are limited. In this paper, we explore the task of learning to animate images to portray delicate human actions using a small number of videos -- 16 or fewer -- which is highly valuable for real-world applications like video and movie production. Learning generalizable motion patterns that smoothly transition from user-provided reference images in a few-shot setting is highly challenging. We propose FLASH (Few-shot Learning to Animate and Steer Humans), which learns generalizable motion patterns by forcing the model to reconstruct a video using the motion features and cross-frame correspondences of another video with the same motion but different appearance. This encourages transferable motion learning and mitigates overfitting to limited training data. Additionally, FLASH extends the decoder with additional layers to propagate details from the reference image to generated frames, improving transition smoothness. Human judges overwhelmingly favor FLASH, with 65.78\% of 488 responses prefer FLASH over baselines. We strongly recommend watching the videos in the website: https://lihaoxin05.github.io/human_action_animation/, as motion artifacts are hard to notice from images.

Authors:Xinwei Luo, Songlin Zhao, Yun Zong, Yong Chen, Gui-shuang Ying, Lifang He
Title: SegImgNet: Segmentation-Guided Dual-Branch Network for Retinal Disease Diagnoses
Abstract:
Retinal image plays a crucial role in diagnosing various diseases, as retinal structures provide essential diagnostic information. However, effectively capturing structural features while integrating them with contextual information from retinal images remains a challenge. In this work, we propose segmentation-guided dual-branch network for retinal disease diagnosis using retinal images and their segmentation maps, named SegImgNet. SegImgNet incorporates a segmentation module to generate multi-scale retinal structural feature maps from retinal images. The classification module employs two encoders to independently extract features from segmented images and retinal images for disease classification. To further enhance feature extraction, we introduce the Segmentation-Guided Attention (SGA) block, which leverages feature maps from the segmentation module to refine the classification process. We evaluate SegImgNet on the public AIROGS dataset and the private e-ROP dataset. Experimental results demonstrate that SegImgNet consistently outperforms existing methods, underscoring its effectiveness in retinal disease diagnosis. The code is publicly available at https://github.com/hawk-sudo/SegImgNet.
中文摘要:提出的SegImgNet通过分割引导的双分支网络和专用注意力机制,有效整合视网膜图像的结构特征与上下文信息,在基准数据集上展现出卓越的疾病诊断性能。
English Summary: The proposed SegImgNet utilizes segmentation-guided dual-branch networks with a specialized attention mechanism to effectively integrate structural and contextual features from retinal images, demonstrating superior diagnostic performance on benchmark datasets.

Authors:Milad Yazdani, Yasamin Medghalchi, Pooria Ashrafian, Ilker Hacihaliloglu, Dena Shahriari
Title: Flow Matching for Medical Image Synthesis: Bridging the Gap Between Speed and Quality
Abstract:
Deep learning models have emerged as a powerful tool for various medical applications. However, their success depends on large, high-quality datasets that are challenging to obtain due to privacy concerns and costly annotation. Generative models, such as diffusion models, offer a potential solution by synthesizing medical images, but their practical adoption is hindered by long inference times. In this paper, we propose the use of an optimal transport flow matching approach to accelerate image generation. By introducing a straighter mapping between the source and target distribution, our method significantly reduces inference time while preserving and further enhancing the quality of the outputs. Furthermore, this approach is highly adaptable, supporting various medical imaging modalities, conditioning mechanisms (such as class labels and masks), and different spatial dimensions, including 2D and 3D. Beyond image generation, it can also be applied to related tasks such as image enhancement. Our results demonstrate the efficiency and versatility of this framework, making it a promising advancement for medical imaging applications. Code with checkpoints and a synthetic dataset (beneficial for classification and segmentation) is now available on: https://github.com/milad1378yz/MOTFM.
中文: 本文提出一种最优传输流匹配方法,通过构建更直接的源-目标分布映射来加速医学图像生成,在显著减少推理时间的同时提升输出质量,并支持多种影像模态与任务的高适应性应用。
English: This paper introduces an optimal transport flow matching approach that accelerates medical image generation by creating straighter source-to-target mappings, significantly reducing inference time while improving output quality and adaptability across various imaging modalities and tasks.

Authors:Shuang Li, Yihuai Gao, Dorsa Sadigh, Shuran Song
Title: Unified Video Action Model
Abstract:
A unified video and action model holds significant promise for robotics, where videos provide rich scene information for action prediction, and actions provide dynamics information for video prediction. However, effectively combining video generation and action prediction remains challenging, and current video generation-based methods struggle to match the performance of direct policy learning in action accuracy and inference speed. To bridge this gap, we introduce the Unified Video Action model (UVA), which jointly optimizes video and action predictions to achieve both high accuracy and efficient action inference. The key lies in learning a joint video-action latent representation and decoupling video-action decoding. The joint latent representation bridges the visual and action domains, effectively modeling the relationship between video and action sequences. Meanwhile, the decoupled decoding, powered by two lightweight diffusion heads, enables high-speed action inference by bypassing video generation during inference. Such a unified framework further enables versatile functionality through masked input training. By selectively masking actions or videos, a single model can tackle diverse tasks beyond policy learning, such as forward and inverse dynamics modeling and video generation. Via an extensive set of experiments, we demonstrate that UVA can serve as a general-purpose solution for a wide range of robotics tasks, such as policy learning, forward/inverse dynamics and video observation prediction, without compromising performance compared to methods tailored for specific applications. Results are best viewed on https://unified-video-action-model.github.io/.

Authors:Amar Kumar, Anita Kriz, Mohammad Havaei, Tal Arbel
Title: PRISM: High-Resolution & Precise Counterfactual Medical Image Generation using Language-guided Stable Diffusion
Abstract:
Developing reliable and generalizable deep learning systems for medical imaging faces significant obstacles due to spurious correlations, data imbalances, and limited text annotations in datasets. Addressing these challenges requires architectures that are robust to the unique complexities posed by medical imaging data. Rapid advancements in vision-language foundation models within the natural image domain prompt the question of how they can be adapted for medical imaging tasks. In this work, we present PRISM, a framework that leverages foundation models to generate high-resolution, language-guided medical image counterfactuals using Stable Diffusion. Our approach demonstrates unprecedented precision in selectively modifying spurious correlations (the medical devices) and disease features, enabling the removal and addition of specific attributes while preserving other image characteristics. Through extensive evaluation, we show how PRISM advances counterfactual generation and enables the development of more robust downstream classifiers for clinically deployable solutions. To facilitate broader adoption and research, we make our code publicly available at https://github.com/Amarkr1/PRISM.
中文摘要:PRISM框架利用视觉-语言基础模型生成精准的语言引导医学图像反事实,能够选择性修改虚假相关性及疾病特征,从而开发出更鲁棒的分类器以供临床部署应用。
English Summary: The PRISM framework utilizes vision-language foundation models to generate precise, language-guided medical image counterfactuals, enabling selective modification of spurious correlations and disease features to develop more robust classifiers for clinical deployment.

Authors:Benedikt Blumenstiel, Nassim Ait Ali Braham, Conrad M Albrecht, Stefano Maurogiovanni, Paolo Fraccaro
Title: SSL4EO-S12 v1.1: A Multimodal, Multiseasonal Dataset for Pretraining, Updated
Abstract:
This technical report presents SSL4EO-S12 v1.1, a multimodal, multitemporal Earth Observation dataset designed for pretraining large-scale foundation models. Building on the success of SSL4EO-S12 v1.0, the new version addresses the previous challenges of data misalignment and a limited data structure for low-barrier, analysis-ready EO processing. SSL4EO-S12 v1.1 covers the world's 10,000 largest cities and its surroundings within a 50 km radius across four seasons, resulting in a diverse collection of nearly one million patches. SSL4EO-S12 v1.1 packages the data in Zarr file format for cloud-efficient loading and representation of meta-information such as including cloud masks and geolocation. Released under the CC-BY-4.0 license, SSL4EO-S12 v1.1 facilitates open research and provides a robust foundation for future advancements in self-supervised learning and geospatial analysis. The dataset is available online through https://datapub.fz-juelich.de/ssl4eo-s12, and we provided additional resources at https://github.com/DLR-MF-DAS/SSL4EO-S12-v1.1.
中文:SSL4EO-S12 v1.1是一个增强型多模态地球观测数据集,覆盖全球万个城市四季影像,采用Zarr格式存储并开放授权,专为自监督学习设计。
English: SSL4EO-S12 v1.1 is an enhanced multimodal Earth Observation dataset covering 10,000 cities across four seasons, designed for self-supervised learning with cloud-optimized Zarr format and open licensing.

Authors:Zhenxing Cui, Lu Chen, Yunhai Wang, Daniel Haehn, Yong Wang, Hanspeter Pfister
Title: Generalization of CNNs on Relational Reasoning with Bar Charts
Abstract:
This paper presents a systematic study of the generalization of convolutional neural networks (CNNs) and humans on relational reasoning tasks with bar charts. We first revisit previous experiments on graphical perception and update the benchmark performance of CNNs. We then test the generalization performance of CNNs on a classic relational reasoning task: estimating bar length ratios in a bar chart, by progressively perturbing the standard visualizations. We further conduct a user study to compare the performance of CNNs and humans. Our results show that CNNs outperform humans only when the training and test data have the same visual encodings. Otherwise, they may perform worse. We also find that CNNs are sensitive to perturbations in various visual encodings, regardless of their relevance to the target bars. Yet, humans are mainly influenced by bar lengths. Our study suggests that robust relational reasoning with visualizations is challenging for CNNs. Improving CNNs' generalization performance may require training them to better recognize task-related visual properties.
中文: 本研究表明,尽管卷积神经网络在视觉编码一致的条形图关系推理任务中表现优于人类,但在视觉干扰下其性能显著下降,而人类主要关注条形长度,这凸显了卷积神经网络在此类任务中实现稳健泛化的挑战。
English: This study demonstrates that while CNNs can outperform humans in relational reasoning tasks with bar charts when visual encodings remain consistent, their performance significantly deteriorates under visual perturbations, unlike humans who focus primarily on bar lengths, highlighting the challenge of achieving robust generalization in CNNs for such tasks.

Authors:Zezeng Li, Xiaoyu Du, Na Lei, Liming Chen, Weimin Wang
Title: NoPain: No-box Point Cloud Attack via Optimal Transport Singular Boundary
Abstract:
Adversarial attacks exploit the vulnerability of deep models against adversarial samples. Existing point cloud attackers are tailored to specific models, iteratively optimizing perturbations based on gradients in either a white-box or black-box setting. Despite their promising attack performance, they often struggle to produce transferable adversarial samples due to overfitting the specific parameters of surrogate models. To overcome this issue, we shift our focus to the data distribution itself and introduce a novel approach named NoPain, which employs optimal transport (OT) to identify the inherent singular boundaries of the data manifold for cross-network point cloud attacks. Specifically, we first calculate the OT mapping from noise to the target feature space, then identify singular boundaries by locating non-differentiable positions. Finally, we sample along singular boundaries to generate adversarial point clouds. Once the singular boundaries are determined, NoPain can efficiently produce adversarial samples without the need of iterative updates or guidance from the surrogate classifiers. Extensive experiments demonstrate that the proposed end-to-end method outperforms baseline approaches in terms of both transferability and efficiency, while also maintaining notable advantages even against defense strategies. Code and model are available at https://github.com/cognaclee/nopain
中文总结:NoPain方法创新性地利用最优传输技术识别数据流形的固有奇异边界,无需迭代更新或依赖代理模型即可高效生成具有强迁移性的对抗性点云。
English summary: The NoPain method introduces a novel approach using optimal transport to identify singular boundaries in data manifolds, enabling efficient generation of highly transferable adversarial point clouds without iterative updates or reliance on surrogate models.

Authors:Nilay Yilmaz, Maitreya Patel, Yiran Lawrence Luo, Tejas Gokhale, Chitta Baral, Suren Jayasuriya, Yezhou Yang
Title: VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning
Abstract:
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. Despite their exceptional performance on visual understanding benchmarks, measuring their ability to reason abstractly across multiple images remains a significant challenge. To address this, we introduce VOILA, a large-scale, open-ended, dynamic benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. VOILA employs an analogical mapping approach in the visual domain, requiring models to generate an image that completes an analogy between two given image pairs, reference and application, without relying on predefined choices. Our experiments demonstrate that the analogical reasoning tasks in VOILA present a challenge to MLLMs. Through multi-step analysis, we reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning. Notably, we observe that performance improves when following a multi-step strategy of least-to-most prompting. Comprehensive evaluations on open-source models and GPT-4o show that on text-based answers, the best accuracy for challenging scenarios is 13% (LLaMa 3.2) and even for simpler tasks is only 29% (GPT-4o), while human performance is significantly higher at 70% across both difficulty levels.
中文: VOILA作为新型评估基准,通过要求多模态大语言模型生成完成视觉类比的图像来测试其抽象关系推理能力,结果显示模型在跨图像理解方面存在显著困难,准确率远低于人类水平。
English: VOILA is a novel benchmark that evaluates multimodal large language models' abstract relational reasoning by requiring them to generate images completing visual analogies, revealing their significant struggles in inter-image understanding with accuracy far below human performance.

Authors:Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H. S. Torr, Fahad Shahbaz Khan, Salman Khan
Title: LLM Post-Training: A Deep Dive into Reasoning Large Language Models
Abstract:
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations. Fine-tuning, reinforcement learning, and test-time scaling have emerged as critical strategies for optimizing LLMs performance, ensuring robustness, and improving adaptability across various real-world tasks. This survey provides a systematic exploration of post-training methodologies, analyzing their role in refining LLMs beyond pretraining, addressing key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs. We highlight emerging directions in model alignment, scalable adaptation, and inference-time reasoning, and outline future research directions. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/mbzuai-oryx/Awesome-LLM-Post-training.
中文摘要:本综述系统探讨了大型语言模型的后训练方法,这些方法在预训练基础上通过增强推理能力、事实准确性和伦理对齐来优化模型性能,同时解决了灾难性遗忘等关键挑战并展望了未来研究方向。
English Summary: This survey systematically examines post-training methods that refine Large Language Models beyond pretraining by enhancing reasoning, factual accuracy, and ethical alignment, while addressing challenges like catastrophic forgetting and outlining future research directions.

Authors:Xueyun Tian, Wei Li, Bingbing Xu, Yige Yuan, Yuanzhuo Wang, Huawei Shen
Title: MIGE: Mutually Enhanced Multimodal Instruction-Based Image Generation and Editing
Abstract:
Despite significant progress in diffusion-based image generation, subject-driven generation and instruction-based editing remain challenging. Existing methods typically treat them separately, struggling with limited high-quality data and poor generalization. However, both tasks require capturing complex visual variations while maintaining consistency between inputs and outputs. Inspired by this, we propose MIGE, a unified framework that standardizes task representations using multimodal instructions. It first treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image, establishing a shared input-output formulation, then introduces a novel multimodal encoder that maps free-form multimodal instructions into a unified vision-language space, integrating visual and semantic features through a feature fusion mechanism. This unification enables joint training of both tasks, providing two key advantages: (1) Cross-Task Enhancement: by leveraging shared visual and semantic representations, joint training improves instruction adherence and visual consistency in both subject-driven generation and instruction-based editing. (2) Generalization: learning in a unified format facilitates cross-task knowledge transfer, enabling MIGE to generalize to novel compositional tasks, including instruction-based subject-driven editing. Experiments show that MIGE excels in both subject-driven generation and instruction-based editing while setting a SOTA in the new task of instruction-based subject-driven editing. Code and model have been publicly available at https://github.com/Eureka-Maggie/MIGE.
Chinese: MIGE是一个通过多模态指令统一处理主体驱动生成和指令编辑的框架,联合训练提升了任务表现并实现了跨任务泛化能力。
English: MIGE is a unified framework that standardizes subject-driven generation and instruction-based editing through multimodal instructions, enabling joint training for enhanced performance and generalization across tasks.

Authors:Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, Qixiang Ye
Title: Adaptive Keyframe Sampling for Long Video Understanding
Abstract:
Multimodal large language models (MLLMs) have enabled open-world visual understanding by injecting visual input as extra tokens into large language models (LLMs) as contexts. However, when the visual input changes from a single image to a long video, the above paradigm encounters difficulty because the vast amount of video tokens has significantly exceeded the maximal capacity of MLLMs. Therefore, existing video-based MLLMs are mostly established upon sampling a small portion of tokens from input data, which can cause key information to be lost and thus produce incorrect answers. This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). It inserts a plug-and-play module known as keyframe selection, which aims to maximize the useful information with a fixed number of video tokens. We formulate keyframe selection as an optimization involving (1) the relevance between the keyframes and the prompt, and (2) the coverage of the keyframes over the video, and present an adaptive algorithm to approximate the best solution. Experiments on two long video understanding benchmarks validate that Adaptive Keyframe Sampling improves video QA accuracy (beyond strong baselines) upon selecting informative keyframes. Our study reveals the importance of information pre-filtering in video-based MLLMs. Code is available at https://github.com/ncTimTang/AKS.
中文: 本文提出自适应关键帧采样算法(AKS),通过优化关键帧与提示的相关性及视频覆盖范围,有效提升多模态大语言模型在长视频理解中的问答准确率。
English: This paper introduces Adaptive Keyframe Sampling (AKS), a plug-and-play module that optimizes video token selection by balancing relevance to prompts and video coverage, thereby enhancing long video understanding accuracy in multimodal large language models.

Authors:Maria Koshkina, James H. Elder
Title: Towards long-term player tracking with graph hierarchies and domain-specific features
Abstract:
In team sports analytics, long-term player tracking remains a challenging task due to player appearance similarity, occlusion, and dynamic motion patterns. Accurately re-identifying players and reconnecting tracklets after extended absences from the field of view or prolonged occlusions is crucial for robust analysis. We introduce SportsSUSHI, a hierarchical graph-based approach that leverages domain-specific features, including jersey numbers, team IDs, and field coordinates, to enhance tracking accuracy. SportsSUSHI achieves high performance on the SoccerNet dataset and a newly proposed hockey tracking dataset. Our hockey dataset, recorded using a stationary camera capturing the entire playing surface, contains long sequences and annotations for team IDs and jersey numbers, making it well-suited for evaluating long-term tracking capabilities. The inclusion of domain-specific features in our approach significantly improves association accuracy, as demonstrated in our experiments. The dataset and code are available at https://github.com/mkoshkina/sports-SUSHI.
中文摘要:SportsSUSHI提出了一种基于分层图的方法,利用球衣号码和队伍标识等特定领域特征来提升团队运动中球员的长期追踪准确性,在足球和冰球数据集上均表现出优异性能。
English Summary: SportsSUSHI introduces a hierarchical graph-based method using domain-specific features like jersey numbers and team IDs to improve long-term player tracking in team sports, demonstrating high accuracy on soccer and hockey datasets.

Authors:Zihan Huang, Xinyu Shi, Zecheng Hao, Tong Bu, Jianhao Ding, Zhaofei Yu, Tiejun Huang
Title: Towards High-performance Spiking Transformers from ANN to SNN Conversion
Abstract:
Spiking neural networks (SNNs) show great potential due to their energy efficiency, fast processing capabilities, and robustness. There are two main approaches to constructing SNNs. Direct training methods require much memory, while conversion methods offer a simpler and more efficient option. However, current conversion methods mainly focus on converting convolutional neural networks (CNNs) to SNNs. Converting Transformers to SNN is challenging because of the presence of non-linear modules. In this paper, we propose an Expectation Compensation Module to preserve the accuracy of the conversion. The core idea is to use information from the previous T time-steps to calculate the expected output at time-step T. We also propose a Multi-Threshold Neuron and the corresponding Parallel Parameter normalization to address the challenge of large time steps needed for high accuracy, aiming to reduce network latency and power consumption. Our experimental results demonstrate that our approach achieves state-of-the-art performance. For example, we achieve a top-1 accuracy of 88.60\% with only a 1\% loss in accuracy using 4 time steps while consuming only 35\% of the original power of the Transformer. To our knowledge, this is the first successful Artificial Neural Network (ANN) to SNN conversion for Spiking Transformers that achieves high accuracy, low latency, and low power consumption on complex datasets. The source codes of the proposed method are available at https://github.com/h-z-h-cell/Transformer-to-SNN-ECMT.
中文: 本文提出期望补偿模块和多阈值神经元,有效将Transformer转换为脉冲神经网络,在保持高精度的同时实现了低延迟和低功耗。
English: This paper introduces an Expectation Compensation Module and a Multi-Threshold Neuron to efficiently convert Transformers into Spiking Neural Networks, achieving high accuracy with low latency and power consumption.

Authors:Yunfan Lu, Xiaogang Xu, Hao Lu, Yanlin Qian, Pengteng Li, Huizai Yao, Bin Yang, Junyi Li, Qianyi Cai, Weiyu Guo, Hui Xiong
Title: SEE: See Everything Every Time -- Adaptive Brightness Adjustment for Broad Light Range Images via Events
Abstract:
Event cameras, with a high dynamic range exceeding $120dB$, significantly outperform traditional embedded cameras, robustly recording detailed changing information under various lighting conditions, including both low- and high-light situations. However, recent research on utilizing event data has primarily focused on low-light image enhancement, neglecting image enhancement and brightness adjustment across a broader range of lighting conditions, such as normal or high illumination. Based on this, we propose a novel research question: how to employ events to enhance and adaptively adjust the brightness of images captured under broad lighting conditions? To investigate this question, we first collected a new dataset, SEE-600K, consisting of 610,126 images and corresponding events across 202 scenarios, each featuring an average of four lighting conditions with over a 1000-fold variation in illumination. Subsequently, we propose a framework that effectively utilizes events to smoothly adjust image brightness through the use of prompts. Our framework captures color through sensor patterns, uses cross-attention to model events as a brightness dictionary, and adjusts the image's dynamic range to form a broad light-range representation (BLR), which is then decoded at the pixel level based on the brightness prompt. Experimental results demonstrate that our method not only performs well on the low-light enhancement dataset but also shows robust performance on broader light-range image enhancement using the SEE-600K dataset. Additionally, our approach enables pixel-level brightness adjustment, providing flexibility for post-processing and inspiring more imaging applications. The dataset and source code are publicly available at:https://github.com/yunfanLu/SEE.
中文: 事件相机虽能在多种光照下捕捉细节,但研究多集中于弱光增强,为此我们提出了新框架和数据集,以实现广泛光照条件下的自适应亮度调整。
English: Event cameras excel in capturing details across varied lighting but are underutilized beyond low-light enhancement, prompting the development of a framework and dataset for adaptive brightness adjustment in broad conditions.

Authors:Marina D'Amato, Jeroen van der Laak, Francesco Ciompi
Title: "No negatives needed": weakly-supervised regression for interpretable tumor detection in whole-slide histopathology images
Abstract:
Accurate tumor detection in digital pathology whole-slide images (WSIs) is crucial for cancer diagnosis and treatment planning. Multiple Instance Learning (MIL) has emerged as a widely used approach for weakly-supervised tumor detection with large-scale data without the need for manual annotations. However, traditional MIL methods often depend on classification tasks that require tumor-free cases as negative examples, which are challenging to obtain in real-world clinical workflows, especially for surgical resection specimens. We address this limitation by reformulating tumor detection as a regression task, estimating tumor percentages from WSIs, a clinically available target across multiple cancer types. In this paper, we provide an analysis of the proposed weakly-supervised regression framework by applying it to multiple organs, specimen types and clinical scenarios. We characterize the robustness of our framework to tumor percentage as a noisy regression target, and introduce a novel concept of amplification technique to improve tumor detection sensitivity when learning from small tumor regions. Finally, we provide interpretable insights into the model's predictions by analyzing visual attention and logit maps. Our code is available at https://github.com/DIAGNijmegen/tumor-percentage-mil-regression.
中文: 本研究提出了一种弱监督回归框架,通过估算临床可用的肿瘤百分比来检测全切片图像中的肿瘤,无需无肿瘤样本,并利用放大技术提高检测灵敏度。
English: This study introduces a weakly-supervised regression framework for tumor detection in whole-slide images, eliminating the need for tumor-free cases by estimating clinically available tumor percentages and enhancing sensitivity through amplification techniques.

Authors:Jingru Fu, Yuqi Zheng, Neel Dey, Daniel Ferreira, Rodrigo Moreno
Title: Synthesizing Individualized Aging Brains in Health and Disease with Generative Models and Parallel Transport
Abstract:
Simulating prospective magnetic resonance imaging (MRI) scans from a given individual brain image is challenging, as it requires accounting for canonical changes in aging and/or disease progression while also considering the individual brain's current status and unique characteristics. While current deep generative models can produce high-resolution anatomically accurate templates for population-wide studies, their ability to predict future aging trajectories for individuals remains limited, particularly in capturing subject-specific neuroanatomical variations over time. In this study, we introduce Individualized Brain Synthesis (InBrainSyn), a framework for synthesizing high-resolution subject-specific longitudinal MRI scans that simulate neurodegeneration in both Alzheimer's disease (AD) and normal aging. InBrainSyn uses a parallel transport algorithm to adapt the population-level aging trajectories learned by a generative deep template network, enabling individualized aging synthesis. As InBrainSyn uses diffeomorphic transformations to simulate aging, the synthesized images are topologically consistent with the original anatomy by design. We evaluated InBrainSyn both quantitatively and qualitatively on AD and healthy control cohorts from the Open Access Series of Imaging Studies - version 3 dataset. Experimentally, InBrainSyn can also model neuroanatomical transitions between normal aging and AD. An evaluation of an external set supports its generalizability. Overall, with only a single baseline scan, InBrainSyn synthesizes realistic 3D spatiotemporal T1w MRI scans, producing personalized longitudinal aging trajectories. The code for InBrainSyn is available at: https://github.com/Fjr9516/InBrainSyn.
中文: InBrainSyn框架通过微分同胚变换调整群体水平的老化轨迹,能够基于单次基线扫描合成个性化的纵向MRI图像,实现阿尔茨海默病和正常衰老中神经退行性变的真实模拟。
English: The InBrainSyn framework synthesizes personalized longitudinal MRI scans by adapting population-level aging trajectories through diffeomorphic transformations, enabling realistic simulation of neurodegeneration in Alzheimer's disease and normal aging from a single baseline scan.

Authors:Chanhui Lee, Yeonghwan Song, Jeany Son
Title: Data-free Universal Adversarial Perturbation with Pseudo-semantic Prior
Abstract:
Data-free Universal Adversarial Perturbation (UAP) is an image-agnostic adversarial attack that deceives deep neural networks using a single perturbation generated solely from random noise without relying on data priors. However, traditional data-free UAP methods often suffer from limited transferability due to the absence of semantic content in random noise. To address this issue, we propose a novel data-free universal attack method that recursively extracts pseudo-semantic priors directly from the UAPs during training to enrich the semantic content within the data-free UAP framework. Our approach effectively leverages latent semantic information within UAPs via region sampling, enabling successful input transformations-typically ineffective in traditional data-free UAP methods due to the lack of semantic cues-and significantly enhancing black-box transferability. Furthermore, we introduce a sample reweighting technique to mitigate potential imbalances from random sampling and transformations, emphasizing hard examples less affected by the UAPs. Comprehensive experiments on ImageNet show that our method achieves state-of-the-art performance in average fooling rate by a substantial margin, notably improves attack transferability across various CNN architectures compared to existing data-free UAP methods, and even surpasses data-dependent UAP methods. Code is available at: https://github.com/ChnanChan/PSP-UAP.
Chinese: 本文提出了一种新颖的无数据通用对抗扰动方法,通过在训练中递归提取扰动中的伪语义先验来增强迁移性,在不依赖数据的情况下实现了跨多种CNN架构的最优性能。
English: This paper introduces a novel data-free universal adversarial perturbation method that enhances transferability by recursively extracting pseudo-semantic priors from perturbations during training, achieving state-of-the-art performance across multiple CNN architectures without relying on data.

Authors:Junchao Zhu, Ruining Deng, Tianyuan Yao, Juming Xiong, Chongyu Qu, Junlin Guo, Siqi Lu, Yucheng Tang, Daguang Xu, Mengmeng Yin, Yu Wang, Shilin Zhao, Yaohong Wang, Haichun Yang, Yuankai Huo
Title: MagNet: Multi-Level Attention Graph Network for Predicting High-Resolution Spatial Transcriptomics
Abstract:
The rapid development of spatial transcriptomics (ST) offers new opportunities to explore the gene expression patterns within the spatial microenvironment. Current research integrates pathological images to infer gene expression, addressing the high costs and time-consuming processes to generate spatial transcriptomics data. However, as spatial transcriptomics resolution continues to improve, existing methods remain primarily focused on gene expression prediction at low-resolution spot levels. These methods face significant challenges, especially the information bottleneck, when they are applied to high-resolution HD data. To bridge this gap, this paper introduces MagNet, a multi-level attention graph network designed for accurate prediction of high-resolution HD data. MagNet employs cross-attention layers to integrate features from multi-resolution image patches hierarchically and utilizes a GAT-Transformer module to aggregate neighborhood information. By integrating multilevel features, MagNet overcomes the limitations posed by low-resolution inputs in predicting high-resolution gene expression. We systematically evaluated MagNet and existing ST prediction models on both a private spatial transcriptomics dataset and a public dataset at three different resolution levels. The results demonstrate that MagNet achieves state-of-the-art performance at both spot level and high-resolution bin levels, providing a novel methodology and benchmark for future research and applications in high-resolution HD-level spatial transcriptomics. Code is available at https://github.com/Junchao-Zhu/MagNet.
中文: 本文提出MagNet多层注意力图网络,通过整合多分辨率图像特征和邻域信息克服高分辨率空间转录组学的信息瓶颈,在不同分辨率级别均实现了最先进的性能。
English: This paper introduces MagNet, a multi-level attention graph network that overcomes the information bottleneck in high-resolution spatial transcriptomics by integrating multi-resolution image features and neighborhood information, achieving state-of-the-art performance across different resolution levels.

Authors:Woo Kyoung Han, Byeonghun Lee, Hyunmin Cho, Sunghoon Im, Kyong Hwan Jin
Title: Towards Lossless Implicit Neural Representation via Bit Plane Decomposition
Abstract:
We quantify the upper bound on the size of the implicit neural representation (INR) model from a digital perspective. The upper bound of the model size increases exponentially as the required bit-precision increases. To this end, we present a bit-plane decomposition method that makes INR predict bit-planes, producing the same effect as reducing the upper bound of the model size. We validate our hypothesis that reducing the upper bound leads to faster convergence with constant model size. Our method achieves lossless representation in 2D image and audio fitting, even for high bit-depth signals, such as 16-bit, which was previously unachievable. We pioneered the presence of bit bias, which INR prioritizes as the most significant bit (MSB). We expand the application of the INR task to bit depth expansion, lossless image compression, and extreme network quantization. Our source code is available at https://github.com/WooKyoungHan/LosslessINR
中文: 本研究提出了一种位平面分解方法,通过降低隐式神经表示模型大小的上限,实现了高比特深度信号的无损表示,并拓展了在比特深度扩展和压缩等领域的应用。
English: This study introduces a bit-plane decomposition method that reduces the upper bound on implicit neural representation (INR) model size, enabling lossless representation for high bit-depth signals and expanding applications to bit depth expansion and compression.

Authors:Xue Yang, Tao Chen, Lei Guo, Wenbo Jiang, Ji Guo, Yongming Li, Jiaming He
Title: BadRefSR: Backdoor Attacks Against Reference-based Image Super Resolution
Abstract:
Reference-based image super-resolution (RefSR) represents a promising advancement in super-resolution (SR). In contrast to single-image super-resolution (SISR), RefSR leverages an additional reference image to help recover high-frequency details, yet its vulnerability to backdoor attacks has not been explored. To fill this research gap, we propose a novel attack framework called BadRefSR, which embeds backdoors in the RefSR model by adding triggers to the reference images and training with a mixed loss function. Extensive experiments across various backdoor attack settings demonstrate the effectiveness of BadRefSR. The compromised RefSR network performs normally on clean input images, while outputting attacker-specified target images on triggered input images. Our study aims to alert researchers to the potential backdoor risks in RefSR. Codes are available at https://github.com/xuefusiji/BadRefSR.
中文:BadRefSR提出了一种针对参考图像超分辨率模型的后门攻击框架,通过在参考图像中嵌入触发器来操控输出结果,同时在干净输入上保持正常性能,揭示了该技术潜在的安全隐患。
English: BadRefSR introduces a backdoor attack framework for reference-based image super-resolution models, embedding triggers in reference images to manipulate outputs while maintaining normal performance on clean inputs, highlighting security risks in RefSR systems.

Authors:Shaoming Li, Qing Cai, Songqi Kong, Runqing Tan, Heng Tong, Shiji Qiu, Yongguo Jiang, Zhi Liu
Title: MESC-3D:Mining Effective Semantic Cues for 3D Reconstruction from a Single Image
Abstract:
Reconstructing 3D shapes from a single image plays an important role in computer vision. Many methods have been proposed and achieve impressive performance. However, existing methods mainly focus on extracting semantic information from images and then simply concatenating it with 3D point clouds without further exploring the concatenated semantics. As a result, these entangled semantic features significantly hinder the reconstruction performance. In this paper, we propose a novel single-image 3D reconstruction method called Mining Effective Semantic Cues for 3D Reconstruction from a Single Image (MESC-3D), which can actively mine effective semantic cues from entangled features. Specifically, we design an Effective Semantic Mining Module to establish connections between point clouds and image semantic attributes, enabling the point clouds to autonomously select the necessary information. Furthermore, to address the potential insufficiencies in semantic information from a single image, such as occlusions, inspired by the human ability to represent 3D objects using prior knowledge drawn from daily experiences, we introduce a 3D Semantic Prior Learning Module. This module incorporates semantic understanding of spatial structures, enabling the model to interpret and reconstruct 3D objects with greater accuracy and realism, closely mirroring human perception of complex 3D environments. Extensive evaluations show that our method achieves significant improvements in reconstruction quality and robustness compared to prior works. Additionally, further experiments validate the strong generalization capabilities and excels in zero-shot preformance on unseen classes. Code is available at https://github.com/QINGQINGLE/MESC-3D.
中文: 提出的MESC-3D方法通过从纠缠特征中主动挖掘有效语义线索并引入三维语义先验知识,显著提升了单图像三维重建的质量、鲁棒性和泛化能力,优于现有方法。
English: The proposed MESC-3D method enhances single-image 3D reconstruction by actively mining effective semantic cues from entangled features and incorporating 3D semantic priors, achieving superior quality, robustness, and generalization compared to existing approaches.

Authors:Yuxiang Chen, Haocheng Xi, Jun Zhu, Jianfei Chen
Title: Oscillation-Reduced MXFP4 Training for Vision Transformers
Abstract:
Pre-training Transformers in FP4 precision is becoming a promising approach to gain substantial speedup, but it comes with a considerable loss of accuracy. Microscaling (MX) data format provides a fine-grained per-group quantization method to improve the representation ability of the FP4 format and is supported by the next-generation Blackwell GPU architecture. However, training with MXFP4 data format still results in significant degradation and there is a lack of systematic research on the reason. In this work, we propose a novel training method TetraJet for a more accurate FP4 training. We comprehensively evaluate all of the quantizers involved in the training, and identify the weight oscillation problem in the forward pass as the main source of the degradation in MXFP4 training. Therefore, we introduce two novel methods, EMA Quantizer (Q-EMA) and Adaptive Ramping Optimizer (Q-Ramping), to resolve the oscillation problem. Extensive experiments on Vision Transformers demonstrate that TetraJet consistently outperforms the existing 4-bit training methods, and Q-EMA & Q-Ramping can provide additional enhancement by effectively reducing oscillation. We decreased the accuracy degradation by more than $50\%$ compared to the baseline, and can even achieve competitive performance compared to full precision training. The codes are available at https://github.com/thu-ml/TetraJet-MXFP4Training
中文摘要:本研究提出TetraJet新型训练方法,通过识别并利用Q-EMA和Q-Ramping技术解决MXFP4训练中的权重振荡问题,相比基线方法将精度损失降低超过50%,在视觉Transformer上实现了与全精度训练相媲美的性能。
English Summary: The study introduces TetraJet, a novel training method that addresses accuracy degradation in FP4 pre-training by identifying and mitigating weight oscillation through Q-EMA and Q-Ramping techniques, achieving over 50% reduction in accuracy loss compared to baseline methods.

Authors:Bach-Thuan Bui, Huy-Hoang Bui, Yasuyuki Fujii, Dinh-Tuan Tran, Joo-Ho Lee
Title: Improved 3D Point-Line Mapping Regression for Camera Relocalization
Abstract:
In this paper, we present a new approach for improving 3D point and line mapping regression for camera re-localization. Previous methods typically rely on feature matching (FM) with stored descriptors or use a single network to encode both points and lines. While FM-based methods perform well in large-scale environments, they become computationally expensive with a growing number of mapping points and lines. Conversely, approaches that learn to encode mapping features within a single network reduce memory footprint but are prone to overfitting, as they may capture unnecessary correlations between points and lines. We propose that these features should be learned independently, each with a distinct focus, to achieve optimal accuracy. To this end, we introduce a new architecture that learns to prioritize each feature independently before combining them for localization. Experimental results demonstrate that our approach significantly enhances the 3D map point and line regression performance for camera re-localization. The implementation of our method will be publicly available at: https://github.com/ais-lab/pl2map/.
中文: 本文提出了一种新架构,通过独立学习点和线特征再整合用于相机重定位,显著提升了3D映射回归性能,有效解决了现有方法计算量大和容易过拟合的问题。
English: This paper introduces a novel architecture that independently learns point and line features before integrating them for camera re-localization, significantly improving 3D mapping regression while overcoming the computational and overfitting limitations of prior methods.

Authors:Qiao Yan, Yuchen Yuan, Xiaowei Hu, Yihan Wang, Jiaqi Xu, Jinpeng Li, Chi-Wing Fu, Pheng-Ann Heng
Title: MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models
Abstract:
The increasing use of vision-language models (VLMs) in healthcare applications presents great challenges related to hallucinations, in which the models may generate seemingly plausible results that are in fact incorrect. Such hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. In this work, we propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. Comprising over 100,000 images and 1,000,000 instruction pairs, MedHallTune includes both hallucination and non-hallucination samples, each with ground-truth annotations. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level. The experimental results show that fine-tuning with MedHallTune successfully improves the ability of several existing models to manage hallucinations and boost their zero-shot performance on downstream visual-question-answering (VQA) tasks, making them more reliable for practical medical applications. Our work contributes to the development of more trustworthy VLMs. Codes and dataset will be available at \href{https://github.com/russellyq/MedHallTune}{MedHallTune}.
中文: 本研究提出了MedHallTune这一大规模基准,用于评估和减少医学视觉语言模型的幻觉问题,实验表明基于该数据的微调能有效提升模型在医疗应用中的可靠性。
English: This study introduces MedHallTune, a large-scale benchmark to evaluate and reduce hallucinations in medical vision-language models, demonstrating that fine-tuning with it enhances model reliability for healthcare applications.

Authors:Zhaoyang Jia, Bin Li, Jiahao Li, Wenxuan Xie, Linfeng Qi, Houqiang Li, Yan Lu
Title: Towards Practical Real-Time Neural Video Compression
Abstract:
We introduce a practical real-time neural video codec (NVC) designed to deliver high compression ratio, low latency and broad versatility. In practice, the coding speed of NVCs depends on 1) computational costs, and 2) non-computational operational costs, such as memory I/O and the number of function calls. While most efficient NVCs prioritize reducing computational cost, we identify operational cost as the primary bottleneck to achieving higher coding speed. Leveraging this insight, we introduce a set of efficiency-driven design improvements focused on minimizing operational costs. Specifically, we employ implicit temporal modeling to eliminate complex explicit motion modules, and use single low-resolution latent representations rather than progressive downsampling. These innovations significantly accelerate NVC without sacrificing compression quality. Additionally, we implement model integerization for consistent cross-device coding and a module-bank-based rate control scheme to improve practical adaptability. Experiments show our proposed DCVC-RT achieves an impressive average encoding/decoding speed at 125.2/112.8 fps (frames per second) for 1080p video, while saving an average of 21% in bitrate compared to H.266/VTM. The code is available at https://github.com/microsoft/DCVC.
中文摘要:本文提出一种实时神经视频编解码器,通过隐式时间建模和单一低分辨率潜在表示降低操作成本,在1080p视频上实现125.2/112.8 fps的编解码速度,相比H.266/VTM节省21%码率。
English Summary: This paper presents a real-time neural video codec that enhances coding speed by minimizing operational costs through implicit temporal modeling and single low-resolution latent representations, achieving 125.2/112.8 fps for 1080p video while reducing bitrate by 21% compared to H.266/VTM.

Authors:Ke Sun, Shen Chen, Taiping Yao, Ziyin Zhou, Jiayi Ji, Xiaoshuai Sun, Chia-Wen Lin, Rongrong Ji
Title: Towards General Visual-Linguistic Face Forgery Detection(V2)
Abstract:
Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust. Recent works demonstrate that leveraging multimodal models can enhance the generalization and interpretability of face forgery detection. However, existing annotation approaches, whether through human labeling or direct Multimodal Large Language Model (MLLM) generation, often suffer from hallucination issues, leading to inaccurate text descriptions, especially for high-quality forgeries. To address this, we propose Face Forgery Text Generator (FFTG), a novel annotation pipeline that generates accurate text descriptions by leveraging forgery masks for initial region and type identification, followed by a comprehensive prompting strategy to guide MLLMs in reducing hallucination. We validate our approach through fine-tuning both CLIP with a three-branch training framework combining unimodal and multimodal objectives, and MLLMs with our structured annotations. Experimental results demonstrate that our method not only achieves more accurate annotations with higher region identification accuracy, but also leads to improvements in model performance across various forgery detection benchmarks. Our Codes are available in https://github.com/skJack/VLFFD.git.
中文: 面部伪造文本生成器(FFTG)是一种新颖的标注流程,通过利用伪造掩码和全面的提示策略来减少多模态模型中的幻觉问题,从而提高了面部伪造检测文本描述的准确性,并在多个基准测试中提升了模型性能。
English: The Face Forgery Text Generator (FFTG) is a novel annotation pipeline that enhances the accuracy of text descriptions for face forgery detection by utilizing forgery masks and a comprehensive prompting strategy to reduce hallucination in multimodal models, leading to improved performance across benchmarks.

Authors:Shanshan Wan, Yingmei Wei, Lai Kang, Tianrui Shen, Haixuan Wang, Yee-Hong Yang
Title: SciceVPR: Stable Cross-Image Correlation Enhanced Model for Visual Place Recognition
Abstract:
Visual Place Recognition (VPR) is a major challenge for robotics and autonomous systems, with the goal of predicting the location of an image based solely on its visual features. State-of-the-art (SOTA) models extract global descriptors using the powerful foundation model DINOv2 as backbone. These models either explore the cross-image correlation or propose a time-consuming two-stage re-ranking strategy to achieve better performance. However, existing works only utilize the final output of DINOv2, and the current cross-image correlation causes unstable retrieval results. To produce both discriminative and constant global descriptors, this paper proposes stable cross-image correlation enhanced model for VPR called SciceVPR. This model explores the full potential of DINOv2 in providing useful feature representations that implicitly encode valuable contextual knowledge. Specifically, SciceVPR first uses a multi-layer feature fusion module to capture increasingly detailed task-relevant channel and spatial information from the multi-layer output of DINOv2. Secondly, SciceVPR considers the invariant correlation between images within a batch as valuable knowledge to be distilled into the proposed self-enhanced encoder. In this way, SciceVPR can acquire fairly robust global features regardless of domain shifts (e.g., changes in illumination, weather and viewpoint between pictures taken in the same place). Experimental results demonstrate that the base variant, SciceVPR-B, outperforms SOTA one-stage methods with single input on multiple datasets with varying domain conditions. The large variant, SciceVPR-L, performs on par with SOTA two-stage models, scoring over 3% higher in Recall@1 compared to existing models on the challenging Tokyo24/7 dataset. Our code will be released at https://github.com/shuimushan/SciceVPR.
中文: 本文提出SciceVPR模型,通过融合DINOv2多层特征并提取稳定的跨图像关联知识,在多种域条件下实现了鲁棒的视觉位置识别性能。
English: This paper introduces SciceVPR, a model that enhances visual place recognition by leveraging multi-layer DINOv2 features and distilling stable cross-image correlations to achieve robust performance across varying domain conditions.

Authors:Xiang Xiang, Zhuo Xu, Yao Deng, Qinhao Zhou, Yifan Liang, Ke Chen, Qingfang Zheng, Yaowei Wang, Xilin Chen, Wen Gao
Title: OpenEarthSensing: Large-Scale Fine-Grained Benchmark for Open-World Remote Sensing
Abstract:
The advancement of remote sensing, including satellite systems, facilitates the continuous acquisition of remote sensing imagery globally, introducing novel challenges for achieving open-world tasks. Deployed models need to continuously adjust to a constant influx of new data, which frequently exhibits diverse shifts from the data encountered during the training phase. To effectively handle the new data, models are required to detect semantic shifts, adapt to covariate shifts, and continuously update their parameters without forgetting learned knowledge, as has been considered in works on a variety of open-world tasks. However, existing studies are typically conducted within a single dataset to simulate realistic conditions, with a lack of large-scale benchmarks capable of evaluating multiple open-world tasks. In this paper, we introduce \textbf{OpenEarthSensing (OES)}, a large-scale fine-grained benchmark for open-world remote sensing. OES includes 189 scene and object categories, covering the vast majority of potential semantic shifts that may occur in the real world. Additionally, to provide a more comprehensive testbed for evaluating the generalization performance, OES encompasses five data domains with significant covariate shifts, including two RGB satellite domains, one RGB aerial domain, one multispectral RGB domain, and one infrared domain. We evaluate the baselines and existing methods for diverse tasks on OES, demonstrating that it serves as a meaningful and challenging benchmark for open-world remote sensing. The proposed dataset OES is available at https://haiv-lab.github.io/OES.

Authors:Yu Pan, Jiahao Chen, Bingrong Dai, Lin Wang, Yi Du, Jiao Liu
Title: Gungnir: Exploiting Stylistic Features in Images for Backdoor Attacks on Diffusion Models
Abstract:
In recent years, Diffusion Models (DMs) have demonstrated significant advances in the field of image generation. However, according to current research, DMs are vulnerable to backdoor attacks, which allow attackers to control the model's output by inputting data containing covert triggers, such as a specific visual patch or phrase. Existing defense strategies are well equipped to thwart such attacks through backdoor detection and trigger inversion because previous attack methods are constrained by limited input spaces and low-dimensional triggers. For example, visual triggers are easily observed by defenders, text-based or attention-based triggers are more susceptible to neural network detection. To explore more possibilities of backdoor attack in DMs, we propose Gungnir, a novel method that enables attackers to activate the backdoor in DMs through style triggers within input images. Our approach proposes using stylistic features as triggers for the first time and implements backdoor attacks successfully in image-to-image tasks by introducing Reconstructing-Adversarial Noise (RAN) and Short-Term Timesteps-Retention (STTR). Our technique generates trigger-embedded images that are perceptually indistinguishable from clean images, thus bypassing both manual inspection and automated detection neural networks. Experiments demonstrate that Gungnir can easily bypass existing defense methods. Among existing DM defense frameworks, our approach achieves a 0 backdoor detection rate (BDR). Our codes are available at https://github.com/paoche11/Gungnir.
中文摘要:本文提出Gungnir方法,首次在扩散模型中利用风格特征作为隐蔽触发器,通过重构对抗噪声和短时步保留技术实现无法被检测的后门攻击,实验表明该方法能完全规避现有防御机制。
English Summary: The paper introduces Gungnir, a novel backdoor attack method for Diffusion Models that uses imperceptible style triggers and specialized noise techniques to bypass existing defenses, achieving a 0% detection rate in experiments.

Authors:Yifei Qian, Zhongliang Guo, Bowen Deng, Chun Tong Lei, Shuai Zhao, Chun Pong Lau, Xiaopeng Hong, Michael P. Pound
Title: T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting
Abstract:
Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available at https://github.com/cha15yq/T2ICount.
中文:T2ICount是一种基于扩散的框架,通过分层语义校正模块和表征区域一致性损失提升文本敏感性,在多个基准测试中实现了卓越的零样本物体计数性能。
English: T2ICount is a diffusion-based framework that enhances zero-shot object counting by improving text sensitivity through a Hierarchical Semantic Correction Module and Representational Regional Coherence Loss, achieving superior performance across benchmarks.

Authors:Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane, Zhaolin Chen
Title: Style Content Decomposition-based Data Augmentation for Domain Generalizable Medical Image Segmentation
Abstract:
Due to domain shifts across diverse medical imaging modalities, learned segmentation models often suffer significant performance degradation during deployment. These domain shifts, typically caused by variations in imaging systems, generally comprise two principal components: 1) \textbf{"style" shifts}, referring to global disparities in image properties such as illumination, contrast, and color; and 2) \textbf{"content" shifts}, which involve local discrepancies in anatomical structures. To address domain shifts in medical image segmentation, a core challenge arises: how can we decouple the factors within images that determine their "style" and "content" components? To this end, we first propose a linear style-content decomposition method that factorizes an image into style codes and content maps, explicitly modeling the "style" and "content" components. Building on this, we introduce a \textbf{Sty}le-\textbf{Con}tent decomposition-based data \textbf{a}ugmentation algorithm (StyCona), which leverages this decomposition strategy to guide augmentation of both the global style and local content of source-domain images, enabling the training of a well-generalized model for domain-generalizable medical image segmentation. StyCona is a simple yet effective plug-and-play module that substantially improves model generalization without requiring additional training parameters or modifications to segmentation model architectures. Experiments on cardiac magnetic resonance imaging and fundus photography segmentation tasks, with single and multiple target domains respectively, demonstrate the effectiveness of StyCona and its superiority over state-of-the-art domain generalization methods. The code will be released at https://github.com/Senyh/StyCona.
Chinese: 提出的StyCona算法通过将医学图像分解为风格和内容成分,实现了有效的数据增强,能在不改变模型架构的情况下显著提升分割模型的泛化能力,从而解决跨域适应问题。
English: The proposed StyCona algorithm addresses domain shifts in medical image segmentation by decomposing images into style and content components, enabling effective data augmentation that improves model generalization without architectural changes.

Authors:Vicente Balmaseda, Bokun Wang, Ching-Long Lin, Tianbao Yang
Title: Discovering Global False Negatives On the Fly for Self-supervised Contrastive Learning
Abstract:
In self-supervised contrastive learning, negative pairs are typically constructed using an anchor image and a sample drawn from the entire dataset, excluding the anchor. However, this approach can result in the creation of negative pairs with similar semantics, referred to as "false negatives", leading to their embeddings being falsely pushed apart. To address this issue, we introduce GloFND, an optimization-based approach that automatically learns on the fly the threshold for each anchor data to identify its false negatives during training. In contrast to previous methods for false negative discovery, our approach globally detects false negatives across the entire dataset rather than locally within the mini-batch. Moreover, its per-iteration computation cost remains independent of the dataset size. Experimental results on image and image-text data demonstrate the effectiveness of the proposed method. Our implementation is available at https://github.com/vibalcam/GloFND.
中文: 本文提出GloFND方法,通过自适应学习每个锚点的阈值来全局识别自监督对比学习中的假阴性样本,有效避免语义相似样本被错误分离,且计算成本不随数据集规模增加。
English: This paper introduces GloFND, an optimization-based method that globally identifies false negatives in self-supervised contrastive learning by adaptively learning per-anchor thresholds, effectively preventing semantically similar pairs from being incorrectly separated while maintaining computational efficiency independent of dataset size.

Authors:Vladimir Zaigrajew, Hubert Baniecki, Przemyslaw Biecek
Title: Interpreting CLIP with Hierarchical Sparse Autoencoders
Abstract:
Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing large-scale vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a new state-of-the-art Pareto frontier between reconstruction quality and sparsity for CLIP, achieving 0.99 cosine similarity and less than 0.1 fraction of variance unexplained while maintaining ~80% sparsity. Finally, we demonstrate the utility of MSAE as a tool for interpreting and controlling CLIP by extracting over 120 semantic concepts from its representation to perform concept-based similarity search and bias analysis in downstream tasks like CelebA. We make the codebase available at https://github.com/WolodjaZ/MSAE.
Chinese: Matryoshka稀疏自编码器(MSAE)通过分层架构同时优化重构质量与稀疏性,在CLIP等视觉语言模型中实现了最先进的性能,并能通过语义概念提取有效进行特征解释与控制。
English: The Matryoshka Sparse Autoencoder (MSAE) introduces a hierarchical architecture that simultaneously optimizes reconstruction quality and sparsity for vision-language models like CLIP, achieving state-of-the-art performance while enabling effective feature interpretation and control through semantic concept extraction.

Authors:Sari Masri, Huthaifa I. Ashqar, Mohammed Elhenawy
Title: Visual Reasoning at Urban Intersections: FineTuning GPT-4o for Traffic Conflict Detection
Abstract:
Traffic control in unsignalized urban intersections presents significant challenges due to the complexity, frequent conflicts, and blind spots. This study explores the capability of leveraging Multimodal Large Language Models (MLLMs), such as GPT-4o, to provide logical and visual reasoning by directly using birds-eye-view videos of four-legged intersections. In this proposed method, GPT-4o acts as intelligent system to detect conflicts and provide explanations and recommendations for the drivers. The fine-tuned model achieved an accuracy of 77.14%, while the manual evaluation of the true predicted values of the fine-tuned GPT-4o showed significant achievements of 89.9% accuracy for model-generated explanations and 92.3% for the recommended next actions. These results highlight the feasibility of using MLLMs for real-time traffic management using videos as inputs, offering scalable and actionable insights into intersections traffic management and operation. Code used in this study is available at https://github.com/sarimasri3/Traffic-Intersection-Conflict-Detection-using-images.git.
中文: 本研究证明,多模态大语言模型如GPT-4o能通过分析鸟瞰视频有效管理无信号灯交叉口,实现冲突检测和驾驶建议,在解释说明和可操作建议方面均展现出高准确率。
English: This study demonstrates that Multimodal Large Language Models like GPT-4o can effectively manage unsignalized intersections by analyzing bird's-eye-view videos to detect conflicts and provide driving recommendations, achieving high accuracy in explanations and actionable insights.

Authors:Joana C. Costa, Tiago Roxo, Hugo Proença, Pedro R. M. Inácio
Title: LISArD: Learning Image Similarity to Defend Against Gray-box Adversarial Attacks
Abstract:
State-of-the-art defense mechanisms are typically evaluated in the context of white-box attacks, which is not realistic, as it assumes the attacker can access the gradients of the target network. To protect against this scenario, Adversarial Training (AT) and Adversarial Distillation (AD) include adversarial examples during the training phase, and Adversarial Purification uses a generative model to reconstruct all the images given to the classifier. This paper considers an even more realistic evaluation scenario: gray-box attacks, which assume that the attacker knows the architecture and the dataset used to train the target network, but cannot access its gradients. We provide empirical evidence that models are vulnerable to gray-box attacks and propose LISArD, a defense mechanism that does not increase computational and temporal costs but provides robustness against gray-box and white-box attacks without including AT. Our method approximates a cross-correlation matrix, created with the embeddings of perturbed and clean images, to a diagonal matrix while simultaneously conducting classification learning. Our results show that LISArD can effectively protect against gray-box attacks, can be used in multiple architectures, and carries over its resilience to the white-box scenario. Also, state-of-the-art AD models underperform greatly when removing AT and/or moving to gray-box settings, highlighting the lack of robustness from existing approaches to perform in various conditions (aside from white-box settings). All the source code is available at https://github.com/Joana-Cabral/LISArD.
Chinese: 本文提出LISArD防御机制,通过在分类学习中将近似的扰动与干净图像嵌入的互相关矩阵对角化,无需对抗训练即可有效防御灰盒和白盒攻击,并保持计算效率。
English: This paper introduces LISArD, a defense mechanism that effectively protects models against gray-box and white-box attacks without adversarial training by approximating a cross-correlation matrix of embeddings from perturbed and clean images to a diagonal matrix during classification learning.

Authors:Li-Wei Chen, Ombretta Strafforello, Anne-Sofie Maerten, Tinne Tuytelaars, Johan Wagemans
Title: On the Role of Individual Differences in Current Approaches to Computational Image Aesthetics
Abstract:
Image aesthetic assessment (IAA) evaluates image aesthetics, a task complicated by image diversity and user subjectivity. Current approaches address this in two stages: Generic IAA (GIAA) models estimate mean aesthetic scores, while Personal IAA (PIAA) models adapt GIAA using transfer learning to incorporate user subjectivity. However, a theoretical understanding of transfer learning between GIAA and PIAA, particularly concerning the impact of group composition, group size, aesthetic differences between groups and individuals, and demographic correlations, is lacking. This work establishes a theoretical foundation for IAA, proposing a unified model that encodes individual characteristics in a distributional format for both individual and group assessments. We show that transferring from GIAA to PIAA involves extrapolation, while the reverse involves interpolation, which is generally more effective for machine learning. Extensive experiments with varying group compositions, including sub-sampling by group size and disjoint demographics, reveal substantial performance variation even for GIAA, challenging the assumption that averaging scores eliminates individual subjectivity. Score-distribution analysis using Earth Mover's Distance (EMD) and the Gini index identifies education, photography experience, and art experience as key factors in aesthetic differences, with greater subjectivity in artworks than in photographs. Code is available at https://github.com/lwchen6309/aesthetics_transfer_learning.
中文摘要:本研究通过提出统一的分布模型为图像美学评估建立理论基础,揭示了从通用模型到个性化模型的迁移学习涉及外推过程,而反向过程涉及更有效的内插方法,并通过实验确定了影响美学主观性的关键人口统计因素。
English Summary: This study establishes a theoretical foundation for image aesthetic assessment by proposing a unified distributional model that reveals transfer learning between generic and personalized approaches involves extrapolation and interpolation respectively, with experiments identifying key demographic factors influencing aesthetic subjectivity.

Authors:Julius Broomfield, Kartik Sharma, Srijan Kumar
Title: A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs
Abstract:
Large language models (LLMs) have recently demonstrated remarkable advancements in embodying diverse personas, enhancing their effectiveness as conversational agents and virtual assistants. Consequently, LLMs have made significant strides in processing and integrating multimodal information. However, even though human personas can be expressed in both text and image, the extent to which the modality of a persona impacts the embodiment by the LLM remains largely unexplored. In this paper, we investigate how do different modalities influence the expressiveness of personas in multimodal LLMs. To this end, we create a novel modality-parallel dataset of 40 diverse personas varying in age, gender, occupation, and location. This consists of four modalities to equivalently represent a persona: image-only, text-only, a combination of image and small text, and typographical images, where text is visually stylized to convey persona-related attributes. We then create a systematic evaluation framework with 60 questions and corresponding metrics to assess how well LLMs embody each persona across its attributes and scenarios. Comprehensive experiments on $5$ multimodal LLMs show that personas represented by detailed text show more linguistic habits, while typographical images often show more consistency with the persona. Our results reveal that LLMs often overlook persona-specific details conveyed through images, highlighting underlying limitations and paving the way for future research to bridge this gap. We release the data and code at https://github.com/claws-lab/persona-modality .
中文摘要:本研究探讨了不同模态如何影响多模态大语言模型的人格体现,发现文本描述的人格能增强语言习惯,而排版图像则提高一致性,但模型常忽略图像传达的人格细节。
English Summary: This study explores how different modalities affect persona embodiment in multimodal large language models, revealing that text-based personas enhance linguistic habits while typographical images improve consistency, yet models often miss image-conveyed persona details.

Authors:Chaoyu Li, Sid Padmanabhuni, Maryam Cheema, Hasti Seifi, Pooyan Fazli
Title: VideoA11y: Method and Dataset for Accessible Video Description
Abstract:
Video descriptions are crucial for blind and low vision (BLV) users to access visual content. However, current artificial intelligence models for generating descriptions often fall short due to limitations in the quality of human annotations within training datasets, resulting in descriptions that do not fully meet BLV users' needs. To address this gap, we introduce VideoA11y, an approach that leverages multimodal large language models (MLLMs) and video accessibility guidelines to generate descriptions tailored for BLV individuals. Using this method, we have curated VideoA11y-40K, the largest and most comprehensive dataset of 40,000 videos described for BLV users. Rigorous experiments across 15 video categories, involving 347 sighted participants, 40 BLV participants, and seven professional describers, showed that VideoA11y descriptions outperform novice human annotations and are comparable to trained human annotations in clarity, accuracy, objectivity, descriptiveness, and user satisfaction. We evaluated models on VideoA11y-40K using both standard and custom metrics, demonstrating that MLLMs fine-tuned on this dataset produce high-quality accessible descriptions. Code and dataset are available at https://people-robots.github.io/VideoA11y.

Authors:Toru Lin, Kartik Sachdev, Linxi Fan, Jitendra Malik, Yuke Zhu
Title: Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids
Abstract:
Learning generalizable robot manipulation policies, especially for complex multi-fingered humanoids, remains a significant challenge. Existing approaches primarily rely on extensive data collection and imitation learning, which are expensive, labor-intensive, and difficult to scale. Sim-to-real reinforcement learning (RL) offers a promising alternative, but has mostly succeeded in simpler state-based or single-hand setups. How to effectively extend this to vision-based, contact-rich bimanual manipulation tasks remains an open question. In this paper, we introduce a practical sim-to-real RL recipe that trains a humanoid robot to perform three challenging dexterous manipulation tasks: grasp-and-reach, box lift and bimanual handover. Our method features an automated real-to-sim tuning module, a generalized reward formulation based on contact and object goals, a divide-and-conquer policy distillation framework, and a hybrid object representation strategy with modality-specific augmentation. We demonstrate high success rates on unseen objects and robust, adaptive policy behaviors -- highlighting that vision-based dexterous manipulation via sim-to-real RL is not only viable, but also scalable and broadly applicable to real-world humanoid manipulation tasks.
中文: 本文提出了一种实用的仿真到现实强化学习方法,使人形机器人能够执行复杂的基于视觉的双臂操作任务,在面对未见物体时表现出高成功率和强大的适应能力。
English: This paper presents a practical sim-to-real reinforcement learning method that enables humanoid robots to perform complex vision-based bimanual manipulation tasks with high success rates and robust adaptability to unseen objects.

Authors:Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, Liang-Yan Gui
Title: InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions
Abstract:
Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.
中文摘要:InterMimic框架通过“先完善后扩展”的课程策略,结合师生策略蒸馏和强化学习微调,能够从有缺陷的运动捕捉数据中稳健学习多样化人-物交互,生成逼真交互行为,并实现从单纯模仿到复杂交互生成建模的零样本泛化能力。
English Summary: InterMimic is a framework that enables robust learning from imperfect motion capture data for diverse human-object interactions through a curriculum strategy combining teacher-student distillation and RL fine-tuning, producing realistic interactions and generalizing beyond mere imitation to generative modeling.

Authors:Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
Title: Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
Abstract:
Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a $k\times k$ grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20$\times$ faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2$\times$ faster than the previous best-performing model without relying on vision foundation modules (e.g., DINOv2) or advanced guidance interval sampling.
中文: xAR框架通过将预测单元扩展为灵活的实体并采用噪声上下文学习来缓解曝光偏差,在图像生成任务中实现了最先进的性能且推理速度显著提升。
English: The xAR framework generalizes autoregressive modeling by allowing flexible prediction units called entities and mitigates exposure bias through noisy context learning, achieving state-of-the-art image generation performance with significantly faster inference.

Authors:Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Jun Zhou, Lin Gu
Title: InsTaG: Learning Personalized 3D Talking Head from Few-Second Video
Abstract:
Despite exhibiting impressive performance in synthesizing lifelike personalized 3D talking heads, prevailing methods based on radiance fields suffer from high demands for training data and time for each new identity. This paper introduces InsTaG, a 3D talking head synthesis framework that allows a fast learning of realistic personalized 3D talking head from few training data. Built upon a lightweight 3DGS person-specific synthesizer with universal motion priors, InsTaG achieves high-quality and fast adaptation while preserving high-level personalization and efficiency. As preparation, we first propose an Identity-Free Pre-training strategy that enables the pre-training of the person-specific model and encourages the collection of universal motion priors from long-video data corpus. To fully exploit the universal motion priors to learn an unseen new identity, we then present a Motion-Aligned Adaptation strategy to adaptively align the target head to the pre-trained field, and constrain a robust dynamic head structure under few training data. Experiments demonstrate our outstanding performance and efficiency under various data scenarios to render high-quality personalized talking heads.

Authors:Shuming Liu, Chen Zhao, Fatimah Zohra, Mattia Soldan, Alejandro Pardo, Mengmeng Xu, Lama Alssum, Merey Ramazanova, Juan León Alcázar, Anthony Cioppa, Silvio Giancola, Carlos Hinojosa, Bernard Ghanem
Title: OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection
Abstract:
Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field has achieved remarkable progress in recent years, further progress and real-world applications are impeded by the absence of a standardized framework. Currently, different methods are compared under different implementation settings, evaluation protocols, etc., making it difficult to assess the real effectiveness of a specific technique. To address this issue, we propose \textbf{OpenTAD}, a unified TAD framework consolidating 16 different TAD methods and 9 standard datasets into a modular codebase. In OpenTAD, minimal effort is required to replace one module with a different design, train a feature-based TAD model in end-to-end mode, or switch between the two. OpenTAD also facilitates straightforward benchmarking across various datasets and enables fair and in-depth comparisons among different methods. With OpenTAD, we comprehensively study how innovations in different network components affect detection performance and identify the most effective design choices through extensive experiments. This study has led to a new state-of-the-art TAD method built upon existing techniques for each component. We have made our code and models available at https://github.com/sming256/OpenTAD.
中文摘要:OpenTAD是一个统一框架,通过整合多种时序动作检测方法和数据集实现标准化,借助模块化设计促进公平比较并达到了最先进的检测性能。
English Summary: OpenTAD is a unified framework that standardizes temporal action detection by integrating multiple methods and datasets, enabling fair comparisons and achieving state-of-the-art performance through modular design.

Authors:Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, Xiaojuan Qi
Title: UniTok: A Unified Tokenizer for Visual Generation and Understanding
Abstract:
Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of 0.38 rFID and 78.6% zero-shot accuracy on ImageNet. Besides, UniTok can be seamlessly integrated into MLLMs to unlock native visual generation capability, without compromising the understanding performance. Additionally, we show that UniTok favors cfg-free generation, reducing gFID from 14.6 to 2.5 on ImageNet 256$\times$256 benchmark. GitHub: https://github.com/FoundationVision/UniTok.
Chinese: UniTok通过创新的多码本量化机制构建统一分词器,有效扩展词汇量和瓶颈维度,在图像生成与理解任务中均创下新记录,且能无缝集成到多模态大模型中实现原生视觉生成能力。
English: UniTok introduces a unified tokenizer with a multi-codebook quantization mechanism that overcomes representational limitations in discrete token spaces, achieving state-of-the-art performance in both image generation and understanding tasks without inherent conflicts between objectives.

Authors:Xiuli Bi, Jianfei Yuan, Bo Liu, Yong Zhang, Xiaodong Cun, Chi-Man Pun, Bin Xiao
Title: Mobius: Text to Seamless Looping Video Generation via Latent Shift
Abstract:
We present Mobius, a novel method to generate seamlessly looping videos from text descriptions directly without any user annotations, thereby creating new visual materials for the multi-media presentation. Our method repurposes the pre-trained video latent diffusion model for generating looping videos from text prompts without any training. During inference, we first construct a latent cycle by connecting the starting and ending noise of the videos. Given that the temporal consistency can be maintained by the context of the video diffusion model, we perform multi-frame latent denoising by gradually shifting the first-frame latent to the end in each step. As a result, the denoising context varies in each step while maintaining consistency throughout the inference process. Moreover, the latent cycle in our method can be of any length. This extends our latent-shifting approach to generate seamless looping videos beyond the scope of the video diffusion model's context. Unlike previous cinemagraphs, the proposed method does not require an image as appearance, which will restrict the motions of the generated results. Instead, our method can produce more dynamic motion and better visual quality. We conduct multiple experiments and comparisons to verify the effectiveness of the proposed method, demonstrating its efficacy in different scenarios. All the code will be made available.
Chinese: Mobius提出了一种无需训练即可从文本描述直接生成无缝循环视频的新方法,通过预训练的视频潜在扩散模型和潜在循环结构,确保时间一致性和动态视觉效果。
English: Mobius introduces a novel method for generating seamlessly looping videos from text descriptions without training, utilizing a pre-trained video latent diffusion model and a latent cycle approach to ensure temporal consistency and dynamic motion.

Authors:Qingsen Yan, Yixu Feng, Cheng Zhang, Guansong Pang, Kangbiao Shi, Peng Wu, Wei Dong, Jinqiu Sun, Yanning Zhang
Title: HVI: A New Color Space for Low-light Image Enhancement
Abstract:
Low-Light Image Enhancement (LLIE) is a crucial computer vision task that aims to restore detailed visual information from corrupted low-light images. Many existing LLIE methods are based on standard RGB (sRGB) space, which often produce color bias and brightness artifacts due to inherent high color sensitivity in sRGB. While converting the images using Hue, Saturation and Value (HSV) color space helps resolve the brightness issue, it introduces significant red and black noise artifacts. To address this issue, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by polarized HS maps and learnable intensity. The former enforces small distances for red coordinates to remove the red artifacts, while the latter compresses the low-light regions to remove the black artifacts. To fully leverage the chromatic and intensity information, a novel Color and Intensity Decoupling Network (CIDNet) is further introduced to learn accurate photometric mapping function under different lighting conditions in the HVI space. Comprehensive results from benchmark and ablation experiments show that the proposed HVI color space with CIDNet outperforms the state-of-the-art methods on 10 datasets. The code is available at https://github.com/Fediory/HVI-CIDNet.
中文摘要:提出的HVI色彩空间与CIDNet相结合,能有效消除低光图像增强中的红黑伪影,在多个数据集上实现卓越性能。
English Summary: The proposed HVI color space combined with CIDNet effectively eliminates red and black artifacts in low-light image enhancement, achieving superior performance across multiple datasets.

Authors:Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen
Title: Vector-Quantized Vision Foundation Models for Object-Centric Learning
Abstract:
Object-Centric Learning (OCL) aggregates image or video feature maps into object-level feature vectors, termed \textit{slots}. It's self-supervision of reconstructing the input from slots struggles with complex object textures, thus Vision Foundation Model (VFM) representations are used as the aggregation input and reconstruction target. Existing methods leverage VFM representations in diverse ways yet fail to fully exploit their potential. In response, we propose a unified architecture, Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO). The key to our unification is simply shared quantizing VFM representations in OCL aggregation and decoding. Experiments show that across different VFMs, aggregators and decoders, our VVO consistently outperforms baselines in object discovery and recognition, as well as downstream visual prediction and reasoning. We also mathematically analyze why VFM representations facilitate OCL aggregation and why their shared quantization as reconstruction targets strengthens OCL supervision. Our source code and model checkpoints are available on https://github.com/Genera1Z/VQ-VFM-OCL.
中文: 本文提出VQ-VFM-OCL(VVO)统一架构,通过在对象中心学习的聚合和解码中共享量化视觉基础模型表示,在物体发现、识别及下游任务中持续超越基线,并对其有效性进行了数学分析。
English: This paper introduces VQ-VFM-OCL (VVO), a unified architecture that enhances Object-Centric Learning by sharing quantized Vision Foundation Model representations across aggregation and decoding, consistently outperforming baselines in object discovery, recognition, and downstream tasks while providing mathematical analysis of its effectiveness.

Authors:Yang Zhou, Xu Gao, Zichong Chen, Hui Huang
Title: Attention Distillation: A Unified Approach to Visual Characteristics Transfer
Abstract:
Recent advances in generative diffusion models have shown a notable inherent understanding of image style and semantics. In this paper, we leverage the self-attention features from pretrained diffusion networks to transfer the visual characteristics from a reference to generated images. Unlike previous work that uses these features as plug-and-play attributes, we propose a novel attention distillation loss calculated between the ideal and current stylization results, based on which we optimize the synthesized image via backpropagation in latent space. Next, we propose an improved Classifier Guidance that integrates attention distillation loss into the denoising sampling process, further accelerating the synthesis and enabling a broad range of image generation applications. Extensive experiments have demonstrated the extraordinary performance of our approach in transferring the examples' style, appearance, and texture to new images in synthesis. Code is available at https://github.com/xugao97/AttentionDistillation.
Chinese: 本文提出了一种新颖的注意力蒸馏损失和改进的分类器引导方法,利用预训练扩散模型的自注意力特征,将参考图像的视觉特征有效迁移至生成图像,在加速合成过程的同时实现了卓越的风格与纹理迁移效果。
English: This paper introduces a novel attention distillation loss and enhanced Classifier Guidance method that leverages self-attention features from pretrained diffusion models to effectively transfer visual characteristics from reference images to generated ones, achieving superior style and texture synthesis with accelerated performance.

Authors:Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Nießner, Shunsuke Saito
Title: Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars
Abstract:
Traditionally, creating photo-realistic 3D head avatars requires a studio-level multi-view capture setup and expensive optimization during test-time, limiting the use of digital human doubles to the VFX industry or offline renderings. To address this shortcoming, we present Avat3r, which regresses a high-quality and animatable 3D head avatar from just a few input images, vastly reducing compute requirements during inference. More specifically, we make Large Reconstruction Models animatable and learn a powerful prior over 3D human heads from a large multi-view video dataset. For better 3D head reconstructions, we employ position maps from DUSt3R and generalized feature maps from the human foundation model Sapiens. To animate the 3D head, our key discovery is that simple cross-attention to an expression code is already sufficient. Finally, we increase robustness by feeding input images with different expressions to our model during training, enabling the reconstruction of 3D head avatars from inconsistent inputs, e.g., an imperfect phone capture with accidental movement, or frames from a monocular video. We compare Avat3r with current state-of-the-art methods for few-input and single-input scenarios, and find that our method has a competitive advantage in both tasks. Finally, we demonstrate the wide applicability of our proposed model, creating 3D head avatars from images of different sources, smartphone captures, single images, and even out-of-domain inputs like antique busts. Project website: https://tobias-kirschstein.github.io/avat3r/

Authors:Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, Baobao Chang
Title: Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
Abstract:
The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image interleaved control in image generation models. Building on powerful text-to-image models like SD3.5, we replace the original text-only encoders by incorporating versatile multimodal information encoders such as QwenVL. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark, and matching the performance of state-of-the-art text-to-image models like SD3.5 and FLUX.
Chinese: Dream Engine框架通过将多模态编码器与扩散模型结合,采用两阶段训练方法,有效解决了图像生成中文本与图像交替控制的统一性问题,并在性能上达到先进水平。
English: The Dream Engine framework addresses the lack of unified text-image interleaved control in image generation by integrating multimodal encoders like QwenVL with diffusion models, achieving competitive performance through a two-stage training approach.

Authors:Yating Yu, Congqi Cao, Yifan Zhang, Yanning Zhang
Title: Learning to Generalize without Bias for Open-Vocabulary Action Recognition
Abstract:
Leveraging the effective visual-text alignment and static generalizability from CLIP, recent video learners adopt CLIP initialization with further regularization or recombination for generalization in open-vocabulary action recognition in-context. However, due to the static bias of CLIP, such video learners tend to overfit on shortcut static features, thereby compromising their generalizability, especially to novel out-of-context actions. To address this issue, we introduce Open-MeDe, a novel Meta-optimization framework with static Debiasing for Open-vocabulary action recognition. From a fresh perspective of generalization, Open-MeDe adopts a meta-learning approach to improve known-to-open generalizing and image-to-video debiasing in a cost-effective manner. Specifically, Open-MeDe introduces a cross-batch meta-optimization scheme that explicitly encourages video learners to quickly generalize to arbitrary subsequent data via virtual evaluation, steering a smoother optimization landscape. In effect, the free of CLIP regularization during optimization implicitly mitigates the inherent static bias of the video meta-learner. We further apply self-ensemble over the optimization trajectory to obtain generic optimal parameters that can achieve robust generalization to both in-context and out-of-context novel data. Extensive evaluations show that Open-MeDe not only surpasses state-of-the-art regularization methods tailored for in-context open-vocabulary action recognition but also substantially excels in out-of-context scenarios.Code is released at https://github.com/Mia-YatingYu/Open-MeDe.
中文摘要:Open-MeDe是一种新颖的元优化框架,通过跨批次元优化方案和自集成方法,有效减轻CLIP的静态偏差,显著提升了视频学习器在上下文内外开放词汇动作识别中的泛化能力。
English Summary: Open-MeDe is a meta-optimization framework that addresses CLIP's static bias in video learners by employing cross-batch meta-optimization and self-ensemble techniques to enhance generalization for both in-context and out-of-context open-vocabulary action recognition.

Authors:Yifan Jia, Xingda Yu, Zhengyang Ji, Songning Lai, Yutao Yue
Title: Adaptive H&E-IHC information fusion staining framework based on feature extra
Abstract:
Immunohistochemistry (IHC) staining plays a significant role in the evaluation of diseases such as breast cancer. The H&E-to-IHC transformation based on generative models provides a simple and cost-effective method for obtaining IHC images. Although previous models can perform digital coloring well, they still suffer from (i) coloring only through the pixel features that are not prominent in HE, which is easy to cause information loss in the coloring process; (ii) The lack of pixel-perfect H&E-IHC groundtruth pairs poses a challenge to the classical L1 loss.To address the above challenges, we propose an adaptive information enhanced coloring framework based on feature extractors. We first propose the VMFE module to effectively extract the color information features using multi-scale feature extraction and wavelet transform convolution, while combining the shared decoder for feature fusion. The high-performance dual feature extractor of H&E-IHC is trained by contrastive learning, which can effectively perform feature alignment of HE-IHC in high latitude space. At the same time, the trained feature encoder is used to enhance the features and adaptively adjust the loss in the HE section staining process to solve the problems related to unclear and asymmetric information. We have tested on different datasets and achieved excellent performance.Our code is available at https://github.com/babyinsunshine/CEFF
中文: 本研究提出了一种自适应信息增强染色框架,通过多尺度特征提取和对比学习优化H&E到IHC的图像转换,有效解决了特征丢失和对齐问题,在多个数据集上取得了优异性能。
English: This study introduces an adaptive information-enhanced coloring framework that utilizes multi-scale feature extraction and contrastive learning to improve H&E-to-IHC image transformation by addressing feature loss and alignment issues, achieving superior performance on various datasets.

Authors:Lin Zhang, Yi Tian, XiYun Wang, Wanru Xu, Yi Jin, Yaping Huang
Title: Differential Contrastive Training for Gaze Estimation
Abstract:
The complex application scenarios have raised critical requirements for precise and generalizable gaze estimation methods. Recently, the pre-trained CLIP has achieved remarkable performance on various vision tasks, but its potentials have not been fully exploited in gaze estimation. In this paper, we propose a novel Differential Contrastive Training strategy, which boosts gaze estimation performance with the help of the CLIP. Accordingly, a Differential Contrastive Gaze Estimation network (DCGaze) composed of a Visual Appearance-aware branch and a Semantic Differential-aware branch is introduced. The Visual Appearance-aware branch is essentially a primary gaze estimation network and it incorporates an Adaptive Feature-refinement Unit (AFU) and a Double-head Gaze Regressor (DGR), which both help the primary network to extract informative and gaze-related appearance features. Moreover, the Semantic Difference-aware branch is designed on the basis of the CLIP's text encoder to reveal the semantic difference of gazes. This branch could further empower the Visual Appearance-aware branch with the capability of characterizing the gaze-related semantic information. Extensive experimental results on four challenging datasets over within and cross-domain tasks demonstrate the effectiveness of our DCGaze.The code is available at https://github.com/LinZhang-bjtu/DCGaze.
中文摘要:本文提出了一种利用CLIP的差分对比训练策略,通过结合视觉外观感知和语义差异感知的DCGaze网络,显著提升了跨域视线估计任务的性能表现。
English Summary: This paper introduces a Differential Contrastive Training strategy leveraging CLIP to enhance gaze estimation, resulting in the DCGaze network that integrates visual appearance and semantic difference awareness for improved performance across challenging datasets.

Authors:Mengjie Xu, Yitao Zhu, Haotian Jiang, Jiaming Li, Zhenrong Shen, Sheng Wang, Haolin Huang, Xinyu Wang, Qing Yang, Han Zhang, Qian Wang
Title: MITracker: Multi-View Integration for Visual Object Tracking
Abstract:
Multi-view object tracking (MVOT) offers promising solutions to challenges such as occlusion and target loss, which are common in traditional single-view tracking. However, progress has been limited by the lack of comprehensive multi-view datasets and effective cross-view integration methods. To overcome these limitations, we compiled a Multi-View object Tracking (MVTrack) dataset of 234K high-quality annotated frames featuring 27 distinct objects across various scenes. In conjunction with this dataset, we introduce a novel MVOT method, Multi-View Integration Tracker (MITracker), to efficiently integrate multi-view object features and provide stable tracking outcomes. MITracker can track any object in video frames of arbitrary length from arbitrary viewpoints. The key advancements of our method over traditional single-view approaches come from two aspects: (1) MITracker transforms 2D image features into a 3D feature volume and compresses it into a bird's eye view (BEV) plane, facilitating inter-view information fusion; (2) we propose an attention mechanism that leverages geometric information from fused 3D feature volume to refine the tracking results at each view. MITracker outperforms existing methods on the MVTrack and GMTD datasets, achieving state-of-the-art performance. The code and the new dataset will be available at https://mii-laboratory.github.io/MITracker/.
中文: 作者提出了MVTrack数据集和MITracker方法,通过将2D特征转换为3D特征体并利用注意力机制整合多视角信息,在多目标跟踪任务中实现了最先进的性能表现。
English: The authors introduce the MVTrack dataset and MITracker, a novel multi-view object tracking method that transforms 2D features into 3D volumes and uses an attention mechanism to achieve state-of-the-art performance by effectively integrating cross-view information.

Authors:Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, Luc Van Gool
Title: UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler
Abstract:
Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE paradigm, UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepthV2 implements a self-promptable camera module predicting a dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss which enhances the localization and sharpness of edges in the metric depth outputs, a revisited, simplified and more efficient architectural design, and an additional uncertainty-level output which enables downstream tasks requiring confidence. Thorough evaluations on ten depth datasets in a zero-shot regime consistently demonstrate the superior performance and generalization of UniDepthV2. Code and models are available at https://github.com/lpiccinelli-eth/UniDepth
中文: UniDepthV2是一种新颖的单目测距深度估计模型,能够直接从单张图像预测三维点,通过改进的边缘引导损失和几何不变性实现跨领域的卓越泛化能力,无需额外信息。
English: UniDepthV2 is a novel model that enables accurate monocular metric depth estimation across domains by directly predicting 3D points from single images, featuring enhanced edge precision and geometric invariance for superior generalization without requiring additional data.

Authors:Xuzheng Yang, Junzhuo Liu, Peng Wang, Guoqing Wang, Yang Yang, Heng Tao Shen
Title: New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration
Abstract:
Referring Expression Comprehension (REC) is a foundational cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding. It serves as an essential testing ground for Multimodal Large Language Models (MLLMs). To advance this field, we introduced a new REC dataset in our previous conference paper, characterized by two key features. First, it is designed with controllable difficulty levels, requiring multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Second, it incorporates negative text and images generated through fine-grained editing and augmentation, explicitly testing a model's ability to reject scenarios where the target object is absent, an often overlooked yet critical challenge in existing datasets. In this extended work, we propose two new methods to tackle the challenges of fine-grained REC by combining the strengths of Specialist Models and MLLMs. The first method adaptively assigns simple cases to faster, lightweight models and reserves complex ones for powerful MLLMs, balancing accuracy and efficiency. The second method lets a specialist generate a set of possible object regions, and the MLLM selects the most plausible one using its reasoning ability. These collaborative strategies lead to significant improvements on our dataset and other challenging benchmarks. Our results show that combining specialized and general-purpose models offers a practical path toward solving complex real-world vision-language tasks. Our dataset and code are available at https://github.com/sleepyshep/FineCops-Ref.
中文: 本研究提出了一个具有可控难度和负样本的指代表达理解数据集,并设计了两种结合专家模型与多模态大语言模型的协作方法,通过优势互补在保证效率的同时显著提升了细粒度推理能力。
English: This work introduces a dataset for Referring Expression Comprehension that features controllable difficulty and negative examples, and proposes two collaborative methods combining Specialist Models and Multimodal Large Language Models to enhance fine-grained reasoning while balancing accuracy and efficiency.

Authors:Gilles Van De Vyver, Aksel Try Lenz, Erik Smistad, Sindre Hellum Olaisen, Bjørnar Grenne, Espen Holte, Håavard Dalen, Lasse Løvstakken
Title: Generative augmentations for improved cardiac ultrasound segmentation using diffusion models
Abstract:
One of the main challenges in current research on segmentation in cardiac ultrasound is the lack of large and varied labeled datasets and the differences in annotation conventions between datasets. This makes it difficult to design robust segmentation models that generalize well to external datasets. This work utilizes diffusion models to create generative augmentations that can significantly improve diversity of the dataset and thus the generalisability of segmentation models without the need for more annotated data. The augmentations are applied in addition to regular augmentations. A visual test survey showed that experts cannot clearly distinguish between real and fully generated images. Using the proposed generative augmentations, segmentation robustness was increased when training on an internal dataset and testing on an external dataset with an improvement of over 20 millimeters in Hausdorff distance. Additionally, the limits of agreement for automatic ejection fraction estimation improved by up to 20% of absolute ejection fraction value on out of distribution cases. These improvements come exclusively from the increased variation of the training data using the generative augmentations, without modifying the underlying machine learning model. The augmentation tool is available as an open source Python library at https://github.com/GillesVanDeVyver/EchoGAINS.
中文: 本研究利用扩散模型生成多样化心脏超声图像,无需额外标注即可提升分割模型的鲁棒性和泛化能力,在外部分割测试中实现了显著性能提升。
English: This study uses diffusion models to generate diverse cardiac ultrasound images, enhancing segmentation model robustness and generalization without additional annotations, achieving significant improvements in external dataset performance.

Authors:Mingjie Wu, Chenggui Yang, Huihua Wang, Chen Xue, Yibo Wang, Haoyu Wang, Yansong Wang, Can Peng, Yuqi Han, Ruoyu Li, Lijun Yun, Zaiqing Chen, Yuelong Xia
Title: WalnutData: A UAV Remote Sensing Dataset of Green Walnuts and Model Evaluation
Abstract:
The UAV technology is gradually maturing and can provide extremely powerful support for smart agriculture and precise monitoring. Currently, there is no dataset related to green walnuts in the field of agricultural computer vision. Thus, in order to promote the algorithm design in the field of agricultural computer vision, we used UAV to collect remote-sensing data from 8 walnut sample plots. Considering that green walnuts are subject to various lighting conditions and occlusion, we constructed a large-scale dataset with a higher-granularity of target features - WalnutData. This dataset contains a total of 30,240 images and 706,208 instances, and there are 4 target categories: being illuminated by frontal light and unoccluded (A1), being backlit and unoccluded (A2), being illuminated by frontal light and occluded (B1), and being backlit and occluded (B2). Subsequently, we evaluated many mainstream algorithms on WalnutData and used these evaluation results as the baseline standard. The dataset and all evaluation results can be obtained at https://github.com/1wuming/WalnutData.
Chinese: 本研究推出了WalnutData数据集,通过无人机采集了30,240张图像、706,208个青核桃实例,涵盖不同光照与遮挡条件,为农业计算机视觉算法提供了评估基准。
English: This study introduces WalnutData, a large-scale UAV-collected dataset of 30,240 images with 706,208 instances of green walnuts under varying light and occlusion conditions, establishing evaluation baselines for agricultural computer vision algorithms.

Authors:Meng Lou, Yizhou Yu
Title: OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
Abstract:
Top-down attention plays a crucial role in the human vision system, wherein the brain initially obtains a rough overview of a scene to discover salient cues (i.e., overview first), followed by a more careful finer-grained examination (i.e., look closely next). However, modern ConvNets remain confined to a pyramid structure that successively downsamples the feature map for receptive field expansion, neglecting this crucial biomimetic principle. We present OverLoCK, the first pure ConvNet backbone architecture that explicitly incorporates a top-down attention mechanism. Unlike pyramid backbone networks, our design features a branched architecture with three synergistic sub-networks: 1) a Base-Net that encodes low/mid-level features; 2) a lightweight Overview-Net that generates dynamic top-down attention through coarse global context modeling (i.e., overview first); and 3) a robust Focus-Net that performs finer-grained perception guided by top-down attention (i.e., look closely next). To fully unleash the power of top-down attention, we further propose a novel context-mixing dynamic convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases, addressing critical limitations in existing convolutions. Our OverLoCK exhibits a notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2%, significantly surpassing ConvNeXt-B while using only around one-third of the FLOPs/parameters. On object detection, our OverLoCK-S clearly surpasses MogaNet-B by 1% in AP^b. On semantic segmentation, our OverLoCK-T remarkably improves UniRepLKNet-T by 1.7% in mIoU. Code is publicly available at https://github.com/LMMMEng/OverLoCK.
中文摘要:OverLoCK是首个采用分支架构模拟人类自上而下注意力的纯卷积网络,通过三个协同子网络实现"先概览后细察"的机制,在多项视觉任务中显著超越现有方法且计算效率更高。
English Summary: OverLoCK is the first pure ConvNet backbone that mimics human top-down attention through a branched architecture with three sub-networks, achieving superior performance across multiple vision tasks while being more computationally efficient than existing methods.

Authors:Yejun Zhang, Shuzhe Wang, Juho Kannala
Title: A2-GNN: Angle-Annular GNN for Visual Descriptor-free Camera Relocalization
Abstract:
Visual localization involves estimating the 6-degree-of-freedom (6-DoF) camera pose within a known scene. A critical step in this process is identifying pixel-to-point correspondences between 2D query images and 3D models. Most advanced approaches currently rely on extensive visual descriptors to establish these correspondences, facing challenges in storage, privacy issues and model maintenance. Direct 2D-3D keypoint matching without visual descriptors is becoming popular as it can overcome those challenges. However, existing descriptor-free methods suffer from low accuracy or heavy computation. Addressing this gap, this paper introduces the Angle-Annular Graph Neural Network (A2-GNN), a simple approach that efficiently learns robust geometric structural representations with annular feature extraction. Specifically, this approach clusters neighbors and embeds each group's distance information and angle as supplementary information to capture local structures. Evaluation on matching and visual localization datasets demonstrates that our approach achieves state-of-the-art accuracy with low computational overhead among visual description-free methods. Our code will be released on https://github.com/YejunZhang/a2-gnn.
中文: 本文提出的A2-GNN方法通过几何结构表征实现无描述符的视觉定位,在保证高精度的同时显著降低计算开销,性能优于现有同类方法。
English: This paper introduces A2-GNN, a descriptor-free method that uses geometric structural representations for efficient and accurate visual localization, achieving top performance with low computational costs.

Authors:Xuyang Wei, Chunlin Tian, Li Li
Title: AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs
Abstract:
Effective instruction fine-tuning on diverse image-text datasets is crucial for developing a versatile Multimodal Large Language Model (MLLM), where dataset composition dictates the model's adaptability across multimodal tasks. However, complex datasets often contain inherent conflicts -- stemming from modality-specific optimization objectives -- and latent commonalities that enable cross-task transfer, which most existing approaches handle separately. To bridge this gap, we introduce AsymLoRA, a parameter-efficient tuning framework that unifies knowledge modularization and cross-modal coordination via asymmetric LoRA: task-specific low-rank projections (matrix B) that preserve distinct adaptation pathways for conflicting objectives, and a shared projection (matrix A) that consolidates cross-modal commonalities. Extensive evaluations demonstrate that AsymLoRA consistently surpasses both vanilla LoRA, which captures only commonalities, and LoRA-MoE, which focuses solely on conflicts, achieving superior model performance and system efficiency across diverse benchmarks.\href{Code}{https://github.com/Clin0212/HydraLoRA/blob/main/MLLM-HydraLoRA/README.md}.
中文摘要:AsymLoRA框架通过非对称低秩自适应技术,在解决多模态任务冲突目标的同时有效利用跨模态共性,其综合性能与系统效率均优于现有方法。
English Summary: The AsymLoRA framework enhances multimodal large language models by using asymmetric low-rank adaptation to simultaneously manage conflicting objectives and leverage cross-modal commonalities, outperforming existing methods in both performance and efficiency.

Authors:Fan Yang, Dongsheng Luo, Wenrui Chen, Jiacheng Lin, Junjie Cai, Kailun Yang, Zhiyong Li, Yaonan Wang
Title: Multi-Keypoint Affordance Representation for Functional Dexterous Grasping
Abstract:
Functional dexterous grasping requires precise hand-object interaction, going beyond simple gripping. Existing affordance-based methods primarily predict coarse interaction regions and cannot directly constrain the grasping posture, leading to a disconnection between visual perception and manipulation. To address this issue, we propose a multi-keypoint affordance representation for functional dexterous grasping, which directly encodes task-driven grasp configurations by localizing functional contact points. Our method introduces Contact-guided Multi-Keypoint Affordance (CMKA), leveraging human grasping experience images for weak supervision combined with Large Vision Models for fine affordance feature extraction, achieving generalization while avoiding manual keypoint annotations. Additionally, we present a Keypoint-based Grasp matrix Transformation (KGT) method, ensuring spatial consistency between hand keypoints and object contact points, thus providing a direct link between visual perception and dexterous grasping actions. Experiments on public real-world FAH datasets, IsaacGym simulation, and challenging robotic tasks demonstrate that our method significantly improves affordance localization accuracy, grasp consistency, and generalization to unseen tools and tasks, bridging the gap between visual affordance learning and dexterous robotic manipulation. The source code and demo videos are publicly available at https://github.com/PopeyePxx/MKA.
中文摘要:本文提出一种多关键点可供性表示方法,通过功能接触点定位直接编码抓取配置,结合人类抓取经验与视觉模型特征提取,有效连接视觉感知与灵巧操作,显著提升了抓取精度与泛化能力。
English Summary: This paper introduces a multi-keypoint affordance representation method that bridges visual perception and dexterous manipulation by directly encoding grasp configurations through functional contact point localization, achieving improved accuracy and generalization in robotic grasping tasks.

Authors:Quanxing Zha, Xin Liu, Shu-Juan Peng, Yiu-ming Cheung, Xing Xu, Nannan Wang
Title: ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning
Abstract:
Can we accurately identify the true correspondences from multimodal datasets containing mismatched data pairs? Existing methods primarily emphasize the similarity matching between the representations of objects across modalities, potentially neglecting the crucial relation consistency within modalities that are particularly important for distinguishing the true and false correspondences. Such an omission often runs the risk of misidentifying negatives as positives, thus leading to unanticipated performance degradation. To address this problem, we propose a general Relation Consistency learning framework, namely ReCon, to accurately discriminate the true correspondences among the multimodal data and thus effectively mitigate the adverse impact caused by mismatches. Specifically, ReCon leverages a novel relation consistency learning to ensure the dual-alignment, respectively of, the cross-modal relation consistency between different modalities and the intra-modal relation consistency within modalities. Thanks to such dual constrains on relations, ReCon significantly enhances its effectiveness for true correspondence discrimination and therefore reliably filters out the mismatched pairs to mitigate the risks of wrong supervisions. Extensive experiments on three widely-used benchmark datasets, including Flickr30K, MS-COCO, and Conceptual Captions, are conducted to demonstrate the effectiveness and superiority of ReCon compared with other SOTAs. The code is available at: https://github.com/qxzha/ReCon.
中文摘要:现有方法因忽视模态内关系一致性而难以区分多模态数据中的真实对应关系,但提出的ReCon框架通过强制跨模态和模态内关系一致性的双重对齐,能准确识别真实匹配并过滤误配对。
English Summary: Existing multimodal matching methods often fail to distinguish true correspondences due to neglecting relation consistency within modalities, but the proposed ReCon framework addresses this by enforcing dual-alignment of cross-modal and intra-modal relation consistency to accurately identify true matches and filter mismatches.

Authors:Yuhao Li, Mirana Claire Angel, Salman Khan, Yu Zhu, Jinqiu Sun, Yanning Zhang, Fahad Shahbaz Khan
Title: C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation
Abstract:
Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation. However, the existing trajectory-based approaches are usually limited to only generating the motion trajectory of the controlled object and ignoring the dynamic interactions between the controlled object and its surroundings. To address this limitation, we propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag. Instead of directly generating the motion of some objects, our C-Drag first performs object perception and then reasons the dynamic interactions between different objects according to the given motion control of the objects. Specifically, our method includes an object perception module and a Chain-of-Thought-based motion reasoning module. The object perception module employs visual language models to capture the position and category information of various objects within the image. The Chain-of-Thought-based motion reasoning module takes this information as input and conducts a stage-wise reasoning process to generate motion trajectories for each of the affected objects, which are subsequently fed to the diffusion model for video synthesis. Furthermore, we introduce a new video object interaction (VOI) dataset to evaluate the generation quality of motion controlled video generation methods. Our VOI dataset contains three typical types of interactions and provides the motion trajectories of objects that can be used for accurate performance evaluation. Experimental results show that C-Drag achieves promising performance across multiple metrics, excelling in object motion control. Our benchmark, codes, and models will be available at https://github.com/WesLee88524/C-Drag-Official-Repo.
中文: 提出的C-Drag方法通过结合物体感知和思维链推理来建模物体间的动态交互,改进了可控视频生成技术,其性能优于仅关注单个物体运动轨迹的现有方法。
English: The proposed C-Drag method enhances controllable video generation by incorporating object perception and Chain-of-Thought reasoning to model dynamic interactions between objects, outperforming existing trajectory-based approaches that focus solely on individual object motion.

Authors:Nan An, Long Ma, Guangchao Han, Xin Fan, RIsheng Liu
Title: Striving for Faster and Better: A One-Layer Architecture with Auto Re-parameterization for Low-Light Image Enhancement
Abstract:
Deep learning-based low-light image enhancers have made significant progress in recent years, with a trend towards achieving satisfactory visual quality while gradually reducing the number of parameters and improving computational efficiency. In this work, we aim to delving into the limits of image enhancers both from visual quality and computational efficiency, while striving for both better performance and faster processing. To be concrete, by rethinking the task demands, we build an explicit connection, i.e., visual quality and computational efficiency are corresponding to model learning and structure design, respectively. Around this connection, we enlarge parameter space by introducing the re-parameterization for ample model learning of a pre-defined minimalist network (e.g., just one layer), to avoid falling into a local solution. To strengthen the structural representation, we define a hierarchical search scheme for discovering a task-oriented re-parameterized structure, which also provides powerful support for efficiency. Ultimately, this achieves efficient low-light image enhancement using only a single convolutional layer, while maintaining excellent visual quality. Experimental results show our sensible superiority both in quality and efficiency against recently-proposed methods. Especially, our running time on various platforms (e.g., CPU, GPU, NPU, DSP) consistently moves beyond the existing fastest scheme. The source code will be released at https://github.com/vis-opt-group/AR-LLIE.
中文: 本研究通过结合重参数化和分层搜索方案,提出了一种高效的低光图像增强方法,在多个平台上均实现了优于现有方法的视觉质量和计算速度。
English: This study introduces an efficient low-light image enhancement method that achieves superior visual quality and computational speed by combining re-parameterization with a hierarchical search scheme, outperforming existing approaches across multiple platforms.

Authors:Chunyang Cheng, Tianyang Xu, Zhenhua Feng, Xiaojun Wu, ZhangyongTang, Hui Li, Zeyang Zhang, Sara Atito, Muhammad Awais, Josef Kittler
Title: One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion
Abstract:
Advanced image fusion methods mostly prioritise high-level missions, where task interaction struggles with semantic gaps, requiring complex bridging mechanisms. In contrast, we propose to leverage low-level vision tasks from digital photography fusion, allowing for effective feature interaction through pixel-level supervision. This new paradigm provides strong guidance for unsupervised multimodal fusion without relying on abstract semantics, enhancing task-shared feature learning for broader applicability. Owning to the hybrid image features and enhanced universal representations, the proposed GIFNet supports diverse fusion tasks, achieving high performance across both seen and unseen scenarios with a single model. Uniquely, experimental results reveal that our framework also supports single-modality enhancement, offering superior flexibility for practical applications. Our code will be available at https://github.com/AWCXV/GIFNet.
中文: 提出的GIFNet利用低层视觉任务和像素级监督,实现了无需抽象语义的无监督多模态融合,在多种任务中表现优异,并能支持单模态增强,具有广泛适用性。
English: The proposed GIFNet leverages low-level vision tasks with pixel-level supervision to enable effective unsupervised multimodal fusion, achieving high performance across diverse tasks and supporting single-modality enhancement without relying on abstract semantics.

Authors:Xiaofan Li, Xin Tan, Zhuo Chen, Zhizhong Zhang, Ruixin Zhang, Rizen Guo, Guannan Jiang, Yulong Chen, Yanyun Qu, Lizhuang Ma, Yuan Xie
Title: One-for-More: Continual Diffusion Model for Anomaly Detection
Abstract:
With the rise of generative models, there is a growing interest in unifying all tasks within a generative framework. Anomaly detection methods also fall into this scope and utilize diffusion models to generate or reconstruct normal samples when given arbitrary anomaly images. However, our study found that the diffusion model suffers from severe ``faithfulness hallucination'' and ``catastrophic forgetting'', which can't meet the unpredictable pattern increments. To mitigate the above problems, we propose a continual diffusion model that uses gradient projection to achieve stable continual learning. Gradient projection deploys a regularization on the model updating by modifying the gradient towards the direction protecting the learned knowledge. But as a double-edged sword, it also requires huge memory costs brought by the Markov process. Hence, we propose an iterative singular value decomposition method based on the transitive property of linear representation, which consumes tiny memory and incurs almost no performance loss. Finally, considering the risk of ``over-fitting'' to normal images of the diffusion model, we propose an anomaly-masked network to enhance the condition mechanism of the diffusion model. For continual anomaly detection, ours achieves first place in 17/18 settings on MVTec and VisA. Code is available at https://github.com/FuNz-0/One-for-More
中文: 本研究针对扩散模型在异常检测中的问题,提出了采用梯度投影的持续扩散模型和异常掩码网络,在持续异常检测基准中取得了领先性能。
English: This study addresses issues in diffusion-based anomaly detection by proposing a continual diffusion model with gradient projection and an anomaly-masked network, achieving top performance in most continual anomaly detection benchmarks.

Authors:Aayush Dhakal, Srikumar Sastry, Subash Khanal, Adeel Ahmad, Eric Xing, Nathan Jacobs
Title: RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings
Abstract:
The choice of representation for geographic location significantly impacts the accuracy of models for a broad range of geospatial tasks, including fine-grained species classification, population density estimation, and biome classification. Recent works like SatCLIP and GeoCLIP learn such representations by contrastively aligning geolocation with co-located images. While these methods work exceptionally well, in this paper, we posit that the current training strategies fail to fully capture the important visual features. We provide an information-theoretic perspective on why the resulting embeddings from these methods discard crucial visual information that is important for many downstream tasks. To solve this problem, we propose a novel retrieval-augmented strategy called RANGE. We build our method on the intuition that the visual features of a location can be estimated by combining the visual features from multiple similar-looking locations. We evaluate our method across a wide variety of tasks. Our results show that RANGE outperforms the existing state-of-the-art models with significant margins in most tasks. We show gains of up to 13.1% on classification tasks and 0.145 $R^2$ on regression tasks. All our code and models will be made available at: https://github.com/mvrl/RANGE.
中文: 本文提出RANGE方法,通过整合相似地理位置的多重视觉特征来改进地理空间表示学习,在多项任务中显著超越现有最优模型。
English: The paper introduces RANGE, a retrieval-augmented strategy that enhances geospatial representation learning by combining visual features from similar locations, achieving significant performance gains over existing methods across various tasks.

Authors:Xingyu Qiu, Mengying Yang, Xinghua Ma, Fanding Li, Dong Liang, Gongning Luo, Wei Wang, Kuanquan Wang, Shuo Li
Title: Finding Local Diffusion Schrödinger Bridge using Kolmogorov-Arnold Network
Abstract:
In image generation, Schrödinger Bridge (SB)-based methods theoretically enhance the efficiency and quality compared to the diffusion models by finding the least costly path between two distributions. However, they are computationally expensive and time-consuming when applied to complex image data. The reason is that they focus on fitting globally optimal paths in high-dimensional spaces, directly generating images as next step on the path using complex networks through self-supervised training, which typically results in a gap with the global optimum. Meanwhile, most diffusion models are in the same path subspace generated by weights $f_A(t)$ and $f_B(t)$, as they follow the paradigm ($x_t = f_A(t)x_{Img} + f_B(t)ε$). To address the limitations of SB-based methods, this paper proposes for the first time to find local Diffusion Schrödinger Bridges (LDSB) in the diffusion path subspace, which strengthens the connection between the SB problem and diffusion models. Specifically, our method optimizes the diffusion paths using Kolmogorov-Arnold Network (KAN), which has the advantage of resistance to forgetting and continuous output. The experiment shows that our LDSB significantly improves the quality and efficiency of image generation using the same pre-trained denoising network and the KAN for optimising is only less than 0.1MB. The FID metric is reduced by more than 15\%, especially with a reduction of 48.50\% when NFE of DDIM is $5$ for the CelebA dataset. Code is available at https://github.com/PerceptionComputingLab/LDSB.
中文: 本文提出局部扩散薛定谔桥方法,在扩散路径子空间内利用科尔莫戈罗夫-阿诺德网络优化路径,以极小的计算开销显著提升图像生成质量与效率。
English: This paper introduces Local Diffusion Schrödinger Bridges (LDSB), a method that optimizes diffusion paths within a constrained subspace using Kolmogorov-Arnold Networks to significantly enhance image generation quality and efficiency while minimizing computational costs.

Authors:Qijie Xu, Defang Chen, Jiawei Chen, Siwei Lyu, Can Wang
Title: Recent Advances on Generalizable Diffusion-generated Image Detection
Abstract:
The rise of diffusion models has significantly improved the fidelity and diversity of generated images. With numerous benefits, these advancements also introduce new risks. Diffusion models can be exploited to create high-quality Deepfake images, which poses challenges for image authenticity verification. In recent years, research on generalizable diffusion-generated image detection has grown rapidly. However, a comprehensive review of this topic is still lacking. To bridge this gap, we present a systematic survey of recent advances and classify them into two main categories: (1) data-driven detection and (2) feature-driven detection. Existing detection methods are further classified into six fine-grained categories based on their underlying principles. Finally, we identify several open challenges and envision some future directions, with the hope of inspiring more research work on this important topic. Reviewed works in this survey can be found at https://github.com/zju-pi/Awesome-Diffusion-generated-Image-Detection.
中文摘要:本综述系统梳理了扩散生成图像检测的最新进展,将其分为数据驱动和特征驱动两大类方法,旨在应对深度伪造对图像真实性带来的挑战,并指出了该领域未来研究的关键方向。
English Summary: This survey systematically reviews and classifies recent advances in detecting diffusion-generated deepfake images into data-driven and feature-driven methods, addressing the growing risks they pose to image authenticity while highlighting open challenges and future research directions.

Authors:Jianning Chi, Zelan Li, Geng Lin, MingYang Sun, Xiaosheng Yu
Title: Weakly Supervised Segmentation Framework for Thyroid Nodule Based on High-confidence Labels and High-rationality Losses
Abstract:
Weakly supervised segmentation methods can delineate thyroid nodules in ultrasound images efficiently using training data with coarse labels, but suffer from: 1) low-confidence pseudo-labels that follow topological priors, introducing significant label noise, and 2) low-rationality loss functions that rigidly compare segmentation with labels, ignoring discriminative information for nodules with diverse and complex shapes. To solve these issues, we clarify the objective and references for weakly supervised ultrasound image segmentation, presenting a framework with high-confidence pseudo-labels to represent topological and anatomical information and high-rationality losses to capture multi-level discriminative features. Specifically, we fuse geometric transformations of four-point annotations and MedSAM model results prompted by specific annotations to generate high-confidence box, foreground, and background labels. Our high-rationality learning strategy includes: 1) Alignment loss measuring spatial consistency between segmentation and box label, and topological continuity within the foreground label, guiding the network to perceive nodule location; 2) Contrastive loss pulling features from labeled foreground regions while pushing features from labeled foreground and background regions, guiding the network to learn nodule and background feature distribution; 3) Prototype correlation loss measuring consistency between correlation maps derived by comparing features with foreground and background prototypes, refining uncertain regions to accurate nodule edges. Experimental results show that our method achieves state-of-the-art performance on the TN3K and DDTI datasets. The code is available at https://github.com/bluehenglee/MLI-MSC.
中文: 该框架通过融合四点标注的几何变换与MedSAM模型结果生成高置信度伪标签,并采用空间对齐、对比学习和原型相关等多层次损失策略,有效解决了弱监督超声图像分割中的标注噪声和形状复杂性问题,在TN3K和DDTI数据集上实现了最优性能。
English: The proposed framework addresses limitations in weakly supervised thyroid nodule segmentation by generating high-confidence pseudo-labels from fused geometric transformations and MedSAM outputs, while employing a multi-loss strategy to enhance spatial, topological, and feature discrimination for state-of-the-art results on benchmark datasets.

Authors:Chen-Chen Zong, Sheng-Jun Huang
Title: Rethinking Epistemic and Aleatoric Uncertainty for Active Open-Set Annotation: An Energy-Based Approach
Abstract:
Active learning (AL), which iteratively queries the most informative examples from a large pool of unlabeled candidates for model training, faces significant challenges in the presence of open-set classes. Existing methods either prioritize query examples likely to belong to known classes, indicating low epistemic uncertainty (EU), or focus on querying those with highly uncertain predictions, reflecting high aleatoric uncertainty (AU). However, they both yield suboptimal performance, as low EU corresponds to limited useful information, and closed-set AU metrics for unknown class examples are less meaningful. In this paper, we propose an Energy-based Active Open-set Annotation (EAOA) framework, which effectively integrates EU and AU to achieve superior performance. EAOA features a $(C+1)$-class detector and a target classifier, incorporating an energy-based EU measure and a margin-based energy loss designed for the detector, alongside an energy-based AU measure for the target classifier. Another crucial component is the target-driven adaptive sampling strategy. It first forms a smaller candidate set with low EU scores to ensure closed-set properties, making AU metrics meaningful. Subsequently, examples with high AU scores are queried to form the final query set, with the candidate set size adjusted adaptively. Extensive experiments show that EAOA achieves state-of-the-art performance while maintaining high query precision and low training overhead. The code is available at https://github.com/chenchenzong/EAOA.
主动学习在开放集类别中面临挑战,而提出的EAOA框架通过自适应采样策略有效整合认知和偶然不确定性,实现了最优性能。
Active learning struggles with open-set classes, but the proposed EAOA framework effectively combines epistemic and aleatoric uncertainty with adaptive sampling to achieve state-of-the-art performance.

Authors:Xiongfei Su, Tianyi Zhu, Lina Liu, Zheng Chen, Yulun Zhang, Siyuan Li, Juntian Ye, Feihu Xu, Xin Yuan
Title: Dual-branch Graph Feature Learning for NLOS Imaging
Abstract:
The domain of non-line-of-sight (NLOS) imaging is advancing rapidly, offering the capability to reveal occluded scenes that are not directly visible. However, contemporary NLOS systems face several significant challenges: (1) The computational and storage requirements are profound due to the inherent three-dimensional grid data structure, which restricts practical application. (2) The simultaneous reconstruction of albedo and depth information requires a delicate balance using hyperparameters in the loss function, rendering the concurrent reconstruction of texture and depth information difficult. This paper introduces the innovative methodology, \xnet, which integrates an albedo-focused reconstruction branch dedicated to albedo information recovery and a depth-focused reconstruction branch that extracts geometrical structure, to overcome these obstacles. The dual-branch framework segregates content delivery to the respective reconstructions, thereby enhancing the quality of the retrieved data. To our knowledge, we are the first to employ the GNN as a fundamental component to transform dense NLOS grid data into sparse structural features for efficient reconstruction. Comprehensive experiments demonstrate that our method attains the highest level of performance among existing methods across synthetic and real data. https://github.com/Nicholassu/DG-NLOS.
中文: 创新的\xnet方法通过采用双分支框架分别重建反照率和深度信息,利用图神经网络技术高效处理数据,克服了非视距成像中的关键挑战,实现了卓越性能。
English: The innovative \xnet method overcomes key challenges in non-line-of-sight imaging by employing a dual-branch framework that separately reconstructs albedo and depth information, achieving superior performance through efficient data processing with GNN technology.

Authors:Moo Jin Kim, Chelsea Finn, Percy Liang
Title: Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Abstract:
Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input-output specifications. We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$\times$. In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs ($π_0$ and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io/.

Authors:Kanglei Zhou, Zikai Hao, Liyuan Wang, Xiaohui Liang
Title: Adaptive Score Alignment Learning for Continual Perceptual Quality Assessment of 360-Degree Videos in Virtual Reality
Abstract:
Virtual Reality Video Quality Assessment (VR-VQA) aims to evaluate the perceptual quality of 360-degree videos, which is crucial for ensuring a distortion-free user experience. Traditional VR-VQA methods trained on static datasets with limited distortion diversity struggle to balance correlation and precision. This becomes particularly critical when generalizing to diverse VR content and continually adapting to dynamic and evolving video distribution variations. To address these challenges, we propose a novel approach for assessing the perceptual quality of VR videos, Adaptive Score Alignment Learning (ASAL). ASAL integrates correlation loss with error loss to enhance alignment with human subjective ratings and precision in predicting perceptual quality. In particular, ASAL can naturally adapt to continually changing distributions through a feature space smoothing process that enhances generalization to unseen content. To further improve continual adaptation to dynamic VR environments, we extend ASAL with adaptive memory replay as a novel Continul Learning (CL) framework. Unlike traditional CL models, ASAL utilizes key frame extraction and feature adaptation to address the unique challenges of non-stationary variations with both the computation and storage restrictions of VR devices. We establish a comprehensive benchmark for VR-VQA and its CL counterpart, introducing new data splits and evaluation metrics. Our experiments demonstrate that ASAL outperforms recent strong baseline models, achieving overall correlation gains of up to 4.78\% in the static joint training setting and 12.19\% in the dynamic CL setting on various datasets. This validates the effectiveness of ASAL in addressing the inherent challenges of VR-VQA.Our code is available at https://github.com/ZhouKanglei/ASAL_CVQA.
中文: 提出的自适应分数对齐学习(ASAL)方法通过整合相关性与误差损失,有效提升VR视频质量评估与人类感知的一致性及对动态内容的适应性,在静态与持续学习场景中均实现了显著性能提升。
English: The proposed Adaptive Score Alignment Learning (ASAL) method enhances VR video quality assessment by integrating correlation and error losses to better align with human perception and adapt to dynamic content variations, achieving significant performance gains in both static and continual learning settings.

Authors:Harsh Gupta, Yuchen Mo, Shengmiao Jin, Wenzhen Yuan
Title: Sensor-Invariant Tactile Representation
Abstract:
High-resolution tactile sensors have become critical for embodied perception and robotic manipulation. However, a key challenge in the field is the lack of transferability between sensors due to design and manufacturing variations, which result in significant differences in tactile signals. This limitation hinders the ability to transfer models or knowledge learned from one sensor to another. To address this, we introduce a novel method for extracting Sensor-Invariant Tactile Representations (SITR), enabling zero-shot transfer across optical tactile sensors. Our approach utilizes a transformer-based architecture trained on a diverse dataset of simulated sensor designs, allowing it to generalize to new sensors in the real world with minimal calibration. Experimental results demonstrate the method's effectiveness across various tactile sensing applications, facilitating data and model transferability for future advancements in the field.

Authors:Hoonhee Cho, Jae-young Kang, Youngho Kim, Kuk-Jin Yoon
Title: Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras
Abstract:
Detecting 3D objects in point clouds plays a crucial role in autonomous driving systems. Recently, advanced multi-modal methods incorporating camera information have achieved notable performance. For a safe and effective autonomous driving system, algorithms that excel not only in accuracy but also in speed and low latency are essential. However, existing algorithms fail to meet these requirements due to the latency and bandwidth limitations of fixed frame rate sensors, e.g., LiDAR and camera. To address this limitation, we introduce asynchronous event cameras into 3D object detection for the first time. We leverage their high temporal resolution and low bandwidth to enable high-speed 3D object detection. Our method enables detection even during inter-frame intervals when synchronized data is unavailable, by retrieving previous 3D information through the event camera. Furthermore, we introduce the first event-based 3D object detection dataset, DSEC-3DOD, which includes ground-truth 3D bounding boxes at 100 FPS, establishing the first benchmark for event-based 3D detectors. The code and dataset are available at https://github.com/mickeykang16/Ev3DOD.
中文: 本研究首次将异步事件相机引入自动驾驶的3D物体检测,利用其高时间分辨率实现帧间高速检测,并发布DSEC-3DOD数据集构建了首个事件驱动的3D检测基准。
English: This study introduces the first asynchronous event camera-based 3D object detection method for autonomous driving, enabling high-speed detection during inter-frame intervals and releasing the DSEC-3DOD dataset to establish an event-based 3D detection benchmark.

Authors:Hugues Turbé, Mina Bjelogrlic, Gianmarco Mengaldo, Christian Lovis
Title: Tell me why: Visual foundation models as self-explainable classifiers
Abstract:
Visual foundation models (VFMs) have become increasingly popular due to their state-of-the-art performance. However, interpretability remains crucial for critical applications. In this sense, self-explainable models (SEM) aim to provide interpretable classifiers that decompose predictions into a weighted sum of interpretable concepts. Despite their promise, recent studies have shown that these explanations often lack faithfulness. In this work, we combine VFMs with a novel prototypical architecture and specialized training objectives. By training only a lightweight head (approximately 1M parameters) on top of frozen VFMs, our approach (ProtoFM) offers an efficient and interpretable solution. Evaluations demonstrate that our approach achieves competitive classification performance while outperforming existing models across a range of interpretability metrics derived from the literature. Code is available at https://github.com/hturbe/proto-fm.
Chinese Summary: 本文提出的ProtoFM方法将视觉基础模型与原型架构及专门训练目标相结合,仅需在冻结模型上训练轻量级头部即可实现高效可解释的分类,在保持竞争力的分类性能同时显著超越了现有模型的可解释性指标。
English Summary: This paper introduces ProtoFM, an efficient and interpretable method that combines visual foundation models with a prototypical architecture and specialized training objectives, achieving competitive classification performance while surpassing existing models in interpretability metrics.

Authors:Danae Sánchez Villegas, Ingo Ziegler, Desmond Elliott
Title: ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
Abstract:
Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task -- achieving an average improvement from 3.7% to 19% in SimRate, a metric that quantifies semantic similarity to human-annotated ground truths. Moreover, ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics. Extensive experiments validate that instruction-tuning in a multimodal, multi-turn conversation design is key to bridging the gap between static image understanding and temporally-aware reasoning.
中文: ImageChain通过将图像序列建模为多轮对话,增强了多模态大语言模型的顺序推理能力,在下一场景描述等任务中显著提升性能并实现强大的跨领域泛化。
English: ImageChain enhances multimodal large language models by modeling image sequences as multi-turn conversations, significantly improving sequential reasoning and achieving robust performance in tasks like next-scene description.

Authors:Adam Celarek, George Kopanas, George Drettakis, Michael Wimmer, Bernhard Kerbl
Title: Does 3D Gaussian Splatting Need Accurate Volumetric Rendering?
Abstract:
Since its introduction, 3D Gaussian Splatting (3DGS) has become an important reference method for learning 3D representations of a captured scene, allowing real-time novel-view synthesis with high visual quality and fast training times. Neural Radiance Fields (NeRFs), which preceded 3DGS, are based on a principled ray-marching approach for volumetric rendering. In contrast, while sharing a similar image formation model with NeRF, 3DGS uses a hybrid rendering solution that builds on the strengths of volume rendering and primitive rasterization. A crucial benefit of 3DGS is its performance, achieved through a set of approximations, in many cases with respect to volumetric rendering theory. A naturally arising question is whether replacing these approximations with more principled volumetric rendering solutions can improve the quality of 3DGS. In this paper, we present an in-depth analysis of the various approximations and assumptions used by the original 3DGS solution. We demonstrate that, while more accurate volumetric rendering can help for low numbers of primitives, the power of efficient optimization and the large number of Gaussians allows 3DGS to outperform volumetric rendering despite its approximations.
中文: 3D高斯泼溅通过高效优化和大量高斯函数,在实时新视角合成中展现出优越性能,尽管存在近似处理,仍超越了更精确的体积渲染方法。
English: 3D Gaussian Splatting achieves superior performance in real-time novel-view synthesis through efficient optimization and numerous Gaussians, outperforming more accurate volumetric rendering despite its approximations.

Authors:Nadya Abdel Madjid, Murad Mebrahtu, Abdulrahman Ahmad, Abdelmoamen Nasser, Bilal Hassan, Naoufel Werghi, Jorge Dias, Majid Khonji
Title: EMT: A Visual Multi-Task Benchmark Dataset for Autonomous Driving
Abstract:
This paper introduces the Emirates Multi-Task (EMT) dataset, designed to support multi-task benchmarking within a unified framework. It comprises over 30,000 frames from a dash-camera perspective and 570,000 annotated bounding boxes, covering approximately 150 kilometers of driving routes that reflect the distinctive road topology, congestion patterns, and driving behavior of Gulf region traffic. The dataset supports three primary tasks: tracking, trajectory forecasting, and intention prediction. Each benchmark is accompanied by corresponding evaluations: (1) multi-agent tracking experiments addressing multi-class scenarios and occlusion handling; (2) trajectory forecasting evaluation using deep sequential and interaction-aware models; and (3) intention prediction experiments based on observed trajectories. The dataset is publicly available at https://avlab.io/emt-dataset, with pre-processing scripts and evaluation models at https://github.com/AV-Lab/emt-dataset.
中文: EMT数据集包含3万多帧行车记录仪图像和57万个标注框,专为海湾地区交通设计,支持追踪、轨迹预测和意图识别三大任务,并提供公开数据和评估工具。
English: The EMT dataset provides over 30,000 dash-camera frames and 570,000 annotations for multi-task benchmarking in Gulf region traffic, supporting tracking, trajectory forecasting, and intention prediction with public access to data and tools.

Authors:Xuan Ding, Rui Sun, Yunjian Zhang, Xiu Yan, Yueqi Zhou, Kaihao Huang, Suzhong Fu, Angelica I Aviles-Rivero, Chuanlong Xie, Yao Zhu
Title: A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs
Abstract:
Compared to width-wise pruning, depth-wise pruning can significantly accelerate inference in resource-constrained scenarios. However, treating the entire Transformer layer as the minimum pruning unit may degrade model performance by indiscriminately discarding the entire information of the layer. This paper reveals the ``Patch-like'' feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space. Building on this observation, we propose a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35% pruning on the Vicuna-7B model, our method achieved a 1.654% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect. Our codes are available at https://github.com/920927/SLM-a-sliding-layer-merging-method.
中文: 深度剪枝虽能加速推理,但移除整个层会损害性能,因此本文提出滑动层合并方法,基于相似度动态选择并融合连续层,在零样本任务和剪枝后重训练恢复上优于现有技术,如在Vicuna-7B模型上实现1.654%的平均性能提升。
English: Depth-wise pruning accelerates inference but risks performance loss by removing entire layers, so this paper introduces a sliding layer merging method that selectively fuses consecutive layers based on similarity, outperforming existing techniques in zero-shot tasks and retraining recovery, as shown in experiments including a 1.654% improvement on Vicuna-7B.

Authors:Junlong Ren, Hao Wu, Hui Xiong, Hao Wang
Title: SCA3D: Enhancing Cross-modal 3D Retrieval via 3D Shape and Caption Paired Data Augmentation
Abstract:
The cross-modal 3D retrieval task aims to achieve mutual matching between text descriptions and 3D shapes. This has the potential to enhance the interaction between natural language and the 3D environment, especially within the realms of robotics and embodied artificial intelligence (AI) applications. However, the scarcity and expensiveness of 3D data constrain the performance of existing cross-modal 3D retrieval methods. These methods heavily rely on features derived from the limited number of 3D shapes, resulting in poor generalization ability across diverse scenarios. To address this challenge, we introduce SCA3D, a novel 3D shape and caption online data augmentation method for cross-modal 3D retrieval. Our approach uses the LLaVA model to create a component library, captioning each segmented part of every 3D shape within the dataset. Notably, it facilitates the generation of extensive new 3D-text pairs containing new semantic features. We employ both inter and intra distances to align various components into a new 3D shape, ensuring that the components do not overlap and are closely fitted. Further, text templates are utilized to process the captions of each component and generate new text descriptions. Besides, we use unimodal encoders to extract embeddings for 3D shapes and texts based on the enriched dataset. We then calculate fine-grained cross-modal similarity using Earth Mover's Distance (EMD) and enhance cross-modal matching with contrastive learning, enabling bidirectional retrieval between texts and 3D shapes. Extensive experiments show our SCA3D outperforms previous works on the Text2Shape dataset, raising the Shape-to-Text RR@1 score from 20.03 to 27.22 and the Text-to-Shape RR@1 score from 13.12 to 16.67. Codes can be found in https://github.com/3DAgentWorld/SCA3D.
中文: SCA3D提出了一种新颖的在线数据增强方法,通过组件分割和描述生成丰富的3D-文本对,利用对比学习和Earth Mover's Distance对齐显著提升了跨模态3D检索性能。
English: SCA3D introduces a novel online data augmentation method that generates enriched 3D-text pairs using component segmentation and captioning, significantly improving cross-modal 3D retrieval performance through contrastive learning and Earth Mover's Distance alignment.

Authors:Ziyuan Luo, Anderson Rocha, Boxin Shi, Qing Guo, Haoliang Li, Renjie Wan
Title: The NeRF Signature: Codebook-Aided Watermarking for Neural Radiance Fields
Abstract:
Neural Radiance Fields (NeRF) have been gaining attention as a significant form of 3D content representation. With the proliferation of NeRF-based creations, the need for copyright protection has emerged as a critical issue. Although some approaches have been proposed to embed digital watermarks into NeRF, they often neglect essential model-level considerations and incur substantial time overheads, resulting in reduced imperceptibility and robustness, along with user inconvenience. In this paper, we extend the previous criteria for image watermarking to the model level and propose NeRF Signature, a novel watermarking method for NeRF. We employ a Codebook-aided Signature Embedding (CSE) that does not alter the model structure, thereby maintaining imperceptibility and enhancing robustness at the model level. Furthermore, after optimization, any desired signatures can be embedded through the CSE, and no fine-tuning is required when NeRF owners want to use new binary signatures. Then, we introduce a joint pose-patch encryption watermarking strategy to hide signatures into patches rendered from a specific viewpoint for higher robustness. In addition, we explore a Complexity-Aware Key Selection (CAKS) scheme to embed signatures in high visual complexity patches to enhance imperceptibility. The experimental results demonstrate that our method outperforms other baseline methods in terms of imperceptibility and robustness. The source code is available at: https://github.com/luo-ziyuan/NeRF_Signature.
中文: 本文提出NeRF Signature方法,通过码本辅助签名嵌入和联合位姿-补丁加密策略,在不改变模型结构的情况下为神经辐射场嵌入难以察觉且鲁棒的数字水印,有效提升版权保护的隐蔽性和稳健性。
English: This paper introduces NeRF Signature, a novel watermarking method that embeds imperceptible and robust digital signatures into Neural Radiance Fields without altering the model structure, using a codebook-aided approach and joint pose-patch encryption strategy.

Authors:Qingyao Tian, Huai Liao, Xinyan Huang, Bingyu Yang, Dongdong Lei, Sebastien Ourselin, Hongbin Liu
Title: EndoMamba: An Efficient Foundation Model for Endoscopic Videos via Hierarchical Pre-training
Abstract:
Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance. While recent video foundation models have shown promise, their applications are hindered by (1) computational inefficiencies and (2) suboptimal performance caused by limited data for pre-training in endoscopy. To address these issues, we present EndoMamba, a foundation model designed for real-time inference while learning generalized spatiotemporal representations. First, to mitigate computational inefficiencies, we propose the EndoMamba backbone, optimized for real-time inference. Inspired by recent advancements in state space models, EndoMamba integrates Bidirectional Mamba blocks for spatial modeling within individual frames and vanilla Mamba blocks for past-to-present reasoning across the temporal domain. This design enables both strong spatiotemporal modeling and efficient inference in online video streams. Second, we propose a self-supervised hierarchical pre-training diagram to enhance EndoMamba's representation learning using endoscopic videos and incorporating general video domain knowledge. Specifically, our approach combines masked reconstruction with auxiliary supervision, leveraging low-level reconstruction to capture spatial-temporal structures and high-level alignment to transfer broader knowledge from a pretrained general-video domain foundation model. Extensive experiments on four downstream tasks--classification, segmentation, surgical phase recognition, and localization--demonstrate that EndoMamba outperforms existing foundation models and task-specific methods while maintaining real-time inference speed. The source code is available at https://github.com/TianCuteQY/EndoMamba.
中文: EndoMamba是一种计算高效的基础模型,通过结合双向和普通Mamba模块进行时空建模,并采用分层预训练方法,在保持实时推理的同时,显著提升了内窥镜视频分析在分类、分割、手术阶段识别和定位等多种任务中的性能表现。
English: EndoMamba is a computationally efficient foundation model that overcomes limitations in endoscopic video analysis by integrating bidirectional and vanilla Mamba blocks for spatiotemporal modeling and employing hierarchical pre-training to achieve superior performance across multiple surgical tasks while maintaining real-time inference.

Authors:Jiebin Yan, Ziwen Tan, Yuming Fang, Jiale Rao, Yifan Zuo
Title: Max360IQ: Blind Omnidirectional Image Quality Assessment with Multi-axis Attention
Abstract:
Omnidirectional image, also called 360-degree image, is able to capture the entire 360-degree scene, thereby providing more realistic immersive feelings for users than general 2D image and stereoscopic image. Meanwhile, this feature brings great challenges to measuring the perceptual quality of omnidirectional images, which is closely related to users' quality of experience, especially when the omnidirectional images suffer from non-uniform distortion. In this paper, we propose a novel and effective blind omnidirectional image quality assessment (BOIQA) model with multi-axis attention (Max360IQ), which can proficiently measure not only the quality of uniformly distorted omnidirectional images but also the quality of non-uniformly distorted omnidirectional images. Specifically, the proposed Max360IQ is mainly composed of a backbone with stacked multi-axis attention modules for capturing both global and local spatial interactions of extracted viewports, a multi-scale feature integration (MSFI) module to fuse multi-scale features and a quality regression module with deep semantic guidance for predicting the quality of omnidirectional images. Experimental results demonstrate that the proposed Max360IQ outperforms the state-of-the-art Assessor360 by 3.6\% in terms of SRCC on the JUFE database with non-uniform distortion, and gains improvement of 0.4\% and 0.8\% in terms of SRCC on the OIQA and CVIQ databases, respectively. The source code is available at https://github.com/WenJuing/Max360IQ.
Chinese: 本文提出了一种名为Max360IQ的新型盲视全景图像质量评估模型,通过多轴注意力模块和多尺度特征融合技术,能够有效评估均匀与非均匀失真的360度图像质量,并在多个数据库中展现出优于现有方法的性能表现。
English: This paper introduces Max360IQ, a novel blind omnidirectional image quality assessment model that effectively evaluates both uniformly and non-uniformly distorted 360-degree images by utilizing multi-axis attention modules and multi-scale feature integration, demonstrating superior performance over existing methods.

Authors:Chenyang Zhao, Kun Wang, Janet H. Hsiao, Antoni B. Chan
Title: Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
Abstract:
Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual and textual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on token features. Qualitative and quantitative evaluations verify the effectiveness and superiority of Grad-ECLIP compared with the state-of-the-art methods. Furthermore, a series of analysis are conducted based on our visual and textual explanation results, from which we explore the working mechanism of image-text matching, the strengths and limitations in attribution identification of CLIP, and the relationship between the concreteness/abstractness of a word and its usage in CLIP. Finally, based on the ability of explanation map that indicates text-specific saliency region of input image, we also propose an application with Grad-ECLIP, which is adopted to boost the fine-grained alignment in the CLIP fine-tuning. The code of Grad-ECLIP is available here: https://github.com/Cyang-Zhao/Grad-Eclip.
中文: 本文提出Grad-ECLIP,这是一种基于梯度的解释方法,通过分析中间特征为CLIP的图文匹配结果提供视觉和文本解释,经评估证明其优越性,并能在微调过程中实现更精细的对齐优化。
English: This paper introduces Grad-ECLIP, a gradient-based method that provides visual and textual explanations for CLIP's image-text matching results by analyzing intermediate features, demonstrating superior effectiveness through evaluations and enabling improved fine-grained alignment during fine-tuning.

Authors:Gabriele Tiboni, Raffaello Camoriano, Tatiana Tommasi
Title: MaskPlanner: Learning-Based Object-Centric Motion Generation from 3D Point Clouds
Abstract:
Object-Centric Motion Generation (OCMG) plays a key role in a variety of industrial applications$\unicode{x2014}$such as robotic spray painting and welding$\unicode{x2014}$requiring efficient, scalable, and generalizable algorithms to plan multiple long-horizon trajectories over free-form 3D objects. However, existing solutions rely on specialized heuristics, expensive optimization routines, or restrictive geometry assumptions that limit their adaptability to real-world scenarios. In this work, we introduce a novel, fully data-driven framework that tackles OCMG directly from 3D point clouds, learning to generalize expert path patterns across free-form surfaces. We propose MaskPlanner, a deep learning method that predicts local path segments for a given object while simultaneously inferring "path masks" to group these segments into distinct paths. This design induces the network to capture both local geometric patterns and global task requirements in a single forward pass. Extensive experimentation on a realistic robotic spray painting scenario shows that our approach attains near-complete coverage (above 99%) for unseen objects, while it remains task-agnostic and does not explicitly optimize for paint deposition. Moreover, our real-world validation on a 6-DoF specialized painting robot demonstrates that the generated trajectories are directly executable and yield expert-level painting quality. Our findings crucially highlight the potential of the proposed learning method for OCMG to reduce engineering overhead and seamlessly adapt to several industrial use cases.

Authors:Yao Su, Keqi Han, Mingjie Zeng, Lichao Sun, Liang Zhan, Carl Yang, Lifang He, Xiangnan Kong
Title: End-to-End Deep Learning for Structural Brain Imaging: A Unified Framework
Abstract:
Brain imaging analysis is fundamental in neuroscience, providing valuable insights into brain structure and function. Traditional workflows follow a sequential pipeline-brain extraction, registration, segmentation, parcellation, network generation, and classification-treating each step as an independent task. These methods rely heavily on task-specific training data and expert intervention to correct intermediate errors, making them particularly burdensome for high-dimensional neuroimaging data, where annotations and quality control are costly and time-consuming. We introduce UniBrain, a unified end-to-end framework that integrates all processing steps into a single optimization process, allowing tasks to interact and refine each other. Unlike traditional approaches that require extensive task-specific annotations, UniBrain operates with minimal supervision, leveraging only low-cost labels (i.e., classification and extraction) and a single labeled atlas. By jointly optimizing extraction, registration, segmentation, parcellation, network generation, and classification, UniBrain enhances both accuracy and computational efficiency while significantly reducing annotation effort. Experimental results demonstrate its superiority over existing methods across multiple tasks, offering a more scalable and reliable solution for neuroimaging analysis. Our code and data can be found at https://github.com/Anonymous7852/UniBrain
中文: UniBrain是一种统一端到端框架,将神经影像处理步骤整合为单一优化流程,在减少标注需求的同时显著提高了分析的准确性和效率。
English: UniBrain is a unified end-to-end framework that integrates all neuroimaging processing steps into a single optimization process, enhancing accuracy and efficiency while minimizing annotation requirements.

Authors:Sefik Serengil, Alper Ozpinar
Title: CipherFace: A Fully Homomorphic Encryption-Driven Framework for Secure Cloud-Based Facial Recognition
Abstract:
Facial recognition systems rely on embeddings to represent facial images and determine identity by verifying if the distance between embeddings is below a pre-tuned threshold. While embeddings are not reversible to original images, they still contain sensitive information, making their security critical. Traditional encryption methods like AES are limited in securely utilizing cloud computational power for distance calculations. Homomorphic Encryption, allowing calculations on encrypted data, offers a robust alternative. This paper introduces CipherFace, a homomorphic encryption-driven framework for secure cloud-based facial recognition, which we have open-sourced at http://github.com/serengil/cipherface. By leveraging FHE, CipherFace ensures the privacy of embeddings while utilizing the cloud for efficient distance computation. Furthermore, we propose a novel encrypted distance computation method for both Euclidean and Cosine distances, addressing key challenges in performing secure similarity calculations on encrypted data. We also conducted experiments with different facial recognition models, various embedding sizes, and cryptosystem configurations, demonstrating the scalability and effectiveness of CipherFace in real-world applications.
Chinese: 本文提出了CipherFace,一个基于同态加密的开源框架,通过保护敏感嵌入特征并高效计算加密的欧几里得与余弦距离,实现安全的云端人脸识别,适用于实际应用场景。
English: This paper introduces CipherFace, an open-source homomorphic encryption framework that enables secure cloud-based facial recognition by protecting sensitive embeddings while efficiently computing encrypted Euclidean and Cosine distances for real-world applications.

Authors:Yukun Chen, Shuo Shao, Enhao Huang, Yiming Li, Pin-Yu Chen, Zhan Qin, Kui Ren
Title: REFINE: Inversion-Free Backdoor Defense via Model Reprogramming
Abstract:
Backdoor attacks on deep neural networks (DNNs) have emerged as a significant security threat, allowing adversaries to implant hidden malicious behaviors during the model training phase. Pre-processing-based defense, which is one of the most important defense paradigms, typically focuses on input transformations or backdoor trigger inversion (BTI) to deactivate or eliminate embedded backdoor triggers during the inference process. However, these methods suffer from inherent limitations: transformation-based defenses often fail to balance model utility and defense performance, while BTI-based defenses struggle to accurately reconstruct trigger patterns without prior knowledge. In this paper, we propose REFINE, an inversion-free backdoor defense method based on model reprogramming. REFINE consists of two key components: \textbf{(1)} an input transformation module that disrupts both benign and backdoor patterns, generating new benign features; and \textbf{(2)} an output remapping module that redefines the model's output domain to guide the input transformations effectively. By further integrating supervised contrastive loss, REFINE enhances the defense capabilities while maintaining model utility. Extensive experiments on various benchmark datasets demonstrate the effectiveness of our REFINE and its resistance to potential adaptive attacks.
中文摘要:REFINE是一种基于模型重编程的无逆向后门防御方法,通过输入转换和输出重映射模块协同工作,在无需触发模式重构的情况下有效消除后门威胁,同时保持模型性能。
English Summary: REFINE is a novel backdoor defense method that uses model reprogramming with input transformation and output remapping to neutralize embedded triggers without inversion, effectively balancing security and model performance.

Authors:Xuemeng Song, Haoqiang Lin, Haokun Wen, Bohan Hou, Mingzhu Xu, Liqiang Nie
Title: A Comprehensive Survey on Composed Image Retrieval
Abstract:
Composed Image Retrieval (CIR) is an emerging yet challenging task that allows users to search for target images using a multimodal query, comprising a reference image and a modification text specifying the user's desired changes to the reference image. Given its significant academic and practical value, CIR has become a rapidly growing area of interest in the computer vision and machine learning communities, particularly with the advances in deep learning. To the best of our knowledge, there is currently no comprehensive review of CIR to provide a timely overview of this field. Therefore, we synthesize insights from over 120 publications in top conferences and journals, including ACM TOIS, SIGIR, and CVPR In particular, we systematically categorize existing supervised CIR and zero-shot CIR models using a fine-grained taxonomy. For a comprehensive review, we also briefly discuss approaches for tasks closely related to CIR, such as attribute-based CIR and dialog-based CIR. Additionally, we summarize benchmark datasets for evaluation and analyze existing supervised and zero-shot CIR methods by comparing experimental results across multiple datasets. Furthermore, we present promising future directions in this field, offering practical insights for researchers interested in further exploration. The curated collection of related works is maintained and continuously updated in https://github.com/haokunwen/Awesome-Composed-Image-Retrieval.
中文: 组合图像检索(CIR)是一项结合参考图像和修改文本进行目标图像搜索的多模态任务,本综述综合了120多篇文献,系统分类模型、评估基准并展望未来方向。
English: Composed Image Retrieval (CIR) is a challenging multimodal task that combines a reference image and modification text to search for target images, and this review synthesizes over 120 publications to systematically categorize models, evaluate benchmarks, and outline future directions.

Authors:Mira Adra, Simone Melcarne, Nelida Mirabet-Herranz, Jean-Luc Dugelay
Title: Event-based Solutions for Human-centered Applications: A Comprehensive Review
Abstract:
Event cameras, often referred to as dynamic vision sensors, are groundbreaking sensors capable of capturing changes in light intensity asynchronously, offering exceptional temporal resolution and energy efficiency. These attributes make them particularly suited for human-centered applications, as they capture both the most intricate details of facial expressions and the complex motion dynamics of the human body. Despite growing interest, research in human-centered applications of event cameras remains scattered, with no comprehensive overview encompassing both body and face tasks. This survey bridges that gap by being the first to unify these domains, presenting an extensive review of advancements, challenges, and opportunities. We also examine less-explored areas, including event compression techniques and simulation frameworks, which are essential for the broader adoption of event cameras. This survey is designed to serve as a foundational reference that helps both new and experienced researchers understand the current state of the field and identify promising directions for future work in human-centered event camera applications. A summary of this survey can be found at https://github.com/nmirabeth/event_human
中文摘要:本综述首次统一了事件相机在人体和面部分析中的人本应用研究,系统梳理了该领域的发展现状、挑战及未来方向。
English Summary: This survey provides the first unified overview of event camera applications in human-centered tasks, addressing advancements, challenges, and future directions for both body and face analysis.

Authors:Ziheng Ouyang, Zhen Li, Qibin Hou
Title: K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs
Abstract:
Recent studies have explored combining different LoRAs to jointly generate learned style and content. However, existing methods either fail to effectively preserve both the original subject and style simultaneously or require additional training. In this paper, we argue that the intrinsic properties of LoRA can effectively guide diffusion models in merging learned subject and style. Building on this insight, we propose K-LoRA, a simple yet effective training-free LoRA fusion approach. In each attention layer, K-LoRA compares the Top-K elements in each LoRA to be fused, determining which LoRA to select for optimal fusion. This selection mechanism ensures that the most representative features of both subject and style are retained during the fusion process, effectively balancing their contributions. Experimental results demonstrate that the proposed method effectively integrates the subject and style information learned by the original LoRAs, outperforming state-of-the-art training-based approaches in both qualitative and quantitative results.

Authors:Alexander Groshev, Anastasiia Iashchenko, Pavel Paramonov, Denis Dimitrov, Andrey Kuznetsov
Title: GHOST 2.0: generative high-fidelity one shot transfer of heads
Abstract:
While the task of face swapping has recently gained attention in the research community, a related problem of head swapping remains largely unexplored. In addition to skin color transfer, head swap poses extra challenges, such as the need to preserve structural information of the whole head during synthesis and inpaint gaps between swapped head and background. In this paper, we address these concerns with GHOST 2.0, which consists of two problem-specific modules. First, we introduce enhanced Aligner model for head reenactment, which preserves identity information at multiple scales and is robust to extreme pose variations. Secondly, we use a Blender module that seamlessly integrates the reenacted head into the target background by transferring skin color and inpainting mismatched regions. Both modules outperform the baselines on the corresponding tasks, allowing to achieve state of the art results in head swapping. We also tackle complex cases, such as large difference in hair styles of source and target. Code is available at https://github.com/ai-forever/ghost-2.0
Chinese: GHOST 2.0通过增强的Aligner模块实现鲁棒头部重演,结合Blender模块无缝融合头部与背景,在头部替换任务中达到领先水平,并能有效处理姿态变化和发型差异等复杂情况。
English: GHOST 2.0 introduces an enhanced Aligner for robust head reenactment and a Blender for seamless integration, achieving state-of-the-art head swapping results while handling challenges like pose variations and hairstyle differences.

Authors:Xiangyu Zhao, Shengyuan Ding, Zicheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, Haodong Duan, Hua Yang, Kai Chen
Title: OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Abstract:
Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs' alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs' alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities. Our datasets, benchmark, code and checkpoints have been released at https://github.com/PhoenixZ810/OmniAlign-V.
中文: 本文提出了包含20万高质量样本的OmniAlign-V数据集和人工标注的MM-AlignBench基准,通过微调方法在保持多模态大语言模型基础能力的同时,显著提升了其与人类偏好的对齐程度。
English: This paper introduces OmniAlign-V, a 200K-sample dataset, and MM-AlignBench, a human-annotated benchmark, to enhance multi-modal large language models' alignment with human preferences while preserving their foundational capabilities through fine-tuning methods.

Authors:Pengzhi Li, Pengfei Yu, Zide Liu, Wei He, Xuhao Pan, Xudong Rao, Tao Wei, Wei Chen
Title: LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation
Abstract:
In this paper, we introduce LDGen, a novel method for integrating large language models (LLMs) into existing text-to-image diffusion models while minimizing computational demands. Traditional text encoders, such as CLIP and T5, exhibit limitations in multilingual processing, hindering image generation across diverse languages. We address these challenges by leveraging the advanced capabilities of LLMs. Our approach employs a language representation strategy that applies hierarchical caption optimization and human instruction techniques to derive precise semantic information,. Subsequently, we incorporate a lightweight adapter and a cross-modal refiner to facilitate efficient feature alignment and interaction between LLMs and image features. LDGen reduces training time and enables zero-shot multilingual image generation. Experimental results indicate that our method surpasses baseline models in both prompt adherence and image aesthetic quality, while seamlessly supporting multiple languages. Project page: https://zrealli.github.io/LDGen.

Authors:Jun Zeng, Debesh Jha, Ertugrul Aktas, Elif Keles, Alpay Medetalibeyoglu, Matthew Antalek, Robert Lewandowski, Daniela Ladner, Amir A. Borhani, Gorkem Durak, Ulas Bagci
Title: A Reverse Mamba Attention Network for Pathological Liver Segmentation
Abstract:
We present RMA-Mamba, a novel architecture that advances the capabilities of vision state space models through a specialized reverse mamba attention module (RMA). The key innovation lies in RMA-Mamba's ability to capture long-range dependencies while maintaining precise local feature representation through its hierarchical processing pipeline. By integrating Vision Mamba (VMamba)'s efficient sequence modeling with RMA's targeted feature refinement, our architecture achieves superior feature learning across multiple scales. This dual-mechanism approach enables robust handling of complex morphological patterns while maintaining computational efficiency. We demonstrate RMA-Mamba's effectiveness in the challenging domain of pathological liver segmentation (from both CT and MRI), where traditional segmentation approaches often fail due to tissue variations. When evaluated on a newly introduced cirrhotic liver dataset (CirrMRI600+) of T2-weighted MRI scans, RMA-Mamba achieves the state-of-the-art performance with a Dice coefficient of 92.08%, mean IoU of 87.36%, and recall of 92.96%. The architecture's generalizability is further validated on the cancerous liver segmentation from CT scans (LiTS: Liver Tumor Segmentation dataset), yielding a Dice score of 92.9% and mIoU of 88.99%. Our code is available for public: https://github.com/JunZengz/RMAMamba.
中文: RMA-Mamba通过反向曼巴注意力模块提升视觉状态空间模型,能捕捉长程依赖并优化局部特征,在CT和MRI的病理肝脏分割任务中实现了最先进的性能。
English: RMA-Mamba introduces a reverse mamba attention module to enhance vision state space models by capturing long-range dependencies and refining local features, achieving state-of-the-art performance in pathological liver segmentation from CT and MRI scans.

Authors:Jun Zeng, Debesh Jha, Ertugrul Aktas, Elif Keles, Alpay Medetalibeyoglu, Matthew Antalek, Federica Proietto Salanitri, Amir A. Borhani, Daniela P. Ladner, Gorkem Durak, Ulas Bagci
Title: Liver Cirrhosis Stage Estimation from MRI with Deep Learning
Abstract:
We present an end-to-end deep learning framework for automated liver cirrhosis stage estimation from multi-sequence MRI. Cirrhosis is the severe scarring (fibrosis) of the liver and a common endpoint of various chronic liver diseases. Early diagnosis is vital to prevent complications such as decompensation and cancer, which significantly decreases life expectancy. However, diagnosing cirrhosis in its early stages is challenging, and patients often present with life-threatening complications. Our approach integrates multi-scale feature learning with sequence-specific attention mechanisms to capture subtle tissue variations across cirrhosis progression stages. Using CirrMRI600+, a large-scale publicly available dataset of 628 high-resolution MRI scans from 339 patients, we demonstrate state-of-the-art performance in three-stage cirrhosis classification. Our best model achieves 72.8% accuracy on T1W and 63.8% on T2W sequences, significantly outperforming traditional radiomics-based approaches. Through extensive ablation studies, we show that our architecture effectively learns stage-specific imaging biomarkers. We establish new benchmarks for automated cirrhosis staging and provide insights for developing clinically applicable deep learning systems. The source code will be available at https://github.com/JunZengz/CirrhosisStage.
中文: 我们提出了一种端到端的深度学习框架,通过多尺度特征学习和序列特异性注意力机制,在大规模MRI数据集上实现了肝硬化自动分期的先进性能。
English: We introduce an end-to-end deep learning framework that achieves state-of-the-art performance in automated liver cirrhosis staging from multi-sequence MRI, using multi-scale feature learning and sequence-specific attention mechanisms on a large-scale dataset.

Authors:He Wang, Tianyang Xu, Zhangyong Tang, Xiao-Jun Wu, Josef Kittler
Title: UASTrack: A Unified Adaptive Selection Framework with Modality-Customization in Single Object Tracking
Abstract:
Multi-modal tracking is essential in single-object tracking (SOT), as different sensor types contribute unique capabilities to overcome challenges caused by variations in object appearance. However, existing unified RGB-X trackers (X represents depth, event, or thermal modality) either rely on the task-specific training strategy for individual RGB-X image pairs or fail to address the critical importance of modality-adaptive perception in real-world applications. In this work, we propose UASTrack, a unified adaptive selection framework that facilitates both model and parameter unification, as well as adaptive modality discrimination across various multi-modal tracking tasks. To achieve modality-adaptive perception in joint RGB-X pairs, we design a Discriminative Auto-Selector (DAS) capable of identifying modality labels, thereby distinguishing the data distributions of auxiliary modalities. Furthermore, we propose a Task-Customized Optimization Adapter (TCOA) tailored to various modalities in the latent space. This strategy effectively filters noise redundancy and mitigates background interference based on the specific characteristics of each modality. Extensive comparisons conducted on five benchmarks including LasHeR, GTOT, RGBT234, VisEvent, and DepthTrack, covering RGB-T, RGB-E, and RGB-D tracking scenarios, demonstrate our innovative approach achieves comparative performance by introducing only additional training parameters of 1.87M and flops of 1.95G. The code will be available at https://github.com/wanghe/UASTrack.
中文: UASTrack提出了一种统一的自适应选择框架,通过判别性自动选择器和任务定制优化适配器实现模态自适应感知,在多个RGB-X跟踪基准上以少量额外参数实现了优越性能。
English: UASTrack introduces a unified adaptive selection framework that achieves modality-adaptive perception through a Discriminative Auto-Selector and Task-Customized Optimization Adapter, delivering competitive performance across multiple RGB-X tracking benchmarks with minimal additional parameters.

Authors:Botao Ye, Sifei Liu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang
Title: Synthesizing Consistent Novel Views via 3D Epipolar Attention without Re-Training
Abstract:
Large diffusion models demonstrate remarkable zero-shot capabilities in novel view synthesis from a single image. However, these models often face challenges in maintaining consistency across novel and reference views. A crucial factor leading to this issue is the limited utilization of contextual information from reference views. Specifically, when there is an overlap in the viewing frustum between two views, it is essential to ensure that the corresponding regions maintain consistency in both geometry and appearance. This observation leads to a simple yet effective approach, where we propose to use epipolar geometry to locate and retrieve overlapping information from the input view. This information is then incorporated into the generation of target views, eliminating the need for training or fine-tuning, as the process requires no learnable parameters. Furthermore, to enhance the overall consistency of generated views, we extend the utilization of epipolar attention to a multi-view setting, allowing retrieval of overlapping information from the input view and other target views. Qualitative and quantitative experimental results demonstrate the effectiveness of our method in significantly improving the consistency of synthesized views without the need for any fine-tuning. Moreover, This enhancement also boosts the performance of downstream applications such as 3D reconstruction. The code is available at https://github.com/botaoye/ConsisSyn.
中文摘要:本研究提出一种新方法,利用极线几何从参考视图中检索重叠信息,以增强零样本新视角合成的一致性,无需训练或微调。
English Summary: This study introduces a novel method that leverages epipolar geometry to enhance consistency in zero-shot novel view synthesis by retrieving overlapping information from reference views, eliminating the need for training or fine-tuning.

Authors:Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji
Title: VCT: Training Consistency Models with Variational Noise Coupling
Abstract:
Consistency Training (CT) has recently emerged as a strong alternative to diffusion models for image generation. However, non-distillation CT often suffers from high variance and instability, motivating ongoing research into its training dynamics. We propose Variational Consistency Training (VCT), a flexible and effective framework compatible with various forward kernels, including those in flow matching. Its key innovation is a learned noise-data coupling scheme inspired by Variational Autoencoders, where a data-dependent encoder models noise emission. This enables VCT to adaptively learn noise-todata pairings, reducing training variance relative to the fixed, unsorted pairings in classical CT. Experiments on multiple image datasets demonstrate significant improvements: our method surpasses baselines, achieves state-of-the-art FID among non-distillation CT approaches on CIFAR-10, and matches SoTA performance on ImageNet 64 x 64 with only two sampling steps. Code is available at https://github.com/sony/vct.
中文: 变分一致性训练(VCT)通过学习噪声与数据的自适应配对机制,有效降低了训练方差,在图像生成任务中仅需两步采样即可达到顶尖性能。
English: Variational Consistency Training (VCT) introduces a learned noise-data coupling scheme to reduce training variance and instability, achieving state-of-the-art results on image generation tasks with minimal sampling steps.

Authors:Anh-Khoa Nguyen Vu, Quoc-Truong Truong, Vinh-Tiep Nguyen, Thanh Duc Ngo, Thanh-Toan Do, Tam V. Nguyen
Title: Multi-Perspective Data Augmentation for Few-shot Object Detection
Abstract:
Recent few-shot object detection (FSOD) methods have focused on augmenting synthetic samples for novel classes, show promising results to the rise of diffusion models. However, the diversity of such datasets is often limited in representativeness because they lack awareness of typical and hard samples, especially in the context of foreground and background relationships. To tackle this issue, we propose a Multi-Perspective Data Augmentation (MPAD) framework. In terms of foreground-foreground relationships, we propose in-context learning for object synthesis (ICOS) with bounding box adjustments to enhance the detail and spatial information of synthetic samples. Inspired by the large margin principle, support samples play a vital role in defining class boundaries. Therefore, we design a Harmonic Prompt Aggregation Scheduler (HPAS) to mix prompt embeddings at each time step of the generation process in diffusion models, producing hard novel samples. For foreground-background relationships, we introduce a Background Proposal method (BAP) to sample typical and hard backgrounds. Extensive experiments on multiple FSOD benchmarks demonstrate the effectiveness of our approach. Our framework significantly outperforms traditional methods, achieving an average increase of $17.5\%$ in nAP50 over the baseline on PASCAL VOC. Code is available at https://github.com/nvakhoa/MPAD.
Chinese: 提出的多视角数据增强(MPAD)框架通过前景调整、困难样本合成和背景优化生成多样且具挑战性的合成样本,显著提升了少样本目标检测性能,在PASCAL VOC数据集上相比基线方法nAP50指标平均提升17.5%。
English: The proposed Multi-Perspective Data Augmentation (MPAD) framework enhances few-shot object detection by generating diverse and challenging synthetic samples through foreground adjustments, hard sample synthesis, and background optimization, achieving a 17.5% nAP50 improvement over baselines on PASCAL VOC.

Authors:Adnan Iltaf, Rayan Merghani Ahmed, Zhenxi Zhang, Bin Li, Shoujun Zhou
Title: VesselSAM: Leveraging SAM for Aortic Vessel Segmentation with AtrousLoRA
Abstract:
Medical image segmentation is crucial for clinical diagnosis and treatment planning, especially when dealing with complex anatomical structures such as vessels. However, accurately segmenting vessels remains challenging due to their small size, intricate edge structures, and susceptibility to artifacts and imaging noise. In this work, we propose VesselSAM, an enhanced version of the Segment Anything Model (SAM), specifically tailored for aortic vessel segmentation. VesselSAM incorporates AtrousLoRA, a novel module integrating Atrous Attention and Low-Rank Adaptation (LoRA), to enhance segmentation performance. Atrous Attention enables the model to capture multi-scale contextual information, preserving both fine-grained local details and broader global context. Additionally, LoRA facilitates efficient fine-tuning of the frozen SAM image encoder, reducing the number of trainable parameters and thereby enhancing computational efficiency. We evaluate VesselSAM using two challenging datasets: the Aortic Vessel Tree (AVT) dataset and the Type-B Aortic Dissection (TBAD) dataset. VesselSAM achieves state-of-the-art performance, attaining DSC scores of 93.50\%, 93.25\%, 93.02\%, and 93.26\% across multi-center datasets. Our results demonstrate that VesselSAM delivers high segmentation accuracy while significantly reducing computational overhead compared to existing large-scale models. This development paves the way for enhanced AI-based aortic vessel segmentation in clinical environments. The code and models will be released at https://github.com/Adnan-CAS/AtrousLora.
中文: VesselSAM通过集成AtrousLoRA模块改进了Segment Anything模型,在主动脉血管分割中实现了顶尖的精度和计算效率,适用于多中心临床数据。
English: VesselSAM enhances the Segment Anything Model with AtrousLoRA for superior aortic vessel segmentation, achieving state-of-the-art accuracy and computational efficiency on multi-center datasets.

Authors:Mingkun Zhang, Keping Bi, Wei Chen, Jiafeng Guo, Xueqi Cheng
Title: CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot Classification
Abstract:
In this paper, we aim to build an adversarially robust zero-shot image classifier. We ground our work on CLIP, a vision-language pre-trained encoder model that can perform zero-shot classification by matching an image with text prompts ``a photo of a .''. Purification is the path we choose since it does not require adversarial training on specific attack types and thus can cope with any foreseen attacks. We then formulate purification risk as the KL divergence between the joint distributions of the purification process of denoising the adversarial samples and the attack process of adding perturbations to benign samples, through bidirectional Stochastic Differential Equations (SDEs). The final derived results inspire us to explore purification in the multi-modal latent space of CLIP. We propose two variants for our CLIPure approach: CLIPure-Diff which models the likelihood of images' latent vectors with the DiffusionPrior module in DaLLE-2 (modeling the generation process of CLIP's latent vectors), and CLIPure-Cos which models the likelihood with the cosine similarity between the embeddings of an image and ``a photo of a.''. As far as we know, CLIPure is the first purification method in multi-modal latent space and CLIPure-Cos is the first purification method that is not based on generative models, which substantially improves defense efficiency. We conducted extensive experiments on CIFAR-10, ImageNet, and 13 datasets that previous CLIP-based defense methods used for evaluating zero-shot classification robustness. Results show that CLIPure boosts the SOTA robustness by a large margin, e.g., from 71.7% to 91.1% on CIFAR10, from 59.6% to 72.6% on ImageNet, and 108% relative improvements of average robustness on the 13 datasets over previous SOTA. The code is available at https://github.com/TMLResearchGroup-CAS/CLIPure.
本研究提出了CLIPure,一种在CLIP多模态潜在空间中运行的零样本图像分类器对抗净化新方法,无需依赖生成模型即可显著提升防御鲁棒性。
This paper introduces CLIPure, a novel adversarial purification method for zero-shot image classifiers that operates in CLIP's multi-modal latent space, significantly boosting robustness without relying on generative models.

Authors:Yunfeng Li, Bo Wang, Ye Li
Title: LightFC-X: Lightweight Convolutional Tracker for RGB-X Tracking
Abstract:
Despite great progress in multimodal tracking, these trackers remain too heavy and expensive for resource-constrained devices. To alleviate this problem, we propose LightFC-X, a family of lightweight convolutional RGB-X trackers that explores a unified convolutional architecture for lightweight multimodal tracking. Our core idea is to achieve lightweight cross-modal modeling and joint refinement of the multimodal features and the spatiotemporal appearance features of the target. Specifically, we propose a novel efficient cross-attention module (ECAM) and a novel spatiotemporal template aggregation module (STAM). The ECAM achieves lightweight cross-modal interaction of template-search area integrated feature with only 0.08M parameters. The STAM enhances the model's utilization of temporal information through module fine-tuning paradigm. Comprehensive experiments show that our LightFC-X achieves state-of-the-art performance and the optimal balance between parameters, performance, and speed. For example, LightFC-T-ST outperforms CMD by 4.3% and 5.7% in SR and PR on the LasHeR benchmark, which it achieves 2.6x reduction in parameters and 2.7x speedup. It runs in real-time on the CPU at a speed of 22 fps. The code is available at https://github.com/LiYunfengLYF/LightFC-X.
中文:LightFC-X提出了一种轻量级多模态跟踪框架,通过高效的跨模态注意力和时空模块,在减少参数的同时实现了顶尖性能,并能在CPU上实时运行。
English: LightFC-X introduces a lightweight multimodal tracking framework with efficient cross-attention and spatiotemporal modules, achieving state-of-the-art performance with reduced parameters and real-time CPU speed.

Authors:Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen
Title: SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference
Abstract:
An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at https://github.com/thu-ml/SpargeAttn.
中文: SpargeAttn提出了一种通用的稀疏量化注意力机制,通过两阶段在线过滤跳过冗余计算,在加速语言、图像和视频生成等多种模型的同时保持端到端性能不损失。
English: SpargeAttn introduces a universal sparse and quantized attention mechanism that accelerates diverse models across language, image, and video generation without compromising performance by employing a two-stage online filter to skip unnecessary computations.

Authors:Ryo Takizawa, Izumi Karino, Koki Nakagawa, Yoshiyuki Ohmura, Yasuo Kuniyoshi
Title: Enhancing Reusability of Learned Skills for Robot Manipulation via Gaze Information and Motion Bottlenecks
Abstract:
Autonomous agents capable of diverse object manipulations should be able to acquire a wide range of manipulation skills with high reusability. Although advances in deep learning have made it increasingly feasible to replicate the dexterity of human teleoperation in robots, generalizing these acquired skills to previously unseen scenarios remains a significant challenge. In this study, we propose a novel algorithm, Gaze-based Bottleneck-aware Robot Manipulation (GazeBot), which enables high reusability of learned motions without sacrificing dexterity or reactivity. By leveraging gaze information and motion bottlenecks, both crucial features for object manipulation, GazeBot achieves high success rates compared with state-of-the-art imitation learning methods, particularly when the object positions and end-effector poses differ from those in the provided demonstrations. Furthermore, the training process of GazeBot is entirely data-driven once a demonstration dataset with gaze data is provided. Videos and code are available at https://crumbyrobotics.github.io/gazebot.

Authors:Han Nie, Bin Luo, Jun Liu, Zhitao Fu, Huan Zhou, Shuo Zhang, Weixing Liu
Title: PromptMID: Modal Invariant Descriptors Based on Diffusion and Vision Foundation Models for Optical-SAR Image Matching
Abstract:
The ideal goal of image matching is to achieve stable and efficient performance in unseen domains. However, many existing learning-based optical-SAR image matching methods, despite their effectiveness in specific scenarios, exhibit limited generalization and struggle to adapt to practical applications. Repeatedly training or fine-tuning matching models to address domain differences is not only not elegant enough but also introduces additional computational overhead and data production costs. In recent years, general foundation models have shown great potential for enhancing generalization. However, the disparity in visual domains between natural and remote sensing images poses challenges for their direct application. Therefore, effectively leveraging foundation models to improve the generalization of optical-SAR image matching remains challenge. To address the above challenges, we propose PromptMID, a novel approach that constructs modality-invariant descriptors using text prompts based on land use classification as priors information for optical and SAR image matching. PromptMID extracts multi-scale modality-invariant features by leveraging pre-trained diffusion models and visual foundation models (VFMs), while specially designed feature aggregation modules effectively fuse features across different granularities. Extensive experiments on optical-SAR image datasets from four diverse regions demonstrate that PromptMID outperforms state-of-the-art matching methods, achieving superior results in both seen and unseen domains and exhibiting strong cross-domain generalization capabilities. The source code will be made publicly available https://github.com/HanNieWHU/PromptMID.
Chinese: PromptMID是一种创新方法,通过利用文本提示和预训练模型构建模态不变描述符来提升光学-SAR图像匹配效果,在多个区域实验中展现出卓越的跨域泛化能力,显著优于现有先进技术。
English: PromptMID is a novel method that enhances optical-SAR image matching by using text prompts and pre-trained models to create modality-invariant descriptors, achieving superior cross-domain generalization and outperforming existing techniques in diverse regions.

Authors:Cao Yuxuan, Wu Jiayang, Alistair Cheong Liang Chuen, Bryan Shan Guanrong, Theodore Lee Chong Jen, Sherman Chann Zhi Shen
Title: Detecting Offensive Memes with Social Biases in Singapore Context Using Multimodal Large Language Models
Abstract:
Traditional online content moderation systems struggle to classify modern multimodal means of communication, such as memes, a highly nuanced and information-dense medium. This task is especially hard in a culturally diverse society like Singapore, where low-resource languages are used and extensive knowledge on local context is needed to interpret online content. We curate a large collection of 112K memes labeled by GPT-4V for fine-tuning a VLM to classify offensive memes in Singapore context. We show the effectiveness of fine-tuned VLMs on our dataset, and propose a pipeline containing OCR, translation and a 7-billion parameter-class VLM. Our solutions reach 80.62% accuracy and 0.8192 AUROC on a held-out test set, and can greatly aid human in moderating online contents. The dataset, code, and model weights have been open-sourced at https://github.com/aliencaocao/vlm-for-memes-aisg.
中文摘要:传统内容审核系统难以处理如表情包这类多模态内容,尤其是在文化多元的新加坡,但通过在大规模数据集上微调视觉语言模型,对冒犯性表情包的分类准确率达到了80.62%。
English Summary: Traditional content moderation systems are ineffective for nuanced multimodal content like memes, especially in culturally diverse Singapore, but fine-tuning a vision-language model on a large dataset achieves 80.62% accuracy in classifying offensive memes.

Authors:Shengtian Mian, Ya Wang, Nannan Gu, Yuping Wang, Xiaoqing Li
Title: FwNet-ECA: A Classification Model Enhancing Window Attention with Global Receptive Fields via Fourier Filtering Operations
Abstract:
Windowed attention mechanisms were introduced to mitigate the issue of excessive computation inherent in global attention mechanisms. In this paper, we present FwNet-ECA, a novel method that utilizes Fourier transforms paired with learnable weight matrices to enhance the spectral features of images. This method establishes a global receptive field through Filter Enhancement and avoids the use of moving window attention. Additionally, we incorporate the Efficient Channel Attention (ECA) module to improve communication between different channels. Instead of relying on physically shifted windows, our approach leverages frequency domain enhancement to implicitly bridge information across spatial regions. We validate our model on the iCartoonFace dataset and conduct downstream tasks on ImageNet, demonstrating that our model achieves lower parameter counts and computational overheads compared to shifted window approaches, while maintaining competitive accuracy. Furthermore, our visualization operations clearly demonstrated that the Filter Enhancement technique achieves greater effectiveness in the model's shallow layers, where feature maps are relatively larger. This work offers a more efficient and effective alternative for leveraging attention mechanisms in visual processing tasks, alleviating the challenges associated with windowed attention models. Code is available at https://github.com/qingxiaoli/FwNet-ECA
中文: FwNet-ECA提出了一种结合傅里叶变换与可学习权重的方法来增强图像频谱特征,通过滤波器增强建立全局感受野,无需移动窗口注意力即可在降低参数量的同时保持竞争力准确率。
English: FwNet-ECA introduces a method using Fourier transforms and learnable weights to enhance image spectral features, establishing a global receptive field without windowed attention while reducing parameters and maintaining competitive accuracy on benchmark datasets.

Authors:Ankita Raj, Deepankar Varma, Chetan Arora
Title: Examining the Threat Landscape: Foundation Models and Model Stealing
Abstract:
Foundation models (FMs) for computer vision learn rich and robust representations, enabling their adaptation to task/domain-specific deployments with little to no fine-tuning. However, we posit that the very same strength can make applications based on FMs vulnerable to model stealing attacks. Through empirical analysis, we reveal that models fine-tuned from FMs harbor heightened susceptibility to model stealing, compared to conventional vision architectures like ResNets. We hypothesize that this behavior is due to the comprehensive encoding of visual patterns and features learned by FMs during pre-training, which are accessible to both the attacker and the victim. We report that an attacker is able to obtain 94.28% agreement (matched predictions with victim) for a Vision Transformer based victim model (ViT-L/16) trained on CIFAR-10 dataset, compared to only 73.20% agreement for a ResNet-18 victim, when using ViT-L/16 as the thief model. We arguably show, for the first time, that utilizing FMs for downstream tasks may not be the best choice for deployment in commercial APIs due to their susceptibility to model theft. We thereby alert model owners towards the associated security risks, and highlight the need for robust security measures to safeguard such models against theft. Code is available at https://github.com/rajankita/foundation_model_stealing.
中文: 计算机视觉基础模型虽然能轻松适应特定任务,但极易遭受模型窃取攻击,攻击者在微调模型上可获得高达94.28%的预测一致性,这对商业部署构成严重安全威胁。
English: Foundation models in computer vision, while enabling easy adaptation to specific tasks, are highly vulnerable to model stealing attacks, with attackers achieving up to 94.28% prediction agreement on fine-tuned models, highlighting significant security risks for commercial deployments.

Authors:Carlos Vélez García, Miguel Cazorla, Jorge Pomares
Title: Escaping The Big Data Paradigm in Self-Supervised Representation Learning
Abstract:
The reliance on large-scale datasets and extensive computational resources has become a major barrier to advancing representation learning in vision, especially in data-scarce domains. In this paper, we address the critical question: Can we escape the big data paradigm in self-supervised representation learning from images? We introduce SCOTT (Sparse Convolutional Tokenizer for Transformers), a shallow tokenization architecture that is compatible with Masked Image Modeling (MIM) tasks. SCOTT injects convolutional inductive biases into Vision Transformers (ViTs), enhancing their efficacy in small-scale data regimes. Alongside, we propose to train on a Joint-Embedding Predictive Architecture within a MIM framework (MIM-JEPA), operating in latent representation space to capture more semantic features. Our approach enables ViTs to be trained from scratch on datasets orders of magnitude smaller than traditionally required --without relying on massive external datasets for pretraining. We validate our method on three small-size, standard-resoultion, fine-grained datasets: Oxford Flowers-102, Oxford IIIT Pets-37, and ImageNet-100. Despite the challenges of limited data and high intra-class similarity, frozen SCOTT models pretrained with MIM-JEPA significantly outperform fully supervised methods and achieve competitive results with SOTA approaches that rely on large-scale pretraining, complex image augmentations and bigger model sizes. By demonstrating that robust off-the-shelf representations can be learned with limited data, compute, and model sizes, our work paves the way for computer applications in resource constrained environments such as medical imaging or robotics. Our findings challenge the prevailing notion that vast amounts of data are indispensable for effective representation learning in vision, offering a new pathway toward more accessible and inclusive advancements in the field.
中文: 本文提出的SCOTT架构与MIM-JEPA框架相结合,使视觉Transformer能在小规模数据集上无需大规模预训练即可学习到强大表征,突破了计算机视觉领域必须依赖大数据的传统范式。
English: This paper introduces SCOTT, a shallow tokenization architecture integrated with MIM-JEPA, enabling Vision Transformers to learn robust representations from small datasets without large-scale pretraining, challenging the necessity of big data in computer vision.

Authors:Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao
Title: ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
Abstract:
Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark. The code is available at https://github.com/Alibaba-NLP/ViDoRAG.
中文摘要:本文提出ViDoRAG多智能体框架,通过混合检索策略和迭代推理工作流解决现有RAG方法在处理视觉文档时的不足,在ViDoSeek基准测试中性能提升超过10%。
English Summary: The abstract introduces ViDoRAG, a multi-agent framework that addresses limitations in current RAG methods for visually rich documents by employing hybrid retrieval and iterative reasoning workflows, achieving over 10% improvement on the new ViDoSeek benchmark.

Authors:Mingyuan Sun, Zheng Fang, Jiaxu Wang, Junjie Jiang, Delei Kong, Chenming Hu, Yuetong Fang, Renjing Xu
Title: Optimal Brain Apoptosis
Abstract:
The increasing complexity and parameter count of Convolutional Neural Networks (CNNs) and Transformers pose challenges in terms of computational efficiency and resource demands. Pruning has been identified as an effective strategy to address these challenges by removing redundant elements such as neurons, channels, or connections, thereby enhancing computational efficiency without heavily compromising performance. This paper builds on the foundational work of Optimal Brain Damage (OBD) by advancing the methodology of parameter importance estimation using the Hessian matrix. Unlike previous approaches that rely on approximations, we introduce Optimal Brain Apoptosis (OBA), a novel pruning method that calculates the Hessian-vector product value directly for each parameter. By decomposing the Hessian matrix across network layers and identifying conditions under which inter-layer Hessian submatrices are non-zero, we propose a highly efficient technique for computing the second-order Taylor expansion of parameters. This approach allows for a more precise pruning process, particularly in the context of CNNs and Transformers, as validated in our experiments including VGG19, ResNet32, ResNet50, and ViT-B/16 on CIFAR10, CIFAR100 and Imagenet datasets. Our code is available at https://github.com/NEU-REAL/OBA.
中文: 本文提出了一种名为最优脑凋亡(OBA)的新剪枝方法,通过直接计算Hessian向量积来精确评估参数重要性,从而在多个数据集上验证了其在提升卷积神经网络和Transformer计算效率方面的有效性。
English: This paper introduces Optimal Brain Apoptosis (OBA), a novel pruning method that enhances computational efficiency by directly calculating Hessian-vector products for precise parameter importance estimation in CNNs and Transformers, validated on multiple datasets.

Authors:Vishal Nedungadi, Muhammad Akhtar Munir, Marc Rußwurm, Ron Sarafian, Ioannis N. Athanasiadis, Yinon Rudich, Fahad Shahbaz Khan, Salman Khan
Title: AirCast: Improving Air Pollution Forecasting Through Multi-Variable Data Alignment
Abstract:
Air pollution remains a leading global health risk, exacerbated by rapid industrialization and urbanization, contributing significantly to morbidity and mortality rates. In this paper, we introduce AirCast, a novel multi-variable air pollution forecasting model, by combining weather and air quality variables. AirCast employs a multi-task head architecture that simultaneously forecasts atmospheric conditions and pollutant concentrations, improving its understanding of how weather patterns affect air quality. Predicting extreme pollution events is challenging due to their rare occurrence in historic data, resulting in a heavy-tailed distribution of pollution levels. To address this, we propose a novel Frequency-weighted Mean Absolute Error (fMAE) loss, adapted from the class-balanced loss for regression tasks. Informed from domain knowledge, we investigate the selection of key variables known to influence pollution levels. Additionally, we align existing weather and chemical datasets across spatial and temporal dimensions. AirCast's integrated approach, combining multi-task learning, frequency weighted loss and domain informed variable selection, enables more accurate pollution forecasts. Our source code and models are made public here (https://github.com/vishalned/AirCast.git)
中文摘要:AirCast是一种创新的多变量空气污染预测模型,通过整合天气和空气质量数据,采用多任务学习架构、频率加权损失函数及领域知识驱动的变量选择方法,显著提升了污染预测的精确度。
English Summary: AirCast is a novel multi-variable air pollution forecasting model that integrates weather and air quality data through multi-task learning, frequency-weighted loss, and domain-informed variable selection to improve prediction accuracy.

Authors:Rong Liu, Junye Liang, Jiaqi Yang, Jiang He, Peng Zhu
Title: Dual Classification Head Self-training Network for Cross-scene Hyperspectral Image Classification
Abstract:
Due to the difficulty of obtaining labeled data for hyperspectral images (HSIs), cross-scene classification has emerged as a widely adopted approach in the remote sensing community. It involves training a model using labeled data from a source domain (SD) and unlabeled data from a target domain (TD), followed by inferencing on the TD. However, variations in the reflectance spectrum of the same object between the SD and the TD, as well as differences in the feature distribution of the same land cover class, pose significant challenges to the performance of cross-scene classification. To address this issue, we propose a dual classification head self-training network (DHSNet). This method aligns class-wise features across domains, ensuring that the trained classifier can accurately classify TD data of different classes. We introduce a dual classification head self-training strategy for the first time in the cross-scene HSI classification field. The proposed approach mitigates domain gap while preventing the accumulation of incorrect pseudo-labels in the model. Additionally, we incorporate a novel central feature attention mechanism to enhance the model's capacity to learn scene-invariant features across domains. Experimental results on three cross-scene HSI datasets demonstrate that the proposed DHSNET significantly outperforms other state-of-the-art approaches. The code for DHSNet will be available at https://github.com/liurongwhm.
Chinese: 针对跨场景高光谱图像分类中的域差异问题,DHSNet首次提出双分类头自训练网络并引入中心特征注意力机制,在三个数据集上显著优于现有先进方法。
English: To overcome domain discrepancies in cross-scene hyperspectral image classification, DHSNet introduces a dual classification head self-training strategy with a central feature attention mechanism, achieving superior performance on benchmark datasets.

Authors:Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna Sanniboina, Nanyun Peng, Kai-Wei Chang, Daniel Kang, Heng Ji
Title: MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks
Abstract:
Multimodal large language models (MLLMs) equipped with Retrieval Augmented Generation (RAG) leverage both their rich parametric knowledge and the dynamic, external knowledge to excel in tasks such as Question Answering. While RAG enhances MLLMs by grounding responses in query-relevant external knowledge, this reliance poses a critical yet underexplored safety risk: knowledge poisoning attacks, where misinformation or irrelevant knowledge is intentionally injected into external knowledge bases to manipulate model outputs to be incorrect and even harmful. To expose such vulnerabilities in multimodal RAG, we propose MM-PoisonRAG, a novel knowledge poisoning attack framework with two attack strategies: Localized Poisoning Attack (LPA), which injects query-specific misinformation in both text and images for targeted manipulation, and Globalized Poisoning Attack (GPA) to provide false guidance during MLLM generation to elicit nonsensical responses across all queries. We evaluate our attacks across multiple tasks, models, and access settings, demonstrating that LPA successfully manipulates the MLLM to generate attacker-controlled answers, with a success rate of up to 56% on MultiModalQA. Moreover, GPA completely disrupts model generation to 0% accuracy with just a single irrelevant knowledge injection. Our results highlight the urgent need for robust defenses against knowledge poisoning to safeguard multimodal RAG frameworks.
中文摘要:多模态检索增强生成系统面临知识投毒攻击的安全隐患,攻击者通过注入误导性内容操控模型输出,MM-PoisonRAG框架实验显示其定向攻击成功率最高可达56%。
English Summary: Multimodal RAG systems face critical safety risks from knowledge poisoning attacks, where adversaries inject misleading content to manipulate model outputs, as demonstrated by the MM-PoisonRAG framework achieving up to 56% targeted attack success.

Authors:Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, Liefeng Bo
Title: LAM: Large Avatar Model for One-shot Animatable Gaussian Head
Abstract:
We present LAM, an innovative Large Avatar Model for animatable Gaussian head reconstruction from a single image. Unlike previous methods that require extensive training on captured video sequences or rely on auxiliary neural networks for animation and rendering during inference, our approach generates Gaussian heads that are immediately animatable and renderable. Specifically, LAM creates an animatable Gaussian head in a single forward pass, enabling reenactment and rendering without additional networks or post-processing steps. This capability allows for seamless integration into existing rendering pipelines, ensuring real-time animation and rendering across a wide range of platforms, including mobile phones. The centerpiece of our framework is the canonical Gaussian attributes generator, which utilizes FLAME canonical points as queries. These points interact with multi-scale image features through a Transformer to accurately predict Gaussian attributes in the canonical space. The reconstructed canonical Gaussian avatar can then be animated utilizing standard linear blend skinning (LBS) with corrective blendshapes as the FLAME model did and rendered in real-time on various platforms. Our experimental results demonstrate that LAM outperforms state-of-the-art methods on existing benchmarks. Our code and video are available at https://aigc3d.github.io/projects/LAM/
中文: LAM是一种创新的大型虚拟化身模型,通过单张图像一次性生成可动画化的高斯头部,无需额外网络即可实现跨平台实时动画与渲染。
English: LAM is a novel Large Avatar Model that reconstructs animatable Gaussian heads from a single image in one forward pass, enabling real-time animation and rendering across multiple platforms without additional networks.

Authors:Hyeonjeong Ha, Xiaomeng Jin, Jeonghwan Kim, Jiateng Liu, Zhenhailong Wang, Khanh Duy Nguyen, Ansel Blume, Nanyun Peng, Kai-Wei Chang, Heng Ji
Title: SYNTHIA: Novel Concept Design with Affordance Composition
Abstract:
Text-to-image (T2I) models enable rapid concept design, making them widely used in AI-driven design. While recent studies focus on generating semantic and stylistic variations of given design concepts, functional coherence--the integration of multiple affordances into a single coherent concept--remains largely overlooked. In this paper, we introduce SYNTHIA, a framework for generating novel, functionally coherent designs based on desired affordances. Our approach leverages a hierarchical concept ontology that decomposes concepts into parts and affordances, serving as a crucial building block for functionally coherent design. We also develop a curriculum learning scheme based on our ontology that contrastively fine-tunes T2I models to progressively learn affordance composition while maintaining visual novelty. To elaborate, we (i) gradually increase affordance distance, guiding models from basic concept-affordance association to complex affordance compositions that integrate parts of distinct affordances into a single, coherent form, and (ii) enforce visual novelty by employing contrastive objectives to push learned representations away from existing concepts. Experimental results show that SYNTHIA outperforms state-of-the-art T2I models, demonstrating absolute gains of 25.1% and 14.7% for novelty and functional coherence in human evaluation, respectively.
Chinese: 本文提出SYNTHIA框架,通过分层概念本体和课程学习,将多种功能整合到单一概念中生成新颖且功能一致的设计,在人类评估中显著超越了现有T2I模型的新颖性和功能一致性表现。
English: This paper introduces SYNTHIA, a framework that uses a hierarchical concept ontology and curriculum learning to generate novel, functionally coherent designs by integrating multiple affordances into a single concept, significantly outperforming existing T2I models in both novelty and functional coherence.

Authors:Luigi Seminara, Giovanni Maria Farinella, Antonino Furnari
Title: Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos
Abstract:
We introduce a gradient-based approach for learning task graphs from procedural activities, improving over hand-crafted methods. Our method directly optimizes edge weights via maximum likelihood, enabling integration into neural architectures. We validate our approach on CaptainCook4D, EgoPER, and EgoProceL, achieving +14.5%, +10.2%, and +13.6% F1-score improvements. Our feature-based approach for predicting task graphs from textual/video embeddings demonstrates emerging video understanding abilities. We also achieved top performance on the procedure understanding benchmark on Ego-Exo4D and significantly improved online mistake detection (+19.8% on Assembly101-O, +6.4% on EPIC-Tent-O). Code: https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.
中文: 我们提出的基于梯度的学习方法通过最大似然优化边权重来从程序性活动中学习任务图,在多个数据集上实现了显著的F1分数提升,并在程序理解基准测试中表现出色。
English: Our gradient-based method learns task graphs from procedural activities by optimizing edge weights through maximum likelihood, achieving significant F1-score improvements on multiple datasets and excelling in procedure understanding benchmarks.

Authors:Sushmita Sarker, Prithul Sarker, George Bebis, Alireza Tavakkoli
Title: Can Score-Based Generative Modeling Effectively Handle Medical Image Classification?
Abstract:
The remarkable success of deep learning in recent years has prompted applications in medical image classification and diagnosis tasks. While classification models have demonstrated robustness in classifying simpler datasets like MNIST or natural images such as ImageNet, this resilience is not consistently observed in complex medical image datasets where data is more scarce and lacks diversity. Moreover, previous findings on natural image datasets have indicated a potential trade-off between data likelihood and classification accuracy. In this study, we explore the use of score-based generative models as classifiers for medical images, specifically mammographic images. Our findings suggest that our proposed generative classifier model not only achieves superior classification results on CBIS-DDSM, INbreast and Vin-Dr Mammo datasets, but also introduces a novel approach to image classification in a broader context. Our code is publicly available at https://github.com/sushmitasarker/sgc_for_medical_image_classification
中文摘要: 本研究提出了一种基于分数的生成式分类器,在医学乳腺摄影数据集上实现了优越的分类性能,并为图像分类领域提供了超越传统方法的新思路。
English Summary: This study proposes a score-based generative classifier that achieves superior performance on medical mammography datasets while offering a novel approach to image classification beyond traditional methods.

Authors:Lei Cheng, Lihao Guo, Tianya Zhang, Tam Bang, Austin Harris, Mustafa Hajij, Mina Sartipi, Siyang Cao
Title: CalibRefine: Deep Learning-Based Online Automatic Targetless LiDAR-Camera Calibration with Iterative and Attention-Driven Post-Refinement
Abstract:
Accurate multi-sensor calibration is essential for deploying robust perception systems in applications such as autonomous driving and intelligent transportation. Existing LiDAR-camera calibration methods often rely on manually placed targets, preliminary parameter estimates, or intensive data preprocessing, limiting their scalability and adaptability in real-world settings. In this work, we propose a fully automatic, targetless, and online calibration framework, CalibRefine, which directly processes raw LiDAR point clouds and camera images. Our approach is divided into four stages: (1) a Common Feature Discriminator that leverages relative spatial positions, visual appearance embeddings, and semantic class cues to identify and generate reliable LiDAR-camera correspondences, (2) a coarse homography-based calibration that uses the matched feature correspondences to estimate an initial transformation between the LiDAR and camera frames, serving as the foundation for further refinement, (3) an iterative refinement to incrementally improve alignment as additional data frames become available, and (4) an attention-based refinement that addresses non-planar distortions by leveraging a Vision Transformer and cross-attention mechanisms. Extensive experiments on two urban traffic datasets demonstrate that CalibRefine achieves high-precision calibration with minimal human input, outperforming state-of-the-art targetless methods and matching or surpassing manually tuned baselines. Our results show that robust object-level feature matching, combined with iterative refinement and self-supervised attention-based refinement, enables reliable sensor alignment in complex real-world conditions without ground-truth matrices or elaborate preprocessing. Code is available at https://github.com/radar-lab/Lidar_Camera_Automatic_Calibration
中文:CalibRefine是一种全自动、无目标且在线的标定框架,通过鲁棒的特征匹配和迭代优化实现高精度LiDAR-相机校准,在实际场景中超越了现有方法。
English: CalibRefine is a fully automatic, targetless, and online framework that achieves high-precision LiDAR-camera calibration through robust feature matching and iterative refinement, outperforming existing methods in real-world conditions.

Authors:Tianhong Li, Qinyi Sun, Lijie Fan, Kaiming He
Title: Fractal Generative Models
Abstract:
Modularization is a cornerstone of computer science, abstracting complex functions into atomic building blocks. In this paper, we introduce a new level of modularization by abstracting generative models into atomic generative modules. Analogous to fractals in mathematics, our method constructs a new type of generative model by recursively invoking atomic generative modules, resulting in self-similar fractal architectures that we call fractal generative models. As a running example, we instantiate our fractal framework using autoregressive models as the atomic generative modules and examine it on the challenging task of pixel-by-pixel image generation, demonstrating strong performance in both likelihood estimation and generation quality. We hope this work could open a new paradigm in generative modeling and provide a fertile ground for future research. Code is available at https://github.com/LTH14/fractalgen.
中文: 本文提出分形生成模型,通过递归调用原子生成模块构建自相似结构,在图像生成任务中表现出色,为生成建模开辟了新范式。
English: This paper introduces fractal generative models, which recursively use atomic generative modules to create self-similar architectures, demonstrating strong performance in image generation tasks and offering a new paradigm for generative modeling.

Authors:Yichi Zhang, Yici Yan, Alex Schwing, Zhizhen Zhao
Title: Towards Hierarchical Rectified Flow
Abstract:
We formulate a hierarchical rectified flow to model data distributions. It hierarchically couples multiple ordinary differential equations (ODEs) and defines a time-differentiable stochastic process that generates a data distribution from a known source distribution. Each ODE resembles the ODE that is solved in a classic rectified flow, but differs in its domain, i.e., location, velocity, acceleration, etc. Unlike the classic rectified flow formulation, which formulates a single ODE in the location domain and only captures the expected velocity field (sufficient to capture a multi-modal data distribution), the hierarchical rectified flow formulation models the multi-modal random velocity field, acceleration field, etc., in their entirety. This more faithful modeling of the random velocity field enables integration paths to intersect when the underlying ODE is solved during data generation. Intersecting paths in turn lead to integration trajectories that are more straight than those obtained in the classic rectified flow formulation, where integration paths cannot intersect. This leads to modeling of data distributions with fewer neural function evaluations. We empirically verify this on synthetic 1D and 2D data as well as MNIST, CIFAR-10, and ImageNet-32 data. Our code is available at: https://riccizz.github.io/HRF/.

Authors:Chen-Wei Chang, Cheng-De Fan, Chia-Che Chang, Yi-Chen Lo, Yu-Chee Tseng, Jiun-Long Huang, Yu-Lun Liu
Title: GCC: Generative Color Constancy via Diffusing a Color Checker
Abstract:
Color constancy methods often struggle to generalize across different camera sensors due to varying spectral sensitivities. We present GCC, which leverages diffusion models to inpaint color checkers into images for illumination estimation. Our key innovations include (1) a single-step deterministic inference approach that inpaints color checkers reflecting scene illumination, (2) a Laplacian decomposition technique that preserves checker structure while allowing illumination-dependent color adaptation, and (3) a mask-based data augmentation strategy for handling imprecise color checker annotations. By harnessing rich priors from pre-trained diffusion models, GCC demonstrates strong robustness in challenging cross-camera scenarios. These results highlight our method's effective generalization capability across different camera characteristics without requiring sensor-specific training, making it a versatile and practical solution for real-world applications.

Authors:Vishal Thengane, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Lu Yin, Xiatian Zhu, Salman Khan
Title: CLIMB-3D: Continual Learning for Imbalanced 3D Instance Segmentation
Abstract:
While 3D instance segmentation (3DIS) has advanced significantly, existing methods typically assume that all object classes are known in advance and are uniformly distributed. However, this assumption is unrealistic in dynamic, real-world environments where new classes emerge gradually and exhibit natural imbalance. Although some approaches have addressed class emergence, they often overlook class imbalance, resulting in suboptimal performance -- particularly on rare categories. To tackle this challenge, we propose CLIMB-3D, a unified framework for \textbf{CL}ass-incremental \textbf{Imb}alance-aware \textbf{3D}IS. Building upon established exemplar replay (ER) strategies, we show that ER alone is insufficient to achieve robust performance under constrained memory conditions. To mitigate this, we introduce a novel pseudo-label generator (PLG) that extends supervision to previously learned categories by leveraging predictions from a frozen prior model. Despite its promise, PLG tends to bias towards frequent classes. Therefore, we propose a class-balanced re-weighting (CBR) scheme, that estimates object frequencies from pseudo-labels and dynamically adjusts training bias -- without requiring access to past data. We design and evaluate three incremental scenarios for 3DIS on the challenging ScanNet200 dataset, and additionally on semantic segmentation on ScanNetV2. Our approach achieves state-of-the-art results, surpassing prior work by up to 16.76\% mAP for instance segmentation and approximately 30\% mIoU for semantic segmentation, demonstrating strong generalization across both frequent and rare classes.
中文: CLIMB-3D 提出了一种解决三维实例分割中类别不平衡和增量学习问题的统一框架,通过结合样本回放、伪标签生成器和类别平衡重加权机制,在稀有和常见类别上均实现了最先进的性能。
English: CLIMB-3D is a novel framework addressing class imbalance and incremental learning in 3D instance segmentation by combining exemplar replay with a pseudo-label generator and class-balanced re-weighting, achieving state-of-the-art performance on rare and frequent classes.

Authors:Runpeng Yu, Xinyin Ma, Xinchao Wang
Title: Introducing Visual Perception Token into Multimodal Large Language Model
Abstract:
To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual perception actions. The Region Selection Token explicitly identifies specific regions in an image that require further perception, while the Vision Re-Encoding Token uses its hidden states as control signals to guide additional visual perception processes. Extensive experiments demonstrate the advantages of these tokens in handling spatial reasoning, improving fine-grained understanding, and other tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 23.6\%, increasing its score from 0.572 to 0.708, and even outperforms a 7B parameter model by 13.4\% (from 0.624). Please check out our repo https://github.com/yu-rp/VisualPerceptionToken
中文摘要:视觉感知令牌的提出使多模态大语言模型能够自主控制其视觉感知过程,通过区域选择令牌和视觉重编码令牌实现选择性图像区域审查,从而显著提升空间推理和细粒度理解等任务的性能表现。
English Summary: The proposed Visual Perception Token empowers Multimodal Large Language Models with autonomous control over visual perception processes, enabling selective region review and enhanced visual re-encoding to significantly boost performance in spatial reasoning and fine-grained understanding tasks.

Authors:Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski
Title: MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
Abstract:
Multimodal Large Language Models (MLLMs) have experienced rapid progress in visual recognition tasks in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their visual perception. In this work, we study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We observe that their performance is very sensitive to the size of the visual subject of the question, and further show that this effect is in fact causal by conducting an intervention study. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to enhance its perception of small visual details. We evaluate our proposed methods on two widely-used MLLMs and seven visual question answering benchmarks and show that they can significantly improve MLLMs' accuracy without requiring any training. Our results elucidate the risk of applying MLLMs to visual recognition tasks concerning small details and indicate that visual intervention using the model's internal state is a promising direction to mitigate this risk.
中文摘要:研究发现多模态大语言模型在感知图像细微视觉信息方面存在不足,但其注意力机制即使回答错误时仍能准确定位关键区域;通过利用模型内部的注意力和梯度图,开发出无需训练的可视干预方法,显著提升了模型对微小视觉细节的识别准确率。
English Summary: This study reveals that Multimodal Large Language Models struggle with perceiving small visual details in images, but their attention mechanisms correctly identify relevant areas even when answering incorrectly, leading to the development of training-free intervention methods that significantly improve accuracy by leveraging the models' internal attention and gradient maps.

Authors:Chong Cheng, Gaochao Song, Yiyang Yao, Qinzheng Zhou, Gangjian Zhang, Hao Wang
Title: Graph-Guided Scene Reconstruction from Images with 3D Gaussian Splatting
Abstract:
This paper investigates an open research challenge of reconstructing high-quality, large 3D open scenes from images. It is observed existing methods have various limitations, such as requiring precise camera poses for input and dense viewpoints for supervision. To perform effective and efficient 3D scene reconstruction, we propose a novel graph-guided 3D scene reconstruction framework, GraphGS. Specifically, given a set of images captured by RGB cameras on a scene, we first design a spatial prior-based scene structure estimation method. This is then used to create a camera graph that includes information about the camera topology. Further, we propose to apply the graph-guided multi-view consistency constraint and adaptive sampling strategy to the 3D Gaussian Splatting optimization process. This greatly alleviates the issue of Gaussian points overfitting to specific sparse viewpoints and expedites the 3D reconstruction process. We demonstrate GraphGS achieves high-fidelity 3D reconstruction from images, which presents state-of-the-art performance through quantitative and qualitative evaluation across multiple datasets. Project Page: https://3dagentworld.github.io/graphgs.

Authors:Tianrui Zhu, Shiyi Zhang, Jiawei Shao, Yansong Tang
Title: KV-Edit: Training-Free Image Editing for Precise Background Preservation
Abstract:
Background consistency remains a significant challenge in image editing tasks. Despite extensive developments, existing works still face a trade-off between maintaining similarity to the original image and generating content that aligns with the target. Here, we propose KV-Edit, a training-free approach that uses KV cache in DiTs to maintain background consistency, where background tokens are preserved rather than regenerated, eliminating the need for complex mechanisms or expensive training, ultimately generating new content that seamlessly integrates with the background within user-provided regions. We further explore the memory consumption of the KV cache during editing and optimize the space complexity to $O(1)$ using an inversion-free method. Our approach is compatible with any DiT-based generative model without additional training. Experiments demonstrate that KV-Edit significantly outperforms existing approaches in terms of both background and image quality, even surpassing training-based methods. Project webpage is available at https://xilluill.github.io/projectpages/KV-Edit

Authors:André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li
Title: DIS-CO: Discovering Copyrighted Content in VLMs Training Data
Abstract:
How can we verify whether copyrighted content was used to train a large vision-language model (VLM) without direct access to its training data? Motivated by the hypothesis that a VLM is able to recognize images from its training corpus, we propose DIS-CO, a novel approach to infer the inclusion of copyrighted content during the model's development. By repeatedly querying a VLM with specific frames from targeted copyrighted material, DIS-CO extracts the content's identity through free-form text completions. To assess its effectiveness, we introduce MovieTection, a benchmark comprising 14,000 frames paired with detailed captions, drawn from films released both before and after a model's training cutoff. Our results show that DIS-CO significantly improves detection performance, nearly doubling the average AUC of the best prior method on models with logits available. Our findings also highlight a broader concern: all tested models appear to have been exposed to some extent to copyrighted content. Our code and data are available at https://github.com/avduarte333/DIS-CO
中文: DIS-CO是一种通过向视觉语言模型输入特定帧并分析其文本补全来检测训练数据中是否包含受版权保护内容的新方法,其检测性能显著优于现有技术,并揭示所有测试模型均存在不同程度的版权内容暴露。
English: DIS-CO is a novel method that detects whether copyrighted content was used to train vision-language models by querying them with specific frames and analyzing text completions, significantly outperforming prior methods and revealing widespread exposure to such content across tested models.

Authors:Inbar Gat, Sigal Raab, Guy Tevet, Yuval Reshef, Amit H. Bermano, Daniel Cohen-Or
Title: AnyTop: Character Animation Diffusion with Any Topology
Abstract:
Generating motion for arbitrary skeletons is a longstanding challenge in computer graphics, remaining largely unexplored due to the scarcity of diverse datasets and the irregular nature of the data. In this work, we introduce AnyTop, a diffusion model that generates motions for diverse characters with distinct motion dynamics, using only their skeletal structure as input. Our work features a transformer-based denoising network, tailored for arbitrary skeleton learning, integrating topology information into the traditional attention mechanism. Additionally, by incorporating textual joint descriptions into the latent feature representation, AnyTop learns semantic correspondences between joints across diverse skeletons. Our evaluation demonstrates that AnyTop generalizes well, even with as few as three training examples per topology, and can produce motions for unseen skeletons as well. Furthermore, our model's latent space is highly informative, enabling downstream tasks such as joint correspondence, temporal segmentation and motion editing. Our webpage, https://anytop2025.github.io/Anytop-page, includes links to videos and code.
中文:AnyTop是一种扩散模型,仅通过骨骼结构输入即可为不同角色生成动作,通过将拓扑信息融入注意力机制和文本关节描述,学习跨骨架的语义对应关系,并支持动作编辑等下游任务。
English: AnyTop is a diffusion model that generates motion for various skeletons using only their structure as input, incorporating topology into attention mechanisms and textual joint descriptions to learn semantic correspondences and enable tasks like motion editing.

Authors:Federico Vasile, Elisa Maiettini, Giulia Pasquale, Nicolò Boccardo, Lorenzo Natale
Title: Continuous Wrist Control on the Hannes Prosthesis: a Vision-based Shared Autonomy Framework
Abstract:
Most control techniques for prosthetic grasping focus on dexterous fingers control, but overlook the wrist motion. This forces the user to perform compensatory movements with the elbow, shoulder and hip to adapt the wrist for grasping. We propose a computer vision-based system that leverages the collaboration between the user and an automatic system in a shared autonomy framework, to perform continuous control of the wrist degrees of freedom in a prosthetic arm, promoting a more natural approach-to-grasp motion. Our pipeline allows to seamlessly control the prosthetic wrist to follow the target object and finally orient it for grasping according to the user intent. We assess the effectiveness of each system component through quantitative analysis and finally deploy our method on the Hannes prosthetic arm. Code and videos: https://hsp-iit.github.io/hannes-wrist-control.

Authors:Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang
Title: VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing
Abstract:
Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios. Our code, data, and demos are available at https://knightyxp.github.io/VideoGrain_project_page/
中文摘要:VideoGrain提出了一种零样本方法,通过优化时空注意力机制来提升多粒度视频编辑能力,在真实场景中实现了卓越的控制效果和顶尖性能。
English Summary: VideoGrain introduces a zero-shot method that enhances multi-grained video editing by optimizing space-time attention mechanisms, achieving superior control and state-of-the-art results in real-world applications.

Authors:Gabriele Berton, Carlo Masone
Title: MegaLoc: One Retrieval to Place Them All
Abstract:
Retrieving images from the same location as a given query is an important component of multiple computer vision tasks, like Visual Place Recognition, Landmark Retrieval, Visual Localization, 3D reconstruction, and SLAM. However, existing solutions are built to specifically work for one of these tasks, and are known to fail when the requirements slightly change or when they meet out-of-distribution data. In this paper we combine a variety of existing methods, training techniques, and datasets to train a retrieval model, called MegaLoc, that is performant on multiple tasks. We find that MegaLoc (1) achieves state of the art on a large number of Visual Place Recognition datasets, (2) impressive results on common Landmark Retrieval datasets, and (3) sets a new state of the art for Visual Localization on the LaMAR datasets, where we only changed the retrieval method to the existing localization pipeline. The code for MegaLoc is available at https://github.com/gmberton/MegaLoc
中文摘要:MegaLoc是一种多功能图像检索模型,通过整合多种现有方法和训练技术,在视觉位置识别、地标检索和视觉定位等多项计算机视觉任务中均实现了最先进的性能表现。
English Summary: MegaLoc is a versatile image retrieval model that achieves state-of-the-art performance across multiple computer vision tasks including Visual Place Recognition, Landmark Retrieval, and Visual Localization by integrating various existing methods and training techniques.

Authors:Hogun Kee, Wooseok Oh, Minjae Kang, Hyemin Ahn, Songhwai Oh
Title: Tidiness Score-Guided Monte Carlo Tree Search for Visual Tabletop Rearrangement
Abstract:
In this paper, we present the tidiness score-guided Monte Carlo tree search (TSMCTS), a novel framework designed to address the tabletop tidying up problem using only an RGB-D camera. We address two major problems for tabletop tidying up problem: (1) the lack of public datasets and benchmarks, and (2) the difficulty of specifying the goal configuration of unseen objects. We address the former by presenting the tabletop tidying up (TTU) dataset, a structured dataset collected in simulation. Using this dataset, we train a vision-based discriminator capable of predicting the tidiness score. This discriminator can consistently evaluate the degree of tidiness across unseen configurations, including real-world scenes. Addressing the second problem, we employ Monte Carlo tree search (MCTS) to find tidying trajectories without specifying explicit goals. Instead of providing specific goals, we demonstrate that our MCTS-based planner can find diverse tidied configurations using the tidiness score as a guidance. Consequently, we propose TSMCTS, which integrates a tidiness discriminator with an MCTS-based tidying planner to find optimal tidied arrangements. TSMCTS has successfully demonstrated its capability across various environments, including coffee tables, dining tables, office desks, and bathrooms. The TTU dataset is available at: https://github.com/rllab-snu/TTU-Dataset.
中文: 本文提出TSMCTS新框架,通过结合整洁度评分判别器与蒙特卡洛树搜索,仅需RGB-D相机即可解决桌面整理问题,并在多种真实场景中成功验证了其有效性。
English: This paper introduces TSMCTS, a novel framework that combines a tidiness score discriminator with Monte Carlo tree search to solve tabletop tidying problems using only RGB-D camera input, successfully demonstrating its effectiveness across multiple real-world environments.

Authors:Canyu Zhao, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen
Title: DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
Abstract:
Our primary goal here is to create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data. To achieve this, we resort to text-to-image diffusion models pre-trained on billions of images. Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models. We achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates the outputs of various perception tasks using color encoding; and we show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation. Unifying various perception tasks as conditional image generation enables us to fully leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently trained at a cost of orders of magnitude lower, compared to conventional models that were trained from scratch. When adapting our model to other tasks, it only requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION provides valuable insights and a more promising solution for visual generalist models. Homepage: https://aim-uofa.github.io/Diception, Huggingface Demo: https://huggingface.co/spaces/Canyu/Diception-Demo.
中文: 本文提出了DICEPTION视觉通用模型,通过利用预训练扩散模型并保留其先验知识,能以极少数据和计算资源高效处理多种感知任务。
English: This paper introduces DICEPTION, a robust visual generalist model that efficiently tackles multiple perception tasks with minimal data and computational resources by leveraging pre-trained diffusion models while preserving their prior knowledge.

Authors:Zekun Wang, Mingyang Yi, Shuchen Xue, Zhenguo Li, Ming Liu, Bing Qin, Zhi-Ming Ma
Title: Improved Diffusion-based Generative Model with Better Adversarial Robustness
Abstract:
Diffusion Probabilistic Models (DPMs) have achieved significant success in generative tasks. However, their training and sampling processes suffer from the issue of distribution mismatch. During the denoising process, the input data distributions differ between the training and inference stages, potentially leading to inaccurate data generation. To obviate this, we analyze the training objective of DPMs and theoretically demonstrate that this mismatch can be alleviated through Distributionally Robust Optimization (DRO), which is equivalent to performing robustness-driven Adversarial Training (AT) on DPMs. Furthermore, for the recently proposed Consistency Model (CM), which distills the inference process of the DPM, we prove that its training objective also encounters the mismatch issue. Fortunately, this issue can be mitigated by AT as well. Based on these insights, we propose to conduct efficient AT on both DPM and CM. Finally, extensive empirical studies validate the effectiveness of AT in diffusion-based models. The code is available at https://github.com/kugwzk/AT_Diff.
中文摘要:扩散概率模型在训练与推理阶段存在分布不匹配问题,通过对抗性训练可有效缓解该问题,提升生成数据的准确性。
English Summary: Diffusion Probabilistic Models face distribution mismatch during training and inference, which can be effectively addressed through Adversarial Training to improve generation accuracy.

Authors:Boris Shirokikh, Anvar Kurmukov, Mariia Donskova, Valentin Samokhin, Mikhail Belyaev, Ivan Oseledets
Title: M3DA: Benchmark for Unsupervised Domain Adaptation in 3D Medical Image Segmentation
Abstract:
Domain shift presents a significant challenge in applying Deep Learning to the segmentation of 3D medical images from sources like Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). Although numerous Domain Adaptation methods have been developed to address this issue, they are often evaluated under impractical data shift scenarios. Specifically, the medical imaging datasets used are often either private, too small for robust training and evaluation, or limited to single or synthetic tasks. To overcome these limitations, we introduce a M3DA /"mEd@/ benchmark comprising four publicly available, multiclass segmentation datasets. We have designed eight domain pairs featuring diverse and practically relevant distribution shifts. These include inter-modality shifts between MRI and CT and intra-modality shifts among various MRI acquisition parameters, different CT radiation doses, and presence/absence of contrast enhancement in images. Within the proposed benchmark, we evaluate more than ten existing domain adaptation methods. Our results show that none of them can consistently close the performance gap between the domains. For instance, the most effective method reduces the performance gap by about 62% across the tasks. This highlights the need for developing novel domain adaptation algorithms to enhance the robustness and scalability of deep learning models in medical imaging. We made our M3DA benchmark publicly available: https://github.com/BorisShirokikh/M3DA.
中文: M3DA基准通过引入八个具有现实分布差异的多样化领域对,解决了当前三维医学图像分割领域适应方法的局限性,结果表明现有技术无法持续缩小领域间性能差距,凸显了开发更强健算法的必要性。
English: The M3DA benchmark addresses limitations in current domain adaptation methods for 3D medical image segmentation by introducing eight diverse domain pairs with realistic shifts, revealing that existing techniques fail to consistently bridge performance gaps and highlighting the need for more robust algorithms.

Authors:Maksim Zhdanov, Max Welling, Jan-Willem van de Meent
Title: Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems
Abstract:
Large-scale physical systems defined on irregular grids pose significant scalability challenges for deep learning methods, especially in the presence of long-range interactions and multi-scale coupling. Traditional approaches that compute all pairwise interactions, such as attention, become computationally prohibitive as they scale quadratically with the number of nodes. We present Erwin, a hierarchical transformer inspired by methods from computational many-body physics, which combines the efficiency of tree-based algorithms with the expressivity of attention mechanisms. Erwin employs ball tree partitioning to organize computation, which enables linear-time attention by processing nodes in parallel within local neighborhoods of fixed size. Through progressive coarsening and refinement of the ball tree structure, complemented by a novel cross-ball interaction mechanism, it captures both fine-grained local details and global features. We demonstrate Erwin's effectiveness across multiple domains, including cosmology, molecular dynamics, PDE solving, and particle fluid dynamics, where it consistently outperforms baseline methods both in accuracy and computational efficiency.
中文摘要:Erwin是一种结合树算法效率与注意力机制表达力的分层变换器,通过球树分区实现线性时间计算,在保持多尺度交互能力的同时,在多个物理系统领域展现出卓越性能。
English summary: Erwin is a hierarchical transformer that combines tree-based efficiency with attention expressivity, enabling linear-time computation while capturing multi-scale interactions across diverse physical systems.

Authors:Valentin Wagner, Sebastian Bullinger, Christoph Bodensteiner, Michael Arens
Title: Semantic Neural Radiance Fields for Multi-Date Satellite Data
Abstract:
In this work we propose a satellite specific Neural Radiance Fields (NeRF) model capable to obtain a three-dimensional semantic representation (neural semantic field) of the scene. The model derives the output from a set of multi-date satellite images with corresponding pixel-wise semantic labels. We demonstrate the robustness of our approach and its capability to improve noisy input labels. We enhance the color prediction by utilizing the semantic information to address temporal image inconsistencies caused by non-stationary categories such as vehicles. To facilitate further research in this domain, we present a dataset comprising manually generated labels for popular multi-view satellite images. Our code and dataset are available at https://github.com/wagnva/semantic-nerf-for-satellite-data.
中文: 本研究提出了一种卫星专用的神经辐射场模型,利用多时相卫星图像构建三维语义场景,通过语义信息提升标签精度并解决因动态物体导致的时序不一致问题。
English: This study introduces a satellite-specific Neural Radiance Fields model that constructs a 3D semantic representation from multi-date satellite images, enhancing label accuracy and addressing temporal inconsistencies through semantic integration.

Authors:Nico Heider, Lorenz Gunreben, Sebastian Zürner, Martin Schieck
Title: A survey of datasets for computer vision in agriculture
Abstract:
In agricultural research, there has been a recent surge in the amount of Computer Vision (CV) focused work. But unlike general CV research, large high-quality public datasets are sparsely available. This can be partially attributed to the high variability between different agricultural tasks, crops and environments as well as the complexity of data collection, but it is also influenced by the reticence to publish datasets by many authors. This, as well as the lack of a widely used agricultural data repository, are impactful factors that hinder research in applied CV for agriculture as well as the usage of agricultural data in general-purpose CV research. In this survey, we provide a large number of high-quality datasets of images taken on fields. Overall, we find 45 datasets, which are listed in this paper as well as in an online catalog on the project website: https://smartfarminglab.github.io/field_dataset_survey/.

Authors:Farzad Beizaee, Gregory Lodygensky, Christian Desrosiers, Jose Dolz
Title: MAD-AD: Masked Diffusion for Unsupervised Brain Anomaly Detection
Abstract:
Unsupervised anomaly detection in brain images is crucial for identifying injuries and pathologies without access to labels. However, the accurate localization of anomalies in medical images remains challenging due to the inherent complexity and variability of brain structures and the scarcity of annotated abnormal data. To address this challenge, we propose a novel approach that incorporates masking within diffusion models, leveraging their generative capabilities to learn robust representations of normal brain anatomy. During training, our model processes only normal brain MRI scans and performs a forward diffusion process in the latent space that adds noise to the features of randomly-selected patches. Following a dual objective, the model learns to identify which patches are noisy and recover their original features. This strategy ensures that the model captures intricate patterns of normal brain structures while isolating potential anomalies as noise in the latent space. At inference, the model identifies noisy patches corresponding to anomalies and generates a normal counterpart for these patches by applying a reverse diffusion process. Our method surpasses existing unsupervised anomaly detection techniques, demonstrating superior performance in generating accurate normal counterparts and localizing anomalies. The code is available at hhttps://github.com/farzad-bz/MAD-AD.
中文: 本研究提出了一种基于掩码扩散模型的无监督脑部MRI异常检测新方法,通过在潜在空间中对图像块进行噪声添加与恢复来学习正常解剖结构的鲁棒表征,从而有效定位异常区域。
English: This study introduces a novel unsupervised anomaly detection method for brain MRI scans using masked diffusion models, which effectively localizes anomalies by learning robust representations of normal anatomy through noise addition and recovery in latent space patches.

Authors:Jiehao Luo, Jintao Cheng, Xiaoyu Tang, Qingwen Zhang, Bohuan Xue, Rui Fan
Title: MambaFlow: A Novel and Flow-guided State Space Model for Scene Flow Estimation
Abstract:
Scene flow estimation aims to predict 3D motion from consecutive point cloud frames, which is of great interest in autonomous driving field. Existing methods face challenges such as insufficient spatio-temporal modeling and inherent loss of fine-grained feature during voxelization. However, the success of Mamba, a representative state space model (SSM) that enables global modeling with linear complexity, provides a promising solution. In this paper, we propose MambaFlow, a novel scene flow estimation network with a mamba-based decoder. It enables deep interaction and coupling of spatio-temporal features using a well-designed backbone. Innovatively, we steer the global attention modeling of voxel-based features with point offset information using an efficient Mamba-based decoder, learning voxel-to-point patterns that are used to devoxelize shared voxel representations into point-wise features. To further enhance the model's generalization capabilities across diverse scenarios, we propose a novel scene-adaptive loss function that automatically adapts to different motion patterns.Extensive experiments on the Argoverse 2 benchmark demonstrate that MambaFlow achieves state-of-the-art performance with real-time inference speed among existing works, enabling accurate flow estimation in real-world urban scenarios. The code is available at https://github.com/SCNU-RISLAB/MambaFlow.
Chinese: MambaFlow提出了一种基于Mamba解码器的新型场景流估计网络,通过深度时空特征交互和自适应损失函数,在Argoverse 2基准测试中实现了最优性能并保持实时推理速度。
English: MambaFlow introduces a novel scene flow estimation network with a Mamba-based decoder that enables deep spatio-temporal feature interaction and achieves state-of-the-art performance on the Argoverse 2 benchmark with real-time inference.

Authors:Suchae Jeong, Inseong Choi, Youngsik Yun, Jihie Kim
Title: Culture-TRIP: Culturally-Aware Text-to-Image Generation with Iterative Prompt Refinement
Abstract:
Text-to-Image models, including Stable Diffusion, have significantly improved in generating images that are highly semantically aligned with the given prompts. However, existing models may fail to produce appropriate images for the cultural concepts or objects that are not well known or underrepresented in western cultures, such as `hangari' (Korean utensil). In this paper, we propose a novel approach, Culturally-Aware Text-to-Image Generation with Iterative Prompt Refinement (Culture-TRIP), which refines the prompt in order to improve the alignment of the image with such culture nouns in text-to-image models. Our approach (1) retrieves cultural contexts and visual details related to the culture nouns in the prompt and (2) iteratively refines and evaluates the prompt based on a set of cultural criteria and large language models. The refinement process utilizes the information retrieved from Wikipedia and the Web. Our user survey, conducted with 66 participants from eight different countries demonstrates that our proposed approach enhances the alignment between the images and the prompts. In particular, C-TRIP demonstrates improved alignment between the generated images and underrepresented culture nouns. Resource can be found at https://shane3606.github.io/Culture-TRIP.

Authors:Meilu Zhu, Qiushi Yang, Zhifan Gao, Yixuan Yuan, Jun Liu
Title: FedBM: Stealing Knowledge from Pre-trained Language Models for Heterogeneous Federated Learning
Abstract:
Federated learning (FL) has shown great potential in medical image computing since it provides a decentralized learning paradigm that allows multiple clients to train a model collaboratively without privacy leakage. However, current studies have shown that data heterogeneity incurs local learning bias in classifiers and feature extractors of client models during local training, leading to the performance degradation of a federation system. To address these issues, we propose a novel framework called Federated Bias eliMinating (FedBM) to get rid of local learning bias in heterogeneous federated learning (FL), which mainly consists of two modules, i.e., Linguistic Knowledge-based Classifier Construction (LKCC) and Concept-guided Global Distribution Estimation (CGDE). Specifically, LKCC exploits class concepts, prompts and pre-trained language models (PLMs) to obtain concept embeddings. These embeddings are used to estimate the latent concept distribution of each class in the linguistic space. Based on the theoretical derivation, we can rely on these distributions to pre-construct a high-quality classifier for clients to achieve classification optimization, which is frozen to avoid classifier bias during local training. CGDE samples probabilistic concept embeddings from the latent concept distributions to learn a conditional generator to capture the input space of the global model. Three regularization terms are introduced to improve the quality and utility of the generator. The generator is shared by all clients and produces pseudo data to calibrate updates of local feature extractors. Extensive comparison experiments and ablation studies on public datasets demonstrate the superior performance of FedBM over state-of-the-arts and confirm the effectiveness of each module, respectively. The code is available at https://github.com/CUHK-AIM-Group/FedBM.
中文: 联邦学习在医学影像中因数据异构性导致性能下降,FedBM框架通过基于语言知识的分类器构建和概念引导的分布估计生成伪数据,有效消除局部学习偏差并提升系统性能。
English: Federated learning in medical imaging faces performance issues due to data heterogeneity, which the proposed FedBM framework addresses by using linguistic knowledge to construct unbiased classifiers and concept-guided distribution estimation to calibrate local feature extractors with pseudo data.

Authors:Zijing Zhao, Jianlong Yu, Lin Zhang, Shunli Zhang
Title: CRTrack: Low-Light Semi-Supervised Multi-object Tracking Based on Consistency Regularization
Abstract:
Multi-object tracking under low-light environments is prevalent in real life. Recent years have seen rapid development in the field of multi-object tracking. However, due to the lack of datasets and the high cost of annotations, multi-object tracking under low-light environments remains a persistent challenge. In this paper, we focus on multi-object tracking under low-light conditions. To address the issues of limited data and the lack of dataset, we first constructed a low-light multi-object tracking dataset (LLMOT). This dataset comprises data from MOT17 that has been enhanced for nighttime conditions as well as multiple unannotated low-light videos. Subsequently, to tackle the high annotation costs and address the issue of image quality degradation, we propose a semi-supervised multi-object tracking method based on consistency regularization named CRTrack. First, we calibrate a consistent adaptive sampling assignment to replace the static IoU-based strategy, enabling the semi-supervised tracking method to resist noisy pseudo-bounding boxes. Then, we design a adaptive semi-supervised network update method, which effectively leverages unannotated data to enhance model performance. Dataset and Code: https://github.com/ZJZhao123/CRTrack.
中文: 本文针对低光环境下多目标跟踪的数据稀缺和标注成本高的问题,构建了低光多目标跟踪数据集LLMOT,并提出基于一致性正则化的半监督方法CRTrack,通过自适应采样和网络更新有效利用未标注数据提升模型性能。
English: This paper addresses the challenges of multi-object tracking in low-light conditions by introducing a new dataset (LLMOT) and proposing CRTrack, a semi-supervised method that uses consistency regularization to handle noisy data and improve performance with unannotated videos.

Authors:Liangtao Shi, Ting Liu, Xiantao Hu, Yue Hu, Quanjun Yin, Richang Hong
Title: SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding
Abstract:
Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained models, followed by a simple stack of visual-language transformers for multimodal fusion. However, these approaches not only limit adequate interaction between visual and linguistic contexts, but also incur significant computational costs. Therefore, to address these issues, we explore a step-wise multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding, replacing the cumbersome transformer stacks for multimodal fusion. Swip can improve {the} alignment between the vision and language representations step by step, in a token-level fusion manner. In addition, weight-level CIA further promotes multimodal fusion by cross-modal interaction. Swip and CIA are both parameter-efficient paradigms, and they fuse the cross-modal features from shallow to deep layers gradually. Experimental results on four widely-used benchmarks demonstrate that SwimVG achieves remarkable abilities and considerable benefits in terms of efficiency. Our code is available at https://github.com/liuting20/SwimVG.
中文摘要:提出的SwimVG框架通过逐步多模态提示和跨模态交互适配器,以参数高效的方式增强视觉与语言的深层融合,在多个基准测试中实现了卓越性能并显著提升计算效率。
English Summary: The proposed SwimVG framework introduces step-wise multimodal prompts and cross-modal interactive adapters to enhance visual-linguistic alignment efficiently, achieving superior performance on benchmarks while reducing computational costs.

Authors:Yaxuan Huang, Xili Dai, Jianan Wang, Xianbiao Qi, Yixing Yuan, Xiangyu Yue
Title: Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model
Abstract:
Room layout estimation from multiple-perspective images is poorly investigated due to the complexities that emerge from multi-view geometry, which requires muti-step solutions such as camera intrinsic and extrinsic estimation, image matching, and triangulation. However, in 3D reconstruction, the advancement of recent 3D foundation models such as DUSt3R has shifted the paradigm from the traditional multi-step structure-from-motion process to an end-to-end single-step approach. To this end, we introduce Plane-DUSt3R, a novel method for multi-view room layout estimation leveraging the 3D foundation model DUSt3R. Plane-DUSt3R incorporates the DUSt3R framework and fine-tunes on a room layout dataset (Structure3D) with a modified objective to estimate structural planes. By generating uniform and parsimonious results, Plane-DUSt3R enables room layout estimation with only a single post-processing step and 2D detection results. Unlike previous methods that rely on single-perspective or panorama image, Plane-DUSt3R extends the setting to handle multiple-perspective images. Moreover, it offers a streamlined, end-to-end solution that simplifies the process and reduces error accumulation. Experimental results demonstrate that Plane-DUSt3R not only outperforms state-of-the-art methods on the synthetic dataset but also proves robust and effective on in the wild data with different image styles such as cartoon. Our code is available at: https://github.com/justacar/Plane-DUSt3R
Chinese: Plane-DUSt3R 利用三维基础模型 DUSt3R,提出了一种端到端的多视角房间布局估计方法,通过单一后处理步骤简化流程,并在合成与真实数据上均超越了现有最优方法。
English: Plane-DUSt3R introduces an end-to-end method for multi-view room layout estimation by leveraging the 3D foundation model DUSt3R, streamlining the process with a single post-processing step and outperforming state-of-the-art methods on both synthetic and real-world data.

Authors:Rui Li, Xiaowei Zhao
Title: AeroReformer: Aerial Referring Transformer for UAV-based Referring Image Segmentation
Abstract:
As a novel and challenging task, referring segmentation combines computer vision and natural language processing to localize and segment objects based on textual descriptions. While referring image segmentation (RIS) has been extensively studied in natural images, little attention has been given to aerial imagery, particularly from unmanned aerial vehicles (UAVs). The unique challenges of UAV imagery, including complex spatial scales, occlusions, and varying object orientations, render existing RIS approaches ineffective. A key limitation has been the lack of UAV-specific datasets, as manually annotating pixel-level masks and generating textual descriptions is labour-intensive and time-consuming. To address this gap, we design an automatic labelling pipeline that leverages pre-existing UAV segmentation datasets and Multimodal Large Language Models (MLLM) for generating textual descriptions. Furthermore, we propose Aerial Referring Transformer (AeroReformer), a novel framework for UAV referring image segmentation (UAV-RIS), featuring a Vision-Language Cross-Attention Module (VLCAM) for effective cross-modal understanding and a Rotation-Aware Multi-Scale Fusion (RAMSF) decoder to enhance segmentation accuracy in aerial scenes. Extensive experiments on two newly developed datasets demonstrate the superiority of AeroReformer over existing methods, establishing a new benchmark for UAV-RIS. The datasets and code will be publicly available at: https://github.com/lironui/AeroReformer.
中文: 本文提出了AeroReformer这一无人机指代图像分割新框架,通过跨模态注意力模块和旋转感知解码器解决航空影像的特殊挑战,在新建数据集上实现了最优性能。
English: This paper introduces AeroReformer, a novel framework for UAV referring image segmentation that addresses the unique challenges of aerial imagery through a cross-modal attention module and a rotation-aware decoder, achieving state-of-the-art performance on newly developed datasets.

Authors:Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu, Haodong Duan
Title: Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs
Abstract:
Despite significant progress on popular multimodal benchmarks, state-of-the-art Multimodal Large Language Models (MLLMs) continue to struggle with basic visual reasoning tasks that are trivially solved by humans, such as recognizing spatial relationships. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment. These subtests span four core domains of human visual cognition: (1) Visualization and Spatial Processing, (2) Perceptual and Closure, (3) Memory, and (4) Reasoning. We evaluate 20 frontier MLLMs from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families. The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that current MLLM performance gains on high-level benchmarks do not reflect human-like low-level visual cognition, challenging the assumption that large-scale pretraining naturally induces gestalt-like perceptual capabilities. The dataset and evaluation toolkit are publicly available at: https://github.com/CUHK-ARISE/VisFactor.
Chinese: VisFactor基准测试显示,当前多模态大语言模型在人类基础视觉推理任务上表现不佳,评估了20个模型在四个核心视觉认知领域的表现,最高得分仅为25.19分(满分100分),远低于人类水平。
English: Current Multimodal Large Language Models perform poorly on basic visual reasoning tasks compared to humans, as revealed by the VisFactor benchmark, which evaluates 20 models across four core visual cognition domains and finds the highest score to be only 25.19 out of 100.

Authors:Kaibin Zhou, Kaifeng Huang, Hao Deng, Zelin Tao, Ziniu Liu, Lin Zhang, Shengjie Zhao
Title: Learning from Rendering: Realistic and Controllable Extreme Rainy Image Synthesis for Autonomous Driving Simulation
Abstract:
Autonomous driving simulators provide an effective and low-cost alternative for evaluating or enhancing visual perception models. However, the reliability of evaluation depends on the diversity and realism of the generated scenes. Extreme weather conditions, particularly extreme rainfalls, are rare and costly to capture in real-world settings. While simulated environments can help address this limitation, existing rainy image synthesizers often suffer from poor controllability over illumination and limited realism, which significantly undermines the effectiveness of the model evaluation. To that end, we propose a learning-from-rendering rainy image synthesizer, which combines the benefits of the realism of rendering-based methods and the controllability of learning-based methods. To validate the effectiveness of our extreme rainy image synthesizer on semantic segmentation task, we require a continuous set of well-labeled extreme rainy images. By integrating the proposed synthesizer with the CARLA driving simulator, we develop CARLARain an extreme rainy street scene simulator which can obtain paired rainy-clean images and labels under complex illumination conditions. Qualitative and quantitative experiments validate that CARLARain can effectively improve the accuracy of semantic segmentation models in extreme rainy scenes, with the models' accuracy (mIoU) improved by 5% - 8% on the synthetic dataset and significantly enhanced in real extreme rainy scenarios under complex illuminations. Our source code and datasets are available at https://github.com/kb824999404/CARLARain/.
中文: 自动驾驶模拟器为视觉感知模型的评估提供了低成本方案,但其可靠性依赖于场景的真实性,尤其针对极端降雨等罕见天气;现有合成器在可控性和真实性上不足,而提出的CARLARain模拟器融合渲染真实性和学习可控性,能生成配对的雨天-晴天图像,将语义分割模型在极端降雨场景下的准确率提升5%-8%。
English: Autonomous driving simulators are cost-effective for testing perception models, but their reliability depends on scene realism, especially for rare extreme weather like rainfall, which existing synthesizers handle poorly; the proposed CARLARain simulator combines rendering realism with learning-based controllability to generate paired rainy-clean images, improving semantic segmentation accuracy by 5%-8% in synthetic and real extreme rainy conditions.

Authors:Jianbin Jiao, Xina Cheng, Kailun Yang, Xiangrong Zhang, Licheng Jiao
Title: DeProPose: Deficiency-Proof 3D Human Pose Estimation via Adaptive Multi-View Fusion
Abstract:
3D human pose estimation has wide applications in fields such as intelligent surveillance, motion capture, and virtual reality. However, in real-world scenarios, issues such as occlusion, noise interference, and missing viewpoints can severely affect pose estimation. To address these challenges, we introduce the task of Deficiency-Aware 3D Pose Estimation. Traditional 3D pose estimation methods often rely on multi-stage networks and modular combinations, which can lead to cumulative errors and increased training complexity, making them unable to effectively address deficiency-aware estimation. To this end, we propose DeProPose, a flexible method that simplifies the network architecture to reduce training complexity and avoid information loss in multi-stage designs. Additionally, the model innovatively introduces a multi-view feature fusion mechanism based on relative projection error, which effectively utilizes information from multiple viewpoints and dynamically assigns weights, enabling efficient integration and enhanced robustness to overcome deficiency-aware 3D Pose Estimation challenges. Furthermore, to thoroughly evaluate this end-to-end multi-view 3D human pose estimation model and to advance research on occlusion-related challenges, we have developed a novel 3D human pose estimation dataset, termed the Deficiency-Aware 3D Pose Estimation (DA-3DPE) dataset. This dataset encompasses a wide range of deficiency scenarios, including noise interference, missing viewpoints, and occlusion challenges. Compared to state-of-the-art methods, DeProPose not only excels in addressing the deficiency-aware problem but also shows improvement in conventional scenarios, providing a powerful and user-friendly solution for 3D human pose estimation. The source code will be available at https://github.com/WUJINHUAN/DeProPose.
中文: DeProPose 提出了一种简化网络结构和多视角特征融合机制的方法,有效应对缺陷感知的3D姿态估计问题,其性能优于现有技术,并配有专门针对遮挡和噪声场景的新数据集。
English: DeProPose introduces a simplified network with a multi-view feature fusion mechanism to address deficiency-aware 3D pose estimation challenges, outperforming existing methods and supported by a novel dataset for occlusion and noise scenarios.

Authors:Xichen Xu, Yanshu Wang, Yawen Huang, Jiaqi Liu, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu
Title: A Survey on Industrial Anomalies Synthesis
Abstract:
This paper comprehensively reviews anomaly synthesis methodologies. Existing surveys focus on limited techniques, missing an overall field view and understanding method interconnections. In contrast, our study offers a unified review, covering about 40 representative methods across Hand-crafted, Distribution-hypothesis-based, Generative models (GM)-based, and Vision-language models (VLM)-based synthesis. We introduce the first industrial anomaly synthesis (IAS) taxonomy. Prior works lack formal classification or use simplistic taxonomies, hampering structured comparisons and trend identification. Our taxonomy provides a fine-grained framework reflecting methodological progress and practical implications, grounding future research. Furthermore, we explore cross-modality synthesis and large-scale VLM. Previous surveys overlooked multimodal data and VLM in anomaly synthesis, limiting insights into their advantages. Our survey analyzes their integration, benefits, challenges, and prospects, offering a roadmap to boost IAS with multimodal learning. More resources are available at https://github.com/M-3LAB/awesome-anomaly-synthesis.
本文对异常合成方法进行了统一评述,首次提出工业分类体系并探索跨模态技术,为该领域的未来发展提供了系统框架。
This paper presents a unified review of anomaly synthesis methods, introducing the first industrial taxonomy and exploring cross-modality techniques to advance the field.

Authors:Kyungbok Lee, You Zhang, Zhiyao Duan
Title: Audio Visual Segmentation Through Text Embeddings
Abstract:
The goal of Audio-Visual Segmentation (AVS) is to localize and segment the sounding source objects from video frames. Research on AVS suffers from data scarcity due to the high cost of fine-grained manual annotations. Recent works attempt to overcome the challenge of limited data by leveraging the vision foundation model, Segment Anything Model (SAM), prompting it with audio to enhance its ability to segment sounding source objects. While this approach alleviates the model's burden on understanding visual modality by utilizing knowledge of pre-trained SAM, it does not address the fundamental challenge of learning audio-visual correspondence with limited data. To address this limitation, we propose \textbf{AV2T-SAM}, a novel framework that bridges audio features with the text embedding space of pre-trained text-prompted SAM. Our method leverages multimodal correspondence learned from rich text-image paired datasets to enhance audio-visual alignment. Furthermore, we introduce a novel feature, $\mathbf{\textit{\textbf{f}}_{CLIP} \odot \textit{\textbf{f}}_{CLAP}}$, which emphasizes shared semantics of audio and visual modalities while filtering irrelevant noise. Our approach outperforms existing methods on the AVSBench dataset by effectively utilizing pre-trained segmentation models and cross-modal semantic alignment. The source code is released at https://github.com/bok-bok/AV2T-SAM.
中文:提出的AV2T-SAM框架通过将音频特征与SAM的文本嵌入空间相连接并利用跨模态语义对齐,有效提升了音频-视觉分割性能,在AVSBench数据集上超越了现有方法。
English: The proposed AV2T-SAM framework enhances audio-visual segmentation by bridging audio features with the text embedding space of SAM and leveraging cross-modal semantic alignment, outperforming existing methods on the AVSBench dataset.

Authors:Tuan-Anh Yang, Truong-Son Hy, Phuong D. Dao
Title: MOB-GCN: A Novel Multiscale Object-Based Graph Neural Network for Hyperspectral Image Classification
Abstract:
This paper introduces a novel multiscale object-based graph neural network called MOB-GCN for hyperspectral image (HSI) classification. The central aim of this study is to enhance feature extraction and classification performance by utilizing multiscale object-based image analysis (OBIA). Traditional pixel-based methods often suffer from low accuracy and speckle noise, while single-scale OBIA approaches may overlook crucial information of image objects at different levels of detail. MOB-GCN addresses this issue by extracting and integrating features from multiple segmentation scales to improve classification results using the Multiresolution Graph Network (MGN) architecture that can model fine-grained and global spatial patterns. By constructing a dynamic multiscale graph hierarchy, MOB-GCN offers a more comprehensive understanding of the intricate details and global context of HSIs. Experimental results demonstrate that MOB-GCN consistently outperforms single-scale graph convolutional networks (GCNs) in terms of classification accuracy, computational efficiency, and noise reduction, particularly when labeled data is limited. The implementation of MOB-GCN is publicly available at https://github.com/HySonLab/MultiscaleHSI
中文: 本文提出MOB-GCN多尺度对象图神经网络,通过整合多尺度分割特征改进高光谱图像分类,在精度和效率上均优于单尺度方法。
English: This paper presents MOB-GCN, a multiscale object-based graph neural network that enhances hyperspectral image classification by integrating features from multiple segmentation scales, demonstrating superior accuracy and efficiency over single-scale methods.

Authors:Ziwei Shan, Yaoyu He, Chengfeng Zhao, Jiashen Du, Jingyan Zhang, Qixuan Zhang, Jingyi Yu, Lan Xu
Title: Mojito: LLM-Aided Motion Instructor with Jitter-Reduced Inertial Tokens
Abstract:
Human bodily movements convey critical insights into action intentions and cognitive processes, yet existing multimodal systems primarily focused on understanding human motion via language, vision, and audio, which struggle to capture the dynamic forces and torques inherent in 3D motion. Inertial measurement units (IMUs) present a promising alternative, offering lightweight, wearable, and privacy-conscious motion sensing. However, processing of streaming IMU data faces challenges such as wireless transmission instability, sensor noise, and drift, limiting their utility for long-term real-time motion capture (MoCap), and more importantly, online motion analysis. To address these challenges, we introduce Mojito, an intelligent motion agent that integrates inertial sensing with large language models (LLMs) for interactive motion capture and behavioral analysis.
中文摘要:该摘要提出Mojito智能运动代理,通过融合惯性传感与大语言模型,解决了传统方法难以捕捉三维运动动态力并实现实时交互式运动分析的局限性。
English Summary: The abstract introduces Mojito, an intelligent motion agent that combines inertial sensing with large language models to overcome limitations in capturing dynamic 3D motion forces and enable interactive motion analysis.

Authors:Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, Xiang Bai
Title: OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models
Abstract:
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of large language models capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individual tasks. This leads to modal isolation and complex workflows due to the diversified targets and heterogeneous schemas. In this paper, we introduce OmniParser V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis, into a unified framework. Central to our approach is the proposed Structured-Points-of-Thought (SPOT) prompting schemas, which improves model performance across diverse scenarios by leveraging a unified encoder-decoder architecture, objective, and input\&output representation. SPOT eliminates the need for task-specific architectures and loss functions, significantly simplifying the processing pipeline. Our extensive evaluations across four tasks on eight different datasets show that OmniParser V2 achieves state-of-the-art or competitive results in VsTP. Additionally, we explore the integration of SPOT within a multimodal large language model structure, further enhancing text localization and recognition capabilities, thereby confirming the generality of SPOT prompting technique. The code is available at \href{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}{AdvancedLiterateMachinery}.
中文摘要:OmniParser V2通过提出的结构化思维点(SPOT)提示模式,将视觉文本解析中的多个任务统一到单一框架中,简化了处理流程并在多个数据集上取得了领先或竞争性的性能表现。
English Summary: OmniParser V2 introduces a unified framework using Structured-Points-of-Thought (SPOT) prompting to simplify visually-situated text parsing by integrating multiple tasks into a single model, achieving state-of-the-art performance across various datasets.

Authors:Feng Liu, Hanyang Wang, Siyuan Shen
Title: Robust Dynamic Facial Expression Recognition
Abstract:
The study of Dynamic Facial Expression Recognition (DFER) is a nascent field of research that involves the automated recognition of facial expressions in video data. Although existing research has primarily focused on learning representations under noisy and hard samples, the issue of the coexistence of both types of samples remains unresolved. In order to overcome this challenge, this paper proposes a robust method of distinguishing between hard and noisy samples. This is achieved by evaluating the prediction agreement of the model on different sampled clips of the video. Subsequently, methodologies that reinforce the learning of hard samples and mitigate the impact of noisy samples can be employed. Moreover, to identify the principal expression in a video and enhance the model's capacity for representation learning, comprising a key expression re-sampling framework and a dual-stream hierarchical network is proposed, namely Robust Dynamic Facial Expression Recognition (RDFER). The key expression re-sampling framework is designed to identify the key expression, thereby mitigating the potential confusion caused by non-target expressions. RDFER employs two sequence models with the objective of disentangling short-term facial movements and long-term emotional changes. The proposed method has been shown to outperform current State-Of-The-Art approaches in DFER through extensive experimentation on benchmark datasets such as DFEW and FERV39K. A comprehensive analysis provides valuable insights and observations regarding the proposed agreement. This work has significant implications for the field of dynamic facial expression recognition and promotes the further development of the field of noise-consistent robust learning in dynamic facial expression recognition. The code is available from [https://github.com/Cross-Innovation-Lab/RDFER].
中文: 本文提出RDFER方法,通过评估预测一致性区分困难与噪声样本,并采用双流网络加强表征学习,在动态面部表情识别基准数据集上实现了最优性能。
English: This paper introduces RDFER, a robust method for dynamic facial expression recognition that distinguishes hard and noisy samples through prediction agreement evaluation and employs a dual-stream network to enhance representation learning, achieving state-of-the-art results on benchmark datasets.

Authors:Heng Gao, Zhuolin He, Jian Pu
Title: Detecting OOD Samples via Optimal Transport Scoring Function
Abstract:
To deploy machine learning models in the real world, researchers have proposed many OOD detection algorithms to help models identify unknown samples during the inference phase and prevent them from making untrustworthy predictions. Unlike methods that rely on extra data for outlier exposure training, post hoc methods detect Out-of-Distribution (OOD) samples by developing scoring functions, which are model agnostic and do not require additional training. However, previous post hoc methods may fail to capture the geometric cues embedded in network representations. Thus, in this study, we propose a novel score function based on the optimal transport theory, named OTOD, for OOD detection. We utilize information from features, logits, and the softmax probability space to calculate the OOD score for each test sample. Our experiments show that combining this information can boost the performance of OTOD with a certain margin. Experiments on the CIFAR-10 and CIFAR-100 benchmarks demonstrate the superior performance of our method. Notably, OTOD outperforms the state-of-the-art method GEN by 7.19% in the mean FPR@95 on the CIFAR-10 benchmark using ResNet-18 as the backbone, and by 12.51% in the mean FPR@95 using WideResNet-28 as the backbone. In addition, we provide theoretical guarantees for OTOD. The code is available in https://github.com/HengGao12/OTOD.
中文: 本研究提出基于最优传输理论的OTOD方法,通过综合特征、逻辑值和Softmax概率信息进行分布外检测,在CIFAR基准测试中显著优于现有最优方法。
English: This study introduces OTOD, a novel post hoc OOD detection method using optimal transport theory that outperforms state-of-the-art approaches by combining features, logits, and softmax probabilities, achieving significant improvements on CIFAR benchmarks.

Authors:Zheling Tan, Kexin Ding, Jin Gao, Mu Zhou, Dimitris Metaxas, Shaoting Zhang, Dequan Wang
Title: MedForge: Building Medical Foundation Models Like Open Source Software Development
Abstract:
Foundational models (FMs) have made significant strides in the healthcare domain. Yet the data silo challenge and privacy concern remain in healthcare systems, hindering safe medical data sharing and collaborative model development among institutions. The collection and curation of scalable clinical datasets increasingly become the bottleneck for training strong FMs. In this study, we propose Medical Foundation Models Merging (MedForge), a cooperative framework enabling a community-driven medical foundation model development, meanwhile preventing the information leakage of raw patient data and mitigating synchronization model development issues across clinical institutions. MedForge offers a bottom-up model construction mechanism by flexibly merging task-specific Low-Rank Adaptation (LoRA) modules, which can adapt to downstream tasks while retaining original model parameters. Through an asynchronous LoRA module integration scheme, the resulting composite model can progressively enhance its comprehensive performance on various clinical tasks. MedForge shows strong performance on multiple clinical datasets (e.g., breast cancer, lung cancer, and colon cancer) collected from different institutions. Our major findings highlight the value of collaborative foundation models in advancing multi-center clinical collaboration effectively and cohesively. Our code is publicly available at https://github.com/TanZheling/MedForge.
中文: MedForge提出了一种通过融合任务特定LoRA模块的协作式医疗基础模型开发框架,既保护原始患者数据隐私又实现跨机构协同建模,在多种临床数据集上展现出卓越性能。
English: MedForge introduces a collaborative framework for developing medical foundation models by merging task-specific LoRA modules, enabling multi-institutional cooperation without sharing raw patient data and demonstrating strong performance across various clinical datasets.

Authors:Mike Ranzinger, Greg Heinrich, Pavlo Molchanov, Jan Kautz, Bryan Catanzaro, Andrew Tao
Title: FeatSharp: Your Vision Model Features, Sharper
Abstract:
The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e.g. semantic segmentation, object detection, depth perception, etc.) to modern multimodal understanding in vision-language models (VLMs). Currently, in computer vision, the frontier of general purpose vision backbones is Vision Transformers (ViT), typically trained using contrastive loss (e.g. CLIP). A key problem with most off-the-shelf ViTs, particularly CLIP, is that these models are inflexibly low resolution. Most run at $224 \times 224$px, while the "high-resolution" versions are around $378-448$px, but still inflexible. We introduce a novel method to coherently and cheaply upsample the feature maps of low-resolution vision encoders while picking up on fine-grained details that would otherwise be lost due to resolution. We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model training using RADIO as a way of providing richer targets for distillation. Code available at https://github.com/NVlabs/FeatSharp .
中文摘要:本文提出了一种新颖方法,能够高效地对低分辨率视觉编码器(如CLIP)的特征图进行上采样,捕捉通常因分辨率不足而丢失的精细细节,并在核心感知任务和模型蒸馏中验证了其有效性。
English Summary: This paper introduces a novel method to efficiently upsample feature maps from low-resolution vision encoders like CLIP, capturing fine-grained details that are typically lost, and demonstrates its effectiveness on core perception tasks and model distillation.

Authors:Prashant Shekhar, Bidur Devkota, Dumindu Samaraweera, Laxima Niure Kandel, Manoj Babu
Title: Cross-Model Transferability of Adversarial Patches in Real-time Segmentation for Autonomous Driving
Abstract:
Adversarial attacks pose a significant threat to deep learning models, particularly in safety-critical applications like healthcare and autonomous driving. Recently, patch based attacks have demonstrated effectiveness in real-time inference scenarios owing to their 'drag and drop' nature. Following this idea for Semantic Segmentation (SS), here we propose a novel Expectation Over Transformation (EOT) based adversarial patch attack that is more realistic for autonomous vehicles. To effectively train this attack we also propose a 'simplified' loss function that is easy to analyze and implement. Using this attack as our basis, we investigate whether adversarial patches once optimized on a specific SS model, can fool other models or architectures. We conduct a comprehensive cross-model transferability analysis of adversarial patches trained on SOTA Convolutional Neural Network (CNN) models such PIDNet-S, PIDNet-M and PIDNet-L, among others. Additionally, we also include the Segformer model to study transferability to Vision Transformers (ViTs). All of our analysis is conducted on the widely used Cityscapes dataset. Our study reveals key insights into how model architectures (CNN vs CNN or CNN vs. Transformer-based) influence attack susceptibility. In particular, we conclude that although the transferability (effectiveness) of attacks on unseen images of any dimension is really high, the attacks trained against one particular model are minimally effective on other models. And this was found to be true for both ViT and CNN based models. Additionally our results also indicate that for CNN-based models, the repercussions of patch attacks are local, unlike ViTs. Per-class analysis reveals that simple-classes like 'sky' suffer less misclassification than others. The code for the project is available at: https://github.com/p-shekhar/adversarial-patch-transferability
中文摘要:本研究针对自动驾驶中的语义分割任务,提出了一种基于期望变换的新型对抗性补丁攻击,并通过跨模型分析发现:尽管攻击在不同图像维度间具有高迁移性,但对特定模型的攻击对其他模型效果有限,且CNN模型受影响范围局部化,而视觉变换器受影响更广泛。
English Summary: This study introduces a novel adversarial patch attack using Expectation Over Transformation for semantic segmentation in autonomous vehicles, revealing through cross-model analysis that while attacks show high transferability across image dimensions, they remain model-specific with localized effects on CNNs compared to Vision Transformers.

Authors:Parth Bhalerao, Mounika Yalamarty, Brian Trinh, Oana Ignat
Title: Multi-Agent Multimodal Models for Multicultural Text to Image Generation
Abstract:
Large Language Models (LLMs) demonstrate impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of existing data and models. Meanwhile, multi-agent models have shown strong capabilities in solving complex tasks. In this paper, we evaluate the performance of LLMs in a multi-agent interaction setting for the novel task of multicultural image generation. Our key contributions are: (1) We introduce MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas; (2) We provide a dataset of 9,000 multicultural images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages; and (3) We demonstrate that multi-agent interactions outperform simple, no-agent models across multiple evaluation metrics, offering valuable insights for future research. Our dataset and models are available at https://github.com/OanaIgnat/MosAIG.
Chinese: 本文提出MosAIG多智能体框架,通过利用具有不同文化角色的LLMs增强多元文化图像生成,并基于涵盖五个国家、三种年龄组等维度的9000张图像数据集,证明其在多项评估指标上优于无智能体模型。
English: This paper introduces MosAIG, a multi-agent framework that enhances multicultural image generation by leveraging LLMs with distinct cultural personas, demonstrating superior performance over no-agent models through a comprehensive dataset of 9,000 images across diverse cultural dimensions.

Authors:William Rudman, Michal Golovanevsky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, Ritambhara Singh
Title: Forgotten Polygons: Multimodal Large Language Models are Shape-Blind
Abstract:
Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving, with both open-source and state-of-the-art models falling short of human performance on visual-math benchmarks. To systematically examine visual-mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi-step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual-process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of sides nor effectively process visual inputs. Finally, we propose Visually Cued Chain-of-Thought (VC-CoT) prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT-4o's accuracy on an irregular polygon side-counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually-guided prompting is essential for successfully engaging visual reasoning. Code available at: https://github.com/rsinghlab/Shape-Blind.
Chinese: 多模态大语言模型在视觉数学推理上存在显著缺陷,依赖直觉联想而非系统分析,但通过视觉引导的提示方法(如VC-CoT)可大幅提升其表现。
English: Multimodal Large Language Models exhibit significant deficiencies in visual-mathematical reasoning, relying on intuitive associations rather than deliberate analysis, but their performance can be dramatically improved through visually-guided prompting techniques like VC-CoT.

Authors:Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, Jun Zhu
Title: RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers
Abstract:
Recent advancements in video generation have enabled models to synthesize high-quality, minute-long videos. However, generating even longer videos with temporal coherence remains a major challenge and existing length extrapolation methods lead to temporal repetition or motion deceleration. In this work, we systematically analyze the role of frequency components in positional embeddings and identify an intrinsic frequency that primarily governs extrapolation behavior. Based on this insight, we propose RIFLEx, a minimal yet effective approach that reduces the intrinsic frequency to suppress repetition while preserving motion consistency, without requiring any additional modifications. RIFLEx offers a true free lunch--achieving high-quality 2x extrapolation on state-of-the-art video diffusion transformers in a completely training-free manner. Moreover, it enhances quality and enables 3x extrapolation by minimal fine-tuning without long videos. Project page and codes: https://riflex-video.github.io/.
中文: 本文提出RIFLEx方法,通过调整位置编码的固有频率,无需训练即可实现高质量2倍视频时长外推并保持时序连贯性,经少量微调更可扩展至3倍外推能力。
English: This paper introduces RIFLEx, a training-free method that adjusts the intrinsic frequency of positional embeddings to enable high-quality 2x video length extrapolation while maintaining temporal coherence, with extended 3x capability through minimal fine-tuning.

Authors:Hongjie Zhu, Zeyu Zhang, Guansong Pang, Xu Wang, Shimin Wen, Yu Bai, Daji Ergu, Ying Cai, Yang Zhao
Title: DOEI: Dual Optimization of Embedding Information for Attention-Enhanced Class Activation Maps
Abstract:
Weakly supervised semantic segmentation (WSSS) typically utilizes limited semantic annotations to obtain initial Class Activation Maps (CAMs). However, due to the inadequate coupling between class activation responses and semantic information in high-dimensional space, the CAM is prone to object co-occurrence or under-activation, resulting in inferior recognition accuracy. To tackle this issue, we propose DOEI, Dual Optimization of Embedding Information, a novel approach that reconstructs embedding representations through semantic-aware attention weight matrices to optimize the expression capability of embedding information. Specifically, DOEI amplifies tokens with high confidence and suppresses those with low confidence during the class-to-patch interaction. This alignment of activation responses with semantic information strengthens the propagation and decoupling of target features, enabling the generated embeddings to more accurately represent target features in high-level semantic space. In addition, we propose a hybrid-feature alignment module in DOEI that combines RGB values, embedding-guided features, and self-attention weights to increase the reliability of candidate tokens. Comprehensive experiments show that DOEI is an effective plug-and-play module that empowers state-of-the-art visual transformer-based WSSS models to significantly improve the quality of CAMs and segmentation performance on popular benchmarks, including PASCAL VOC (+3.6%, +1.5%, +1.2% mIoU) and MS COCO (+1.2%, +1.6% mIoU). Code will be available at https://github.com/AIGeeksGroup/DOEI.
Chinese: DOEI是一种通过语义感知注意力优化嵌入表示的新方法,在类别到区块交互中增强高置信度标记并抑制低置信度标记,从而显著提升PASCAL VOC和MS COCO等基准测试中的分割性能。
English: DOEI is a novel approach that optimizes embedding representations through semantic-aware attention to enhance Class Activation Maps (CAMs) by amplifying high-confidence tokens and suppressing low-confidence ones, significantly improving segmentation performance on benchmarks like PASCAL VOC and MS COCO.

Authors:Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Yifei Dong, Mario Fritz, Margret Keuper
Title: MaxSup: Overcoming Representation Collapse in Label Smoothing
Abstract:
Label Smoothing (LS) is widely adopted to reduce overconfidence in neural network predictions and improve generalization. Despite these benefits, recent studies reveal two critical issues with LS. First, LS induces overconfidence in misclassified samples. Second, it compacts feature representations into overly tight clusters, diluting intra-class diversity, although the precise cause of this phenomenon remained elusive. In this paper, we analytically decompose the LS-induced loss, exposing two key terms: (i) a regularization term that dampens overconfidence only when the prediction is correct, and (ii) an error-amplification term that arises under misclassifications. This latter term compels the network to reinforce incorrect predictions with undue certainty, exacerbating representation collapse. To address these shortcomings, we propose Max Suppression (MaxSup), which applies uniform regularization to both correct and incorrect predictions by penalizing the top-1 logit rather than the ground-truth logit. Through extensive feature-space analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm that MaxSup is a more robust alternative to LS, consistently reducing overconfidence while preserving richer feature representations. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization
中文: 标签平滑会导致错误预测的过度自信和特征压缩,引发表示坍塌,而提出的最大抑制方法通过均匀正则化预测,恢复特征多样性并减少过度自信。
English: Label Smoothing is found to induce overconfidence in errors and compress features, leading to representation collapse, while the proposed Max Suppression method uniformly regularizes predictions to restore feature diversity and reduce overconfidence.

Authors:Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, Renaud Marlet, Alexandre Boulch, Mickael Chen, Éloi Zablocki, Andrei Bursuc, Eduardo Valle, Matthieu Cord
Title: VaViM and VaVAM: Autonomous Driving through Video Generative Modeling
Abstract:
We explore the potential of large-scale generative video models for autonomous driving, introducing an open-source auto-regressive video model (VaViM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving. VaViM is a simple auto-regressive video model that predicts frames using spatio-temporal token sequences. We show that it captures the semantics and dynamics of driving scenes. VaVAM, the video-action model, leverages the learned representations of VaViM to generate driving trajectories through imitation learning. Together, the models form a complete perception-to-action pipeline. We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving. Key insights include the semantic richness of the learned representations, the benefits of scaling for video synthesis, and the complex relationship between model size, data, and safety metrics in closed-loop evaluations. We release code and model weights at https://github.com/valeoai/VideoActionModel
中文摘要:本研究提出VaViM和VaVAM两种生成式视频模型,构建了从感知到行动的完整自动驾驶流程,证明视频预训练能有效学习驾驶场景语义并通过模仿学习生成行驶轨迹。
English Summary: This study introduces VaViM and VaVAM, two generative video models that form a perception-to-action pipeline for autonomous driving, demonstrating video pre-training's effectiveness in capturing driving semantics and generating trajectories through imitation learning.

Authors:Ziqian Ni, Sicong Du, Zhenghua Hou, Chenming Wu, Sheng Yang
Title: Para-Lane: Multi-Lane Dataset Registering Parallel Scans for Benchmarking Novel View Synthesis
Abstract:
To evaluate end-to-end autonomous driving systems, a simulation environment based on Novel View Synthesis (NVS) techniques is essential, which synthesizes photo-realistic images and point clouds from previously recorded sequences under new vehicle poses, particularly in cross-lane scenarios. Therefore, the development of a multi-lane dataset and benchmark is necessary. While recent synthetic scene-based NVS datasets have been prepared for cross-lane benchmarking, they still lack the realism of captured images and point clouds. To further assess the performance of existing methods based on NeRF and 3DGS, we present the first multi-lane dataset registering parallel scans specifically for novel driving view synthesis dataset derived from real-world scans, comprising 25 groups of associated sequences, including 16,000 front-view images, 64,000 surround-view images, and 16,000 LiDAR frames. All frames are labeled to differentiate moving objects from static elements. Using this dataset, we evaluate the performance of existing approaches in various testing scenarios at different lanes and distances. Additionally, our method provides the solution for solving and assessing the quality of multi-sensor poses for multi-modal data alignment for curating such a dataset in real-world. We plan to continually add new sequences to test the generalization of existing methods across different scenarios. The dataset is released publicly at the project page: https://nizqleo.github.io/paralane-dataset/.

Authors:Sicheng Yu, Chong Cheng, Yifan Zhou, Xiaojun Yang, Hao Wang
Title: RGB-Only Gaussian Splatting SLAM for Unbounded Outdoor Scenes
Abstract:
3D Gaussian Splatting (3DGS) has become a popular solution in SLAM, as it can produce high-fidelity novel views. However, previous GS-based methods primarily target indoor scenes and rely on RGB-D sensors or pre-trained depth estimation models, hence underperforming in outdoor scenarios. To address this issue, we propose a RGB-only gaussian splatting SLAM method for unbounded outdoor scenes--OpenGS-SLAM. Technically, we first employ a pointmap regression network to generate consistent pointmaps between frames for pose estimation. Compared to commonly used depth maps, pointmaps include spatial relationships and scene geometry across multiple views, enabling robust camera pose estimation. Then, we propose integrating the estimated camera poses with 3DGS rendering as an end-to-end differentiable pipeline. Our method achieves simultaneous optimization of camera poses and 3DGS scene parameters, significantly enhancing system tracking accuracy. Specifically, we also design an adaptive scale mapper for the pointmap regression network, which provides more accurate pointmap mapping to the 3DGS map representation. Our experiments on the Waymo dataset demonstrate that OpenGS-SLAM reduces tracking error to 9.8\% of previous 3DGS methods, and achieves state-of-the-art results in novel view synthesis. Project Page: https://3dagentworld.github.io/opengs-slam/

Authors:Kai Liu, Dehui Wang, Zhiteng Li, Zheng Chen, Yong Guo, Wenbo Li, Linghe Kong, Yulun Zhang
Title: CondiQuant: Condition Number Based Low-Bit Quantization for Image Super-Resolution
Abstract:
Low-bit model quantization for image super-resolution (SR) is a longstanding task that is renowned for its surprising compression and acceleration ability. However, accuracy degradation is inevitable when compressing the full-precision (FP) model to ultra-low bit widths (2~4 bits). Experimentally, we observe that the degradation of quantization is mainly attributed to the quantization of activation instead of model weights. In numerical analysis, the condition number of weights could measure how much the output value can change for a small change in the input argument, inherently reflecting the quantization error. Therefore, we propose CondiQuant, a condition number based low-bit post-training quantization for image super-resolution. Specifically, we formulate the quantization error as the condition number of weight metrics. By decoupling the representation ability and the quantization sensitivity, we design an efficient proximal gradient descent algorithm to iteratively minimize the condition number and maintain the output still. With comprehensive experiments, we demonstrate that CondiQuant outperforms existing state-of-the-art post-training quantization methods in accuracy without computation overhead and gains the theoretically optimal compression ratio in model parameters. Our code and model are released at https://github.com/Kai-Liu001/CondiQuant.
Chinese: CondiQuant提出了一种基于条件数的图像超分辨率后训练量化方法,通过最小化量化误差,在无计算开销的情况下有效缓解了超低比特模型的精度下降问题。
English: CondiQuant introduces a condition number-based post-training quantization method for image super-resolution, effectively reducing accuracy degradation in ultra-low bit models by minimizing quantization errors without computational overhead.

Authors:Sanghee Park, Geewook Kim
Title: Evaluating Multimodal Generative AI with Korean Educational Standards
Abstract:
This paper presents the Korean National Educational Test Benchmark (KoNET), a new benchmark designed to evaluate Multimodal Generative AI Systems using Korean national educational tests. KoNET comprises four exams: the Korean Elementary General Educational Development Test (KoEGED), Middle (KoMGED), High (KoHGED), and College Scholastic Ability Test (KoCSAT). These exams are renowned for their rigorous standards and diverse questions, facilitating a comprehensive analysis of AI performance across different educational levels. By focusing on Korean, KoNET provides insights into model performance in less-explored languages. We assess a range of models - open-source, open-access, and closed APIs - by examining difficulties, subject diversity, and human error rates. The code and dataset builder will be made fully open-sourced at https://github.com/naver-ai/KoNET.
中文: KoNET基准通过韩国四个级别的国家级教育考试,全面评估多模态生成式AI系统在较少研究语言中的表现,涵盖多样化的学科和难度。
English: The KoNET benchmark evaluates multimodal generative AI systems using rigorous Korean national educational tests across four levels to analyze performance in less-explored languages and diverse subjects.

Authors:Jiebin Yan, Ziwen Tan, Yuming Fang, Junjie Chen, Wenhui Jiang, Zhou Wang
Title: Omnidirectional Image Quality Captioning: A Large-scale Database and A New Model
Abstract:
The fast growing application of omnidirectional images calls for effective approaches for omnidirectional image quality assessment (OIQA). Existing OIQA methods have been developed and tested on homogeneously distorted omnidirectional images, but it is hard to transfer their success directly to the heterogeneously distorted omnidirectional images. In this paper, we conduct the largest study so far on OIQA, where we establish a large-scale database called OIQ-10K containing 10,000 omnidirectional images with both homogeneous and heterogeneous distortions. A comprehensive psychophysical study is elaborated to collect human opinions for each omnidirectional image, together with the spatial distributions (within local regions or globally) of distortions, and the head and eye movements of the subjects. Furthermore, we propose a novel multitask-derived adaptive feature-tailoring OIQA model named IQCaption360, which is capable of generating a quality caption for an omnidirectional image in a manner of textual template. Extensive experiments demonstrate the effectiveness of IQCaption360, which outperforms state-of-the-art methods by a significant margin on the proposed OIQ-10K database. The OIQ-10K database and the related source codes are available at https://github.com/WenJuing/IQCaption360.
中文摘要:本研究建立了最大的全景图像质量评估数据库OIQ-10K,并提出IQCaption360模型,能生成质量描述文本,其性能显著优于现有最优方法。
English Summary: This study introduces the largest omnidirectional image quality assessment (OIQA) database, OIQ-10K, and proposes IQCaption360, a novel model that generates quality captions and significantly outperforms existing methods.

Authors:Nie Lin, Takehiko Ohkawa, Yifei Huang, Mingfang Zhang, Minjie Cai, Ming Li, Ryosuke Furuta, Yoichi Sato
Title: SiMHand: Mining Similar Hands for Large-Scale 3D Hand Pose Pre-training
Abstract:
We present a framework for pre-training of 3D hand pose estimation from in-the-wild hand images sharing with similar hand characteristics, dubbed SimHand. Pre-training with large-scale images achieves promising results in various tasks, but prior methods for 3D hand pose pre-training have not fully utilized the potential of diverse hand images accessible from in-the-wild videos. To facilitate scalable pre-training, we first prepare an extensive pool of hand images from in-the-wild videos and design our pre-training method with contrastive learning. Specifically, we collect over 2.0M hand images from recent human-centric videos, such as 100DOH and Ego4D. To extract discriminative information from these images, we focus on the similarity of hands: pairs of non-identical samples with similar hand poses. We then propose a novel contrastive learning method that embeds similar hand pairs closer in the feature space. Our method not only learns from similar samples but also adaptively weights the contrastive learning loss based on inter-sample distance, leading to additional performance gains. Our experiments demonstrate that our method outperforms conventional contrastive learning approaches that produce positive pairs sorely from a single image with data augmentation. We achieve significant improvements over the state-of-the-art method (PeCLR) in various datasets, with gains of 15% on FreiHand, 10% on DexYCB, and 4% on AssemblyHands. Our code is available at https://github.com/ut-vision/SiMHand.
Chinese: 我们提出了SimHand框架,通过利用野外多样手部图像进行预训练,并采用新颖的对比学习方法,在多个数据集上显著超越了现有最先进方法的性能。
English: We introduce SimHand, a framework that enhances 3D hand pose estimation through pre-training on diverse in-the-wild hand images using a novel contrastive learning method, which significantly outperforms existing approaches across multiple datasets.

Authors:Yifan Jiang, Yannick Lemaréchal, Sophie Plante, Josée Bafaro, Jessica Abi-Rjeile, Philippe Joubert, Philippe Després, Venkata Manem
Title: Lung-DDPM: Semantic Layout-guided Diffusion Models for Thoracic CT Image Synthesis
Abstract:
With the rapid development of artificial intelligence (AI), AI-assisted medical imaging analysis demonstrates remarkable performance in early lung cancer screening. However, the costly annotation process and privacy concerns limit the construction of large-scale medical datasets, hampering the further application of AI in healthcare. To address the data scarcity in lung cancer screening, we propose Lung-DDPM, a thoracic CT image synthesis approach that effectively generates high-fidelity 3D synthetic CT images, which prove helpful in downstream lung nodule segmentation tasks. Our method is based on semantic layout-guided denoising diffusion probabilistic models (DDPM), enabling anatomically reasonable, seamless, and consistent sample generation even from incomplete semantic layouts. Our results suggest that the proposed method outperforms other state-of-the-art (SOTA) generative models in image quality evaluation and downstream lung nodule segmentation tasks. Specifically, Lung-DDPM achieved superior performance on our large validation cohort, with a Fréchet inception distance (FID) of 0.0047, maximum mean discrepancy (MMD) of 0.0070, and mean squared error (MSE) of 0.0024. These results were 7.4$\times$, 3.1$\times$, and 29.5$\times$ better than the second-best competitors, respectively. Furthermore, the lung nodule segmentation model, trained on a dataset combining real and Lung-DDPM-generated synthetic samples, attained a Dice Coefficient (Dice) of 0.3914 and sensitivity of 0.4393. This represents 8.8% and 18.6% improvements in Dice and sensitivity compared to the model trained solely on real samples. The experimental results highlight Lung-DDPM's potential for a broader range of medical imaging applications, such as general tumor segmentation, cancer survival estimation, and risk prediction. The code and pretrained models are available at https://github.com/Manem-Lab/Lung-DDPM/.
中文: 本研究提出Lung-DDPM,一种基于语义布局引导的扩散模型,能生成高保真三维合成CT图像以解决肺癌筛查数据稀缺问题,显著提升了后续肺结节分割任务的性能表现。
English: The study introduces Lung-DDPM, a semantic layout-guided diffusion model that generates high-fidelity 3D synthetic CT images to overcome data scarcity in lung cancer screening, significantly enhancing downstream nodule segmentation performance.

Authors:Luoying Hao, Yan Hu, Yang Yue, Li Wu, Huazhu Fu, Jinming Duan, Jiang Liu
Title: Hierarchical Context Transformer for Multi-level Semantic Scene Understanding
Abstract:
A comprehensive and explicit understanding of surgical scenes plays a vital role in developing context-aware computer-assisted systems in the operating theatre. However, few works provide systematical analysis to enable hierarchical surgical scene understanding. In this work, we propose to represent the tasks set [phase recognition --> step recognition --> action and instrument detection] as multi-level semantic scene understanding (MSSU). For this target, we propose a novel hierarchical context transformer (HCT) network and thoroughly explore the relations across the different level tasks. Specifically, a hierarchical relation aggregation module (HRAM) is designed to concurrently relate entries inside multi-level interaction information and then augment task-specific features. To further boost the representation learning of the different tasks, inter-task contrastive learning (ICL) is presented to guide the model to learn task-wise features via absorbing complementary information from other tasks. Furthermore, considering the computational costs of the transformer, we propose HCT+ to integrate the spatial and temporal adapter to access competitive performance on substantially fewer tunable parameters. Extensive experiments on our cataract dataset and a publicly available endoscopic PSI-AVA dataset demonstrate the outstanding performance of our method, consistently exceeding the state-of-the-art methods by a large margin. The code is available at https://github.com/Aurora-hao/HCT.
中文摘要:本研究提出了一种分层上下文变换器(HCT)网络,通过多级关系聚合和任务间对比学习实现层次化手术场景理解,在显著降低计算成本的同时大幅超越现有最优方法。
English Summary: This study introduces a hierarchical context transformer (HCT) network for multi-level surgical scene understanding, integrating relation aggregation and contrastive learning to achieve state-of-the-art performance with reduced computational costs.

Authors:Junliang Chen, Huaiyuan Xu, Yi Wang, Lap-Pui Chau
Title: OccProphet: Pushing Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with Observer-Forecaster-Refiner Framework
Abstract:
Predicting variations in complex traffic environments is crucial for the safety of autonomous driving. Recent advancements in occupancy forecasting have enabled forecasting future 3D occupied status in driving environments by observing historical 2D images. However, high computational demands make occupancy forecasting less efficient during training and inference stages, hindering its feasibility for deployment on edge agents. In this paper, we propose a novel framework, i.e., OccProphet, to efficiently and effectively learn occupancy forecasting with significantly lower computational requirements while improving forecasting accuracy. OccProphet comprises three lightweight components: Observer, Forecaster, and Refiner. The Observer extracts spatio-temporal features from 3D multi-frame voxels using the proposed Efficient 4D Aggregation with Tripling-Attention Fusion, while the Forecaster and Refiner conditionally predict and refine future occupancy inferences. Experimental results on nuScenes, Lyft-Level5, and nuScenes-Occupancy datasets demonstrate that OccProphet is both training- and inference-friendly. OccProphet reduces 58\%$\sim$78\% of the computational cost with a 2.6$\times$ speedup compared with the state-of-the-art Cam4DOcc. Moreover, it achieves 4\%$\sim$18\% relatively higher forecasting accuracy. Code and models are publicly available at https://github.com/JLChen-C/OccProphet.
Chinese: 本文提出OccProphet轻量级框架,在显著降低计算成本的同时,提高了自动驾驶中3D占据预测的准确性。
English: This paper introduces OccProphet, a lightweight framework that significantly reduces computational costs while improving forecasting accuracy for 3D occupancy prediction in autonomous driving.

Authors:Chuan Cui, Kejiang Chen, Zhihua Wei, Wen Shen, Weiming Zhang, Nenghai Yu
Title: M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment
Abstract:
The rapid advancement of AI-generated image (AIGI) models presents new challenges for evaluating image quality, particularly across three aspects: perceptual quality, prompt correspondence, and authenticity. To address these challenges, we introduce M3-AGIQA, a comprehensive framework that leverages Multimodal Large Language Models (MLLMs) to enable more human-aligned, holistic evaluation of AI-generated images across both visual and textual domains. Besides, our framework features a structured multi-round evaluation process, generating and analyzing intermediate image descriptions to provide deeper insight into these three aspects. By aligning model outputs more closely with human judgment, M3-AGIQA delivers robust and interpretable quality scores. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art performance on tested datasets and aspects, and exhibits strong generalizability in most cross-dataset settings. Code is available at https://github.com/strawhatboy/M3-AGIQA.
中文: 针对AI生成图像评估的挑战,我们提出M3-AGIQA框架,利用多模态大语言模型进行涵盖感知质量、提示对应性和真实性的全面、人类对齐评估,实现了最先进的性能和强大的泛化能力。
English: To address the challenges in evaluating AI-generated images, we propose M3-AGIQA, a framework using Multimodal Large Language Models for holistic, human-aligned assessment across perceptual quality, prompt correspondence, and authenticity, achieving state-of-the-art performance and strong generalizability.

Authors:Ebenezer Tarubinga, Jenifer Kalafatovich, Seong-Whan Lee
Title: CW-BASS: Confidence-Weighted Boundary-Aware Learning for Semi-Supervised Semantic Segmentation
Abstract:
Semi-supervised semantic segmentation (SSSS) aims to improve segmentation performance by utilizing large amounts of unlabeled data with limited labeled samples. Existing methods often suffer from coupling, where over-reliance on initial labeled data leads to suboptimal learning; confirmation bias, where incorrect predictions reinforce themselves repeatedly; and boundary blur caused by limited boundary-awareness and ambiguous edge cues. To address these issues, we propose CW-BASS, a novel framework for SSSS. In order to mitigate the impact of incorrect predictions, we assign confidence weights to pseudo-labels. Additionally, we leverage boundary-delineation techniques, which, despite being extensively explored in weakly-supervised semantic segmentation (WSSS), remain underutilized in SSSS. Specifically, our method: (1) reduces coupling via a confidence-weighted loss that adjusts pseudo-label influence based on their predicted confidence scores, (2) mitigates confirmation bias with a dynamic thresholding mechanism that learns to filter out pseudo-labels based on model performance, (3) tackles boundary blur using a boundary-aware module to refine segmentation near object edges, and (4) reduces label noise through a confidence decay strategy that progressively refines pseudo-labels during training. Extensive experiments on Pascal VOC 2012 and Cityscapes demonstrate that CW-BASS achieves state-of-the-art performance. Notably, CW-BASS achieves a 65.9% mIoU on Cityscapes under a challenging and underexplored 1/30 (3.3%) split (100 images), highlighting its effectiveness in limited-label settings. Our code is available at https://github.com/psychofict/CW-BASS.
中文: CW-BASS框架通过引入置信度加权的伪标签和边界描绘技术,有效解决了半监督语义分割中的耦合、确认偏差和边界模糊问题,在有限标注数据下于多个基准数据集上取得了领先性能。
English: The CW-BASS framework enhances semi-supervised semantic segmentation by introducing confidence-weighted pseudo-labeling and boundary-delineation techniques to address coupling, confirmation bias, and boundary blur issues, achieving state-of-the-art results on benchmark datasets with limited labeled data.

Authors:Zihao Sheng, Zilin Huang, Yansong Qu, Yue Leng, Sruthi Bhavanam, Sikai Chen
Title: CurricuVLM: Towards Safe Autonomous Driving via Personalized Safety-Critical Curriculum Learning with Vision-Language Models
Abstract:
Ensuring safety in autonomous driving systems remains a critical challenge, particularly in handling rare but potentially catastrophic safety-critical scenarios. While existing research has explored generating safety-critical scenarios for autonomous vehicle (AV) testing, there is limited work on effectively incorporating these scenarios into policy learning to enhance safety. Furthermore, developing training curricula that adapt to an AV's evolving behavioral patterns and performance bottlenecks remains largely unexplored. To address these challenges, we propose CurricuVLM, a novel framework that leverages Vision-Language Models (VLMs) to enable personalized curriculum learning for autonomous driving agents. Our approach uniquely exploits VLMs' multimodal understanding capabilities to analyze agent behavior, identify performance weaknesses, and dynamically generate tailored training scenarios for curriculum adaptation. Through comprehensive analysis of unsafe driving situations with narrative descriptions, CurricuVLM performs in-depth reasoning to evaluate the AV's capabilities and identify critical behavioral patterns. The framework then synthesizes customized training scenarios targeting these identified limitations, enabling effective and personalized curriculum learning. Extensive experiments on the Waymo Open Motion Dataset show that CurricuVLM outperforms state-of-the-art baselines across both regular and safety-critical scenarios, achieving superior performance in terms of navigation success, driving efficiency, and safety metrics. Further analysis reveals that CurricuVLM serves as a general approach that can be integrated with various RL algorithms to enhance autonomous driving systems. The code and demo video are available at: https://zihaosheng.github.io/CurricuVLM/.

Authors:Anthony Fuller, Yousef Yassin, Daniel G. Kyrollos, Evan Shelhamer, James R. Green
Title: Simpler Fast Vision Transformers with a Jumbo CLS Token
Abstract:
We introduce a simple enhancement of vision transformers (ViTs) to improve accuracy while maintaining throughput. Our approach, Jumbo, creates a wider CLS token, which is split to match the patch token width before attention, processed with self-attention, and reassembled. After attention, Jumbo applies a dedicated, wider FFN to this token. Since there is only one Jumbo token, its cost is minimal, and because we share this FFN across layers, its parameter count is controlled. Jumbo significantly improves over ViT+Registers on ImageNet-1K and ImageNet-21K. These gains are largest at small sizes / high speeds, e.g., ViT-nano+Jumbo outperforms ViT-nano+Registers by 13%. In fact, our Jumbo models are so efficient that they outperform specialized compute-efficient models while preserving the architectural advantages of plain ViTs, such as support for token dropping and other modalities. Accordingly, we demonstrate that Jumbo excels in these two settings via masked autoencoding and on a suite of time series benchmarks. Code and weights available: https://github.com/antofuller/jumbo
中文: Jumbo通过拓宽CLS标记并应用专用前馈网络来增强视觉变换器,以极低成本显著提升精度,尤其在小规模高速模型中表现突出。
English: Jumbo enhances vision transformers by widening the CLS token and applying a dedicated feed-forward network, significantly improving accuracy with minimal cost, especially in small, high-speed models.

Authors:Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys, Daniel Barath, Iro Armeni
Title: CrossOver: 3D Scene Cross-Modal Alignment
Abstract:
Multi-modal 3D object understanding has gained significant attention, yet current approaches often assume complete data availability and rigid alignment across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require aligned modality data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities -- RGB images, point clouds, CAD models, floorplans, and text descriptions -- with relaxed constraints and without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting the adaptability for real-world applications in 3D scene understanding.
中文:CrossOver是一种灵活的跨模态3D场景理解框架,通过无需严格数据对齐或对象语义的方式学习统一嵌入空间,即使在数据不完整时也能实现稳健的场景检索和物体定位。
English: CrossOver is a flexible framework for cross-modal 3D scene understanding that learns a unified embedding space by aligning various modalities without strict data alignment or object semantics, demonstrating robust performance in scene retrieval and object localization even with incomplete data.

Authors:Manuel Knott, Ignacio Serna, Ethan Mann, Pietro Perona
Title: A Rapid Test for Accuracy and Bias of Face Recognition Technology
Abstract:
Measuring the accuracy of face recognition (FR) systems is essential for improving performance and ensuring responsible use. Accuracy is typically estimated using large annotated datasets, which are costly and difficult to obtain. We propose a novel method for 1:1 face verification that benchmarks FR systems quickly and without manual annotation, starting from approximate labels (e.g., from web search results). Unlike previous methods for training set label cleaning, ours leverages the embedding representation of the models being evaluated, achieving high accuracy in smaller-sized test datasets. Our approach reliably estimates FR accuracy and ranking, significantly reducing the time and cost of manual labeling. We also introduce the first public benchmark of five FR cloud services, revealing demographic biases, particularly lower accuracy for Asian women. Our rapid test method can democratize FR testing, promoting scrutiny and responsible use of the technology. Our method is provided as a publicly accessible tool at https://github.com/caltechvisionlab/frt-rapid-test
中文: 本文提出了一种无需人工标注的快速人脸识别系统评测方法,利用近似标签和模型嵌入向量,在显著降低测试成本的同时揭示了商业服务中存在的种族与性别偏见。
English: This paper introduces a rapid, annotation-free method for benchmarking face recognition systems using approximate labels and model embeddings, significantly reducing testing costs while revealing demographic biases in commercial services.

Authors:Thomas Froech, Olaf Wysocki, Yan Xia, Junyu Xie, Benedikt Schwab, Daniel Cremers, Thomas H. Kolbe
Title: FacaDiffy: Inpainting Unseen Facade Parts Using Diffusion Models
Abstract:
High-detail semantic 3D building models are frequently utilized in robotics, geoinformatics, and computer vision. One key aspect of creating such models is employing 2D conflict maps that detect openings' locations in building facades. Yet, in reality, these maps are often incomplete due to obstacles encountered during laser scanning. To address this challenge, we introduce FacaDiffy, a novel method for inpainting unseen facade parts by completing conflict maps with a personalized Stable Diffusion model. Specifically, we first propose a deterministic ray analysis approach to derive 2D conflict maps from existing 3D building models and corresponding laser scanning point clouds. Furthermore, we facilitate the inpainting of unseen facade objects into these 2D conflict maps by leveraging the potential of personalizing a Stable Diffusion model. To complement the scarcity of real-world training data, we also develop a scalable pipeline to produce synthetic conflict maps using random city model generators and annotated facade images. Extensive experiments demonstrate that FacaDiffy achieves state-of-the-art performance in conflict map completion compared to various inpainting baselines and increases the detection rate by $22\%$ when applying the completed conflict maps for high-definition 3D semantic building reconstruction. The code is be publicly available in the corresponding GitHub repository: https://github.com/ThomasFroech/InpaintingofUnseenFacadeObjects
中文摘要:FacaDiffy提出了一种利用个性化稳定扩散模型补全二维冲突地图的创新方法,用于高细节三维建筑重建,实现了最先进的性能,并将检测率提高了22%。
English Summary: FacaDiffy introduces a novel method using personalized Stable Diffusion models to complete incomplete 2D conflict maps for high-detail 3D building reconstruction, achieving state-of-the-art performance with a 22% detection rate improvement.

Authors:Ziye Wang, Yiran Qin, Lin Zeng, Ruimao Zhang
Title: High-Dynamic Radar Sequence Prediction for Weather Nowcasting Using Spatiotemporal Coherent Gaussian Representation
Abstract:
Weather nowcasting is an essential task that involves predicting future radar echo sequences based on current observations, offering significant benefits for disaster management, transportation, and urban planning. Current prediction methods are limited by training and storage efficiency, mainly focusing on 2D spatial predictions at specific altitudes. Meanwhile, 3D volumetric predictions at each timestamp remain largely unexplored. To address such a challenge, we introduce a comprehensive framework for 3D radar sequence prediction in weather nowcasting, using the newly proposed SpatioTemporal Coherent Gaussian Splatting (STC-GS) for dynamic radar representation and GauMamba for efficient and accurate forecasting. Specifically, rather than relying on a 4D Gaussian for dynamic scene reconstruction, STC-GS optimizes 3D scenes at each frame by employing a group of Gaussians while effectively capturing their movements across consecutive frames. It ensures consistent tracking of each Gaussian over time, making it particularly effective for prediction tasks. With the temporally correlated Gaussian groups established, we utilize them to train GauMamba, which integrates a memory mechanism into the Mamba framework. This allows the model to learn the temporal evolution of Gaussian groups while efficiently handling a large volume of Gaussian tokens. As a result, it achieves both efficiency and accuracy in forecasting a wide range of dynamic meteorological radar signals. The experimental results demonstrate that our STC-GS can efficiently represent 3D radar sequences with over $16\times$ higher spatial resolution compared with the existing 3D representation methods, while GauMamba outperforms state-of-the-art methods in forecasting a broad spectrum of high-dynamic weather conditions.

Authors:Junhyeok Kim, Min Soo Kim, Jiwan Chung, Jungbin Cho, Jisoo Kim, Sungwoong Kim, Gyeongbo Sim, Youngjae Yu
Title: EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild
Abstract:
Predicting when to initiate speech in real-world environments remains a fundamental challenge for conversational agents. We introduce EgoSpeak, a novel framework for real-time speech initiation prediction in egocentric streaming video. By modeling the conversation from the speaker's first-person viewpoint, EgoSpeak is tailored for human-like interactions in which a conversational agent must continuously observe its environment and dynamically decide when to talk. Our approach bridges the gap between simplified experimental setups and complex natural conversations by integrating four key capabilities: (1) first-person perspective, (2) RGB processing, (3) online processing, and (4) untrimmed video processing. We also present YT-Conversation, a diverse collection of in-the-wild conversational videos from YouTube, as a resource for large-scale pretraining. Experiments on EasyCom and Ego4D demonstrate that EgoSpeak outperforms random and silence-based baselines in real time. Our results also highlight the importance of multimodal input and context length in effectively deciding when to speak.
中文: EgoSpeak是一种新颖的实时语音起始预测框架,通过整合第一人称视角、RGB处理、在线及未剪辑视频处理能力,弥合了实验设置与自然对话之间的差距,并在多模态实验中优于基线方法。
English: EgoSpeak is a novel framework that predicts speech initiation in real-time from egocentric video, bridging experimental setups and natural conversations by integrating first-person perspective, RGB processing, online and untrimmed video capabilities, and outperforms baselines in multimodal experiments.

Authors:Zhe Huang, Shuo Wang, Yongcai Wang, Lei Wang
Title: CoDiff: Conditional Diffusion Model for Collaborative 3D Object Detection
Abstract:
Collaborative 3D object detection holds significant importance in the field of autonomous driving, as it greatly enhances the perception capabilities of each individual agent by facilitating information exchange among multiple agents. However, in practice, due to pose estimation errors and time delays, the fusion of information across agents often results in feature representations with spatial and temporal noise, leading to detection errors. Diffusion models naturally have the ability to denoise noisy samples to the ideal data, which motivates us to explore the use of diffusion models to address the noise problem between multi-agent systems. In this work, we propose CoDiff, a novel robust collaborative perception framework that leverages the potential of diffusion models to generate more comprehensive and clearer feature representations. To the best of our knowledge, this is the first work to apply diffusion models to multi-agent collaborative perception. Specifically, we project high-dimensional feature map into the latent space of a powerful pre-trained autoencoder. Within this space, individual agent information serves as a condition to guide the diffusion model's sampling. This process denoises coarse feature maps and progressively refines the fused features. Experimental study on both simulated and real-world datasets demonstrates that the proposed framework CoDiff consistently outperforms existing relevant methods in terms of the collaborative object detection performance, and exhibits highly desired robustness when the pose and delay information of agents is with high-level noise. The code is released at https://github.com/HuangZhe885/CoDiff
中文摘要:CoDiff是一种创新的协作感知框架,利用扩散模型对多智能体系统中的噪声特征进行去噪和优化,显著提升了自动驾驶中3D物体检测的准确性和对姿态误差及延迟的鲁棒性。
English Summary: Collaborative 3D object detection in autonomous driving is enhanced by CoDiff, a novel framework that uses diffusion models to denoise and refine multi-agent features, improving detection accuracy and robustness against pose and delay errors.

Authors:Zhiyu Zhu, Zhibo Jin, Jiayu Zhang, Nan Yang, Jiahao Huang, Jianlong Zhou, Fang Chen
Title: Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability
Abstract:
The task of identifying multimodal image-text representations has garnered increasing attention, particularly with models such as CLIP (Contrastive Language-Image Pretraining), which demonstrate exceptional performance in learning complex associations between images and text. Despite these advancements, ensuring the interpretability of such models is paramount for their safe deployment in real-world applications, such as healthcare. While numerous interpretability methods have been developed for unimodal tasks, these approaches often fail to transfer effectively to multimodal contexts due to inherent differences in the representation structures. Bottleneck methods, well-established in information theory, have been applied to enhance CLIP's interpretability. However, they are often hindered by strong assumptions or intrinsic randomness. To overcome these challenges, we propose the Narrowing Information Bottleneck Theory, a novel framework that fundamentally redefines the traditional bottleneck approach. This theory is specifically designed to satisfy contemporary attribution axioms, providing a more robust and reliable solution for improving the interpretability of multimodal models. In our experiments, compared to state-of-the-art methods, our approach enhances image interpretability by an average of 9%, text interpretability by an average of 58.83%, and accelerates processing speed by 63.95%. Our code is publicly accessible at https://github.com/LMBTough/NIB.
Chinese: 提出的“窄化信息瓶颈理论”通过重新定义传统瓶颈方法,显著提升了CLIP模型的可解释性,在图像和文本理解方面取得重大改进,同时大幅提高了处理速度。
English: The proposed Narrowing Information Bottleneck Theory enhances CLIP's interpretability by redefining traditional bottleneck methods, achieving significant improvements in image and text interpretability along with faster processing speeds.

Authors:Insu Han, Zeliang Zhang, Zhiyuan Wang, Yifan Zhu, Susan Liang, Jiani Liu, Haiting Lin, Mingjie Zhao, Chenliang Xu, Kun Wan, Wentian Zhao
Title: CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance across diverse applications. However, their computational overhead during deployment remains a critical bottleneck. While Key-Value (KV) caching effectively trades memory for computation to enhance inference efficiency, the growing memory footprint from extensive KV caches significantly reduces throughput and restricts prolonged deployment on memory-constrained GPU devices. To address this challenge, we propose CalibQuant, a simple yet highly effective visual quantization strategy that drastically reduces both memory and computational overhead. Specifically, CalibQuant introduces an extreme 1-bit quantization scheme, complemented by novel post-scaling and calibration techniques tailored to the intrinsic patterns of KV caches, thereby ensuring high efficiency without compromising model performance. Leveraging Triton for runtime optimization, we achieve a 10x throughput increase on InternVL models. Our method is designed to be plug-and-play, seamlessly integrating with various existing MLLMs without requiring architectural changes. Extensive experiments confirm that our approach significantly reduces memory usage while maintaining computational efficiency and preserving multimodal capabilities. Codes are available at https://github.com/insuhan/calibquant.
中文:CalibQuant提出了一种极端的1位量化策略,通过定制化校准技术显著降低多模态大语言模型的内存和计算开销,在保持性能的同时实现了10倍吞吐量提升,且无需修改模型架构。
English: CalibQuant introduces an extreme 1-bit quantization strategy with specialized calibration techniques to drastically reduce memory and computational overhead in Multimodal Large Language Models while maintaining performance, achieving a 10x throughput increase without architectural modifications.

Authors:Mang Ye, Xuankun Rong, Wenke Huang, Bo Du, Nenghai Yu, Dacheng Tao
Title: A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations
Abstract:
With the rapid advancement of Large Vision-Language Models (LVLMs), ensuring their safety has emerged as a crucial area of research. This survey provides a comprehensive analysis of LVLM safety, covering key aspects such as attacks, defenses, and evaluation methods. We introduce a unified framework that integrates these interrelated components, offering a holistic perspective on the vulnerabilities of LVLMs and the corresponding mitigation strategies. Through an analysis of the LVLM lifecycle, we introduce a classification framework that distinguishes between inference and training phases, with further subcategories to provide deeper insights. Furthermore, we highlight limitations in existing research and outline future directions aimed at strengthening the robustness of LVLMs. As part of our research, we conduct a set of safety evaluations on the latest LVLM, Deepseek Janus-Pro, and provide a theoretical analysis of the results. Our findings provide strategic recommendations for advancing LVLM safety and ensuring their secure and reliable deployment in high-stakes, real-world applications. This survey aims to serve as a cornerstone for future research, facilitating the development of models that not only push the boundaries of multimodal intelligence but also adhere to the highest standards of security and ethical integrity. Furthermore, to aid the growing research in this field, we have created a public repository to continuously compile and update the latest work on LVLM safety: https://github.com/XuankunRong/Awesome-LVLM-Safety .
中文摘要:本综述通过统一框架全面分析大型视觉语言模型的安全性,涵盖攻击、防御和评估方法,同时提供对Deepseek Janus-Pro的安全评估,并指出加强模型鲁棒性的未来研究方向。
English Summary: This survey comprehensively analyzes the safety of Large Vision-Language Models by examining attacks, defenses, and evaluation methods through a unified framework, while also providing safety assessments of Deepseek Janus-Pro and outlining future research directions to enhance model robustness.

Authors:Dong Chen, Zhengqing Hu, Peiguang Fan, Yueting Zhuang, Yafei Li, Qidong Liu, Xiaoheng Jiang, Mingliang Xu
Title: KKA: Improving Vision Anomaly Detection through Anomaly-related Knowledge from Large Language Models
Abstract:
Vision anomaly detection, particularly in unsupervised settings, often struggles to distinguish between normal samples and anomalies due to the wide variability in anomalies. Recently, an increasing number of studies have focused on generating anomalies to help detectors learn more effective boundaries between normal samples and anomalies. However, as the generated anomalies are often derived from random factors, they frequently lack realism. Additionally, randomly generated anomalies typically offer limited support in constructing effective boundaries, as most differ substantially from normal samples and lie far from the boundary. To address these challenges, we propose Key Knowledge Augmentation (KKA), a method that extracts anomaly-related knowledge from large language models (LLMs). More specifically, KKA leverages the extensive prior knowledge of LLMs to generate meaningful anomalies based on normal samples. Then, KKA classifies the generated anomalies as easy anomalies and hard anomalies according to their similarity to normal samples. Easy anomalies exhibit significant differences from normal samples, whereas hard anomalies closely resemble normal samples. KKA iteratively updates the generated anomalies, and gradually increasing the proportion of hard anomalies to enable the detector to learn a more effective boundary. Experimental results show that the proposed method significantly improves the performance of various vision anomaly detectors while maintaining low generation costs. The code for CMG can be found at https://github.com/Anfeather/KKA.
中文: 提出的关键知识增强方法利用大型语言模型生成与正常样本高度相似的逼真困难异常,通过迭代优化帮助视觉异常检测器学习更有效的边界,从而显著提升检测性能。
English: The proposed Key Knowledge Augmentation (KKA) method leverages large language models to generate realistic hard anomalies that closely resemble normal samples, iteratively refining them to help vision anomaly detectors learn more effective boundaries and significantly improve performance.

Authors:Sara Ghaboura, Ketan More, Ritesh Thawkar, Wafa Alghallabi, Omkar Thawakar, Fahad Shahbaz Khan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer
Title: Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
Abstract:
Understanding historical and cultural artifacts demands human expertise and advanced computational techniques, yet the process remains complex and time-intensive. While large multimodal models offer promising support, their evaluation and improvement require a standardized benchmark. To address this, we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning 266 distinct cultures across 10 major historical regions. Designed for AI-driven analysis of manuscripts, artworks, inscriptions, and archaeological discoveries, TimeTravel provides a structured dataset and robust evaluation framework to assess AI models' capabilities in classification, interpretation, and historical comprehension. By integrating AI with historical research, TimeTravel fosters AI-powered tools for historians, archaeologists, researchers, and cultural tourists to extract valuable insights while ensuring technology contributes meaningfully to historical discovery and cultural heritage preservation. We evaluate contemporary AI models on TimeTravel, highlighting their strengths and identifying areas for improvement. Our goal is to establish AI as a reliable partner in preserving cultural heritage, ensuring that technological advancements contribute meaningfully to historical discovery. Our code is available at: \url{https://github.com/mbzuai-oryx/TimeTravel}.
中文摘要:TimeTravel基准提供了一个涵盖多文化历史文物的标准化数据集和评估框架,旨在推动人工智能在文化遗产分析和历史研究中的应用,确保技术进步对历史发现与保护做出实质性贡献。
English Summary: The TimeTravel benchmark introduces a comprehensive dataset and evaluation framework to enhance AI models' capabilities in analyzing historical artifacts, aiming to integrate AI as a reliable tool for cultural heritage preservation and historical research.

Authors:Yuming Yang, Jiang Zhong, Li Jin, Jingwang Huang, Jingpeng Gao, Qing Liu, Yang Bai, Jingyuan Zhang, Rui Jiang, Kaiwen Wei
Title: Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework
Abstract:
Multimodal Retrieval-Augmented Generation (MRAG) enhances reasoning capabilities by integrating external knowledge. However, existing benchmarks primarily focus on simple image-text interactions, overlooking complex visual formats like charts that are prevalent in real-world applications. In this work, we introduce a novel task, Chart-based MRAG, to address this limitation. To semi-automatically generate high-quality evaluation samples, we propose CHARt-based document question-answering GEneration (CHARGE), a framework that produces evaluation data through structured keypoint extraction, crossmodal verification, and keypoint-based generation. By combining CHARGE with expert validation, we construct Chart-MRAG Bench, a comprehensive benchmark for chart-based MRAG evaluation, featuring 4,738 question-answering pairs across 8 domains from real-world documents. Our evaluation reveals three critical limitations in current approaches: (1) unified multimodal embedding retrieval methods struggles in chart-based scenarios, (2) even with ground-truth retrieval, state-of-the-art MLLMs achieve only 58.19% Correctness and 73.87% Coverage scores, and (3) MLLMs demonstrate consistent text-over-visual modality bias during Chart-based MRAG reasoning. The CHARGE and Chart-MRAG Bench are released at https://github.com/Nomothings/CHARGE.git.
中文摘要:本文提出了基于图表的多模态检索增强生成新任务,通过CHARGE框架生成评估数据并构建Chart-MRAG基准测试集,揭示了现有方法在图表检索性能差和文本模态偏好等关键缺陷。
English Summary: This paper introduces Chart-based MRAG, a novel task addressing the gap in multimodal benchmarks by proposing the CHARGE framework to generate evaluation data and constructing Chart-MRAG Bench, revealing critical limitations in current methods including poor chart retrieval performance and text-biased reasoning.

Authors:Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark
Title: Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
Abstract:
Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.

Authors:Rameen Abdal, Or Patashnik, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman
Title: Dynamic Concepts Personalization from Single Videos
Abstract:
Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts, i.e., entities defined not only by their appearance but also by their motion. In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)-based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an identity LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the identity LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal weight space that effectively embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality while setting a new benchmark for personalizing dynamic concepts.
中文: Set-and-Sequence框架通过无序帧学习外观特征和序列训练捕捉运动动态的两阶段方法,实现了文本到视频模型中动态概念的个性化,为动态概念编辑设立了新标准。
English: The Set-and-Sequence framework introduces a novel two-stage approach for personalizing text-to-video models by first learning appearance through unordered frames and then capturing motion dynamics via sequential training, enabling unprecedented editability of dynamic concepts.

Authors:Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li
Title: LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
Abstract:
Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V
中文:现有大型视觉语言模型因缺乏长输出示例而难以生成连贯长文本,我们通过LongWriter-V-22k数据集和IterDPO方法提升了生成长度与保真度,使7B参数模型在性能上超越GPT-4o等大型专有模型。
English: Existing LVLMs struggle with long coherent outputs due to lacking long output examples in SFT, so we introduce LongWriter-V-22k dataset and IterDPO method to enhance generation fidelity and length, achieving superior performance with a 7B model over larger proprietary models.

Authors:Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin
Title: Improving the Diffusability of Autoencoders
Abstract:
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to 20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K $256^2$ and FVD by at least 44% for video generation on Kinetics-700 $17 \times 256^2$. The source code is available at https://github.com/snap-research/diffusability.
中文: 潜在扩散模型因自编码器潜在空间中的过度高频分量而影响生成质量,通过提出的尺度等变性正则化方法,仅需少量调整即可显著提升图像和视频生成的性能指标。
English: Latent diffusion models face quality issues due to high-frequency components in autoencoder latent spaces, which are mitigated by a proposed scale equivariance regularization that significantly improves image and video generation metrics with minimal adjustments.

Authors:Hai Wang, Xiaoyu Xiang, Weihao Xia, Jing-Hao Xue
Title: A Survey on Text-Driven 360-Degree Panorama Generation
Abstract:
The advent of text-driven 360-degree panorama generation, enabling the synthesis of 360-degree panoramic images directly from textual descriptions, marks a transformative advancement in immersive visual content creation. This innovation significantly simplifies the traditionally complex process of producing such content. Recent progress in text-to-image diffusion models has accelerated the rapid development in this emerging field. This survey presents a comprehensive review of text-driven 360-degree panorama generation, offering an in-depth analysis of state-of-the-art algorithms and their expanding applications in 360-degree 3D scene generation. Furthermore, we critically examine current limitations and propose promising directions for future research. A curated project page with relevant resources and research papers is available at https://littlewhitesea.github.io/Text-Driven-Pano-Gen/.

Authors:Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai
Title: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Abstract:
We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).
中文: SigLIP 2 作为新一代多语言视觉语言编码器,融合了多种先进技术,在核心能力、定位任务和多语言公平性方面均超越前代,并提供多种可扩展的模型尺寸。
English: SigLIP 2 is an enhanced multilingual vision-language encoder that integrates multiple advanced techniques to surpass its predecessor in core capabilities, localization tasks, and multilingual fairness while offering scalable model sizes.

Authors:Hongji Yang, Wencheng Han, Yucheng Zhou, Jianbing Shen
Title: DC-ControlNet: Decoupling Inter- and Intra-Element Conditions in Image Generation with Diffusion Models
Abstract:
In this paper, we introduce DC (Decouple)-ControlNet, a highly flexible and precisely controllable framework for multi-condition image generation. The core idea behind DC-ControlNet is to decouple control conditions, transforming global control into a hierarchical system that integrates distinct elements, contents, and layouts. This enables users to mix these individual conditions with greater flexibility, leading to more efficient and accurate image generation control. Previous ControlNet-based models rely solely on global conditions, which affect the entire image and lack the ability of element- or region-specific control. This limitation reduces flexibility and can cause condition misunderstandings in multi-conditional image generation. To address these challenges, we propose both intra-element and Inter-element Controllers in DC-ControlNet. The Intra-Element Controller handles different types of control signals within individual elements, accurately describing the content and layout characteristics of the object. For interactions between elements, we introduce the Inter-Element Controller, which accurately handles multi-element interactions and occlusion based on user-defined relationships. Extensive evaluations show that DC-ControlNet significantly outperforms existing ControlNet models and Layout-to-Image generative models in terms of control flexibility and precision in multi-condition control. Our project website is available at: https://um-lab.github.io/DC-ControlNet/
中文: DC-ControlNet提出了一种解耦的多条件图像生成框架,通过将控制条件分解为层级化的元素内与元素间控制器,显著提升了控制的灵活性和精确度。
English: DC-ControlNet introduces a decoupled framework for multi-condition image generation, enhancing flexibility and precision by separating control conditions into hierarchical intra-element and inter-element controllers.

Authors:Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa
Title: Harnessing PDF Data for Improving Japanese Large Multimodal Models
Abstract:
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark. Our results demonstrate substantial improvements, with performance gains ranging from 2.1% to 13.8% on Heron-Bench. Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs.
Chinese: 本研究开发了一种从日语PDF中自动提取图文对的流程,利用这一未充分开发的资源显著提升了日语大型多模态模型的性能,在基准测试中实现了最高13.8%的性能提升。
English: This study introduces an automated pipeline to extract image-text pairs from Japanese PDFs, significantly enhancing the performance of Japanese Large Multimodal Models by leveraging this underutilized resource and achieving up to 13.8% improvement on benchmarks.

Authors:Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari
Title: MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders
Abstract:
Medical images are acquired at high resolutions with large fields of view in order to capture fine-grained features necessary for clinical decision-making. Consequently, training deep learning models on medical images can incur large computational costs. In this work, we address the challenge of downsizing medical images in order to improve downstream computational efficiency while preserving clinically-relevant features. We introduce MedVAE, a family of six large-scale 2D and 3D autoencoders capable of encoding medical images as downsized latent representations and decoding latent representations back to high-resolution images. We train MedVAE autoencoders using a novel two-stage training approach with 1,052,730 medical images. Across diverse tasks obtained from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent representations in place of high-resolution images when training downstream models can lead to efficiency benefits (up to 70x improvement in throughput) while simultaneously preserving clinically-relevant features and (2) MedVAE can decode latent representations back to high-resolution images with high fidelity. Our work demonstrates that large-scale, generalizable autoencoders can help address critical efficiency challenges in the medical domain. Our code is available at https://github.com/StanfordMIMI/MedVAE.
中文: MedVAE提出了一系列自编码器,可将医学图像压缩为高效潜在表征,在保持临床特征和高保真重建的同时,实现高达70倍的计算吞吐量提升。
English: MedVAE introduces a family of autoencoders that compress medical images into efficient latent representations, enabling up to 70x computational throughput improvement while preserving clinical features and high-fidelity reconstruction.

Authors:Zheyuan Zhang, Runze Li, Tasnim Kabir, Jordan Boyd-Graber
Title: NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization
Abstract:
Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at https://github.com/SparrowZheyuan18/Navig/.
Chinese: 本研究提出了Navig图像地理定位框架,利用高质量数据集NaviClues通过语言推理增强分析能力,仅需少量训练样本即可将平均距离误差降低14%,优于现有最优模型。
English: The study introduces Navig, a novel image geo-localization framework that leverages a high-quality dataset called NaviClues to enhance reasoning with language, reducing average distance error by 14% over previous models with minimal training data.

Authors:Jiangyuan Liu, Hongxuan Ma, Yuxin Guo, Yuhao Zhao, Chi Zhang, Wei Sui, Wei Zou
Title: Monocular Depth Estimation and Segmentation for Transparent Object with Iterative Semantic and Geometric Fusion
Abstract:
Transparent object perception is indispensable for numerous robotic tasks. However, accurately segmenting and estimating the depth of transparent objects remain challenging due to complex optical properties. Existing methods primarily delve into only one task using extra inputs or specialized sensors, neglecting the valuable interactions among tasks and the subsequent refinement process, leading to suboptimal and blurry predictions. To address these issues, we propose a monocular framework, which is the first to excel in both segmentation and depth estimation of transparent objects, with only a single-image input. Specifically, we devise a novel semantic and geometric fusion module, effectively integrating the multi-scale information between tasks. In addition, drawing inspiration from human perception of objects, we further incorporate an iterative strategy, which progressively refines initial features for clearer results. Experiments on two challenging synthetic and real-world datasets demonstrate that our model surpasses state-of-the-art monocular, stereo, and multi-view methods by a large margin of about 38.8%-46.2% with only a single RGB input. Codes and models are publicly available at https://github.com/L-J-Yuan/MODEST.
Chinese Summary: 本文提出一种单目视觉框架,首次仅用单张RGB图像就实现了透明物体的精确分割与深度估计,通过多尺度融合模块和迭代优化策略,在合成与真实数据集上以38.8%-46.2%的显著优势超越现有最佳方法。
English Summary: This paper introduces a monocular framework that simultaneously excels in transparent object segmentation and depth estimation using only a single RGB image, achieving a 38.8%-46.2% performance improvement over existing methods through novel multi-scale fusion and iterative refinement strategies.

Authors:Jiahao Qi, Chuanhong Zhou, Xingyue Liu, Chen Chen, Dehui Zhu, Kangcheng Bin, Ping Zhong
Title: Nearshore Underwater Target Detection Meets UAV-borne Hyperspectral Remote Sensing: A Novel Hybrid-level Contrastive Learning Framework and Benchmark Dataset
Abstract:
UAV-borne hyperspectral remote sensing has emerged as a promising approach for underwater target detection (UTD). However, its effectiveness is hindered by spectral distortions in nearshore environments, which compromise the accuracy of traditional hyperspectral UTD (HUTD) methods that rely on bathymetric model. These distortions lead to significant uncertainty in target and background spectra, challenging the detection process. To address this, we propose the Hyperspectral Underwater Contrastive Learning Network (HUCLNet), a novel framework that integrates contrastive learning with a self-paced learning paradigm for robust HUTD in nearshore regions. HUCLNet extracts discriminative features from distorted hyperspectral data through contrastive learning, while the self-paced learning strategy selectively prioritizes the most informative samples. Additionally, a reliability-guided clustering strategy enhances the robustness of learned representations.To evaluate the method effectiveness, we conduct a novel nearshore HUTD benchmark dataset, ATR2-HUTD, covering three diverse scenarios with varying water types and turbidity, and target types. Extensive experiments demonstrate that HUCLNet significantly outperforms state-of-the-art methods. The dataset and code will be publicly available at: https://github.com/qjh1996/HUTD
Chinese: 提出的高光谱水下对比学习网络(HUCLNet)通过结合对比学习和自步学习,有效解决了近岸水下目标检测中的光谱失真问题,在新型ATR2-HUTD基准数据集上显著优于现有方法。
English: The proposed Hyperspectral Underwater Contrastive Learning Network (HUCLNet) effectively addresses spectral distortions in nearshore underwater target detection by integrating contrastive learning and self-paced learning, significantly outperforming existing methods on the new ATR2-HUTD benchmark dataset.

Authors:Chengyu Fang, Chunming He, Longxiang Tang, Yuelin Zhang, Chenyang Zhu, Yuqi Shen, Chubin Chen, Guoxia Xu, Xiu Li
Title: Integrating Extra Modality Helps Segmentor Find Camouflaged Objects Well
Abstract:
Camouflaged Object Segmentation (COS) remains challenging because camouflaged objects exhibit only subtle visual differences from their backgrounds and single-modality RGB methods provide limited cues, leading researchers to explore multimodal data to improve segmentation accuracy. In this work, we presenet MultiCOS, a novel framework that effectively leverages diverse data modalities to improve segmentation performance. MultiCOS comprises two modules: Bi-space Fusion Segmentor (BFSer), which employs a state space and a latent space fusion mechanism to integrate cross-modal features within a shared representation and employs a fusion-feedback mechanism to refine context-specific features, and Cross-modal Knowledge Learner (CKLer), which leverages external multimodal datasets to generate pseudo-modal inputs and establish cross-modal semantic associations, transferring knowledge to COS models when real multimodal pairs are missing. When real multimodal COS data are unavailable, CKLer yields additional segmentation gains using only non-COS multimodal sources. Experiments on standard COS benchmarks show that BFSer outperforms existing multimodal baselines with both real and pseudo-modal data. Code will be released at \href{https://github.com/cnyvfang/MultiCOS}{GitHub}.
Chinese: MultiCOS是一种新颖框架,通过双空间融合分割器整合跨模态特征,并利用跨模态知识学习器借助外部数据集提升伪装物体分割性能,即使在缺乏真实多模态数据时也能实现精度提升。
English: MultiCOS is a novel framework that enhances Camouflaged Object Segmentation by integrating cross-modal features through its Bi-space Fusion Segmentor and leveraging external datasets via the Cross-modal Knowledge Learner to improve accuracy even without real multimodal data.

Authors:Haeyun Choi, Heemin Yang, Janghyeok Han, Sunghyun Cho
Title: Exploiting Deblurring Networks for Radiance Fields
Abstract:
In this paper, we propose DeepDeblurRF, a novel radiance field deblurring approach that can synthesize high-quality novel views from blurred training views with significantly reduced training time. DeepDeblurRF leverages deep neural network (DNN)-based deblurring modules to enjoy their deblurring performance and computational efficiency. To effectively combine DNN-based deblurring and radiance field construction, we propose a novel radiance field (RF)-guided deblurring and an iterative framework that performs RF-guided deblurring and radiance field construction in an alternating manner. Moreover, DeepDeblurRF is compatible with various scene representations, such as voxel grids and 3D Gaussians, expanding its applicability. We also present BlurRF-Synth, the first large-scale synthetic dataset for training radiance field deblurring frameworks. We conduct extensive experiments on both camera motion blur and defocus blur, demonstrating that DeepDeblurRF achieves state-of-the-art novel-view synthesis quality with significantly reduced training time.

Authors:Paul Friedrich, Florentin Bieder, Julian McGinnis, Julia Wolleb, Daniel Rueckert, Philippe C. Cattin
Title: MedFuncta: A Unified Framework for Learning Efficient Medical Neural Fields
Abstract:
Research in medical imaging primarily focuses on discrete data representations that poorly scale with grid resolution and fail to capture the often continuous nature of the underlying signal. Neural Fields (NFs) offer a powerful alternative by modeling data as continuous functions. While single-instance NFs have successfully been applied in medical contexts, extending them to large-scale medical datasets remains an open challenge. We therefore introduce MedFuncta, a unified framework for large-scale NF training on diverse medical signals. Building on Functa, our approach encodes data into a unified representation, namely a 1D latent vector, that modulates a shared, meta-learned NF, enabling generalization across a dataset. We revisit common design choices, introducing a non-constant frequency parameter $ω$ in widely used SIREN activations, and establish a connection between this $ω$-schedule and layer-wise learning rates, relating our findings to recent work in theoretical learning dynamics. We additionally introduce a scalable meta-learning strategy for shared network learning that employs sparse supervision during training, thereby reducing memory consumption and computational overhead while maintaining competitive performance. Finally, we evaluate MedFuncta across a diverse range of medical datasets and show how to solve relevant downstream tasks on our neural data representation. To promote further research in this direction, we release our code, model weights and the first large-scale dataset - MedNF - containing > 500 k latent vectors for multi-instance medical NFs.
Chinese: 医学影像研究常受限于离散数据表示,因此MedFuncta提出了一个统一框架,利用神经场建模连续函数,并通过元学习和优化激活函数实现在大规模医学数据集上的泛化能力。
English: Medical imaging research often struggles with discrete data representations, so MedFuncta introduces a unified framework using neural fields to model continuous functions and enable generalization across large-scale medical datasets through meta-learning and optimized activation functions.

Authors:Ke Cao, Jing Wang, Ao Ma, Jiasong Feng, Zhanjie Zhang, Xuanhua He, Shanyuan Liu, Bo Cheng, Dawei Leng, Yuhui Yin, Jie Zhang
Title: RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers
Abstract:
The Diffusion Transformer plays a pivotal role in advancing text-to-image and text-to-video generation, owing primarily to its inherent scalability. However, existing controlled diffusion transformer methods incur significant parameter and computational overheads and suffer from inefficient resource allocation due to their failure to account for the varying relevance of control information across different transformer layers. To address this, we propose the Relevance-Guided Efficient Controllable Generation framework, RelaCtrl, enabling efficient and resource-optimized integration of control signals into the Diffusion Transformer. First, we evaluate the relevance of each layer in the Diffusion Transformer to the control information by assessing the "ControlNet Relevance Score"-i.e., the impact of skipping each control layer on both the quality of generation and the control effectiveness during inference. Based on the strength of the relevance, we then tailor the positioning, parameter scale, and modeling capacity of the control layers to reduce unnecessary parameters and redundant computations. Additionally, to further improve efficiency, we replace the self-attention and FFN in the commonly used copy block with the carefully designed Two-Dimensional Shuffle Mixer (TDSM), enabling efficient implementation of both the token mixer and channel mixer. Both qualitative and quantitative experimental results demonstrate that our approach achieves superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta.
扩散变换器在文本到图像和视频生成中发挥关键作用,但现有方法因参数和计算开销大且资源分配低效而受限;我们提出的RelaCtrl框架通过评估控制信息在各层的相关性并优化控制层结构与计算模块,仅需15%的参数和计算量即可实现卓越性能。
Diffusion Transformer significantly advances text-to-image and video generation but faces efficiency challenges due to excessive parameters and poor resource allocation, which our RelaCtrl framework addresses by optimizing control layer relevance and computational structure for high performance with minimal resources.

Authors:Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, Fei Huang
Title: PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC
Abstract:
In the field of MLLM-based GUI agents, compared to smartphones, the PC scenario not only features a more complex interactive environment, but also involves more intricate intra- and inter-app workflows. To address these issues, we propose a hierarchical agent framework named PC-Agent. Specifically, from the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content. From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture that decomposes decision-making processes into Instruction-Subtask-Action levels. Within this architecture, three agents (i.e., Manager, Progress and Decision) are set up for instruction decomposition, progress tracking and step-by-step decision-making respectively. Additionally, a Reflection agent is adopted to enable timely bottom-up error feedback and adjustment. We also introduce a new benchmark PC-Eval with 25 real-world complex instructions. Empirical results on PC-Eval show that our PC-Agent achieves a 32% absolute improvement of task success rate over previous state-of-the-art methods. The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/PC-Agent.
中文:PC-Agent框架通过引入具有主动感知和反思机制的分层多智能体系统,解决了PC交互的复杂性,在新型PC-Eval基准测试中实现了任务成功率32%的绝对提升。
English: The PC-Agent framework addresses complex PC interactions by introducing a hierarchical multi-agent system with active perception and reflection mechanisms, achieving a 32% improvement in task success rate on the new PC-Eval benchmark.

Authors:Wenhui Zhu, Xuanzhao Dong, Xin Li, Yujian Xiong, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Zhangsihao Yang, Yi Su, Oana Dumitrascu, Yalin Wang
Title: EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement
Abstract:
Over the past decade, generative models have achieved significant success in enhancement fundus images.However, the evaluation of these models still presents a considerable challenge. A comprehensive evaluation benchmark for fundus image enhancement is indispensable for three main reasons: 1) The existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to downstream real-world clinical research (e.g., Vessel morphology consistency). 2) There is a lack of comprehensive evaluation for both paired and unpaired enhancement methods, along with the need for expert protocols to accurately assess clinical value. 3) An ideal evaluation system should provide insights to inform future developments of fundus image enhancement. To this end, we propose a novel comprehensive benchmark, EyeBench, to provide insights that align enhancement models with clinical needs, offering a foundation for future work to improve the clinical relevance and applicability of generative models for fundus image enhancement. EyeBench has three appealing properties: 1) multi-dimensional clinical alignment downstream evaluation: In addition to evaluating the enhancement task, we provide several clinically significant downstream tasks for fundus images, including vessel segmentation, DR grading, denoising generalization, and lesion segmentation. 2) Medical expert-guided evaluation design: We introduce a novel dataset that promote comprehensive and fair comparisons between paired and unpaired methods and includes a manual evaluation protocol by medical experts. 3) Valuable insights: Our benchmark study provides a comprehensive and rigorous evaluation of existing methods across different downstream tasks, assisting medical experts in making informed choices. Additionally, we offer further analysis of the challenges faced by existing methods. The code is available at \url{https://github.com/Retinal-Research/EyeBench}
中文摘要:过去十年,生成模型在眼底图像增强方面取得显著进展,但其评估仍面临挑战,为此我们提出EyeBench综合基准,通过多维临床对齐评估和医学专家指导设计,使增强模型更贴合临床需求。
English Summary: Over the past decade, generative models have advanced fundus image enhancement, but their evaluation remains challenging, leading to the creation of EyeBench, a comprehensive benchmark that aligns models with clinical needs through multi-dimensional assessments and expert-guided protocols.

Authors:Jiayu Yang, Taizhang Shang, Weixuan Sun, Xibin Song, Ziang Cheng, Senbo Wang, Shenzhou Chen, Weizhe Liu, Hongdong Li, Pan Ji
Title: Pandora3D: A Comprehensive Framework for High-Quality 3D Shape and Texture Generation
Abstract:
This report presents a comprehensive framework for generating high-quality 3D shapes and textures from diverse input prompts, including single images, multi-view images, and text descriptions. The framework consists of 3D shape generation and texture generation. (1). The 3D shape generation pipeline employs a Variational Autoencoder (VAE) to encode implicit 3D geometries into a latent space and a diffusion network to generate latents conditioned on input prompts, with modifications to enhance model capacity. An alternative Artist-Created Mesh (AM) generation approach is also explored, yielding promising results for simpler geometries. (2). Texture generation involves a multi-stage process starting with frontal images generation followed by multi-view images generation, RGB-to-PBR texture conversion, and high-resolution multi-view texture refinement. A consistency scheduler is plugged into every stage, to enforce pixel-wise consistency among multi-view textures during inference, ensuring seamless integration. The pipeline demonstrates effective handling of diverse input formats, leveraging advanced neural architectures and novel methodologies to produce high-quality 3D content. This report details the system architecture, experimental results, and potential future directions to improve and expand the framework. The source code and pretrained weights are released at: https://github.com/Tencent/Tencent-XR-3DGen.
中文摘要:该框架通过基于变分自编码器的三维形状生成与多阶段纹理处理流程,结合一致性调度机制,能够从多种输入生成高质量的三维模型和纹理。
English Summary: This framework generates high-quality 3D shapes and textures from various inputs using a VAE-based shape generator with diffusion modeling and a multi-stage texture pipeline enhanced by consistency scheduling for seamless results.

Authors:Gengxu Li, Yuan Wu
Title: Asymmetric Co-Training for Source-Free Few-Shot Domain Adaptation
Abstract:
Source-free unsupervised domain adaptation (SFUDA) has gained significant attention as an alternative to traditional unsupervised domain adaptation (UDA), which relies on the constant availability of labeled source data. However, SFUDA approaches come with inherent limitations that are frequently overlooked. These challenges include performance degradation when the unlabeled target data fails to meet critical assumptions, such as having a closed-set label distribution identical to that of the source domain, or when sufficient unlabeled target data is unavailable-a common situation in real-world applications. To address these issues, we propose an asymmetric co-training (ACT) method specifically designed for the SFFSDA scenario. SFFSDA presents a more practical alternative to SFUDA, as gathering a few labeled target instances is more feasible than acquiring large volumes of unlabeled target data in many real-world contexts. Our ACT method begins by employing a weak-strong augmentation to enhance data diversity. Then we use a two-step optimization process to train the target model. In the first step, we optimize the label smoothing cross-entropy loss, the entropy of the class-conditional distribution, and the reverse-entropy loss to bolster the model's discriminative ability while mitigating overfitting. The second step focuses on reducing redundancy in the output space by minimizing classifier determinacy disparity. Extensive experiments across four benchmarks demonstrate the superiority of our ACT approach, which outperforms state-of-the-art SFUDA methods and transfer learning techniques. Our findings suggest that adapting a source pre-trained model using only a small amount of labeled target data offers a practical and dependable solution. The code is available at https://github.com/gengxuli/ACT.
中文: 提出的非对称协同训练方法通过强弱数据增强和两步优化过程解决了无源无监督域适应的局限性,仅需少量目标域标注数据即可超越现有方法的性能表现。
English: The proposed asymmetric co-training (ACT) method addresses the limitations of source-free unsupervised domain adaptation by using weak-strong data augmentation and a two-step optimization process, demonstrating superior performance over existing methods with only minimal labeled target data.

Authors:Michihiro Yasunaga, Luke Zettlemoyer, Marjan Ghazvininejad
Title: Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models
Abstract:
Reward models play an essential role in training vision-language models (VLMs) by assessing output quality to enable aligning with human preferences. Despite their importance, the research community lacks comprehensive open benchmarks for evaluating multimodal reward models in VLMs. To address this gap, we introduce Multimodal RewardBench, an expert-annotated benchmark covering six domains: general correctness, preference, knowledge, reasoning, safety, and visual question-answering. Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various VLMs. In evaluating a range of VLM judges, we find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy. Notably, most models struggle in the reasoning and safety domains. These findings suggest that Multimodal RewardBench offers a challenging testbed for advancing reward model development across multiple domains. We release the benchmark at https://github.com/facebookresearch/multimodal_rewardbench.
中文摘要:Multimodal RewardBench作为专家标注的多模态奖励模型评估基准涵盖六大领域,结果显示即使表现最佳的模型总体准确率仅为72%,且在推理和安全领域存在明显不足。
English Summary: Multimodal RewardBench is introduced as an expert-annotated benchmark to evaluate multimodal reward models across six domains, revealing that top models like Gemini 1.5 Pro and Claude 3.5 Sonnet achieve only 72% accuracy and struggle particularly in reasoning and safety.

Authors:Xiaoxing Liu, Zhilei Liu, Chongke Bi
Title: NeRF-3DTalker: Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis
Abstract:
Talking head synthesis is to synthesize a lip-synchronized talking head video using audio. Recently, the capability of NeRF to enhance the realism and texture details of synthesized talking heads has attracted the attention of researchers. However, most current NeRF methods based on audio are exclusively concerned with the rendering of frontal faces. These methods are unable to generate clear talking heads in novel views. Another prevalent challenge in current 3D talking head synthesis is the difficulty in aligning acoustic and visual spaces, which often results in suboptimal lip-syncing of the generated talking heads. To address these issues, we propose Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis (NeRF-3DTalker). Specifically, the proposed method employs 3D prior information to synthesize clear talking heads with free views. Additionally, we propose a 3D Prior Aided Audio Disentanglement module, which is designed to disentangle the audio into two distinct categories: features related to 3D awarded speech movements and features related to speaking style. Moreover, to reposition the generated frames that are distant from the speaker's motion space in the real space, we have devised a local-global Standardized Space. This method normalizes the irregular positions in the generated frames from both global and local semantic perspectives. Through comprehensive qualitative and quantitative experiments, it has been demonstrated that our NeRF-3DTalker outperforms state-of-the-art in synthesizing realistic talking head videos, exhibiting superior image quality and lip synchronization. Project page: https://nerf-3dtalker.github.io/NeRF-3Dtalker.

Authors:Runlong He, Danyal Z. Khan, Evangelos B. Mazomenos, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson, Mobarakol Islam
Title: PitVQA++: Vector Matrix-Low-Rank Adaptation for Open-Ended Visual Question Answering in Pituitary Surgery
Abstract:
Vision-Language Models (VLMs) in visual question answering (VQA) offer a unique opportunity to enhance intra-operative decision-making, promote intuitive interactions, and significantly advancing surgical education. However, the development of VLMs for surgical VQA is challenging due to limited datasets and the risk of overfitting and catastrophic forgetting during full fine-tuning of pretrained weights. While parameter-efficient techniques like Low-Rank Adaptation (LoRA) and Matrix of Rank Adaptation (MoRA) address adaptation challenges, their uniform parameter distribution overlooks the feature hierarchy in deep networks, where earlier layers, that learn general features, require more parameters than later ones. This work introduces PitVQA++ with an open-ended PitVQA dataset and vector matrix-low-rank adaptation (Vector-MoLoRA), an innovative VLM fine-tuning approach for adapting GPT-2 to pituitary surgery. Open-Ended PitVQA comprises around 101,803 frames from 25 procedural videos with 745,972 question-answer sentence pairs, covering key surgical elements such as phase and step recognition, context understanding, tool detection, localization, and interactions recognition. Vector-MoLoRA incorporates the principles of LoRA and MoRA to develop a matrix-low-rank adaptation strategy that employs vector ranking to allocate more parameters to earlier layers, gradually reducing them in the later layers. Our approach, validated on the Open-Ended PitVQA and EndoVis18-VQA datasets, effectively mitigates catastrophic forgetting while significantly enhancing performance over recent baselines. Furthermore, our risk-coverage analysis highlights its enhanced reliability and trustworthiness in handling uncertain predictions. Our source code and dataset is available at~\url{https://github.com/HRL-Mike/PitVQA-Plus}.
中文: 本研究提出了PitVQA++,包含开放式手术视觉问答数据集和Vector-MoLoRA创新微调方法,通过分层参数分配策略有效提升视觉语言模型在手术应用中的性能,同时防止灾难性遗忘问题。
English: This research introduces PitVQA++, featuring an open-ended surgical visual question answering dataset and Vector-MoLoRA, a novel fine-tuning method that strategically allocates parameters across network layers to enhance performance while preventing catastrophic forgetting in vision-language models for surgical applications.

Authors:Takahiko Furuya
Title: Token Adaptation via Side Graph Convolution for Efficient Fine-tuning of 3D Point Cloud Transformers
Abstract:
Parameter-efficient fine-tuning (PEFT) of pre-trained 3D point cloud Transformers has emerged as a promising technique for 3D point cloud analysis. While existing PEFT methods attempt to minimize the number of tunable parameters, they often suffer from high temporal and spatial computational costs during fine-tuning. This paper proposes a novel PEFT algorithm called Side Token Adaptation on a neighborhood Graph (STAG) to achieve superior temporal and spatial efficiency. STAG employs a graph convolutional side network operating in parallel with a frozen backbone Transformer to adapt tokens to downstream tasks. Through efficient graph convolution, parameter sharing, and reduced gradient computation, STAG significantly reduces both temporal and spatial costs for fine-tuning. We also present Point Cloud Classification 13 (PCC13), a new benchmark comprising diverse publicly available 3D point cloud datasets to facilitate comprehensive evaluation. Extensive experiments using multiple pre-trained models and PCC13 demonstrates the effectiveness of STAG. Specifically, STAG maintains classification accuracy comparable to existing methods while reducing tunable parameters to only 0.43M and achieving significant reductions in both computation time and memory consumption for fine-tuning. Code and benchmark will be available at: https://github.com/takahikof/STAG.
Chinese: 本文提出STAG参数高效微调方法,通过图卷积侧网络在保持分类精度的同时,显著降低了计算时间和内存消耗。
English: This paper introduces STAG, a parameter-efficient fine-tuning method that uses a graph convolutional side network to significantly reduce computational costs while maintaining classification accuracy comparable to existing approaches.

Authors:Rui Zhao, Zeyu Zhang, Yi Xu, Yi Yao, Yan Huang, Wenxin Zhang, Zirui Song, Xiuying Chen, Yang Zhao
Title: PedDet: Adaptive Spectral Optimization for Multimodal Pedestrian Detection
Abstract:
Pedestrian detection in intelligent transportation systems has made significant progress but faces two critical challenges: (1) insufficient fusion of complementary information between visible and infrared spectra, particularly in complex scenarios, and (2) sensitivity to illumination changes, such as low-light or overexposed conditions, leading to degraded performance. To address these issues, we propose PedDet, an adaptive spectral optimization complementarity framework specifically enhanced and optimized for multispectral pedestrian detection. PedDet introduces the Multi-scale Spectral Feature Perception Module (MSFPM) to adaptively fuse visible and infrared features, enhancing robustness and flexibility in feature extraction. Additionally, the Illumination Robustness Feature Decoupling Module (IRFDM) improves detection stability under varying lighting by decoupling pedestrian and background features. We further design a contrastive alignment to enhance intermodal feature discrimination. Experiments on LLVIP and MSDS datasets demonstrate that PedDet achieves state-of-the-art performance, improving the mAP by 6.6% with superior detection accuracy even in low-light conditions, marking a significant step forward for road safety. Code will be available at https://github.com/AIGeeksGroup/PedDet.
中文: PedDet提出了一种自适应光谱优化框架,通过多尺度特征融合和光照鲁棒性模块,在复杂光照条件下将检测性能提升6.6% mAP,实现了最先进的行人检测效果。
English: PedDet introduces an adaptive spectral optimization framework with multi-scale feature fusion and illumination robustness modules, achieving state-of-the-art pedestrian detection performance with 6.6% mAP improvement under challenging lighting conditions.

Authors:Yucheng Shi, Quanzheng Li, Jin Sun, Xiang Li, Ninghao Liu
Title: Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data
Abstract:
Large Multimodal Models (LMMs), or Vision-Language Models (VLMs), have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address the above challenge, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, and carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers for the next round of tuning. This iterative process of synthetic data generation and fine-tuning progressively improves the model's ability to generate accurate and reasonable explanations. Experimental results demonstrate the effectiveness of our method in improving both the accuracy and explainability of specialized visual classification tasks.
中文: 该研究提出的视觉拒绝采样框架通过迭代生成基于专家定义概念的可验证视觉特征合成数据,并利用无奖励模型的筛选机制进行微调,有效提升了大规模多模态模型在专业视觉分类任务中的准确性和可解释性。
English: The proposed visual rejection sampling framework enhances Large Multimodal Models' fine-grained reasoning by iteratively generating and fine-tuning with synthetic data featuring human-verifiable visual features, significantly improving both accuracy and explainability in specialized visual tasks.

Authors:Masane Fuchi, Tomohiro Takagi
Title: Erasing with Precision: Evaluating Specific Concept Erasure from Text-to-Image Generative Models
Abstract:
Studies have been conducted to prevent specific concepts from being generated from pretrained text-to-image generative models, achieving concept erasure in various ways. However, the performance evaluation of these studies is still largely reliant on visualization, with the superiority of studies often determined by human subjectivity. The metrics of quantitative evaluation also vary, making comprehensive comparisons difficult. We propose EraseEval, an evaluation method that differs from previous evaluation methods in that it involves three fundamental evaluation criteria: (1) How well does the prompt containing the target concept be reflected, (2) To what extent the concepts related to the erased concept can reduce the impact of the erased concept, and (3) Whether other concepts are preserved. These criteria are evaluated and integrated into a single metric, such that a lower score is given if any of the evaluations are low, leading to a more robust assessment. We experimentally evaluated baseline concept erasure methods, organized their characteristics, and identified challenges with them. Despite being fundamental evaluation criteria, some concept erasure methods failed to achieve high scores, which point toward future research directions for concept erasure methods. Our code is available at https://github.com/fmp453/erase-eval.
中文: 本文提出EraseEval评估框架,通过将三个核心标准整合为单一指标来系统评估文本到图像模型的概念消除效果,实验发现现有方法在此标准下存在不足,为未来研究指明了方向。
English: This paper introduces EraseEval, a novel evaluation framework for concept erasure in text-to-image models that integrates three key criteria into a single robust metric, addressing limitations in current visualization-dependent assessments and revealing challenges in existing methods through experimental analysis.

Authors:Eduard Chelebian, Pratiti Dasgupta, Zainalabedin Samadi, Carolina Wählby, Amjad Askary
Title: Segmentation-free integration of nuclei morphology and spatial transcriptomics for retinal images
Abstract:
This study introduces SEFI (SEgmentation-Free Integration), a novel method for integrating morphological features of cell nuclei with spatial transcriptomics data. Cell segmentation poses a significant challenge in the analysis of spatial transcriptomics data, as tissue-specific structural complexities and densely packed cells in certain regions make it difficult to develop a universal approach. SEFI addresses this by utilizing self-supervised learning to extract morphological features from fluorescent nuclear staining images, enhancing the clustering of gene expression data without requiring segmentation. We demonstrate SEFI on spatially resolved gene expression profiles of the developing retina, acquired using multiplexed single molecule Fluorescence In Situ Hybridization (smFISH). SEFI is publicly available at https://github.com/eduardchelebian/sefi.
中文: 本研究提出SEFI方法,通过自监督学习从荧光核染色图像中提取形态特征,无需细胞分割即可整合空间转录组数据,并在发育视网膜数据中验证了其提升基因表达聚类的效果。
English: This study presents SEFI, a segmentation-free method that integrates nuclear morphological features with spatial transcriptomics data through self-supervised learning, improving gene expression clustering without cell segmentation, as demonstrated on developing retina data.

Authors:Arjun Gupta, Rishik Sathua, Saurabh Gupta
Title: A Training-Free Framework for Precise Mobile Manipulation of Small Everyday Objects
Abstract:
Many everyday mobile manipulation tasks require precise interaction with small objects, such as grasping a knob to open a cabinet or pressing a light switch. In this paper, we develop Servoing with Vision Models (SVM), a closed-loop training-free framework that enables a mobile manipulator to tackle such precise tasks involving the manipulation of small objects. SVM employs an RGB-D wrist camera and uses visual servoing for control. Our novelty lies in the use of state-of-the-art vision models to reliably compute 3D targets from the wrist image for diverse tasks and under occlusion due to the end-effector. To mitigate occlusion artifacts, we employ vision models to out-paint the end-effector thereby significantly enhancing target localization. We demonstrate that aided by out-painting methods, open-vocabulary object detectors can serve as a drop-in module to identify semantic targets (e.g. knobs) and point tracking methods can reliably track interaction sites indicated by user clicks. This training-free method obtains an 85% zero-shot success rate on manipulating unseen objects in novel environments in the real world, outperforming an open-loop control method and an imitation learning baseline trained on 1000+ demonstrations by an absolute success rate of 50%.

Authors:Sara Dorfman, Dana Cohen-Bar, Rinon Gal, Daniel Cohen-Or
Title: IP-Composer: Semantic Composition of Visual Concepts
Abstract:
Content creators often draw inspiration from multiple visual sources, combining distinct elements to craft new compositions. Modern computational approaches now aim to emulate this fundamental creative process. Although recent diffusion models excel at text-guided compositional synthesis, text as a medium often lacks precise control over visual details. Image-based composition approaches can capture more nuanced features, but existing methods are typically limited in the range of concepts they can capture, and require expensive training procedures or specialized data. We present IP-Composer, a novel training-free approach for compositional image generation that leverages multiple image references simultaneously, while using natural language to describe the concept to be extracted from each image. Our method builds on IP-Adapter, which synthesizes novel images conditioned on an input image's CLIP embedding. We extend this approach to multiple visual inputs by crafting composite embeddings, stitched from the projections of multiple input images onto concept-specific CLIP-subspaces identified through text. Through comprehensive evaluation, we show that our approach enables more precise control over a larger range of visual concept compositions.

Authors:Idris Hamoud, Vinkle Srivastav, Muhammad Abdullah Jamal, Didier Mutter, Omid Mohareri, Nicolas Padoy
Title: Multi-view Video-Pose Pretraining for Operating Room Surgical Activity Recognition
Abstract:
Understanding the workflow of surgical procedures in complex operating rooms requires a deep understanding of the interactions between clinicians and their environment. Surgical activity recognition (SAR) is a key computer vision task that detects activities or phases from multi-view camera recordings. Existing SAR models often fail to account for fine-grained clinician movements and multi-view knowledge, or they require calibrated multi-view camera setups and advanced point-cloud processing to obtain better results. In this work, we propose a novel calibration-free multi-view multi-modal pretraining framework called Multiview Pretraining for Video-Pose Surgical Activity Recognition PreViPS, which aligns 2D pose and vision embeddings across camera views. Our model follows CLIP-style dual-encoder architecture: one encoder processes visual features, while the other encodes human pose embeddings. To handle the continuous 2D human pose coordinates, we introduce a tokenized discrete representation to convert the continuous 2D pose coordinates into discrete pose embeddings, thereby enabling efficient integration within the dual-encoder framework. To bridge the gap between these two modalities, we propose several pretraining objectives using cross- and in-modality geometric constraints within the embedding space and incorporating masked pose token prediction strategy to enhance representation learning. Extensive experiments and ablation studies demonstrate improvements over the strong baselines, while data-efficiency experiments on two distinct operating room datasets further highlight the effectiveness of our approach. We highlight the benefits of our approach for surgical activity recognition in both multi-view and single-view settings, showcasing its practical applicability in complex surgical environments. Code will be made available at: https://github.com/CAMMA-public/PreViPS.
Chinese: 本文提出PreViPS,一种免校准的多视角多模态预训练框架,通过跨视角对齐2D姿态与视觉嵌入,解决了现有手术活动识别模型在细粒度动作和多视角知识整合方面的不足。
English: This paper introduces PreViPS, a calibration-free multi-view multi-modal pretraining framework that aligns 2D pose and vision embeddings across camera views to improve surgical activity recognition by addressing limitations in existing models regarding fine-grained movements and multi-view integration.

Authors:Taewoo Kim, Yujeong Chae, Hyun-Kurl Jang, Kuk-Jin Yoon
Title: Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields
Abstract:
Video Frame Interpolation (VFI) aims to generate intermediate video frames between consecutive input frames. Since the event cameras are bio-inspired sensors that only encode brightness changes with a micro-second temporal resolution, several works utilized the event camera to enhance the performance of VFI. However, existing methods estimate bidirectional inter-frame motion fields with only events or approximations, which can not consider the complex motion in real-world scenarios. In this paper, we propose a novel event-based VFI framework with cross-modal asymmetric bidirectional motion field estimation. In detail, our EIF-BiOFNet utilizes each valuable characteristic of the events and images for direct estimation of inter-frame motion fields without any approximation methods. Moreover, we develop an interactive attention-based frame synthesis network to efficiently leverage the complementary warping-based and synthesis-based features. Finally, we build a large-scale event-based VFI dataset, ERF-X170FPS, with a high frame rate, extreme motion, and dynamic textures to overcome the limitations of previous event-based VFI datasets. Extensive experimental results validate that our method shows significant performance improvement over the state-of-the-art VFI methods on various datasets. Our project pages are available at: https://github.com/intelpro/CBMNet
中文: 本文提出了一种新颖的基于事件的视频帧插值框架,通过跨模态非对称双向运动场估计和交互式注意力合成网络,在新构建的高帧率数据集上实现了超越现有方法的性能表现。
English: This paper introduces a novel event-based video frame interpolation framework that utilizes cross-modal asymmetric bidirectional motion field estimation and an interactive attention-based synthesis network, achieving superior performance on a newly created high-frame-rate dataset.

Authors:Zheng Wu, Yiping Xie, Bo Zhao, Jiguang He, Fei Luo, Ning Deng, Zitong Yu
Title: CardiacMamba: A Multimodal RGB-RF Fusion Framework with State Space Models for Remote Physiological Measurement
Abstract:
Heart rate (HR) estimation via remote photoplethysmography (rPPG) offers a non-invasive solution for health monitoring. However, traditional single-modality approaches (RGB or Radio Frequency (RF)) face challenges in balancing robustness and accuracy due to lighting variations, motion artifacts, and skin tone bias. In this paper, we propose CardiacMamba, a multimodal RGB-RF fusion framework that leverages the complementary strengths of both modalities. It introduces the Temporal Difference Mamba Module (TDMM) to capture dynamic changes in RF signals using timing differences between frames, enhancing the extraction of local and global features. Additionally, CardiacMamba employs a Bidirectional SSM for cross-modal alignment and a Channel-wise Fast Fourier Transform (CFFT) to effectively capture and refine the frequency domain characteristics of RGB and RF signals, ultimately improving heart rate estimation accuracy and periodicity detection. Extensive experiments on the EquiPleth dataset demonstrate state-of-the-art performance, achieving marked improvements in accuracy and robustness. CardiacMamba significantly mitigates skin tone bias, reducing performance disparities across demographic groups, and maintains resilience under missing-modality scenarios. By addressing critical challenges in fairness, adaptability, and precision, the framework advances rPPG technology toward reliable real-world deployment in healthcare. The codes are available at: https://github.com/WuZheng42/CardiacMamba.
中文: CardiacMamba是一种多模态RGB-RF融合框架,通过整合互补特征提升心率估计性能,在医疗应用中实现了更高的准确性、鲁棒性并显著降低了肤色偏差。
English: CardiacMamba is a multimodal RGB-RF fusion framework that enhances heart rate estimation by integrating complementary features, achieving superior accuracy, robustness, and reduced skin tone bias in real-world healthcare applications.

Authors:Ziming Hong, Yongli Xiang, Tongliang Liu
Title: Toward Robust Non-Transferable Learning: A Survey and Benchmark
Abstract:
Over the past decades, researchers have primarily focused on improving the generalization abilities of models, with limited attention given to regulating such generalization. However, the ability of models to generalize to unintended data (e.g., harmful or unauthorized data) can be exploited by malicious adversaries in unforeseen ways, potentially resulting in violations of model ethics. Non-transferable learning (NTL), a task aimed at reshaping the generalization abilities of deep learning models, was proposed to address these challenges. While numerous methods have been proposed in this field, a comprehensive review of existing progress and a thorough analysis of current limitations remain lacking. In this paper, we bridge this gap by presenting the first comprehensive survey on NTL and introducing NTLBench, the first benchmark to evaluate NTL performance and robustness within a unified framework. Specifically, we first introduce the task settings, general framework, and criteria of NTL, followed by a summary of NTL approaches. Furthermore, we emphasize the often-overlooked issue of robustness against various attacks that can destroy the non-transferable mechanism established by NTL. Experiments conducted via NTLBench verify the limitations of existing NTL methods in robustness. Finally, we discuss the practical applications of NTL, along with its future directions and associated challenges.
Chinese Summary: 本文首次对不可迁移学习进行全面综述,提出了首个评估NTL方法性能与鲁棒性的基准NTLBench,同时揭示了现有方法在抗攻击鲁棒性方面的局限性。
English Summary: This paper presents the first comprehensive survey on non-transferable learning (NTL), introducing NTLBench as the inaugural benchmark to evaluate NTL methods' performance and robustness while highlighting their limitations against attacks.

Authors:Ziyuan Liu, Ruifei Zhu, Long Gao, Yuanxiu Zhou, Jingyu Ma, Yuantao Gu
Title: JL1-CD: A New Benchmark for Remote Sensing Change Detection and a Robust Multi-Teacher Knowledge Distillation Framework
Abstract:
Change detection (CD) in remote sensing images plays a vital role in Earth observation. However, the scarcity of high-resolution, comprehensive open-source datasets and the difficulty in achieving robust performance across varying change types remain major challenges. To address these issues, we introduce JL1-CD, a large-scale, sub-meter CD dataset consisting of 5,000 image pairs. We further propose a novel Origin-Partition (O-P) strategy and integrate it into a Multi-Teacher Knowledge Distillation (MTKD) framework to enhance CD performance. The O-P strategy partitions the training set by Change Area Ratio (CAR) and trains specialized teacher models on each subset. The MTKD framework then distills complementary knowledge from these teachers into a single student model, enabling improved detection results across diverse CAR scenarios without additional inference cost. Our MTKD approach demonstrated strong performance in the 2024 ``Jilin-1'' Cup challenge, ranking first in the preliminary and second in the final rounds. Extensive experiments on the JL1-CD and SYSU-CD datasets show that the MTKD framework consistently improves the performance of CD models with various network architectures and parameter sizes, establishing new state-of-the-art results. Code and dataset are available at https://github.com/circleLZY/MTKD-CD.
中文摘要:本文提出了大规模变化检测数据集JL1-CD,并创新性地将源分区策略融入多教师知识蒸馏框架,在不增加推理成本的情况下显著提升了各类变化场景的检测性能。
English Summary: This paper introduces JL1-CD, a large-scale change detection dataset, and proposes a novel Origin-Partition strategy integrated into a Multi-Teacher Knowledge Distillation framework that enhances detection performance across diverse scenarios without increasing inference costs.

Authors:Shuo Xing, Peiran Li, Yuping Wang, Ruizheng Bai, Yueqi Wang, Chan-Wei Hu, Chengxuan Qian, Huaxiu Yao, Zhengzhong Tu
Title: Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization
Abstract:
The emergence of large Vision Language Models (VLMs) has broadened the scope and capabilities of single-modal Large Language Models (LLMs) by integrating visual modalities, thereby unlocking transformative cross-modal applications in a variety of real-world scenarios. Despite their impressive performance, VLMs are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies. Building on the success of Reinforcement Learning from Human Feedback (RLHF) in aligning LLMs, recent advancements have focused on applying direct preference optimization (DPO) on carefully curated datasets to mitigate these issues. Yet, such approaches typically introduce preference signals in a brute-force manner, neglecting the crucial role of visual information in the alignment process. In this paper, we introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset, effectively incorporating both textual and visual preference signals. We further introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning. Our experimental results demonstrate that Re-Align not only mitigates hallucinations more effectively than previous methods but also yields significant performance gains in general visual question-answering (VQA) tasks. Moreover, we show that Re-Align maintains robustness and scalability across a wide range of VLM sizes and architectures. This work represents a significant step forward in aligning multimodal LLMs, paving the way for more reliable and effective cross-modal applications. We release all the code in https://github.com/taco-group/Re-Align.
中文: Re-Align框架通过图像检索和扩展的rDPO方法,有效减少视觉语言模型中的跨模态幻觉问题,并显著提升视觉问答任务的性能。
English: The Re-Align framework introduces image retrieval and an extended rDPO method to effectively reduce cross-modal hallucinations in Vision Language Models while enhancing performance in visual question-answering tasks.

Authors:Bencheng Liao, Hongyuan Tao, Qian Zhang, Tianheng Cheng, Yingyue Li, Haoran Yin, Wenyu Liu, Xinggang Wang
Title: Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation
Abstract:
Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures that combine Transformer and Mamba layers for customizable efficiency-performance trade-offs. Distilled from the Transformer-based decoder-only HoVLE, mmMamba-linear achieves competitive performance against existing linear and quadratic-complexity VLMs, while mmMamba-hybrid further improves performance significantly, approaching HoVLE's capabilities. At 103K tokens, mmMamba-linear demonstrates 20.6$\times$ speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves 13.5$\times$ speedup and 60.2% memory savings. Code and models are released at https://github.com/hustvl/mmMamba
中文:mmMamba框架通过渐进式蒸馏将多模态大语言模型转换为线性复杂度架构,在保持竞争力的性能的同时实现了显著的速度提升和内存节省。
English: The mmMamba framework enables efficient conversion of multimodal large language models to linear-complexity architectures through progressive distillation, achieving significant speed improvements and memory savings while maintaining competitive performance.

Authors:Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, Ying Zhang, Wenyu Liu, Qian Zhang, Xinggang Wang
Title: RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
Abstract:
Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and the open-loop gap. In this work, we establish a 3DGS-based closed-loop Reinforcement Learning (RL) training paradigm. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards that guide the policy to effectively respond to safety-critical events and understand real-world causal relationships. For better alignment with human driving behavior, IL is incorporated into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, especially 3x lower collision rate. Abundant closed-loop results are presented at https://hgao-cv.github.io/RAD.
中文: 本文提出了一种基于3DGS的闭环强化学习自动驾驶框架,通过在逼真数字环境中进行安全探索来克服模仿学习的局限性,实现了碰撞率降低三倍的显著效果。
English: This paper introduces a 3DGS-based closed-loop reinforcement learning framework for autonomous driving that overcomes imitation learning limitations by enabling safe exploration in photorealistic environments, achieving a threefold reduction in collision rates.

Authors:Aggelina Chatziagapi, Louis-Philippe Morency, Hongyu Gong, Michael Zollhoefer, Dimitris Samaras, Alexander Richard
Title: AV-Flow: Transforming Text to Audio-Visual Human-like Interactions
Abstract:
We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV-Flow produces an always-on avatar, that actively listens and reacts to the audio-visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars. Project page: https://aggelinacha.github.io/AV-Flow/

Authors:Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao
Title: Magma: A Foundation Model for Multimodal AI Agents
Abstract:
We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.
中文: Magma是一个多模态基础模型,不仅具备视觉语言理解能力,还通过空间-时间智能在数字和物理世界中规划执行任务,在界面导航和机器人操控等任务中创造了最新最优性能。
English: Magma is a multimodal foundation model that extends vision-language capabilities by integrating spatial-temporal intelligence for planning and executing agentic tasks in both digital and physical environments, achieving state-of-the-art performance in UI navigation and robotic manipulation.

Authors:Ekin Celikkan, Timo Kunzmann, Yertay Yeskaliyev, Sibylle Itzerott, Nadja Klein, Martin Herold
Title: WeedsGalore: A Multispectral and Multitemporal UAV-based Dataset for Crop and Weed Segmentation in Agricultural Maize Fields
Abstract:
Weeds are one of the major reasons for crop yield loss but current weeding practices fail to manage weeds in an efficient and targeted manner. Effective weed management is especially important for crops with high worldwide production such as maize, to maximize crop yield for meeting increasing global demands. Advances in near-sensing and computer vision enable the development of new tools for weed management. Specifically, state-of-the-art segmentation models, coupled with novel sensing technologies, can facilitate timely and accurate weeding and monitoring systems. However, learning-based approaches require annotated data and show a lack of generalization to aerial imaging for different crops. We present a novel dataset for semantic and instance segmentation of crops and weeds in agricultural maize fields. The multispectral UAV-based dataset contains images with RGB, red-edge, and near-infrared bands, a large number of plant instances, dense annotations for maize and four weed classes, and is multitemporal. We provide extensive baseline results for both tasks, including probabilistic methods to quantify prediction uncertainty, improve model calibration, and demonstrate the approach's applicability to out-of-distribution data. The results show the effectiveness of the two additional bands compared to RGB only, and better performance in our target domain than models trained on existing datasets. We hope our dataset advances research on methods and operational systems for fine-grained weed identification, enhancing the robustness and applicability of UAV-based weed management. The dataset and code are available at https://github.com/GFZ/weedsgalore
中文: 本研究提出了一种新型多光谱无人机数据集,用于玉米田间作物和杂草的语义与实例分割,通过额外光谱波段和优化模型校准显著提升了杂草管理的准确性和适用性。
English: This study introduces a novel multispectral UAV dataset for semantic and instance segmentation of crops and weeds in maize fields, demonstrating enhanced performance with additional spectral bands and improved model calibration for robust weed management.

Authors:Yuxiang Wei, Yiheng Zheng, Yabo Zhang, Ming Liu, Zhilong Ji, Lei Zhang, Wangmeng Zuo
Title: Personalized Image Generation with Deep Generative Models: A Decade Survey
Abstract:
Recent advancements in generative models have significantly facilitated the development of personalized content creation. Given a small set of images with user-specific concept, personalized image generation allows to create images that incorporate the specified concept and adhere to provided text descriptions. Due to its wide applications in content creation, significant effort has been devoted to this field in recent years. Nonetheless, the technologies used for personalization have evolved alongside the development of generative models, with their distinct and interrelated components. In this survey, we present a comprehensive review of generalized personalized image generation across various generative models, including traditional GANs, contemporary text-to-image diffusion models, and emerging multi-model autoregressive models. We first define a unified framework that standardizes the personalization process across different generative models, encompassing three key components, i.e., inversion spaces, inversion methods, and personalization schemes. This unified framework offers a structured approach to dissecting and comparing personalization techniques across different generative architectures. Building upon this unified framework, we further provide an in-depth analysis of personalization techniques within each generative model, highlighting their unique contributions and innovations. Through comparative analysis, this survey elucidates the current landscape of personalized image generation, identifying commonalities and distinguishing features among existing methods. Finally, we discuss the open challenges in the field and propose potential directions for future research. We keep tracing related works at https://github.com/csyxwei/Awesome-Personalized-Image-Generation.
中文摘要:本综述系统梳理了各类生成模型中的个性化图像生成技术,通过建立统一框架分析核心组件并进行方法对比,同时指出了未来研究方向。
English Summary: This survey comprehensively reviews personalized image generation techniques across various generative models, establishing a unified framework to analyze their key components and comparing methods while identifying future research directions.

Authors:Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne
Title: Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection
Abstract:
Hateful memes have become a significant concern on the Internet, necessitating robust automated detection systems. While Large Multimodal Models (LMMs) have shown promise in hateful meme detection, they face notable challenges like sub-optimal performance and limited out-of-domain generalization capabilities. Recent studies further reveal the limitations of both supervised fine-tuning (SFT) and in-context learning when applied to LMMs in this setting. To address these issues, we propose a robust adaptation framework for hateful meme detection that enhances in-domain accuracy and cross-domain generalization while preserving the general vision-language capabilities of LMMs. Analysis reveals that our approach achieves improved robustness under adversarial attacks compared to SFT models. Experiments on six meme classification datasets show that our approach achieves state-of-the-art performance, outperforming larger agentic systems. Moreover, our method generates higher-quality rationales for explaining hateful content compared to standard SFT, enhancing model interpretability. Code available at https://github.com/JingbiaoMei/RGCL
Chinese: 本文提出了一种鲁棒适应框架,通过提升领域内准确性和跨领域泛化能力来增强仇恨表情包检测,同时保留大型多模态模型的视觉语言能力,实现了最先进的性能并通过高质量解释提升了模型可解释性。
English: This paper introduces a robust adaptation framework that enhances hateful meme detection by improving in-domain accuracy and cross-domain generalization while preserving LMMs' vision-language capabilities, achieving state-of-the-art performance and superior interpretability through high-quality rationales.

Authors:Rema Daher, Francisco Vasconcelos, Danail Stoyanov
Title: SHADeS: Self-supervised Monocular Depth Estimation Through Non-Lambertian Image Decomposition
Abstract:
Purpose: Visual 3D scene reconstruction can support colonoscopy navigation. It can help in recognising which portions of the colon have been visualised and characterising the size and shape of polyps. This is still a very challenging problem due to complex illumination variations, including abundant specular reflections. We investigate how to effectively decouple light and depth in this problem. Methods: We introduce a self-supervised model that simultaneously characterises the shape and lighting of the visualised colonoscopy scene. Our model estimates shading, albedo, depth, and specularities (SHADeS) from single images. Unlike previous approaches (IID), we use a non-Lambertian model that treats specular reflections as a separate light component. The implementation of our method is available at https://github.com/RemaDaher/SHADeS. Results: We demonstrate on real colonoscopy images (Hyper Kvasir) that previous models for light decomposition (IID) and depth estimation (MonoVIT, ModoDepth2) are negatively affected by specularities. In contrast, SHADeS can simultaneously produce light decomposition and depth maps that are robust to specular regions. We also perform a quantitative comparison on phantom data (C3VD) where we further demonstrate the robustness of our model. Conclusion: Modelling specular reflections improves depth estimation in colonoscopy. We propose an effective self-supervised approach that uses this insight to jointly estimate light decomposition and depth. Light decomposition has the potential to help with other problems, such as place recognition within the colon.
中文摘要:本研究提出的SHADeS自监督模型通过单独处理镜面反射,有效提升了结肠镜三维重建中的深度估计和光照分解的鲁棒性。
English Summary: The study introduces SHADeS, a self-supervised model that improves colonoscopy 3D reconstruction by separately modeling specular reflections to enhance depth estimation and light decomposition robustness.

Authors:Nicolas Talabot, Olivier Clerc, Arda Cinar Demirtas, Doruk Oner, Pascal Fua
Title: PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization
Abstract:
Accurate 3D shape representation is essential in engineering applications such as design, optimization, and simulation. In practice, engineering workflows require structured, part-aware representations, as objects are inherently designed as assemblies of distinct components. However, most existing methods either model shapes holistically or decompose them without predefined part structures, limiting their applicability in real-world design tasks. We propose PartSDF, a supervised implicit representation framework that explicitly models composite shapes with independent, controllable parts while maintaining shape consistency. Despite its simple single-decoder architecture, PartSDF outperforms both supervised and unsupervised baselines in reconstruction and generation tasks. We further demonstrate its effectiveness as a structured shape prior for engineering applications, enabling precise control over individual components while preserving overall coherence. Code available at https://github.com/cvlab-epfl/PartSDF.
中文: PartSDF是一种监督式隐式表示框架,通过独立可控部件建模复合形状,在重建和生成任务中优于现有方法,并为工程应用提供精确的组件控制能力。
English: PartSDF is a supervised implicit representation framework that models composite shapes with independent, controllable parts, outperforming existing methods in reconstruction and generation tasks while enabling precise component control for engineering applications.

Authors:Zhexiong Wan, Bin Fan, Le Hui, Yuchao Dai, Gim Hee Lee
Title: Instance-Level Moving Object Segmentation from a Single Image with Events
Abstract:
Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects, while the difficulties lie in taking into account both spatial texture structures and temporal motion cues. Existing methods based on video frames encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion due to the complexities of accurate image-based motion modeling. Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities, but instead lead to challenges in segmenting pixel-level object masks due to the lack of dense texture structures in events. To address these two limitations imposed by unimodal settings, we propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues. Our model incorporates implicit cross-modal masked attention augmentation, explicit contrastive feature learning, and flow-guided motion enhancement to exploit dense texture information from a single image and rich motion information from events, respectively. By leveraging the augmented texture and motion features, we separate mask segmentation from motion classification to handle varying numbers of independently moving objects. Through extensive evaluations on multiple datasets, as well as ablation experiments with different input settings and real-time efficiency analysis of the proposed framework, we believe that our first attempt to incorporate image and event data for practical deployment can provide new insights for future work in event-based motion related works. The source code with model training and pre-trained weights is released at https://npucvr.github.io/EvInsMOS

Authors:Gianluca Guglielmo, Marc Masana
Title: Leveraging Intermediate Representations for Better Out-of-Distribution Detection
Abstract:
In real-world applications, machine learning models must reliably detect Out-of-Distribution (OoD) samples to prevent unsafe decisions. Current OoD detection methods often rely on analyzing the logits or the embeddings of the penultimate layer of a neural network. However, little work has been conducted on the exploitation of the rich information encoded in intermediate layers. To address this, we analyze the discriminative power of intermediate layers and show that they can positively be used for OoD detection. Therefore, we propose to regularize intermediate layers with an energy-based contrastive loss, and by grouping multiple layers in a single aggregated response. We demonstrate that intermediate layer activations improves OoD detection performance by running a comprehensive evaluation across multiple datasets.
中文: 机器学习模型可通过基于能量的对比损失和层聚合方法,有效利用中间层信息来提升分布外样本检测性能,多数据集验证了其有效性。
English: Machine learning models can enhance Out-of-Distribution detection by utilizing intermediate layers' information through energy-based contrastive loss and layer aggregation, as validated across multiple datasets.

Authors:Tanqiu Jiang, Changjiang Li, Fenglong Ma, Ting Wang
Title: RAPID: Retrieval Augmented Training of Differentially Private Diffusion Models
Abstract:
Differentially private diffusion models (DPDMs) harness the remarkable generative capabilities of diffusion models while enforcing differential privacy (DP) for sensitive data. However, existing DPDM training approaches often suffer from significant utility loss, large memory footprint, and expensive inference cost, impeding their practical uses. To overcome such limitations, we present RAPID: Retrieval Augmented PrIvate Diffusion model, a novel approach that integrates retrieval augmented generation (RAG) into DPDM training. Specifically, RAPID leverages available public data to build a knowledge base of sample trajectories; when training the diffusion model on private data, RAPID computes the early sampling steps as queries, retrieves similar trajectories from the knowledge base as surrogates, and focuses on training the later sampling steps in a differentially private manner. Extensive evaluation using benchmark datasets and models demonstrates that, with the same privacy guarantee, RAPID significantly outperforms state-of-the-art approaches by large margins in generative quality, memory footprint, and inference cost, suggesting that retrieval-augmented DP training represents a promising direction for developing future privacy-preserving generative models. The code is available at: https://github.com/TanqiuJiang/RAPID
中文: RAPID提出了一种检索增强的差分隐私扩散模型方法,利用公共数据提升训练效率,在同等隐私保护下显著提高了生成质量并降低了内存和计算成本。
English: RAPID introduces a retrieval-augmented approach to differentially private diffusion models, leveraging public data to enhance training efficiency and significantly improving generative quality while reducing memory and computational costs under the same privacy guarantees.

Authors:Fabian Bongratz, Yitong Li, Sama Elbaroudy, Christian Wachinger
Title: 3D Shape-to-Image Brownian Bridge Diffusion for Brain MRI Synthesis from Cortical Surfaces
Abstract:
Despite recent advances in medical image generation, existing methods struggle to produce anatomically plausible 3D structures. In synthetic brain magnetic resonance images (MRIs), characteristic fissures are often missing, and reconstructed cortical surfaces appear scattered rather than densely convoluted. To address this issue, we introduce Cor2Vox, the first diffusion model-based method that translates continuous cortical shape priors to synthetic brain MRIs. To achieve this, we leverage a Brownian bridge process which allows for direct structured mapping between shape contours and medical images. Specifically, we adapt the concept of the Brownian bridge diffusion model to 3D and extend it to embrace various complementary shape representations. Our experiments demonstrate significant improvements in the geometric accuracy of reconstructed structures compared to previous voxel-based approaches. Moreover, Cor2Vox excels in image quality and diversity, yielding high variation in non-target structures like the skull. Finally, we highlight the capability of our approach to simulate cortical atrophy at the sub-voxel level. Our code is available at https://github.com/ai-med/Cor2Vox.
中文:Cor2Vox提出了一种创新的扩散模型,能将皮层形状先验转化为解剖结构精确的3D脑部核磁共振图像,显著提升了几何精度,并能实现亚体素级别的皮层萎缩模拟。
English: Cor2Vox introduces a novel diffusion model that translates cortical shape priors into anatomically accurate 3D brain MRIs, significantly improving geometric precision and enabling sub-voxel simulation of cortical atrophy.

Authors:Thierry Judge, Olivier Bernard, Woo-Jin Cho Kim, Alberto Gomez, Arian Beqiri, Agisilaos Chartsias, Pierre-Marc Jodoin
Title: Uncertainty Propagation for Echocardiography Clinical Metric Estimation via Contour Sampling
Abstract:
Echocardiography plays a fundamental role in the extraction of important clinical parameters (e.g. left ventricular volume and ejection fraction) required to determine the presence and severity of heart-related conditions. When deploying automated techniques for computing these parameters, uncertainty estimation is crucial for assessing their utility. Since clinical parameters are usually derived from segmentation maps, there is no clear path for converting pixel-wise uncertainty values into uncertainty estimates in the downstream clinical metric calculation. In this work, we propose a novel uncertainty estimation method based on contouring rather than segmentation. Our method explicitly predicts contour location uncertainty from which contour samples can be drawn. Finally, the sampled contours can be used to propagate uncertainty to clinical metrics. Our proposed method not only provides accurate uncertainty estimations for the task of contouring but also for the downstream clinical metrics on two cardiac ultrasound datasets. Code is available at: https://github.com/ThierryJudge/contouring-uncertainty.
中文: 本研究提出了一种基于轮廓检测的新型不确定性估计方法,能准确预测轮廓位置的不确定性并将其传递至临床指标,在心脏超声数据集上优于基于分割的方法。
English: This study introduces a novel uncertainty estimation method based on contouring, which accurately predicts contour location uncertainty and propagates it to clinical metrics, outperforming segmentation-based approaches on cardiac ultrasound datasets.

Authors:Timon Winter, Stanislav Frolov, Brian Bernhard Moser, Andreas Dengel
Title: Spherical Dense Text-to-Image Synthesis
Abstract:
Recent advancements in text-to-image (T2I) have improved synthesis results, but challenges remain in layout control and generating omnidirectional panoramic images. Dense T2I (DT2I) and spherical T2I (ST2I) models address these issues, but so far no unified approach exists. Trivial approaches, like prompting a DT2I model to generate panoramas can not generate proper spherical distortions and seamless transitions at the borders. Our work shows that spherical dense text-to-image (SDT2I) can be achieved by integrating training-free DT2I approaches into finetuned panorama models. Specifically, we propose MultiStitchDiffusion (MSTD) and MultiPanFusion (MPF) by integrating MultiDiffusion into StitchDiffusion and PanFusion, respectively. Since no benchmark for SDT2I exists, we further construct Dense-Synthetic-View (DSynView), a new synthetic dataset containing spherical layouts to evaluate our models. Our results show that MSTD outperforms MPF across image quality as well as prompt- and layout adherence. MultiPanFusion generates more diverse images but struggles to synthesize flawless foreground objects. We propose bootstrap-coupling and turning off equirectangular perspective-projection attention in the foreground as an improvement of MPF. Link to code https://github.com/sdt2i/spherical-dense-text-to-image
中文: 本研究提出了MultiStitchDiffusion和MultiPanFusion两种统一方法,通过整合现有模型实现球形密集文本到图像生成,解决了布局控制和全景无缝合成的难题,其中MSTD在图像质量与内容贴合度方面表现更优。
English: This work introduces MultiStitchDiffusion and MultiPanFusion as unified approaches for spherical dense text-to-image generation, addressing layout control and seamless panoramic synthesis through integration with existing models, with MSTD showing superior performance in quality and adherence metrics.

Authors:Tanzhe Li, Caoshuo Li, Jiayi Lyu, Hongjuan Pei, Baochang Zhang, Taisong Jin, Rongrong Ji
Title: DAMamba: Vision State Space Model with Dynamic Adaptive Scan
Abstract:
State space models (SSMs) have recently garnered significant attention in computer vision. However, due to the unique characteristics of image data, adapting SSMs from natural language processing to computer vision has not outperformed the state-of-the-art convolutional neural networks (CNNs) and Vision Transformers (ViTs). Existing vision SSMs primarily leverage manually designed scans to flatten image patches into sequences locally or globally. This approach disrupts the original semantic spatial adjacency of the image and lacks flexibility, making it difficult to capture complex image structures. To address this limitation, we propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions. This enables more flexible modeling capabilities while maintaining linear computational complexity and global modeling capacity. Based on DAS, we further propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks such as image classification, object detection, instance segmentation, and semantic segmentation. Notably, it surpasses some of the latest state-of-the-art CNNs and ViTs. Code will be available at https://github.com/ltzovo/DAMamba.
中文: 提出的动态自适应扫描(DAS)方法自适应分配扫描顺序和区域,克服了现有视觉状态空间模型的局限,并基于此开发了DAMamba视觉主干网络,在多项视觉任务中显著超越了当前最先进模型。
English: The proposed Dynamic Adaptive Scan (DAS) method adaptively allocates scanning orders and regions to overcome the limitations of existing vision SSMs, leading to the development of DAMamba, a vision backbone that outperforms current state-of-the-art models in various vision tasks.

Authors:Juefeng Xiao, Tianqi Xiang, Zhigang Tu
Title: Adaptive Prototype Model for Attribute-based Multi-label Few-shot Action Recognition
Abstract:
In real-world action recognition systems, incorporating more attributes helps achieve a more comprehensive understanding of human behavior. However, using a single model to simultaneously recognize multiple attributes can lead to a decrease in accuracy. In this work, we propose a novel method i.e. Adaptive Attribute Prototype Model (AAPM) for human action recognition, which captures rich action-relevant attribute information and strikes a balance between accuracy and robustness. Firstly, we introduce the Text-Constrain Module (TCM) to incorporate textual information from potential labels, and constrain the construction of different attributes prototype representations. In addition, we explore the Attribute Assignment Method (AAM) to address the issue of training bias and increase robustness during the training process.Furthermore, we construct a new video dataset with attribute-based multi-label called Multi-Kinetics for evaluation, which contains various attribute labels (e.g. action, scene, object, etc.) related to human behavior. Extensive experiments demonstrate that our AAPM achieves the state-of-the-art performance in both attribute-based multi-label few-shot action recognition and single-label few-shot action recognition. The project and dataset are available at an anonymous account https://github.com/theAAPM/AAPM
中文: 自适应属性原型模型(AAPM)通过引入文本约束模块和属性分配方法,有效提升了多属性动作识别的准确性和鲁棒性,在新建的Multi-Kinetics数据集上实现了多标签和单标签小样本识别的最优性能。
English: The Adaptive Attribute Prototype Model (AAPM) enhances human action recognition by integrating textual constraints and addressing training bias, achieving state-of-the-art performance in both multi-label and single-label few-shot tasks, as validated on the new Multi-Kinetics dataset.

Authors:Minghao Fu, Guo-Hua Wang, Liangfu Cao, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
Title: CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation
Abstract:
Diffusion models have emerged as a dominant approach for text-to-image generation. Key components such as the human preference alignment and classifier-free guidance play a crucial role in ensuring generation quality. However, their independent application in current text-to-image models continues to face significant challenges in achieving strong text-image alignment, high generation quality, and consistency with human aesthetic standards. In this work, we for the first time, explore facilitating the collaboration of human performance alignment and test-time sampling to unlock the potential of text-to-image models. Consequently, we introduce CHATS (Combining Human-Aligned optimization and Test-time Sampling), a novel generative framework that separately models the preferred and dispreferred distributions and employs a proxy-prompt-based sampling strategy to utilize the useful information contained in both distributions. We observe that CHATS exhibits exceptional data efficiency, achieving strong performance with only a small, high-quality funetuning dataset. Extensive experiments demonstrate that CHATS surpasses traditional preference alignment methods, setting new state-of-the-art across various standard benchmarks.
Chinese: 本文提出CHATS框架,通过结合人类偏好对齐与测试时采样,分别建模偏好与非偏好分布,以少量高质量数据实现了文本到图像生成的卓越效果,在多项基准测试中创下新纪录。
English: This paper introduces CHATS, a novel framework that combines human preference alignment with test-time sampling to enhance text-to-image generation by modeling preferred and dispreferred distributions, achieving state-of-the-art performance with high data efficiency.

Authors:Chao Yang, Yong Fan, Cheng Lu, Minghao Yuan, Zhijing Yang
Title: GVTNet: Graph Vision Transformer For Face Super-Resolution
Abstract:
Recent advances in face super-resolution research have utilized the Transformer architecture. This method processes the input image into a series of small patches. However, because of the strong correlation between different facial components in facial images. When it comes to super-resolution of low-resolution images, existing algorithms cannot handle the relationships between patches well, resulting in distorted facial components in the super-resolution results. To solve the problem, we propose a transformer architecture based on graph neural networks called graph vision transformer network. We treat each patch as a graph node and establish an adjacency matrix based on the information between patches. In this way, the patch only interacts between neighboring patches, further processing the relationship of facial components. Quantitative and visualization experiments have underscored the superiority of our algorithm over state-of-the-art techniques. Through detailed comparisons, we have demonstrated that our algorithm possesses more advanced super-resolution capabilities, particularly in enhancing facial components. The PyTorch code is available at https://github.com/continueyang/GVTNet
中文摘要:提出的图视觉Transformer网络通过将图像块视为图节点并利用邻接矩阵优化块间交互,有效解决了超分辨率中面部组件失真的问题,展现出优于现有技术的性能。
English Summary: The proposed Graph Vision Transformer Network addresses facial component distortion in super-resolution by treating image patches as graph nodes and using an adjacency matrix to optimize patch interactions, demonstrating superior performance over existing methods.

Authors:Chao Yang, Yong Fan, Qichao Zhang, Cheng Lu, Zhijing Yang
Title: DeltaDiff: Reality-Driven Diffusion with AnchorResiduals for Faithful SR
Abstract:
Recently, the transfer application of diffusion models in super-resolu-tion tasks has faced the problem ofdecreased fidelity. Due to the inherent randomsampling characteristics ofdiffusion models, direct application in super-resolu-tion tasks can result in generated details deviating from the true distribution ofhigh-resolution images. To address this, we propose DeltaDiff, a novel frame.work that constrains the difusion process, its essence is to establish a determin-istic mapping path between HR and LR, rather than the random noise disturbanceprocess oftraditional difusion models. Theoretical analysis demonstrates a 25%reduction in diffusion entropy in the residual space compared to pixel-space diffiusion, effectively suppressing irrelevant noise interference. The experimentalresults show that our method surpasses state-of-the-art models and generates re-sults with better fidelity. This work establishes a new low-rank constrained par-adigm for applying diffusion models to image reconstruction tasks, balancingstochastic generation with structural fidelity. Our code and model are publiclyavailable at https://github.com/continueyang/DeltaDiff .
中文:DeltaDiff通过建立高分辨率与低分辨率图像间的确定性映射路径,将扩散熵降低25%,在超分辨率任务中有效抑制噪声干扰,以超越现有最优模型的保真度生成结果。
English: DeltaDiff introduces a deterministic mapping between high- and low-resolution images to reduce diffusion entropy by 25%, surpassing state-of-the-art models in fidelity by constraining noise interference in super-resolution tasks.

Authors:Huaying Yuan, Jian Ni, Zheng Liu, Yueze Wang, Junjie Zhou, Zhengyang Liang, Bo Zhao, Zhao Cao, Zhicheng Dou, Ji-Rong Wen
Title: MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval
Abstract:
Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1200 seconds in duration and collected from various domains, e.g., movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning. We have publicly released MomentSeeker(https://yhy-2000.github.io/MomentSeeker/) to facilitate future research in this area.
中文: MomentSeeker是一个新颖的长视频片段检索基准,通过采用多样化的长视频和多种查询形式,解决了现有基准在评估关键时刻准确定位方面的不足。
English: MomentSeeker is a new benchmark for long-video moment retrieval that uses diverse, lengthy videos and various query types to address the limitations of existing benchmarks in evaluating accurate key moment localization.

Authors:Zhen Li, Weiwei Sun, Shrisudhan Govindarajan, Shaobo Xia, Daniel Rebain, Kwang Moo Yi, Andrea Tagliasacchi
Title: NoKSR: Kernel-Free Neural Surface Reconstruction via Point Cloud Serialization
Abstract:
We present a novel approach to large-scale point cloud surface reconstruction by developing an efficient framework that converts an irregular point cloud into a signed distance field (SDF). Our backbone builds upon recent transformer-based architectures (i.e., PointTransformerV3), that serializes the point cloud into a locality-preserving sequence of tokens. We efficiently predict the SDF value at a point by aggregating nearby tokens, where fast approximate neighbors can be retrieved thanks to the serialization. We serialize the point cloud at different levels/scales, and non-linearly aggregate a feature to predict the SDF value. We show that aggregating across multiple scales is critical to overcome the approximations introduced by the serialization (i.e. false negatives in the neighborhood). Our frameworks sets the new state-of-the-art in terms of accuracy and efficiency (better or similar performance with half the latency of the best prior method, coupled with a simpler implementation), particularly on outdoor datasets where sparse-grid methods have shown limited performance.

Authors:Yunjie Tian, Qixiang Ye, David Doermann
Title: YOLOv12: Attention-Centric Real-Time Object Detectors
Abstract:
Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.
中文: YOLOv12提出了一种以注意力为核心的框架,在保持竞争性速度的同时实现了更高的精度,在效率和性能上均超越了基于CNN和端到端的实时检测器。
English: YOLOv12 introduces an attention-centric framework that achieves superior accuracy with competitive speed, outperforming both CNN-based and end-to-end real-time detectors in efficiency and performance.

Authors:Tiancheng Gu, Kaicheng Yang, Chaoyi Zhang, Yin Xie, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng
Title: RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm
Abstract:
After pre-training on extensive image-text pairs, Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of multimodal interleaved documents remains underutilized for contrastive vision-language representation learning. To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M. We compare our dataset with other widely used datasets of equivalent scale for CLIP training. Models pre-trained on RealSyn consistently achieve state-of-the-art performance across various downstream tasks, including linear probe, zero-shot transfer, zero-shot robustness, and zero-shot retrieval. Furthermore, extensive experiments confirm that RealSyn significantly enhances contrastive vision-language representation learning and demonstrates robust scalability. To facilitate future research, the RealSyn dataset and pretrained model weights are released at https://github.com/deepglint/RealSyn.
中文: RealSyn数据集通过分层检索方法提取高质量图像并关联语义相关文本,结合合成文本生成增强视觉信息,使CLIP模型在多种下游任务中实现最优性能,并展现出强大的可扩展性。
English: The RealSyn dataset, constructed by extracting and associating high-quality images with semantically relevant texts through a hierarchical retrieval method and enhanced with synthetic text generation, enables CLIP models to achieve state-of-the-art performance across various downstream tasks.

Authors:Xiang He, Dongcheng Zhao, Yiting Dong, Guobin Shen, Xin Yang, Yi Zeng
Title: Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning
Abstract:
Humans interpret and perceive the world by integrating sensory information from multiple modalities, such as vision and hearing. Spiking Neural Networks (SNNs), as brain-inspired computational models, exhibit unique advantages in emulating the brain's information processing mechanisms. However, existing SNN models primarily focus on unimodal processing and lack efficient cross-modal information fusion, thereby limiting their effectiveness in real-world multimodal scenarios. To address this challenge, we propose a semantic-alignment cross-modal residual learning (S-CMRL) framework, a Transformer-based multimodal SNN architecture designed for effective audio-visual integration. S-CMRL leverages a spatiotemporal spiking attention mechanism to extract complementary features across modalities, and incorporates a cross-modal residual learning strategy to enhance feature integration. Additionally, a semantic alignment optimization mechanism is introduced to align cross-modal features within a shared semantic space, improving their consistency and complementarity. Extensive experiments on three benchmark datasets CREMA-D, UrbanSound8K-AV, and MNISTDVS-NTIDIGITS demonstrate that S-CMRL significantly outperforms existing multimodal SNN methods, achieving the state-of-the-art performance. The code is publicly available at https://github.com/Brain-Cog-Lab/S-CMRL.
中文:提出的S-CMRL框架采用基于Transformer的脉冲神经网络,通过跨模态残差学习和语义对齐机制有效整合视听信息,在多个基准数据集上实现了最先进的性能。
English: The proposed S-CMRL framework introduces a Transformer-based spiking neural network with cross-modal residual learning and semantic alignment to effectively integrate audio-visual information, achieving state-of-the-art performance on multiple benchmark datasets.

Authors:Emily Jin, Joy Hsu, Jiajun Wu
Title: Predicate Hierarchies Improve Few-Shot State Classification
Abstract:
State classification of objects and their relations is core to many long-horizon tasks, particularly in robot planning and manipulation. However, the combinatorial explosion of possible object-predicate combinations, coupled with the need to adapt to novel real-world environments, makes it a desideratum for state classification models to generalize to novel queries with few examples. To this end, we propose PHIER, which leverages predicate hierarchies to generalize effectively in few-shot scenarios. PHIER uses an object-centric scene encoder, self-supervised losses that infer semantic relations between predicates, and a hyperbolic distance metric that captures hierarchical structure; it learns a structured latent space of image-predicate pairs that guides reasoning over state classification queries. We evaluate PHIER in the CALVIN and BEHAVIOR robotic environments and show that PHIER significantly outperforms existing methods in few-shot, out-of-distribution state classification, and demonstrates strong zero- and few-shot generalization from simulated to real-world tasks. Our results demonstrate that leveraging predicate hierarchies improves performance on state classification tasks with limited data.

Authors:Gang Yang, Miao Wang, Quan Zhou, Jiangchuan Li
Title: YUNet: Improved YOLOv11 Network for Skyline Detection
Abstract:
Skyline detection plays an important role in geolocalizaion, flight control, visual navigation, port security, etc. The appearance of the sky and non-sky areas are variable, because of different weather or illumination environment, which brings challenges to skyline detection. In this research, we proposed the YUNet algorithm, which improved the YOLOv11 architecture to segment the sky region and extract the skyline in complicated and variable circumstances. To improve the ability of multi-scale and large range contextual feature fusion, the YOLOv11 architecture is extended as an UNet-like architecture, consisting of an encoder, neck and decoder submodule. The encoder extracts the multi-scale features from the given images. The neck makes fusion of these multi-scale features. The decoder applies the fused features to complete the prediction rebuilding. To validate the proposed approach, the YUNet was tested on Skyfinder and CH1 datasets for segmentation and skyline detection respectively. Our test shows that the IoU of YUnet segmentation can reach 0.9858, and the average error of YUnet skyline detection is just 1.36 pixels. The implementation is published at https://github.com/kuazhangxiaoai/SkylineDet-YOLOv11Seg.git.
中文: YUNet算法基于YOLOv11改进,采用类似UNet的结构,在复杂多变环境下能有效分割天空区域并提取天际线,分割精度高且检测误差极小。
English: The YUNet algorithm, an enhanced version of YOLOv11 with a UNet-like structure, effectively segments sky regions and detects skylines in diverse conditions, achieving high accuracy in segmentation and minimal error in detection.

Authors:Yinghao Shuai, Ran Yu, Yuantao Chen, Zijian Jiang, Xiaowei Song, Nan Wang, Jv Zheng, Jianzhu Ma, Meng Yang, Zhicheng Wang, Wenbo Ding, Hao Zhao
Title: PUGS: Zero-shot Physical Understanding with Gaussian Splatting
Abstract:
Current robotic systems can understand the categories and poses of objects well. But understanding physical properties like mass, friction, and hardness, in the wild, remains challenging. We propose a new method that reconstructs 3D objects using the Gaussian splatting representation and predicts various physical properties in a zero-shot manner. We propose two techniques during the reconstruction phase: a geometry-aware regularization loss function to improve the shape quality and a region-aware feature contrastive loss function to promote region affinity. Two other new techniques are designed during inference: a feature-based property propagation module and a volume integration module tailored for the Gaussian representation. Our framework is named as zero-shot physical understanding with Gaussian splatting, or PUGS. PUGS achieves new state-of-the-art results on the standard benchmark of ABO-500 mass prediction. We provide extensive quantitative ablations and qualitative visualization to demonstrate the mechanism of our designs. We show the proposed methodology can help address challenging real-world grasping tasks. Our codes, data, and models are available at https://github.com/EverNorif/PUGS
中文:PUGS框架采用高斯溅射表示重建三维物体,并以零样本方式预测质量、摩擦等物理属性,在标准基准测试中创下最新记录,同时有效提升了机器人抓取任务的现实应用能力。
English: The proposed PUGS framework uses Gaussian splatting to reconstruct 3D objects and predict physical properties like mass and friction in a zero-shot manner, achieving state-of-the-art results and enhancing real-world robotic grasping tasks.

Authors:Zhicong Tang, Jianmin Bao, Dong Chen, Baining Guo
Title: Diffusion Models without Classifier-free Guidance
Abstract:
This paper presents Model-guidance (MG), a novel objective for training diffusion model that addresses and removes of the commonly used Classifier-free guidance (CFG). Our innovative approach transcends the standard modeling of solely data distribution to incorporating the posterior probability of conditions. The proposed technique originates from the idea of CFG and is easy yet effective, making it a plug-and-play module for existing models. Our method significantly accelerates the training process, doubles the inference speed, and achieve exceptional quality that parallel and even surpass concurrent diffusion models with CFG. Extensive experiments demonstrate the effectiveness, efficiency, scalability on different models and datasets. Finally, we establish state-of-the-art performance on ImageNet 256 benchmarks with an FID of 1.34. Our code is available at https://github.com/tzco/Diffusion-wo-CFG.
中文: 本文提出模型引导(MG)这一创新目标,通过引入条件后验概率取代了常用的无分类器引导(CFG),在加速训练和推理的同时实现了更优的图像生成质量,并在ImageNet 256基准测试中取得了当前最佳性能。
English: This paper introduces Model-guidance (MG), a novel training objective for diffusion models that replaces Classifier-free guidance (CFG) by incorporating conditional posterior probabilities, achieving faster training, doubled inference speed, and superior image quality with state-of-the-art results on benchmarks.

Authors:Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui
Title: HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
Abstract:
The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow
Chinese: 本研究提出了HermesFlow框架,通过同源偏好数据和迭代优化,有效弥合了多模态大语言模型中理解与生成能力之间的差距,在统一这两种能力方面展现出卓越性能。
English: The study introduces HermesFlow, a framework that effectively bridges the gap between understanding and generation in Multimodal Large Language Models using homologous preference data and iterative optimization, demonstrating superior performance in aligning these capabilities.

Authors:Ye Tian, Ling Yang, Xinchen Zhang, Yunhai Tong, Mengdi Wang, Bin Cui
Title: Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening
Abstract:
We propose Diffusion-Sharpening, a fine-tuning approach that enhances downstream alignment by optimizing sampling trajectories. Existing RL-based fine-tuning methods focus on single training timesteps and neglect trajectory-level alignment, while recent sampling trajectory optimization methods incur significant inference NFE costs. Diffusion-Sharpening overcomes this by using a path integral framework to select optimal trajectories during training, leveraging reward feedback, and amortizing inference costs. Our method demonstrates superior training efficiency with faster convergence, and best inference efficiency without requiring additional NFEs. Extensive experiments show that Diffusion-Sharpening outperforms RL-based fine-tuning methods (e.g., Diffusion-DPO) and sampling trajectory optimization methods (e.g., Inference Scaling) across diverse metrics including text alignment, compositional capabilities, and human preferences, offering a scalable and efficient solution for future diffusion model fine-tuning. Code: https://github.com/Gen-Verse/Diffusion-Sharpening
中文: Diffusion-Sharpening 是一种基于路径积分框架优化采样轨迹的微调方法,在提升对齐效果和效率的同时,无需额外计算成本,显著优于现有方法。
English: Diffusion-Sharpening is a fine-tuning method that optimizes sampling trajectories using a path integral framework to enhance alignment and efficiency, outperforming existing approaches in both training convergence and inference without extra computational cost.

Authors:Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, Gordon Wetzstein
Title: FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views
Abstract:
We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images (i.e., as few as 2-8 inputs), which is a challenging yet practical setting in real-world applications. Our solution features a cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes. Concretely, FLARE starts with camera pose estimation, whose results condition the subsequent learning of geometric structure and appearance, optimized through the objectives of geometry reconstruction and novel-view synthesis. Utilizing large-scale public datasets for training, our method delivers state-of-the-art performance in the tasks of pose estimation, geometry reconstruction, and novel view synthesis, while maintaining the inference efficiency (i.e., less than 0.5 seconds). The project page and code can be found at: https://zhanghe3z.github.io/FLARE/
Chinese: FLARE是一种前馈模型,能够从稀疏视图图像中高效推断相机姿态并重建三维几何结构,在姿态估计、几何重建和新视角合成任务中实现顶尖性能,且推理时间低于0.5秒。
English: FLARE is a feed-forward model that efficiently estimates camera poses and reconstructs 3D geometry from sparse-view images, achieving state-of-the-art results in pose estimation, geometry reconstruction, and novel-view synthesis in under 0.5 seconds.

Authors:Chaoyue Song, Jianfeng Zhang, Xiu Li, Fan Yang, Yiwen Chen, Zhongcong Xu, Jun Hao Liew, Xiaoyang Guo, Fayao Liu, Jiashi Feng, Guosheng Lin
Title: MagicArticulate: Make Your 3D Models Articulation-Ready
Abstract:
With the explosive growth of 3D content creation, there is an increasing demand for automatically converting static 3D models into articulation-ready versions that support realistic animation. Traditional approaches rely heavily on manual annotation, which is both time-consuming and labor-intensive. Moreover, the lack of large-scale benchmarks has hindered the development of learning-based solutions. In this work, we present MagicArticulate, an effective framework that automatically transforms static 3D models into articulation-ready assets. Our key contributions are threefold. First, we introduce Articulation-XL, a large-scale benchmark containing over 33k 3D models with high-quality articulation annotations, carefully curated from Objaverse-XL. Second, we propose a novel skeleton generation method that formulates the task as a sequence modeling problem, leveraging an auto-regressive transformer to naturally handle varying numbers of bones or joints within skeletons and their inherent dependencies across different 3D models. Third, we predict skinning weights using a functional diffusion process that incorporates volumetric geodesic distance priors between vertices and joints. Extensive experiments demonstrate that MagicArticulate significantly outperforms existing methods across diverse object categories, achieving high-quality articulation that enables realistic animation. Project page: https://chaoyuesong.github.io/MagicArticulate.
中文:MagicArticulate是一个创新框架,通过构建大规模标注数据集、采用基于Transformer的自回归骨架生成方法和结合体积测地距离先验的功能扩散过程,自动将静态3D模型转换为支持逼真动画的关节化资源,在各类物体上实现卓越的动画效果。
English: MagicArticulate is an innovative framework that automatically converts static 3D models into articulation-ready assets using a large-scale benchmark, a transformer-based skeleton generation method, and functional diffusion for skinning weights, achieving superior animation quality across diverse categories.

Authors:En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, Wenbing Tao
Title: Unhackable Temporal Rewarding for Scalable Video MLLMs
Abstract:
In the pursuit of superior video-processing MLLMs, we have encountered a perplexing paradox: the "anti-scaling law", where more data and larger models lead to worse performance. This study unmasks the culprit: "temporal hacking", a phenomenon where models shortcut by fixating on select frames, missing the full video narrative. In this work, we systematically establish a comprehensive theory of temporal hacking, defining it from a reinforcement learning perspective, introducing the Temporal Perplexity (TPL) score to assess this misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework to mitigate the temporal hacking. Both theoretically and empirically, TPL proves to be a reliable indicator of temporal modeling quality, correlating strongly with frame activation patterns. Extensive experiments reveal that UTR not only counters temporal hacking but significantly elevates video comprehension capabilities. This work not only advances video-AI systems but also illuminates the critical importance of aligning proxy rewards with true objectives in MLLM development.
中文总结:本研究揭示了视频处理多模态大模型中导致性能退化的"时间黑客"现象,提出了时间困惑度指标和不可黑客时间奖励框架,不仅能有效应对该问题,还显著提升了视频理解能力。
English Summary: This study identifies "temporal hacking" as the cause of performance degradation in video-processing MLLMs and introduces the Temporal Perplexity metric and Unhackable Temporal Rewarding framework to effectively counteract this issue while enhancing video comprehension.

Authors:Shoukang Hu, Takuya Narihira, Kazumi Fukuda, Ryosuke Sawata, Takashi Shibuya, Yuki Mitsufuji
Title: HumanGif: Single-View Human Diffusion with Generative Prior
Abstract:
Previous 3D human creation methods have made significant progress in synthesizing view-consistent and temporally aligned results from sparse-view images or monocular videos. However, it remains challenging to produce perpetually realistic, view-consistent, and temporally coherent human avatars from a single image, as limited information is available in the single-view input setting. Motivated by the success of 2D character animation, we propose HumanGif, a single-view human diffusion model with generative prior. Specifically, we formulate the single-view-based 3D human novel view and pose synthesis as a single-view-conditioned human diffusion process, utilizing generative priors from foundational diffusion models to complement the missing information. To ensure fine-grained and consistent novel view and pose synthesis, we introduce a Human NeRF module in HumanGif to learn spatially aligned features from the input image, implicitly capturing the relative camera and human pose transformation. Furthermore, we introduce an image-level loss during optimization to bridge the gap between latent and image spaces in diffusion models. Extensive experiments on RenderPeople, DNA-Rendering, THuman 2.1, and TikTok datasets demonstrate that HumanGif achieves the best perceptual performance, with better generalizability for novel view and pose synthesis.
中文摘要:先前的人体三维生成方法难以从单张图像创建真实、视角一致且时序连贯的虚拟形象,为此提出的HumanGif模型通过扩散生成先验与神经辐射场模块,在新型视角与姿态合成中实现了最优的感知效果。
English Summary: Previous 3D human creation methods struggle with generating realistic, view-consistent, and temporally coherent avatars from a single image, leading to the development of HumanGif, which uses a diffusion model with generative priors and a Human NeRF module to achieve superior perceptual performance in novel view and pose synthesis.

Authors:Yifan Liang, Fangkun Liu, Andong Li, Xiaodong Li, Chengshi Zheng
Title: NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing
Abstract:
Recent advancements in visual speech recognition (VSR) have promoted progress in lip-to-speech synthesis, where pre-trained VSR models enhance the intelligibility of synthesized speech by providing valuable semantic information. The success achieved by cascade frameworks, which combine pseudo-VSR with pseudo-text-to-speech (TTS) or implicitly utilize the transcribed text, highlights the benefits of leveraging VSR models. However, these methods typically rely on mel-spectrograms as an intermediate representation, which may introduce a key bottleneck: the domain gap between synthetic mel-spectrograms, generated from inherently error-prone lip-to-speech mappings, and real mel-spectrograms used to train vocoders. This mismatch inevitably degrades synthesis quality. To bridge this gap, we propose Natural Lip-to-Speech (NaturalL2S), an end-to-end framework integrating acoustic inductive biases with differentiable speech generation components. Specifically, we introduce a fundamental frequency (F0) predictor to capture prosodic variations in synthesized speech. The predicted F0 then drives a Differentiable Digital Signal Processing (DDSP) synthesizer to generate a coarse signal which serves as prior information for subsequent speech synthesis. Additionally, instead of relying on a reference speaker embedding as an auxiliary input, our approach achieves satisfactory performance on speaker similarity without explicitly modelling speaker characteristics. Both objective and subjective evaluation results demonstrate that NaturalL2S can effectively enhance the quality of the synthesized speech when compared to state-of-the-art methods. Our demonstration page is accessible at https://yifan-liang.github.io/NaturalL2S/.

Authors:Yinan Chen, Jiangning Zhang, Yali Bi, Xiaobin Hu, Teng Hu, Zhucun Xue, Ran Yi, Yong Liu, Ying Tai
Title: Image Inversion: A Survey from GANs to Diffusion and Beyond
Abstract:
Image inversion is a fundamental task in generative models, aiming to map images back to their latent representations to enable downstream applications such as editing, restoration, and style transfer. This paper provides a comprehensive review of the latest advancements in image inversion techniques, focusing on two main paradigms: Generative Adversarial Network (GAN) inversion and diffusion model inversion. We categorize these techniques based on their optimization methods. For GAN inversion, we systematically classify existing methods into encoder-based approaches, latent optimization approaches, and hybrid approaches, analyzing their theoretical foundations, technical innovations, and practical trade-offs. For diffusion model inversion, we explore training-free strategies, fine-tuning methods, and the design of additional trainable modules, highlighting their unique advantages and limitations. Additionally, we discuss several popular downstream applications and emerging applications beyond image tasks, identifying current challenges and future research directions. By synthesizing the latest developments, this paper aims to provide researchers and practitioners with a valuable reference resource, promoting further advancements in the field of image inversion. We keep track of the latest works at https://github.com/RyanChenYN/ImageInversion
中文: 本文系统综述了图像反转技术的最新进展,重点对GAN反转和扩散模型反转的方法进行分类分析,探讨其应用与挑战,旨在为研究者提供全面的参考资源。
English: This paper offers a comprehensive review of image inversion techniques, categorizing GAN and diffusion model methods by optimization approaches and analyzing their applications, challenges, and future directions to serve as a key reference for researchers.

Authors:Sheng-Yu Huang, Zi-Ting Chou, Yu-Chiang Frank Wang
Title: 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency
Abstract:
When performing 3D inpainting using novel-view rendering methods like Neural Radiance Field (NeRF) or 3D Gaussian Splatting (3DGS), how to achieve texture and geometry consistency across camera views has been a challenge. In this paper, we propose a framework of 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency (3DGIC) for cross-view consistent 3D inpainting. Guided by the rendered depth information from each training view, our 3DGIC exploits background pixels visible across different views for updating the inpainting mask, allowing us to refine the 3DGS for inpainting purposes.Through extensive experiments on benchmark datasets, we confirm that our 3DGIC outperforms current state-of-the-art 3D inpainting methods quantitatively and qualitatively.

Authors:Yuqi Pang, Bowen Yang, Haoqin Tu, Yun Cao, Zeyu Zhang
Title: Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning
Abstract:
Although Large Language Models (LLMs) excel in reasoning and generation for language tasks, they are not specifically designed for multimodal challenges. Training Multimodal Large Language Models (MLLMs), however, is resource-intensive and constrained by various training limitations. In this paper, we propose the Modular-based Visual Contrastive Decoding (MVCD) framework to move this obstacle. Our framework leverages LLMs' In-Context Learning (ICL) capability and the proposed visual contrastive-example decoding (CED), specifically tailored for this framework, without requiring any additional training. By converting visual signals into text and focusing on contrastive output distributions during decoding, we can highlight the new information introduced by contextual examples, explore their connections, and avoid over-reliance on prior encoded knowledge. MVCD enhances LLMs' visual perception to make it see and reason over the input visuals. To demonstrate MVCD's effectiveness, we conduct experiments with four LLMs across five question answering datasets. Our results not only show consistent improvement in model accuracy but well explain the effective components inside our decoding strategy. Our code will be available at https://github.com/Pbhgit/MVCD.
中文: 提出的模块化视觉对比解码(MVCD)框架通过将视觉信号转化为文本并采用对比解码,无需额外训练即可增强大语言模型的视觉推理能力,在多个数据集上提升了准确性。
English: The proposed Modular-based Visual Contrastive Decoding (MVCD) framework enhances LLMs' visual reasoning without additional training by converting visuals to text and using contrastive decoding, improving accuracy across multiple datasets.

Authors:Hanzhuo Huang, Yuan Liu, Ge Zheng, Jiepeng Wang, Zhiyang Dou, Sibei Yang
Title: MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow
Abstract:
In this paper, we present MVTokenFlow for high-quality 4D content creation from monocular videos. Recent advancements in generative models such as video diffusion models and multiview diffusion models enable us to create videos or 3D models. However, extending these generative models for dynamic 4D content creation is still a challenging task that requires the generated content to be consistent spatially and temporally. To address this challenge, MVTokenFlow utilizes the multiview diffusion model to generate multiview images on different timesteps, which attains spatial consistency across different viewpoints and allows us to reconstruct a reasonable coarse 4D field. Then, MVTokenFlow further regenerates all the multiview images using the rendered 2D flows as guidance. The 2D flows effectively associate pixels from different timesteps and improve the temporal consistency by reusing tokens in the regeneration process. Finally, the regenerated images are spatiotemporally consistent and utilized to refine the coarse 4D field to get a high-quality 4D field. Experiments demonstrate the effectiveness of our design and show significantly improved quality than baseline methods.
中文:MVTokenFlow提出了一种从单目视频生成高质量4D内容的新方法,通过首先生成空间一致的多视角图像,再利用令牌复用技术优化时间一致性,最终显著超越了现有基线方法的质量表现。
English: MVTokenFlow introduces a novel method for generating high-quality 4D content from monocular videos by first creating spatially consistent multiview images and then refining them with temporal consistency through token reuse, outperforming existing baseline methods.

Authors:Angel Villar-Corrales, Gjergj Plepi, Sven Behnke
Title: TextOCVP: Object-Centric Video Prediction with Language Guidance
Abstract:
Understanding and forecasting future scene states is critical for autonomous agents to plan and act effectively in complex environments. Object-centric models, with structured latent spaces, have shown promise in modeling object dynamics and predicting future scene states, but often struggle to scale beyond simple synthetic datasets and to integrate external guidance, limiting their applicability in robotics. To address these limitations, we propose TextOCVP, an object-centric model for video prediction guided by textual descriptions. TextOCVP parses an observed scene into object representations, called slots, and utilizes a text-conditioned transformer predictor to forecast future object states and video frames. Our approach jointly models object dynamics and interactions while incorporating textual guidance, enabling accurate and controllable predictions. TextOCVP's structured latent space offers a more precise control of the forecasting process, outperforming several video prediction baselines on two datasets. Additionally, we show that structured object-centric representations provide superior robustness to novel scene configurations, as well as improved controllability and interpretability, enabling more precise and understandable predictions. Videos and code are available at https://play-slot.github.io/TextOCVP.

Authors:Linjie Mu, Zhongzhen Huang, Shengqian Qin, Yakun Zhu, Shaoting Zhang, Xiaofan Zhang
Title: MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression
Abstract:
Large vision-language models (LVLMs) have shown great promise in medical applications, particularly in visual question answering (MedVQA) and diagnosis from medical images. However, existing datasets and models often fail to consider critical aspects of medical diagnostics, such as the integration of historical records and the analysis of disease progression over time. In this paper, we introduce MMXU (Multimodal and MultiX-ray Understanding), a novel dataset for MedVQA that focuses on identifying changes in specific regions between two patient visits. Unlike previous datasets that primarily address single-image questions, MMXU enables multi-image questions, incorporating both current and historical patient data. We demonstrate the limitations of current LVLMs in identifying disease progression on MMXU-\textit{test}, even those that perform well on traditional benchmarks. To address this, we propose a MedRecord-Augmented Generation (MAG) approach, incorporating both global and regional historical records. Our experiments show that integrating historical records significantly enhances diagnostic accuracy by at least 20\%, bridging the gap between current LVLMs and human expert performance. Additionally, we fine-tune models with MAG on MMXU-\textit{dev}, which demonstrates notable improvements. We hope this work could illuminate the avenue of advancing the use of LVLMs in medical diagnostics by emphasizing the importance of historical context in interpreting medical images. Our dataset is released at github: https://github.com/linjiemu/MMXU.
中文: 本文提出了用于医学视觉问答的新型数据集MMXU,通过整合历史患者记录实现多图像分析,并提出医疗记录增强生成方法,结合全局和局部历史数据将诊断准确率提升至少20%。
English: This paper introduces MMXU, a novel dataset for medical visual question answering that enables multi-image analysis by incorporating historical patient records, and proposes a MedRecord-Augmented Generation (MAG) approach that improves diagnostic accuracy by at least 20% by integrating both global and regional historical data.

Authors:Dariush Lotfi, Mohammad-Ali Nikouei Mahani, Mohamad Koohi-Moghadam, Kyongtae Ty Bae
Title: Safeguarding AI in Medical Imaging: Post-Hoc Out-of-Distribution Detection with Normalizing Flows
Abstract:
In AI-driven medical imaging, the failure to detect out-of-distribution (OOD) data poses a severe risk to clinical reliability, potentially leading to critical diagnostic errors. Current OOD detection methods often demand impractical retraining or modifications to pre-trained models, hindering their adoption in regulated clinical environments. To address this challenge, we propose a post-hoc normalizing flow-based approach that seamlessly integrates with existing pre-trained models without altering their weights. Our evaluation used a novel in-house built dataset, MedOOD, meticulously curated to simulate clinically relevant distributional shifts, alongside the MedMNIST benchmark dataset. On our in-house MedOOD dataset, our method achieved an AUROC of 84.61%, outperforming state-of-the-art methods like ViM (80.65%) and MDS (80.87%). Similarly, on MedMNIST, it reached an exceptional AUROC of 93.8%, surpassing leading approaches such as ViM (88.08%) and ReAct (87.05%). This superior performance, coupled with its post-hoc integration capability, positions our method as a vital safeguard for enhancing safety in medical imaging workflows. The model and code to build OOD datasets are publicly accessible at https://github.com/dlotfi/MedOODFlow.
中文摘要:本研究提出了一种后验归一化流方法,无需重新训练即可与预训练的医学影像模型无缝集成,在MedOOD和MedMNIST数据集上分别实现了84.61%和93.8%的AUROC评分,显著提升了分布外检测性能。
English Summary: This study introduces a post-hoc normalizing flow-based method that integrates with pre-trained medical imaging models without retraining, achieving superior OOD detection performance with AUROC scores of 84.61% on MedOOD and 93.8% on MedMNIST datasets.

Authors:Arnaud Bougaham, Benoît Frénay
Title: Towards a Trustworthy Anomaly Detection for Critical Applications through Approximated Partial AUC Loss
Abstract:
Anomaly Detection is a crucial step for critical applications such in the industrial, medical or cybersecurity domains. These sectors share the same requirement of handling differently the different types of classification errors. Indeed, even if false positives are acceptable, false negatives are not, because it would reflect a missed detection of a quality issue, a disease or a cyber threat. To fulfill this requirement, we propose a method that dynamically applies a trustworthy approximated partial AUC ROC loss (tapAUC). A binary classifier is trained to optimize the specific range of the AUC ROC curve that prevents the True Positive Rate (TPR) to reach 100% while minimizing the False Positive Rate (FPR). The optimal threshold that does not trigger any false negative is then kept and used at the test step. The results show a TPR of 92.52% at a 20.43% FPR for an average across 6 datasets, representing a TPR improvement of 4.3% for a FPR cost of 12.2% against other state-of-the-art methods. The code is available at https://github.com/ArnaudBougaham/tapAUC.
中文: 本文提出了一种可信赖的近似部分AUC ROC损失方法,通过优化分类器在关键应用中避免漏报,在六个数据集上实现了92.52%的真阳性率和20.43%的假阳性率。
English: This paper introduces a trustworthy approximated partial AUC ROC loss method that optimizes classifiers to prevent false negatives in critical applications, achieving a 92.52% true positive rate at a 20.43% false positive rate across six datasets.

Authors:Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang
Title: Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More
Abstract:
Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators.Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99$\times$ and 2.99$\times$ speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators. Our codes are available at https://github.com/ZichenWen1/DART.
中文: 本文提出DART方法,通过基于视觉令牌与关键令牌的重复性进行剪枝,无需训练即可实现高达88.9%的令牌削减和近3倍加速,同时保持模型性能。
English: This paper introduces DART, a training-free method that accelerates multimodal models by pruning vision tokens based on duplication with pivot tokens, achieving up to 88.9% token reduction and nearly 3x speed-up while maintaining performance.

Authors:Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, Chien-Sheng Wu
Title: Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
Abstract:
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks, yet they often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison, which are essential for relevant complex tasks like chart understanding and geometric reasoning. In this work, we first investigate the root causes of this deficiency through a suite of probing tasks focusing on basic visual arithmetic. Our analysis reveals that while pre-trained vision encoders typically capture sufficient information, the text decoder often fails to decode it correctly for arithmetic reasoning. To address this, we propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development. CogAlign trains VLMs to recognize invariant properties under visual transformations. We demonstrate that this approach significantly improves the performance of three diverse VLMs on our proposed probing tasks. Furthermore, CogAlign enhances performance by an average of 4.6% on CHOCOLATE and 2.9% on MATH-VISION, outperforming or matching supervised fine-tuning methods while requiring only 60% less training data. These results highlight the effectiveness and generalizability of CogAlign in improving fundamental visual arithmetic capabilities and their transfer to downstream tasks.
Chinese: 视觉语言模型常因文本解码器问题在视觉算术任务中表现不佳,但提出的CogAlign训练策略显著提升了其性能与泛化能力,同时减少了数据需求。
English: Vision Language Models often fail at visual arithmetic tasks due to text decoder limitations, but the proposed CogAlign training strategy significantly enhances their performance and generalizability with less data.

Authors:Hao Xu, Tengfei Xue, Jianan Fan, Dongnan Liu, Yuqian Chen, Fan Zhang, Carl-Fredrik Westin, Ron Kikinis, Lauren J. O'Donnell, Weidong Cai
Title: Medical Image Registration Meets Vision Foundation Model: Prototype Learning and Contour Awareness
Abstract:
Medical image registration is a fundamental task in medical image analysis, aiming to establish spatial correspondences between paired images. However, existing unsupervised deformable registration methods rely solely on intensity-based similarity metrics, lacking explicit anatomical knowledge, which limits their accuracy and robustness. Vision foundation models, such as the Segment Anything Model (SAM), can generate high-quality segmentation masks that provide explicit anatomical structure knowledge, addressing the limitations of traditional methods that depend only on intensity similarity. Based on this, we propose a novel SAM-assisted registration framework incorporating prototype learning and contour awareness. The framework includes: (1) Explicit anatomical information injection, where SAM-generated segmentation masks are used as auxiliary inputs throughout training and testing to ensure the consistency of anatomical information; (2) Prototype learning, which leverages segmentation masks to extract prototype features and aligns prototypes to optimize semantic correspondences between images; and (3) Contour-aware loss, a contour-aware loss is designed that leverages the edges of segmentation masks to improve the model's performance in fine-grained deformation fields. Extensive experiments demonstrate that the proposed framework significantly outperforms existing methods across multiple datasets, particularly in challenging scenarios with complex anatomical structures and ambiguous boundaries. Our code is available at https://github.com/HaoXu0507/IPMI25-SAM-Assisted-Registration.
中文摘要:本文提出了一种新颖的医学图像配准框架,通过结合分割一切模型(SAM)的原型学习和轮廓感知损失来注入解剖知识,在多个数据集上显著超越了现有方法的准确性和鲁棒性。
English Summary: This paper introduces a novel medical image registration framework that integrates the Segment Anything Model (SAM) to inject anatomical knowledge through prototype learning and contour-aware loss, significantly outperforming existing methods in accuracy and robustness.

Authors:Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong, Guoqi Li
Title: Without Paired Labeled Data: End-to-End Self-Supervised Learning for Drone-view Geo-Localization
Abstract:
Drone-view Geo-Localization (DVGL) aims to achieve accurate localization of drones by retrieving the most relevant GPS-tagged satellite images. However, most existing methods heavily rely on strictly pre-paired drone-satellite images for supervised learning. When the target region shifts, new paired samples are typically required to adapt to the distribution changes. The high cost of annotation and the limited transferability of these methods significantly hinder the practical deployment of DVGL in open-world scenarios. To address these limitations, we propose a novel end-to-end self-supervised learning method with a shallow backbone network, called the dynamic memory-driven and neighborhood information learning (DMNIL) method. It employs a clustering algorithm to generate pseudo-labels and adopts a dual-path contrastive learning framework to learn discriminative intra-view representations. Furthermore, DMNIL incorporates two core modules, including the dynamic hierarchical memory learning (DHML) module and the information consistency evolution learning (ICEL) module. The DHML module combines short-term and long-term memory to enhance intra-view feature consistency and discriminability. Meanwhile, the ICEL module utilizes a neighborhood-driven dynamic constraint mechanism to systematically capture implicit cross-view semantic correlations, consequently improving cross-view feature alignment. To further stabilize and strengthen the self-supervised training process, a pseudo-label enhancement strategy is introduced to enhance the quality of pseudo supervision. Extensive experiments on three public benchmark datasets demonstrate that the proposed method consistently outperforms existing self-supervised methods and even surpasses several state-of-the-art supervised methods. Our code is available at https://github.com/ISChenawei/DMNIL.
中文: 本文提出DMNIL,一种自监督的无人机视角地理定位方法,无需配对图像,通过伪标签和对比学习结合记忆模块,有效提升了跨视角特征对齐能力。
English: This paper introduces DMNIL, a self-supervised method for drone-view geo-localization that eliminates the need for paired drone-satellite images by using pseudo-labels and contrastive learning with memory modules to enhance feature alignment across views.

Authors:Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee
Title: CORDIAL: Can Multimodal Large Language Models Effectively Understand Coherence Relationships?
Abstract:
Multimodal Large Language Models (MLLMs) are renowned for their superior instruction-following and reasoning capabilities across diverse problem domains. However, existing benchmarks primarily focus on assessing factual and logical correctness in downstream tasks, with limited emphasis on evaluating MLLMs' ability to interpret pragmatic cues and intermodal relationships. To address this gap, we assess the competency of MLLMs in performing Multimodal Discourse Analysis (MDA) using Coherence Relations. Our benchmark, CORDIAL, encompasses a broad spectrum of Coherence Relations across 3 different discourse domains at varying levels of granularity. Through our experiments on 10+ MLLMs employing different prompting strategies, we show that even top models like Gemini 1.5 Pro and GPT-4o fail to match the performance of simple classifier-based baselines. This study emphasizes the need to move beyond similarity-based metrics and adopt a discourse-driven framework for evaluating MLLMs, providing a more nuanced assessment of their capabilities. The benchmark and code are available at: https://aashish2000.github.io/CORDIAL/

Authors:Michael Fuest, Vincent Tao Hu, Björn Ommer
Title: MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation
Abstract:
Generating long, high-quality videos remains a challenge due to the complex interplay of spatial and temporal dynamics and hardware limitations. In this work, we introduce MaskFlow, a unified video generation framework that combines discrete representations with flow-matching to enable efficient generation of high-quality long videos. By leveraging a frame-level masking strategy during training, MaskFlow conditions on previously generated unmasked frames to generate videos with lengths ten times beyond that of the training sequences. MaskFlow does so very efficiently by enabling the use of fast Masked Generative Model (MGM)-style sampling and can be deployed in both fully autoregressive as well as full-sequence generation modes. We validate the quality of our method on the FaceForensics (FFS) and Deepmind Lab (DMLab) datasets and report Frechet Video Distance (FVD) competitive with state-of-the-art approaches. We also provide a detailed analysis on the sampling efficiency of our method and demonstrate that MaskFlow can be applied to both timestep-dependent and timestep-independent models in a training-free manner.

Authors:Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, Huajun Chen
Title: How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training
Abstract:
Despite exceptional capabilities in knowledge-intensive tasks, Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge, particularly how to structurally embed acquired knowledge in their neural computations. We address this issue through the lens of knowledge circuit evolution, identifying computational subgraphs that facilitate knowledge storage and processing. Our systematic analysis of circuit evolution throughout continual pre-training reveals several key findings: (1) the acquisition of new knowledge is influenced by its relevance to pre-existing knowledge; (2) the evolution of knowledge circuits exhibits a distinct phase shift from formation to optimization; (3) the evolution of knowledge circuits follows a deep-to-shallow pattern. These insights not only advance our theoretical understanding of the mechanisms of new knowledge acquisition in LLMs, but also provide potential implications for improving continual pre-training strategies to enhance model performance. Code and data will be available at https://github.com/zjunlp/DynamicKnowledgeCircuits.
Chinese: 本研究通过知识回路演化的视角,揭示了大语言模型获取新知识受其与已有知识相关性影响,遵循从深层到浅层的模式,并经历形成到优化的阶段转变,为改进持续预训练策略提供了理论依据。
English: This study investigates how Large Language Models structurally embed new knowledge through knowledge circuit evolution, revealing that acquisition depends on relevance to existing knowledge, follows a deep-to-shallow pattern, and shifts from formation to optimization, offering insights to improve continual pre-training strategies.

Authors:Haoming Xu, Ningyuan Zhao, Liming Yang, Sendong Zhao, Shumin Deng, Mengru Wang, Bryan Hooi, Nay Oo, Huajun Chen, Ningyu Zhang
Title: ReLearn: Unlearning via Learning for Large Language Models
Abstract:
Current unlearning methods for large language models usually rely on reverse optimization to reduce target token probabilities. However, this paradigm disrupts the subsequent tokens prediction, degrading model performance and linguistic coherence. Moreover, existing evaluation metrics overemphasize contextual forgetting while inadequately assessing response fluency and relevance. To address these challenges, we propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning, along with a comprehensive evaluation framework. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation, and Linguistic Score (LS) to evaluate generation quality. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output. Through mechanistic analysis, we further demonstrate how reverse optimization disrupts coherent text generation, while ReLearn preserves this essential capability. Code is available at https://github.com/zjunlp/unlearn.
中文摘要:提出的ReLearn方法通过数据增强和微调有效实现大语言模型的定向遗忘,同时保持输出质量,优于会破坏语言连贯性的反向优化方法。
English Summary: The proposed ReLearn method effectively achieves targeted unlearning in large language models through data augmentation and fine-tuning while maintaining output quality, outperforming reverse optimization approaches that compromise linguistic coherence.

Authors:Shilong Yang, Qi Zang, Chulong Zhang, Lingfeng Huang, Yaoqin Xie
Title: RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer
Abstract:
Traditional Chinese acupuncture methods often face controversy in clinical practice due to their high subjectivity. Additionally, current intelligent-assisted acupuncture systems have two major limitations: slow acupoint localization speed and low accuracy. To address these limitations, a new method leverages the excellent inference efficiency of the state-space model Mamba, while retaining the advantages of the attention mechanism in the traditional DETR architecture, to achieve efficient global information integration and provide high-quality feature information for acupoint localization tasks. Furthermore, by employing the concept of residual likelihood estimation, it eliminates the need for complex upsampling processes, thereby accelerating the acupoint localization task. Our method achieved state-of-the-art (SOTA) accuracy on a private dataset of acupoints on the human back, with an average Euclidean distance pixel error (EPE) of 7.792 and an average time consumption of 10.05 milliseconds per localization task. Compared to the second-best algorithm, our method improved both accuracy and speed by approximately 14\%. This significant advancement not only enhances the efficacy of acupuncture treatment but also demonstrates the commercial potential of automated acupuncture robot systems. Access to our method is available at https://github.com/Sohyu1/RT-DEMT
中文:新型针灸定位方法融合Mamba推理效率与DETR注意力机制,通过残差似然估计避免复杂上采样,在背部穴位数据集上以7.792像素误差和10.05毫秒单次定位速度达到最优精度。
English: A novel acupuncture localization method combining Mamba's inference efficiency with DETR's attention mechanism achieves state-of-the-art accuracy (7.792px error) and speed (10.05ms per task) while eliminating complex upsampling through residual likelihood estimation.

Authors:Jingyuan Huang, Jen-tse Huang, Ziyi Liu, Xiaoyuan Liu, Wenxuan Wang, Jieyu Zhao
Title: AI Sees Your Location, But With A Bias Toward The Wealthy World
Abstract:
Visual-Language Models (VLMs) have shown remarkable performance across various tasks, particularly in recognizing geographic information from images. However, VLMs still show regional biases in this task. To systematically evaluate these issues, we introduce a benchmark consisting of 1,200 images paired with detailed geographic metadata. Evaluating four VLMs, we find that while these models demonstrate the ability to recognize geographic information from images, achieving up to 53.8% accuracy in city prediction, they exhibit significant biases. Specifically, performance is substantially higher for economically developed and densely populated regions compared to less developed (-12.5%) and sparsely populated (-17.0%) areas. Moreover, regional biases of frequently over-predicting certain locations remain. For instance, they consistently predict Sydney for images taken in Australia, shown by the low entropy scores for these countries. The strong performance of VLMs also raises privacy concerns, particularly for users who share images online without the intent of being identified. Our code and dataset are publicly available at https://github.com/uscnlp-lime/FairLocator.
中文: 视觉语言模型在识别图像地理信息方面表现出显著准确性,但存在明显区域偏见,偏向发达和人口稠密地区,同时引发了对在线图像分享隐私问题的担忧。
English: Visual-Language Models demonstrate notable accuracy in identifying geographic details from images but exhibit significant regional biases, favoring developed and densely populated areas while raising privacy concerns for online image sharing.

Authors:Mohammad Mehdi Hosseini, Ali Pourramezan Fard, Mohammad H. Mahoor
Title: Faces of Fairness: Examining Bias in Facial Expression Recognition Datasets and Models
Abstract:
Building AI systems, including Facial Expression Recognition (FER), involves two critical aspects: data and model design. Both components significantly influence bias and fairness in FER tasks. Issues related to bias and fairness in FER datasets and models remain underexplored. This study investigates bias sources in FER datasets and models. Four common FER datasets--AffectNet, ExpW, Fer2013, and RAF-DB--are analyzed. The findings demonstrate that AffectNet and ExpW exhibit high generalizability despite data imbalances. Additionally, this research evaluates the bias and fairness of six deep models, including three state-of-the-art convolutional neural network (CNN) models: MobileNet, ResNet, XceptionNet, as well as three transformer-based models: ViT, CLIP, and GPT-4o-mini. Experimental results reveal that while GPT-4o-mini and ViT achieve the highest accuracy scores, they also display the highest levels of bias. These findings underscore the urgent need for developing new methodologies to mitigate bias and ensure fairness in datasets and models, particularly in affective computing applications. See our implementation details at https://github.com/MMHosseini/bias_in_FER.
中文摘要:本研究分析了四种面部表情识别数据集和六种深度学习模型的偏见与公平性,发现GPT-4o-mini和ViT虽获得最高准确率,但表现出最强偏见,凸显了在情感计算领域减少偏见的迫切需求。
English Summary: This study analyzes bias and fairness in four FER datasets and six deep learning models, finding that while GPT-4o-mini and ViT achieve top accuracy, they exhibit the highest bias, highlighting the need for bias mitigation in affective computing.

Authors:Jiuwu Hao, Liguo Sun, Ti Xiang, Yuting Wan, Haolin Song, Pin Lv
Title: FeaKM: Robust Collaborative Perception under Noisy Pose Conditions
Abstract:
Collaborative perception is essential for networks of agents with limited sensing capabilities, enabling them to work together by exchanging information to achieve a robust and comprehensive understanding of their environment. However, localization inaccuracies often lead to significant spatial message displacement, which undermines the effectiveness of these collaborative efforts. To tackle this challenge, we introduce FeaKM, a novel method that employs Feature-level Keypoints Matching to effectively correct pose discrepancies among collaborating agents. Our approach begins by utilizing a confidence map to identify and extract salient points from intermediate feature representations, allowing for the computation of their descriptors. This step ensures that the system can focus on the most relevant information, enhancing the matching process. We then implement a target-matching strategy that generates an assignment matrix, correlating the keypoints identified by different agents. This is critical for establishing accurate correspondences, which are essential for effective collaboration. Finally, we employ a fine-grained transformation matrix to synchronize the features of all agents and ascertain their relative statuses, ensuring coherent communication among them. Our experimental results demonstrate that FeaKM significantly outperforms existing methods on the DAIR-V2X dataset, confirming its robustness even under severe noise conditions. The code and implementation details are available at https://github.com/uestchjw/FeaKM.
中文摘要:FeaKM是一种通过特征级关键点匹配来校正姿态差异的新型协同感知方法,在DAIR-V2X数据集上显著优于现有方法,即使在严重噪声条件下也表现出强大鲁棒性。
English Summary: FeaKM is a novel collaborative perception method that corrects pose discrepancies through feature-level keypoint matching, significantly outperforming existing approaches on the DAIR-V2X dataset even under severe noise conditions.

Authors:Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Shengsheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, Changsheng Xu
Title: SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
Abstract:
Despite the significant advancements of Large Vision-Language Models (LVLMs) on established benchmarks, there remains a notable gap in suitable evaluation regarding their applicability in the emerging domain of long-context streaming video understanding. Current benchmarks for video understanding typically emphasize isolated single-instance text inputs and fail to evaluate the capacity to sustain temporal reasoning throughout the entire duration of video streams. To address these limitations, we introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains specifically designed to thoroughly assess the capabilities of streaming video understanding of current LVLMs. We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos, which includes generating QA chains that represent a series of consecutive multi-turn dialogues over video segments and constructing temporal linkages between successive QA chains. Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding. We also construct a StreamingChat model, which significantly outperforms open-source LVLMs on our SVBench and achieves comparable performance on diverse vision-language benchmarks. We expect SVBench to advance the research of streaming video understanding by providing a comprehensive and in-depth analysis of current LVLMs. Our benchmark and model can be accessed at https://yzy-bupt.github.io/SVBench.

Authors:Ming Meng, Ke Mu, Yonggui Zhu, Zhe Zhu, Haoyu Sun, Heyang Yan, Zhaoxin Fan
Title: VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS
Abstract:
Generating expressive and diverse human gestures from audio is crucial in fields like human-computer interaction, virtual reality, and animation. Though existing methods have achieved remarkable performance, they often exhibit limitations due to constrained dataset diversity and the restricted amount of information derived from audio inputs. To address these challenges, we present VarGes, a novel variation-driven framework designed to enhance co-speech gesture generation by integrating visual stylistic cues while maintaining naturalness. Our approach begins with the Variation-Enhanced Feature Extraction (VEFE) module, which seamlessly incorporates \textcolor{blue}{style-reference} video data into a 3D human pose estimation network to extract StyleCLIPS, thereby enriching the input with stylistic information. Subsequently, we employ the Variation-Compensation Style Encoder (VCSE), a transformer-style encoder equipped with an additive attention mechanism pooling layer, to robustly encode diverse StyleCLIPS representations and effectively manage stylistic variations. Finally, the Variation-Driven Gesture Predictor (VDGP) module fuses MFCC audio features with StyleCLIPS encodings via cross-attention, injecting this fused data into a cross-conditional autoregressive model to modulate 3D human gesture generation based on audio input and stylistic clues. The efficacy of our approach is validated on benchmark datasets, where it outperforms existing methods in terms of gesture diversity and naturalness. The code and video results will be made publicly available upon acceptance:https://github.com/mookerr/VarGES/ .
中文摘要:VarGes是一种新颖的框架,通过将视觉风格线索与音频输入相结合来增强伴随语音的手势生成,在3D人体手势的多样性和自然度方面优于现有方法。
English Summary: VarGes is a novel framework that enhances co-speech gesture generation by integrating visual style cues with audio inputs, achieving superior diversity and naturalness in 3D human gestures compared to existing methods.

Authors:Mingyang Zhao, Gaofeng Meng, Dong-Ming Yan
Title: Occlusion-aware Non-Rigid Point Cloud Registration via Unsupervised Neural Deformation Correntropy
Abstract:
Non-rigid alignment of point clouds is crucial for scene understanding, reconstruction, and various computer vision and robotics tasks. Recent advancements in implicit deformation networks for non-rigid registration have significantly reduced the reliance on large amounts of annotated training data. However, existing state-of-the-art methods still face challenges in handling occlusion scenarios. To address this issue, this paper introduces an innovative unsupervised method called Occlusion-Aware Registration (OAR) for non-rigidly aligning point clouds. The key innovation of our method lies in the utilization of the adaptive correntropy function as a localized similarity measure, enabling us to treat individual points distinctly. In contrast to previous approaches that solely minimize overall deviations between two shapes, we combine unsupervised implicit neural representations with the maximum correntropy criterion to optimize the deformation of unoccluded regions. This effectively avoids collapsed, tearing, and other physically implausible results. Moreover, we present a theoretical analysis and establish the relationship between the maximum correntropy criterion and the commonly used Chamfer distance, highlighting that the correntropy-induced metric can be served as a more universal measure for point cloud analysis. Additionally, we introduce locally linear reconstruction to ensure that regions lacking correspondences between shapes still undergo physically natural deformations. Our method achieves superior or competitive performance compared to existing approaches, particularly when dealing with occluded geometries. We also demonstrate the versatility of our method in challenging tasks such as large deformations, shape interpolation, and shape completion under occlusion disturbances.
中文: 本文提出了一种无监督的遮挡感知配准方法,通过自适应熵和隐式神经表示有效处理非刚性点云对齐中的遮挡问题,在复杂场景下实现了优越性能。
English: This paper introduces an unsupervised Occlusion-Aware Registration (OAR) method that employs adaptive correntropy and implicit neural representations to effectively handle occlusions in non-rigid point cloud alignment, achieving superior performance in challenging scenarios.

Authors:Khanh Nguyen, Ghulam Mubashar Hassan, Ajmal Mian
Title: Occlusion-aware Text-Image-Point Cloud Pretraining for Open-World 3D Object Recognition
Abstract:
Recent open-world representation learning approaches have leveraged CLIP to enable zero-shot 3D object recognition. However, performance on real point clouds with occlusions still falls short due to unrealistic pretraining settings. Additionally, these methods incur high inference costs because they rely on Transformer's attention modules. In this paper, we make two contributions to address these limitations. First, we propose occlusion-aware text-image-point cloud pretraining to reduce the training-testing domain gap. From 52K synthetic 3D objects, our framework generates nearly 630K partial point clouds for pretraining, consistently improving real-world recognition performances of existing popular 3D networks. Second, to reduce computational requirements, we introduce DuoMamba, a two-stream linear state space model tailored for point clouds. By integrating two space-filling curves with 1D convolutions, DuoMamba effectively models spatial dependencies between point tokens, offering a powerful alternative to Transformer. When pretrained with our framework, DuoMamba surpasses current state-of-the-art methods while reducing latency and FLOPs, highlighting the potential of our approach for real-world applications. Our code and data are available at https://ndkhanh360.github.io/project-occtip.

Authors:Muhammad Ashad Kabir, Nidita Roy, Md. Ekramul Hossain, Jill Featherston, Sayed Ahmed
Title: Deep Learning for Wound Tissue Segmentation: A Comprehensive Evaluation using A Novel Dataset
Abstract:
Deep learning (DL) techniques have emerged as promising solutions for medical wound tissue segmentation. However, a notable limitation in this field is the lack of publicly available labelled datasets and a standardised performance evaluation of state-of-the-art DL models on such datasets. This study addresses this gap by comprehensively evaluating various DL models for wound tissue segmentation using a novel dataset. We have curated a dataset comprising 147 wound images exhibiting six tissue types: slough, granulation, maceration, necrosis, bone, and tendon. The dataset was meticulously labelled for semantic segmentation employing supervised machine learning techniques. Three distinct labelling formats were developed -- full image, patch, and superpixel. Our investigation encompassed a wide array of DL segmentation and classification methodologies, ranging from conventional approaches like UNet, to generative adversarial networks such as cGAN, and modified techniques like FPN+VGG16. Also, we explored DL-based classification methods (e.g., ResNet50) and machine learning-based classification leveraging DL features (e.g., AlexNet+RF). In total, 82 wound tissue segmentation models were derived across the three labelling formats. Our analysis yielded several notable findings, including identifying optimal DL models for each labelling format based on weighted average Dice or F1 scores. Notably, FPN+VGG16 emerged as the top-performing DL model for wound tissue segmentation, achieving a dice score of 82.25%. This study provides a valuable benchmark for evaluating wound image segmentation and classification models, offering insights to inform future research and clinical practice in wound care. The labelled dataset created in this study is available at https://github.com/akabircs/WoundTissue.
中文摘要:本研究利用包含六种组织类型的147张伤口图像新数据集,评估了多种深度学习模型在伤口组织分割中的表现,确定FPN+VGG16为最佳模型(Dice得分82.25%),为未来伤口护理研究提供了重要基准。
English Summary: This study evaluates various deep learning models for wound tissue segmentation using a novel dataset of 147 images with six tissue types, identifying FPN+VGG16 as the top-performing model with an 82.25% dice score and providing a benchmark for future wound care research.

Authors:Kaiwen Shi, Yifei Li, Binh Ho, Jovian Wang, Kobe Guo
Title: Universal Lesion Segmentation Challenge 2023: A Comparative Research of Different Algorithms
Abstract:
In recent years, machine learning algorithms have achieved much success in segmenting lesions across various tissues. There is, however, not one satisfying model that works well on all tissue types universally. In response to this need, we attempt to train a model that 1) works well on all tissue types, and 2) is capable of still performing fast inferences. To this end, we design our architectures, test multiple existing architectures, compare their results, and settle upon SwinUnet. We document our rationales, successes, and failures. Finally, we propose some further directions that we think are worth exploring. codes: https://github.com/KWFredShi/ULS2023NGKD.git
中文: 研究人员利用SwinUnet架构开发了一种通用病灶分割模型,该模型能在所有组织类型上有效工作并保持快速推理能力,同时公布了研究结果和代码以供进一步探索。
English: Researchers have developed a universal lesion segmentation model using SwinUnet architecture that performs effectively across all tissue types while maintaining fast inference speeds, with findings and code shared for further exploration.

Authors:Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Li, Xiang Bai
Title: The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey
Abstract:
Driving World Model (DWM), which focuses on predicting scene evolution during the driving process, has emerged as a promising paradigm in pursuing autonomous driving. These methods enable autonomous driving systems to better perceive, understand, and interact with dynamic driving environments. In this survey, we provide a comprehensive overview of the latest progress in DWM. We categorize existing approaches based on the modalities of the predicted scenes and summarize their specific contributions to autonomous driving. In addition, high-impact datasets and various metrics tailored to different tasks within the scope of DWM research are reviewed. Finally, we discuss the potential limitations of current research and propose future directions. This survey provides valuable insights into the development and application of DWM, fostering its broader adoption in autonomous driving. The relevant papers are collected at https://github.com/LMD0311/Awesome-World-Model.
中文: 驾驶世界模型(DWM)作为自动驾驶领域的新范式,通过预测场景演变来提升系统对动态环境的感知与交互能力,本综述系统梳理了其研究进展、数据集、评估指标及未来发展方向。
English: The Driving World Model (DWM) is a promising autonomous driving paradigm that predicts scene evolution, enabling systems to better perceive and interact with dynamic environments, with this survey comprehensively reviewing its progress, datasets, metrics, and future directions.

Authors:Wenxuan Guo, Xiuwei Xu, Ziwei Wang, Jianjiang Feng, Jie Zhou, Jiwen Lu
Title: TSP3D: Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding
Abstract:
In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. Conventional methods are difficult to meet the requirements of real-time inference due to the two-stage or point-based architecture. Inspired by the success of multi-level fully sparse convolutional architecture in 3D object detection, we aim to build a new 3D visual grounding framework following this technical route. However, as in 3D visual grounding task the 3D scene representation should be deeply interacted with text features, sparse convolution-based architecture is inefficient for this interaction due to the large amount of voxel features. To this end, we propose text-guided pruning (TGP) and completion-based addition (CBA) to deeply fuse 3D scene representation and text features in an efficient way by gradual region pruning and target completion. Specifically, TGP iteratively sparsifies the 3D scene representation and thus efficiently interacts the voxel features with text features by cross-attention. To mitigate the affect of pruning on delicate geometric information, CBA adaptively fixes the over-pruned region by voxel completion with negligible computational overhead. Compared with previous single-stage methods, our method achieves top inference speed and surpasses previous fastest method by 100\% FPS. Our method also achieves state-of-the-art accuracy even compared with two-stage methods, with $+1.13$ lead of Acc@0.5 on ScanRefer, and $+2.6$ and $+3.2$ leads on NR3D and SR3D respectively. The code is available at \href{https://github.com/GWxuan/TSP3D}{https://github.com/GWxuan/TSP3D}.
中文: 本文提出了一种高效的多级卷积架构用于三维视觉定位,通过文本引导剪枝和基于补全的添加方法,实现了最优的推理速度和精度。
English: This paper introduces an efficient multi-level convolution architecture for 3D visual grounding, utilizing text-guided pruning and completion-based addition to achieve state-of-the-art speed and accuracy.

Authors:Thien B. Nguyen-Tat, Hoang-An Vo, Phuoc-Sang Dang
Title: QMaxViT-Unet+: A Query-Based MaxViT-Unet with Edge Enhancement for Scribble-Supervised Segmentation of Medical Images
Abstract:
The deployment of advanced deep learning models for medical image segmentation is often constrained by the requirement for extensively annotated datasets. Weakly-supervised learning, which allows less precise labels, has become a promising solution to this challenge. Building on this approach, we propose QMaxViT-Unet+, a novel framework for scribble-supervised medical image segmentation. This framework is built on the U-Net architecture, with the encoder and decoder replaced by Multi-Axis Vision Transformer (MaxViT) blocks. These blocks enhance the model's ability to learn local and global features efficiently. Additionally, our approach integrates a query-based Transformer decoder to refine features and an edge enhancement module to compensate for the limited boundary information in the scribble label. We evaluate the proposed QMaxViT-Unet+ on four public datasets focused on cardiac structures, colorectal polyps, and breast cancer: ACDC, MS-CMRSeg, SUN-SEG, and BUSI. Evaluation metrics include the Dice similarity coefficient (DSC) and the 95th percentile of Hausdorff distance (HD95). Experimental results show that QMaxViT-Unet+ achieves 89.1\% DSC and 1.316mm HD95 on ACDC, 88.4\% DSC and 2.226mm HD95 on MS-CMRSeg, 71.4\% DSC and 4.996mm HD95 on SUN-SEG, and 69.4\% DSC and 50.122mm HD95 on BUSI. These results demonstrate that our method outperforms existing approaches in terms of accuracy, robustness, and efficiency while remaining competitive with fully-supervised learning approaches. This makes it ideal for medical image analysis, where high-quality annotations are often scarce and require significant effort and expense. The code is available at: https://github.com/anpc849/QMaxViT-Unet
中文: QMaxViT-Unet+框架提出了一种基于涂鸦标注的弱监督医学图像分割新方法,通过多轴视觉Transformer模块与查询式解码器的结合,在四个公共数据集上实现了优于现有方法的精确度和鲁棒性。
English: The QMaxViT-Unet+ framework introduces a novel weakly-supervised approach for medical image segmentation, combining Multi-Axis Vision Transformer blocks with query-based decoding and edge enhancement to achieve superior accuracy and efficiency across multiple clinical datasets.

Authors:Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, Siqi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo, Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang
Title: Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Abstract:
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
中文: Step-Video-T2V是一个拥有300亿参数的最先进文本到视频模型,通过深度压缩变分自编码器和三维全注意力扩散变换器,能生成长达204帧的双语高质量视频,在性能评估中展现出业界领先水平。
English: Step-Video-T2V is a 30B-parameter text-to-video model that generates high-quality 204-frame videos using advanced compression and denoising techniques, achieving state-of-the-art performance in bilingual video synthesis.

Authors:Ruslan Agishev, Karel Zimmermann
Title: FusionForce: End-to-end Differentiable Neural-Symbolic Layer for Trajectory Prediction
Abstract:
We propose end-to-end differentiable model that predicts robot trajectories on rough offroad terrain from camera images and/or lidar point clouds. The model integrates a learnable component that predicts robot-terrain interaction forces with a neural-symbolic layer that enforces the laws of classical mechanics and consequently improves generalization on out-of-distribution data. The neural-symbolic layer includes a differentiable physics engine that computes the robot's trajectory by querying these forces at the points of contact with the terrain. As the proposed architecture comprises substantial geometrical and physics priors, the resulting model can also be seen as a learnable physics engine conditioned on real sensor data that delivers $10^4$ trajectories per second. We argue and empirically demonstrate that this architecture reduces the sim-to-real gap and mitigates out-of-distribution sensitivity. The differentiability, in conjunction with the rapid simulation speed, makes the model well-suited for various applications including model predictive control, trajectory shooting, supervised and reinforcement learning, or SLAM.
中文: 本文提出了一种端到端的可微分模型,通过融合可学习的力预测与神经符号层来执行经典力学定律,利用传感器数据预测机器人在崎岖地形上的轨迹,提升泛化能力并实现高速仿真,适用于多种应用场景。
English: This paper introduces an end-to-end differentiable model that predicts robot trajectories on rough terrain using sensor data, integrating learnable force predictions with a neural-symbolic layer to enforce physics laws and enhance generalization, while enabling rapid simulation for diverse applications.

Authors:Teng Li, Guangcong Zheng, Rui Jiang, Shuigen Zhan, Tao Wu, Yehao Lu, Yining Lin, Chuanyun Deng, Yepan Xiong, Min Chen, Lin Cheng, Xi Li
Title: RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control
Abstract:
Recent advancements in camera-trajectory-guided image-to-video generation offer higher precision and better support for complex camera control compared to text-based approaches. However, they also introduce significant usability challenges, as users often struggle to provide precise camera parameters when working with arbitrary real-world images without knowledge of their depth nor scene scale. To address these real-world application issues, we propose RealCam-I2V, a novel diffusion-based video generation framework that integrates monocular metric depth estimation to establish 3D scene reconstruction in a preprocessing step. During training, the reconstructed 3D scene enables scaling camera parameters from relative to metric scales, ensuring compatibility and scale consistency across diverse real-world images. In inference, RealCam-I2V offers an intuitive interface where users can precisely draw camera trajectories by dragging within the 3D scene. To further enhance precise camera control and scene consistency, we propose scene-constrained noise shaping, which shapes high-level noise and also allows the framework to maintain dynamic and coherent video generation in lower noise stages. RealCam-I2V achieves significant improvements in controllability and video quality on the RealEstate10K and out-of-domain images. We further enables applications like camera-controlled looping video generation and generative frame interpolation. Project page: https://zgctroy.github.io/RealCam-I2V.

Authors:Luca Parolari, Andrea Cherubini, Lamberto Ballan, Carlo Biffi
Title: Towards Polyp Counting In Full-Procedure Colonoscopy Videos
Abstract:
Automated colonoscopy reporting holds great potential for enhancing quality control and improving cost-effectiveness of colonoscopy procedures. A major challenge lies in the automated identification, tracking, and re-association (ReID) of polyps tracklets across full-procedure colonoscopy videos. This is essential for precise polyp counting and enables automated computation of key quality metrics, such as Adenoma Detection Rate (ADR) and Polyps Per Colonoscopy (PPC). However, polyp ReID is challenging due to variations in polyp appearance, frequent disappearance from the field of view, and occlusions. In this work, we leverage the REAL-Colon dataset, the first open-access dataset providing full-procedure videos, to define tasks, data splits and metrics for the problem of automatically count polyps in full-procedure videos, establishing an open-access framework. We re-implement previously proposed SimCLR-based methods for learning representations of polyp tracklets, both single-frame and multi-view, and adapt them to the polyp counting task. We then propose an Affinity Propagation-based clustering method to further improve ReID based on these learned representations, ultimately enhancing polyp counting. Our approach achieves state-of-the-art performance, with a polyp fragmentation rate of 6.30 and a false positive rate (FPR) below 5% on the REAL-Colon dataset. We release code at https://github.com/lparolari/towards-polyp-counting.
中文摘要:自动化结肠镜报告通过先进的息肉追踪与重识别方法,在REAL-Colon数据集上实现了低碎片率和误报率的最优性能,从而提升质量控制与成本效益。
English Summary: Automated colonoscopy reporting can enhance quality control and cost-effectiveness by accurately counting polyps through advanced tracking and re-identification methods, achieving state-of-the-art performance with low fragmentation and false positive rates on the REAL-Colon dataset.

Authors:Xiaoya Lu, Dongrui Liu, Yi Yu, Luxin Xu, Jing Shao
Title: X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability
Abstract:
Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against multi-turn jailbreaks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training. Please see our code at: https://github.com/AI45Lab/X-Boundary.
中文摘要:本研究提出X-Boundary方法,通过精确区分安全与有害特征表示并仅消除后者,在保持大语言模型通用能力的同时,将过度拒绝率降低约20%,实现了对多轮越狱攻击的最优防御效果。
English Summary: The study introduces X-Boundary, a novel defense method that enhances LLM robustness against multi-turn jailbreaks by precisely distinguishing and erasing harmful representations while preserving usability and reducing over-refusal by approximately 20%.

Authors:Hsu-kuang Chiu, Ryo Hachiuma, Chien-Yi Wang, Stephen F. Smith, Yu-Chiang Frank Wang, Min-Hung Chen
Title: V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models
Abstract:
Current autonomous driving vehicles rely mainly on their individual sensors to understand surrounding scenes and plan for future trajectories, which can be unreliable when the sensors are malfunctioning or occluded. To address this problem, cooperative perception methods via vehicle-to-vehicle (V2V) communication have been proposed, but they have tended to focus on perception tasks like detection or tracking. How those approaches contribute to overall cooperative planning performance is still under-explored. Inspired by recent progress using Large Language Models (LLMs) to build autonomous driving systems, we propose a novel problem setting that integrates a Multi-Modal LLM into cooperative autonomous driving, with the proposed Vehicle-to-Vehicle Question-Answering (V2V-QA) dataset and benchmark. We also propose our baseline method Vehicle-to-Vehicle Multi-Modal Large Language Model (V2V-LLM), which uses an LLM to fuse perception information from multiple connected autonomous vehicles (CAVs) and answer various types of driving-related questions: grounding, notable object identification, and planning. Experimental results show that our proposed V2V-LLM can be a promising unified model architecture for performing various tasks in cooperative autonomous driving, and outperforms other baseline methods that use different fusion approaches. Our work also creates a new research direction that can improve the safety of future autonomous driving systems. The code and data will be released to the public to facilitate open-source research in this field. Our project website: https://eddyhkchiu.github.io/v2vllm.github.io/ .

Authors:Siqi Wu, Yinda Chen, Dong Liu, Zhihai He
Title: Conditional Latent Coding with Learnable Synthesized Reference for Deep Image Compression
Abstract:
In this paper, we study how to synthesize a dynamic reference from an external dictionary to perform conditional coding of the input image in the latent domain and how to learn the conditional latent synthesis and coding modules in an end-to-end manner. Our approach begins by constructing a universal image feature dictionary using a multi-stage approach involving modified spatial pyramid pooling, dimension reduction, and multi-scale feature clustering. For each input image, we learn to synthesize a conditioning latent by selecting and synthesizing relevant features from the dictionary, which significantly enhances the model's capability in capturing and exploring image source correlation. This conditional latent synthesis involves a correlation-based feature matching and alignment strategy, comprising a Conditional Latent Matching (CLM) module and a Conditional Latent Synthesis (CLS) module. The synthesized latent is then used to guide the encoding process, allowing for more efficient compression by exploiting the correlation between the input image and the reference dictionary. According to our theoretical analysis, the proposed conditional latent coding (CLC) method is robust to perturbations in the external dictionary samples and the selected conditioning latent, with an error bound that scales logarithmically with the dictionary size, ensuring stability even with large and diverse dictionaries. Experimental results on benchmark datasets show that our new method improves the coding performance by a large margin (up to 1.2 dB) with a very small overhead of approximately 0.5\% bits per pixel. Our code is publicly available at https://github.com/ydchen0806/CLC.
中文: 本文提出了一种条件潜在编码方法,通过从外部字典合成动态参考来利用图像源相关性增强压缩性能,在极低比特开销下实现了显著的编码质量提升。
English: This paper introduces a conditional latent coding method that synthesizes dynamic references from an external dictionary to enhance image compression by exploiting source correlations, achieving significant performance gains with minimal overhead.

Authors:Kun Guo, Gang Cao, Zijie Lou, Xianglin Huang, Jiaoyun Liu
Title: A Lightweight and Effective Image Tampering Localization Network with Vision Mamba
Abstract:
Current image tampering localization methods primarily rely on Convolutional Neural Networks (CNNs) and Transformers. While CNNs suffer from limited local receptive fields, Transformers offer global context modeling at the expense of quadratic computational complexity. Recently, the state space model Mamba has emerged as a competitive alternative, enabling linear-complexity global dependency modeling. Inspired by it, we propose a lightweight and effective FORensic network based on vision MAmba (ForMa) for blind image tampering localization. Firstly, ForMa captures multi-scale global features that achieves efficient global dependency modeling through linear complexity. Then the pixel-wise localization map is generated by a lightweight decoder, which employs a parameter-free pixel shuffle layer for upsampling. Additionally, a noise-assisted decoding strategy is proposed to integrate complementary manipulation traces from tampered images, boosting decoder sensitivity to forgery cues. Experimental results on 10 standard datasets demonstrate that ForMa achieves state-of-the-art generalization ability and robustness, while maintaining the lowest computational complexity. Code is available at https://github.com/multimediaFor/ForMa.
中文:提出的ForMa网络利用视觉Mamba实现线性复杂度的全局特征建模,并结合噪声辅助解码策略,以最低计算成本实现了最先进的图像篡改定位性能。
English: The proposed ForMa network leverages vision Mamba for linear-complexity global feature modeling and integrates noise-assisted decoding to achieve state-of-the-art image tampering localization with minimal computational cost.

Authors:Łukasz Staniszewski, Bartosz Cywiński, Franziska Boenisch, Kamil Deja, Adam Dziedzic
Title: Precise Parameter Localization for Textual Generation in Diffusion Models
Abstract:
Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of diffusion models' parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models' generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP to the large language models like T5). Project page available at https://t2i-text-loc.github.io/.
中文摘要:新型扩散模型能生成融合高质量文本的逼真图像,研究发现仅需调整注意力层中不足1%的参数即可控制文本生成,从而实现了针对性优化及文本编辑、防毒性内容等应用,适用于多种模型架构。
English Summary: Novel diffusion models can generate realistic images with integrated text, and researchers have discovered that less than 1% of parameters in attention layers control text generation, enabling targeted improvements and applications like text editing and toxicity prevention across various model architectures.

Authors:Jiankang Chen, Tianke Zhang, Changyi Liu, Haojie Ding, Yaya Shi, Feng Cheng, Huihui Xiao, Bin Wen, Fan Yang, Tingting Gao, Di Zhang
Title: TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types
Abstract:
Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data. However, their performance is often limited by insufficient task-specific data, leading to poor generalization and biased outputs. Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling, which typically produces only a few hundred task types. To address this, we propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples. TaskGalaxy utilizes GPT-4o to enrich task diversity by expanding from a small set of manually defined tasks, with CLIP and GPT-4o filtering those that best match open-source images, and generating relevant question-answer pairs. Multiple models are employed to ensure sample quality. This automated process enhances both task diversity and data quality, reducing manual intervention. Incorporating TaskGalaxy into LLaVA-v1.5 and InternVL-Chat-v1.0 models shows substantial performance improvements across 16 benchmarks, demonstrating the critical importance of task diversity. TaskGalaxy is publicly released at https://github.com/Kwai-YuanQi/TaskGalaxy.
Chinese: TaskGalaxy是一个大规模多模态指令微调数据集,通过自动化生成显著提升了任务多样性和数据质量,使LLaVA-v1.5和InternVL-Chat-v1.0等模型在多个基准测试中取得了显著性能提升。
English: TaskGalaxy is a large-scale multimodal instruction fine-tuning dataset that significantly enhances task diversity and data quality through automated generation, leading to substantial performance improvements in models like LLaVA-v1.5 and InternVL-Chat-v1.0 across multiple benchmarks.

Authors:Peng Ling, Wenxiao Xiong
Title: FrGNet: A fourier-guided weakly-supervised framework for nuclear instance segmentation
Abstract:
Nuclear instance segmentation has played a critical role in pathology image analysis. The main challenges arise from the difficulty in accurately segmenting instances and the high cost of precise mask-level annotations for fully-supervised training.In this work, we propose a fourier guidance framework for solving the weakly-supervised nuclear instance segmentation problem. In this framework, we construct a fourier guidance module to fuse the priori information into the training process of the model, which facilitates the model to capture the relevant features of the nuclear. Meanwhile, in order to further improve the model's ability to represent the features of nuclear, we propose the guide-based instance level contrastive module. This module makes full use of the framework's own properties and guide information to effectively enhance the representation features of nuclear. We show on two public datasets that our model can outperform current SOTA methods under fully-supervised design, and in weakly-supervised experiments, with only a small amount of labeling our model still maintains close to the performance under full supervision.In addition, we also perform generalization experiments on a private dataset, and without any labeling, our model is able to segment nuclear images that have not been seen during training quite effectively. As open science, all codes and pre-trained models are available at https://github.com/LQY404/FrGNet.
中文摘要:本研究提出了一种傅里叶引导框架用于弱监督核实例分割,通过融合先验信息和对比学习增强特征表示能力,在少量标注下即可达到接近全监督的性能水平。
English Summary: This study introduces a Fourier guidance framework for weakly-supervised nuclear instance segmentation, which integrates prior information and contrastive learning to enhance feature representation, achieving state-of-the-art performance with minimal labeling requirements.

Authors:Jinpei Guo, Zheng Chen, Wenbo Li, Yong Guo, Yulun Zhang
Title: Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal
Abstract:
Diffusion models have demonstrated remarkable success in image restoration tasks. However, their multi-step denoising process introduces significant computational overhead, limiting their practical deployment. Furthermore, existing methods struggle to effectively remove severe JPEG artifact, especially in highly compressed images. To address these challenges, we propose CODiff, a compression-aware one-step diffusion model for JPEG artifact removal. The core of CODiff is the compression-aware visual embedder (CaVE), which extracts and leverages JPEG compression priors to guide the diffusion model. We propose a dual learning strategy that combines explicit and implicit learning. Specifically, explicit learning enforces a quality prediction objective to differentiate low-quality images with different compression levels. Implicit learning employs a reconstruction objective that enhances the model's generalization. This dual learning allows for a deeper and more comprehensive understanding of JPEG compression. Experimental results demonstrate that CODiff surpasses recent leading methods in both quantitative and visual quality metrics. The code is released at https://github.com/jp-guo/CODiff.
中文:CODiff是一种创新的单步扩散模型,通过压缩感知视觉嵌入器和双重学习策略有效去除JPEG伪影,在定量和视觉质量指标上均超越了现有领先方法。
English: CODiff is a novel one-step diffusion model that utilizes a compression-aware visual embedder and dual learning strategy to effectively remove JPEG artifacts, achieving superior performance in both quantitative and visual quality metrics.

Authors:Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi
Title: HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
Abstract:
We present HealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is complemented by a tailored hierarchical visual perception approach and a three-stage learning strategy. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed at https://github.com/DCDmllm/HealthGPT.
中文: HealthGPT是一种统一的自回归医学大视觉语言模型,通过创新的异构低秩适应技术和分层视觉感知,在医学视觉任务中展现出卓越的性能与扩展性。
English: HealthGPT is a unified autoregressive Med-LVLM that integrates medical visual comprehension and generation through a novel H-LoRA technique and hierarchical perception, achieving exceptional performance in medical tasks.

Authors:Onur Bagoren, Marc Micatka, Katherine A. Skinner, Aaron Marburg
Title: PUGS: Perceptual Uncertainty for Grasp Selection in Underwater Environments
Abstract:
When navigating and interacting in challenging environments where sensory information is imperfect and incomplete, robots must make decisions that account for these shortcomings. We propose a novel method for quantifying and representing such perceptual uncertainty in 3D reconstruction through occupancy uncertainty estimation. We develop a framework to incorporate it into grasp selection for autonomous manipulation in underwater environments. Instead of treating each measurement equally when deciding which location to grasp from, we present a framework that propagates uncertainty inherent in the multi-view reconstruction process into the grasp selection. We evaluate our method with both simulated and the real world data, showing that by accounting for uncertainty, the grasp selection becomes robust against partial and noisy measurements. Code will be made available at https://onurbagoren.github.io/PUGS/

Authors:Yuhui Zhang, Yuchang Su, Chenyu Wang, Tianhong Li, Zoe Wefers, Jeffrey Nirschl, James Burgess, Daisy Ding, Alejandro Lozano, Emma Lundberg, Serena Yeung-Levy
Title: CellFlux: Simulating Cellular Morphology Changes via Flow Matching
Abstract:
Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology. We introduce CellFlux, an image-generative model that simulates cellular morphology changes induced by chemical and genetic perturbations using flow matching. Unlike prior methods, CellFlux models distribution-wise transformations from unperturbed to perturbed cell states, effectively distinguishing actual perturbation effects from experimental artifacts such as batch effects -- a major challenge in biological data. Evaluated on chemical (BBBC021), genetic (RxRx1), and combined perturbation (JUMP) datasets, CellFlux generates biologically meaningful cell images that faithfully capture perturbation-specific morphological changes, achieving a 35% improvement in FID scores and a 12% increase in mode-of-action prediction accuracy over existing methods. Additionally, CellFlux enables continuous interpolation between cellular states, providing a potential tool for studying perturbation dynamics. These capabilities mark a significant step toward realizing virtual cell modeling for biomedical research. Project page: https://yuhui-zh15.github.io/CellFlux/.
中文: CellFlux是一种创新的图像生成模型,通过流匹配技术模拟化学和遗传扰动引起的细胞形态变化,在图像质量和生物学预测准确性上相比现有方法实现了显著提升。
English: CellFlux is an innovative image-generative model that simulates cellular morphology changes from chemical and genetic perturbations using flow matching, achieving significant improvements in image quality and biological prediction accuracy over existing methods.

Authors:Xiaohong Liu, Xulong Zhao, Gang Liu, Zili Wu, Tao Wang, Lei Meng, Yuhan Wang
Title: IMM-MOT: A Novel 3D Multi-object Tracking Framework with Interacting Multiple Model Filter
Abstract:
3D Multi-Object Tracking (MOT) provides the trajectories of surrounding objects, assisting robots or vehicles in smarter path planning and obstacle avoidance. Existing 3D MOT methods based on the Tracking-by-Detection framework typically use a single motion model to track an object throughout its entire tracking process. However, objects may change their motion patterns due to variations in the surrounding environment. In this paper, we introduce the Interacting Multiple Model filter in IMM-MOT, which accurately fits the complex motion patterns of individual objects, overcoming the limitation of single-model tracking in existing approaches. In addition, we incorporate a Damping Window mechanism into the trajectory lifecycle management, leveraging the continuous association status of trajectories to control their creation and termination, reducing the occurrence of overlooked low-confidence true targets. Furthermore, we propose the Distance-Based Score Enhancement module, which enhances the differentiation between false positives and true positives by adjusting detection scores, thereby improving the effectiveness of the Score Filter. On the NuScenes Val dataset, IMM-MOT outperforms most other single-modal models using 3D point clouds, achieving an AMOTA of 73.8%. Our project is available at https://github.com/Ap01lo/IMM-MOT.
中文:IMM-MOT通过交互多模型滤波器精确跟踪物体运动模式,结合阻尼窗口管理轨迹生命周期和基于距离的分数增强模块,在NuScenes数据集上实现了73.8%的AMOTA性能。
English: IMM-MOT introduces an Interacting Multiple Model filter for accurate motion pattern tracking, a Damping Window for trajectory management, and a Distance-Based Score Enhancement module, achieving 73.8% AMOTA on the NuScenes dataset.

Authors:Maizhe Yang, Kaiyuan Tang, Chaoli Wang
Title: Meta-INR: Efficient Encoding of Volumetric Data via Meta-Learning Implicit Neural Representation
Abstract:
Implicit neural representation (INR) has emerged as a promising solution for encoding volumetric data, offering continuous representations and seamless compatibility with the volume rendering pipeline. However, optimizing an INR network from randomly initialized parameters for each new volume is computationally inefficient, especially for large-scale time-varying or ensemble volumetric datasets where volumes share similar structural patterns but require independent training. To close this gap, we propose Meta-INR, a pretraining strategy adapted from meta-learning algorithms to learn initial INR parameters from partial observation of a volumetric dataset. Compared to training an INR from scratch, the learned initial parameters provide a strong prior that enhances INR generalizability, allowing significantly faster convergence with just a few gradient updates when adapting to a new volume and better interpretability when analyzing the parameters of the adapted INRs. We demonstrate that Meta-INR can effectively extract high-quality generalizable features that help encode unseen similar volume data across diverse datasets. Furthermore, we highlight its utility in tasks such as simulation parameter analysis and representative timestep selection. The code is available at https://github.com/spacefarers/MetaINR.
中文: Meta-INR采用基于元学习的预训练策略,为隐式神经表示学习初始参数,从而在相似体积数据集上实现更快收敛和更强的泛化能力。
English: Meta-INR introduces a meta-learning-based pretraining approach to learn initial parameters for implicit neural representations, enabling faster convergence and improved generalizability across similar volumetric datasets.

Authors:Duc Kieu, Kien Do, Toan Nguyen, Dang Nguyen, Thin Nguyen
Title: Bidirectional Diffusion Bridge Models
Abstract:
Diffusion bridges have shown potential in paired image-to-image (I2I) translation tasks. However, existing methods are limited by their unidirectional nature, requiring separate models for forward and reverse translations. This not only doubles the computational cost but also restricts their practicality. In this work, we introduce the Bidirectional Diffusion Bridge Model (BDBM), a scalable approach that facilitates bidirectional translation between two coupled distributions using a single network. BDBM leverages the Chapman-Kolmogorov Equation for bridges, enabling it to model data distribution shifts across timesteps in both forward and backward directions by exploiting the interchangeability of the initial and target timesteps within this framework. Notably, when the marginal distribution given endpoints is Gaussian, BDBM's transition kernels in both directions possess analytical forms, allowing for efficient learning with a single network. We demonstrate the connection between BDBM and existing bridge methods, such as Doob's h-transform and variational approaches, and highlight its advantages. Extensive experiments on high-resolution I2I translation tasks demonstrate that BDBM not only enables bidirectional translation with minimal additional cost but also outperforms state-of-the-art bridge models. Our source code is available at [https://github.com/kvmduc/BDBM||https://github.com/kvmduc/BDBM].
Chinese: 双向扩散桥模型(BDBM)提出了一种可扩展的方法,通过单一网络实现双向图像翻译,不仅降低了计算成本,而且性能优于现有技术。
English: The Bidirectional Diffusion Bridge Model (BDBM) introduces a scalable approach for bidirectional image-to-image translation using a single network, reducing computational costs and outperforming existing methods.

Authors:Bowen Chen, Keyan Chen, Mohan Yang, Zhengxia Zou, Zhenwei Shi
Title: Heterogeneous Mixture of Experts for Remote Sensing Image Super-Resolution
Abstract:
Remote sensing image super-resolution (SR) aims to reconstruct high-resolution remote sensing images from low-resolution inputs, thereby addressing limitations imposed by sensors and imaging conditions. However, the inherent characteristics of remote sensing images, including diverse ground object types and complex details, pose significant challenges to achieving high-quality reconstruction. Existing methods typically employ a uniform structure to process various types of ground objects without distinction, making it difficult to adapt to the complex characteristics of remote sensing images. To address this issue, we introduce a Mixture of Experts (MoE) model and design a set of heterogeneous experts. These experts are organized into multiple expert groups, where experts within each group are homogeneous while being heterogeneous across groups. This design ensures that specialized activation parameters can be employed to handle the diverse and intricate details of ground objects effectively. To better accommodate the heterogeneous experts, we propose a multi-level feature aggregation strategy to guide the routing process. Additionally, we develop a dual-routing mechanism to adaptively select the optimal expert for each pixel. Experiments conducted on the UCMerced and AID datasets demonstrate that our proposed method achieves superior SR reconstruction accuracy compared to state-of-the-art methods. The code will be available at https://github.com/Mr-Bamboo/MFG-HMoE.
Chinese Summary: 本文提出了一种混合专家模型,通过异构专家组和双路由机制有效处理遥感图像的多样化特征,在基准数据集上实现了更优的超分辨率重建精度。
English Summary: This paper introduces a Mixture of Experts model with heterogeneous expert groups and a dual-routing mechanism to effectively handle the diverse characteristics of remote sensing images for superior super-resolution reconstruction, demonstrating improved accuracy on benchmark datasets.

Authors:Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
Title: Exploring the Potential of Encoder-free Architectures in 3D LMMs
Abstract:
Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to alleviate the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.10%, 50.98%, and 43.10% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL
中文: 本文提出首个无编码器的3D大模型ENEL,通过预训练阶段的语义编码策略和指令调优阶段的层次几何聚合,使大语言模型直接处理3D点云,在分类、描述和视觉问答任务上达到与更大编码器模型相当的性能。
English: This paper introduces ENEL, the first encoder-free 3D Large Multimodal Model that eliminates traditional 3D encoders by embedding semantic encoding during pre-training and hierarchical geometry aggregation during fine-tuning, achieving performance comparable to larger encoder-based models across classification, captioning, and VQA tasks.

Authors:Jing Wen, Alexander G. Schwing, Shenlong Wang
Title: LIFe-GoM: Generalizable Human Rendering with Learned Iterative Feedback Over Multi-Resolution Gaussians-on-Mesh
Abstract:
Generalizable rendering of an animatable human avatar from sparse inputs relies on data priors and inductive biases extracted from training on large data to avoid scene-specific optimization and to enable fast reconstruction. This raises two main challenges: First, unlike iterative gradient-based adjustment in scene-specific optimization, generalizable methods must reconstruct the human shape representation in a single pass at inference time. Second, rendering is preferably computationally efficient yet of high resolution. To address both challenges we augment the recently proposed dual shape representation, which combines the benefits of a mesh and Gaussian points, in two ways. To improve reconstruction, we propose an iterative feedback update framework, which successively improves the canonical human shape representation during reconstruction. To achieve computationally efficient yet high-resolution rendering, we study a coupled-multi-resolution Gaussians-on-Mesh representation. We evaluate the proposed approach on the challenging THuman2.0, XHuman and AIST++ data. Our approach reconstructs an animatable representation from sparse inputs in less than 1s, renders views with 95.1FPS at $1024 \times 1024$, and achieves PSNR/LPIPS*/FID of 24.65/110.82/51.27 on THuman2.0, outperforming the state-of-the-art in rendering quality.

Authors:Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
Title: DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References
Abstract:
We address the challenge of developing a generalizable neural tracking controller for dexterous manipulation from human references. This controller aims to manage a dexterous robot hand to manipulate diverse objects for various purposes defined by kinematic human-object interactions. Developing such a controller is complicated by the intricate contact dynamics of dexterous manipulation and the need for adaptivity, generalizability, and robustness. Current reinforcement learning and trajectory optimization methods often fall short due to their dependence on task-specific rewards or precise system models. We introduce an approach that curates large-scale successful robot tracking demonstrations, comprising pairs of human references and robot actions, to train a neural controller. Utilizing a data flywheel, we iteratively enhance the controller's performance, as well as the number and quality of successful tracking demonstrations. We exploit available tracking demonstrations and carefully integrate reinforcement learning and imitation learning to boost the controller's performance in dynamic environments. At the same time, to obtain high-quality tracking demonstrations, we individually optimize per-trajectory tracking by leveraging the learned tracking controller in a homotopy optimization method. The homotopy optimization, mimicking chain-of-thought, aids in solving challenging trajectory tracking problems to increase demonstration diversity. We showcase our success by training a generalizable neural controller and evaluating it in both simulation and real world. Our method achieves over a 10% improvement in success rates compared to leading baselines. The project website with animated results is available at https://meowuu7.github.io/DexTrack/.
Chinese: 本研究通过整合人类-机器人示范数据并结合模仿学习与同伦优化方法,开发出适用于灵巧操作的通用神经控制器,在真实环境测试中相比主流基线模型成功率提升超过10%。
English: This study develops a generalizable neural controller for dexterous robotic manipulation by curating human-robot demonstration pairs and integrating imitation learning with homotopy optimization, achieving over 10% higher success rates than leading baselines in real-world evaluations.

Authors:Chaoyi Zhou, Xi Liu, Feng Luo, Siyu Huang
Title: Latent Radiance Fields with 3D-aware 2D Representations
Abstract:
Latent 3D reconstruction has shown great promise in empowering 3D semantic understanding and 3D generation by distilling 2D features into the 3D space. However, existing approaches struggle with the domain gap between 2D feature space and 3D representations, resulting in degraded rendering performance. To address this challenge, we propose a novel framework that integrates 3D awareness into the 2D latent space. The framework consists of three stages: (1) a correspondence-aware autoencoding method that enhances the 3D consistency of 2D latent representations, (2) a latent radiance field (LRF) that lifts these 3D-aware 2D representations into 3D space, and (3) a VAE-Radiance Field (VAE-RF) alignment strategy that improves image decoding from the rendered 2D representations. Extensive experiments demonstrate that our method outperforms the state-of-the-art latent 3D reconstruction approaches in terms of synthesis performance and cross-dataset generalizability across diverse indoor and outdoor scenes. To our knowledge, this is the first work showing the radiance field representations constructed from 2D latent representations can yield photorealistic 3D reconstruction performance.

Authors:Youming Deng, Wenqi Xian, Guandao Yang, Leonidas Guibas, Gordon Wetzstein, Steve Marschner, Paul Debevec
Title: Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction
Abstract:
In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. In particular, our technique enables high-quality scene reconstruction from Large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. Our approach introduces a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortion, and demonstrates state-of-the-art performance on both synthetic and real-world datasets.
中文: 本文提出了一种自校准框架,通过联合优化相机参数、镜头畸变和3D高斯表示,结合混合网络和立方体贴图重采样策略,实现了从广角图像进行高质量场景重建,并在合成与真实数据集上达到最优性能。
English: This paper introduces a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations for high-quality scene reconstruction from wide-angle images, achieving state-of-the-art accuracy through a hybrid network and cubemap-based resampling strategy.

Authors:Yi Yu, Xue Yang, Yansheng Li, Zhenjun Han, Feipeng Da, Junchi Yan
Title: Wholly-WOOD: Wholly Leveraging Diversified-quality Labels for Weakly-supervised Oriented Object Detection
Abstract:
Accurately estimating the orientation of visual objects with compact rotated bounding boxes (RBoxes) has become a prominent demand, which challenges existing object detection paradigms that only use horizontal bounding boxes (HBoxes). To equip the detectors with orientation awareness, supervised regression/classification modules have been introduced at the high cost of rotation annotation. Meanwhile, some existing datasets with oriented objects are already annotated with horizontal boxes or even single points. It becomes attractive yet remains open for effectively utilizing weaker single point and horizontal annotations to train an oriented object detector (OOD). We develop Wholly-WOOD, a weakly-supervised OOD framework, capable of wholly leveraging various labeling forms (Points, HBoxes, RBoxes, and their combination) in a unified fashion. By only using HBox for training, our Wholly-WOOD achieves performance very close to that of the RBox-trained counterpart on remote sensing and other areas, significantly reducing the tedious efforts on labor-intensive annotation for oriented objects. The source codes are available at https://github.com/VisionXLab/whollywood (PyTorch-based) and https://github.com/VisionXLab/whollywood-jittor (Jittor-based).
中文: 该研究提出了Wholly-WOOD弱监督框架,能统一利用点、水平框和旋转框等多种标注形式训练定向物体检测器,仅用水平框即可达到接近全标注的性能,大幅降低标注成本。
English: The study introduces Wholly-WOOD, a weakly-supervised framework that effectively trains oriented object detectors using various annotation forms like points, horizontal boxes, and rotated boxes, achieving near-optimal performance with minimal annotation effort.

Authors:Dexian Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Shi Feng, Yifei Zhang, Soujanya Poria
Title: Pixel-Level Reasoning Segmentation via Multi-turn Conversations
Abstract:
Existing visual perception systems focus on region-level segmentation in single-turn dialogues, relying on complex and explicit query instructions. Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation. To establish a benchmark for this novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k multi-turn conversational scenarios with segmentation targets. Building on PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level segmentation with robust multi-turn conversation understanding, generating pixel-grounded explanations aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in pixel-level reasoning segmentation. Experimental results on the PRIST dataset demonstrate that our method outperforms current segmentation-specific baselines in terms of segmentation and LLM-based reasoning metrics. The code and data are available at: https://github.com/ccccai239/PixelRIST.
中文: 本研究提出了基于多轮对话的像素级推理分割新任务,通过构建PRIST数据集和MIRAS框架,实现了对动态用户意图的追踪与精细分割,在分割效果和推理指标上均优于现有基准方法。
English: This work introduces Pixel-level Reasoning Segmentation (Pixel-level RS), a novel task that tracks evolving user intent through multi-turn conversations for fine-grained segmentation, supported by the PRIST dataset and MIRAS framework, which outperform existing methods in segmentation and reasoning metrics.

Authors:Xiaoliu Guan, Yu Wu, Huayang Huang, Xiao Liu, Jiaxu Miao, Yi Yang
Title: Redistribute Ensemble Training for Mitigating Memorization in Diffusion Models
Abstract:
Diffusion models, known for their tremendous ability to generate high-quality samples, have recently raised concerns due to their data memorization behavior, which poses privacy risks. Recent methods for memory mitigation have primarily addressed the issue within the context of the text modality in cross-modal generation tasks, restricting their applicability to specific conditions. In this paper, we propose a novel method for diffusion models from the perspective of visual modality, which is more generic and fundamental for mitigating memorization. Directly exposing visual data to the model increases memorization risk, so we design a framework where models learn through proxy model parameters instead. Specially, the training dataset is divided into multiple shards, with each shard training a proxy model, then aggregated to form the final model. Additionally, practical analysis of training losses illustrates that the losses for easily memorable images tend to be obviously lower. Thus, we skip the samples with abnormally low loss values from the current mini-batch to avoid memorizing. However, balancing the need to skip memorization-prone samples while maintaining sufficient training data for high-quality image generation presents a key challenge. Thus, we propose IET-AGC+, which redistributes highly memorizable samples between shards, to mitigate these samples from over-skipping. Furthermore, we dynamically augment samples based on their loss values to further reduce memorization. Extensive experiments and analysis on four datasets show that our method successfully reduces memory capacity while maintaining performance. Moreover, we fine-tune the pre-trained diffusion models, e.g., Stable Diffusion, and decrease the memorization score by 46.7\%, demonstrating the effectiveness of our method. Code is available in: https://github.com/liuxiao-guan/IET_AGC.
中文摘要:本文提出一种通过代理模型参数和动态样本管理的新方法,有效降低扩散模型的数据记忆风险,在保持图像生成质量的同时显著提升隐私保护能力。
English Summary: This paper introduces a novel method to mitigate data memorization in diffusion models by using proxy model parameters and dynamic sample management, effectively reducing privacy risks while maintaining image generation quality.

Authors:Rotem Shalev-Arkushin, Rinon Gal, Amit H. Bermano, Ohad Fried
Title: ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation
Abstract:
Diffusion models enable high-quality and diverse visual content synthesis. However, they struggle to generate rare or unseen concepts. To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) with image generation models. We propose ImageRAG, a method that dynamically retrieves relevant images based on a given text prompt, and uses them as context to guide the generation process. Prior approaches that used retrieved images to improve generation, trained models specifically for retrieval-based generation. In contrast, ImageRAG leverages the capabilities of existing image conditioning models, and does not require RAG-specific training. Our approach is highly adaptable and can be applied across different model types, showing significant improvement in generating rare and fine-grained concepts using different base models. Our project page is available at: https://rotem-shalev.github.io/ImageRAG

Authors:Mojtaba Safari, Shansong Wang, Zach Eidex, Richard Qiu, Chih-Wei Chang, David S. Yu, Xiaofeng Yang
Title: A Physics-Informed Deep Learning Model for MRI Brain Motion Correction
Abstract:
Background: MRI is crucial for brain imaging but is highly susceptible to motion artifacts due to long acquisition times. This study introduces PI-MoCoNet, a physics-informed motion correction network that integrates spatial and k-space information to remove motion artifacts without explicit motion parameter estimation, enhancing image fidelity and diagnostic reliability. Materials and Methods: PI-MoCoNet consists of a motion detection network (U-net with spatial averaging) to identify corrupted k-space lines and a motion correction network (U-net with Swin Transformer blocks) to reconstruct motion-free images. The correction is guided by three loss functions: reconstruction (L1), perceptual (LPIPS), and data consistency (Ldc). Motion artifacts were simulated via rigid phase encoding perturbations and evaluated on IXI and MR-ART datasets against Pix2Pix, CycleGAN, and U-net using PSNR, SSIM, and NMSE. Results: PI-MoCoNet significantly improved image quality. On IXI, for minor artifacts, PSNR increased from 34.15 dB to 45.95 dB, SSIM from 0.87 to 1.00, and NMSE reduced from 0.55% to 0.04%. For moderate artifacts, PSNR improved from 30.23 dB to 42.16 dB, SSIM from 0.80 to 0.99, and NMSE from 1.32% to 0.09%. For heavy artifacts, PSNR rose from 27.99 dB to 36.01 dB, SSIM from 0.75 to 0.97, and NMSE decreased from 2.21% to 0.36%. On MR-ART, PI-MoCoNet achieved PSNR gains of ~10 dB and SSIM improvements of up to 0.20, with NMSE reductions of ~6%. Ablation studies confirmed the importance of data consistency and perceptual losses, yielding a 1 dB PSNR gain and 0.17% NMSE reduction. Conclusions: PI-MoCoNet effectively mitigates motion artifacts in brain MRI, outperforming existing methods. Its ability to integrate spatial and k-space information makes it a promising tool for clinical use in motion-prone settings. Code: https://github.com/mosaf/PI-MoCoNet.git.
Chinese: PI-MoCoNet 是一种物理信息驱动的运动校正网络,通过整合空间和k空间数据,有效消除脑部MRI中的运动伪影,显著提升图像质量,且无需显式运动估计,性能优于现有方法。
English: PI-MoCoNet is a physics-informed motion correction network that integrates spatial and k-space data to effectively remove motion artifacts in brain MRI, significantly enhancing image quality and outperforming existing methods without explicit motion estimation.

Authors:Onat Şahin, Mohammad Altillawi, George Eskandar, Carlos Carbone, Ziyuan Liu
Title: ConsistentDreamer: View-Consistent Meshes Through Balanced Multi-View Gaussian Optimization
Abstract:
Recent advances in diffusion models have significantly improved 3D generation, enabling the use of assets generated from an image for embodied AI simulations. However, the one-to-many nature of the image-to-3D problem limits their use due to inconsistent content and quality across views. Previous models optimize a 3D model by sampling views from a view-conditioned diffusion prior, but diffusion models cannot guarantee view consistency. Instead, we present ConsistentDreamer, where we first generate a set of fixed multi-view prior images and sample random views between them with another diffusion model through a score distillation sampling (SDS) loss. Thereby, we limit the discrepancies between the views guided by the SDS loss and ensure a consistent rough shape. In each iteration, we also use our generated multi-view prior images for fine-detail reconstruction. To balance between the rough shape and the fine-detail optimizations, we introduce dynamic task-dependent weights based on homoscedastic uncertainty, updated automatically in each iteration. Additionally, we employ opacity, depth distortion, and normal alignment losses to refine the surface for mesh extraction. Our method ensures better view consistency and visual quality compared to the state-of-the-art.

Authors:Shihao Zhang, Yuguang Yan, Angela Yao
Title: Improving Deep Regression with Tightness
Abstract:
For deep regression, preserving the ordinality of the targets with respect to the feature representation improves performance across various tasks. However, a theoretical explanation for the benefits of ordinality is still lacking. This work reveals that preserving ordinality reduces the conditional entropy $H(Z|Y)$ of representation $Z$ conditional on the target $Y$. However, our findings reveal that typical regression losses do little to reduce $H(Z|Y)$, even though it is vital for generalization performance. With this motivation, we introduce an optimal transport-based regularizer to preserve the similarity relationships of targets in the feature space to reduce $H(Z|Y)$. Additionally, we introduce a simple yet efficient strategy of duplicating the regressor targets, also with the aim of reducing $H(Z|Y)$. Experiments on three real-world regression tasks verify the effectiveness of our strategies to improve deep regression. Code: https://github.com/needylove/Regression_tightness.
中文: 在深度回归中保持目标的有序性可降低条件熵H(Z|Y)以提升性能,本研究通过引入最优传输正则化和目标复制策略有效减少了该熵,并在三个实际任务中验证了其有效性。
English: Preserving ordinality in deep regression reduces conditional entropy H(Z|Y) to enhance performance, and this work introduces both an optimal transport regularizer and a target duplication strategy to effectively minimize this entropy, validated across three real-world tasks.

Authors:Jinhui Guo, Lubin Fan, Bojian Wu, Jiaqi Gu, Shen Cao, Jieping Ye
Title: PTZ-Calib: Robust Pan-Tilt-Zoom Camera Calibration
Abstract:
In this paper, we present PTZ-Calib, a robust two-stage PTZ camera calibration method, that efficiently and accurately estimates camera parameters for arbitrary viewpoints. Our method includes an offline and an online stage. In the offline stage, we first uniformly select a set of reference images that sufficiently overlap to encompass a complete 360° view. We then utilize the novel PTZ-IBA (PTZ Incremental Bundle Adjustment) algorithm to automatically calibrate the cameras within a local coordinate system. Additionally, for practical application, we can further optimize camera parameters and align them with the geographic coordinate system using extra global reference 3D information. In the online stage, we formulate the calibration of any new viewpoints as a relocalization problem. Our approach balances the accuracy and computational efficiency to meet real-world demands. Extensive evaluations demonstrate our robustness and superior performance over state-of-the-art methods on various real and synthetic datasets. Datasets and source code can be accessed online at https://github.com/gjgjh/PTZ-Calib
中文: PTZ-Calib提出了一种鲁棒的双阶段标定方法,通过离线PTZ-IBA优化和在线重定位,高效估算任意视角的相机参数,实现了卓越的精度与计算效率。
English: PTZ-Calib introduces a robust two-stage calibration method that efficiently estimates camera parameters for arbitrary viewpoints through offline PTZ-IBA optimization and online relocalization, achieving superior accuracy and computational efficiency.

Authors:Zichong Chen, Shijin Wang, Yang Zhou
Title: StyleBlend: Enhancing Style-Specific Content Creation in Text-to-Image Diffusion Models
Abstract:
Synthesizing visually impressive images that seamlessly align both text prompts and specific artistic styles remains a significant challenge in Text-to-Image (T2I) diffusion models. This paper introduces StyleBlend, a method designed to learn and apply style representations from a limited set of reference images, enabling content synthesis of both text-aligned and stylistically coherent. Our approach uniquely decomposes style into two components, composition and texture, each learned through different strategies. We then leverage two synthesis branches, each focusing on a corresponding style component, to facilitate effective style blending through shared features without affecting content generation. StyleBlend addresses the common issues of text misalignment and weak style representation that previous methods have struggled with. Extensive qualitative and quantitative comparisons demonstrate the superiority of our approach.

Authors:Shiryu Ueno, Yoshikazu Hayashi, Shunsuke Nakatsuka, Yusei Yamada, Hiroaki Aizawa, Kunihito Kato
Title: Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model
Abstract:
We propose general visual inspection model using Vision-Language Model~(VLM) with few-shot images of non-defective or defective products, along with explanatory texts that serve as inspection criteria. Although existing VLM exhibit high performance across various tasks, they are not trained on specific tasks such as visual inspection. Thus, we construct a dataset consisting of diverse images of non-defective and defective products collected from the web, along with unified formatted output text, and fine-tune VLM. For new products, our method employs In-Context Learning, which allows the model to perform inspections with an example of non-defective or defective image and the corresponding explanatory texts with visual prompts. This approach eliminates the need to collect a large number of training samples and re-train the model for each product. The experimental results show that our method achieves high performance, with MCC of 0.804 and F1-score of 0.950 on MVTec AD in a one-shot manner. Our code is available at~https://github.com/ia-gu/Vision-Language-In-Context-Learning-Driven-Few-Shot-Visual-Inspection-Model.
中文: 本研究提出了一种基于视觉语言模型的视觉检测方法,通过少量样本和解释性文本识别产品缺陷,无需大量重新训练即可实现高性能。
English: This study introduces a vision-language model for visual inspection that uses few-shot examples and explanatory texts to identify defects in products, achieving high performance without extensive retraining.

Authors:Lingting Zhu, Guying Lin, Jinnan Chen, Xinjie Zhang, Zhenchao Jin, Zhao Wang, Lequan Yu
Title: Large Images are Gaussians: High-Quality Large Image Representation with Levels of 2D Gaussian Splatting
Abstract:
While Implicit Neural Representations (INRs) have demonstrated significant success in image representation, they are often hindered by large training memory and slow decoding speed. Recently, Gaussian Splatting (GS) has emerged as a promising solution in 3D reconstruction due to its high-quality novel view synthesis and rapid rendering capabilities, positioning it as a valuable tool for a broad spectrum of applications. In particular, a GS-based representation, 2DGS, has shown potential for image fitting. In our work, we present \textbf{L}arge \textbf{I}mages are \textbf{G}aussians (\textbf{LIG}), which delves deeper into the application of 2DGS for image representations, addressing the challenge of fitting large images with 2DGS in the situation of numerous Gaussian points, through two distinct modifications: 1) we adopt a variant of representation and optimization strategy, facilitating the fitting of a large number of Gaussian points; 2) we propose a Level-of-Gaussian approach for reconstructing both coarse low-frequency initialization and fine high-frequency details. Consequently, we successfully represent large images as Gaussian points and achieve high-quality large image representation, demonstrating its efficacy across various types of large images. Code is available at {\href{https://github.com/HKU-MedAI/LIG}{https://github.com/HKU-MedAI/LIG}}.
中文:LIG方法通过优化高斯点管理和采用多层次重建策略,改进了2D高斯泼溅技术,成功实现了对大尺寸图像的高效高质量表示,并在多种图像类型上验证了其有效性。
English: The LIG method enhances 2D Gaussian Splatting to efficiently represent large images by optimizing Gaussian point management and employing a multi-level reconstruction strategy, achieving high-quality results across diverse image types.

Authors:Xiao Wang, Jingtao Jiang, Dong Li, Futian Wang, Lin Zhu, Yaowei Wang, Yongyong Tian, Jin Tang
Title: EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition
Abstract:
Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB cameras which are sensitive to challenging factors such as low illumination, motion blur, and cluttered backgrounds. In this paper, we propose to recognize the scene text using bio-inspired event cameras by collecting and annotating a large-scale benchmark dataset, termed EventSTR. It contains 9,928 high-definition (1280 * 720) event samples and involves both Chinese and English characters. We also benchmark multiple STR algorithms as the baselines for future works to compare. In addition, we propose a new event-based scene text recognition framework, termed SimC-ESTR. It first extracts the event features using a visual encoder and projects them into tokens using a Q-former module. More importantly, we propose to augment the vision tokens based on a memory mechanism before feeding into the large language models. A similarity-based error correction mechanism is embedded within the large language model to correct potential minor errors fundamentally based on contextual information. Extensive experiments on the newly proposed EventSTR dataset and two simulation STR datasets fully demonstrate the effectiveness of our proposed model. We believe that the dataset and algorithmic model can innovatively propose an event-based STR task and are expected to accelerate the application of event cameras in various industries. The source code and pre-trained models will be released on https://github.com/Event-AHU/EventSTR
中文: 本文提出了基于事件相机的大规模场景文本识别数据集EventSTR和新型框架SimC-ESTR,该框架通过记忆增强的视觉标记和基于相似度的纠错机制显著提升了识别性能,在多个数据集上验证了其有效性。
English: This paper introduces EventSTR, a large-scale dataset for bio-inspired event camera-based scene text recognition, and proposes SimC-ESTR, a novel framework that enhances text recognition through memory-augmented vision tokens and similarity-based error correction, demonstrating superior performance on multiple datasets.

Authors:Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa, Yasutoshi Ida
Title: Zero-shot Concept Bottleneck Models
Abstract:
Concept bottleneck models (CBMs) are inherently interpretable and intervenable neural network models, which explain their final label prediction by the intermediate prediction of high-level semantic concepts. However, they require target task training to learn input-to-concept and concept-to-label mappings, incurring target dataset collections and training resources. In this paper, we present \textit{zero-shot concept bottleneck models} (Z-CBMs), which predict concepts and labels in a fully zero-shot manner without training neural networks. Z-CBMs utilize a large-scale concept bank, which is composed of millions of vocabulary extracted from the web, to describe arbitrary input in various domains. For the input-to-concept mapping, we introduce concept retrieval, which dynamically finds input-related concepts by the cross-modal search on the concept bank. In the concept-to-label inference, we apply concept regression to select essential concepts from the retrieved concepts by sparse linear regression. Through extensive experiments, we confirm that our Z-CBMs provide interpretable and intervenable concepts without any additional training. Code will be available at https://github.com/yshinya6/zcbm.
中文: 零样本概念瓶颈模型(Z-CBMs)通过从大规模网络概念库中动态检索相关概念,并利用稀疏回归筛选关键概念,实现了无需训练即可进行可解释和可干预的预测。
English: Zero-shot concept bottleneck models (Z-CBMs) enable interpretable predictions by dynamically retrieving relevant concepts from a large-scale web-based concept bank and selecting essential ones through sparse regression, eliminating the need for target task training.

Authors:Jocelyn Dzuong
Title: DejAIvu: Identifying and Explaining AI Art on the Web in Real-Time with Saliency Maps
Abstract:
The recent surge in advanced generative models, such as diffusion models and generative adversarial networks (GANs), has led to an alarming rise in AI-generated images across various domains on the web. While such technologies offer benefits such as democratizing artistic creation, they also pose challenges in misinformation, digital forgery, and authenticity verification. Additionally, the uncredited use of AI-generated images in media and marketing has sparked significant backlash from online communities. In response to this, we introduce DejAIvu, a Chrome Web extension that combines real-time AI-generated image detection with saliency-based explainability while users browse the web. Using an ONNX-optimized deep learning model, DejAIvu automatically analyzes images on websites such as Google Images, identifies AI-generated content using model inference, and overlays a saliency heatmap to highlight AI-related artifacts. Our approach integrates efficient in-browser inference, gradient-based saliency analysis, and a seamless user experience, ensuring that AI detection is both transparent and interpretable. We also evaluate DejAIvu across multiple pretrained architectures and benchmark datasets, demonstrating high accuracy and low latency, making it a practical and deployable tool for enhancing AI image accountability. The code for this system can be found at https://github.com/Noodulz/dejAIvu.
Chinese: DejAIvu 是一款 Chrome 浏览器扩展,通过优化的深度学习模型实时检测 AI 生成图像并提供基于显著性的可解释热力图,为增强 AI 图像可信度提供透明高效的解决方案。
English: The DejAIvu Chrome extension uses an optimized deep learning model to detect AI-generated images in real-time and provides explainable saliency heatmaps, offering a transparent and efficient solution for enhancing AI image accountability.

Authors:Joshua Omolegan, Pak Hei Yeung, Madeleine K. Wyburd, Linde Hesse, Monique Haak, Intergrowth-21st Consortium, Ana I. L. Namburete, Nicola K. Dinsdale
Title: Exploring Test Time Adaptation for Subcortical Segmentation of the Fetal Brain in 3D Ultrasound
Abstract:
Monitoring the growth of subcortical regions of the fetal brain in ultrasound (US) images can help identify the presence of abnormal development. Manually segmenting these regions is a challenging task, but recent work has shown that it can be automated using deep learning. However, applying pretrained models to unseen freehand US volumes often leads to a degradation of performance due to the vast differences in acquisition and alignment. In this work, we first demonstrate that test time adaptation (TTA) can be used to improve model performance in the presence of both real and simulated domain shifts. We further propose a novel TTA method by incorporating a normative atlas as a prior for anatomy. In the presence of various types of domain shifts, we benchmark the performance of different TTA methods and demonstrate the improvements brought by our proposed approach, which may further facilitate automated monitoring of fetal brain development. Our code is available at https://github.com/joshuaomolegan/TTA-for-3D-Fetal-Subcortical-Segmentation.
中文摘要:通过引入规范图谱作为解剖先验的测试时自适应方法,有效提升了深度学习模型在不同域偏移下对胎儿脑部皮层下区域超声图像分割的性能,有助于实现胎儿大脑发育的自动化监测。
English Summary: Test time adaptation, enhanced with a normative atlas prior, improves deep learning model performance for segmenting fetal brain subcortical regions in ultrasound images under domain shifts, facilitating automated monitoring of brain development.

Authors:Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai
Title: CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
Abstract:
In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: https://cinemaster-dev.github.io/.
中文: CineMaster是一个创新的两阶段框架,通过交互式3D信号构建和引导式扩散模型,实现具有三维感知的文本到视频生成,支持精确的对象布局与摄像机运动控制。
English: CineMaster is a two-stage framework for 3D-aware text-to-video generation that enables precise object placement, camera manipulation, and layout control through interactive 3D signal construction and guided diffusion modeling.

Authors:Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, Alan Yuille
Title: Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
Abstract:
Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present Spatial457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings. The code and data are released in https://github.com/XingruiWang/Spatial457.
中文: 该摘要介绍了Spatial457合成数据集,旨在通过多难度评估框架测试大型多模态模型的6D空间推理能力,发现模型在3D任务中表现下降并识别出预测偏差。
English: This abstract introduces Spatial457, a synthetic dataset designed to evaluate large multimodal models' 6D spatial reasoning capabilities across varying complexities, revealing performance declines in 3D tasks and identifying prediction biases through a novel evaluation framework.

Authors:Yujie Zhou, Jiazi Bu, Pengyang Ling, Pan Zhang, Tong Wu, Qidong Huang, Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Anyi Rao, Jiaqi Wang, Li Niu
Title: Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
Abstract:
Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to several issues: lighting source inconsistency and relighted appearance inconsistency, resulting in flickers in the generated videos. In this work, we propose Light-A-Video, a training-free approach to achieve temporally smooth video relighting. Adapted from image relighting models, Light-A-Video introduces two key techniques to enhance lighting consistency. First, we design a Consistent Light Attention (CLA) module, which enhances cross-frame interactions within the self-attention layers of the image relight model to stabilize the generation of the background lighting source. Second, leveraging the physical principle of light transport independence, we apply linear blending between the source video's appearance and the relighted appearance, using a Progressive Light Fusion (PLF) strategy to ensure smooth temporal transitions in illumination. Experiments show that Light-A-Video improves the temporal consistency of relighted video while maintaining the relighted image quality, ensuring coherent lighting transitions across frames. Project page: https://bujiazi.github.io/light-a-video.github.io/.
中文摘要:Light-A-Video是一种无需训练的解决方案,通过引入一致性光照注意力模块和渐进式光照融合策略,有效提升视频重照明的时序连贯性,在消除闪烁的同时保持画面质量。
English Summary: Light-A-Video is a training-free method that enhances video relighting consistency by introducing a Consistent Light Attention module and Progressive Light Fusion strategy, addressing flickering issues while preserving image quality.

Authors:Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì
Title: Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion
Abstract:
The growing availability of longitudinal Magnetic Resonance Imaging (MRI) datasets has facilitated Artificial Intelligence (AI)-driven modeling of disease progression, making it possible to predict future medical scans for individual patients. However, despite significant advancements in AI, current methods continue to face challenges including achieving patient-specific individualization, ensuring spatiotemporal consistency, efficiently utilizing longitudinal data, and managing the substantial memory demands of 3D scans. To address these challenges, we propose Brain Latent Progression (BrLP), a novel spatiotemporal model designed to predict individual-level disease progression in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates in a small latent space, mitigating the computational challenges posed by high-dimensional imaging data; (ii) it explicitly integrates subject metadata to enhance the individualization of predictions; (iii) it incorporates prior knowledge of disease dynamics through an auxiliary model, facilitating the integration of longitudinal data; and (iv) it introduces the Latent Average Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in the predicted progression at inference time and (b) allows us to derive a measure of the uncertainty for the prediction at the global and voxel level. We train and evaluate BrLP on 11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its generalizability on an external test set comprising 2,257 MRIs from 962 subjects. Our experiments compare BrLP-generated MRI scans with real follow-up MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The code is publicly available at: https://github.com/LemuelPuglisi/BrLP.
中文: 提出的Brain Latent Progression(BrLP)模型通过在紧凑潜在空间中运行、整合个体元数据和疾病动态知识,并利用新型稳定化算法保证时空一致性,有效解决了AI驱动疾病进展预测中的关键难题。
English: The proposed Brain Latent Progression (BrLP) model addresses key challenges in AI-driven disease progression prediction by operating in a compact latent space, integrating subject metadata and disease dynamics, and ensuring spatiotemporal consistency through its novel stabilization algorithm.

Authors:Kevin Flanagan, Dima Damen, Michael Wray
Title: Moment of Untruth: Dealing with Negative Queries in Video Moment Retrieval
Abstract:
Video Moment Retrieval is a common task to evaluate the performance of visual-language models - it involves localising start and end times of moments in videos from query sentences. The current task formulation assumes that the queried moment is present in the video, resulting in false positive moment predictions when irrelevant query sentences are provided. In this paper we propose the task of Negative-Aware Video Moment Retrieval (NA-VMR), which considers both moment retrieval accuracy and negative query rejection accuracy. We make the distinction between In-Domain and Out-of-Domain negative queries and provide new evaluation benchmarks for two popular video moment retrieval datasets: QVHighlights and Charades-STA. We analyse the ability of current SOTA video moment retrieval approaches to adapt to Negative-Aware Video Moment Retrieval and propose UniVTG-NA, an adaptation of UniVTG designed to tackle NA-VMR. UniVTG-NA achieves high negative rejection accuracy (avg. $98.4\%$) scores while retaining moment retrieval scores to within $3.87\%$ Recall@1. Dataset splits and code are available at https://github.com/keflanagan/MomentofUntruth
中文摘要:本文提出负感知视频片段检索新任务,通过区分域内与域外负查询构建评估基准,所提UniVTG-NA模型在保持检索性能的同时实现98.4%的平均负样本拒绝准确率。
English Summary: This paper introduces Negative-Aware Video Moment Retrieval (NA-VMR), a new task that evaluates both moment localization accuracy and negative query rejection, proposing UniVTG-NA model which maintains high retrieval performance while achieving 98.4% average negative rejection accuracy.

Authors:Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou
Title: mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
Abstract:
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in https://github.com/haon-chen/mmE5.
Chinese: 本研究提出了高质量合成多模态数据的三个标准——广泛覆盖、强跨模态对齐和高保真度,并据此生成数据集训练mmE5模型,在多个基准测试中取得了最优性能。
English: This study establishes three criteria for high-quality synthetic multimodal data—broad scope, robust cross-modal alignment, and high fidelity—and uses them to create datasets that train the mmE5 model, achieving state-of-the-art performance on benchmarks.

Authors:Prajwal Gatti, Kshitij Parikh, Dhriti Prasanna Paul, Manish Gupta, Anand Mishra
Title: Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions
Abstract:
Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them, e.g., people outside Australia searching for numbats. Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., numbat digging in the ground. In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of difficult-to-name but easy-to-draw objects and text describing difficult-to-sketch but easy-to-verbalize object attributes or interaction with the scene. This novel problem statement distinctly differs from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of approx. 2M queries and 108K natural scene images. Further, as a solution to this problem, we propose a pretrained multimodal transformer-based baseline, STNET (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image, and encodes the text and image to perform image retrieval. In addition to contrastive learning, we propose multiple training objectives that improve the performance of our model. Extensive experiments show that our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities. We make the dataset and code available at our project website.

Authors:Ziyue Yang, Kehan Wang, Yuhang Ming, Yong Peng, Han Yang, Qiong Chen, Wanzeng Kong
Title: Uncertainty Aware Human-machine Collaboration in Camouflaged Object Detection
Abstract:
Camouflaged Object Detection (COD), the task of identifying objects concealed within their environments, has seen rapid growth due to its wide range of practical applications. A key step toward developing trustworthy COD systems is the estimation and effective utilization of uncertainty. In this work, we propose a human-machine collaboration framework for classifying the presence of camouflaged objects, leveraging the complementary strengths of computer vision (CV) models and noninvasive brain-computer interfaces (BCIs). Our approach introduces a multiview backbone to estimate uncertainty in CV model predictions, utilizes this uncertainty during training to improve efficiency, and defers low-confidence cases to human evaluation via RSVP-based BCIs during testing for more reliable decision-making. We evaluated the framework in the CAMO dataset, achieving state-of-the-art results with an average improvement of 4.56\% in balanced accuracy (BA) and 3.66\% in the F1 score compared to existing methods. For the best-performing participants, the improvements reached 7.6\% in BA and 6.66\% in the F1 score. Analysis of the training process revealed a strong correlation between our confidence measures and precision, while an ablation study confirmed the effectiveness of the proposed training policy and the human-machine collaboration strategy. In general, this work reduces human cognitive load, improves system reliability, and provides a strong foundation for advancements in real-world COD applications and human-computer interaction. Our code and data are available at: https://github.com/ziyuey/Uncertainty-aware-human-machine-collaboration-in-camouflaged-object-identification.
中文: 本研究提出了一种人机协作框架,通过结合计算机视觉不确定性估计与脑机接口技术,提升了伪装物体检测的性能与可靠性,并取得了领先的实验结果。
English: This study introduces a human-machine collaboration framework that enhances camouflaged object detection by integrating computer vision uncertainty estimation with brain-computer interfaces, achieving state-of-the-art performance and improved reliability.

Authors:Tianle Liu, Shuangming Zhao, Wanshou Jiang, Bingxuan Guo
Title: Sat-DN: Implicit Surface Reconstruction from Multi-View Satellite Images with Depth and Normal Supervision
Abstract:
With advancements in satellite imaging technology, acquiring high-resolution multi-view satellite imagery has become increasingly accessible, enabling rapid and location-independent ground model reconstruction. However, traditional stereo matching methods struggle to capture fine details, and while neural radiance fields (NeRFs) achieve high-quality reconstructions, their training time is prohibitively long. Moreover, challenges such as low visibility of building facades, illumination and style differences between pixels, and weakly textured regions in satellite imagery further make it hard to reconstruct reasonable terrain geometry and detailed building facades. To address these issues, we propose Sat-DN, a novel framework leveraging a progressively trained multi-resolution hash grid reconstruction architecture with explicit depth guidance and surface normal consistency constraints to enhance reconstruction quality. The multi-resolution hash grid accelerates training, while the progressive strategy incrementally increases the learning frequency, using coarse low-frequency geometry to guide the reconstruction of fine high-frequency details. The depth and normal constraints ensure a clear building outline and correct planar distribution. Extensive experiments on the DFC2019 dataset demonstrate that Sat-DN outperforms existing methods, achieving state-of-the-art results in both qualitative and quantitative evaluations. The code is available at https://github.com/costune/SatDN.
中文摘要:Sat-DN框架通过多分辨率哈希网格结合深度引导和表面法向约束,解决了卫星图像重建中的细节缺失和训练效率问题,在DFC2019数据集上实现了最优性能。
English Summary: The proposed Sat-DN framework overcomes limitations in satellite imagery reconstruction by combining multi-resolution hash grids with depth guidance and surface normal constraints, achieving state-of-the-art results through accelerated training and enhanced detail recovery.

Authors:Fenghe Tang, Qingsong Yao, Wenxin Ma, Chenxu Wu, Zihang Jiang, S. Kevin Zhou
Title: Hi-End-MAE: Hierarchical encoder-driven masked autoencoders are stronger vision learners for medical image segmentation
Abstract:
Medical image segmentation remains a formidable challenge due to the label scarcity. Pre-training Vision Transformer (ViT) through masked image modeling (MIM) on large-scale unlabeled medical datasets presents a promising solution, providing both computational efficiency and model generalization for various downstream tasks. However, current ViT-based MIM pre-training frameworks predominantly emphasize local aggregation representations in output layers and fail to exploit the rich representations across different ViT layers that better capture fine-grained semantic information needed for more precise medical downstream tasks. To fill the above gap, we hereby present Hierarchical Encoder-driven MAE (Hi-End-MAE), a simple yet effective ViT-based pre-training solution, which centers on two key innovations: (1) Encoder-driven reconstruction, which encourages the encoder to learn more informative features to guide the reconstruction of masked patches; and (2) Hierarchical dense decoding, which implements a hierarchical decoding structure to capture rich representations across different layers. We pre-train Hi-End-MAE on a large-scale dataset of 10K CT scans and evaluated its performance across seven public medical image segmentation benchmarks. Extensive experiments demonstrate that Hi-End-MAE achieves superior transfer learning capabilities across various downstream tasks, revealing the potential of ViT in medical imaging applications. The code is available at: https://github.com/FengheTan9/Hi-End-MAE
中文:Hi-End-MAE框架通过利用分层ViT表征和编码器驱动的重建,提升了医学图像分割的性能,在多个基准测试中表现卓越。
English: The Hi-End-MAE framework enhances medical image segmentation by leveraging hierarchical ViT representations and encoder-driven reconstruction, achieving superior performance across multiple benchmarks.

Authors:Keqi Chen, Lilien Schewski, Vinkle Srivastav, Joël Lavanchy, Didier Mutter, Guido Beldi, Sandra Keller, Nicolas Padoy
Title: When do they StOP?: A First Step Towards Automatically Identifying Team Communication in the Operating Room
Abstract:
Purpose: Surgical performance depends not only on surgeons' technical skills but also on team communication within and across the different professional groups present during the operation. Therefore, automatically identifying team communication in the OR is crucial for patient safety and advances in the development of computer-assisted surgical workflow analysis and intra-operative support systems. To take the first step, we propose a new task of detecting communication briefings involving all OR team members, i.e. the team Time-out and the StOP?-protocol, by localizing their start and end times in video recordings of surgical operations. Methods: We generate an OR dataset of real surgeries, called Team-OR, with more than one hundred hours of surgical videos captured by the multi-view camera system in the OR. The dataset contains temporal annotations of 33 Time-out and 22 StOP?-protocol activities in total. We then propose a novel group activity detection approach, where we encode both scene context and action features, and use an efficient neural network model to output the results. Results: The experimental results on the Team-OR dataset show that our approach outperforms existing state-of-the-art temporal action detection approaches. It also demonstrates the lack of research on group activities in the OR, proving the significance of our dataset. Conclusion: We investigate the Team Time-Out and the StOP?-protocol in the OR, by presenting the first OR dataset with temporal annotations of group activities protocols, and introducing a novel group activity detection approach that outperforms existing approaches. Code is available at https://github.com/CAMMA-public/Team-OR.
中文: 本研究提出了一种新的群体活动检测方法和Team-OR数据集,用于自动识别手术中的团队沟通简报,其性能优于现有方法。
English: This study introduces a novel group activity detection method and the Team-OR dataset to automatically identify team communication briefings in surgical operations, demonstrating superior performance over existing approaches.

Authors:Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho
Title: FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis
Abstract:
We present FloVD, a novel video diffusion model for camera-controllable video generation. FloVD leverages optical flow to represent the motions of the camera and moving objects. This approach offers two key benefits. Since optical flow can be directly estimated from videos, our approach allows for the use of arbitrary training videos without ground-truth camera parameters. Moreover, as background optical flow encodes 3D correlation across different viewpoints, our method enables detailed camera control by leveraging the background motion. To synthesize natural object motion while supporting detailed camera control, our framework adopts a two-stage video synthesis pipeline consisting of optical flow generation and flow-conditioned video synthesis. Extensive experiments demonstrate the superiority of our method over previous approaches in terms of accurate camera control and natural object motion synthesis.
Chinese: FloVD是一种新型视频扩散模型,利用光流实现相机可控的视频生成,无需真实相机参数即可使用任意视频进行训练,并在相机控制和自然物体运动方面表现出优越性能。
English: FloVD is a novel video diffusion model that uses optical flow to enable camera-controllable video generation, allowing training with arbitrary videos without ground-truth camera parameters and achieving superior camera control and natural object motion.

Authors:Yilu Wu, Chenhui Zhu, Shuai Wang, Hanlin Wang, Jing Wang, Zhaoxiang Zhang, Limin Wang
Title: Learning Human Skill Generators at Key-Step Levels
Abstract:
We are committed to learning human skill generators at key-step levels. The generation of skills is a challenging endeavor, but its successful implementation could greatly facilitate human skill learning and provide more experience for embodied intelligence. Although current video generation models can synthesis simple and atomic human operations, they struggle with human skills due to their complex procedure process. Human skills involve multi-step, long-duration actions and complex scene transitions, so the existing naive auto-regressive methods for synthesizing long videos cannot generate human skills. To address this, we propose a novel task, the Key-step Skill Generation (KS-Gen), aimed at reducing the complexity of generating human skill videos. Given the initial state and a skill description, the task is to generate video clips of key steps to complete the skill, rather than a full-length video. To support this task, we introduce a carefully curated dataset and define multiple evaluation metrics to assess performance. Considering the complexity of KS-Gen, we propose a new framework for this task. First, a multimodal large language model (MLLM) generates descriptions for key steps using retrieval argument. Subsequently, we use a Key-step Image Generator (KIG) to address the discontinuity between key steps in skill videos. Finally, a video generation model uses these descriptions and key-step images to generate video clips of the key steps with high temporal consistency. We offer a detailed analysis of the results, hoping to provide more insights on human skill generation. All models and data are available at https://github.com/MCG-NJU/KS-Gen.
Chinese Summary: 我们提出了关键步骤技能生成(KS-Gen)新任务,通过生成关键步骤视频片段而非完整视频来简化人类技能视频制作,采用结合多模态大语言模型和图像生成的新框架以提升时间连贯性。
English Summary: We introduce the Key-step Skill Generation (KS-Gen) task to simplify human skill video creation by generating key-step clips instead of full videos, using a novel framework that combines multimodal language models and image generation for improved temporal consistency.

Authors:Yunjiang Xu, Lingzhi Li, Jin Wang, Benyuan Yang, Zhiwen Wu, Xinhong Chen, Jianping Wang
Title: CoDynTrust: Robust Asynchronous Collaborative Perception via Dynamic Feature Trust Modulus
Abstract:
Collaborative perception, fusing information from multiple agents, can extend perception range so as to improve perception performance. However, temporal asynchrony in real-world environments, caused by communication delays, clock misalignment, or sampling configuration differences, can lead to information mismatches. If this is not well handled, then the collaborative performance is patchy, and what's worse safety accidents may occur. To tackle this challenge, we propose CoDynTrust, an uncertainty-encoded asynchronous fusion perception framework that is robust to the information mismatches caused by temporal asynchrony. CoDynTrust generates dynamic feature trust modulus (DFTM) for each region of interest by modeling aleatoric and epistemic uncertainty as well as selectively suppressing or retaining single-vehicle features, thereby mitigating information mismatches. We then design a multi-scale fusion module to handle multi-scale feature maps processed by DFTM. Compared to existing works that also consider asynchronous collaborative perception, CoDynTrust combats various low-quality information in temporally asynchronous scenarios and allows uncertainty to be propagated to downstream tasks such as planning and control. Experimental results demonstrate that CoDynTrust significantly reduces performance degradation caused by temporal asynchrony across multiple datasets, achieving state-of-the-art detection performance even with temporal asynchrony. The code is available at https://github.com/CrazyShout/CoDynTrust.
Chinese: 协作感知通过融合多智能体信息扩展感知范围并提升性能,但现实环境中的时间异步性会导致信息不匹配;为解决此问题,提出的CoDynTrust框架利用不确定性建模和动态特征信任机制来缓解这些影响,从而提高检测性能。
English: Collaborative perception, which fuses data from multiple agents, can enhance perception range and performance but is often compromised by temporal asynchrony, leading to mismatches; to address this, the proposed CoDynTrust framework uses uncertainty modeling and dynamic feature trust to mitigate these issues and improve detection performance.

Authors:Mingyu Xing, Lechao Cheng, Shengeng Tang, Yaxiong Wang, Zhun Zhong, Meng Wang
Title: Knowledge Swapping via Learning and Unlearning
Abstract:
We introduce \textbf{Knowledge Swapping}, a novel task designed to selectively regulate knowledge of a pretrained model by enabling the forgetting of user\-specified information, retaining essential knowledge, and acquiring new knowledge simultaneously. By delving into the analysis of knock-on feature hierarchy, we find that incremental learning typically progresses from low\-level representations to higher\-level semantics, whereas forgetting tends to occur in the opposite direction\-starting from high-level semantics and moving down to low-level features. Building upon this, we propose to benchmark the knowledge swapping task with the strategy of \textit{Learning Before Forgetting}. Comprehensive experiments on various tasks like image classification, object detection, and semantic segmentation validate the effectiveness of the proposed strategy. The source code is available at \href{https://github.com/xingmingyu123456/KnowledgeSwapping}{https://github.com/xingmingyu123456/KnowledgeSwapping}.
中文: 该研究提出了知识交换任务,通过“先学习后遗忘”策略,使预训练模型能够同时遗忘特定信息、保留关键知识并学习新内容,并在多项任务中验证了其有效性。
English: The study introduces Knowledge Swapping, a task that enables pretrained models to forget specific information, retain essential knowledge, and acquire new knowledge simultaneously, using a "Learning Before Forgetting" strategy validated across multiple tasks.

Authors:Kristofer Grover Roos, Atsushi Fukuda, Quan Huu Cap
Title: From Brainwaves to Brain Scans: A Robust Neural Network for EEG-to-fMRI Synthesis
Abstract:
While functional magnetic resonance imaging (fMRI) offers valuable insights into brain activity, it is limited by high operational costs and significant infrastructural demands. In contrast, electroencephalography (EEG) provides millisecond-level precision in capturing electrical activity but lacks the spatial fidelity necessary for precise neural localization. To bridge these gaps, we propose E2fNet, a simple yet effective deep learning model for synthesizing fMRI images from low-cost EEG data. E2fNet is an encoder-decoder network specifically designed to capture and translate meaningful multi-scale features from EEG across electrode channels into accurate fMRI representations. Extensive evaluations across three public datasets demonstrate that E2fNet consistently outperforms existing CNN- and transformer-based methods, achieving state-of-the-art results in terms of the structural similarity index measure (SSIM). These results demonstrate that E2fNet is a promising, cost-effective solution for enhancing neuroimaging capabilities. The code is available at https://github.com/kgr20/E2fNet.
中文: E2fNet是一种创新的深度学习模型,能够通过经济高效的脑电图数据合成高质量的功能磁共振成像图像,在多个数据集上均展现出超越现有方法的卓越性能。
English: E2fNet is a novel deep learning model that synthesizes high-quality fMRI images from cost-effective EEG data, achieving superior performance over existing methods across multiple datasets.

Authors:Xiaofei Wang, Hanyu Liu, Yupei Zhang, Boyang Zhao, Hao Duan, Wanming Hu, Yonggao Mou, Stephen Price, Chao Li
Title: Joint Modelling Histology and Molecular Markers for Cancer Classification
Abstract:
Cancers are characterized by remarkable heterogeneity and diverse prognosis. Accurate cancer classification is essential for patient stratification and clinical decision-making. Although digital pathology has been advancing cancer diagnosis and prognosis, the paradigm in cancer pathology has shifted from purely relying on histology features to incorporating molecular markers. There is an urgent need for digital pathology methods to meet the needs of the new paradigm. We introduce a novel digital pathology approach to jointly predict molecular markers and histology features and model their interactions for cancer classification. Firstly, to mitigate the challenge of cross-magnification information propagation, we propose a multi-scale disentangling module, enabling the extraction of multi-scale features from high-magnification (cellular-level) to low-magnification (tissue-level) whole slide images. Further, based on the multi-scale features, we propose an attention-based hierarchical multi-task multi-instance learning framework to simultaneously predict histology and molecular markers. Moreover, we propose a co-occurrence probability-based label correlation graph network to model the co-occurrence of molecular markers. Lastly, we design a cross-modal interaction module with the dynamic confidence constrain loss and a cross-modal gradient modulation strategy, to model the interactions of histology and molecular markers. Our experiments demonstrate that our method outperforms other state-of-the-art methods in classifying glioma, histology features and molecular markers. Our method promises to promote precise oncology with the potential to advance biomedical research and clinical applications. The code is available at https://github.com/LHY1007/M3C2
中文: 本研究提出一种新型数字病理方法,通过多尺度特征提取与跨模态交互技术联合预测组织学特征和分子标记,在胶质瘤分类中表现优异,有望推动精准肿瘤学发展。
English: This study introduces a novel digital pathology method that integrates multi-scale feature extraction and cross-modal interactions to jointly predict histology features and molecular markers, demonstrating superior performance in glioma classification and advancing precision oncology.

Authors:Ao Liang, Haiyang Hua, Jian Fang, Wenyu Chen, Huaici Zhao
Title: PDM-SSD: Single-Stage Three-Dimensional Object Detector With Point Dilation
Abstract:
Current Point-based detectors can only learn from the provided points, with limited receptive fields and insufficient global learning capabilities for such targets. In this paper, we present a novel Point Dilation Mechanism for single-stage 3D detection (PDM-SSD) that takes advantage of these two representations. Specifically, we first use a PointNet-style 3D backbone for efficient feature encoding. Then, a neck with Point Dilation Mechanism (PDM) is used to expand the feature space, which involves two key steps: point dilation and feature filling. The former expands points to a certain size grid centered around the sampled points in Euclidean space. The latter fills the unoccupied grid with feature for backpropagation using spherical harmonic coefficients and Gaussian density function in terms of direction and scale. Next, we associate multiple dilation centers and fuse coefficients to obtain sparse grid features through height compression. Finally, we design a hybrid detection head for joint learning, where on one hand, the scene heatmap is predicted to complement the voting point set for improved detection accuracy, and on the other hand, the target probability of detected boxes are calibrated through feature fusion. On the challenging Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, PDM-SSD achieves state-of-the-art results for multi-class detection among single-modal methods with an inference speed of 68 frames. We also demonstrate the advantages of PDM-SSD in detecting sparse and incomplete objects through numerous object-level instances. Additionally, PDM can serve as an auxiliary network to establish a connection between sampling points and object centers, thereby improving the accuracy of the model without sacrificing inference speed. Our code will be available at https://github.com/AlanLiangC/PDM-SSD.git.
中文: 本文提出了一种用于单阶段3D检测的点扩展机制,通过点扩展和特征填充扩大特征空间,在KITTI数据集上实现了最先进的多类别检测效果,同时保持了较高的推理速度。
English: This paper introduces a Point Dilation Mechanism (PDM) for single-stage 3D detection that expands feature space through point dilation and feature filling, achieving state-of-the-art multi-class detection results on the KITTI dataset while maintaining high inference speed.

Authors:Dongsu Song, Daehwa Ko, Jay Hoon Jung
Title: Amnesia as a Catalyst for Enhancing Black Box Pixel Attacks in Image Classification and Object Detection
Abstract:
It is well known that query-based attacks tend to have relatively higher success rates in adversarial black-box attacks. While research on black-box attacks is actively being conducted, relatively few studies have focused on pixel attacks that target only a limited number of pixels. In image classification, query-based pixel attacks often rely on patches, which heavily depend on randomness and neglect the fact that scattered pixels are more suitable for adversarial attacks. Moreover, to the best of our knowledge, query-based pixel attacks have not been explored in the field of object detection. To address these issues, we propose a novel pixel-based black-box attack called Remember and Forget Pixel Attack using Reinforcement Learning(RFPAR), consisting of two main components: the Remember and Forget processes. RFPAR mitigates randomness and avoids patch dependency by leveraging rewards generated through a one-step RL algorithm to perturb pixels. RFPAR effectively creates perturbed images that minimize the confidence scores while adhering to limited pixel constraints. Furthermore, we advance our proposed attack beyond image classification to object detection, where RFPAR reduces the confidence scores of detected objects to avoid detection. Experiments on the ImageNet-1K dataset for classification show that RFPAR outperformed state-of-the-art query-based pixel attacks. For object detection, using the MSCOCO dataset with YOLOv8 and DDQ, RFPAR demonstrates comparable mAP reduction to state-of-the-art query-based attack while requiring fewer query. Further experiments on the Argoverse dataset using YOLOv8 confirm that RFPAR effectively removed objects on a larger scale dataset. Our code is available at https://github.com/KAU-QuantumAILab/RFPAR.
中文: 本文提出RFPAR这一基于强化学习的像素攻击方法,通过记忆-遗忘机制降低黑盒攻击中的随机性和补丁依赖性,在图像分类中表现优异,并在目标检测任务中以更少查询实现有效目标消除。
English: This paper introduces RFPAR, a reinforcement learning-based pixel attack method that reduces randomness and patch dependency in black-box attacks, demonstrating superior performance in image classification and effective object removal in detection tasks with fewer queries.

Authors:Le-Anh Tran
Title: Unpaired Image Dehazing via Kolmogorov-Arnold Transformation of Latent Features
Abstract:
This paper proposes an innovative framework for Unsupervised Image Dehazing via Kolmogorov-Arnold Transformation, termed UID-KAT. Image dehazing is recognized as a challenging and ill-posed vision task that requires complex transformations and interpretations in the feature space. Recent advancements have introduced Kolmogorov-Arnold Networks (KANs), inspired by the Kolmogorov-Arnold representation theorem, as promising alternatives to Multi-Layer Perceptrons (MLPs) since KANs can leverage their polynomial foundation to more efficiently approximate complex functions while requiring fewer layers than MLPs. Motivated by this potential, this paper explores the use of KANs combined with adversarial training and contrastive learning to model the intricate relationship between hazy and clear images. Adversarial training is employed due to its capacity in producing high-fidelity images, and contrastive learning promotes the model's emphasis on significant features while suppressing the influence of irrelevant information. The proposed UID-KAT framework is trained in an unsupervised setting to take advantage of the abundance of real-world data and address the challenge of preparing paired hazy/clean images. Experimental results show that UID-KAT achieves state-of-the-art dehazing performance across multiple datasets and scenarios, outperforming existing unpaired methods while reducing model complexity. The source code for this work is publicly available at https://github.com/tranleanh/uid-kat.
中文: 本文提出UID-KAT无监督图像去雾框架,通过结合Kolmogorov-Arnold网络与对抗训练和对比学习,有效建模有雾与清晰图像间的复杂映射关系,在降低模型复杂度的同时实现了最先进的去雾性能。
English: This paper introduces UID-KAT, an unsupervised image dehazing framework that utilizes Kolmogorov-Arnold Networks with adversarial and contrastive learning to effectively model the complex mapping between hazy and clear images, achieving superior performance with reduced model complexity.

Authors:Feng Liang, Haoyu Ma, Zecheng He, Tingbo Hou, Ji Hou, Kunpeng Li, Xiaoliang Dai, Felix Juefei-Xu, Samaneh Azadi, Animesh Sinha, Peizhao Zhang, Peter Vajda, Diana Marculescu
Title: Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts
Abstract:
Video personalization, which generates customized videos using reference images, has gained significant attention. However, prior methods typically focus on single-concept personalization, limiting broader applications that require multi-concept integration. Attempts to extend these models to multiple concepts often lead to identity blending, which results in composite characters with fused attributes from multiple sources. This challenge arises due to the lack of a mechanism to link each concept with its specific reference image. We address this with anchored prompts, which embed image anchors as unique tokens within text prompts, guiding accurate referencing during generation. Additionally, we introduce concept embeddings to encode the order of reference images. Our approach, Movie Weaver, seamlessly weaves multiple concepts-including face, body, and animal images-into one video, allowing flexible combinations in a single model. The evaluation shows that Movie Weaver outperforms existing methods for multi-concept video personalization in identity preservation and overall quality.
中文:Movie Weaver通过锚定提示和概念嵌入技术,实现了多概念视频个性化生成,有效避免了身份混合问题,在保持各参考图像独立特征的同时,其生成质量优于现有方法。
English: Movie Weaver introduces anchored prompts and concept embeddings to enable seamless multi-concept video personalization, effectively preventing identity blending while preserving distinct attributes from multiple reference images and outperforming existing methods in quality.

Authors:Yash Kant, Ethan Weber, Jin Kyu Kim, Rawal Khirodkar, Su Zhaoen, Julieta Martinez, Igor Gilitschenski, Shunsuke Saito, Timur Bagautdinov
Title: Pippo: High-Resolution Multi-View Humans from a Single Image
Abstract:
We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater than 5 times as many views as seen during training. Finally, we also introduce an improved metric to evaluate 3D consistency of multi-view generations, and show that Pippo outperforms existing works on multi-view human generation from a single image.

Authors:Ivan Lopes, Valentin Deschaintre, Yannick Hold-Geoffroy, Raoul de Charette
Title: MatSwap: Light-aware material transfers in images
Abstract:
We present MatSwap, a method to transfer materials to designated surfaces in an image photorealistically. Such a task is non-trivial due to the large entanglement of material appearance, geometry, and lighting in a photograph. In the literature, material editing methods typically rely on either cumbersome text engineering or extensive manual annotations requiring artist knowledge and 3D scene properties that are impractical to obtain. In contrast, we propose to directly learn the relationship between the input material -- as observed on a flat surface -- and its appearance within the scene, without the need for explicit UV mapping. To achieve this, we rely on a custom light- and geometry-aware diffusion model. We fine-tune a large-scale pre-trained text-to-image model for material transfer using our synthetic dataset, preserving its strong priors to ensure effective generalization to real images. As a result, our method seamlessly integrates a desired material into the target location in the photograph while retaining the identity of the scene. We evaluate our method on synthetic and real images and show that it compares favorably to recent work both qualitatively and quantitatively. We release our code and data on https://github.com/astra-vision/MatSwap
中文: MatSwap是一种通过定制扩散模型直接学习材质外观关系的真实感材质替换方法,无需三维场景属性即可将指定材质无缝融入目标图像。
English: MatSwap is a photorealistic material transfer method that uses a custom diffusion model to learn material appearance relationships without needing 3D scene properties, enabling seamless integration of materials into target images.

Authors:Shuhuai Ren, Shuming Ma, Xu Sun, Furu Wei
Title: Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
Abstract:
Next-Token Prediction (NTP) is a de facto approach for autoregressive (AR) video generation, but it suffers from suboptimal unidirectional dependencies and slow inference speed. In this work, we propose a semi-autoregressive (semi-AR) framework, called Next-Block Prediction (NBP), for video generation. By uniformly decomposing video content into equal-sized blocks (e.g., rows or frames), we shift the generation unit from individual tokens to blocks, allowing each token in the current block to simultaneously predict the corresponding token in the next block. Unlike traditional AR modeling, our framework employs bidirectional attention within each block, enabling tokens to capture more robust spatial dependencies. By predicting multiple tokens in parallel, NBP models significantly reduce the number of generation steps, leading to faster and more efficient inference. Our model achieves FVD scores of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an average of 4.4. Furthermore, thanks to the reduced number of inference steps, the NBP model generates 8.89 frames (128x128 resolution) per second, achieving an 11x speedup. We also explored model scales ranging from 700M to 3B parameters, observing significant improvements in generation quality, with FVD scores dropping from 103.3 to 55.3 on UCF101 and from 25.5 to 19.5 on K600, demonstrating the scalability of our approach.
中文: 提出的下一块预测框架通过将生成单元从单个标记转为块级预测并采用双向注意力机制,在视频生成质量上超越传统方法的同时实现了11倍的加速效果。
English: The proposed Next-Block Prediction (NBP) framework improves video generation by shifting from token-level to block-level prediction with bidirectional attention, achieving superior quality and an 11x speedup over traditional methods.

Authors:Bing Fan, Yunhe Feng, Yapeng Tian, James Chenhao Liang, Yuewei Lin, Yan Huang, Heng Fan
Title: PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization
Abstract:
Egocentric visual query localization (EgoVQL) focuses on localizing the target of interest in space and time from first-person videos, given a visual query. Despite recent progressive, existing methods often struggle to handle severe object appearance changes and cluttering background in the video due to lacking sufficient target cues, leading to degradation. Addressing this, we introduce PRVQL, a novel Progressive knowledge-guided Refinement framework for EgoVQL. The core is to continuously exploit target-relevant knowledge directly from videos and utilize it as guidance to refine both query and video features for improving target localization. Our PRVQL contains multiple processing stages. The target knowledge from one stage, comprising appearance and spatial knowledge extracted via two specially designed knowledge learning modules, are utilized as guidance to refine the query and videos features for the next stage, which are used to generate more accurate knowledge for further feature refinement. With such a progressive process, target knowledge in PRVQL can be gradually improved, which, in turn, leads to better refined query and video features for localization in the final stage. Compared to previous methods, our PRVQL, besides the given object cues, enjoys additional crucial target information from a video as guidance to refine features, and hence enhances EgoVQL in complicated scenes. In our experiments on challenging Ego4D, PRVQL achieves state-of-the-art result and largely surpasses other methods, showing its efficacy. Our code, model and results will be released at https://github.com/fb-reps/PRVQL.
中文: PRVQL提出了一种渐进式知识引导优化框架,通过从视频中迭代提取并利用目标相关知识来优化查询和视频特征,从而在复杂场景中显著提升了第一人称视觉查询定位的准确率,并在Ego4D数据集上取得了领先性能。
English: PRVQL introduces a progressive knowledge-guided refinement framework that enhances egocentric visual query localization by iteratively extracting and utilizing target-specific knowledge from videos to refine features, achieving state-of-the-art performance on complex datasets like Ego4D.

Authors:Hongwei Yi, Shitong Shao, Tian Ye, Jiantong Zhao, Qingyu Yin, Michael Lingelbach, Li Yuan, Yonghong Tian, Enze Xie, Daquan Zhou
Title: Magic 1-For-1: Generating One Minute Video Clips within One Minute
Abstract:
In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.
中文:Magic141是一种高效的视频生成模型,通过将文本到视频任务分解为文本到图像和图像到视频两个步骤,并运用多种优化技术降低计算成本和延迟,从而快速生成高质量视频。
English: Magic141 is an efficient video generation model that simplifies text-to-video creation by splitting it into text-to-image and image-to-video tasks, using optimization techniques to reduce memory and latency while maintaining quality.

Authors:Yuanxun Lu, Jingyang Zhang, Tian Fang, Jean-Daniel Nahmias, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao, Shiwei Li
Title: Matrix3D: Large Photogrammetry Model All-in-One
Abstract:
We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Project page: https://nju-3dv.github.io/projects/matrix3d.
Chinese: Matrix3D是一个统一的摄影测量模型,采用多模态扩散变换器处理姿态估计、深度预测和新视角合成,通过掩码学习策略利用不完整数据进行训练,实现了最先进的性能。
English: Matrix3D is a unified photogrammetry model that uses a multi-modal diffusion transformer to handle pose estimation, depth prediction, and novel view synthesis, achieving state-of-the-art performance through mask learning for training with incomplete data.

Authors:Yinzhe Shen, Omer Sahin Tas, Kaiwen Wang, Royden Wagner, Christoph Stiller
Title: Divide and Merge: Motion and Semantic Learning in End-to-End Autonomous Driving
Abstract:
Perceiving the environment and its changes over time corresponds to two fundamental yet heterogeneous types of information: semantics and motion. Previous end-to-end autonomous driving works represent both types of information in a single feature vector. However, including motion related tasks, such as prediction and planning, impairs detection and tracking performance, a phenomenon known as negative transfer in multi-task learning. To address this issue, we propose Neural-Bayes motion decoding, a novel parallel detection, tracking, and prediction method that separates semantic and motion learning. Specifically, we employ a set of learned motion queries that operate in parallel with detection and tracking queries, sharing a unified set of recursively updated reference points. Moreover, we employ interactive semantic decoding to enhance information exchange in semantic tasks, promoting positive transfer. Experiments on the nuScenes dataset with UniAD and SparseDrive confirm the effectiveness of our divide and merge approach, resulting in performance improvements across perception, prediction, and planning. Our code is available at https://github.com/shenyinzhe/DMAD.
中文: 本文提出神经贝叶斯运动解码方法,通过分离语义与运动学习的并行检测、跟踪及预测机制解决自动驾驶中的负迁移问题,在nuScenes数据集上实现了感知、预测与规划任务的全面性能提升。
English: This paper introduces Neural-Bayes motion decoding, a novel parallel detection, tracking, and prediction method that separates semantic and motion learning to mitigate negative transfer in autonomous driving, achieving performance improvements across perception, prediction, and planning on the nuScenes dataset.

Authors:Lin-Zhuo Chen, Kangjie Liu, Youtian Lin, Siyu Zhu, Zhihao Li, Xun Cao, Yao Yao
Title: Flow Distillation Sampling: Regularizing 3D Gaussians with Pre-trained Matching Priors
Abstract:
3D Gaussian Splatting (3DGS) has achieved excellent rendering quality with fast training and rendering speed. However, its optimization process lacks explicit geometric constraints, leading to suboptimal geometric reconstruction in regions with sparse or no observational input views. In this work, we try to mitigate the issue by incorporating a pre-trained matching prior to the 3DGS optimization process. We introduce Flow Distillation Sampling (FDS), a technique that leverages pre-trained geometric knowledge to bolster the accuracy of the Gaussian radiance field. Our method employs a strategic sampling technique to target unobserved views adjacent to the input views, utilizing the optical flow calculated from the matching model (Prior Flow) to guide the flow analytically calculated from the 3DGS geometry (Radiance Flow). Comprehensive experiments in depth rendering, mesh reconstruction, and novel view synthesis showcase the significant advantages of FDS over state-of-the-art methods. Additionally, our interpretive experiments and analysis aim to shed light on the effects of FDS on geometric accuracy and rendering quality, potentially providing readers with insights into its performance. Project page: https://nju-3dv.github.io/projects/fds
中文摘要:本文提出流蒸馏采样方法,通过引入预训练的几何先验知识,利用光流引导优化3D高斯溅射过程,有效提升了稀疏观测区域的几何重建精度。
English Summary: This paper introduces Flow Distillation Sampling (FDS), a method that enhances 3D Gaussian Splatting by integrating pre-trained geometric priors to improve reconstruction accuracy in under-observed areas through optical flow guidance.

Authors:Jiacong Xu, Shao-Yuan Lo, Bardia Safaei, Vishal M. Patel, Isht Dwivedi
Title: Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models
Abstract:
Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: https://xujiacong.github.io/Anomaly-OV/

Authors:Angel Villar-Corrales, Sven Behnke
Title: PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning
Abstract:
Predicting future scene representations is a crucial task for enabling robots to understand and interact with the environment. However, most existing methods rely on videos and simulations with precise action annotations, limiting their ability to leverage the large amount of available unlabeled video data. To address this challenge, we propose PlaySlot, an object-centric video prediction model that infers object representations and latent actions from unlabeled video sequences. It then uses these representations to forecast future object states and video frames. PlaySlot allows the generation of multiple possible futures conditioned on latent actions, which can be inferred from video dynamics, provided by a user, or generated by a learned action policy, thus enabling versatile and interpretable world modeling. Our results show that PlaySlot outperforms both stochastic and object-centric baselines for video prediction across different environments. Furthermore, we show that our inferred latent actions can be used to learn robot behaviors sample-efficiently from unlabeled video demonstrations. Videos and code are available on https://play-slot.github.io/PlaySlot/.

Authors:Fangwen Wu, Lechao Cheng, Shengeng Tang, Xiaofeng Zhu, Chaowei Fang, Dingwen Zhang, Meng Wang
Title: Navigating Semantic Drift in Task-Agnostic Class-Incremental Learning
Abstract:
Class-incremental learning (CIL) seeks to enable a model to sequentially learn new classes while retaining knowledge of previously learned ones. Balancing flexibility and stability remains a significant challenge, particularly when the task ID is unknown. To address this, our study reveals that the gap in feature distribution between novel and existing tasks is primarily driven by differences in mean and covariance moments. Building on this insight, we propose a novel semantic drift calibration method that incorporates mean shift compensation and covariance calibration. Specifically, we calculate each class's mean by averaging its sample embeddings and estimate task shifts using weighted embedding changes based on their proximity to the previous mean, effectively capturing mean shifts for all learned classes with each new task. We also apply Mahalanobis distance constraint for covariance calibration, aligning class-specific embedding covariances between old and current networks to mitigate the covariance shift. Additionally, we integrate a feature-level self-distillation approach to enhance generalization. Comprehensive experiments on commonly used datasets demonstrate the effectiveness of our approach. The source code is available at \href{https://github.com/fwu11/MACIL.git}{https://github.com/fwu11/MACIL.git}.
中文: 本研究针对类增量学习提出了一种语义漂移校准方法,通过均值漂移补偿和协方差校准来弥补特征分布差异,有效提升了任务间的稳定性和泛化能力。
English: This study introduces a semantic drift calibration method for class-incremental learning, addressing feature distribution gaps through mean shift compensation and covariance calibration to enhance stability and generalization across tasks.

Authors:Viacheslav Vasilev, Julia Agafonova, Nikolai Gerasimenko, Alexander Kapitanov, Polina Mikhailova, Evelina Mironova, Denis Dimitrov
Title: RusCode: Russian Cultural Code Benchmark for Text-to-Image Generation
Abstract:
Text-to-image generation models have gained popularity among users around the world. However, many of these models exhibit a strong bias toward English-speaking cultures, ignoring or misrepresenting the unique characteristics of other language groups, countries, and nationalities. The lack of cultural awareness can reduce the generation quality and lead to undesirable consequences such as unintentional insult, and the spread of prejudice. In contrast to the field of natural language processing, cultural awareness in computer vision has not been explored as extensively. In this paper, we strive to reduce this gap. We propose a RusCode benchmark for evaluating the quality of text-to-image generation containing elements of the Russian cultural code. To do this, we form a list of 19 categories that best represent the features of Russian visual culture. Our final dataset consists of 1250 text prompts in Russian and their translations into English. The prompts cover a wide range of topics, including complex concepts from art, popular culture, folk traditions, famous people's names, natural objects, scientific achievements, etc. We present the results of a human evaluation of the side-by-side comparison of Russian visual concepts representations using popular generative models.
中文:文本到图像生成模型常偏向英语文化,因此我们提出RusCode基准,通过涵盖俄罗斯文化特征的人类评估提示来提升其文化表现力。
English: Text-to-image models often exhibit cultural bias favoring English-speaking contexts, prompting the development of the RusCode benchmark to evaluate and improve their representation of Russian cultural elements through human-assessed prompts.

Authors:Duong Anh Kiet
Title: Hierarchical Document Parsing via Large Margin Feature Matching and Heuristics
Abstract:
We present our solution to the AAAI-25 VRD-IU challenge, achieving first place in the competition. Our approach integrates large margin loss for improved feature discrimination and employs heuristic rules to refine hierarchical relationships. By combining a deep learning-based matching strategy with greedy algorithms, we achieve a significant boost in accuracy while maintaining computational efficiency. Our method attains an accuracy of 0.98904 on the private leaderboard, demonstrating its effectiveness in document structure parsing. Source codes are publicly available at https://github.com/ffyyytt/VRUID-AAAI-DAKiet
我们的方案在AAAI-25 VRD-IU挑战赛中夺冠,通过结合大间隔损失提升特征区分度和启发式规则优化层级关系,以0.98904的准确率实现了高效计算。
Our solution won the AAAI-25 VRD-IU challenge by integrating large margin loss for better feature discrimination and heuristic rules to refine hierarchical relationships, achieving 0.98904 accuracy with efficient computation.

Authors:Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, Angela Yao
Title: EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
Abstract:
We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33\% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance. Our dataset is released at: https://github.com/zhousheng97/EgoTextVQA.
中文: EgoTextVQA是一个包含1.5K视频和7K问题的新型基准,用于评估以自我为中心的视觉文本问答,目前最佳模型(Gemini 1.5 Pro)准确率仅33%,表明需要提升时间定位和多帧推理能力。
English: EgoTextVQA is a new benchmark with 1.5K videos and 7K questions for evaluating egocentric scene-text QA, where current models like Gemini 1.5 Pro achieve only 33% accuracy, highlighting the need for improved temporal grounding and multi-frame reasoning.

Authors:Yuxu Lu, Ai Chen, Dong Yang, Ryan Wen Liu
Title: USRNet: Unified Scene Recovery Network for Enhancing Traffic Imaging under Multiple Adverse Weather Conditions
Abstract:
Advancements in computer vision technology have facilitated the extensive deployment of intelligent transportation systems and visual surveillance systems across various applications, including autonomous driving, public safety, and environmental monitoring. However, adverse weather conditions such as haze, rain, snow, and more complex mixed degradation can significantly degrade image quality. The degradation compromises the accuracy and reliability of these systems across various scenarios. To tackle the challenge of developing adaptable models for scene restoration, we introduce the unified scene recovery network (USRNet), capable of handling multiple types of image degradation. The USRNet features a sophisticated architecture consisting of a scene encoder, an attention-driven node independent learning mechanism (NILM), an edge decoder, and a scene restoration module. The scene encoder, powered by advanced residual blocks, extracts deep features from degraded images in a progressive manner, ensuring thorough encoding of degradation information. To enhance the USRNet's adaptability in diverse weather conditions, we introduce NILM, which enables the network to learn and respond to different scenarios with precision, thereby increasing its robustness. The edge decoder is designed to extract edge features with precision, which is essential for maintaining image sharpness. Experimental results demonstrate that USRNet surpasses existing methods in handling complex imaging degradations, thereby improving the accuracy and reliability of visual systems across diverse scenarios. The code resources for this work can be accessed in https://github.com/LouisYxLu/USRNet.
中文: USRNet模型通过创新的架构有效应对恶劣天气导致的多种图像退化问题,显著提升了视觉系统在不同场景下的可靠性。
English: The USRNet model effectively addresses multiple image degradations caused by adverse weather through its innovative architecture, enhancing the reliability of visual systems across diverse scenarios.

Authors:Ruining Deng, Yihe Yang, David J. Pisapia, Benjamin Liechty, Junchao Zhu, Juming Xiong, Junlin Guo, Zhengyi Lu, Jiacheng Wang, Xing Yao, Runxuan Yu, Rendong Zhang, Gaurav Rudravaram, Mengmeng Yin, Pinaki Sarder, Haichun Yang, Yuankai Huo, Mert R. Sabuncu
Title: CASC-AI: Consensus-aware Self-corrective Learning for Noise Cell Segmentation
Abstract:
Multi-class cell segmentation in high-resolution gigapixel whole slide images (WSIs) is crucial for various clinical applications. However, training such models typically requires labor-intensive, pixel-wise annotations by domain experts. Recent efforts have democratized this process by involving lay annotators without medical expertise. However, conventional non-corrective approaches struggle to handle annotation noise adaptively because they lack mechanisms to mitigate false positives (FP) and false negatives (FN) at both the image-feature and pixel levels. In this paper, we propose a consensus-aware self-corrective AI agent that leverages the Consensus Matrix to guide its learning process. The Consensus Matrix defines regions where both the AI and annotators agree on cell and non-cell annotations, which are prioritized with stronger supervision. Conversely, areas of disagreement are adaptively weighted based on their feature similarity to high-confidence consensus regions, with more similar regions receiving greater attention. Additionally, contrastive learning is employed to separate features of noisy regions from those of reliable consensus regions by maximizing their dissimilarity. This paradigm enables the model to iteratively refine noisy labels, enhancing its robustness. Validated on one real-world lay-annotated cell dataset and two reasoning-guided simulated noisy datasets, our method demonstrates improved segmentation performance, effectively correcting FP and FN errors and showcasing its potential for training robust models on noisy datasets. The official implementation and cell annotations are publicly available at https://github.com/ddrrnn123/CASC-AI.
中文: 本文提出了一种基于共识矩阵的自校正AI代理,通过对比学习和自适应加权机制迭代修正非专业标注者产生的细胞标注噪声,在多个数据集上验证了其能有效改善分割性能并纠正误报和漏报错误。
English: This paper introduces a consensus-aware self-corrective AI agent that leverages a Consensus Matrix and contrastive learning to adaptively refine noisy cell annotations from non-expert annotators, demonstrating improved segmentation performance by effectively correcting false positives and negatives across multiple datasets.

Authors:Yuechen Xie, Jie Song, Mengqi Xue, Haofei Zhang, Xingen Wang, Bingde Hu, Genlang Chen, Mingli Song
Title: Dataset Ownership Verification in Contrastive Pre-trained Models
Abstract:
High-quality open-source datasets, which necessitate substantial efforts for curation, has become the primary catalyst for the swift progress of deep learning. Concurrently, protecting these datasets is paramount for the well-being of the data owner. Dataset ownership verification emerges as a crucial method in this domain, but existing approaches are often limited to supervised models and cannot be directly extended to increasingly popular unsupervised pre-trained models. In this work, we propose the first dataset ownership verification method tailored specifically for self-supervised pre-trained models by contrastive learning. Its primary objective is to ascertain whether a suspicious black-box backbone has been pre-trained on a specific unlabeled dataset, aiding dataset owners in upholding their rights. The proposed approach is motivated by our empirical insights that when models are trained with the target dataset, the unary and binary instance relationships within the embedding space exhibit significant variations compared to models trained without the target dataset. We validate the efficacy of this approach across multiple contrastive pre-trained models including SimCLR, BYOL, SimSiam, MOCO v3, and DINO. The results demonstrate that our method rejects the null hypothesis with a $p$-value markedly below $0.05$, surpassing all previous methodologies. Our code is available at https://github.com/xieyc99/DOV4CL.
中文: 本研究首次提出针对自监督预训练模型的数据集所有权验证方法,通过对比学习分析嵌入空间中的实例关系来准确判断模型是否使用特定数据集训练,并在多个模型中验证了其显著有效性。
English: This study introduces the first dataset ownership verification method for self-supervised pre-trained models using contrastive learning, effectively determining if a model was trained on a specific dataset by analyzing embedding space relationships, with validation across multiple models showing significant results.

Authors:Pinxin Liu, Pengfei Zhang, Hyeongwoo Kim, Pablo Garrido, Ari Shapiro, Kyle Olszewski
Title: Contextual Gesture: Co-Speech Gesture Video Generation through Context-aware Gesture Representation
Abstract:
Co-speech gesture generation is crucial for creating lifelike avatars and enhancing human-computer interactions by synchronizing gestures with speech. Despite recent advancements, existing methods struggle with accurately identifying the rhythmic or semantic triggers from audio for generating contextualized gesture patterns and achieving pixel-level realism. To address these challenges, we introduce Contextual Gesture, a framework that improves co-speech gesture video generation through three innovative components: (1) a chronological speech-gesture alignment that temporally connects two modalities, (2) a contextualized gesture tokenization that incorporate speech context into motion pattern representation through distillation, and (3) a structure-aware refinement module that employs edge connection to link gesture keypoints to improve video generation. Our extensive experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications, shown in Fig.1.

Authors:Sen Peng, Mingyue Wang, Jianfei He, Jijia Yang, Xiaohua Jia
Title: CAT: Contrastive Adversarial Training for Evaluating the Robustness of Protective Perturbations in Latent Diffusion Models
Abstract:
Latent diffusion models have recently demonstrated superior capabilities in many downstream image synthesis tasks. However, customization of latent diffusion models using unauthorized data can severely compromise the privacy and intellectual property rights of data owners. Adversarial examples as protective perturbations have been developed to defend against unauthorized data usage by introducing imperceptible noise to customization samples, preventing diffusion models from effectively learning them. In this paper, we first reveal that the primary reason adversarial examples are effective as protective perturbations in latent diffusion models is the distortion of their latent representations, as demonstrated through qualitative and quantitative experiments. We then propose the Contrastive Adversarial Training (CAT) utilizing lightweight adapters as an adaptive attack against these protection methods, highlighting their lack of robustness. Extensive experiments demonstrate that our CAT method significantly reduces the effectiveness of protective perturbations in customization, urging the community to reconsider and improve the robustness of existing protective perturbations. The code is available at https://github.com/senp98/CAT.
Chinese: 潜在扩散模型易受对抗样本干扰,因其扭曲潜在表示,但提出的对比对抗训练(CAT)方法有效削弱了这些保护性扰动的防御效果,揭示了其鲁棒性不足。
English: Latent diffusion models are vulnerable to adversarial examples that distort latent representations, but the proposed Contrastive Adversarial Training (CAT) method effectively counteracts these protective perturbations, exposing their lack of robustness.

Authors:Xingpei Ma, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Qiang Zhang, Shunsi Zhang
Title: Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion
Abstract:
Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate not only outperforms existing state-of-the-art methods in terms of video quality, but also exhibits strong competitiveness in lip synchronization while offering improved flexibility in controlling emotion and head pose. The code will be available at https://github.com/Playmate111/Playmate.
Chinese: 提出的Playmate框架采用两阶段训练方法,通过解耦的3D表征和情感控制模块,能够生成具有精确唇部同步和精细情感控制的逼真说话人脸,在视频质量和控制灵活性方面优于现有方法。
English: The proposed Playmate framework introduces a two-stage training approach with decoupled 3D representation and emotion-control modules to generate lifelike talking faces with accurate lip-sync and fine-grained emotional control, outperforming existing methods in video quality and flexibility.

Authors:Ravi Shah, Atsushi Fukuda, Quan Huu Cap
Title: Color-Quality Invariance for Robust Medical Image Segmentation
Abstract:
Single-source domain generalization (SDG) in medical image segmentation remains a significant challenge, particularly for images with varying color distributions and qualities. Previous approaches often struggle when models trained on high-quality images fail to generalize to low-quality test images due to these color and quality shifts. In this work, we propose two novel techniques to enhance generalization: dynamic color image normalization (DCIN) module and color-quality generalization (CQG) loss. The DCIN dynamically normalizes the color of test images using two reference image selection strategies. Specifically, the DCIN utilizes a global reference image selection (GRIS), which finds a universal reference image, and a local reference image selection (LRIS), which selects a semantically similar reference image per test sample. Additionally, CQG loss enforces invariance to color and quality variations by ensuring consistent segmentation predictions across transformed image pairs. Experimental results show that our proposals significantly improve segmentation performance over the baseline on two target domain datasets, despite being trained solely on a single source domain. Notably, our model achieved up to a 32.3-point increase in Dice score compared to the baseline, consistently producing robust and usable results even under substantial domain shifts. Our work contributes to the development of more robust medical image segmentation models that generalize across unseen domains. The implementation code is available at https://github.com/RaviShah1/DCIN-CQG.
中文摘要:本研究提出了动态颜色图像归一化和颜色质量泛化损失两种新技术,显著提升了医学图像分割中单源域泛化能力,在面对颜色和质量变化的未知域时表现出优越性能。
English Summary: This study introduces two novel techniques, dynamic color image normalization and color-quality generalization loss, to enhance single-source domain generalization in medical image segmentation, significantly improving performance on unseen domains with varying color and quality.

Authors:ByungOk Han, Woo-han Yun, Beom-Su Seo, Jaehong Kim
Title: Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired
Abstract:
Guide dog robots offer promising solutions to enhance mobility and safety for visually impaired individuals, addressing the limitations of traditional guide dogs, particularly in perceptual intelligence and communication. With the emergence of Vision-Language Models (VLMs), robots are now capable of generating natural language descriptions of their surroundings, aiding in safer decision-making. However, existing VLMs often struggle to accurately interpret and convey spatial relationships, which is crucial for navigation in complex environments such as street crossings. We introduce the Space-Aware Instruction Tuning (SAIT) dataset and the Space-Aware Benchmark (SA-Bench) to address the limitations of current VLMs in understanding physical environments. Our automated data generation pipeline focuses on the virtual path to the destination in 3D space and the surroundings, enhancing environmental comprehension and enabling VLMs to provide more accurate guidance to visually impaired individuals. We also propose an evaluation protocol to assess VLM effectiveness in delivering walking guidance. Comparative experiments demonstrate that our space-aware instruction-tuned model outperforms state-of-the-art algorithms. We have fully open-sourced the SAIT dataset and SA-Bench, along with the related code, at https://github.com/byungokhan/Space-awareVLM
中文摘要:本文提出的SAIT数据集和SA-Bench旨在提升视觉语言模型的空间感知能力,使导盲犬机器人能为视障人士提供更安全的导航指引,该模型在实验中表现优异且所有资源均已开源。
English Summary: The SAIT dataset and SA-Bench are introduced to enhance Vision-Language Models' spatial understanding for safer guide dog robots, with the proposed model outperforming existing methods and all resources being open-sourced.

Authors:Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
Title: Explaining 3D Computed Tomography Classifiers with Counterfactuals
Abstract:
Counterfactual explanations enhance the interpretability of deep learning models in medical imaging, yet adapting them to 3D CT scans poses challenges due to volumetric complexity and resource demands. We extend the Latent Shift counterfactual generation method from 2D applications to explain 3D computed tomography (CT) scans classifiers. We address the challenges associated with 3D classifiers, such as limited training samples and high memory demands, by implementing a slice-based autoencoder and gradient blocking except for specific chunks of slices. This method leverages a 2D encoder trained on CT slices, which are subsequently combined to maintain 3D context. We demonstrate this technique on two models for clinical phenotype prediction and lung segmentation. Our approach is both memory-efficient and effective for generating interpretable counterfactuals in high-resolution 3D medical imaging.
中文: 本研究通过采用基于切片的自动编码器和梯度阻断技术,将潜在偏移方法扩展至三维CT扫描的反事实解释生成,在保持临床可解释性的同时,有效解决了内存限制和体积复杂性的问题。
English: The study extends the Latent Shift method to generate counterfactual explanations for 3D CT scans by using a slice-based autoencoder and gradient blocking, effectively addressing memory constraints and volumetric complexity while maintaining interpretability in clinical applications.

Authors:Girish A. Koushik, Diptesh Kanojia, Helen Treharne
Title: Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content
Abstract:
Social media platforms enable the propagation of hateful content across different modalities such as textual, auditory, and visual, necessitating effective detection methods. While recent approaches have shown promise in handling individual modalities, their effectiveness across different modality combinations remains unexplored. This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content. Our comprehensive evaluation reveals significant modality-specific limitations: while simple embedding fusion achieves state-of-the-art performance on video content (HateMM dataset) with a 9.9% points F1-score improvement, it struggles with complex image-text relationships in memes (Hateful Memes dataset). Through detailed ablation studies and error analysis, we demonstrate how current fusion approaches fail to capture nuanced cross-modal interactions, particularly in cases involving benign confounders. Our findings provide crucial insights for developing more robust hate detection systems and highlight the need for modality-specific architectural considerations. The code is available at https://github.com/gak97/Video-vs-Meme-Hate.
Chinese: 本研究系统评估了基于融合的多模态仇恨内容检测方法,发现简单嵌入融合在视频内容上表现优异,但在处理表情包中复杂的图文关系时存在不足,因其难以捕捉细微的跨模态交互特征。
English: This study systematically evaluates fusion-based approaches for multimodal hate detection, revealing that while simple embedding fusion excels with video content, it struggles with complex image-text relationships in memes due to limitations in capturing nuanced cross-modal interactions.

Authors:Behzad Hejrati, Soumyanil Banerjee, Carri Glide-Hurst, Ming Dong
Title: Conditional diffusion model with spatial attention and latent embedding for medical image segmentation
Abstract:
Diffusion models have been used extensively for high quality image and video generation tasks. In this paper, we propose a novel conditional diffusion model with spatial attention and latent embedding (cDAL) for medical image segmentation. In cDAL, a convolutional neural network (CNN) based discriminator is used at every time-step of the diffusion process to distinguish between the generated labels and the real ones. A spatial attention map is computed based on the features learned by the discriminator to help cDAL generate more accurate segmentation of discriminative regions in an input image. Additionally, we incorporated a random latent embedding into each layer of our model to significantly reduce the number of training and sampling time-steps, thereby making it much faster than other diffusion models for image segmentation. We applied cDAL on 3 publicly available medical image segmentation datasets (MoNuSeg, Chest X-ray and Hippocampus) and observed significant qualitative and quantitative improvements with higher Dice scores and mIoU over the state-of-the-art algorithms. The source code is publicly available at https://github.com/Hejrati/cDAL/.
中文: 本文提出了一种结合空间注意力和潜在嵌入的条件扩散模型cDAL,通过提升判别区域分割精度和大幅减少训练步骤,在三个公开医学图像数据集上实现了超越现有最佳算法的分割性能。
English: This paper introduces cDAL, a conditional diffusion model with spatial attention and latent embedding that enhances medical image segmentation by improving accuracy and reducing computational time, achieving superior performance on three public datasets.

Authors:Yixing Lu, Junting Dong, Youngjoong Kwon, Qin Zhao, Bo Dai, Fernando De la Torre
Title: GAS: Generative Avatar Synthesis from a Single Image
Abstract:
We present a unified and generalizable framework for synthesizing view-consistent and temporally coherent avatars from a single image, addressing the challenging task of single-image avatar generation. Existing diffusion-based methods often condition on sparse human templates (e.g., depth or normal maps), which leads to multi-view and temporal inconsistencies due to the mismatch between these signals and the true appearance of the subject. Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model. In a first step, an initial 3D reconstructed human through a generalized NeRF provides comprehensive conditioning, ensuring high-quality synthesis faithful to the reference appearance and structure. Subsequently, the derived geometry and appearance from the generalized NeRF serve as input to a video-based diffusion model. This strategic integration is pivotal for enforcing both multi-view and temporal consistency throughout the avatar's generation. Empirical results underscore the superior generalization ability of our proposed method, demonstrating its effectiveness across diverse in-domain and out-of-domain in-the-wild datasets.

Authors:Wen Zhou, Shuichiro Miwa, Yang Liu, Koji Okamoto
Title: BF-GAN: Development of an AI-driven Bubbly Flow Image Generation Model Using Generative Adversarial Networks
Abstract:
A generative AI architecture called bubbly flow generative adversarial networks (BF-GAN) is developed, designed to generate realistic and high-quality bubbly flow images through physically conditioned inputs, jg and jf. Initially, 52 sets of bubbly flow experiments under varying conditions are conducted to collect 140,000 bubbly flow images with physical labels of jg and jf for training data. A multi-scale loss function is then developed, incorporating mismatch loss and pixel loss to enhance the generative performance of BF-GAN further. Regarding evaluative metrics of generative AI, the BF-GAN has surpassed conventional GAN. Physically, key parameters of bubbly flow generated by BF-GAN are extracted and compared with measurement values and empirical correlations, validating BF-GAN's generative performance. The comparative analysis demonstrate that the BF-GAN can generate realistic and high-quality bubbly flow images with any given jg and jf within the research scope. BF-GAN offers a generative AI solution for two-phase flow research, substantially lowering the time and cost required to obtain high-quality data. In addition, it can function as a benchmark dataset generator for bubbly flow detection and segmentation algorithms, enhancing overall productivity in this research domain. The BF-GAN model is available online (https://github.com/zhouzhouwen/BF-GAN).
中文: BF-GAN是一种生成对抗网络架构,通过物理条件输入生成高质量气泡流图像,其性能超越传统GAN并验证了关键参数,显著降低了研究时间和成本。
English: The BF-GAN is a generative AI architecture that produces realistic bubbly flow images using physical inputs, outperforming traditional GANs and validating key parameters against experimental data while reducing research time and costs.

Authors:Xingye Chen, Wei Feng, Zhenbang Du, Weizhen Wang, Yanyin Chen, Haohan Wang, Linkai Liu, Yaoyu Li, Jinyuan Zhao, Yu Li, Zheng Zhang, Jingjing Lv, Junjie Shen, Zhangang Lin, Jingping Shao, Yuanjie Shao, Xinge You, Changxin Gao, Nong Sang
Title: CTR-Driven Advertising Image Generation with Multimodal Large Language Models
Abstract:
In web data, advertising images are crucial for capturing user attention and improving advertising effectiveness. Most existing methods generate background for products primarily focus on the aesthetic quality, which may fail to achieve satisfactory online performance. To address this limitation, we explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective. Firstly, we build targeted pre-training tasks, and leverage a large-scale e-commerce multimodal dataset to equip MLLMs with initial capabilities for advertising image generation tasks. To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL), which can jointly utilize multimodal features and accurately reflect user click preferences. Meanwhile, a product-centric preference optimization strategy is developed to ensure that the generated background content aligns with the product characteristics after fine-tuning, enhancing the overall relevance and effectiveness of the advertising images. Extensive experiments have demonstrated that our method achieves state-of-the-art performance in both online and offline metrics. Our code and pre-trained models are publicly available at: https://github.com/Chenguoz/CAIG.
中文: 本研究提出了一种利用多模态大语言模型生成广告图像的新方法,通过针对性预训练、结合奖励模型的强化学习以及以产品为中心的优化策略,显著提升了点击率,在线上线下指标中均实现了最优性能。
English: This study introduces a novel approach using Multimodal Large Language Models (MLLMs) to generate advertising images optimized for Click-Through Rate (CTR), employing targeted pre-training, reinforcement learning with a reward model, and product-centric optimization to achieve state-of-the-art performance in both online and offline metrics.

Authors:Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang
Title: EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Abstract:
Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability. Code is publicly available at: https://github.com/baaivision/EVE.
中文摘要:无编码器的视觉语言模型EVEv2.0通过分层模态关联和优化训练策略,在数据效率和视觉推理能力上表现出色,正快速缩小与基于编码器模型的性能差距。
English Summary: Encoder-free vision-language models like EVEv2.0 are closing performance gaps with encoder-based models through hierarchical modality association and optimized training strategies, demonstrating superior data efficiency and vision-reasoning capabilities.

Authors:Damiano Marsili, Rohun Agrawal, Yisong Yue, Georgia Gkioxari
Title: Visual Agentic AI for Spatial Reasoning with a Dynamic API
Abstract:
Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries. To assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference. We show that our method outperforms prior zero-shot models for visual reasoning in 3D and empirically validate the effectiveness of our agentic framework for 3D spatial reasoning tasks. Project website: https://glab-caltech.github.io/vadar/

Authors:Yue Zhu, Haiwen Diao, Shang Gao, Long Chen, Huchuan Lu
Title: KARST: Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission for Visual Classification
Abstract:
Fine-tuning pre-trained vision models for specific tasks is a common practice in computer vision. However, this process becomes more expensive as models grow larger. Recently, parameter-efficient fine-tuning (PEFT) methods have emerged as a popular solution to improve training efficiency and reduce storage needs by tuning additional low-rank modules within pre-trained backbones. Despite their advantages, they struggle with limited representation capabilities and misalignment with pre-trained intermediate features. To address these issues, we introduce an innovative Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission (KARST) for various recognition tasks. Specifically, its multi-kernel design extends Kronecker projections horizontally and separates adaptation matrices into multiple complementary spaces, reducing parameter dependency and creating more compact subspaces. Besides, it incorporates extra learnable re-scaling factors to better align with pre-trained feature distributions, allowing for more flexible and balanced feature aggregation. Extensive experiments validate that our KARST outperforms other PEFT counterparts with a negligible inference cost due to its re-parameterization characteristics. Code is publicly available at: https://github.com/Lucenova/KARST.
Chinese: 提出的多核Kronecker适应与重缩放传输(KARST)方法通过扩展投影空间和特征对齐,以极低的推理成本实现了优于同类参数高效微调方法的性能。
English: The proposed Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission (KARST) enhances parameter-efficient fine-tuning by expanding projection spaces and aligning features, achieving superior performance with minimal inference cost.

Authors:Yuqi Lin, Hengjia Li, Wenqi Shao, Zheng Yang, Jun Zhao, Xiaofei He, Ping Luo, Kaipeng Zhang
Title: SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement
Abstract:
In this paper, we explore a principal way to enhance the quality of widely pre-existing coarse masks, enabling them to serve as reliable training data for segmentation models to reduce the annotation cost. In contrast to prior refinement techniques that are tailored to specific models or tasks in a close-world manner, we propose SAMRefiner, a universal and efficient approach by adapting SAM to the mask refinement task. The core technique of our model is the noise-tolerant prompting scheme. Specifically, we introduce a multi-prompt excavation strategy to mine diverse input prompts for SAM (i.e., distance-guided points, context-aware elastic bounding boxes, and Gaussian-style masks) from initial coarse masks. These prompts can collaborate with each other to mitigate the effect of defects in coarse masks. In particular, considering the difficulty of SAM to handle the multi-object case in semantic segmentation, we introduce a split-then-merge (STM) pipeline. Additionally, we extend our method to SAMRefiner++ by introducing an additional IoU adaption step to further boost the performance of the generic SAMRefiner on the target dataset. This step is self-boosted and requires no additional annotation. The proposed framework is versatile and can flexibly cooperate with existing segmentation methods. We evaluate our mask framework on a wide range of benchmarks under different settings, demonstrating better accuracy and efficiency. SAMRefiner holds significant potential to expedite the evolution of refinement tools. Our code is available at https://github.com/linyq2117/SAMRefiner.
本文提出SAMRefiner,一种通用高效的方法,通过适配SAM模型进行掩码优化,利用抗噪提示策略将粗糙掩码转化为可靠的分割训练数据,从而降低标注成本。
This paper introduces SAMRefiner, a universal and efficient method that adapts the Segment Anything Model (SAM) to refine coarse masks into reliable training data for segmentation models, reducing annotation costs through noise-tolerant prompting and a split-then-merge pipeline.

Authors:Samuel Stevens, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su
Title: Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models
Abstract:
To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. Current approaches either provide interpretable features without the ability to test their causal influence, or enable model editing without interpretable controls. We present a unified framework using sparse autoencoders (SAEs) that bridges this gap, allowing us to discover human-interpretable visual features and precisely manipulate them to test hypotheses about model behavior. By applying our method to state-of-the-art vision models, we reveal key differences in the semantic abstractions learned by models with different pre-training objectives. We then demonstrate the practical usage of our framework through controlled interventions across multiple vision tasks. We show that SAEs can reliably identify and manipulate interpretable visual features without model re-training, providing a powerful tool for understanding and controlling vision model behavior. We provide code, demos and models on our project website: https://osu-nlp-group.github.io/SAE-V.

Authors:Xingjian Diao, Chunhui Zhang, Tingxuan Wu, Ming Cheng, Zhongyu Ouyang, Weiyi Wu, Jiang Gui
Title: Learning Musical Representations for Music Performance Question Answering
Abstract:
Music performances are representative scenarios for audio-visual modeling. Unlike common scenarios with sparse audio, music performances continuously involve dense audio signals throughout. While existing multimodal learning methods on the audio-video QA demonstrate impressive capabilities in general scenarios, they are incapable of dealing with fundamental problems within the music performances: they underexplore the interaction between the multimodal signals in performance and fail to consider the distinctive characteristics of instruments and music. Therefore, existing methods tend to answer questions regarding musical performances inaccurately. To bridge the above research gaps, (i) given the intricate multimodal interconnectivity inherent to music data, our primary backbone is designed to incorporate multimodal interactions within the context of music; (ii) to enable the model to learn music characteristics, we annotate and release rhythmic and music sources in the current music datasets; (iii) for time-aware audio-visual modeling, we align the model's music predictions with the temporal dimension. Our experiments show state-of-the-art effects on the Music AVQA datasets. Our code is available at https://github.com/xid32/Amuse.
中文摘要:本研究针对现有多模态学习方法在音乐表演分析中的不足,开发了一个增强视听交互建模、融合音乐特性并确保时间对齐的框架,在音乐视听问答数据集上取得了领先效果。
English Summary: This study addresses the limitations of existing multimodal learning methods in music performance analysis by developing a framework that enhances audio-visual interaction modeling, incorporates musical characteristics, and ensures temporal alignment, achieving state-of-the-art results on Music AVQA datasets.

Authors:Bessie Dominguez-Dager, Felix Escalona, Francisco Gomez-Donoso, Miguel Cazorla
Title: CHIRLA: Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis
Abstract:
Person re-identification (Re-ID) is a key challenge in computer vision, requiring the matching of individuals across cameras, locations, and time. While most research focuses on short-term scenarios with minimal appearance changes, real-world applications demand robust systems that handle long-term variations caused by clothing and physical changes. We present CHIRLA, Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis, a novel dataset designed for video-based long-term person Re-ID. CHIRLA was recorded over seven months in four connected indoor environments using seven strategically placed cameras, capturing realistic movements with substantial clothing and appearance variability. The dataset includes 22 individuals, more than five hours of video, and about 1M bounding boxes with identity annotations obtained through semi-automatic labeling. We also define benchmark protocols for person tracking and Re-ID, covering diverse and challenging scenarios such as occlusion, reappearance, and multi-camera conditions. By introducing this comprehensive benchmark, we aim to facilitate the development and evaluation of Re-ID algorithms that can reliably perform in challenging, long-term real-world scenarios. The benchmark code is publicly available at: https://github.com/bdager/CHIRLA.
中文摘要:CHIRLA数据集通过提供七个月的多摄像头视频数据,包含显著的服装变化,解决了长期行人重识别难题,并建立了基准测试以提升实际应用中的Re-ID算法性能。
English Summary: The CHIRLA dataset addresses long-term person re-identification challenges by providing seven months of multi-camera video data with significant clothing variations, establishing benchmarks to improve Re-ID algorithms for real-world applications.

Authors:Shihuan He, Zhihui Lai, Ruxin Wang, Heng Kong
Title: Prototype Contrastive Consistency Learning for Semi-Supervised Medical Image Segmentation
Abstract:
Medical image segmentation is a crucial task in medical image analysis, but it can be very challenging especially when there are less labeled data but with large unlabeled data. Contrastive learning has proven to be effective for medical image segmentation in semi-supervised learning by constructing contrastive samples from partial pixels. However, although previous contrastive learning methods can mine semantic information from partial pixels within images, they ignore the whole context information of unlabeled images, which is very important to precise segmentation. In order to solve this problem, we propose a novel prototype contrastive learning method called Prototype Contrastive Consistency Segmentation (PCCS) for semi-supervised medical image segmentation. The core idea is to enforce the prototypes of the same semantic class to be closer and push the prototypes in different semantic classes far away from each other. Specifically, we construct a signed distance map and an uncertainty map from unlabeled images. The signed distance map is used to construct prototypes for contrastive learning, and then we estimate the prototype uncertainty from the uncertainty map as trade-off among prototypes. In order to obtain better prototypes, based on the student-teacher architecture, a new mechanism named prototype updating prototype is designed to assist in updating the prototypes for contrastive learning. In addition, we propose an uncertainty-consistency loss to mine more reliable information from unlabeled data. Extensive experiments on medical image segmentation demonstrate that PCCS achieves better segmentation performance than the state-of-the-art methods. The code is available at https://github.com/comphsh/PCCS.
中文: 提出的原型对比一致性分割(PCCS)方法通过结合原型对比学习和不确定性一致性损失,在半监督医学图像分割中实现了优于现有技术的分割性能。
English: The proposed Prototype Contrastive Consistency Segmentation (PCCS) method enhances semi-supervised medical image segmentation by leveraging prototype contrastive learning with uncertainty-consistency loss, achieving superior performance over existing approaches.

Authors:Jiachen Li, Xiaojin Gong
Title: Unleashing the Potential of Pre-Trained Diffusion Models for Generalizable Person Re-Identification
Abstract:
Domain-generalizable re-identification (DG Re-ID) aims to train a model on one or more source domains and evaluate its performance on unseen target domains, a task that has attracted growing attention due to its practical relevance. While numerous methods have been proposed, most rely on discriminative or contrastive learning frameworks to learn generalizable feature representations. However, these approaches often fail to mitigate shortcut learning, leading to suboptimal performance. In this work, we propose a novel method called diffusion model-assisted representation learning with a correlation-aware conditioning scheme (DCAC) to enhance DG Re-ID. Our method integrates a discriminative and contrastive Re-ID model with a pre-trained diffusion model through a correlation-aware conditioning scheme. By incorporating ID classification probabilities generated from the Re-ID model with a set of learnable ID-wise prompts, the conditioning scheme injects dark knowledge that captures ID correlations to guide the diffusion process. Simultaneously, feedback from the diffusion model is back-propagated through the conditioning scheme to the Re-ID model, effectively improving the generalization capability of Re-ID features. Extensive experiments on both single-source and multi-source DG Re-ID tasks demonstrate that our method achieves state-of-the-art performance. Comprehensive ablation studies further validate the effectiveness of the proposed approach, providing insights into its robustness. Codes will be available at https://github.com/RikoLi/DCAC.
Chinese: 本文提出DCAC方法,通过相关感知条件机制将判别式与对比式重识别模型同预训练扩散模型结合,有效缓解捷径学习问题并提升特征泛化能力,从而在领域泛化重识别任务中实现最优性能。
English: This paper introduces DCAC, a novel method that integrates discriminative and contrastive Re-ID models with a pre-trained diffusion model using a correlation-aware conditioning scheme to enhance domain-generalizable re-identification by mitigating shortcut learning and improving feature generalization.

Authors:Kamil Garifullin, Maxim Nikolaev, Andrey Kuznetsov, Aibek Alanov
Title: MaterialFusion: High-Quality, Zero-Shot, and Controllable Material Transfer with Diffusion Models
Abstract:
Manipulating the material appearance of objects in images is critical for applications like augmented reality, virtual prototyping, and digital content creation. We present MaterialFusion, a novel framework for high-quality material transfer that allows users to adjust the degree of material application, achieving an optimal balance between new material properties and the object's original features. MaterialFusion seamlessly integrates the modified object into the scene by maintaining background consistency and mitigating boundary artifacts. To thoroughly evaluate our approach, we have compiled a dataset of real-world material transfer examples and conducted complex comparative analyses. Through comprehensive quantitative evaluations and user studies, we demonstrate that MaterialFusion significantly outperforms existing methods in terms of quality, user control, and background preservation. Code is available at https://github.com/ControlGenAI/MaterialFusion.
Chinese: MaterialFusion是一种高质量材质转换的新框架,允许用户调节材质应用程度,在保持物体特征和背景一致性的同时,显著优于现有方法的质量和可控性。
English: MaterialFusion is a novel framework for high-quality material transfer that enables adjustable material application while preserving object features and background consistency, outperforming existing methods in quality and user control.

Authors:Paschalis Giakoumoglou, Dimitrios Karageorgiou, Symeon Papadopoulos, Panagiotis C. Petrantonakis
Title: SAGI: Semantically Aligned and Uncertainty Guided AI Image Inpainting
Abstract:
Recent advancements in generative AI have made text-guided image inpainting - adding, removing, or altering image regions using textual prompts - widely accessible. However, generating semantically correct photorealistic imagery, typically requires carefully-crafted prompts and iterative refinement by evaluating the realism of the generated content - tasks commonly performed by humans. To automate the generative process, we propose Semantically Aligned and Uncertainty Guided AI Image Inpainting (SAGI), a model-agnostic pipeline, to sample prompts from a distribution that closely aligns with human perception and to evaluate the generated content and discard instances that deviate from such a distribution, which we approximate using pretrained large language models and vision-language models. By applying this pipeline on multiple state-of-the-art inpainting models, we create the SAGI Dataset (SAGI-D), currently the largest and most diverse dataset of AI-generated inpaintings, comprising over 95k inpainted images and a human-evaluated subset. Our experiments show that semantic alignment significantly improves image quality and aesthetics, while uncertainty guidance effectively identifies realistic manipulations - human ability to distinguish inpainted images from real ones drops from 74% to 35% in terms of accuracy, after applying our pipeline. Moreover, using SAGI-D for training several image forensic approaches increases in-domain detection performance on average by 37.4% and out-of-domain generalization by 26.1% in terms of IoU, also demonstrating its utility in countering malicious exploitation of generative AI. Code and dataset are available at https://mever-team.github.io/SAGI/

Authors:Haokai Zhao, Haowei Lou, Lina Yao, Wei Peng, Ehsan Adeli, Kilian M Pohl, Yu Zhang
Title: Diffusion Models for Computational Neuroimaging: A Survey
Abstract:
Computational neuroimaging involves analyzing brain images or signals to provide mechanistic insights and predictive tools for human cognition and behavior. While diffusion models have shown stability and high-quality generation in natural images, there is increasing interest in adapting them to analyze brain data for various neurological tasks such as data enhancement, disease diagnosis and brain decoding. This survey provides an overview of recent efforts to integrate diffusion models into computational neuroimaging. We begin by introducing the common neuroimaging data modalities, follow with the diffusion formulations and conditioning mechanisms. Then we discuss how the variations of the denoising starting point, condition input and generation target of diffusion models are developed and enhance specific neuroimaging tasks. For a comprehensive overview of the ongoing research, we provide a publicly available repository at https://github.com/JoeZhao527/dm4neuro.
Chinese: 本综述探讨了如何将扩散模型整合到计算神经影像学中,通过调整其公式和条件机制,以改进数据增强、疾病诊断和脑解码等任务。
English: This survey explores the integration of diffusion models into computational neuroimaging to enhance tasks like data improvement, disease diagnosis, and brain decoding by adapting their formulations and conditioning mechanisms.

Authors:Soobin Um, Beomsu Kim, Jong Chul Ye
Title: Boost-and-Skip: A Simple Guidance-Free Diffusion for Minority Generation
Abstract:
Minority samples are underrepresented instances located in low-density regions of a data manifold, and are valuable in many generative AI applications, such as data augmentation, creative content generation, etc. Unfortunately, existing diffusion-based minority generators often rely on computationally expensive guidance dedicated for minority generation. To address this, here we present a simple yet powerful guidance-free approach called Boost-and-Skip for generating minority samples using diffusion models. The key advantage of our framework requires only two minimal changes to standard generative processes: (i) variance-boosted initialization and (ii) timestep skipping. We highlight that these seemingly-trivial modifications are supported by solid theoretical and empirical evidence, thereby effectively promoting emergence of underrepresented minority features. Our comprehensive experiments demonstrate that Boost-and-Skip greatly enhances the capability of generating minority samples, even rivaling guidance-based state-of-the-art approaches while requiring significantly fewer computations. Code is available at https://github.com/soobin-um/BnS.
中文摘要:本文提出的Boost-and-Skip方法通过方差增强初始化和时间步跳跃这两个简单修改,无需引导机制即可有效生成数据流形中的少数样本,在保持与先进引导方法相当性能的同时大幅降低了计算开销。
English Summary: The paper introduces Boost-and-Skip, a computationally efficient guidance-free method that enhances minority sample generation in diffusion models through variance-boosted initialization and timestep skipping, achieving performance comparable to state-of-the-art guidance-based approaches with significantly reduced computational costs.

Authors:Hongyu Qu, Jianan Wei, Xiangbo Shu, Wenguan Wang
Title: Learning Clustering-based Prototypes for Compositional Zero-shot Learning
Abstract:
Learning primitive (i.e., attribute and object) concepts from seen compositions is the primary challenge of Compositional Zero-Shot Learning (CZSL). Existing CZSL solutions typically rely on oversimplified data assumptions, e.g., modeling each primitive with a single centroid primitive representation, ignoring the natural diversities of the attribute (resp. object) when coupled with different objects (resp. attribute). In this work, we develop ClusPro, a robust clustering-based prototype mining framework for CZSL that defines the conceptual boundaries of primitives through a set of diversified prototypes. Specifically, ClusPro conducts within-primitive clustering on the embedding space for automatically discovering and dynamically updating prototypes. These representative prototypes are subsequently used to repaint a well-structured and independent primitive embedding space, ensuring intra-primitive separation and inter-primitive decorrelation through prototype-based contrastive learning and decorrelation learning. Moreover, ClusPro efficiently performs prototype clustering in a non-parametric fashion without the introduction of additional learnable parameters or computational budget during testing. Experiments on three benchmarks demonstrate ClusPro outperforms various top-leading CZSL solutions under both closed-world and open-world settings.
Chinese: ClusPro是一种基于聚类的原型挖掘框架,通过多样化原型定义原始概念并重构嵌入空间,在组合零样本学习中实现了优于现有方法的性能。
English: ClusPro is a robust clustering-based prototype mining framework for Compositional Zero-Shot Learning that discovers diversified prototypes to define primitive concepts and restructures the embedding space for improved performance.

Authors:Vlad Hosu, Lorenzo Agnolucci, Daisuke Iso, Dietmar Saupe
Title: Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution
Abstract:
Image Quality Assessment (IQA) measures and predicts perceived image quality by human observers. Although recent studies have highlighted the critical influence that variations in the scale of an image have on its perceived quality, this relationship has not been systematically quantified. To bridge this gap, we introduce the Image Intrinsic Scale (IIS), defined as the largest scale where an image exhibits its highest perceived quality. We also present the Image Intrinsic Scale Assessment (IISA) task, which involves subjectively measuring and predicting the IIS based on human judgments. We develop a subjective annotation methodology and create the IISA-DB dataset, comprising 785 image-IIS pairs annotated by experts in a rigorously controlled crowdsourcing study. Furthermore, we propose WIISA (Weak-labeling for Image Intrinsic Scale Assessment), a strategy that leverages how the IIS of an image varies with downscaling to generate weak labels. Experiments show that applying WIISA during the training of several IQA methods adapted for IISA consistently improves the performance compared to using only ground-truth labels. The code, dataset, and pre-trained models are available at https://github.com/SonyResearch/IISA.
Chinese: 本研究提出了图像内在尺度概念及其评估任务,以系统量化图像尺度对感知质量的影响,并设计了一种弱标注策略,在适配尺度预测时显著提升了图像质量评估方法的性能。
English: This study introduces the Image Intrinsic Scale (IIS) and a corresponding assessment task to systematically quantify how image scale affects perceived quality, proposing a weak-labeling strategy that enhances IQA method performance when adapted for scale prediction.

Authors:Weijia Mao, Zhenheng Yang, Mike Zheng Shou
Title: UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Abstract:
Unified multimodal transformers, which handle both generation and understanding tasks within a shared parameter space, have received increasing attention in recent research. Although various unified transformers have been proposed, training these models is costly due to redundant tokens and heavy attention computation. In the past, studies on large language models have demonstrated that token pruning methods, such as Mixture of Depths (MoD), can significantly improve computational efficiency. MoD employs a router to select the most important ones for processing within a transformer layer. However, directly applying MoD-based token pruning to unified transformers will result in suboptimal performance because different tasks exhibit varying levels of token redundancy. In our work, we analyze the unified transformers by (1) examining attention weight patterns, (2) evaluating the layer importance and token redundancy, and (3) analyzing task interactions. Our findings reveal that token redundancy is primarily influenced by different tasks and layers. Building on these findings, we introduce UniMoD, a task-aware token pruning method that employs a separate router for each task to determine which tokens should be pruned. We apply our method to Show-o and Emu3, reducing training FLOPs by approximately 15% in Show-o and 40% in Emu3, while maintaining or improving performance on several benchmarks. Code will be released at https://github.com/showlab/UniMoD.
中文: UniMoD提出了一种任务感知的令牌剪枝方法,能在统一多模态Transformer中降低高达40%的计算成本,同时保持或提升多项任务的性能表现。
English: UniMoD introduces a task-aware token pruning method that reduces computational costs by up to 40% in unified multimodal transformers while preserving or enhancing performance across various tasks.

Authors:Sankalp Nagaonkar, Augustya Sharma, Ashish Choithani, Ashutosh Trivedi
Title: Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments
Abstract:
This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.
中文摘要:本文提出一个用于评估视觉语言模型在视频光学字符识别任务中表现的开源基准,结果表明尽管视觉语言模型在许多场景下优于传统OCR系统,但仍面临幻觉内容和对复杂文本敏感等挑战。
English Summary: This paper presents an open-source benchmark for evaluating Vision-Language Models on video OCR tasks, revealing that while VLMs can outperform traditional OCR systems in many scenarios, they still face challenges like hallucinations and sensitivity to complex text.

Authors:Lingao Xiao, Songhua Liu, Yang He, Xinchao Wang
Title: Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images
Abstract:
Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational and storage efficiency. Despite their overlapping objectives, these approaches are rarely compared directly. Even within each field, the evaluation protocols are inconsistent across various methods, which complicates fair comparisons and hinders reproducibility. Considering these limitations, we introduce in this paper a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Notably, our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, which heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests that an overemphasis on soft labels may be diverting attention from the intrinsic value of the image data, while also imposing additional burdens in terms of generation, storage, and application. To address these issues, we propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively, relies solely on hard labels for evaluation, and achieves state-of-the-art performance in this setup. By shifting the emphasis back to the images, our benchmark and PCA framework pave the way for more balanced and accessible techniques in dataset compression research. Our code is available at: https://github.com/ArmandXiao/Rethinking-Dataset-Compression
中文: 本文提出了一个公平评估数据集蒸馏与剪枝方法的基准,发现随机子集在主流设定中可与依赖软标签的复杂方法相媲美,并提出了仅使用硬标签的新框架PCA,通过聚焦图像数据实现了最优性能。
English: This paper introduces a benchmark that fairly compares dataset distillation and pruning methods, revealing that random subsets can match the performance of complex soft-label approaches, and proposes a new hard-label-only framework called PCA that achieves state-of-the-art results by focusing on image data.

Authors:Huaqiu Li, Wang Zhang, Xiaowan Hu, Tao Jiang, Zikang Chen, Haoqian Wang
Title: Prompt-SID: Learning Structural Representation Prompt via Latent Diffusion for Single-Image Denoising
Abstract:
Many studies have concentrated on constructing supervised models utilizing paired datasets for image denoising, which proves to be expensive and time-consuming. Current self-supervised and unsupervised approaches typically rely on blind-spot networks or sub-image pairs sampling, resulting in pixel information loss and destruction of detailed structural information, thereby significantly constraining the efficacy of such methods. In this paper, we introduce Prompt-SID, a prompt-learning-based single image denoising framework that emphasizes preserving of structural details. This approach is trained in a self-supervised manner using downsampled image pairs. It captures original-scale image information through structural encoding and integrates this prompt into the denoiser. To achieve this, we propose a structural representation generation model based on the latent diffusion process and design a structural attention module within the transformer-based denoiser architecture to decode the prompt. Additionally, we introduce a scale replay training mechanism, which effectively mitigates the scale gap from images of different resolutions. We conduct comprehensive experiments on synthetic, real-world, and fluorescence imaging datasets, showcasing the remarkable effectiveness of Prompt-SID. Our code will be released at https://github.com/huaqlili/Prompt-SID.
中文: 本文提出Prompt-SID自监督图像去噪框架,通过提示学习保留结构细节,克服了现有方法的局限,在多个数据集上展现出卓越性能。
English: This paper introduces Prompt-SID, a self-supervised image denoising framework that uses prompt learning to preserve structural details, overcoming limitations of existing methods and demonstrating strong performance across multiple datasets.

Authors:Jian Hu, Zixu Cheng, Chenyang Si, Wei Li, Shaogang Gong
Title: CoS: Chain-of-Shot Prompting for Long Video Understanding
Abstract:
Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in https://lwpyh.github.io/CoS.

Authors:Aobotao Dai, Xinyu Ma, Lei Chen, Songze Li, Lin Wang
Title: When Data Manipulation Meets Attack Goals: An In-depth Survey of Attacks for VLMs
Abstract:
Vision-Language Models (VLMs) have gained considerable prominence in recent years due to their remarkable capability to effectively integrate and process both textual and visual information. This integration has significantly enhanced performance across a diverse spectrum of applications, such as scene perception and robotics. However, the deployment of VLMs has also given rise to critical safety and security concerns, necessitating extensive research to assess the potential vulnerabilities these VLM systems may harbor. In this work, we present an in-depth survey of the attack strategies tailored for VLMs. We categorize these attacks based on their underlying objectives - namely jailbreak, camouflage, and exploitation - while also detailing the various methodologies employed for data manipulation of VLMs. Meanwhile, we outline corresponding defense mechanisms that have been proposed to mitigate these vulnerabilities. By discerning key connections and distinctions among the diverse types of attacks, we propose a compelling taxonomy for VLM attacks. Moreover, we summarize the evaluation metrics that comprehensively describe the characteristics and impact of different attacks on VLMs. Finally, we conclude with a discussion of promising future research directions that could further enhance the robustness and safety of VLMs, emphasizing the importance of ongoing exploration in this critical area of study. To facilitate community engagement, we maintain an up-to-date project page, accessible at: https://github.com/AobtDai/VLM_Attack_Paper_List.
中文: 视觉语言模型(VLMs)在提升多模态应用性能的同时,面临越狱、伪装等安全威胁,促使研究防御策略并探索未来防护方向。
English: Vision-Language Models (VLMs) enhance multimodal applications but face security threats like jailbreak and camouflage attacks, prompting research into defense mechanisms and future safeguards.

Authors:Yiru Jiao, Sander van Cranenburgh, Simeon Calvert, Hans van Lint
Title: Structure-preserving contrastive learning for spatial time series
Abstract:
The effectiveness of neural network models largely relies on learning meaningful latent patterns from data, where self-supervised learning of informative representations can enhance model performance and generalisability. However, self-supervised representation learning for spatially characterised time series, which are ubiquitous in transportation domain, poses unique challenges due to the necessity of maintaining fine-grained spatio-temporal similarities in the latent space. In this study, we introduce two structure-preserving regularisers for the contrastive learning of spatial time series: one regulariser preserves the topology of similarities between instances, and the other preserves the graph geometry of similarities across spatial and temporal dimensions. To balance the contrastive learning objective and the need for structure preservation, we propose a dynamic weighting mechanism that adaptively manages this trade-off and stabilises training. We validate the proposed method through extensive experiments, including multivariate time series classification to demonstrate its general applicability, as well as macroscopic and microscopic traffic prediction to highlight its particular usefulness in encoding traffic interactions. Across all tasks, our method preserves the similarity structures more effectively and improves state-of-the-art task performances. This method can be integrated with an arbitrary neural network model and is particularly beneficial for time series data with spatial or geographical features. Furthermore, our findings suggest that well-preserved similarity structures in the latent space indicate more informative and useful representations. This provides insights to design more effective neural networks for data-driven transportation research. Our code is made openly accessible with all resulting data at https://github.com/yiru-jiao/spclt
中文: 本研究针对空间时间序列提出了两个结构保持正则化器和一个动态权重机制,通过对比学习有效保持时空相似性结构,在多项任务中提升了模型性能,尤其适用于交通领域的预测应用。
English: This study introduces two structure-preserving regularizers and a dynamic weighting mechanism for contrastive learning of spatial time series, which effectively maintains spatio-temporal similarities and enhances model performance across various tasks, particularly in transportation applications.

Authors:Oliver Boyne, Roberto Cipolla
Title: FOCUS -- Multi-View Foot Reconstruction From Synthetically Trained Dense Correspondences
Abstract:
Surface reconstruction from multiple, calibrated images is a challenging task - often requiring a large number of collected images with significant overlap. We look at the specific case of human foot reconstruction. As with previous successful foot reconstruction work, we seek to extract rich per-pixel geometry cues from multi-view RGB images, and fuse these into a final 3D object. Our method, FOCUS, tackles this problem with 3 main contributions: (i) SynFoot2, an extension of an existing synthetic foot dataset to include a new data type: dense correspondence with the parameterized foot model FIND; (ii) an uncertainty-aware dense correspondence predictor trained on our synthetic dataset; (iii) two methods for reconstructing a 3D surface from dense correspondence predictions: one inspired by Structure-from-Motion, and one optimization-based using the FIND model. We show that our reconstruction achieves state-of-the-art reconstruction quality in a few-view setting, performing comparably to state-of-the-art when many views are available, and runs substantially faster. We release our synthetic dataset to the research community. Code is available at: https://github.com/OllieBoyne/FOCUS
Chinese: FOCUS方法通过扩展合成数据集、引入不确定性感知对应点预测器及两种表面重建技术,在少量视角下实现了顶尖的三维足部重建质量,且运行速度显著提升。
English: The FOCUS method introduces a synthetic dataset extension, an uncertainty-aware correspondence predictor, and two surface reconstruction techniques to achieve state-of-the-art 3D foot reconstruction quality with fewer views and faster processing.

Authors:Sihwan Park, Doohyuk Jang, Sungyub Kim, Souvik Kundu, Eunho Yang
Title: LANTERN++: Enhancing Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models
Abstract:
Speculative decoding has been widely used to accelerate auto-regressive (AR) text generation. However, its effectiveness for visual AR models remains limited due to token selection ambiguity, where multiple tokens share similarly low probabilities and thus reduce acceptance rates. Recently, relaxed speculative decoding with dynamic tree drafting was proposed to mitigate this ambiguity, demonstrating promising results in accelerating visual AR models. However, we observe that token selection ambiguity still negatively affects dynamic tree drafting, resulting in shallow draft trees and limited acceleration. To overcome this issue, we introduce LANTERN++, a refined framework that integrates static tree drafting with a tailored relaxed acceptance condition, allowing drafts to be selected independently of low-confidence predictions. This enables the acceptance of deeper sequences, improving decoding efficiency while preserving image quality. Extensive experiments on state-of-the-art visual AR models demonstrate that LANTERN++ significantly accelerates inference, achieving up to $\mathbf{\times 2.56}$ speedup over standard AR decoding while maintaining high image quality. The code is publicly available at https://github.com/jadohu/LANTERN.
中文总结:LANTERN++ 通过结合静态树草拟和定制化的宽松接受条件,改进了视觉自回归模型的推测解码,在保持图像质量的同时实现了高达 2.56 倍的加速效果。
English Summary: LANTERN++ enhances speculative decoding for visual auto-regressive models by integrating static tree drafting with a custom relaxed acceptance condition, achieving up to 2.56× speedup while preserving image quality.

Authors:Lee Hyoseok, Kyeong Seon Kim, Kwon Byung-Ki, Tae-Hyun Oh
Title: Zero-shot Depth Completion via Test-time Alignment with Affine-invariant Depth Prior
Abstract:
Depth completion, predicting dense depth maps from sparse depth measurements, is an ill-posed problem requiring prior knowledge. Recent methods adopt learning-based approaches to implicitly capture priors, but the priors primarily fit in-domain data and do not generalize well to out-of-domain scenarios. To address this, we propose a zero-shot depth completion method composed of an affine-invariant depth diffusion model and test-time alignment. We use pre-trained depth diffusion models as depth prior knowledge, which implicitly understand how to fill in depth for scenes. Our approach aligns the affine-invariant depth prior with metric-scale sparse measurements, enforcing them as hard constraints via an optimization loop at test-time. Our zero-shot depth completion method demonstrates generalization across various domain datasets, achieving up to a 21\% average performance improvement over the previous state-of-the-art methods while enhancing spatial understanding by sharpening scene details. We demonstrate that aligning a monocular affine-invariant depth prior with sparse metric measurements is a proven strategy to achieve domain-generalizable depth completion without relying on extensive training data. Project page: https://hyoseok1223.github.io/zero-shot-depth-completion/.

Authors:Yeho Gwon, Sehyun Hwang, Hoyoung Kim, Jungseul Ok, Suha Kwak
Title: Enhancing Cost Efficiency in Active Learning with Candidate Set Query
Abstract:
This paper introduces a cost-efficient active learning (AL) framework for classification, featuring a novel query design called candidate set query. Unlike traditional AL queries requiring the oracle to examine all possible classes, our method narrows down the set of candidate classes likely to include the ground-truth class, significantly reducing the search space and labeling cost. Moreover, we leverage conformal prediction to dynamically generate small yet reliable candidate sets, adapting to model enhancement over successive AL rounds. To this end, we introduce an acquisition function designed to prioritize data points that offer high information gain at lower cost. Empirical evaluations on CIFAR-10, CIFAR-100, and ImageNet64x64 demonstrate the effectiveness and scalability of our framework. Notably, it reduces labeling cost by 48% on ImageNet64x64. The project page can be found at https://yehogwon.github.io/csq-al.

Authors:Li Hu, Guangyuan Wang, Zhen Shen, Xin Gao, Dechao Meng, Lian Zhuo, Peng Zhang, Bang Zhang, Liefeng Bo
Title: Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance
Abstract:
Recent character image animation methods based on diffusion models, such as Animate Anyone, have made significant progress in generating consistent and generalizable character animations. However, these approaches fail to produce reasonable associations between characters and their environments. To address this limitation, we introduce Animate Anyone 2, aiming to animate characters with environment affordance. Beyond extracting motion signals from source video, we additionally capture environmental representations as conditional inputs. The environment is formulated as the region with the exclusion of characters and our model generates characters to populate these regions while maintaining coherence with the environmental context. We propose a shape-agnostic mask strategy that more effectively characterizes the relationship between character and environment. Furthermore, to enhance the fidelity of object interactions, we leverage an object guider to extract features of interacting objects and employ spatial blending for feature injection. We also introduce a pose modulation strategy that enables the model to handle more diverse motion patterns. Experimental results demonstrate the superior performance of the proposed method.

Authors:Ce Zhang, Zifu Wan, Zhehan Kan, Martin Q. Ma, Simon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis-Philippe Morency, Katia Sycara, Yaqi Xie
Title: Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models
Abstract:
While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditioned response generation in LVLMs, we explore the potential of leveraging text-to-image generative models to assist in mitigating hallucinations in LVLMs. We discover that generative models can offer valuable self-feedback for mitigating hallucinations at both the response and token levels. Building on this insight, we introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process to effectively mitigate hallucinations in LVLMs. Specifically, DeGF generates an image from the initial response produced by LVLMs, which acts as an auxiliary visual reference and provides self-feedback to verify and correct the initial response through complementary or contrastive decoding. Extensive experimental results validate the effectiveness of our approach in mitigating diverse types of hallucinations, consistently surpassing state-of-the-art methods across six benchmarks. Code is available at https://github.com/zhangce01/DeGF.
Chinese: 近期大型视觉语言模型常产生与视觉输入不符的幻觉文本,而新型无训练算法DeGF通过文本到图像生成模型在解码过程中提供自反馈,有效减少幻觉生成,在多个基准测试中超越现有最优方法。
English: Recent Large Vision-Language Models (LVLMs) often produce hallucinatory text responses, but a new training-free algorithm called Decoding with Generative Feedback (DeGF) leverages text-to-image generative models to provide self-feedback during decoding, effectively reducing hallucinations and outperforming state-of-the-art methods across multiple benchmarks.

Authors:Krishna Sri Ipsit Mantri, Carola-Bibiane Schönlieb, Bruno Ribeiro, Chaim Baskin, Moshe Eliasof
Title: DiTASK: Multi-Task Fine-Tuning with Diffeomorphic Transformations
Abstract:
Pre-trained Vision Transformers now serve as powerful tools for computer vision. Yet, efficiently adapting them for multiple tasks remains a challenge that arises from the need to modify the rich hidden representations encoded by the learned weight matrices, without inducing interference between tasks. Current parameter-efficient methods like LoRA, which apply low-rank updates, force tasks to compete within constrained subspaces, ultimately degrading performance. We introduce DiTASK a novel Diffeomorphic Multi-Task Fine-Tuning approach that maintains pre-trained representations by preserving weight matrix singular vectors, while enabling task-specific adaptations through neural diffeomorphic transformations of the singular values. By following this approach, DiTASK enables both shared and task-specific feature modulations with minimal added parameters. Our theoretical analysis shows that DITASK achieves full-rank updates during optimization, preserving the geometric structure of pre-trained features, and establishing a new paradigm for efficient multi-task learning (MTL). Our experiments on PASCAL MTL and NYUD show that DiTASK achieves state-of-the-art performance across four dense prediction tasks, using 75% fewer parameters than existing methods. Our code is available [here](https://github.com/ipsitmantri/DiTASK).
中文: DiTASK提出了一种新颖的多任务微调方法,通过微分同胚变换高效适配预训练视觉变换器,以显著更少的参数实现了最先进的性能。
English: DiTASK introduces a novel multi-task fine-tuning method that uses diffeomorphic transformations to adapt pre-trained vision transformers efficiently, achieving state-of-the-art performance with significantly fewer parameters.

Authors:Xingjian Diao, Chunhui Zhang, Weiyi Wu, Zhongyu Ouyang, Peijun Qing, Ming Cheng, Soroush Vosoughi, Jiang Gui
Title: Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding
Abstract:
Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model's limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across tasks such as video captioning, question answering, and video-text retrieval. By enhancing temporal modeling, TWM extends the capability of MFMs to handle complex, time-sensitive data effectively. Our code is available at https://github.com/xid32/NAACL_2025_TWM.
中文: 提出的时序工作记忆(TWM)模块通过选择性保留跨时间序列的任务相关信息,优化多模态基础模型的容量使用,显著提升了视频与音频处理任务的性能表现。
English: The proposed temporal working memory (TWM) module enhances multimodal foundation models by selectively retaining task-relevant information across temporal sequences, significantly improving performance in video and audio tasks through optimized capacity utilization.

Authors:Raza Imam, Asif Hanif, Jian Zhang, Khaled Waleed Dawoud, Yova Kementchedjhieva, Mohammad Yaqub
Title: Noise is an Efficient Learner for Zero-Shot Vision-Language Models
Abstract:
Recently, test-time adaptation has garnered attention as a method for tuning models without labeled data. The conventional modus operandi for adapting pre-trained vision-language models (VLMs) during test-time primarily focuses on tuning learnable prompts; however, this approach overlooks potential distribution shifts in the visual representations themselves. In this work, we address this limitation by introducing Test-Time Noise Tuning (TNT), a novel method for handling unpredictable shifts in the visual space. TNT leverages, for the first time, a noise adaptation strategy that optimizes learnable noise directly in the visual input space, enabling adaptive feature learning from a single test sample. We further introduce a novel approach for inter-view representation alignment by explicitly enforcing coherence in embedding distances, ensuring consistent feature representations across views. Combined with scaled logits and confident view selection at inference, TNT substantially enhances VLM generalization and calibration, achieving average gains of +7.38% on natural distributions benchmark and +0.80% on cross-dataset evaluations over zero-shot CLIP. These improvements lay a strong foundation for adaptive out-of-distribution handling.
中文: 本文提出测试时噪声调优(TNT)方法,通过在视觉输入空间优化可学习噪声来处理不可预测的分布偏移,相比零样本CLIP显著提升了视觉语言模型的泛化能力和校准性能。
English: This paper introduces Test-Time Noise Tuning (TNT), a novel method that optimizes learnable noise in the visual input space to handle unpredictable distribution shifts, significantly improving vision-language model generalization and calibration over zero-shot CLIP.

Authors:Julia Hornauer, Amir El-Ghoussani, Vasileios Belagiannis
Title: Revisiting Gradient-based Uncertainty for Monocular Depth Estimation
Abstract:
Monocular depth estimation, similar to other image-based tasks, is prone to erroneous predictions due to ambiguities in the image, for example, caused by dynamic objects or shadows. For this reason, pixel-wise uncertainty assessment is required for safety-critical applications to highlight the areas where the prediction is unreliable. We address this in a post hoc manner and introduce gradient-based uncertainty estimation for already trained depth estimation models. To extract gradients without depending on the ground truth depth, we introduce an auxiliary loss function based on the consistency of the predicted depth and a reference depth. The reference depth, which acts as pseudo ground truth, is in fact generated using a simple image or feature augmentation, making our approach simple and effective. To obtain the final uncertainty score, the derivatives w.r.t. the feature maps from single or multiple layers are calculated using back-propagation. We demonstrate that our gradient-based approach is effective in determining the uncertainty without re-training using the two standard depth estimation benchmarks KITTI and NYU. In particular, for models trained with monocular sequences and therefore most prone to uncertainty, our method outperforms related approaches. In addition, we publicly provide our code and models: https://github.com/jhornauer/GrUMoDepth
中文: 本文提出了一种基于梯度的单目深度预测不确定性估计方法,通过辅助损失函数和特征增强生成参考深度作为伪真值,无需重新训练模型即可有效识别预测不可靠区域。
English: This paper introduces a gradient-based uncertainty estimation method for monocular depth prediction models, which uses an auxiliary loss function and feature augmentation to generate reference depth as pseudo ground truth, effectively identifying unreliable areas without model retraining.

Authors:Hongyu Ge, Longkun Hao, Zihui Xu, Zhenxin Lin, Bin Li, Shoujun Zhou, Hongjin Zhao, Yihang Liu
Title: ClinKD: Cross-Modal Clinical Knowledge Distiller For Multi-Task Medical Images
Abstract:
Medical Visual Question Answering (Med-VQA) represents a critical and challenging subtask within the general VQA domain. Despite significant advancements in general VQA, multimodal large language models (MLLMs) still exhibit substantial limitations when handling multi-task VQA scenarios. These limitations manifest through erroneous spatial localization and misinterpretation of medical images, which primarily arise from two fundamental issues: inadequate image-text alignment and insufficient domain-specified knowledge for medical applications. To address these issues, we introduce the Cross-Modal Clinical Knowledge Distiller (ClinKD), an innovative framework designed to enhance image-text alignment and establish more effective medical knowledge transformation mechanisms, which enables MLLMs to perform better even when lacking prior medical knowledge. Our extensive experimental evaluations demonstrate that the ClinKD achieves state-of-the-art performance on several datasets which are challenging for Med-VQA task. The results indicate that our approach not only significantly improves image-text alignment but also effectively enables MLLMs to adapt to the medical knowledge. The source code for ClinKD is available at: https://github.com/overloadedHenry/ClinKD.
中文:提出的跨模态临床知识蒸馏框架通过加强图像-文本对齐和医学知识迁移,有效解决了医学视觉问答中的关键问题,在多个数据集上实现了最优性能。
English: The proposed Cross-Modal Clinical Knowledge Distiller (ClinKD) framework addresses medical VQA limitations by enhancing image-text alignment and medical knowledge transfer, achieving state-of-the-art performance across multiple datasets.

Authors:Vera Soboleva, Maksim Nakhodnov, Aibek Alanov
Title: Beyond Fine-Tuning: A Systematic Study of Sampling Techniques in Personalized Image Generation
Abstract:
Personalized text-to-image generation aims to create images tailored to user-defined concepts and textual descriptions. Balancing the fidelity of the learned concept with its ability for generation in various contexts presents a significant challenge. Existing methods often address this through diverse fine-tuning parameterizations and improved sampling strategies that integrate superclass trajectories during the diffusion process. While improved sampling offers a cost-effective, training-free solution for enhancing fine-tuned models, systematic analyses of these methods remain limited. Current approaches typically tie sampling strategies with fixed fine-tuning configurations, making it difficult to isolate their impact on generation outcomes. To address this issue, we systematically analyze sampling strategies beyond fine-tuning, exploring the impact of concept and superclass trajectories on the results. Building on this analysis, we propose a decision framework evaluating text alignment, computational constraints, and fidelity objectives to guide strategy selection. It integrates with diverse architectures and training approaches, systematically optimizing concept preservation, prompt adherence, and resource efficiency. The source code can be found at https://github.com/ControlGenAI/PersonGenSampler.
中文: 本研究系统分析了个性化文本到图像生成中的采样策略,提出了一个决策框架,通过评估文本对齐、计算约束和保真度来优化不同架构下的概念保持与资源效率。
English: This study systematically analyzes sampling strategies in personalized text-to-image generation, proposing a decision framework that evaluates text alignment, computational constraints, and fidelity to optimize concept preservation and resource efficiency across various architectures.

Authors:Zhifei Yang, Keyang Lu, Chao Zhang, Jiaxing Qi, Hanqi Jiang, Ruifei Ma, Shenglin Yin, Yifan Xu, Mingzhe Xing, Zhen Xiao, Jieyi Long, Guangyao Zhai
Title: MMGDreamer: Mixed-Modality Graph for Geometry-Controllable 3D Indoor Scene Generation
Abstract:
Controllable 3D scene generation has extensive applications in virtual reality and interior design, where the generated scenes should exhibit high levels of realism and controllability in terms of geometry. Scene graphs provide a suitable data representation that facilitates these applications. However, current graph-based methods for scene generation are constrained to text-based inputs and exhibit insufficient adaptability to flexible user inputs, hindering the ability to precisely control object geometry. To address this issue, we propose MMGDreamer, a dual-branch diffusion model for scene generation that incorporates a novel Mixed-Modality Graph, visual enhancement module, and relation predictor. The mixed-modality graph allows object nodes to integrate textual and visual modalities, with optional relationships between nodes. It enhances adaptability to flexible user inputs and enables meticulous control over the geometry of objects in the generated scenes. The visual enhancement module enriches the visual fidelity of text-only nodes by constructing visual representations using text embeddings. Furthermore, our relation predictor leverages node representations to infer absent relationships between nodes, resulting in more coherent scene layouts. Extensive experimental results demonstrate that MMGDreamer exhibits superior control of object geometry, achieving state-of-the-art scene generation performance. Project page: https://yangzhifeio.github.io/project/MMGDreamer.

Authors:Lu Chen, Yizhou Wang, Shixiang Tang, Qianhong Ma, Tong He, Wanli Ouyang, Xiaowei Zhou, Hujun Bao, Sida Peng
Title: EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds
Abstract:
Learning an agent model that behaves like humans-capable of jointly perceiving the environment, predicting the future, and taking actions from a first-person perspective-is a fundamental challenge in computer vision. Existing methods typically train separate models for these abilities, which fail to capture their intrinsic relationships and prevent them from learning from each other. Inspired by how humans learn through the perception-action loop, we propose EgoAgent, a unified agent model that simultaneously learns to represent, predict, and act within a single transformer. EgoAgent explicitly models the causal and temporal dependencies among these abilities by formulating the task as an interleaved sequence of states and actions. It further introduces a joint embedding-action-prediction architecture with temporally asymmetric predictor and observer branches, enabling synergistic optimization across all three capabilities. Comprehensive evaluations of EgoAgent on representative tasks such as image classification, egocentric future state prediction, and 3D human motion prediction demonstrate the superiority of our method. The code and trained models will be publicly available at https://github.com/zju3dv/EgoAgent.
Chinese: EgoAgent是一种基于Transformer的统一智能体模型,通过单一架构同时学习感知、预测和行动,并通过显式建模它们之间的因果与时间依赖关系,在多项任务中展现出卓越性能。
English: EgoAgent is a unified transformer-based model that jointly learns perception, prediction, and action through a single architecture, demonstrating superior performance across multiple tasks by explicitly modeling their causal and temporal dependencies.

Authors:Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, Feifei Feng
Title: DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
Abstract:
Enabling robots to perform diverse tasks across varied environments is a central challenge in robot learning. While vision-language-action (VLA) models have shown promise for generalizable robot skills, realizing their full potential requires addressing limitations in action representation and efficient training. Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck. This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities of VLAs for complex, long-horizon tasks across diverse robot embodiments. DexVLA features a novel diffusion-based action expert, scaled to one billion parameters, designed for cross-embodiment learning. A novel embodiment curriculum learning strategy facilitates efficient training: (1) pre-training the diffusion expert that is separable from the VLA on cross-embodiment data, (2) aligning the VLA model to specific embodiments, and (3) post-training for rapid adaptation to new tasks. We conduct comprehensive experiments across multiple embodiments, including single-arm, bimanual, and dexterous hand, demonstrating DexVLA's adaptability to challenging tasks without task-specific adaptation, its ability to learn dexterous skills on novel embodiments with limited data, and its capacity to complete complex, long-horizon tasks using only direct language prompting, such as laundry folding. In all settings, our method demonstrates superior performance compared to state-of-the-art models like Octo, OpenVLA, and Diffusion Policy.

Authors:Yuhui Zeng, Haoxiang Wu, Wenjie Nie, Xiawu Zheng, Guangyao Chen, Yunhang Shen, Jun Peng, Yonghong Tian, Rongrong Ji
Title: From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning
Abstract:
Current object detectors excel at entity localization and classification, yet exhibit inherent limitations in event recognition capabilities. This deficiency arises from their architecture's emphasis on discrete object identification rather than modeling the compositional reasoning, inter-object correlations, and contextual semantics essential for comprehensive event understanding. To address this challenge, we present a novel framework that expands the capability of standard object detectors beyond mere object recognition to complex event understanding through LLM-guided symbolic reasoning. Our key innovation lies in bridging the semantic gap between object detection and event understanding without requiring expensive task-specific training. The proposed plug-and-play framework interfaces with any open-vocabulary detector while extending their inherent capabilities across architectures. At its core, our approach combines (i) a symbolic regression mechanism exploring relationship patterns among detected entities and (ii) a LLM-guided strategically guiding the search toward meaningful expressions. These discovered symbolic rules transform low-level visual perception into interpretable event understanding, providing a transparent reasoning path from objects to events with strong transferability across domains.We compared our training-free framework against specialized event recognition systems across diverse application domains. Experiments demonstrate that our framework enhances multiple object detector architectures to recognize complex events such as illegal fishing activities (75% AUROC, +8.36% improvement), construction safety violations (+15.77%), and abnormal crowd behaviors (+23.16%). Code is available at \href{https://github.com/MAC-AutoML/SymbolicDet}{here}.
该框架通过LLM引导的符号推理增强物体检测器,无需额外训练即可实现跨领域的复杂事件理解,在多种应用场景中展现出显著的性能提升。
This framework enhances object detectors with LLM-guided symbolic reasoning to achieve complex event understanding across domains without additional training, demonstrating significant performance improvements in diverse applications.

Authors:Yixiong Jing, Wei Lin, Brian Sheil, Sinan Acikgoz
Title: A 3D Multimodal Feature for Infrastructure Anomaly Detection
Abstract:
Ageing structures require periodic inspections to identify structural defects. Previous work has used geometric distortions to locate cracks in synthetic masonry bridge point clouds but has struggled to detect small cracks. To address this limitation, this study proposes a novel 3D multimodal feature, 3DMulti-FPFHI, that combines a customized Fast Point Feature Histogram (FPFH) with an intensity feature. This feature is integrated into the PatchCore anomaly detection algorithm and evaluated through statistical and parametric analyses. The method is further evaluated using point clouds of a real masonry arch bridge and a full-scale experimental model of a concrete tunnel. Results show that the 3D intensity feature enhances inspection quality by improving crack detection; it also enables the identification of water ingress which introduces intensity anomalies. The 3DMulti-FPFHI outperforms FPFH and a state-of-the-art multimodal anomaly detection method. The potential of the method to address diverse infrastructure anomaly detection scenarios is highlighted by the minimal requirements for data compared to learning-based methods. The code and related point cloud dataset are available at https://github.com/Jingyixiong/3D-Multi-FPFHI.
中文摘要:本研究提出了一种名为3DMulti-FPFHI的新型三维多模态特征,通过将定制化快速点特征直方图与强度特征相结合,有效提升了老化结构中微小裂缝的检测能力,并在识别渗水异常方面优于现有方法。
English Summary: This study introduces a novel 3D multimodal feature called 3DMulti-FPFHI, which enhances crack detection in aging structures by combining customized Fast Point Feature Histogram with intensity features, showing superior performance in identifying small cracks and water ingress compared to existing methods.

Authors:Enquan Yang, Peng Xing, Hanyang Sun, Wenbo Guo, Yuanwei Ma, Zechao Li, Dan Zeng
Title: 3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly
Abstract:
Industrial anomaly detection achieves progress thanks to datasets such as MVTec-AD and VisA. However, they suffer from limitations in terms of the number of defect samples, types of defects, and availability of real-world scenes. These constraints inhibit researchers from further exploring the performance of industrial detection with higher accuracy. To this end, we propose a new large-scale anomaly detection dataset called 3CAD, which is derived from real 3C production lines. Specifically, the proposed 3CAD includes eight different types of manufactured parts, totaling 27,039 high-resolution images labeled with pixel-level anomalies. The key features of 3CAD are that it covers anomalous regions of different sizes, multiple anomaly types, and the possibility of multiple anomalous regions and multiple anomaly types per anomaly image. This is the largest and first anomaly detection dataset dedicated to 3C product quality control for community exploration and development. Meanwhile, we introduce a simple yet effective framework for unsupervised anomaly detection: a Coarse-to-Fine detection paradigm with Recovery Guidance (CFRG). To detect small defect anomalies, the proposed CFRG utilizes a coarse-to-fine detection paradigm. Specifically, we utilize a heterogeneous distillation model for coarse localization and then fine localization through a segmentation model. In addition, to better capture normal patterns, we introduce recovery features as guidance. Finally, we report the results of our CFRG framework and popular anomaly detection methods on the 3CAD dataset, demonstrating strong competitiveness and providing a highly challenging benchmark to promote the development of the anomaly detection field. Data and code are available: https://github.com/EnquanYang2022/3CAD.
中文:3CAD数据集通过提供来自3C生产线的27,039张高分辨率图像,包含多样化缺陷类型和像素级标注,解决了工业异常检测的局限性;同时提出的CFRG框架采用由粗到精的检测方法结合恢复指导,有效提升小缺陷识别能力并建立了具有竞争力的性能基准。
English: The 3CAD dataset addresses limitations in industrial anomaly detection by offering a large-scale collection of 27,039 high-resolution images from 3C production lines, featuring diverse defect types and pixel-level annotations, while the proposed CFRG framework introduces a coarse-to-fine detection method with recovery guidance to enhance small defect identification and establish competitive benchmarks.

Authors:Yue Pan, Xingguang Zhong, Liren Jin, Louis Wiesmann, Marija Popović, Jens Behley, Cyrill Stachniss
Title: PINGS: Gaussian Splatting Meets Distance Fields within a Point-Based Implicit Neural Map
Abstract:
Robots benefit from high-fidelity reconstructions of their environment, which should be geometrically accurate and photorealistic to support downstream tasks. While this can be achieved by building distance fields from range sensors and radiance fields from cameras, realising scalable incremental mapping of both fields consistently and at the same time with high quality is challenging. In this paper, we propose a novel map representation that unifies a continuous signed distance field and a Gaussian splatting radiance field within an elastic and compact point-based implicit neural map. By enforcing geometric consistency between these fields, we achieve mutual improvements by exploiting both modalities. We present a novel LiDAR-visual SLAM system called PINGS using the proposed map representation and evaluate it on several challenging large-scale datasets. Experimental results demonstrate that PINGS can incrementally build globally consistent distance and radiance fields encoded with a compact set of neural points. Compared to state-of-the-art methods, PINGS achieves superior photometric and geometric rendering at novel views by constraining the radiance field with the distance field. Furthermore, by utilizing dense photometric cues and multi-view consistency from the radiance field, PINGS produces more accurate distance fields, leading to improved odometry estimation and mesh reconstruction. We also provide an open-source implementation of PING at: https://github.com/PRBonn/PINGS.
中文: 本文提出PINGS系统,通过将符号距离场与高斯溅射辐射场统一于基于点的隐式神经地图中,实现了几何一致的高质量环境重建,在新型视角下展现出卓越的光度与几何渲染效果。
English: This paper introduces PINGS, a LiDAR-visual SLAM system that unifies signed distance and Gaussian splatting radiance fields in a point-based neural map to achieve globally consistent, high-quality geometric and photorealistic reconstructions with mutual improvements between modalities.

Authors:Kaizhen Zhu, Mokai Pan, Yuexin Ma, Yanwei Fu, Jingyi Yu, Jingya Wang, Ye Shi
Title: UniDB: A Unified Diffusion Bridge Framework via Stochastic Optimal Control
Abstract:
Recent advances in diffusion bridge models leverage Doob's $h$-transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches frequently produce blurred or excessively smoothed image details and lack a comprehensive theoretical foundation to explain these shortcomings. To address these limitations, we propose UniDB, a unified framework for diffusion bridges based on Stochastic Optimal Control (SOC). UniDB formulates the problem through an SOC-based optimization and derives a closed-form solution for the optimal controller, thereby unifying and generalizing existing diffusion bridge models. We demonstrate that existing diffusion bridges employing Doob's $h$-transform constitute a special case of our framework, emerging when the terminal penalty coefficient in the SOC cost function tends to infinity. By incorporating a tunable terminal penalty coefficient, UniDB achieves an optimal balance between control costs and terminal penalties, substantially improving detail preservation and output quality. Notably, UniDB seamlessly integrates with existing diffusion bridge models, requiring only minimal code modifications. Extensive experiments across diverse image restoration tasks validate the superiority and adaptability of the proposed framework. Our code is available at https://github.com/UniDB-SOC/UniDB/.
中文摘要:提出的UniDB框架通过随机最优控制统一扩散桥模型,通过优化终端惩罚系数显著提升图像细节保留能力和输出质量。
English Summary: The proposed UniDB framework utilizes Stochastic Optimal Control to unify diffusion bridge models, enhancing image detail preservation and quality by optimizing terminal penalty coefficients.

Authors:Donghui Feng, Zhengxue Cheng, Shen Wang, Ronghua Wu, Hongwei Hu, Guo Lu, Li Song
Title: Linear Attention Modeling for Learned Image Compression
Abstract:
Recent years, learned image compression has made tremendous progress to achieve impressive coding efficiency. Its coding gain mainly comes from non-linear neural network-based transform and learnable entropy modeling. However, most studies focus on a strong backbone, and few studies consider a low complexity design. In this paper, we propose LALIC, a linear attention modeling for learned image compression. Specially, we propose to use Bi-RWKV blocks, by utilizing the Spatial Mix and Channel Mix modules to achieve more compact feature extraction, and apply the Conv based Omni-Shift module to adapt to two-dimensional latent representation. Furthermore, we propose a RWKV-based Spatial-Channel ConTeXt model (RWKV-SCCTX), that leverages the Bi-RWKV to modeling the correlation between neighboring features effectively. To our knowledge, our work is the first work to utilize efficient Bi-RWKV models with linear attention for learned image compression. Experimental results demonstrate that our method achieves competitive RD performances by outperforming VTM-9.1 by -15.26%, -15.41%, -17.63% in BD-rate on Kodak, CLIC and Tecnick datasets. The code is available at https://github.com/sjtu-medialab/RwkvCompress .
中文摘要:本文提出LALIC方法,通过线性注意力建模和双向RWKV模块实现高效特征提取,结合新型上下文模型在图像压缩中取得了优异的率失真性能。
English Summary: This paper introduces LALIC, a learned image compression method using linear attention modeling with Bi-RWKV blocks and a novel RWKV-based context model to achieve efficient feature extraction and competitive rate-distortion performance.

Authors:Xiao Wang, Qingquan Yang, Fuling Wang, Qiang Chen, Wentao Wu, Yu Jin, Jingtao Jiang, Liye Jin, Bo Jiang, Dengdi Sun, Wanli Lv, Meiwen Chen, Zehua Chen, Guosheng Xu, Jin Tang
Title: XiHeFusion: Harnessing Large Language Models for Science Communication in Nuclear Fusion
Abstract:
Nuclear fusion is one of the most promising ways for humans to obtain infinite energy. Currently, with the rapid development of artificial intelligence, the mission of nuclear fusion has also entered a critical period of its development. How to let more people to understand nuclear fusion and join in its research is one of the effective means to accelerate the implementation of fusion. This paper proposes the first large model in the field of nuclear fusion, XiHeFusion, which is obtained through supervised fine-tuning based on the open-source large model Qwen2.5-14B. We have collected multi-source knowledge about nuclear fusion tasks to support the training of this model, including the common crawl, eBooks, arXiv, dissertation, etc. After the model has mastered the knowledge of the nuclear fusion field, we further used the chain of thought to enhance its logical reasoning ability, making XiHeFusion able to provide more accurate and logical answers. In addition, we propose a test questionnaire containing 180+ questions to assess the conversational ability of this science popularization large model. Extensive experimental results show that our nuclear fusion dialogue model, XiHeFusion, can perform well in answering science popularization knowledge. The pre-trained XiHeFusion model is released on https://github.com/Event-AHU/XiHeFusion.
Chinese: 本文提出了核聚变领域的首个大模型羲和聚变,通过基于Qwen2.5-14B的监督微调和思维链增强,能够为科学普及提供准确且逻辑清晰的回答。
English: This paper introduces XiHeFusion, the first large model in nuclear fusion, developed by fine-tuning Qwen2.5-14B with multi-source data and enhanced reasoning to provide accurate, logical answers for science popularization.

Authors:Yufan Zhou, Haoyu Shen, Huan Wang
Title: FreeBlend: Advancing Concept Blending with Staged Feedback-Driven Interpolation Diffusion
Abstract:
Concept blending is a promising yet underexplored area in generative models. While recent approaches, such as embedding mixing and latent modification based on structural sketches, have been proposed, they often suffer from incompatible semantic information and discrepancies in shape and appearance. In this work, we introduce FreeBlend, an effective, training-free framework designed to address these challenges. To mitigate cross-modal loss and enhance feature detail, we leverage transferred image embeddings as conditional inputs. The framework employs a stepwise increasing interpolation strategy between latents, progressively adjusting the blending ratio to seamlessly integrate auxiliary features. Additionally, we introduce a feedback-driven mechanism that updates the auxiliary latents in reverse order, facilitating global blending and preventing rigid or unnatural outputs. Extensive experiments demonstrate that our method significantly improves both the semantic coherence and visual quality of blended images, yielding compelling and coherent results.

Authors:Shiao Wang, Xiao Wang, Chao Wang, Liye Jin, Lin Zhu, Bo Jiang, Yonghong Tian, Jin Tang
Title: Event Stream-based Visual Object Tracking: HDETrack V2 and A High-Definition Benchmark
Abstract:
We then introduce a novel hierarchical knowledge distillation strategy that incorporates the similarity matrix, feature representation, and response map-based distillation to guide the learning of the student Transformer network. We also enhance the model's ability to capture temporal dependencies by applying the temporal Fourier transform to establish temporal relationships between video frames. We adapt the network model to specific target objects during testing via a newly proposed test-time tuning strategy to achieve high performance and flexibility in target tracking. Recognizing the limitations of existing event-based tracking datasets, which are predominantly low-resolution, we propose EventVOT, the first large-scale high-resolution event-based tracking dataset. It comprises 1141 videos spanning diverse categories such as pedestrians, vehicles, UAVs, ping pong, etc. Extensive experiments on both low-resolution (FE240hz, VisEvent, FELT), and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method. Both the benchmark dataset and source code have been released on https://github.com/Event-AHU/EventVOT_Benchmark
中文摘要:本文提出了一种结合相似性矩阵、特征表示和响应图的分层知识蒸馏方法,并使用时域傅里叶变换增强视频跟踪的时序建模能力,通过新发布的高分辨率事件数据集EventVOT验证了其有效性。
English Summary: This paper introduces a hierarchical knowledge distillation method and temporal Fourier transform to enhance a student Transformer network for video tracking, validated by a new high-resolution event-based dataset called EventVOT.

Authors:Qirui Wu, Shizhou Zhang, De Cheng, Yinghui Xing, Di Xu, Peng Wang, Yanning Zhang
Title: Demystifying Catastrophic Forgetting in Two-Stage Incremental Object Detector
Abstract:
Catastrophic forgetting is a critical chanllenge for incremental object detection (IOD). Most existing methods treat the detector monolithically, relying on instance replay or knowledge distillation without analyzing component-specific forgetting. Through dissection of Faster R-CNN, we reveal a key insight: Catastrophic forgetting is predominantly localized to the RoI Head classifier, while regressors retain robustness across incremental stages. This finding challenges conventional assumptions, motivating us to develop a framework termed NSGP-RePRE. Regional Prototype Replay (RePRE) mitigates classifier forgetting via replay of two types of prototypes: coarse prototypes represent class-wise semantic centers of RoI features, while fine-grained prototypes model intra-class variations. Null Space Gradient Projection (NSGP) is further introduced to eliminate prototype-feature misalignment by updating the feature extractor in directions orthogonal to subspace of old inputs via gradient projection, aligning RePRE with incremental learning dynamics. Our simple yet effective design allows NSGP-RePRE to achieve state-of-the-art performance on the Pascal VOC and MS COCO datasets under various settings. Our work not only advances IOD methodology but also provide pivotal insights for catastrophic forgetting mitigation in IOD. Code is available at \href{https://github.com/fanrena/NSGP-RePRE}{https://github.com/fanrena/NSGP-RePRE} .
Chinese: 该研究发现增量目标检测中的灾难性遗忘主要影响分类器组件,因此提出了NSGP-RePRE框架,通过区域原型回放和零空间梯度投影,在多个数据集上实现了最先进的性能。
English: The study identifies that catastrophic forgetting in incremental object detection mainly affects the classifier component, leading to the development of NSGP-RePRE, which uses regional prototype replay and null space gradient projection to achieve state-of-the-art results on benchmark datasets.

Authors:Zinan Lin, Tadas Baltrusaitis, Wenyu Wang, Sergey Yekhanin
Title: Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model
Abstract:
Differentially private (DP) synthetic data, which closely resembles the original private data while maintaining strong privacy guarantees, has become a key tool for unlocking the value of private data without compromising privacy. Recently, Private Evolution (PE) has emerged as a promising method for generating DP synthetic data. Unlike other training-based approaches, PE only requires access to inference APIs from foundation models, enabling it to harness the power of state-of-the-art (SoTA) models. However, a suitable foundation model for a specific private data domain is not always available. In this paper, we discover that the PE framework is sufficiently general to allow APIs beyond foundation models. In particular, we demonstrate that many SoTA data synthesizers that do not rely on neural networks--such as computer graphics-based image generators, which we refer to as simulators--can be effectively integrated into PE. This insight significantly broadens PE's applicability and unlocks the potential of powerful simulators for DP data synthesis. We explore this approach, named Sim-PE, in the context of image synthesis. Across four diverse simulators, Sim-PE performs well, improving the downstream classification accuracy of PE by up to 3x, reducing FID by up to 80%, and offering much greater efficiency. We also show that simulators and foundation models can be easily leveraged together within PE to achieve further improvements. The code is open-sourced in the Private Evolution Python library: https://github.com/microsoft/DPSDA.
中文: 本研究提出了Sim-PE,作为私有进化框架的扩展,通过整合非神经网络的模拟器来生成差分隐私合成数据,显著提升了图像合成中的性能、效率和适用性。
English: The study introduces Sim-PE, an extension of the Private Evolution framework that integrates non-neural network simulators for differentially private synthetic data generation, significantly enhancing performance, efficiency, and applicability in image synthesis.

Authors:Yongfan Chen, Xiuwen Zhu, Tianyu Li
Title: A Physical Coherence Benchmark for Evaluating Video Generation Models via Optical Flow-guided Frame Prediction
Abstract:
Recent advances in video generation models demonstrate their potential as world simulators, but they often struggle with videos deviating from physical laws, a key concern overlooked by most text-to-video benchmarks. We introduce a benchmark designed specifically to assess the Physical Coherence of generated videos, PhyCoBench. Our benchmark includes 120 prompts covering 7 categories of physical principles, capturing key physical laws observable in video content. We evaluated four state-of-the-art (SoTA) T2V models on PhyCoBench and conducted manual assessments. Additionally, we propose an automated evaluation model: PhyCoPredictor, a diffusion model that generates optical flow and video frames in a cascade manner. Through a consistency evaluation comparing automated and manual sorting, the experimental results show that PhyCoPredictor currently aligns most closely with human evaluation. Therefore, it can effectively evaluate the physical coherence of videos, providing insights for future model optimization. Our benchmark, including physical coherence prompts, the automatic evaluation tool PhyCoPredictor, and the generated video dataset, has been released on GitHub at https://github.com/Jeckinchen/PhyCoBench.
中文: 该摘要介绍了PhyCoBench基准测试,专门用于评估生成视频在七类物理原理上的连贯性,并提出了PhyCoPredictor自动评估模型,其评估结果与人工评估高度一致,可为未来模型优化提供指导。
English: This abstract introduces PhyCoBench, a benchmark designed to evaluate the physical coherence of generated videos across seven categories of physical principles, and proposes PhyCoPredictor, an automated evaluation model that aligns closely with human assessments to guide future model improvements.

Authors:Shengdong Zhang, Fan Jia, Xiang Li, Hao Zhang, Jun Shi, Liyan Ma, Shihui Ying
Title: LMS-Net: A Learned Mumford-Shah Network For Few-Shot Medical Image Segmentation
Abstract:
Few-shot semantic segmentation (FSS) methods have shown great promise in handling data-scarce scenarios, particularly in medical image segmentation tasks. However, most existing FSS architectures lack sufficient interpretability and fail to fully incorporate the underlying physical structures of semantic regions. To address these issues, in this paper, we propose a novel deep unfolding network, called the Learned Mumford-Shah Network (LMS-Net), for the FSS task. Specifically, motivated by the effectiveness of pixel-to-prototype comparison in prototypical FSS methods and the capability of deep priors to model complex spatial structures, we leverage our learned Mumford-Shah model (LMS model) as a mathematical foundation to integrate these insights into a unified framework. By reformulating the LMS model into prototype update and mask update tasks, we propose an alternating optimization algorithm to solve it efficiently. Further, the iterative steps of this algorithm are unfolded into corresponding network modules, resulting in LMS-Net with clear interpretability. Comprehensive experiments on three publicly available medical segmentation datasets verify the effectiveness of our method, demonstrating superior accuracy and robustness in handling complex structures and adapting to challenging segmentation scenarios. These results highlight the potential of LMS-Net to advance FSS in medical imaging applications. Our code will be available at: https://github.com/SDZhang01/LMSNet
中文: 本文提出LMS-Net,通过将原型比较与学习的Mumford-Shah模型相结合,构建可解释的深度展开网络,在少样本医学图像分割中实现了更优的准确性和结构处理能力。
English: This paper introduces LMS-Net, a deep unfolding network that enhances few-shot medical image segmentation by integrating prototype comparison with the learned Mumford-Shah model, achieving superior accuracy and interpretability through alternating optimization.

Authors:Shadab Ahamed, Simon Ghyselincks, Pablo Chang Huang Arias, Julian Kloiber, Yasin Ranjbar, Jingrong Tang, Niloufar Zakariaei, Eldad Haber
Title: Inversion of Magnetic Data using Learned Dictionaries and Scale Space
Abstract:
Magnetic data inversion is an important tool in geophysics, used to infer subsurface magnetic susceptibility distributions from surface magnetic field measurements. This inverse problem is inherently ill-posed, characterized by non-unique solutions, depth ambiguity, and sensitivity to noise. Traditional inversion approaches rely on predefined regularization techniques to stabilize solutions, limiting their adaptability to complex or diverse geological scenarios. In this study, we propose an approach that integrates variable dictionary learning and scale-space methods to address these challenges. Our method employs learned dictionaries, allowing for adaptive representation of complex subsurface features that are difficult to capture with predefined bases. Additionally, we extend classical variational inversion by incorporating multi-scale representations through a scale-space framework, enabling the progressive introduction of structural detail while mitigating overfitting. We implement both fixed and dynamic dictionary learning techniques, with the latter introducing iteration-dependent dictionaries for enhanced flexibility. Using a synthetic dataset to simulate geological scenarios, we demonstrate significant improvements in reconstruction accuracy and robustness compared to conventional variational and dictionary-based methods. Our results highlight the potential of learned dictionaries, especially when coupled with scale-space dynamics, to improve model recovery and noise handling. These findings underscore the promise of our data-driven approach for advance magnetic data inversion and its applications in geophysical exploration, environmental assessment, and mineral prospecting. The code is publicly available at: https://github.com/ahxmeds/magnetic-inversion-dictionary.git.
中文: 本研究提出了一种结合自适应字典学习和多尺度框架的新型磁数据反演方法,通过优化特征表征和噪声抑制能力显著提升了地下结构重建效果,在精度与鲁棒性上均优于传统方法。
English: This study introduces a novel magnetic data inversion method that combines adaptive dictionary learning with a multi-scale framework to enhance subsurface reconstruction by improving feature representation and noise resilience, outperforming traditional techniques in accuracy and robustness.

Authors:Xuanyu Tian, Lixuan Chen, Qing Wu, Chenhe Du, Jingjing Shi, Hongjiang Wei, Yuyao Zhang
Title: Unsupervised Self-Prior Embedding Neural Representation for Iterative Sparse-View CT Reconstruction
Abstract:
Emerging unsupervised implicit neural representation (INR) methods, such as NeRP, NeAT, and SCOPE, have shown great potential to address sparse-view computed tomography (SVCT) inverse problems. Although these INR-based methods perform well in relatively dense SVCT reconstructions, they struggle to achieve comparable performance to supervised methods in sparser SVCT scenarios. They are prone to being affected by noise, limiting their applicability in real clinical settings. Additionally, current methods have not fully explored the use of image domain priors for solving SVCsT inverse problems. In this work, we demonstrate that imperfect reconstruction results can provide effective image domain priors for INRs to enhance performance. To leverage this, we introduce Self-prior embedding neural representation (Spener), a novel unsupervised method for SVCT reconstruction that integrates iterative reconstruction algorithms. During each iteration, Spener extracts local image prior features from the previous iteration and embeds them to constrain the solution space. Experimental results on multiple CT datasets show that our unsupervised Spener method achieves performance comparable to supervised state-of-the-art (SOTA) methods on in-domain data while outperforming them on out-of-domain datasets. Moreover, Spener significantly improves the performance of INR-based methods in handling SVCT with noisy sinograms. Our code is available at https://github.com/MeijiTian/Spener.
中文摘要:无监督Spener方法通过嵌入迭代重建中的图像先验来增强稀疏视图CT重建,在域内数据上达到与监督方法相当的性能,并在域外数据集上展现出更优的泛化能力。
English Summary: The unsupervised Spener method enhances sparse-view CT reconstruction by embedding image priors from iterative reconstructions, achieving performance comparable to supervised methods on in-domain data and superior generalization on out-of-domain datasets.

Authors:Vanshali Sharma, Debesh Jha, M. K. Bhuyan, Pradip K. Das, Ulas Bagci
Title: Diverse Image Generation with Diffusion Models and Cross Class Label Learning for Polyp Classification
Abstract:
Pathologic diagnosis is a critical phase in deciding the optimal treatment procedure for dealing with colorectal cancer (CRC). Colonic polyps, precursors to CRC, can pathologically be classified into two major types: adenomatous and hyperplastic. For precise classification and early diagnosis of such polyps, the medical procedure of colonoscopy has been widely adopted paired with various imaging techniques, including narrow band imaging and white light imaging. However, the existing classification techniques mainly rely on a single imaging modality and show limited performance due to data scarcity. Recently, generative artificial intelligence has been gaining prominence in overcoming such issues. Additionally, various generation-controlling mechanisms using text prompts and images have been introduced to obtain visually appealing and desired outcomes. However, such mechanisms require class labels to make the model respond efficiently to the provided control input. In the colonoscopy domain, such controlling mechanisms are rarely explored; specifically, the text prompt is a completely uninvestigated area. Moreover, the unavailability of expensive class-wise labels for diverse sets of images limits such explorations. Therefore, we develop a novel model, PathoPolyp-Diff, that generates text-controlled synthetic images with diverse characteristics in terms of pathology, imaging modalities, and quality. We introduce cross-class label learning to make the model learn features from other classes, reducing the burdensome task of data annotation. The experimental results report an improvement of up to 7.91% in balanced accuracy using a publicly available dataset. Moreover, cross-class label learning achieves a statistically significant improvement of up to 18.33% in balanced accuracy during video-level analysis. The code is available at https://github.com/Vanshali/PathoPolyp-Diff.
Chinese: 该研究提出了PathoPolyp-Diff模型,这是一种创新的生成式人工智能,通过文本提示生成多样化的合成结肠镜图像,并采用跨类别标签学习来提高结直肠癌诊断的分类准确性,有效解决了数据稀缺问题。
English: The study introduces PathoPolyp-Diff, a novel generative AI model that uses text prompts to create diverse synthetic colonoscopy images, enhancing classification accuracy through cross-class label learning and addressing data scarcity in colorectal cancer diagnosis.

Authors:Shuheng Zhang, Yuqi Liu, Hongbo Zhou, Jun Peng, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji
Title: AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming And Keyframe Selection
Abstract:
Despite great progress, text-driven long video editing is still notoriously challenging mainly due to excessive memory overhead. Although recent efforts have simplified this task into a two-step process of keyframe translation and interpolation generation, the token-wise keyframe translation still plagues the upper limit of video length. In this paper, we propose a novel and training-free approach towards efficient and effective long video editing, termed AdaFlow. We first reveal that not all tokens of video frames hold equal importance for keyframe translation, based on which we propose an Adaptive Attention Slimming scheme for AdaFlow to squeeze the $KV$ sequence, thus increasing the number of keyframes for translations by an order of magnitude. In addition, an Adaptive Keyframe Selection scheme is also equipped to select the representative frames for joint editing, further improving generation quality. With these innovative designs, AdaFlow achieves high-quality long video editing of minutes in one inference, i.e., more than 1$k$ frames on one A800 GPU, which is about ten times longer than the compared methods, e.g., TokenFlow. To validate AdaFlow, we also build a new benchmark for long video editing with high-quality annotations, termed LongV-EVAL. Our code is released at: https://github.com/jidantang55/AdaFlow.
中文: AdaFlow提出了一种无需训练的长视频编辑方法,通过自适应注意力精简和关键帧选择机制,实现了对超过1000帧视频的高效处理,性能远超TokenFlow等方法。
English: AdaFlow introduces a training-free method for long video editing by adaptively slimming attention mechanisms and selecting keyframes, enabling efficient processing of over 1,000 frames and significantly outperforming existing approaches like TokenFlow.

Authors:Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng
Title: UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding
Abstract:
Consistency models (CMs) have shown promise in the efficient generation of both image and text. This raises the natural question of whether we can learn a unified CM for efficient multimodal generation (e.g., text-to-image) and understanding (e.g., image-to-text). Intuitively, such a model could be acquired by applying the consistency distillation (CD) to existing unified multimodal models. However, the key challenge is establishing a unified denoising perspective for both image and text generation, which is essential for establishing the consistency mapping. To tackle this, at the representation level, we advocate for discrete tokens for both modalities to best preserve language modeling capabilities. Critically, instead of defining the text denoising trajectory via recent discrete diffusion language modeling principles, we specify it using the parallel decoding trace of an autoregressive language model, benefiting from the latter's superior performance in general text generation tasks. The denoising trajectory of image tokens adheres to standard discrete diffusion. We train our unified consistency models (UniCMs) on these combined multimodal trajectories simultaneously with a unified objective. We introduce a trajectory segmentation strategy to further improve the training convergence. Empirically, in text-to-image generation, UniCMs outperform SD3 on GenEval, Image Reward, and CLIP Score metrics, while requiring only approximately ${1}/{8}$ of the sampling time. Meanwhile, in image-to-text generation, UniCMs surpass Show-o on the MMMU benchmark while being $1.5 \times$ faster at long-sequence generating speed. The code is available at https://github.com/zhijie-group/UniCMs.
Chinese: 统一一致性模型(UniCMs)通过结合图像和文本的离散令牌去噪轨迹,实现了多模态内容的高效生成与理解,在文生图和图生文任务中均超越现有模型性能,同时大幅提升了生成速度。
English: Unified consistency models (UniCMs) efficiently generate and understand multimodal content by combining discrete token denoising trajectories for images and text, achieving superior performance in both text-to-image and image-to-text tasks with significantly faster speeds than existing models.

Authors:Shiyao Li, Antoine Guédon, Clémentin Boittiaux, Shizhe Chen, Vincent Lepetit
Title: NextBestPath: Efficient 3D Mapping of Unseen Environments
Abstract:
This work addresses the problem of active 3D mapping, where an agent must find an efficient trajectory to exhaustively reconstruct a new scene. Previous approaches mainly predict the next best view near the agent's location, which is prone to getting stuck in local areas. Additionally, existing indoor datasets are insufficient due to limited geometric complexity and inaccurate ground truth meshes. To overcome these limitations, we introduce a novel dataset AiMDoom with a map generator for the Doom video game, enabling to better benchmark active 3D mapping in diverse indoor environments. Moreover, we propose a new method we call next-best-path (NBP), which predicts long-term goals rather than focusing solely on short-sighted views. The model jointly predicts accumulated surface coverage gains for long-term goals and obstacle maps, allowing it to efficiently plan optimal paths with a unified model. By leveraging online data collection, data augmentation and curriculum learning, NBP significantly outperforms state-of-the-art methods on both the existing MP3D dataset and our AiMDoom dataset, achieving more efficient mapping in indoor environments of varying complexity.

Authors:Yitian Long, Zhongze Wu, Xiu Su, Lining Yu, Ruining Deng, Haichun Yang, Yuankai Huo
Title: Towards Fine-grained Renal Vasculature Segmentation: Full-Scale Hierarchical Learning with FH-Seg
Abstract:
Accurate fine-grained segmentation of the renal vasculature is critical for nephrological analysis, yet it faces challenges due to diverse and insufficiently annotated images. Existing methods struggle to accurately segment intricate regions of the renal vasculature, such as the inner and outer walls, arteries and lesions. In this paper, we introduce FH-Seg, a Full-scale Hierarchical Learning Framework designed for comprehensive segmentation of the renal vasculature. Specifically, FH-Seg employs full-scale skip connections that merge detailed anatomical information with contextual semantics across scales, effectively bridging the gap between structural and pathological contexts. Additionally, we implement a learnable hierarchical soft attention gates to adaptively reduce interference from non-core information, enhancing the focus on critical vascular features. To advance research on renal pathology segmentation, we also developed a Large Renal Vasculature (LRV) dataset, which contains 16,212 fine-grained annotated images of 5,600 renal arteries. Extensive experiments on the LRV dataset demonstrate FH-Seg's superior accuracies (71.23% Dice, 73.06% F1), outperforming Omni-Seg by 2.67 and 2.13 percentage points respectively. Code is available at: https://github.com/hrlblab/FH-seg.
中文: FH-Seg通过全尺度分层学习框架,结合多尺度解剖信息并采用可学习的注意力门来聚焦关键血管特征,在新型LRV数据集上实现了卓越的肾脏血管分割精度。
English: FH-Seg, a full-scale hierarchical learning framework, enhances renal vasculature segmentation by integrating multi-scale anatomical details and using attention gates to focus on critical features, achieving superior accuracy on the new LRV dataset.

Authors:Yuting He, Boyu Wang, Rongjun Ge, Yang Chen, Guanyu Yang, Shuo Li
Title: Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning
Abstract:
Dense contrastive representation learning (DCRL) has greatly improved the learning efficiency for image-dense prediction tasks, showing its great potential to reduce the large costs of medical image collection and dense annotation. However, the properties of medical images make unreliable correspondence discovery, bringing an open problem of large-scale false positive and negative (FP&N) pairs in DCRL. In this paper, we propose GEoMetric vIsual deNse sImilarity (GEMINI) learning which embeds the homeomorphism prior to DCRL and enables a reliable correspondence discovery for effective dense contrast. We propose a deformable homeomorphism learning (DHL) which models the homeomorphism of medical images and learns to estimate a deformable mapping to predict the pixels' correspondence under topological preservation. It effectively reduces the searching space of pairing and drives an implicit and soft learning of negative pairs via a gradient. We also propose a geometric semantic similarity (GSS) which extracts semantic information in features to measure the alignment degree for the correspondence learning. It will promote the learning efficiency and performance of deformation, constructing positive pairs reliably. We implement two practical variants on two typical representation learning tasks in our experiments. Our promising results on seven datasets which outperform the existing methods show our great superiority. We will release our code on a companion link: https://github.com/YutingHe-list/GEMINI.
中文:GEMINI方法通过引入可变形同胚映射和几何语义相似性,将同胚先验融入密集对比表示学习中,有效减少了医学图像中的误匹配对,并在多个数据集上显著提升了性能表现。
English: The GEMINI method enhances dense contrastive representation learning for medical images by incorporating a homeomorphism prior through deformable mapping and geometric semantic similarity, effectively reducing false positives and negatives while improving performance across various datasets.

Authors:Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Yutao Wu, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang
Title: Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
Abstract:
The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI). These models are now foundational to a wide range of applications, including conversational AI, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-powered Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard AI models.
Chinese: 大型模型通过大规模预训练重塑了人工智能格局,但其广泛应用也带来了严重的安全隐患,本综述系统梳理了相关威胁与防御策略,为构建安全AI体系提供重要参考。
English: Large models are revolutionizing AI across various fields but face significant safety risks, prompting a systematic review of threats and defenses to guide future secure development.

Authors:Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, Ping Luo
Title: FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
Abstract:
DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high resolution outputs, further amplifying computational demands especially for single stage DiT models. To address these challenges, we propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage establishes flow matching between low and high resolutions, effectively generating fine details with minimal NFEs. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output and accordingly adjust the prompt before committing to full-resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability.
Chinese Summary: FlashVideo提出了一种两阶段框架,通过首先生成低分辨率视频确保提示准确性,再通过流匹配增强细节,在高效计算的同时实现了最先进的高分辨率视频生成。
English Summary: FlashVideo introduces a two-stage framework that efficiently balances computational cost and quality by first generating low-resolution videos with high prompt fidelity and then enhancing details through flow matching, achieving state-of-the-art high-resolution video generation.

Authors:Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, De-An Huang
Title: QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
Abstract:
We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
中文: QLIP是一种视觉标记化方法,通过动态损失平衡和两阶段训练,将高质量图像重建与卓越的零样本理解能力相结合,并在多模态理解和图像生成任务中验证了其有效性。
English: QLIP is a visual tokenization method that integrates high-quality image reconstruction with superior zero-shot understanding, balancing reconstruction and language-image alignment through dynamic loss management and a two-stage training pipeline.

Authors:Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Yi-Fan Zhang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Xiawu Zheng, Yan Zhang, Yiyi Zhou, Ran He, Caifeng Shan, Rongrong Ji, Xing Sun
Title: Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
Abstract:
We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of 17M samples from public datasets only and demonstrates the state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing. By leveraging our inference designs, Long-VITA models achieve a remarkable 2x prefill speedup and 4x context length extension in single node with 8 GPUs. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.
Chinese: Long-VITA 是一种高效的大型多模态模型,擅长处理长上下文视觉语言任务,在基准测试中表现卓越,并通过创新的训练和推理技术实现了显著的加速和扩展能力。
English: Long-VITA is a highly efficient large multi-modal model that excels at processing long-context visual-language tasks, achieving state-of-the-art performance on benchmarks and offering significant speed and scalability improvements through innovative training and inference techniques.

Authors:Chung-Ho Wu, Yang-Jung Chen, Ying-Huan Chen, Jie-Ying Lee, Bo-Hsu Ke, Chun-Wei Tuan Mu, Yi-Chuan Huang, Chin-Yang Lin, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu
Title: AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting
Abstract:
Three-dimensional scene inpainting is crucial for applications from virtual reality to architectural visualization, yet existing methods struggle with view consistency and geometric accuracy in 360° unbounded scenes. We present AuraFusion360, a novel reference-based method that enables high-quality object removal and hole filling in 3D scenes represented by Gaussian Splatting. Our approach introduces (1) depth-aware unseen mask generation for accurate occlusion identification, (2) Adaptive Guided Depth Diffusion, a zero-shot method for accurate initial point placement without requiring additional training, and (3) SDEdit-based detail enhancement for multi-view coherence. We also introduce 360-USID, the first comprehensive dataset for 360° unbounded scene inpainting with ground truth. Extensive experiments demonstrate that AuraFusion360 significantly outperforms existing methods, achieving superior perceptual quality while maintaining geometric accuracy across dramatic viewpoint changes.

Authors:Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin
Title: VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Abstract:
While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features \textit{low-frequency temporal allocation} to mitigate periodic oscillations, a \textit{diagonal layout} to maintain spatial symmetry, and \textit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at \href{https://github.com/Wiselnn570/VideoRoPE}{https://github.com/Wiselnn570/VideoRoPE}.
Chinese: 本文提出VideoRoPE,一种专为视频数据设计的旋转位置嵌入三维扩展方法,通过保持时空关系克服了先前变体的局限性,并在多种视频任务中表现优异。
English: This paper introduces VideoRoPE, a 3D extension of Rotary Position Embedding designed specifically for video data, which overcomes limitations of previous variants by preserving spatio-temporal relationships and outperforms them across various video tasks.

Authors:Minh-Quan Le, Gaurav Mittal, Tianjian Meng, A S M Iftekhar, Vishwas Suryanarayanan, Barun Patra, Dimitris Samaras, Mei Chen
Title: Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment
Abstract:
While diffusion models are powerful in generating high-quality, diverse synthetic data for object-centric tasks, existing methods struggle with scene-aware tasks such as Visual Question Answering (VQA) and Human-Object Interaction (HOI) Reasoning, where it is critical to preserve scene attributes in generated images consistent with a multimodal context, i.e. a reference image with accompanying text guidance query. To address this, we introduce $\textbf{Hummingbird}$, the first diffusion-based image generator which, given a multimodal context, generates highly diverse images w.r.t. the reference image while ensuring high fidelity by accurately preserving scene attributes, such as object interactions and spatial relationships from the text guidance. Hummingbird employs a novel Multimodal Context Evaluator that simultaneously optimizes our formulated Global Semantic and Fine-grained Consistency Rewards to ensure generated images preserve the scene attributes of reference images in relation to the text guidance while maintaining diversity. As the first model to address the task of maintaining both diversity and fidelity given a multimodal context, we introduce a new benchmark formulation incorporating MME Perception and Bongard HOI datasets. Benchmark experiments show Hummingbird outperforms all existing methods by achieving superior fidelity while maintaining diversity, validating Hummingbird's potential as a robust multimodal context-aligned image generator in complex visual tasks. Project page: https://roar-ai.github.io/hummingbird

Authors:Zefan Yang, Xuanang Xu, Jiajin Zhang, Ge Wang, Mannudeep K. Kalra, Pingkun Yan
Title: Chest X-ray Foundation Model with Global and Local Representations Integration
Abstract:
Chest X-ray (CXR) is the most frequently ordered imaging test, supporting diverse clinical tasks from thoracic disease detection to postoperative monitoring. However, task-specific classification models are limited in scope, require costly labeled data, and lack generalizability to out-of-distribution datasets. To address these challenges, we introduce CheXFound, a self-supervised vision foundation model that learns robust CXR representations and generalizes effectively across a wide range of downstream tasks. We pretrain CheXFound on a curated CXR-1M dataset, comprising over one million unique CXRs from publicly available sources. We propose a Global and Local Representations Integration (GLoRI) module for downstream adaptations, by incorporating disease-specific local features with global image features for enhanced performance in multilabel classification. Our experimental results show that CheXFound outperforms state-of-the-art models in classifying 40 disease findings across different prevalence levels on the CXR-LT 24 dataset and exhibits superior label efficiency on downstream tasks with limited training data. Additionally, CheXFound achieved significant improvements on new tasks with out-of-distribution datasets, including opportunistic cardiovascular disease risk estimation and mortality prediction. These results highlight CheXFound's strong generalization capabilities, enabling diverse adaptations with improved label efficiency. The project source code is publicly available at https://github.com/RPIDIAL/CheXFound.
中文:CheXFound是一种自监督视觉基础模型,能够学习稳健的胸部X光表征,在多种疾病分类任务中超越现有最优模型,并在不同任务中展现出卓越的泛化能力和标签效率。
English: CheXFound is a self-supervised vision foundation model that learns robust chest X-ray representations, outperforming state-of-the-art models in classifying diseases and demonstrating strong generalization with improved label efficiency across diverse tasks.

Authors:Gorkem Can Ates, Yu Xin, Kuang Gong, Wei Shao
Title: DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions
Abstract:
Vision-language models (VLMs) have been widely applied to 2D medical image analysis due to their ability to align visual and textual representations. However, extending VLMs to 3D imaging remains computationally challenging. Existing 3D VLMs often rely on Vision Transformers (ViTs), which are computationally expensive due to the quadratic complexity of self-attention, or on 3D convolutions, which require large numbers of parameters and FLOPs as kernel size increases. We introduce DCFormer, an efficient 3D image encoder that factorizes 3D convolutions into three parallel 1D convolutions along the depth, height, and width dimensions. This design preserves spatial information while significantly reducing computational cost. Integrated into a CLIP-based vision-language framework, DCFormer is trained and evaluated on CT-RATE, a dataset of 50,188 paired 3D chest CT volumes and radiology reports. In zero-shot and fine-tuned detection of 18 pathologies, as well as in image-text retrieval tasks, DCFormer consistently outperforms state-of-the-art 3D vision encoders, including CT-ViT, ViT, ConvNeXt, PoolFormer, and TransUNet. These results highlight DCFormer's potential for scalable, clinically deployable 3D medical VLMs. Our code is available at: https://github.com/mirthAI/DCFormer.
Chinese: DCFormer提出了一种高效的3D图像编码器,通过分解的1D卷积在降低计算成本的同时保持视觉语言任务性能,在医学影像基准测试中超越了现有模型。
English: DCFormer introduces an efficient 3D image encoder using factorized 1D convolutions to reduce computational costs while maintaining performance in vision-language tasks, outperforming existing models on medical imaging benchmarks.

Authors:Loïck Chambon, Eloi Zablocki, Alexandre Boulch, Mickaël Chen, Matthieu Cord
Title: GaussRender: Learning 3D Occupancy with Gaussian Rendering
Abstract:
Understanding the 3D geometry and semantics of driving scenes is critical for safe autonomous driving. Recent advances in 3D occupancy prediction have improved scene representation but often suffer from visual inconsistencies, leading to floating artifacts and poor surface localization. Existing voxel-wise losses (e.g., cross-entropy) fail to enforce visible geometric coherence. In this paper, we propose GaussRender, a module that improves 3D occupancy learning by enforcing projective consistency. Our key idea is to project both predicted and ground-truth 3D occupancy into 2D camera views, where we apply supervision. Our method penalizes 3D configurations that produce inconsistent 2D projections, thereby enforcing a more coherent 3D structure. To achieve this efficiently, we leverage differentiable rendering with Gaussian splatting. GaussRender seamlessly integrates with existing architectures while maintaining efficiency and requiring no inference-time modifications. Extensive evaluations on multiple benchmarks (SurroundOcc-nuScenes, Occ3D-nuScenes, SSCBench-KITTI360) demonstrate that GaussRender significantly improves geometric fidelity across various 3D occupancy models (TPVFormer, SurroundOcc, Symphonies), achieving state-of-the-art results, particularly on surface-sensitive metrics such as RayIoU. The code is open-sourced at https://github.com/valeoai/GaussRender.
中文: 本文提出的GaussRender模块通过可微分渲染强制投影一致性来改进3D占据预测,在多个基准测试中显著提升了几何保真度,且不影响推理效率。
English: This paper introduces GaussRender, a module that enhances 3D occupancy prediction by enforcing projective consistency through differentiable rendering, significantly improving geometric fidelity across multiple benchmarks without altering inference efficiency.

Authors:Craig Myles, In Hwa Um, Craig Marshall, David Harris-Birtill, David J. Harrison
Title: SurGen: 1020 H&E-stained Whole Slide Images With Survival and Genetic Markers
Abstract:
$\textbf{Background}$: Cancer remains one of the leading causes of morbidity and mortality worldwide. Comprehensive datasets that combine histopathological images with genetic and survival data across various tumour sites are essential for advancing computational pathology and personalised medicine. $\textbf{Results}$: We present SurGen, a dataset comprising 1,020 H&E-stained whole slide images (WSIs) from 843 colorectal cancer cases. The dataset includes detailed annotations for key genetic mutations (KRAS, NRAS, BRAF) and mismatch repair status, as well as survival data for 426 cases. To demonstrate SurGen's practical utility, we conducted a proof-of-concept machine learning experiment predicting mismatch repair status from the WSIs, achieving a test AUROC of 0.8316. These preliminary results underscore the dataset's potential to facilitate research in biomarker discovery, prognostic modelling, and advanced machine learning applications in colorectal cancer. $\textbf{Conclusions}$: SurGen offers a valuable resource for the scientific community, enabling studies that require high-quality WSIs linked with comprehensive clinical and genetic information on colorectal cancer. Our initial findings affirm the dataset's capacity to advance diagnostic precision and foster the development of personalised treatment strategies in colorectal oncology. Data available online at https://doi.org/10.6019/S-BIAD1285.
中文:SurGen数据集整合了1020张结直肠癌全切片图像及其基因突变注释与生存数据,通过机器学习模型成功预测错配修复状态,证实了其在推进精准医疗研究中的实用价值。
English: The SurGen dataset provides 1,020 whole slide images from colorectal cancer cases with genetic mutation annotations and survival data, demonstrating its utility through a machine learning model that achieved high accuracy in predicting mismatch repair status.

Authors:Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu
Title: Goku: Flow Based Video Generative Foundation Models
Abstract:
This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.
中文: 本文介绍了Goku模型系列,它采用整流流Transformer实现联合图像与视频生成,在定性和定量评估中均创下行业新标杆。
English: This paper presents Goku, a cutting-edge family of joint image-and-video generation models using rectified flow Transformers that set new industry benchmarks in both qualitative and quantitative evaluations.

Authors:Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, Jianke Zhu
Title: HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation
Abstract:
Human motion video generation has advanced significantly, while existing methods still struggle with accurately rendering detailed body parts like hands and faces, especially in long sequences and intricate motions. Current approaches also rely on fixed resolution and struggle to maintain visual consistency. To address these limitations, we propose HumanDiT, a pose-guided Diffusion Transformer (DiT)-based framework trained on a large and wild dataset containing 14,000 hours of high-quality video to produce high-fidelity videos with fine-grained body rendering. Specifically, (i) HumanDiT, built on DiT, supports numerous video resolutions and variable sequence lengths, facilitating learning for long-sequence video generation; (ii) we introduce a prefix-latent reference strategy to maintain personalized characteristics across extended sequences. Furthermore, during inference, HumanDiT leverages Keypoint-DiT to generate subsequent pose sequences, facilitating video continuation from static images or existing videos. It also utilizes a Pose Adapter to enable pose transfer with given sequences. Extensive experiments demonstrate its superior performance in generating long-form, pose-accurate videos across diverse scenarios.
中文:HumanDiT是一种基于姿态引导扩散Transformer的框架,通过支持多种分辨率和采用前缀潜在参考策略,有效解决了长视频序列中身体细节渲染和视觉一致性的问题,并在14,000小时视频数据上训练实现高质量生成。
English: HumanDiT is a pose-guided Diffusion Transformer framework that overcomes limitations in rendering detailed body parts and maintaining visual consistency in long video sequences by supporting multiple resolutions and using a prefix-latent reference strategy, trained on 14,000 hours of video data.

Authors:Wonjun Lee, Doehyeon Lee, Eugene Choi, Sangyoon Yu, Ashkan Yousefpour, Haon Park, Bumsub Ham, Suhyun Kim
Title: ELITE: Enhanced Language-Image Toxicity Evaluation for Safety
Abstract:
Current Vision Language Models (VLMs) remain vulnerable to malicious prompts that induce harmful outputs. Existing safety benchmarks for VLMs primarily rely on automated evaluation methods, but these methods struggle to detect implicit harmful content or produce inaccurate evaluations. Therefore, we found that existing benchmarks have low levels of harmfulness, ambiguous data, and limited diversity in image-text pair combinations. To address these issues, we propose the ELITE benchmark, a high-quality safety evaluation benchmark for VLMs, underpinned by our enhanced evaluation method, the ELITE evaluator. The ELITE evaluator explicitly incorporates a toxicity score to accurately assess harmfulness in multimodal contexts, where VLMs often provide specific, convincing, but unharmful descriptions of images. We filter out ambiguous and low-quality image-text pairs from existing benchmarks using the ELITE evaluator and generate diverse combinations of safe and unsafe image-text pairs. Our experiments demonstrate that the ELITE evaluator achieves superior alignment with human evaluations compared to prior automated methods, and the ELITE benchmark offers enhanced benchmark quality and diversity. By introducing ELITE, we pave the way for safer, more robust VLMs, contributing essential tools for evaluating and mitigating safety risks in real-world applications.

Authors:Yijun Wang, Yong Wang, Chendong xu, Shuai Yao, Qisong Wu
Title: SelaFD:Seamless Adaptation of Vision Transformer Fine-tuning for Radar-based Human Activity Recognition
Abstract:
Human Activity Recognition (HAR) such as fall detection has become increasingly critical due to the aging population, necessitating effective monitoring systems to prevent serious injuries and fatalities associated with falls. This study focuses on fine-tuning the Vision Transformer (ViT) model specifically for HAR using radar-based Time-Doppler signatures. Unlike traditional image datasets, these signals present unique challenges due to their non-visual nature and the high degree of similarity among various activities. Directly fine-tuning the ViT with all parameters proves suboptimal for this application. To address this challenge, we propose a novel approach that employs Low-Rank Adaptation (LoRA) fine-tuning in the weight space to facilitate knowledge transfer from pre-trained ViT models. Additionally, to extract fine-grained features, we enhance feature representation through the integration of a serial-parallel adapter in the feature space. Our innovative joint fine-tuning method, tailored for radar-based Time-Doppler signatures, significantly improves HAR accuracy, surpassing existing state-of-the-art methodologies in this domain. Our code is released at https://github.com/wangyijunlyy/SelaFD.
中文: 本研究提出了一种新颖的联合微调方法,通过低秩适应和串并联适配器优化视觉变换器在雷达时频信号上的人类活动识别性能,实现了超越现有方法的更高精度。
English: This study introduces a novel joint fine-tuning approach using Low-Rank Adaptation and serial-parallel adapters to optimize Vision Transformers for human activity recognition with radar-based Time-Doppler signatures, achieving superior accuracy over existing methods.

Authors:Zhiqiang Yang, Qiu Guan, Zhongwen Yu, Xinli Xu, Haixia Long, Sheng Lian, Haigen Hu, Ying Tang
Title: MHAF-YOLO: Multi-Branch Heterogeneous Auxiliary Fusion YOLO for accurate object detection
Abstract:
Due to the effective multi-scale feature fusion capabilities of the Path Aggregation FPN (PAFPN), it has become a widely adopted component in YOLO-based detectors. However, PAFPN struggles to integrate high-level semantic cues with low-level spatial details, limiting its performance in real-world applications, especially with significant scale variations. In this paper, we propose MHAF-YOLO, a novel detection framework featuring a versatile neck design called the Multi-Branch Auxiliary FPN (MAFPN), which consists of two key modules: the Superficial Assisted Fusion (SAF) and Advanced Assisted Fusion (AAF). The SAF bridges the backbone and the neck by fusing shallow features, effectively transferring crucial low-level spatial information with high fidelity. Meanwhile, the AAF integrates multi-scale feature information at deeper neck layers, delivering richer gradient information to the output layer and further enhancing the model learning capacity. To complement MAFPN, we introduce the Global Heterogeneous Flexible Kernel Selection (GHFKS) mechanism and the Reparameterized Heterogeneous Multi-Scale (RepHMS) module to enhance feature fusion. RepHMS is globally integrated into the network, utilizing GHFKS to select larger convolutional kernels for various feature layers, expanding the vertical receptive field and capturing contextual information across spatial hierarchies. Locally, it optimizes convolution by processing both large and small kernels within the same layer, broadening the lateral receptive field and preserving crucial details for detecting smaller targets. The source code of this work is available at: https://github.com/yang-0201/MHAF-YOLO.
中文: MHAF-YOLO框架采用多分支辅助特征金字塔网络,通过浅层辅助融合和高级辅助融合模块优化多尺度特征整合,并结合全局异构卷积核选择与重参数化多尺度模块,有效扩大感受野并保留细节,显著提升了多尺度目标检测性能。
English: The MHAF-YOLO framework introduces a Multi-Branch Auxiliary FPN with SAF and AAF modules to better fuse multi-scale features, complemented by GHFKS and RepHMS mechanisms that enhance receptive fields and preserve details for improved object detection across varying scales.

Authors:Yong Li, Yingjing Huang, Gengchen Mai, Fan Zhang
Title: Learning Street View Representations with Spatiotemporal Contrast
Abstract:
Street view imagery is extensively utilized in representation learning for urban visual environments, supporting various sustainable development tasks such as environmental perception and socio-economic assessment. However, it is challenging for existing image representations to specifically encode the dynamic urban environment (such as pedestrians, vehicles, and vegetation), the built environment (including buildings, roads, and urban infrastructure), and the environmental ambiance (such as the cultural and socioeconomic atmosphere) depicted in street view imagery to address downstream tasks related to the city. In this work, we propose an innovative self-supervised learning framework that leverages temporal and spatial attributes of street view imagery to learn image representations of the dynamic urban environment for diverse downstream tasks. By employing street view images captured at the same location over time and spatially nearby views at the same time, we construct contrastive learning tasks designed to learn the temporal-invariant characteristics of the built environment and the spatial-invariant neighborhood ambiance. Our approach significantly outperforms traditional supervised and unsupervised methods in tasks such as visual place recognition, socioeconomic estimation, and human-environment perception. Moreover, we demonstrate the varying behaviors of image representations learned through different contrastive learning objectives across various downstream tasks. This study systematically discusses representation learning strategies for urban studies based on street view images, providing a benchmark that enhances the applicability of visual data in urban science. The code is available at https://github.com/yonglleee/UrbanSTCL.
中文摘要:本研究提出一种创新自监督学习框架,利用街景图像的时空特性学习城市动态环境表征,在视觉场所识别、社会经济评估等任务中显著优于传统方法,为城市研究提供了有效的视觉表征基准。
English Summary: This paper introduces a self-supervised learning framework that uses temporal and spatial attributes of street view imagery to effectively represent dynamic urban environments, built environments, and neighborhood ambiance, outperforming existing methods in various urban tasks.

Authors:Kunxiao Liu, Guowu Yuan, Hongyu Liu, Hao Wu
Title: Multiscale style transfer based on a Laplacian pyramid for traditional Chinese painting
Abstract:
Style transfer is adopted to synthesize appealing stylized images that preserve the structure of a content image but carry the pattern of a style image. Many recently proposed style transfer methods use only western oil paintings as style images to achieve image stylization. As a result, unnatural messy artistic effects are produced in stylized images when using these methods to directly transfer the patterns of traditional Chinese paintings, which are composed of plain colors and abstract objects. Moreover, most of them work only at the original image scale and thus ignore multiscale image information during training. In this paper, we present a novel effective multiscale style transfer method based on Laplacian pyramid decomposition and reconstruction, which can transfer unique patterns of Chinese paintings by learning different image features at different scales. In the first stage, the holistic patterns are transferred at low resolution by adopting a Style Transfer Base Network. Then, the details of the content and style are gradually enhanced at higher resolutions by a Detail Enhancement Network with an edge information selection (EIS) module in the second stage. The effectiveness of our method is demonstrated through the generation of appealing high-quality stylization results and a comparison with some state-of-the-art style transfer methods. Datasets and codes are available at https://github.com/toby-katakuri/LP_StyleTransferNet.
Chinese: 本文提出了一种基于拉普拉斯金字塔分解的多尺度风格迁移方法,通过在不同尺度学习图像特征,有效迁移中国传统绘画的独特图案,解决了现有方法在处理此类风格时产生不自然效果且忽略多尺度信息的问题。
English: This paper introduces a multiscale style transfer method using Laplacian pyramid decomposition to effectively transfer patterns from traditional Chinese paintings by learning features at different scales, overcoming limitations of existing methods that produce unnatural results with such styles and ignore multiscale information.

Authors:Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, Hao Zhang
Title: Fast Video Generation with Sliding Tile Attention
Abstract:
Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench. We make our codebase public at https://github.com/hao-ai-lab/FastVideo.
中文: 本文提出滑动分块注意力(STA),通过聚焦局部三维窗口的硬件高效方法,将视频生成中的注意力计算时间最多减少17倍且不损失质量。
English: This paper introduces sliding tile attention (STA), a hardware-efficient method that accelerates video generation by focusing on localized 3D windows, reducing attention computation time by up to 17 times without quality loss.

Authors:Imad Eddine Marouf, Enzo Tartaglione, Stephane Lathuiliere, Joost van de Weijer
Title: Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering
Abstract:
Continual Learning in Visual Question Answering (VQACL) requires models to acquire new visual-linguistic skills (plasticity) while preserving previously learned knowledge (stability). The inherent multimodality of VQACL exacerbates this challenge, as models must balance stability across visual and textual domains while adapting to novel objects and reasoning tasks. Existing methods, primarily designed for unimodal settings, often fall short in addressing this dual requirement. In this work, we present QUestion-only replay with Attention Distillation (QUAD), a novel approach for VQACL that leverages only past task questions for regularization. By eliminating the need to store visual data, QUAD not only reduces memory overhead, but also alleviates privacy concerns. Our method introduces a Question-only Replay mechanism that selectively reuses prior task questions to counteract overfitting to the answer space of the current task, addressing the problem out of answer set. Complementing this, we propose Attention Consistency Distillation to enforce both intra-modal and inter-modal attention consistency across tasks, preserving essential visual-linguistic associations. Extensive experiments on VQAv2 and NExT-QA demonstrate that QUAD significantly outperforms state-of-the-art methods, achieving robust performance in continual VQA. Code is available at: https://github.com/IemProg/QUAD.
中文:QUAD是一种新颖的视觉问答持续学习方法,仅通过问题回放和注意力蒸馏机制,在不存储视觉数据的情况下保持跨任务稳定性,在基准数据集上实现了卓越性能。
English: QUAD is a novel continual learning method for Visual Question Answering that uses question-only replay and attention distillation to maintain stability across tasks without storing visual data, achieving superior performance on benchmark datasets.

Authors:Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, Yuxuan Liang
Title: Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting
Abstract:
Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose \method, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting. Code is available at https://github.com/CityMind-Lab/ICML25-TimeVLM.
Chinese: Time-VLM提出了一种新颖的多模态框架,利用预训练的视觉语言模型整合时序、视觉和文本数据以提升时间序列预测性能,在少样本和零样本场景下表现尤为卓越。
English: Time-VLM introduces a novel multimodal framework that integrates temporal, visual, and textual data using pre-trained vision-language models to enhance time series forecasting, achieving superior performance especially in few-shot and zero-shot settings.

Authors:Bosung Kim, Kyuhwan Lee, Isu Jeong, Jungmin Cheon, Yeojin Lee, Seulki Lee
Title: On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices
Abstract:
We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed On-device Sora applies three novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device, comparable to those produced by high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation on commodity mobile and embedded devices without resource-intensive re-training for model optimization (compression). The code implementation is available at a GitHub repository(https://github.com/eai-lab/On-device-Sora).
中文: On-device Sora 是一种无需训练的解决方案,通过线性比例跳跃、时间维度令牌合并和动态加载并发推理三项新技术,在智能手机上实现高效文本到视频生成,其生成质量可与高端GPU相媲美。
English: On-device Sora is a training-free solution that enables efficient text-to-video generation on smartphones through three novel techniques—Linear Proportional Leap, Temporal Dimension Token Merging, and Concurrent Inference with Dynamic Loading—producing high-quality videos comparable to those from high-end GPUs.

Authors:Afshin Khadangi, Amir Sartipi, Igor Tchappi, Gilbert Fridgen
Title: CognArtive: Large Language Models for Automating Art Analysis and Decoding Aesthetic Elements
Abstract:
Art, as a universal language, can be interpreted in diverse ways, with artworks embodying profound meanings and nuances. The advent of Large Language Models (LLMs) and the availability of Multimodal Large Language Models (MLLMs) raise the question of how these transformative models can be used to assess and interpret the artistic elements of artworks. While research has been conducted in this domain, to the best of our knowledge, a deep and detailed understanding of the technical and expressive features of artworks using LLMs has not been explored. In this study, we investigate the automation of a formal art analysis framework to analyze a high-throughput number of artworks rapidly and examine how their patterns evolve over time. We explore how LLMs can decode artistic expressions, visual elements, composition, and techniques, revealing emerging patterns that develop across periods. Finally, we discuss the strengths and limitations of LLMs in this context, emphasizing their ability to process vast quantities of art-related data and generate insightful interpretations. Due to the exhaustive and granular nature of the results, we have developed interactive data visualizations, available online https://cognartive.github.io/, to enhance understanding and accessibility.

Authors:Junjie Ye, David Paz, Hengyuan Zhang, Yuliang Guo, Xinyu Huang, Henrik I. Christensen, Yue Wang, Liu Ren
Title: SMART: Advancing Scalable Map Priors for Driving Topology Reasoning
Abstract:
Topology reasoning is crucial for autonomous driving as it enables comprehensive understanding of connectivity and relationships between lanes and traffic elements. While recent approaches have shown success in perceiving driving topology using vehicle-mounted sensors, their scalability is hindered by the reliance on training data captured by consistent sensor configurations. We identify that the key factor in scalable lane perception and topology reasoning is the elimination of this sensor-dependent feature. To address this, we propose SMART, a scalable solution that leverages easily available standard-definition (SD) and satellite maps to learn a map prior model, supervised by large-scale geo-referenced high-definition (HD) maps independent of sensor settings. Attributed to scaled training, SMART alone achieves superior offline lane topology understanding using only SD and satellite inputs. Extensive experiments further demonstrate that SMART can be seamlessly integrated into any online topology reasoning methods, yielding significant improvements of up to 28% on the OpenLane-V2 benchmark.

Authors:Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao
Title: Ola: Pushing the Frontiers of Omni-Modal Language Model
Abstract:
Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal Language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts, pushing the frontiers of the omni-modal language model to a large extent. We conduct a comprehensive exploration of architectural design, data curation, and training strategies essential for building a robust omni-modal model. Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements over mainstream baselines. Moreover, we rethink inter-modal relationships during omni-modal training, emphasizing cross-modal alignment with video as a central bridge, and propose a progressive training pipeline that begins with the most distinct modalities and gradually moves towards closer modality alignment. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.
中文摘要:本文介绍了Ola全模态语言模型,通过架构改进和渐进式训练方法,在图像、视频和音频理解方面实现了与专业模型相媲美的性能,并完全开源以推动该领域未来发展。
English summary: The paper introduces Ola, an omni-modal language model that achieves competitive performance across image, video, and audio understanding through architectural improvements and a progressive training approach, while being fully open-sourced to advance future research.

Authors:Lirui Wang, Kevin Zhao, Chaoqi Liu, Xinlei Chen
Title: Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression
Abstract:
We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics to generate high-quality data and evaluation in scaling robot learning. Building interactive video world models and policies for robotics is difficult due to the challenge of handling diverse settings while maintaining computational efficiency to run in real time. HMA uses heterogeneous pre-training from observations and action sequences across different robotic embodiments, domains, and tasks. HMA uses masked autoregression to generate quantized or soft tokens for video predictions. \ourshort achieves better visual fidelity and controllability than the previous robotic video generation models with 15 times faster speed in the real world. After post-training, this model can be used as a video simulator from low-level action inputs for evaluating policies and generating synthetic data. See this link https://liruiw.github.io/hma for more information.

Authors:Weihang Li, Hongli Xu, Junwen Huang, Hyunjun Jung, Peter KT Yu, Nassir Navab, Benjamin Busam
Title: GCE-Pose: Global Context Enhancement for Category-level Object Pose Estimation
Abstract:
A key challenge in model-free category-level pose estimation is the extraction of contextual object features that generalize across varying instances within a specific category. Recent approaches leverage foundational features to capture semantic and geometry cues from data. However, these approaches fail under partial visibility. We overcome this with a first-complete-then-aggregate strategy for feature extraction utilizing class priors. In this paper, we present GCE-Pose, a method that enhances pose estimation for novel instances by integrating category-level global context prior. GCE-Pose performs semantic shape reconstruction with a proposed Semantic Shape Reconstruction (SSR) module. Given an unseen partial RGB-D object instance, our SSR module reconstructs the instance's global geometry and semantics by deforming category-specific 3D semantic prototypes through a learned deep Linear Shape Model. We further introduce a Global Context Enhanced (GCE) feature fusion module that effectively fuses features from partial RGB-D observations and the reconstructed global context. Extensive experiments validate the impact of our global context prior and the effectiveness of the GCE fusion module, demonstrating that GCE-Pose significantly outperforms existing methods on challenging real-world datasets HouseCat6D and NOCS-REAL275. Our project page is available at https://colin-de.github.io/GCE-Pose/.
中文:GCE-Pose提出了一种新颖的位姿估计方法,通过语义形状变形重建全局几何特征并与局部观测特征融合,有效解决了局部可见性难题,在基准数据集上实现了最先进的性能。
English: GCE-Pose introduces a novel pose estimation method that overcomes partial visibility limitations by reconstructing global geometry through semantic shape deformation and fusing features with partial observations, achieving state-of-the-art performance on benchmark datasets.

Authors:Yi Yu, Botao Ren, Peiyuan Zhang, Mingxin Liu, Junwei Luo, Shaofeng Zhang, Feipeng Da, Junchi Yan, Xue Yang
Title: Point2RBox-v2: Rethinking Point-supervised Oriented Object Detection with Spatial Layout Among Instances
Abstract:
With the rapidly increasing demand for oriented object detection (OOD), recent research involving weakly-supervised detectors for learning OOD from point annotations has gained great attention. In this paper, we rethink this challenging task setting with the layout among instances and present Point2RBox-v2. At the core are three principles: 1) Gaussian overlap loss. It learns an upper bound for each instance by treating objects as 2D Gaussian distributions and minimizing their overlap. 2) Voronoi watershed loss. It learns a lower bound for each instance through watershed on Voronoi tessellation. 3) Consistency loss. It learns the size/rotation variation between two output sets with respect to an input image and its augmented view. Supplemented by a few devised techniques, e.g. edge loss and copy-paste, the detector is further enhanced. To our best knowledge, Point2RBox-v2 is the first approach to explore the spatial layout among instances for learning point-supervised OOD. Our solution is elegant and lightweight, yet it is expected to give a competitive performance especially in densely packed scenes: 62.61%/86.15%/34.71% on DOTA/HRSC/FAIR1M. Code is available at https://github.com/VisionXLab/point2rbox-v2.
Chinese: Point2RBox-v2提出了一种基于点标注的弱监督旋转目标检测新方法,通过高斯重叠、Voronoi分水岭和一致性损失三大核心原则学习实例间的空间布局,在密集场景中实现了优越性能。
English: Point2RBox-v2 introduces a novel weakly-supervised approach for oriented object detection using point annotations, employing three key principles—Gaussian overlap, Voronoi watershed, and consistency losses—to effectively learn instance layouts and achieve competitive results in densely packed scenes.

Authors:Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Andrew D. Bagdanov
Title: Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Abstract:
Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful multi-modal models is highly suboptimal for intra-modal tasks like image-to-image retrieval. We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints, leading to what we call intra-modal misalignment. To demonstrate this, we leverage two optimization-based modality inversion techniques that map representations from their input modality to the complementary one without any need for auxiliary data or additional trained adapters. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance with respect to intra-modal baselines on more than fifteen datasets. Additionally, we demonstrate that approaching a native inter-modal task (e.g. zero-shot image classification) intra-modally decreases performance, further validating our findings. Finally, we show that incorporating an intra-modal term in the pre-training objective or narrowing the modality gap between the text and image feature embedding spaces helps reduce the intra-modal misalignment. The code is publicly available at: https://github.com/miccunifi/Cross-the-Gap.
中文摘要:本研究发现,由于模态内不对齐问题,在图像检索等模态内任务中单独使用CLIP的文本或图像编码器效果欠佳,但通过跨模态映射方法能显著提升性能。
English Summary: This study reveals that using CLIP's individual text or image encoders for intra-modal tasks is suboptimal due to intra-modal misalignment, but performance significantly improves by adopting inter-modal approaches through modality inversion techniques.

Authors:Shangkun Sun, Xiaoyu Liang, Bowen Qu, Wei Gao
Title: Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency
Abstract:
The advent of next-generation video generation models like \textit{Sora} poses challenges for AI-generated content (AIGC) video quality assessment (VQA). These models substantially mitigate flickering artifacts prevalent in prior models, enable longer and complex text prompts and generate longer videos with intricate, diverse motion patterns. Conventional VQA methods designed for simple text and basic motion patterns struggle to evaluate these content-rich videos. To this end, we propose \textbf{CRAVE} (\underline{C}ontent-\underline{R}ich \underline{A}IGC \underline{V}ideo \underline{E}valuator), specifically for the evaluation of Sora-era AIGC videos. CRAVE proposes the multi-granularity text-temporal fusion that aligns long-form complex textual semantics with video dynamics. Additionally, CRAVE leverages the hybrid motion-fidelity modeling to assess temporal artifacts. Furthermore, given the straightforward prompts and content in current AIGC VQA datasets, we introduce \textbf{CRAVE-DB}, a benchmark featuring content-rich videos from next-generation models paired with elaborate prompts. Extensive experiments have shown that the proposed CRAVE achieves excellent results on multiple AIGC VQA benchmarks, demonstrating a high degree of alignment with human perception. All data and code will be publicly available at https://github.com/littlespray/CRAVE.
中文: 提出的CRAVE模型通过多粒度文本-时序融合和混合运动-保真度建模,有效解决了评估内容丰富的AIGC视频的挑战,实现了与人类感知的高度一致。
English: The proposed CRAVE model effectively addresses the challenge of assessing content-rich AIGC videos by aligning complex text semantics with video dynamics and modeling temporal artifacts, achieving high alignment with human perception.

Authors:Aleksandar Cvejic, Abdelrahman Eldesokey, Peter Wonka
Title: PartEdit: Fine-Grained Image Editing using Pre-Trained Diffusion Models
Abstract:
We present the first text-based image editing approach for object parts based on pre-trained diffusion models. Diffusion-based image editing approaches capitalized on the deep understanding of diffusion models of image semantics to perform a variety of edits. However, existing diffusion models lack sufficient understanding of many object parts, hindering fine-grained edits requested by users. To address this, we propose to expand the knowledge of pre-trained diffusion models to allow them to understand various object parts, enabling them to perform fine-grained edits. We achieve this by learning special textual tokens that correspond to different object parts through an efficient token optimization process. These tokens are optimized to produce reliable localization masks at each inference step to localize the editing region. Leveraging these masks, we design feature-blending and adaptive thresholding strategies to execute the edits seamlessly. To evaluate our approach, we establish a benchmark and an evaluation protocol for part editing. Experiments show that our approach outperforms existing editing methods on all metrics and is preferred by users 66-90% of the time in conducted user studies.

Authors:Yu Yuan, Shizhao Sun, Qi Liu, Jiang Bian
Title: CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing
Abstract:
Computer Aided Design (CAD) is indispensable across various industries. \emph{Text-based CAD editing}, which automates the modification of CAD models based on textual instructions, holds great potential but remains underexplored. Existing methods primarily focus on design variation generation or text-based CAD generation, either lacking support for text-based control or neglecting existing CAD models as constraints. We introduce \emph{CAD-Editor}, the first framework for text-based CAD editing. To address the challenge of demanding triplet data with accurate correspondence for training, we propose an automated data synthesis pipeline. This pipeline utilizes design variation models to generate pairs of original and edited CAD models and employs Large Vision-Language Models (LVLMs) to summarize their differences into editing instructions. To tackle the composite nature of text-based CAD editing, we propose a locate-then-infill framework that decomposes the task into two focused sub-tasks: locating regions requiring modification and infilling these regions with appropriate edits. Large Language Models (LLMs) serve as the backbone for both sub-tasks, leveraging their capabilities in natural language understanding and CAD knowledge. Experiments show that CAD-Editor achieves superior performance both quantitatively and qualitatively. The code is available at \url {https://github.com/microsoft/CAD-Editor}.
中文摘要:CAD-Editor是首个基于文本的CAD编辑框架,通过自动化数据合成流程和结合大型语言模型的"定位-填充"方法,实现了通过文本指令修改CAD模型的卓越性能。
English Summary: CAD-Editor is the first framework for text-based CAD editing, using an automated data synthesis pipeline and a locate-then-infill approach with LLMs to achieve superior performance in modifying CAD models through textual instructions.

Authors:Priyank Pathak, Shyam Marjit, Shruti Vyas, Yogesh S Rawat
Title: LR0.FM: Low-Res Benchmark and Improving Robustness for Zero-Shot Classification in Foundation Models
Abstract:
Visual-language foundation Models (FMs) exhibit remarkable zero-shot generalization across diverse tasks, largely attributed to extensive pre-training on largescale datasets. However, their robustness on low-resolution/pixelated (LR) images, a common challenge in real-world scenarios, remains underexplored. We introduce LR0.FM, a comprehensive benchmark evaluating the impact of low resolution on the zero-shot classification performance of 10 FM(s) across 66 backbones and 15 datasets. We propose a novel metric, Weighted Aggregated Robustness, to address the limitations of existing metrics and better evaluate model performance across resolutions and datasets. Our key findings show that: (i) model size positively correlates with robustness to resolution degradation, (ii) pre-training dataset quality is more important than its size, and (iii) fine-tuned and higher resolution models are less robust against LR. Our analysis further reveals that the model makes semantically reasonable predictions at LR, and the lack of fine-grained details in input adversely impacts the model's initial layers more than the deeper layers. We use these insights and introduce a simple strategy, LR-TK0, to enhance the robustness of models without compromising their pre-trained weights. We demonstrate the effectiveness of LR-TK0 for robustness against low-resolution across several datasets and its generalization capability across backbones and other approaches. Code is available at https://github.com/shyammarjit/LR0.FM
中文摘要:视觉语言基础模型在零样本任务中表现出色,但在低分辨率图像上的鲁棒性不足;为此提出的LR0.FM基准和LR-TK0增强策略,可在不修改预训练权重的情况下有效提升模型对低分辨率输入的适应能力。
English Summary: Visual-language foundation models show strong zero-shot capabilities but struggle with low-resolution images, leading to the development of the LR0.FM benchmark and a simple enhancement strategy called LR-TK0 that improves robustness without altering pre-trained weights.

Authors:Xiangyu Wu, Feng Yu, Qing-Guo Chen, Yang Yang, Jianfeng Lu
Title: Multi-Label Test-Time Adaptation with Bound Entropy Minimization
Abstract:
Mainstream test-time adaptation (TTA) techniques endeavor to mitigate distribution shifts via entropy minimization for multi-class classification, inherently increasing the probability of the most confident class. However, when encountering multi-label instances, the primary challenge stems from the varying number of labels per image, and prioritizing only the highest probability class inevitably undermines the adaptation of other positive labels. To address this issue, we investigate TTA within multi-label scenario (ML--TTA), developing Bound Entropy Minimization (BEM) objective to simultaneously increase the confidence of multiple top predicted labels. Specifically, to determine the number of labels for each augmented view, we retrieve a paired caption with yielded textual labels for that view. These labels are allocated to both the view and caption, called weak label set and strong label set with the same size k. Following this, the proposed BEM considers the highest top-k predicted labels from view and caption as a single entity, respectively, learning both view and caption prompts concurrently. By binding top-k predicted labels, BEM overcomes the limitation of vanilla entropy minimization, which exclusively optimizes the most confident class. Across the MSCOCO, VOC, and NUSWIDE multi-label datasets, our ML--TTA framework equipped with BEM exhibits superior performance compared to the latest SOTA methods, across various model architectures, prompt initialization, and varying label scenarios. The code is available at https://github.com/Jinx630/ML-TTA.
中文摘要:本研究针对多标签测试时适应场景,提出边界熵最小化方法,通过利用配对标题同时提升多个预测标签的置信度,有效克服了传统熵最小化仅优化最置信类别的局限性。
English Summary: This study introduces a Bound Entropy Minimization (BEM) method for multi-label test-time adaptation, which simultaneously boosts confidence in multiple top predicted labels by leveraging paired captions and overcoming limitations of traditional entropy minimization.

Authors:Chaoyin She, Ruifang Lu, Danni He, Jiayi Lv, Yadan Lin, Meiqing Cheng, Hui Huang, Fengyu Ye, Lida Chen, Wei Wang, Qinghua Huang
Title: A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma
Abstract:
Hepatocellular carcinoma (HCC), ranking as the third leading cause of cancer-related mortality worldwide, demands urgent improvements in early detection to enhance patient survival. While ultrasound remains the preferred screening modality due to its cost-effectiveness and real-time capabilities, its sensitivity (59%-78%) heavily relies on radiologists' expertise, leading to inconsistent diagnostic outcomes and operational inefficiencies. Recent advancements in AI technology offer promising solutions to bridge this gap. This study introduces the Hierarchical Sparse Query Transformer (HSQformer), a novel hybrid architecture that synergizes CNNs' local feature extraction with Vision Transformers' global contextual awareness through latent space representation and sparse learning. By dynamically activating task-specific experts via a Mixture-of-Experts (MoE) framework, HSQformer achieves hierarchical feature integration without structural redundancy. Evaluated across three clinical scenarios: single-center, multi-center, and high-risk patient cohorts, HSQformer outperforms state-of-the-art models (e.g., 95.38% AUC in multi-center testing) and matches senior radiologists' diagnostic accuracy while significantly surpassing junior counterparts. These results highlight the potential of AI-assisted tools to standardize HCC screening, reduce dependency on human expertise, and improve early diagnosis rates. The full code is available at https://github.com/Asunatan/HSQformer.
中文: 本研究提出的HSQformer模型融合了CNN与视觉Transformer,通过多中心临床验证不仅超越现有检测方法,更达到资深放射科医生诊断水平,为肝细胞癌早期筛查提供了标准化解决方案。
English: The study introduces HSQformer, an AI model combining CNNs and Vision Transformers, which outperforms existing methods in detecting hepatocellular carcinoma across multiple clinical settings and matches senior radiologists' accuracy, offering a standardized screening solution to improve early diagnosis.

Authors:Feng Wang, Yaodong Yu, Guoyizhe Wei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie
Title: Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Abstract:
Since the introduction of Vision Transformer (ViT), patchification has long been regarded as a de facto image tokenization approach for plain visual architectures. By compressing the spatial size of images, this approach can effectively shorten the token sequence and reduce the computational cost of ViT-like plain architectures. In this work, we aim to thoroughly examine the information loss caused by this patchification-based compressive encoding paradigm and how it affects visual understanding. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification: the models can consistently benefit from decreased patch sizes and attain improved predictive performance, until it reaches the minimum patch size of 1x1, i.e., pixel tokenization. This conclusion is broadly applicable across different vision tasks, various input scales, and diverse architectures such as ViT and the recent Mamba models. Moreover, as a by-product, we discover that with smaller patches, task-specific decoder heads become less critical for dense prediction. In the experiments, we successfully scale up the visual sequence to an exceptional length of 50,176 tokens, achieving a competitive test accuracy of 84.6% with a base-sized model on the ImageNet-1k benchmark. We hope this study can provide insights and theoretical foundations for future works of building non-compressive vision models. Code is available at https://github.com/wangf3014/Patch_Scaling.
中文: 研究表明,在视觉Transformer中减小补丁尺寸能持续提升预测性能,最小1x1像素标记化效果最佳,这一发现适用于多种任务和架构。
English: The study reveals that reducing patch sizes in Vision Transformers consistently enhances predictive performance, with the best results achieved at the smallest 1x1 pixel tokenization, applicable across various tasks and architectures.

Authors:Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C. -W. Phan
Title: MD-BERT: Action Recognition in Dark Videos via Dynamic Multi-Stream Fusion and Temporal Modeling
Abstract:
Action recognition in dark, low-light (under-exposed) or noisy videos is a challenging task due to visibility degradation, which can hinder critical spatiotemporal details. This paper proposes MD-BERT, a novel multi-stream approach that integrates complementary pre-processing techniques such as gamma correction and histogram equalization alongside raw dark frames to address these challenges. We introduce the Dynamic Feature Fusion (DFF) module, extending existing attentional fusion methods to a three-stream setting, thereby capturing fine-grained and global contextual information across different brightness and contrast enhancements. The fused spatiotemporal features are then processed by a BERT-based temporal model, which leverages its bidirectional self-attention to effectively capture long-range dependencies and contextual relationships across frames. Extensive experiments on the ARID V1.0 and ARID V1.5 dark video datasets show that MD-BERT outperforms existing methods, establishing a new state-of-the-art performance. Ablation studies further highlight the individual contributions of each input stream and the effectiveness of the proposed DFF and BERT modules. The official website of this work is available at: https://github.com/HrishavBakulBarua/DarkBERT
中文摘要:本文提出MD-BERT多流框架,通过动态特征融合模块整合增强视频输入与BERT时序模型,在暗光视频行为识别任务中实现了最优性能。
English Summary: This paper introduces MD-BERT, a multi-stream framework that combines enhanced video inputs with a BERT-based temporal model to achieve state-of-the-art action recognition in dark videos through dynamic feature fusion.

Authors:Gian Mario Favero, Parham Saremi, Emily Kaczmarek, Brennan Nichyporuk, Tal Arbel
Title: Conditional Diffusion Models are Medical Image Classifiers that Provide Explainability and Uncertainty for Free
Abstract:
Discriminative classifiers have become a foundational tool in deep learning for medical imaging, excelling at learning separable features of complex data distributions. However, these models often need careful design, augmentation, and training techniques to ensure safe and reliable deployment. Recently, diffusion models have become synonymous with generative modeling in 2D. These models showcase robustness across a range of tasks including natural image classification, where classification is performed by comparing reconstruction errors across images generated for each possible conditioning input. This work presents the first exploration of the potential of class conditional diffusion models for 2D medical image classification. First, we develop a novel majority voting scheme shown to improve the performance of medical diffusion classifiers. Next, extensive experiments on the CheXpert and ISIC Melanoma skin cancer datasets demonstrate that foundation and trained-from-scratch diffusion models achieve competitive performance against SOTA discriminative classifiers without the need for explicit supervision. In addition, we show that diffusion classifiers are intrinsically explainable, and can be used to quantify the uncertainty of their predictions, increasing their trustworthiness and reliability in safety-critical, clinical contexts. Further information is available on our project page: https://faverogian.github.io/med-diffusion-classifier.github.io/.

Authors:Kushagra Pandey, Farrin Marouf Sofian, Felix Draxler, Theofanis Karaletsos, Stephan Mandt
Title: Variational Control for Guidance in Diffusion Models
Abstract:
Diffusion models exhibit excellent sample quality, but existing guidance methods often require additional model training or are limited to specific tasks. We revisit guidance in diffusion models from the perspective of variational inference and control, introducing Diffusion Trajectory Matching (DTM) that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves state-of-the-art results on several linear, non-linear, and blind inverse problems without requiring additional model training or specificity to pixel or latent space diffusion models. Our code will be available at https://github.com/czi-ai/oc-guidance
中文摘要:扩散轨迹匹配(DTM)通过变分推理统一了预训练扩散模型的多种引导方法,无需额外训练或特定空间适配,即在多种逆问题上实现了最先进的性能。
English Summary: Diffusion Trajectory Matching (DTM) unifies various guidance methods for pretrained diffusion models through variational inference, achieving state-of-the-art performance on multiple inverse problems without requiring additional training or space-specific adaptations.

Authors:Huimin Zeng, Jiacheng Li, Ziqiang Zheng, Zhiwei Xiong
Title: All-in-One Image Compression and Restoration
Abstract:
Visual images corrupted by various types and levels of degradations are commonly encountered in practical image compression. However, most existing image compression methods are tailored for clean images, therefore struggling to achieve satisfying results on these images. Joint compression and restoration methods typically focus on a single type of degradation and fail to address a variety of degradations in practice. To this end, we propose a unified framework for all-in-one image compression and restoration, which incorporates the image restoration capability against various degradations into the process of image compression. The key challenges involve distinguishing authentic image content from degradations, and flexibly eliminating various degradations without prior knowledge. Specifically, the proposed framework approaches these challenges from two perspectives: i.e., content information aggregation, and degradation representation aggregation. Extensive experiments demonstrate the following merits of our model: 1) superior rate-distortion (RD) performance on various degraded inputs while preserving the performance on clean data; 2) strong generalization ability to real-world and unseen scenarios; 3) higher computing efficiency over compared methods. Our code is available at https://github.com/ZeldaM1/All-in-one.
中文摘要:该统一框架将图像修复能力融入压缩过程,无需先验知识即可灵活消除多种退化,在各类场景中实现了优越的性能、泛化能力和计算效率。
English Summary: The proposed unified framework integrates image restoration capabilities into compression to effectively handle various degradations without prior knowledge, achieving superior performance, generalization, and efficiency across diverse scenarios.

Authors:Yunuo Chen, Junli Cao, Anil Kag, Vidit Goel, Sergei Korolev, Chenfanfu Jiang, Sergey Tulyakov, Jian Ren
Title: Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach
Abstract:
We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness. To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space. The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model, enabling it to track 2D objects with 3D Cartesian coordinates. Building on this, we regularize the shape and motion of objects in the video to eliminate undesired artifacts, \eg, nonphysical deformation. Consequently, we enhance the quality of generated RGB videos and alleviate common issues like object morphing, which are prevalent in current video models due to a lack of shape awareness. With our 3D augmentation and regularization, our model is capable of handling contact-rich scenarios such as task-oriented videos. These videos involve complex interactions of solids, where 3D information is essential for perceiving deformation and contact. Furthermore, our model improves the overall quality of video generation by promoting the 3D consistency of moving objects and reducing abrupt changes in shape and motion.
中文: 本文提出了一种新颖的视频生成框架,通过将2D视频增强为3D点轨迹并微调潜在扩散模型,提升了三维几何和动态感知能力,有效减少了伪影并改善了复杂场景下的视频质量。
English: This paper introduces a novel video generation framework that enhances 3D geometry and dynamic awareness by augmenting 2D videos with 3D point trajectories and fine-tuning a latent diffusion model, effectively reducing artifacts and improving video quality in complex scenarios.

Authors:Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, Dimitris N. Metaxas
Title: The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering
Abstract:
Large Vision-Language Models (LVLMs) can reason effectively over both textual and visual inputs, but they tend to hallucinate syntactically coherent yet visually ungrounded contents. In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits ranking throughout the generation process, revealing three key patterns in how LVLMs process information: (1) gradual visual information loss - visually grounded tokens gradually become less favored throughout generation, and (2) early excitation - semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3) hidden genuine information - visually grounded tokens though not being eventually decoded still retain relatively high rankings at inference. Based on these insights, we propose VISTA (Visual Information Steering with Token-logit Augmentation), a training-free inference-time intervention framework that reduces hallucination while promoting genuine information. VISTA works by combining two complementary approaches: reinforcing visual information in activation space and leveraging early layer activations to promote semantically meaningful decoding. Compared to existing methods, VISTA requires no external supervision and is applicable to various decoding strategies. Extensive experiments show that VISTA on average reduces hallucination by about 40% on evaluated open-ended generation task, and it consistently outperforms existing methods on four benchmarks across four architectures under three decoding strategies. Code is available at https://github.com/LzVv123456/VISTA.
中文: 大型视觉语言模型因视觉信息逐渐丢失和早期语义激发而产生幻觉内容,为此提出的VISTA框架通过增强视觉激活与利用早期层输出,无需训练即可有效减少幻觉现象。
English: Large Vision-Language Models often generate visually ungrounded content due to gradual loss of visual information and early token excitation, prompting the development of VISTA, a training-free framework that reduces hallucination by reinforcing visual activations and leveraging early layer outputs.

Authors:Mehrdad Asadi, Komi Sodoké, Ian J. Gerard, Marta Kersten-Oertel
Title: Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function
Abstract:
In this work, we present a novel approach to multi-label chest X-ray (CXR) image classification that enhances clinical interpretability while maintaining a streamlined, single-model, single-run training pipeline. Leveraging the CheXpert dataset and VisualCheXbert-derived labels, we incorporate hierarchical label groupings to capture clinically meaningful relationships between diagnoses. To achieve this, we designed a custom hierarchical binary cross-entropy (HBCE) loss function that enforces label dependencies using either fixed or data-driven penalty types. Our model achieved a mean area under the receiver operating characteristic curve (AUROC) of 0.903 on the test set. Additionally, we provide visual explanations and uncertainty estimations to further enhance model interpretability. All code, model configurations, and experiment details are made available.
中文: 本研究提出了一种新颖的胸部X光多标签分类方法,通过自定义分层损失函数增强临床可解释性,在测试集上取得了0.903的平均AUROC值,同时提供可视化解释和不确定性评估。
English: This study introduces a hierarchical multi-label classification method for chest X-rays that improves clinical interpretability through a custom loss function while maintaining high diagnostic accuracy with a 0.903 AUROC score.

Authors:Darina Koishigarina, Arnas Uselis, Seong Joon Oh
Title: CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally
Abstract:
CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. We find that the correct attribute-object binding information is already present in individual text and image modalities. Instead, the issue lies in the cross-modal alignment, which relies on cosine similarity. To address this, we propose Linear Attribute Binding CLIP or LABCLIP. It applies a linear transformation to text embeddings before computing cosine similarity. This approach significantly improves CLIP's ability to bind attributes to correct objects, thereby enhancing its compositional understanding. The code is available at https://github.com/kdariina/CLIP-not-BoW-unimodally.
中文: CLIP常因跨模态对齐问题无法正确绑定属性与对象,而LABCLIP通过在计算相似度前对文本嵌入进行线性变换,显著提升了这种绑定能力。
English: CLIP often fails to bind attributes to objects due to cross-modal alignment issues, but LABCLIP enhances this by applying a linear transformation to text embeddings before similarity computation.

Authors:Qiuhong Shen, Xuanyu Yi, Mingbao Lin, Hanwang Zhang, Shuicheng Yan, Xinchao Wang
Title: Seeing World Dynamics in a Nutshell
Abstract:
We consider the problem of efficiently representing casually captured monocular videos in a spatially- and temporally-coherent manner. While existing approaches predominantly rely on 2D/2.5D techniques treating videos as collections of spatiotemporal pixels, they struggle with complex motions, occlusions, and geometric consistency due to absence of temporal coherence and explicit 3D structure. Drawing inspiration from monocular video as a projection of the dynamic 3D world, we explore representing videos in their intrinsic 3D form through continuous flows of Gaussian primitives in space-time. In this paper, we propose NutWorld, a novel framework that efficiently transforms monocular videos into dynamic 3D Gaussian representations in a single forward pass. At its core, NutWorld introduces a structured spatial-temporal aligned Gaussian (STAG) representation, enabling optimization-free scene modeling with effective depth and flow regularization. Through comprehensive experiments, we demonstrate that NutWorld achieves high-fidelity video reconstruction quality while enabling various downstream applications in real-time. Demos and code will be available at https://github.com/Nut-World/NutWorld.
中文摘要:本文提出NutWorld框架,通过单次前向传播将单目视频高效转换为动态3D高斯表示,实现了高保真视频重建并支持实时下游应用。
English Summary: This paper introduces NutWorld, a novel framework that efficiently converts monocular videos into dynamic 3D Gaussian representations in a single forward pass, achieving high-fidelity reconstruction and enabling real-time applications.

Authors:Rudolf Herdt, Daniel Otero Baguer
Title: Concept Based Explanations and Class Contrasting
Abstract:
Explaining deep neural networks is challenging, due to their large size and non-linearity. In this paper, we introduce a concept-based explanation method, in order to explain the prediction for an individual class, as well as contrasting any two classes, i.e. explain why the model predicts one class over the other. We test it on several openly available classification models trained on ImageNet1K. We perform both qualitative and quantitative tests. For example, for a ResNet50 model from pytorch model zoo, we can use the explanation for why the model predicts a class 'A' to automatically select four dataset crops where the model does not predict class 'A'. The model then predicts class 'A' again for the newly combined image in 91.1% of the cases (works for 911 out of the 1000 classes). The code including an .ipynb example is available on github: https://github.com/rherdt185/concept-based-explanations-and-class-contrasting
Chinese: 本文提出了一种基于概念的解释方法,用于阐明深度神经网络对单个类别的预测及类别间对比,在ImageNet模型上通过定性和定量测试验证了其91.1%的有效性。
English: This paper presents a concept-based explanation method for deep neural networks that clarifies individual class predictions and contrasts between classes, validated through qualitative and quantitative tests on ImageNet models with 91.1% effectiveness.

Authors:Fabian Konstantinidis, Moritz Sackmann, Ulrich Hofmann, Christoph Stiller
Title: Conditional Prediction by Simulation for Automated Driving
Abstract:
Modular automated driving systems commonly handle prediction and planning as sequential, separate tasks, thereby prohibiting cooperative maneuvers. To enable cooperative planning, this work introduces a prediction model that models the conditional dependencies between trajectories. For this, predictions are generated by a microscopic traffic simulation, with the individual traffic participants being controlled by a realistic behavior model trained via Adversarial Inverse Reinforcement Learning. By assuming various candidate trajectories for the automated vehicle, we generate predictions conditioned on each of them. Furthermore, our approach allows the candidate trajectories to adapt dynamically during the prediction rollout. Several example scenarios are available at https://conditionalpredictionbysimulation.github.io/.

Authors:Nikolaos Tsagkas, Andreas Sochopoulos, Duolikun Danier, Sethu Vijayakumar, Chris Xiaoxuan Lu, Oisin Mac Aodha
Title: When Pre-trained Visual Representations Fall Short: Limitations in Visuo-Motor Robot Learning
Abstract:
The integration of pre-trained visual representations (PVRs) into visuo-motor robot learning has emerged as a promising alternative to training visual encoders from scratch. However, PVRs face critical challenges in the context of policy learning, including temporal entanglement and an inability to generalise even in the presence of minor scene perturbations. These limitations hinder performance in tasks requiring temporal awareness and robustness to scene changes. This work identifies these shortcomings and proposes solutions to address them. First, we augment PVR features with temporal perception and a sense of task completion, effectively disentangling them in time. Second, we introduce a module that learns to selectively attend to task-relevant local features, enhancing robustness when evaluated on out-of-distribution scenes. Our experiments demonstrate significant performance improvements, particularly in PVRs trained with masking objectives, and validate the effectiveness of our enhancements in addressing PVR-specific limitations.

Authors:Ying Zhang, Maoliang Yin, Wenfu Bi, Haibao Yan, Shaohan Bian, Cui-Hua Zhang, Changchun Hua
Title: ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models
Abstract:
Service robots operating in unstructured environments must effectively recognize and segment unknown objects to enhance their functionality. Traditional supervised learningbased segmentation techniques require extensive annotated datasets, which are impractical for the diversity of objects encountered in real-world scenarios. Unseen Object Instance Segmentation (UOIS) methods aim to address this by training models on synthetic data to generalize to novel objects, but they often suffer from the simulation-to-reality gap. This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero-shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT). The proposed framework operates in three stages: (1) generating object-agnostic mask proposals from colorized depth images using SAM, (2) refining these proposals using attention-based features from the selfsupervised ViT to filter non-object masks, and (3) applying K-Medoids clustering to generate point prompts that guide SAM towards precise object segmentation. Experimental validation on two benchmark datasets and a self-collected dataset demonstrates the superior performance of ZISVFM in complex environments, including hierarchical settings such as cabinets, drawers, and handheld objects. Our source code is available at https://github.com/Yinmlmaoliang/zisvfm.
中文摘要:本文提出ZISVFM框架,通过结合SAM的零样本分割能力和自监督ViT特征,采用掩码生成、精化及聚类提示的三阶段方法,在复杂环境中实现了精确的未知物体实例分割。
English Summary: This paper introduces ZISVFM, a novel framework that combines the zero-shot segmentation capability of SAM with self-supervised ViT features to achieve precise unseen object instance segmentation through a three-stage process of mask proposal, refinement, and clustering-based prompting.

Authors:Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen
Title: Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration
Abstract:
Recently computer-aided diagnosis has demonstrated promising performance, effectively alleviating the workload of clinicians. However, the inherent sample imbalance among different diseases leads algorithms biased to the majority categories, leading to poor performance for rare categories. Existing works formulated this challenge as a long-tailed problem and attempted to tackle it by decoupling the feature representation and classification. Yet, due to the imbalanced distribution and limited samples from tail classes, these works are prone to biased representation learning and insufficient classifier calibration. To tackle these problems, we propose a new Long-tailed Medical Diagnosis (LMD) framework for balanced medical image classification on long-tailed datasets. In the initial stage, we develop a Relation-aware Representation Learning (RRL) scheme to boost the representation ability by encouraging the encoder to capture intrinsic semantic features through different data augmentations. In the subsequent stage, we propose an Iterative Classifier Calibration (ICC) scheme to calibrate the classifier iteratively. This is achieved by generating a large number of balanced virtual features and fine-tuning the encoder using an Expectation-Maximization manner. The proposed ICC compensates for minority categories to facilitate unbiased classifier optimization while maintaining the diagnostic knowledge in majority classes. Comprehensive experiments on three public long-tailed medical datasets demonstrate that our LMD framework significantly surpasses state-of-the-art approaches. The source code can be accessed at https://github.com/peterlipan/LMD.
中文: 提出的长尾医疗诊断框架通过关系感知表征学习和迭代分类器校准解决医学图像中的类别不平衡问题,在三个公共数据集上显著超越了现有最优方法。
English: The proposed Long-tailed Medical Diagnosis (LMD) framework addresses class imbalance in medical image analysis through Relation-aware Representation Learning and Iterative Classifier Calibration, achieving state-of-the-art performance on three public datasets.

Authors:Ruizhe Li, Grazziela Figueredo, Dorothee Auer, Rob Dineen, Paul Morgan, Xin Chen
Title: A Unified Framework for Semi-Supervised Image Segmentation and Registration
Abstract:
Semi-supervised learning, which leverages both annotated and unannotated data, is an efficient approach for medical image segmentation, where obtaining annotations for the whole dataset is time-consuming and costly. Traditional semi-supervised methods primarily focus on extracting features and learning data distributions from unannotated data to enhance model training. In this paper, we introduce a novel approach incorporating an image registration model to generate pseudo-labels for the unannotated data, producing more geometrically correct pseudo-labels to improve the model training. Our method was evaluated on a 2D brain data set, showing excellent performance even using only 1\% of the annotated data. The results show that our approach outperforms conventional semi-supervised segmentation methods (e.g. teacher-student model), particularly in a low percentage of annotation scenario. GitHub: https://github.com/ruizhe-l/UniSegReg.
中文: 本文提出了一种新颖的半监督医学图像分割方法,通过引入图像配准模型生成几何更准确的伪标签,在极少量标注数据下显著优于传统方法。
English: This paper presents a novel semi-supervised medical image segmentation method that integrates an image registration model to generate geometrically accurate pseudo-labels, demonstrating superior performance with minimal annotated data compared to traditional approaches.

Authors:Wen Yan, Qianye Yang, Shiqi Huang, Yipei Wang, Shonit Punwani, Mark Emberton, Vasilis Stavrinides, Yipeng Hu, Dean Barratt
Title: Tell2Reg: Establishing spatial correspondence between images by the same language prompts
Abstract:
Spatial correspondence can be represented by pairs of segmented regions, such that the image registration networks aim to segment corresponding regions rather than predicting displacement fields or transformation parameters. In this work, we show that such a corresponding region pair can be predicted by the same language prompt on two different images using the pre-trained large multimodal models based on GroundingDINO and SAM. This enables a fully automated and training-free registration algorithm, potentially generalisable to a wide range of image registration tasks. In this paper, we present experimental results using one of the challenging tasks, registering inter-subject prostate MR images, which involves both highly variable intensity and morphology between patients. Tell2Reg is training-free, eliminating the need for costly and time-consuming data curation and labelling that was previously required for this registration task. This approach outperforms unsupervised learning-based registration methods tested, and has a performance comparable to weakly-supervised methods. Additional qualitative results are also presented to suggest that, for the first time, there is a potential correlation between language semantics and spatial correspondence, including the spatial invariance in language-prompted regions and the difference in language prompts between the obtained local and global correspondences. Code is available at https://github.com/yanwenCi/Tell2Reg.git.
中文摘要:本文提出Tell2Reg这一无需训练的图像配准方法,通过语言提示与预训练多模态模型自动分割不同图像中的对应区域,在无需数据标注的情况下实现了与弱监督方法相当的性能。
English Summary: This paper introduces Tell2Reg, a training-free image registration method that uses language prompts with pre-trained multimodal models to automatically segment corresponding regions in different images, achieving performance comparable to weakly-supervised methods without requiring data labeling.

Authors:Baoyao Yang, Junxiang Chen, Wanyun Li, Wenbin Yao, Yang Zhou
Title: Expertized Caption Auto-Enhancement for Video-Text Retrieval
Abstract:
Video-text retrieval has been stuck in the information mismatch caused by personalized and inadequate textual descriptions of videos. The substantial information gap between the two modalities hinders an effective cross-modal representation alignment, resulting in ambiguous retrieval results. Although text rewriting methods have been proposed to broaden text expressions, the modality gap remains significant, as the text representation space is hardly expanded with insufficient semantic enrichment.Instead, this paper turns to enhancing visual presentation, bridging video expression closer to textual representation via caption generation and thereby facilitating video-text matching.While multimodal large language models (mLLM) have shown a powerful capability to convert video content into text, carefully crafted prompts are essential to ensure the reasonableness and completeness of the generated captions. Therefore, this paper proposes an automatic caption enhancement method that improves expression quality and mitigates empiricism in augmented captions through self-learning.Additionally, an expertized caption selection mechanism is designed and introduced to customize augmented captions for each video, further exploring the utilization potential of caption augmentation.Our method is entirely data-driven, which not only dispenses with heavy data collection and computation workload but also improves self-adaptability by circumventing lexicon dependence and introducing personalized matching. The superiority of our method is validated by state-of-the-art results on various benchmarks, specifically achieving Top-1 recall accuracy of 68.5% on MSR-VTT, 68.1% on MSVD, and 62.0% on DiDeMo. Our code is publicly available at https://github.com/CaryXiang/ECA4VTR.
Chinese: 本文通过自动字幕增强方法和专业化字幕选择机制改进视觉表达以弥合视频与文本间的信息鸿沟,无需大量数据收集即在多个基准测试中取得领先性能。
English: This paper addresses the video-text retrieval mismatch by enhancing visual presentation through an automatic caption enhancement method and expertized caption selection, achieving state-of-the-art results on benchmarks without heavy data collection.

Authors:Hao Yin, Paritosh Parmar, Daoliang Xu, Yang Zhang, Tianyou Zheng, Weiwei Fu
Title: A Decade of Action Quality Assessment: Largest Systematic Survey of Trends, Challenges, and Future Directions
Abstract:
Action Quality Assessment (AQA) -- the ability to quantify the quality of human motion, actions, or skill levels and provide feedback -- has far-reaching implications in areas such as low-cost physiotherapy, sports training, and workforce development. As such, it has become a critical field in computer vision & video understanding over the past decade. Significant progress has been made in AQA methodologies, datasets, & applications, yet a pressing need remains for a comprehensive synthesis of this rapidly evolving field. In this paper, we present a thorough survey of the AQA landscape, systematically reviewing over 200 research papers using the preferred reporting items for systematic reviews & meta-analyses (PRISMA) framework. We begin by covering foundational concepts & definitions, then move to general frameworks & performance metrics, & finally discuss the latest advances in methodologies & datasets. This survey provides a detailed analysis of research trends, performance comparisons, challenges, & future directions. Through this work, we aim to offer a valuable resource for both newcomers & experienced researchers, promoting further exploration & progress in AQA. Data are available at https://haoyin116.github.io/Survey_of_AQA/

Authors:Obed Korshie Dzikunu, Shadab Ahamed, Amirhossein Toosi, Xiaoxiao Li, Arman Rahmim
Title: Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images
Abstract:
This study proposes a new loss function for deep neural networks, L1-weighted Dice Focal Loss (L1DFL), that leverages L1 norms for adaptive weighting of voxels based on their classification difficulty, towards automated detection and segmentation of metastatic prostate cancer lesions in PET/CT scans. We obtained 380 PSMA [18-F] DCFPyL PET/CT scans of patients diagnosed with biochemical recurrence metastatic prostate cancer. We trained two 3D convolutional neural networks, Attention U-Net and SegResNet, and concatenated the PET and CT volumes channel-wise as input. The performance of our custom loss function was evaluated against the Dice and Dice Focal Loss functions. For clinical significance, we considered a detected region of interest (ROI) as a true positive if at least the voxel with the maximum standardized uptake value falls within the ROI. We assessed the models' performance based on the number of lesions in an image, tumour volume, activity, and extent of spread. The L1DFL outperformed the comparative loss functions by at least 13% on the test set. In addition, the F1 scores of the Dice Loss and the Dice Focal Loss were lower than that of L1DFL by at least 6% and 34%, respectively. The Dice Focal Loss yielded more false positives, whereas the Dice Loss was more sensitive to smaller volumes and struggled to segment larger lesions accurately. They also exhibited network-specific variations and yielded declines in segmentation accuracy with increased tumour spread. Our results demonstrate the potential of L1DFL to yield robust segmentation of metastatic prostate cancer lesions in PSMA PET/CT images. The results further highlight potential complexities arising from the variations in lesion characteristics that may influence automated prostate cancer tumour detection and segmentation. The code is publicly available at: https://github.com/ObedDzik/pca_segment.git.
本研究提出L1DFL新型损失函数,通过基于分类难度自适应加权体素来改进PET/CT扫描中转移性前列腺癌的自动分割,在测试集上相比现有方法性能提升至少13%。
This study introduces L1DFL, a novel loss function that enhances automated segmentation of metastatic prostate cancer in PET/CT scans by adaptively weighting voxels based on classification difficulty, achieving superior performance over existing methods with at least 13% improvement on test sets.

Authors:Xueqing Deng, Qihang Yu, Ali Athar, Chenglin Yang, Linjie Yang, Xiaojie Jin, Xiaohui Shen, Liang-Chieh Chen
Title: COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation
Abstract:
This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.
Chinese: 本文介绍了COCONut-PanCap数据集,它通过提供与全景分割掩码关联的细粒度区域级标注,显著提升了全景分割和基于图像的描述任务性能,为视觉语言模型的理解和生成能力设立了新基准。
English: This paper presents the COCONut-PanCap dataset, which enhances panoptic segmentation and grounded image captioning by providing fine-grained, region-level captions tied to panoptic masks, thereby improving vision-language model performance in both understanding and generation tasks.

Authors:Kyungmin Lee, Xiaohang Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, Yinxiao Li
Title: Calibrated Multi-Preference Optimization for Aligning Diffusion Models
Abstract:
Aligning text-to-image (T2I) diffusion models with preference optimization is valuable for human-annotated datasets, but the heavy cost of manual data collection limits scalability. Using reward models offers an alternative, however, current preference optimization methods fall short in exploiting the rich information, as they only consider pairwise preference distribution. Furthermore, they lack generalization to multi-preference scenarios and struggle to handle inconsistencies between rewards. To address this, we present Calibrated Preference Optimization (CaPO), a novel method to align T2I diffusion models by incorporating the general preference from multiple reward models without human annotated data. The core of our approach involves a reward calibration method to approximate the general preference by computing the expected win-rate against the samples generated by the pretrained models. Additionally, we propose a frontier-based pair selection method that effectively manages the multi-preference distribution by selecting pairs from Pareto frontiers. Finally, we use regression loss to fine-tune diffusion models to match the difference between calibrated rewards of a selected pair. Experimental results show that CaPO consistently outperforms prior methods, such as Direct Preference Optimization (DPO), in both single and multi-reward settings validated by evaluation on T2I benchmarks, including GenEval and T2I-Compbench.

Authors:Divya Bharti, Sriprabha Ramanarayanan, Sadhana S, Kishore Kumar M, Keerthi Ram, Harsh Agarwal, Ramesh Venkatesan, Mohanasankar Sivaprakasam
Title: AAD-DCE: An Aggregated Multimodal Attention Mechanism for Early and Late Dynamic Contrast Enhanced Prostate MRI Synthesis
Abstract:
Dynamic Contrast-Enhanced Magnetic Resonance Imaging (DCE-MRI) is a medical imaging technique that plays a crucial role in the detailed visualization and identification of tissue perfusion in abnormal lesions and radiological suggestions for biopsy. However, DCE-MRI involves the administration of a Gadolinium based (Gad) contrast agent, which is associated with a risk of toxicity in the body. Previous deep learning approaches that synthesize DCE-MR images employ unimodal non-contrast or low-dose contrast MRI images lacking focus on the local perfusion information within the anatomy of interest. We propose AAD-DCE, a generative adversarial network (GAN) with an aggregated attention discriminator module consisting of global and local discriminators. The discriminators provide a spatial embedded attention map to drive the generator to synthesize early and late response DCE-MRI images. Our method employs multimodal inputs - T2 weighted (T2W), Apparent Diffusion Coefficient (ADC), and T1 pre-contrast for image synthesis. Extensive comparative and ablation studies on the ProstateX dataset show that our model (i) is agnostic to various generator benchmarks and (ii) outperforms other DCE-MRI synthesis approaches with improvement margins of +0.64 dB PSNR, +0.0518 SSIM, -0.015 MAE for early response and +0.1 dB PSNR, +0.0424 SSIM, -0.021 MAE for late response, and (ii) emphasize the importance of attention ensembling. Our code is available at https://github.com/bhartidivya/AAD-DCE.
中文摘要:AAD-DCE模型采用带聚合注意力判别器的生成对抗网络,通过多模态MRI输入合成DCE-MRI图像,在精度上超越现有方法并突显了注意力集成的重要性。
English Summary: The AAD-DCE model uses a GAN with aggregated attention discriminators to synthesize DCE-MRI images from multimodal MRI inputs, outperforming existing methods in accuracy and emphasizing attention ensembling's importance.

Authors:Junha Lee, Chunghyun Park, Jaesung Choe, Yu-Chiang Frank Wang, Jan Kautz, Minsu Cho, Chris Choy
Title: Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation
Abstract:
We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models, we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs, significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.

Authors:Jian Liu, Wei Sun, Hui Yang, Pengchao Deng, Chongpei Liu, Nicu Sebe, Hossein Rahmani, Ajmal Mian
Title: Diff9D: Diffusion-Based Domain-Generalized Category-Level 9-DoF Object Pose Estimation
Abstract:
Nine-degrees-of-freedom (9-DoF) object pose and size estimation is crucial for enabling augmented reality and robotic manipulation. Category-level methods have received extensive research attention due to their potential for generalization to intra-class unknown objects. However, these methods require manual collection and labeling of large-scale real-world training data. To address this problem, we introduce a diffusion-based paradigm for domain-generalized category-level 9-DoF object pose estimation. Our motivation is to leverage the latent generalization ability of the diffusion model to address the domain generalization challenge in object pose estimation. This entails training the model exclusively on rendered synthetic data to achieve generalization to real-world scenes. We propose an effective diffusion model to redefine 9-DoF object pose estimation from a generative perspective. Our model does not require any 3D shape priors during training or inference. By employing the Denoising Diffusion Implicit Model, we demonstrate that the reverse diffusion process can be executed in as few as 3 steps, achieving near real-time performance. Finally, we design a robotic grasping system comprising both hardware and software components. Through comprehensive experiments on two benchmark datasets and the real-world robotic system, we show that our method achieves state-of-the-art domain generalization performance. Our code will be made public at https://github.com/CNJianLiu/Diff9D.
中文摘要:本文提出一种基于扩散模型的九自由度物体姿态估计方法,仅使用合成数据训练即可通过三步去噪实现卓越的跨领域泛化性能,并在真实场景中达到领先水平。
English Summary: This paper introduces a diffusion-based method for category-level 9-DoF object pose estimation that trains solely on synthetic data yet achieves state-of-the-art generalization to real-world scenes through efficient 3-step denoising.

Authors:Antoni Kowalczuk, Jan Dubiński, Franziska Boenisch, Adam Dziedzic
Title: Privacy Attacks on Image AutoRegressive Models
Abstract:
Image AutoRegressive generation has emerged as a new powerful paradigm with image autoregressive models (IARs) matching state-of-the-art diffusion models (DMs) in image quality (FID: 1.48 vs. 1.58) while allowing for a higher generation speed. However, the privacy risks associated with IARs remain unexplored, raising concerns regarding their responsible deployment. To address this gap, we conduct a comprehensive privacy analysis of IARs, comparing their privacy risks to the ones of DMs as reference points. Concretely, we develop a novel membership inference attack (MIA) that achieves a remarkably high success rate in detecting training images (with a True Positive Rate at False Positive Rate = 1% of 86.38% vs. 6.38% for DMs with comparable attacks). We leverage our novel MIA to provide dataset inference (DI) for IARs, and show that it requires as few as 6 samples to detect dataset membership (compared to 200 for DI in DMs), confirming a higher information leakage in IARs. Finally, we are able to extract hundreds of training data points from an IAR (e.g., 698 from VAR-d30). Our results suggest a fundamental privacy-utility trade-off: while IARs excel in image generation quality and speed, they are empirically significantly more vulnerable to privacy attacks compared to DMs that achieve similar performance. We release the code at https://github.com/sprintml/privacy_attacks_against_iars for reproducibility.
中文: 图像自回归模型在图像质量和生成速度上媲美扩散模型,但隐私风险更高,新型成员推理攻击成功率显著且能提取大量训练数据,揭示了其严重的隐私泄露问题。
English: Image autoregressive models (IARs) match diffusion models in image quality with faster generation but are significantly more vulnerable to privacy attacks, as demonstrated by a novel membership inference attack achieving high success rates and data extraction.

Authors:Yufan Chen, Ruiping Liu, Junwei Zheng, Di Wen, Kunyu Peng, Jiaming Zhang, Rainer Stiefelhagen
Title: Graph-based Document Structure Analysis
Abstract:
When reading a document, glancing at the spatial layout of a document is an initial step to understand it roughly. Traditional document layout analysis (DLA) methods, however, offer only a superficial parsing of documents, focusing on basic instance detection and often failing to capture the nuanced spatial and logical relations between instances. These limitations hinder DLA-based models from achieving a gradually deeper comprehension akin to human reading. In this work, we propose a novel graph-based Document Structure Analysis (gDSA) task. This task requires that model not only detects document elements but also generates spatial and logical relations in form of a graph structure, allowing to understand documents in a holistic and intuitive manner. For this new task, we construct a relation graph-based document structure analysis dataset (GraphDoc) with 80K document images and 4.13M relation annotations, enabling training models to complete multiple tasks like reading order, hierarchical structures analysis, and complex inter-element relation inference. Furthermore, a document relation graph generator (DRGG) is proposed to address the gDSA task, which achieves performance with 57.6% at mAP$_g$@0.5 for a strong benchmark baseline on this novel task and dataset. We hope this graphical representation of document structure can mark an innovative advancement in document structure analysis and understanding. The new dataset and code will be made publicly available at https://yufanchen96.github.io/projects/GraphDoc.
中文摘要:传统文档布局分析方法仅提供浅层解析,因此提出了一种新颖的基于图的文档结构分析任务,通过生成空间与逻辑关系图来实现整体文档理解。
English Summary: Traditional document layout analysis methods provide only superficial parsing, prompting the introduction of a novel graph-based Document Structure Analysis task that generates spatial and logical relations through a graph structure for holistic document understanding.

Authors:Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin
Title: VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
Abstract:
Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/

Authors:Simone Alberto Peirone, Francesca Pistilli, Antonio Alliegro, Tatiana Tommasi, Giuseppe Averta
Title: Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives
Abstract:
Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4d benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously.

Authors:Mengting Wei, Tuomas Varanka, Yante Li, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao
Title: Towards Consistent and Controllable Image Synthesis for Face Editing
Abstract:
Face editing methods, essential for tasks like virtual avatars, digital human synthesis and identity preservation, have traditionally been built upon GAN-based techniques, while recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in controlling specific attributes and preserving the consistency of other unchanged attributes especially the identity characteristics. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion (SD) models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involves the combinations of target background, identity and face attributes aimed to edit. We strive to sufficiently disentangle the control of these factors to enable consistency of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Attribute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) A high-consistency FaceFusion method that transfers identity features from the Identity Encoder to the denoising UNet of a pre-trained SD model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models. Code is publicly available at https://github.com/weimengting/RigFace.
中文: 本文提出RigFace方法,结合稳定扩散模型和粗略三维人脸模型,在精确控制光照、表情和头部姿态的同时,有效保持身份特征一致性和图像真实感。
English: This paper introduces RigFace, a novel face editing approach that utilizes Stable-Diffusion models and 3D face models to precisely control lighting, expressions, and poses while maintaining identity consistency and photorealism.

Authors:Qianhao Yuan, Yanjiang Liu, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun
Title: SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency
Abstract:
Multimodal Large Language Models (MLLMs) mainly fall into two architectures, each involving a trade-off between training and inference efficiency: embedding space alignment (e.g., LLaVA-1.5) is inefficient during inference, while cross-attention space alignment (e.g., Flamingo) is inefficient in training. In this paper, we compare these two architectures and identify the key factors for building efficient MLLMs. A primary difference between them lies in how attention is applied to visual tokens, particularly in their interactions with each other. To investigate whether attention among visual tokens is necessary, we propose a new self-attention mechanism, NAAViT (\textbf{N}o \textbf{A}ttention \textbf{A}mong \textbf{Vi}sual \textbf{T}okens), which eliminates this type of attention. Our pilot experiment on LLaVA-1.5 shows that attention among visual tokens is highly redundant. Based on these insights, we introduce SAISA (\textbf{S}elf-\textbf{A}ttention \textbf{I}nput \textbf{S}pace \textbf{A}lignment), a novel architecture that enhance both training and inference efficiency. SAISA directly aligns visual features with the input spaces of NAAViT self-attention blocks, reducing computational overhead in both self-attention blocks and feed-forward networks (FFNs). Using the same configuration as LLaVA-1.5, SAISA reduces inference FLOPs by 66\% and training budget by 26\%, while achieving superior performance in terms of accuracy. Comprehensive ablation studies further validate the effectiveness of SAISA across various LLMs and visual encoders. The code and model will be publicly available at https://github.com/icip-cas/SAISA.
中文: 本文提出SAISA这一新型多模态大语言模型架构,通过消除视觉标记间的注意力机制,在提升模型精度的同时显著降低了训练与推理的计算成本。
English: This paper introduces SAISA, a novel multimodal large language model architecture that eliminates attention among visual tokens to significantly enhance both training and inference efficiency while achieving superior accuracy compared to existing models.

Authors:Chenhui Zhao, Yan Jiang, Todd C. Hollon
Title: Extending SEEDS to a Supervoxel Algorithm for Medical Image Analysis
Abstract:
In this work, we extend the SEEDS superpixel algorithm from 2D images to 3D volumes, resulting in 3D SEEDS, a faster, better, and open-source supervoxel algorithm for medical image analysis. We compare 3D SEEDS with the widely used supervoxel algorithm SLIC on 13 segmentation tasks across 10 organs. 3D SEEDS accelerates supervoxel generation by a factor of 10, improves the achievable Dice score by +6.5%, and reduces the under-segmentation error by -0.16%. The code is available at https://github.com/Zch0414/3d_seeds
本研究提出3D SEEDS算法,这一开源超体素方法在医学图像分析中比SLIC提速十倍,并将分割精度提升6.5%,同时代码已公开共享。
This study introduces 3D SEEDS, an enhanced open-source supervoxel algorithm that accelerates processing tenfold and improves segmentation accuracy by 6.5% compared to SLIC across multiple medical imaging tasks.

Authors:Alexander Kolesov, Manukhov Stepan, Vladimir V. Palyulin, Alexander Korotin
Title: Field Matching: an Electrostatic Paradigm to Generate and Transfer Data
Abstract:
We propose Electrostatic Field Matching (EFM), a novel method that is suitable for both generative modeling and distribution transfer tasks. Our approach is inspired by the physics of an electrical capacitor. We place source and target distributions on the capacitor plates and assign them positive and negative charges, respectively. Then we learn the electrostatic field of the capacitor using a neural network approximator. To map the distributions to each other, we start at one plate of the capacitor and move the samples along the learned electrostatic field lines until they reach the other plate. We theoretically justify that this approach provably yields the distribution transfer. In practice, we demonstrate the performance of our EFM in toy and image data experiments. Our code is available at https://github.com/justkolesov/FieldMatching
中文摘要:静电匹配场(EFM)是一种受电容器物理启发的新型分布转移方法,通过神经网络学习静电场,使样本沿电场线移动,从而可靠地实现分布间的相互映射。
English Summary: Electrostatic Field Matching (EFM) is a novel distribution transfer method inspired by capacitor physics, using neural networks to learn electrostatic fields that provably map source and target distributions by moving samples along field lines.

Authors:Ziyan Guo, Zeyu Hu, De Wen Soh, Na Zhao
Title: MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
Abstract:
Human motion generation and editing are key components of computer vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, fine-grained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: \textbf{Motion-Condition-Motion}, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, \textbf{MotionLab}, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. In MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple benchmarks for human motion. Our code and additional video results are available at: https://diouo.github.io/motionlab.github.io/.

Authors:Shangwei Guo, Hao Shi, Song Wang, Xiaoting Yin, Kailun Yang, Kaiwei Wang
Title: Event-aided Semantic Scene Completion
Abstract:
Autonomous driving systems rely on robust 3D scene understanding. Recent advances in Semantic Scene Completion (SSC) for autonomous driving underscore the limitations of RGB-based approaches, which struggle under motion blur, poor lighting, and adverse weather. Event cameras, offering high dynamic range and low latency, address these challenges by providing asynchronous data that complements RGB inputs. We present DSEC-SSC, the first real-world benchmark specifically designed for event-aided SSC, which includes a novel 4D labeling pipeline for generating dense, visibility-aware labels that adapt dynamically to object motion. Our proposed RGB-Event fusion framework, EvSSC, introduces an Event-aided Lifting Module (ELM) that effectively bridges 2D RGB-Event features to 3D space, enhancing view transformation and the robustness of 3D volume construction across SSC models. Extensive experiments on DSEC-SSC and simulated SemanticKITTI-E demonstrate that EvSSC is adaptable to both transformer-based and LSS-based SSC architectures. Notably, evaluations on SemanticKITTI-C demonstrate that EvSSC achieves consistently improved prediction accuracy across five degradation modes and both In-domain and Out-of-domain settings, achieving up to a 52.5% relative improvement in mIoU when the image sensor partially fails. Additionally, we quantitatively and qualitatively validate the superiority of EvSSC under motion blur and extreme weather conditions, where autonomous driving is challenged. The established datasets and our codebase will be made publicly at https://github.com/Pandapan01/EvSSC.
中文摘要:DSEC-SSC基准与EvSSC框架通过融合事件相机数据与RGB输入,显著提升了自动驾驶系统在传感器故障、运动模糊及恶劣天气下的三维场景理解鲁棒性。
English Summary: The DSEC-SSC benchmark and EvSSC framework enhance autonomous driving's 3D scene understanding by fusing event camera data with RGB inputs, significantly improving robustness against sensor degradation, motion blur, and adverse weather conditions.

Authors:Hsin-Cheng Lu, Chung-Yi Lin, Winston H. Hsu
Title: Improving Generalization Ability for 3D Object Detection by Learning Sparsity-invariant Features
Abstract:
In autonomous driving, 3D object detection is essential for accurately identifying and tracking objects. Despite the continuous development of various technologies for this task, a significant drawback is observed in most of them-they experience substantial performance degradation when detecting objects in unseen domains. In this paper, we propose a method to improve the generalization ability for 3D object detection on a single domain. We primarily focus on generalizing from a single source domain to target domains with distinct sensor configurations and scene distributions. To learn sparsity-invariant features from a single source domain, we selectively subsample the source data to a specific beam, using confidence scores determined by the current detector to identify the density that holds utmost importance for the detector. Subsequently, we employ the teacher-student framework to align the Bird's Eye View (BEV) features for different point clouds densities. We also utilize feature content alignment (FCA) and graph-based embedding relationship alignment (GERA) to instruct the detector to be domain-agnostic. Extensive experiments demonstrate that our method exhibits superior generalization capabilities compared to other baselines. Furthermore, our approach even outperforms certain domain adaptation methods that can access to the target domain data.
Chinese Summary: 本文提出了一种方法,通过从单一源域学习稀疏不变特征,并采用师生框架和特征对齐技术,提升自动驾驶中3D物体检测在未见领域的泛化能力。
English Summary: The paper introduces a method to enhance the generalization of 3D object detection in autonomous driving by learning sparsity-invariant features from a single source domain and employing techniques like teacher-student framework and feature alignment to improve performance across unseen domains.

Authors:Jiawei Qin, Xucong Zhang, Yusuke Sugano
Title: UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training
Abstract:
Despite decades of research on data collection and model architectures, current gaze estimation models encounter significant challenges in generalizing across diverse data domains. Recent advances in self-supervised pre-training have shown remarkable performances in generalization across various vision tasks. However, their effectiveness in gaze estimation remains unexplored. We propose UniGaze, for the first time, leveraging large-scale in-the-wild facial datasets for gaze estimation through self-supervised pre-training. Through systematic investigation, we clarify critical factors that are essential for effective pretraining in gaze estimation. Our experiments reveal that self-supervised approaches designed for semantic tasks fail when applied to gaze estimation, while our carefully designed pre-training pipeline consistently improves cross-domain performance. Through comprehensive experiments of challenging cross-dataset evaluation and novel protocols including leave-one-dataset-out and joint-dataset settings, we demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data. source code and model are available at https://github.com/ut-vision/UniGaze.
中文: UniGaze首次通过自监督预训练利用大规模野外面部数据集进行视线估计,显著提升了跨领域泛化能力并降低了对标注数据的依赖。
English: UniGaze introduces a novel self-supervised pre-training approach using large-scale facial datasets to significantly enhance gaze estimation's cross-domain generalization while reducing dependence on labeled data.

Authors:Tao Zhang, Jinyong Wen, Zhen Chen, Kun Ding, Shiming Xiang, Chunhong Pan
Title: UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation
Abstract:
Pre-training techniques significantly enhance the performance of semantic segmentation tasks with limited training data. However, the efficacy under a large domain gap between pre-training (e.g. RGB) and fine-tuning (e.g. infrared) remains underexplored. In this study, we first benchmark the infrared semantic segmentation performance of various pre-training methods and reveal several phenomena distinct from the RGB domain. Next, our layerwise analysis of pre-trained attention maps uncovers that: (1) There are three typical attention patterns (local, hybrid, and global); (2) Pre-training tasks notably influence the pattern distribution across layers; (3) The hybrid pattern is crucial for semantic segmentation as it attends to both nearby and foreground elements; (4) The texture bias impedes model generalization in infrared tasks. Building on these insights, we propose UNIP, a UNified Infrared Pre-training framework, to enhance the pre-trained model performance. This framework uses the hybrid-attention distillation NMI-HAD as the pre-training target, a large-scale mixed dataset InfMix for pre-training, and a last-layer feature pyramid network LL-FPN for fine-tuning. Experimental results show that UNIP outperforms various pre-training methods by up to 13.5\% in average mIoU on three infrared segmentation tasks, evaluated using fine-tuning and linear probing metrics. UNIP-S achieves performance on par with MAE-L while requiring only 1/10 of the computational cost. Furthermore, UNIP significantly surpasses state-of-the-art (SOTA) infrared or RGB segmentation methods and demonstrates broad potential for application in other modalities, such as RGB and depth. Our code is available at https://github.com/casiatao/UNIP.
Chinese: 预训练技术能提升有限数据下的语义分割性能,但在RGB与红外等大领域差异下的效果尚不明确,为此提出的统一红外预训练框架UNIP显著超越了现有方法,在性能和效率上均表现出色。
English: Pre-training boosts semantic segmentation with limited data, but its effectiveness across large domain gaps like RGB to infrared remains unclear, leading to the development of UNIP, a unified infrared pre-training framework that significantly outperforms existing methods in performance and efficiency.

Authors:Fei Wang, Kun Li, Yiqi Nie, Zhangling Duan, Peng Zou, Zhiliang Wu, Yuwei Wang, Yanyan Wei
Title: Exploiting Ensemble Learning for Cross-View Isolated Sign Language Recognition
Abstract:
In this paper, we present our solution to the Cross-View Isolated Sign Language Recognition (CV-ISLR) challenge held at WWW 2025. CV-ISLR addresses a critical issue in traditional Isolated Sign Language Recognition (ISLR), where existing datasets predominantly capture sign language videos from a frontal perspective, while real-world camera angles often vary. To accurately recognize sign language from different viewpoints, models must be capable of understanding gestures from multiple angles, making cross-view recognition challenging. To address this, we explore the advantages of ensemble learning, which enhances model robustness and generalization across diverse views. Our approach, built on a multi-dimensional Video Swin Transformer model, leverages this ensemble strategy to achieve competitive performance. Finally, our solution ranked 3rd in both the RGB-based ISLR and RGB-D-based ISLR tracks, demonstrating the effectiveness in handling the challenges of cross-view recognition. The code is available at: https://github.com/Jiafei127/CV_ISLR_WWW2025.
中文摘要:我们针对跨视角孤立手语识别挑战提出的解决方案,采用基于视频Swin Transformer的集成学习方法,通过增强模型对不同视角的泛化能力,在两项赛道中均获得第三名。
English Summary: Our solution for the Cross-View Isolated Sign Language Recognition challenge employs an ensemble learning approach with a Video Swin Transformer model, achieving third place in both competition tracks by effectively recognizing signs from varied camera angles.

Authors:Rohit Girmaji, Bhav Beri, Ramanathan Subramanian, Vineet Gandhi
Title: EditIQ: Automated Cinematic Editing of Static Wide-Angle Videos via Dialogue Interpretation and Saliency Cues
Abstract:
We present EditIQ, a completely automated framework for cinematically editing scenes captured via a stationary, large field-of-view and high-resolution camera. From the static camera feed, EditIQ initially generates multiple virtual feeds, emulating a team of cameramen. These virtual camera shots termed rushes are subsequently assembled using an automated editing algorithm, whose objective is to present the viewer with the most vivid scene content. To understand key scene elements and guide the editing process, we employ a two-pronged approach: (1) a large language model (LLM)-based dialogue understanding module to analyze conversational flow, coupled with (2) visual saliency prediction to identify meaningful scene elements and camera shots therefrom. We then formulate cinematic video editing as an energy minimization problem over shot selection, where cinematic constraints determine shot choices, transitions, and continuity. EditIQ synthesizes an aesthetically and visually compelling representation of the original narrative while maintaining cinematic coherence and a smooth viewing experience. Efficacy of EditIQ against competing baselines is demonstrated via a psychophysical study involving twenty participants on the BBC Old School dataset plus eleven theatre performance videos. Video samples from EditIQ can be found at https://editiq-ave.github.io/.

Authors:Simon de Moreau, Mathias Corsia, Hassan Bouchiba, Yasser Almehio, Andrei Bursuc, Hafid El-Idrissi, Fabien Moutarde
Title: DOC-Depth: A novel approach for dense depth ground truth generation
Abstract:
Accurate depth information is essential for many computer vision applications. Yet, no available dataset recording method allows for fully dense accurate depth estimation in a large scale dynamic environment. In this paper, we introduce DOC-Depth, a novel, efficient and easy-to-deploy approach for dense depth generation from any LiDAR sensor. After reconstructing consistent dense 3D environment using LiDAR odometry, we address dynamic objects occlusions automatically thanks to DOC, our state-of-the art dynamic object classification method. Additionally, DOC-Depth is fast and scalable, allowing for the creation of unbounded datasets in terms of size and time. We demonstrate the effectiveness of our approach on the KITTI dataset, improving its density from 16.1% to 71.2% and release this new fully dense depth annotation, to facilitate future research in the domain. We also showcase results using various LiDAR sensors and in multiple environments. All software components are publicly available for the research community.

Authors:Zaid Ilyas, Arooba Maqsood, Afsah Saleem, Erchuan Zhang, David Suter, Parminder Raina, Jonathan M. Hodgson, John T. Schousboe, William D. Leslie, Joshua R. Lewis, Syed Zulqarnain Gilani
Title: VerteNet -- A Multi-Context Hybrid CNN Transformer for Accurate Vertebral Landmark Localization in Lateral Spine DXA Images
Abstract:
Lateral Spine Image (LSI) analysis is important for medical diagnosis, treatment planning, and detailed spinal health assessments. Although modalities like Computed Tomography and Digital X-ray Imaging are commonly used, Dual Energy X-ray Absorptiometry (DXA) is often preferred due to lower radiation exposure, seamless capture, and cost-effectiveness. Accurate Vertebral Landmark Localization (VLL) on LSIs is important to detect spinal conditions like kyphosis and lordosis, as well as assessing Abdominal Aortic Calcification (AAC) using Inter-Vertebral Guides (IVGs). Nonetheless, few automated VLL methodologies have concentrated on DXA LSIs. We present VerteNet, a hybrid CNN-Transformer model featuring a novel dual-resolution attention mechanism in self and cross-attention domains, referred to as Dual Resolution Self-Attention (DRSA) and Dual Resolution Cross-Attention (DRCA). These mechanisms capture the diverse frequencies in DXA images by operating at two different feature map resolutions. Additionally, we design a Multi-Context Feature Fusion Block (MCFB) that efficiently integrates the features using DRSA and DRCA. We train VerteNet on 620 DXA LSIs from various machines and achieve superior results compared to existing methods. We also design an algorithm that utilizes VerteNet's predictions in estimating the Region of Interest (ROI) to detect potential abdominal aorta cropping, where inadequate soft tissue hinders calcification assessment. Additionally, we present a small proof-of-concept study to show that IVGs generated from VLL information can improve inter-reader correlation in AAC scoring, addressing two key areas of disagreement in expert AAC-24 scoring: IVG placement and quality control for full abdominal aorta assessment. The code for this work can be found at https://github.com/zaidilyas89/VerteNet.
Chinese: VerteNet是一种混合CNN-Transformer模型,通过双分辨率注意力机制在DXA脊柱侧位图像上实现精准椎骨定位,其性能优于现有方法,并能有效提升腹主动脉钙化评估的准确性。
English: VerteNet is a hybrid CNN-Transformer model that introduces dual-resolution attention mechanisms for accurate vertebral landmark localization on DXA lateral spine images, achieving superior performance and enabling improved abdominal aortic calcification assessment.

Authors:Mokshagna Sai Teja Karanam, Krithika Iyer, Sarang Joshi, Shireen Elhabian
Title: MORPH-LER: Log-Euclidean Regularization for Population-Aware Image Registration
Abstract:
Spatial transformations that capture population-level morphological statistics are critical for medical image analysis. Commonly used smoothness regularizers for image registration fail to integrate population statistics, leading to anatomically inconsistent transformations. Inverse consistency regularizers promote geometric consistency but lack population morphometrics integration. Regularizers that constrain deformation to low-dimensional manifold methods address this. However, they prioritize reconstruction over interpretability and neglect diffeomorphic properties, such as group composition and inverse consistency. We introduce MORPH-LER, a Log-Euclidean regularization framework for population-aware unsupervised image registration. MORPH-LER learns population morphometrics from spatial transformations to guide and regularize registration networks, ensuring anatomically plausible deformations. It features a bottleneck autoencoder that computes the principal logarithm of deformation fields via iterative square-root predictions. It creates a linearized latent space that respects diffeomorphic properties and enforces inverse consistency. By integrating a registration network with a diffeomorphic autoencoder, MORPH-LER produces smooth, meaningful deformation fields. The framework offers two main contributions: (1) a data-driven regularization strategy that incorporates population-level anatomical statistics to enhance transformation validity and (2) a linearized latent space that enables compact and interpretable deformation fields for efficient population morphometrics analysis. We validate MORPH-LER across two families of deep learning-based registration networks, demonstrating its ability to produce anatomically accurate, computationally efficient, and statistically meaningful transformations on the OASIS-1 brain imaging dataset. https://github.com/iyerkrithika21/MORPH_LER
中文摘要:MORPH-LER是一种新颖的对数欧几里得正则化框架,通过引入保持微分同胚特性的线性化潜空间,将群体形态统计融入无监督图像配准,确保产生解剖学合理的形变场。
English Summary: MORPH-LER is a novel Log-Euclidean regularization framework that integrates population morphometrics into unsupervised image registration, ensuring anatomically plausible deformations through a diffeomorphic autoencoder with a linearized latent space.

Authors:Jianze Li, Jiezhang Cao, Yong Guo, Wenbo Li, Yulun Zhang
Title: One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation
Abstract:
Diffusion models (DMs) have significantly advanced the development of real-world image super-resolution (Real-ISR), but the computational cost of multi-step diffusion models limits their application. One-step diffusion models generate high-quality images in a one sampling step, greatly reducing computational overhead and inference latency. However, most existing one-step diffusion methods are constrained by the performance of the teacher model, where poor teacher performance results in image artifacts. To address this limitation, we propose FluxSR, a novel one-step diffusion Real-ISR technique based on flow matching models. We use the state-of-the-art diffusion model FLUX.1-dev as both the teacher model and the base model. First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR. Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss and introduce Attention Diversification Loss (ADL) as a regularization term to reduce token similarity in transformer, thereby eliminating high-frequency artifacts. Comprehensive experiments demonstrate that our method outperforms existing one-step diffusion-based Real-ISR methods. The code and model will be released at https://github.com/JianzeLi-114/FluxSR.
Chinese: FluxSR提出了一种基于流匹配模型的新型单步扩散真实图像超分辨率技术,通过引入流轨迹蒸馏和感知损失等方法克服了教师模型限制并消除伪影,在性能上超越了现有单步扩散方法。
English: FluxSR introduces a novel one-step diffusion technique for real-world image super-resolution, utilizing flow matching models and innovative losses to overcome teacher model limitations and eliminate artifacts, achieving superior performance over existing methods.

Authors:Hongxin Li, Jingfan Chen, Jingran Su, Yuntao Chen, Qing Li, Zhaoxiang Zhang
Title: AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs
Abstract:
User interface understanding with vision-language models (VLMs) has received much attention due to its potential for enhancing software automation. However, existing datasets used to build UI-VLMs either only contain large-scale context-free element annotations or contextualized functional descriptions for elements at a small scale. In this work, we propose the \textbf{AutoGUI} pipeline for automatically annotating UI elements with detailed functionality descriptions at scale. Specifically, we leverage large language models (LLMs) to infer element functionality by comparing UI state changes before and after simulated interactions. To improve annotation quality, we propose LLM-aided rejection and verification, eliminating invalid annotations without human labor. We construct a high-quality AutoGUI-704k dataset using the proposed pipeline, featuring diverse and detailed functionality annotations that are hardly provided by previous datasets. Human evaluation shows that we achieve annotation correctness comparable to a trained human annotator. Extensive experiments show that our dataset remarkably enhances VLM's UI grounding capabilities and exhibits significant scaling effects. We also show the interesting potential use of our dataset in UI agent tasks. Please view our project at https://autogui-project.github.io/.
中文:AutoGUI 管道利用大语言模型自动为界面元素添加详细功能标注,构建的高质量数据集显著提升了视觉语言模型对用户界面的理解与定位能力,并展现出强大的扩展潜力。
English: The AutoGUI pipeline automates large-scale UI element annotation with detailed functionality descriptions using LLMs, creating a high-quality dataset that significantly enhances vision-language models' UI understanding and grounding capabilities.

Authors:Jingjing Liu, Li Zhang, Xiaoyang Zeng, Wanquan Liu, Jianhua Zhang
Title: MATCNN: Infrared and Visible Image Fusion Method Based on Multi-scale CNN with Attention Transformer
Abstract:
While attention-based approaches have shown considerable progress in enhancing image fusion and addressing the challenges posed by long-range feature dependencies, their efficacy in capturing local features is compromised by the lack of diverse receptive field extraction techniques. To overcome the shortcomings of existing fusion methods in extracting multi-scale local features and preserving global features, this paper proposes a novel cross-modal image fusion approach based on a multi-scale convolutional neural network with attention Transformer (MATCNN). MATCNN utilizes the multi-scale fusion module (MSFM) to extract local features at different scales and employs the global feature extraction module (GFEM) to extract global features. Combining the two reduces the loss of detail features and improves the ability of global feature representation. Simultaneously, an information mask is used to label pertinent details within the images, aiming to enhance the proportion of preserving significant information in infrared images and background textures in visible images in fused images. Subsequently, a novel optimization algorithm is developed, leveraging the mask to guide feature extraction through the integration of content, structural similarity index measurement, and global feature loss. Quantitative and qualitative evaluations are conducted across various datasets, revealing that MATCNN effectively highlights infrared salient targets, preserves additional details in visible images, and achieves better fusion results for cross-modal images. The code of MATCNN will be available at https://github.com/zhang3849/MATCNN.git.
中文: 本文提出MATCNN跨模态图像融合方法,通过结合多尺度卷积网络与注意力Transformer,有效提取局部与全局特征,在融合图像中增强细节保留与特征表征能力。
English: This paper introduces MATCNN, a cross-modal image fusion method that combines multi-scale convolutional networks with attention Transformers to effectively capture both local and global features, improving detail preservation and feature representation in fused images.

Authors:Chia-Wen Kuo, Sijie Zhu, Fan Chen, Xiaohui Shen, Longyin Wen
Title: D-Attn: Decomposed Attention for Large Vision-and-Language Models
Abstract:
Large vision-and-language models (LVLMs) have traditionally integrated visual and textual tokens by concatenating them into a single homogeneous input for large language models (LLMs), thereby maximally preserving the pre-trained language capabilities. However, this constrained architecture for visual and textual tokens restricts the design space for processing visual tokens, potentially leading to suboptimal performance and efficiency. In this paper, we propose Decomposed Attention (D-Attn), a more flexible attention architecture for LVLMs, which enables modification of visual token operations without affecting textual-to-textual attention. D-Attn decomposes the 1-D causal self-attention of LVLMs into visual-to-visual, textual-to-visual, and textual-to-textual attentions, and the visual and textual output tokens from the decomposed attentions are merged with a carefully derived weighting strategy, namely $α$-weighting. Taking advantage of the flexibility, we are able to introduce two critical improvements in visual token processing while maintaining the capacity of pre-trained LLMs: 1) We rectify the biased positional encoding in textual-to-visual attention to boost visual understanding performance. 2) We diagonalize visual-to-visual attention to reduce computation complexity from $O(|V|^2)$ to $O(|V|)$ for $|V|$ visual tokens without compromising performance. Extensive experiments and analysis validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs (\eg, $5\times$ faster). Code will be available at https://github.com/bytedance/DecomposedAttention.
Chinese: 本文提出分解注意力(D-Attn)架构,通过分离视觉与文本标记处理,在保持预训练语言能力的同时显著提升了视觉理解性能并大幅降低了计算复杂度。
English: This paper introduces Decomposed Attention (D-Attn), a flexible architecture for large vision-and-language models that separates visual and textual token processing, enabling key improvements in visual understanding and computational efficiency while preserving pre-trained language capabilities.

Authors:Aashish Rai, Dilin Wang, Mihir Jain, Nikolaos Sarafianos, Kefan Chen, Srinath Sridhar, Aayush Prakash
Title: UVGS: Reimagining Unstructured 3D Gaussian Splatting using UV Mapping
Abstract:
3D Gaussian Splatting (3DGS) has demonstrated superior quality in modeling 3D objects and scenes. However, generating 3DGS remains challenging due to their discrete, unstructured, and permutation-invariant nature. In this work, we present a simple yet effective method to overcome these challenges. We utilize spherical mapping to transform 3DGS into a structured 2D representation, termed UVGS. UVGS can be viewed as multi-channel images, with feature dimensions as a concatenation of Gaussian attributes such as position, scale, color, opacity, and rotation. We further find that these heterogeneous features can be compressed into a lower-dimensional (e.g., 3-channel) shared feature space using a carefully designed multi-branch network. The compressed UVGS can be treated as typical RGB images. Remarkably, we discover that typical VAEs trained with latent diffusion models can directly generalize to this new representation without additional training. Our novel representation makes it effortless to leverage foundational 2D models, such as diffusion models, to directly model 3DGS. Additionally, one can simply increase the 2D UV resolution to accommodate more Gaussians, making UVGS a scalable solution compared to typical 3D backbones. This approach immediately unlocks various novel generation applications of 3DGS by inherently utilizing the already developed superior 2D generation capabilities. In our experiments, we demonstrate various unconditional, conditional generation, and inpainting applications of 3DGS based on diffusion models, which were previously non-trivial.

Authors:Dazhou Yu, Genpei Zhang, Liang Zhao
Title: PolyhedronNet: Representation Learning for Polyhedra with Surface-attributed Graph
Abstract:
Ubiquitous geometric objects can be precisely and efficiently represented as polyhedra. The transformation of a polyhedron into a vector, known as polyhedra representation learning, is crucial for manipulating these shapes with mathematical and statistical tools for tasks like classification, clustering, and generation. Recent years have witnessed significant strides in this domain, yet most efforts focus on the vertex sequence of a polyhedron, neglecting the complex surface modeling crucial in real-world polyhedral objects. This study proposes \textbf{PolyhedronNet}, a general framework tailored for learning representations of 3D polyhedral objects. We propose the concept of the surface-attributed graph to seamlessly model the vertices, edges, faces, and their geometric interrelationships within a polyhedron. To effectively learn the representation of the entire surface-attributed graph, we first propose to break it down into local rigid representations to effectively learn each local region's relative positions against the remaining regions without geometric information loss. Subsequently, we propose PolyhedronGNN to hierarchically aggregate the local rigid representation via intra-face and inter-face geometric message passing modules, to obtain a global representation that minimizes information loss while maintaining rotation and translation invariance. Our experimental evaluations on four distinct datasets, encompassing both classification and retrieval tasks, substantiate PolyhedronNet's efficacy in capturing comprehensive and informative representations of 3D polyhedral objects. Code and data are available at {https://github.com/dyu62/3D_polyhedron}.
中文: 本研究提出PolyhedronNet框架,通过表面属性图建模多面体结构,并采用几何消息传递机制学习鲁棒的三维表示,在分类和检索任务中展现出卓越性能。
English: This study introduces PolyhedronNet, a framework that models polyhedra as surface-attributed graphs and employs geometric message passing to learn robust 3D representations, demonstrating superior performance in classification and retrieval tasks.

Authors:Zhengtong Xu, Qiang Qiu, Yu She
Title: VILP: Imitation Learning with Latent Video Planning
Abstract:
In the era of generative AI, integrating video generation models into robotics opens new possibilities for the general-purpose robot agent. This paper introduces imitation learning with latent video planning (VILP). We propose a latent video diffusion model to generate predictive robot videos that adhere to temporal consistency to a good degree. Our method is able to generate highly time-aligned videos from multiple views, which is crucial for robot policy learning. Our video generation model is highly time-efficient. For example, it can generate videos from two distinct perspectives, each consisting of six frames with a resolution of 96x160 pixels, at a rate of 5 Hz. In the experiments, we demonstrate that VILP outperforms the existing video generation robot policy across several metrics: training costs, inference speed, temporal consistency of generated videos, and the performance of the policy. We also compared our method with other imitation learning methods. Our findings indicate that VILP can rely less on extensive high-quality task-specific robot action data while still maintaining robust performance. In addition, VILP possesses robust capabilities in representing multi-modal action distributions. Our paper provides a practical example of how to effectively integrate video generation models into robot policies, potentially offering insights for related fields and directions. For more details, please refer to our open-source repository https://github.com/ZhengtongXu/VILP.
中文: 本文提出VILP方法,通过隐空间视频扩散模型生成时序一致的多视角机器人视频,在降低对高质量动作数据依赖的同时,在训练成本、推理速度和策略性能上均优于现有方法。
English: This paper introduces VILP, a latent video diffusion model for imitation learning that generates temporally consistent, multi-view robot videos efficiently, outperforming existing methods in training cost, speed, and policy performance while reducing reliance on high-quality action data.

Authors:Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, Song Han
Title: Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
Abstract:
Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality. Our code is open-sourced and is available at https://github.com/svg-project/Sparse-VideoGen
中文:提出的Sparse VideoGen(SVG)框架通过动态识别扩散变换器中的空间与时间注意力模式,在保持生成质量的同时将视频生成速度提升超过2倍。
English: The proposed Sparse VideoGen (SVG) framework dynamically identifies spatial and temporal attention patterns in Diffusion Transformers to reduce computational costs, achieving over 2x speedup in video generation while maintaining quality.

Authors:Haibo Tong, Zhaoyang Wang, Zhaorun Chen, Haonian Ji, Shi Qiu, Siwei Han, Kexin Geng, Zhongkai Xue, Yiyang Zhou, Peng Xia, Mingyu Ding, Rafael Rafailov, Chelsea Finn, Huaxiu Yao
Title: MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation
Abstract:
Recent advancements in video generation have significantly improved the ability to synthesize videos from text instructions. However, existing models still struggle with key challenges such as instruction misalignment, content hallucination, safety concerns, and bias. Addressing these limitations, we introduce MJ-BENCH-VIDEO, a large-scale video preference benchmark designed to evaluate video generation across five critical aspects: Alignment, Safety, Fineness, Coherence & Consistency, and Bias & Fairness. This benchmark incorporates 28 fine-grained criteria to provide a comprehensive evaluation of video preference. Building upon this dataset, we propose MJ-VIDEO, a Mixture-of-Experts (MoE)-based video reward model designed to deliver fine-grained reward. MJ-VIDEO can dynamically select relevant experts to accurately judge the preference based on the input text-video pair. This architecture enables more precise and adaptable preference judgments. Through extensive benchmarking on MJ-BENCH-VIDEO, we analyze the limitations of existing video reward models and demonstrate the superior performance of MJ-VIDEO in video preference assessment, achieving 17.58% and 15.87% improvements in overall and fine-grained preference judgments, respectively. Additionally, introducing MJ-VIDEO for preference tuning in video generation enhances the alignment performance. All our code, data, and models are available at https://aiming-lab.github.io/MJ-VIDEO.github.io/.
中文: 研究者提出了MJ-BENCH-VIDEO视频偏好基准和基于专家混合的MJ-VIDEO奖励模型,显著提升了视频偏好评估与生成对齐的性能。
English: The authors introduce MJ-BENCH-VIDEO, a comprehensive video preference benchmark, and MJ-VIDEO, a Mixture-of-Experts-based reward model that significantly improves video preference assessment and generation alignment.

Authors:Shilong Hong, Yanzhou Zhou, Weichao Xu
Title: DAGNet: A Dual-View Attention-Guided Network for Efficient X-ray Security Inspection
Abstract:
With the rapid development of modern transportation systems and the exponential growth of logistics volumes, intelligent X-ray-based security inspection systems play a crucial role in public safety. Although single-view X-ray baggage scanner is widely deployed, they struggles to accurately identify contraband in complex stacking scenarios due to strong viewpoint dependency and inadequate feature representation. To address this, we propose a Dual-View Attention-Guided Network for Efficient X-ray Security Inspection (DAGNet). This study builds on a shared-weight backbone network as the foundation and constructs three key modules that work together: (1) Frequency Domain Interaction Module (FDIM) dynamically enhances features by adjusting frequency components based on inter-view relationships; (2) Dual-View Hierarchical Enhancement Module (DVHEM) employs cross-attention to align features between views and capture hierarchical associations; (3) Convolutional Guided Fusion Module (CGFM) fuses features to suppress redundancy while retaining critical discriminative information. Collectively, these modules substantially improve the performance of dual-view X-ray security inspection. Experimental results demonstrate that DAGNet outperforms existing state-of-the-art approaches across multiple backbone architectures. The code is available at:https://github.com/ShilongHong/DAGNet.
中文摘要:针对单视角X光安检仪在复杂堆叠场景下识别违禁品的不足,本文提出DAGNet双视角注意力引导网络,通过频域交互、层次化特征对齐和卷积融合模块,显著提升了双视角X光安检性能。
English Summary: To overcome the limitations of single-view X-ray scanners in detecting contraband within complex baggage, this paper introduces DAGNet, a dual-view attention-guided network that enhances inspection accuracy through frequency domain interaction, hierarchical feature alignment, and convolutional fusion modules.

Authors:Yirui Zeng, Jun Fu, Hadi Amirpour, Huasheng Wang, Guanghui Yue, Hantao Liu, Ying Chen, Wei Zhou
Title: CLIP-DQA: Blindly Evaluating Dehazed Images from Global and Local Perspectives Using CLIP
Abstract:
Blind dehazed image quality assessment (BDQA), which aims to accurately predict the visual quality of dehazed images without any reference information, is essential for the evaluation, comparison, and optimization of image dehazing algorithms. Existing learning-based BDQA methods have achieved remarkable success, while the small scale of DQA datasets limits their performance. To address this issue, in this paper, we propose to adapt Contrastive Language-Image Pre-Training (CLIP), pre-trained on large-scale image-text pairs, to the BDQA task. Specifically, inspired by the fact that the human visual system understands images based on hierarchical features, we take global and local information of the dehazed image as the input of CLIP. To accurately map the input hierarchical information of dehazed images into the quality score, we tune both the vision branch and language branch of CLIP with prompt learning. Experimental results on two authentic DQA datasets demonstrate that our proposed approach, named CLIP-DQA, achieves more accurate quality predictions over existing BDQA methods. The code is available at https://github.com/JunFu1995/CLIP-DQA.
中文: 本文提出CLIP-DQA方法,通过结合分层特征和提示学习将对比语言-图像预训练模型应用于无参考去雾图像质量评估,在真实数据集上取得了优于现有方法的性能。
English: This paper introduces CLIP-DQA, a method that adapts Contrastive Language-Image Pre-Training (CLIP) to blind dehazed image quality assessment by incorporating hierarchical features and prompt learning, achieving superior performance over existing methods on authentic datasets.

Authors:Lifan Jiang, Boxi Wu, Jiahui Zhang, Xiaotong Guan, Shuang Chen
Title: HuViDPO:Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment
Abstract:
With the rapid development of AIGC technology, significant progress has been made in diffusion model-based technologies for text-to-image (T2I) and text-to-video (T2V). In recent years, a few studies have introduced the strategy of Direct Preference Optimization (DPO) into T2I tasks, significantly enhancing human preferences in generated images. However, existing T2V generation methods lack a well-formed pipeline with exact loss function to guide the alignment of generated videos with human preferences using DPO strategies. Additionally, challenges such as the scarcity of paired video preference data hinder effective model training. At the same time, the lack of training datasets poses a risk of insufficient flexibility and poor video generation quality in the generated videos. Based on those problems, our work proposes three targeted solutions in sequence. 1) Our work is the first to introduce the DPO strategy into the T2V tasks. By deriving a carefully structured loss function, we utilize human feedback to align video generation with human preferences. We refer to this new method as HuViDPO. 2) Our work constructs small-scale human preference datasets for each action category and fine-tune this model, improving the aesthetic quality of the generated videos while reducing training costs. 3) We adopt a First-Frame-Conditioned strategy, leveraging the rich in formation from the first frame to guide the generation of subsequent frames, enhancing flexibility in video generation. At the same time, we employ a SparseCausal Attention mechanism to enhance the quality of the generated videos.More details and examples can be accessed on our website: https://tankowa.github.io/HuViDPO. github.io/.

Authors:Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Khan, Salman Khan
Title: Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models
Abstract:
Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms. Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data, ensuring their generalization ability is preserved. However, this limited adversarial training restricts robustness and broader generalization. In this work, we explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these models to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts, without requiring additional adversarial training, and (2) end-to-end MLLM integration with these robust models facilitates enhanced adaptation of language components to robust visual features, outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust models achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2x and 1.5x average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against jailbreak attacks. Code and pretrained models will be available at https://github.com/HashmatShadab/Robust-LLaVA.
Chinese: 本研究通过整合经过对抗性预训练的视觉模型,为多模态大语言模型引入了一个鲁棒框架,显著提升了对抗各种威胁的鲁棒性,同时保持了良好的干净性能,在图像描述和视觉问答等任务中取得了显著成效。
English: This work introduces a robust framework for Multi-modal Large Language Models (MLLMs) by integrating adversarially pre-trained vision models, which significantly enhances adversarial robustness against various threats while maintaining clean performance, achieving notable gains in tasks like captioning and visual question-answering.

Authors:Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang
Title: VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
Abstract:
Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in enhancing Large Language Models (LLMs) through external knowledge integration, yet its application has primarily focused on textual content, leaving the rich domain of multi-modal video knowledge predominantly unexplored. This paper introduces VideoRAG, the first retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos. Our core innovation lies in its dual-channel architecture that seamlessly integrates (i) graph-based textual knowledge grounding for capturing cross-video semantic relationships, and (ii) multi-modal context encoding for efficiently preserving visual features. This novel design empowers VideoRAG to process unlimited-length videos by constructing precise knowledge graphs that span multiple videos while maintaining semantic dependencies through specialized multi-modal retrieval paradigms. Through comprehensive empirical evaluation on our proposed LongerVideos benchmark-comprising over 160 videos totaling 134+ hours across lecture, documentary, and entertainment categories-VideoRAG demonstrates substantial performance compared to existing RAG alternatives and long video understanding methods. The source code of VideoRAG implementation and the benchmark dataset are openly available at: https://github.com/HKUDS/VideoRAG.
中文: 本文提出首个专为超长视频设计的检索增强生成框架VideoRAG,其采用双通道架构整合基于图结构的文本知识基础与多模态上下文编码,在包含160余个视频的LongerVideos基准测试中展现出卓越性能。
English: This paper presents VideoRAG, the first retrieval-augmented generation framework designed for processing unlimited-length videos through a dual-channel architecture that integrates graph-based textual knowledge grounding and multi-modal context encoding, demonstrating superior performance on the comprehensive LongerVideos benchmark.

Authors:Andrew Rouditchenko, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
Title: mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
Abstract:
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.
中文:提出的mWhisper-Flamingo模型结合预训练音频和视频模型,通过解码器模态丢弃技术实现多语言视听语音识别的最优性能,在九种语言的嘈杂环境中持续超越纯音频系统。
English: The proposed mWhisper-Flamingo model combines pre-trained audio and video models with decoder modality dropout to achieve state-of-the-art multilingual audio-visual speech recognition, consistently outperforming audio-only systems in noisy conditions across nine languages.

Authors:Quan Dao, Khanh Doan, Di Liu, Trung Le, Dimitris Metaxas
Title: Improved Training Technique for Latent Consistency Models
Abstract:
Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling-$c$ scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models. The implementation is released here: https://github.com/quandao10/sLCT/
中文摘要:本研究通过采用柯西损失、引入扩散损失和最优传输耦合,并配合自适应调度器与无缩放层归一化,有效解决了潜在空间中异常值导致的性能下降问题,成功训练出能进行高质量少步采样的潜在一致性模型,显著缩小了与扩散模型的性能差距。
English Summary: This study improves latent consistency models by addressing performance degradation from impulsive outliers in latent spaces through Cauchy losses, diffusion loss integration, optimal transport coupling, and architectural enhancements, enabling high-quality few-step sampling that narrows the gap with diffusion models.

Authors:Jue Gong, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, Xiaokang Yang
Title: Human Body Restoration with One-Step Diffusion Model and A New Benchmark
Abstract:
Human body restoration, as a specific application of image restoration, is widely applied in practice and plays a vital role across diverse fields. However, thorough research remains difficult, particularly due to the lack of benchmark datasets. In this study, we propose a high-quality dataset automated cropping and filtering (HQ-ACF) pipeline. This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images. Using this pipeline, we constructed a person-based restoration with sophisticated objects and natural activities (\emph{PERSONA}) dataset, which includes training, validation, and test sets. The dataset significantly surpasses other human-related datasets in both quality and content richness. Finally, we propose \emph{OSDHuman}, a novel one-step diffusion model for human body restoration. Specifically, we propose a high-fidelity image embedder (HFIE) as the prompt generator to better guide the model with low-quality human image information, effectively avoiding misleading prompts. Experimental results show that OSDHuman outperforms existing methods in both visual quality and quantitative metrics. The dataset and code will at https://github.com/gobunu/OSDHuman.
中文: 本研究提出了一种高质量自动数据集构建流程(HQ-ACF)和新型一步扩散模型(OSDHuman),通过创建PERSONA数据集并采用高保真图像嵌入器避免误导提示,在人体修复任务中取得了优于现有方法的成果。
English: This study introduces a high-quality automated dataset creation pipeline (HQ-ACF) and a novel one-step diffusion model (OSDHuman) for human body restoration, achieving superior results through a new PERSONA dataset and avoiding misleading prompts with a high-fidelity image embedder.

Authors:Zhiteng Li, Mingyuan Xia, Jingyuan Zhang, Zheng Hui, Haotong Qin, Linghe Kong, Yulun Zhang, Xiaokang Yang
Title: AdaSVD: Adaptive Singular Value Decomposition for Large Language Models
Abstract:
Large language models (LLMs) have achieved remarkable success in natural language processing (NLP) tasks, yet their substantial memory requirements present significant challenges for deployment on resource-constrained devices. Singular Value Decomposition (SVD) has emerged as a promising compression technique for LLMs, offering considerable reductions in memory overhead. However, existing SVD-based methods often struggle to effectively mitigate the errors introduced by SVD truncation, leading to a noticeable performance gap when compared to the original models. Furthermore, applying a uniform compression ratio across all transformer layers fails to account for the varying importance of different layers. To address these challenges, we propose AdaSVD, an adaptive SVD-based LLM compression approach. Specifically, AdaSVD introduces adaComp, which adaptively compensates for SVD truncation errors by alternately updating the singular matrices $\mathcal{U}$ and $\mathcal{V}^\top$. Additionally, AdaSVD introduces adaCR, which adaptively assigns layer-specific compression ratios based on the relative importance of each layer. Extensive experiments across multiple LLM/VLM families and evaluation metrics demonstrate that AdaSVD consistently outperforms state-of-the-art (SOTA) SVD-based methods, achieving superior performance with significantly reduced memory requirements. Code and models of AdaSVD will be available at https://github.com/ZHITENGLI/AdaSVD.
Chinese: AdaSVD是一种自适应大语言模型压缩方法,通过误差补偿和分层压缩比分配,在显著降低内存占用的同时超越了现有SVD技术的性能表现。
English: AdaSVD is an adaptive LLM compression method that uses error compensation and layer-specific compression ratios to outperform existing SVD techniques while significantly reducing memory usage.

Authors:Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey
Title: Detecting Backdoor Samples in Contrastive Language Image Pretraining
Abstract:
Contrastive language-image pretraining (CLIP) has been found to be vulnerable to poisoning backdoor attacks where the adversary can achieve an almost perfect attack success rate on CLIP models by poisoning only 0.01\% of the training dataset. This raises security concerns on the current practice of pretraining large-scale models on unscrutinized web data using CLIP. In this work, we analyze the representations of backdoor-poisoned samples learned by CLIP models and find that they exhibit unique characteristics in their local subspace, i.e., their local neighborhoods are far more sparse than that of clean samples. Based on this finding, we conduct a systematic study on detecting CLIP backdoor attacks and show that these attacks can be easily and efficiently detected by traditional density ratio-based local outlier detectors, whereas existing backdoor sample detection methods fail. Our experiments also reveal that an unintentional backdoor already exists in the original CC3M dataset and has been trained into a popular open-source model released by OpenCLIP. Based on our detector, one can clean up a million-scale web dataset (e.g., CC3M) efficiently within 15 minutes using 4 Nvidia A100 GPUs. The code is publicly available in our \href{https://github.com/HanxunH/Detect-CLIP-Backdoor-Samples}{GitHub repository}.
中文:CLIP模型极易因微量数据投毒遭受后门攻击,但基于其稀疏局部特征的传统离群点检测器能高效识别恶意样本,实现大规模数据集的快速净化。
English: CLIP models are highly susceptible to backdoor attacks through minimal data poisoning, but these malicious samples can be efficiently detected using local outlier detectors based on their sparse local neighborhoods, enabling rapid cleanup of large datasets.

Authors:Nikita Gushchin, David Li, Daniil Selikhanovych, Evgeny Burnaev, Dmitry Baranchuk, Alexander Korotin
Title: Inverse Bridge Matching Distillation
Abstract:
Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup. We provide the code at https://github.com/ngushchin/IBMD
Chinese Summary: 作者提出了一种基于逆向桥匹配的新型蒸馏技术,可将扩散桥模型的推理速度提升4至100倍,并在多种图像处理任务中保持或超越原始模型的生成质量。
English Summary: The authors introduce a novel distillation technique using inverse bridge matching to significantly accelerate diffusion bridge models' inference by 4x to 100x while maintaining or improving generation quality across various image tasks.

Authors:Xiao Lin, Yun Peng, Liuyi Wang, Xianyou Zhong, Minghao Zhu, Jingwei Yang, Yi Feng, Chengju Liu, Qijun Chen
Title: CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation
Abstract:
Category-level object pose estimation aims to recover the rotation, translation and size of unseen instances within predefined categories. In this task, deep neural network-based methods have demonstrated remarkable performance. However, previous studies show they suffer from spurious correlations raised by "unclean" confounders in models, hindering their performance on novel instances with significant variations. To address this issue, we propose CleanPose, a novel approach integrating causal learning and knowledge distillation to enhance category-level pose estimation. To mitigate the negative effect of unobserved confounders, we develop a causal inference module based on front-door adjustment, which promotes unbiased estimation by reducing potential spurious correlations. Additionally, to further improve generalization ability, we devise a residual-based knowledge distillation method that has proven effective in providing comprehensive category information guidance. Extensive experiments across multiple benchmarks (REAL275, CAMERA25 and HouseCat6D) hightlight the superiority of proposed CleanPose over state-of-the-art methods. Code will be available at https://github.com/chrislin0621/CleanPose.
Chinese: CleanPose通过引入因果推理模块和基于残差的知识蒸馏方法,有效减少类别级物体姿态估计中的虚假相关性并提升泛化能力,在多个基准测试中展现出卓越性能。
English: CleanPose introduces a causal inference module and residual-based knowledge distillation to mitigate spurious correlations and enhance generalization in category-level object pose estimation, demonstrating superior performance across multiple benchmarks.

Authors:Haiduo Huang, Tian Xia, Wenzhe zhao, Pengju Ren
Title: Partial Channel Network: Compute Fewer, Perform Better
Abstract:
Designing a module or mechanism that enables a network to maintain low parameters and FLOPs without sacrificing accuracy and throughput remains a challenge. To address this challenge and exploit the redundancy within feature map channels, we propose a new solution: partial channel mechanism (PCM). Specifically, through the split operation, the feature map channels are divided into different parts, with each part corresponding to different operations, such as convolution, attention, pooling, and identity mapping. Based on this assumption, we introduce a novel partial attention convolution (PATConv) that can efficiently combine convolution with visual attention. Our exploration indicates that the PATConv can completely replace both the regular convolution and the regular visual attention while reducing model parameters and FLOPs. Moreover, PATConv can derive three new types of blocks: Partial Channel-Attention block (PAT_ch), Partial Spatial-Attention block (PAT_sp), and Partial Self-Attention block (PAT_sf). In addition, we propose a novel dynamic partial convolution (DPConv) that can adaptively learn the proportion of split channels in different layers to achieve better trade-offs. Building on PATConv and DPConv, we propose a new hybrid network family, named PartialNet, which achieves superior top-1 accuracy and inference speed compared to some SOTA models on ImageNet-1K classification and excels in both detection and segmentation on the COCO dataset. Our code is available at https://github.com/haiduo/PartialNet.
中文摘要:提出的PartialNet框架通过引入部分通道机制和动态部分卷积,在保持精度的同时有效减少模型参数和计算量,在ImageNet-1K和COCO基准测试中实现了最优性能。
English Summary: The proposed PartialNet framework introduces partial channel mechanisms and dynamic partial convolution to efficiently reduce model parameters and FLOPs while maintaining accuracy, achieving state-of-the-art performance on ImageNet-1K and COCO benchmarks.

Authors:Eun-Sol Park, MiSo Park, Seung Park, Yong-Goo Shin
Title: FSPGD: Rethinking Black-box Attacks on Semantic Segmentation
Abstract:
Transferability, the ability of adversarial examples crafted for one model to deceive other models, is crucial for black-box attacks. Despite advancements in attack methods for semantic segmentation, transferability remains limited, reducing their effectiveness in real-world applications. To address this, we introduce the Feature Similarity Projected Gradient Descent (FSPGD) attack, a novel black-box approach that enhances both attack performance and transferability. Unlike conventional segmentation attacks that rely on output predictions for gradient calculation, FSPGD computes gradients from intermediate layer features. Specifically, our method introduces a loss function that targets local information by comparing features between clean images and adversarial examples, while also disrupting contextual information by accounting for spatial relationships between objects. Experiments on Pascal VOC 2012 and Cityscapes datasets demonstrate that FSPGD achieves superior transferability and attack performance, establishing a new state-of-the-art benchmark. Code is available at https://github.com/KU-AIVS/FSPGD.
Chinese: FSPGD攻击通过从中间层特征计算梯度并引入针对局部和上下文信息的损失函数,显著提升了语义分割黑盒攻击中对抗样本的可迁移性。
English: The FSPGD attack enhances adversarial example transferability in black-box attacks on semantic segmentation by computing gradients from intermediate features and introducing a loss function that targets both local and contextual information.

Authors:Ismail Khalfaoui-Hassani, Stefan Kesselheim
Title: Polynomial, trigonometric, and tropical activations
Abstract:
Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. Our work addresses the issue of exploding and vanishing activations and gradients, particularly prevalent with polynomial activations, and opens the door for improving the efficiency of large-scale learning tasks. Furthermore, our approach provides insight into the structure of neural networks, revealing that networks with polynomial activations can be interpreted as multivariate polynomial mappings. Finally, using Hermite interpolation, we show that our activations can closely approximate classical ones in pre-trained models by matching both the function and its derivative, making them especially useful for fine-tuning tasks. These activations are available in the torchortho library, which can be accessed via: https://github.com/K-H-Ismail/torchortho.
中文摘要:本文研究表明,基于正交基(如埃尔米特多项式和傅里叶基)的激活函数能有效用于深度神经网络训练,不仅解决了梯度爆炸/消失问题,还能通过插值逼近经典激活函数,特别适用于微调任务。
English Summary: This article demonstrates that orthonormal basis functions, such as Hermite polynomials and Fourier bases, can effectively serve as activations in deep neural networks, enabling stable training without special mechanisms while providing insights into network structure and approximation capabilities.

Authors:Tongkun Liu, Bing Li, Xiao Jin, Yupeng Shi, Qiuying Li, Xiang Wei
Title: Exploring Few-Shot Defect Segmentation in General Industrial Scenarios with Metric Learning and Vision Foundation Models
Abstract:
Industrial defect segmentation is critical for manufacturing quality control. Due to the scarcity of training defect samples, few-shot semantic segmentation (FSS) holds significant value in this field. However, existing studies mostly apply FSS to tackle defects on simple textures, without considering more diverse scenarios. This paper aims to address this gap by exploring FSS in broader industrial products with various defect types. To this end, we contribute a new real-world dataset and reorganize some existing datasets to build a more comprehensive few-shot defect segmentation (FDS) benchmark. On this benchmark, we thoroughly investigate metric learning-based FSS methods, including those based on meta-learning and those based on Vision Foundation Models (VFMs). We observe that existing meta-learning-based methods are generally not well-suited for this task, while VFMs hold great potential. We further systematically study the applicability of various VFMs in this task, involving two paradigms: feature matching and the use of Segment Anything (SAM) models. We propose a novel efficient FDS method based on feature matching. Meanwhile, we find that SAM2 is particularly effective for addressing FDS through its video track mode. The contributed dataset and code will be available at: https://github.com/liutongkun/GFDS.
中文: 本文构建了全面的少样本缺陷分割基准,针对工业场景中的多样化缺陷检测问题,发现视觉基础模型潜力显著,并提出新型特征匹配方法,同时验证了SAM2模型在视频追踪模式下的卓越性能。
English: This paper introduces a comprehensive few-shot defect segmentation benchmark to address industrial defect detection across diverse scenarios, finding that Vision Foundation Models show strong potential while proposing a novel feature-matching method and highlighting SAM2's effectiveness.

Authors:Haiduo Huang, Zhenhua Liu, Tian Xia, Wenzhe zhao, Pengju Ren
Title: Nearly Lossless Adaptive Bit Switching
Abstract:
Model quantization is widely applied for compressing and accelerating deep neural networks (DNNs). However, conventional Quantization-Aware Training (QAT) focuses on training DNNs with uniform bit-width. The bit-width settings vary across different hardware and transmission demands, which induces considerable training and storage costs. Hence, the scheme of one-shot joint training multiple precisions is proposed to address this issue. Previous works either store a larger FP32 model to switch between different precision models for higher accuracy or store a smaller INT8 model but compromise accuracy due to using shared quantization parameters. In this paper, we introduce the Double Rounding quantization method, which fully utilizes the quantized representation range to accomplish nearly lossless bit-switching while reducing storage by using the highest integer precision instead of full precision. Furthermore, we observe a competitive interference among different precisions during one-shot joint training, primarily due to inconsistent gradients of quantization scales during backward propagation. To tackle this problem, we propose an Adaptive Learning Rate Scaling (ALRS) technique that dynamically adapts learning rates for various precisions to optimize the training process. Additionally, we extend our Double Rounding to one-shot mixed precision training and develop a Hessian-Aware Stochastic Bit-switching (HASB) strategy. Experimental results on the ImageNet-1K classification demonstrate that our methods have enough advantages to state-of-the-art one-shot joint QAT in both multi-precision and mixed-precision. We also validate the feasibility of our method on detection and segmentation tasks, as well as on LLMs task. Our codes are available at https://github.com/haiduo/Double-Rounding.
中文摘要:本文提出双舍入量化方法和自适应学习率调整技术,通过一次性联合训练实现多精度模型的高效转换,在减少存储的同时保持高精度,并在多项视觉任务中验证了其优越性。
English Summary: The paper introduces Double Rounding quantization and Adaptive Learning Rate Scaling to enable efficient one-shot joint training of multiple precision models, achieving nearly lossless bit-switching while reducing storage and maintaining high accuracy across various tasks.

Authors:Qianyu Guo, Jingrong Wu, Tianxing Wu, Haofen Wang, Weifeng Ge, Wenqiang Zhang
Title: Enhancing Environmental Robustness in Few-shot Learning via Conditional Representation Learning
Abstract:
Few-shot learning (FSL) has recently been extensively utilized to overcome the scarcity of training data in domain-specific visual recognition. In real-world scenarios, environmental factors such as complex backgrounds, varying lighting conditions, long-distance shooting, and moving targets often cause test images to exhibit numerous incomplete targets or noise disruptions. However, current research on evaluation datasets and methodologies has largely ignored the concept of "environmental robustness", which refers to maintaining consistent performance in complex and diverse physical environments. This neglect has led to a notable decline in the performance of FSL models during practical testing compared to their training performance. To bridge this gap, we introduce a new real-world multi-domain few-shot learning (RD-FSL) benchmark, which includes four domains and six evaluation datasets. The test images in this benchmark feature various challenging elements, such as camouflaged objects, small targets, and blurriness. Our evaluation experiments reveal that existing methods struggle to utilize training images effectively to generate accurate feature representations for challenging test images. To address this problem, we propose a novel conditional representation learning network (CRLNet) that integrates the interactions between training and testing images as conditional information in their respective representation processes. The main goal is to reduce intra-class variance or enhance inter-class variance at the feature representation level. Finally, comparative experiments reveal that CRLNet surpasses the current state-of-the-art methods, achieving performance improvements ranging from 6.83% to 16.98% across diverse settings and backbones. The source code and dataset are available at https://github.com/guoqianyu-alberta/Conditional-Representation-Learning.
中文: 针对小样本学习在现实环境中因缺乏环境鲁棒性而表现不佳的问题,我们提出了新的多领域基准和条件表示学习网络(CRLNet),该网络在不同设置下显著提升了模型性能。
English: Few-shot learning models often underperform in real-world environments due to unaddressed robustness issues, prompting the introduction of a new benchmark and a conditional representation learning network (CRLNet) that significantly enhances performance across diverse settings.

Authors:Vernon Y. H. Toh, Yew Ken Chia, Deepanway Ghosal, Soujanya Poria
Title: The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
Abstract:
The releases of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm shift in Large Language Models towards advanced reasoning capabilities. Notably, models like o3 have demonstrated strong performance on benchmarks like the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns, whereas humans often perceive and reason about multimodal scenarios involving both vision and language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models (including o1, o3, and o4-mini) on challenging multimodal puzzles from PuzzleVQA and AlgoPuzzleVQA, which demand fine-grained visual perception. Our results reveal that o-[n] series, particularly later iterations like o3 and o4-mini, significantly outperform the GPT-[n] series and show strong scalability in multimodal reasoning. Nonetheless, despite these substantial advancements and the superior capabilities demonstrated by the o-[n] series, our findings highlight that even these leading models face persistent challenges. Difficulties are particularly evident in tasks requiring precise visual perception, robust compositional reasoning across multiple visual attributes, and solving complex algorithmic or highly combinatorial puzzles, indicating critical areas for future AGI development. We plan to continuously track new models in the series and update our results in this paper accordingly. All resources used in this evaluation are openly available at https://github.com/declare-lab/LLM-PuzzleTest.
中文: OpenAI的o系列模型在多模态推理能力上显著优于GPT系列,但在需要精确视觉感知和复杂组合推理的任务中仍面临挑战。
English: OpenAI's o-series models demonstrate superior multimodal reasoning capabilities over GPT-series models but still face challenges in tasks requiring precise visual perception and complex compositional reasoning.

Authors:Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan
Title: Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
Abstract:
Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space, as they are explicitly designed to process latent images at various noise levels. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of the diffusion model to predict preferences of latent images at arbitrary timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a step-level preference optimization method conducted directly in the noisy latent space. Experimental results indicate that LPO significantly improves the model's alignment with general, aesthetic, and text-image alignment preferences, while achieving a 2.5-28x training speedup over existing preference optimization methods. Our code and models are available at https://github.com/Kwai-Kolors/LPO.
Chinese Summary: 本文提出潜在偏好优化(LPO)方法,通过将预训练扩散模型重新用作潜在空间的步级奖励模型,在显著提升训练速度的同时,更好地使生成图像符合人类审美和文本对齐等偏好。
English Summary: This paper introduces Latent Preference Optimization (LPO), a novel method that leverages pre-trained diffusion models as step-level reward models in latent space to better align generated images with human preferences while achieving significant training acceleration.

Authors:Jiaxin Ye, Boyuan Cao, Hongming Shan
Title: Emotional Face-to-Speech
Abstract:
How much can we infer about an emotional voice solely from an expressive face? This intriguing question holds great potential for applications such as virtual character dubbing and aiding individuals with expressive language disorders. Existing face-to-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression. In this paper, we explore a new task, termed emotional face-to-speech, aiming to synthesize emotional speech directly from expressive facial cues. To that end, we introduce DEmoFace, a novel generative framework that leverages a discrete diffusion transformer (DiT) with curriculum learning, built upon a multi-level neural audio codec. Specifically, we propose multimodal DiT blocks to dynamically align text and speech while tailoring vocal styles based on facial emotion and identity. To enhance training efficiency and generation quality, we further introduce a coarse-to-fine curriculum learning algorithm for multi-level token processing. In addition, we develop an enhanced predictor-free guidance to handle diverse conditioning scenarios, enabling multi-conditional generation and disentangling complex attributes effectively. Extensive experimental results demonstrate that DEmoFace generates more natural and consistent speech compared to baselines, even surpassing speech-driven methods. Demos are shown at https://demoface-ai.github.io/.

Authors:Zilong Wang, Zhiyang Dou, Yuan Liu, Cheng Lin, Xiao Dong, Yunhui Guo, Chenxu Zhang, Xin Li, Wenping Wang, Xiaohu Guo
Title: WonderHuman: Hallucinating Unseen Parts in Dynamic 3D Human Reconstruction
Abstract:
In this paper, we present WonderHuman to reconstruct dynamic human avatars from a monocular video for high-fidelity novel view synthesis. Previous dynamic human avatar reconstruction methods typically require the input video to have full coverage of the observed human body. However, in daily practice, one typically has access to limited viewpoints, such as monocular front-view videos, making it a cumbersome task for previous methods to reconstruct the unseen parts of the human avatar. To tackle the issue, we present WonderHuman, which leverages 2D generative diffusion model priors to achieve high-quality, photorealistic reconstructions of dynamic human avatars from monocular videos, including accurate rendering of unseen body parts. Our approach introduces a Dual-Space Optimization technique, applying Score Distillation Sampling (SDS) in both canonical and observation spaces to ensure visual consistency and enhance realism in dynamic human reconstruction. Additionally, we present a View Selection strategy and Pose Feature Injection to enforce the consistency between SDS predictions and observed data, ensuring pose-dependent effects and higher fidelity in the reconstructed avatar. In the experiments, our method achieves SOTA performance in producing photorealistic renderings from the given monocular video, particularly for those challenging unseen parts. The project page and source code can be found at https://wyiguanw.github.io/WonderHuman/.
中文: WonderHuman通过利用2D生成扩散模型先验和双空间优化技术,从单目视频中重建动态人体化身,精确渲染未观察到的身体部位,实现了最先进的逼真效果。
English: WonderHuman reconstructs dynamic human avatars from monocular videos using 2D generative diffusion priors and dual-space optimization to accurately render unseen body parts, achieving state-of-the-art photorealistic results.

Authors:Wenfei Zhang, Ruipeng Zhao, Yongxiang Yao, Yi Wan, Peihao Wu, Jiayuan Li, Yansheng Li, Yongjun Zhang
Title: Multi-Resolution SAR and Optical Remote Sensing Image Registration Methods: A Review, Datasets, and Future Perspectives
Abstract:
Synthetic Aperture Radar (SAR) and optical image registration is essential for remote sensing data fusion, with applications in military reconnaissance, environmental monitoring, and disaster management. However, challenges arise from differences in imaging mechanisms, geometric distortions, and radiometric properties between SAR and optical images. As image resolution increases, fine SAR textures become more significant, leading to alignment issues and 3D spatial discrepancies. Two major gaps exist: the lack of a publicly available multi-resolution, multi-scene registration dataset and the absence of systematic analysis of current methods. To address this, the MultiResSAR dataset was created, containing over 10k pairs of multi-source, multi-resolution, and multi-scene SAR and optical images. Sixteen state-of-the-art algorithms were tested. Results show no algorithm achieves 100% success, and performance decreases as resolution increases, with most failing on sub-meter data. XoFTR performs best among deep learning methods (40.58%), while RIFT performs best among traditional methods (66.51%). Future research should focus on noise suppression, 3D geometric fusion, cross-view transformation modeling, and deep learning optimization for robust registration of high-resolution SAR and optical images. The dataset is available at https://github.com/betterlll/Multi-Resolution-SAR-dataset-.
中文摘要:MultiResSAR数据集填补了多分辨率SAR与光学图像配准公开数据的空白,测试表明尚无算法能完全成功配准,且分辨率越高性能越差,其中XoFTR和RIFT分别成为深度学习和传统方法中的最佳算法。
English Summary: The MultiResSAR dataset addresses the lack of public multi-resolution SAR-optical registration data, revealing that no tested algorithm achieves perfect success, with performance declining at higher resolutions and XoFTR and RIFT being the top deep learning and traditional methods respectively.

Authors:Jingyun Yang, Guoqing Zhang, Jingge Wang, Yang Li
Title: Adapting Foundation Models for Few-Shot Medical Image Segmentation: Actively and Sequentially
Abstract:
Recent advances in foundation models have brought promising results in computer vision, including medical image segmentation. Fine-tuning foundation models on specific low-resource medical tasks has become a standard practice. However, ensuring reliable and robust model adaptation when the target task has a large domain gap and few annotated samples remains a challenge. Previous few-shot domain adaptation (FSDA) methods seek to bridge the distribution gap between source and target domains by utilizing auxiliary data. The selection and scheduling of auxiliaries are often based on heuristics, which can easily cause negative transfer. In this work, we propose an Active and Sequential domain AdaPtation (ASAP) framework for dynamic auxiliary dataset selection in FSDA. We formulate FSDA as a multi-armed bandit problem and derive an efficient reward function to prioritize training on auxiliary datasets that align closely with the target task, through a single-round fine-tuning. Empirical validation on diverse medical segmentation datasets demonstrates that our method achieves favorable segmentation performance, significantly outperforming the state-of-the-art FSDA methods, achieving an average gain of 27.75% on MRI and 7.52% on CT datasets in Dice score. Code is available at the git repository: https://github.com/techicoco/ASAP.
中文:ASAP框架通过多臂老虎机方法动态选择辅助数据集,有效提升医学图像分割中的少样本域适应性能,相比现有方法取得显著改进。
English: The ASAP framework dynamically selects auxiliary datasets through a multi-armed bandit approach to enhance few-shot domain adaptation in medical image segmentation, achieving significant performance gains over existing methods.

Authors:Anuj Singh, Sayak Mukherjee, Ahmad Beirami, Hadi Jamali-Rad
Title: CoDe: Blockwise Control for Denoising Diffusion Models
Abstract:
Aligning diffusion models to downstream tasks often requires finetuning new models or gradient-based guidance at inference time to enable sampling from the reward-tilted posterior. In this work, we explore a simple inference-time gradient-free guidance approach, called controlled denoising (CoDe), that circumvents the need for differentiable guidance functions and model finetuning. CoDe is a blockwise sampling method applied during intermediate denoising steps, allowing for alignment with downstream rewards. Our experiments demonstrate that, despite its simplicity, CoDe offers a favorable trade-off between reward alignment, prompt instruction following, and inference cost, achieving a competitive performance against the state-of-the-art baselines. Our code is available at: https://github.com/anujinho/code.
Chinese: 提出的受控去噪(CoDe)方法通过在推理过程中采用无需梯度的分块采样策略,使扩散模型能够与下游奖励对齐,无需模型微调或可微分指导,同时保持有竞争力的性能。
English: The proposed controlled denoising (CoDe) method enables alignment of diffusion models with downstream rewards through a gradient-free, blockwise sampling approach during inference, eliminating the need for model finetuning or differentiable guidance while maintaining competitive performance.

Authors:Ye Mao, Weixun Luo, Junpeng Jing, Anlan Qiu, Krystian Mikolajczyk
Title: Hypo3D: Exploring Hypothetical Reasoning in 3D
Abstract:
The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduce Hypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models' ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering (VQA) benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art foundation models struggle to reason in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the context change is irrelevant to the question, models often incorrectly adjust their answers. Project website: https://matchlab-imperial.github.io/Hypo3D/

Authors:Can Jin, Ying Li, Mingyu Zhao, Shiyu Zhao, Zhenting Wang, Xiaoxiao He, Ligong Han, Tong Che, Dimitris N. Metaxas
Title: LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation
Abstract:
Visual prompting has gained popularity as a method for adapting pre-trained models to specific tasks, particularly in the realm of parameter-efficient tuning. However, existing visual prompting techniques often pad the prompt parameters around the image, limiting the interaction between the visual prompts and the original image to a small set of patches while neglecting the inductive bias present in shared information across different patches. In this study, we conduct a thorough preliminary investigation to identify and address these limitations. We propose a novel visual prompt design, introducing Low-Rank matrix multiplication for Visual Prompting (LoR-VP), which enables shared and patch-specific information across rows and columns of image pixels. Extensive experiments across seven network architectures and four datasets demonstrate significant improvements in both performance and efficiency compared to state-of-the-art visual prompting methods, achieving up to 6 times faster training times, utilizing 18 times fewer visual prompt parameters, and delivering a 3.1% improvement in performance. The code is available as https://github.com/jincan333/LoR-VP.
Chinese: 本研究提出LoR-VP,一种采用低秩矩阵乘法的新型视觉提示方法,增强了提示与图像间的交互,在效率和性能上均显著优于现有技术。
English: This study introduces LoR-VP, a novel visual prompting method using low-rank matrix multiplication to enhance interaction between prompts and images, achieving superior efficiency and performance improvements over existing techniques.

Authors:Alireza Morsali, MohammadJavad Vaez, Mohammadhossein Soltani, Amirhossein Kazerouni, Babak Taati, Morteza Mohammad-Noori
Title: STAF: Sinusoidal Trainable Activation Functions for Implicit Neural Representation
Abstract:
Implicit Neural Representations (INRs) have emerged as a powerful framework for modeling continuous signals. The spectral bias of ReLU-based networks is a well-established limitation, restricting their ability to capture fine-grained details in target signals. While previous works have attempted to mitigate this issue through frequency-based encodings or architectural modifications, these approaches often introduce additional complexity and do not fully address the underlying challenge of learning high-frequency components efficiently. We introduce Sinusoidal Trainable Activation Functions (STAF), designed to directly tackle this limitation by enabling networks to adaptively learn and represent complex signals with higher precision and efficiency. STAF inherently modulates its frequency components, allowing for self-adaptive spectral learning. This capability significantly improves convergence speed and expressivity, making STAF highly effective for both signal representations and inverse problems. Through extensive evaluations across a range of tasks, including signal representation (shape, image, audio) and inverse problems (super-resolution, denoising), as well as neural radiance fields (NeRF), we demonstrate that STAF consistently outperforms state-of-the-art methods in accuracy and reconstruction fidelity. These results establish STAF as a robust solution to spectral bias and the capacity--convergence tradeoff, with broad applicability in computer vision and graphics. Our codebase is publicly accessible at https://github.com/AlirezaMorsali/STAF.
中文: STAF通过引入正弦可训练激活函数,有效克服了ReLU网络的频谱偏差,实现了高频分量的自适应学习,在多种信号表示和逆问题应用中均展现出卓越的性能。
English: STAF introduces sinusoidal trainable activation functions to overcome the spectral bias of ReLU networks, enabling adaptive learning of high-frequency components and achieving superior performance in signal representation and inverse problems across various applications.

Authors:Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Title: BEEM: Boosting Performance of Early Exit DNNs using Multi-Exit Classifiers as Experts
Abstract:
Early Exit (EE) techniques have emerged as a means to reduce inference latency in Deep Neural Networks (DNNs). The latency improvement and accuracy in these techniques crucially depend on the criteria used to make exit decisions. We propose a new decision criterion where exit classifiers are treated as experts BEEM and aggregate their confidence scores. The confidence scores are aggregated only if neighbouring experts are consistent in prediction as the samples pass through them, thus capturing their ensemble effect. A sample exits when the aggregated confidence value exceeds a threshold. The threshold is set using the error rates of the intermediate exits aiming to surpass the performance of conventional DNN inference. Experimental results on the COCO dataset for Image captioning and GLUE datasets for various language tasks demonstrate that our method enhances the performance of state-of-the-art EE methods, achieving improvements in speed-up by a factor 1.5x to 2.1x. When compared to the final layer, its accuracy is comparable in harder Image Captioning and improves in the easier language tasks. The source code for this work is publicly available at https://github.com/Div290/BEEM1/tree/main
中文: BEEM提出通过聚合相邻分类器在预测一致时的置信度作为早退新标准,在图像描述和语言任务中实现1.5-2.1倍加速,同时保持或提升模型精度。
English: BEEM introduces a novel early exit criterion that aggregates confidence scores from neighboring classifiers when predictions are consistent, achieving 1.5x-2.1x speed-up while maintaining comparable or improved accuracy across vision and language tasks.

Authors:Kosuke Sakurai, Ryotaro Shimizu, Masayuki Goto
Title: Vision and Language Reference Prompt into SAM for Few-shot Segmentation
Abstract:
Segment Anything Model (SAM) represents a large-scale segmentation model that enables powerful zero-shot capabilities with flexible prompts. While SAM can segment any object in zero-shot, it requires user-provided prompts for each target image and does not attach any label information to masks. Few-shot segmentation models addressed these issues by inputting annotated reference images as prompts to SAM and can segment specific objects in target images without user-provided prompts. Previous SAM-based few-shot segmentation models only use annotated reference images as prompts, resulting in limited accuracy due to a lack of reference information. In this paper, we propose a novel few-shot segmentation model, Vision and Language reference Prompt into SAM (VLP-SAM), that utilizes the visual information of the reference images and the semantic information of the text labels by inputting not only images but also language as reference information. In particular, VLP-SAM is a simple and scalable structure with minimal learnable parameters, which inputs prompt embeddings with vision-language information into SAM using a multimodal vision-language model. To demonstrate the effectiveness of VLP-SAM, we conducted experiments on the PASCAL-5i and COCO-20i datasets, and achieved high performance in the few-shot segmentation task, outperforming the previous state-of-the-art model by a large margin (6.3% and 9.5% in mIoU, respectively). Furthermore, VLP-SAM demonstrates its generality in unseen objects that are not included in the training data. Our code is available at https://github.com/kosukesakurai1/VLP-SAM.
中文: 本文提出VLP-SAM这一新型小样本分割模型,通过将视觉和语言提示信息共同输入到Segment Anything模型中,显著提升了分割精度,在基准数据集上大幅超越了现有最优方法。
English: The paper introduces VLP-SAM, a novel few-shot segmentation model that enhances accuracy by integrating both visual and language prompts into the Segment Anything Model, achieving state-of-the-art performance on benchmark datasets.

Authors:Yunuo Chen, Qian Li, Bing He, Donghui Feng, Ronghua Wu, Qi Wang, Li Song, Guo Lu, Wenjun Zhang
Title: S2CFormer: Revisiting the RD-Latency Trade-off in Transformer-based Learned Image Compression
Abstract:
Transformer-based Learned Image Compression (LIC) suffers from a suboptimal trade-off between decoding latency and rate-distortion (R-D) performance. Moreover, the critical role of the FeedForward Network (FFN)-based channel aggregation module has been largely overlooked. Our research reveals that efficient channel aggregation-rather than complex and time-consuming spatial operations-is the key to achieving competitive LIC models. Based on this insight, we initiate the ``S2CFormer'' paradigm, a general architecture that simplifies spatial operations and enhances channel operations to overcome the previous trade-off. We present two instances of the S2CFormer: S2C-Conv, and S2C-Attention. Both models demonstrate state-of-the-art (SOTA) R-D performance and significantly faster decoding speed. Furthermore, we introduce S2C-Hybrid, an enhanced variant that maximizes the strengths of different S2CFormer instances to achieve a better performance-latency trade-off. This model outperforms all the existing methods on the Kodak, Tecnick, and CLIC Professional Validation datasets, setting a new benchmark for efficient and high-performance LIC. The code is at \href{https://github.com/YunuoChen/S2CFormer}{https://github.com/YunuoChen/S2CFormer}.
Chinese: S2CFormer架构通过强化通道操作和简化空间处理,突破了学习型图像压缩中解码延迟与率失真性能的权衡,实现了最先进的压缩效果。
English: The S2CFormer architecture enhances channel operations and simplifies spatial processes to overcome the trade-off between decoding latency and rate-distortion performance, achieving state-of-the-art results in learned image compression.

Authors:Linglong Wu, Xuhao Shan, Ruiquan Ge, Ruoyu Liang, Chi Zhang, Yonghong Li, Ahmed Elazab, Huoling Luo, Yunbi Liu, Changmiao Wang
Title: TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion
Abstract:
Chronic liver disease represents a significant health challenge worldwide and accurate prognostic evaluations are essential for personalized treatment plans. Recent evidence suggests that integrating multimodal data, such as computed tomography imaging, radiomic features, and clinical information, can provide more comprehensive prognostic information. However, modalities have an inherent heterogeneity, and incorporating additional modalities may exacerbate the challenges of heterogeneous data fusion. Moreover, existing multimodal fusion methods often struggle to adapt to richer medical modalities, making it difficult to capture inter-modal relationships. To overcome these limitations, We present the Triple-Modal Interaction Chronic Liver Network (TMI-CLNet). Specifically, we develop an Intra-Modality Aggregation module and a Triple-Modal Cross-Attention Fusion module, which are designed to eliminate intra-modality redundancy and extract cross-modal information, respectively. Furthermore, we design a Triple-Modal Feature Fusion loss function to align feature representations across modalities. Extensive experiments on the liver prognosis dataset demonstrate that our approach significantly outperforms existing state-of-the-art unimodal models and other multi-modal techniques. Our code is available at https://github.com/Mysterwll/liver.git.
中文:三重模态交互慢性肝网络(TMI-CLNet)通过专门设计的模块有效整合多模态数据,显著提升了慢性肝病预后评估的准确性,优于现有技术。
English: The Triple-Modal Interaction Chronic Liver Network (TMI-CLNet) effectively integrates multimodal data through specialized modules to enhance prognostic accuracy for chronic liver disease, outperforming existing methods.

Authors:Yujin Oh, Pengfei Jin, Sangjoon Park, Sekeun Kim, Siyeop Yoon, Kyungsang Kim, Jin Sung Kim, Xiang Li, Quanzheng Li
Title: Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective
Abstract:
Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mechanisms and clarify dMoE's role in adapting to heterogeneous distributions in medical image segmentation. Furthermore, we integrate dMoE into multiple network architectures, demonstrating its broad applicability across diverse medical image analysis tasks. By incorporating demographic and clinical factors, dMoE achieves state-of-the-art performance on two 2D benchmark datasets and a 3D in-house dataset. Our results highlight the effectiveness of dMoE in mitigating biases from imbalanced distributions, offering a promising approach to bridging control theory and medical image segmentation within fairness learning paradigms. The source code will be made available. The source code is available at https://github.com/tvseg/dMoE.
Chinese: 本研究提出了分布感知专家混合模型(dMoE),通过整合人口统计学和临床因素来减轻医学图像分割中的偏差,在多个数据集上实现最优性能,并将控制理论与公平性学习相结合。
English: The study introduces Distribution-aware Mixture of Experts (dMoE), a method that integrates demographic and clinical factors to mitigate biases in medical image segmentation, achieving top performance across multiple datasets and bridging control theory with fairness learning.

Authors:Saarthak Kapse, Robin Betz, Srinivasan Sivanandan
Title: Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing
Abstract:
State Space Models (SSMs) with selective scan (Mamba) have been adapted into efficient vision models. Mamba, unlike Vision Transformers, achieves linear complexity for token interactions through a recurrent hidden state process. This sequential processing is enhanced by a parallel scan algorithm, which reduces the computational time of recurrent steps from $L$ sequential steps to $log(L)$ parallel steps with respect to the number of input tokens ($L$). In this work, we propose Fast Vision Mamba (FastVim), that further reduces the computational time of the SSM block by reducing the number of recurrent steps in Vision Mamba models while still retaining model performance. By alternately pooling tokens along image dimensions across Mamba blocks, we obtain a 2$\times$ reduction in the number of parallel steps in SSM block. Our model offers up to $72.5\%$ speedup in inference speed compared to baseline Vision Mamba models on high resolution (2048$\times$2048) images. Our experiments demonstrate state-of-the-art performance with dramatically improved throughput in a range of tasks such as image classification, cell perturbation prediction, segmentation, and object detection. Code is made available at https://github.com/insitro/FastVim
中文:FastVim通过跨Mamba块交替池化图像维度的标记,将SSM模块的并行步骤减半,在保持各类视觉任务顶尖性能的同时,实现高分辨率图像推理速度最高提升72.5%。
English: FastVim enhances Vision Mamba models by reducing recurrent steps through token pooling, achieving up to 72.5% faster inference on high-resolution images while maintaining top performance across vision tasks.

Authors:Samiran Dey, Christopher R. S. Banerji, Partha Basuchowdhuri, Sanjoy K. Saha, Deepak Parashar, Tapabrata Chakraborti
Title: Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions
Abstract:
Emerging research has highlighted that artificial intelligence based multimodal fusion of digital pathology and transcriptomic features can improve cancer diagnosis (grading/subtyping) and prognosis (survival risk) prediction. However, such direct fusion for joint decision is impractical in real clinical settings, where histopathology is still the gold standard for diagnosis and transcriptomic tests are rarely requested, at least in the public healthcare system. With our novel diffusion based crossmodal generative AI model PathGen, we show that genomic expressions synthesized from digital histopathology jointly predicts cancer grading and patient survival risk with high accuracy (state-of-the-art performance), certainty (through conformal coverage guarantee) and interpretability (through distributed attention maps). PathGen code is available for open use by the research community through GitHub at https://github.com/Samiran-Dey/PathGen.
中文摘要:PathGen模型通过基于扩散的生成式人工智能,从数字病理学中合成基因组表达,无需实际转录组测试即可实现对癌症分级和患者生存风险的高精度、可解释预测。
English Summary: The PathGen model uses diffusion-based generative AI to synthesize genomic expressions from digital pathology, enabling accurate and interpretable predictions of cancer grading and patient survival risk without requiring actual transcriptomic tests.

Authors:Renhao Lu
Title: Complex Wavelet Mutual Information Loss: A Multi-Scale Loss Function for Semantic Segmentation
Abstract:
Recent advancements in deep neural networks have significantly enhanced the performance of semantic segmentation. However, class imbalance and instance imbalance remain persistent challenges, where smaller instances and thin boundaries are often overshadowed by larger structures. To address the multiscale nature of segmented objects, various models have incorporated mechanisms such as spatial attention and feature pyramid networks. Despite these advancements, most loss functions are still primarily pixel-wise, while regional and boundary-focused loss functions often incur high computational costs or are restricted to small-scale regions. To address this limitation, we propose the complex wavelet mutual information (CWMI) loss, a novel loss function that leverages mutual information from subband images decomposed by a complex steerable pyramid. The complex steerable pyramid captures features across multiple orientations and preserves structural similarity across scales. Meanwhile, mutual information is well-suited to capturing high-dimensional directional features and offers greater noise robustness. Extensive experiments on diverse segmentation datasets demonstrate that CWMI loss achieves significant improvements in both pixel-wise accuracy and topological metrics compared to state-of-the-art methods, while introducing minimal computational overhead. Our code is available at https://github.com/lurenhaothu/CWMI
中文: 提出的复杂小波互信息损失函数通过复杂可控金字塔分解和互信息利用多尺度结构特征,相比现有方法以最小计算开销实现了更优的分割精度和拓扑度量。
English: The proposed complex wavelet mutual information (CWMI) loss function leverages multiscale structural features through complex steerable pyramid decomposition and mutual information, achieving superior segmentation accuracy and topological metrics with minimal computational overhead compared to existing methods.

Authors:Zaitian Wang, Jian He, Yu Liang, Xiyuan Hu, Tianhao Peng, Kaixin Wang, Jiakai Wang, Chenlong Zhang, Weili Zhang, Shuang Niu, Xiaoyang Xie
Title: Milmer: a Framework for Multiple Instance Learning based Multimodal Emotion Recognition
Abstract:
Emotions play a crucial role in human behavior and decision-making, making emotion recognition a key area of interest in human-computer interaction (HCI). This study addresses the challenges of emotion recognition by integrating facial expression analysis with electroencephalogram (EEG) signals, introducing a novel multimodal framework-Milmer. The proposed framework employs a transformer-based fusion approach to effectively integrate visual and physiological modalities. It consists of an EEG preprocessing module, a facial feature extraction and balancing module, and a cross-modal fusion module. To enhance visual feature extraction, we fine-tune a pre-trained Swin Transformer on emotion-related datasets. Additionally, a cross-attention mechanism is introduced to balance token representation across modalities, ensuring effective feature integration. A key innovation of this work is the adoption of a multiple instance learning (MIL) approach, which extracts meaningful information from multiple facial expression images over time, capturing critical temporal dynamics often overlooked in previous studies. Extensive experiments conducted on the DEAP dataset demonstrate the superiority of the proposed framework, achieving a classification accuracy of 96.72% in the four-class emotion recognition task. Ablation studies further validate the contributions of each module, highlighting the significance of advanced feature extraction and fusion strategies in enhancing emotion recognition performance. Our code are available at https://github.com/liangyubuaa/Milmer.
中文摘要:本研究提出名为Milmer的新型多模态框架,通过融合面部表情与脑电信号,采用基于Transformer的融合方法和多示例学习,在DEAP数据集上实现了96.72%的情感识别准确率。
English Summary: This study introduces Milmer, a novel multimodal framework that integrates facial expressions and EEG signals using transformer-based fusion and multiple instance learning to achieve 96.72% emotion recognition accuracy on the DEAP dataset.

Authors:Mohammad Nazeri, Anuj Pokhrel, Alexandyr Card, Aniket Datar, Garrett Warnell, Xuesu Xiao
Title: VertiFormer: A Data-Efficient Multi-Task Transformer for Off-Road Robot Mobility
Abstract:
Sophisticated learning architectures, e.g., Transformers, present a unique opportunity for robots to understand complex vehicle-terrain kinodynamic interactions for off-road mobility. While internet-scale data are available for Natural Language Processing (NLP) and Computer Vision (CV) tasks to train Transformers, real-world mobility data are difficult to acquire with physical robots navigating off-road terrain. Furthermore, training techniques specifically designed to process text and image data in NLP and CV may not apply to robot mobility. In this paper, we propose VertiFormer, a novel data-efficient multi-task Transformer model trained with only one hour of data to address such challenges of applying Transformer architectures for robot mobility on extremely rugged, vertically challenging, off-road terrain. Specifically, VertiFormer employs a new learnable masked modeling and next token prediction paradigm to predict the next pose, action, and terrain patch to enable a variety of off-road mobility tasks simultaneously, e.g., forward and inverse kinodynamics modeling. The non-autoregressive design mitigates computational bottlenecks and error propagation associated with autoregressive models. VertiFormer's unified modality representation also enhances learning of diverse temporal mappings and state representations, which, combined with multiple objective functions, further improves model generalization. Our experiments offer insights into effectively utilizing Transformers for off-road robot mobility with limited data and demonstrate our efficiently trained Transformer can facilitate multiple off-road mobility tasks onboard a physical mobile robot.
Chinese: VertiFormer是一种数据高效的多任务Transformer模型,仅用一小时数据即可预测机器人姿态、动作和地形区块,以解决越野移动中的数据稀缺和计算瓶颈问题。
English: VertiFormer is a data-efficient multi-task Transformer model that uses only one hour of data to predict robot poses, actions, and terrain patches for off-road mobility tasks, overcoming the limitations of data scarcity and computational bottlenecks.

Authors:David Oro, Carles Fernández, Xavier Martorell, Javier Hernando
Title: Work-Efficient Parallel Non-Maximum Suppression Kernels
Abstract:
In the context of object detection, sliding-window classifiers and single-shot Convolutional Neural Network (CNN) meta-architectures typically yield multiple overlapping candidate windows with similar high scores around the true location of a particular object. Non-Maximum Suppression (NMS) is the process of selecting a single representative candidate within this cluster of detections, so as to obtain a unique detection per object appearing on a given picture. In this paper, we present a highly scalable NMS algorithm for embedded GPU architectures that is designed from scratch to handle workloads featuring thousands of simultaneous detections on a given picture. Our kernels are directly applicable to other sequential NMS algorithms such as FeatureNMS, Soft-NMS or AdaptiveNMS that share the inner workings of the classic greedy NMS method. The obtained performance results show that our parallel NMS algorithm is capable of clustering 1024 simultaneous detected objects per frame in roughly 1 ms on both NVIDIA Tegra X1 and NVIDIA Tegra X2 on-die GPUs, while taking 2 ms on NVIDIA Tegra K1. Furthermore, our proposed parallel greedy NMS algorithm yields a 14x-40x speed up when compared to state-of-the-art NMS methods that require learning a CNN from annotated data.
中文: 本文提出了一种专为嵌入式GPU架构设计的高度可扩展并行非极大值抑制算法,在毫秒级时间内高效处理每帧多达1024个检测目标,相比现有方法实现了14至40倍的显著加速。
English: This paper introduces a highly scalable parallel Non-Maximum Suppression (NMS) algorithm designed for embedded GPU architectures, achieving significant speed improvements of 14x to 40x over existing methods while efficiently clustering up to 1024 detections per frame in milliseconds.

Authors:David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos
Title: Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions
Abstract:
Visual speech recognition remains an open research problem where different challenges must be considered by dispensing with the auditory sense, such as visual ambiguities, the inter-personal variability among speakers, and the complex modeling of silence. Nonetheless, recent remarkable results have been achieved in the field thanks to the availability of large-scale databases and the use of powerful attention mechanisms. Besides, multiple languages apart from English are nowadays a focus of interest. This paper presents noticeable advances in automatic continuous lipreading for Spanish. First, an end-to-end system based on the hybrid CTC/Attention architecture is presented. Experiments are conducted on two corpora of disparate nature, reaching state-of-the-art results that significantly improve the best performance obtained to date for both databases. In addition, a thorough ablation study is carried out, where it is studied how the different components that form the architecture influence the quality of speech recognition. Then, a rigorous error analysis is carried out to investigate the different factors that could affect the learning of the automatic system. Finally, a new Spanish lipreading benchmark is consolidated. Code and trained models are available at https://github.com/david-gimeno/evaluating-end2end-spanish-lipreading.
中文摘要:本文提出了一种基于混合CTC/注意力架构的西班牙语端到端唇读系统,通过全面实验在两个语料库上取得最优性能,并建立了新的西班牙语唇读基准。
English Summary: This paper presents an end-to-end lipreading system for Spanish using a hybrid CTC/Attention architecture, achieving state-of-the-art results on two corpora through comprehensive experiments and establishing a new Spanish lipreading benchmark.

Authors:Kihwan Ryoo, Hyungtae Lim, Hyun Myung
Title: MambaGlue: Fast and Robust Local Feature Matching With Mamba
Abstract:
In recent years, robust matching methods using deep learning-based approaches have been actively studied and improved in computer vision tasks. However, there remains a persistent demand for both robust and fast matching techniques. To address this, we propose a novel Mamba-based local feature matching approach, called MambaGlue, where Mamba is an emerging state-of-the-art architecture rapidly gaining recognition for its superior speed in both training and inference, and promising performance compared with Transformer architectures. In particular, we propose two modules: a) MambaAttention mixer to simultaneously and selectively understand the local and global context through the Mamba-based self-attention structure and b) deep confidence score regressor, which is a multi-layer perceptron (MLP)-based architecture that evaluates a score indicating how confidently matching predictions correspond to the ground-truth correspondences. Consequently, our MambaGlue achieves a balance between robustness and efficiency in real-world applications. As verified on various public datasets, we demonstrate that our MambaGlue yields a substantial performance improvement over baseline approaches while maintaining fast inference speed. Our code will be available on https://github.com/url-kaist/MambaGlue
中文摘要:本文提出MambaGlue,一种基于Mamba架构的新型局部特征匹配方法,通过MambaAttention混合器实现局部与全局上下文理解,结合深度置信度回归器评估匹配精度,在保持快速推理的同时显著提升了匹配鲁棒性。
English Summary: This paper introduces MambaGlue, a novel local feature matching method leveraging the Mamba architecture to achieve robust performance and fast inference by incorporating a MambaAttention mixer for contextual understanding and a deep confidence score regressor for matching accuracy.

Authors:Chuc Man Duc, Hiromichi Fukui
Title: SatMamba: Development of Foundation Models for Remote Sensing Imagery Using State Space Models
Abstract:
Foundation models refer to deep learning models pretrained on large unlabeled datasets through self-supervised algorithms. In the Earth science and remote sensing communities, there is growing interest in transforming the use of Earth observation data, including satellite and aerial imagery, through foundation models. Various foundation models have been developed for remote sensing, such as those for multispectral, high-resolution, and hyperspectral images, and have demonstrated superior performance on various downstream tasks compared to traditional supervised models. These models are evolving rapidly, with capabilities to handle multispectral, multitemporal, and multisensor data. Most studies use masked autoencoders in combination with Vision Transformers (ViTs) as the backbone for pretraining. While the models showed promising performance, ViTs face challenges, such as quadratic computational scaling with input length, which may limit performance on multiband and multitemporal data with long sequences. This research aims to address these challenges by proposing SatMamba, a new pretraining framework that combines masked autoencoders with State Space Model, offering linear computational scaling. Experiments on high-resolution imagery across various downstream tasks show promising results, paving the way for more efficient foundation models and unlocking the full potential of Earth observation data. The source code is available in https://github.com/mdchuc/HRSFM.
地球科学中的基础模型正快速发展以更高效处理复杂遥感数据,SatMamba通过结合状态空间模型提出线性计算扩展的新方法,在下游任务中展现出优越性能。
Foundation models in Earth science are evolving to process complex remote sensing data more efficiently, with SatMamba introducing a novel approach using State Space Models for linear computational scaling, enhancing performance on downstream tasks.

Authors:Xinle Cheng, Zhuoming Chen, Zhihao Jia
Title: CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models
Abstract:
Diffusion models have revolutionized generative tasks, especially in the domain of text-to-image synthesis; however, their iterative denoising process demands substantial computational resources. In this paper, we present a novel acceleration strategy that integrates token-level pruning with caching techniques to tackle this computational challenge. By employing noise relative magnitude, we identify significant token changes across denoising iterations. Additionally, we enhance token selection by incorporating spatial clustering and ensuring distributional balance. Our experiments demonstrate reveal a 50%-60% reduction in computational costs while preserving the performance of the model, thereby markedly increasing the efficiency of diffusion models. The code is available at https://github.com/ada-cheng/CAT-Pruning
中文摘要:本文提出了一种结合令牌级剪枝与缓存的新颖加速方法,将扩散模型的计算成本降低50%-60%,同时保持其性能表现。
English Summary: This paper introduces a novel acceleration method combining token-level pruning and caching to reduce diffusion models' computational costs by 50%-60% while maintaining performance.

Authors:JiangYong Yu, Sifan Zhou, Dawei Yang, Shuo Wang, Shuoyu Li, Xing Hu, Chen Xu, Zukang Xu, Changyong Shu, Zhihang Yuan
Title: MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization
Abstract:
Multimodal large language models (MLLMs) have garnered widespread attention due to their ability to understand multimodal input. However, their large parameter sizes and substantial computational demands severely hinder their practical deployment and application.While quantization is an effective way to reduce model size and inference latency, its application to MLLMs remains underexplored. In this paper, we propose MQuant, a post-training quantization (PTQ) framework designed to tackle the unique challenges of multimodal large language models (MLLMs). Conventional quantization often struggles with MLLMs because of (a) high inference latency from large visual token counts, (b) distributional disparities between visual and textual tokens, and (c) extreme outliers introduced by Hadamard-based transformations. To address these issues, MQuant introduces: Modality-Specific Static Quantization (MSQ), assigning distinct static scales for visual vs. textual tokens; Attention-Invariant Flexible Switching (AIFS), reordering tokens to preserve casual attention while eliminating expensive token-wise scale computations; Rotation Magnitude Suppression (RMS), mitigating weight outliers arising from online Hadamard rotations. On five mainstream MLLMs (including Qwen-VL, MiniCPM-V, CogVLM2), MQuant under W4A8 achieves near-floating-point accuracy (<1% degradation) while reducing inference latency by up to 30%, significantly outperforming existing PTQ baselines. Our MQuant effectively bridges the gap for efficient and accurate MLLMs inference in resource-constrained devices. Code has been released in https://github.com/StiphyJay/MQuant.
中文:MQuant是一种后训练量化框架,专门解决多模态大语言模型的独特挑战,在精度损失低于1%的同时将推理延迟降低高达30%,实现在资源受限设备上的高效部署。
English: MQuant is a post-training quantization framework that addresses multimodal large language models' unique challenges, achieving near-floating-point accuracy with less than 1% degradation while cutting inference latency by up to 30% for efficient deployment on resource-limited devices.

Authors:Carolin Teuber, Anwai Archit, Constantin Pape
Title: Parameter Efficient Fine-Tuning of Segment Anything Model for Biomedical Imaging
Abstract:
Segmentation is an important analysis task for biomedical images, enabling the study of individual organelles, cells or organs. Deep learning has massively improved segmentation methods, but challenges remain in generalization to new conditions, requiring costly data annotation. Vision foundation models, such as Segment Anything Model (SAM), address this issue through improved generalization. However, these models still require finetuning on annotated data, although with less annotations, to achieve optimal results for new conditions. As a downside, they require more computational resources. This makes parameter-efficient finetuning (PEFT) relevant. We contribute the first comprehensive study of PEFT for SAM applied to biomedical images. We find that the placement of PEFT layers is more important for efficiency than the type of layer for vision transformers and we provide a recipe for resource-efficient finetuning. Our code is publicly available at https://github.com/computational-cell-analytics/peft-sam.
中文摘要:本研究首次系统分析了参数高效微调在生物医学图像分割中Segment Anything模型的应用,发现对于视觉变换器而言,层的位置策略比层类型更能提升效率,并提供了一种资源优化的微调方案。
English Summary: This study presents the first comprehensive analysis of parameter-efficient fine-tuning (PEFT) for the Segment Anything Model in biomedical image segmentation, revealing that strategic layer placement outweighs layer type for efficiency and offering a resource-optimized fine-tuning method.

Authors:Titus Griebel, Anwai Archit, Constantin Pape
Title: Segment Anything for Histopathology
Abstract:
Nucleus segmentation is an important analysis task in digital pathology. However, methods for automatic segmentation often struggle with new data from a different distribution, requiring users to manually annotate nuclei and retrain data-specific models. Vision foundation models (VFMs), such as the Segment Anything Model (SAM), offer a more robust alternative for automatic and interactive segmentation. Despite their success in natural images, a foundation model for nucleus segmentation in histopathology is still missing. Initial efforts to adapt SAM have shown some success, but did not yet introduce a comprehensive model for diverse segmentation tasks. To close this gap, we introduce PathoSAM, a VFM for nucleus segmentation, based on training SAM on a diverse dataset. Our extensive experiments show that it is the new state-of-the-art model for automatic and interactive nucleus instance segmentation in histopathology. We also demonstrate how it can be adapted for other segmentation tasks, including semantic nucleus segmentation. For this task, we show that it yields results better than popular methods, while not yet beating the state-of-the-art, CellViT. Our models are open-source and compatible with popular tools for data annotation. We also provide scripts for whole-slide image segmentation. Our code and models are publicly available at https://github.com/computational-cell-analytics/patho-sam.
中文:PathoSAM是一种基于多样化数据集训练的新型视觉基础模型,在组织病理学中实现了自动和交互式细胞核实例分割的最新性能,同时也能适应其他分割任务。
English: PathoSAM is a new vision foundation model trained on diverse datasets that achieves state-of-the-art performance in automatic and interactive nucleus instance segmentation in histopathology while also being adaptable to other segmentation tasks.

Authors:Maximilian Leitenstern, Marko Alten, Christian Bolea-Schaser, Dominik Kulmer, Marcel Weinmann, Markus Lienkamp
Title: FlexCloud: Direct, Modular Georeferencing and Drift-Correction of Point Cloud Maps
Abstract:
Current software stacks for real-world applications of autonomous driving leverage map information to ensure reliable localization, path planning, and motion prediction. An important field of research is the generation of point cloud maps, referring to the topic of simultaneous localization and mapping (SLAM). As most recent developments do not include global position data, the resulting point cloud maps suffer from internal distortion and missing georeferencing, preventing their use for map-based localization approaches. Therefore, we propose FlexCloud for an automatic georeferencing of point cloud maps created from SLAM. Our approach is designed to work modularly with different SLAM methods, utilizing only the generated local point cloud map and its odometry. Using the corresponding GNSS positions enables direct georeferencing without additional control points. By leveraging a 3D rubber-sheet transformation, we can correct distortions within the map caused by long-term drift while maintaining its structure. Our approach enables the creation of consistent, globally referenced point cloud maps from data collected by a mobile mapping system (MMS). The source code of our work is available at https://github.com/TUMFTM/FlexCloud.
Chinese: FlexCloud 是一种模块化解决方案,利用GNSS数据自动对SLAM生成的点云地图进行地理配准,并通过三维橡皮筋变换校正畸变,从而为自动驾驶应用提供全局一致的地图。
English: FlexCloud is a modular solution that automatically georeferences SLAM-generated point cloud maps using GNSS data and corrects distortions via 3D rubber-sheet transformation, enabling globally consistent maps for autonomous driving applications.

Authors:Zhichao Sun, Yepeng Liu, Huachao Zhu, Yuliang Gu, Yuda Zou, Zelong Liu, Gui-Song Xia, Bo Du, Yongchao Xu
Title: RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes
Abstract:
Drones have become prevalent robotic platforms with diverse applications, showing significant potential in Embodied Artificial Intelligence (Embodied AI). Referring Expression Comprehension (REC) enables drones to locate objects based on natural language expressions, a crucial capability for Embodied AI. Despite advances in REC for ground-level scenes, aerial views introduce unique challenges including varying viewpoints, occlusions and scale variations. To address this gap, we introduce RefDrone, a REC benchmark for drone scenes. RefDrone reveals three key challenges in REC: 1) multi-scale and small-scale target detection; 2) multi-target and no-target samples; 3) complex environment with rich contextual expressions. To efficiently construct this dataset, we develop RDAgent (referring drone annotation framework with multi-agent system), a semi-automated annotation tool for REC tasks. RDAgent ensures high-quality contextual expressions and reduces annotation cost. Furthermore, we propose Number GroundingDINO (NGDINO), a novel method designed to handle multi-target and no-target cases. NGDINO explicitly learns and utilizes the number of objects referred to in the expression. Comprehensive experiments with state-of-the-art REC methods demonstrate that NGDINO achieves superior performance on both the proposed RefDrone and the existing gRefCOCO datasets. The dataset and code are be publicly at https://github.com/sunzc-sunny/refdrone.
Chinese: RefDrone基准通过引入专门数据集和创新NGDINO方法,解决了无人机视角下指代表达理解的特殊挑战,该方法通过显式数量定位在多个数据集上实现了最优性能。
English: The RefDrone benchmark addresses unique challenges in drone-based Referring Expression Comprehension (REC) by introducing a specialized dataset and the novel NGDINO method, which outperforms existing approaches through explicit number grounding.

Authors:Alexander Nikulin, Ilya Zisman, Denis Tarasov, Nikita Lyubaykin, Andrei Polubarov, Igor Kiselev, Vladislav Kurenkov
Title: Latent Action Learning Requires Supervision in the Presence of Distractors
Abstract:
Recently, latent action learning, pioneered by Latent Action Policies (LAPO), have shown remarkable pre-training efficiency on observation-only data, offering potential for leveraging vast amounts of video available on the web for embodied AI. However, prior work has focused on distractor-free data, where changes between observations are primarily explained by ground-truth actions. Unfortunately, real-world videos contain action-correlated distractors that may hinder latent action learning. Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions.
中文摘要:潜在动作策略(LAPO)在现实干扰物下表现不佳,但改进的LAOM方法将潜在动作质量提升8倍,且仅需2.5%的真实动作监督即可使下游任务性能提高4.2倍,这挑战了传统训练流程。
English Summary: Latent Action Policies (LAPO) struggle with real-world distractors, but the proposed LAOM modification improves latent action quality by 8x and incorporating minimal ground-truth supervision boosts downstream performance by 4.2x, challenging conventional training pipelines.

Authors:Anh-Kiet Duong, Petra Gomez-Krämer
Title: Scalable Framework for Classifying AI-Generated Content Across Modalities
Abstract:
The rapid growth of generative AI technologies has heightened the importance of effectively distinguishing between human and AI-generated content, as well as classifying outputs from diverse generative models. This paper presents a scalable framework that integrates perceptual hashing, similarity measurement, and pseudo-labeling to address these challenges. Our method enables the incorporation of new generative models without retraining, ensuring adaptability and robustness in dynamic scenarios. Comprehensive evaluations on the Defactify4 dataset demonstrate competitive performance in text and image classification tasks, achieving high accuracy across both distinguishing human and AI-generated content and classifying among generative methods. These results highlight the framework's potential for real-world applications as generative AI continues to evolve. Source codes are publicly available at https://github.com/ffyyytt/defactify4.
中文: 本文提出了一种可扩展框架,结合感知哈希、相似度测量和伪标记技术,无需重新训练即可有效区分人类与AI生成内容并分类不同生成模型的输出,在Defactify4数据集上的评估展现了高准确率。
English: This paper introduces a scalable framework that uses perceptual hashing, similarity measurement, and pseudo-labeling to effectively distinguish human from AI-generated content and classify outputs from various generative models without retraining, demonstrating high accuracy in evaluations on the Defactify4 dataset.

Authors:Zhixi Cai, Fucai Ke, Simindokht Jahangard, Maria Garcia de la Banda, Reza Haffari, Peter J. Stuckey, Hamid Rezatofighi
Title: NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning
Abstract:
Visual Grounding (VG) tasks, such as referring expression detection and segmentation tasks are important for linking visual entities to context, especially in complex reasoning tasks that require detailed query interpretation. This paper explores VG beyond basic perception, highlighting challenges for methods that require reasoning like human cognition. Recent advances in large language methods (LLMs) and Vision-Language methods (VLMs) have improved abilities for visual comprehension, contextual understanding, and reasoning. These methods are mainly split into end-to-end and compositional methods, with the latter offering more flexibility. Compositional approaches that integrate LLMs and foundation models show promising performance but still struggle with complex reasoning with language-based logical representations. To address these limitations, we propose NAVER, a compositional visual grounding method that integrates explicit probabilistic logic reasoning within a finite-state automaton, equipped with a self-correcting mechanism. This design improves robustness and interpretability in inference through explicit logic reasoning. Our results show that NAVER achieves SoTA performance comparing to recent end-to-end and compositional baselines. The code is available at https://github.com/ControlNet/NAVER .
Chinese: 本文提出NAVER组合式视觉定位方法,通过集成概率逻辑推理和自校正机制来增强鲁棒性和可解释性,在复杂推理任务中实现了最先进的性能。
English: This paper introduces NAVER, a compositional visual grounding method that integrates probabilistic logic reasoning and a self-correcting mechanism to enhance robustness and interpretability, achieving state-of-the-art performance in complex reasoning tasks.

Authors:Kai Liu, Kaicheng Yang, Zheng Chen, Zhiteng Li, Yong Guo, Wenbo Li, Linghe Kong, Yulun Zhang
Title: BiMaCoSR: Binary One-Step Diffusion Model Leveraging Flexible Matrix Compression for Real Super-Resolution
Abstract:
While super-resolution (SR) methods based on diffusion models (DM) have demonstrated inspiring performance, their deployment is impeded due to the heavy request of memory and computation. Recent researchers apply two kinds of methods to compress or fasten the DM. One is to compress the DM into 1-bit, aka binarization, alleviating the storage and computation pressure. The other distills the multi-step DM into only one step, significantly speeding up inference process. Nonetheless, it remains impossible to deploy DM to resource-limited edge devices. To address this problem, we propose BiMaCoSR, which combines binarization and one-step distillation to obtain extreme compression and acceleration. To prevent the catastrophic collapse of the model caused by binarization, we proposed sparse matrix branch (SMB) and low rank matrix branch (LRMB). Both auxiliary branches pass the full-precision (FP) information but in different ways. SMB absorbs the extreme values and its output is high rank, carrying abundant FP information. Whereas, the design of LRMB is inspired by LoRA and is initialized with the top r SVD components, outputting low rank representation. The computation and storage overhead of our proposed branches can be safely ignored. Comprehensive comparison experiments are conducted to exhibit BiMaCoSR outperforms current state-of-the-art binarization methods and gains competitive performance compared with FP one-step model. BiMaCoSR achieves a 23.8x compression ratio and a 27.4x speedup ratio compared to FP counterpart. Our code and model are available at https://github.com/Kai-Liu001/BiMaCoSR.
Chinese: BiMaCoSR 结合二值化和一步蒸馏技术,通过引入稀疏矩阵分支和低秩矩阵分支保留全精度信息,实现了扩散模型超分辨率的极致压缩与加速,使其能在资源受限的边缘设备上高效部署。
English: BiMaCoSR combines binarization and one-step distillation to achieve extreme compression and acceleration for diffusion model-based super-resolution, overcoming deployment barriers on resource-limited devices with auxiliary branches that preserve full-precision information.

Authors:Jihyeok Kim, Seongwoo Moon, Sungwon Nah, David Hyunchul Shim
Title: MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model
Abstract:
This paper proposes novel methods to enhance the performance of monocular 3D object detection models by leveraging the generalized feature extraction capabilities of a vision foundation model. Unlike traditional CNN-based approaches, which often suffer from inaccurate depth estimation and rely on multi-stage object detection pipelines, this study employs a Vision Transformer (ViT)-based foundation model as the backbone, which excels at capturing global features for depth estimation. It integrates a detection transformer (DETR) architecture to improve both depth estimation and object detection performance in a one-stage manner. Specifically, a hierarchical feature fusion block is introduced to extract richer visual features from the foundation model, further enhancing feature extraction capabilities. Depth estimation accuracy is further improved by incorporating a relative depth estimation model trained on large-scale data and fine-tuning it through transfer learning. Additionally, the use of queries in the transformer's decoder, which consider reference points and the dimensions of 2D bounding boxes, enhances recognition performance. The proposed model outperforms recent state-of-the-art methods, as demonstrated through quantitative and qualitative evaluations on the KITTI 3D benchmark and a custom dataset collected from high-elevation racing environments. Code is available at https://github.com/JihyeokKim/MonoDINO-DETR.
中文: 本文提出了一种新颖的单目3D物体检测方法,采用视觉Transformer基础模型和DETR架构,通过分层特征融合提升了深度估计与检测性能,在基准数据集上取得了最优结果。
English: This paper introduces a novel monocular 3D object detection method using a Vision Transformer foundation model and DETR architecture, which improves depth estimation and detection performance through hierarchical feature fusion and achieves state-of-the-art results on benchmark datasets.

Authors:Binchi Zhang, Zaiyi Zheng, Zhengzhang Chen, Jundong Li
Title: Beyond the Permutation Symmetry of Transformers: The Role of Rotation for Model Fusion
Abstract:
Symmetry in the parameter space of deep neural networks (DNNs) has proven beneficial for various deep learning applications. A well-known example is the permutation symmetry in Multi-Layer Perceptrons (MLPs), where permuting the rows of weight matrices in one layer and applying the inverse permutation to adjacent layers yields a functionally equivalent model. While permutation symmetry fully characterizes the equivalence set for MLPs, its discrete nature limits its utility for transformers. In this paper, we introduce rotation symmetry, a novel form of parameter space symmetry for transformers that generalizes permutation symmetry by rotating parameter matrices in self-attention layers. Unlike permutation symmetry, rotation symmetry operates in a continuous domain, thereby significantly expanding the equivalence set for transformers. Based on this property, we propose a theoretically optimal parameter matching algorithm as a plug-and-play module to enhance model fusion. We evaluate our approach using pre-trained transformers across diverse natural language and vision tasks. Experimental results demonstrate that our rotation symmetry-based matching algorithm substantially improves model fusion, highlighting the potential of parameter space symmetry to facilitate model fusion. Our code is available on https://github.com/zhengzaiyi/RotationSymmetry.
中文摘要:本文提出旋转对称性,一种适用于Transformer的连续参数空间对称性,推广了离散的置换对称性,并通过最优匹配算法显著提升了跨语言与视觉任务的模型融合效果。
English Summary: The paper introduces rotation symmetry, a continuous parameter space symmetry for transformers that generalizes discrete permutation symmetry, and proposes an optimal matching algorithm to significantly enhance model fusion across language and vision tasks.

Authors:Takumu Fujioka, Gouhei Tanaka
Title: Transformer-Based Vector Font Classification Using Different Font Formats: TrueType versus PostScript
Abstract:
Modern fonts adopt vector-based formats, which ensure scalability without loss of quality. While many deep learning studies on fonts focus on bitmap formats, deep learning for vector fonts remains underexplored. In studies involving deep learning for vector fonts, the choice of font representation has often been made conventionally. However, the font representation format is one of the factors that can influence the computational performance of machine learning models in font-related tasks. Here we show that font representations based on PostScript outlines outperform those based on TrueType outlines in Transformer-based vector font classification. TrueType outlines represent character shapes as sequences of points and their associated flags, whereas PostScript outlines represent them as sequences of commands. In previous research, PostScript outlines have been predominantly used when fonts are treated as part of vector graphics, while TrueType outlines are mainly employed when focusing on fonts alone. Whether to use PostScript or TrueType outlines has been mainly determined by file format specifications and precedent settings in previous studies, rather than performance considerations. To date, few studies have compared which outline format provides better embedding representations. Our findings suggest that information aggregation is crucial in Transformer-based deep learning for vector graphics, as in tokenization in language models and patch division in bitmap-based image recognition models. This insight provides valuable guidance for selecting outline formats in future research on vector graphics.
现代矢量字体中,基于PostScript轮廓的表示在基于Transformer的分类任务中优于TrueType格式,因其具有更出色的信息聚合能力。
Modern vector fonts using PostScript outlines outperform TrueType formats in Transformer-based classification due to superior information aggregation capabilities.

Authors:Abdurrahim Yilmaz, Furkan Yuceyalcin, Ece Gokyayla, Donghee Choi, Ozan Erdem, Ali Anil Demircali, Rahmetullah Varol, Ufuk Gorkem Kirabali, Gulsum Gencoglan, Joram M. Posma, Burak Temelkuran
Title: DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets
Abstract:
A major barrier to developing vision large language models (LLMs) in dermatology is the lack of large image--text pairs dataset. We introduce DermaSynth, a dataset comprising of 92,020 synthetic image--text pairs curated from 45,205 images (13,568 clinical and 35,561 dermatoscopic) for dermatology-related clinical tasks. Leveraging state-of-the-art LLMs, using Gemini 2.0, we used clinically related prompts and self-instruct method to generate diverse and rich synthetic texts. Metadata of the datasets were incorporated into the input prompts by targeting to reduce potential hallucinations. The resulting dataset builds upon open access dermatological image repositories (DERM12345, BCN20000, PAD-UFES-20, SCIN, and HIBA) that have permissive CC-BY-4.0 licenses. We also fine-tuned a preliminary Llama-3.2-11B-Vision-Instruct model, DermatoLlama 1.0, on 5,000 samples. We anticipate this dataset to support and accelerate AI research in dermatology. Data and code underlying this work are accessible at https://github.com/abdurrahimyilmaz/DermaSynth.
中文: 开发皮肤病学视觉大模型的主要障碍是缺乏大型图像-文本配对数据集,为此我们推出了DermaSynth,这是一个包含92,020对合成图像-文本的数据集,利用先进大语言模型和临床提示生成,旨在支持和加速皮肤病学人工智能研究。
English: The main challenge in developing vision large language models for dermatology is the scarcity of large image-text datasets, which is addressed by the introduction of DermaSynth, a comprehensive synthetic dataset of 92,020 image-text pairs generated using advanced LLMs and clinical prompts to support AI research in the field.

Authors:Joseph Fioresi, Ishan Rajendrakumar Dave, Mubarak Shah
Title: ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition
Abstract:
Bias in machine learning models can lead to unfair decision making, and while it has been well-studied in the image and text domains, it remains underexplored in action recognition. Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance), which can be detrimental to real-life applications such as autonomous vehicles or assisted living monitoring. While prior approaches have mainly focused on mitigating background bias using specialized augmentations, we thoroughly study both foreground and background bias. We propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes. Our framework applies an adversarial cross-entropy loss to the sampled static clip (where all the frames are the same) and aims to make its class probabilities uniform using a proposed entropy maximization loss. Additionally, we introduce a gradient penalty loss for regularization against the debiasing process. We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% absolute on HMDB51. Furthermore, we identify an issue of background leakage in the existing UCF101 protocol for bias evaluation which provides a shortcut to predict actions and does not provide an accurate measure of the debiasing capability of a model. We address this issue by proposing more fine-grained segmentation boundaries for the actor, where our method also outperforms existing approaches. Project Page: https://joefioresi718.github.io/ALBAR_webpage/

Authors:Rachel Mikulinsky, Morris Alper, Shai Gordin, Enrique Jiménez, Yoram Cohen, Hadar Averbuch-Elor
Title: ProtoSnap: Prototype Alignment for Cuneiform Signs
Abstract:
The cuneiform writing system served as the medium for transmitting knowledge in the ancient Near East for a period of over three thousand years. Cuneiform signs have a complex internal structure which is the subject of expert paleographic analysis, as variations in sign shapes bear witness to historical developments and transmission of writing and culture over time. However, prior automated techniques mostly treat sign types as categorical and do not explicitly model their highly varied internal configurations. In this work, we present an unsupervised approach for recovering the fine-grained internal configuration of cuneiform signs by leveraging powerful generative models and the appearance and structure of prototype font images as priors. Our approach, ProtoSnap, enforces structural consistency on matches found with deep image features to estimate the diverse configurations of cuneiform characters, snapping a skeleton-based template to photographed cuneiform signs. We provide a new benchmark of expert annotations and evaluate our method on this task. Our evaluation shows that our approach succeeds in aligning prototype skeletons to a wide variety of cuneiform signs. Moreover, we show that conditioning on structures produced by our method allows for generating synthetic data with correct structural configurations, significantly boosting the performance of cuneiform sign recognition beyond existing techniques, in particular over rare signs. Our code, data, and trained models are available at the project page: https://tau-vailab.github.io/ProtoSnap/

Authors:Mateus de Souza Miranda, Ronny Hänsch, Valdivino Alexandre de Santiago Júnior, Thales Sehn Körting, Erison Carlos dos Santos Monteiro
Title: CerraData-4MM: A multimodal benchmark dataset on Cerrado for land use and land cover classification
Abstract:
The Cerrado faces increasing environmental pressures, necessitating accurate land use and land cover (LULC) mapping despite challenges such as class imbalance and visually similar categories. To address this, we present CerraData-4MM, a multimodal dataset combining Sentinel-1 Synthetic Aperture Radar (SAR) and Sentinel-2 MultiSpectral Imagery (MSI) with 10m spatial resolution. The dataset includes two hierarchical classification levels with 7 and 14 classes, respectively, focusing on the diverse Bico do Papagaio ecoregion. We highlight CerraData-4MM's capacity to benchmark advanced semantic segmentation techniques by evaluating a standard U-Net and a more sophisticated Vision Transformer (ViT) model. The ViT achieves superior performance in multimodal scenarios, with the highest macro F1-score of 57.60% and a mean Intersection over Union (mIoU) of 49.05% at the first hierarchical level. Both models struggle with minority classes, particularly at the second hierarchical level, where U-Net's performance drops to an F1-score of 18.16%. Class balancing improves representation for underrepresented classes but reduces overall accuracy, underscoring the trade-off in weighted training. CerraData-4MM offers a challenging benchmark for advancing deep learning models to handle class imbalance and multimodal data fusion. Code, trained models, and data are publicly available at https://github.com/ai4luc/CerraData-4MM.
中文: CerraData-4MM多模态数据集通过整合卫星影像应对塞拉多地区环境测绘挑战,为深度学习模型提供基准测试平台,其中视觉变换器模型表现优于U-Net,但在少数类别识别上仍存在困难。
English: The CerraData-4MM multimodal dataset addresses environmental mapping challenges in the Cerrado by combining satellite imagery to benchmark deep learning models, with Vision Transformers outperforming U-Net despite persistent difficulties with minority classes.

Authors:Bidossessi Emmanuel Agossou, Marius Pedersen, Kiran Raja, Anuja Vats, PÃ¥l Anders Floor
Title: Influence of color correction on pathology detection in Capsule Endoscopy
Abstract:
Pathology detection in Wireless Capsule Endoscopy (WCE) using deep learning has been explored in the recent past. However, deep learning models can be influenced by the color quality of the dataset used to train them, impacting detection, segmentation and classification tasks. In this work, we evaluate the impact of color correction on pathology detection using two prominent object detection models: Retinanet and YOLOv5. We first generate two color corrected versions of a popular WCE dataset (i.e., SEE-AI dataset) using two different color correction functions. We then evaluate the performance of the Retinanet and YOLOv5 on the original and color corrected versions of the dataset. The results reveal that color correction makes the models generate larger bounding boxes and larger intersection areas with the ground truth annotations. Furthermore, color correction leads to an increased number of false positives for certain pathologies. However, these effects do not translate into a consistent improvement in performance metrics such as F1-scores, IoU, and AP50. The code is available at https://github.com/agossouema2011/WCE2024. Keywords: Wireless Capsule Endoscopy, Color correction, Retinanet, YOLOv5, Detection
中文摘要:本研究评估了颜色校正对无线胶囊内窥镜病理检测的影响,发现尽管颜色校正会扩大检测框并增加假阳性,但并未持续提升关键性能指标。
English Summary: This study evaluates how color correction affects pathology detection in Wireless Capsule Endoscopy using Retinanet and YOLOv5, finding that while it enlarges bounding boxes and increases false positives, it does not consistently improve key performance metrics.

Authors:Dong-Hee Paek, Seung-Hyun Kong
Title: SpikingRTNH: Spiking Neural Network for 4D Radar Object Detection
Abstract:
Recently, 4D Radar has emerged as a crucial sensor for 3D object detection in autonomous vehicles, offering both stable perception in adverse weather and high-density point clouds for object shape recognition. However, processing such high-density data demands substantial computational resources and energy consumption. We propose SpikingRTNH, the first spiking neural network (SNN) for 3D object detection using 4D Radar data. By replacing conventional ReLU activation functions with leaky integrate-and-fire (LIF) spiking neurons, SpikingRTNH achieves significant energy efficiency gains. Furthermore, inspired by human cognitive processes, we introduce biological top-down inference (BTI), which processes point clouds sequentially from higher to lower densities. This approach effectively utilizes points with lower noise and higher importance for detection. Experiments on K-Radar dataset demonstrate that SpikingRTNH with BTI significantly reduces energy consumption by 78% while achieving comparable detection performance to its ANN counterpart (51.1% AP 3D, 57.0% AP BEV). These results establish the viability of SNNs for energy-efficient 4D Radar-based object detection in autonomous driving systems. All codes are available at https://github.com/kaist-avelab/k-radar.
Chinese: SpikingRTNH是首个利用4D雷达数据进行3D物体检测的脉冲神经网络,通过仿生自上而下推理机制,在保持同等检测性能的同时实现了78%的能耗降低。
English: SpikingRTNH is the first spiking neural network for 3D object detection using 4D Radar data, achieving 78% energy reduction while maintaining comparable detection performance through biological top-down inference.

Authors:Liudi Yang, Ruben Mascaro, Ignacio Alzugaray, Sai Manoj Prakhya, Marco Karrer, Ziyuan Liu, Margarita Chli
Title: LiDAR Loop Closure Detection using Semantic Graphs with Graph Attention Networks
Abstract:
In this paper, we propose a novel loop closure detection algorithm that uses graph attention neural networks to encode semantic graphs to perform place recognition and then use semantic registration to estimate the 6 DoF relative pose constraint. Our place recognition algorithm has two key modules, namely, a semantic graph encoder module and a graph comparison module. The semantic graph encoder employs graph attention networks to efficiently encode spatial, semantic and geometric information from the semantic graph of the input point cloud. We then use self-attention mechanism in both node-embedding and graph-embedding steps to create distinctive graph vectors. The graph vectors of the current scan and a keyframe scan are then compared in the graph comparison module to identify a possible loop closure. Specifically, employing the difference of the two graph vectors showed a significant improvement in performance, as shown in ablation studies. Lastly, we implemented a semantic registration algorithm that takes in loop closure candidate scans and estimates the relative 6 DoF pose constraint for the LiDAR SLAM system. Extensive evaluation on public datasets shows that our model is more accurate and robust, achieving 13% improvement in maximum F1 score on the SemanticKITTI dataset, when compared to the baseline semantic graph algorithm. For the benefit of the community, we open-source the complete implementation of our proposed algorithm and custom implementation of semantic registration at https://github.com/crepuscularlight/SemanticLoopClosure
中文: 本文提出了一种新颖的回环检测算法,通过图注意力网络编码语义图进行位置识别,并利用语义配准估计6自由度位姿约束,在SemanticKITTI数据集上实现了F1分数13%的提升。
English: This paper introduces a loop closure detection algorithm using graph attention networks to encode semantic graphs for place recognition and semantic registration to estimate 6 DoF pose constraints, achieving a 13% improvement in F1 score on SemanticKITTI.

Authors:Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta
Title: Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search
Abstract:
The remarkable progress in text-to-video diffusion models enables photorealistic generations, although the contents of the generated video often include unnatural movement or deformation, reverse playback, and motionless scenes. Recently, an alignment problem has attracted huge attention, where we steer the output of diffusion models based on some quantity on the goodness of the content. Because there is a large room for improvement of perceptual quality along the frame direction, we should address which metrics we should optimize and how we can optimize them in the video generation. In this paper, we propose diffusion latent beam search with lookahead estimator, which can select a better diffusion latent to maximize a given alignment reward, at inference time. We then point out that the improvement of perceptual video quality considering the alignment to prompts requires reward calibration by weighting existing metrics. This is because when humans or vision language models evaluate outputs, many previous metrics to quantify the naturalness of video do not always correlate with evaluation. We demonstrate that our method improves the perceptual quality evaluated on the calibrated reward, VLMs, and human assessment, without model parameter update, and outputs the best generation compared to greedy search and best-of-N sampling under much more efficient computational cost. The experiments highlight that our method is beneficial to many capable generative models, and provide a practical guideline that we should prioritize the inference-time compute allocation into lookahead steps for reward estimation over search budget or denoising steps.
中文摘要:本文针对文本到视频扩散模型生成视频常出现运动不自然和对齐不佳的问题,提出了一种扩散潜在波束搜索方法,通过校准奖励机制优化视频质量,无需更新模型参数即可显著提升生成效果。
English Summary: Text-to-video diffusion models can now produce realistic videos but often suffer from unnatural motion and poor alignment with prompts, which this paper addresses by introducing a diffusion latent beam search method that optimizes video quality through calibrated rewards without updating model parameters.

Authors:Junxiang Qiu, Shuo Wang, Jinda Lu, Lin Liu, Houcheng Jiang, Xingyu Zhu, Yanbin Hao
Title: Accelerating Diffusion Transformer via Error-Optimized Cache
Abstract:
Diffusion Transformer (DiT) is a crucial method for content generation. However, it needs a lot of time to sample. Many studies have attempted to use caching to reduce the time consumption of sampling. Existing caching methods accelerate generation by reusing DiT features from the previous time step and skipping calculations in the next, but they tend to locate and cache low-error modules without focusing on reducing caching-induced errors, resulting in a sharp decline in generated content quality when increasing caching intensity. To solve this problem, we propose the \textbf{E}rror-\textbf{O}ptimized \textbf{C}ache (\textbf{EOC}). This method introduces three key improvements: \textbf{(1)} Prior knowledge extraction: Extract and process the caching differences; \textbf{(2)} A judgment method for cache optimization: Determine whether certain caching steps need to be optimized; \textbf{(3)} Cache optimization: reduce caching errors. Experiments show that this algorithm significantly reduces the error accumulation caused by caching, especially excessive caching. On the ImageNet dataset, without substantially increasing the computational load, this method improves the FID of the generated images when the rule-based model FORA has a caching level of \textbf{75}\%, \textbf{50}\%, and \textbf{25}\%, and the training-based model Learning-to-cache has a caching level of \textbf{22}\%. Specifically, the FID values change from 30.454 to 21.690 (\textbf{28.8}\%), from 6.857 to 5.821 (\textbf{15.1}\%), from 3.870 to 3.692 (\textbf{4.6}\%), and from 3.539 to 3.451 (\textbf{2.5}\%) respectively. Code is available at https://github.com/qiujx0520/EOC_MM2025.git.
中文: 提出的误差优化缓存(EOC)方法通过引入先验知识提取、缓存优化判断和误差减少技术,有效降低了扩散变换器采样中的误差累积,在高缓存强度下显著提升了生成图像的FID等质量指标。
English: The proposed Error-Optimized Cache (EOC) method effectively reduces error accumulation in Diffusion Transformers during sampling by introducing prior knowledge extraction, cache optimization judgment, and error reduction techniques, significantly improving image quality metrics like FID under high caching intensities.

Authors:Yunfan Lu, Yanlin Qian, Ziyang Rao, Junren Xiao, Liming Chen, Hui Xiong
Title: RGB-Event ISP: The Dataset and Benchmark
Abstract:
Event-guided imaging has received significant attention due to its potential to revolutionize instant imaging systems. However, the prior methods primarily focus on enhancing RGB images in a post-processing manner, neglecting the challenges of image signal processor (ISP) dealing with event sensor and the benefits events provide for reforming the ISP process. To achieve this, we conduct the first research on event-guided ISP. First, we present a new event-RAW paired dataset, collected with a novel but still confidential sensor that records pixel-level aligned events and RAW images. This dataset includes 3373 RAW images with 2248 x 3264 resolution and their corresponding events, spanning 24 scenes with 3 exposure modes and 3 lenses. Second, we propose a conventional ISP pipeline to generate good RGB frames as reference. This conventional ISP pipleline performs basic ISP operations, e.g.demosaicing, white balancing, denoising and color space transforming, with a ColorChecker as reference. Third, we classify the existing learnable ISP methods into 3 classes, and select multiple methods to train and evaluate on our new dataset. Lastly, since there is no prior work for reference, we propose a simple event-guided ISP method and test it on our dataset. We further put forward key technical challenges and future directions in RGB-Event ISP. In summary, to the best of our knowledge, this is the very first research focusing on event-guided ISP, and we hope it will inspire the community. The code and dataset are available at: https://github.com/yunfanLu/RGB-Event-ISP.
中文摘要:本研究开创性地探索事件引导的图像信号处理(ISP),通过构建新型对齐的事件-RAW数据集并提出基准方法,首次解决如何利用事件数据优化ISP流程而非仅用于后处理的问题。
English Summary: This research pioneers event-guided image signal processing (ISP) by introducing a novel aligned event-RAW dataset and proposing a baseline method, addressing the gap in leveraging events for ISP enhancement rather than just post-processing.

Authors:Zhengqin Lai, Xiaopeng Hong, Yabin Wang, Xiaobai Li
Title: A Benchmark for Incremental Micro-expression Recognition
Abstract:
Micro-expression recognition plays a pivotal role in understanding hidden emotions and has applications across various fields. Traditional recognition methods assume access to all training data at once, but real-world scenarios involve continuously evolving data streams. To respond to the requirement of adapting to new data while retaining previously learned knowledge, we introduce the first benchmark specifically designed for incremental micro-expression recognition. Our contributions include: Firstly, we formulate the incremental learning setting tailored for micro-expression recognition. Secondly, we organize sequential datasets with carefully curated learning orders to reflect real-world scenarios. Thirdly, we define two cross-evaluation-based testing protocols, each targeting distinct evaluation objectives. Finally, we provide six baseline methods and their corresponding evaluation results. This benchmark lays the groundwork for advancing incremental micro-expression recognition research. All source code used in this study will be publicly available at https://github.com/ZhengQinLai/IMER-benchmark.
Chinese: 本研究首次提出了增量式微表情识别的基准,通过制定学习协议、整理有序数据集和提供基线方法,解决了适应动态数据流的挑战,为该领域研究奠定了基础。
English: This study introduces the first benchmark for incremental micro-expression recognition, addressing the challenge of adapting to evolving data streams by establishing learning protocols, curated datasets, and baseline methods to advance research in this field.

Authors:Xingyu Miao, Haoran Duan, Yang Bai, Tejal Shah, Jun Song, Yang Long, Rajiv Ranjan, Ling Shao
Title: Laser: Efficient Language-Guided Segmentation in Neural Radiance Fields
Abstract:
In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the workflow by directly and effectively distilling dense CLIP features, thereby achieving precise segmentation of 3D scenes using text. To achieve this, we introduce an adapter module and mitigate the noise issue in the dense CLIP feature distillation process through a self-cross-training strategy. Moreover, to enhance the accuracy of segmentation edges, this work presents a low-rank transient query attention mechanism. To ensure the consistency of segmentation for similar colors under different viewpoints, we convert the segmentation task into a classification task through label volume, which significantly improves the consistency of segmentation in color-similar areas. We also propose a simplified text augmentation strategy to alleviate the issue of ambiguity in the correspondence between CLIP features and text. Extensive experimental results show that our method surpasses current state-of-the-art technologies in both training speed and performance. Our code is available on: https://github.com/xingy038/Laser.git.
Chinese: 本研究提出一种通过语言引导直接蒸馏密集CLIP特征的方法,利用自交叉训练和低秩注意力机制等策略,在训练速度和性能上均超越了现有先进技术,实现了高效的3D场景精确分割。
English: This work introduces a method that efficiently achieves precise 3D segmentation by directly distilling dense CLIP features with language guidance, incorporating strategies like self-cross-training and a low-rank attention mechanism to enhance performance and speed beyond current state-of-the-art technologies.

Authors:Dahye Kim, Deepti Ghadiyaram
Title: Concept Steerers: Leveraging K-Sparse Autoencoders for Test-Time Controllable Generations
Abstract:
Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lacks scalability, and/or compromises generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style) -- all during test time. Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of $\mathbf{20.01\%}$ in unsafe concept removal, is effective in style manipulation, and is $\mathbf{\sim5}$x faster than the current state-of-the-art. Code is available at: https://github.com/kim-dahye/steerers
中文: 本文提出了一种利用k稀疏自编码器的新框架,在推理过程中高效操控扩散模型中的概念,无需重新训练即可精确控制生成内容,同时提升安全性和速度。
English: This paper introduces a novel framework using k-sparse autoencoders to efficiently manipulate concepts in diffusion models during inference, enabling precise control over content generation without retraining while improving safety and speed.

Authors:Hongliang Li, Jiaxin Zhang, Wenhui Liao, Dezhi Peng, Kai Ding, Lianwen Jin
Title: RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs
Abstract:
Current Multimodal Large Language Model (MLLM) architectures face a critical tradeoff between performance and efficiency: decoder-only architectures achieve higher performance but lower efficiency, while cross-attention-based architectures offer greater efficiency but lower performance. The key distinction lies in how visual tokens are processed. Decoder-only architectures apply self-attention and FFN operations on visual tokens, while cross-attention architectures skip these computations. To investigate whether redundancy exists in this computationally expensive process, we propose a training-free framework for analyzing trained MLLMs. It consists of Probe-Activated Dynamic FFN and Hollow Attention, which enable adjustable reductions in computations for visual tokens, as well as a Layer Ranking Algorithm that prioritizes layers for these reductions. Extensive experiments demonstrate substantial, structured, and clustered redundancy unique to decoder-only MLLMs, offering valuable insights for future MLLM architecture design. Furthermore, by leveraging our reduction framework as a training-free inference acceleration approach, we achieve performance comparable to or better than state-of-the-art methods while remaining compatible with them. Code will be publicly available at https://github.com/L-Hugh/RedundancyLens.
中文: 当前多模态大语言模型存在性能与效率的权衡,本研究提出无需训练的分析框架,发现仅解码器架构中存在显著冗余,并通过推理加速实现了可比甚至更优的性能。
English: Current MLLMs face a performance-efficiency tradeoff, with this study proposing a training-free framework that identifies substantial redundancy in decoder-only architectures and achieves comparable or better performance through inference acceleration.

Authors:Javier Montalvo, Pablo Carballeira, Álvaro García-Martín
Title: SynthmanticLiDAR: A Synthetic Dataset for Semantic Segmentation on LiDAR Imaging
Abstract:
Semantic segmentation on LiDAR imaging is increasingly gaining attention, as it can provide useful knowledge for perception systems and potential for autonomous driving. However, collecting and labeling real LiDAR data is an expensive and time-consuming task. While datasets such as SemanticKITTI have been manually collected and labeled, the introduction of simulation tools such as CARLA, has enabled the creation of synthetic datasets on demand. In this work, we present a modified CARLA simulator designed with LiDAR semantic segmentation in mind, with new classes, more consistent object labeling with their counterparts from real datasets such as SemanticKITTI, and the possibility to adjust the object class distribution. Using this tool, we have generated SynthmanticLiDAR, a synthetic dataset for semantic segmentation on LiDAR imaging, designed to be similar to SemanticKITTI, and we evaluate its contribution to the training process of different semantic segmentation algorithms by using a naive transfer learning approach. Our results show that incorporating SynthmanticLiDAR into the training process improves the overall performance of tested algorithms, proving the usefulness of our dataset, and therefore, our adapted CARLA simulator. The dataset and simulator are available in https://github.com/vpulab/SynthmanticLiDAR.
Chinese: 本研究推出了改进的CARLA模拟器,生成模拟SemanticKITTI的合成数据集SynthmanticLiDAR,通过迁移学习验证了该数据集能有效提升激光雷达语义分割算法的性能。
English: This study introduces a modified CARLA simulator that generates SynthmanticLiDAR, a synthetic dataset designed to mimic SemanticKITTI for LiDAR semantic segmentation, demonstrating improved algorithm performance through transfer learning.

Authors:Bo Lan, Pei Li, Jiaxi Yin, Yunpeng Song, Ge Wang, Han Ding, Jinsong Han, Fei Wang
Title: XRF V2: A Dataset for Action Summarization with Wi-Fi Signals, and IMUs in Phones, Watches, Earbuds, and Glasses
Abstract:
Human Action Recognition (HAR) plays a crucial role in applications such as health monitoring, smart home automation, and human-computer interaction. While HAR has been extensively studied, action summarization using Wi-Fi and IMU signals in smart-home environments , which involves identifying and summarizing continuous actions, remains an emerging task. This paper introduces the novel XRF V2 dataset, designed for indoor daily activity Temporal Action Localization (TAL) and action summarization. XRF V2 integrates multimodal data from Wi-Fi signals, IMU sensors (smartphones, smartwatches, headphones, and smart glasses), and synchronized video recordings, offering a diverse collection of indoor activities from 16 volunteers across three distinct environments. To tackle TAL and action summarization, we propose the XRFMamba neural network, which excels at capturing long-term dependencies in untrimmed sensory sequences and achieves the best performance with an average mAP of 78.74, outperforming the recent WiFiTAD by 5.49 points in mAP@avg while using 35% fewer parameters. In action summarization, we introduce a new metric, Response Meaning Consistency (RMC), to evaluate action summarization performance. And it achieves an average Response Meaning Consistency (mRMC) of 0.802. We envision XRF V2 as a valuable resource for advancing research in human action localization, action forecasting, pose estimation, multimodal foundation models pre-training, synthetic data generation, and more. The data and code are available at https://github.com/aiotgroup/XRFV2.
中文: 本文提出XRF V2数据集和XRFMamba神经网络,通过融合Wi-Fi和IMU多模态数据实现室内行为定位与摘要,在取得最优性能的同时引入了新的评估指标。
English: This paper introduces the XRF V2 dataset and XRFMamba neural network for indoor action localization and summarization, achieving state-of-the-art performance with multimodal Wi-Fi and IMU data while proposing a new evaluation metric.

Authors:Yunpeng Qu, Kun Yuan, Jinhua Hao, Kai Zhao, Qizhi Xie, Ming Sun, Chao Zhou
Title: Visual Autoregressive Modeling for Image Super-Resolution
Abstract:
Image Super-Resolution (ISR) has seen significant progress with the introduction of remarkable generative models. However, challenges such as the trade-off issues between fidelity and realism, as well as computational complexity, have also posed limitations on their application. Building upon the tremendous success of autoregressive models in the language domain, we propose \textbf{VARSR}, a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction. To effectively integrate and preserve semantic information in low-resolution images, we propose using prefix tokens to incorporate the condition. Scale-aligned Rotary Positional Encodings are introduced to capture spatial structures and the diffusion refiner is utilized for modeling quantization residual loss to achieve pixel-level fidelity. Image-based Classifier-free Guidance is proposed to guide the generation of more realistic images. Furthermore, we collect large-scale data and design a training process to obtain robust generative priors. Quantitative and qualitative results show that VARSR is capable of generating high-fidelity and high-realism images with more efficiency than diffusion-based methods. Our codes will be released at https://github.com/qyp2000/VARSR.
中文摘要:VARSR框架提出了一种基于视觉自回归模型的图像超分辨率方法,通过下一尺度预测、前缀标记和尺度对齐位置编码等创新技术,在保证高保真度和真实感的同时,比基于扩散的方法更高效地生成优质图像。
English Summary: The VARSR framework introduces a visual autoregressive model for image super-resolution that uses next-scale prediction and innovative techniques like prefix tokens and scale-aligned positional encodings to efficiently generate high-fidelity, realistic images while outperforming diffusion-based methods in computational efficiency.

Authors:Zhengrui Guo, Qichen Sun, Jiabo Ma, Lishuang Feng, Jinzhuo Wang, Hao Chen
Title: Context Matters: Query-aware Dynamic Long Sequence Modeling of Gigapixel Images
Abstract:
Whole slide image (WSI) analysis presents significant computational challenges due to the massive number of patches in gigapixel images. While transformer architectures excel at modeling long-range correlations through self-attention, their quadratic computational complexity makes them impractical for computational pathology applications. Existing solutions like local-global or linear self-attention reduce computational costs but compromise the strong modeling capabilities of full self-attention. In this work, we propose Querent, i.e., the query-aware long contextual dynamic modeling framework, which achieves a theoretically bounded approximation of full self-attention while delivering practical efficiency. Our method adaptively predicts which surrounding regions are most relevant for each patch, enabling focused yet unrestricted attention computation only with potentially important contexts. By using efficient region-wise metadata computation and importance estimation, our approach dramatically reduces computational overhead while preserving global perception to model fine-grained patch correlations. Through comprehensive experiments on biomarker prediction, gene mutation prediction, cancer subtyping, and survival analysis across over 10 WSI datasets, our method demonstrates superior performance compared to the state-of-the-art approaches. Codes are available at https://github.com/dddavid4real/Querent.
中文: 提出的Querent框架通过自适应聚焦全切片图像中的相关区域,有效逼近完整自注意力机制,在保持实际计算效率的同时,在多种计算病理学任务中展现出卓越性能。
English: The proposed Querent framework efficiently approximates full self-attention by adaptively focusing on relevant regions in whole slide images, achieving superior performance across multiple computational pathology tasks while maintaining practical computational efficiency.

Authors:Yuchen Lin, Chenguo Lin, Jianjin Xu, Yadong Mu
Title: OmniPhysGS: 3D Constitutive Gaussians for General Physics-Based Dynamics Generation
Abstract:
Recently, significant advancements have been made in the reconstruction and generation of 3D assets, including static cases and those with physical interactions. To recover the physical properties of 3D assets, existing methods typically assume that all materials belong to a specific predefined category (e.g., elasticity). However, such assumptions ignore the complex composition of multiple heterogeneous objects in real scenarios and tend to render less physically plausible animation given a wider range of objects. We propose OmniPhysGS for synthesizing a physics-based 3D dynamic scene composed of more general objects. A key design of OmniPhysGS is treating each 3D asset as a collection of constitutive 3D Gaussians. For each Gaussian, its physical material is represented by an ensemble of 12 physical domain-expert sub-models (rubber, metal, honey, water, etc.), which greatly enhances the flexibility of the proposed model. In the implementation, we define a scene by user-specified prompts and supervise the estimation of material weighting factors via a pretrained video diffusion model. Comprehensive experiments demonstrate that OmniPhysGS achieves more general and realistic physical dynamics across a broader spectrum of materials, including elastic, viscoelastic, plastic, and fluid substances, as well as interactions between different materials. Our method surpasses existing methods by approximately 3% to 16% in metrics of visual quality and text alignment.

Authors:Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng
Title: LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Abstract:
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset is available at https://github.com/iSEE-Laboratory/LLMDet.
中文摘要:LLMDet模型通过与大语言模型协同训练,利用GroundingCap-1M数据集生成图像级描述,显著提升了开放词汇检测性能,并能构建更强大的多模态系统。
English Summary: The LLMDet model enhances open-vocabulary detection by co-training with a large language model using the GroundingCap-1M dataset, achieving superior performance and enabling stronger multimodal applications.

Authors:Anh Bui, Trang Vu, Long Vuong, Trung Le, Paul Montague, Tamas Abraham, Junae Kim, Dinh Phung
Title: Fantastic Targets for Concept Erasure in Diffusion Models and Where To Find Them
Abstract:
Concept erasure has emerged as a promising technique for mitigating the risk of harmful content generation in diffusion models by selectively unlearning undesirable concepts. The common principle of previous works to remove a specific concept is to map it to a fixed generic concept, such as a neutral concept or just an empty text prompt. In this paper, we demonstrate that this fixed-target strategy is suboptimal, as it fails to account for the impact of erasing one concept on the others. To address this limitation, we model the concept space as a graph and empirically analyze the effects of erasing one concept on the remaining concepts. Our analysis uncovers intriguing geometric properties of the concept space, where the influence of erasing a concept is confined to a local region. Building on this insight, we propose the Adaptive Guided Erasure (AGE) method, which \emph{dynamically} selects optimal target concepts tailored to each undesirable concept, minimizing unintended side effects. Experimental results show that AGE significantly outperforms state-of-the-art erasure methods on preserving unrelated concepts while maintaining effective erasure performance. Our code is published at {https://github.com/tuananhbui89/Adaptive-Guided-Erasure}.
Chinese: 本文提出自适应引导擦除方法(AGE),通过动态选择最优目标概念来消除扩散模型中的不良概念,有效减少对其他概念的副作用,并在实验中显著优于现有擦除技术。
English: This paper introduces Adaptive Guided Erasure (AGE), a method that dynamically selects optimal target concepts for erasing undesirable ones in diffusion models, minimizing side effects on unrelated concepts and outperforming existing techniques.

Authors:Sunyong Seo, Huisu Yoon, Semin Kim, Jongha Lee
Title: Full-scale Representation Guided Network for Retinal Vessel Segmentation
Abstract:
The U-Net architecture and its variants have remained state-of-the-art (SOTA) for retinal vessel segmentation over the past decade. In this study, we introduce a Full Scale Guided Network (FSG-Net), where the feature representation network with modernized convolution blocks extracts full-scale information and the guided convolution block refines that information. Attention-guided filter is introduced to the guided convolution block under the interpretation that the filter behaves like the unsharp mask filter. Passing full-scale information to the attention block allows for the generation of improved attention maps, which are then passed to the attention-guided filter, resulting in performance enhancement of the segmentation network. The structure preceding the guided convolution block can be replaced by any U-Net variant, which enhances the scalability of the proposed approach. For a fair comparison, we re-implemented recent studies available in public repositories to evaluate their scalability and reproducibility. Our experiments also show that the proposed network demonstrates competitive results compared to current SOTA models on various public datasets. Ablation studies demonstrate that the proposed model is competitive with much smaller parameter sizes. Lastly, by applying the proposed model to facial wrinkle segmentation, we confirmed the potential for scalability to similar tasks in other domains. Our code is available on https://github.com/ZombaSY/FSG-Net-pytorch.
中文: 本研究提出FSG-Net网络,通过利用全尺度信息和注意力引导滤波器来改进视网膜血管分割,在参数更少的情况下实现了与先进模型相媲美的性能,并展现出在面部皱纹分割等其他领域的可扩展性。
English: The study introduces FSG-Net, a novel network that enhances retinal vessel segmentation by utilizing full-scale information and attention-guided filters, achieving competitive performance with fewer parameters and demonstrating scalability to other domains like facial wrinkle segmentation.

Authors:Tongda Xu, Xiyan Cai, Xinjie Zhang, Xingtong Ge, Dailan He, Ming Sun, Jingjing Liu, Ya-Qin Zhang, Jian Li, Yan Wang
Title: Rethinking Diffusion Posterior Sampling: From Conditional Score Estimator to Maximizing a Posterior
Abstract:
Recent advancements in diffusion models have been leveraged to address inverse problems without additional training, and Diffusion Posterior Sampling (DPS) (Chung et al., 2022a) is among the most popular approaches. Previous analyses suggest that DPS accomplishes posterior sampling by approximating the conditional score. While in this paper, we demonstrate that the conditional score approximation employed by DPS is not as effective as previously assumed, but rather aligns more closely with the principle of maximizing a posterior (MAP). This assertion is substantiated through an examination of DPS on 512x512 ImageNet images, revealing that: 1) DPS's conditional score estimation significantly diverges from the score of a well-trained conditional diffusion model and is even inferior to the unconditional score; 2) The mean of DPS's conditional score estimation deviates significantly from zero, rendering it an invalid score estimation; 3) DPS generates high-quality samples with significantly lower diversity. In light of the above findings, we posit that DPS more closely resembles MAP than a conditional score estimator, and accordingly propose the following enhancements to DPS: 1) we explicitly maximize the posterior through multi-step gradient ascent and projection; 2) we utilize a light-weighted conditional score estimator trained with only 100 images and 8 GPU hours. Extensive experimental results indicate that these proposed improvements significantly enhance DPS's performance. The source code for these improvements is provided in https://github.com/tongdaxu/Rethinking-Diffusion-Posterior-Sampling-From-Conditional-Score-Estimator-to-Maximizing-a-Posterior.
中文: 本研究发现扩散后验采样(DPS)更接近最大化后验而非精确逼近条件得分,并提出通过显式后验最大化与轻量级条件得分估计器的改进方案,显著提升了其性能表现。
English: This study reveals that Diffusion Posterior Sampling (DPS) aligns more closely with maximizing a posterior rather than accurately approximating the conditional score, and proposes enhancements including explicit posterior maximization and a lightweight conditional score estimator to significantly improve its performance.

Authors:Pinxin Liu, Luchuan Song, Junhua Huang, Haiyang Liu, Chenliang Xu
Title: GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling
Abstract:
Generating full-body human gestures based on speech signals remains challenges on quality and speed. Existing approaches model different body regions such as body, legs and hands separately, which fail to capture the spatial interactions between them and result in unnatural and disjointed movements. Additionally, their autoregressive/diffusion-based pipelines show slow generation speed due to dozens of inference steps. To address these two challenges, we propose GestureLSM, a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling. Our method i) explicitly model the interaction of tokenized body regions through spatial and temporal attention, for generating coherent full-body gestures. ii) introduce the flow matching to enable more efficient sampling by explicitly modeling the latent velocity space. To overcome the suboptimal performance of flow matching baseline, we propose latent shortcut learning and beta distribution time stamp sampling during training to enhance gesture synthesis quality and accelerate inference. Combining the spatial-temporal modeling and improved flow matching-based framework, GestureLSM achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods, highlighting its potential for enhancing digital humans and embodied agents in real-world applications. Project Page: https://andypinxinliu.github.io/GestureLSM
中文摘要:GestureLSM采用基于流匹配的时空建模方法,通过显式建模身体区域间的交互和优化潜在速度空间,实现了语音驱动全身手势的高质量生成与快速推理。
English Summary: GestureLSM introduces a flow-matching approach with spatial-temporal modeling to generate synchronized full-body gestures from speech, achieving superior quality and faster inference by capturing inter-region interactions and optimizing latent velocity space.

Authors:Jaesin Ahn, Heechul Jung
Title: Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models
Abstract:
Text-to-image diffusion models show remarkable generation performance following text prompts, but risk generating Not Safe For Work (NSFW) contents from unsafe prompts. Existing approaches, such as prompt filtering or concept unlearning, fail to defend against adversarial attacks while maintaining benign image quality. In this paper, we propose a novel approach called Distorting Embedding Space (DES), a text encoder-based defense mechanism that effectively tackles these issues through innovative embedding space control. DES transforms unsafe embeddings, extracted from a text encoder using unsafe prompts, toward carefully calculated safe embedding regions to prevent unsafe contents generation, while reproducing the original safe embeddings. DES also neutralizes the nudity embedding, extracted using prompt ``nudity", by aligning it with neutral embedding to enhance robustness against adversarial attacks. These methods ensure both robust defense and high-quality image generation. Additionally, DES can be adopted in a plug-and-play manner and requires zero inference overhead, facilitating its deployment. Extensive experiments on diverse attack types, including black-box and white-box scenarios, demonstrate DES's state-of-the-art performance in both defense capability and benign image generation quality. Our model is available at https://github.com/aei13/DES.
中文:提出的扭曲嵌入空间(DES)方法通过将不安全文本嵌入转换为安全区域,有效防止生成不良内容,同时保持高质量良性图像输出并增强对抗攻击的防御能力。
English: The proposed Distorting Embedding Space (DES) method effectively prevents NSFW content generation by transforming unsafe text embeddings into safe regions while maintaining high-quality benign image output and robust defense against adversarial attacks.

Authors:Zhe Wang, Yuhua Ru, Fabian Bauer, Aladine Chetouani, Fang Chen, Liping Zhang, Didier Hans, Rachid Jennane, Mohamed Jarraya, Yung Hsin Chen
Title: Distillation-Driven Diffusion Model for Multi-Scale MRI Super-Resolution: Make 1.5T MRI Great Again
Abstract:
Magnetic Resonance Imaging (MRI) offers critical insights into microstructural details, however, the spatial resolution of standard 1.5T imaging systems is often limited. In contrast, 7T MRI provides significantly enhanced spatial resolution, enabling finer visualization of anatomical structures. Though this, the high cost and limited availability of 7T MRI hinder its widespread use in clinical settings. To address this challenge, a novel Super-Resolution (SR) model is proposed to generate 7T-like MRI from standard 1.5T MRI scans. Our approach leverages a diffusion-based architecture, incorporating gradient nonlinearity correction and bias field correction data from 7T imaging as guidance. Moreover, to improve deployability, a progressive distillation strategy is introduced. Specifically, the student model refines the 7T SR task with steps, leveraging feature maps from the inference phase of the teacher model as guidance, aiming to allow the student model to achieve progressively 7T SR performance with a smaller, deployable model size. Experimental results demonstrate that our baseline teacher model achieves state-of-the-art SR performance. The student model, while lightweight, sacrifices minimal performance. Furthermore, the student model is capable of accepting MRI inputs at varying resolutions without the need for retraining, significantly further enhancing deployment flexibility. The clinical relevance of our proposed method is validated using clinical data from Massachusetts General Hospital. Our code is available at https://github.com/ZWang78/SR.
中文: 本文提出了一种基于扩散模型的新型超分辨率方法,可将标准1.5T磁共振图像增强至类似7T的高分辨率,其轻量化学生模型在保持优异性能的同时,无需重新训练即可适应不同分辨率的输入,显著提升了临床部署灵活性。
English: A novel diffusion-based super-resolution model is proposed to generate high-resolution 7T-like MRI from standard 1.5T scans, achieving state-of-the-art performance with a lightweight student model that maintains flexibility across varying input resolutions without retraining.

Authors:Yansong Qu, Dian Chen, Xinyang Li, Xiaofan Li, Shengchuan Zhang, Liujuan Cao, Rongrong Ji
Title: Drag Your Gaussian: Effective Drag-Based Editing with Score Distillation for 3D Gaussian Splatting
Abstract:
Recent advancements in 3D scene editing have been propelled by the rapid development of generative models. Existing methods typically utilize generative models to perform text-guided editing on 3D representations, such as 3D Gaussian Splatting (3DGS). However, these methods are often limited to texture modifications and fail when addressing geometric changes, such as editing a character's head to turn around. Moreover, such methods lack accurate control over the spatial position of editing results, as language struggles to precisely describe the extent of edits. To overcome these limitations, we introduce DYG, an effective 3D drag-based editing method for 3D Gaussian Splatting. It enables users to conveniently specify the desired editing region and the desired dragging direction through the input of 3D masks and pairs of control points, thereby enabling precise control over the extent of editing. DYG integrates the strengths of the implicit triplane representation to establish the geometric scaffold of the editing results, effectively overcoming suboptimal editing outcomes caused by the sparsity of 3DGS in the desired editing regions. Additionally, we incorporate a drag-based Latent Diffusion Model into our method through the proposed Drag-SDS loss function, enabling flexible, multi-view consistent, and fine-grained editing. Extensive experiments demonstrate that DYG conducts effective drag-based editing guided by control point prompts, surpassing other baselines in terms of editing effect and quality, both qualitatively and quantitatively. Visit our project page at https://quyans.github.io/Drag-Your-Gaussian.
Chinese: 近期基于生成模型的3D场景编辑方法仅限于纹理修改且缺乏几何控制的精确性,为此我们提出了DYG——一种针对3D高斯泼溅的拖拽式编辑方法,通过3D遮罩和控制点对实现精准的空间编辑控制。
English: Recent advancements in 3D scene editing using generative models are limited to texture changes and lack precise geometric control, prompting the introduction of DYG, a drag-based method for 3D Gaussian Splatting that enables accurate spatial manipulation through 3D masks and control points.

Authors:Ranjan Sapkota, Shaina Raza, Maged Shoman, Achyut Paudel, Manoj Karkee
Title: Multimodal Large Language Models for Image, Text, and Speech Data Augmentation: A Survey
Abstract:
In the past five years, research has shifted from traditional Machine Learning (ML) and Deep Learning (DL) approaches to leveraging Large Language Models (LLMs) , including multimodality, for data augmentation to enhance generalization, and combat overfitting in training deep convolutional neural networks. However, while existing surveys predominantly focus on ML and DL techniques or limited modalities (text or images), a gap remains in addressing the latest advancements and multi-modal applications of LLM-based methods. This survey fills that gap by exploring recent literature utilizing multimodal LLMs to augment image, text, and audio data, offering a comprehensive understanding of these processes. We outlined various methods employed in the LLM-based image, text and speech augmentation, and discussed the limitations identified in current approaches. Additionally, we identified potential solutions to these limitations from the literature to enhance the efficacy of data augmentation practices using multimodal LLMs. This survey serves as a foundation for future research, aiming to refine and expand the use of multimodal LLMs in enhancing dataset quality and diversity for deep learning applications. (Surveyed Paper GitHub Repo: https://github.com/WSUAgRobotics/data-aug-multi-modal-llm. Keywords: LLM data augmentation, Grok text data augmentation, DeepSeek image data augmentation, Grok speech data augmentation, GPT audio augmentation, voice augmentation, DeepSeek for data augmentation, DeepSeek R1 text data augmentation, DeepSeek R1 image augmentation, Image Augmentation using LLM, Text Augmentation using LLM, LLM data augmentation for deep learning applications)
中文摘要:本综述填补了现有研究的空白,全面探讨了多模态大语言模型在图像、文本和音频数据增强方面的最新进展,旨在提升深度学习应用的数据质量和多样性。
English Summary: This survey addresses the gap in existing literature by comprehensively exploring recent advancements in multimodal Large Language Models for augmenting image, text, and audio data to enhance deep learning applications.

Authors:Rong Liu, Dylan Sun, Meida Chen, Yue Wang, Andrew Feng
Title: Deformable Beta Splatting
Abstract:
3D Gaussian Splatting (3DGS) has advanced radiance field reconstruction by enabling real-time rendering. However, its reliance on Gaussian kernels for geometry and low-order Spherical Harmonics (SH) for color encoding limits its ability to capture complex geometries and diverse colors. We introduce Deformable Beta Splatting (DBS), a deformable and compact approach that enhances both geometry and color representation. DBS replaces Gaussian kernels with deformable Beta Kernels, which offer bounded support and adaptive frequency control to capture fine geometric details with higher fidelity while achieving better memory efficiency. In addition, we extended the Beta Kernel to color encoding, which facilitates improved representation of diffuse and specular components, yielding superior results compared to SH-based methods. Furthermore, Unlike prior densification techniques that depend on Gaussian properties, we mathematically prove that adjusting regularized opacity alone ensures distribution-preserved Markov chain Monte Carlo (MCMC), independent of the splatting kernel type. Experimental results demonstrate that DBS achieves state-of-the-art visual quality while utilizing only 45% of the parameters and rendering 1.5x faster than 3DGS-MCMC, highlighting the superior performance of DBS for real-time radiance field rendering. Interactive demonstrations and source code are available on our project website: https://rongliu-leo.github.io/beta-splatting/.

Authors:Xiangbo Gao, Runsheng Xu, Jiachen Li, Ziran Wang, Zhiwen Fan, Zhengzhong Tu
Title: STAMP: Scalable Task And Model-agnostic Collaborative Perception
Abstract:
Perception is crucial for autonomous driving, but single-agent perception is often constrained by sensors' physical limitations, leading to degraded performance under severe occlusion, adverse weather conditions, and when detecting distant objects. Multi-agent collaborative perception offers a solution, yet challenges arise when integrating heterogeneous agents with varying model architectures. To address these challenges, we propose STAMP, a scalable task- and model-agnostic, collaborative perception pipeline for heterogeneous agents. STAMP utilizes lightweight adapter-reverter pairs to transform Bird's Eye View (BEV) features between agent-specific and shared protocol domains, enabling efficient feature sharing and fusion. This approach minimizes computational overhead, enhances scalability, and preserves model security. Experiments on simulated and real-world datasets demonstrate STAMP's comparable or superior accuracy to state-of-the-art models with significantly reduced computational costs. As a first-of-its-kind task- and model-agnostic framework, STAMP aims to advance research in scalable and secure mobility systems towards Level 5 autonomy. Our project page is at https://xiangbogaobarry.github.io/STAMP and the code is available at https://github.com/taco-group/STAMP.
中文: STAMP是一种可扩展、任务与模型无关的异构智能体协同感知框架,通过轻量级适配器-还原器对实现高效特征共享与融合,在显著降低计算成本的同时保持高精度和模型安全性。
English: STAMP is a scalable, task- and model-agnostic collaborative perception pipeline for heterogeneous agents that uses lightweight adapter-reverter pairs to enable efficient feature sharing and fusion, achieving high accuracy with low computational costs and enhanced security.

Authors:Vishal Thengane, Xiatian Zhu, Salim Bouzerdoum, Son Lam Phung, Yunpeng Li
Title: Foundational Models for 3D Point Clouds: A Survey and Outlook
Abstract:
The 3D point cloud representation plays a crucial role in preserving the geometric fidelity of the physical world, enabling more accurate complex 3D environments. While humans naturally comprehend the intricate relationships between objects and variations through a multisensory system, artificial intelligence (AI) systems have yet to fully replicate this capacity. To bridge this gap, it becomes essential to incorporate multiple modalities. Models that can seamlessly integrate and reason across these modalities are known as foundation models (FMs). The development of FMs for 2D modalities, such as images and text, has seen significant progress, driven by the abundant availability of large-scale datasets. However, the 3D domain has lagged due to the scarcity of labelled data and high computational overheads. In response, recent research has begun to explore the potential of applying FMs to 3D tasks, overcoming these challenges by leveraging existing 2D knowledge. Additionally, language, with its capacity for abstract reasoning and description of the environment, offers a promising avenue for enhancing 3D understanding through large pre-trained language models (LLMs). Despite the rapid development and adoption of FMs for 3D vision tasks in recent years, there remains a gap in comprehensive and in-depth literature reviews. This article aims to address this gap by presenting a comprehensive overview of the state-of-the-art methods that utilize FMs for 3D visual understanding. We start by reviewing various strategies employed in the building of various 3D FMs. Then we categorize and summarize use of different FMs for tasks such as perception tasks. Finally, the article offers insights into future directions for research and development in this field. To help reader, we have curated list of relevant papers on the topic: https://github.com/vgthengane/Awesome-FMs-in-3D.
中文摘要:本文全面综述了用于三维视觉理解的基础模型,通过分析其构建策略、在感知任务中的应用及未来研究方向,填补了该领域文献综述的空白。
English Summary: This article provides a comprehensive review of foundation models (FMs) for 3D visual understanding, addressing the gap in literature by examining their development strategies, applications in perception tasks, and future research directions.

Authors:Yinbo Chen, Rohit Girdhar, Xiaolong Wang, Sai Saketh Rambhatla, Ishan Misra
Title: Diffusion Autoencoders are Scalable Image Tokenizers
Abstract:
Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that DiTo is a simpler, scalable, and self-supervised alternative to the current state-of-the-art image tokenizer which is supervised. DiTo achieves competitive or better quality than state-of-the-art in image reconstruction and downstream image generation tasks.

Authors:Hao Dong, Moru Liu, Kaiyang Zhou, Eleni Chatzi, Juho Kannala, Cyrill Stachniss, Olga Fink
Title: Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models
Abstract:
In real-world scenarios, achieving domain adaptation and generalization poses significant challenges, as models must adapt to or generalize across unknown target distributions. Extending these capabilities to unseen multimodal distributions, i.e., multimodal domain adaptation and generalization, is even more challenging due to the distinct characteristics of different modalities. Significant progress has been made over the years, with applications ranging from action recognition to semantic segmentation. Besides, the recent advent of large-scale pre-trained multimodal foundation models, such as CLIP, has inspired works leveraging these models to enhance adaptation and generalization performances or adapting them to downstream tasks. This survey provides the first comprehensive review of recent advances from traditional approaches to foundation models, covering: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models. For each topic, we formally define the problem and thoroughly review existing methods. Additionally, we analyze relevant datasets and applications, highlighting open challenges and potential future research directions. We maintain an active repository that contains up-to-date literature at https://github.com/donghao51/Awesome-Multimodal-Adaptation.
中文: 本综述全面回顾了多模态领域自适应与泛化的研究进展,涵盖从传统方法到基础模型的各类方法,并分析了关键挑战、应用场景及未来研究方向。
English: This survey comprehensively reviews multimodal domain adaptation and generalization, covering traditional methods to foundation models, analyzing key challenges, applications, and future research directions.

Authors:Shi Chen, Lefei Zhang, Liangpei Zhang
Title: HSRMamba: Contextual Spatial-Spectral State Space Model for Single Image Hyperspectral Super-Resolution
Abstract:
Mamba has demonstrated exceptional performance in visual tasks due to its powerful global modeling capabilities and linear computational complexity, offering considerable potential in hyperspectral image super-resolution (HSISR). However, in HSISR, Mamba faces challenges as transforming images into 1D sequences neglects the spatial-spectral structural relationships between locally adjacent pixels, and its performance is highly sensitive to input order, which affects the restoration of both spatial and spectral details. In this paper, we propose HSRMamba, a contextual spatial-spectral modeling state space model for HSISR, to address these issues both locally and globally. Specifically, a local spatial-spectral partitioning mechanism is designed to establish patch-wise causal relationships among adjacent pixels in 3D features, mitigating the local forgetting issue. Furthermore, a global spectral reordering strategy based on spectral similarity is employed to enhance the causal representation of similar pixels across both spatial and spectral dimensions. Finally, experimental results demonstrate our HSRMamba outperforms the state-of-the-art methods in quantitative quality and visual results. Code is available at: https://github.com/Tomchenshi/HSRMamba.
中文摘要:提出的HSRMamba模型通过局部空谱分区和全局谱重排序机制,解决了Mamba在超分辨率任务中忽略局部结构关系和对输入顺序敏感的问题,显著提升了图像重建质量。
English Summary: The proposed HSRMamba model overcomes Mamba's limitations in hyperspectral image super-resolution by implementing local spatial-spectral partitioning and global spectral reordering to better preserve structural relationships and enhance detail restoration.

Authors:Görkay Aydemir, Xiongyi Cai, Weidi Xie, Fatma Güney
Title: Track-On: Transformer-based Online Point Tracking with Memory
Abstract:
In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across multiple frames in a video, despite changes in appearance, lighting, perspective, and occlusions. We target online tracking on a frame-by-frame basis, making it suitable for real-world, streaming scenarios. Specifically, we introduce Track-On, a simple transformer-based model designed for online long-term point tracking. Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames, leveraging two memory modules -- spatial memory and context memory -- to capture temporal information and maintain reliable point tracking over long time horizons. At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy. Through extensive experiments, we demonstrate that Track-On sets a new state-of-the-art for online models and delivers superior or competitive results compared to offline approaches on seven datasets, including the TAP-Vid benchmark. Our method offers a robust and scalable solution for real-time tracking in diverse applications. Project page: https://kuis-ai.github.io/track_on

Authors:Shiho Noda, Atsuyuki Miyai, Qing Yu, Go Irie, Kiyoharu Aizawa
Title: A Benchmark and Evaluation for Real-World Out-of-Distribution Detection Using Vision-Language Models
Abstract:
Out-of-distribution (OOD) detection is a task that detects OOD samples during inference to ensure the safety of deployed models. However, conventional benchmarks have reached performance saturation, making it difficult to compare recent OOD detection methods. To address this challenge, we introduce three novel OOD detection benchmarks that enable a deeper understanding of method characteristics and reflect real-world conditions. First, we present ImageNet-X, designed to evaluate performance under challenging semantic shifts. Second, we propose ImageNet-FS-X for full-spectrum OOD detection, assessing robustness to covariate shifts (feature distribution shifts). Finally, we propose Wilds-FS-X, which extends these evaluations to real-world datasets, offering a more comprehensive testbed. Our experiments reveal that recent CLIP-based OOD detection methods struggle to varying degrees across the three proposed benchmarks, and none of them consistently outperforms the others. We hope the community goes beyond specific benchmarks and includes more challenging conditions reflecting real-world scenarios. The code is https://github.com/hoshi23/OOD-X-Benchmarks.
中文: 本文提出了三个新的分布外检测基准,以解决传统测试中的性能饱和问题并更真实地反映实际条件,发现现有方法在这些挑战中表现均不稳定。
English: This paper introduces three new benchmarks for out-of-distribution detection to address performance saturation in conventional tests and better reflect real-world conditions, revealing that current methods struggle inconsistently across these challenges.

Authors:Amanturdieva Akmaral, Muhammad Hamza Zafar
Title: Efficient Transformer for High Resolution Image Motion Deblurring
Abstract:
This paper presents a comprehensive study and improvement of the Restormer architecture for high-resolution image motion deblurring. We introduce architectural modifications that reduce model complexity by 18.4% while maintaining or improving performance through optimized attention mechanisms. Our enhanced training pipeline incorporates additional transformations including color jitter, Gaussian blur, and perspective transforms to improve model robustness as well as a new frequency loss term. Extensive experiments on the RealBlur-R, RealBlur-J, and Ultra-High-Definition Motion blurred (UHDM) datasets demonstrate the effectiveness of our approach. The improved architecture shows better convergence behavior and reduced training time while maintaining competitive performance across challenging scenarios. We also provide detailed ablation studies analyzing the impact of our modifications on model behavior and performance. Our results suggest that thoughtful architectural simplification combined with enhanced training strategies can yield more efficient yet equally capable models for motion deblurring tasks. Code and Data Available at: https://github.com/hamzafer/image-deblurring
中文: 本研究通过优化注意力机制和增强训练流程,改进了Restormer架构用于图像运动去模糊,在降低18.4%复杂度的同时保持性能,并在多个数据集上验证了其高效性。
English: This study enhances the Restormer architecture for image motion deblurring by reducing complexity 18.4% through optimized attention mechanisms and improved training with additional transformations and frequency loss, achieving efficient performance across multiple datasets.

Authors:Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou
Title: MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Abstract:
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 18 leading models on \benchmark. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models. Code and data are available at: https://github.com/TsinghuaC3I/MedXpertQA
中文:MedXpertQA是一个包含4,460道跨17个专科医学难题的权威评测基准,通过严格筛选和多模态临床数据,专门评估超越传统问答的高级推理能力。
English: MedXpertQA is a challenging medical benchmark featuring 4,460 expert-level questions across 17 specialties, incorporating rigorous filtering and multimodal clinical data to evaluate advanced reasoning beyond traditional QA pairs.

Authors:Yuqin Cao, Xiongkuo Min, Yixuan Gao, Wei Sun, Guangtao Zhai
Title: AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment
Abstract:
Many video-to-audio (VTA) methods have been proposed for dubbing silent AI-generated videos. An efficient quality assessment method for AI-generated audio-visual content (AGAV) is crucial for ensuring audio-visual quality. Existing audio-visual quality assessment methods struggle with unique distortions in AGAVs, such as unrealistic and inconsistent elements. To address this, we introduce AGAVQA-3k, the first large-scale AGAV quality assessment dataset, comprising $3,382$ AGAVs from $16$ VTA methods. AGAVQA-3k includes two subsets: AGAVQA-MOS, which provides multi-dimensional scores for audio quality, content consistency, and overall quality, and AGAVQA-Pair, designed for optimal AGAV pair selection. We further propose AGAV-Rater, a LMM-based model that can score AGAVs, as well as audio and music generated from text, across multiple dimensions, and selects the best AGAV generated by VTA methods to present to the user. AGAV-Rater achieves state-of-the-art performance on AGAVQA-3k, Text-to-Audio, and Text-to-Music datasets. Subjective tests also confirm that AGAV-Rater enhances VTA performance and user experience. The dataset and code is available at https://github.com/charlotte9524/AGAV-Rater.
中文: 本文提出了首个大规模AI生成音视频质量评估数据集AGAVQA-3k,并开发了多模态模型AGAV-Rater,该模型在评估和选择最佳音视频内容方面实现了最先进的性能表现。
English: This paper introduces AGAVQA-3k, the first large-scale dataset for assessing AI-generated audio-visual content quality, and proposes AGAV-Rater, a multimodal model that achieves state-of-the-art performance in evaluating and selecting optimal audio-visual outputs.

Authors:Siyuan Jiang, Yihan Hu, Wenjie Li, Pengcheng Zeng
Title: DeepFRC: An End-to-End Deep Learning Model for Functional Registration and Classification
Abstract:
Functional data - observations in the form of curves or trajectories - arise in diverse domains such as biomedical sensing, motion capture, and handwriting recognition. A core challenge in functional data analysis (FDA) is accounting for phase variability, where misaligned temporal patterns hinder accurate inference. We introduce DeepFRC, an end-to-end deep learning framework for joint functional registration and classification. Unlike conventional approaches that decouple alignment and prediction, DeepFRC integrates class-aware elastic warping and a learnable basis representation into a unified architecture. This design enables temporal alignment and dimensionality reduction to be jointly optimized with classification, improving both interpretability and accuracy. We establish the first theoretical connection between alignment quality and generalization error, and validate our model on synthetic and real-world benchmarks. DeepFRC consistently outperforms state-of-the-art methods, especially in scenarios with complex temporal misalignment. Code is available at: https://github.com/Drivergo-93589/DeepFRC.
Chinese: DeepFRC 是一个端到端的深度学习框架,通过整合类别感知的弹性扭曲和可学习基表示,联合优化功能配准与分类,在处理时间错位时提高了准确性和可解释性,并优于现有先进方法。
English: DeepFRC is an end-to-end deep learning framework that jointly optimizes functional registration and classification by integrating class-aware elastic warping and a learnable basis representation, improving accuracy and interpretability while outperforming state-of-the-art methods in handling temporal misalignment.

Authors:Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar
Title: LLMs can see and hear without any training
Abstract:
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.
中文: MILS是一种无需训练的方法,通过迭代生成和评分候选答案来增强LLM的多模态能力,在图像、视频和音频描述等任务中实现了最先进的性能。
English: MILS is a training-free method that enhances multimodal capabilities in LLMs through iterative candidate generation and scoring, achieving state-of-the-art performance in tasks like captioning and media generation.

Authors:Lei Cheng, Siyang Cao
Title: TransRAD: Retentive Vision Transformer for Enhanced Radar Object Detection
Abstract:
Despite significant advancements in environment perception capabilities for autonomous driving and intelligent robotics, cameras and LiDARs remain notoriously unreliable in low-light conditions and adverse weather, which limits their effectiveness. Radar serves as a reliable and low-cost sensor that can effectively complement these limitations. However, radar-based object detection has been underexplored due to the inherent weaknesses of radar data, such as low resolution, high noise, and lack of visual information. In this paper, we present TransRAD, a novel 3D radar object detection model designed to address these challenges by leveraging the Retentive Vision Transformer (RMT) to more effectively learn features from information-dense radar Range-Azimuth-Doppler (RAD) data. Our approach leverages the Retentive Manhattan Self-Attention (MaSA) mechanism provided by RMT to incorporate explicit spatial priors, thereby enabling more accurate alignment with the spatial saliency characteristics of radar targets in RAD data and achieving precise 3D radar detection across Range-Azimuth-Doppler dimensions. Furthermore, we propose Location-Aware NMS to effectively mitigate the common issue of duplicate bounding boxes in deep radar object detection. The experimental results demonstrate that TransRAD outperforms state-of-the-art methods in both 2D and 3D radar detection tasks, achieving higher accuracy, faster inference speed, and reduced computational complexity. Code is available at https://github.com/radar-lab/TransRAD
Chinese: TransRAD是一种新型3D雷达目标检测模型,通过采用保留视觉变换器和位置感知非极大值抑制技术,有效克服了雷达数据的固有缺陷,在精度、速度和计算效率上均超越了现有最优方法。
English: TransRAD is a novel 3D radar object detection model that addresses the limitations of radar data by leveraging the Retentive Vision Transformer and introducing Location-Aware NMS, achieving superior performance in accuracy, speed, and efficiency compared to state-of-the-art methods.

Authors:Ajinkya Khoche, Qingwen Zhang, Laura Pereira Sanchez, Aron Asefaw, Sina Sharif Mansouri, Patric Jensfelt
Title: SSF: Sparse Long-Range Scene Flow for Autonomous Driving
Abstract:
Scene flow enables an understanding of the motion characteristics of the environment in the 3D world. It gains particular significance in the long-range, where object-based perception methods might fail due to sparse observations far away. Although significant advancements have been made in scene flow pipelines to handle large-scale point clouds, a gap remains in scalability with respect to long-range. We attribute this limitation to the common design choice of using dense feature grids, which scale quadratically with range. In this paper, we propose Sparse Scene Flow (SSF), a general pipeline for long-range scene flow, adopting a sparse convolution based backbone for feature extraction. This approach introduces a new challenge: a mismatch in size and ordering of sparse feature maps between time-sequential point scans. To address this, we propose a sparse feature fusion scheme, that augments the feature maps with virtual voxels at missing locations. Additionally, we propose a range-wise metric that implicitly gives greater importance to faraway points. Our method, SSF, achieves state-of-the-art results on the Argoverse2 dataset, demonstrating strong performance in long-range scene flow estimation. Our code will be released at https://github.com/KTH-RPL/SSF.git.
Chinese Summary: 本文提出稀疏场景流(SSF)方法,通过采用稀疏卷积特征提取和特征融合方案解决远距离场景流估计的扩展性难题,在Argoverse2数据集上实现了最优性能。
English Summary: The paper introduces Sparse Scene Flow (SSF), a novel pipeline that overcomes scalability limitations in long-range 3D motion estimation by employing sparse convolutions and feature fusion, achieving state-of-the-art performance on the Argoverse2 dataset.

Authors:Ahmed Sharshar, Yasser Attia, Mohammad Yaqub, Mohsen Guizani
Title: PulmoFusion: Advancing Pulmonary Health with Efficient Multi-Modal Fusion
Abstract:
Traditional remote spirometry lacks the precision required for effective pulmonary monitoring. We present a novel, non-invasive approach using multimodal predictive models that integrate RGB or thermal video data with patient metadata. Our method leverages energy-efficient Spiking Neural Networks (SNNs) for the regression of Peak Expiratory Flow (PEF) and classification of Forced Expiratory Volume (FEV1) and Forced Vital Capacity (FVC), using lightweight CNNs to overcome SNN limitations in regression tasks. Multimodal data integration is improved with a Multi-Head Attention Layer, and we employ K-Fold validation and ensemble learning to boost robustness. Using thermal data, our SNN models achieve 92% accuracy on a breathing-cycle basis and 99.5% patient-wise. PEF regression models attain Relative RMSEs of 0.11 (thermal) and 0.26 (RGB), with an MAE of 4.52% for FEV1/FVC predictions, establishing state-of-the-art performance. Code and dataset can be found on https://github.com/ahmed-sharshar/RespiroDynamics.git
中文: 本研究提出了一种新型无创多模态方法,利用脉冲神经网络结合热成像或RGB视频数据,实现了对肺功能关键指标的高精度预测,在呼吸监测领域达到了领先性能水平。
English: This study introduces a non-invasive multimodal approach using spiking neural networks and thermal or RGB video data to accurately monitor pulmonary function, achieving state-of-the-art performance in predicting key respiratory metrics with high precision.

Authors:Sebastian Ægidius, Dennis Hadjivelichkov, Jianhao Jiao, Jonathan Embley-Riches, Dimitrios Kanoulas
Title: Watch Your STEPP: Semantic Traversability Estimation using Pose Projected Features
Abstract:
Understanding the traversability of terrain is essential for autonomous robot navigation, particularly in unstructured environments such as natural landscapes. Although traditional methods, such as occupancy mapping, provide a basic framework, they often fail to account for the complex mobility capabilities of some platforms such as legged robots. In this work, we propose a method for estimating terrain traversability by learning from demonstrations of human walking. Our approach leverages dense, pixel-wise feature embeddings generated using the DINOv2 vision Transformer model, which are processed through an encoder-decoder MLP architecture to analyze terrain segments. The averaged feature vectors, extracted from the masked regions of interest, are used to train the model in a reconstruction-based framework. By minimizing reconstruction loss, the network distinguishes between familiar terrain with a low reconstruction error and unfamiliar or hazardous terrain with a higher reconstruction error. This approach facilitates the detection of anomalies, allowing a legged robot to navigate more effectively through challenging terrain. We run real-world experiments on the ANYmal legged robot both indoor and outdoor to prove our proposed method. The code is open-source, while video demonstrations can be found on our website: https://rpl-cs-ucl.github.io/STEPP

Authors:Anh-Kiet Duong, Petra Gomez-Krämer
Title: Action Recognition Using Temporal Shift Module and Ensemble Learning
Abstract:
This paper presents the first-rank solution for the Multi-Modal Action Recognition Challenge, part of the Multi-Modal Visual Pattern Recognition Workshop at the \acl{ICPR} 2024. The competition aimed to recognize human actions using a diverse dataset of 20 action classes, collected from multi-modal sources. The proposed approach is built upon the \acl{TSM}, a technique aimed at efficiently capturing temporal dynamics in video data, incorporating multiple data input types. Our strategy included transfer learning to leverage pre-trained models, followed by meticulous fine-tuning on the challenge's specific dataset to optimize performance for the 20 action classes. We carefully selected a backbone network to balance computational efficiency and recognition accuracy and further refined the model using an ensemble technique that integrates outputs from different modalities. This ensemble approach proved crucial in boosting the overall performance. Our solution achieved a perfect top-1 accuracy on the test set, demonstrating the effectiveness of the proposed approach in recognizing human actions across 20 classes. Our code is available online https://github.com/ffyyytt/TSM-MMVPR.
本文提出了多模态动作识别挑战赛的冠军方案,采用基于TSM的集成方法并结合迁移学习,在20个动作类别上实现了100%的识别准确率。
This paper introduces the top-performing solution for the Multi-Modal Action Recognition Challenge, utilizing a TSM-based ensemble approach with transfer learning to achieve perfect accuracy across 20 action classes.

Authors:Matt C. Bendel, Saurav K. Shastri, Rizwan Ahmad, Philip Schniter
Title: Solving Inverse Problems using Diffusion with Iterative Colored Renoising
Abstract:
Imaging inverse problems can be solved in an unsupervised manner using pre-trained diffusion models, but doing so requires approximating the gradient of the measurement-conditional score function in the diffusion reverse process. We show that the approximations produced by existing methods are relatively poor, especially early in the reverse process, and so we propose a new approach that iteratively reestimates and "renoises" the estimate several times per diffusion step. This iterative approach, which we call Fast Iterative REnoising (FIRE), injects colored noise that is shaped to ensure that the pre-trained diffusion model always sees white noise, in accordance with how it was trained. We then embed FIRE into the DDIM reverse process and show that the resulting "DDfire" offers state-of-the-art accuracy and runtime on several linear inverse problems, as well as phase retrieval. Our implementation is at https://github.com/matt-bendel/DDfire
中文摘要:提出的FIRE方法通过在扩散过程中迭代重估和重噪估计,改进了用于解决成像逆问题的扩散模型,实现了最先进的精度和运行效率。
English Summary: The proposed FIRE method enhances diffusion models for solving imaging inverse problems by iteratively reestimating and renoising estimates within the diffusion process, achieving state-of-the-art accuracy and efficiency.

Authors:Shreya Shukla, Prajwal Gatti, Yogesh Kumar, Vikash Yadav, Anand Mishra
Title: Towards Making Flowchart Images Machine Interpretable
Abstract:
Computer programming textbooks and software documentations often contain flowcharts to illustrate the flow of an algorithm or procedure. Modern OCR engines often tag these flowcharts as graphics and ignore them in further processing. In this paper, we work towards making flowchart images machine-interpretable by converting them to executable Python codes. To this end, inspired by the recent success in natural language to code generation literature, we present a novel transformer-based framework, namely FloCo-T5. Our model is well-suited for this task,as it can effectively learn semantics, structure, and patterns of programming languages, which it leverages to generate syntactically correct code. We also used a task-specific pre-training objective to pre-train FloCo-T5 using a large number of logic-preserving augmented code samples. Further, to perform a rigorous study of this problem, we introduce theFloCo dataset that contains 11,884 flowchart images and their corresponding Python codes. Our experiments show promising results, and FloCo-T5 clearly outperforms related competitive baselines on code generation metrics. We make our dataset and implementation publicly available.

Authors:Chongyu Qu, Ritchie Zhao, Ye Yu, Bin Liu, Tianyuan Yao, Junchao Zhu, Bennett A. Landman, Yucheng Tang, Yuankai Huo
Title: Post-Training Quantization for 3D Medical Image Segmentation: A Practical Study on Real Inference Engines
Abstract:
Quantizing deep neural networks ,reducing the precision (bit-width) of their computations, can remarkably decrease memory usage and accelerate processing, making these models more suitable for large-scale medical imaging applications with limited computational resources. However, many existing methods studied "fake quantization", which simulates lower precision operations during inference, but does not actually reduce model size or improve real-world inference speed. Moreover, the potential of deploying real 3D low-bit quantization on modern GPUs is still unexplored. In this study, we introduce a real post-training quantization (PTQ) framework that successfully implements true 8-bit quantization on state-of-the-art (SOTA) 3D medical segmentation models, i.e., U-Net, SegResNet, SwinUNETR, nnU-Net, UNesT, TransUNet, ST-UNet,and VISTA3D. Our approach involves two main steps. First, we use TensorRT to perform fake quantization for both weights and activations with unlabeled calibration dataset. Second, we convert this fake quantization into real quantization via TensorRT engine on real GPUs, resulting in real-world reductions in model size and inference latency. Extensive experiments demonstrate that our framework effectively performs 8-bit quantization on GPUs without sacrificing model performance. This advancement enables the deployment of efficient deep learning models in medical imaging applications where computational resources are constrained. The code and models have been released, including U-Net, TransUNet pretrained on the BTCV dataset for abdominal (13-label) segmentation, UNesT pretrained on the Whole Brain Dataset for whole brain (133-label) segmentation, and nnU-Net, SegResNet, SwinUNETR and VISTA3D pretrained on TotalSegmentator V2 for full body (104-label) segmentation. https://github.com/hrlblab/PTQ.
Chinese: 本研究提出了一种真实的后训练量化框架,成功在先进的3D医学分割模型上实现真正的8位量化,在GPU上显著减小模型尺寸并降低推理延迟,同时保持模型性能。
English: This study introduces a real post-training quantization framework that successfully implements true 8-bit quantization on state-of-the-art 3D medical segmentation models, significantly reducing model size and inference latency without compromising performance on GPUs.

Authors:Hossein Mirzaei, Mojtaba Nafez, Moein Madadi, Arad Maleki, Mahdi Hajialilue, Zeinab Sadat Taghavi, Sepehr Rezaee, Ali Ansari, Bahar Dibaei Nia, Kian Shamsaie, Mohammadreza Salehi, Mackenzie W. Mathis, Mahdieh Soleymani Baghshah, Mohammad Sabokrou, Mohammad Hossein Rohban
Title: A Contrastive Teacher-Student Framework for Novelty Detection under Style Shifts
Abstract:
There have been several efforts to improve Novelty Detection (ND) performance. However, ND methods often suffer significant performance drops under minor distribution shifts caused by changes in the environment, known as style shifts. This challenge arises from the ND setup, where the absence of out-of-distribution (OOD) samples during training causes the detector to be biased toward the dominant style features in the in-distribution (ID) data. As a result, the model mistakenly learns to correlate style with core features, using this shortcut for detection. Robust ND is crucial for real-world applications like autonomous driving and medical imaging, where test samples may have different styles than the training data. Motivated by this, we propose a robust ND method that crafts an auxiliary OOD set with style features similar to the ID set but with different core features. Then, a task-based knowledge distillation strategy is utilized to distinguish core features from style features and help our model rely on core features for discriminating crafted OOD and ID sets. We verified the effectiveness of our method through extensive experimental evaluations on several datasets, including synthetic and real-world benchmarks, against nine different ND methods.
中文摘要:该研究提出了一种新颖性检测方法,通过构建风格特征相似但核心特征不同的辅助分布外数据集,并利用知识蒸馏技术区分风格与核心特征,从而提升模型在风格变化下的检测鲁棒性。
English Summary: The proposed robust novelty detection method creates an auxiliary out-of-distribution set with matching style features but different core features, then uses knowledge distillation to help the model distinguish between style and core features for improved detection performance.

Authors:Akash Kumar, Zsolt Kira, Yogesh Singh Rawat
Title: Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding
Abstract:
In this work, we focus on Weakly Supervised Spatio-Temporal Video Grounding (WSTVG). It is a multimodal task aimed at localizing specific subjects spatio-temporally based on textual queries without bounding box supervision. Motivated by recent advancements in multi-modal foundation models for grounding tasks, we first explore the potential of state-of-the-art object detection models for WSTVG. Despite their robust zero-shot capabilities, our adaptation reveals significant limitations, including inconsistent temporal predictions, inadequate understanding of complex queries, and challenges in adapting to difficult scenarios. We propose CoSPaL (Contextual Self-Paced Learning), a novel approach which is designed to overcome these limitations. CoSPaL integrates three core components: (1) Tubelet Phrase Grounding (TPG), which introduces spatio-temporal prediction by linking textual queries to tubelets; (2) Contextual Referral Grounding (CRG), which improves comprehension of complex queries by extracting contextual information to refine object identification over time; and (3) Self-Paced Scene Understanding (SPS), a training paradigm that progressively increases task difficulty, enabling the model to adapt to complex scenarios by transitioning from coarse to fine-grained understanding.

Authors:Hossein Mirzaei, Mohammad Jafari, Hamid Reza Dehbashi, Ali Ansari, Sepehr Ghobadi, Masoud Hadi, Arshia Soltani Moakhar, Mohammad Azizmalayeri, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Title: RODEO: Robust Outlier Detection via Exposing Adaptive Out-of-Distribution Samples
Abstract:
In recent years, there have been significant improvements in various forms of image outlier detection. However, outlier detection performance under adversarial settings lags far behind that in standard settings. This is due to the lack of effective exposure to adversarial scenarios during training, especially on unseen outliers, leading to detection models failing to learn robust features. To bridge this gap, we introduce RODEO, a data-centric approach that generates effective outliers for robust outlier detection. More specifically, we show that incorporating outlier exposure (OE) and adversarial training can be an effective strategy for this purpose, as long as the exposed training outliers meet certain characteristics, including diversity, and both conceptual differentiability and analogy to the inlier samples. We leverage a text-to-image model to achieve this goal. We demonstrate both quantitatively and qualitatively that our adaptive OE method effectively generates ``diverse'' and ``near-distribution'' outliers, leveraging information from both text and image domains. Moreover, our experimental results show that utilizing our synthesized outliers significantly enhances the performance of the outlier detector, particularly in adversarial settings.
中文: 近年来图像异常检测在对抗性场景下表现不佳,为此我们提出RODEO方法,通过文本到图像模型生成多样且接近正常分布的异常样本,有效提升了检测器在对抗环境下的性能。
English: Recent advances in image outlier detection struggle with adversarial scenarios due to insufficient training exposure, prompting the introduction of RODEO, a data-centric method using text-to-image models to generate diverse and near-distribution outliers that significantly boost detector robustness.

Authors:Nikolaos Kaparinos, Vasileios Mezaris
Title: B-FPGM: Lightweight Face Detection via Bayesian-Optimized Soft FPGM Pruning
Abstract:
Face detection is a computer vision application that increasingly demands lightweight models to facilitate deployment on devices with limited computational resources. Neural network pruning is a promising technique that can effectively reduce network size without significantly affecting performance. In this work, we propose a novel face detection pruning pipeline that leverages Filter Pruning via Geometric Median (FPGM) pruning, Soft Filter Pruning (SFP) and Bayesian optimization in order to achieve a superior trade-off between size and performance compared to existing approaches. FPGM pruning is a structured pruning technique that allows pruning the least significant filters in each layer, while SFP iteratively prunes the filters and allows them to be updated in any subsequent training step. Bayesian optimization is employed in order to optimize the pruning rates of each layer, rather than relying on engineering expertise to determine the optimal pruning rates for each layer. In our experiments across all three subsets of the WIDER FACE dataset, our proposed approach B-FPGM consistently outperforms existing ones in balancing model size and performance. All our experiments were applied to EResFD, the currently smallest (in number of parameters) well-performing face detector of the literature; a small ablation study with a second small face detector, EXTD, is also reported. The source code and trained pruned face detection models can be found at: https://github.com/IDTITI/B-FPGM.
Chinese: 本研究提出了一种结合FPGM、SFP和贝叶斯优化的新型人脸检测剪枝流程,在WIDER FACE数据集上相比现有方法,能更好地平衡模型大小与性能。
English: This study introduces a novel face detection pruning pipeline combining FPGM, SFP, and Bayesian optimization to achieve an optimal balance between model size and performance, consistently outperforming existing methods on the WIDER FACE dataset.

Authors:Lantao Li, Kang Yang, Wenqi Zhang, Xiaoxue Wang, Chen Sun
Title: RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception
Abstract:
Cooperative perception enhances autonomous driving by leveraging Vehicle-to-Everything (V2X) communication for multi-agent sensor fusion. However, most existing methods rely on single-modal data sharing, limiting fusion performance, particularly in heterogeneous sensor settings involving both LiDAR and cameras across vehicles and roadside units (RSUs). To address this, we propose Radian Glue Attention (RG-Attn), a lightweight and generalizable cross-modal fusion module that unifies intra-agent and inter-agent fusion via transformation-based coordinate alignment and a unified sampling/inversion strategy. RG-Attn efficiently aligns features through a radian-based attention constraint, operating column-wise on geometrically consistent regions to reduce overhead and preserve spatial coherence, thereby enabling accurate and robust fusion. Building upon RG-Attn, we propose three cooperative architectures. The first, Paint-To-Puzzle (PTP), prioritizes communication efficiency but assumes all agents have LiDAR, optionally paired with cameras. The second, Co-Sketching-Co-Coloring (CoS-CoCo), offers maximal flexibility, supporting any sensor setup (e.g., LiDAR-only, camera-only, or both) and enabling strong cross-modal generalization for real-world deployment. The third, Pyramid-RG-Attn Fusion (PRGAF), aims for peak detection accuracy with the highest computational overhead. Extensive evaluations on simulated and real-world datasets show our framework delivers state-of-the-art detection accuracy with high flexibility and efficiency. GitHub Link: https://github.com/LantaoLi/RG-Attn
中文摘要:提出的Radian Glue Attention(RG-Attn)模块通过坐标对齐和统一采样策略实现高效跨模态融合,其三种架构分别在通信效率、传感器兼容性和检测精度方面提供不同优势。
English Summary: The proposed Radian Glue Attention (RG-Attn) module enables efficient cross-modal fusion for cooperative perception through coordinate alignment and unified sampling, with three architectures offering varying balances of communication efficiency, sensor flexibility, and detection accuracy.

Authors:Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, Yadong Mu
Title: DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
Abstract:
Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussian splats by taming large-scale text-to-image diffusion models. It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multi-view Gaussian splat grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views. The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism.

Authors:Yinfeng Gao, Qichao Zhang, Da-wei Ding, Dongbin Zhao
Title: Dream to Drive with Predictive Individual World Model
Abstract:
It is still a challenging topic to make reactive driving behaviors in complex urban environments as road users' intentions are unknown. Model-based reinforcement learning (MBRL) offers great potential to learn a reactive policy by constructing a world model that can provide informative states and imagination training. However, a critical limitation in relevant research lies in the scene-level reconstruction representation learning, which may overlook key interactive vehicles and hardly model the interactive features among vehicles and their long-term intentions. Therefore, this paper presents a novel MBRL method with a predictive individual world model (PIWM) for autonomous driving. PIWM describes the driving environment from an individual-level perspective and captures vehicles' interactive relations and their intentions via trajectory prediction task. Meanwhile, a behavior policy is learned jointly with PIWM. It is trained in PIWM's imagination and effectively navigates in the urban driving scenes leveraging intention-aware latent states. The proposed method is trained and evaluated on simulation environments built upon real-world challenging interactive scenarios. Compared with popular model-free and state-of-the-art model-based reinforcement learning methods, experimental results show that the proposed method achieves the best performance in terms of safety and efficiency.
中文摘要:本文提出了一种新颖的基于模型强化学习的方法,通过预测个体世界模型(PIWM)从个体层面捕捉车辆交互关系和意图,在复杂城市驾驶场景中实现了最佳的安全性和效率表现。
English Summary: This paper introduces a novel model-based reinforcement learning method with a Predictive Individual World Model (PIWM) that enhances autonomous driving by capturing vehicle interactions and intentions through trajectory prediction, achieving superior safety and efficiency in urban environments.

Authors:Jianing Li, Ming Lu, Hao Wang, Chenyang Gu, Wenzhao Zheng, Li Du, Shanghang Zhang
Title: SliceOcc: Indoor 3D Semantic Occupancy Prediction with Vertical Slice Representation
Abstract:
3D semantic occupancy prediction is a crucial task in visual perception, as it requires the simultaneous comprehension of both scene geometry and semantics. It plays a crucial role in understanding 3D scenes and has great potential for various applications, such as robotic vision perception and autonomous driving. Many existing works utilize planar-based representations such as Bird's Eye View (BEV) and Tri-Perspective View (TPV). These representations aim to simplify the complexity of 3D scenes while preserving essential object information, thereby facilitating efficient scene representation. However, in dense indoor environments with prevalent occlusions, directly applying these planar-based methods often leads to difficulties in capturing global semantic occupancy, ultimately degrading model performance. In this paper, we present a new vertical slice representation that divides the scene along the vertical axis and projects spatial point features onto the nearest pair of parallel planes. To utilize these slice features, we propose SliceOcc, an RGB camera-based model specifically tailored for indoor 3D semantic occupancy prediction. SliceOcc utilizes pairs of slice queries and cross-attention mechanisms to extract planar features from input images. These local planar features are then fused to form a global scene representation, which is employed for indoor occupancy prediction. Experimental results on the EmbodiedScan dataset demonstrate that SliceOcc achieves a mIoU of 15.45% across 81 indoor categories, setting a new state-of-the-art performance among RGB camera-based models for indoor 3D semantic occupancy prediction. Code is available at https://github.com/NorthSummer/SliceOcc.
中文: 本文提出SliceOcc模型,通过垂直切片表示和交叉注意力机制,在室内三维语义占据预测任务中实现了最先进的性能,有效解决了平面表示方法在密集遮挡环境中的局限性。
English: This paper introduces SliceOcc, a novel RGB camera-based model that employs vertical slice representation and cross-attention mechanisms to achieve state-of-the-art performance in indoor 3D semantic occupancy prediction, addressing limitations of planar-based methods in dense environments.

Authors:Shengyuan Liu, Zhen Chen, Qiushi Yang, Weihao Yu, Di Dong, Jiancong Hu, Yixuan Yuan
Title: Polyp-Gen: Realistic and Diverse Polyp Image Generation for Endoscopic Dataset Expansion
Abstract:
Automated diagnostic systems (ADS) have shown significant potential in the early detection of polyps during endoscopic examinations, thereby reducing the incidence of colorectal cancer. However, due to high annotation costs and strict privacy concerns, acquiring high-quality endoscopic images poses a considerable challenge in the development of ADS. Despite recent advancements in generating synthetic images for dataset expansion, existing endoscopic image generation algorithms failed to accurately generate the details of polyp boundary regions and typically required medical priors to specify plausible locations and shapes of polyps, which limited the realism and diversity of the generated images. To address these limitations, we present Polyp-Gen, the first full-automatic diffusion-based endoscopic image generation framework. Specifically, we devise a spatial-aware diffusion training scheme with a lesion-guided loss to enhance the structural context of polyp boundary regions. Moreover, to capture medical priors for the localization of potential polyp areas, we introduce a hierarchical retrieval-based sampling strategy to match similar fine-grained spatial features. In this way, our Polyp-Gen can generate realistic and diverse endoscopic images for building reliable ADS. Extensive experiments demonstrate the state-of-the-art generation quality, and the synthetic images can improve the downstream polyp detection task. Additionally, our Polyp-Gen has shown remarkable zero-shot generalizability on other datasets. The source code is available at https://github.com/CUHK-AIM-Group/Polyp-Gen.
中文: Polyp-Gen是一种全自动基于扩散的框架,通过增强息肉边界细节和利用医学先验知识定位,生成逼真多样的内窥镜图像,以提升自动化诊断系统中的息肉检测能力,并展现出优异的泛化性能。
English: Polyp-Gen is a fully automated diffusion-based framework that generates realistic and diverse endoscopic images for automated diagnostic systems by enhancing polyp boundary details and using medical priors for localization, improving polyp detection and showing strong generalization.

Authors:Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng Qiu, See-Kiong Ng
Title: CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs
Abstract:
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities. Recent studies have attempted to mitigate this by applying Direct Preference Optimization (DPO) to multimodal scenarios using preference pairs from text-based responses. However, our analysis of representation distributions reveals that multimodal DPO struggles to align image and text representations and to distinguish between hallucinated and non-hallucinated descriptions. To address these challenges, in this work, we propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations. We introduce a visual preference optimization module within the DPO framework, enabling MLLMs to learn from both textual and visual preferences simultaneously. Furthermore, we propose a hierarchical textual preference optimization module that allows the model to capture preferences at multiple granular levels, including response, segment, and token levels. We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations. On the Object HalBench dataset, CHiP outperforms DPO in hallucination reduction, achieving improvements of 52.7% and 55.5% relative points based on the base model Muffin and LLaVA models, respectively. We make all our datasets and code publicly available: https://github.com/LVUGAI/CHiP.
中文: 提出的跨模态分层直接偏好优化(CHiP)方法通过整合视觉和分层文本偏好,有效减少多模态模型中的幻觉现象,在多个基准测试中表现优于现有技术。
English: The proposed Cross-modal Hierarchical Direct Preference Optimization (CHiP) method enhances multimodal models by integrating visual and hierarchical textual preferences, significantly reducing hallucinations and outperforming existing techniques on benchmarks.

Authors:Wenfeng Lin, Jiangchuan Wei, Boyuan Liu, Yichen Zhang, Shiyue Yan, Mingyu Guo
Title: CascadeV: An Implementation of Wurstchen Architecture for Video Generation
Abstract:
Recently, with the tremendous success of diffusion models in the field of text-to-image (T2I) generation, increasing attention has been directed toward their potential in text-to-video (T2V) applications. However, the computational demands of diffusion models pose significant challenges, particularly in generating high-resolution videos with high frame rates. In this paper, we propose CascadeV, a cascaded latent diffusion model (LDM), that is capable of producing state-of-the-art 2K resolution videos. Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation. We also implement a spatiotemporal alternating grid 3D attention mechanism, which effectively integrates spatial and temporal information, ensuring superior consistency across the generated video frames. Furthermore, our model can be cascaded with existing T2V models, theoretically enabling a 4$\times$ increase in resolution or frames per second without any fine-tuning. Our code is available at https://github.com/bytedance/CascadeV.
中文: 本文提出CascadeV,一种级联潜在扩散模型,通过采用时空交替网格注意力机制和实现更高压缩比,有效生成高质量2K视频并显著降低计算负担。
English: The paper introduces CascadeV, a cascaded latent diffusion model that efficiently generates high-resolution 2K videos by leveraging a spatiotemporal attention mechanism and achieving higher compression to reduce computational demands.

Authors:Xiaolei Liu, Yan Sun, Zhiliang Wang, Mark Nixon
Title: Unsupervised Domain Adaptation with Dynamic Clustering and Contrastive Refinement for Gait Recognition
Abstract:
Gait recognition is an emerging identification technology that distinguishes individuals at long distances by analyzing individual walking patterns. Traditional techniques rely heavily on large-scale labeled datasets, which incurs high costs and significant labeling challenges. Recently, researchers have explored unsupervised gait recognition with clustering-based unsupervised domain adaptation methods and achieved notable success. However, these methods directly use pseudo-label generated by clustering and neglect pseudolabel noise caused by domain differences, which affects the effect of the model training process. To mitigate these issues, we proposed a novel model called GaitDCCR, which aims to reduce the influence of noisy pseudo labels on clustering and model training. Our approach can be divided into two main stages: clustering and training stage. In the clustering stage, we propose Dynamic Cluster Parameters (DCP) and Dynamic Weight Centroids (DWC) to improve the efficiency of clustering and obtain reliable cluster centroids. In the training stage, we employ the classical teacher-student structure and propose Confidence-based Pseudo-label Refinement (CPR) and Contrastive Teacher Module (CTM) to encourage noisy samples to converge towards clusters containing their true identities. Extensive experiments on public gait datasets have demonstrated that our simple and effective method significantly enhances the performance of unsupervised gait recognition, laying the foundation for its application in the real-world. We will release the code at https://github.com/YanSun-github/GaitDCCR upon acceptance.
中文: 提出的GaitDCCR模型通过动态聚类优化和师生训练框架,有效解决了无监督步态识别中的伪标签噪声问题,在公开数据集上显著提升了识别性能。
English: The proposed GaitDCCR model addresses noisy pseudo-labels in unsupervised gait recognition through dynamic clustering optimization and a teacher-student training framework, significantly improving recognition accuracy on public datasets.

Authors:Tianyi Xie, Yiwei Zhao, Ying Jiang, Chenfanfu Jiang
Title: PhysAnimator: Physics-Guided Generative Cartoon Animation
Abstract:
Creating hand-drawn animation sequences is labor-intensive and demands professional expertise. We introduce PhysAnimator, a novel approach for generating physically plausible meanwhile anime-stylized animation from static anime illustrations. Our method seamlessly integrates physics-based simulations with data-driven generative models to produce dynamic and visually compelling animations. To capture the fluidity and exaggeration characteristic of anime, we perform image-space deformable body simulations on extracted mesh geometries. We enhance artistic control by introducing customizable energy strokes and incorporating rigging point support, enabling the creation of tailored animation effects such as wind interactions. Finally, we extract and warp sketches from the simulation sequence, generating a texture-agnostic representation, and employ a sketch-guided video diffusion model to synthesize high-quality animation frames. The resulting animations exhibit temporal consistency and visual plausibility, demonstrating the effectiveness of our method in creating dynamic anime-style animations. See our project page for more demos: https://xpandora.github.io/PhysAnimator/

Authors:Oriol Barbany, Adrià Colomé, Carme Torras
Title: BiFold: Bimanual Cloth Folding with Language Guidance
Abstract:
Cloth folding is a complex task due to the inevitable self-occlusions of clothes, their complicated dynamics, and the disparate materials, geometries, and textures that garments can have. In this work, we learn folding actions conditioned on text commands. Translating high-level, abstract instructions into precise robotic actions requires sophisticated language understanding and manipulation capabilities. To do that, we leverage a pre-trained vision-language model and repurpose it to predict manipulation actions. Our model, BiFold, can take context into account and achieves state-of-the-art performance on an existing language-conditioned folding benchmark. To address the lack of annotated bimanual folding data, we introduce a novel dataset with automatically parsed actions and language-aligned instructions, enabling better learning of text-conditioned manipulation. BiFold attains the best performance on our dataset and demonstrates strong generalization to new instructions, garments, and environments.

Authors:Yash Yardi, Samuel Biruduganti, Lars Ankile
Title: Bridging the Sim2Real Gap: Vision Encoder Pre-Training for Visuomotor Policy Transfer
Abstract:
Simulation offers a scalable and efficient alternative to real-world data collection for learning visuomotor robotic policies. However, the simulation-to-reality, or Sim2Real distribution shift -- introduced by employing simulation-trained policies in real-world environments -- frequently prevents successful policy transfer. We present an offline framework to evaluate the performance of using large-scale pre-trained vision encoders to address the Sim2Real gap. We examine a diverse collection of encoders, assessing their ability to extract features necessary for robot control (Action Score) while remaining invariant to task-irrelevant environmental variations (Domain Invariance Score). Evaluating 23 encoders, we reveal patterns across architectures, pre-training datasets, and parameter scales. Our findings show that manipulation-pretrained encoders consistently achieve higher Action Scores, CNN-based encoders demonstrate stronger domain invariance than ViTs, and the best-performing models combine both properties, underscoring DIS and AS as complementary predictors of Sim2Real transferability.
中文: 本研究评估了大规模预训练视觉编码器以弥合Sim2Real差距,发现操作预训练模型在动作相关性上表现优异,CNN比ViT具有更强的域不变性,而最佳模型综合了这两种特性以实现有效的策略迁移。
English: This study evaluates large-scale pre-trained vision encoders to bridge the Sim2Real gap, finding that manipulation-pretrained models excel in action relevance, CNNs offer superior domain invariance over ViTs, and top performers combine both traits for effective policy transfer.

Authors:Nicolas von Lützow, Matthias Nießner
Title: LinPrim: Linear Primitives for Differentiable Volumetric Rendering
Abstract:
Volumetric rendering has become central to modern novel view synthesis methods, which use differentiable rendering to optimize 3D scene representations directly from observed views. While many recent works build on NeRF or 3D Gaussians, we explore an alternative volumetric scene representation. More specifically, we introduce two new scene representations based on linear primitives - octahedra and tetrahedra - both of which define homogeneous volumes bounded by triangular faces. To optimize these primitives, we present a differentiable rasterizer that runs efficiently on GPUs, allowing end-to-end gradient-based optimization while maintaining real-time rendering capabilities. Through experiments on real-world datasets, we demonstrate comparable performance to state-of-the-art volumetric methods while requiring fewer primitives to achieve similar reconstruction fidelity. Our findings deepen the understanding of 3D representations by providing insights into the fidelity and performance characteristics of transparent polyhedra and suggest that adopting novel primitives can expand the available design space.

Authors:Weixin Liang, Junhong Shen, Genghan Zhang, Ning Dong, Luke Zettlemoyer, Lili Yu
Title: Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
Abstract:
State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining. Our code can be accessed at https://github.com/Weixin-Liang/Mixture-of-Mamba
中文: Mixture-of-Mamba通过模态感知稀疏性改进了状态空间模型,在文本、图像和语音的多模态预训练中,能以显著降低的计算成本实现同等性能。
English: Mixture-of-Mamba introduces modality-aware sparsity into State Space Models, enabling efficient multi-modal pretraining by achieving comparable performance with significantly reduced computational costs across text, image, and speech tasks.

Authors:Younggun Kim, Mohamed Abdel-Aty, Beomsik Cho, Seonghoon Ryoo, Soomok Lee
Title: MSCN: Multi-view Structural Convolution Network for Domain-Invariant Point Cloud Recognition of Autonomous Vehicles
Abstract:
Although LiDAR sensors have become indispensable for autonomous vehicles (AVs) due to their ability to provide accurate 3D scene understanding and robust perception under adverse weather conditions, the properties of LiDAR point clouds vary widely across sensor configurations and data acquisition domains, leading to severe performance degradation when models are transferred between heterogeneous sensors or from simulation to the real world. To address this challenge, we propose the Multi-view Structural Convolution Network (MSCN), a novel architecture designed to achieve domain-invariant recognition across diverse LiDAR configurations and environments. MSCN comprises Structural Convolution Layers (SCL) that extract local context geometric features from point clouds and Structural Aggregation Layers (SAL) that extract and aggregate both local and overall context features from point clouds. Furthermore, we incorporate an unseen domain generation strategy to mitigate domain gaps during training. Extensive experiments demonstrate that MSCN consistently outperforms state-of-the-art point cloud classification methods across all domain change scenarios. These results highlight MSCN as a scalable solution for deploying LiDAR-based perception systems of AVs. Our code is available at https://github.com/MLMLab/MSCN.
中文摘要:提出的多视角结构卷积网络(MSCN)通过特殊结构层和领域生成策略有效解决激光雷达领域适应问题,在不同传感器配置和环境中均展现出卓越性能。
English Summary: The proposed Multi-view Structural Convolution Network (MSCN) effectively addresses LiDAR domain adaptation challenges through specialized structural layers and domain generation strategies, demonstrating superior performance across varied sensor configurations and environments.

Authors:Li Pang, Jing Yao, Kaiyu Li, Xiangyong Cao
Title: SPECIAL: Zero-shot Hyperspectral Image Classification With CLIP
Abstract:
Hyperspectral image (HSI) classification aims at categorizing each pixel in an HSI into a specific land cover class, which is crucial for applications like remote sensing, environmental monitoring, and agriculture. Although deep learning-based HSI classification methods have achieved significant advancements, existing methods still rely on manually labeled data for training, which is both time-consuming and labor-intensive. To address this limitation, we introduce a novel zero-shot hyperspectral image classification framework based on CLIP (SPECIAL), aiming to eliminate the need for manual annotations. The SPECIAL framework consists of two main stages: (1) CLIP-based pseudo-label generation, and (2) noisy label learning. In the first stage, HSI is spectrally interpolated to produce RGB bands. These bands are subsequently classified using CLIP, resulting in noisy pseudo-labels that are accompanied by confidence scores. To improve the quality of these labels, we propose a scaling strategy that fuses predictions from multiple spatial scales. In the second stage, spectral information and a label refinement technique are incorporated to mitigate label noise and further enhance classification accuracy. Experimental results on three benchmark datasets demonstrate that our SPECIAL outperforms existing methods in zero-shot HSI classification, showing its potential for more practical applications. The code is available at https://github.com/LiPang/SPECIAL.
中文: 本文提出SPECIAL零样本高光谱图像分类框架,通过CLIP生成伪标签并结合多尺度融合与噪声处理技术消除人工标注需求,在基准数据集上实现了优越性能。
English: This paper introduces SPECIAL, a zero-shot hyperspectral image classification framework that eliminates manual annotations by generating CLIP-based pseudo-labels and refining them through multi-scale fusion and noise reduction techniques, achieving superior performance on benchmark datasets.

Authors:Tatiana Taís Schein, Gustavo Pereira de Almeira, Stephanie Loi Brião, Rodrigo Andrade de Bem, Felipe Gomes de Oliveira, Paulo L. J. Drews-Jr
Title: UDBE: Unsupervised Diffusion-based Brightness Enhancement in Underwater Images
Abstract:
Activities in underwater environments are paramount in several scenarios, which drives the continuous development of underwater image enhancement techniques. A major challenge in this domain is the depth at which images are captured, with increasing depth resulting in a darker environment. Most existing methods for underwater image enhancement focus on noise removal and color adjustment, with few works dedicated to brightness enhancement. This work introduces a novel unsupervised learning approach to underwater image enhancement using a diffusion model. Our method, called UDBE, is based on conditional diffusion to maintain the brightness details of the unpaired input images. The input image is combined with a color map and a Signal-Noise Relation map (SNR) to ensure stable training and prevent color distortion in the output images. The results demonstrate that our approach achieves an impressive accuracy rate in the datasets UIEB, SUIM and RUIE, well-established underwater image benchmarks. Additionally, the experiments validate the robustness of our approach, regarding the image quality metrics PSNR, SSIM, UIQM, and UISM, indicating the good performance of the brightness enhancement process. The source code is available here: https://github.com/gusanagy/UDBE.
中文摘要:本文提出了一种名为UDBE的无监督扩散模型,用于水下图像增强,能有效提升亮度同时保留细节并防止色彩失真,在多个基准数据集上表现出优异的准确性和鲁棒性。
English Summary: This paper presents UDBE, an unsupervised diffusion model for underwater image enhancement that effectively improves brightness while preserving details and preventing color distortion, achieving high accuracy and robustness on benchmark datasets.

Authors:Zhongjin Luo, Yang Li, Mingrui Zhang, Senbo Wang, Han Yan, Xibin Song, Taizhang Shang, Wei Mao, Hongdong Li, Xiaoguang Han, Pan Ji
Title: BAG: Body-Aligned 3D Wearable Asset Generation
Abstract:
While recent advancements have shown remarkable progress in general 3D shape generation models, the challenge of leveraging these approaches to automatically generate wearable 3D assets remains unexplored. To this end, we present BAG, a Body-aligned Asset Generation method to output 3D wearable asset that can be automatically dressed on given 3D human bodies. This is achived by controlling the 3D generation process using human body shape and pose information. Specifically, we first build a general single-image to consistent multiview image diffusion model, and train it on the large Objaverse dataset to achieve diversity and generalizability. Then we train a Controlnet to guide the multiview generator to produce body-aligned multiview images. The control signal utilizes the multiview 2D projections of the target human body, where pixel values represent the XYZ coordinates of the body surface in a canonical space. The body-conditioned multiview diffusion generates body-aligned multiview images, which are then fed into a native 3D diffusion model to produce the 3D shape of the asset. Finally, by recovering the similarity transformation using multiview silhouette supervision and addressing asset-body penetration with physics simulators, the 3D asset can be accurately fitted onto the target human body. Experimental results demonstrate significant advantages over existing methods in terms of image prompt-following capability, shape diversity, and shape quality. Our project page is available at https://bag-3d.github.io/.

Authors:Anh-Kiet Duong, Petra Gomez-Krämer
Title: Addressing Out-of-Label Hazard Detection in Dashcam Videos: Insights from the COOOL Challenge
Abstract:
This paper presents a novel approach for hazard analysis in dashcam footage, addressing the detection of driver reactions to hazards, the identification of hazardous objects, and the generation of descriptive captions. We first introduce a method for detecting driver reactions through speed and sound anomaly detection, leveraging unsupervised learning techniques. For hazard detection, we employ a set of heuristic rules as weak classifiers, which are combined using an ensemble method. This ensemble approach is further refined with differential privacy to mitigate overconfidence, ensuring robustness despite the lack of labeled data. Lastly, we use state-of-the-art vision-language models for hazard captioning, generating descriptive labels for the detected hazards. Our method achieved the highest scores in the Challenge on Out-of-Label in Autonomous Driving, demonstrating its effectiveness across all three tasks. Source codes are publicly available at https://github.com/ffyyytt/COOOL_2025.
中文: 本文提出了一种新颖的行车记录仪危险分析方法,通过速度和声音异常检测驾驶反应,采用集成启发式规则结合差分隐私识别危险物体,并利用先进视觉语言模型生成描述性标签,在COOOL 2025挑战赛中取得最佳成绩。
English: This paper introduces a novel method for analyzing hazards in dashcam footage by detecting driver reactions through speed and sound anomalies, identifying hazardous objects using an ensemble of heuristic rules enhanced with differential privacy, and generating descriptive captions with vision-language models, achieving top performance in the COOOL 2025 challenge.

Authors:Zhibo Ren, Pritthijit Nath, Pancham Shukla
Title: Improving Tropical Cyclone Forecasting With Video Diffusion Models
Abstract:
Tropical cyclone (TC) forecasting is crucial for disaster preparedness and mitigation. While recent deep learning approaches have shown promise, existing methods often treat TC evolution as a series of independent frame-to-frame predictions, limiting their ability to capture long-term dynamics. We present a novel application of video diffusion models for TC forecasting that explicitly models temporal dependencies through additional temporal layers. Our approach enables the model to generate multiple frames simultaneously, better capturing cyclone evolution patterns. We introduce a two-stage training strategy that significantly improves individual-frame quality and performance in low-data regimes. Experimental results show our method outperforms the previous approach of Nath et al. by 19.3% in MAE, 16.2% in PSNR, and 36.1% in SSIM. Most notably, we extend the reliable forecasting horizon from 36 to 50 hours. Through comprehensive evaluation using both traditional metrics and Fréchet Video Distance (FVD), we demonstrate that our approach produces more temporally coherent forecasts while maintaining competitive single-frame quality. Code accessible at https://github.com/Ren-creater/forecast-video-diffmodels.
中文摘要:本研究提出了一种新型视频扩散模型用于热带气旋预报,通过增强的时间依赖建模和两阶段训练策略,显著提升了预测的时间连贯性,并将可靠预报时长延长至50小时。
English Summary: This study introduces a novel video diffusion model for tropical cyclone forecasting that improves temporal coherence and extends the reliable prediction horizon to 50 hours through enhanced temporal modeling and a two-stage training strategy.

Authors:Michael Birsak, John Femiani, Biao Zhang, Peter Wonka
Title: MatCLIP: Light- and Shape-Insensitive Assignment of PBR Material Models
Abstract:
Assigning realistic materials to 3D models remains a significant challenge in computer graphics. We propose MatCLIP, a novel method that extracts shape- and lighting-insensitive descriptors of Physically Based Rendering (PBR) materials to assign plausible textures to 3D objects based on images, such as the output of Latent Diffusion Models (LDMs) or photographs. Matching PBR materials to static images is challenging because the PBR representation captures the dynamic appearance of materials under varying viewing angles, shapes, and lighting conditions. By extending an Alpha-CLIP-based model on material renderings across diverse shapes and lighting, and encoding multiple viewing conditions for PBR materials, our approach generates descriptors that bridge the domains of PBR representations with photographs or renderings, including LDM outputs. This enables consistent material assignments without requiring explicit knowledge of material relationships between different parts of an object. MatCLIP achieves a top-1 classification accuracy of 76.6%, outperforming state-of-the-art methods such as PhotoShape and MatAtlas by over 15 percentage points on publicly available datasets. Our method can be used to construct material assignments for 3D shape datasets such as ShapeNet, 3DCoMPaT++, and Objaverse. All code and data will be released.

Authors:Jiahao Chen, Bin Qin, Jiangmeng Li, Hao Chen, Bing Su
Title: Rethinking the Bias of Foundation Model under Long-tailed Distribution
Abstract:
Long-tailed learning has garnered increasing attention due to its practical significance. Among the various approaches, the fine-tuning paradigm has gained considerable interest with the advent of foundation models. However, most existing methods primarily focus on leveraging knowledge from these models, overlooking the inherent biases introduced by the imbalanced training data they rely on. In this paper, we examine how such imbalances from pre-training affect long-tailed downstream tasks. Specifically, we find the imbalance biases inherited in foundation models on downstream task as parameter imbalance and data imbalance. During fine-tuning, we observe that parameter imbalance plays a more critical role, while data imbalance can be mitigated using existing re-balancing strategies. Moreover, we find that parameter imbalance cannot be effectively addressed by current re-balancing techniques, such as adjusting the logits, during training, unlike data imbalance. To tackle both imbalances simultaneously, we build our method on causal learning and view the incomplete semantic factor as the confounder, which brings spurious correlations between input samples and labels. To resolve the negative effects of this, we propose a novel backdoor adjustment method that learns the true causal effect between input samples and labels, rather than merely fitting the correlations in the data. Notably, we achieve an average performance increase of about $1.67\%$ on each dataset. Code is available: https://github.com/JiahaoChen1/Pre-train-Imbalance
中文摘要:本文研究了预训练数据不平衡对长尾下游任务的影响,识别出参数不平衡和数据不平衡是关键问题,并提出了一种基于因果学习的后门调整方法,在多个数据集上平均性能提升约1.67%。
English Summary: This paper investigates how imbalances in pre-training data affect long-tailed downstream tasks, identifying parameter and data imbalances as key issues, and proposes a novel causal learning-based backdoor adjustment method that achieves an average performance improvement of about 1.67% across datasets.

Authors:Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Chuang Zhang, Jiaming Liu
Title: Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks
Abstract:
Image-based virtual try-on (VTON) aims to generate a virtual try-on result by transferring an input garment onto a target person's image. However, the scarcity of paired garment-model data makes it challenging for existing methods to achieve high generalization and quality in VTON. Also, it limits the ability to generate mask-free try-ons. To tackle the data scarcity problem, approaches such as Stable Garment and MMTryon use a synthetic data strategy, effectively increasing the amount of paired data on the model side. However, existing methods are typically limited to performing specific try-on tasks and lack user-friendliness. To enhance the generalization and controllability of VTON generation, we propose Any2AnyTryon, which can generate try-on results based on different textual instructions and model garment images to meet various needs, eliminating the reliance on masks, poses, or other conditions. Specifically, we first construct the virtual try-on dataset LAION-Garment, the largest known open-source garment try-on dataset. Then, we introduce adaptive position embedding, which enables the model to generate satisfactory outfitted model images or garment images based on input images of different sizes and categories, significantly enhancing the generalization and controllability of VTON generation. In our experiments, we demonstrate the effectiveness of our Any2AnyTryon and compare it with existing methods. The results show that Any2AnyTryon enables flexible, controllable, and high-quality image-based virtual try-on generation. https://logn-2024.github.io/Any2anyTryonProjectPage

Authors:Karahan Sarıtaş, Peter Dayan, Tingke Shen, Surabhi S Nath
Title: Complexity in Complexity: Understanding Visual Complexity Through Structure, Color, and Surprise
Abstract:
Understanding how humans perceive visual complexity is a key area of study in visual cognition. Previous approaches to modeling visual complexity assessments have often resulted in intricate, difficult-to-interpret algorithms that employ numerous features or sophisticated deep learning architectures. While these complex models achieve high performance on specific datasets, they often sacrifice interpretability, making it challenging to understand the factors driving human perception of complexity. Recently (Shen, et al. 2024) proposed an interpretable segmentation-based model that accurately predicted complexity across various datasets, supporting the idea that complexity can be explained simply. In this work, we investigate the failure of their model to capture structural, color and surprisal contributions to complexity. To this end, we propose Multi-Scale Sobel Gradient (MSG) which measures spatial intensity variations, Multi-Scale Unique Color (MUC) which quantifies colorfulness across multiple scales, and surprise scores generated using a Large Language Model. We test our features on existing benchmarks and a novel dataset (Surprising Visual Genome) containing surprising images from Visual Genome. Our experiments demonstrate that modeling complexity accurately is not as simple as previously thought, requiring additional perceptual and semantic factors to address dataset biases. Our model improves predictive performance while maintaining interpretability, offering deeper insights into how visual complexity is perceived and assessed. Our code, analysis and data are available at https://github.com/Complexity-Project/Complexity-in-Complexity.
中文摘要:本研究通过引入多尺度梯度、色彩丰富度和语义惊奇度等特征,挑战了视觉复杂性可被简单解释的观点,在保持模型可解释性的同时提高了跨数据集的预测准确性。
English Summary: This study challenges the notion that visual complexity can be explained simply by introducing multi-scale gradient, colorfulness, and semantic surprise features, which improve prediction accuracy while maintaining model interpretability across diverse datasets.

Authors:Adil Kaan Akan, Yucel Yemez
Title: Slot-Guided Adaptation of Pre-trained Diffusion Models for Object-Centric Learning and Compositional Generation
Abstract:
We present SlotAdapt, an object-centric learning method that combines slot attention with pretrained diffusion models by introducing adapters for slot-based conditioning. Our method preserves the generative power of pretrained diffusion models, while avoiding their text-centric conditioning bias. We also incorporate an additional guidance loss into our architecture to align cross-attention from adapter layers with slot attention. This enhances the alignment of our model with the objects in the input image without using external supervision. Experimental results show that our method outperforms state-of-the-art techniques in object discovery and image generation tasks across multiple datasets, including those with real images. Furthermore, we demonstrate through experiments that our method performs remarkably well on complex real-world images for compositional generation, in contrast to other slot-based generative methods in the literature. The project page can be found at https://kaanakan.github.io/SlotAdapt/.

Authors:Ruiqi Wu, Na Su, Chenran Zhang, Tengfei Ma, Tao Zhou, Zhiting Cui, Nianfeng Tang, Tianyu Mao, Yi Zhou, Wen Fan, Tianxing Wu, Shenqi Jing, Huazhu Fu
Title: MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining
Abstract:
Vision-language pretraining (VLP) has been investigated to generalize across diverse downstream tasks for fundus image analysis. Although recent methods showcase promising achievements, they significantly rely on large-scale private image-text data but pay less attention to the pretraining manner, which limits their further advancements. In this work, we introduce MM-Retinal V2, a high-quality image-text paired dataset comprising CFP, FFA, and OCT image modalities. Then, we propose a novel fundus vision-language pretraining model, namely KeepFIT V2, which is pretrained by integrating knowledge from the elite data spark into categorical public datasets. Specifically, a preliminary textual pretraining is adopted to equip the text encoder with primarily ophthalmic textual knowledge. Moreover, a hybrid image-text knowledge injection module is designed for knowledge transfer, which is essentially based on a combination of global semantic concepts from contrastive learning and local appearance details from generative learning. Extensive experiments across zero-shot, few-shot, and linear probing settings highlight the generalization and transferability of KeepFIT V2, delivering performance competitive to state-of-the-art fundus VLP models trained on large-scale private image-text datasets. Our dataset and model are publicly available via https://github.com/lxirich/MM-Retinal.
中文: 本文提出MM-Retinal V2多模态视网膜数据集和KeepFIT V2视觉语言预训练模型,该模型通过混合学习方法整合眼科专业知识,在不依赖大规模私有数据集的情况下显著提升了眼底图像分析性能。
English: This paper introduces MM-Retinal V2, a multimodal retinal dataset, and KeepFIT V2, a novel vision-language pretraining model that enhances fundus image analysis by integrating specialized ophthalmic knowledge through hybrid learning methods, achieving competitive performance without relying on large private datasets.

Authors:Karam Park, Jae Woong Soh, Nam Ik Cho
Title: Efficient Attention-Sharing Information Distillation Transformer for Lightweight Single Image Super-Resolution
Abstract:
Transformer-based Super-Resolution (SR) methods have demonstrated superior performance compared to convolutional neural network (CNN)-based SR approaches due to their capability to capture long-range dependencies. However, their high computational complexity necessitates the development of lightweight approaches for practical use. To address this challenge, we propose the Attention-Sharing Information Distillation (ASID) network, a lightweight SR network that integrates attention-sharing and an information distillation structure specifically designed for Transformer-based SR methods. We modify the information distillation scheme, originally designed for efficient CNN operations, to reduce the computational load of stacked self-attention layers, effectively addressing the efficiency bottleneck. Additionally, we introduce attention-sharing across blocks to further minimize the computational cost of self-attention operations. By combining these strategies, ASID achieves competitive performance with existing SR methods while requiring only around 300K parameters - significantly fewer than existing CNN-based and Transformer-based SR models. Furthermore, ASID outperforms state-of-the-art SR methods when the number of parameters is matched, demonstrating its efficiency and effectiveness. The code and supplementary material are available on the project page.
中文摘要:提出的注意力共享信息蒸馏(ASID)网络通过整合注意力共享和改进的信息蒸馏结构,有效解决了基于Transformer的超分辨率方法计算复杂度高的问题,仅用约30万参数即实现了与现有方法相媲美的性能。
English Summary: The proposed Attention-Sharing Information Distillation (ASID) network addresses the high computational complexity of Transformer-based super-resolution methods by integrating attention-sharing and modified information distillation, achieving competitive performance with only around 300K parameters.

Authors:Jialun Cai, Mengyuan Liu, Hong Liu, Wenhao Li, Shuheng Zhou
Title: NanoHTNet: Nano Human Topology Network for Efficient 3D Human Pose Estimation
Abstract:
The widespread application of 3D human pose estimation (HPE) is limited by resource-constrained edge devices, requiring more efficient models. A key approach to enhancing efficiency involves designing networks based on the structural characteristics of input data. However, effectively utilizing the structural priors in human skeletal inputs remains challenging. To address this, we leverage both explicit and implicit spatio-temporal priors of the human body through innovative model design and a pre-training proxy task. First, we propose a Nano Human Topology Network (NanoHTNet), a tiny 3D HPE network with stacked Hierarchical Mixers to capture explicit features. Specifically, the spatial Hierarchical Mixer efficiently learns the human physical topology across multiple semantic levels, while the temporal Hierarchical Mixer with discrete cosine transform and low-pass filtering captures local instantaneous movements and global action coherence. Moreover, Efficient Temporal-Spatial Tokenization (ETST) is introduced to enhance spatio-temporal interaction and reduce computational complexity significantly. Second, PoseCLR is proposed as a general pre-training method based on contrastive learning for 3D HPE, aimed at extracting implicit representations of human topology. By aligning 2D poses from diverse viewpoints in the proxy task, PoseCLR aids 3D HPE encoders like NanoHTNet in more effectively capturing the high-dimensional features of the human body, leading to further performance improvements. Extensive experiments verify that NanoHTNet with PoseCLR outperforms other state-of-the-art methods in efficiency, making it ideal for deployment on edge devices like the Jetson Nano. Code and models are available at https://github.com/vefalun/NanoHTNet.
Chinese: 针对3D人体姿态估计在边缘设备上的应用限制,我们开发了NanoHTNet紧凑网络,通过分层混合器和ETST捕捉显式时空特征,并提出了PoseCLR预训练方法以增强隐式拓扑学习,经广泛实验验证实现了卓越的效率和性能。
English: To address the limitations of 3D human pose estimation on edge devices, we developed NanoHTNet, a compact network that captures explicit spatio-temporal features through hierarchical mixers and ETST, and introduced PoseCLR, a pre-training method that enhances implicit topology learning, achieving superior efficiency and performance validated by extensive experiments.

Authors:Ashim Dahal, Saydul Akbar Murad, Nick Rahimi
Title: Efficiency Bottlenecks of Convolutional Kolmogorov-Arnold Networks: A Comprehensive Scrutiny with ImageNet, AlexNet, LeNet and Tabular Classification
Abstract:
Algorithmic level developments like Convolutional Neural Networks, transformers, attention mechanism, Retrieval Augmented Generation and so on have changed Artificial Intelligence. Recent such development was observed by Kolmogorov-Arnold Networks that suggested to challenge the fundamental concept of a Neural Network, thus change Multilayer Perceptron, and Convolutional Neural Networks. They received a good reception in terms of scientific modeling, yet had some drawbacks in terms of efficiency. In this paper, we train Convolutional Kolmogorov Arnold Networks (CKANs) with the ImageNet-1k dataset with 1.3 million images, MNIST dataset with 60k images and a tabular biological science related MoA dataset and test the promise of CKANs in terms of FLOPS, Inference Time, number of trainable parameters and training time against the accuracy, precision, recall and f-1 score they produce against the standard industry practice on CNN models. We show that the CKANs perform fair yet slower than CNNs in small size dataset like MoA and MNIST but are not nearly comparable as the dataset gets larger and more complex like the ImageNet. The code implementation of this paper can be found on the link: https://github.com/ashimdahal/Study-of-Convolutional-Kolmogorov-Arnold-networks
中文摘要:本文评估了卷积柯尔莫哥洛夫-阿诺德网络(CKANs)与标准卷积神经网络的对比,结果表明CKANs在MNIST和MoA等小型数据集上表现尚可,但在ImageNet等大型复杂数据集上的效率和性能明显不足。
English Summary: This paper evaluates Convolutional Kolmogorov-Arnold Networks (CKANs) against standard CNNs, demonstrating that while CKANs perform fairly on smaller datasets like MNIST and MoA, they lag significantly in efficiency and performance on larger, complex datasets such as ImageNet.

Authors:Ayush Gupta, Rama Chellappa
Title: MimicGait: A Model Agnostic approach for Occluded Gait Recognition using Correlational Knowledge Distillation
Abstract:
Gait recognition is an important biometric technique over large distances. State-of-the-art gait recognition systems perform very well in controlled environments at close range. Recently, there has been an increased interest in gait recognition in the wild prompted by the collection of outdoor, more challenging datasets containing variations in terms of illumination, pitch angles, and distances. An important problem in these environments is that of occlusion, where the subject is partially blocked from camera view. While important, this problem has received little attention. Thus, we propose MimicGait, a model-agnostic approach for gait recognition in the presence of occlusions. We train the network using a multi-instance correlational distillation loss to capture both inter-sequence and intra-sequence correlations in the occluded gait patterns of a subject, utilizing an auxiliary Visibility Estimation Network to guide the training of the proposed mimic network. We demonstrate the effectiveness of our approach on challenging real-world datasets like GREW, Gait3D and BRIAR. We release the code in https://github.com/Ayush-00/mimicgait.
中文: MimicGait是一种模型无关的方法,通过多实例相关性蒸馏和可见性估计网络解决步态识别中的遮挡问题,并在真实数据集上验证了其有效性。
English: MimicGait is a model-agnostic approach that addresses occlusion challenges in gait recognition by using multi-instance correlational distillation and a visibility estimation network, demonstrating effectiveness on real-world datasets.

Authors:Vaclav Knapp, Matyas Bohacek
Title: Can Pose Transfer Models Generate Realistic Human Motion?
Abstract:
Recent pose-transfer methods aim to generate temporally consistent and fully controllable videos of human action where the motion from a reference video is reenacted by a new identity. We evaluate three state-of-the-art pose-transfer methods -- AnimateAnyone, MagicAnimate, and ExAvatar -- by generating videos with actions and identities outside the training distribution and conducting a participant study about the quality of these videos. In a controlled environment of 20 distinct human actions, we find that participants, presented with the pose-transferred videos, correctly identify the desired action only 42.92% of the time. Moreover, the participants find the actions in the generated videos consistent with the reference (source) videos only 36.46% of the time. These results vary by method: participants find the splatting-based ExAvatar more consistent and photorealistic than the diffusion-based AnimateAnyone and MagicAnimate.
Chinese: 一项针对三种姿态迁移方法的研究显示,参与者仅能识别42.92%的目标动作,对生成视频与参考视频动作一致性的认可度仅为36.46%,其中基于点云渲染的ExAvatar在真实感方面优于扩散模型方法。
English: A study evaluating three pose-transfer methods found that participants correctly identified target actions only 42.92% of the time and perceived motion consistency with reference videos in just 36.46% of cases, with ExAvatar outperforming diffusion-based alternatives in realism.

Authors:Jiajun Dong, Chengkun Wang, Wenzhao Zheng, Lei Chen, Jiwen Lu, Yansong Tang
Title: GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting
Abstract:
Effective image tokenization is crucial for both multi-modal understanding and generation tasks due to the necessity of the alignment with discrete text data. To this end, existing approaches utilize vector quantization (VQ) to project pixels onto a discrete codebook and reconstruct images from the discrete representation. However, compared with the continuous latent space, the limited discrete codebook space significantly restrict the representational ability of these image tokenizers. In this paper, we propose GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting as a solution. We first represent the encoded samples as multiple flexible featured 2D Gaussians characterized by positions, rotation angles, scaling factors, and feature coefficients. We adopt the standard quantization for the Gaussian features and then concatenate the quantization results with the other intrinsic Gaussian parameters before the corresponding splatting operation and the subsequent decoding module. In general, GaussianToken integrates the local influence of 2D Gaussian distribution into the discrete space and thus enhances the representation capability of the image tokenizer. Competitive reconstruction performances on CIFAR, Mini-ImageNet, and ImageNet-1K demonstrate the effectiveness of our framework. Our code is available at: https://github.com/ChrisDong-THU/GaussianToken.
中文摘要:GaussianToken通过将图像表示为灵活的二维高斯分布并量化其特征,有效提升了图像分词器的表示能力,克服了离散码本的局限性,在多个数据集上实现了优异的图像重建效果。
English Summary: GaussianToken enhances image tokenization by representing images with flexible 2D Gaussians and quantizing their features, overcoming the limitations of discrete codebooks and improving reconstruction performance across multiple datasets.

Authors:Chenglong Ma, Zilong Li, Yuanlin Li, Jing Han, Junping Zhang, Yi Zhang, Jiannan Liu, Hongming Shan
Title: Radiologist-in-the-Loop Self-Training for Generalizable CT Metal Artifact Reduction
Abstract:
Metal artifacts in computed tomography (CT) images can significantly degrade image quality and impede accurate diagnosis. Supervised metal artifact reduction (MAR) methods, trained using simulated datasets, often struggle to perform well on real clinical CT images due to a substantial domain gap. Although state-of-the-art semi-supervised methods use pseudo ground-truths generated by a prior network to mitigate this issue, their reliance on a fixed prior limits both the quality and quantity of these pseudo ground-truths, introducing confirmation bias and reducing clinical applicability. To address these limitations, we propose a novel Radiologist-In-the-loop SElf-training framework for MAR, termed RISE-MAR, which can integrate radiologists' feedback into the semi-supervised learning process, progressively improving the quality and quantity of pseudo ground-truths for enhanced generalization on real clinical CT images. For quality assurance, we introduce a clinical quality assessor model that emulates radiologist evaluations, effectively selecting high-quality pseudo ground-truths for semi-supervised training. For quantity assurance, our self-training framework iteratively generates additional high-quality pseudo ground-truths, expanding the clinical dataset and further improving model generalization. Extensive experimental results on multiple clinical datasets demonstrate the superior generalization performance of our RISE-MAR over state-of-the-art methods, advancing the development of MAR models for practical application. Code is available at https://github.com/Masaaki-75/rise-mar.
中文: 本文提出RISE-MAR框架,通过融入放射科医生反馈的自我训练机制,逐步提升伪真实标注的质量与数量,有效改善CT图像金属伪影消除效果,在真实临床数据上展现出优越的泛化性能。
English: This paper introduces RISE-MAR, a radiologist-in-the-loop self-training framework that enhances metal artifact reduction in CT images by progressively improving pseudo ground-truth quality and quantity through clinical feedback and iterative generation, demonstrating superior generalization on real clinical data.

Authors:Zhiyuan Fan, Weinong Wang, Xing Wu, Debing Zhang
Title: SedarEval: Automated Evaluation using Self-Adaptive Rubrics
Abstract:
The evaluation paradigm of LLM-as-judge gains popularity due to its significant reduction in human labor and time costs. This approach utilizes one or more large language models (LLMs) to assess the quality of outputs from other LLMs. However, existing methods rely on generic scoring rubrics that fail to consider the specificities of each question and its problem-solving process, compromising precision and stability in assessments. Inspired by human examination scoring processes, we propose a new evaluation paradigm based on self-adaptive rubrics. Specifically, we create detailed scoring rubrics for each question, capturing the primary and secondary criteria in a structured format of scoring and deduction points that mimic a human evaluator's analytical process. Building on this paradigm, we further develop a novel benchmark called SedarEval, which covers a range of domains including long-tail knowledge, mathematics, coding, and logical reasoning. SedarEval consists of 1,000 meticulously crafted questions, each with its own self-adaptive rubric. To further streamline the evaluation, we train a specialized evaluator language model (evaluator LM) to supplant human graders. Using the same training data, our evaluator LM achieves a higher concordance rate with human grading results than other paradigms, including GPT-4, highlighting the superiority and efficiency of our approach. We release our dataset at https://github.com/wwn1233/sedareval.
Chinese: LLM作为评判者的评估范式虽降低了人工成本,但因通用评分标准缺乏精确性,故提出自适应评分标准和SedarEval基准,其专用评估语言模型在与人评分一致性上优于现有方法。
English: The LLM-as-judge evaluation paradigm reduces human effort but lacks precision due to generic rubrics, leading to the proposal of self-adaptive rubrics and the SedarEval benchmark with a specialized evaluator LM that outperforms existing methods in aligning with human grading.

Authors:Romeo Sommerfeld, Christian Helms, Ralf Herbrich
Title: Approximate Message Passing for Bayesian Neural Networks
Abstract:
Bayesian neural networks (BNNs) offer the potential for reliable uncertainty quantification and interpretability, which are critical for trustworthy AI in high-stakes domains. However, existing methods often struggle with issues such as overconfidence, hyperparameter sensitivity, and posterior collapse, leaving room for alternative approaches. In this work, we advance message passing (MP) for BNNs and present a novel framework that models the predictive posterior as a factor graph. To the best of our knowledge, our framework is the first MP method that handles convolutional neural networks and avoids double-counting training data, a limitation of previous MP methods that causes overconfidence. We evaluate our approach on CIFAR-10 with a convolutional neural network of roughly 890k parameters and find that it can compete with the SOTA baselines AdamW and IVON, even having an edge in terms of calibration. On synthetic data, we validate the uncertainty estimates and observe a strong correlation (0.9) between posterior credible intervals and its probability of covering the true data-generating function outside the training range. While our method scales to an MLP with 5.6 million parameters, further improvements are necessary to match the scale and performance of state-of-the-art variational inference methods.
中文: 本研究为贝叶斯神经网络提出了一种新颖的消息传递框架,该框架能有效处理卷积网络并避免数据重复计算,在CIFAR-10数据集上展现出与先进基线相当的校准性能,并在合成数据上验证了可靠的uncertainty量化能力,同时承认其在扩展性方面仍需改进。
English: This study introduces a novel message passing framework for Bayesian neural networks that effectively handles convolutional networks and prevents data double-counting, achieving competitive calibration on CIFAR-10 and demonstrating reliable uncertainty quantification on synthetic data while acknowledging scalability limitations compared to state-of-the-art methods.

Authors:Xingjian Zhang, Xi Weng, Yihao Yue, Zhaoxin Fan, Wenjun Wu, Lei Huang
Title: TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler
Abstract:
Video behavior recognition and scene understanding are fundamental tasks in multimodal intelligence, serving as critical building blocks for numerous real-world applications. Through large multimodal models (LMMs) have achieved remarkable progress in video understanding, most existing open-source models rely on over 7B parameters and require large-scale datasets for training, making them resource-intensive and inaccessible to many researchers. Furthermore, lightweight models face persistent challenges in effectively processing long visual sequences and temporal understanding. In this work, we introduce TinyLLaVA-Video, a lightweight yet powerful video understanding model with approximately 3.6B parameters. The cornerstone of our design is the video-level group resampler, a novel mechanism that significantly reduces and controls the number of visual tokens at the video level. Unlike traditional image-level resampler, our approach effectively mitigates redundancy while enhancing temporal comprehension, leading to improved performance on video-based tasks. In addition, TinyLLaVA-Video demonstrates exceptional efficiency, requiring only one day of training on 8 A100-40G GPUs. It surpasses several existing 7B-parameter models on multiple benchmarks. We believe this work provides a valuable foundation for future research on lightweight video understanding models. The code and weights is available at https://github.com/ZhangXJ199/TinyLLaVA-Video.
中文摘要:本文提出TinyLLaVA-Video这一轻量级3.6B参数模型,通过创新的视频级分组重采样器有效减少视觉标记冗余并增强时序理解能力,在仅需少量训练资源的情况下实现了超越多个7B参数模型的性能表现。
English Summary: This paper introduces TinyLLaVA-Video, a lightweight 3.6B-parameter model that uses a novel video-level group resampler to reduce visual token redundancy and enhance temporal understanding, achieving superior performance over larger models while requiring minimal training resources.

Authors:JiaKui Hu, Lujia Jin, Zhengjian Yao, Yanye Lu
Title: Universal Image Restoration Pre-training via Degradation Classification
Abstract:
This paper proposes the Degradation Classification Pre-Training (DCPT), which enables models to learn how to classify the degradation type of input images for universal image restoration pre-training. Unlike the existing self-supervised pre-training methods, DCPT utilizes the degradation type of the input image as an extremely weak supervision, which can be effortlessly obtained, even intrinsic in all image restoration datasets. DCPT comprises two primary stages. Initially, image features are extracted from the encoder. Subsequently, a lightweight decoder, such as ResNet18, is leveraged to classify the degradation type of the input image solely based on the features extracted in the first stage, without utilizing the input image. The encoder is pre-trained with a straightforward yet potent DCPT, which is used to address universal image restoration and achieve outstanding performance. Following DCPT, both convolutional neural networks (CNNs) and transformers demonstrate performance improvements, with gains of up to 2.55 dB in the 10D all-in-one restoration task and 6.53 dB in the mixed degradation scenarios. Moreover, previous self-supervised pretraining methods, such as masked image modeling, discard the decoder after pre-training, while our DCPT utilizes the pre-trained parameters more effectively. This superiority arises from the degradation classifier acquired during DCPT, which facilitates transfer learning between models of identical architecture trained on diverse degradation types. Source code and models are available at https://github.com/MILab-PKU/dcpt.
Chinese: 本文提出退化分类预训练(DCPT)方法,通过将图像退化类型分类作为弱监督信号进行通用图像恢复预训练,在多种网络架构和任务中均实现了显著的性能提升。
English: This paper introduces Degradation Classification Pre-Training (DCPT), a method that uses degradation type classification as weak supervision to pre-train models for universal image restoration, achieving significant performance gains across various architectures and tasks.

Authors:Dan Song, Shumeng Huo, Wenhui Li, Lanjun Wang, Chao Xue, An-An Liu
Title: Domain Adaptation from Generated Multi-Weather Images for Unsupervised Maritime Object Classification
Abstract:
The classification and recognition of maritime objects are crucial for enhancing maritime safety, monitoring, and intelligent sea environment prediction. However, existing unsupervised methods for maritime object classification often struggle with the long-tail data distributions in both object categories and weather conditions. In this paper, we construct a dataset named AIMO produced by large-scale generative models with diverse weather conditions and balanced object categories, and collect a dataset named RMO with real-world images where long-tail issue exists. We propose a novel domain adaptation approach that leverages AIMO (source domain) to address the problem of limited labeled data, unbalanced distribution and domain shift in RMO (target domain), enhance the generalization of source features with the Vision-Language Models such as CLIP, and propose a difficulty score for curriculum learning to optimize training process. Experimental results shows that the proposed method significantly improves the classification accuracy, particularly for samples within rare object categories and weather conditions. Datasets and codes will be publicly available at https://github.com/honoria0204/AIMO.
中文摘要:本文提出一种新颖的域自适应方法,利用生成数据集AIMO解决真实海域物体分类中的长尾分布和域偏移问题,通过CLIP增强特征和课程学习显著提升了稀有类别和恶劣天气条件下的分类准确率。
English Summary: This paper introduces a novel domain adaptation method using a generative dataset (AIMO) to address long-tail distribution and domain shift in real maritime object classification, significantly improving accuracy for rare categories and weather conditions through CLIP-enhanced features and curriculum learning.

Authors:Zhenkai Wu, Xiaowen Ma, Rongrong Lian, Kai Zheng, Mengting Ma, Wei Zhang, Siyang Song
Title: CD-Lamba: Boosting Remote Sensing Change Detection via a Cross-Temporal Locally Adaptive State Space Model
Abstract:
Mamba, with its advantages of global perception and linear complexity, has been widely applied to identify changes of the target regions within the remote sensing (RS) images captured under complex scenarios and varied conditions. However, existing remote sensing change detection (RSCD) approaches based on Mamba frequently struggle to effectively perceive the inherent locality of change regions as they direct flatten and scan RS images (i.e., the features of the same region of changes are not distributed continuously within the sequence but are mixed with features from other regions throughout the sequence). In this paper, we propose a novel locally adaptive SSM-based approach, termed CD-Lamba, which effectively enhances the locality of change detection while maintaining global perception. Specifically, our CD-Lamba includes a Locally Adaptive State-Space Scan (LASS) strategy for locality enhancement, a Cross-Temporal State-Space Scan (CTSS) strategy for bi-temporal feature fusion, and a Window Shifting and Perception (WSP) mechanism to enhance interactions across segmented windows. These strategies are integrated into a multi-scale Cross-Temporal Locally Adaptive State-Space Scan (CT-LASS) module to effectively highlight changes and refine changes' representations feature generation. CD-Lamba significantly enhances local-global spatio-temporal interactions in bi-temporal images, offering improved performance in RSCD tasks. Extensive experimental results show that CD-Lamba achieves state-of-the-art performance on four benchmark datasets with a satisfactory efficiency-accuracy trade-off. Our code is publicly available at https://github.com/xwmaxwma/rschange.
中文: 基于Mamba的遥感变化检测方法因扁平化扫描难以有效感知局部变化区域,因此提出的CD-Lamba通过局部自适应和跨时序扫描策略增强时空交互,在基准数据集上实现了最优性能。
English: Mamba-based remote sensing change detection methods often fail to effectively capture local change regions due to flattened scanning, so the proposed CD-Lamba introduces locally adaptive and cross-temporal scanning strategies to enhance local-global spatio-temporal interactions, achieving state-of-the-art performance on benchmark datasets.

Authors:Zengran Wang, Yanan Zhang, Jiaxin Chen, Di Huang
Title: Breaking the SSL-AL Barrier: A Synergistic Semi-Supervised Active Learning Framework for 3D Object Detection
Abstract:
To address the annotation burden in LiDAR-based 3D object detection, active learning (AL) methods offer a promising solution. However, traditional active learning approaches solely rely on a small amount of labeled data to train an initial model for data selection, overlooking the potential of leveraging the abundance of unlabeled data. Recently, attempts to integrate semi-supervised learning (SSL) into AL with the goal of leveraging unlabeled data have faced challenges in effectively resolving the conflict between the two paradigms, resulting in less satisfactory performance. To tackle this conflict, we propose a Synergistic Semi-Supervised Active Learning framework, dubbed as S-SSAL. Specifically, from the perspective of SSL, we propose a Collaborative PseudoScene Pre-training (CPSP) method that effectively learns from unlabeled data without introducing adverse effects. From the perspective of AL, we design a Collaborative Active Learning (CAL) method, which complements the uncertainty and diversity methods by model cascading. This allows us to fully exploit the potential of the CPSP pre-trained model. Extensive experiments conducted on KITTI and Waymo demonstrate the effectiveness of our S-SSAL framework. Notably, on the KITTI dataset, utilizing only 2% labeled data, S-SSAL can achieve performance comparable to models trained on the full dataset. The code has been released at https://github.com/LandDreamer/S_SSAL.
Chinese: 为解决激光雷达3D物体检测中的标注负担,S-SSAL框架通过协同伪场景预训练和协同主动学习方法,将半监督学习与主动学习有机结合,在KITTI数据集上仅使用2%标注数据即可达到全监督模型的性能水平。
English: To address the annotation burden in LiDAR-based 3D object detection, the proposed S-SSAL framework synergistically integrates semi-supervised learning with active learning through Collaborative PseudoScene Pre-training and Collaborative Active Learning, achieving performance comparable to fully supervised models using only 2% labeled data on KITTI.

Authors:Han Wang, Rui Yang Tan, Roy Ka-Wei Lee
Title: Cross-Modal Transfer from Memes to Videos: Addressing Data Scarcity in Hateful Video Detection
Abstract:
Detecting hate speech in online content is essential to ensuring safer digital spaces. While significant progress has been made in text and meme modalities, video-based hate speech detection remains under-explored, hindered by a lack of annotated datasets and the high cost of video annotation. This gap is particularly problematic given the growing reliance on large models, which demand substantial amounts of training data. To address this challenge, we leverage meme datasets as both a substitution and an augmentation strategy for training hateful video detection models. Our approach introduces a human-assisted reannotation pipeline to align meme dataset labels with video datasets, ensuring consistency with minimal labeling effort. Using two state-of-the-art vision-language models, we demonstrate that meme data can substitute for video data in resource-scarce scenarios and augment video datasets to achieve further performance gains. Our results consistently outperform state-of-the-art benchmarks, showcasing the potential of cross-modal transfer learning for advancing hateful video detection. Dataset and code are available at https://github.com/Social-AI-Studio/CrossModalTransferLearning.
中文摘要:本研究提出一种跨模态迁移学习方法,利用表情包数据集训练仇恨视频检测模型,通过人工辅助标签对齐和数据增强策略实现了最先进的性能表现。
English Summary: This study introduces a cross-modal transfer learning approach that utilizes meme datasets to train hateful video detection models, achieving state-of-the-art performance through human-assisted label alignment and data augmentation strategies.

Authors:Hossein Mirzaei, Mojtaba Nafez, Jafar Habibi, Mohammad Sabokrou, Mohammad Hossein Rohban
Title: Mitigating Spurious Negative Pairs for Robust Industrial Anomaly Detection
Abstract:
Despite significant progress in Anomaly Detection (AD), the robustness of existing detection methods against adversarial attacks remains a challenge, compromising their reliability in critical real-world applications such as autonomous driving. This issue primarily arises from the AD setup, which assumes that training data is limited to a group of unlabeled normal samples, making the detectors vulnerable to adversarial anomaly samples during testing. Additionally, implementing adversarial training as a safeguard encounters difficulties, such as formulating an effective objective function without access to labels. An ideal objective function for adversarial training in AD should promote strong perturbations both within and between the normal and anomaly groups to maximize margin between normal and anomaly distribution. To address these issues, we first propose crafting a pseudo-anomaly group derived from normal group samples. Then, we demonstrate that adversarial training with contrastive loss could serve as an ideal objective function, as it creates both inter- and intra-group perturbations. However, we notice that spurious negative pairs compromise the conventional contrastive loss to achieve robust AD. Spurious negative pairs are those that should be closely mapped but are erroneously separated. These pairs introduce noise and misguide the direction of inter-group adversarial perturbations. To overcome the effect of spurious negative pairs, we define opposite pairs and adversarially pull them apart to strengthen inter-group perturbations. Experimental results demonstrate our superior performance in both clean and adversarial scenarios, with a 26.1% improvement in robust detection across various challenging benchmark datasets. The implementation of our work is available at: https://github.com/rohban-lab/COBRA.
中文摘要:本研究针对异常检测系统在对抗攻击下的脆弱性问题,提出了一种基于对比损失和伪异常样本生成的新型对抗训练方法,通过在基准测试中实现26.1%的鲁棒性提升,有效增强了系统防御能力。
English Summary: This study addresses the vulnerability of anomaly detection systems to adversarial attacks by proposing a novel adversarial training method using contrastive loss and pseudo-anomaly generation, which significantly enhances robustness by 26.1% across various benchmarks.

Authors:Liang Shang, William A. Sethares, Anusha Adluru, Andrew L. Alexander, Vivek Prabhakaran, Veena A. Nair, Nagesh Adluru
Title: Stroke Lesion Segmentation using Multi-Stage Cross-Scale Attention
Abstract:
Precise characterization of stroke lesions from MRI data has immense value in prognosticating clinical and cognitive outcomes following a stroke. Manual stroke lesion segmentation is time-consuming and requires the expertise of neurologists and neuroradiologists. Often, lesions are grossly characterized for their location and overall extent using bounding boxes without specific delineation of their boundaries. While such characterization provides some clinical value, to develop a precise mechanistic understanding of the impact of lesions on post-stroke vascular contributions to cognitive impairments and dementia (VCID), the stroke lesions need to be fully segmented with accurate boundaries. This work introduces the Multi-Stage Cross-Scale Attention (MSCSA) mechanism, applied to the U-Net family, to improve the mapping between brain structural features and lesions of varying sizes. Using the Anatomical Tracings of Lesions After Stroke (ATLAS) v2.0 dataset, MSCSA outperforms all baseline methods in both Dice and F1 scores on a subset focusing on small lesions, while maintaining competitive performance across the entire dataset. Notably, the ensemble strategy incorporating MSCSA achieves the highest scores for Dice and F1 on both the full dataset and the small lesion subset. These results demonstrate the effectiveness of MSCSA in segmenting small lesions and highlight its robustness across different training schemes for large stroke lesions. Our code is available at: https://github.com/nadluru/StrokeLesSeg.
中文: 多阶段跨尺度注意力机制应用于U-Net架构,通过增强脑部结构特征与不同尺寸病灶的映射关系,在ATLAS v2.0数据集中对小病灶分割表现优异,并在整体数据集上保持稳健性能。
English: The Multi-Stage Cross-Scale Attention (MSCSA) mechanism enhances U-Net-based stroke lesion segmentation by effectively mapping brain structural features to lesions of varying sizes, achieving superior performance on small lesions and robust results across the ATLAS v2.0 dataset.

Authors:Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, Jun Zhu
Title: Visual Generation Without Guidance
Abstract:
Classifier-Free Guidance (CFG) has been a default technique in various visual generative models, yet it requires inference from both conditional and unconditional models during sampling. We propose to build visual models that are free from guided sampling. The resulting algorithm, Guidance-Free Training (GFT), matches the performance of CFG while reducing sampling to a single model, halving the computational cost. Unlike previous distillation-based approaches that rely on pretrained CFG networks, GFT enables training directly from scratch. GFT is simple to implement. It retains the same maximum likelihood objective as CFG and differs mainly in the parameterization of conditional models. Implementing GFT requires only minimal modifications to existing codebases, as most design choices and hyperparameters are directly inherited from CFG. Our extensive experiments across five distinct visual models demonstrate the effectiveness and versatility of GFT. Across domains of diffusion, autoregressive, and masked-prediction modeling, GFT consistently achieves comparable or even lower FID scores, with similar diversity-fidelity trade-offs compared with CFG baselines, all while being guidance-free. Code will be available at https://github.com/thu-ml/GFT.
Chinese: 提出的无引导训练(GFT)方法通过从头开始训练视觉模型,无需引导采样即可达到分类器自由引导的性能,同时在多个生成领域中保持相似的多样性-保真度权衡,并将计算成本减半。
English: The proposed Guidance-Free Training (GFT) method eliminates the need for guided sampling by training visual models directly from scratch, matching the performance of Classifier-Free Guidance while halving computational costs and maintaining similar diversity-fidelity trade-offs across multiple generative domains.

Authors:Siqi Fan, Yuguang Xie, Bowen Cai, Ailin Xie, Gaochao Liu, Mu Qiao, Jie Xing, Zaiqing Nie
Title: OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery
Abstract:
Understanding the chemical structure from a graphical representation of a molecule is a challenging image caption task that would greatly benefit molecule-centric scientific discovery. Variations in molecular images and caption subtasks pose a significant challenge in both image representation learning and task modeling. Yet, existing methods only focus on a specific caption task that translates a molecular image into its graph structure, i.e., OCSR. In this paper, we propose the Optical Chemical Structure Understanding (OCSU) task, which extends low-level recognition to multilevel understanding and aims to translate chemical structure diagrams into readable strings for both machine and chemist. To facilitate the development of OCSU technology, we explore both OCSR-based and OCSR-free paradigms. We propose DoubleCheck to enhance OCSR performance via attentive feature enhancement for local ambiguous atoms. It can be cascaded with existing SMILES-based molecule understanding methods to achieve OCSU. Meanwhile, Mol-VL is a vision-language model end-to-end optimized for OCSU. We also construct Vis-CheBI20, the first large-scale OCSU dataset. Through comprehensive experiments, we demonstrate the proposed approaches excel at providing chemist-readable caption for chemical structure diagrams, which provide solid baselines for further research. Our code, model, and data are open-sourced at https://github.com/PharMolix/OCSU.
中文摘要:本文提出光学化学结构理解(OCSU)任务,将分子图像转化为机器可读且化学家可理解的描述,通过两种创新方法和新构建的数据集显著提升了分子图像解析性能。
English Summary: This paper introduces the Optical Chemical Structure Understanding (OCSU) task to translate molecular diagrams into machine-readable and chemist-friendly captions, proposing two innovative methods and a new dataset that outperform existing approaches.

Authors:Zhiming Wang, Lin Gu, Feng Lu
Title: TdAttenMix: Top-Down Attention Guided Mixup
Abstract:
CutMix is a data augmentation strategy that cuts and pastes image patches to mixup training data. Existing methods pick either random or salient areas which are often inconsistent to labels, thus misguiding the training model. By our knowledge, we integrate human gaze to guide cutmix for the first time. Since human attention is driven by both high-level recognition and low-level clues, we propose a controllable Top-down Attention Guided Module to obtain a general artificial attention which balances top-down and bottom-up attention. The proposed TdATttenMix then picks the patches and adjust the label mixing ratio that focuses on regions relevant to the current label. Experimental results demonstrate that our TdAttenMix outperforms existing state-of-the-art mixup methods across eight different benchmarks. Additionally, we introduce a new metric based on the human gaze and use this metric to investigate the issue of image-label inconsistency. Project page: \url{https://github.com/morning12138/TdAttenMix}
中文: TdAttenMix是一种新型CutMix数据增强方法,首次通过自上而下的注意力模块整合人类注视信息,选择与标签一致的图像区域并调整混合比例,在八个基准测试中实现最优性能,同时解决了图像标签不一致问题。
English: TdAttenMix is a novel CutMix data augmentation method that integrates human gaze through a top-down attention module to select label-consistent image patches and adjust mixing ratios, achieving superior performance across eight benchmarks while addressing image-label inconsistency.

Authors:Long Yang, Lianqing Zheng, Wenjin Ai, Minghao Liu, Sen Li, Qunshu Lin, Shengyu Yan, Jie Bai, Zhixiong Ma, Tao Huang, Xichan Zhu
Title: MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies
Abstract:
Robust 3D occupancy prediction is essential for autonomous driving, particularly under adverse weather conditions where traditional vision-only systems struggle. While the fusion of surround-view 4D radar and cameras offers a promising low-cost solution, effectively extracting and integrating features from these heterogeneous sensors remains challenging. This paper introduces MetaOcc, a novel multi-modal framework for omnidirectional 3D occupancy prediction that leverages both multi-view 4D radar and images. To address the limitations of directly applying LiDAR-oriented encoders to sparse radar data, we propose a Radar Height Self-Attention module that enhances vertical spatial reasoning and feature extraction. Additionally, a Hierarchical Multi-scale Multi-modal Fusion strategy is developed to perform adaptive local-global fusion across modalities and time, mitigating spatio-temporal misalignments and enriching fused feature representations. To reduce reliance on expensive point cloud annotations, we further propose a pseudo-label generation pipeline based on an open-set segmentor. This enables a semi-supervised strategy that achieves 90% of the fully supervised performance using only 50% of the ground truth labels, offering an effective trade-off between annotation cost and accuracy. Extensive experiments demonstrate that MetaOcc under full supervision achieves state-of-the-art performance, outperforming previous methods by +0.47 SC IoU and +4.02 mIoU on the OmniHD-Scenes dataset, and by +1.16 SC IoU and +1.24 mIoU on the SurroundOcc-nuScenes dataset. These results demonstrate the scalability and robustness of MetaOcc across sensor domains and training conditions, paving the way for practical deployment in real-world autonomous systems. Code and data are available at https://github.com/LucasYang567/MetaOcc.
中文摘要:本文提出MetaOcc多模态框架,通过融合4D雷达与相机数据实现自动驾驶中的全方位3D占据预测,采用创新的特征提取与分层融合策略,在减少标注成本的同时达到最优性能表现。
English Summary: This paper introduces MetaOcc, a multi-modal framework that combines 4D radar and camera data for robust 3D occupancy prediction in autonomous driving, achieving state-of-the-art performance through novel feature extraction and fusion techniques while reducing annotation costs via semi-supervised learning.

Authors:Chuanyang Zheng
Title: iFormer: Integrating ConvNet and Transformer for Mobile Application
Abstract:
We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, \textit{i.e.}, ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios.
中文: iFormer是一种新型移动混合视觉网络,通过结合卷积的局部处理能力和自注意力的全局建模优势,在移动设备上实现了卓越的精度与低延迟,在ImageNet分类和COCO目标检测等任务中超越了现有模型。
English: iFormer is a novel mobile hybrid vision network that integrates convolution's local processing with self-attention's global modeling to achieve superior accuracy and low latency on mobile devices, outperforming existing models in tasks like ImageNet classification and COCO object detection.

Authors:Jiajie Li, Brian R Quaranto, Chenhui Xu, Ishan Mishra, Ruiyang Qin, Dancheng Liu, Peter C W Kim, Jinjun Xiong
Title: Recognize Any Surgical Object: Unleashing the Power of Weakly-Supervised Data
Abstract:
We present RASO, a foundation model designed to Recognize Any Surgical Object, offering robust open-set recognition capabilities across a broad range of surgical procedures and object classes, in both surgical images and videos. RASO leverages a novel weakly-supervised learning framework that generates tag-image-text pairs automatically from large-scale unannotated surgical lecture videos, significantly reducing the need for manual annotations. Our scalable data generation pipeline gathers 2,200 surgical procedures and produces 3.6 million tag annotations across 2,066 unique surgical tags. Our experiments show that RASO achieves improvements of 2.9 mAP, 4.5 mAP, 10.6 mAP, and 7.2 mAP on four standard surgical benchmarks, respectively, in zero-shot settings, and surpasses state-of-the-art models in supervised surgical action recognition tasks. Code, model, and demo are available at https://ntlm1686.github.io/raso.
Chinese: RASO是一种基础模型,通过弱监督学习框架在手术图像和视频中稳健识别各类手术对象,在多个标准手术基准的零样本和监督任务中均实现了显著性能提升。
English: RASO is a foundation model that robustly recognizes any surgical object in images and videos using a weakly-supervised learning framework, achieving significant performance improvements in zero-shot and supervised tasks across multiple surgical benchmarks.

Authors:Hao Tang, Siyue Yu, Jian Pang, Bingfeng Zhang
Title: A Training-free Synthetic Data Selection Method for Semantic Segmentation
Abstract:
Training semantic segmenter with synthetic data has been attracting great attention due to its easy accessibility and huge quantities. Most previous methods focused on producing large-scale synthetic image-annotation samples and then training the segmenter with all of them. However, such a solution remains a main challenge in that the poor-quality samples are unavoidable, and using them to train the model will damage the training process. In this paper, we propose a training-free Synthetic Data Selection (SDS) strategy with CLIP to select high-quality samples for building a reliable synthetic dataset. Specifically, given massive synthetic image-annotation pairs, we first design a Perturbation-based CLIP Similarity (PCS) to measure the reliability of synthetic image, thus removing samples with low-quality images. Then we propose a class-balance Annotation Similarity Filter (ASF) by comparing the synthetic annotation with the response of CLIP to remove the samples related to low-quality annotations. The experimental results show that using our method significantly reduces the data size by half, while the trained segmenter achieves higher performance. The code is released at https://github.com/tanghao2000/SDS.
中文摘要:本文提出了一种基于CLIP的无训练合成数据选择方法,通过双重质量筛选机制构建可靠数据集,在将数据量减半的同时显著提升了分割模型的性能。
English Summary: The paper introduces a training-free Synthetic Data Selection (SDS) method using CLIP to filter high-quality synthetic data for semantic segmentation, which halves dataset size while improving model performance.

Authors:Hulingxiao He, Geng Li, Zijun Geng, Jinglin Xu, Yuxin Peng
Title: Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models
Abstract:
Multi-modal large language models (MLLMs) have shown remarkable abilities in various visual understanding tasks. However, MLLMs still struggle with fine-grained visual recognition (FGVR), which aims to identify subordinate-level categories from images. This can negatively impact more advanced capabilities of MLLMs, such as object-centric visual question answering and reasoning. In our study, we revisit three quintessential capabilities of MLLMs for FGVR, including object information extraction, category knowledge reserve, object-category alignment, and position of the root cause as a misalignment problem. To address this issue, we present Finedefics, an MLLM that enhances the model's FGVR capability by incorporating informative attribute descriptions of objects into the training phase. We employ contrastive learning on object-attribute pairs and attribute-category pairs simultaneously and use examples from similar but incorrect categories as hard negatives, naturally bringing representations of visual objects and category names closer. Extensive evaluations across multiple popular FGVR datasets demonstrate that Finedefics outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code is available at https://github.com/PKU-ICST-MIPL/Finedefics_ICLR2025.
Chinese: 本研究提出了Finedefics多模态大语言模型,通过引入物体属性描述并采用包含困难负样本的对比学习,显著提升了细粒度视觉识别能力,在多个数据集上展现出优越性能。
English: The study introduces Finedefics, a multi-modal large language model that enhances fine-grained visual recognition by incorporating object attribute descriptions and using contrastive learning with hard negatives, achieving superior performance on multiple datasets.

Authors:Shreya Shukla, Nakul Sharma, Manish Gupta, Anand Mishra
Title: PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures
Abstract:
Writing comprehensive and accurate descriptions of technical drawings in patent documents is crucial to effective knowledge sharing and enabling the replication and protection of intellectual property. However, automation of this task has been largely overlooked by the research community. To this end, we introduce PatentDesc-355K, a novel large-scale dataset containing ~355K patent figures along with their brief and detailed textual descriptions extracted from more than 60K US patent documents. In addition, we propose PatentLMM - a novel multimodal large language model specifically tailored to generate high-quality descriptions of patent figures. Our proposed PatentLMM comprises two key components: (i) PatentMME, a specialized multimodal vision encoder that captures the unique structural elements of patent figures, and (ii) PatentLLaMA, a domain-adapted version of LLaMA fine-tuned on a large collection of patents. Extensive experiments demonstrate that training a vision encoder specifically designed for patent figures significantly boosts the performance, generating coherent descriptions compared to fine-tuning similar-sized off-the-shelf multimodal models. PatentDesc-355K and PatentLMM pave the way for automating the understanding of patent figures, enabling efficient knowledge sharing and faster drafting of patent documents. We make the code and data publicly available.

Authors:Mengshi Qi, Xiaoyang Bi, Pengfei Zhu, Huadong Ma
Title: Towards Robust Unsupervised Attention Prediction in Autonomous Driving
Abstract:
Robustly predicting attention regions of interest for self-driving systems is crucial for driving safety but presents significant challenges due to the labor-intensive nature of obtaining large-scale attention labels and the domain gap between self-driving scenarios and natural scenes. These challenges are further exacerbated by complex traffic environments, including camera corruption under adverse weather, noise interferences, and central bias from long-tail distributions. To address these issues, we propose a robust unsupervised attention prediction method. An Uncertainty Mining Branch refines predictions by analyzing commonalities and differences across multiple pre-trained models on natural scenes, while a Knowledge Embedding Block bridges the domain gap by incorporating driving knowledge to adaptively enhance pseudo-labels. Additionally, we introduce RoboMixup, a novel data augmentation method that improves robustness against corruption through soft attention and dynamic augmentation, and mitigates central bias by integrating random cropping into Mixup as a regularizer. To systematically evaluate robustness in self-driving attention prediction, we introduce the DriverAttention-C benchmark, comprising over 100k frames across three subsets: BDD-A-C, DR(eye)VE-C, and DADA-2000-C. Our method achieves performance equivalent to or surpassing fully supervised state-of-the-art approaches on three public datasets and the proposed robustness benchmark, reducing relative corruption degradation by 58.8% and 52.8%, and improving central bias robustness by 12.4% and 11.4% in KLD and CC metrics, respectively. Code and data are available at https://github.com/zaplm/DriverAttention.
Chinese: 本研究提出了一种鲁棒的无监督自动驾驶注意力预测方法,通过不确定性挖掘和知识嵌入弥合领域差异,并采用RoboMixup数据增强提升抗干扰能力和缓解中心偏差,在新型DriverAttention-C基准测试中表现优于全监督方法。
English: This study introduces a robust unsupervised method for predicting attention in self-driving systems, utilizing uncertainty mining and knowledge embedding to bridge domain gaps and employing RoboMixup data augmentation to enhance corruption resistance and reduce central bias, validated by the new DriverAttention-C benchmark showing superior performance over supervised approaches.

Authors:Kaixun Jiang, Zhaoyu Chen, Jiyuan Fu, Lingyi Hong, Jinglun Li, Wenqiang Zhang
Title: VideoPure: Diffusion-based Adversarial Purification for Video Recognition
Abstract:
Recent work indicates that video recognition models are vulnerable to adversarial examples, posing a serious security risk to downstream applications. However, current research has primarily focused on adversarial attacks, with limited work exploring defense mechanisms. Furthermore, due to the spatial-temporal complexity of videos, existing video defense methods face issues of high cost, overfitting, and limited defense performance. Recently, diffusion-based adversarial purification methods have achieved robust defense performance in the image domain. However, due to the additional temporal dimension in videos, directly applying these diffusion-based adversarial purification methods to the video domain suffers performance and efficiency degradation. To achieve an efficient and effective video adversarial defense method, we propose the first diffusion-based video purification framework to improve video recognition models' adversarial robustness: VideoPure. Given an adversarial example, we first employ temporal DDIM inversion to transform the input distribution into a temporally consistent and trajectory-defined distribution, covering adversarial noise while preserving more video structure. Then, during DDIM denoising, we leverage intermediate results at each denoising step and conduct guided spatial-temporal optimization, removing adversarial noise while maintaining temporal consistency. Finally, we input the list of optimized intermediate results into the video recognition model for multi-step voting to obtain the predicted class. We investigate the defense performance of our method against black-box, gray-box, and adaptive attacks on benchmark datasets and models. Compared with other adversarial purification methods, our method overall demonstrates better defense performance against different attacks. Our code is available at https://github.com/deep-kaixun/VideoPure.
中文摘要:本研究提出了VideoPure,首个基于扩散的视频净化框架,通过时序DDIM反演和引导式时空优化,在保持时序一致性的同时有效去除对抗噪声,显著提升了视频识别模型对抗不同攻击的防御性能。
English Summary: The study introduces VideoPure, a novel diffusion-based video purification framework designed to enhance adversarial robustness in video recognition models by employing temporal DDIM inversion and guided spatial-temporal optimization for effective noise removal while maintaining temporal consistency.

Authors:Md. Kamrul Hasan, Guang Yang, Choon Hwai Yap
Title: Motion-enhanced Cardiac Anatomy Segmentation via an Insertable Temporal Attention Module
Abstract:
Cardiac anatomy segmentation is useful for clinical assessment of cardiac morphology to inform diagnosis and intervention. Deep learning (DL), especially with motion information, has improved segmentation accuracy. However, existing techniques for motion enhancement are not yet optimal, and they have high computational costs due to increased dimensionality or reduced robustness due to suboptimal approaches that use non-DL motion registration, non-attention models, or single-headed attention. They further have limited adaptability and are inconvenient for incorporation into existing networks where motion awareness is desired. Here, we propose a novel, computationally efficient Temporal Attention Module (TAM) that offers robust motion enhancement, modeled as a small, multi-headed, cross-temporal attention module. TAM's uniqueness is that it is a lightweight, plug-and-play module that can be inserted into a broad range of segmentation networks (CNN-based, Transformer-based, or hybrid) for motion enhancement without requiring substantial changes in the network's backbone. This feature enables high adaptability and ease of integration for enhancing both existing and future networks. Extensive experiments on multiple 2D and 3D cardiac ultrasound and MRI datasets confirm that TAM consistently improves segmentation across a range of networks while maintaining computational efficiency and improving on currently reported performance. The evidence demonstrates that it is a robust, generalizable solution for motion-awareness enhancement that is scalable (such as from 2D to 3D).
中文: 该摘要提出了一种新型时序注意力模块(TAM),能高效增强多种网络架构的运动感知心脏分割性能,大量实验证实其具有更优的运算效率和泛化能力。
English: This abstract introduces a novel Temporal Attention Module (TAM) that efficiently enhances motion-aware cardiac segmentation across various network architectures, demonstrating improved performance and computational efficiency in extensive experiments.

Authors:Taewoong Lee, Sarah Frisken, Nazim Haouchine
Title: 3D/2D Registration of Angiograms using Silhouette-based Differentiable Rendering
Abstract:
We present a method for 3D/2D registration of Digital Subtraction Angiography (DSA) images to provide valuable insight into brain hemodynamics and angioarchitecture. Our approach formulates the registration as a pose estimation problem, leveraging both anteroposterior and lateral DSA views and employing differentiable rendering. Preliminary experiments on real and synthetic datasets demonstrate the effectiveness of our method, with both qualitative and quantitative evaluations highlighting its potential for clinical applications. The code is available at https://github.com/taewoonglee17/TwoViewsDSAReg.
中文: 本研究提出了一种利用双视角输入和可微分渲染的3D/2D DSA图像配准方法,实验结果表明其在临床应用中的潜力。
English: This study introduces a 3D/2D registration method for DSA images using pose estimation with dual-view inputs and differentiable rendering, showing promising results in experiments for clinical use.

Authors:Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, Xiang Bai
Title: HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
Abstract:
Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. In this paper, we present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios. Specifically, HERMES leverages a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information while preserving geometric relationships and interactions. We also introduce world queries, which incorporate world knowledge into BEV features via causal attention in the Large Language Model, enabling contextual enrichment for understanding and generation tasks. We conduct comprehensive studies on nuScenes and OmniDrive-nuScenes datasets to validate the effectiveness of our method. HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%. The model and code will be publicly released at https://github.com/LMD0311/HERMES.
Chinese: HERMES提出了一种统一的驾驶世界模型,通过鸟瞰图表示和世界查询将3D场景理解与未来场景生成相结合,在生成误差降低32.4%和理解指标提升8.0%的基础上实现了最先进的性能。
English: HERMES introduces a unified Driving World Model that integrates 3D scene understanding and future scene generation using Bird's-Eye View representation and world queries, achieving state-of-the-art performance with a 32.4% reduction in generation error and 8.0% improvement in understanding metrics.

Authors:Shaofei Wang, Tomas Simon, Igor Santesteban, Timur Bagautdinov, Junxuan Li, Vasu Agrawal, Fabian Prada, Shoou-I Yu, Pace Nalbone, Matt Gramlich, Roman Lubachersky, Chenglei Wu, Javier Romero, Jason Saragih, Michael Zollhoefer, Andreas Geiger, Siyu Tang, Shunsuke Saito
Title: Relightable Full-Body Gaussian Codec Avatars
Abstract:
We propose Relightable Full-Body Gaussian Codec Avatars, a new approach for modeling relightable full-body avatars with fine-grained details including face and hands. The unique challenge for relighting full-body avatars lies in the large deformations caused by body articulation and the resulting impact on appearance caused by light transport. Changes in body pose can dramatically change the orientation of body surfaces with respect to lights, resulting in both local appearance changes due to changes in local light transport functions, as well as non-local changes due to occlusion between body parts. To address this, we decompose the light transport into local and non-local effects. Local appearance changes are modeled using learnable zonal harmonics for diffuse radiance transfer. Unlike spherical harmonics, zonal harmonics are highly efficient to rotate under articulation. This allows us to learn diffuse radiance transfer in a local coordinate frame, which disentangles the local radiance transfer from the articulation of the body. To account for non-local appearance changes, we introduce a shadow network that predicts shadows given precomputed incoming irradiance on a base mesh. This facilitates the learning of non-local shadowing between the body parts. Finally, we use a deferred shading approach to model specular radiance transfer and better capture reflections and highlights such as eye glints. We demonstrate that our approach successfully models both the local and non-local light transport required for relightable full-body avatars, with a superior generalization ability under novel illumination conditions and unseen poses.

Authors:Rongzhao He, Weihao Zheng, Leilei Zhao, Ying Wang, Dalin Zhu, Dan Wu, Bin Hu
Title: Surface Vision Mamba: Leveraging Bidirectional State Space Model for Efficient Spherical Manifold Representation
Abstract:
Attention-based methods have demonstrated exceptional performance in modelling long-range dependencies on spherical cortical surfaces, surpassing traditional Geometric Deep Learning (GDL) models. However, their extensive inference time and high memory demands pose challenges for application to large datasets with limited computing resources. Inspired by the state space model in computer vision, we introduce the attention-free Vision Mamba (Vim) to spherical surfaces, presenting a domain-agnostic architecture for analyzing data on spherical manifolds. Our method achieves surface patching by representing spherical data as a sequence of triangular patches derived from a subdivided icosphere. The proposed Surface Vision Mamba (SiM) is evaluated on multiple neurodevelopmental phenotype regression tasks using cortical surface metrics from neonatal brains. Experimental results demonstrate that SiM outperforms both attention- and GDL-based methods, delivering 4.8 times faster inference and achieving 91.7% lower memory consumption compared to the Surface Vision Transformer (SiT) under the Ico-4 grid partitioning. Sensitivity analysis further underscores the potential of SiM to identify subtle cognitive developmental patterns. The code is available at https://github.com/Rongzhao-He/surface-vision-mamba.
Chinese: 提出的表面视觉Mamba(SiM)方法在神经发育分析中超越了基于注意力和几何深度学习的方法,实现了推理速度提升4.8倍、内存消耗降低91.7%,并能有效识别细微的认知发育模式。
English: The proposed Surface Vision Mamba (SiM) method outperforms attention-based and geometric deep learning approaches in neurodevelopmental analysis by achieving 4.8x faster inference and 91.7% lower memory consumption while effectively identifying subtle cognitive patterns.

Authors:Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, Chen Change Loy
Title: MatAnyone: Stable Video Matting with Consistent Memory Propagation
Abstract:
Auxiliary-free human video matting methods, which rely solely on input frames, often struggle with complex or ambiguous backgrounds. To address this, we propose MatAnyone, a robust framework tailored for target-assigned video matting. Specifically, building on a memory-based paradigm, we introduce a consistent memory propagation module via region-adaptive memory fusion, which adaptively integrates memory from the previous frame. This ensures semantic stability in core regions while preserving fine-grained details along object boundaries. For robust training, we present a larger, high-quality, and diverse dataset for video matting. Additionally, we incorporate a novel training strategy that efficiently leverages large-scale segmentation data, boosting matting stability. With this new network design, dataset, and training strategy, MatAnyone delivers robust and accurate video matting results in diverse real-world scenarios, outperforming existing methods.
中文摘要:MatAnyone是一种无需辅助信息的鲁棒视频抠图框架,通过自适应记忆融合确保核心区域语义稳定性并保留边界细节,结合新数据集和训练策略在多样化场景中实现卓越性能。
English Summary: MatAnyone is a robust auxiliary-free video matting framework that ensures semantic stability and preserves fine details through adaptive memory fusion, supported by a novel dataset and training strategy for superior performance in diverse scenarios.

Authors:Tianming Liang, Kun-Yu Lin, Chaolei Tan, Jianguo Zhang, Wei-Shi Zheng, Jian-Fang Hu
Title: ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
Abstract:
Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This is challenging as it involves deep vision-language understanding, pixel-level dense prediction and spatiotemporal reasoning. Despite notable progress in recent years, existing methods still exhibit a noticeable gap when considering all these aspects. In this work, we propose \textbf{ReferDINO}, a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception and cross-modal spatiotemporal reasoning. In detail, ReferDINO integrates two key components: 1) a grounding-guided deformable mask decoder that utilizes location prediction to progressively guide mask prediction through differentiable deformation mechanisms; 2) an object-consistent temporal enhancer that injects pretrained time-varying text features into inter-frame interaction to capture object-aware dynamic changes. Moreover, a confidence-aware query pruning strategy is designed to accelerate object decoding without compromising model performance. Extensive experimental results on five benchmarks demonstrate that our ReferDINO significantly outperforms previous methods (e.g., +3.9% (\mathcal{J}&\mathcal{F}) on Ref-YouTube-VOS) with real-time inference speed (51 FPS).
中文:ReferDINO是一种新颖的参照视频目标分割模型,通过结合定位引导的掩码解码和对象一致的时间增强技术,实现了卓越性能和实时处理速度。
English: ReferDINO is a novel referring video object segmentation model that achieves superior performance and real-time speed by integrating grounding-guided mask decoding and object-consistent temporal enhancement.

Authors:Jiazhen Zhang, Yuexi Du, Nicha C. Dvornek, John A. Onofrey
Title: Improved Vessel Segmentation with Symmetric Rotation-Equivariant U-Net
Abstract:
Automated segmentation plays a pivotal role in medical image analysis and computer-assisted interventions. Despite the promising performance of existing methods based on convolutional neural networks (CNNs), they neglect useful equivariant properties for images, such as rotational and reflection equivariance. This limitation can decrease performance and lead to inconsistent predictions, especially in applications like vessel segmentation where explicit orientation is absent. While existing equivariant learning approaches attempt to mitigate these issues, they substantially increase learning cost, model size, or both. To overcome these challenges, we propose a novel application of an efficient symmetric rotation-equivariant (SRE) convolutional (SRE-Conv) kernel implementation to the U-Net architecture, to learn rotation and reflection-equivariant features, while also reducing the model size dramatically. We validate the effectiveness of our method through improved segmentation performance on retina vessel fundus imaging. Our proposed SRE U-Net not only significantly surpasses standard U-Net in handling rotated images, but also outperforms existing equivariant learning methods and does so with a reduced number of trainable parameters and smaller memory cost. The code is available at https://github.com/OnofreyLab/sre_conv_segm_isbi2025.
中文摘要:本研究提出的SRE U-Net在U-Net架构中引入了高效的对称旋转等变卷积核,不仅显著提升了医学图像旋转分割性能,还大幅减少了模型参数量和计算成本,优于现有等变学习方法。
English Summary: The proposed SRE U-Net introduces an efficient symmetric rotation-equivariant convolutional kernel to the U-Net architecture, achieving superior segmentation performance on rotated medical images while significantly reducing model size and computational costs compared to existing methods.

Authors:Yoni Schirris, Rosie Voorthuis, Mark Opdam, Marte Liefaard, Gabe S Sonke, Gwen Dackus, Vincent de Jong, Yuwei Wang, Annelot Van Rossum, Tessa G Steenbruggen, Lars C Steggink, Liesbeth G. E. de Vries, Marc van de Vijver, Roberto Salgado, Efstratios Gavves, Paul J van Diest, Sabine C Linn, Jonas Teuwen, Renee Menezes, Marleen Kok, Hugo Horlings
Title: ECTIL: Label-efficient Computational Tumour Infiltrating Lymphocyte (TIL) assessment in breast cancer: Multicentre validation in 2,340 patients with breast cancer
Abstract:
The level of tumour-infiltrating lymphocytes (TILs) is a prognostic factor for patients with (triple-negative) breast cancer (BC). Computational TIL assessment (CTA) has the potential to assist pathologists in this labour-intensive task, but current CTA models rely heavily on many detailed annotations. We propose and validate a fundamentally simpler deep learning based CTA that can be trained in only ten minutes on hundredfold fewer pathologist annotations. We collected whole slide images (WSIs) with TILs scores and clinical data of 2,340 patients with BC from six cohorts including three randomised clinical trials. Morphological features were extracted from whole slide images (WSIs) using a pathology foundation model. Our label-efficient Computational stromal TIL assessment model (ECTIL) directly regresses the TILs score from these features. ECTIL trained on only a few hundred samples (ECTIL-TCGA) showed concordance with the pathologist over five heterogeneous external cohorts (r=0.54-0.74, AUROC=0.80-0.94). Training on all slides of five cohorts (ECTIL-combined) improved results on a held-out test set (r=0.69, AUROC=0.85). Multivariable Cox regression analyses indicated that every 10% increase of ECTIL scores was associated with improved overall survival independent of clinicopathological variables (HR 0.86, p<0.01), similar to the pathologist score (HR 0.87, p<0.001). We demonstrate that ECTIL is highly concordant with an expert pathologist and obtains a similar hazard ratio. ECTIL has a fundamentally simpler design than existing methods and can be trained on orders of magnitude fewer annotations. Such a CTA may be used to pre-screen patients for, e.g., immunotherapy clinical trial inclusion, or as a tool to assist clinicians in the diagnostic work-up of patients with BC. Our model is available under an open source licence (https://github.com/nki-ai/ectil).
中文: ECTIL是一种新型深度学习模型,能以极少的标注高效预测乳腺癌中肿瘤浸润淋巴细胞评分,与病理学家评估高度一致,并对患者生存具有重要预后价值。
English: ECTIL, a novel deep learning model, efficiently predicts tumor-infiltrating lymphocyte scores in breast cancer with minimal annotations, demonstrating high concordance with pathologists and significant prognostic value for patient survival.

Authors:Weicai Yan, Ye Wang, Wang Lin, Zirun Guo, Zhou Zhao, Tao Jin
Title: Low-rank Prompt Interaction for Continual Vision-Language Retrieval
Abstract:
Research on continual learning in multi-modal tasks has been receiving increasing attention. However, most existing work overlooks the explicit cross-modal and cross-task interactions. In this paper, we innovatively propose the Low-rank Prompt Interaction (LPI) to address this general problem of multi-modal understanding, which considers both cross-modal and cross-task interactions. Specifically, as for the former, we employ multi-modal correlation modules for corresponding Transformer layers. Considering that the training parameters scale to the number of layers and tasks, we propose low-rank interaction-augmented decomposition to avoid memory explosion while enhancing the cross-modal association through sharing and separating common-specific low-rank factors. In addition, due to the multi-modal semantic differences carried by the low-rank initialization, we adopt hierarchical low-rank contrastive learning to ensure training robustness. As for the latter, we initially employ a visual analysis and identify that different tasks have clear distinctions in proximity. Therefore, we introduce explicit task contrastive constraints in the prompt learning process based on task semantic distances. Experiments on two retrieval tasks show performance improvements with the introduction of a minimal number of parameters, demonstrating the effectiveness of our method. Code is available at https://github.com/Kelvin-ywc/LPI.
Chinese: 本文提出低秩提示交互(LPI)方法,通过低秩分解和对比学习解决多模态持续学习中的跨模态与跨任务交互问题,以少量参数显著提升了检索任务的性能表现。
English: This paper introduces Low-rank Prompt Interaction (LPI), a novel method that enhances multi-modal continual learning by addressing cross-modal and cross-task interactions through low-rank decomposition and contrastive learning, achieving improved performance with minimal parameters.

Authors:Taha Emre, Teresa Araújo, Marzieh Oghbaie, Dmitrii Lachinov, Guilherme Aresta, Hrvoje Bogunović
Title: Automatic detection and prediction of nAMD activity change in retinal OCT using Siamese networks and Wasserstein Distance for ordinality
Abstract:
Neovascular age-related macular degeneration (nAMD) is a leading cause of vision loss among older adults, where disease activity detection and progression prediction are critical for nAMD management in terms of timely drug administration and improving patient outcomes. Recent advancements in deep learning offer a promising solution for predicting changes in AMD from optical coherence tomography (OCT) retinal volumes. In this work, we proposed deep learning models for the two tasks of the public MARIO Challenge at MICCAI 2024, designed to detect and forecast changes in nAMD severity with longitudinal retinal OCT. For the first task, we employ a Vision Transformer (ViT) based Siamese Network to detect changes in AMD severity by comparing scan embeddings of a patient from different time points. To train a model to forecast the change after 3 months, we exploit, for the first time, an Earth Mover (Wasserstein) Distance-based loss to harness the ordinal relation within the severity change classes. Both models ranked high on the preliminary leaderboard, demonstrating that their predictive capabilities could facilitate nAMD treatment management.
中文: 本研究提出基于视觉变换器的孪生网络模型,结合Wasserstein距离损失函数,能够通过纵向OCT扫描准确检测和预测新生血管性年龄相关性黄斑变性的严重程度变化,为临床治疗管理提供有效支持。
English: This study introduces deep learning models using Vision Transformers and Wasserstein Distance loss to effectively detect and predict neovascular AMD severity changes from longitudinal OCT scans, showing strong potential for improving treatment management.

Authors:Xiaohao Xu, Tianyi Zhang, Shibo Zhao, Xiang Li, Sibo Wang, Yongqi Chen, Ye Li, Bhiksha Raj, Matthew Johnson-Roberson, Sebastian Scherer, Xiaonan Huang
Title: Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video
Abstract:
We aim to redefine robust ego-motion estimation and photorealistic 3D reconstruction by addressing a critical limitation: the reliance on noise-free data in existing models. While such sanitized conditions simplify evaluation, they fail to capture the unpredictable, noisy complexities of real-world environments. Dynamic motion, sensor imperfections, and synchronization perturbations lead to sharp performance declines when these models are deployed in practice, revealing an urgent need for frameworks that embrace and excel under real-world noise. To bridge this gap, we tackle three core challenges: scalable data generation, comprehensive benchmarking, and model robustness enhancement. First, we introduce a scalable noisy data synthesis pipeline that generates diverse datasets simulating complex motion, sensor imperfections, and synchronization errors. Second, we leverage this pipeline to create Robust-Ego3D, a benchmark rigorously designed to expose noise-induced performance degradation, highlighting the limitations of current learning-based methods in ego-motion accuracy and 3D reconstruction quality. Third, we propose Correspondence-guided Gaussian Splatting (CorrGS), a novel test-time adaptation method that progressively refines an internal clean 3D representation by aligning noisy observations with rendered RGB-D frames from clean 3D map, enhancing geometric alignment and appearance restoration through visual correspondence. Extensive experiments on synthetic and real-world data demonstrate that CorrGS consistently outperforms prior state-of-the-art methods, particularly in scenarios involving rapid motion and dynamic illumination.
中文摘要:本研究针对现有运动估计和三维重建模型难以应对现实世界噪声的局限,提出了包含合成数据生成、新基准测试和测试时自适应方法的完整框架,显著超越了现有技术性能。
English Summary: This research addresses the limitations of existing ego-motion estimation and 3D reconstruction models that struggle with real-world noise by introducing a comprehensive framework including synthetic data generation, a new benchmark, and a test-time adaptation method that significantly outperforms current approaches.

Authors:JongMin Lee, Sungjoo Yoo
Title: Dense-SfM: Structure from Motion with Dense Consistent Matching
Abstract:
We present Dense-SfM, a novel Structure from Motion (SfM) framework designed for dense and accurate 3D reconstruction from multi-view images. Sparse keypoint matching, which traditional SfM methods often rely on, limits both accuracy and point density, especially in texture-less areas. Dense-SfM addresses this limitation by integrating dense matching with a Gaussian Splatting (GS) based track extension which gives more consistent, longer feature tracks. To further improve reconstruction accuracy, Dense-SfM is equipped with a multi-view kernelized matching module leveraging transformer and Gaussian Process architectures, for robust track refinement across multi-views. Evaluations on the ETH3D and Texture-Poor SfM datasets show that Dense-SfM offers significant improvements in accuracy and density over state-of-the-art methods. Project page: https://icetea-cv.github.io/densesfm/.

Authors:Huayi Zhou, Ruixiang Wang, Yunxin Tai, Yueci Deng, Guiliang Liu, Kui Jia
Title: You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations
Abstract:
Bimanual robotic manipulation is a long-standing challenge of embodied intelligence due to its characteristics of dual-arm spatial-temporal coordination and high-dimensional action spaces. Previous studies rely on pre-defined action taxonomies or direct teleoperation to alleviate or circumvent these issues, often making them lack simplicity, versatility and scalability. Differently, we believe that the most effective and efficient way for teaching bimanual manipulation is learning from human demonstrated videos, where rich features such as spatial-temporal positions, dynamic postures, interaction states and dexterous transitions are available almost for free. In this work, we propose the YOTO (You Only Teach Once), which can extract and then inject patterns of bimanual actions from as few as a single binocular observation of hand movements, and teach dual robot arms various complex tasks. Furthermore, based on keyframes-based motion trajectories, we devise a subtle solution for rapidly generating training demonstrations with diverse variations of manipulated objects and their locations. These data can then be used to learn a customized bimanual diffusion policy (BiDP) across diverse scenes. In experiments, YOTO achieves impressive performance in mimicking 5 intricate long-horizon bimanual tasks, possesses strong generalization under different visual and spatial conditions, and outperforms existing visuomotor imitation learning methods in accuracy and efficiency. Our project link is https://hnuzhy.github.io/projects/YOTO.

Authors:Runyi Hu, Jie Zhang, Yiming Li, Jiwei Li, Qing Guo, Han Qiu, Tianwei Zhang
Title: VideoShield: Regulating Diffusion-based Video Generation Models via Watermarking
Abstract:
Artificial Intelligence Generated Content (AIGC) has advanced significantly, particularly with the development of video generation models such as text-to-video (T2V) models and image-to-video (I2V) models. However, like other AIGC types, video generation requires robust content control. A common approach is to embed watermarks, but most research has focused on images, with limited attention given to videos. Traditional methods, which embed watermarks frame-by-frame in a post-processing manner, often degrade video quality. In this paper, we propose VideoShield, a novel watermarking framework specifically designed for popular diffusion-based video generation models. Unlike post-processing methods, VideoShield embeds watermarks directly during video generation, eliminating the need for additional training. To ensure video integrity, we introduce a tamper localization feature that can detect changes both temporally (across frames) and spatially (within individual frames). Our method maps watermark bits to template bits, which are then used to generate watermarked noise during the denoising process. Using DDIM Inversion, we can reverse the video to its original watermarked noise, enabling straightforward watermark extraction. Additionally, template bits allow precise detection for potential temporal and spatial modification. Extensive experiments across various video models (both T2V and I2V models) demonstrate that our method effectively extracts watermarks and detects tamper without compromising video quality. Furthermore, we show that this approach is applicable to image generation models, enabling tamper detection in generated images as well. Codes and models are available at https://github.com/hurunyi/VideoShield.
中文: VideoShield是一种创新的水印框架,直接在基于扩散的视频生成过程中嵌入水印,既能保持视频质量又能实现精准篡改检测,且无需额外训练。
English: VideoShield is a novel watermarking framework that embeds watermarks directly during diffusion-based video generation, enabling robust tamper detection without compromising quality or requiring additional training.

Authors:Mojtaba Safari, Zach Eidex, Chih-Wei Chang, Richard L. J. Qiu, Xiaofeng Yang
Title: Advancing MRI Reconstruction: A Systematic Review of Deep Learning and Compressed Sensing Integration
Abstract:
Magnetic resonance imaging (MRI) is a non-invasive imaging modality and provides comprehensive anatomical and functional insights into the human body. However, its long acquisition times can lead to patient discomfort, motion artifacts, and limiting real-time applications. To address these challenges, strategies such as parallel imaging have been applied, which utilize multiple receiver coils to speed up the data acquisition process. Additionally, compressed sensing (CS) is a method that facilitates image reconstruction from sparse data, significantly reducing image acquisition time by minimizing the amount of data collection needed. Recently, deep learning (DL) has emerged as a powerful tool for improving MRI reconstruction. It has been integrated with parallel imaging and CS principles to achieve faster and more accurate MRI reconstructions. This review comprehensively examines DL-based techniques for MRI reconstruction. We categorize and discuss various DL-based methods, including end-to-end approaches, unrolled optimization, and federated learning, highlighting their potential benefits. Our systematic review highlights significant contributions and underscores the potential of DL in MRI reconstruction. Additionally, we summarize key results and trends in DL-based MRI reconstruction, including quantitative metrics, the dataset, acceleration factors, and the progress of and research interest in DL techniques over time. Finally, we discuss potential future directions and the importance of DL-based MRI reconstruction in advancing medical imaging. To facilitate further research in this area, we provide a GitHub repository that includes up-to-date DL-based MRI reconstruction publications and public datasets-https://github.com/mosaf/Awesome-DL-based-CS-MRI.
中文: 深度学习技术通过与并行成像和压缩感知相结合,正在革新磁共振成像重建过程,显著缩短采集时间并提升图像精度;本综述系统分类了各类方法并展望未来方向,同时提供了资源库支持后续研究。
English: Deep learning techniques are revolutionizing MRI reconstruction by integrating with parallel imaging and compressed sensing to significantly reduce acquisition times and improve image accuracy, with this review systematically categorizing methods and highlighting future directions while providing a resource repository.

Authors:Sneh Pandya, Purvik Patel, Brian D. Nord, Mike Walmsley, Aleksandra Ćiprijanović
Title: SIDDA: SInkhorn Dynamic Domain Adaptation for Image Classification with Equivariant Neural Networks
Abstract:
Modern neural networks (NNs) often do not generalize well in the presence of a "covariate shift"; that is, in situations where the training and test data distributions differ, but the conditional distribution of classification labels remains unchanged. In such cases, NN generalization can be reduced to a problem of learning more domain-invariant features. Domain adaptation (DA) methods include a range of techniques aimed at achieving this; however, these methods have struggled with the need for extensive hyperparameter tuning, which then incurs significant computational costs. In this work, we introduce SIDDA, an out-of-the-box DA training algorithm built upon the Sinkhorn divergence, that can achieve effective domain alignment with minimal hyperparameter tuning and computational overhead. We demonstrate the efficacy of our method on multiple simulated and real datasets of varying complexity, including simple shapes, handwritten digits, and real astronomical observations. SIDDA is compatible with a variety of NN architectures, and it works particularly well in improving classification accuracy and model calibration when paired with equivariant neural networks (ENNs). We find that SIDDA enhances the generalization capabilities of NNs, achieving up to a $\approx40\%$ improvement in classification accuracy on unlabeled target data. We also study the efficacy of DA on ENNs with respect to the varying group orders of the dihedral group $D_N$, and find that the model performance improves as the degree of equivariance increases. Finally, we find that SIDDA enhances model calibration on both source and target data--achieving over an order of magnitude improvement in the ECE and Brier score. SIDDA's versatility, combined with its automated approach to domain alignment, has the potential to advance multi-dataset studies by enabling the development of highly generalizable models.
中文: SIDDA是一种基于Sinkhorn散度的高效域自适应算法,能以最少的超参数调优实现域对齐,显著提升神经网络在不同数据集上的泛化能力和模型校准效果。
English: SIDDA is an efficient domain adaptation algorithm that uses Sinkhorn divergence to achieve domain alignment with minimal hyperparameter tuning, significantly improving neural network generalization and model calibration across various datasets.

Authors:Andrey Palaev, Adil Khan, Syed M. Ahsan Kazmi
Title: LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps
Abstract:
The advancement of text-to-image synthesis has introduced powerful generative models capable of creating realistic images from textual prompts. However, precise control over image attributes remains challenging, especially at the instance level. While existing methods offer some control through fine-tuning or auxiliary information, they often face limitations in flexibility and accuracy. To address these challenges, we propose a pipeline leveraging Large Language Models (LLMs), open-vocabulary detectors, cross-attention maps and intermediate activations of diffusion U-Net for instance-level image manipulation. Our method detects objects mentioned in the prompt and present in the generated image, enabling precise manipulation without extensive training or input masks. By incorporating cross-attention maps, our approach ensures coherence in manipulated images while controlling object positions. Our method enables precise manipulations at the instance level without fine-tuning or auxiliary information such as masks or bounding boxes. Code is available at https://github.com/Palandr123/DiffusionU-NetLLM
Chinese Summary: 本文提出了一种新颖方法,利用大型语言模型、开放词汇检测器和扩散U-Net组件,无需微调或额外输入掩码即可实现基于文本提示的实例级图像精确操控。
English Summary: This paper introduces a novel pipeline that utilizes Large Language Models, open-vocabulary detectors, and diffusion U-Net components to achieve precise instance-level image manipulation from text prompts without requiring fine-tuning or additional input masks.

Authors:Luqi Zhang, Haiping Wang, Chong Liu, Zhen Dong, Bisheng Yang
Title: ME-CPT: Multi-Task Enhanced Cross-Temporal Point Transformer for Urban 3D Change Detection
Abstract:
The point clouds collected by the Airborne Laser Scanning (ALS) system provide accurate 3D information of urban land covers. By utilizing multi-temporal ALS point clouds, semantic changes in urban area can be captured, demonstrating significant potential in urban planning, emergency management, and infrastructure maintenance. Existing 3D change detection methods struggle to efficiently extract multi-class semantic information and change features, still facing the following challenges: (1) the difficulty of accurately modeling cross-temporal point clouds spatial relationships for effective change feature extraction; (2) class imbalance of change samples which hinders distinguishability of semantic features; (3) the lack of real-world datasets for 3D semantic change detection. To resolve these challenges, we propose the Multi-task Enhanced Cross-temporal Point Transformer (ME-CPT) network. ME-CPT establishes spatiotemporal correspondences between point cloud across different epochs and employs attention mechanisms to jointly extract semantic change features, facilitating information exchange and change comparison. Additionally, we incorporate a semantic segmentation task and through the multi-task training strategy, further enhance the distinguishability of semantic features, reducing the impact of class imbalance in change types. Moreover, we release a 22.5 $km^2$ 3D semantic change detection dataset, offering diverse scenes for comprehensive evaluation. Experiments on multiple datasets show that the proposed MT-CPT achieves superior performance compared to existing state-of-the-art methods. The source code and dataset will be released upon acceptance at https://github.com/zhangluqi0209/ME-CPT.
中文: ME-CPT网络通过建立跨时空点云的时空对应关系并利用注意力机制,有效从多时机ALS点云中提取城市语义变化特征,同时通过多任务训练缓解类别不平衡问题,并发布了新的评估数据集。
English: The ME-CPT network effectively captures semantic changes in urban areas from multi-temporal ALS point clouds by establishing spatiotemporal correspondences and using attention mechanisms, while also addressing class imbalance through multi-task training and providing a new dataset for evaluation.

Authors:Ioannis Nasios
Title: Enhancing kelp forest detection in remote sensing images using crowdsourced labels with Mixed Vision Transformers and ConvNeXt segmentation models
Abstract:
Kelp forests, as foundation species, are vital to marine ecosystems, providing essential food and habitat for numerous organisms. This study explores the integration of crowdsourced labels with advanced artificial intelligence models to develop a fast and accurate kelp canopy detection pipeline using Landsat images. Building on the success of a machine learning competition, where this approach ranked third and performed consistently well on both local validation and public and private leaderboards, the research highlights the effectiveness of combining Mixed Vision Transformers (MIT) with ConvNeXt models. Training these models on various image sizes significantly enhanced the accuracy of the ensemble results. U-Net emerged as the best segmentation architecture, with UpperNet also contributing to the final ensemble. Key Landsat bands, such as ShortWave InfraRed (SWIR1) and Near-InfraRed (NIR), were crucial while altitude data was used in postprocessing to eliminate false positives on land. The methodology achieved a high detection rate, accurately identifying about three out of four pixels containing kelp canopy while keeping false positives low. Despite the medium resolution of Landsat satellites, their extensive historical coverage makes them effective for studying kelp forests. This work also underscores the potential of combining machine learning models with crowdsourced data for effective and scalable environmental monitoring. All running code for training all models and inference can be found at https://github.com/IoannisNasios/Kelp_Forests.
中文摘要:本研究通过整合众包标签与混合视觉转换器和ConvNeXt等人工智能模型,利用Landsat图像开发出高效的海藻冠层检测方法,在中等卫星分辨率下仍能实现精确识别。
English Summary: This study demonstrates a highly effective method for detecting kelp canopies using Landsat imagery by integrating crowdsourced labels with AI models like Mixed Vision Transformers and ConvNeXt, achieving accurate results despite medium satellite resolution.

Authors:Yi Yang, Zhang Zhang, Liang Wang
Title: MCRL4OR: Multimodal Contrastive Representation Learning for Off-Road Environmental Perception
Abstract:
Most studies on environmental perception for autonomous vehicles (AVs) focus on urban traffic environments, where the objects/stuff to be perceived are mainly from man-made scenes and scalable datasets with dense annotations can be used to train supervised learning models. By contrast, it is hard to densely annotate a large-scale off-road driving dataset manually due to the inherently unstructured nature of off-road environments. In this paper, we propose a Multimodal Contrastive Representation Learning approach for Off-Road environmental perception, namely MCRL4OR. This approach aims to jointly learn three encoders for processing visual images, locomotion states, and control actions by aligning the locomotion states with the fused features of visual images and control actions within a contrastive learning framework. The causation behind this alignment strategy is that the inertial locomotion state is the result of taking a certain control action under the current landform/terrain condition perceived by visual sensors. In experiments, we pre-train the MCRL4OR with a large-scale off-road driving dataset and adopt the learned multimodal representations for various downstream perception tasks in off-road driving scenarios. The superior performance in downstream tasks demonstrates the advantages of the pre-trained multimodal representations. The codes can be found in \url{https://github.com/1uciusy/MCRL4OR}.
中文: 本文提出MCRL4OR多模态对比学习方法,通过将运动状态与视觉和控制特征融合对齐来提升越野环境感知能力,在下游任务中展现出优越性能。
English: This paper introduces MCRL4OR, a multimodal contrastive learning method that aligns locomotion states with fused visual and control features to enhance off-road environmental perception, demonstrating superior performance in downstream tasks.

Authors:Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu
Title: Attribute-based Visual Reprogramming for Vision-Language Models
Abstract:
Visual reprogramming (VR) reuses pre-trained vision models for downstream image classification tasks by adding trainable noise patterns to inputs. When applied to vision-language models (e.g., CLIP), existing VR approaches follow the same pipeline used in vision models (e.g., ResNet, ViT), where ground-truth class labels are inserted into fixed text templates to guide the optimization of VR patterns. This label-based approach, however, overlooks the rich information and diverse attribute-guided textual representations that CLIP can exploit, which may lead to the misclassification of samples. In this paper, we propose Attribute-based Visual Reprogramming (AttrVR) for CLIP, utilizing descriptive attributes (DesAttrs) and distinctive attributes (DistAttrs), which respectively represent common and unique feature descriptions for different classes. Besides, as images of the same class may reflect different attributes after VR, AttrVR iteratively refines patterns using the $k$-nearest DesAttrs and DistAttrs for each image sample, enabling more dynamic and sample-specific optimization. Theoretically, AttrVR is shown to reduce intra-class variance and increase inter-class separation. Empirically, it achieves superior performance in 12 downstream tasks for both ViT-based and ResNet-based CLIP. The success of AttrVR facilitates more effective integration of VR from unimodal vision models into vision-language models. Our code is available at https://github.com/tmlr-group/AttrVR.
Chinese Summary: 本文提出基于属性的视觉重编程方法(AttrVR),通过利用描述性和区分性属性动态优化输入模式,有效降低类内差异并提升CLIP模型在12个下游任务中的分类性能。
English Summary: This paper introduces Attribute-based Visual Reprogramming (AttrVR) for CLIP, which leverages descriptive and distinctive attributes to dynamically optimize input patterns, reducing intra-class variance and improving classification performance across 12 downstream tasks.

Authors:Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Rui Huang, Haoquan Zhang, Manyuan Zhang, Jiaming Liu, Shanghang Zhang, Peng Gao, Hongsheng Li, Pheng-Ann Heng
Title: Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
Abstract:
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image, which is the first to incorporate reflection in autoregressive image generation. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT
中文摘要:本研究首次将思维链推理应用于自回归图像生成,通过结合验证扩展、偏好对齐及新型奖励模型(PARM/PARM++),显著提升了生成性能,在基准测试中实现了24%的突破性改进。
English Summary: This study pioneers the application of Chain-of-Thought reasoning to autoregressive image generation, demonstrating that combining verification scaling, preference alignment, and novel reward models (PARM/PARM++) significantly enhances performance, achieving a 24% improvement on benchmarks.

Authors:Hao Dong, Eleni Chatzi, Olga Fink
Title: Towards Robust Multimodal Open-set Test-time Adaptation via Adaptive Entropy-aware Optimization
Abstract:
Test-time adaptation (TTA) has demonstrated significant potential in addressing distribution shifts between training and testing data. Open-set test-time adaptation (OSTTA) aims to adapt a source pre-trained model online to an unlabeled target domain that contains unknown classes. This task becomes more challenging when multiple modalities are involved. Existing methods have primarily focused on unimodal OSTTA, often filtering out low-confidence samples without addressing the complexities of multimodal data. In this work, we present Adaptive Entropy-aware Optimization (AEO), a novel framework specifically designed to tackle Multimodal Open-set Test-time Adaptation (MM-OSTTA) for the first time. Our analysis shows that the entropy difference between known and unknown samples in the target domain strongly correlates with MM-OSTTA performance. To leverage this, we propose two key components: Unknown-aware Adaptive Entropy Optimization (UAE) and Adaptive Modality Prediction Discrepancy Optimization (AMP). These components enhance the ability of model to distinguish unknown class samples during online adaptation by amplifying the entropy difference between known and unknown samples. To thoroughly evaluate our proposed methods in the MM-OSTTA setting, we establish a new benchmark derived from existing datasets. This benchmark includes two downstream tasks and incorporates five modalities. Extensive experiments across various domain shift situations demonstrate the efficacy and versatility of the AEO framework. Additionally, we highlight the strong performance of AEO in long-term and continual MM-OSTTA settings, both of which are challenging and highly relevant to real-world applications. Our source code is available at https://github.com/donghao51/AEO.
中文: AEO框架通过自适应熵优化增强多模态开放集测试时适应能力,利用已知与未知样本间的熵差异提升模型对未知类别的识别性能。
English: The AEO framework introduces adaptive entropy optimization to enhance multimodal open-set test-time adaptation by distinguishing unknown classes through amplified entropy differences between known and unknown samples.

Authors:Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li
Title: IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models
Abstract:
With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.
中文:随着FLUX.1和Ideogram2.0等扩散模型的突破,文本到图像模型在多领域展现出通用化潜力,但现有评估体系尚不完善,为此开发的IMAGINE-E基准通过五大关键维度对六款主流模型进行了系统评估,揭示了其作为基础AI工具的发展前景。
English: Recent advances in diffusion-based text-to-image models like FLUX.1 and Ideogram2.0 demonstrate expanding capabilities across multiple domains, though current evaluation frameworks remain inadequate for comprehensive assessment, prompting the development of the IMAGINE-E benchmark to systematically evaluate six leading models across five key domains.

Authors:Rui Li, Xiaohan Wang, Yuhui Zhang, Orr Zohar, Zeyu Wang, Serena Yeung-Levy
Title: Temporal Preference Optimization for Long-Form Video Understanding
Abstract:
Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding. Project page: https://ruili33.github.io/tpo_website.

Authors:Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang
Title: Improving Video Generation with Human Feedback
Abstract:
Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: https://gongyeliu.github.io/videoalign.
中文摘要:本研究通过构建大规模人类偏好数据集、开发VideoReward奖励模型,并实施三种对齐算法,建立了一个利用人类反馈的系统化流程,显著提升了视频生成的流畅度和提示词对齐效果。
English Summary: This study introduces a systematic pipeline using human feedback to enhance video generation by creating a large-scale preference dataset, developing the VideoReward model, and implementing three alignment algorithms that significantly improve motion smoothness and prompt alignment.

Authors:Peiyuan Zhang, Junwei Luo, Xue Yang, Yi Yu, Qingyun Li, Yue Zhou, Xiaosong Jia, Xudong Lu, Jingdong Chen, Xiang Li, Junchi Yan, Yansheng Li
Title: PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection
Abstract:
With the growing demand for oriented object detection (OOD), recent studies on point-supervised OOD have attracted significant interest. In this paper, we propose PointOBB-v3, a stronger single point-supervised OOD framework. Compared to existing methods, it generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm. PointOBB-v3 functions by integrating three unique image views: the original view, a resized view, and a rotated/flipped (rot/flp) view. Based on the views, a scale augmentation module and an angle acquisition module are constructed. In the first module, a Scale-Sensitive Consistency (SSC) loss and a Scale-Sensitive Feature Fusion (SSFF) module are introduced to improve the model's ability to estimate object scale. To achieve precise angle predictions, the second module employs symmetry-based self-supervised learning. Additionally, we introduce an end-to-end version that eliminates the pseudo-label generation process by integrating a detector branch and introduces an Instance-Aware Weighting (IAW) strategy to focus on high-quality predictions. We conducted extensive experiments on the DIOR-R, DOTA-v1.0/v1.5/v2.0, FAIR1M, STAR, and RSAR datasets. Across all these datasets, our method achieves an average improvement in accuracy of 3.56% in comparison to previous state-of-the-art methods. The code will be available at https://github.com/ZpyWHU/PointOBB-v3.
中文: PointOBB-v3是一种更强大的单点监督定向目标检测框架,无需额外先验即可生成伪旋转框,通过多视图集成和尺度角度增强模块,在多个数据集上平均精度提升3.56%。
English: PointOBB-v3 is an advanced single point-supervised oriented object detection framework that generates pseudo rotated boxes without extra priors, integrates multi-view processing with scale and angle enhancement modules, and achieves a 3.56% average accuracy improvement across multiple datasets.

Authors:Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
Title: Ensuring Medical AI Safety: Interpretability-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data
Abstract:
Deep neural networks are increasingly employed in high-stakes medical applications, despite their tendency for shortcut learning in the presence of spurious correlations, which can have potentially fatal consequences in practice. Whereas a multitude of works address either the detection or mitigation of such shortcut behavior in isolation, the Reveal2Revise approach provides a comprehensive bias mitigation framework combining these steps. However, effectively addressing these biases often requires substantial labeling efforts from domain experts. In this work, we review the steps of the Reveal2Revise framework and enhance it with semi-automated interpretability-based bias annotation capabilities. This includes methods for the sample- and feature-level bias annotation, providing valuable information for bias mitigation methods to unlearn the undesired shortcut behavior. We show the applicability of the framework using four medical datasets across two modalities, featuring controlled and real-world spurious correlations caused by data artifacts. We successfully identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision Transformer models, ultimately increasing their robustness and applicability for real-world medical tasks. Our code is available at https://github.com/frederikpahde/medical-ai-safety.
中文:Reveal2Revise框架通过半自动化的基于可解释性的偏差标注功能得到增强,能有效识别和缓解医学AI模型中的捷径学习问题,从而提升了模型在多种数据集和架构中的鲁棒性。
English: The Reveal2Revise framework is enhanced with semi-automated interpretability-based bias annotation to effectively identify and mitigate shortcut learning in medical AI models, improving their robustness across various datasets and architectures.

Authors:Yizhe Lv, Tingting Zhang, Zhijian Wang, Yunpeng Song, Han Ding, Jinsong Han, Fei Wang
Title: mmEgoHand: Egocentric Hand Pose Estimation and Gesture Recognition with Head-mounted Millimeter-wave Radar and IMU
Abstract:
Recent advancements in millimeter-wave (mmWave) radar have demonstrated its potential for human action recognition and pose estimation, offering privacy-preserving advantages over conventional cameras while maintaining occlusion robustness, with promising applications in human-computer interaction and wellness care. However, existing mmWave systems typically employ fixed-position configurations, restricting user mobility to predefined zones and limiting practical deployment scenarios. We introduce mmEgoHand, a head-mounted egocentric system for hand pose estimation to support applications such as gesture recognition, VR interaction, skill digitization and assessment, and robotic teleoperation. mmEgoHand synergistically integrates mmWave radar with inertial measurement units (IMUs) to enable dynamic perception. The IMUs actively compensate for radar interference induced by head movements, while our novel end-to-end Transformer architecture simultaneously estimates 3D hand keypoint coordinates through multi-modal sensor fusion. This dual-modality framework achieves spatial-temporal alignment of mmWave heatmaps with IMUs, overcoming viewpoint instability inherent in egocentric sensing scenarios. We further demonstrate that intermediate hand pose representations substantially improve performance in downstream task, e.g., VR gesture recognition. Extensive evaluations with 10 subjects performing 8 gestures across 3 distinct postures -- standing, sitting, lying -- achieve 90.8% recognition accuracy, outperforming state-of-the-art solutions by a large margin. Dataset and code are available at https://github.com/WhisperYi/mmVR.
中文:mmEgoHand是一种头戴式系统,结合毫米波雷达和惯性测量单元实现动态手部姿态估计,通过新型Transformer架构和多模态融合,在不同姿势下实现了高精度的手势识别。
English: mmEgoHand is a head-mounted system combining mmWave radar and IMUs for dynamic hand pose estimation, achieving high accuracy in gesture recognition across various postures through a novel Transformer architecture and multi-modal fusion.

Authors:Dario Serez, Marco Cristani, Alessio Del Bue, Vittorio Murino, Pietro Morerio
Title: A Mutual Information Perspective on Multiple Latent Variable Generative Models for Positive View Generation
Abstract:
In image generation, Multiple Latent Variable Generative Models (MLVGMs) employ multiple latent variables to gradually shape the final images, from global characteristics to finer and local details (e.g., StyleGAN, NVAE), emerging as powerful tools for diverse applications. Yet their generative dynamics remain only empirically observed, without a systematic understanding of each latent variable's impact. In this work, we propose a novel framework that quantifies the contribution of each latent variable using Mutual Information (MI) as a metric. Our analysis reveals that current MLVGMs often underutilize some latent variables, and provides actionable insights for their use in downstream applications. With this foundation, we introduce a method for generating synthetic data for Self-Supervised Contrastive Representation Learning (SSCRL). By leveraging the hierarchical and disentangled variables of MLVGMs, our approach produces diverse and semantically meaningful views without the need for real image data. Additionally, we introduce a Continuous Sampling (CS) strategy, where the generator dynamically creates new samples during SSCRL training, greatly increasing data variability. Our comprehensive experiments demonstrate the effectiveness of these contributions, showing that MLVGMs' generated views compete on par with or even surpass views generated from real data. This work establishes a principled approach to understanding and exploiting MLVGMs, advancing both generative modeling and self-supervised learning. Code and pre-trained models at: https://github.com/SerezD/mi_ml_gen.
中文摘要:本研究提出一种基于互信息的框架来量化多潜在变量生成模型中各变量的作用,发现其利用不足的问题,并开发了无需真实图像即可生成媲美真实数据的自监督学习合成数据方法。
English Summary: This study introduces a framework to quantify the impact of latent variables in MLVGMs using mutual information, revealing their underutilization and enabling synthetic data generation for self-supervised learning that rivals real data performance.

Authors:Abdulrahman Oladipupo Ibraheem
Title: Regularizing cross entropy loss via minimum entropy and K-L divergence
Abstract:
I introduce two novel loss functions for classification in deep learning. The two loss functions extend standard cross entropy loss by regularizing it with minimum entropy and Kullback-Leibler (K-L) divergence terms. The first of the two novel loss functions is termed mixed entropy loss (MIX-ENT for short), while the second one is termed minimum entropy regularized cross-entropy loss (MIN-ENT for short). The MIX-ENT function introduces a regularizer that can be shown to be equivalent to the sum of a minimum entropy term and a K-L divergence term. However, it should be noted that the K-L divergence term here is different from that in the standard cross-entropy loss function, in the sense that it swaps the roles of the target probability and the hypothesis probability. The MIN-ENT function simply adds a minimum entropy regularizer to the standard cross entropy loss function. In both MIX-ENT and MIN-ENT, the minimum entropy regularizer minimizes the entropy of the hypothesis probability distribution which is output by the neural network. Experiments on the EMNIST-Letters dataset shows that my implementation of MIX-ENT and MIN-ENT lets the VGG model climb from its previous 3rd position on the paperswithcode leaderboard to reach the 2nd position on the leaderboard, outperforming the Spinal-VGG model in so doing. Specifically, using standard cross-entropy, VGG achieves 95.86% while Spinal-VGG achieves 95.88% classification accuracies, whereas using VGG (without Spinal-VGG) our MIN-ENT achieved 95.933%, while our MIX-ENT achieved 95.927% accuracies. The pre-trained models for both MIX-ENT and MIN-ENT are at https://github.com/rahmanoladi/minimum entropy project.
Chinese: 本文提出了两种新的损失函数MIX-ENT和MIN-ENT,通过引入最小熵和K-L散度正则化改进标准交叉熵,在EMNIST-Letters数据集上提升了VGG模型的分类准确率,使其在排行榜上超越原有排名。
English: This paper introduces two novel loss functions, MIX-ENT and MIN-ENT, which enhance standard cross-entropy by incorporating minimum entropy and K-L divergence regularization, leading to improved classification accuracy on the EMNIST-Letters dataset and advancing VGG's leaderboard position.

Authors:Fu Rong, Meng Lan, Qian Zhang, Lefei Zhang
Title: MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
Abstract:
Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions, which requires the integration of multimodal information and temporal dynamics perception. The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks. However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges. Specifically, MPG-SAM 2 employs a unified multimodal encoder to jointly encode video and textual features, generating semantically aligned video and text embeddings, along with multimodal class tokens. A mask prior generator utilizes the video embeddings and class tokens to create pseudo masks of target objects and global context. These masks are fed into the prompt encoder as dense prompts along with multimodal class tokens as sparse prompts to generate accurate prompts for SAM 2. To provide the online SAM 2 with a global view, we introduce a hierarchical global-historical aggregator, which allows SAM 2 to aggregate global and historical information of target objects at both pixel and object levels, enhancing the target representation and temporal consistency. Extensive experiments on several RVOS benchmarks demonstrate the superiority of MPG-SAM 2 and the effectiveness of our proposed modules. The code is available at https://github.com/rongfu-dsb/MPG-SAM2.
中文:提出的MPG-SAM 2框架通过多模态编码器实现语义对齐,并采用分层聚合器获取全局上下文,显著提升了离线参考视频对象分割的性能,在多个基准测试中表现优异。
English: The proposed MPG-SAM 2 framework enhances offline referring video object segmentation by integrating multimodal encoders for semantic alignment and a hierarchical aggregator for global context, achieving superior performance on benchmarks.

Authors:Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng
Title: One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt
Abstract:
Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input description for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments. Code is available at https://github.com/byliutao/1Prompt1Story.
Chinese Summary: 本文提出“1Prompt1Story”方法,通过合并提示词并采用奇异值重加权和身份保持交叉注意力技术,无需训练即可实现一致性文本到图像生成,在实验中优于现有方法。
English Summary: The paper introduces "1Prompt1Story," a training-free method that enhances consistent text-to-image generation by combining prompts and refining outputs with Singular-Value Reweighting and Identity-Preserving Cross-Attention, outperforming existing approaches.

Authors:Chenxu Wu, Qingpeng Kong, Zihang Jiang, S. Kevin Zhou
Title: Self-Supervised Diffusion MRI Denoising via Iterative and Stable Refinement
Abstract:
Magnetic Resonance Imaging (MRI), including diffusion MRI (dMRI), serves as a ``microscope'' for anatomical structures and routinely mitigates the influence of low signal-to-noise ratio scans by compromising temporal or spatial resolution. However, these compromises fail to meet clinical demands for both efficiency and precision. Consequently, denoising is a vital preprocessing step, particularly for dMRI, where clean data is unavailable. In this paper, we introduce Di-Fusion, a fully self-supervised denoising method that leverages the latter diffusion steps and an adaptive sampling process. Unlike previous approaches, our single-stage framework achieves efficient and stable training without extra noise model training and offers adaptive and controllable results in the sampling process. Our thorough experiments on real and simulated data demonstrate that Di-Fusion achieves state-of-the-art performance in microstructure modeling, tractography tracking, and other downstream tasks. Code is available at https://github.com/FouierL/Di-Fusion.
中文: Di-Fusion是一种无需额外噪声训练的自监督去噪方法,通过自适应采样提升MRI效率与精度,在微观结构建模和纤维束追踪中表现卓越。
English: Di-Fusion is a self-supervised denoising method that improves MRI efficiency and precision without extra noise training, achieving top performance in microstructure modeling and tractography.

Authors:Xuerui Qiu, Malu Zhang, Jieyuan Zhang, Wenjie Wei, Honglin Cao, Junsheng Guo, Rui-Jie Zhu, Yimeng Shan, Yang Yang, Haizhou Li
Title: Quantized Spike-driven Transformer
Abstract:
Spiking neural networks are emerging as a promising energy-efficient alternative to traditional artificial neural networks due to their spike-driven paradigm. However, recent research in the SNN domain has mainly focused on enhancing accuracy by designing large-scale Transformer structures, which typically rely on substantial computational resources, limiting their deployment on resource-constrained devices. To overcome this challenge, we propose a quantized spike-driven Transformer baseline (QSD-Transformer), which achieves reduced resource demands by utilizing a low bit-width parameter. Regrettably, the QSD-Transformer often suffers from severe performance degradation. In this paper, we first conduct empirical analysis and find that the bimodal distribution of quantized spike-driven self-attention (Q-SDSA) leads to spike information distortion (SID) during quantization, causing significant performance degradation. To mitigate this issue, we take inspiration from mutual information entropy and propose a bi-level optimization strategy to rectify the information distribution in Q-SDSA. Specifically, at the lower level, we introduce an information-enhanced LIF to rectify the information distribution in Q-SDSA. At the upper level, we propose a fine-grained distillation scheme for the QSD-Transformer to align the distribution in Q-SDSA with that in the counterpart ANN. By integrating the bi-level optimization strategy, the QSD-Transformer can attain enhanced energy efficiency without sacrificing its high-performance advantage. For instance, when compared to the prior SNN benchmark on ImageNet, the QSD-Transformer achieves 80.3% top-1 accuracy, accompanied by significant reductions of 6.0$\times$ and 8.1$\times$ in power consumption and model size, respectively. Code is available at https://github.com/bollossom/QSD-Transformer.
中文: 本文提出量化脉冲驱动Transformer(QSD-Transformer),通过双级优化策略解决脉冲信息失真导致的性能下降问题,在ImageNet上实现了高精度与显著提升的能效。
English: This paper proposes a quantized spike-driven Transformer (QSD-Transformer) that addresses performance degradation caused by spike information distortion through a bi-level optimization strategy, achieving enhanced energy efficiency and high accuracy on ImageNet.

Authors:Yuliang Gu, Weilun Tsao, Bo Du, Thierry Géraud, Yongchao Xu
Title: Leveraging Textual Anatomical Knowledge for Class-Imbalanced Semi-Supervised Multi-Organ Segmentation
Abstract:
Annotating 3D medical images demands substantial time and expertise, driving the adoption of semi-supervised learning (SSL) for segmentation tasks. However, the complex anatomical structures of organs often lead to significant class imbalances, posing major challenges for deploying SSL in real-world scenarios. Despite the availability of valuable prior information, such as inter-organ relative positions and organ shape priors, existing SSL methods have yet to fully leverage these insights. To address this gap, we propose a novel approach that integrates textual anatomical knowledge (TAK) into the segmentation model. Specifically, we use GPT-4o to generate textual descriptions of anatomical priors, which are then encoded using a CLIP-based model. These encoded priors are injected into the segmentation model as parameters of the segmentation head. Additionally, contrastive learning is employed to enhance the alignment between textual priors and visual features. Extensive experiments demonstrate the superior performance of our method, significantly surpassing state-of-the-art approaches. The source code will be available at: https://github.com/Lunn88/TAK-Semi.
中文摘要:本研究提出了一种新颖的半监督学习方法,通过CLIP编码和对比学习将GPT-4o生成的文本解剖知识融入医学图像分割,显著超越了现有方法的性能表现。
English Summary: This study introduces a novel semi-supervised learning method that integrates GPT-4o-generated textual anatomical knowledge into medical image segmentation through CLIP encoding and contrastive learning, significantly outperforming existing approaches.

Authors:Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, Huchuan Lu
Title: Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge
Abstract:
Recent advances in Large Language Models (LLMs) have enabled the development of Video-LLMs, advancing multimodal learning by bridging video data with language tasks. However, current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios. To address these issues, we propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction. $\StreamChat$ leverages a novel hierarchical memory system to efficiently process and compress video features over extended sequences, enabling real-time, multi-turn dialogue. Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications. Furthermore, we introduce StreamBench, a versatile benchmark that evaluates streaming video understanding across diverse media types and interactive scenarios, including multi-turn interactions and complex reasoning tasks. Extensive evaluations on StreamBench and other public benchmarks demonstrate that StreamChat significantly outperforms existing state-of-the-art models in terms of accuracy and response times, confirming its effectiveness for streaming video understanding. Code is available at StreamChat: https://github.com/hmxiong/StreamChat.
中文摘要:StreamChat是一种无需训练的框架,通过分层记忆系统和并行调度技术,实现了流媒体视频的实时多轮对话理解,在准确性和响应速度上显著优于现有先进模型。
English Summary: StreamChat is a training-free framework that uses a hierarchical memory system and parallel scheduling to enable real-time, multi-turn dialogue for streaming video understanding, outperforming existing models in accuracy and speed.

Authors:Samer Attrah
Title: Emotion estimation from video footage with LSTM
Abstract:
Emotion estimation in general is a field that has been studied for a long time, and several approaches exist using machine learning. in this paper, we present an LSTM model, that processes the blend-shapes produced by the library MediaPipe, for a face detected in a live stream of a camera, to estimate the main emotion from the facial expressions, this model is trained on the FER2013 dataset and delivers a result of 71% accuracy and 62% f1-score which meets the accuracy benchmark of the FER2013 dataset, with significantly reduced computation costs. https://github.com/Samir-atra/Emotion_estimation_from_video_footage_with_LSTM_ML_algorithm
中文: 本文提出一种LSTM模型,通过处理实时视频中MediaPipe生成的面部混合形状来估算情绪,在FER2013数据集上达到71%准确率和62% F1分数,且显著降低了计算成本。
English: This paper introduces an LSTM model that analyzes MediaPipe-generated facial blend-shapes from live video to estimate emotions, achieving 71% accuracy and 62% F1-score on the FER2013 dataset with lower computational costs.

Authors:Jian Wang, Xiaokang Zhang, Xianping Ma, Weikang Yu, Pedram Ghamisi
Title: Auto-Prompting SAM for Weakly Supervised Landslide Extraction
Abstract:
Weakly supervised landslide extraction aims to identify landslide regions from remote sensing data using models trained with weak labels, particularly image-level labels. However, it is often challenged by the imprecise boundaries of the extracted objects due to the lack of pixel-wise supervision and the properties of landslide objects. To tackle these issues, we propose a simple yet effective method by auto-prompting the Segment Anything Model (SAM), i.e., APSAM. Instead of depending on high-quality class activation maps (CAMs) for pseudo-labeling or fine-tuning SAM, our method directly yields fine-grained segmentation masks from SAM inference through prompt engineering. Specifically, it adaptively generates hybrid prompts from the CAMs obtained by an object localization network. To provide sufficient information for SAM prompting, an adaptive prompt generation (APG) algorithm is designed to fully leverage the visual patterns of CAMs, enabling the efficient generation of pseudo-masks for landslide extraction. These informative prompts are able to identify the extent of landslide areas (box prompts) and denote the centers of landslide objects (point prompts), guiding SAM in landslide segmentation. Experimental results on high-resolution aerial and satellite datasets demonstrate the effectiveness of our method, achieving improvements of at least 3.0\% in F1 score and 3.69\% in IoU compared to other state-of-the-art methods. The source codes and datasets will be available at https://github.com/zxk688.
中文: 本研究提出APSAM方法,通过从类别激活图中自适应生成混合提示来引导Segment Anything模型,有效提升了弱监督滑坡提取在遥感数据上的分割精度。
English: This study introduces APSAM, a novel method that enhances weakly supervised landslide extraction by adaptively generating hybrid prompts from class activation maps to guide the Segment Anything Model, achieving significant improvements in segmentation accuracy on remote sensing datasets.

Authors:Jinghan You, Shanglin Li, Yuanrui Sun, Jiangchuan Wei, Mingyu Guo, Chao Feng, Jiao Ran
Title: LVFace: Progressive Cluster Optimization for Large Vision Models in Face Recognition
Abstract:
Vision Transformers (ViTs) have revolutionized large-scale visual modeling, yet remain underexplored in face recognition (FR) where CNNs still dominate. We identify a critical bottleneck: CNN-inspired training paradigms fail to unlock ViT's potential, leading to suboptimal performance and convergence instability.To address this challenge, we propose LVFace, a ViT-based FR model that integrates Progressive Cluster Optimization (PCO) to achieve superior results. Specifically, PCO sequentially applies negative class sub-sampling (NCS) for robust and fast feature alignment from random initialization, feature expectation penalties for centroid stabilization, performing cluster boundary refinement through full-batch training without NCS constraints. LVFace establishes a new state-of-the-art face recognition baseline, surpassing leading approaches such as UniFace and TopoFR across multiple benchmarks. Extensive experiments demonstrate that LVFace delivers consistent performance gains, while exhibiting scalability to large-scale datasets and compatibility with mainstream VLMs and LLMs. Notably, LVFace secured 1st place in the ICCV 2021 Masked Face Recognition (MFR)-Ongoing Challenge (March 2025), proving its efficacy in real-world scenarios. Project is available at https://github.com/bytedance/LVFace.
中文: LVFace是一种基于视觉变换器的人脸识别模型,通过渐进式聚类优化克服训练瓶颈,在多个基准测试中创下新纪录并赢得ICCV 2021 MFR挑战赛,展现了卓越性能和可扩展性。
English: LVFace, a Vision Transformer-based face recognition model with Progressive Cluster Optimization, overcomes training bottlenecks to set new benchmarks and win the ICCV 2021 MFR Challenge, demonstrating superior performance and scalability.

Authors:Gabrielle Hoyer, Michelle W Tong, Rupsa Bhattacharjee, Valentina Pedoia, Sharmila Majumdar
Title: Clinical Utility of Foundation Segmentation Models in Musculoskeletal MRI: Biomarker Fidelity and Predictive Outcomes
Abstract:
Effective segmentation is fundamental for quantitative medical imaging; however, foundation segmentation models remain insufficiently evaluated for accuracy and biomarker fidelity across the diverse anatomical contexts and imaging protocols encountered in musculoskeletal (MSK) MRI. We evaluate three widely used segmentation models (SAM, SAM2, MedSAM) across eleven MSK MRI datasets spanning the knee, hip, spine, shoulder, and thigh. Our framework assesses both zero-shot and finetuned performance, with attention to segmentation accuracy, generalizability across imaging protocols, and reliability of derived quantitative biomarkers. Finetuned models showed consistent agreement with expert measurements for biomarkers including cartilage thickness, disc height, muscle volume, and compositional T1rho/T2 values. Automated prompting through the AutoLabel system enabled scalable segmentation, with moderate trade-offs in accuracy. As proof of concept, we applied the validated system to (i) a three-stage knee MRI triage cascade and (ii) a longitudinal landmark model that predicts total knee replacement and incident osteoarthritis. The framework offers a transparent method for benchmarking segmentation tools and connecting model performance to clinical imaging priorities.
中文摘要:本研究在多种肌肉骨骼MRI数据集上评估三种分割模型,发现微调后的模型能获得可靠的生物标志物测量结果,并通过膝关节MRI分诊和骨关节炎预测系统验证了其临床应用价值。
English Summary: This study evaluates three segmentation models across diverse musculoskeletal MRI datasets, finding that fine-tuned models achieve reliable biomarker measurements and demonstrating their clinical applicability through knee MRI triage and osteoarthritis prediction systems.

Authors:Peirong Liu, Ana Lawry Aguila, Juan E. Iglesias
Title: Unraveling Normal Anatomy via Fluid-Driven Anomaly Randomization
Abstract:
Data-driven machine learning has made significant strides in medical image analysis. However, most existing methods are tailored to specific modalities and assume a particular resolution (often isotropic). This limits their generalizability in clinical settings, where variations in scan appearance arise from differences in sequence parameters, resolution, and orientation. Furthermore, most general-purpose models are designed for healthy subjects and suffer from performance degradation when pathology is present. We introduce UNA (Unraveling Normal Anatomy), the first modality-agnostic learning approach for normal brain anatomy reconstruction that can handle both healthy scans and cases with pathology. We propose a fluid-driven anomaly randomization method that generates an unlimited number of realistic pathology profiles on-the-fly. UNA is trained on a combination of synthetic and real data, and can be applied directly to real images with potential pathology without the need for fine-tuning. We demonstrate UNA's effectiveness in reconstructing healthy brain anatomy and showcase its direct application to anomaly detection, using both simulated and real images from 3D healthy and stroke datasets, including CT and MRI scans. By bridging the gap between healthy and diseased images, UNA enables the use of general-purpose models on diseased images, opening up new opportunities for large-scale analysis of uncurated clinical images in the presence of pathology. Code is available at https://github.com/peirong26/UNA.
中文: UNA提出了一种与模态无关的正常脑解剖重建方法,通过流体驱动的异常随机化技术,无需微调即可直接应用于包含病理的真实图像,有效处理健康和病变扫描。
English: UNA introduces a modality-agnostic approach for reconstructing normal brain anatomy, capable of handling both healthy and pathological scans through fluid-driven anomaly randomization and direct application to real images without fine-tuning.

Authors:Yongxiang Liu, Weijie Li, Li Liu, Jie Zhou, Bowen Peng, Yafei Song, Xuying Xiong, Wei Yang, Tianpeng Liu, Zhen Liu, Xiang Li
Title: ATRNet-STAR: A Large Dataset and Benchmark Towards Remote Sensing Object Recognition in the Wild
Abstract:
The absence of publicly available, large-scale, high-quality datasets for Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) has significantly hindered the application of rapidly advancing deep learning techniques, which hold huge potential to unlock new capabilities in this field. This is primarily because collecting large volumes of diverse target samples from SAR images is prohibitively expensive, largely due to privacy concerns, the characteristics of microwave radar imagery perception, and the need for specialized expertise in data annotation. Throughout the history of SAR ATR research, there have been only a number of small datasets, mainly including targets like ships, airplanes, buildings, etc. There is only one vehicle dataset MSTAR collected in the 1990s, which has been a valuable source for SAR ATR. To fill this gap, this paper introduces a large-scale, new dataset named ATRNet-STAR with 40 different vehicle categories collected under various realistic imaging conditions and scenes. It marks a substantial advancement in dataset scale and diversity, comprising over 190,000 well-annotated samples, 10 times larger than its predecessor, the famous MSTAR. Building such a large dataset is a challenging task, and the data collection scheme will be detailed. Secondly, we illustrate the value of ATRNet-STAR via extensively evaluating the performance of 15 representative methods with 7 different experimental settings on challenging classification and detection benchmarks derived from the dataset. Finally, based on our extensive experiments, we identify valuable insights for SAR ATR and discuss potential future research directions in this field. We hope that the scale, diversity, and benchmark of ATRNet-STAR can significantly facilitate the advancement of SAR ATR.
中文: 针对合成孔径雷达自动目标识别领域缺乏大规模高质量数据集的问题,本文推出了ATRNet-STAR数据集,包含40类车辆目标和19万余标注样本,通过全面基准测试显著推动了该领域的发展。
English: The lack of large-scale, high-quality datasets has limited the application of deep learning in Synthetic Aperture Radar Automatic Target Recognition (SAR ATR), prompting the introduction of ATRNet-STAR, a new dataset with 40 vehicle categories and over 190,000 annotated samples, which significantly advances the field through extensive benchmark evaluations.

Authors:Daeun Jung, Jaehyeok Jang, Sooyoung Jang, Yu Rang Park
Title: MEDFORM: A Foundation Model for Contrastive Learning of CT Imaging and Clinical Numeric Data in Multi-Cancer Analysis
Abstract:
Computed tomography (CT) and clinical numeric data are essential modalities for cancer evaluation, but building large-scale multimodal training datasets for developing medical foundation models remains challenging due to the structural complexity of multi-slice CT data and high cost of expert annotation. In this study, we propose MEDFORM, a multimodal pre-training strategy that guides CT image representation learning using complementary information from clinical data for medical foundation model development. MEDFORM efficiently processes CT slice through multiple instance learning (MIL) and adopts a dual pre-training strategy: first pretraining the CT slice feature extractor using SimCLR-based self-supervised learning, then aligning CT and clinical modalities through cross-modal contrastive learning. Our model was pre-trained on three different cancer types: lung cancer (141,171 slices), breast cancer (8,100 slices), colorectal cancer (10,393 slices). The experimental results demonstrated that this dual pre-training strategy improves cancer classification performance and maintains robust performance in few-shot learning scenarios. Code available at https://github.com/DigitalHealthcareLab/25MultiModalFoundationModel.git
中文:MEDFORM提出了一种多模态预训练策略,通过自监督和跨模态对比学习将CT影像与临床数据结合,有效提升了癌症分类和小样本学习的能力。
English: MEDFORM introduces a multimodal pre-training approach that combines CT imaging with clinical data using self-supervised and cross-modal contrastive learning to enhance cancer classification and few-shot learning performance.

Authors:Adam Tupper, Christian Gagné
Title: Revisiting Data Augmentation for Ultrasound Images
Abstract:
Data augmentation is a widely used and effective technique to improve the generalization performance of deep neural networks. Yet, despite often facing limited data availability when working with medical images, it is frequently underutilized. This appears to come from a gap in our collective understanding of the efficacy of different augmentation techniques across different tasks and modalities. One modality where this is especially true is ultrasound imaging. This work addresses this gap by analyzing the effectiveness of different augmentation techniques at improving model performance across a wide range of ultrasound image analysis tasks. To achieve this, we introduce a new standardized benchmark of 14 ultrasound image classification and semantic segmentation tasks from 10 different sources and covering 11 body regions. Our results demonstrate that many of the augmentations commonly used for tasks on natural images are also effective on ultrasound images, even more so than augmentations developed specifically for ultrasound images in some cases. We also show that diverse augmentation using TrivialAugment, which is widely used for natural images, is also effective for ultrasound images. Moreover, our proposed methodology represents a structured approach for assessing various data augmentations that can be applied to other contexts and modalities.
中文: 本研究评估了多种数据增强技术在超声图像分析中的效果,发现常用于自然图像的方法常优于超声专用增强技术,并提出了标准化基准以指导未来研究。
English: This study evaluates the effectiveness of various data augmentation techniques for ultrasound image analysis, revealing that methods commonly used for natural images often outperform ultrasound-specific augmentations, and introduces a standardized benchmark to guide future research.

Authors:Matthew Gwilliam, Han Cai, Di Wu, Abhinav Shrivastava, Zhiyu Cheng
Title: Accelerate High-Quality Diffusion Models with Inner Loop Feedback
Abstract:
We propose Inner Loop Feedback (ILF), a novel approach to accelerate diffusion models' inference. ILF trains a lightweight module to predict future features in the denoising process by leveraging the outputs from a chosen diffusion backbone block at a given time step. This approach exploits two key intuitions; (1) the outputs of a given block at adjacent time steps are similar, and (2) performing partial computations for a step imposes a lower burden on the model than skipping the step entirely. Our method is highly flexible, since we find that the feedback module itself can simply be a block from the diffusion backbone, with all settings copied. Its influence on the diffusion forward can be tempered with a learnable scaling factor from zero initialization. We train this module using distillation losses; however, unlike some prior work where a full diffusion backbone serves as the student, our model freezes the backbone, training only the feedback module. While many efforts to optimize diffusion models focus on achieving acceptable image quality in extremely few steps (1-4 steps), our emphasis is on matching best case results (typically achieved in 20 steps) while significantly reducing runtime. ILF achieves this balance effectively, demonstrating strong performance for both class-to-image generation with diffusion transformer (DiT) and text-to-image generation with DiT-based PixArt-alpha and PixArt-sigma. The quality of ILF's 1.7x-1.8x speedups are confirmed by FID, CLIP score, CLIP Image Quality Assessment, ImageReward, and qualitative comparisons. Project information is available at https://mgwillia.github.io/ilf.

Authors:Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao
Title: VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Abstract:
In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: the vision-centric training paradigm and vision-centric framework design. The key insight of our vision-centric training paradigm is that high-quality image-text data is crucial for both image and video understanding. Instead of preparing massive video-text datasets, we focus on constructing large-scale and high-quality image-text datasets. VideoLLaMA3 has four training stages: 1) Vision Encoder Adaptation, which enables vision encoder to accept images of variable resolutions as input; 2) Vision-Language Alignment, which jointly tunes the vision encoder, projector, and LLM with large-scale image-text data covering multiple types (including scene images, documents, charts) as well as text-only data. 3) Multi-task Fine-tuning, which incorporates image-text SFT data for downstream tasks and video-text data to establish a foundation for video understanding. 4) Video-centric Fine-tuning, which further improves the model's capability in video understanding. As for the framework design, to better capture fine-grained details in images, the pretrained vision encoder is adapted to encode images of varying sizes into vision tokens with corresponding numbers, rather than a fixed number of tokens. For video inputs, we reduce the number of vision tokens according to their similarity so that the representation of videos will be more precise and compact. Benefit from vision-centric designs, VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.
中文摘要:VideoLLaMA3提出了一种以视觉为中心的多模态基础模型,通过高质量图文数据和自适应视觉标记处理,在图像与视频理解任务中均表现出色。
English Summary: VideoLLaMA3 introduces a vision-centric multimodal foundation model that prioritizes high-quality image-text data and adaptive visual token processing to excel in both image and video understanding tasks.

Authors:Jiachen Lei, Julius Berner, Jiongxiao Wang, Zhongzhu Chen, Zhongjia Ba, Kui Ren, Jun Zhu, Anima Anandkumar
Title: Robust Representation Consistency Model via Contrastive Denoising
Abstract:
Robustness is essential for deep neural networks, especially in security-sensitive applications. To this end, randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations. Recently, diffusion models have been successfully employed for randomized smoothing to purify noise-perturbed samples before making predictions with a standard classifier. While these methods excel at small perturbation radii, they struggle with larger perturbations and incur a significant computational overhead during inference compared to classical methods. To address this, we reformulate the generative modeling task along the diffusion trajectories in pixel space as a discriminative task in the latent space. Specifically, we use instance discrimination to achieve consistent representations along the trajectories by aligning temporally adjacent points. After fine-tuning based on the learned representations, our model enables implicit denoising-then-classification via a single prediction, substantially reducing inference costs. We conduct extensive experiments on various datasets and achieve state-of-the-art performance with minimal computation budget during inference. For example, our method outperforms the certified accuracy of diffusion-based methods on ImageNet across all perturbation radii by 5.3% on average, with up to 11.6% at larger radii, while reducing inference costs by 85$\times$ on average. Codes are available at: https://github.com/jiachenlei/rRCM.
中文: 本研究提出了一种新方法,将生成式扩散模型重构为潜在空间中的判别任务,显著提升了对抗扰动的鲁棒性,同时将推理成本降低85倍,并在多个数据集上实现了最先进的认证精度。
English: This study introduces a novel method that reformulates generative diffusion modeling as a discriminative task in latent space, significantly enhancing robustness against adversarial perturbations while reducing inference costs by 85 times and achieving state-of-the-art certified accuracy across various datasets.

Authors:Jiahao Wang, Ning Kang, Lewei Yao, Mengzhao Chen, Chengyue Wu, Songyang Zhang, Shuchen Xue, Yong Liu, Taiqiang Wu, Xihui Liu, Kaipeng Zhang, Shifeng Zhang, Wenqi Shao, Zhenguo Li, Ping Luo
Title: LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation
Abstract:
In commonly used sub-quadratic complexity modules, linear attention benefits from simplicity and high parallelism, making it promising for image synthesis tasks. However, the architectural design and learning strategy for linear attention remain underexplored in this field. In this paper, we offer a suite of ready-to-use solutions for efficient linear diffusion Transformers. Our core contributions include: (1) Simplified Linear Attention using few heads, observing the free-lunch effect of performance without latency increase. (2) Weight inheritance from a fully pre-trained diffusion Transformer: initializing linear Transformer using pre-trained diffusion Transformer and loading all parameters except for those related to linear attention. (3) Hybrid knowledge distillation objective: using a pre-trained diffusion Transformer to help the training of the student linear Transformer, supervising not only the predicted noise but also the variance of the reverse diffusion process. These guidelines lead to our proposed Linear Diffusion Transformer (LiT), an efficient text-to-image Transformer that can be deployed offline on a laptop. Experiments show that in class-conditional 256*256 and 512*512 ImageNet benchmark LiT achieves highly competitive FID while reducing training steps by 80% and 77% compared to DiT. LiT also rivals methods based on Mamba or Gated Linear Attention. Besides, for text-to-image generation, LiT allows for the rapid synthesis of up to 1K resolution photorealistic images. Project page: https://techmonsterwang.github.io/LiT/.
本文提出了线性扩散变换器(LiT),通过线性注意力和五项优化策略将预训练DiT简化为高效模型,仅需少量训练步骤即可实现相当的图像生成性能。
This paper introduces Linear Diffusion Transformer (LiT), a streamlined version of DiT that uses linear attention and five optimization strategies to achieve efficient image generation with comparable performance using significantly fewer training steps.

Authors:Ruisi Zhao, Zechuan Zhang, Zongxin Yang, Yi Yang
Title: 3D Object Manipulation in a Single Image using Generative Models
Abstract:
Object manipulation in images aims to not only edit the object's presentation but also gift objects with motion. Previous methods encountered challenges in concurrently handling static editing and dynamic generation, while also struggling to achieve fidelity in object appearance and scene lighting. In this work, we introduce \textbf{OMG3D}, a novel framework that integrates the precise geometric control with the generative power of diffusion models, thus achieving significant enhancements in visual performance. Our framework first converts 2D objects into 3D, enabling user-directed modifications and lifelike motions at the geometric level. To address texture realism, we propose CustomRefiner, a texture refinement module that pre-train a customized diffusion model, aligning the details and style of coarse renderings of 3D rough model with the original image, further refine the texture. Additionally, we introduce IllumiCombiner, a lighting processing module that estimates and corrects background lighting to match human visual perception, resulting in more realistic shadow effects. Extensive experiments demonstrate the outstanding visual performance of our approach in both static and dynamic scenarios. Remarkably, all these steps can be done using one NVIDIA 3090. Project page is at https://whalesong-zrs.github.io/OMG3D-projectpage/
中文: OMG3D框架通过将3D几何控制与扩散模型相结合,实现了图像中物体的精确编辑和生动运动,其定制优化器和光照合成器模块分别提升了纹理真实感与光影效果,在静态和动态场景中均展现出卓越的视觉表现。
English: The OMG3D framework enhances image object manipulation by integrating 3D geometric control with diffusion models, enabling precise editing and realistic motion, while its CustomRefiner and IllumiCombiner modules improve texture and lighting for superior visual fidelity in both static and dynamic scenarios.

Authors:Kaiyu Li, Xiangyong Cao, Yupeng Deng, Chao Pang, Zepeng Xin, Deyu Meng, Zhi Wang
Title: DynamicEarth: How Far are We from Open-Vocabulary Change Detection?
Abstract:
Monitoring Earth's evolving land covers requires methods capable of detecting changes across a wide range of categories and contexts. Existing change detection methods are hindered by their dependency on predefined classes, reducing their effectiveness in open-world applications. To address this issue, we introduce open-vocabulary change detection (OVCD), a novel task that bridges vision and language to detect changes across any category. Considering the lack of high-quality data and annotation, we propose two training-free frameworks, M-C-I and I-M-C, which leverage and integrate off-the-shelf foundation models for the OVCD task. The insight behind the M-C-I framework is to discover all potential changes and then classify these changes, while the insight of I-M-C framework is to identify all targets of interest and then determine whether their states have changed. Based on these two frameworks, we instantiate to obtain several methods, e.g., SAM-DINOv2-SegEarth-OV, Grounding-DINO-SAM2-DINO, etc. Extensive evaluations on 5 benchmark datasets demonstrate the superior generalization and robustness of our OVCD methods over existing supervised and unsupervised methods. To support continued exploration, we release DynamicEarth, a dedicated codebase designed to advance research and application of OVCD. https://likyoo.github.io/DynamicEarth
中文: 为解决传统变化检测方法依赖预定义类别的问题,本研究提出开放词汇变化检测(OVCD)任务,并开发了无需训练的M-C-I和I-M-C框架,通过整合基础模型在多个基准测试中展现出卓越的泛化能力和鲁棒性。
English: To overcome the limitations of predefined classes in change detection, this study introduces open-vocabulary change detection (OVCD) and proposes two training-free frameworks, M-C-I and I-M-C, which integrate foundation models to achieve superior generalization and robustness across diverse benchmarks.

Authors:Ruicheng Zhang, Haowei Guo, Zeyu Zhang, Puxin Yan, Shen Zhao
Title: GAMED-Snake: Gradient-aware Adaptive Momentum Evolution Deep Snake Model for Multi-organ Segmentation
Abstract:
Multi-organ segmentation is a critical yet challenging task due to complex anatomical backgrounds, blurred boundaries, and diverse morphologies. This study introduces the Gradient-aware Adaptive Momentum Evolution Deep Snake (GAMED-Snake) model, which establishes a novel paradigm for contour-based segmentation by integrating gradient-based learning with adaptive momentum evolution mechanisms. The GAMED-Snake model incorporates three major innovations: First, the Distance Energy Map Prior (DEMP) generates a pixel-level force field that effectively attracts contour points towards the true boundaries, even in scenarios with complex backgrounds and blurred edges. Second, the Differential Convolution Inception Module (DCIM) precisely extracts comprehensive energy gradients, significantly enhancing segmentation accuracy. Third, the Adaptive Momentum Evolution Mechanism (AMEM) employs cross-attention to establish dynamic features across different iterations of evolution, enabling precise boundary alignment for diverse morphologies. Experimental results on four challenging multi-organ segmentation datasets demonstrate that GAMED-Snake improves the mDice metric by approximately 2% compared to state-of-the-art methods. Code will be available at https://github.com/SYSUzrc/GAMED-Snake.
Chinese: 本研究提出的GAMED-Snake模型通过融合梯度学习和自适应动量演化机制,改进了多器官分割中的边界对齐和精度,在四个数据集上的实验表明其mDice指标比现有最优方法提升约2%。
English: This study introduces the GAMED-Snake model, which integrates gradient-based learning with adaptive momentum evolution to enhance multi-organ segmentation by improving boundary alignment and accuracy, achieving a 2% mDice improvement over state-of-the-art methods.

Authors:Ruicheng Zhang, Kanghui Tian, Zeyu Zhang, Qixiang Liu, Zhi Jin
Title: FDG-Diff: Frequency-Domain-Guided Diffusion Framework for Compressed Hazy Image Restoration
Abstract:
In this study, we reveal that the interaction between haze degradation and JPEG compression introduces complex joint loss effects, which significantly complicate image restoration. Existing dehazing models often neglect compression effects, which limits their effectiveness in practical applications. To address these challenges, we introduce three key contributions. First, we design FDG-Diff, a novel frequency-domain-guided dehazing framework that improves JPEG image restoration by leveraging frequency-domain information. Second, we introduce the High-Frequency Compensation Module (HFCM), which enhances spatial-domain detail restoration by incorporating frequency-domain augmentation techniques into a diffusion-based restoration framework. Lastly, the introduction of the Degradation-Aware Denoising Timestep Predictor (DADTP) module further enhances restoration quality by enabling adaptive region-specific restoration, effectively addressing regional degradation inconsistencies in compressed hazy images. Experimental results across multiple compressed dehazing datasets demonstrate that our method consistently outperforms the latest state-of-the-art approaches. Code be available at https://github.com/SYSUzrc/FDG-Diff.
中文摘要:本研究提出FDG-Diff框架,通过频域引导的去雾方法和创新模块有效解决雾霾与JPEG压缩的共同退化问题,在图像恢复效果上超越现有先进方法。
English Summary: This study introduces FDG-Diff, a frequency-domain-guided dehazing framework that effectively addresses the joint degradation effects of haze and JPEG compression through innovative modules for enhanced image restoration, outperforming existing methods.

Authors:Haocheng Luo, Tuan Truong, Tung Pham, Mehrtash Harandi, Dinh Phung, Trung Le
Title: Explicit Eigenvalue Regularization Improves Sharpness-Aware Minimization
Abstract:
Sharpness-Aware Minimization (SAM) has attracted significant attention for its effectiveness in improving generalization across various tasks. However, its underlying principles remain poorly understood. In this work, we analyze SAM's training dynamics using the maximum eigenvalue of the Hessian as a measure of sharpness, and propose a third-order stochastic differential equation (SDE), which reveals that the dynamics are driven by a complex mixture of second- and third-order terms. We show that alignment between the perturbation vector and the top eigenvector is crucial for SAM's effectiveness in regularizing sharpness, but find that this alignment is often inadequate in practice, limiting SAM's efficiency. Building on these insights, we introduce Eigen-SAM, an algorithm that explicitly aims to regularize the top Hessian eigenvalue by aligning the perturbation vector with the leading eigenvector. We validate the effectiveness of our theory and the practical advantages of our proposed approach through comprehensive experiments. Code is available at https://github.com/RitianLuo/EigenSAM.
中文: 本研究分析了锐度感知最小化(SAM)方法,发现其有效性依赖于扰动向量与海森矩阵主特征向量的对齐,据此提出了Eigen-SAM算法,通过显式优化这种对齐来提升正则化效果和计算效率。
English: This study analyzes Sharpness-Aware Minimization (SAM) and reveals that its effectiveness depends on aligning the perturbation vector with the Hessian's top eigenvector, leading to the development of Eigen-SAM, which explicitly optimizes this alignment to improve regularization and efficiency.

Authors:Jingwei Yi, Junhao Yin, Ju Xu, Peng Bao, Yongliang Wang, Wei Fan, Hao Wang
Title: ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in understanding multimodal inputs and have been widely integrated into Retrieval-Augmented Generation (RAG) based conversational systems. While current VLM-powered chatbots can provide textual source references in their responses, they exhibit significant limitations in referencing contextually relevant images during conversations. In this paper, we introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context -- and systematically investigate VLMs' capability in this aspect. We conduct the first evaluation for contextual image referencing, comprising a dedicated testing dataset and evaluation metrics. Furthermore, we propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities through instruction fine-tuning on a large-scale, manually curated multimodal conversation dataset. Experimental results demonstrate that ImageRef-VL not only outperforms proprietary models but also achieves an 88% performance improvement over state-of-the-art open-source VLMs in contextual image referencing tasks. Our code is available at https://github.com/bytedance/ImageRef-VL.
中文: 本文提出了情境图像引用能力,使视觉语言模型能够基于对话上下文恰当引用检索文档中的相关图像,并通过指令微调方法ImageRef-VL显著提升了开源模型在此任务上的性能表现。
English: This paper introduces Contextual Image Reference, a novel capability for Vision-Language Models to appropriately reference relevant images from retrieval documents based on conversation context, and presents ImageRef-VL, a method that significantly enhances open-source VLMs' performance in this task through instruction fine-tuning.

Authors:Chao Feng, Ziyang Chen, Aleksander Holynski, Alexei A. Efros, Andrew Owens
Title: GPS as a Control Signal for Image Generation
Abstract:
We show that the GPS tags contained in photo metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city. In particular, we train a diffusion model to generate images conditioned on both GPS and text. The learned model generates images that capture the distinctive appearance of different neighborhoods, parks, and landmarks. We also extract 3D models from 2D GPS-to-image models through score distillation sampling, using GPS conditioning to constrain the appearance of the reconstruction from each viewpoint. Our evaluations suggest that our GPS-conditioned models successfully learn to generate images that vary based on location, and that GPS conditioning improves estimated 3D structure.

Authors:Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, Limin Wang
Title: InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Abstract:
This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs' ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short & long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM's innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5
中文: 本文提出InternVideo2.5模型,通过长视频上下文建模结合自适应令牌压缩和密集标注技术,显著提升了视频多模态大语言模型对细节的感知能力和长时序理解能力。
English: This paper introduces InternVideo2.5, which enhances video multimodal large language models by incorporating long and rich context modeling to improve fine-grained perception and long-term temporal understanding through adaptive token compression and dense annotations.

Authors:Hongjun Wang, Wonmin Byeon, Jiarui Xu, Jinwei Gu, Ka Chun Cheung, Xiaolong Wang, Kai Han, Jan Kautz, Sifei Liu
Title: Parallel Sequence Modeling via Generalized Spatial Propagation Network
Abstract:
We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multi-dimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, context-aware propagation across 2D sequences and reduces the effective sequence length to $\sqrt{N}$ for a square map with N elements, significantly enhancing computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over $84\times$ when generating 16K images.

Authors:Vishagar Arunan, Saeedha Nazar, Hashiru Pramuditha, Vinasirajan Viruthshaan, Sameera Ramasinghe, Simon Lucey, Ranga Rodrigo
Title: DARB-Splatting: Generalizing Splatting with Decaying Anisotropic Radial Basis Functions
Abstract:
Splatting-based 3D reconstruction methods have gained popularity with the advent of 3D Gaussian Splatting, efficiently synthesizing high-quality novel views. These methods commonly resort to using exponential family functions, such as the Gaussian function, as reconstruction kernels due to their anisotropic nature, ease of projection, and differentiability in rasterization. However, the field remains restricted to variations within the exponential family, leaving generalized reconstruction kernels largely underexplored, partly due to the lack of easy integrability in 3D to 2D projections. In this light, we show that a class of decaying anisotropic radial basis functions (DARBFs), which are non-negative functions of the Mahalanobis distance, supports splatting by approximating the Gaussian function's closed-form integration advantage. With this fresh perspective, we demonstrate up to 34% faster convergence during training and a 45% reduction in memory consumption across various DARB reconstruction kernels, while maintaining comparable PSNR, SSIM, and LPIPS results. We will make the code available.

Authors:Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang
Title: InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Abstract:
Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at https://github.com/InternLM/InternLM-XComposer/tree/main/InternLM-XComposer-2.5-Reward
Chinese: 针对大型视觉语言模型公开多模态奖励模型稀缺的问题,我们提出了InternLM-XComposer2.5-Reward这一简单高效的多模态奖励模型,通过在多样化多模态偏好数据上训练,该模型在基准测试中表现优异,并成功应用于强化学习训练和响应选择等关键场景。
English: To address the scarcity of public multi-modal reward models for Large Vision Language Models (LVLMs), we introduce InternLM-XComposer2.5-Reward, a simple yet effective model trained on a diverse multi-modal preference corpus, which achieves top performance on benchmarks and enables key applications like RL training and response selection.

Authors:Xianwei Zhuang, Yuxin Xie, Yufan Deng, Liming Liang, Jinghan Ru, Yuguo Yin, Yuexian Zou
Title: VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
Abstract:
We present VARGPT, a novel multimodal large language model (MLLM) that unifies visual understanding and generation within a single autoregressive framework. VARGPT employs a next-token prediction paradigm for visual understanding and a next-scale prediction paradigm for visual autoregressive generation. VARGPT innovatively extends the LLaVA architecture, achieving efficient scale-wise autoregressive visual generation within MLLMs while seamlessly accommodating mixed-modal input and output within a single model framework. Our VARGPT undergoes a three-stage unified training process on specially curated datasets, comprising a pre-training phase and two mixed visual instruction-tuning phases. The unified training strategy are designed to achieve alignment between visual and textual features, enhance instruction following for both understanding and generation, and improve visual generation quality, respectively. Despite its LLAVA-based architecture for multimodel understanding, VARGPT significantly outperforms LLaVA-1.5 across various vision-centric benchmarks, such as visual question-answering and reasoning tasks. Notably, VARGPT naturally supports capabilities in autoregressive visual generation and instruction-to-image synthesis, showcasing its versatility in both visual understanding and generation tasks. Project page is at: \url{https://vargpt-1.github.io/}

Authors:Jiacheng Zuo, Haibo Hu, Zikang Zhou, Yufei Cui, Ziquan Liu, Jianping Wang, Nan Guan, Jin Wang, Chun Jason Xue
Title: RALAD: Bridging the Real-to-Sim Domain Gap in Autonomous Driving with Retrieval-Augmented Learning
Abstract:
In the pursuit of robust autonomous driving systems, models trained on real-world datasets often struggle to adapt to new environments, particularly when confronted with corner cases such as extreme weather conditions. Collecting these corner cases in the real world is non-trivial, which necessitates the use of simulators for validation. However,the high computational cost and the domain gap in data distribution have hindered the seamless transition between real and simulated driving scenarios. To tackle this challenge, we propose Retrieval-Augmented Learning for Autonomous Driving (RALAD), a novel framework designed to bridge the real-to-sim gap at a low cost. RALAD features three primary designs, including (1) domain adaptation via an enhanced Optimal Transport (OT) method that accounts for both individual and grouped image distances, (2) a simple and unified framework that can be applied to various models, and (3) efficient fine-tuning techniques that freeze the computationally expensive layers while maintaining robustness. Experimental results demonstrate that RALAD compensates for the performance degradation in simulated environments while maintaining accuracy in real-world scenarios across three different models. Taking Cross View as an example, the mIOU and mAP metrics in real-world scenarios remain stable before and after RALAD fine-tuning, while in simulated environments,the mIOU and mAP metrics are improved by 10.30% and 12.29%, respectively. Moreover, the re-training cost of our approach is reduced by approximately 88.1%. Our code is available at https://github.com/JiachengZuo/RALAD.git.
中文:RALAD框架通过增强的领域适应和高效微调技术,有效弥合自动驾驶中真实与模拟环境的差距,在保持现实场景精度的同时显著提升模拟环境性能,并将重新训练成本降低约88.1%。
English: The proposed RALAD framework effectively bridges the real-to-sim gap in autonomous driving by employing enhanced domain adaptation and efficient fine-tuning, significantly improving performance in simulated environments while maintaining real-world accuracy and reducing retraining costs by 88.1%.

Authors:Wenxin Ma, Qingsong Yao, Xiang Zhang, Zhelong Huang, Zihang Jiang, S. Kevin Zhou
Title: Towards Accurate Unified Anomaly Segmentation
Abstract:
Unsupervised anomaly detection (UAD) from images strives to model normal data distributions, creating discriminative representations to distinguish and precisely localize anomalies. Despite recent advancements in the efficient and unified one-for-all scheme, challenges persist in accurately segmenting anomalies for further monitoring. Moreover, this problem is obscured by the widely-used AUROC metric under imbalanced UAD settings. This motivates us to emphasize the significance of precise segmentation of anomaly pixels using pAP and DSC as metrics. To address the unsolved segmentation task, we introduce the Unified Anomaly Segmentation (UniAS). UniAS presents a multi-level hybrid pipeline that progressively enhances normal information from coarse to fine, incorporating a novel multi-granularity gated CNN (MGG-CNN) into Transformer layers to explicitly aggregate local details from different granularities. UniAS achieves state-of-the-art anomaly segmentation performance, attaining 65.12/59.33 and 40.06/32.50 in pAP/DSC on the MVTec-AD and VisA datasets, respectively, surpassing previous methods significantly. The codes are shared at https://github.com/Mwxinnn/UniAS.
Chinese: 本文提出统一异常分割(UniAS),这是一种多层级混合框架,通过将新型多粒度门控CNN与Transformer层结合,从粗到细逐步增强正常信息,在MVTec-AD和VisA数据集上分别以65.12/59.33和40.06/32.50的pAP/DSC指标实现了最先进的异常分割性能。
English: The paper introduces Unified Anomaly Segmentation (UniAS), a multi-level hybrid pipeline that enhances normal information from coarse to fine using a novel multi-granularity gated CNN integrated with Transformer layers, achieving state-of-the-art performance in anomaly segmentation on MVTec-AD and VisA datasets with pAP/DSC scores of 65.12/59.33 and 40.06/32.50 respectively.

Authors:Cristiano Patrício, Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira, João C. Neves
Title: CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification
Abstract:
The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the final disease prediction on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples. More information on our project page: https://cristianopatricio.github.io/CBVLM/.

Authors:Yihang Chen, Qianyi Wu, Weiyao Lin, Mehrtash Harandi, Jianfei Cai
Title: HAC++: Towards 100X Compression of 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. To achieve a compact size, we propose HAC++, which leverages the relationships between unorganized anchors and a structured hash grid, utilizing their mutual information for context modeling. Additionally, HAC++ captures intra-anchor contextual relationships to further enhance compression performance. To facilitate entropy coding, we utilize Gaussian distributions to precisely estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Moreover, we incorporate an adaptive masking strategy to eliminate invalid Gaussians and anchors. Overall, HAC++ achieves a remarkable size reduction of over 100X compared to vanilla 3DGS when averaged on all datasets, while simultaneously improving fidelity. It also delivers more than 20X size reduction compared to Scaffold-GS. Our code is available at https://github.com/YihangChen-ee/HAC-plus.
Chinese: HAC++ 是一种针对 D 高斯泼溅的新型压缩方法,利用锚点与哈希网格间的互信息并捕捉锚点内部关联,相比原始 3DGS 实现了超过 100 倍的体积压缩且保真度更高。
English: HAC++ is a novel compression method for 3D Gaussian Splatting that leverages mutual information between anchors and a hash grid while capturing intra-anchor relationships, achieving over 100x size reduction with improved fidelity compared to vanilla 3DGS.

Authors:Junyu Xia, Jiesong Bai, Yihang Dong
Title: DLEN: Dual Branch of Transformer for Low-Light Image Enhancement in Dual Domains
Abstract:
Low-light image enhancement (LLE) aims to improve the visual quality of images captured in poorly lit conditions, which often suffer from low brightness, low contrast, noise, and color distortions. These issues hinder the performance of computer vision tasks such as object detection, facial recognition, and autonomous driving.Traditional enhancement techniques, such as multi-scale fusion and histogram equalization, fail to preserve fine details and often struggle with maintaining the natural appearance of enhanced images under complex lighting conditions. Although the Retinex theory provides a foundation for image decomposition, it often amplifies noise, leading to suboptimal image quality. In this paper, we propose the Dual Light Enhance Network (DLEN), a novel architecture that incorporates two distinct attention mechanisms, considering both spatial and frequency domains. Our model introduces a learnable wavelet transform module in the illumination estimation phase, preserving high- and low-frequency components to enhance edge and texture details. Additionally, we design a dual-branch structure that leverages the power of the Transformer architecture to enhance both the illumination and structural components of the image.Through extensive experiments, our model outperforms state-of-the-art methods on standard benchmarks.Code is available here: https://github.com/LaLaLoXX/DLEN
Chinese: 本文提出的双重光增强网络(DLEN)融合了空间与频域注意力机制,采用可学习小波变换和双分支Transformer结构,通过优化光照并保留细节来有效增强低光图像,其性能超越了现有先进方法。
English: The paper introduces the Dual Light Enhance Network (DLEN), which integrates spatial and frequency domain attention with a learnable wavelet transform and dual-branch Transformer structure to effectively enhance low-light images by improving illumination and preserving details, outperforming existing methods.

Authors:Inès Hyeonsu Kim, Seokju Cho, Jiahui Huang, Jung Yi, Joon-Young Lee, Seungryong Kim
Title: Exploring Temporally-Aware Features for Point Tracking
Abstract:
Point tracking in videos is a fundamental task with applications in robotics, video editing, and more. While many vision tasks benefit from pre-trained feature backbones to improve generalizability, point tracking has primarily relied on simpler backbones trained from scratch on synthetic data, which may limit robustness in real-world scenarios. Additionally, point tracking requires temporal awareness to ensure coherence across frames, but using temporally-aware features is still underexplored. Most current methods often employ a two-stage process: an initial coarse prediction followed by a refinement stage to inject temporal information and correct errors from the coarse stage. These approach, however, is computationally expensive and potentially redundant if the feature backbone itself captures sufficient temporal information. In this work, we introduce Chrono, a feature backbone specifically designed for point tracking with built-in temporal awareness. Leveraging pre-trained representations from self-supervised learner DINOv2 and enhanced with a temporal adapter, Chrono effectively captures long-term temporal context, enabling precise prediction even without the refinement stage. Experimental results demonstrate that Chrono achieves state-of-the-art performance in a refiner-free setting on the TAP-Vid-DAVIS and TAP-Vid-Kinetics datasets, among common feature backbones used in point tracking as well as DINOv2, with exceptional efficiency. Project page: https://cvlab-kaist.github.io/Chrono/

Authors:Kazi Hasan Ibn Arif, Sajib Acharjee Dip, Khizar Hussain, Lang Zhang, Chris Thomas
Title: PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model
Abstract:
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in understanding and describing visual content, achieving state-of-the-art performance across various vision-language tasks. However, these models often generate descriptions containing objects or details that are absent in the input image, a phenomenon commonly known as hallucination. Our work investigates the key reasons behind this issue by analyzing the pattern of self-attention in transformer layers. We find that hallucinations often arise from the progressive weakening of attention weight to visual tokens in the deeper layers of the LLM. Some previous works naively boost the attention of all visual tokens to mitigate this issue, resulting in suboptimal hallucination reduction. To address this, we identify two critical sets of visual tokens that facilitate the transfer of visual information from the vision encoder to the LLM. Local tokens encode grounded information about objects present in an image, while summary tokens capture the overall aggregated representation of the image. Importantly, these two sets of tokens require different levels of weight enhancement. To this end, we propose \textbf{PAINT} (\textbf{P}aying \textbf{A}ttention to \textbf{IN}formed \textbf{T}okens), a plug-and-play framework that intervenes in the self-attention mechanism of the LLM, selectively boosting the attention weights of local and summary tokens with experimentally learned margins. Evaluation on the MSCOCO image captioning dataset demonstrate that our approach reduces hallucination rates by up to 62.3\% compared to baseline models while maintaining accuracy. Code is available at \href{https://github.com/hasanar1f/PAINT}{https://github.com/hasanar1f/PAINT}
中文:PAINT框架通过选择性增强局部和摘要视觉标记的注意力权重,将LVLM的幻觉率降低高达62.3%,同时保持准确性。
English: The PAINT framework selectively enhances attention weights for local and summary visual tokens in LVLMs to reduce hallucinations by up to 62.3% while preserving accuracy.

Authors:Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Shaoxiong Yang, Song Zhang, Yang Liu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo
Title: Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
Abstract:
We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: https://github.com/Tencent/Hunyuan3D-2
中文摘要:Hunyuan3D 2.0是一个先进的3D合成系统,包含形状生成和纹理合成两大基础模型,在几何细节与纹理质量上超越现有技术,已开源以填补大规模3D生成模型的空白。
English Summary: Hunyuan3D 2.0 is an advanced 3D synthesis system featuring a shape generation model and texture synthesis model that outperforms existing models in geometry and texture quality, released publicly to advance open-source 3D generative models.

Authors:Sean Man, Guy Ohayon, Ron Raphaeli, Michael Elad
Title: Proxies for Distortion and Consistency with Applications for Real-World Image Restoration
Abstract:
Real-world image restoration deals with the recovery of images suffering from an unknown degradation. This task is typically addressed while being given only degraded images, without their corresponding ground-truth versions. In this hard setting, designing and evaluating restoration algorithms becomes highly challenging. This paper offers a suite of tools that can serve both the design and assessment of real-world image restoration algorithms. Our work starts by proposing a trained model that predicts the chain of degradations a given real-world measured input has gone through. We show how this estimator can be used to approximate the consistency -- the match between the measurements and any proposed recovered image. We also use this estimator as a guiding force for the design of a simple and highly-effective plug-and-play real-world image restoration algorithm, leveraging a pre-trained diffusion-based image prior. Furthermore, this work proposes no-reference proxy measures of MSE and LPIPS, which, without access to the ground-truth images, allow ranking of real-world image restoration algorithms according to their (approximate) MSE and LPIPS. The proposed suite provides a versatile, first of its kind framework for evaluating and comparing blind image restoration algorithms in real-world scenarios.

Authors:Liam Chalcroft, Jenny Crinion, Cathy J. Price, John Ashburner
Title: Unified 3D MRI Representations via Sequence-Invariant Contrastive Learning
Abstract:
Self-supervised deep learning has accelerated 2D natural image analysis but remains difficult to translate into 3D MRI, where data are scarce and pre-trained 2D backbones cannot capture volumetric context. We present a \emph{sequence-invariant} self-supervised framework leveraging quantitative MRI (qMRI). By simulating multiple MRI contrasts from a single 3D qMRI scan and enforcing consistent representations across these contrasts, we learn anatomy-centric rather than sequence-specific features. The result is a single 3D encoder that excels across tasks and protocols. Experiments on healthy brain segmentation (IXI), stroke lesion segmentation (ARC), and MRI denoising show significant gains over baseline SSL approaches, especially in low-data settings (up to +8.3\% Dice, +4.2 dB PSNR). It also generalises to unseen sites, supporting scalable clinical use. Code and trained models are publicly available at https://github.com/liamchalcroft/contrast-squared
中文: 本研究提出了一种序列不变的自监督框架,利用定量MRI模拟多种对比度,使单个3D编码器能够学习解剖学特征,在多种任务和协议中表现卓越,尤其在数据稀缺情况下优势显著。
English: This study introduces a sequence-invariant self-supervised framework that uses quantitative MRI to simulate multiple contrasts, enabling a single 3D encoder to learn anatomy-focused features and achieve superior performance across various tasks and protocols, particularly in low-data scenarios.

Authors:Le Thien Phuc Nguyen, Zhuoran Yu, Yong Jae Lee
Title: LASER: Lip Landmark Assisted Speaker Detection for Robustness
Abstract:
Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER). Unlike models that rely solely on facial frames, LASER explicitly focuses on lip movements by integrating lip landmarks in training. Specifically, given a face track, LASER extracts frame-level visual features and the 2D coordinates of lip landmarks using a lightweight detector. These coordinates are encoded into dense feature maps, providing spatial and structural information on lip positions. Recognizing that landmark detectors may sometimes fail under challenging conditions (e.g., low resolution, occlusions, extreme angles), we incorporate an auxiliary consistency loss to align predictions from both lip-aware and face-only features, ensuring reliable performance even when lip data is absent. Extensive experiments across multiple datasets show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals, demonstrating robust performance in real-world video contexts. Code is available at \url{https://github.com/plnguyen2908/LASER_ASD}.
中文摘要:提出的LASER模型通过整合唇部关键点和辅助一致性损失,显著提升了主动说话人检测的鲁棒性,在多种数据集中尤其在视听不同步场景下表现卓越。
English Summary: The proposed LASER model enhances Active Speaker Detection by integrating lip landmarks and an auxiliary consistency loss, achieving superior robustness and performance in handling audio-visual desynchronization across diverse datasets.

Authors:Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, Maosong Sun
Title: EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents
Abstract:
Multimodal Large Language Models (MLLMs) have shown significant advancements, providing a promising future for embodied agents. Existing benchmarks for evaluating MLLMs primarily utilize static images or videos, limiting assessments to non-interactive scenarios. Meanwhile, existing embodied AI benchmarks are task-specific and not diverse enough, which do not adequately evaluate the embodied capabilities of MLLMs. To address this, we propose EmbodiedEval, a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks. EmbodiedEval features 328 distinct tasks within 125 varied 3D scenes, each of which is rigorously selected and annotated. It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity, all within a unified simulation and evaluation framework tailored for MLLMs. The tasks are organized into five categories: navigation, object interaction, social interaction, attribute question answering, and spatial question answering to assess different capabilities of the agents. We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks. Our analysis demonstrates the limitations of existing MLLMs in embodied capabilities, providing insights for their future development. We open-source all evaluation data and simulation framework at https://github.com/thunlp/EmbodiedEval.
Chinese: EmbodiedEval 是一个全面的评估基准,通过125个多样化3D场景中的328项交互任务来测试多模态大语言模型,揭示了其在具身能力方面相比人类表现存在的显著不足。
English: EmbodiedEval is a comprehensive benchmark designed to assess Multimodal Large Language Models (MLLMs) through 328 interactive tasks across 125 diverse 3D scenes, revealing their significant limitations in embodied capabilities compared to human performance.

Authors:Ron Raphaeli, Sean Man, Michael Elad
Title: SILO: Solving Inverse Problems with Latent Operators
Abstract:
Consistent improvement of image priors over the years has led to the development of better inverse problem solvers. Diffusion models are the newcomers to this arena, posing the strongest known prior to date. Recently, such models operating in a latent space have become increasingly predominant due to their efficiency. In recent works, these models have been applied to solve inverse problems. Working in the latent space typically requires multiple applications of an Autoencoder during the restoration process, which leads to both computational and restoration quality challenges. In this work, we propose a new approach for handling inverse problems with latent diffusion models, where a learned degradation function operates within the latent space, emulating a known image space degradation. Usage of the learned operator reduces the dependency on the Autoencoder to only the initial and final steps of the restoration process, facilitating faster sampling and superior restoration quality. We demonstrate the effectiveness of our method on a variety of image restoration tasks and datasets, achieving significant improvements over prior art.

Authors:Anwai Archit, Luca Freckmann, Constantin Pape
Title: MedicoSAM: Towards foundation models for medical image segmentation
Abstract:
Medical image segmentation is an important analysis task in clinical practice and research. Deep learning has massively advanced the field, but current approaches are mostly based on models trained for a specific task. Training such models or adapting them to a new condition is costly due to the need for (manually) labeled data. The emergence of vision foundation models, especially Segment Anything, offers a path to universal segmentation for medical images, overcoming these issues. Here, we study how to improve Segment Anything for medical images by comparing different finetuning strategies on a large and diverse dataset. We evaluate the finetuned models on a wide range of interactive and (automatic) semantic segmentation tasks. We find that the performance can be clearly improved for interactive segmentation. However, semantic segmentation does not benefit from pretraining on medical images. Our best model, MedicoSAM, is publicly available at https://github.com/computational-cell-analytics/medico-sam. We show that it is compatible with existing tools for data annotation and believe that it will be of great practical value.
中文摘要:本研究通过多种微调策略优化Segment Anything模型在医学图像上的应用,显著提升了交互式分割性能,并发布MedicoSAM作为实用的数据标注工具。
English Summary: The study enhances the Segment Anything model for medical images through diverse finetuning strategies, achieving significant improvements in interactive segmentation and releasing MedicoSAM as a practical tool for data annotation.

Authors:Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji
Title: Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Abstract:
Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: https://x-plug.github.io/MobileAgent.
中文: Mobile-Agent-E是一种分层多智能体框架,通过长期记忆和专门下属代理实现自我进化,解决了现有移动代理的不足,性能比先前最优方法提升了22%。
English: Mobile-Agent-E is a hierarchical multi-agent framework that addresses limitations in current mobile agents by enabling self-evolution through long-term memory and specialized subordinate agents, achieving a 22% performance improvement over prior methods.

Authors:Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, Chao Dong
Title: Teaching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution
Abstract:
With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising performance in linguistic quality description. However, current methods still fall short in accurately scoring image quality. In this work, we aim to leverage MLLMs to regress accurate quality scores. A key challenge is that the quality score is inherently continuous, typically modeled as a Gaussian distribution, whereas MLLMs generate discrete token outputs. This mismatch necessitates score discretization. Previous approaches discretize the mean score into a one-hot label, resulting in information loss and failing to capture inter-image relationships. We propose a distribution-based approach that discretizes the score distribution into a soft label. This method preserves the characteristics of the score distribution, achieving high accuracy and maintaining inter-image relationships. Moreover, to address dataset variation, where different IQA datasets exhibit various distributions, we introduce a fidelity loss based on Thurstone's model. This loss captures intra-dataset relationships, facilitating co-training across multiple IQA datasets. With these designs, we develop the distribution-based Depicted image Quality Assessment model for Score regression (DeQA-Score). Experiments across multiple benchmarks show that DeQA-Score stably outperforms baselines in score regression. Also, DeQA-Score can predict the score distribution that closely aligns with human annotations. Codes and model weights have been released in https://depictqa.github.io/deqa-score/.

Authors:Zixuan Chen, Yujin Wang, Xin Cai, Zhiyuan You, Zheming Lu, Fan Zhang, Shi Guo, Tianfan Xue
Title: UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion
Abstract:
Capturing high dynamic range (HDR) scenes is one of the most important issues in camera design. Majority of cameras use exposure fusion, which fuses images captured by different exposure levels, to increase dynamic range. However, this approach can only handle images with limited exposure difference, normally 3-4 stops. When applying to very high dynamic range scenes where a large exposure difference is required, this approach often fails due to incorrect alignment or inconsistent lighting between inputs, or tone mapping artifacts. In this work, we propose \model, the first exposure fusion technique that can merge inputs with 9 stops differences. The key idea is that we model exposure fusion as a guided inpainting problem, where the under-exposed image is used as a guidance to fill the missing information of over-exposed highlights in the over-exposed region. Using an under-exposed image as a soft guidance, instead of a hard constraint, our model is robust to potential alignment issue or lighting variations. Moreover, by utilizing the image prior of the generative model, our model also generates natural tone mapping, even for very high-dynamic range scenes. Our approach outperforms HDR-Transformer on latest HDR benchmarks. Moreover, to test its performance in ultra high dynamic range scenes, we capture a new real-world exposure fusion benchmark, UltraFusion dataset, with exposure differences up to 9 stops, and experiments show that UltraFusion can generate beautiful and high-quality fusion results under various scenarios. Code and data will be available at https://openimaginglab.github.io/UltraFusion.

Authors:Jiebin Yan, Jiale Rao, Junjie Chen, Ziwen Tan, Weide Liu, Yuming Fang
Title: Multitask Auxiliary Network for Perceptual Quality Assessment of Non-Uniformly Distorted Omnidirectional Images
Abstract:
Omnidirectional image quality assessment (OIQA) has been widely investigated in the past few years and achieved much success. However, most of existing studies are dedicated to solve the uniform distortion problem in OIQA, which has a natural gap with the non-uniform distortion problem, and their ability in capturing non-uniform distortion is far from satisfactory. To narrow this gap, in this paper, we propose a multitask auxiliary network for non-uniformly distorted omnidirectional images, where the parameters are optimized by jointly training the main task and other auxiliary tasks. The proposed network mainly consists of three parts: a backbone for extracting multiscale features from the viewport sequence, a multitask feature selection module for dynamically allocating specific features to different tasks, and auxiliary sub-networks for guiding the proposed model to capture local distortion and global quality change. Extensive experiments conducted on two large-scale OIQA databases demonstrate that the proposed model outperforms other state-of-the-art OIQA metrics, and these auxiliary sub-networks contribute to improve the performance of the proposed model. The source code is available at https://github.com/RJL2000/MTAOIQA.
Chinese: 本文提出了一种多任务辅助网络,通过联合训练主任务与辅助任务来改善非均匀失真全向图像的质量评估,其性能优于现有最先进方法。
English: This paper introduces a multitask auxiliary network that enhances omnidirectional image quality assessment by addressing non-uniform distortions through joint training of main and auxiliary tasks, outperforming existing methods.

Authors:Jiebin Yan, Jiale Rao, Xuelin Liu, Yuming Fang, Yifan Zuo, Weide Liu
Title: Subjective and Objective Quality Assessment of Non-Uniformly Distorted Omnidirectional Images
Abstract:
Omnidirectional image quality assessment (OIQA) has been one of the hot topics in IQA with the continuous development of VR techniques, and achieved much success in the past few years. However, most studies devote themselves to the uniform distortion issue, i.e., all regions of an omnidirectional image are perturbed by the ``same amount'' of noise, while ignoring the non-uniform distortion issue, i.e., partial regions undergo ``different amount'' of perturbation with the other regions in the same omnidirectional image. Additionally, nearly all OIQA models are verified on the platforms containing a limited number of samples, which largely increases the over-fitting risk and therefore impedes the development of OIQA. To alleviate these issues, we elaborately explore this topic from both subjective and objective perspectives. Specifically, we construct a large OIQA database containing 10,320 non-uniformly distorted omnidirectional images, each of which is generated by considering quality impairments on one or two camera len(s). Then we meticulously conduct psychophysical experiments and delve into the influence of both holistic and individual factors (i.e., distortion range and viewing condition) on omnidirectional image quality. Furthermore, we propose a perception-guided OIQA model for non-uniform distortion by adaptively simulating users' viewing behavior. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods. The source code is available at https://github.com/RJL2000/OIQAND.
中文: 本研究通过构建包含大量非均匀失真全景图像的数据集,并开发一种模拟用户观看行为的感知引导模型,有效解决了全景图像质量评估中的局限性,其性能优于现有方法。
English: This study addresses the limitations in omnidirectional image quality assessment (OIQA) by creating a large database of non-uniformly distorted images and developing a perception-guided model that outperforms existing methods by simulating user viewing behavior.

Authors:Shu Zou, Xinyu Tian, Qinyu Zhao, Zhaoyuan Yang, Jing Zhang
Title: SimLabel: Consistency-Guided OOD Detection with Pretrained Vision-Language Models
Abstract:
Detecting out-of-distribution (OOD) data is crucial in real-world machine learning applications, particularly in safety-critical domains. Existing methods often leverage language information from vision-language models (VLMs) to enhance OOD detection by improving confidence estimation through rich class-wise text information. However, when building OOD detection score upon on in-distribution (ID) text-image affinity, existing works either focus on each ID class or whole ID label sets, overlooking inherent ID classes' connection. We find that the semantic information across different ID classes is beneficial for effective OOD detection. We thus investigate the ability of image-text comprehension among different semantic-related ID labels in VLMs and propose a novel post-hoc strategy called SimLabel. SimLabel enhances the separability between ID and OOD samples by establishing a more robust image-class similarity metric that considers consistency over a set of similar class labels. Extensive experiments demonstrate the superior performance of SimLabel on various zero-shot OOD detection benchmarks. The proposed model is also extended to various VLM-backbones, demonstrating its good generalization ability. Our demonstration and implementation codes are available at: https://github.com/ShuZou-1/SimLabel.
中文摘要:提出的SimLabel方法通过利用分布内类别之间的语义关系建立更鲁棒的图像类别相似度度量,从而提升视觉语言模型中的分布外检测性能,在多个基准测试中展现出优越表现。
English Summary: The proposed SimLabel method improves out-of-distribution detection in vision-language models by leveraging semantic relationships between in-distribution classes to create a more robust image-class similarity metric, demonstrating superior performance across multiple benchmarks.

Authors:Zibin Wang, Zhiyuan Ouyang, Xiangyun Zhang
Title: Block Flow: Learning Straight Flow on Data Blocks
Abstract:
Flow-matching models provide a powerful framework for various applications, offering efficient sampling and flexible probability path modeling. These models are characterized by flows with low curvature in learned generative trajectories, which results in reduced truncation error at each sampling step. To further reduce curvature, we propose block matching. This novel approach leverages label information to partition the data distribution into blocks and match them with a prior distribution parameterized using the same label information, thereby learning straighter flows. We demonstrate that the variance of the prior distribution can control the curvature upper bound of forward trajectories in flow-matching models. By designing flexible regularization strategies to adjust this variance, we achieve optimal generation performance, effectively balancing the trade-off between maintaining diversity in generated samples and minimizing numerical solver errors. Our results demonstrate competitive performance with models of the same parameter scale.Code is available at \url{https://github.com/wpp13749/block_flow}.
中文摘要:本研究提出块匹配方法,通过将数据按标签分区并与参数化先验对齐来降低流匹配模型中的曲率,借助方差调节策略平衡生成多样性与数值误差,实现了优异性能。
English Summary: The study introduces block matching to reduce curvature in flow-matching models by partitioning data into labeled blocks and aligning them with a parameterized prior, achieving optimal generation through variance-controlled regularization.

Authors:Ruojun Xu, Weijie Xi, Xiaodi Wang, Yongbo Mao, Zach Cheng
Title: StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer
Abstract:
Training-free diffusion-based methods have achieved remarkable success in style transfer, eliminating the need for extensive training or fine-tuning. However, due to the lack of targeted training for style information extraction and constraints on the content image layout, training-free methods often suffer from layout changes of original content and content leakage from style images. Through a series of experiments, we discovered that an effective startpoint in the sampling stage significantly enhances the style transfer process. Based on this discovery, we propose StyleSSP, which focuses on obtaining a better startpoint to address layout changes of original content and content leakage from style image. StyleSSP comprises two key components: (1) Frequency Manipulation: To improve content preservation, we reduce the low-frequency components of the DDIM latent, allowing the sampling stage to pay more attention to the layout of content images; and (2) Negative Guidance via Inversion: To mitigate the content leakage from style image, we employ negative guidance in the inversion stage to ensure that the startpoint of the sampling stage is distanced from the content of style image. Experiments show that StyleSSP surpasses previous training-free style transfer baselines, particularly in preserving original content and minimizing the content leakage from style image. Project page: https://github.com/bytedance/StyleSSP.
中文摘要:提出的StyleSSP方法通过频率操作和反向引导优化采样起点,有效解决了无训练风格迁移中的内容布局改变和风格图像内容泄露问题。
English Summary: The proposed StyleSSP method enhances training-free style transfer by improving the sampling startpoint through frequency manipulation and negative guidance, effectively preserving content layout and reducing style leakage.

Authors:Ziheng Zhang, Jianyang Gu, Arpita Chowdhury, Zheda Mai, David Carlyn, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao
Title: Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation
Abstract:
Class activation map (CAM) has been widely used to highlight image regions that contribute to class predictions. Despite its simplicity and computational efficiency, CAM often struggles to identify discriminative regions that distinguish visually similar fine-grained classes. Prior efforts address this limitation by introducing more sophisticated explanation processes, but at the cost of extra complexity. In this paper, we propose Finer-CAM, a method that retains CAM's efficiency while achieving precise localization of discriminative regions. Our key insight is that the deficiency of CAM lies not in "how" it explains, but in "what" it explains. Specifically, previous methods attempt to identify all cues contributing to the target class's logit value, which inadvertently also activates regions predictive of visually similar classes. By explicitly comparing the target class with similar classes and spotting their differences, Finer-CAM suppresses features shared with other classes and emphasizes the unique, discriminative details of the target class. Finer-CAM is easy to implement, compatible with various CAM methods, and can be extended to multi-modal models for accurate localization of specific concepts. Additionally, Finer-CAM allows adjustable comparison strength, enabling users to selectively highlight coarse object contours or fine discriminative details. Quantitatively, we show that masking out the top 5% of activated pixels by Finer-CAM results in a larger relative confidence drop compared to baselines. The source code and demo are available at https://github.com/Imageomics/Finer-CAM.
Chinese: Finer-CAM通过显式比较相似类别来抑制共有特征并突出目标类的独特细节,在保持CAM效率的同时实现了对区分性区域的精确定位。
English: Finer-CAM improves upon class activation maps by efficiently highlighting unique discriminative regions through explicit comparison with similar classes, achieving precise localization without added complexity.

Authors:Yepeng Liu, Zhichao Sun, Baosheng Yu, Yitian Zhao, Bo Du, Yongchao Xu, Jun Cheng
Title: MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching
Abstract:
Many keypoint detection and description methods have been proposed for image matching or registration. While these methods demonstrate promising performance for single-modality image matching, they often struggle with multimodal data because the descriptors trained on single-modality data tend to lack robustness against the non-linear variations present in multimodal data. Extending such methods to multimodal image matching often requires well-aligned multimodal data to learn modality-invariant descriptors. However, acquiring such data is often costly and impractical in many real-world scenarios. To address this challenge, we propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching using only single-modality training data. Specifically, we propose a novel latent feature aggregation module and a cumulative hybrid aggregation module to enhance the base keypoint descriptors trained on single-modality data by leveraging pre-trained features from Stable Diffusion models. %, our approach generates robust and invariant features across diverse and unknown modalities. We validate our method with recent keypoint detection and description methods in three multimodal retinal image datasets (CF-FA, CF-OCT, EMA-OCTA) and two remote sensing datasets (Optical-SAR and Optical-NIR). Extensive experiments demonstrate that the proposed MIFNet is able to learn modality-invariant feature for multimodal image matching without accessing the targeted modality and has good zero-shot generalization ability. The code will be released at https://github.com/lyp-deeplearning/MIFNet.
中文: 提出的MIFNet通过仅使用单模态训练数据学习模态不变特征,结合新型聚合模块和预训练Stable Diffusion模型增强基础描述符,无需对齐多模态数据即可在不同数据集中实现鲁棒的跨模态图像匹配。
English: The proposed MIFNet addresses the challenge of multimodal image matching by learning modality-invariant features using only single-modality training data, enhanced through novel aggregation modules and pre-trained Stable Diffusion features, achieving robust performance across diverse datasets without requiring aligned multimodal data.

Authors:Yanchao Wang, Dawei Zhang, Run Li, Zhonglong Zheng, Minglu Li
Title: PD-SORT: Occlusion-Robust Multi-Object Tracking Using Pseudo-Depth Cues
Abstract:
Multi-object tracking (MOT) is a rising topic in video processing technologies and has important application value in consumer electronics. Currently, tracking-by-detection (TBD) is the dominant paradigm for MOT, which performs target detection and association frame by frame. However, the association performance of TBD methods degrades in complex scenes with heavy occlusions, which hinders the application of such methods in real-world scenarios.To this end, we incorporate pseudo-depth cues to enhance the association performance and propose Pseudo-Depth SORT (PD-SORT). First, we extend the Kalman filter state vector with pseudo-depth states. Second, we introduce a novel depth volume IoU (DVIoU) by combining the conventional 2D IoU with pseudo-depth. Furthermore, we develop a quantized pseudo-depth measurement (QPDM) strategy for more robust data association. Besides, we also integrate camera motion compensation (CMC) to handle dynamic camera situations. With the above designs, PD-SORT significantly alleviates the occlusion-induced ambiguous associations and achieves leading performances on DanceTrack, MOT17, and MOT20. Note that the improvement is especially obvious on DanceTrack, where objects show complex motions, similar appearances, and frequent occlusions. The code is available at https://github.com/Wangyc2000/PD_SORT.
中文摘要:PD-SORT通过将伪深度信息融入卡尔曼滤波器并引入深度感知度量方法,有效解决了复杂场景中的遮挡问题,在多个基准数据集上实现了领先的跟踪性能。
English Summary: PD-SORT enhances multi-object tracking by incorporating pseudo-depth cues into the Kalman filter and introducing depth-aware metrics to address occlusion challenges, achieving state-of-the-art performance on benchmark datasets.

Authors:Xiangyang Hu, Xiangyu Shen, Yifei Sun, Xuhao Shan, Wenwen Min, Liyilei Su, Xiaomao Fan, Ahmed Elazab, Ruiquan Ge, Changmiao Wang, Xiaopeng Fan
Title: ITCFN: Incomplete Triple-Modal Co-Attention Fusion Network for Mild Cognitive Impairment Conversion Prediction
Abstract:
Alzheimer's disease (AD) is a common neurodegenerative disease among the elderly. Early prediction and timely intervention of its prodromal stage, mild cognitive impairment (MCI), can decrease the risk of advancing to AD. Combining information from various modalities can significantly improve predictive accuracy. However, challenges such as missing data and heterogeneity across modalities complicate multimodal learning methods as adding more modalities can worsen these issues. Current multimodal fusion techniques often fail to adapt to the complexity of medical data, hindering the ability to identify relationships between modalities. To address these challenges, we propose an innovative multimodal approach for predicting MCI conversion, focusing specifically on the issues of missing positron emission tomography (PET) data and integrating diverse medical information. The proposed incomplete triple-modal MCI conversion prediction network is tailored for this purpose. Through the missing modal generation module, we synthesize the missing PET data from the magnetic resonance imaging and extract features using specifically designed encoders. We also develop a channel aggregation module and a triple-modal co-attention fusion module to reduce feature redundancy and achieve effective multimodal data fusion. Furthermore, we design a loss function to handle missing modality issues and align cross-modal features. These components collectively harness multimodal data to boost network performance. Experimental results on the ADNI1 and ADNI2 datasets show that our method significantly surpasses existing unimodal and other multimodal models. Our code is available at https://github.com/justinhxy/ITFC.
中文摘要:本研究提出了一种创新的多模态网络,通过生成缺失的PET数据并整合多样化医疗信息来提升轻度认知障碍向阿尔茨海默病转化的预测准确性,实验证明其性能显著优于现有模型。
English Summary: This study introduces an innovative multimodal network that addresses missing PET data and integrates diverse medical information to improve the prediction of mild cognitive impairment conversion to Alzheimer's disease, demonstrating superior performance over existing models.

Authors:Hongwei Sha, Muchen Dong, Quanyou Luo, Ming Lu, Hao Chen, Zhan Ma
Title: Towards Loss-Resilient Image Coding for Unstable Satellite Networks
Abstract:
Geostationary Earth Orbit (GEO) satellite communication demonstrates significant advantages in emergency short burst data services. However, unstable satellite networks, particularly those with frequent packet loss, present a severe challenge to accurate image transmission. To address it, we propose a loss-resilient image coding approach that leverages end-to-end optimization in learned image compression (LIC). Our method builds on the channel-wise progressive coding framework, incorporating Spatial-Channel Rearrangement (SCR) on the encoder side and Mask Conditional Aggregation (MCA) on the decoder side to improve reconstruction quality with unpredictable errors. By integrating the Gilbert-Elliot model into the training process, we enhance the model's ability to generalize in real-world network conditions. Extensive evaluations show that our approach outperforms traditional and deep learning-based methods in terms of compression performance and stability under diverse packet loss, offering robust and efficient progressive transmission even in challenging environments. Code is available at https://github.com/NJUVISION/LossResilientLIC.
中文摘要:本文针对地球静止轨道卫星通信中的图像传输问题,提出了一种抗丢损的图像编码方法,通过端到端优化结合空间-通道重排和掩码条件聚合技术,在不可预测的丢包环境下仍能保持稳定的渐进式传输性能。
English Summary: This paper introduces a loss-resilient image coding method for GEO satellite communications that uses end-to-end optimization and integrates spatial-channel rearrangement with mask conditional aggregation to maintain robust progressive transmission under unpredictable packet loss.

Authors:Tuo Feng, Wenguan Wang, Yi Yang
Title: A Survey of World Models for Autonomous Driving
Abstract:
Recent breakthroughs in autonomous driving have been propelled by advances in robust world modeling, fundamentally transforming how vehicles interpret dynamic scenes and execute safe decision-making. World models have emerged as a linchpin technology, offering high-fidelity representations of the driving environment that integrate multi-sensor data, semantic cues, and temporal dynamics. This paper systematically reviews recent advances in world models for autonomous driving, proposing a three-tiered taxonomy: (i) Generation of Future Physical World, covering Image-, BEV-, OG-, and PC-based generation methods that enhance scene evolution modeling through diffusion models and 4D occupancy forecasting; (ii) Behavior Planning for Intelligent Agents, combining rule-driven and learning-based paradigms with cost map optimization and reinforcement learning for trajectory generation in complex traffic conditions; (ii) Interaction between Prediction and Planning, achieving multi-agent collaborative decision-making through latent space diffusion and memory-augmented architectures. The study further analyzes training paradigms, including self-supervised learning, multimodal pretraining, and generative data augmentation, while evaluating world models' performance in scene understanding and motion prediction tasks. Future research must address key challenges in self-supervised representation learning, multimodal fusion, and advanced simulation to advance the practical deployment of world models in complex urban environments. Overall, the comprehensive analysis provides a technical roadmap for harnessing the transformative potential of world models in advancing safe and reliable autonomous driving solutions.
中文: 世界建模的最新进展通过增强场景解读和决策制定,彻底改变了自动驾驶技术;本文系统综述了其分类体系、训练范式及未来挑战,为开发安全可靠的自动驾驶解决方案提供了技术路线图。
English: Recent advances in world modeling have revolutionized autonomous driving by enhancing scene interpretation and decision-making, with this paper systematically reviewing their taxonomy, training paradigms, and future challenges to guide the development of safe driving solutions.

Authors:Tal Zeevi, Lawrence H. Staib, John A. Onofrey
Title: Enhancing Uncertainty Estimation in Semantic Segmentation via Monte-Carlo Frequency Dropout
Abstract:
Monte-Carlo (MC) Dropout provides a practical solution for estimating predictive distributions in deterministic neural networks. Traditional dropout, applied within the signal space, may fail to account for frequency-related noise common in medical imaging, leading to biased predictive estimates. A novel approach extends Dropout to the frequency domain, allowing stochastic attenuation of signal frequencies during inference. This creates diverse global textural variations in feature maps while preserving structural integrity -- a factor we hypothesize and empirically show is contributing to accurately estimating uncertainties in semantic segmentation. We evaluated traditional MC-Dropout and the MC-frequency Dropout in three segmentation tasks involving different imaging modalities: (i) prostate zones in biparametric MRI, (ii) liver tumors in contrast-enhanced CT, and (iii) lungs in chest X-ray scans. Our results show that MC-Frequency Dropout improves calibration, convergence, and semantic uncertainty, thereby improving prediction scrutiny, boundary delineation, and has the potential to enhance medical decision-making.
中文:MC频率Dropout将dropout扩展至频域,通过增强校准、收敛和边界划分,在多种医学影像分割任务中改进了不确定性估计。
English: MC-Frequency Dropout extends dropout to the frequency domain, improving uncertainty estimation in medical image segmentation by enhancing calibration, convergence, and boundary delineation across various imaging modalities.

Authors:Yassir Bendou, Amine Ouasfi, Vincent Gripon, Adnane Boukhayma
Title: ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models
Abstract:
The growing popularity of Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks. To enhance CLIP's effectiveness and versatility, efficient few-shot adaptation techniques have been widely adopted. Among these approaches, training-free methods, particularly caching methods exemplified by Tip-Adapter, have gained attention for their lightweight adaptation without the need for additional fine-tuning. In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters and are connected to a well-established kernel literature. Drawing on this insight, we offer a theoretical understanding of how these methods operate and suggest multiple avenues for enhancing the Tip-Adapter baseline. Notably, our analysis shows the importance of incorporating global information in local adapters. Therefore, we subsequently propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space (RKHS) using CLIP as a base learner. Our method, which we call ProKeR (Proximal Kernel ridge Regression), has a closed form solution and achieves state-of-the-art performances across 11 datasets in the standard few-shot adaptation benchmark.

Authors:Konrad Lis, Tomasz Kryjak, Marek Gorgon
Title: LiFT: Lightweight, FPGA-tailored 3D object detection based on LiDAR data
Abstract:
This paper presents LiFT, a lightweight, fully quantized 3D object detection algorithm for LiDAR data, optimized for real-time inference on FPGA platforms. Through an in-depth analysis of FPGA-specific limitations, we identify a set of FPGA-induced constraints that shape the algorithm's design. These include a computational complexity limit of 30 GMACs (billion multiply-accumulate operations), INT8 quantization for weights and activations, 2D cell-based processing instead of 3D voxels, and minimal use of skip connections. To meet these constraints while maximizing performance, LiFT combines novel mechanisms with state-of-the-art techniques such as reparameterizable convolutions and fully sparse architecture. Key innovations include the Dual-bound Pillar Feature Net, which boosts performance without increasing complexity, and an efficient scheme for INT8 quantization of input features. With a computational cost of just 20.73 GMACs, LiFT stands out as one of the few algorithms targeting minimal-complexity 3D object detection. Among comparable methods, LiFT ranks first, achieving an mAP of 51.84% and an NDS of 61.01% on the challenging NuScenes validation dataset. The code will be available at https://github.com/vision-agh/lift.
中文: 本文提出LiFT,一种轻量级全量化激光雷达3D目标检测算法,专为FPGA实时推理设计,在NuScenes数据集上以极低计算复杂度实现了最优性能。
English: This paper introduces LiFT, a lightweight and fully quantized 3D object detection algorithm for LiDAR data, designed for real-time FPGA execution with minimal computational complexity and state-of-the-art performance on the NuScenes dataset.

Authors:William Doherty, Anton Lee, Heitor Murilo Gomes
Title: CLOFAI: A Dataset of Real And Fake Image Classification Tasks for Continual Learning
Abstract:
The rapid advancement of generative AI models capable of creating realistic media has led to a need for classifiers that can accurately distinguish between genuine and artificially-generated images. A significant challenge for these classifiers emerges when they encounter images from generative models that are not represented in their training data, usually resulting in diminished performance. A typical approach is to periodically update the classifier's training data with images from the new generative models then retrain the classifier on the updated dataset. However, in some real-life scenarios, storage, computational, or privacy constraints render this approach impractical. Additionally, models used in security applications may be required to rapidly adapt. In these circumstances, continual learning provides a promising alternative, as the classifier can be updated without retraining on the entire dataset. In this paper, we introduce a new dataset called CLOFAI (Continual Learning On Fake and Authentic Images), which takes the form of a domain-incremental image classification problem. Moreover, we showcase the applicability of this dataset as a benchmark for evaluating continual learning methodologies. In doing this, we set a baseline on our novel dataset using three foundational continual learning methods -- EWC, GEM, and Experience Replay -- and find that EWC performs poorly, while GEM and Experience Replay show promise, performing significantly better than a Naive baseline. The dataset and code to run the experiments can be accessed from the following GitHub repository: https://github.com/Will-Doherty/CLOFAI.
Chinese: CLOFAI数据集被提出作为持续学习方法的基准,旨在解决对未见过的生成式AI模型图像分类的挑战,其中GEM和经验回放方法相较于传统方法展现出更优的性能。
English: The CLOFAI dataset is introduced as a benchmark for continual learning methods to address the challenge of classifying images from unseen generative AI models, with GEM and Experience Replay showing promising results compared to traditional approaches.

Authors:Zhipeng Yu, Qianqian Xu, Yangbangyan Jiang, Yingfei Sun, Qingming Huang
Title: Enhancing Sample Utilization in Noise-Robust Deep Metric Learning With Subgroup-Based Positive-Pair Selection
Abstract:
The existence of noisy labels in real-world data negatively impacts the performance of deep learning models. Although much research effort has been devoted to improving the robustness towards noisy labels in classification tasks, the problem of noisy labels in deep metric learning (DML) remains under-explored. Existing noisy label learning methods designed for DML mainly discard suspicious noisy samples, resulting in a waste of the training data. To address this issue, we propose a noise-robust DML framework with SubGroup-based Positive-pair Selection (SGPS), which constructs reliable positive pairs for noisy samples to enhance the sample utilization. Specifically, SGPS first effectively identifies clean and noisy samples by a probability-based clean sample selectionstrategy. To further utilize the remaining noisy samples, we discover their potential similar samples based on the subgroup information given by a subgroup generation module and then aggregate them into informative positive prototypes for each noisy sample via a positive prototype generation module. Afterward, a new contrastive loss is tailored for the noisy samples with their selected positive pairs. SGPS can be easily integrated into the training process of existing pair-wise DML tasks, like image retrieval and face recognition. Extensive experiments on multiple synthetic and real-world large-scale label noise datasets demonstrate the effectiveness of our proposed method. Without any bells and whistles, our SGPS framework outperforms the state-of-the-art noisy label DML methods. Code is available at \url{https://github.com/smuelpeng/SGPS-NoiseFreeDML}.
中文: 本文提出了一种名为SGPS的抗噪声深度度量学习框架,通过为噪声样本构建可靠的正样本对来提高数据利用率,在多个数据集上超越了现有方法。
English: This paper introduces a noise-robust deep metric learning framework called SGPS that constructs reliable positive pairs for noisy samples to improve data utilization, outperforming existing methods on multiple datasets.

Authors:Eunjin Kim, Hyeonjin Kim, Kyong Hwan Jin, Jaejun Yoo
Title: BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution
Abstract:
While prior methods in Continuous Spatial-Temporal Video Super-Resolution (C-STVSR) employ Implicit Neural Representation (INR) for continuous encoding, they often struggle to capture the complexity of video data, relying on simple coordinate concatenation and pre-trained optical flow networks for motion representation. Interestingly, we find that adding position encoding, contrary to common observations, does not improve--and even degrades--performance. This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model's flexibility. To address these issues, we propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. Our approach achieves state-of-the-art in various metrics, including PSNR and SSIM, showing enhanced spatial details and natural temporal consistency. Our code is available https://github.com/Eunjnnn/bfstvsr.
中文: 先前基于隐式神经表示的连续时空视频超分辨率方法难以充分建模视频复杂性,而我们提出的BF-STVSR框架通过B样条映射器和傅里叶映射器更好地捕捉时空特征,实现了最先进的性能表现。
English: Previous C-STVSR methods using Implicit Neural Representation often fail to adequately model video complexity, but our proposed BF-STVSR framework with B-spline and Fourier mappers achieves state-of-the-art performance by better capturing spatial and temporal characteristics.

Authors:Congcong Li, Jin Wang, Xiaomeng Wang, Xingchen Zhou, Wei Wu, Yuzhi Zhang, Tongyi Cao
Title: Car-GS: Addressing Reflective and Transparent Surface Challenges in 3D Car Reconstruction
Abstract:
3D car modeling is crucial for applications in autonomous driving systems, virtual and augmented reality, and gaming. However, due to the distinctive properties of cars, such as highly reflective and transparent surface materials, existing methods often struggle to achieve accurate 3D car reconstruction.To address these limitations, we propose Car-GS, a novel approach designed to mitigate the effects of specular highlights and the coupling of RGB and geometry in 3D geometric and shading reconstruction (3DGS). Our method incorporates three key innovations: First, we introduce view-dependent Gaussian primitives to effectively model surface reflections. Second, we identify the limitations of using a shared opacity parameter for both image rendering and geometric attributes when modeling transparent objects. To overcome this, we assign a learnable geometry-specific opacity to each 2D Gaussian primitive, dedicated solely to rendering depth and normals. Third, we observe that reconstruction errors are most prominent when the camera view is nearly orthogonal to glass surfaces. To address this issue, we develop a quality-aware supervision module that adaptively leverages normal priors from a pre-trained large-scale normal model.Experimental results demonstrate that Car-GS achieves precise reconstruction of car surfaces and significantly outperforms prior methods. The project page is available at https://lcc815.github.io/Car-GS.

Authors:Zhanpeng Chen, Mingxiao Li, Ziyang Chen, Nan Du, Xiaolong Li, Yuexian Zou
Title: Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding
Abstract:
Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models' comprehensive perception performance across different levels of granularity. In this work, we propose Pyramid-descent Visual Position Encoding (PyPE), a novel approach designed to enhance the perception of visual tokens within VLMs. By assigning visual position indexes from the periphery to the center and expanding the central receptive field incrementally, PyPE addresses the limitations of traditional raster-scan methods and mitigates the long-term decay effects induced by Rotary Position Embedding (RoPE). Our method reduces the relative distance between interrelated visual elements and instruction tokens, promoting a more rational allocation of attention weights and allowing for a multi-granularity perception of visual elements and countering the over-reliance on anchor tokens. Extensive experimental evaluations demonstrate that PyPE consistently improves the general capabilities of VLMs across various sizes. Code is available at https://github.com/SakuraTroyChen/PyPE.
中文: 提出的金字塔式视觉位置编码(PyPE)通过从外围到中心的视觉位置索引和逐步扩展中心感受野,增强了视觉语言模型的多粒度感知能力,有效提升了不同规模模型的综合性能。
English: The proposed Pyramid-descent Visual Position Encoding (PyPE) enhances vision-language models by rationally encoding visual positions to improve multi-granularity perception and attention allocation, consistently boosting model performance across various scales.

Authors:Sijun Dong, Fangcheng Zuo, Geng Chen, Siming Fu, Xiaoliang Meng
Title: A Remote Sensing Image Change Detection Method Integrating Layer Exchange and Channel-Spatial Differences
Abstract:
Change detection in remote sensing imagery is a critical technique for Earth observation, primarily focusing on pixel-level segmentation of change regions between bi-temporal images. The essence of pixel-level change detection lies in determining whether corresponding pixels in bi-temporal images have changed. In deep learning, the spatial and channel dimensions of feature maps represent different information from the original images. In this study, we found that in change detection tasks, difference information can be computed not only from the spatial dimension of bi-temporal features but also from the channel dimension. Therefore, we designed the Channel-Spatial Difference Weighting (CSDW) module as an aggregation-distribution mechanism for bi-temporal features in change detection. This module enhances the sensitivity of the change detection model to difference features. Additionally, bi-temporal images share the same geographic location and exhibit strong inter-image correlations. To construct the correlation between bi-temporal images, we designed a decoding structure based on the Layer-Exchange (LE) method to enhance the interaction of bi-temporal features. Comprehensive experiments on the CLCD, PX-CLCD, LEVIR-CD, and S2Looking datasets demonstrate that the proposed LENet model significantly improves change detection performance. The code and pre-trained models will be available at: https://github.com/dyzy41/lenet.
中文摘要:本研究提出通道-空间差异加权模块和层交换解码结构,通过增强双时相特征交互来改进遥感变化检测,所开发的LENet模型在多个数据集上展现出卓越性能。
English Summary: This study introduces the Channel-Spatial Difference Weighting (CSDW) module and a Layer-Exchange (LE) decoding structure to enhance bi-temporal feature interaction in remote sensing change detection, with the proposed LENet model demonstrating superior performance across multiple datasets.

Authors:Young Seok Jeon, Hongfei Yang, Huazhu Fu, Mengling Feng
Title: No More Sliding Window: Efficient 3D Medical Image Segmentation with Differentiable Top-k Patch Sampling
Abstract:
3D models surpass 2D models in CT/MRI segmentation by effectively capturing inter-slice relationships. However, the added depth dimension substantially increases memory consumption. While patch-based training alleviates memory constraints, it significantly slows down the inference speed due to the sliding window (SW) approach. We propose No-More-Sliding-Window (NMSW), a novel end-to-end trainable framework that enhances the efficiency of generic 3D segmentation backbone during an inference step by eliminating the need for SW. NMSW employs a differentiable Top-k module to selectively sample only the most relevant patches, thereby minimizing redundant computations. When patch-level predictions are insufficient, the framework intelligently leverages coarse global predictions to refine results. Evaluated across 3 tasks using 3 segmentation backbones, NMSW achieves competitive accuracy compared to SW inference while significantly reducing computational complexity by 91% (88.0 to 8.00 TMACs). Moreover, it delivers a 9.1x faster inference on the H100 GPU (99.0 to 8.3 sec) and a 11.1x faster inference on the Xeon Gold CPU (2110 to 189 sec). NMSW is model-agnostic, further boosting efficiency when integrated with any existing efficient segmentation backbones. The code is avaialble: https://github.com/Youngseok0001/open_nmsw.
中文摘要:提出的NMSW框架通过选择性采样关键切片并融合全局预测,在保持精度的同时消除了3D医学图像分割中的滑动窗口操作,实现了91%的计算量削减和超过9倍的推理加速。
English Summary: The proposed NMSW framework eliminates sliding window inference in 3D medical image segmentation by selectively sampling relevant patches and leveraging global predictions, achieving 91% computational reduction and over 9x faster inference while maintaining competitive accuracy.

Authors:Pengcheng Zhao, Zhixian He, Fuwei Zhang, Shujin Lin, Fan Zhou
Title: LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection
Abstract:
Video Moment Retrieval and Highlight Detection aim to find corresponding content in the video based on a text query. Existing models usually first use contrastive learning methods to align video and text features, then fuse and extract multimodal information, and finally use a Transformer Decoder to decode multimodal information. However, existing methods face several issues: (1) Overlapping semantic information between different samples in the dataset hinders the model's multimodal aligning performance; (2) Existing models are not able to efficiently extract local features of the video; (3) The Transformer Decoder used by the existing model cannot adequately decode multimodal features. To address the above issues, we proposed the LD-DETR model for Video Moment Retrieval and Highlight Detection tasks. Specifically, we first distilled the similarity matrix into the identity matrix to mitigate the impact of overlapping semantic information. Then, we designed a method that enables convolutional layers to extract multimodal local features more efficiently. Finally, we fed the output of the Transformer Decoder back into itself to adequately decode multimodal information. We evaluated LD-DETR on four public benchmarks and conducted extensive experiments to demonstrate the superiority and effectiveness of our approach. Our model outperforms the State-Of-The-Art models on QVHighlight, Charades-STA and TACoS datasets. Our code is available at https://github.com/qingchen239/ld-detr.
Chinese: LD-DETR模型通过将相似度矩阵蒸馏为恒等矩阵缓解语义重叠问题,利用卷积层高效提取局部特征,并通过Transformer解码器自反馈充分解码多模态信息,在多个公开数据集上实现了最先进的视频片段检索和高亮检测性能。
English: The LD-DETR model addresses key challenges in video moment retrieval and highlight detection by mitigating overlapping semantic information through similarity matrix distillation, enhancing local feature extraction with convolutional layers, and improving multimodal decoding via Transformer Decoder feedback, achieving state-of-the-art performance on multiple benchmarks.

Authors:Xinjie Liang, Xiangyu Li, Fanding Li, Jie Jiang, Qing Dong, Wei Wang, Kuanquan Wang, Suyu Dong, Gongning Luo, Shuo Li
Title: MedFILIP: Medical Fine-grained Language-Image Pre-training
Abstract:
Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e.g., RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6.69\%. The code is available in https://github.com/PerceptionComputingLab/MedFILIP.
Chinese: MedFILIP是一种细粒度视觉语言预训练模型,通过对比学习增强医学图像与疾病的关联性,在多个数据集上实现最高6.69%准确率提升的领先性能。
English: MedFILIP is a fine-grained vision-language pretraining model that enhances medical image-disease associations through contrastive learning, achieving state-of-the-art performance across multiple datasets with up to 6.69% accuracy improvement.

Authors:Jinyuan Liu, Guanyao Wu, Zhu Liu, Di Wang, Zhiying Jiang, Long Ma, Wei Zhong, Xin Fan, Risheng Liu
Title: Infrared and Visible Image Fusion: From Data Compatibility to Task Adaption
Abstract:
Infrared-visible image fusion (IVIF) is a critical task in computer vision, aimed at integrating the unique features of both infrared and visible spectra into a unified representation. Since 2018, the field has entered the deep learning era, with an increasing variety of approaches introducing a range of networks and loss functions to enhance visual performance. However, challenges such as data compatibility, perception accuracy, and efficiency remain. Unfortunately, there is a lack of recent comprehensive surveys that address this rapidly expanding domain. This paper fills that gap by providing a thorough survey covering a broad range of topics. We introduce a multi-dimensional framework to elucidate common learning-based IVIF methods, from visual enhancement strategies to data compatibility and task adaptability. We also present a detailed analysis of these approaches, accompanied by a lookup table clarifying their core ideas. Furthermore, we summarize performance comparisons, both quantitatively and qualitatively, focusing on registration, fusion, and subsequent high-level tasks. Beyond technical analysis, we discuss potential future directions and open issues in this area. For further details, visit our GitHub repository: https://github.com/RollingPlain/IVIF_ZOO.
中文摘要:本文对红外-可见光图像融合领域进行了全面综述,系统分析了深度学习方法、性能比较及未来研究方向。
English Summary: This paper provides a comprehensive survey of infrared-visible image fusion, analyzing deep learning methods, performance comparisons, and future directions in the field.

Authors:Weihang Zhang, Jihao Li, Shuoke Li, Ziqing Niu, Jialiang Chen, Wenkai Zhang
Title: A Resource-Efficient Training Framework for Remote Sensing Text--Image Retrieval
Abstract:
Remote sensing text--image retrieval (RSTIR) aims to retrieve the matched remote sensing (RS) images from the database according to the descriptive text. Recently, the rapid development of large visual-language pre-training models provides new insights for RSTIR. Nevertheless, as the complexity of models grows in RSTIR, the previous studies suffer from suboptimal resource efficiency during transfer learning. To address this issue, we propose a computation and memory-efficient retrieval (CMER) framework for RSTIR. To reduce the training memory consumption, we propose the Focus-Adapter module, which adopts a side branch structure. Its focus layer suppresses the interference of background pixels for small targets. Simultaneously, to enhance data efficacy, we regard the RS scene category as the metadata and design a concise augmentation technique. The scene label augmentation leverages the prior knowledge from land cover categories and shrinks the search space. We propose the negative sample recycling strategy to make the negative sample pool decoupled from the mini-batch size. It improves the generalization performance without introducing additional encoders. We have conducted quantitative and qualitative experiments on public datasets and expanded the benchmark with some advanced approaches, which demonstrates the competitiveness of the proposed CMER. Compared with the recent advanced methods, the overall retrieval performance of CMER is 2%--5% higher on RSITMD. Moreover, our proposed method reduces memory consumption by 49% and has a 1.4x data throughput during training. The code of the CMER and the dataset will be released at https://github.com/ZhangWeihang99/CMER.
中文: 本研究提出了一种用于遥感图文检索的计算与内存高效检索(CMER)框架,通过创新模块和策略,在提升检索性能2%–5%的同时,将内存消耗降低49%并实现1.4倍的数据处理效率提升。
English: The study introduces a computation and memory-efficient retrieval (CMER) framework for remote sensing text-image retrieval, which enhances retrieval performance by 2%–5%, reduces memory usage by 49%, and increases data throughput by 1.4 times through innovative modules and strategies.

Authors:Mehrad Mortazavi, David J. Cappelleri, Reza Ehsani
Title: RoMu4o: A Robotic Manipulation Unit For Orchard Operations Automating Proximal Hyperspectral Leaf Sensing
Abstract:
Driven by the need to address labor shortages and meet the demands of a rapidly growing population, robotic automation has become a critical component in precision agriculture. Leaf-level hyperspectral spectroscopy is shown to be a powerful tool for phenotyping, monitoring crop health, identifying essential nutrients within plants as well as detecting diseases and water stress. This work introduces RoMu4o, a robotic manipulation unit for orchard operations offering an automated solution for proximal hyperspectral leaf sensing. This ground robot is equipped with a 6DOF robotic arm and vision system for real-time deep learning-based image processing and motion planning. We developed robust perception and manipulation pipelines that enable the robot to successfully grasp target leaves and perform spectroscopy. These frameworks operate synergistically to identify and extract the 3D structure of leaves from an observed batch of foliage, propose 6D poses, and generate collision-free constraint-aware paths for precise leaf manipulation. The end-effector of the arm features a compact design that integrates an independent lighting source with a hyperspectral sensor, enabling high-fidelity data acquisition while streamlining the calibration process for accurate measurements. Our ground robot is engineered to operate in unstructured orchard environments. However, the performance of the system is evaluated in both indoor and outdoor plant models. The system demonstrated reliable performance for 1-LPB hyperspectral sampling, achieving 95% success rate in lab trials and 79% in field trials. Field experiments revealed an overall success rate of 70% for autonomous leaf grasping and hyperspectral measurement in a pistachio orchard. The open-source repository is available at: https://github.com/mehradmrt/UCM-AgBot-ROS2
中文: 为解决劳动力短缺和满足人口增长需求,本研究开发了RoMu4o果园机器人系统,通过先进的视觉与操控技术实现叶片抓取和高光谱检测,在实验室和田间试验中均展现出较高的成功率。
English: To tackle labor shortages and support growing populations, this research presents RoMu4o, a robotic system for orchard automation that uses advanced vision and manipulation to grasp leaves and perform hyperspectral sensing with high success rates in both lab and field conditions.

Authors:Delin An, Pan Du, Pengfei Gu, Jian-Xun Wang, Chaoli Wang
Title: Hierarchical LoG Bayesian Neural Network for Enhanced Aorta Segmentation
Abstract:
Accurate segmentation of the aorta and its associated arch branches is crucial for diagnosing aortic diseases. While deep learning techniques have significantly improved aorta segmentation, they remain challenging due to the intricate multiscale structure and the complexity of the surrounding tissues. This paper presents a novel approach for enhancing aorta segmentation using a Bayesian neural network-based hierarchical Laplacian of Gaussian (LoG) model. Our model consists of a 3D U-Net stream and a hierarchical LoG stream: the former provides an initial aorta segmentation, and the latter enhances blood vessel detection across varying scales by learning suitable LoG kernels, enabling self-adaptive handling of different parts of the aorta vessels with significant scale differences. We employ a Bayesian method to parameterize the LoG stream and provide confidence intervals for the segmentation results, ensuring robustness and reliability of the prediction for vascular medical image analysts. Experimental results show that our model can accurately segment main and supra-aortic vessels, yielding at least a 3% gain in the Dice coefficient over state-of-the-art methods across multiple volumes drawn from two aorta datasets, and can provide reliable confidence intervals for different parts of the aorta. The code is available at https://github.com/adlsn/LoGBNet.
中文: 本文提出一种基于贝叶斯神经网络的分层拉普拉斯高斯模型,通过结合三维U-Net与自适应多尺度血管检测来提升主动脉分割精度,在多个数据集上实现Dice系数至少3%的提升,并能为不同血管部位提供可靠置信区间。
English: This paper introduces a Bayesian neural network-based hierarchical Laplacian of Gaussian model that improves aorta segmentation by combining 3D U-Net with adaptive multi-scale vessel detection, achieving at least 3% higher Dice scores and providing reliable confidence intervals.

Authors:Ruixuan Zhang, Beichen Wang, Juexiao Zhang, Zilin Bian, Chen Feng, Kaan Ozbay
Title: When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis
Abstract:
The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great potential of increasing the spatio-temporal coverage of traffic accidents, which will help improve traffic safety. However, analyzing footage from hundreds, if not thousands, of traffic cameras in a 24/7/365 working protocol remains an extremely challenging task, as current vision-based approaches primarily focus on extracting raw information, such as vehicle trajectories or individual object detection, but require laborious post-processing to derive actionable insights. We propose SeeUnsafe, a new framework that integrates Multimodal Large Language Model (MLLM) agents to transform video-based traffic accident analysis from a traditional extraction-then-explanation workflow to a more interactive, conversational approach. This shift significantly enhances processing throughput by automating complex tasks like video classification and visual grounding, while improving adaptability by enabling seamless adjustments to diverse traffic scenarios and user-defined queries. Our framework employs a severity-based aggregation strategy to handle videos of various lengths and a novel multimodal prompt to generate structured responses for review and evaluation and enable fine-grained visual grounding. We introduce IMS (Information Matching Score), a new MLLM-based metric for aligning structured responses with ground truth. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding by leveraging off-the-shelf MLLMs. Source code will be available at \url{https://github.com/ai4ce/SeeUnsafe}.
中文: SeeUnsafe框架通过多模态大语言模型代理,将传统交通事故视频分析转变为交互式对话处理,利用基于严重程度的聚合策略自动完成分类和视觉定位任务,显著提升了处理效率与场景适应性。
English: The SeeUnsafe framework utilizes Multimodal Large Language Model agents to revolutionize traffic accident analysis by enabling interactive, conversational processing of surveillance videos, automating classification and visual grounding tasks while adapting to diverse scenarios through a severity-based aggregation strategy.

Authors:Xiaolu Hou, Mingcheng Li, Dingkang Yang, Jiawei Chen, Ziyun Qian, Xiao Zhao, Yue Jiang, Jinjie Wei, Qingyao Xu, Lihua Zhang
Title: BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation
Abstract:
With the widespread use of virtual reality applications, 3D scene generation has become a new challenging research frontier. 3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures. Many current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators. However, the generated scenes occupy large amounts of storage space and often lack effective regularisation methods, leading to geometric distortions. To this end, we propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation, which creates diverse and high-quality 3D scenes from text or image inputs. Specifically, a crossmodal progressive scene generation framework is proposed to generate coherent scenes utilizing incremental point cloud reconstruction and 3D Gaussian splatting. Additionally, we propose a hierarchical depth prior-based regularization mechanism that utilizes multi-level constraints on depth accuracy and smoothness to enhance the realism and continuity of the generated scenes. Ultimately, we propose a structured context-guided compression mechanism that exploits structured hash grids to model the context of unorganized anchor attributes, which significantly eliminates structural redundancy and reduces storage overhead. Comprehensive experiments across multiple scenes demonstrate the significant potential and advantages of our framework compared with several baselines.
中文: 本文提出BloomScene,一种轻量级3D高斯溅射方法,通过跨模态渐进生成、分层深度正则化和结构化压缩机制,从文本或图像输入生成高质量3D场景,有效减少存储需求并提升场景真实感。
English: The abstract introduces BloomScene, a lightweight 3D Gaussian splatting method that generates high-quality 3D scenes from text or image inputs using crossmodal progressive generation, hierarchical depth regularization, and structured compression to reduce storage and enhance realism.

Authors:Kartik Narayan, Vibashan VS, Vishal M. Patel
Title: FaceXBench: Evaluating Multimodal LLMs on Face Understanding
Abstract:
Multimodal Large Language Models (MLLMs) demonstrate impressive problem-solving abilities across a wide range of tasks and domains. However, their capacity for face understanding has not been systematically studied. To address this gap, we introduce FaceXBench, a comprehensive benchmark designed to evaluate MLLMs on complex face understanding tasks. FaceXBench includes 5,000 multimodal multiple-choice questions derived from 25 public datasets and a newly created dataset, FaceXAPI. These questions cover 14 tasks across 6 broad categories, assessing MLLMs' face understanding abilities in bias and fairness, face authentication, recognition, analysis, localization and tool retrieval. Using FaceXBench, we conduct an extensive evaluation of 26 open-source MLLMs alongside 2 proprietary models, revealing the unique challenges in complex face understanding tasks. We analyze the models across three evaluation settings: zero-shot, in-context task description, and chain-of-thought prompting. Our detailed analysis reveals that current MLLMs, including advanced models like GPT-4o, and GeminiPro 1.5, show significant room for improvement. We believe FaceXBench will be a crucial resource for developing MLLMs equipped to perform sophisticated face understanding. Code: https://github.com/Kartik-3004/facexbench
中文: FaceXBench 是一个综合性基准测试,旨在系统评估多模态大语言模型在复杂人脸理解任务中的能力,揭示了尽管这些模型具备广泛的问题解决能力,但在该领域仍有显著提升空间。
English: FaceXBench is a comprehensive benchmark introduced to systematically evaluate Multimodal Large Language Models' capabilities in complex face understanding, revealing significant room for improvement despite their broad problem-solving skills.

Authors:Xuange Zhang, Dengjie Li, Bo Liu, Zenghao Bao, Yao Zhou, Baisong Yang, Zhongying Liu, Yujie Zhong, Zheng Zhao, Tongtong Yuan
Title: HiMix: Reducing Computational Complexity in Large Vision-Language Models
Abstract:
Benefiting from recent advancements in large language models and modality alignment techniques, existing Large Vision-Language Models(LVLMs) have achieved prominent performance across a wide range of scenarios. However, the excessive computational complexity limits the widespread use of these models in practical applications. We argue that one main bottleneck in computational complexity is caused by the involvement of redundant vision sequences in model computation. This is inspired by a reassessment of the efficiency of vision and language information transmission in the language decoder of LVLMs. Then, we propose a novel hierarchical vision-language interaction mechanism called Hierarchical Vision injection for Mixture Attention (HiMix). In HiMix, only the language sequence undergoes full forward propagation, while the vision sequence interacts with the language at specific stages within each language decoder layer. It is striking that our approach significantly reduces computational complexity with minimal performance loss. Specifically, HiMix achieves a 10x reduction in the computational cost of the language decoder across multiple LVLM models while maintaining comparable performance. This highlights the advantages of our method, and we hope our research brings new perspectives to the field of vision-language understanding. Project Page: https://xuange923.github.io/HiMix

Authors:Chengwei Zheng, Lixin Xue, Juan Zarate, Jie Song
Title: GauSTAR: Gaussian Surface Tracking and Reconstruction
Abstract:
3D Gaussian Splatting techniques have enabled efficient photo-realistic rendering of static scenes. Recent works have extended these approaches to support surface reconstruction and tracking. However, tracking dynamic surfaces with 3D Gaussians remains challenging due to complex topology changes, such as surfaces appearing, disappearing, or splitting. To address these challenges, we propose GauSTAR, a novel method that achieves photo-realistic rendering, accurate surface reconstruction, and reliable 3D tracking for general dynamic scenes with changing topology. Given multi-view captures as input, GauSTAR binds Gaussians to mesh faces to represent dynamic objects. For surfaces with consistent topology, GauSTAR maintains the mesh topology and tracks the meshes using Gaussians. For regions where topology changes, GauSTAR adaptively unbinds Gaussians from the mesh, enabling accurate registration and generation of new surfaces based on these optimized Gaussians. Additionally, we introduce a surface-based scene flow method that provides robust initialization for tracking between frames. Experiments demonstrate that our method effectively tracks and reconstructs dynamic surfaces, enabling a range of applications. Our project page with the code release is available at https://eth-ait.github.io/GauSTAR/.

Authors:Xi Yang, Haoyuan Shi, Zihan Wang, Nannan Wang, Xinbo Gao
Title: CSHNet: A Novel Information Asymmetric Image Translation Method
Abstract:
Despite advancements in cross-domain image translation, challenges persist in asymmetric tasks such as SAR-to-Optical and Sketch-to-Instance conversions, which involve transforming data from a less detailed domain into one with richer content. Traditional CNN-based methods are effective at capturing fine details but struggle with global structure, leading to unwanted merging of image regions. To address this, we propose the CNN-Swin Hybrid Network (CSHNet), which combines two key modules: Swin Embedded CNN (SEC) and CNN Embedded Swin (CES), forming the SEC-CES-Bottleneck (SCB). SEC leverages CNN's detailed feature extraction while integrating the Swin Transformer's structural bias. CES, in turn, preserves the Swin Transformer's global integrity, compensating for CNN's lack of focus on structure. Additionally, CSHNet includes two components designed to enhance cross-domain information retention: the Interactive Guided Connection (IGC), which enables dynamic information exchange between SEC and CES, and Adaptive Edge Perception Loss (AEPL), which maintains structural boundaries during translation. Experimental results show that CSHNet outperforms existing methods in both visual quality and performance metrics across scene-level and instance-level datasets. Our code is available at: https://github.com/XduShi/CSHNet.
中文摘要:提出的CNN-Swin混合网络(CSHNet)通过结合局部特征提取与全局结构保持的创新模块及损失函数,有效解决了非对称图像转换中的结构失真问题,在多个数据集上实现了最优性能。
English Summary: The proposed CNN-Swin Hybrid Network (CSHNet) effectively addresses structural limitations in asymmetric image translation by integrating detailed feature extraction with global structural preservation through novel modules and loss functions, achieving superior performance across multiple datasets.

Authors:Xiaohui Li, Yihao Liu, Shuo Cao, Ziyan Chen, Shaobin Zhuang, Xiangyu Chen, Yinan He, Yi Wang, Yu Qiao
Title: DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations
Abstract:
Diffusion models have demonstrated exceptional capabilities in image restoration, yet their application to video super-resolution (VSR) faces significant challenges in balancing fidelity with temporal consistency. Our evaluation reveals a critical gap: existing approaches consistently fail on severely degraded videos--precisely where diffusion models' generative capabilities are most needed. We identify that existing diffusion-based VSR methods struggle primarily because they face an overwhelming learning burden: simultaneously modeling complex degradation distributions, content representations, and temporal relationships with limited high-quality training data. To address this fundamental challenge, we present DiffVSR, featuring a Progressive Learning Strategy (PLS) that systematically decomposes this learning burden through staged training, enabling superior performance on complex degradations. Our framework additionally incorporates an Interweaved Latent Transition (ILT) technique that maintains competitive temporal consistency without additional training overhead. Experiments demonstrate that our approach excels in scenarios where competing methods struggle, particularly on severely degraded videos. Our work reveals that addressing the learning strategy, rather than focusing solely on architectural complexity, is the critical path toward robust real-world video super-resolution with diffusion models.
中文摘要:本研究提出的DiffVSR框架通过渐进式学习策略和交织潜在转换技术,系统化解除了扩散模型在视频超分辨率中的学习负担,在严重退化视频上实现了卓越性能,揭示了学习策略优化比架构复杂性更关键的本质。
English Summary: The study introduces DiffVSR, a novel diffusion-based video super-resolution framework that overcomes existing limitations through a Progressive Learning Strategy and Interweaved Latent Transition technique, achieving superior performance on severely degraded videos by systematically addressing the learning burden.

Authors:Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Xianyuan Zhan
Title: Universal Actions for Enhanced Embodied Foundation Models
Abstract:
Training on diverse, internet-scale data is a key factor in the success of recent large foundation models. Yet, using the same recipe for building embodied agents has faced noticeable difficulties. Despite the availability of many crowd-sourced embodied datasets, their action spaces often exhibit significant heterogeneity due to distinct physical embodiment and control interfaces for different robots, causing substantial challenges in developing embodied foundation models using cross-domain data. In this paper, we introduce UniAct, a new embodied foundation modeling framework operating in a Universal Action Space. Our learned universal actions capture the generic atomic behaviors across diverse robots by exploiting their shared structural features, and enable enhanced cross-domain data utilization and cross-embodiment generalizations by eliminating the notorious heterogeneity. The universal actions can be efficiently translated back to heterogeneous actionable commands by simply adding embodiment-specific details, from which fast adaptation to new robots becomes simple and straightforward. Our 0.5B instantiation of UniAct outperforms 14X larger SOTA embodied foundation models in extensive evaluations on various real-world and simulation robots, showcasing exceptional cross-embodiment control and adaptation capability, highlighting the crucial benefit of adopting universal actions. Project page: https://github.com/2toinf/UniAct
中文摘要:UniAct提出通用动作空间框架,通过提取不同机器人的共享结构特征学习通用原子行为,有效消除异构性,在跨领域泛化和跨具身控制方面展现出卓越性能,以更小模型规模超越现有大型具身基础模型。
English Summary: UniAct introduces a universal action space framework that overcomes heterogeneity in embodied AI by learning shared atomic behaviors across robots, enabling superior cross-domain generalization and outperforming larger models in real-world and simulation evaluations.

Authors:Michael Schwingshackl, Fabio Francisco Oberweger, Markus Murschitz
Title: Few-shot Structure-Informed Machinery Part Segmentation with Foundation Models and Graph Neural Networks
Abstract:
This paper proposes a novel approach to few-shot semantic segmentation for machinery with multiple parts that exhibit spatial and hierarchical relationships. Our method integrates the foundation models CLIPSeg and Segment Anything Model (SAM) with the interest point detector SuperPoint and a graph convolutional network (GCN) to accurately segment machinery parts. By providing 1 to 25 annotated samples, our model, evaluated on a purely synthetic dataset depicting a truck-mounted loading crane, achieves effective segmentation across various levels of detail. Training times are kept under five minutes on consumer GPUs. The model demonstrates robust generalization to real data, achieving a qualitative synthetic-to-real generalization with a $J\&F$ score of 92.2 on real data using 10 synthetic support samples. When benchmarked on the DAVIS 2017 dataset, it achieves a $J\&F$ score of 71.5 in semi-supervised video segmentation with three support samples. This method's fast training times and effective generalization to real data make it a valuable tool for autonomous systems interacting with machinery and infrastructure, and illustrate the potential of combined and orchestrated foundation models for few-shot segmentation tasks.
中文: 本文提出了一种新颖的少样本机械部件语义分割方法,通过结合CLIPSeg、SAM、SuperPoint和图卷积网络,实现了快速训练和仅需少量标注样本即可有效泛化到真实数据的能力。
English: This paper introduces a novel few-shot semantic segmentation method for machinery parts by integrating CLIPSeg, SAM, SuperPoint, and GCN, achieving rapid training and robust generalization to real data with minimal annotated samples.

Authors:Ali Can Karaca, M. Enes Ozelbas, Saadettin Berber, Orkhan Karimli, Turabi Yildirim, M. Fatih Amasyali
Title: Robust Change Captioning in Remote Sensing: SECOND-CC Dataset and MModalCC Framework
Abstract:
Remote sensing change captioning (RSICC) aims to describe changes between bitemporal images in natural language. Existing methods often fail under challenges like illumination differences, viewpoint changes, blur effects, leading to inaccuracies, especially in no-change regions. Moreover, the images acquired at different spatial resolutions and have registration errors tend to affect the captions. To address these issues, we introduce SECOND-CC, a novel RSICC dataset featuring high-resolution RGB image pairs, semantic segmentation maps, and diverse real-world scenarios. SECOND-CC which contains 6,041 pairs of bitemporal RS images and 30,205 sentences describing the differences between images. Additionally, we propose MModalCC, a multimodal framework that integrates semantic and visual data using advanced attention mechanisms, including Cross-Modal Cross Attention (CMCA) and Multimodal Gated Cross Attention (MGCA). Detailed ablation studies and attention visualizations further demonstrate its effectiveness and ability to address RSICC challenges. Comprehensive experiments show that MModalCC outperforms state-of-the-art RSICC methods, including RSICCformer, Chg2Cap, and PSNet with +4.6% improvement on BLEU4 score and +9.6% improvement on CIDEr score. We will make our dataset and codebase publicly available to facilitate future research at https://github.com/ChangeCapsInRS/SecondCC
Chinese: 本研究提出了高分辨率遥感数据集SECOND-CC和融合语义与视觉信息的多模态框架MModalCC,通过先进的注意力机制显著提升了变化描述准确性,在BLEU4和CIDEr指标上分别实现了4.6%和9.6%的性能提升。
English: The study introduces SECOND-CC, a high-resolution remote sensing dataset, and MModalCC, a multimodal framework that integrates semantic and visual data to improve change captioning accuracy, achieving state-of-the-art performance with significant gains in BLEU4 and CIDEr scores.

Authors:Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, Jinqiao Wang
Title: FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization
Abstract:
Anomaly detection methods typically require extensive normal samples from the target class for training, limiting their applicability in scenarios that require rapid adaptation, such as cold start. Zero-shot and few-shot anomaly detection do not require labeled samples from the target class in advance, making them a promising research direction. Existing zero-shot and few-shot approaches often leverage powerful multimodal models to detect and localize anomalies by comparing image-text similarity. However, their handcrafted generic descriptions fail to capture the diverse range of anomalies that may emerge in different objects, and simple patch-level image-text matching often struggles to localize anomalous regions of varying shapes and sizes. To address these issues, this paper proposes the FiLo++ method, which consists of two key components. The first component, Fused Fine-Grained Descriptions (FusDes), utilizes large language models to generate anomaly descriptions for each object category, combines both fixed and learnable prompt templates and applies a runtime prompt filtering method, producing more accurate and task-specific textual descriptions. The second component, Deformable Localization (DefLoc), integrates the vision foundation model Grounding DINO with position-enhanced text descriptions and a Multi-scale Deformable Cross-modal Interaction (MDCI) module, enabling accurate localization of anomalies with various shapes and sizes. In addition, we design a position-enhanced patch matching approach to improve few-shot anomaly detection performance. Experiments on multiple datasets demonstrate that FiLo++ achieves significant performance improvements compared with existing methods. Code will be available at https://github.com/CASIA-IVA-Lab/FiLo.
中文: FiLo++方法通过大语言模型生成针对特定类别的精确异常描述,并结合可变形定位模块准确识别各种形状大小的异常,在多个数据集上实现了优于现有方法的零样本和少样本异常检测性能。
English: The FiLo++ method enhances zero-shot and few-shot anomaly detection by generating precise, category-specific anomaly descriptions using large language models and employing a deformable localization module for accurate identification of diverse anomalies, achieving superior performance across multiple datasets.

Authors:Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, Yu Yamaguchi
Title: One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression
Abstract:
Current image tokenization methods require a large number of tokens to capture the information contained within images. Although the amount of information varies across images, most image tokenizers only support fixed-length tokenization, leading to inefficiency in token allocation. In this study, we introduce One-D-Piece, a discrete image tokenizer designed for variable-length tokenization, achieving quality-controllable mechanism. To enable variable compression rate, we introduce a simple but effective regularization mechanism named "Tail Token Drop" into discrete one-dimensional image tokenizers. This method encourages critical information to concentrate at the head of the token sequence, enabling support of variadic tokenization, while preserving state-of-the-art reconstruction quality. We evaluate our tokenizer across multiple reconstruction quality metrics and find that it delivers significantly better perceptual quality than existing quality-controllable compression methods, including JPEG and WebP, at smaller byte sizes. Furthermore, we assess our tokenizer on various downstream computer vision tasks, including image classification, object detection, semantic segmentation, and depth estimation, confirming its adaptability to numerous applications compared to other variable-rate methods. Our approach demonstrates the versatility of variable-length discrete image tokenization, establishing a new paradigm in both compression efficiency and reconstruction performance. Finally, we validate the effectiveness of tail token drop via detailed analysis of tokenizers.

Authors:Di Chang, Hongyi Xu, You Xie, Yipeng Gao, Zhengfei Kuang, Shengqu Cai, Chenxu Zhang, Guoxian Song, Chao Wang, Yichun Shi, Zeyuan Chen, Shijie Zhou, Linjie Luo, Gordon Wetzstein, Mohammad Soleymani
Title: X-Dyna: Expressive Dynamic Human Image Animation
Abstract:
We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a single human image using facial expressions and body movements derived from a driving video, that generates realistic, context-aware dynamics for both the subject and the surrounding environment. Building on prior approaches centered on human pose control, X-Dyna addresses key shortcomings causing the loss of dynamic details, enhancing the lifelike qualities of human video animations. At the core of our approach is the Dynamics-Adapter, a lightweight module that effectively integrates reference appearance context into the spatial attentions of the diffusion backbone while preserving the capacity of motion modules in synthesizing fluid and intricate dynamic details. Beyond body pose control, we connect a local control module with our model to capture identity-disentangled facial expressions, facilitating accurate expression transfer for enhanced realism in animated scenes. Together, these components form a unified framework capable of learning physical human motion and natural scene dynamics from a diverse blend of human and scene videos. Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna outperforms state-of-the-art methods, creating highly lifelike and expressive animations. The code is available at https://github.com/bytedance/X-Dyna.
中文: X-Dyna 是一种基于零样本扩散的动画生成方法,通过驱动视频中的面部表情和身体动作来激活单张人物图像,利用动态适配器和局部控制模块实现人物与环境的逼真动态效果。
English: X-Dyna is a zero-shot diffusion-based pipeline that animates a single human image by transferring facial expressions and body movements from a driving video, generating realistic dynamics for both the subject and environment through its Dynamics-Adapter and local control modules.

Authors:Chao He, Jianqiang Ren, Yuan Dong, Jianjing Xiang, Xiejie Shen, Weihao Yuan, Liefeng Bo
Title: Textoon: Generating Vivid 2D Cartoon Characters from Text Descriptions
Abstract:
The 2D cartoon style is a prominent art form in digital character creation, particularly popular among younger audiences. While advancements in digital human technology have spurred extensive research into photorealistic digital humans and 3D characters, interactive 2D cartoon characters have received comparatively less attention. Unlike 3D counterparts, which require sophisticated construction and resource-intensive rendering, Live2D, a widely-used format for 2D cartoon characters, offers a more efficient alternative, which allows to animate 2D characters in a manner that simulates 3D movement without the necessity of building a complete 3D model. Furthermore, Live2D employs lightweight HTML5 (H5) rendering, improving both accessibility and efficiency. In this technical report, we introduce Textoon, an innovative method for generating diverse 2D cartoon characters in the Live2D format based on text descriptions. The Textoon leverages cutting-edge language and vision models to comprehend textual intentions and generate 2D appearance, capable of creating a wide variety of stunning and interactive 2D characters within one minute. The project homepage is https://human3daigc.github.io/Textoon_webpage/.

Authors:Xigui Li, Yuanye Zhou, Feiyang Xiao, Xin Guo, Yichi Zhang, Chen Jiang, Jianchao Ge, Xiansheng Wang, Qimeng Wang, Taiwei Zhang, Chensen Lin, Yuan Cheng, Yuan Qi
Title: Aneumo: A Large-Scale Comprehensive Synthetic Dataset of Aneurysm Hemodynamics
Abstract:
Intracranial aneurysm (IA) is a common cerebrovascular disease that is usually asymptomatic but may cause severe subarachnoid hemorrhage (SAH) if ruptured. Although clinical practice is usually based on individual factors and morphological features of the aneurysm, its pathophysiology and hemodynamic mechanisms remain controversial. To address the limitations of current research, this study constructed a comprehensive hemodynamic dataset of intracranial aneurysms. The dataset is based on 466 real aneurysm models, and 10,000 synthetic models were generated by resection and deformation operations, including 466 aneurysm-free models and 9,534 deformed aneurysm models. The dataset also provides medical image-like segmentation mask files to support insightful analysis. In addition, the dataset contains hemodynamic data measured at eight steady-state flow rates (0.001 to 0.004 kg/s), including critical parameters such as flow velocity, pressure, and wall shear stress, providing a valuable resource for investigating aneurysm pathogenesis and clinical prediction. This dataset will help advance the understanding of the pathologic features and hemodynamic mechanisms of intracranial aneurysms and support in-depth research in related fields. Dataset hosted at https://github.com/Xigui-Li/Aneumo.
中文: 本研究基于466个真实和1万个合成颅内动脉瘤模型构建了全面的血流动力学数据集,提供流速、压力和壁面剪应力等关键参数,将推动对动脉瘤发病机制和血流动力学机理的研究。
English: This study developed a comprehensive hemodynamic dataset using 466 real and 10,000 synthetic intracranial aneurysm models, providing flow velocity, pressure, and wall shear stress data to advance understanding of aneurysm pathogenesis and mechanisms.

Authors:Xiangyue Liu, Kunming Luo, Heng Li, Qi Zhang, Yuan Liu, Li Yi, Ping Tan
Title: GaussianAvatar-Editor: Photorealistic Animatable Gaussian Head Avatar Editor
Abstract:
We introduce GaussianAvatar-Editor, an innovative framework for text-driven editing of animatable Gaussian head avatars that can be fully controlled in expression, pose, and viewpoint. Unlike static 3D Gaussian editing, editing animatable 4D Gaussian avatars presents challenges related to motion occlusion and spatial-temporal inconsistency. To address these issues, we propose the Weighted Alpha Blending Equation (WABE). This function enhances the blending weight of visible Gaussians while suppressing the influence on non-visible Gaussians, effectively handling motion occlusion during editing. Furthermore, to improve editing quality and ensure 4D consistency, we incorporate conditional adversarial learning into the editing process. This strategy helps to refine the edited results and maintain consistency throughout the animation. By integrating these methods, our GaussianAvatar-Editor achieves photorealistic and consistent results in animatable 4D Gaussian editing. We conduct comprehensive experiments across various subjects to validate the effectiveness of our proposed techniques, which demonstrates the superiority of our approach over existing methods. More results and code are available at: [Project Link](https://xiangyueliu.github.io/GaussianAvatar-Editor/).

Authors:Xiaoyun Zheng, Liwei Liao, Jianbo Jiao, Feng Gao, Ronggang Wang
Title: Surface-SOS: Self-Supervised Object Segmentation via Neural Surface Representation
Abstract:
Self-supervised Object Segmentation (SOS) aims to segment objects without any annotations. Under conditions of multi-camera inputs, the structural, textural and geometrical consistency among each view can be leveraged to achieve fine-grained object segmentation. To make better use of the above information, we propose Surface representation based Self-supervised Object Segmentation (Surface-SOS), a new framework to segment objects for each view by 3D surface representation from multi-view images of a scene. To model high-quality geometry surfaces for complex scenes, we design a novel scene representation scheme, which decomposes the scene into two complementary neural representation modules respectively with a Signed Distance Function (SDF). Moreover, Surface-SOS is able to refine single-view segmentation with multi-view unlabeled images, by introducing coarse segmentation masks as additional input. To the best of our knowledge, Surface-SOS is the first self-supervised approach that leverages neural surface representation to break the dependence on large amounts of annotated data and strong constraints. These constraints typically involve observing target objects against a static background or relying on temporal supervision in videos. Extensive experiments on standard benchmarks including LLFF, CO3D, BlendedMVS, TUM and several real-world scenes show that Surface-SOS always yields finer object masks than its NeRF-based counterparts and surpasses supervised single-view baselines remarkably. Code is available at: https://github.com/zhengxyun/Surface-SOS.
中文: Surface-SOS是一种创新的自监督物体分割框架,通过多视角图像的3D表面表示实现无需标注的精细分割,在多个基准测试中显著优于现有方法。
English: Surface-SOS is a novel self-supervised object segmentation framework that utilizes 3D surface representations from multi-view images to achieve fine-grained segmentation without annotations, outperforming existing methods on multiple benchmarks.

Authors:Yixiang Zhuang, Chunshan Ma, Yao Cheng, Xuan Cheng, Jing Liao, Juncong Lin
Title: TalkingEyes: Pluralistic Speech-Driven 3D Eye Gaze Animation
Abstract:
Although significant progress has been made in the field of speech-driven 3D facial animation recently, the speech-driven animation of an indispensable facial component, eye gaze, has been overlooked by recent research. This is primarily due to the weak correlation between speech and eye gaze, as well as the scarcity of audio-gaze data, making it very challenging to generate 3D eye gaze motion from speech alone. In this paper, we propose a novel data-driven method which can generate diverse 3D eye gaze motions in harmony with the speech. To achieve this, we firstly construct an audio-gaze dataset that contains about 14 hours of audio-mesh sequences featuring high-quality eye gaze motion, head motion and facial motion simultaneously. The motion data is acquired by performing lightweight eye gaze fitting and face reconstruction on videos from existing audio-visual datasets. We then tailor a novel speech-to-motion translation framework in which the head motions and eye gaze motions are jointly generated from speech but are modeled in two separate latent spaces. This design stems from the physiological knowledge that the rotation range of eyeballs is less than that of head. Through mapping the speech embedding into the two latent spaces, the difficulty in modeling the weak correlation between speech and non-verbal motion is thus attenuated. Finally, our TalkingEyes, integrated with a speech-driven 3D facial motion generator, can synthesize eye gaze motion, eye blinks, head motion and facial motion collectively from speech. Extensive quantitative and qualitative evaluations demonstrate the superiority of the proposed method in generating diverse and natural 3D eye gaze motions from speech. The project page of this paper is: https://lkjkjoiuiu.github.io/TalkingEyes_Home/

Authors:Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, Stan Birchfield
Title: FoundationStereo: Zero-Shot Stereo Matching
Abstract:
Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization - a hallmark of foundation models in other computer vision tasks - remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation. Project page: https://nvlabs.github.io/FoundationStereo/

Authors:Fausto German, Brian Keith, Mauricio Matus, Diego Urrutia, Claudio Meneses
Title: Semi-Supervised Image-Based Narrative Extraction: A Case Study with Historical Photographic Records
Abstract:
This paper presents a semi-supervised approach to extracting narratives from historical photographic records using an adaptation of the narrative maps algorithm. We extend the original unsupervised text-based method to work with image data, leveraging deep learning techniques for visual feature extraction and similarity computation. Our method is applied to the ROGER dataset, a collection of photographs from the 1928 Sacambaya Expedition in Bolivia captured by Robert Gerstmann. We compare our algorithmically extracted visual narratives with expert-curated timelines of varying lengths (5 to 30 images) to evaluate the effectiveness of our approach. In particular, we use the Dynamic Time Warping (DTW) algorithm to match the extracted narratives with the expert-curated baseline. In addition, we asked an expert on the topic to qualitatively evaluate a representative example of the resulting narratives. Our findings show that the narrative maps approach generally outperforms random sampling for longer timelines (10+ images, p < 0.05), with expert evaluation confirming the historical accuracy and coherence of the extracted narratives. This research contributes to the field of computational analysis of visual cultural heritage, offering new tools for historians, archivists, and digital humanities scholars to explore and understand large-scale image collections. The method's ability to generate meaningful narratives from visual data opens up new possibilities for the study and interpretation of historical events through photographic evidence.
本文提出一种半监督方法,通过改进叙事地图算法并利用深度学习技术从历史照片中提取视觉叙事,在ROGER数据集上的实验表明该方法在长序列中优于随机采样,且专家评估确认了叙事的历史准确性与连贯性。
This paper introduces a semi-supervised method that adapts narrative maps to extract visual narratives from historical photos using deep learning, validated on the ROGER dataset where it outperforms random sampling for longer sequences and gains expert approval for historical coherence.

Authors:Wanqi Yin, Zhongang Cai, Ruisi Wang, Ailing Zeng, Chen Wei, Qingping Sun, Haiyi Mei, Yanjun Wang, Hui En Pang, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Atsushi Yamashita, Lei Yang, Ziwei Liu
Title: SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation
Abstract:
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. 1) For data scaling, we perform a systematic investigation on 40 EHPS datasets, encompassing a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10M training instances from diverse data sources. 2) For model scaling, we take advantage of vision transformers (up to ViT-Huge as the backbone) to study the scaling law of model sizes in EHPS. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. With big data and the large model, the foundation models exhibit strong performance across diverse test benchmarks and excellent transferability to even unseen environments. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation. (Code is available at: https://github.com/wqyin/SMPLest-X).
中文摘要:本研究通过整合40个数据集扩展训练数据并采用视觉变换器进行模型升级,建立了表达性人体姿态与形状估计的基础模型,在多个基准测试中实现最优性能且展现出卓越的迁移能力。
English Summary: This research scales up expressive human pose and shape estimation through data expansion across 40 datasets and model scaling using vision transformers, establishing foundation models that achieve state-of-the-art performance across multiple benchmarks while demonstrating strong transferability.

Authors:Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin
Title: VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Abstract:
This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research.
中文: 本研究表明,仅通过未标记视频数据训练的深度生成模型VideoWorld能够有效获取规则、推理和规划等复杂知识,无需文本输入即可在围棋和机器人控制任务中达到专业水平。
English: This research demonstrates that VideoWorld, a deep generative model trained solely on unlabeled video data, can effectively acquire complex knowledge such as rules, reasoning, and planning abilities without text input, achieving professional-level performance in Go and robotic control tasks.

Authors:Sumit Chaturvedi, Mengwei Ren, Yannick Hold-Geoffroy, Jingyuan Liu, Julie Dorsey, Zhixin Shu
Title: SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces
Abstract:
We introduce SynthLight, a diffusion model for portrait relighting. Our approach frames image relighting as a re-rendering problem, where pixels are transformed in response to changes in environmental lighting conditions. Using a physically-based rendering engine, we synthesize a dataset to simulate this lighting-conditioned transformation with 3D head assets under varying lighting. We propose two training and inference strategies to bridge the gap between the synthetic and real image domains: (1) multi-task training that takes advantage of real human portraits without lighting labels; (2) an inference time diffusion sampling procedure based on classifier-free guidance that leverages the input portrait to better preserve details. Our method generalizes to diverse real photographs and produces realistic illumination effects, including specular highlights and cast shadows, while preserving the subject's identity. Our quantitative experiments on Light Stage data demonstrate results comparable to state-of-the-art relighting methods. Our qualitative results on in-the-wild images showcase rich and unprecedented illumination effects. Project Page: \url{https://vrroom.github.io/synthlight/}

Authors:Yuexi Du, Jiazhen Zhang, Tal Zeevi, Nicha C. Dvornek, John A. Onofrey
Title: SRE-Conv: Symmetric Rotation Equivariant Convolution for Biomedical Image Classification
Abstract:
Convolutional neural networks (CNNs) are essential tools for computer vision tasks, but they lack traditionally desired properties of extracted features that could further improve model performance, e.g., rotational equivariance. Such properties are ubiquitous in biomedical images, which often lack explicit orientation. While current work largely relies on data augmentation or explicit modules to capture orientation information, this comes at the expense of increased training costs or ineffective approximations of the desired equivariance. To overcome these challenges, we propose a novel and efficient implementation of the Symmetric Rotation-Equivariant (SRE) Convolution (SRE-Conv) kernel, designed to learn rotation-invariant features while simultaneously compressing the model size. The SRE-Conv kernel can easily be incorporated into any CNN backbone. We validate the ability of a deep SRE-CNN to capture equivariance to rotation using the public MedMNISTv2 dataset (16 total tasks). SRE-Conv-CNN demonstrated improved rotated image classification performance accuracy on all 16 test datasets in both 2D and 3D images, all while increasing efficiency with fewer parameters and reduced memory footprint. The code is available at https://github.com/XYPB/SRE-Conv.
中文摘要:本研究提出了一种对称旋转等变(SRE)卷积核,能在学习旋转不变特征的同时压缩模型规模,经生物医学图像数据集验证,其在提升分类精度的同时显著提高了计算效率。
English Summary: The study introduces a Symmetric Rotation-Equivariant (SRE) Convolution kernel that enhances CNN performance by learning rotation-invariant features while reducing model size, validated through improved accuracy and efficiency on biomedical image datasets.

Authors:Qingyun Li, Yushi Chen, Xinya Shu, Dong Chen, Xin He, Yi Yu, Xue Yang
Title: A Simple Aerial Detection Baseline of Multimodal Language Models
Abstract:
The multimodal language models (MLMs) based on generative pre-trained Transformer are considered powerful candidates for unifying various domains and tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding performance in multiple tasks, such as visual question answering and visual grounding. In addition to visual grounding that detects specific objects corresponded to given instruction, aerial detection, which detects all objects of multiple categories, is also a valuable and challenging task for RS foundation models. However, aerial detection has not been explored by existing RS MLMs because the autoregressive prediction mechanism of MLMs differs significantly from the detection outputs. In this paper, we present a simple baseline for applying MLMs to aerial detection for the first time, named LMMRotate. Specifically, we first introduce a normalization method to transform detection outputs into textual outputs to be compatible with the MLM framework. Then, we propose a evaluation method, which ensures a fair comparison between MLMs and conventional object detection models. We construct the baseline by fine-tuning open-source general-purpose MLMs and achieve impressive detection performance comparable to conventional detector. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding RS images. Code is available at https://github.com/Li-Qingyun/mllm-mmrotate.
中文:本文提出了LMMRotate这一创新基线,通过将检测输出标准化为文本并实现与传统检测器的公平比较,首次将多模态语言模型应用于航空检测任务,取得了具有竞争力的性能。
English: This paper introduces LMMRotate, a novel baseline that adapts multimodal language models for aerial detection by normalizing detection outputs into text and enabling fair comparison with conventional detectors, achieving competitive performance.

Authors:Juan C. Benito, Daniel Feijoo, Alvaro Garcia, Marcos V. Conde
Title: FLOL: Fast Baselines for Real-World Low-Light Enhancement
Abstract:
Low-Light Image Enhancement (LLIE) is a key task in computational photography and imaging. The problem of enhancing images captured during night or in dark environments has been well-studied in the image signal processing literature. However, current deep learning-based solutions struggle with efficiency and robustness in real-world scenarios (e.g. scenes with noise, saturated pixels, bad illumination). We propose a lightweight neural network that combines image processing in the frequency and spatial domains. Our method, FLOL+, is one of the fastest models for this task, achieving state-of-the-art results on popular real scenes datasets such as LOL and LSRW. Moreover, we are able to process 1080p images under 12ms. Code and models at https://github.com/cidautai/FLOL
中文:提出的FLOL+模型是一种轻量级神经网络,结合频域和空间域进行图像处理,在真实场景数据集上取得领先性能,同时成为最快的方法之一,可在12毫秒内处理1080p图像。
English: The proposed FLOL+ model is a lightweight neural network that processes images in both frequency and spatial domains, achieving state-of-the-art results on real-world datasets while being one of the fastest methods capable of processing 1080p images in under 12ms.

Authors:Hongbo Zhao, Fei Zhu, Bolin Ni, Feng Zhu, Gaofeng Meng, Zhaoxiang Zhang
Title: Practical Continual Forgetting for Pre-trained Vision Models
Abstract:
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners, and these requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify three key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. (iii) In real-world scenarios, the training samples may be scarce or partially missing during the process of forgetting. To address them, we first propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. To further extend GS-LoRA to more practical scenarios, we incorporate prototype information as additional supervision and introduce a more practical approach, GS-LoRA++. For each forgotten class, we move the logits away from its original prototype. For the remaining classes, we pull the logits closer to their respective prototypes. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes. Codes have been released on https://github.com/bjzhb666/GS-LoRA.
中文: 针对预训练视觉模型需持续删除特定信息的需求,本文提出GS-LoRA及其增强版GS-LoRA++,通过任务独立的LoRA模块结合组稀疏正则化和原型监督机制,在有效擦除目标知识的同时最大程度保持模型其余性能。
English: To address the need for continuous removal of specific information from pre-trained vision models while preserving overall performance, this paper introduces GS-LoRA and its enhanced version GS-LoRA++, which utilize task-specific LoRA modules with group sparse regularization and prototype supervision to effectively erase targeted knowledge with minimal impact on retained capabilities.

Authors:Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, Dongsheng Li
Title: Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
Abstract:
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples. Our implementation is available at https://github.com/zhyang2226/OPA-DPO.
中文: OPA-DPO框架通过采用基于专家反馈的在线策略对齐方法,有效减少大型视觉语言模型的幻觉问题,仅用少量数据即在多个基准测试中显著超越现有最优算法的性能表现。
English: The OPA-DPO framework addresses hallucination in LVLMs by using on-policy alignment with expert feedback, achieving significant reductions in hallucination rates with minimal data compared to existing methods.

Authors:Tingxuan Chen, Kun Yuan, Vinkle Srivastav, Nassir Navab, Nicolas Padoy
Title: Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis
Abstract:
Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in https://github.com/CAMMA-public/Surg-FTDA
Chinese: Surg-FTDA提出了一种文本驱动的适应方法,通过弥合模态差距,仅需少量标注数据即可执行多种手术流程分析任务,在生成式和判别式任务评估中均优于基线模型。
English: Surg-FTDA introduces a text-driven adaptation method that bridges the modality gap and performs multiple surgical workflow analysis tasks with minimal annotated data, outperforming baselines in generative and discriminative evaluations.

Authors:Junjie He, Yuxiang Tuo, Binghui Chen, Chongyang Zhong, Yifeng Geng, Liefeng Bo
Title: AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation
Abstract:
Recently, large-scale generative models have demonstrated outstanding text-to-image generation capabilities. However, generating high-fidelity personalized images with specific subjects still presents challenges, especially in cases involving multiple subjects. In this paper, we propose AnyStory, a unified approach for personalized subject generation. AnyStory not only achieves high-fidelity personalization for single subjects, but also for multiple subjects, without sacrificing subject fidelity. Specifically, AnyStory models the subject personalization problem in an "encode-then-route" manner. In the encoding step, AnyStory utilizes a universal and powerful image encoder, i.e., ReferenceNet, in conjunction with CLIP vision encoder to achieve high-fidelity encoding of subject features. In the routing step, AnyStory utilizes a decoupled instance-aware subject router to accurately perceive and predict the potential location of the corresponding subject in the latent space, and guide the injection of subject conditions. Detailed experimental results demonstrate the excellent performance of our method in retaining subject details, aligning text descriptions, and personalizing for multiple subjects. The project page is at https://aigcdesigngroup.github.io/AnyStory/ .

Authors:Zixun Fang, Zhiheng Liu, Kai Zhu, Yu Liu, Ka Leong Cheng, Wei Zhai, Yang Cao, Zheng-Jun Zha
Title: VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization
Abstract:
Video colorization aims to transform grayscale videos into vivid color representations while maintaining temporal consistency and structural integrity. Existing video colorization methods often suffer from color bleeding and lack comprehensive control, particularly under complex motion or diverse semantic cues. To this end, we introduce VanGogh, a unified multimodal diffusion-based framework for video colorization. VanGogh tackles these challenges using a Dual Qformer to align and fuse features from multiple modalities, complemented by a depth-guided generation process and an optical flow loss, which help reduce color overflow. Additionally, a color injection strategy and luma channel replacement are implemented to improve generalization and mitigate flickering artifacts. Thanks to this design, users can exercise both global and local control over the generation process, resulting in higher-quality colorized videos. Extensive qualitative and quantitative evaluations, and user studies, demonstrate that VanGogh achieves superior temporal consistency and color fidelity.Project page: https://becauseimbatman0.github.io/VanGogh.
中文: VanGogh是一种基于多模态扩散的视频着色框架,通过融合多种模态特征并采用深度引导生成和光流损失等技术,有效防止颜色溢出和闪烁,确保时间一致性,使用户能够全面控制生成过程,获得高质量着色视频。
English: VanGogh is a multimodal diffusion-based framework that enhances video colorization by integrating multiple modalities and employing techniques like depth guidance and optical flow loss to prevent color bleeding and ensure temporal consistency, offering users comprehensive control for high-quality results.

Authors:Jan Skvrna, Lukas Neumann
Title: MonoSOWA: Scalable monocular 3D Object detector Without human Annotations
Abstract:
Inferring object 3D position and orientation from a single RGB camera is a foundational task in computer vision with many important applications. Traditionally, 3D object detection methods are trained in a fully-supervised setup, requiring LiDAR and vast amounts of human annotations, which are laborious, costly, and do not scale well with the ever-increasing amounts of data being captured. We present a novel method to train a 3D object detector from a single RGB camera without domain-specific human annotations, making orders of magnitude more data available for training. The method uses newly proposed Local Object Motion Model to disentangle object movement source between subsequent frames, is approximately 700 times faster than previous work and compensates camera focal length differences to aggregate multiple datasets. The method is evaluated on three public datasets, where despite using no human labels, it outperforms prior work by a significant margin. It also shows its versatility as a pre-training tool for fully-supervised training and shows that combining pseudo-labels from multiple datasets can achieve comparable accuracy to using human labels from a single dataset. The source code and model are available at https://github.com/jskvrna/MonoSOWA.
中文: 本文提出了一种仅使用单目RGB相机图像且无需人工标注来训练3D物体检测器的新方法,在性能和效率上均显著优于现有技术。
English: This paper introduces a novel method for training a 3D object detector using only single RGB camera images without human annotations, achieving superior performance and efficiency over previous approaches.

Authors:Hualie Jiang, Zhiqiang Lou, Laiyan Ding, Rui Xu, Minglang Tan, Wenjie Jiang, Rui Huang
Title: DEFOM-Stereo: Depth Foundation Model Based Stereo Matching
Abstract:
Stereo matching is a key technique for metric depth estimation in computer vision and robotics. Real-world challenges like occlusion and non-texture hinder accurate disparity estimation from binocular matching cues. Recently, monocular relative depth estimation has shown remarkable generalization using vision foundation models. Thus, to facilitate robust stereo matching with monocular depth cues, we incorporate a robust monocular relative depth model into the recurrent stereo-matching framework, building a new framework for depth foundation model-based stereo-matching, DEFOM-Stereo. In the feature extraction stage, we construct the combined context and matching feature encoder by integrating features from conventional CNNs and DEFOM. In the update stage, we use the depth predicted by DEFOM to initialize the recurrent disparity and introduce a scale update module to refine the disparity at the correct scale. DEFOM-Stereo is verified to have much stronger zero-shot generalization compared with SOTA methods. Moreover, DEFOM-Stereo achieves top performance on the KITTI 2012, KITTI 2015, Middlebury, and ETH3D benchmarks, ranking $1^{st}$ on many metrics. In the joint evaluation under the robust vision challenge, our model simultaneously outperforms previous models on the individual benchmarks, further demonstrating its outstanding capabilities.

Authors:Ji Shi, Xianghua Ying, Ruohao Guo, Bowei Xing, Wenzhen Yue
Title: Normal-NeRF: Ambiguity-Robust Normal Estimation for Highly Reflective Scenes
Abstract:
Neural Radiance Fields (NeRF) often struggle with reconstructing and rendering highly reflective scenes. Recent advancements have developed various reflection-aware appearance models to enhance NeRF's capability to render specular reflections. However, the robust reconstruction of highly reflective scenes is still hindered by the inherent shape ambiguity on specular surfaces. Existing methods typically rely on additional geometry priors to regularize the shape prediction, but this can lead to oversmoothed geometry in complex scenes. Observing the critical role of surface normals in parameterizing reflections, we introduce a transmittance-gradient-based normal estimation technique that remains robust even under ambiguous shape conditions. Furthermore, we propose a dual activated densities module that effectively bridges the gap between smooth surface normals and sharp object boundaries. Combined with a reflection-aware appearance model, our proposed method achieves robust reconstruction and high-fidelity rendering of scenes featuring both highly specular reflections and intricate geometric structures. Extensive experiments demonstrate that our method outperforms existing state-of-the-art methods on various datasets.
Chinese: 该方法通过引入基于透射率梯度的法线估计技术和双重激活密度模块,显著提升了神经辐射场(NeRF)对高反射场景的重建鲁棒性和渲染保真度,有效处理复杂几何结构下的镜面反射问题。
English: The proposed method enhances Neural Radiance Fields (NeRF) by introducing a transmittance-gradient-based normal estimation technique and a dual activated densities module, enabling robust reconstruction and high-fidelity rendering of highly reflective scenes with complex geometries.

Authors:Zeyu Wang, Cihang Xie, Brian Bartoldson, Bhavya Kailkhura
Title: Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness
Abstract:
This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, $Δ$CLIP and $Δ^2$LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of $Δ$CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, $Δ$CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, $Δ^2$LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is https://doublevisualdefense.github.io/.

Authors:Tim J. M. Jaspers, Ronald L. P. D. de Jong, Yiping Li, Carolus H. J. Kusters, Franciscus H. A. Bakker, Romy C. van Jaarsveld, Gino M. Kuiper, Richard van Hillegersberg, Jelle P. Ruurda, Willem M. Brinkman, Josien P. W. Pluim, Peter H. N. de With, Marcel Breeuwer, Yasmina Al Khalil, Fons van der Sommen
Title: Scaling up self-supervised learning for improved surgical foundation models
Abstract:
Foundation models have revolutionized computer vision by achieving vastly superior performance across diverse tasks through large-scale pretraining on extensive datasets. However, their application in surgical computer vision has been limited. This study addresses this gap by introducing SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision. Trained on the largest reported surgical dataset to date, comprising over 4.7 million video frames, SurgeNetXL achieves consistent top-tier performance across six datasets spanning four surgical procedures and three tasks, including semantic segmentation, phase recognition, and critical view of safety (CVS) classification. Compared with the best-performing surgical foundation models, SurgeNetXL shows mean improvements of 2.4, 9.0, and 12.6 percent for semantic segmentation, phase recognition, and CVS classification, respectively. Additionally, SurgeNetXL outperforms the best-performing ImageNet-based variants by 14.4, 4.0, and 1.6 percent in the respective tasks. In addition to advancing model performance, this study provides key insights into scaling pretraining datasets, extending training durations, and optimizing model architectures specifically for surgical computer vision. These findings pave the way for improved generalizability and robustness in data-scarce scenarios, offering a comprehensive framework for future research in this domain. All models and a subset of the SurgeNetXL dataset, including over 2 million video frames, are publicly available at: https://github.com/TimJaspers0801/SurgeNet.
中文:SurgeNetXL作为突破性的手术基础模型,基于470万视频帧训练,在多项手术任务和流程中实现顶尖性能,显著超越现有模型的同时为手术计算机视觉发展提供了关键见解。
English: SurgeNetXL is a groundbreaking surgical foundation model trained on 4.7 million video frames that achieves state-of-the-art performance across multiple surgical tasks and procedures, significantly outperforming existing models while providing key insights for advancing surgical computer vision.

Authors:Hwan Heo, Jangyeong Kim, Seongyeong Lee, Jeong A Wi, Junyoung Choi, Sangjun Ahn
Title: CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation
Abstract:
The synthesis of high-quality 3D assets from textual or visual inputs has become a central objective in modern generative modeling. Despite the proliferation of 3D generation algorithms, they frequently grapple with challenges such as multi-view inconsistency, slow generation times, low fidelity, and surface reconstruction problems. While some studies have addressed some of these issues, a comprehensive solution remains elusive. In this paper, we introduce \textbf{CaPa}, a carve-and-paint framework that generates high-fidelity 3D assets efficiently. CaPa employs a two-stage process, decoupling geometry generation from texture synthesis. Initially, a 3D latent diffusion model generates geometry guided by multi-view inputs, ensuring structural consistency across perspectives. Subsequently, leveraging a novel, model-agnostic Spatially Decoupled Attention, the framework synthesizes high-resolution textures (up to 4K) for a given geometry. Furthermore, we propose a 3D-aware occlusion inpainting algorithm that fills untextured regions, resulting in cohesive results across the entire model. This pipeline generates high-quality 3D assets in less than 30 seconds, providing ready-to-use outputs for commercial applications. Experimental results demonstrate that CaPa excels in both texture fidelity and geometric stability, establishing a new standard for practical, scalable 3D asset generation.

Authors:Veronika Spieker, Hannah Eichhorn, Wenqi Huang, Jonathan K. Stelter, Tabita Catalan, Rickmer F. Braren, Daniel Rueckert, Francisco Sahli Costabal, Kerstin Hammernik, Dimitrios C. Karampinos, Claudia Prieto, Julia A. Schnabel
Title: PISCO: Self-Supervised k-Space Regularization for Improved Neural Implicit k-Space Representations of Dynamic MRI
Abstract:
Neural implicit k-space representations (NIK) have shown promising results for dynamic magnetic resonance imaging (MRI) at high temporal resolutions. Yet, reducing acquisition time, and thereby available training data, results in severe performance drops due to overfitting. To address this, we introduce a novel self-supervised k-space loss function $\mathcal{L}_\mathrm{PISCO}$, applicable for regularization of NIK-based reconstructions. The proposed loss function is based on the concept of parallel imaging-inspired self-consistency (PISCO), enforcing a consistent global k-space neighborhood relationship without requiring additional data. Quantitative and qualitative evaluations on static and dynamic MR reconstructions show that integrating PISCO significantly improves NIK representations. Particularly for high acceleration factors (R$\geq$54), NIK with PISCO achieves superior spatio-temporal reconstruction quality compared to state-of-the-art methods. Furthermore, an extensive analysis of the loss assumptions and stability shows PISCO's potential as versatile self-supervised k-space loss function for further applications and architectures. Code is available at: https://github.com/compai-lab/2025-pisco-spieker
中文: 本研究提出了一种自监督k空间损失函数PISCO,通过强制全局邻域一致性且无需额外数据,显著提升了动态MRI中神经隐式k空间表示的性能,在高加速因子下实现了卓越的时空重建质量。
English: The study introduces a self-supervised k-space loss function, PISCO, which enhances neural implicit k-space representations for dynamic MRI by enforcing global neighborhood consistency without extra data, achieving superior reconstruction quality at high accelerations.

Authors:Arpita Chowdhury, Dipanjyoti Paul, Zheda Mai, Jianyang Gu, Ziheng Zhang, Kazi Sajeed Mehrab, Elizabeth G. Campolongo, Daniel Rubenstein, Charles V. Stewart, Anuj Karpatne, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao
Title: Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis
Abstract:
We present a simple approach to make pre-trained Vision Transformers (ViTs) interpretable for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as bird species. Pre-trained ViTs, such as DINO, have demonstrated remarkable capabilities in extracting localized, discriminative features. However, saliency maps like Grad-CAM often fail to identify these traits, producing blurred, coarse heatmaps that highlight entire objects instead. We propose a novel approach, Prompt Class Attention Map (Prompt-CAM), to address this limitation. Prompt-CAM learns class-specific prompts for a pre-trained ViT and uses the corresponding outputs for classification. To correctly classify an image, the true-class prompt must attend to unique image patches not present in other classes' images (i.e., traits). As a result, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a ``free lunch,'' requiring only a modification to the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM easy to train and apply, in stark contrast to other interpretable methods that require designing specific models and training processes. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate the superior interpretation capability of Prompt-CAM. The source code and demo are available at https://github.com/Imageomics/Prompt_CAM.
Chinese: 本文提出Prompt-CAM方法,通过生成类别特定的注意力图来提升预训练视觉Transformer的可解释性,能精准识别并定位视觉相似类别间的区分性特征。
English: This paper introduces Prompt-CAM, a simple method that enhances the interpretability of pre-trained Vision Transformers by generating class-specific attention maps to precisely identify and localize distinguishing traits between visually similar categories.

Authors:Alper Kayabasi, Anil Kumar Vadathya, Guha Balakrishnan, Vishwanath Saragadam
Title: Bias for Action: Video Implicit Neural Representations with Bias Modulation
Abstract:
We propose a new continuous video modeling framework based on implicit neural representations (INRs) called ActINR. At the core of our approach is the observation that INRs can be considered as a learnable dictionary, with the shapes of the basis functions governed by the weights of the INR, and their locations governed by the biases. Given compact non-linear activation functions, we hypothesize that an INR's biases are suitable to capture motion across images, and facilitate compact representations for video sequences. Using these observations, we design ActINR to share INR weights across frames of a video sequence, while using unique biases for each frame. We further model the biases as the output of a separate INR conditioned on time index to promote smoothness. By training the video INR and this bias INR together, we demonstrate unique capabilities, including $10\times$ video slow motion, 4x spatial super resolution along with 2x slow motion, denoising, and video inpainting. ActINR performs remarkably well across numerous video processing tasks (often achieving more than 6dB improvement), setting a new standard for continuous modeling of videos.

Authors:Shuo Chen, Yijin Li, Guofeng Zhang
Title: OpticFusion: Multi-Modal Neural Implicit 3D Reconstruction of Microstructures by Fusing White Light Interferometry and Optical Microscopy
Abstract:
White Light Interferometry (WLI) is a precise optical tool for measuring the 3D topography of microstructures. However, conventional WLI cannot capture the natural color of a sample's surface, which is essential for many microscale research applications that require both 3D geometry and color information. Previous methods have attempted to overcome this limitation by modifying WLI hardware and analysis software, but these solutions are often costly. In this work, we address this challenge from a computer vision multi-modal reconstruction perspective for the first time. We introduce OpticFusion, a novel approach that uses an additional digital optical microscope (OM) to achieve 3D reconstruction with natural color textures using multi-view WLI and OM images. Our method employs a two-step data association process to obtain the poses of WLI and OM data. By leveraging the neural implicit representation, we fuse multi-modal data and apply color decomposition technology to extract the sample's natural color. Tested on our multi-modal dataset of various microscale samples, OpticFusion achieves detailed 3D reconstructions with color textures. Our method provides an effective tool for practical applications across numerous microscale research fields. The source code and our real-world dataset are available at https://github.com/zju3dv/OpticFusion.
中文: 本研究提出了OpticFusion这一创新计算机视觉方法,通过结合白光干涉仪和光学显微镜,实现了微结构的三维重建并保留其自然色彩,为微观尺度研究提供了一种经济有效的解决方案。
English: This study introduces OpticFusion, a novel computer vision method that integrates White Light Interferometry (WLI) with an optical microscope (OM) to enable precise 3D reconstructions of microstructures while capturing their natural color, offering a cost-effective solution for microscale research applications.

Authors:Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, Bolei Zhou
Title: Embodied Scene Understanding for Vision Language Models via MetaVQA
Abstract:
Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications. However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making capabilities is lacking. To address this, we present MetaVQA: a comprehensive benchmark designed to assess and enhance VLMs' understanding of spatial relationships and scene dynamics through Visual Question Answering (VQA) and closed-loop simulations. MetaVQA leverages Set-of-Mark prompting and top-down view ground-truth annotations from nuScenes and Waymo datasets to automatically generate extensive question-answer pairs based on diverse real-world traffic scenarios, ensuring object-centric and context-rich instructions. Our experiments show that fine-tuning VLMs with the MetaVQA dataset significantly improves their spatial reasoning and embodied scene comprehension in safety-critical simulations, evident not only in improved VQA accuracies but also in emerging safety-aware driving maneuvers. In addition, the learning demonstrates strong transferability from simulation to real-world observation. Code and data will be publicly available at https://metadriverse.github.io/metavqa .

Authors:Edward R Criscuolo, Yao Hao, Zhendong Zhang, Trevor McKeown, Deshan Yang
Title: A Vessel Bifurcation Landmark Pair Dataset for Abdominal CT Deformable Image Registration (DIR) Validation
Abstract:
Deformable image registration (DIR) is an enabling technology in many diagnostic and therapeutic tasks. Despite this, DIR algorithms have limited clinical use, largely due to a lack of benchmark datasets for quality assurance during development. To support future algorithm development, here we introduce our first-of-its-kind abdominal CT DIR benchmark dataset, comprising large numbers of highly accurate landmark pairs on matching blood vessel bifurcations. Abdominal CT image pairs of 30 patients were acquired from several public repositories as well as the authors' institution with IRB approval. The two CTs of each pair were originally acquired for the same patient on different days. An image processing workflow was developed and applied to each image pair: 1) Abdominal organs were segmented with a deep learning model, and image intensity within organ masks was overwritten. 2) Matching image patches were manually identified between two CTs of each image pair 3) Vessel bifurcation landmarks were labeled on one image of each image patch pair. 4) Image patches were deformably registered, and landmarks were projected onto the second image. 5) Landmark pair locations were refined manually or with an automated process. This workflow resulted in 1895 total landmark pairs, or 63 per case on average. Estimates of the landmark pair accuracy using digital phantoms were 0.7+/-1.2mm. The data is published in Zenodo at https://doi.org/10.5281/zenodo.14362785. Instructions for use can be found at https://github.com/deshanyang/Abdominal-DIR-QA. This dataset is a first-of-its-kind for abdominal DIR validation. The number, accuracy, and distribution of landmark pairs will allow for robust validation of DIR algorithms with precision beyond what is currently available.
中文: 本研究首次推出腹部CT可变形图像配准基准数据集,包含1,895对高精度标志点(精度达0.7±1.2毫米),能为DIR算法提供超越现有水平的精准验证。
English: This study introduces the first abdominal CT deformable image registration benchmark dataset, featuring 1,895 highly accurate landmark pairs with an estimated accuracy of 0.7±1.2mm, enabling robust validation of DIR algorithms beyond current capabilities.

Authors:Huiyu Li, Nicholas Ayache, Hervé Delingette
Title: Generative Medical Image Anonymization Based on Latent Code Projection and Optimization
Abstract:
Medical image anonymization aims to protect patient privacy by removing identifying information, while preserving the data utility to solve downstream tasks. In this paper, we address the medical image anonymization problem with a two-stage solution: latent code projection and optimization. In the projection stage, we design a streamlined encoder to project input images into a latent space and propose a co-training scheme to enhance the projection process. In the optimization stage, we refine the latent code using two deep loss functions designed to address the trade-off between identity protection and data utility dedicated to medical images. Through a comprehensive set of qualitative and quantitative experiments, we showcase the effectiveness of our approach on the MIMIC-CXR chest X-ray dataset by generating anonymized synthetic images that can serve as training set for detecting lung pathologies. Source codes are available at https://github.com/Huiyu-Li/GMIA.
中文摘要:本文提出一种两阶段医学图像匿名化方法,通过潜在空间投影与优化在保护患者隐私的同时保留数据实用性,并在胸部X光数据上验证了其检测肺部病变的有效性。
English Summary: This paper proposes a two-stage medical image anonymization method using latent code projection and optimization to balance privacy protection with data utility, validated on chest X-ray data for lung pathology detection.

Authors:Zheng-An Zhu, Hsin-Che Chien, Chen-Kuo Chiang
Title: TCMM: Token Constraint and Multi-Scale Memory Bank of Contrastive Learning for Unsupervised Person Re-identification
Abstract:
This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \href{https://github.com/andy412510/TCMM}{https://github.com/andy412510/TCMM}.
中文: 本文提出ViT令牌约束与多尺度记忆库方法,通过减少补丁噪声和增强特征一致性,在无监督行人重识别中有效利用离群样本并提升模型性能,实现了最优的基准测试结果。
English: This paper introduces the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to mitigate patch noise and feature inconsistency in unsupervised person re-identification, achieving state-of-the-art results by enhancing outlier sample utilization and feature stability.

Authors:Jianzi Xiang, Cailu Wan, Zhu Cao
Title: Pseudolabel guided pixels contrast for domain adaptive semantic segmentation
Abstract:
Semantic segmentation is essential for comprehending images, but the process necessitates a substantial amount of detailed annotations at the pixel level. Acquiring such annotations can be costly in the real-world. Unsupervised domain adaptation (UDA) for semantic segmentation is a technique that uses virtual data with labels to train a model and adapts it to real data without labels. Some recent works use contrastive learning, which is a powerful method for self-supervised learning, to help with this technique. However, these works do not take into account the diversity of features within each class when using contrastive learning, which leads to errors in class prediction. We analyze the limitations of these works and propose a novel framework called Pseudo-label Guided Pixel Contrast (PGPC), which overcomes the disadvantages of previous methods. We also investigate how to use more information from target images without adding noise from pseudo-labels. We test our method on two standard UDA benchmarks and show that it outperforms existing methods. Specifically, we achieve relative improvements of 5.1% mIoU and 4.6% mIoU on the Grand Theft Auto V (GTA5) to Cityscapes and SYNTHIA to Cityscapes tasks based on DAFormer, respectively. Furthermore, our approach can enhance the performance of other UDA approaches without increasing model complexity. Code is available at https://github.com/embar111/pgpc
中文: 语义分割依赖昂贵的像素级标注,无监督域自适应结合对比学习虽能辅助,但现有方法忽略了类内特征多样性,导致预测错误;我们提出的伪标签引导像素对比(PGPC)框架通过有效利用目标图像信息解决了这一问题,在标准基准测试中显著提升性能且未增加模型复杂度。
English: Semantic segmentation requires costly pixel-level annotations, and while unsupervised domain adaptation with contrastive learning helps, existing methods overlook intra-class feature diversity, leading to prediction errors; our proposed Pseudo-label Guided Pixel Contrast (PGPC) framework addresses this by leveraging target image information effectively, achieving significant improvements on standard benchmarks without added model complexity.

Authors:Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, Robert Geirhos
Title: Do generative video models understand physical principles?
Abstract:
AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn "world models" that discover laws of physics -- or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our project page is at https://physics-iq.github.io; code at https://github.com/google-deepmind/physics-IQ-benchmark.
中文: AI视频生成在视觉真实感上取得显著进展,但通过Physics-IQ基准测试发现,现有模型对物理原理的理解仍非常有限,表明视觉逼真度并不等同于物理认知能力。
English: AI video generation achieves high visual realism but lacks deep physical understanding, as demonstrated by the Physics-IQ benchmark, which shows current models struggle with fundamental physical principles despite some progress.

Authors:Ruixiang Jiang, Changwen Chen
Title: Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Abstract:
The rapid technical progress of generative art (GenArt) has democratized the creation of visually appealing imagery. However, achieving genuine artistic impact - the kind that resonates with viewers on a deeper, more meaningful level - remains formidable as it requires a sophisticated aesthetic sensibility. This sensibility involves a multifaceted cognitive process extending beyond mere visual appeal, which is often overlooked by current computational methods. This paper pioneers an approach to capture this complex process by investigating how the reasoning capabilities of Multimodal LLMs (MLLMs) can be effectively elicited to perform aesthetic judgment. Our analysis reveals a critical challenge: MLLMs exhibit a tendency towards hallucinations during aesthetic reasoning, characterized by subjective opinions and unsubstantiated artistic interpretations. We further demonstrate that these hallucinations can be suppressed by employing an evidence-based and objective reasoning process, as substantiated by our proposed baseline, ArtCoT. MLLMs prompted by this principle produce multifaceted, in-depth aesthetic reasoning that aligns significantly better with human judgment. These findings have direct applications in areas such as AI art tutoring and as reward models for image generation. Ultimately, we hope this work paves the way for AI systems that can truly understand, appreciate, and contribute to art that aligns with human aesthetic values. Project homepage: https://github.com/songrise/MLLM4Art.
中文: 本文提出ArtCoT方法,通过基于证据的推理过程抑制多模态大语言模型在审美判断中的幻觉,使其评估更符合人类审美标准,可应用于AI艺术辅导和图像生成领域。
English: This paper introduces ArtCoT, a method that leverages Multimodal LLMs' reasoning to perform aesthetic judgments by suppressing hallucinations through evidence-based processes, resulting in evaluations that better align with human values and enabling applications in AI art tutoring and image generation.

Authors:Aditya Bhat, Rupak Bose, Chinedu Innocent Nwoye, Nicolas Padoy
Title: SimGen: A Diffusion-Based Framework for Simultaneous Surgical Image and Segmentation Mask Generation
Abstract:
Acquiring and annotating surgical data is often resource-intensive, ethical constraining, and requiring significant expert involvement. While generative AI models like text-to-image can alleviate data scarcity, incorporating spatial annotations, such as segmentation masks, is crucial for precision-driven surgical applications, simulation, and education. This study introduces both a novel task and method, SimGen, for Simultaneous Image and Mask Generation. SimGen is a diffusion model based on the DDPM framework and Residual U-Net, designed to jointly generate high-fidelity surgical images and their corresponding segmentation masks. The model leverages cross-correlation priors to capture dependencies between continuous image and discrete mask distributions. Additionally, a Canonical Fibonacci Lattice (CFL) is employed to enhance class separability and uniformity in the RGB space of the masks. SimGen delivers high-fidelity images and accurate segmentation masks, outperforming baselines across six public datasets assessed on image and semantic inception distance metrics. Ablation study shows that the CFL improves mask quality and spatial separation. Downstream experiments suggest generated image-mask pairs are usable if regulations limit human data release for research. This work offers a cost-effective solution for generating paired surgical images and complex labels, advancing surgical AI development by reducing the need for expensive manual annotations.

Authors:Chenyang Si, Weichen Fan, Zhengyao Lv, Ziqi Huang, Yu Qiao, Ziwei Liu
Title: RepVideo: Rethinking Cross-Layer Representation for Video Generation
Abstract:
Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.
中文摘要:本文提出RepVideo框架,通过聚合相邻层特征来增强文本到视频扩散模型的语义表示稳定性,显著提升了生成视频的空间准确性(如复杂物体关系)和时间连贯性。
English Summary: This paper introduces RepVideo, a representation framework that enhances text-to-video diffusion models by aggregating features from adjacent layers to achieve more stable semantic representations and improve both spatial accuracy and temporal coherence in generated videos.

Authors:Tengpeng Li, Hanli Wang, Xianfei Li, Wenlong Liao, Tao He, Pai Peng
Title: Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving
Abstract:
Autonomous driving is a challenging task that requires perceiving and understanding the surrounding environment for safe trajectory planning. While existing vision-based end-to-end models have achieved promising results, these methods are still facing the challenges of vision understanding, decision reasoning and scene generalization. To solve these issues, a generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving. The proposed paradigm has two significant aspects. On one hand, a 3D-vision language pre-training module is designed to bridge the gap between visual perception and linguistic understanding in the bird's eye view. On the other hand, a cross-modal language model is introduced to generate holistic driving decisions and fine-grained trajectories with perception and navigation information in an auto-regressive manner. Experiments on the challenging nuScenes dataset demonstrate that the proposed scheme achieves excellent performances compared with state-of-the-art methods. Besides, the proposed GPVL presents strong generalization ability and real-time potential when handling high-level commands in various scenarios. It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems. Code is available at https://github.com/ltp1995/GPVL
Chinese: 提出的GPVL模型通过结合3D视觉语言预训练来弥合感知与语言理解之间的差距,并采用跨模态语言模型生成驾驶决策和轨迹,在nuScenes数据集上实现了卓越的性能和泛化能力。
English: The proposed GPVL model enhances autonomous driving by integrating 3D-vision language pre-training to bridge perception and linguistic understanding, and a cross-modal language model for generating driving decisions and trajectories, achieving superior performance and generalization on the nuScenes dataset.

Authors:Olga Zatsarynna, Emad Bahrami, Yazan Abu Farha, Gianpiero Francesca, Juergen Gall
Title: MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Anticipation
Abstract:
Long-term dense action anticipation is very challenging since it requires predicting actions and their durations several minutes into the future based on provided video observations. To model the uncertainty of future outcomes, stochastic models predict several potential future action sequences for the same observation. Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner. While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points. However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length. We demonstrate that our approach achieves state-of-the-art results on three datasets - Breakfast, 50Salads, and Assembly101 - while also significantly improving computational and memory efficiency. Our code is available at https://github.com/olga-zats/DIFF_MANTA .
中文: MANTA网络通过实现线性复杂度的有效长程时序建模,解决了长期密集动作预测的挑战,在多个数据集上取得了最先进的结果,同时显著提升了计算效率。
English: The MANTA network addresses the challenge of long-term dense action anticipation by enabling effective long-range temporal modeling with linear complexity, achieving state-of-the-art results on multiple datasets while improving computational efficiency.

Authors:Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu
Title: MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Abstract:
Multimodal document retrieval aims to identify and retrieve various forms of multimodal content, such as figures, tables, charts, and layout information from extensive documents. Despite its increasing popularity, there is a notable lack of a comprehensive and robust benchmark to effectively evaluate the performance of systems in such tasks. To address this gap, this work introduces a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level and layout-level retrieval. The former evaluates the performance of identifying the most relevant pages within a long document, while the later assesses the ability of detecting specific layouts, providing a more fine-grained measure than whole-page analysis. A layout refers to a variety of elements, including textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring 1,685 questions annotated by experts and 173,843 questions with bootstrapped labels, making it a valuable resource in multimodal document retrieval for both training and evaluation. Through rigorous experiments, we demonstrate that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR training set effectively enhances the performance of multimodal document retrieval and (iii) text retrievers leveraging VLM-text significantly outperforms retrievers relying on OCR-text. Our dataset is available at https://mmdocrag.github.io/MMDocIR/.
中文: 本研究提出了MMDocIR这一综合性基准,用于评估多模态文档检索,包含页面级和布局级任务,结合专家标注和自举数据集,有效提升了系统性能评估与训练效果。
English: This work introduces MMDocIR, a comprehensive benchmark for multimodal document retrieval that includes page-level and layout-level tasks, featuring expert-annotated and bootstrapped datasets to enhance system performance evaluation and training.

Authors:Shao-Hao Lu, Ren Wang, Ching-Chun Huang, Wei-Chen Chiu
Title: Boosting Diffusion Guidance via Learning Degradation-Aware Models for Blind Super Resolution
Abstract:
Recently, diffusion-based blind super-resolution (SR) methods have shown great ability to generate high-resolution images with abundant high-frequency detail, but the detail is often achieved at the expense of fidelity. Meanwhile, another line of research focusing on rectifying the reverse process of diffusion models (i.e., diffusion guidance), has demonstrated the power to generate high-fidelity results for non-blind SR. However, these methods rely on known degradation kernels, making them difficult to apply to blind SR. To address these issues, we present DADiff in this paper. DADiff incorporates degradation-aware models into the diffusion guidance framework, eliminating the need to know degradation kernels. Additionally, we propose two novel techniques: input perturbation and guidance scalar, to further improve our performance. Extensive experimental results show that our proposed method has superior performance over state-of-the-art methods on blind SR benchmarks.
Diffusion-based blind super-resolution methods produce detailed images but often lack fidelity, while existing guidance techniques require known degradation kernels, limiting their use; the proposed DADiff method overcomes these by integrating degradation-aware models and novel techniques to achieve superior performance without needing prior kernel knowledge.
English Summary:

Authors:Zhipeng Ye, Feng Jiang, Qiufeng Wang, Kaizhu Huang, Jiaqi Huang
Title: IDEA: Image Description Enhanced CLIP-Adapter
Abstract:
CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are released at https://github.com/FourierAI/IDEA.
中文: IDEA方法通过结合视觉特征和文本描述,无需训练即可提升CLIP在小样本图像分类中的性能,其可训练扩展T-IDEA利用生成的图像-文本对在11个数据集上取得了领先成果。
English: The IDEA method enhances CLIP for few-shot image classification by integrating visual features and textual descriptions, achieving competitive or superior performance without training, while its trainable extension T-IDEA sets new benchmarks on 11 datasets using generated image-text pairs.

Authors:Shiyu Wu, Jing Liu, Jing Li, Yequan Wang
Title: Few-Shot Learner Generalizes Across AI-Generated Image Detection
Abstract:
Current fake image detectors trained on large synthetic image datasets perform satisfactorily on limited studied generative models. However, these detectors suffer a notable performance decline over unseen models. Besides, collecting adequate training data from online generative models is often expensive or infeasible. To overcome these issues, we propose Few-Shot Detector (FSD), a novel AI-generated image detector which learns a specialized metric space for effectively distinguishing unseen fake images using very few samples. Experiments show that FSD achieves state-of-the-art performance by $+11.6\%$ average accuracy on the GenImage dataset with only $10$ additional samples. More importantly, our method is better capable of capturing the intra-category commonality in unseen images without further training. Our code is available at https://github.com/teheperinko541/Few-Shot-AIGI-Detector.
Chinese: 提出的少样本检测器(FSD)通过学习专门的度量空间,仅需少量样本即可有效识别未知生成模型的伪造图像,在GenImage数据集上以11.6%的准确率提升达到领先性能。
English: The Few-Shot Detector (FSD) effectively identifies fake images from unseen generative models using minimal samples by learning a specialized metric space, achieving state-of-the-art performance with an 11.6% accuracy boost on the GenImage dataset.

Authors:Jaemyung Yu, Jaehyun Choi, Dong-Jae Lee, HyeongGwon Hong, Junmo Kim
Title: Self-supervised Transformation Learning for Equivariant Representations
Abstract:
Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach's effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at https://github.com/jaemyung-u/stl.
中文: 提出的自监督变换学习(STL)方法用图像不变的变换表示替代变换标签,在不增加批次复杂度的前提下实现有效的等变学习,并在多个基准测试中超越现有方法。
English: The proposed Self-supervised Transformation Learning (STL) method replaces transformation labels with image-invariant transformation representations, enabling effective equivariant learning without increasing batch complexity and outperforming existing methods across multiple benchmarks.

Authors:Anant Khandelwal
Title: FlexiClip: Locality-Preserving Free-Form Character Animation
Abstract:
Animating clipart images with seamless motion while maintaining visual fidelity and temporal coherence presents significant challenges. Existing methods, such as AniClipart, effectively model spatial deformations but often fail to ensure smooth temporal transitions, resulting in artifacts like abrupt motions and geometric distortions. Similarly, text-to-video (T2V) and image-to-video (I2V) models struggle to handle clipart due to the mismatch in statistical properties between natural video and clipart styles. This paper introduces FlexiClip, a novel approach designed to overcome these limitations by addressing the intertwined challenges of temporal consistency and geometric integrity. FlexiClip extends traditional Bézier curve-based trajectory modeling with key innovations: temporal Jacobians to correct motion dynamics incrementally, continuous-time modeling via probability flow ODEs (pfODEs) to mitigate temporal noise, and a flow matching loss inspired by GFlowNet principles to optimize smooth motion transitions. These enhancements ensure coherent animations across complex scenarios involving rapid movements and non-rigid deformations. Extensive experiments validate the effectiveness of FlexiClip in generating animations that are not only smooth and natural but also structurally consistent across diverse clipart types, including humans and animals. By integrating spatial and temporal modeling with pre-trained video diffusion models, FlexiClip sets a new standard for high-quality clipart animation, offering robust performance across a wide range of visual content. Project Page: https://creative-gen.github.io/flexiclip.github.io/

Authors:Dongzhihan Wang, Yang Yang, Liang Xu
Title: BRIGHT-VO: Brightness-Guided Hybrid Transformer for Visual Odometry with Multi-modality Refinement Module
Abstract:
Visual odometry (VO) plays a crucial role in autonomous driving, robotic navigation, and other related tasks by estimating the position and orientation of a camera based on visual input. Significant progress has been made in data-driven VO methods, particularly those leveraging deep learning techniques to extract image features and estimate camera poses. However, these methods often struggle in low-light conditions because of the reduced visibility of features and the increased difficulty of matching keypoints. To address this limitation, we introduce BrightVO, a novel VO model based on Transformer architecture, which not only performs front-end visual feature extraction, but also incorporates a multi-modality refinement module in the back-end that integrates Inertial Measurement Unit (IMU) data. Using pose graph optimization, this module iteratively refines pose estimates to reduce errors and improve both accuracy and robustness. Furthermore, we create a synthetic low-light dataset, KiC4R, which includes a variety of lighting conditions to facilitate the training and evaluation of VO frameworks in challenging environments. Experimental results demonstrate that BrightVO achieves state-of-the-art performance on both the KiC4R dataset and the KITTI benchmarks. Specifically, it provides an average improvement of 20% in pose estimation accuracy in normal outdoor environments and 259% in low-light conditions, outperforming existing methods. For widespread use and further development, the research work is fully open-source at https://github.com/Anastasiawd/BrightVO.
中文: BrightVO是一种基于Transformer架构的新型视觉里程计模型,它融合了IMU数据和姿态图优化技术,显著提升了姿态估计精度——在低光照环境下性能提升达259%,并配套开源了KiC4R数据集用于训练与评估。
English: BrightVO is a novel Transformer-based visual odometry model that integrates IMU data and pose graph optimization to significantly enhance pose estimation accuracy, particularly achieving a 259% improvement in low-light conditions, and is supported by the open-source KiC4R dataset for training and evaluation.

Authors:Xianqi Wang, Hao Yang, Gangwei Xu, Junda Cheng, Min Lin, Yong Deng, Jinliang Zang, Yurui Chen, Xin Yang
Title: ZeroStereo: Zero-shot Stereo Matching from Single Images
Abstract:
State-of-the-art supervised stereo matching methods have achieved remarkable performance on various benchmarks. However, their generalization to real-world scenarios remains challenging due to the scarcity of annotated real-world stereo data. In this paper, we propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching. Our approach synthesizes high-quality right images from arbitrary single images by leveraging pseudo disparities generated by a monocular depth estimation model. Unlike previous methods that address occluded regions by filling missing areas with neighboring pixels or random backgrounds, we fine-tune a diffusion inpainting model to recover missing details while preserving semantic structure. Additionally, we propose Training-Free Confidence Generation, which mitigates the impact of unreliable pseudo labels without additional training, and Adaptive Disparity Selection, which ensures a diverse and realistic disparity distribution while preventing excessive occlusion and foreground distortion. Experiments demonstrate that models trained with our pipeline achieve state-of-the-art zero-shot generalization across multiple datasets with only a dataset volume comparable to Scene Flow. Code: https://github.com/Windsrain/ZeroStereo.
Chinese: 提出的ZeroStereo流程通过伪视差和扩散修复技术从单张图像生成高质量立体图像,仅需少量训练数据即可实现最先进的零样本泛化能力。
English: The proposed ZeroStereo pipeline synthesizes high-quality stereo images from single inputs using pseudo disparities and diffusion inpainting, achieving state-of-the-art zero-shot generalization with minimal training data.

Authors:Jiaqi Huang, Zunnan Xu, Ting Liu, Yong Liu, Haonan Han, Kehong Yuan, Xiu Li
Title: Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation
Abstract:
In the domain of computer vision, Parameter-Efficient Tuning (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large foundation models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the current PET methods are mainly designed for single-modal optimization. While some pioneering studies have undertaken preliminary explorations, they still remain at the level of aligned encoders (e.g., CLIP) and lack exploration of misaligned encoders. These methods show sub-optimal performance with misaligned encoders, as they fail to effectively align the multimodal features during fine-tuning. In this paper, we introduce DETRIS, a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers, which enables effective cross-modal feature interaction and adaptation to misaligned encoders. We also suggest using text adapters to improve textual features. Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks. Our project is available at \url{https://github.com/jiaqihuang01/DETRIS}.
中文: DETRIS提出了一种参数高效调优框架,通过增强多模态特征交互来适配未对齐编码器,仅需少量参数更新即可实现卓越性能。
English: DETRIS introduces a parameter-efficient tuning framework that enhances multimodal feature interaction for misaligned encoders, achieving superior performance with minimal parameter updates.

Authors:Donghwi Jung, Keonwoo Kim, Seong-Woo Kim
Title: GOTPR: General Outdoor Text-based Place Recognition Using Scene Graph Retrieval with OpenStreetMap
Abstract:
We propose GOTPR, a robust place recognition method designed for outdoor environments where GPS signals are unavailable. Unlike existing approaches that use point cloud maps, which are large and difficult to store, GOTPR leverages scene graphs generated from text descriptions and maps for place recognition. This method improves scalability by replacing point clouds with compact data structures, allowing robots to efficiently store and utilize extensive map data. In addition, GOTPR eliminates the need for custom map creation by using publicly available OpenStreetMap data, which provides global spatial information. We evaluated its performance using the KITTI360Pose dataset with corresponding OpenStreetMap data, comparing it to existing point cloud-based place recognition methods. The results show that GOTPR achieves comparable accuracy while significantly reducing storage requirements. In city-scale tests, it completed processing within a few seconds, making it highly practical for real-world robotics applications. More information can be found at https://donghwijung.github.io/GOTPR_page/.

Authors:Oscar Ramos-Soto, Jorge Ramos-Frutos, Ezequiel Perez-Zarate, Diego Oliva, Sandra E. Balderas-Mata
Title: MIAFEx: An Attention-based Feature Extraction Method for Medical Image Classification
Abstract:
Feature extraction techniques are crucial in medical image classification; however, classical feature extractors, in addition to traditional machine learning classifiers, often exhibit significant limitations in providing sufficient discriminative information for complex image sets. While Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) have shown promise in feature extraction, they are prone to overfitting due to the inherent characteristics of medical imaging data, including small sample sizes or high intra-class variance. In this work, the Medical Image Attention-based Feature Extractor (MIAFEx) is proposed, a novel method that employs a learnable refinement mechanism to enhance the classification token within the Transformer encoder architecture. This mechanism adjusts the token based on learned weights, improving the extraction of salient features and enhancing the model's adaptability to the challenges presented by medical imaging data. The MIAFEx output feature quality is compared against classical feature extractors using traditional and hybrid classifiers. Also, the performance of these features is compared against modern CNN and ViT models in classification tasks, demonstrating their superiority in accuracy and robustness across multiple complex medical imaging datasets. This advantage is particularly pronounced in scenarios with limited training data, where traditional and modern models often struggle to generalize effectively. The source code of this proposal can be found at https://github.com/Oscar-RamosS/Medical-Image-Attention-based-Feature-Extractor-MIAFEx
Chinese: 提出的MIAFEx方法通过可学习的优化机制改进Transformer架构中的分类标记,在医学图像特征提取中显著优于传统和现代模型,尤其在数据有限时展现出更优的准确性与鲁棒性。
English: The proposed MIAFEx method enhances Transformer-based feature extraction for medical images by refining classification tokens with a learnable mechanism, outperforming both classical and modern models in accuracy and robustness, especially with limited data.

Authors:Hyeonwoo Kim, Sangwon Baik, Hanbyul Joo
Title: DAViD: Modeling Dynamic Affordance of 3D Objects Using Pre-trained Video Diffusion Models
Abstract:
Modeling how humans interact with objects is crucial for AI to effectively assist or mimic human behaviors. Existing studies for learning such ability primarily focus on static human-object interaction (HOI) patterns, such as contact and spatial relationships, while dynamic HOI patterns, capturing the movement of humans and objects over time, remain relatively underexplored. In this paper, we present a novel framework for learning Dynamic Affordance across various target object categories. To address the scarcity of 4D HOI datasets, our method learns the 3D dynamic affordance from synthetically generated 4D HOI samples. Specifically, we propose a pipeline that first generates 2D HOI videos from a given 3D target object using a pre-trained video diffusion model, then lifts them into 3D to generate 4D HOI samples. Leveraging these synthesized 4D HOI samples, we train DAViD, our generative 4D human-object interaction model, which is composed of two key components: (1) a human motion diffusion model (MDM) with Low-Rank Adaptation (LoRA) module to fine-tune a pre-trained MDM to learn the HOI motion concepts from limited HOI motion samples, (2) a motion diffusion model for 4D object poses conditioned by produced human interaction motions. Interestingly, DAViD can integrate newly learned HOI motion concepts with pre-trained human motions to create novel HOI motions, even for multiple HOI motion concepts, demonstrating the advantage of our pipeline with LoRA in integrating dynamic HOI concepts. Through extensive experiments, we demonstrate that DAViD outperforms baselines in synthesizing HOI motion.

Authors:Zhiheng Liu, Ka Leong Cheng, Xi Chen, Jie Xiao, Hao Ouyang, Kai Zhu, Yu Liu, Yujun Shen, Qifeng Chen, Ping Luo
Title: MangaNinja: Line Art Colorization with Precise Reference Following
Abstract:
Derived from diffusion models, MangaNinjia specializes in the task of reference-guided line art colorization. We incorporate two thoughtful designs to ensure precise character detail transcription, including a patch shuffling module to facilitate correspondence learning between the reference color image and the target line art, and a point-driven control scheme to enable fine-grained color matching. Experiments on a self-collected benchmark demonstrate the superiority of our model over current solutions in terms of precise colorization. We further showcase the potential of the proposed interactive point control in handling challenging cases, cross-character colorization, multi-reference harmonization, beyond the reach of existing algorithms.

Authors:Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, Michael Ryoo, Paul Debevec, Ning Yu
Title: Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise
Abstract:
Generative modeling aims to transform random noise into structured outputs. In this work, we enhance video diffusion models by allowing motion control via structured latent noise sampling. This is achieved by just a change in data: we pre-process training videos to yield structured noise. Consequently, our method is agnostic to diffusion model design, requiring no changes to model architectures or training pipelines. Specifically, we propose a novel noise warping algorithm, fast enough to run in real time, that replaces random temporal Gaussianity with correlated warped noise derived from optical flow fields, while preserving the spatial Gaussianity. The efficiency of our algorithm enables us to fine-tune modern video diffusion base models using warped noise with minimal overhead, and provide a one-stop solution for a wide range of user-friendly motion control: local object motion control, global camera movement control, and motion transfer. The harmonization between temporal coherence and spatial Gaussianity in our warped noise leads to effective motion control while maintaining per-frame pixel quality. Extensive experiments and user studies demonstrate the advantages of our method, making it a robust and scalable approach for controlling motion in video diffusion models. Video results are available on our webpage: https://eyeline-labs.github.io/Go-with-the-Flow. Source code and model checkpoints are available on GitHub: https://github.com/Eyeline-Labs/Go-with-the-Flow.
中文: 本研究提出一种噪声扭曲算法,通过利用光流生成的结构化噪声实现对视频扩散模型的实时运动控制,且无需改变模型架构或训练流程。
English: This study introduces a noise warping algorithm that enables real-time motion control in video diffusion models by using structured noise derived from optical flow, without altering model architecture or training processes.

Authors:Yufei Ye, Yao Feng, Omid Taheri, Haiwen Feng, Shubham Tulsiani, Michael J. Black
Title: Predicting 4D Hand Trajectory from Monocular Videos
Abstract:
We present HaPTIC, an approach that infers coherent 4D hand trajectories from monocular videos. Current video-based hand pose reconstruction methods primarily focus on improving frame-wise 3D pose using adjacent frames rather than studying consistent 4D hand trajectories in space. Despite the additional temporal cues, they generally underperform compared to image-based methods due to the scarcity of annotated video data. To address these issues, we repurpose a state-of-the-art image-based transformer to take in multiple frames and directly predict a coherent trajectory. We introduce two types of lightweight attention layers: cross-view self-attention to fuse temporal information, and global cross-attention to bring in larger spatial context. Our method infers 4D hand trajectories similar to the ground truth while maintaining strong 2D reprojection alignment. We apply the method to both egocentric and allocentric videos. It significantly outperforms existing methods in global trajectory accuracy while being comparable to the state-of-the-art in single-image pose estimation. Project website: https://judyye.github.io/haptic-www

Authors:Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma
Title: Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Abstract:
We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.

Authors:Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu
Title: GameFactory: Creating New Games with Generative Interactive Videos
Abstract:
Generative videos have the potential to revolutionize game development by autonomously creating new content. In this paper, we present GameFactory, a framework for action-controlled scene-generalizable game video generation. We first address the fundamental challenge of action controllability by introducing GF-Minecraft, an action-annotated game video dataset without human bias, and developing an action control module that enables precise control over both keyboard and mouse inputs. We further extend to support autoregressive generation for unlimited-length interactive videos. More importantly, GameFactory tackles the critical challenge of scene-generalizable action control, which most existing methods fail to address. To enable the creation of entirely new and diverse games beyond fixed styles and scenes, we leverage the open-domain generative priors from pre-trained video diffusion models. To bridge the domain gap between open-domain priors and small-scale game datasets, we propose a multi-phase training strategy with a domain adapter that decouples game style learning from action control. This decoupling ensures that action control learning is no longer bound to specific game styles, thereby achieving scene-generalizable action control. Experimental results demonstrate that GameFactory effectively generates open-domain action-controllable game videos, representing a significant step forward in AI-driven game generation.
中文摘要:GameFactory通过利用开放域生成先验和多阶段训练策略,提出了一个能实现动作可控和场景泛化的游戏视频生成框架,突破了固定游戏风格限制,可创建无限长度的交互式内容。
English Summary: GameFactory introduces a framework for generating action-controllable and scene-generalizable game videos by leveraging open-domain priors and a multi-phase training strategy, enabling unlimited-length interactive content creation beyond fixed game styles.

Authors:MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu
Title: MiniMax-01: Scaling Foundation Models with Lightning Attention
Abstract:
We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.
MiniMax-01系列模型融合闪电注意力与专家混合技术,在保持顶尖性能的同时突破上下文长度限制,以更低成本实现百万级 token 的高效处理。
MiniMax-01系列模型通过闪电注意力和专家混合架构实现了与顶尖模型相媲美的性能,同时支持超长上下文处理,训练和推理成本显著降低。

Authors:Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis
Title: Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers
Abstract:
Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting. We provide the implementation code at https://github.com/Sta8is/FUTURIST .
中文: 本文提出FUTURIST方法,采用统一变换器架构结合新型掩码机制和无VAE分层标记化,在Cityscapes数据集上实现了短期和中期未来语义预测的最先进性能。
English: This paper presents FUTURIST, a multimodal future semantic prediction method using a unified transformer architecture with a novel masking mechanism and VAE-free tokenization, achieving state-of-the-art performance on Cityscapes for short- and mid-term forecasting.

Authors:Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, Si Liu
Title: LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
Abstract:
Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages.Additionally, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks, which include Spatial-Temporal Video Grounding (STVG) , Event Localization and Captioning (ELC) and Spatial Video Grounding (SVG). LLaVA-ST achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released at Our code, data and benchmark will be released at https://github.com/appletea233/LLaVA-ST .
中文摘要:该摘要介绍了LLaVA-ST模型,它通过创新的语言对齐位置嵌入和时空特征压缩方法,解决了多模态理解中时空定位的难题,并在多个基准测试中取得了优异表现。
English Summary: The abstract introduces LLaVA-ST, a multimodal large language model designed to overcome challenges in simultaneous spatial-temporal localization by using innovative embedding and feature compression techniques, achieving top performance across multiple benchmarks.

Authors:Longtao Jiang, Zhendong Wang, Jianmin Bao, Wengang Zhou, Dongdong Chen, Lei Shi, Dong Chen, Houqiang Li
Title: SmartEraser: Remove Anything from Images using Masked-Region Guidance
Abstract:
Object removal has so far been dominated by the mask-and-inpaint paradigm, where the masked region is excluded from the input, leaving models relying on unmasked areas to inpaint the missing region. However, this approach lacks contextual information for the masked area, often resulting in unstable performance. In this work, we introduce SmartEraser, built with a new removing paradigm called Masked-Region Guidance. This paradigm retains the masked region in the input, using it as guidance for the removal process. It offers several distinct advantages: (a) it guides the model to accurately identify the object to be removed, preventing its regeneration in the output; (b) since the user mask often extends beyond the object itself, it aids in preserving the surrounding context in the final result. Leveraging this new paradigm, we present Syn4Removal, a large-scale object removal dataset, where instance segmentation data is used to copy and paste objects onto images as removal targets, with the original images serving as ground truths. Experimental results demonstrate that SmartEraser significantly outperforms existing methods, achieving superior performance in object removal, especially in complex scenes with intricate compositions.
中文摘要:SmartEraser提出了一种新颖的掩码区域引导范式,将掩码区域保留为输入引导,既能精确去除目标物体又能保持上下文,并通过新数据集和实验验证了其卓越性能。
English Summary: SmartEraser introduces a novel Masked-Region Guidance paradigm that retains masked regions as input guidance, enabling precise object removal while preserving context, and demonstrates superior performance through a new dataset and experiments.

Authors:Rui Daniel, M. Rita Verdelho, Catarina Barata, Carlos Santiago
Title: Continual Deep Active Learning for Medical Imaging: Replay-Base Architecture for Context Adaptation
Abstract:
Deep Learning for medical imaging faces challenges in adapting and generalizing to new contexts. Additionally, it often lacks sufficient labeled data for specific tasks requiring significant annotation effort. Continual Learning (CL) tackles adaptability and generalizability by enabling lifelong learning from a data stream while mitigating forgetting of previously learned knowledge. Active Learning (AL) reduces the number of required annotations for effective training. This work explores both approaches (CAL) to develop a novel framework for robust medical image analysis. Based on the automatic recognition of shifts in image characteristics, Replay-Base Architecture for Context Adaptation (RBACA) employs a CL rehearsal method to continually learn from diverse contexts, and an AL component to select the most informative instances for annotation. A novel approach to evaluate CAL methods is established using a defined metric denominated IL-Score, which allows for the simultaneous assessment of transfer learning, forgetting, and final model performance. We show that RBACA works in domain and class-incremental learning scenarios, by assessing its IL-Score on the segmentation and diagnosis of cardiac images. The results show that RBACA outperforms a baseline framework without CAL, and a state-of-the-art CAL method across various memory sizes and annotation budgets. Our code is available in https://github.com/RuiDaniel/RBACA .
中文摘要:本研究提出RBACA框架,融合持续学习与主动学习,通过最小化标注需求适应新场景,在心脏图像分割与诊断中展现出卓越性能。
English Summary: This study introduces RBACA, a novel framework combining continual and active learning to enhance medical image analysis by adapting to new contexts with minimal annotation, achieving superior performance in cardiac image segmentation and diagnosis.

Authors:Yabo Zhang, Xinpeng Zhou, Yihan Zeng, Hang Xu, Hui Li, Wangmeng Zuo
Title: FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors
Abstract:
Interactive image editing allows users to modify images through visual interaction operations such as drawing, clicking, and dragging. Existing methods construct such supervision signals from videos, as they capture how objects change with various physical interactions. However, these models are usually built upon text-to-image diffusion models, so necessitate (i) massive training samples and (ii) an additional reference encoder to learn real-world dynamics and visual consistency. In this paper, we reformulate this task as an image-to-video generation problem, so that inherit powerful video diffusion priors to reduce training costs and ensure temporal consistency. Specifically, we introduce FramePainter as an efficient instantiation of this formulation. Initialized with Stable Video Diffusion, it only uses a lightweight sparse control encoder to inject editing signals. Considering the limitations of temporal attention in handling large motion between two frames, we further propose matching attention to enlarge the receptive field while encouraging dense correspondence between edited and source image tokens. We highlight the effectiveness and efficiency of FramePainter across various of editing signals: it domainantly outperforms previous state-of-the-art methods with far less training data, achieving highly seamless and coherent editing of images, \eg, automatically adjust the reflection of the cup. Moreover, FramePainter also exhibits exceptional generalization in scenarios not present in real-world videos, \eg, transform the clownfish into shark-like shape. Our code will be available at https://github.com/YBYBZhang/FramePainter.
Chinese Summary: FramePainter将交互式图像编辑重构为图像到视频生成问题,利用视频扩散先验知识,通过轻量级控制编码器实现高质量、时序一致的编辑效果,大幅减少了训练数据需求。
English Summary: FramePainter reformulates interactive image editing as an image-to-video generation task, leveraging video diffusion priors to achieve high-quality, temporally consistent edits with minimal training data and a lightweight control encoder.

Authors:Qian Zeng, Jie Song, Han Zheng, Hao Jiang, Mingli Song
Title: D$^2$-DPM: Dual Denoising for Quantized Diffusion Probabilistic Models
Abstract:
Diffusion models have achieved cutting-edge performance in image generation. However, their lengthy denoising process and computationally intensive score estimation network impede their scalability in low-latency and resource-constrained scenarios. Post-training quantization (PTQ) compresses and accelerates diffusion models without retraining, but it inevitably introduces additional quantization noise, resulting in mean and variance deviations. In this work, we propose D2-DPM, a dual denoising mechanism aimed at precisely mitigating the adverse effects of quantization noise on the noise estimation network. Specifically, we first unravel the impact of quantization noise on the sampling equation into two components: the mean deviation and the variance deviation. The mean deviation alters the drift coefficient of the sampling equation, influencing the trajectory trend, while the variance deviation magnifies the diffusion coefficient, impacting the convergence of the sampling trajectory. The proposed D2-DPM is thus devised to denoise the quantization noise at each time step, and then denoise the noisy sample through the inverse diffusion iterations. Experimental results demonstrate that D2-DPM achieves superior generation quality, yielding a 1.42 lower FID than the full-precision model while achieving 3.99x compression and 11.67x bit-operation acceleration.
中文: D2-DPM提出双重去噪机制,有效抵消扩散模型量化噪声影响,在实现显著压缩和加速的同时提升了生成质量。
English: D2-DPM introduces a dual denoising mechanism to counteract quantization noise effects in diffusion models, achieving enhanced generation quality with significant compression and acceleration.

Authors:Marcel Rogge, Didier Stricker
Title: Object-Centric 2D Gaussian Splatting: Background Removal and Occlusion-Aware Pruning for Compact Object Models
Abstract:
Current Gaussian Splatting approaches are effective for reconstructing entire scenes but lack the option to target specific objects, making them computationally expensive and unsuitable for object-specific applications. We propose a novel approach that leverages object masks to enable targeted reconstruction, resulting in object-centric models. Additionally, we introduce an occlusion-aware pruning strategy to minimize the number of Gaussians without compromising quality. Our method reconstructs compact object models, yielding object-centric Gaussian and mesh representations that are up to 96% smaller and up to 71% faster to train compared to the baseline while retaining competitive quality. These representations are immediately usable for downstream applications such as appearance editing and physics simulation without additional processing.
中文摘要:该方法利用物体遮罩和遮挡感知剪枝策略,构建了紧凑的物体中心高斯与网格表示,在保持质量的同时显著减小了模型体积并加速训练,可直接应用于外观编辑和物理模拟等下游任务。
English Summary: The proposed method uses object masks and occlusion-aware pruning to create compact, object-centric Gaussian and mesh representations that are significantly smaller and faster to train while maintaining quality, enabling direct use in applications like appearance editing and physics simulation.

Authors:Hanene F. Z. Brachemi Meftah, Wassim Hamidouche, Sid Ahmed Fezza, Olivier Déforges, Kassem Kallas
Title: Energy Backdoor Attack to Deep Neural Networks
Abstract:
The rise of deep learning (DL) has increased computing complexity and energy use, prompting the adoption of application specific integrated circuits (ASICs) for energy-efficient edge and mobile deployment. However, recent studies have demonstrated the vulnerability of these accelerators to energy attacks. Despite the development of various inference time energy attacks in prior research, backdoor energy attacks remain unexplored. In this paper, we design an innovative energy backdoor attack against deep neural networks (DNNs) operating on sparsity-based accelerators. Our attack is carried out in two distinct phases: backdoor injection and backdoor stealthiness. Experimental results using ResNet-18 and MobileNet-V2 models trained on CIFAR-10 and Tiny ImageNet datasets show the effectiveness of our proposed attack in increasing energy consumption on trigger samples while preserving the model's performance for clean/regular inputs. This demonstrates the vulnerability of DNNs to energy backdoor attacks. The source code of our attack is available at: https://github.com/hbrachemi/energy_backdoor.
中文摘要:本文针对基于稀疏性加速器的深度神经网络提出了一种新型能量后门攻击,该攻击能在保持模型正常性能的同时,显著增加触发样本的能耗。
English Summary: This paper introduces a novel energy backdoor attack targeting deep neural networks on sparsity-based accelerators, which significantly increases energy consumption for trigger inputs while maintaining normal model performance.

Authors:Yunzhi Zhuge, Hongyu Gu, Lu Zhang, Jinqing Qi, Huchuan Lu
Title: Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation
Abstract:
In this paper, we address the challenges in unsupervised video object segmentation (UVOS) by proposing an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. Unlike previous methods that focus solely on integrating appearance with motion or on modeling temporal relations, our method combines both aspects by integrating them within a unified framework. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders, promoting a more complementary representation. To capture the intricate long-range contextual dynamics and information embedded within videos, a temporal transformer module is introduced, facilitating efficacious inter-frame interactions throughout a video clip. Furthermore, we employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features, aiming to generate increasingly precise segmentation masks. As a result, MTNet provides a strong and compact framework that explores both temporal and cross-modality knowledge to robustly localize and track the primary object accurately in various challenging scenarios efficiently. Extensive experiments across diverse benchmarks conclusively show that our method not only attains state-of-the-art performance in unsupervised video object segmentation but also delivers competitive results in video salient object detection. These findings highlight the method's robust versatility and its adeptness in adapting to a range of segmentation tasks. Source code is available on https://github.com/hy0523/MTNet.
中文: 本文提出MTNet算法,通过统一框架整合运动与时间线索进行无监督视频目标分割,能在复杂场景中准确定位主要目标,并取得领先性能。
English: This paper introduces MTNet, an efficient unsupervised video object segmentation algorithm that integrates motion and temporal cues within a unified framework to robustly localize primary objects across challenging scenarios, achieving state-of-the-art performance.

Authors:Farnoosh Koleini, Muhammad Usama Saleem, Pu Wang, Hongfei Xue, Ahmed Helmy, Abbey Fenwick
Title: BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos
Abstract:
Recent advancements in 3D human pose estimation from single-camera images and videos have relied on parametric models, like SMPL. However, these models oversimplify anatomical structures, limiting their accuracy in capturing true joint locations and movements, which reduces their applicability in biomechanics, healthcare, and robotics. Biomechanically accurate pose estimation, on the other hand, typically requires costly marker-based motion capture systems and optimization techniques in specialized labs. To bridge this gap, we propose BioPose, a novel learning-based framework for predicting biomechanically accurate 3D human pose directly from monocular videos. BioPose includes three key components: a Multi-Query Human Mesh Recovery model (MQ-HMR), a Neural Inverse Kinematics (NeurIK) model, and a 2D-informed pose refinement technique. MQ-HMR leverages a multi-query deformable transformer to extract multi-scale fine-grained image features, enabling precise human mesh recovery. NeurIK treats the mesh vertices as virtual markers, applying a spatial-temporal network to regress biomechanically accurate 3D poses under anatomical constraints. To further improve 3D pose estimations, a 2D-informed refinement step optimizes the query tokens during inference by aligning the 3D structure with 2D pose observations. Experiments on benchmark datasets demonstrate that BioPose significantly outperforms state-of-the-art methods. Project website: \url{https://m-usamasaleem.github.io/publication/BioPose/BioPose.html}.

Authors:Zhaokai Wang, Xizhou Zhu, Xue Yang, Gen Luo, Hao Li, Changyao Tian, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai
Title: Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Abstract:
Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data. Our code is released at https://github.com/OpenGVLab/PIIP.
中文总结:提出的参数倒置图像金字塔网络通过使用更小的网络分支处理高分辨率图像来降低计算成本,同时利用跨分支特征交互保持性能,在多项视觉任务中以显著减少的计算量实现了更优表现。
English Summary: The proposed Parameter-Inverted Image Pyramid (PIIP) network processes higher-resolution images with smaller network branches to reduce computational costs while maintaining performance through cross-branch feature interactions, achieving superior results across various vision tasks with significantly lower computation.

Authors:Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, Liang-Chieh Chen
Title: Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
Abstract:
Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.
中文: 本研究提出了TA-TiTok图像分词器,通过在解码阶段融合文本信息提升性能与扩展性,并基于此开发了仅使用开放数据即可达到媲美私有数据训练效果的MaskGen生成模型。
English: The study introduces TA-TiTok, an efficient image tokenizer that integrates text during decoding to improve performance and scalability, and MaskGen models that achieve competitive results using only open data.

Authors:Jiacheng Cui, Zhaoyi Li, Xiaochen Ma, Xinyue Bi, Yaxin Luo, Zhiqiang Shen
Title: Dataset Distillation via Committee Voting
Abstract:
Dataset distillation aims to synthesize a smaller, representative dataset that preserves the essential properties of the original data, enabling efficient model training with reduced computational resources. Prior work has primarily focused on improving the alignment or matching process between original and synthetic data, or on enhancing the efficiency of distilling large datasets. In this work, we introduce ${\bf C}$ommittee ${\bf V}$oting for ${\bf D}$ataset ${\bf D}$istillation (CV-DD), a novel and orthogonal approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets. We start by showing how to establish a strong baseline that already achieves state-of-the-art accuracy through leveraging recent advancements and thoughtful adjustments in model design and optimization processes. By integrating distributions and predictions from a committee of models while generating high-quality soft labels, our method captures a wider spectrum of data features, reduces model-specific biases and the adverse effects of distribution shifts, leading to significant improvements in generalization. This voting-based strategy not only promotes diversity and robustness within the distilled dataset but also significantly reduces overfitting, resulting in improved performance on post-eval tasks. Extensive experiments across various datasets and IPCs (images per class) demonstrate that Committee Voting leads to more reliable and adaptable distilled data compared to single/multi-model distillation methods, demonstrating its potential for efficient and accurate dataset distillation. Code is available at: https://github.com/Jiacheng8/CV-DD.
中文: 本文提出委员会投票数据集蒸馏(CV-DD)新方法,通过整合多个模型的预测分布生成高质量软标签,有效增强特征多样性、减少模型偏差,在多个数据集上显著提升泛化性能。
English: This paper introduces Committee Voting for Dataset Distillation (CV-DD), a novel method that synthesizes high-quality distilled datasets by leveraging collective model predictions to enhance feature diversity, reduce bias, and improve generalization across various datasets.

Authors:Xinyu Zhang, Zicheng Duan, Dong Gong, Lingqiao Liu
Title: Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss
Abstract:
In this paper, we address the challenge of generating temporally consistent videos with motion guidance. While many existing methods depend on additional control modules or inference-time fine-tuning, recent studies suggest that effective motion guidance is achievable without altering the model architecture or requiring extra training. Such approaches offer promising compatibility with various video generation foundation models. However, existing training-free methods often struggle to maintain consistent temporal coherence across frames or to follow guided motion accurately. In this work, we propose a simple yet effective solution that combines an initial-noise-based approach with a novel motion consistency loss, the latter being our key innovation. Specifically, we capture the inter-frame feature correlation patterns of intermediate features from a video diffusion model to represent the motion pattern of the reference video. We then design a motion consistency loss to maintain similar feature correlation patterns in the generated video, using the gradient of this loss in the latent space to guide the generation process for precise motion control. This approach improves temporal consistency across various motion control tasks while preserving the benefits of a training-free setup. Extensive experiments show that our method sets a new standard for efficient, temporally coherent video generation.

Authors:Xingyi He, Hao Yu, Sida Peng, Dongli Tan, Zehong Shen, Hujun Bao, Xiaowei Zhou
Title: MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training
Abstract:
Image matching, which aims to identify corresponding pixel locations between images, is crucial in a wide range of scientific disciplines, aiding in image registration, fusion, and analysis. In recent years, deep learning-based image matching algorithms have dramatically outperformed humans in rapidly and accurately finding large amounts of correspondences. However, when dealing with images captured under different imaging modalities that result in significant appearance changes, the performance of these algorithms often deteriorates due to the scarcity of annotated cross-modal training data. This limitation hinders applications in various fields that rely on multiple image modalities to obtain complementary information. To address this challenge, we propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals, incorporating diverse data from various sources, to train models to recognize and match fundamental structures across images. This capability is transferable to real-world, unseen cross-modality image matching tasks. Our key finding is that the matching model trained with our framework achieves remarkable generalizability across more than eight unseen cross-modality registration tasks using the same network weight, substantially outperforming existing methods, whether designed for generalization or tailored for specific tasks. This advancement significantly enhances the applicability of image matching technologies across various scientific disciplines and paves the way for new applications in multi-modality human and artificial intelligence analysis and beyond.
中文: 基于深度学习的图像匹配在速度和精度上表现优异,但面对跨模态图像时因训练数据不足而受限;我们提出的预训练框架利用合成数据解决了这一问题,显著提升了模型在多种任务中的泛化能力。
English: Deep learning-based image matching excels in speed and accuracy but struggles with cross-modal images due to limited training data, a challenge addressed by our pre-training framework using synthetic data to boost generalizability across diverse tasks.

Authors:Varun Biyyala, Bharat Chanderprakash Kathuria, Jialu Li, Youshan Zhang
Title: SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
Abstract:
Video editing models have advanced significantly, but evaluating their performance remains challenging. Traditional metrics, such as CLIP text and image scores, often fall short: text scores are limited by inadequate training data and hierarchical dependencies, while image scores fail to assess temporal consistency. We present SST-EM (Semantic, Spatial, and Temporal Evaluation Metric), a novel evaluation framework that leverages modern Vision-Language Models (VLMs), Object Detection, and Temporal Consistency checks. SST-EM comprises four components: (1) semantic extraction from frames using a VLM, (2) primary object tracking with Object Detection, (3) focused object refinement via an LLM agent, and (4) temporal consistency assessment using a Vision Transformer (ViT). These components are integrated into a unified metric with weights derived from human evaluations and regression analysis. The name SST-EM reflects its focus on Semantic, Spatial, and Temporal aspects of video evaluation. SST-EM provides a comprehensive evaluation of semantic fidelity and temporal smoothness in video editing. The source code is available in the \textbf{\href{https://github.com/custommetrics-sst/SST_CustomEvaluationMetrics.git}{GitHub Repository}}.
中文摘要:SST-EM是一种创新的视频编辑评估框架,通过整合语义提取、目标追踪、精细化处理和时序一致性检测,全面评估视频的语义保真度与时间连贯性。
English Summary: SST-EM is a novel video editing evaluation framework that integrates semantic extraction, object tracking, refinement, and temporal consistency checks to comprehensively assess semantic fidelity and temporal smoothness.

Authors:Shiman Zhang, Lakshmikar Reddy Polamreddy, Youshan Zhang
Title: Confident Pseudo-labeled Diffusion Augmentation for Canine Cardiomegaly Detection
Abstract:
Canine cardiomegaly, marked by an enlarged heart, poses serious health risks if undetected, requiring accurate diagnostic methods. Current detection models often rely on small, poorly annotated datasets and struggle to generalize across diverse imaging conditions, limiting their real-world applicability. To address these issues, we propose a Confident Pseudo-labeled Diffusion Augmentation (CDA) model for identifying canine cardiomegaly. Our approach addresses the challenge of limited high-quality training data by employing diffusion models to generate synthetic X-ray images and annotate Vertebral Heart Score key points, thereby expanding the dataset. We also employ a pseudo-labeling strategy with Monte Carlo Dropout to select high-confidence labels, refine the synthetic dataset, and improve accuracy. Iteratively incorporating these labels enhances the model's performance, overcoming the limitations of existing approaches. Experimental results show that the CDA model outperforms traditional methods, achieving state-of-the-art accuracy in canine cardiomegaly detection. The code implementation is available at https://github.com/Shira7z/CDA.
中文: CDA模型通过扩散模型生成合成X光图像并采用置信伪标注策略,有效解决了标注数据稀缺问题,在大检测准确率上实现了最优性能。
English: The CDA model introduces diffusion-based synthetic X-ray generation and confident pseudo-labeling to overcome limited annotated data, achieving state-of-the-art accuracy in detecting canine cardiomegaly.

Authors:Difei Gu, Yunhe Gao, Yang Zhou, Mu Zhou, Dimitris Metaxas
Title: RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment
Abstract:
Automated chest radiographs interpretation requires both accurate disease classification and detailed radiology report generation, presenting a significant challenge in the clinical workflow. Current approaches either focus on classification accuracy at the expense of interpretability or generate detailed but potentially unreliable reports through image captioning techniques. In this study, we present RadAlign, a novel framework that combines the predictive accuracy of vision-language models (VLMs) with the reasoning capabilities of large language models (LLMs). Inspired by the radiologist's workflow, RadAlign first employs a specialized VLM to align visual features with key medical concepts, achieving superior disease classification with an average AUC of 0.885 across multiple diseases. These recognized medical conditions, represented as text-based concepts in the aligned visual-language space, are then used to prompt LLM-based report generation. Enhanced by a retrieval-augmented generation mechanism that grounds outputs in similar historical cases, RadAlign delivers superior report quality with a GREEN score of 0.678, outperforming state-of-the-art methods' 0.634. Our framework maintains strong clinical interpretability while reducing hallucinations, advancing automated medical imaging and report analysis through integrated predictive and generative AI. Code is available at https://github.com/difeigu/RadAlign.
Chinese: RadAlign是一种创新框架,通过结合视觉语言模型进行精确疾病分类和大型语言模型生成详细放射学报告,以0.885的平均AUC和0.678的GREEN评分实现卓越性能,同时提升临床可解释性。
English: RadAlign is a novel framework that integrates vision-language models for accurate disease classification and large language models for generating detailed radiology reports, achieving superior performance with an AUC of 0.885 and a GREEN score of 0.678 while enhancing clinical interpretability.

Authors:Yaqing Ding, Viktor Kocur, Zuzana Berger Haladová, Qianliang Wu, Shen Cai, Jian Yang, Zuzana Kukelova
Title: Three-view Focal Length Recovery From Homographies
Abstract:
In this paper, we propose a novel approach for recovering focal lengths from three-view homographies. By examining the consistency of normal vectors between two homographies, we derive new explicit constraints between the focal lengths and homographies using an elimination technique. We demonstrate that three-view homographies provide two additional constraints, enabling the recovery of one or two focal lengths. We discuss four possible cases, including three cameras having an unknown equal focal length, three cameras having two different unknown focal lengths, three cameras where one focal length is known, and the other two cameras have equal or different unknown focal lengths. All the problems can be converted into solving polynomials in one or two unknowns, which can be efficiently solved using Sturm sequence or hidden variable technique. Evaluation using both synthetic and real data shows that the proposed solvers are both faster and more accurate than methods relying on existing two-view solvers. The code and data are available on https://github.com/kocurvik/hf
中文: 本文提出了一种通过分析三视图单应矩阵中法向量一致性并利用消元技术推导新约束的新方法,能够高效求解多项式以恢复焦距,实验证明其比现有方法更快更精确。
English: This paper introduces a novel method for recovering focal lengths from three-view homographies by deriving explicit constraints through normal vector consistency and elimination techniques, enabling efficient polynomial solving with demonstrated superior speed and accuracy over existing methods.

Authors:Wenping Jin, Li Zhu, Jing Sun
Title: Aligning First, Then Fusing: A Novel Weakly Supervised Multimodal Violence Detection Method
Abstract:
Weakly supervised violence detection refers to the technique of training models to identify violent segments in videos using only video-level labels. Among these approaches, multimodal violence detection, which integrates modalities such as audio and optical flow, holds great potential. Existing methods in this domain primarily focus on designing multimodal fusion models to address modality discrepancies. In contrast, we take a different approach; leveraging the inherent discrepancies across modalities in violence event representation to propose a novel multimodal semantic feature alignment method. This method sparsely maps the semantic features of local, transient, and less informative modalities ( such as audio and optical flow ) into the more informative RGB semantic feature space. Through an iterative process, the method identifies the suitable no-zero feature matching subspace and aligns the modality-specific event representations based on this subspace, enabling the full exploitation of information from all modalities during the subsequent modality fusion stage. Building on this, we design a new weakly supervised violence detection framework that consists of unimodal multiple-instance learning for extracting unimodal semantic features, multimodal alignment, multimodal fusion, and final detection. Experimental results on benchmark datasets demonstrate the effectiveness of our method, achieving an average precision (AP) of 86.07% on the XD-Violence dataset. Our code is available at https://github.com/xjpp2016/MAVD.
Chinese Summary: 本文提出一种新颖的弱监督暴力检测方法,通过将信息量较少的模态特征稀疏映射至RGB语义空间实现多模态特征对齐,结合单模态学习、特征对齐、多模态融合与检测的框架,在XD-Violence数据集上达到86.07%的平均精度。
English Summary: This paper introduces a novel weakly supervised violence detection method that aligns multimodal semantic features by mapping less informative modalities into the RGB feature space, achieving 86.07% AP on the XD-Violence dataset through a framework combining unimodal learning, alignment, fusion, and detection.

Authors:Daniel Steininger, Julia Simon, Andreas Trondl, Markus Murschitz
Title: TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations
Abstract:
Timber represents an increasingly valuable and versatile resource. However, forestry operations such as harvesting, handling and measuring logs still require substantial human labor in remote environments posing significant safety risks. Progressively automating these tasks has the potential of increasing their efficiency as well as safety, but requires an accurate detection of individual logs as well as live trees and their context. Although initial approaches have been proposed for this challenging application domain, specialized data and algorithms are still too scarce to develop robust solutions. To mitigate this gap, we introduce the TimberVision dataset, consisting of more than 2k annotated RGB images containing a total of 51k trunk components including cut and lateral surfaces, thereby surpassing any existing dataset in this domain in terms of both quantity and detail by a large margin. Based on this data, we conduct a series of ablation experiments for oriented object detection and instance segmentation and evaluate the influence of multiple scene parameters on model performance. We introduce a generic framework to fuse the components detected by our models for both tasks into unified trunk representations. Furthermore, we automatically derive geometric properties and apply multi-object tracking to further enhance robustness. Our detection and tracking approach provides highly descriptive and accurate trunk representations solely from RGB image data, even under challenging environmental conditions. Our solution is suitable for a wide range of application scenarios and can be readily combined with other sensor modalities.
中文: TimberVision数据集通过提供包含5.1万个树干部件的2000多张标注图像,解决了林业自动化中专业数据匮乏的问题,利用先进的计算机视觉技术实现了对原木和树木的精准检测与追踪。
English: The TimberVision dataset addresses the scarcity of specialized data for automating forestry tasks by providing over 2,000 annotated images with 51,000 trunk components, enabling robust detection and tracking of logs and trees through advanced computer vision techniques.

Authors:Zhimeng Xin, Tianxu Wu, Shiming Chen, Shuo Ye, Zijing Xie, Yixiong Zou, Xinge You, Yufei Guo
Title: Toward Realistic Camouflaged Object Detection: Benchmarks and Method
Abstract:
Camouflaged object detection (COD) primarily relies on semantic or instance segmentation methods. While these methods have made significant advancements in identifying the contours of camouflaged objects, they may be inefficient or cost-effective for tasks that only require the specific location of the object. Object detection algorithms offer an optimized solution for Realistic Camouflaged Object Detection (RCOD) in such cases. However, detecting camouflaged objects remains a formidable challenge due to the high degree of similarity between the features of the objects and their backgrounds. Unlike segmentation methods that perform pixel-wise comparisons to differentiate between foreground and background, object detectors omit this analysis, further aggravating the challenge. To solve this problem, we propose a camouflage-aware feature refinement (CAFR) strategy. Since camouflaged objects are not rare categories, CAFR fully utilizes a clear perception of the current object within the prior knowledge of large models to assist detectors in deeply understanding the distinctions between background and foreground. Specifically, in CAFR, we introduce the Adaptive Gradient Propagation (AGP) module that fine-tunes all feature extractor layers in large detection models to fully refine class-specific features from camouflaged contexts. We then design the Sparse Feature Refinement (SFR) module that optimizes the transformer-based feature extractor to focus primarily on capturing class-specific features in camouflaged scenarios. To facilitate the assessment of RCOD tasks, we manually annotate the labels required for detection on three existing segmentation COD datasets, creating a new benchmark for RCOD tasks. Code and datasets are available at: https://github.com/zhimengXin/RCOD.
中文: 提出的伪装感知特征细化(CAFR)策略利用大模型的先验知识优化特征提取,有效区分伪装物体与背景,解决了分割方法在定位任务中的不足,并为现实伪装物体检测建立了新基准。
English: The proposed camouflage-aware feature refinement (CAFR) strategy enhances object detection for camouflaged objects by leveraging large models' prior knowledge and refining class-specific features, addressing the limitations of segmentation methods and establishing a new benchmark for realistic camouflaged object detection.

Authors:Li Liang, Naveed Akhtar, Jordan Vice, Xiangrui Kong, Ajmal Saeed Mian
Title: Skip Mamba Diffusion for Monocular 3D Semantic Scene Completion
Abstract:
3D semantic scene completion is critical for multiple downstream tasks in autonomous systems. It estimates missing geometric and semantic information in the acquired scene data. Due to the challenging real-world conditions, this task usually demands complex models that process multi-modal data to achieve acceptable performance. We propose a unique neural model, leveraging advances from the state space and diffusion generative modeling to achieve remarkable 3D semantic scene completion performance with monocular image input. Our technique processes the data in the conditioned latent space of a variational autoencoder where diffusion modeling is carried out with an innovative state space technique. A key component of our neural network is the proposed Skimba (Skip Mamba) denoiser, which is adept at efficiently processing long-sequence data. The Skimba diffusion model is integral to our 3D scene completion network, incorporating a triple Mamba structure, dimensional decomposition residuals and varying dilations along three directions. We also adopt a variant of this network for the subsequent semantic segmentation stage of our method. Extensive evaluation on the standard SemanticKITTI and SSCBench-KITTI360 datasets show that our approach not only outperforms other monocular techniques by a large margin, it also achieves competitive performance against stereo methods. The code is available at https://github.com/xrkong/skimba
中文摘要:本文提出了一种新颖的神经网络模型,利用状态空间和扩散生成建模技术,仅通过单目图像输入即可实现卓越的3D语义场景补全效果,大幅超越现有单目方法并与立体方法相媲美。
English Summary: This paper introduces a novel neural model that employs state space and diffusion generative modeling to achieve superior 3D semantic scene completion from monocular images, significantly outperforming existing monocular methods and competing with stereo approaches.

Authors:Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, Bilge Soran
Title: EdgeTAM: On-Device Track Anything Model
Abstract:
On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame-level memories with a lightweight Transformer that contains a fixed set of learnable queries. Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global-level and patch-level groups. We also propose a distillation pipeline that further improves the performance without inference overhead. As a result, EdgeTAM achieves 87.7, 70.0, 72.3, and 71.7 J&F on DAVIS 2017, MOSE, SA-V val, and SA-V test, while running at 16 FPS on iPhone 15 Pro Max.
Chinese: SAM 2通过内存库机制将SAM扩展至视频分割,而EdgeTAM采用新型2D空间感知器和蒸馏技术优化其效率,使其在iPhone 15 Pro Max上以16 FPS运行并保持高性能。
English: SAM 2 enhances video segmentation by extending SAM with a memory bank mechanism, and EdgeTAM optimizes its efficiency for mobile devices using a 2D Spatial Perceiver and distillation, achieving high performance at 16 FPS on iPhone 15 Pro Max.

Authors:Han Liu, Yinwei Wei, Fan Liu, Wenjie Wang, Liqiang Nie, Tat-Seng Chua
Title: Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation
Abstract:
Multimodal information (e.g., visual, acoustic, and textual) has been widely used to enhance representation learning for micro-video recommendation. For integrating multimodal information into a joint representation of micro-video, multimodal fusion plays a vital role in the existing micro-video recommendation approaches. However, the static multimodal fusion used in previous studies is insufficient to model the various relationships among multimodal information of different micro-videos. In this paper, we develop a novel meta-learning-based multimodal fusion framework called Meta Multimodal Fusion (MetaMMF), which dynamically assigns parameters to the multimodal fusion function for each micro-video during its representation learning. Specifically, MetaMMF regards the multimodal fusion of each micro-video as an independent task. Based on the meta information extracted from the multimodal features of the input task, MetaMMF parameterizes a neural network as the item-specific fusion function via a meta learner. We perform extensive experiments on three benchmark datasets, demonstrating the significant improvements over several state-of-the-art multimodal recommendation models, like MMGCN, LATTICE, and InvRL. Furthermore, we lighten our model by adopting canonical polyadic decomposition to improve the training efficiency, and validate its effectiveness through experimental results. Codes are available at https://github.com/hanliu95/MetaMMF.
中文: 本文提出MetaMMF框架,通过元学习动态融合各微视频的多模态信息以提升推荐性能,在超越现有先进模型的同时保持了训练效率。
English: This paper introduces MetaMMF, a dynamic meta-learning framework that adaptively fuses multimodal information for each micro-video to improve recommendation accuracy, outperforming existing models while maintaining training efficiency.

Authors:Zhen Xiong, Yuqi Li, Chuanguang Yang, Tiao Tan, Zhihong Zhu, Siyuan Li, Yue Ma
Title: Enhancing Image Generation Fidelity via Progressive Prompts
Abstract:
The diffusion transformer (DiT) architecture has attracted significant attention in image generation, achieving better fidelity, performance, and diversity. However, most existing DiT - based image generation methods focus on global - aware synthesis, and regional prompt control has been less explored. In this paper, we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control. Various prompts are injected into the proposed regional cross - attention control for coarse - to - fine generation. By using the proposed pipeline, we enhance the controllability of DiT - based image generation. Extensive quantitative and qualitative results show that our pipeline can improve the performance of the generated images.
中文: 本文提出了一种从粗到细的生成流程,利用大语言模型和区域交叉注意力控制来增强基于扩散变换器的图像生成可控性,从而优化图像的高级内容和低级细节表现。
English: This paper introduces a coarse-to-fine pipeline using large language models and regional cross-attention control to enhance the controllability of diffusion transformer-based image generation, improving both high-level content and low-level details.

Authors:Minhui Xie, Hao Peng, Pu Li, Guangjie Zeng, Shuhai Wang, Jia Wu, Peng Li, Philip S. Yu
Title: Hierarchical Superpixel Segmentation via Structural Information Theory
Abstract:
Superpixel segmentation is a foundation for many higher-level computer vision tasks, such as image segmentation, object recognition, and scene understanding. Existing graph-based superpixel segmentation methods typically concentrate on the relationships between a given pixel and its directly adjacent pixels while overlooking the influence of non-adjacent pixels. These approaches do not fully leverage the global information in the graph, leading to suboptimal segmentation quality. To address this limitation, we present SIT-HSS, a hierarchical superpixel segmentation method based on structural information theory. Specifically, we first design a novel graph construction strategy that incrementally explores the pixel neighborhood to add edges based on 1-dimensional structural entropy (1D SE). This strategy maximizes the retention of graph information while avoiding an overly complex graph structure. Then, we design a new 2D SE-guided hierarchical graph partitioning method, which iteratively merges pixel clusters layer by layer to reduce the graph's 2D SE until a predefined segmentation scale is achieved. Experimental results on three benchmark datasets demonstrate that the SIT-HSS performs better than state-of-the-art unsupervised superpixel segmentation algorithms. The source code is available at \url{https://github.com/SELGroup/SIT-HSS}.
中文: 提出的SIT-HSS方法通过结构信息理论探索相邻和非相邻像素关系,改进了超像素分割,在性能上超越了现有无监督算法。
English: The proposed SIT-HSS method enhances superpixel segmentation by incorporating structural information theory to explore both adjacent and non-adjacent pixel relationships, achieving superior performance over existing unsupervised algorithms.

Authors:Xuhui Guo, Tanmoy Dam, Rohan Dhamdhere, Gourav Modanwal, Anant Madabhushi
Title: UNetVL: Enhancing 3D Medical Image Segmentation with Chebyshev KAN Powered Vision-LSTM
Abstract:
3D medical image segmentation has progressed considerably due to Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), yet these methods struggle to balance long-range dependency acquisition with computational efficiency. To address this challenge, we propose UNETVL (U-Net Vision-LSTM), a novel architecture that leverages recent advancements in temporal information processing. UNETVL incorporates Vision-LSTM (ViL) for improved scalability and memory functions, alongside an efficient Chebyshev Kolmogorov-Arnold Networks (KAN) to handle complex and long-range dependency patterns more effectively. We validated our method on the ACDC and AMOS2022 (post challenge Task 2) benchmark datasets, showing a significant improvement in mean Dice score compared to recent state-of-the-art approaches, especially over its predecessor, UNETR, with increases of 7.3% on ACDC and 15.6% on AMOS, respectively. Extensive ablation studies were conducted to demonstrate the impact of each component in UNETVL, providing a comprehensive understanding of its architecture. Our code is available at https://github.com/tgrex6/UNETVL, facilitating further research and applications in this domain.
中文: 提出的UNETVL架构通过融合Vision-LSTM和高效切比雪夫KAN网络,显著提升了三维医学图像分割中长程依赖关系的建模能力,在ACDC和AMOS数据集上的Dice分数分别较现有最优方法提高了7.3%和15.6%。
English: The proposed UNETVL architecture enhances 3D medical image segmentation by integrating Vision-LSTM and efficient Chebyshev KAN to better capture long-range dependencies, achieving significant Dice score improvements of 7.3% on ACDC and 15.6% on AMOS datasets compared to previous methods.

Authors:Binyu Zhang, Shichao Li, Junpeng Jian, Zhu Meng, Limei Guo, Zhicheng Zhao
Title: A Multi-Modal Deep Learning Framework for Pan-Cancer Prognosis
Abstract:
Prognostic task is of great importance as it closely related to the survival analysis of patients, the optimization of treatment plans and the allocation of resources. The existing prognostic models have shown promising results on specific datasets, but there are limitations in two aspects. On the one hand, they merely explore certain types of modal data, such as patient histopathology WSI and gene expression analysis. On the other hand, they adopt the per-cancer-per-model paradigm, which means the trained models can only predict the prognostic effect of a single type of cancer, resulting in weak generalization ability. In this paper, a deep-learning based model, named UMPSNet, is proposed. Specifically, to comprehensively understand the condition of patients, in addition to constructing encoders for histopathology images and genomic expression profiles respectively, UMPSNet further integrates four types of important meta data (demographic information, cancer type information, treatment protocols, and diagnosis results) into text templates, and then introduces a text encoder to extract textual features. In addition, the optimal transport OT-based attention mechanism is utilized to align and fuse features of different modalities. Furthermore, a guided soft mixture of experts (GMoE) mechanism is introduced to effectively address the issue of distribution differences among multiple cancer datasets. By incorporating the multi-modality of patient data and joint training, UMPSNet outperforms all SOTA approaches, and moreover, it demonstrates the effectiveness and generalization ability of the proposed learning paradigm of a single model for multiple cancer types. The code of UMPSNet is available at https://github.com/binging512/UMPSNet.
中文:提出的UMPSNet模型通过整合病理图像、基因组数据和临床文本等多模态信息,采用基于最优运输的注意力机制和引导混合专家方法,在多种癌症类型的预后预测中实现了优于现有单模态单癌症模型的性能与泛化能力。
English: The proposed UMPSNet model integrates multiple data modalities, including histopathology images, genomic profiles, and clinical metadata, using optimal transport-based attention and guided mixture of experts mechanisms to achieve superior prognostic performance and generalization across multiple cancer types compared to existing single-modality, single-cancer models.

Authors:Wojciech Zielonka, Stephan J. Garbin, Alexandros Lattas, George Kopanas, Paulo Gotardo, Thabo Beeler, Justus Thies, Timo Bolkart
Title: Synthetic Prior for Few-Shot Drivable Head Avatar Inversion
Abstract:
We present SynShot, a novel method for the few-shot inversion of a drivable head avatar based on a synthetic prior. We tackle three major challenges. First, training a controllable 3D generative network requires a large number of diverse sequences, for which pairs of images and high-quality tracked meshes are not always available. Second, the use of real data is strictly regulated (e.g., under the General Data Protection Regulation, which mandates frequent deletion of models and data to accommodate a situation when a participant's consent is withdrawn). Synthetic data, free from these constraints, is an appealing alternative. Third, state-of-the-art monocular avatar models struggle to generalize to new views and expressions, lacking a strong prior and often overfitting to a specific viewpoint distribution. Inspired by machine learning models trained solely on synthetic data, we propose a method that learns a prior model from a large dataset of synthetic heads with diverse identities, expressions, and viewpoints. With few input images, SynShot fine-tunes the pretrained synthetic prior to bridge the domain gap, modeling a photorealistic head avatar that generalizes to novel expressions and viewpoints. We model the head avatar using 3D Gaussian splatting and a convolutional encoder-decoder that outputs Gaussian parameters in UV texture space. To account for the different modeling complexities over parts of the head (e.g., skin vs hair), we embed the prior with explicit control for upsampling the number of per-part primitives. Compared to SOTA monocular and GAN-based methods, SynShot significantly improves novel view and expression synthesis.

Authors:Liyan Chen, Huangying Zhan, Kevin Chen, Xiangyu Xu, Qingan Yan, Changjiang Cai, Yi Xu
Title: ActiveGAMER: Active GAussian Mapping through Efficient Rendering
Abstract:
We introduce ActiveGAMER, an active mapping system that utilizes 3D Gaussian Splatting (3DGS) to achieve high-quality, real-time scene mapping and exploration. Unlike traditional NeRF-based methods, which are computationally demanding and restrict active mapping performance, our approach leverages the efficient rendering capabilities of 3DGS, allowing effective and efficient exploration in complex environments. The core of our system is a rendering-based information gain module that dynamically identifies the most informative viewpoints for next-best-view planning, enhancing both geometric and photometric reconstruction accuracy. ActiveGAMER also integrates a carefully balanced framework, combining coarse-to-fine exploration, post-refinement, and a global-local keyframe selection strategy to maximize reconstruction completeness and fidelity. Our system autonomously explores and reconstructs environments with state-of-the-art geometric and photometric accuracy and completeness, significantly surpassing existing approaches in both aspects. Extensive evaluations on benchmark datasets such as Replica and MP3D highlight ActiveGAMER's effectiveness in active mapping tasks.
中文:ActiveGAMER 是一种利用 3D 高斯泼溅技术的主动建图系统,能够实现实时场景探索与重建,在精度和效率上均优于传统方法。
English: ActiveGAMER is an active mapping system that uses 3D Gaussian Splatting for real-time scene exploration and reconstruction, outperforming traditional methods in accuracy and efficiency.

Authors:Haojun Yu, Di Dai, Ziwei Zhao, Di He, Han Hu, Liwei Wang
Title: LarvSeg: Exploring Image Classification Data For Large Vocabulary Semantic Segmentation via Category-wise Attentive Classifier
Abstract:
Scaling up the vocabulary of semantic segmentation models is extremely challenging because annotating large-scale mask labels is labour-intensive and time-consuming. Recently, language-guided segmentation models have been proposed to address this challenge. However, their performance drops significantly when applied to out-of-distribution categories. In this paper, we propose a new large vocabulary semantic segmentation framework, called LarvSeg. Different from previous works, LarvSeg leverages image classification data to scale the vocabulary of semantic segmentation models as large-vocabulary classification datasets usually contain balanced categories and are much easier to obtain. However, for classification tasks, the category is image-level, while for segmentation we need to predict the label at pixel level. To address this issue, we first propose a general baseline framework to incorporate image-level supervision into the training process of a pixel-level segmentation model, making the trained network perform semantic segmentation on newly introduced categories in the classification data. We then observe that a model trained on segmentation data can group pixel features of categories beyond the training vocabulary. Inspired by this finding, we design a category-wise attentive classifier to apply supervision to the precise regions of corresponding categories to improve the model performance. Extensive experiments demonstrate that LarvSeg significantly improves the large vocabulary semantic segmentation performance, especially in the categories without mask labels. For the first time, we provide a 21K-category semantic segmentation model with the help of ImageNet21K. The code is available at https://github.com/HaojunYu1998/large_voc_seg.
中文: 提出的LarvSeg框架通过利用图像分类数据和类别注意力分类器,显著提升大词汇量语义分割性能,无需大量掩码标注即可扩展至21K个类别。
English: The proposed LarvSeg framework enhances large vocabulary semantic segmentation by utilizing image classification data and a category-wise attentive classifier, effectively scaling to 21K categories without requiring extensive mask annotations.

Authors:Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath
Title: A General Framework for Inference-time Scaling and Steering of Diffusion Models
Abstract:
Diffusion models produce impressive results in modalities ranging from images and video to protein design and text. However, generating samples with user-specified properties remains a challenge. Recent research proposes fine-tuning models to maximize rewards that capture desired properties, but these methods require expensive training and are prone to mode collapse. In this work, we present Feynman-Kac (FK) steering, an inference-time framework for steering diffusion models with reward functions. FK steering works by sampling a system of multiple interacting diffusion processes, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are selected such that a high value indicates that the particle will yield a high-reward sample. We explore various choices of potentials, intermediate rewards, and samplers. We evaluate FK steering on text-to-image and text diffusion models. For steering text-to-image models with a human preference reward, we find that FK steering a 0.8B parameter model outperforms a 2.6B parameter fine-tuned model on prompt fidelity, with faster sampling and no training. For steering text diffusion models with rewards for text quality and specific text attributes, we find that FK steering generates lower perplexity, more linguistically acceptable outputs and enables gradient-free control of attributes like toxicity. Our results demonstrate that inference-time scaling and steering of diffusion models - even with off-the-shelf rewards - can provide significant sample quality gains and controllability benefits. Code is available at https://github.com/zacharyhorvitz/Fk-Diffusion-Steering .
中文: 本文提出Feynman-Kac引导方法,这是一种无需训练即可通过奖励函数在推理时指导扩散模型的框架,在文本到图像生成和文本扩散任务中,其效果优于经过微调的模型,显著提升了样本质量和可控性。
English: This paper introduces Feynman-Kac (FK) steering, an inference-time framework that guides diffusion models using reward functions to enhance sample quality and controllability without requiring training, achieving superior results in text-to-image generation and text diffusion tasks compared to fine-tuned models.

Authors:Du Chen, Liyi Chen, Zhengqiang Zhang, Lei Zhang
Title: Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution
Abstract:
Implicit Neural Representations (INR) have been successfully employed for Arbitrary-scale Super-Resolution (ASR). However, INR-based models need to query the multi-layer perceptron module numerous times and render a pixel in each query, resulting in insufficient representation capability and low computational efficiency. Recently, Gaussian Splatting (GS) has shown its advantages over INR in both visual quality and rendering speed in 3D tasks, which motivates us to explore whether GS can be employed for the ASR task. However, directly applying GS to ASR is exceptionally challenging because the original GS is an optimization-based method through overfitting each single scene, while in ASR we aim to learn a single model that can generalize to different images and scaling factors. We overcome these challenges by developing two novel techniques. Firstly, to generalize GS for ASR, we elaborately design an architecture to predict the corresponding image-conditioned Gaussians of the input low-resolution image in a feed-forward manner. Each Gaussian can fit the shape and direction of an area of complex textures, showing powerful representation capability. Secondly, we implement an efficient differentiable 2D GPU/CUDA-based scale-aware rasterization to render super-resolved images by sampling discrete RGB values from the predicted continuous Gaussians. Via end-to-end training, our optimized network, namely GSASR, can perform ASR for any image and unseen scaling factors. Extensive experiments validate the effectiveness of our proposed method. The code and models are available at https://github.com/ChrisDud0257/GSASR.
中文摘要:本文提出GSASR方法,通过设计图像条件化的高斯预测架构和高效的尺度感知渲染技术,将高斯溅射成功应用于任意尺度超分辨率任务,解决了泛化性难题。
English Summary: The paper introduces GSASR, a novel method that adapts Gaussian Splatting for Arbitrary-scale Super-Resolution by designing an image-conditioned Gaussian prediction architecture and an efficient scale-aware rasterization technique to overcome generalization challenges.

Authors:Minglong Xue, Shuaibin Fan, Shivakumara Palaiahnakote, Mingliang Zhou
Title: UR2P-Dehaze: Learning a Simple Image Dehaze Enhancer via Unpaired Rich Physical Prior
Abstract:
Image dehazing techniques aim to enhance contrast and restore details, which are essential for preserving visual information and improving image processing accuracy. Existing methods rely on a single manual prior, which cannot effectively reveal image details. To overcome this limitation, we propose an unpaired image dehazing network, called the Simple Image Dehaze Enhancer via Unpaired Rich Physical Prior (UR2P-Dehaze). First, to accurately estimate the illumination, reflectance, and color information of the hazy image, we design a shared prior estimator (SPE) that is iteratively trained to ensure the consistency of illumination and reflectance, generating clear, high-quality images. Additionally, a self-monitoring mechanism is introduced to eliminate undesirable features, providing reliable priors for image reconstruction. Next, we propose Dynamic Wavelet Separable Convolution (DWSC), which effectively integrates key features across both low and high frequencies, significantly enhancing the preservation of image details and ensuring global consistency. Finally, to effectively restore the color information of the image, we propose an Adaptive Color Corrector that addresses the problem of unclear colors. The PSNR, SSIM, LPIPS, FID and CIEDE2000 metrics on the benchmark dataset show that our method achieves state-of-the-art performance. It also contributes to the performance improvement of downstream tasks. The project code will be available at https://github.com/Fan-pixel/UR2P-Dehaze. \end{abstract}
中文摘要:UR2P-Dehaze网络通过共享先验估计器和动态小波卷积整合非配对丰富物理先验,有效克服单先验方法的局限,在去雾性能、细节保留和色彩恢复方面达到最优水平。
English Summary: The proposed UR2P-Dehaze network overcomes limitations of single-prior methods by integrating unpaired rich physical priors through shared estimation and dynamic wavelet convolution, achieving state-of-the-art dehazing performance with enhanced detail preservation and color restoration.

Authors:Keyan Chen, Jiafan Zhang, Chenyang Liu, Zhengxia Zou, Zhenwei Shi
Title: RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models
Abstract:
Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding through free-format textual input, enabling enhanced scene and object extraction in remote sensing applications. Current research primarily utilizes pre-trained language models to encode textual descriptions and align them with visual modalities, thereby facilitating the expression of relevant visual features. However, these approaches often struggle to establish robust alignments between fine-grained semantic concepts, leading to inconsistent representations across textual and visual information. To address these limitations, we introduce a referring remote sensing image segmentation foundational model, RSRefSeg. RSRefSeg leverages CLIP for visual and textual encoding, employing both global and local textual semantics as filters to generate referring-related visual activation features in the latent space. These activated features then serve as input prompts for SAM, which refines the segmentation masks through its robust visual generalization capabilities. Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods, underscoring the effectiveness of foundational models in enhancing multimodal task comprehension. The code is available at \url{https://github.com/KyanChen/RSRefSeg}.
中文摘要:RSRefSeg模型通过结合CLIP的多模态编码与SAM的分割优化能力,显著提升了遥感图像指代分割的精度,在基准测试中凭借更精准的图文特征对齐优于现有方法。
English Summary: The RSRefSeg model enhances referring remote sensing image segmentation by integrating CLIP's multimodal encoding with SAM's segmentation refinement, achieving superior performance on benchmark datasets through improved alignment of textual and visual features.

Authors:Mahmoud Ahmed, Xiang Li, Arpit Prajapati, Mohamed Elhoseiny
Title: 3DCoMPaT200: Language-Grounded Compositional Understanding of Parts and Materials of 3D Shapes
Abstract:
Understanding objects in 3D at the part level is essential for humans and robots to navigate and interact with the environment. Current datasets for part-level 3D object understanding encompass a limited range of categories. For instance, the ShapeNet-Part and PartNet datasets only include 16, and 24 object categories respectively. The 3DCoMPaT dataset, specifically designed for compositional understanding of parts and materials, contains only 42 object categories. To foster richer and fine-grained part-level 3D understanding, we introduce 3DCoMPaT200, a large-scale dataset tailored for compositional understanding of object parts and materials, with 200 object categories with $\approx$5 times larger object vocabulary compared to 3DCoMPaT and $\approx$ 4 times larger part categories. Concretely, 3DCoMPaT200 significantly expands upon 3DCoMPaT, featuring 1,031 fine-grained part categories and 293 distinct material classes for compositional application to 3D object parts. Additionally, to address the complexities of compositional 3D modeling, we propose a novel task of Compositional Part Shape Retrieval using ULIP to provide a strong 3D foundational model for 3D Compositional Understanding. This method evaluates the model shape retrieval performance given one, three, or six parts described in text format. These results show that the model's performance improves with an increasing number of style compositions, highlighting the critical role of the compositional dataset. Such results underscore the dataset's effectiveness in enhancing models' capability to understand complex 3D shapes from a compositional perspective. Code and Data can be found at http://github.com/3DCoMPaT200/3DCoMPaT200
中文: 为推进部件级三维物体理解,研究者提出了大规模数据集3DCoMPaT200,涵盖200个物体类别并显著扩展了部件与材料分类,同时通过ULIP框架创新性地引入组合式部件检索任务,证明模型性能随组合复杂度提升而增强。
English: To advance part-level 3D object understanding, the authors introduce 3DCoMPaT200, a large-scale dataset with 200 object categories, significantly expanding part and material classes, and propose a compositional part shape retrieval task using ULIP to enhance model performance with increasing compositional complexity.

Authors:Ji Soo Lee, Jongha Kim, Jeehye Na, Jinyoung Park, Hyunwoo J. Kim
Title: VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning
Abstract:
Despite the advancements of Video Large Language Models (VideoLLMs) in various tasks, they struggle with fine-grained temporal understanding, such as Dense Video Captioning (DVC). DVC is a complicated task of describing all events within a video while also temporally localizing them, which integrates multiple fine-grained tasks, including video segmentation, video captioning, and temporal video grounding. Previous VideoLLMs attempt to solve DVC in a single step, failing to utilize their reasoning capability. Moreover, previous training objectives for VideoLLMs do not fully reflect the evaluation metrics, therefore not providing supervision directly aligned to target tasks. To address such a problem, we propose a novel framework named VidChain comprised of Chain-of-Tasks (CoTasks) and Metric-based Direct Preference Optimization (M-DPO). CoTasks decompose a complex task into a sequence of sub-tasks, allowing VideoLLMs to leverage their reasoning capabilities more effectively. M-DPO aligns a VideoLLM with evaluation metrics, providing fine-grained supervision to each task that is well-aligned with metrics. Applied to two different VideoLLMs, VidChain consistently improves their fine-grained video understanding, thereby outperforming previous VideoLLMs on two different DVC benchmarks and also on the temporal video grounding task. Code is available at \url{https://github.com/mlvlab/VidChain}.
中文摘要:提出的VidChain框架通过将复杂任务分解为子任务并将模型训练与评估指标对齐,增强了视频大语言模型的细粒度时序理解能力,在密集视频描述基准测试中实现了更优性能。
English Summary: The proposed VidChain framework enhances VideoLLMs' fine-grained temporal understanding by decomposing complex tasks into sub-tasks and aligning model training with evaluation metrics, achieving superior performance on Dense Video Captioning benchmarks.

Authors:Yuxin Wang, Qianyi Wu, Dan Xu
Title: F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Aggregative Gaussian Splatting
Abstract:
This paper tackles the problem of generalizable 3D-aware generation from monocular datasets, e.g., ImageNet. The key challenge of this task is learning a robust 3D-aware representation without multi-view or dynamic data, while ensuring consistent texture and geometry across different viewpoints. Although some baseline methods are capable of 3D-aware generation, the quality of the generated images still lags behind state-of-the-art 2D generation approaches, which excel in producing high-quality, detailed images. To address this severe limitation, we propose a novel feed-forward pipeline based on pixel-aligned Gaussian Splatting, coined as F3D-Gaus, which can produce more realistic and reliable 3D renderings from monocular inputs. In addition, we introduce a self-supervised cycle-aggregative constraint to enforce cross-view consistency in the learned 3D representation. This training strategy naturally allows aggregation of multiple aligned Gaussian primitives and significantly alleviates the interpolation limitations inherent in single-view pixel-aligned Gaussian Splatting. Furthermore, we incorporate video model priors to perform geometry-aware refinement, enhancing the generation of fine details in wide-viewpoint scenarios and improving the model's capability to capture intricate 3D textures. Extensive experiments demonstrate that our approach not only achieves high-quality, multi-view consistent 3D-aware generation from monocular datasets, but also significantly improves training and inference efficiency.

Authors:Ming Dai, Jian Li, Jiedong Zhuang, Xian Zhang, Wankou Yang
Title: Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
Abstract:
Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is error-prone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture ($\text{C}^3\text{VG}$), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a process referred to as the Rough Semantic Perception (RSP) stage. These coarse predictions are subsequently refined through the proposed Mask-guided Interaction Module (MIM) and a novel explicit bidirectional consistency constraint loss to ensure consistent representations across tasks, which we term the Refined Consistency Interaction (RCI) stage. Furthermore, to address the challenge of insufficient multimodal understanding, we leverage pre-trained models based on visual-linguistic fusion representations. Empirical evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the efficacy and soundness of $\text{C}^3\text{VG}$, which significantly outperforms state-of-the-art REC and RIS methods by a substantial margin. Code and model will be available at \url{https://github.com/Dmmm1997/C3VG}.
中文: 本文提出C³VG架构,通过粗到精的一致性约束,结合隐式和显式建模解决多任务视觉定位中的歧义问题,在基准数据集上显著优于现有最优方法。
English: The paper introduces C³VG, a coarse-to-fine architecture with consistency constraints that enhances multi-task visual grounding by integrating implicit and explicit modeling to resolve ambiguities between localization and segmentation, achieving state-of-the-art performance on benchmark datasets.

Authors:Ziyang Xie, Zhizheng Liu, Zhenghao Peng, Wayne Wu, Bolei Zhou
Title: Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation
Abstract:
Sim-to-real gap has long posed a significant challenge for robot learning in simulation, preventing the deployment of learned models in the real world. Previous work has primarily focused on domain randomization and system identification to mitigate this gap. However, these methods are often limited by the inherent constraints of the simulation and graphics engines. In this work, we propose Vid2Sim, a novel framework that effectively bridges the sim2real gap through a scalable and cost-efficient real2sim pipeline for neural 3D scene reconstruction and simulation. Given a monocular video as input, Vid2Sim can generate photorealistic and physically interactable 3D simulation environments to enable the reinforcement learning of visual navigation agents in complex urban environments. Extensive experiments demonstrate that Vid2Sim significantly improves the performance of urban navigation in the digital twins and real world by 31.2% and 68.3% in success rate compared with agents trained with prior simulation methods.

Authors:Hengyuan Zhang, David Paz, Yuliang Guo, Xinyu Huang, Henrik I. Christensen, Liu Ren
Title: MapGS: Generalizable Pretraining and Data Augmentation for Online Mapping via Novel View Synthesis
Abstract:
Online mapping reduces the reliance of autonomous vehicles on high-definition (HD) maps, significantly enhancing scalability. However, recent advancements often overlook cross-sensor configuration generalization, leading to performance degradation when models are deployed on vehicles with different camera intrinsics and extrinsics. With the rapid evolution of novel view synthesis methods, we investigate the extent to which these techniques can be leveraged to address the sensor configuration generalization challenge. We propose a novel framework leveraging Gaussian splatting to reconstruct scenes and render camera images in target sensor configurations. The target config sensor data, along with labels mapped to the target config, are used to train online mapping models. Our proposed framework on the nuScenes and Argoverse 2 datasets demonstrates a performance improvement of 18% through effective dataset augmentation, achieves faster convergence and efficient training, and exceeds state-of-the-art performance when using only 25% of the original training data. This enables data reuse and reduces the need for laborious data labeling. Project page at https://henryzhangzhy.github.io/mapgs.

Authors:Narges Rashvand, Ghazal Alinezhad Noghre, Armin Danesh Pazho, Shanle Yao, Hamed Tabkhi
Title: Exploring Pose-Based Anomaly Detection for Retail Security: A Real-World Shoplifting Dataset and Benchmark
Abstract:
Shoplifting poses a significant challenge for retailers, resulting in billions of dollars in annual losses. Traditional security measures often fall short, highlighting the need for intelligent solutions capable of detecting shoplifting behaviors in real time. This paper frames shoplifting detection as an anomaly detection problem, focusing on the identification of deviations from typical shopping patterns. We introduce PoseLift, a privacy-preserving dataset specifically designed for shoplifting detection, addressing challenges such as data scarcity, privacy concerns, and model biases. PoseLift is built in collaboration with a retail store and contains anonymized human pose data from real-world scenarios. By preserving essential behavioral information while anonymizing identities, PoseLift balances privacy and utility. We benchmark state-of-the-art pose-based anomaly detection models on this dataset, evaluating performance using a comprehensive set of metrics. Our results demonstrate that pose-based approaches achieve high detection accuracy while effectively addressing privacy and bias concerns inherent in traditional methods. As one of the first datasets capturing real-world shoplifting behaviors, PoseLift offers researchers a valuable tool to advance computer vision ethically and will be publicly available to foster innovation and collaboration. The dataset is available at https://github.com/TeCSAR-UNCC/PoseLift.
中文: 本文提出PoseLift这一隐私保护数据集,通过匿名化人体姿态数据将商店盗窃行为作为异常检测问题处理,在解决隐私和偏见问题的同时实现了高检测准确率。
English: This paper introduces PoseLift, a privacy-preserving dataset using anonymized human pose data to detect shoplifting as an anomaly detection problem, achieving high accuracy while addressing privacy and bias concerns.

Authors:Xianwei Zhuang, Zhihong Zhu, Yuxin Xie, Liming Liang, Yuexian Zou
Title: VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification
Abstract:
Large Vision-Language Models (LVLMs) may produce outputs that are unfaithful to reality, also known as visual hallucinations (VH), which significantly impedes their real-world usage. To alleviate VH, various decoding strategies have been proposed to enhance visual information. However, many of these methods may require secondary decoding and rollback, which significantly reduces inference speed. In this work, we propose an efficient plug-and-play decoding algorithm via Visual-Aware Sparsification (VASparse) from the perspective of token sparsity for mitigating VH. VASparse is inspired by empirical observations: (1) the sparse activation of attention in LVLMs, and (2) visual-agnostic tokens sparsification exacerbates VH. Based on these insights, we propose a novel token sparsification strategy that balances efficiency and trustworthiness. Specifically, VASparse implements a visual-aware token selection strategy during decoding to reduce redundant tokens while preserving visual context effectively. Additionally, we innovatively introduce a sparse-based visual contrastive decoding method to recalibrate the distribution of hallucinated outputs without the time overhead associated with secondary decoding. Subsequently, VASparse recalibrates attention scores to penalize attention sinking of LVLMs towards text tokens. Extensive experiments across four popular benchmarks confirm the effectiveness of VASparse in mitigating VH across different LVLM families without requiring additional training or post-processing. Impressively, VASparse achieves state-of-the-art performance for mitigating VH while maintaining competitive decoding speed. Code is available at https://github.com/mengchuang123/VASparse-github.
中文: VASparse是一种高效的即插即用解码算法,通过视觉感知的令牌稀疏化和对比解码来减少大型视觉语言模型中的视觉幻觉,在不影响推理速度的情况下实现了最先进的性能。
English: VASparse is an efficient plug-and-play decoding algorithm that mitigates visual hallucinations in large vision-language models by implementing visual-aware token sparsification and contrastive decoding, achieving state-of-the-art performance without compromising inference speed.

Authors:Yiheng Li, Yang Yang, Zhen Lei
Title: CoreNet: Conflict Resolution Network for Point-Pixel Misalignment and Sub-Task Suppression of 3D LiDAR-Camera Object Detection
Abstract:
Fusing multi-modality inputs from different sensors is an effective way to improve the performance of 3D object detection. However, current methods overlook two important conflicts: point-pixel misalignment and sub-task suppression. The former means a pixel feature from the opaque object is projected to multiple point features of the same ray in the world space, and the latter means the classification prediction and bounding box regression may cause mutual suppression. In this paper, we propose a novel method named Conflict Resolution Network (CoreNet) to address the aforementioned issues. Specifically, we first propose a dual-stream transformation module to tackle point-pixel misalignment. It consists of ray-based and point-based 2D-to-BEV transformations. Both of them achieve approximately unique mapping from the image space to the world space. Moreover, we introduce a task-specific predictor to tackle sub-task suppression. It uses the dual-branch structure which adopts class-specific query and Bbox-specific query to corresponding sub-tasks. Each task-specific query is constructed of task-specific feature and general feature, which allows the heads to adaptively select information of interest based on different sub-tasks. Experiments on the large-scale nuScenes dataset demonstrate the superiority of our proposed CoreNet, by achieving 75.6\% NDS and 73.3\% mAP on the nuScenes test set without test-time augmentation and model ensemble techniques. The ample ablation study also demonstrates the effectiveness of each component. The code is released on https://github.com/liyih/CoreNet.
Chinese: 提出的冲突解决网络(CoreNet)通过双流变换模块和任务特定预测器,有效解决了多模态3D物体检测中的点像素不对齐和子任务抑制问题,在nuScenes数据集上取得了最优性能。
English: The proposed Conflict Resolution Network (CoreNet) effectively addresses point-pixel misalignment and sub-task suppression in multi-modality 3D object detection through dual-stream transformation and task-specific predictors, achieving state-of-the-art performance on the nuScenes dataset.

Authors:Maomao Li, Lijian Lin, Yunfei Liu, Ye Zhu, Yu Li
Title: Qffusion: Controllable Portrait Video Editing via Quadrant-Grid Attention Learning
Abstract:
This paper presents Qffusion, a dual-frame-guided framework for portrait video editing. Specifically, we consider a design principle of ``animation for editing'', and train Qffusion as a general animation framework from two still reference images while we can use it for portrait video editing easily by applying modified start and end frames as references during inference. Leveraging the powerful generative power of Stable Diffusion, we propose a Quadrant-grid Arrangement (QGA) scheme for latent re-arrangement, which arranges the latent codes of two reference images and that of four facial conditions into a four-grid fashion, separately. Then, we fuse features of these two modalities and use self-attention for both appearance and temporal learning, where representations at different times are jointly modeled under QGA. Our Qffusion can achieve stable video editing without additional networks or complex training stages, where only the input format of Stable Diffusion is modified. Further, we propose a Quadrant-grid Propagation (QGP) inference strategy, which enjoys a unique advantage on stable arbitrary-length video generation by processing reference and condition frames recursively. Through extensive experiments, Qffusion consistently outperforms state-of-the-art techniques on portrait video editing. Project page: https://qffusion.github.io/page/.

Authors:Mohammad Amin Mirzaee, Hung-Jui Huang, Wenzhen Yuan
Title: GelBelt: A Vision-based Tactile Sensor for Continuous Sensing of Large Surfaces
Abstract:
Scanning large-scale surfaces is widely demanded in surface reconstruction applications and detecting defects in industries' quality control and maintenance stages. Traditional vision-based tactile sensors have shown promising performance in high-resolution shape reconstruction while suffering limitations such as small sensing areas or susceptibility to damage when slid across surfaces, making them unsuitable for continuous sensing on large surfaces. To address these shortcomings, we introduce a novel vision-based tactile sensor designed for continuous surface sensing applications. Our design uses an elastomeric belt and two wheels to continuously scan the target surface. The proposed sensor showed promising results in both shape reconstruction and surface fusion, indicating its applicability. The dot product of the estimated and reference surface normal map is reported over the sensing area and for different scanning speeds. Results indicate that the proposed sensor can rapidly scan large-scale surfaces with high accuracy at speeds up to 45 mm/s.

Authors:Huaiguang Cai
Title: CAMs as Shapley Value-based Explainers
Abstract:
Class Activation Mapping (CAM) methods are widely used to visualize neural network decisions, yet their underlying mechanisms remain incompletely understood. To enhance the understanding of CAM methods and improve their explainability, we introduce the Content Reserved Game-theoretic (CRG) Explainer. This theoretical framework clarifies the theoretical foundations of GradCAM and HiResCAM by modeling the neural network prediction process as a cooperative game. Within this framework, we develop ShapleyCAM, a new method that leverages gradients and the Hessian matrix to provide more precise and theoretically grounded visual explanations. Due to the computational infeasibility of exact Shapley value calculation, ShapleyCAM employs a second-order Taylor expansion of the cooperative game's utility function to derive a closed-form expression. Additionally, we propose the Residual Softmax Target-Class (ReST) utility function to address the limitations of pre-softmax and post-softmax scores. Extensive experiments across 12 popular networks on the ImageNet validation set demonstrate the effectiveness of ShapleyCAM and its variants. Our findings not only advance CAM explainability but also bridge the gap between heuristic-driven CAM methods and compute-intensive Shapley value-based methods. The code is available at \url{https://github.com/caihuaiguang/pytorch-shapley-cam}.
中文摘要:本研究提出CRG理论框架和ShapleyCAM方法,通过将神经网络预测建模为合作博弈,并采用梯度-海森矩阵近似计算沙普利值,显著提升了类激活映射技术的可解释性。
English Summary: The study introduces the CRG theoretical framework and ShapleyCAM method to enhance the explainability of CAM techniques by modeling neural network predictions as cooperative games and using gradient-Hessian approximations for efficient Shapley value computation.

Authors:Yunlong Tang, Junjia Guo, Pinxin Liu, Zhiyuan Wang, Hang Hua, Jia-Xing Zhong, Yunzhong Xiao, Chao Huang, Luchuan Song, Susan Liang, Yizhi Song, Liu He, Jing Bi, Mingqian Feng, Xinyang Li, Zeliang Zhang, Chenliang Xu
Title: Generative AI for Cel-Animation: A Survey
Abstract:
Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, challenges like visual consistency, stylistic coherence, and ethical considerations persist. Additionally, this paper explores future directions and advancements in AI-assisted animation. For further exploration and resources, please visit our GitHub repository: https://github.com/yunlong10/Awesome-AI4Animation
中文摘要:生成式人工智能通过自动完成中间帧生成、上色等繁重工序,正在革新传统赛璐珞动画制作流程,不仅大幅提升效率、降低技术门槛,还让艺术家能更专注于创意表达,尽管在视觉一致性与伦理规范方面仍存在挑战。
English Summary: Generative AI is revolutionizing traditional Cel-Animation by automating labor-intensive processes like inbetweening and colorization, thereby enhancing efficiency and accessibility while allowing artists to focus on creativity, despite ongoing challenges with visual consistency and ethical concerns.

Authors:Mills Staylor, Amirreza Dolatpour Fathkouhi, Md Khairul Islam, Kaleigh O'Hara, Ryan Ghiles Goudjil, Geoffrey Fox, Judy Fox
Title: Scalable Cosmic AI Inference using Cloud Serverless Computing with FMI
Abstract:
Large-scale astronomical image data processing and prediction is essential for astronomers, providing crucial insights into celestial objects, the universe's history, and its evolution. While modern deep learning models offer high predictive accuracy, they often demand substantial computational resources, making them resource-intensive and limiting accessibility. We introduce the Cloud-based Astronomy Inference (CAI) framework to address these challenges. This scalable solution integrates pre-trained foundation models with serverless cloud infrastructure through a Function-as-a-Service (FaaS) Message Interface (FMI). CAI enables efficient and scalable inference on astronomical images without extensive hardware. Using a foundation model for redshift prediction as a case study, our extensive experiments cover user devices, HPC (High-Performance Computing) servers, and Cloud. CAI's significant scalability improvement on large data sizes provides an accessible and effective tool for the astronomy community. The code is accessible at https://github.com/UVA-MLSys/AI-for-Astronomy.
中文: CAI框架利用无服务器云基础设施和预训练基础模型,为天文学图像处理提供了高度可扩展、成本低廉且高效的解决方案,其速度和吞吐量远超传统高性能计算方法。
English: The CAI framework leverages serverless cloud infrastructure and pre-trained foundation models to deliver a highly scalable, cost-effective, and efficient solution for processing astronomical images, significantly outperforming traditional HPC methods in speed and throughput.

Authors:Nirit Alkalay, Roy Orfaig, Ben-Zion Bobrovsky
Title: NextStop: An Improved Tracker For Panoptic LIDAR Segmentation Data
Abstract:
4D panoptic LiDAR segmentation is essential for scene understanding in autonomous driving and robotics, combining semantic and instance segmentation with temporal consistency. Current methods, like 4D-PLS and 4D-STOP, use a tracking-by-detection methodology, employing deep learning networks to perform semantic and instance segmentation on each frame. To maintain temporal consistency, large-size instances detected in the current frame are compared and associated with instances within a temporal window that includes the current and preceding frames. However, their reliance on short-term instance detection, lack of motion estimation, and exclusion of small-sized instances lead to frequent identity switches and reduced tracking performance. We address these issues with the NextStop1 tracker, which integrates Kalman filter-based motion estimation, data association, and lifespan management, along with a tracklet state concept to improve prioritization. Evaluated using the LiDAR Segmentation and Tracking Quality (LSTQ) metric on the SemanticKITTI validation set, NextStop demonstrated enhanced tracking performance, particularly for small-sized objects like people and bicyclists, with fewer ID switches, earlier tracking initiation, and improved reliability in complex environments. The source code is available at https://github.com/AIROTAU/NextStop
中文摘要:NextStop跟踪器通过集成运动估计和生命周期管理,改进了4D全景激光雷达分割,在复杂环境中对小尺寸物体实现了更少的身份切换和更优的跟踪性能。
English Summary: The NextStop tracker improves 4D panoptic LiDAR segmentation by incorporating motion estimation and lifespan management, achieving better tracking performance with fewer identity switches for small objects in complex environments.

Authors:Gent Wu
Title: Powerful Design of Small Vision Transformer on CIFAR10
Abstract:
Vision Transformers (ViTs) have demonstrated remarkable success on large-scale datasets, but their performance on smaller datasets often falls short of convolutional neural networks (CNNs). This paper explores the design and optimization of Tiny ViTs for small datasets, using CIFAR-10 as a benchmark. We systematically evaluate the impact of data augmentation, patch token initialization, low-rank compression, and multi-class token strategies on model performance. Our experiments reveal that low-rank compression of queries in Multi-Head Latent Attention (MLA) incurs minimal performance loss, indicating redundancy in ViTs. Additionally, introducing multiple CLS tokens improves global representation capacity, boosting accuracy. These findings provide a comprehensive framework for optimizing Tiny ViTs, offering practical insights for efficient and effective designs. Code is available at https://github.com/erow/PoorViTs.
中文: 本文通过评估低秩压缩和多类别令牌等策略,系统优化了适用于小数据集的微型视觉变换器,发现压缩查询带来的性能损失极小,且引入多个CLS令牌能提升全局表示能力与准确率。
English: This paper systematically optimizes Tiny Vision Transformers for small datasets by evaluating strategies like low-rank compression and multi-class tokens, finding minimal performance loss with compressed queries and enhanced accuracy through multiple CLS tokens.

Authors:Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov
Title: Multi-subject Open-set Personalization in Video Generation
Abstract:
Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
中文摘要:Video Alchemist 是一种创新的视频个性化模型,通过扩散Transformer架构和创新的数据流程,实现了无需测试时优化的多主体开放集视频定制,有效解决了数据集构建和评估难题。
English Summary: Video Alchemist is a novel video personalization model that enables multi-subject, open-set customization of both foreground and background elements without test-time optimization, utilizing a Diffusion Transformer architecture and an innovative data pipeline to overcome dataset and evaluation challenges.

Authors:Junfei Xiao, Feng Cheng, Lu Qi, Liangke Gui, Jiepeng Cen, Zhibei Ma, Alan Yuille, Lu Jiang
Title: VideoAuteur: Towards Long Narrative Video Generation
Abstract:
Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support coherent narrations. In this paper, we present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We validate the quality of our proposed dataset in terms of visual fidelity and textual caption accuracy using state-of-the-art Vision-Language Models (VLMs) and video generation models, respectively. We further introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos and emphasize the role of aligning visual embeddings to achieve improved overall video quality. Our method demonstrates substantial improvements in generating visually detailed and semantically aligned keyframes, supported by finetuning techniques that integrate text and image embeddings within the video generation process. Project page: https://videoauteur.github.io/

Authors:Cecilia Curreli, Dominik Muhle, Abhishek Saroha, Zhenzhang Ye, Riccardo Marin, Daniel Cremers
Title: Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction
Abstract:
Probabilistic human motion prediction aims to forecast multiple possible future movements from past observations. While current approaches report high diversity and realism, they often generate motions with undetected limb stretching and jitter. To address this, we introduce SkeletonDiffusion, a latent diffusion model that embeds an explicit inductive bias on the human body within its architecture and training. Our model is trained with a novel nonisotropic Gaussian diffusion formulation that aligns with the natural kinematic structure of the human skeleton. Results show that our approach outperforms conventional isotropic alternatives, consistently generating realistic predictions while avoiding artifacts such as limb distortion. Additionally, we identify a limitation in commonly used diversity metrics, which may inadvertently favor models that produce inconsistent limb lengths within the same sequence. SkeletonDiffusion sets a new benchmark on real-world datasets, outperforming various baselines across multiple evaluation metrics. Visit our project page at https://ceveloper.github.io/publications/skeletondiffusion/ .

Authors:Oindrila Saha, Logan Lawrence, Grant Van Horn, Subhransu Maji
Title: Generate, Transduct, Adapt: Iterative Transduction with VLMs
Abstract:
Transductive zero-shot learning with vision-language models leverages image-image similarities within the dataset to achieve better classification accuracy compared to the inductive setting. However, there is little work that explores the structure of the language space in this context. We propose GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces. Our approach is iterative and consists of three steps: (i) incrementally exploring the attribute space by querying language models, (ii) an attribute-augmented transductive inference procedure, and (iii) fine-tuning the language and vision encoders based on inferred labels within the dataset. Through experiments with CLIP encoders, we demonstrate that GTA-CLIP, yields an average performance improvement of 8.6% and 3.7% across 12 datasets and 3 encoders, over CLIP and transductive CLIP respectively in the zero-shot setting. We also observe similar improvements in a few-shot setting. We present ablation studies that demonstrate the value of each step and visualize how the vision and language spaces evolve over iterations driven by the transductive learning.
中文摘要:GTA-CLIP通过语言模型监督实现语言与视觉空间的联合转导,采用迭代式属性增强和编码器微调方法,在零样本学习中显著提升了分类准确率。
English Summary: GTA-CLIP introduces iterative language-vision transduction with language model supervision, achieving significant accuracy improvements in zero-shot learning by augmenting attributes and fine-tuning encoders.

Authors:Hongruixuan Chen, Jian Song, Olivier Dietrich, Clifford Broni-Bediako, Weihao Xuan, Junjue Wang, Xinlei Shao, Yimin Wei, Junshi Xia, Cuiling Lan, Konrad Schindler, Naoto Yokoya
Title: BRIGHT: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response
Abstract:
Disaster events occur around the world and cause significant damage to human life and property. Earth observation (EO) data enables rapid and comprehensive building damage assessment (BDA), an essential capability in the aftermath of a disaster to reduce human casualties and to inform disaster relief efforts. Recent research focuses on the development of AI models to achieve accurate mapping of unseen disaster events, mostly using optical EO data. However, solutions based on optical data are limited to clear skies and daylight hours, preventing a prompt response to disasters. Integrating multimodal (MM) EO data, particularly the combination of optical and SAR imagery, makes it possible to provide all-weather, day-and-night disaster responses. Despite this potential, the development of robust multimodal AI models has been constrained by the lack of suitable benchmark datasets. In this paper, we present a BDA dataset using veRy-hIGH-resoluTion optical and SAR imagery (BRIGHT) to support AI-based all-weather disaster response. To the best of our knowledge, BRIGHT is the first open-access, globally distributed, event-diverse MM dataset specifically curated to support AI-based disaster response. It covers five types of natural disasters and two types of man-made disasters across 14 regions worldwide, with a particular focus on developing countries where external assistance is most needed. The optical and SAR imagery in BRIGHT, with a spatial resolution between 0.3-1 meters, provides detailed representations of individual buildings, making it ideal for precise BDA. In our experiments, we have tested seven advanced AI models trained with our BRIGHT to validate the transferability and robustness. The dataset and code are available at https://github.com/ChenHongruixuan/BRIGHT. BRIGHT also serves as the official dataset for the 2025 IEEE GRSS Data Fusion Contest.
中文:BRIGHT数据集通过提供高分辨率光学与SAR影像,支持全天候建筑物损毁评估,克服了单一光学数据的局限,为基于人工智能的全球多样化灾害响应提供了关键数据支撑。
English: The BRIGHT dataset provides high-resolution optical and SAR imagery to enable all-weather building damage assessment, addressing the limitations of optical-only methods and supporting AI-driven disaster response across diverse global events.

Authors:David Bojanić, Stefanie Wuhrer, Tomislav Petković, Tomislav Pribanić
Title: Pose-independent 3D Anthropometry from Sparse Data
Abstract:
3D digital anthropometry is the study of estimating human body measurements from 3D scans. Precise body measurements are important health indicators in the medical industry, and guiding factors in the fashion, ergonomic and entertainment industries. The measuring protocol consists of scanning the whole subject in the static A-pose, which is maintained without breathing or movement during the scanning process. However, the A-pose is not easy to maintain during the whole scanning process, which can last even up to a couple of minutes. This constraint affects the final quality of the scan, which in turn affects the accuracy of the estimated body measurements obtained from methods that rely on dense geometric data. Additionally, this constraint makes it impossible to develop a digital anthropometry method for subjects unable to assume the A-pose, such as those with injuries or disabilities. We propose a method that can obtain body measurements from sparse landmarks acquired in any pose. We make use of the sparse landmarks of the posed subject to create pose-independent features, and train a network to predict the body measurements as taken from the standard A-pose. We show that our method achieves comparable results to competing methods that use dense geometry in the standard A-pose, but has the capability of estimating the body measurements from any pose using sparse landmarks only. Finally, we address the lack of open-source 3D anthropometry methods by making our method available to the research community at https://github.com/DavidBoja/pose-independent-anthropometry.
中文: 本研究提出了一种姿态无关的人体测量方法,利用任意姿态下的稀疏关键点估算身体尺寸,其精度与标准A姿态方法相当,可适用于行动不便者,并将该方法开源供研究社区使用。
English: This study introduces a pose-independent method for estimating human body measurements using sparse landmarks from any pose, achieving accuracy comparable to standard A-pose techniques while enabling applications for individuals with mobility limitations and making the approach publicly available.

Authors:Sauda Adiv Hanum, Ashim Dey, Muhammad Ashad Kabir
Title: An Attention-Guided Deep Learning Approach for Classifying 39 Skin Lesion Types
Abstract:
The skin, as the largest organ of the human body, is vulnerable to a diverse array of conditions collectively known as skin lesions, which encompass various dermatoses. Diagnosing these lesions presents significant challenges for medical practitioners due to the subtle visual differences that are often imperceptible to the naked eye. While not all skin lesions are life-threatening, certain types can act as early indicators of severe diseases, including skin cancers, underscoring the critical need for timely and accurate diagnostic methods. Deep learning algorithms have demonstrated remarkable potential in facilitating the early detection and prognosis of skin lesions. This study advances the field by curating a comprehensive and diverse dataset comprising 39 categories of skin lesions, synthesized from five publicly available datasets. Using this dataset, the performance of five state-of-the-art deep learning models -- MobileNetV2, Xception, InceptionV3, EfficientNetB1, and Vision Transformer - is rigorously evaluated. To enhance the accuracy and robustness of these models, attention mechanisms such as the Efficient Channel Attention (ECA) and the Convolutional Block Attention Module (CBAM) are incorporated into their architectures. Comprehensive evaluation across multiple performance metrics reveals that the Vision Transformer model integrated with CBAM outperforms others, achieving an accuracy of 93.46%, precision of 94%, recall of 93%, F1-score of 93%, and specificity of 93.67%. These results underscore the significant potential of the proposed system in supporting medical professionals with accurate and efficient prognostic tools for diagnosing a broad spectrum of skin lesions. The dataset and code used in this study can be found at https://github.com/akabircs/Skin-Lesions-Classification.
Chinese: 本研究通过将注意力机制与五种先进模型相结合,开发了一个用于诊断39种皮肤病变的深度学习系统,结果表明集成CBAM的Vision Transformer模型表现最佳,准确率达93.46%。
English: This study develops a deep learning system for diagnosing 39 types of skin lesions by integrating attention mechanisms with five advanced models, demonstrating that the Vision Transformer with CBAM achieves superior performance with 93.46% accuracy.

Authors:Kuan Liu, Zongyuan Ying, Jie Jin, Dongyan Li, Ping Huang, Wenjian Wu, Zhe Chen, Jin Qi, Yong Lu, Lianfu Deng, Bo Chen
Title: Swin-X2S: Reconstructing 3D Shape from 2D Biplanar X-ray with Swin Transformers
Abstract:
The conversion from 2D X-ray to 3D shape holds significant potential for improving diagnostic efficiency and safety. However, existing reconstruction methods often rely on hand-crafted features, manual intervention, and prior knowledge, resulting in unstable shape errors and additional processing costs. In this paper, we introduce Swin-X2S, an end-to-end deep learning method for directly reconstructing 3D segmentation and labeling from 2D biplanar orthogonal X-ray images. Swin-X2S employs an encoder-decoder architecture: the encoder leverages 2D Swin Transformer for X-ray information extraction, while the decoder employs 3D convolution with cross-attention to integrate structural features from orthogonal views. A dimension-expanding module is introduced to bridge the encoder and decoder, ensuring a smooth conversion from 2D pixels to 3D voxels. We evaluate proposed method through extensive qualitative and quantitative experiments across nine publicly available datasets covering four anatomies (femur, hip, spine, and rib), with a total of 54 categories. Significant improvements over previous methods have been observed not only in the segmentation and labeling metrics but also in the clinically relevant parameters that are of primary concern in practical applications, which demonstrates the promise of Swin-X2S to provide an effective option for anatomical shape reconstruction in clinical scenarios. Code implementation is available at: \url{https://github.com/liukuan5625/Swin-X2S}.
中文: Swin-X2S是一种端到端的深度学习方法,可直接从二维X射线图像重建三维解剖结构,在多个数据集的细分和临床指标上均表现出卓越性能。
English: Swin-X2S is an end-to-end deep learning method that directly reconstructs 3D anatomical shapes from 2D X-ray images, achieving superior performance in segmentation and clinical metrics across multiple datasets.

Authors:Naval Kishore Mehta, Arvind, Himanshu Kumar, Abeer Banerjee, Sumeet Saurav, Sanjay Singh
Title: A Multimodal Dataset for Enhancing Industrial Task Monitoring and Engagement Prediction
Abstract:
Detecting and interpreting operator actions, engagement, and object interactions in dynamic industrial workflows remains a significant challenge in human-robot collaboration research, especially within complex, real-world environments. Traditional unimodal methods often fall short of capturing the intricacies of these unstructured industrial settings. To address this gap, we present a novel Multimodal Industrial Activity Monitoring (MIAM) dataset that captures realistic assembly and disassembly tasks, facilitating the evaluation of key meta-tasks such as action localization, object interaction, and engagement prediction. The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video, annotated in detail for task performance and operator behavior. Its distinctiveness lies in the integration of multiple data modalities and its emphasis on real-world, untrimmed industrial workflows-key for advancing research in human-robot collaboration and operator monitoring. Additionally, we propose a multimodal network that fuses RGB frames, IMU data, and skeleton sequences to predict engagement levels during industrial tasks. Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments. The dataset and code can be accessed from https://github.com/navalkishoremehta95/MIAM/.
Chinese: 该研究提出了一种新颖的多模态工业活动监测(MIAM)数据集及多模态网络,旨在提升复杂工业流程中操作员行为、参与度和物体交互的检测能力,从而增强人机协作的准确性。
English: The study introduces a novel Multimodal Industrial Activity Monitoring (MIAM) dataset and a multimodal network to enhance the detection of operator actions, engagement, and object interactions in complex industrial workflows, improving accuracy in human-robot collaboration.

Authors:Ziheng Wu, Zhenghao Chen, Ruipu Luo, Can Zhang, Yuan Gao, Zhentao He, Xian Wang, Haoran Lin, Minghui Qiu
Title: Valley2: Exploring Multimodal Models with Scalable Vision-Language Design
Abstract:
Recently, vision-language models have made remarkable progress, demonstrating outstanding capabilities in various tasks such as image captioning and video understanding. We introduce Valley2, a novel multimodal large language model designed to enhance performance across all domains and extend the boundaries of practical applications in e-commerce and short video scenarios. Notably, Valley2 achieves state-of-the-art (SOTA) performance on e-commerce benchmarks, surpassing open-source models of similar size by a large margin (79.66 vs. 72.76). Additionally, Valley2 ranks second on the OpenCompass leaderboard among models with fewer than 10B parameters, with an impressive average score of 67.4. The code and model weights are open-sourced at https://github.com/bytedance/Valley.
中文: Valley2是一种新型多模态大语言模型,在电商基准测试中取得最优性能,并在少于100亿参数的模型中排名第二,其代码和模型权重已开源。
English: Valley2 is a new multimodal large language model that achieves state-of-the-art performance in e-commerce benchmarks and ranks second among models with fewer than 10B parameters, with its code and model weights being open-sourced.

Authors:Zhifan Song, Yuan Zhang, Abd Al Rahman M. Abu Ebayyeh
Title: EDNet: Edge-Optimized Small Target Detection in UAV Imagery -- Faster Context Attention, Better Feature Fusion, and Hardware Acceleration
Abstract:
Detecting small targets in drone imagery is challenging due to low resolution, complex backgrounds, and dynamic scenes. We propose EDNet, a novel edge-target detection framework built on an enhanced YOLOv10 architecture, optimized for real-time applications without post-processing. EDNet incorporates an XSmall detection head and a Cross Concat strategy to improve feature fusion and multi-scale context awareness for detecting tiny targets in diverse environments. Our unique C2f-FCA block employs Faster Context Attention to enhance feature extraction while reducing computational complexity. The WIoU loss function is employed for improved bounding box regression. With seven model sizes ranging from Tiny to XL, EDNet accommodates various deployment environments, enabling local real-time inference and ensuring data privacy. Notably, EDNet achieves up to a 5.6% gain in mAP@50 with significantly fewer parameters. On an iPhone 12, EDNet variants operate at speeds ranging from 16 to 55 FPS, providing a scalable and efficient solution for edge-based object detection in challenging drone imagery. The source code and pre-trained models are available at: https://github.com/zsniko/EDNet.
中文摘要:EDNet是基于YOLOv10优化的实时目标检测框架,通过改进特征融合和多尺度上下文感知技术,有效提升无人机图像中小目标检测性能,在多种模型尺寸下以更少参数实现更高精度。
English Summary: EDNet is an optimized real-time object detection framework based on YOLOv10 that enhances small target detection in drone imagery through improved feature fusion and multi-scale context awareness, achieving higher accuracy with fewer parameters across multiple model sizes.

Authors:Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang
Title: VideoRAG: Retrieval-Augmented Generation over Video Corpus
Abstract:
Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing contextual details more effectively than any other modality. While very recent studies explore the use of videos in response generation, they either predefine query-associated videos without retrieval or convert videos into textual descriptions losing multimodal richness. To tackle these, we introduce VideoRAG, a framework that not only dynamically retrieves videos based on their relevance with queries but also utilizes both visual and textual information. The operation of VideoRAG is powered by recent Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and the seamless integration of retrieved videos jointly with queries for response generation. Also, inspired by that the context size of LVLMs may not be sufficient to process all frames in extremely long videos and not all frames are equally important, we introduce a video frame selection mechanism to extract the most informative subset of frames, along with a strategy to extract textual information from videos (as it can aid the understanding of video content) when their subtitles are not available. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines. Code is available at https://github.com/starsuzi/VideoRAG.
中文: VideoRAG是一种创新框架,通过动态检索相关视频并利用大型视频语言模型处理多模态内容来增强回答生成,它通过引入帧选择机制和文本提取策略弥补了现有方法的不足,从而提高了事实准确性。
English: VideoRAG is a novel framework that dynamically retrieves relevant videos and leverages their multimodal content through Large Video Language Models to enhance response generation, addressing gaps in existing methods by incorporating frame selection and text extraction for improved accuracy.

Authors:Xinting Hu, Haoran Wang, Jan Eric Lenssen, Bernt Schiele
Title: PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation
Abstract:
We introduce PersonaHOI, a training- and tuning-free framework that fuses a general StableDiffusion model with a personalized face diffusion (PFD) model to generate identity-consistent human-object interaction (HOI) images. While existing PFD models have advanced significantly, they often overemphasize facial features at the expense of full-body coherence, PersonaHOI introduces an additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By incorporating cross-attention constraints in the PFD branch and spatial merging at both latent and residual levels, PersonaHOI preserves personalized facial details while ensuring interactive non-facial regions. Experiments, validated by a novel interaction alignment metric, demonstrate the superior realism and scalability of PersonaHOI, establishing a new standard for practical personalized face with HOI generation. Our code will be available at https://github.com/JoyHuYY1412/PersonaHOI
中文: PersonaHOI是一种无需训练的框架,通过将通用StableDiffusion与个性化人脸扩散模型相结合,在保持全身协调性的同时生成身份一致的人物交互图像,其交叉注意力约束和空间融合技术确保了面部细节与交互区域的真实感。
English: PersonaHOI is a training-free framework that integrates StableDiffusion with personalized face diffusion models to generate identity-consistent human-object interaction images while maintaining full-body coherence through cross-attention constraints and spatial merging.

Authors:Sunwoo Kim, Minkyu Kim, Dongmin Park
Title: Test-time Alignment of Diffusion Models without Reward Over-optimization
Abstract:
Diffusion models excel in generative tasks, but aligning them with specific objectives while maintaining their versatility remains challenging. Existing fine-tuning methods often suffer from reward over-optimization, while approximate guidance approaches fail to optimize target rewards effectively. Addressing these limitations, we propose a training-free, test-time method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution. Our approach, tailored for diffusion sampling and incorporating tempering techniques, achieves comparable or superior target rewards to fine-tuning methods while preserving diversity and cross-reward generalization. We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box optimization. This work offers a robust solution for aligning diffusion models with diverse downstream objectives without compromising their general capabilities. Code is available at https://github.com/krafton-ai/DAS.
Chinese: 本文提出了一种无需训练、基于序贯蒙特卡洛的测试时方法,能在保持扩散模型多样性和泛化能力的同时有效对齐目标奖励,其性能优于现有微调方法。
English: This paper introduces a training-free, test-time Sequential Monte Carlo method that effectively aligns diffusion models with target rewards while preserving their diversity and generalization, outperforming existing fine-tuning approaches.

Authors:You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, Maosong Sun
Title: Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
Abstract:
The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced at https://migician-vg.github.io/.
中文: 该研究提出了首个多图像精准定位模型Migician,并配套新数据集和基准测试,性能超越现有最佳模型24.94%。
English: The study introduces Migician, the first model for precise multi-image grounding, supported by a new dataset and benchmark, achieving a 24.94% performance improvement over existing models.

Authors:Shangjin Zhai, Zhichao Ye, Jialin Liu, Weijian Xie, Jiaqi Hu, Zhen Peng, Hua Xue, Danpeng Chen, Xiaomeng Wang, Lei Yang, Nan Wang, Haomin Liu, Guofeng Zhang
Title: StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation
Abstract:
Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. The generation of each video clip is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from previously generated clips, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions, facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation. Quantitative and qualitative evaluations demonstrate StarGen's superior scalability, fidelity, and pose accuracy compared to state-of-the-art methods. Project page: https://zju3dv.github.io/StarGen.

Authors:Seungjoo Shin, Jaesik Park, Sunghyun Cho
Title: Locality-aware Gaussian Compression for Fast and High-quality Rendering
Abstract:
We present LocoGS, a locality-aware 3D Gaussian Splatting (3DGS) framework that exploits the spatial coherence of 3D Gaussians for compact modeling of volumetric scenes. To this end, we first analyze the local coherence of 3D Gaussian attributes, and propose a novel locality-aware 3D Gaussian representation that effectively encodes locally-coherent Gaussian attributes using a neural field representation with a minimal storage requirement. On top of the novel representation, LocoGS is carefully designed with additional components such as dense initialization, an adaptive spherical harmonics bandwidth scheme and different encoding schemes for different Gaussian attributes to maximize compression performance. Experimental results demonstrate that our approach outperforms the rendering quality of existing compact Gaussian representations for representative real-world 3D datasets while achieving from 54.6$\times$ to 96.6$\times$ compressed storage size and from 2.1$\times$ to 2.4$\times$ rendering speed than 3DGS. Even our approach also demonstrates an averaged 2.4$\times$ higher rendering speed than the state-of-the-art compression method with comparable compression performance.

Authors:Sehyung Kim, Chanhyeong Yang, Jihwan Park, Taehoon Song, Hyunwoo J. Kim
Title: Super-class guided Transformer for Zero-Shot Attribute Classification
Abstract:
Attribute classification is crucial for identifying specific characteristics within image regions. Vision-Language Models (VLMs) have been effective in zero-shot tasks by leveraging their general knowledge from large-scale datasets. Recent studies demonstrate that transformer-based models with class-wise queries can effectively address zero-shot multi-label classification. However, poor utilization of the relationship between seen and unseen attributes makes the model lack generalizability. Additionally, attribute classification generally involves many attributes, making maintaining the model's scalability difficult. To address these issues, we propose Super-class guided transFormer (SugaFormer), a novel framework that leverages super-classes to enhance scalability and generalizability for zero-shot attribute classification. SugaFormer employs Super-class Query Initialization (SQI) to reduce the number of queries, utilizing common semantic information from super-classes, and incorporates Multi-context Decoding (MD) to handle diverse visual cues. To strengthen generalizability, we introduce two knowledge transfer strategies that utilize VLMs. During training, Super-class guided Consistency Regularization (SCR) aligns model's features with VLMs using super-class guided prompts, and during inference, Zero-shot Retrieval-based Score Enhancement (ZRSE) refines predictions for unseen attributes. Extensive experiments demonstrate that SugaFormer achieves state-of-the-art performance across three widely-used attribute classification benchmarks under zero-shot, and cross-dataset transfer settings. Our code is available at https://github.com/mlvlab/SugaFormer.
中文摘要:提出的SugaFormer框架通过利用超类减少查询数量提升可扩展性,并借助视觉语言模型的知识迁移增强泛化能力,从而改进零样本属性分类性能。
English Summary: The proposed SugaFormer framework enhances zero-shot attribute classification by leveraging super-classes to improve scalability through query reduction and generalizability via knowledge transfer from Vision-Language Models.

Authors:Dominick Reilly, Manish Kumar Govind, Le Xue, Srijan Das
Title: From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities
Abstract:
Large Vision Language Models (LVLMs) have demonstrated impressive capabilities in video understanding, yet their adoption for Activities of Daily Living (ADL) remains limited by their inability to capture fine-grained interactions and spatial relationships. To address this, we aim to leverage the complementary nature of egocentric views to enhance LVLM's understanding of exocentric ADL videos. Consequently, we propose ego2exo knowledge distillation to learn ego-augmented exp representations. While effective, this approach requires paired ego-exo videos, which are impractical to collect at scale. To address this, we propose Skeleton-guided Synthetic Ego Generation (SK-EGO), which leverages human skeleton motion to generate synthetic ego views from exocentric videos. To enhance the ego representation of LVLMs trained on synthetic data, we develop a domain-agnostic bootstrapped ego2exo strategy that effectively transfers knowledge from real ego-exo pairs to synthetic ego-exo pairs, while mitigating domain misalignment. We find that the exo representations of our ego-augmented LVLMs successfully learn to extract ego-perspective cues, demonstrated through comprehensive evaluation on six ADL benchmarks and our proposed Ego-in-Exo PerceptionMCQ benchmark designed specifically to assess egocentric understanding from exocentric videos. Code, models, and data will be open-sourced at https://github.com/dominickrei/EgoExo4ADL.
中文: 本研究通过引入自我到外部知识蒸馏和基于骨骼的合成自我视角生成方法,增强大型视觉语言模型对日常生活活动的理解能力,有效解决了配对视频数据难以大规模获取的问题,并通过全面评估验证了从外部视角视频中提取自我视角线索的有效性。
English: This study enhances Large Vision Language Models' understanding of Activities of Daily Living by introducing ego2exo knowledge distillation and a skeleton-guided synthetic ego generation method to overcome the limitation of requiring paired ego-exo videos, with comprehensive evaluations demonstrating improved egocentric cue extraction from exocentric videos.

Authors:Shuolong Chen, Xingxing Li, Liu Yuan, Ziao Liu
Title: eKalibr: Dynamic Intrinsic Calibration for Event Cameras From First Principles of Events
Abstract:
The bio-inspired event camera has garnered extensive research attention in recent years, owing to its significant potential derived from its high dynamic range and low latency characteristics. Similar to the standard camera, the event camera requires precise intrinsic calibration to facilitate further high-level visual applications, such as pose estimation and mapping. While several calibration methods for event cameras have been proposed, most of them are either (i) engineering-driven, heavily relying on conventional image-based calibration pipelines, or (ii) inconvenient, requiring complex instrumentation. To this end, we propose an accurate and convenient intrinsic calibration method for event cameras, named eKalibr, which builds upon a carefully designed event-based circle grid pattern recognition algorithm. To extract target patterns from events, we perform event-based normal flow estimation to identify potential events generated by circle edges, and cluster them spatially. Subsequently, event clusters associated with the same grid circles are matched and grouped using normal flows, for subsequent time-varying ellipse estimation. Fitted ellipse centers are time-synchronized, for final grid pattern recognition. We conducted extensive experiments to evaluate the performance of eKalibr in terms of pattern extraction and intrinsic calibration. The implementation of eKalibr is open-sourced at (https://github.com/Unsigned-Long/eKalibr) to benefit the research community.
Chinese: 本文提出了一种名为eKalibr的事件相机内参标定方法,通过精心设计的事件驱动圆形网格识别算法,实现了精确便捷的标定,解决了现有方法依赖传统流程或需要复杂设备的问题。
English: The paper introduces eKalibr, an accurate and convenient intrinsic calibration method for event cameras that uses a novel event-based circle grid pattern recognition algorithm to overcome the limitations of existing engineering-heavy or complex approaches.

Authors:Anant Mehta, Bryant McArthur, Nagarjuna Kolloju, Zhengzhong Tu
Title: HFMF: Hierarchical Fusion Meets Multi-Stream Models for Deepfake Detection
Abstract:
The rapid progress in deep generative models has led to the creation of incredibly realistic synthetic images that are becoming increasingly difficult to distinguish from real-world data. The widespread use of Variational Models, Diffusion Models, and Generative Adversarial Networks has made it easier to generate convincing fake images and videos, which poses significant challenges for detecting and mitigating the spread of misinformation. As a result, developing effective methods for detecting AI-generated fakes has become a pressing concern. In our research, we propose HFMF, a comprehensive two-stage deepfake detection framework that leverages both hierarchical cross-modal feature fusion and multi-stream feature extraction to enhance detection performance against imagery produced by state-of-the-art generative AI models. The first component of our approach integrates vision Transformers and convolutional nets through a hierarchical feature fusion mechanism. The second component of our framework combines object-level information and a fine-tuned convolutional net model. We then fuse the outputs from both components via an ensemble deep neural net, enabling robust classification performances. We demonstrate that our architecture achieves superior performance across diverse dataset benchmarks while maintaining calibration and interoperability.
中文: 深度生成模型已能制作高度逼真的合成图像,给检测带来挑战,而我们提出的HFMF框架通过分层跨模态融合和多流特征提取增强了伪造检测能力,在多种基准测试中实现了卓越性能。
English: Deep generative models have advanced to create highly realistic synthetic images, posing challenges for detection, but our proposed HFMF framework enhances fake detection through hierarchical cross-modal fusion and multi-stream feature extraction, achieving superior performance across benchmarks.

Authors:Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang
Title: OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Abstract:
Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at https://github.com/JoeLeelyf/OVO-Bench.
中文: 时间感知能力是在线视频大语言模型区别于离线模型的关键特征,新推出的OVO-Bench通过12项任务和三种场景对此进行评估,揭示了现有模型与人类表现之间的显著差距。
English: Temporal awareness, which enables dynamic reasoning based on question timestamps, is a critical feature distinguishing online video LLMs from offline models, and the newly introduced OVO-Bench evaluates this capability through 12 tasks across three scenarios to reveal current models' significant shortcomings compared to humans.

Authors:Jingyuan Tang, Yuhuan Zhao, Songlin Sun, Yangang Cai
Title: Implicit Guidance and Explicit Representation of Semantic Information in Points Cloud: A Survey
Abstract:
Point clouds, a prominent method of 3D representation, are extensively utilized across industries such as autonomous driving, surveying, electricity, architecture, and gaming, and have been rigorously investigated for their accuracy and resilience. The extraction of semantic information from scenes enhances both human understanding and machine perception. By integrating semantic information from two-dimensional scenes with three-dimensional point clouds, researchers aim to improve the precision and efficiency of various tasks. This paper provides a comprehensive review of the diverse applications and recent advancements in the integration of semantic information within point clouds. We explore the dual roles of semantic information in point clouds, encompassing both implicit guidance and explicit representation, across traditional and emerging tasks. Additionally, we offer a comparative analysis of publicly available datasets tailored to specific tasks and present notable observations. In conclusion, we discuss several challenges and potential issues that may arise in the future when fully utilizing semantic information in point clouds, providing our perspectives on these obstacles. The classified and organized articles related to semantic based point cloud tasks, and continuously followed up on relevant achievements in different fields, which can be accessed through https://github.com/Jasmine-tjy/Semantic-based-Point-Cloud-Tasks.
Chinese: 本文综述了将语义信息与三维点云融合以提高任务精度和效率的研究,探讨了其在传统和新兴应用中的双重作用,同时指出了未来挑战并提供了相关研究的整理资源。
English: This paper reviews the integration of semantic information with 3D point clouds to enhance task precision and efficiency, exploring its dual roles in traditional and emerging applications while addressing future challenges and providing a curated resource of related studies.

Authors:Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, Yong Zhang, Zhenan Fan
Title: Efficiently Serving Large Multimodal Models Using EPD Disaggregation
Abstract:
Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step negatively affects key Service Level Objectives (SLOs), such as time to first token (TTFT) and time per output token (TPOT). We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our approach decouples these steps, unlocking new opportunities and optimizations. These include a mechanism to cache multimedia tokens for efficient transfer, a novel way to parallelize the encoding load within a request, a module for optimal resource allocation for disaggregated serving, and a novel role-switching method to handle changing workload characteristics. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15x lower peak memory utilization), batch sizes (up to 22x larger), 10x more images per request, and 2.2x larger KV caches. Furthermore, it leads to significant improvements in SLO attainment (up to 90-100% improvement) and TTFT (up to 71% reduction), compared to systems that do not disaggregate. The code is available at https://github.com/vbdi/epdserve.
Chinese: Encode-Prefill-Decode(EPD)解耦框架将多模态模型各处理阶段分配至专用资源,相比集成系统显著提升了内存效率、批处理规模和服务性能。
English: The Encode-Prefill-Decode (EPD) Disaggregation framework separates multimodal model stages onto dedicated resources, significantly improving memory efficiency, batch sizes, and service performance compared to integrated systems.

Authors:Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik
Title: An Empirical Study of Autoregressive Pre-training from Videos
Abstract:
We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/

Authors:Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang
Title: ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Abstract:
Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.

Authors:Yifan Yu, Shaohui Liu, Rémi Pautrat, Marc Pollefeys, Viktor Larsson
Title: Relative Pose Estimation through Affine Corrections of Monocular Depth Priors
Abstract:
Monocular depth estimation (MDE) models have undergone significant advancements over recent years. Many MDE models aim to predict affine-invariant relative depth from monocular images, while recent developments in large-scale training and vision foundation models enable reasonable estimation of metric (absolute) depth. However, effectively leveraging these predictions for geometric vision tasks, in particular relative pose estimation, remains relatively under explored. While depths provide rich constraints for cross-view image alignment, the intrinsic noise and ambiguity from the monocular depth priors present practical challenges to improving upon classic keypoint-based solutions. In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions. We further propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints. We find that the affine correction modeling is beneficial to not only the relative depth priors but also, surprisingly, the "metric" ones. Results across multiple datasets demonstrate large improvements of our approach over classic keypoint-based baselines and PnP-based solutions, under both calibrated and uncalibrated setups. We also show that our method improves consistently with different feature matchers and MDE models, and can further benefit from very recent advances on both modules. Code is available at https://github.com/MarkYu98/madpose.
Chinese: 本文提出了新的求解器和混合流程,有效处理单目深度预测中的仿射模糊性问题,显著提升了相对姿态估计的精度,在多种数据集和设置下均优于传统的基于关键点的方法。
English: This paper introduces novel solvers and a hybrid pipeline that effectively address affine ambiguities in monocular depth predictions to significantly improve relative pose estimation, outperforming traditional keypoint-based methods across various datasets and setups.

Authors:Runjie Yan, Yinbo Chen, Xiaolong Wang
Title: Consistent Flow Distillation for Text-to-3D Generation
Abstract:
Score Distillation Sampling (SDS) has made significant strides in distilling image-generative models for 3D generation. However, its maximum-likelihood-seeking behavior often leads to degraded visual quality and diversity, limiting its effectiveness in 3D applications. In this work, we propose Consistent Flow Distillation (CFD), which addresses these limitations. We begin by leveraging the gradient of the diffusion ODE or SDE sampling process to guide the 3D generation. From the gradient-based sampling perspective, we find that the consistency of 2D image flows across different viewpoints is important for high-quality 3D generation. To achieve this, we introduce multi-view consistent Gaussian noise on the 3D object, which can be rendered from various viewpoints to compute the flow gradient. Our experiments demonstrate that CFD, through consistent flows, significantly outperforms previous methods in text-to-3D generation.

Authors:Aniruddha Mahapatra, Long Mai, David Bourgin, Yitian Zhang, Feng Liu
Title: Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces
Abstract:
Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4x without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence. Evaluation of video benchmarks shows that our method significantly improves reconstruction quality while increasing temporal compression compared to directly training the full model. Furthermore, the resulting compact latent space effectively trains a video diffusion model for high-quality video generation with a significantly reduced token budget.

Authors:Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, James Tompkin
Title: The GAN is dead; long live the GAN! A Modern GAN Baseline
Abstract:
There is a widely-spread claim that GANs are difficult to train, and GAN architectures in the literature are littered with empirical tricks. We provide evidence against this claim and build a modern GAN baseline in a more principled manner. First, we derive a well-behaved regularized relativistic GAN loss that addresses issues of mode dropping and non-convergence that were previously tackled via a bag of ad-hoc tricks. We analyze our loss mathematically and prove that it admits local convergence guarantees, unlike most existing relativistic losses. Second, our new loss allows us to discard all ad-hoc tricks and replace outdated backbones used in common GANs with modern architectures. Using StyleGAN2 as an example, we present a roadmap of simplification and modernization that results in a new minimalist baseline -- R3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ, ImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against state-of-the-art GANs and diffusion models.
中文: 本研究通过引入一种原理性的正则化相对损失,反驳了GAN难以训练的普遍观点,该损失无需经验性技巧并支持现代架构,由此产生的极简基线R3GAN在多个数据集上超越了StyleGAN2,并与顶尖模型相媲美。
English: The study refutes the common belief that GANs are hard to train by introducing a principled, regularized relativistic loss that eliminates the need for empirical tricks and enables the use of modern architectures, resulting in a minimalist baseline, R3GAN, which outperforms StyleGAN2 and competes with top models across multiple datasets.

Authors:Shuliang Ning, Yipeng Qin, Xiaoguang Han
Title: 1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On
Abstract:
Virtual Try-On (VTON) has become a crucial tool in ecommerce, enabling the realistic simulation of garments on individuals while preserving their original appearance and pose. Early VTON methods relied on single generative networks, but challenges remain in preserving fine-grained garment details due to limitations in feature extraction and fusion. To address these issues, recent approaches have adopted a dual-network paradigm, incorporating a complementary "ReferenceNet" to enhance garment feature extraction and fusion. While effective, this dual-network approach introduces significant computational overhead, limiting its scalability for high-resolution and long-duration image/video VTON applications. In this paper, we challenge the dual-network paradigm by proposing a novel single-network VTON method that overcomes the limitations of existing techniques. Our method, namely MNVTON, introduces a Modality-specific Normalization strategy that separately processes text, image and video inputs, enabling them to share the same attention layers in a VTON network. Extensive experimental results demonstrate the effectiveness of our approach, showing that it consistently achieves higher-quality, more detailed results for both image and video VTON tasks. Our results suggest that the single-network paradigm can rival the performance of dualnetwork approaches, offering a more efficient alternative for high-quality, scalable VTON applications.

Authors:Xinzi Cao, Xiawu Zheng, Guanhong Wang, Weijiang Yu, Yunhang Shen, Ke Li, Yutong Lu, Yonghong Tian
Title: Solving the Catastrophic Forgetting Problem in Generalized Category Discovery
Abstract:
Generalized Category Discovery (GCD) aims to identify a mix of known and novel categories within unlabeled data sets, providing a more realistic setting for image recognition. Essentially, GCD needs to remember existing patterns thoroughly to recognize novel categories. Recent state-of-the-art method SimGCD transfers the knowledge from known-class data to the learning of novel classes through debiased learning. However, some patterns are catastrophically forgot during adaptation and thus lead to poor performance in novel categories classification. To address this issue, we propose a novel learning approach, LegoGCD, which is seamlessly integrated into previous methods to enhance the discrimination of novel classes while maintaining performance on previously encountered known classes. Specifically, we design two types of techniques termed as Local Entropy Regularization (LER) and Dual-views Kullback Leibler divergence constraint (DKL). The LER optimizes the distribution of potential known class samples in unlabeled data, thus ensuring the preservation of knowledge related to known categories while learning novel classes. Meanwhile, DKL introduces Kullback Leibler divergence to encourage the model to produce a similar prediction distribution of two view samples from the same image. In this way, it successfully avoids mismatched prediction and generates more reliable potential known class samples simultaneously. Extensive experiments validate that the proposed LegoGCD effectively addresses the known category forgetting issue across all datasets, eg, delivering a 7.74% and 2.51% accuracy boost on known and novel classes in CUB, respectively. Our code is available at: https://github.com/Cliffia123/LegoGCD.
中文: 提出的LegoGCD方法通过局部熵正则化和双视图KL散度约束,在广义类别发现中增强新类别的区分能力同时防止已知类别被遗忘,在多个基准数据集上实现了显著的准确率提升。
English: The proposed LegoGCD method enhances novel class discrimination in Generalized Category Discovery by integrating Local Entropy Regularization and Dual-views KL divergence to prevent catastrophic forgetting of known categories, achieving significant accuracy improvements on benchmark datasets.

Authors:Fabian Hörst, Moritz Rempe, Helmut Becker, Lukas Heine, Julius Keyl, Jens Kleesiek
Title: CellViT++: Energy-Efficient and Adaptive Cell Segmentation and Classification Using Foundation Models
Abstract:
Digital Pathology is a cornerstone in the diagnosis and treatment of diseases. A key task in this field is the identification and segmentation of cells in hematoxylin and eosin-stained images. Existing methods for cell segmentation often require extensive annotated datasets for training and are limited to a predefined cell classification scheme. To overcome these limitations, we propose $\text{CellViT}^{\scriptscriptstyle ++}$, a framework for generalized cell segmentation in digital pathology. $\text{CellViT}^{\scriptscriptstyle ++}$ utilizes Vision Transformers with foundation models as encoders to compute deep cell features and segmentation masks simultaneously. To adapt to unseen cell types, we rely on a computationally efficient approach. It requires minimal data for training and leads to a drastically reduced carbon footprint. We demonstrate excellent performance on seven different datasets, covering a broad spectrum of cell types, organs, and clinical settings. The framework achieves remarkable zero-shot segmentation and data-efficient cell-type classification. Furthermore, we show that $\text{CellViT}^{\scriptscriptstyle ++}$ can leverage immunofluorescence stainings to generate training datasets without the need for pathologist annotations. The automated dataset generation approach surpasses the performance of networks trained on manually labeled data, demonstrating its effectiveness in creating high-quality training datasets without expert annotations. To advance digital pathology, $\text{CellViT}^{\scriptscriptstyle ++}$ is available as an open-source framework featuring a user-friendly, web-based interface for visualization and annotation. The code is available under https://github.com/TIO-IKIM/CellViT-plus-plus.
中文: CellViT++框架采用基于基础模型的视觉变换器,实现了数字病理学中的通用细胞分割,在多种数据集上表现出色,仅需少量训练数据即可完成零样本分割。
English: The proposed CellViT++ framework utilizes Vision Transformers with foundation models to achieve generalized cell segmentation in digital pathology, demonstrating excellent performance across diverse datasets while requiring minimal training data and enabling zero-shot segmentation.

Authors:Tianci Wen, Zhiang Liu, Yongchun Fang
Title: SEGS-SLAM: Structure-enhanced 3D Gaussian Splatting SLAM with Appearance Embedding
Abstract:
3D Gaussian splatting (3D-GS) has recently revolutionized novel view synthesis in the simultaneous localization and mapping (SLAM) problem. However, most existing algorithms fail to fully capture the underlying structure, resulting in structural inconsistency. Additionally, they struggle with abrupt appearance variations, leading to inconsistent visual quality. To address these problems, we propose SEGS-SLAM, a structure-enhanced 3D Gaussian Splatting SLAM, which achieves high-quality photorealistic mapping. Our main contributions are two-fold. First, we propose a structure-enhanced photorealistic mapping (SEPM) framework that, for the first time, leverages highly structured point cloud to initialize structured 3D Gaussians, leading to significant improvements in rendering quality. Second, we propose Appearance-from-Motion embedding (AfME), enabling 3D Gaussians to better model image appearance variations across different camera poses. Extensive experiments on monocular, stereo, and RGB-D datasets demonstrate that SEGS-SLAM significantly outperforms state-of-the-art (SOTA) methods in photorealistic mapping quality, e.g., an improvement of $19.86\%$ in PSNR over MonoGS on the TUM RGB-D dataset for monocular cameras. The project page is available at https://segs-slam.github.io/.

Authors:Xueyi Ke, Satoshi Tsutsui, Yayun Zhang, Bihan Wen
Title: Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning
Abstract:
Infants develop complex visual understanding rapidly, even preceding the acquisition of linguistic skills. As computer vision seeks to replicate the human vision system, understanding infant visual development may offer valuable insights. In this paper, we present an interdisciplinary study exploring this question: can a computational model that imitates the infant learning process develop broader visual concepts that extend beyond the vocabulary it has heard, similar to how infants naturally learn? To investigate this, we analyze a recently published model in Science by Vong et al., which is trained on longitudinal, egocentric images of a single child paired with transcribed parental speech. We perform neuron labeling to identify visual concept neurons hidden in the model's internal representations. We then demonstrate that these neurons can recognize objects beyond the model's original vocabulary. Furthermore, we compare the differences in representation between infant models and those in modern computer vision models, such as CLIP and ImageNet pre-trained model. Ultimately, our work bridges cognitive science and computer vision by analyzing the internal representations of a computational model trained on an infant visual and linguistic inputs. Project page is available at https://kexueyi.github.io/webpage-discover-hidden-visual-concepts.

Authors:Xuyang Liu, Ziming Wang, Junjie Chen, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Linfeng Zhang, Siteng Huang, Honggang Chen
Title: Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models
Abstract:
Large vision-language models (LVLMs) excel at visual understanding, but face efficiency challenges due to quadratic complexity in processing long multi-modal contexts. While token compression can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to consider the unique multi-view characteristics of high-resolution LVLMs with dynamic cropping. Existing methods treat all tokens uniformly, but our analysis reveals that global thumbnails can naturally guide the compression of local crops by providing holistic context for informativeness evaluation. In this paper, we first analyze dynamic cropping strategy, revealing both the complementary nature between thumbnails and crops, and the distinctive characteristics across different crops. Based on our observations, we propose "Global Compression Commander" (GlobalCom$^2$), a novel plug-and-play token compression framework for HR-LVLMs. GlobalCom$^2$ leverages thumbnail as the "commander" to guide the compression of local crops, adaptively preserving informative details while eliminating redundancy. Extensive experiments show that GlobalCom$^2$ maintains over 90% performance while compressing 90% visual tokens, reducing FLOPs and peak memory to 9.1% and 60%. Our code is available at https://github.com/xuyang-liu16/GlobalCom2.
Chinese Summary: 本文提出GlobalCom²框架,通过全局缩略图指导高分辨率视觉语言模型中局部裁剪区域的动态压缩,在保留90%以上性能的同时将计算成本降低高达90%。
English Summary: The paper introduces GlobalCom², a token compression framework that uses global thumbnails to guide adaptive compression of local crops in high-resolution vision-language models, achieving over 90% performance retention while reducing computational costs by up to 90%.

Authors:Dewei Zhou, Ji Xie, Zongxin Yang, Yi Yang
Title: 3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering
Abstract:
The growing demand for controllable outputs in text-to-image generation has driven significant advancements in multi-instance generation (MIG), enabling users to define both instance layouts and attributes. Currently, the state-of-the-art methods in MIG are primarily adapter-based. However, these methods necessitate retraining a new adapter each time a more advanced model is released, resulting in significant resource consumption. A methodology named Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which decouples MIG into two distinct phases: 1) depth-based scene construction and 2) detail rendering with widely pre-trained depth control models. The 3DIS method requires adapter training solely during the scene construction phase, while enabling various models to perform training-free detail rendering. Initially, 3DIS focused on rendering techniques utilizing U-Net architectures such as SD1.5, SD2, and SDXL, without exploring the potential of recent DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension of the 3DIS framework that integrates the FLUX model for enhanced rendering capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map controlled image generation and introduce a detail renderer that manipulates the Attention Mask in FLUX's Joint Attention mechanism based on layout information. This approach allows for the precise rendering of fine-grained attributes of each instance. Our experimental results indicate that 3DIS-FLUX, leveraging the FLUX model, outperforms the original 3DIS method, which utilized SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in terms of both performance and image quality. Project Page: https://limuloo.github.io/3DIS/.
中文:3DIS-FLUX框架通过集成FLUX模型扩展了原有3DIS方法,将多实例生成解耦为场景构建与细节渲染两个阶段,在性能和图像质量上均超越了当前基于适配器的先进方法。
English: The 3DIS-FLUX framework extends the original 3DIS method by integrating the FLUX model for enhanced multi-instance generation, achieving superior performance and image quality compared to existing adapter-based approaches by decoupling scene construction and detail rendering.

Authors:Haoyi Xiu, Xin Liu, Taehoon Kim, Kyoung-Sook Kim
Title: Advancing ALS Applications with Large-Scale Pre-training: Dataset Development and Downstream Assessment
Abstract:
The pre-training and fine-tuning paradigm has revolutionized satellite remote sensing applications. However, this approach remains largely underexplored for airborne laser scanning (ALS), an important technology for applications such as forest management and urban planning. In this study, we address this gap by constructing a large-scale ALS point cloud dataset and evaluating its impact on downstream applications. Our dataset comprises ALS point clouds collected across the contiguous United States, provided by the United States Geological Survey's 3D Elevation Program. To ensure efficient data collection while capturing diverse land cover and terrain types, we introduce a geospatial sampling method that selects point cloud tiles based on land cover maps and digital elevation models. As a baseline self-supervised learning model, we adopt BEV-MAE, a state-of-the-art masked autoencoder for 3D outdoor point clouds, and pre-train it on the constructed dataset. The pre-trained models are subsequently fine-tuned for downstream tasks, including tree species classification, terrain scene recognition, and point cloud semantic segmentation. Our results show that the pre-trained models significantly outperform their scratch counterparts across all downstream tasks, demonstrating the transferability of the representations learned from the proposed dataset. Furthermore, we observe that scaling the dataset using our geospatial sampling method consistently enhances performance, whereas pre-training on datasets constructed with random sampling fails to achieve similar improvements. These findings highlight the utility of the constructed dataset and the effectiveness of our sampling strategy in the pre-training and fine-tuning paradigm. The source code and pre-trained models will be made publicly available at \url{https://github.com/martianxiu/ALS_pretraining}.
中文: 本研究通过创新的地理空间采样方法构建了一个大规模机载激光扫描数据集,并证明在该数据集上预训练的模型在树种分类和语义分割等下游任务中性能显著提升。
English: This study develops a large-scale airborne laser scanning dataset using a novel geospatial sampling method and demonstrates that pre-training models on it significantly enhances performance in downstream tasks like tree species classification and semantic segmentation.

Authors:Ronghao Dang, Yuqian Yuan, Wenqi Zhang, Yifei Xin, Boqiang Zhang, Long Li, Liuyi Wang, Qinyang Zeng, Xin Li, Lidong Bing
Title: ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark
Abstract:
The enhancement of generalization in robots by large vision-language models (LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of LVLMs based on egocentric videos are of great interest. However, current datasets for embodied video question answering lack comprehensive and systematic evaluation frameworks. Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed. To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition. To ensure quality, balance, and high visual dependence, ECBench uses class-independent meticulous human annotation and multi-round question screening strategies. Additionally, we introduce ECEval, a comprehensive evaluation system that ensures the fairness and rationality of the indicators. Utilizing ECBench, we conduct extensive evaluations of proprietary, open-source, and task-specific LVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of LVLMs, laying a solid foundation for developing reliable core models for embodied agents. All data and code are available at https://github.com/Rh-Dang/ECBench.
中文: ECBench是一个新颖的基准,旨在通过多样化的第一人称视角视频和30个认知维度,系统评估大型视觉语言模型的具身认知能力,弥补现有数据集的不足,并通过精细人工标注和综合评估体系确保公平性。
English: ECBench is a novel benchmark designed to systematically evaluate the embodied cognitive abilities of large vision-language models using diverse egocentric videos and 30 cognitive dimensions, addressing gaps in current datasets while ensuring fairness through meticulous human annotation and a comprehensive evaluation system.

Authors:Yingjie Chen, Yifang Men, Yuan Yao, Miaomiao Cui, Liefeng Bo
Title: Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation
Abstract:
Motion-controllable image animation is a fundamental task with a wide range of potential applications. Recent works have made progress in controlling camera or object motion via various motion representations, while they still struggle to support collaborative camera and object motion control with adaptive control granularity. To this end, we introduce 3D-aware motion representation and propose an image animation framework, called Perception-as-Control, to achieve fine-grained collaborative motion control. Specifically, we construct 3D-aware motion representation from a reference image, manipulate it based on interpreted user instructions, and perceive it from different viewpoints. In this way, camera and object motions are transformed into intuitive and consistent visual changes. Then, our framework leverages the perception results as motion control signals, enabling it to support various motion-related video synthesis tasks in a unified and flexible way. Experiments demonstrate the superiority of the proposed approach. For more details and qualitative results, please refer to our anonymous project webpage: https://chen-yingjie.github.io/projects/Perception-as-Control.

Authors:Xiaojie Li, Jianlong Wu, Yue Yu, Liqiang Nie, Min Zhang
Title: Continuous Knowledge-Preserving Decomposition with Adaptive Layer Selection for Few-Shot Class-Incremental Learning
Abstract:
Few-Shot Class-Incremental Learning (FSCIL) faces a critical challenge: balancing the retention of prior knowledge with the acquisition of new classes. Existing methods either freeze the backbone to prevent catastrophic forgetting, sacrificing plasticity, or add new modules, incurring high costs. These approaches treat pretrained models as black boxes, overlooking two key opportunities to exploit their internal capacity: reusing redundant representational space within layers and selectively adapting layers based on their sensitivity to forgetting. We propose CKPD-FSCIL, a unified framework that unlocks the underutilized capacity of pretrained weights, achieving a superior stability-plasticity balance with zero inference overhead. Our design integrates two continuously adapting mechanisms: At the weight level, a Continuous Knowledge-Preserving Decomposition mechanism uses feature covariance to split each weight matrix into a frozen subspace that safeguards prior knowledge and a learnable, redundant subspace for new tasks. At the layer level, a Continuous Adaptive Layer Selection mechanism leverages an Adapter Sensitivity Ratio to automatically select layers with the highest redundant capacity and lowest forgetting risk for adaptation. By targeting only safe, high-potential subspaces and layers, CKPD-FSCIL enables efficient adaptation. After each session, the learned adapters are merged back into the original weights, ensuring zero additional parameters or FLOPs during inference. Extensive experiments on multiple FSCIL benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in both adaptability and knowledge retention. The code is available at https://github.com/xiaojieli0903/CKPD-FSCIL.
中文: 本文提出的CKPD-FSCIL框架通过自适应复用预训练权重中的冗余子空间和选择性更新网络层,解决了小样本类增量学习中的稳定性与可塑性平衡难题,在零推理开销下实现了更优的性能。
English: The proposed CKPD-FSCIL framework addresses the stability-plasticity dilemma in Few-Shot Class-Incremental Learning by adaptively reusing redundant subspaces in pretrained weights and selectively updating layers, achieving superior performance without inference overhead.

Authors:Haoran Zhu, Zhenyuan Dong, Kristi Topollai, Anna Choromanska
Title: AD-L-JEPA: Self-Supervised Spatial World Models with Joint Embedding Predictive Architecture for Autonomous Driving with LiDAR Data
Abstract:
As opposed to human drivers, current autonomous driving systems still require vast amounts of labeled data to train. Recently, world models have been proposed to simultaneously enhance autonomous driving capabilities by improving the way these systems understand complex real-world environments and reduce their data demands via self-supervised pre-training. In this paper, we present AD-L-JEPA (aka Autonomous Driving with LiDAR data via a Joint Embedding Predictive Architecture), a novel self-supervised pre-training framework for autonomous driving with LiDAR data that, as opposed to existing methods, is neither generative nor contrastive. Our method learns spatial world models with a joint embedding predictive architecture. Instead of explicitly generating masked unknown regions, our self-supervised world models predict Bird's Eye View (BEV) embeddings to represent the diverse nature of autonomous driving scenes. Our approach furthermore eliminates the need to manually create positive and negative pairs, as is the case in contrastive learning. AD-L-JEPA leads to simpler implementation and enhanced learned representations. We qualitatively and quantitatively demonstrate high-quality of embeddings learned with AD-L-JEPA. We furthermore evaluate the accuracy and label efficiency of AD-L-JEPA on popular downstream tasks such as LiDAR 3D object detection and associated transfer learning. Our experimental evaluation demonstrates that AD-L-JEPA is a plausible approach for self-supervised pre-training in autonomous driving applications and is the best available approach outperforming SOTA, including most recently proposed Occupancy-MAE [1] and ALSO [2]. The source code of AD-L-JEPA is available at https://github.com/HaoranZhuExplorer/AD-L-JEPA-Release.
中文: 本文提出AD-L-JEPA这一新型自监督预训练框架,通过联合嵌入预测架构从激光雷达数据中学习空间世界模型,无需生成或对比方法,在精度和标签效率上超越了现有最优方法。
English: This paper introduces AD-L-JEPA, a novel self-supervised pre-training framework for autonomous driving that uses a joint embedding predictive architecture to learn spatial world models from LiDAR data, eliminating the need for generative or contrastive methods and outperforming state-of-the-art approaches in accuracy and label efficiency.

Authors:Lei Li, Xinglin Zhang, Jun Liang, Tao Chen
Title: Addressing Domain Shift via Imbalance-Aware Domain Adaptation in Embryo Development Assessment
Abstract:
Deep learning models in medical imaging face dual challenges: domain shift, where models perform poorly when deployed in settings different from their training environment, and class imbalance, where certain disease conditions are naturally underrepresented. We present Imbalance-Aware Domain Adaptation (IADA), a novel framework that simultaneously tackles both challenges through three key components: (1) adaptive feature learning with class-specific attention mechanisms, (2) balanced domain alignment with dynamic weighting, and (3) adaptive threshold optimization. Our theoretical analysis establishes convergence guarantees and complexity bounds. Through extensive experiments on embryo development assessment across four imaging modalities, IADA demonstrates significant improvements over existing methods, achieving up to 25.19\% higher accuracy while maintaining balanced performance across classes. In challenging scenarios with low-quality imaging systems, IADA shows robust generalization with AUC improvements of up to 12.56\%. These results demonstrate IADA's potential for developing reliable and equitable medical imaging systems for diverse clinical settings. The code is made public available at \url{https://github.com/yinghemedical/imbalance-aware_domain_adaptation}
中文:IADA框架通过自适应特征学习、平衡域对齐和阈值优化,有效解决了医学影像中的域偏移和类别不平衡问题,在多种临床场景中实现了高达25.19%的准确率提升和稳健的泛化性能。
English: The IADA framework effectively addresses domain shift and class imbalance in medical imaging through adaptive feature learning, balanced domain alignment, and threshold optimization, achieving up to 25.19% higher accuracy and robust generalization across diverse clinical settings.

Authors:Yapeng Li, Yong Luo, Lefei Zhang, Zengmao Wang, Bo Du
Title: MambaHSI: Spatial-Spectral Mamba for Hyperspectral Image Classification
Abstract:
Transformer has been extensively explored for hyperspectral image (HSI) classification. However, transformer poses challenges in terms of speed and memory usage because of its quadratic computational complexity. Recently, the Mamba model has emerged as a promising approach, which has strong long-distance modeling capabilities while maintaining a linear computational complexity. However, representing the HSI is challenging for the Mamba due to the requirement for an integrated spatial and spectral understanding. To remedy these drawbacks, we propose a novel HSI classification model based on a Mamba model, named MambaHSI, which can simultaneously model long-range interaction of the whole image and integrate spatial and spectral information in an adaptive manner. Specifically, we design a spatial Mamba block (SpaMB) to model the long-range interaction of the whole image at the pixel-level. Then, we propose a spectral Mamba block (SpeMB) to split the spectral vector into multiple groups, mine the relations across different spectral groups, and extract spectral features. Finally, we propose a spatial-spectral fusion module (SSFM) to adaptively integrate spatial and spectral features of a HSI. To our best knowledge, this is the first image-level HSI classification model based on the Mamba. We conduct extensive experiments on four diverse HSI datasets. The results demonstrate the effectiveness and superiority of the proposed model for HSI classification. This reveals the great potential of Mamba to be the next-generation backbone for HSI models. Codes are available at https://github.com/li-yapeng/MambaHSI .
Chinese: 提出的MambaHSI模型通过空间和光谱模块及融合机制,克服了Transformer和Mamba在高光谱图像分类中的不足,实现了卓越性能,展现出作为下一代骨干网络的巨大潜力。
English: The proposed MambaHSI model addresses the limitations of Transformers and Mamba in hyperspectral image classification by integrating spatial and spectral features through specialized blocks and a fusion module, demonstrating superior performance and potential as a next-generation backbone.

Authors:Guannan Lai, Yihui Feng, Xin Yang, Xiaoyu Deng, Hao Yu, Shuyin Xia, Guoyin Wang, Tianrui Li
Title: A New Perspective on Privacy Protection in Federated Learning with Granular-Ball Computing
Abstract:
Federated Learning (FL) facilitates collaborative model training while prioritizing privacy by avoiding direct data sharing. However, most existing articles attempt to address challenges within the model's internal parameters and corresponding outputs, while neglecting to solve them at the input level. To address this gap, we propose a novel framework called Granular-Ball Federated Learning (GrBFL) for image classification. GrBFL diverges from traditional methods that rely on the finest-grained input data. Instead, it segments images into multiple regions with optimal coarse granularity, which are then reconstructed into a graph structure. We designed a two-dimensional binary search segmentation algorithm based on variance constraints for GrBFL, which effectively removes redundant information while preserving key representative features. Extensive theoretical analysis and experiments demonstrate that GrBFL not only safeguards privacy and enhances efficiency but also maintains robust utility, consistently outperforming other state-of-the-art FL methods. The code is available at https://github.com/AIGNLAI/GrBFL.
中文: 本文提出了一种名为粒度球联邦学习(GrBFL)的新框架,通过粗粒度分割和图像重构为图结构,在保护隐私的同时提高了联邦学习的效率和性能,优于现有先进方法。
English: This paper introduces Granular-Ball Federated Learning (GrBFL), a novel framework that processes images through coarse-grained segmentation and graph reconstruction to enhance privacy, efficiency, and utility in federated learning, outperforming existing methods.

Authors:Sun-Hyuk Choi, Hayoung Jo, Seong-Whan Lee
Title: Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation
Abstract:
Referring video object segmentation aims to segment objects within a video corresponding to a given text description. Existing transformer-based temporal modeling approaches face challenges related to query inconsistency and the limited consideration of context. Query inconsistency produces unstable masks of different objects in the middle of the video. The limited consideration of context leads to the segmentation of incorrect objects by failing to adequately account for the relationship between the given text and instances. To address these issues, we propose the Multi-context Temporal Consistency Module (MTCM), which consists of an Aligner and a Multi-Context Enhancer (MCE). The Aligner removes noise from queries and aligns them to achieve query consistency. The MCE predicts text-relevant queries by considering multi-context. We applied MTCM to four different models, increasing performance across all of them, particularly achieving 47.6 J&F on the MeViS. Code is available at https://github.com/Choi58/MTCM.
Chinese: 提出的多上下文时序一致性模块(MTCM)通过对齐查询和增强多上下文理解,解决了指代视频对象分割中的查询不一致性和上下文考虑不足问题,在多个模型中提升了性能。
English: The proposed Multi-context Temporal Consistency Module (MTCM) addresses query inconsistency and limited contextual consideration in referring video object segmentation by aligning queries and enhancing multi-context understanding, improving performance across multiple models.

Authors:Zhenghui Zhao, Chen Wu, Lixiang Ru, Di Wang, Hongruixuan Chen, Cuiqun Chen
Title: Plug-and-Play DISep: Separating Dense Instances for Scene-to-Pixel Weakly-Supervised Change Detection in High-Resolution Remote Sensing Images
Abstract:
Existing Weakly-Supervised Change Detection (WSCD) methods often encounter the problem of "instance lumping" under scene-level supervision, particularly in scenarios with a dense distribution of changed instances (i.e., changed objects). In these scenarios, unchanged pixels between changed instances are also mistakenly identified as changed, causing multiple changes to be mistakenly viewed as one. In practical applications, this issue prevents the accurate quantification of the number of changes. To address this issue, we propose a Dense Instance Separation (DISep) method as a plug-and-play solution, refining pixel features from a unified instance perspective under scene-level supervision. Specifically, our DISep comprises a three-step iterative training process: 1) Instance Localization: We locate instance candidate regions for changed pixels using high-pass class activation maps. 2) Instance Retrieval: We identify and group these changed pixels into different instance IDs through connectivity searching. Then, based on the assigned instance IDs, we extract corresponding pixel-level features on a per-instance basis. 3) Instance Separation: We introduce a separation loss to enforce intra-instance pixel consistency in the embedding space, thereby ensuring separable instance feature representations. The proposed DISep adds only minimal training cost and no inference cost. It can be seamlessly integrated to enhance existing WSCD methods. We achieve state-of-the-art performance by enhancing {three Transformer-based and four ConvNet-based methods} on the LEVIR-CD, WHU-CD, DSIFN-CD, SYSU-CD, and CDD datasets. Additionally, our DISep can be used to improve fully-supervised change detection methods. Code is available at https://github.com/zhenghuizhao/Plug-and-Play-DISep-for-Change-Detection.
Chinese: 提出的密集实例分离(DISep)方法通过三步迭代训练过程从统一实例角度优化像素特征,有效解决了弱监督变化检测中的“实例堆积”问题,在多个数据集上以最小额外成本实现了最先进的性能。
English: The proposed Dense Instance Separation (DISep) method effectively addresses the "instance lumping" issue in Weakly-Supervised Change Detection by refining pixel features through a three-step iterative process, achieving state-of-the-art performance across multiple datasets with minimal additional cost.

Authors:Golriz Hosseinimanesh, Farnoosh Ghadiri, Francois Guibault, Farida Cheriet, Julia Keren
Title: From Mesh Completion to AI Designed Crown
Abstract:
Designing a dental crown is a time-consuming and labor intensive process. Our goal is to simplify crown design and minimize the tediousness of making manual adjustments while still ensuring the highest level of accuracy and consistency. To this end, we present a new end- to-end deep learning approach, coined Dental Mesh Completion (DMC), to generate a crown mesh conditioned on a point cloud context. The dental context includes the tooth prepared to receive a crown and its surroundings, namely the two adjacent teeth and the three closest teeth in the opposing jaw. We formulate crown generation in terms of completing this point cloud context. A feature extractor first converts the input point cloud into a set of feature vectors that represent local regions in the point cloud. The set of feature vectors is then fed into a transformer to predict a new set of feature vectors for the missing region (crown). Subsequently, a point reconstruction head, followed by a multi-layer perceptron, is used to predict a dense set of points with normals. Finally, a differentiable point-to-mesh layer serves to reconstruct the crown surface mesh. We compare our DMC method to a graph-based convolutional neural network which learns to deform a crown mesh from a generic crown shape to the target geometry. Extensive experiments on our dataset demonstrate the effectiveness of our method, which attains an average of 0.062 Chamfer Distance.The code is available at:https://github.com/Golriz-code/DMC.gi
中文: 本文提出DMC深度学习模型,通过点云数据自动生成牙冠网格,在保证精度的同时显著简化了传统牙冠设计的繁琐流程。
English: This paper introduces Dental Mesh Completion (DMC), an end-to-end deep learning method that automatically generates dental crown meshes from point cloud data to streamline design while maintaining high accuracy.

Authors:Hafiz Mughees Ahmad, Dario Morle, Afshin Rahimi
Title: LayerMix: Enhanced Data Augmentation through Fractal Integration for Robust Deep Learning
Abstract:
Deep learning models have demonstrated remarkable performance across various computer vision tasks, yet their vulnerability to distribution shifts remains a critical challenge. Despite sophisticated neural network architectures, existing models often struggle to maintain consistent performance when confronted with Out-of-Distribution (OOD) samples, including natural corruptions, adversarial perturbations, and anomalous patterns. We introduce LayerMix, an innovative data augmentation approach that systematically enhances model robustness through structured fractal-based image synthesis. By meticulously integrating structural complexity into training datasets, our method generates semantically consistent synthetic samples that significantly improve neural network generalization capabilities. Unlike traditional augmentation techniques that rely on random transformations, LayerMix employs a structured mixing pipeline that preserves original image semantics while introducing controlled variability. Extensive experiments across multiple benchmark datasets, including CIFAR-10, CIFAR-100, ImageNet-200, and ImageNet-1K demonstrate LayerMixs superior performance in classification accuracy and substantially enhances critical Machine Learning (ML) safety metrics, including resilience to natural image corruptions, robustness against adversarial attacks, improved model calibration and enhanced prediction consistency. LayerMix represents a significant advancement toward developing more reliable and adaptable artificial intelligence systems by addressing the fundamental challenges of deep learning generalization. The code is available at https://github.com/ahmadmughees/layermix.
中文: LayerMix是一种基于分形结构的创新数据增强方法,通过生成语义一致的合成图像系统提升深度学习模型的泛化能力,在多个基准测试中显著提高了分类准确率和机器学习安全指标。
English: LayerMix is a structured fractal-based data augmentation method that enhances deep learning model robustness by generating semantically consistent synthetic images, significantly improving generalization and safety metrics across multiple benchmarks.

Authors:Yachuan Li, Xavier Soria Poma, Yun Bai, Qian Xiao, Chaozhi Yang, Guanlin Li, Zongmin Li
Title: EDMB: Edge Detector with Mamba
Abstract:
Transformer-based models have made significant progress in edge detection, but their high computational cost is prohibitive. Recently, vision Mamba have shown excellent ability in efficiently capturing long-range dependencies. Drawing inspiration from this, we propose a novel edge detector with Mamba, termed EDMB, to efficiently generate high-quality multi-granularity edges. In EDMB, Mamba is combined with a global-local architecture, therefore it can focus on both global information and fine-grained cues. The fine-grained cues play a crucial role in edge detection, but are usually ignored by ordinary Mamba. We design a novel decoder to construct learnable Gaussian distributions by fusing global features and fine-grained features. And the multi-grained edges are generated by sampling from the distributions. In order to make multi-granularity edges applicable to single-label data, we introduce Evidence Lower Bound loss to supervise the learning of the distributions. On the multi-label dataset BSDS500, our proposed EDMB achieves competitive single-granularity ODS 0.837 and multi-granularity ODS 0.851 without multi-scale test or extra PASCAL-VOC data. Remarkably, EDMB can be extended to single-label datasets such as NYUDv2 and BIPED. The source code is available at https://github.com/Li-yachuan/EDMB.
中文: 提出的EDMB模型结合视觉Mamba与全局-局部架构,能高效生成高质量多粒度边缘,在BSDS500上表现优异,并可有效扩展至单标签数据集。
English: The proposed EDMB model leverages vision Mamba with a global-local architecture to efficiently produce high-quality multi-granularity edges, achieving competitive performance on BSDS500 and extending effectively to single-label datasets.

Authors:Jiteng Mu, Nuno Vasconcelos, Xiaolong Wang
Title: EditAR: Unified Conditional Generation with Autoregressive Models
Abstract:
Recent progress in controllable image generation and editing is largely driven by diffusion-based methods. Although diffusion models perform exceptionally well in specific tasks with tailored designs, establishing a unified model is still challenging. In contrast, autoregressive models inherently feature a unified tokenized representation, which simplifies the creation of a single foundational model for various tasks. In this work, we propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks, e.g., image editing, depth-to-image, edge-to-image, segmentation-to-image. The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm. To enhance the text-to-image alignment, we further propose to distill the knowledge from foundation models into the autoregressive modeling process. We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods. Project page: https://jitengmu.github.io/EditAR/

Authors:Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, Kun Gai
Title: ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
Abstract:
Text-to-video generation has made remarkable advancements through diffusion models. However, Multi-Concept Video Customization (MCVC) remains a significant challenge. We identify two key challenges for this task: 1) the identity decoupling issue, where directly adopting existing customization methods inevitably mix identity attributes when handling multiple concepts simultaneously, and 2) the scarcity of high-quality video-entity pairs, which is crucial for training a model that can well represent and decouple various customized concepts in video generation. To address these challenges, we introduce ConceptMaster, a novel framework that effectively addresses the identity decoupling issues while maintaining concept fidelity in video customization. Specifically, we propose to learn decoupled multi-concept embeddings and inject them into diffusion models in a standalone manner, which effectively guarantees the quality of customized videos with multiple identities, even for highly similar visual concepts. To overcome the scarcity of high-quality MCVC data, we establish a data construction pipeline, which enables collection of high-quality multi-concept video-entity data pairs across diverse scenarios. A multi-concept video evaluation set is further devised to comprehensively validate our method from three dimensions, including concept fidelity, identity decoupling ability, and video generation quality, across six different concept composition scenarios. Extensive experiments demonstrate that ConceptMaster significantly outperforms previous methods for video customization tasks, showing great potential to generate personalized and semantically accurate content for video diffusion models.
中文摘要:ConceptMaster通过解耦多概念嵌入学习和构建高质量视频实体对,有效解决了多概念视频定制中的身份解耦问题和数据稀缺问题,显著优于现有方法。
English Summary: ConceptMaster is a novel framework that overcomes identity decoupling issues and data scarcity in multi-concept video customization by learning decoupled embeddings and constructing high-quality video-entity pairs, significantly outperforming previous methods.

Authors:Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, Tolga Birdal
Title: Grokking at the Edge of Numerical Stability
Abstract:
Grokking, the sudden generalization that occurs after prolonged overfitting, is a surprising phenomenon challenging our understanding of deep learning. Although significant progress has been made in understanding grokking, the reasons behind the delayed generalization and its dependence on regularization remain unclear. In this work, we argue that without regularization, grokking tasks push models to the edge of numerical stability, introducing floating point errors in the Softmax function, which we refer to as Softmax Collapse (SC). We demonstrate that SC prevents grokking and that mitigating SC enables grokking without regularization. Investigating the root cause of SC, we find that beyond the point of overfitting, the gradients strongly align with what we call the naïve loss minimization (NLM) direction. This component of the gradient does not alter the model's predictions but decreases the loss by scaling the logits, typically by scaling the weights along their current direction. We show that this scaling of the logits explains the delay in generalization characteristic of grokking and eventually leads to SC, halting further learning. To validate our hypotheses, we introduce two key contributions that address the challenges in grokking tasks: StableMax, a new activation function that prevents SC and enables grokking without regularization, and $\perp$Grad, a training algorithm that promotes quick generalization in grokking tasks by preventing NLM altogether. These contributions provide new insights into grokking, elucidating its delayed generalization, reliance on regularization, and the effectiveness of existing grokking-inducing methods. Code for this paper is available at https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability.
中文: 本研究揭示了浮点误差导致的Softmax崩溃是阻碍深度学习模型"顿悟"现象的关键因素,并提出了StableMax激活函数和⊥Grad训练算法,可在无需正则化的情况下实现顿悟。
English: This study identifies Softmax Collapse caused by floating point errors as the key factor preventing grokking in deep learning models and introduces StableMax activation function and ⊥Grad training algorithm to enable grokking without regularization.

Authors:Charles Corbière, Simon Roburin, Syrielle Montariol, Antoine Bosselut, Alexandre Alahi
Title: Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios
Abstract:
While chain-of-thought (CoT) prompting improves reasoning in large language models, its effectiveness in vision-language models (VLMs) remains limited due to over-reliance on textual cues and memorized knowledge. To investigate the visual reasoning capabilities of VLMs in complex real-world scenarios, we introduce DrivingVQA, a visual question answering dataset derived from driving theory exams, which contains 3,931 multiple-choice problems with expert-written explanations and grounded entities relevant to the reasoning process. Leveraging this dataset, we propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables VLMs to reason using visual crops corresponding to these relevant entities. Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting. Furthermore, we demonstrate that our method effectively scales to the larger A-OKVQA reasoning dataset by leveraging automatically generated pseudo-labels, outperforming CoT prompting.

Authors:Weitian Zhang, Yichao Yan, Sijing Wu, Manwen Liao, Xiaokang Yang
Title: Disentangled Clothed Avatar Generation with Layered Representation
Abstract:
Clothed avatar generation has wide applications in virtual and augmented reality, filmmaking, and more. Previous methods have achieved success in generating diverse digital avatars, however, generating avatars with disentangled components (\eg, body, hair, and clothes) has long been a challenge. In this paper, we propose LayerAvatar, the first feed-forward diffusion-based method for generating component-disentangled clothed avatars. To achieve this, we first propose a layered UV feature plane representation, where components are distributed in different layers of the Gaussian-based UV feature plane with corresponding semantic labels. This representation supports high-resolution and real-time rendering, as well as expressive animation including controllable gestures and facial expressions. Based on the well-designed representation, we train a single-stage diffusion model and introduce constrain terms to address the severe occlusion problem of the innermost human body layer. Extensive experiments demonstrate the impressive performances of our method in generating disentangled clothed avatars, and we further explore its applications in component transfer. The project page is available at: https://olivia23333.github.io/LayerAvatar/
中文: LayerAvatar首次提出基于前馈扩散的分层UV特征平面表示方法,通过单阶段扩散模型生成组件解耦的着装虚拟人,在解决严重遮挡问题的同时支持高分辨率实时渲染与富有表现力的动画。
English: LayerAvatar introduces a novel feed-forward diffusion-based approach that utilizes a layered UV feature plane representation to generate component-disentangled clothed avatars, enabling high-resolution rendering and expressive animations while addressing occlusion challenges.

Authors:Han Huang, Yulun Wu, Chao Deng, Ge Gao, Ming Gu, Yu-Shen Liu
Title: FatesGS: Fast and Accurate Sparse-View Surface Reconstruction using Gaussian Splatting with Depth-Feature Consistency
Abstract:
Recently, Gaussian Splatting has sparked a new trend in the field of computer vision. Apart from novel view synthesis, it has also been extended to the area of multi-view reconstruction. The latest methods facilitate complete, detailed surface reconstruction while ensuring fast training speed. However, these methods still require dense input views, and their output quality significantly degrades with sparse views. We observed that the Gaussian primitives tend to overfit the few training views, leading to noisy floaters and incomplete reconstruction surfaces. In this paper, we present an innovative sparse-view reconstruction framework that leverages intra-view depth and multi-view feature consistency to achieve remarkably accurate surface reconstruction. Specifically, we utilize monocular depth ranking information to supervise the consistency of depth distribution within patches and employ a smoothness loss to enhance the continuity of the distribution. To achieve finer surface reconstruction, we optimize the absolute position of depth through multi-view projection features. Extensive experiments on DTU and BlendedMVS demonstrate that our method outperforms state-of-the-art methods with a speedup of 60x to 200x, achieving swift and fine-grained mesh reconstruction without the need for costly pre-training.

Authors:Eric Chen, Xi Chen, Arian Maleki, Shirin Jalali
Title: Comprehensive Examination of Unrolled Networks for Solving Linear Inverse Problems
Abstract:
Unrolled networks have become prevalent in various computer vision and imaging tasks. Although they have demonstrated remarkable efficacy in solving specific computer vision and computational imaging tasks, their adaptation to other applications presents considerable challenges. This is primarily due to the multitude of design decisions that practitioners working on new applications must navigate, each potentially affecting the network's overall performance. These decisions include selecting the optimization algorithm, defining the loss function, and determining the number of convolutional layers, among others. Compounding the issue, evaluating each design choice requires time-consuming simulations to train, fine-tune the neural network, and optimize for its performance. As a result, the process of exploring multiple options and identifying the optimal configuration becomes time-consuming and computationally demanding. The main objectives of this paper are (1) to unify some ideas and methodologies used in unrolled networks to reduce the number of design choices a user has to make, and (2) to report a comprehensive ablation study to discuss the impact of each of the choices involved in designing unrolled networks and present practical recommendations based on our findings. We anticipate that this study will help scientists and engineers design unrolled networks for their applications and diagnose problems within their networks efficiently.
Chinese: 本文旨在通过统一方法和提供全面的消融研究及实用建议,简化展开网络的设计过程,从而降低其在新应用中适应时的复杂性和计算需求。
English: This paper aims to simplify the design of unrolled networks by unifying methodologies and providing a comprehensive ablation study with practical recommendations to reduce the complexity and computational demands of adapting them to new applications.

Authors:Boyang Sun, Hanzhi Chen, Stefan Leutenegger, Cesar Cadena, Marc Pollefeys, Hermann Blum
Title: FrontierNet: Learning Visual Cues to Explore
Abstract:
Exploration of unknown environments is crucial for autonomous robots; it allows them to actively reason and decide on what new data to acquire for different tasks, such as mapping, object discovery, and environmental assessment. Existing solutions, such as frontier-based exploration approaches, rely heavily on 3D map operations, which are limited by map quality and, more critically, often overlook valuable context from visual cues. This work aims at leveraging 2D visual cues for efficient autonomous exploration, addressing the limitations of extracting goal poses from a 3D map. We propose a visual-only frontier-based exploration system, with FrontierNet as its core component. FrontierNet is a learning-based model that (i) proposes frontiers, and (ii) predicts their information gain, from posed RGB images enhanced by monocular depth priors. Our approach provides an alternative to existing 3D-dependent goal-extraction approaches, achieving a 15\% improvement in early-stage exploration efficiency, as validated through extensive simulations and real-world experiments. The project is available at https://github.com/cvg/FrontierNet.
中文摘要:本研究提出FrontierNet视觉探索系统,仅通过RGB图像和深度先验识别边界并预测信息增益,在早期探索阶段比依赖三维地图的方法效率提升15%。
English Summary: This research introduces FrontierNet, a visual-only exploration system that uses posed RGB images and depth priors to identify frontiers and predict their information gain, achieving 15% higher early-stage efficiency than 3D map-based methods.

Authors:Zhi Jin, Yuwei Qiu, Kaihao Zhang, Hongdong Li, Wenhan Luo
Title: MB-TaylorFormer V2: Improved Multi-branch Linear Transformer Expanded by Taylor Formula for Image Restoration
Abstract:
Recently, Transformer networks have demonstrated outstanding performance in the field of image restoration due to the global receptive field and adaptability to input. However, the quadratic computational complexity of Softmax-attention poses a significant limitation on its extensive application in image restoration tasks, particularly for high-resolution images. To tackle this challenge, we propose a novel variant of the Transformer. This variant leverages the Taylor expansion to approximate the Softmax-attention and utilizes the concept of norm-preserving mapping to approximate the remainder of the first-order Taylor expansion, resulting in a linear computational complexity. Moreover, we introduce a multi-branch architecture featuring multi-scale patch embedding into the proposed Transformer, which has four distinct advantages: 1) various sizes of the receptive field; 2) multi-level semantic information; 3) flexible shapes of the receptive field; 4) accelerated training and inference speed. Hence, the proposed model, named the second version of Taylor formula expansion-based Transformer (for short MB-TaylorFormer V2) has the capability to concurrently process coarse-to-fine features, capture long-distance pixel interactions with limited computational cost, and improve the approximation of the Taylor expansion remainder. Experimental results across diverse image restoration benchmarks demonstrate that MB-TaylorFormer V2 achieves state-of-the-art performance in multiple image restoration tasks, such as image dehazing, deraining, desnowing, motion deblurring, and denoising, with very little computational overhead. The source code is available at https://github.com/FVL2020/MB-TaylorFormerV2.
中文: 提出的MB-TaylorFormer V2 Transformer变体采用泰勒展开近似Softmax注意力实现线性复杂度,结合多分支架构进行多尺度处理,以极低计算成本在多种图像复原任务中达到最优性能。
English: The proposed MB-TaylorFormer V2 Transformer variant uses Taylor expansion to approximate Softmax-attention with linear complexity and a multi-branch architecture for multi-scale processing, achieving state-of-the-art results in various image restoration tasks with minimal computational cost.

Authors:Xin Zhang, Xue Yang, Yuxuan Li, Jian Yang, Ming-Ming Cheng, Xiang Li
Title: RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark
Abstract:
Rotated object detection has made significant progress in the optical remote sensing. However, advancements in the Synthetic Aperture Radar (SAR) field are laggard behind, primarily due to the absence of a large-scale dataset. Annotating such a dataset is inefficient and costly. A promising solution is to employ a weakly supervised model (e.g., trained with available horizontal boxes only) to generate pseudo-rotated boxes for reference before manual calibration. Unfortunately, the existing weakly supervised models exhibit limited accuracy in predicting the object's angle. Previous works attempt to enhance angle prediction by using angle resolvers that decouple angles into cosine and sine encodings. In this work, we first reevaluate these resolvers from a unified perspective of dimension mapping and expose that they share the same shortcomings: these methods overlook the unit cycle constraint inherent in these encodings, easily leading to prediction biases. To address this issue, we propose the Unit Cycle Resolver, which incorporates a unit circle constraint loss to improve angle prediction accuracy. Our approach can effectively improve the performance of existing state-of-the-art weakly supervised methods and even surpasses fully supervised models on existing optical benchmarks (i.e., DOTA-v1.0 dataset). With the aid of UCR, we further annotate and introduce RSAR, the largest multi-class rotated SAR object detection dataset to date. Extensive experiments on both RSAR and optical datasets demonstrate that our UCR enhances angle prediction accuracy. Our dataset and code can be found at: https://github.com/zhasion/RSAR.
中文: 本文提出单位圆解析器(UCR),通过引入单位圆约束损失来解决弱监督旋转目标检测中的角度预测偏差问题,不仅提升了预测精度,还建立了迄今最大的多类别旋转SAR数据集RSAR。
English: This paper introduces the Unit Cycle Resolver (UCR) to address angle prediction biases in weakly supervised rotated object detection by incorporating a unit circle constraint loss, which improves accuracy and enables the creation of the largest multi-class rotated SAR dataset, RSAR.

Authors:Michal Nohel, Constantin Ulrich, Jonathan Suprijadi, Tassilo Wald, Klaus Maier-Hein
Title: A Unified Framework for Foreground and Anonymization Area Segmentation in CT and MRI Data
Abstract:
This study presents an open-source toolkit to address critical challenges in preprocessing data for self-supervised learning (SSL) for 3D medical imaging, focusing on data privacy and computational efficiency. The toolkit comprises two main components: a segmentation network that delineates foreground regions to optimize data sampling and thus reduce training time, and a segmentation network that identifies anonymized regions, preventing erroneous supervision in reconstruction-based SSL methods. Experimental results demonstrate high robustness, with mean Dice scores exceeding 98.5 across all anonymization methods and surpassing 99.5 for foreground segmentation tasks, highlighting the efficacy of the toolkit in supporting SSL applications in 3D medical imaging for both CT and MRI images. The weights and code is available at https://github.com/MIC-DKFZ/Foreground-and-Anonymization-Area-Segmentation.
Chinese: 该开源工具包通过优化数据采样和匿名化处理,提升了3D医学影像的自监督学习效率,在CT和MRI应用中平均Dice分数超过98.5,表现出卓越的鲁棒性。
English: This open-source toolkit enhances self-supervised learning for 3D medical imaging by optimizing data sampling and anonymization, achieving over 98.5 mean Dice scores for robust performance in CT and MRI applications.

Authors:Xueqiang Ouyang, Jia Wei, Wenjie Huo, Xiaocong Wang, Rui Li, Jianlong Zhou
Title: DeFusion: An Effective Decoupling Fusion Network for Multi-Modal Pregnancy Prediction
Abstract:
Temporal embryo images and parental fertility table indicators are both valuable for pregnancy prediction in \textbf{in vitro fertilization embryo transfer} (IVF-ET). However, current machine learning models cannot make full use of the complementary information between the two modalities to improve pregnancy prediction performance. In this paper, we propose a Decoupling Fusion Network called DeFusion to effectively integrate the multi-modal information for IVF-ET pregnancy prediction. Specifically, we propose a decoupling fusion module that decouples the information from the different modalities into related and unrelated information, thereby achieving a more delicate fusion. And we fuse temporal embryo images with a spatial-temporal position encoding, and extract fertility table indicator information with a table transformer. To evaluate the effectiveness of our model, we use a new dataset including 4046 cases collected from Southern Medical University. The experiments show that our model outperforms state-of-the-art methods. Meanwhile, the performance on the eye disease prediction dataset reflects the model's good generalization. Our code is available at https://github.com/Ou-Young-1999/DFNet.
中文: 提出的DeFusion网络通过解耦融合和专门编码,有效整合胚胎时序图像与生育表格数据,在IVF-ET妊娠预测中实现优越性能并展现良好泛化能力。
English: The proposed DeFusion network effectively integrates temporal embryo images and fertility table data through decoupled fusion and specialized encoding, achieving superior IVF-ET pregnancy prediction performance with demonstrated generalization capability.

Authors:Clément Fuchs, Maxime Zanella, Christophe De Vleeschouwer
Title: Online Gaussian Test-Time Adaptation of Vision-Language Models
Abstract:
Online test-time adaptation (OTTA) of vision-language models (VLMs) has recently garnered increased attention to take advantage of data observed along a stream to improve future predictions. Unfortunately, existing methods rely on dataset-specific hyperparameters, significantly limiting their adaptability to unseen tasks. In response, we propose Online Gaussian Adaptation (OGA), a novel method that models the likelihoods of visual features using Gaussian distributions and incorporates zero-shot priors into an interpretable Maximum A Posteriori (MAP) estimation framework with fixed hyper-parameters across all datasets. We demonstrate that OGA outperforms state-of-the-art methods on most datasets and runs. Additionally, we show that combining OTTA with popular few-shot techniques (a practical yet overlooked setting in prior research) is highly beneficial. Furthermore, our experimental study reveals that common OTTA evaluation protocols, which average performance over at most three runs per dataset, are inadequate due to the substantial variability observed across runs for all OTTA methods. Therefore, we advocate for more rigorous evaluation practices, including increasing the number of runs and considering additional quantitative metrics, such as our proposed Expected Tail Accuracy (ETA), calculated as the average accuracy in the worst 10% of runs. We hope these contributions will encourage more rigorous and diverse evaluation practices in the OTTA community. Code is available at https://github.com/cfuchs2023/OGA .
在线视觉语言模型的测试时自适应方法通过提出的在线高斯自适应得到改进,该方法采用固定超参数并融入零样本先验,在多数数据集上优于现有技术,同时因运行间性能差异显著,强调了需要更严格评估实践的必要性。
Online test-time adaptation for vision-language models is advanced by the proposed Online Gaussian Adaptation method, which uses fixed hyperparameters and integrates zero-shot priors, outperforming existing techniques and highlighting the need for more rigorous evaluation practices due to significant performance variability across runs.

Authors:Miao Rang, Zhenni Bi, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang
Title: Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts
Abstract:
Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary,we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results 68.87% in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLaVA-1.5 model. Our code is available at https://github.com/rangmiao/Eve.
Chinese: 创新的高效视觉语言模型与弹性视觉专家(Eve)框架,在仅1.8B参数下实现了语言能力与多模态性能的平衡,在语言基准和视觉语言模型基准中均取得卓越成果。
English: The innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve) balances linguistic preservation and multimodal enhancement, achieving superior performance with only 1.8B parameters in both language and VLM benchmarks.

Authors:Hyogon Ryu, NaHyeon Park, Hyunjung Shim
Title: DGQ: Distribution-Aware Group Quantization for Text-to-Image Diffusion Models
Abstract:
Despite the widespread use of text-to-image diffusion models across various tasks, their computational and memory demands limit practical applications. To mitigate this issue, quantization of diffusion models has been explored. It reduces memory usage and computational costs by compressing weights and activations into lower-bit formats. However, existing methods often struggle to preserve both image quality and text-image alignment, particularly in lower-bit($<$ 8bits) quantization. In this paper, we analyze the challenges associated with quantizing text-to-image diffusion models from a distributional perspective. Our analysis reveals that activation outliers play a crucial role in determining image quality. Additionally, we identify distinctive patterns in cross-attention scores, which significantly affects text-image alignment. To address these challenges, we propose Distribution-aware Group Quantization (DGQ), a method that identifies and adaptively handles pixel-wise and channel-wise outliers to preserve image quality. Furthermore, DGQ applies prompt-specific logarithmic quantization scales to maintain text-image alignment. Our method demonstrates remarkable performance on datasets such as MS-COCO and PartiPrompts. We are the first to successfully achieve low-bit quantization of text-to-image diffusion models without requiring additional fine-tuning of weight quantization parameters. Code is available at https://github.com/ugonfor/DGQ.
中文: 本文提出的分布感知分组量化(DGQ)方法通过自适应处理激活异常值并采用提示特定的对数量化尺度,成功解决了文本到图像扩散模型在低位量化中保持图像质量和图文一致性的难题,无需额外微调即可实现高效压缩。
English: This paper introduces Distribution-aware Group Quantization (DGQ), a novel method that effectively addresses the challenges of low-bit quantization in text-to-image diffusion models by adaptively handling activation outliers and applying prompt-specific logarithmic scales to preserve both image quality and text-image alignment without additional fine-tuning.

Authors:Hyungjin Chung, Dohun Lee, Zihui Wu, Byung-Hoon Kim, Katherine L. Bouman, Jong Chul Ye
Title: ContextMRI: Enhancing Compressed Sensing MRI through Metadata Conditioning
Abstract:
Compressed sensing MRI seeks to accelerate MRI acquisition processes by sampling fewer k-space measurements and then reconstructing the missing data algorithmically. The success of these approaches often relies on strong priors or learned statistical models. While recent diffusion model-based priors have shown great potential, previous methods typically ignore clinically available metadata (e.g. patient demographics, imaging parameters, slice-specific information). In practice, metadata contains meaningful cues about the anatomy and acquisition protocol, suggesting it could further constrain the reconstruction problem. In this work, we propose ContextMRI, a text-conditioned diffusion model for MRI that integrates granular metadata into the reconstruction process. We train a pixel-space diffusion model directly on minimally processed, complex-valued MRI images. During inference, metadata is converted into a structured text prompt and fed to the model via CLIP text embeddings. By conditioning the prior on metadata, we unlock more accurate reconstructions and show consistent gains across multiple datasets, acceleration factors, and undersampling patterns. Our experiments demonstrate that increasing the fidelity of metadata, ranging from slice location and contrast to patient age, sex, and pathology, systematically boosts reconstruction performance. This work highlights the untapped potential of leveraging clinical context for inverse problems and opens a new direction for metadata-driven MRI reconstruction.
中文: ContextMRI是一种新型扩散模型,通过将临床元数据转化为文本提示来增强MRI重建效果,在多种数据集和加速因子下均实现了更精准的图像恢复。
English: ContextMRI is a novel diffusion model that enhances MRI reconstruction by incorporating clinical metadata as text prompts, achieving superior accuracy across various datasets and acceleration factors.

Authors:Yuze Wang, Rong Xiao, Haifeng Li, Mariana Belgiu, Chao Tao
Title: Enhancing Scene Classification in Cloudy Image Scenarios: A Collaborative Transfer Method with Information Regulation Mechanism using Optical Cloud-Covered and SAR Remote Sensing Images
Abstract:
In remote sensing scene classification, leveraging the transfer methods with well-trained optical models is an efficient way to overcome label scarcity. However, cloud contamination leads to optical information loss and significant impacts on feature distribution, challenging the reliability and stability of transferred target models. Common solutions include cloud removal for optical data or directly using Synthetic aperture radar (SAR) data in the target domain. However, cloud removal requires substantial auxiliary data for support and pre-training, while directly using SAR disregards the unobstructed portions of optical data. This study presents a scene classification transfer method that synergistically combines multi-modality data, which aims to transfer the source domain model trained on cloudfree optical data to the target domain that includes both cloudy optical and SAR data at low cost. Specifically, the framework incorporates two parts: (1) the collaborative transfer strategy, based on knowledge distillation, enables the efficient prior knowledge transfer across heterogeneous data; (2) the information regulation mechanism (IRM) is proposed to address the modality imbalance issue during transfer. It employs auxiliary models to measure the contribution discrepancy of each modality, and automatically balances the information utilization of modalities during the target model learning process at the sample-level. The transfer experiments were conducted on simulated and real cloud datasets, demonstrating the superior performance of the proposed method compared to other solutions in cloud-covered scenarios. We also verified the importance and limitations of IRM, and further discussed and visualized the modality imbalance problem during the model transfer. Codes are available at https://github.com/wangyuze-csu/ESCCS
中文摘要:本研究提出一种遥感场景分类的迁移学习方法,通过协同利用多模态数据,结合知识蒸馏策略和信息调节机制,有效解决云层覆盖下光学与SAR数据的模态不平衡问题,在云污染场景中展现出优越性能。
English Summary: This study introduces a transfer learning method for remote sensing scene classification that synergistically utilizes multi-modal data, combining cloudy optical and SAR imagery through knowledge distillation and an information regulation mechanism to address modality imbalance and enhance performance in cloud-covered scenarios.

Authors:Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, Baharan Mirzasoleiman
Title: MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation
Abstract:
Vision-language models (VLMs) are highly effective but often underperform on specialized tasks; for example, Llava-1.5 struggles with chart and diagram understanding due to scarce task-specific training data. Existing training data, sourced from general-purpose datasets, fails to capture the nuanced details needed for these tasks. We introduce MM-Gen, a scalable method that generates task-specific, high-quality synthetic text for candidate images by leveraging stronger models. MM-Gen employs a three-stage targeted process: partitioning data into subgroups, generating targeted text based on task descriptions, and filtering out redundant and outlier data. Fine-tuning VLMs with data generated by MM-Gen leads to significant performance gains, including 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5 (7B). Compared to human-curated caption data, MM-Gen achieves up to 1.6x better improvements for the original models, proving its effectiveness in enhancing task-specific VLM performance and bridging the gap between general-purpose datasets and specialized requirements. Code available at https://github.com/sjoshi804/MM-Gen.
Chinese: MM-Gen是一种可扩展的方法,通过为图像生成高质量的合成文本来提升视觉语言模型在专业任务上的性能,例如使Llava-1.5在空间推理和图表理解方面分别实现了29%和15%的显著提升。
English: MM-Gen is a scalable method that generates high-quality synthetic text for images to enhance vision-language models' performance on specialized tasks, achieving significant improvements such as 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5.

Authors:Kam Woh Ng, Jing Yang, Jia Wei Sii, Jiankang Deng, Chee Seng Chan, Yi-Zhe Song, Tao Xiang, Xiatian Zhu
Title: Chirpy3D: Creative Fine-grained 3D Object Fabrication via Part Sampling
Abstract:
We present Chirpy3D, a novel approach for fine-grained 3D object generation, tackling the challenging task of synthesizing creative 3D objects in a zero-shot setting, with access only to unposed 2D images of seen categories. Without structured supervision -- such as camera poses, 3D part annotations, or object-specific labels -- the model must infer plausible 3D structures, capture fine-grained details, and generalize to novel objects using only category-level labels from seen categories. To address this, Chirpy3D introduces a multi-view diffusion model that decomposes training objects into anchor parts in an unsupervised manner, representing the latent space of both seen and unseen parts as continuous distributions. This allows smooth interpolation and flexible recombination of parts to generate entirely new objects with species-specific details. A self-supervised feature consistency loss further ensures structural and semantic coherence. The result is the first system capable of generating entirely novel 3D objects with species-specific fine-grained details through flexible part sampling and composition. Our experiments demonstrate that Chirpy3D surpasses existing methods in generating creative 3D objects with higher quality and fine-grained details. Code will be released at https://github.com/kamwoh/chirpy3d.
中文:Chirpy3D是一种零样本三维物体生成方法,通过无监督部件分解和多视角扩散模型,能够生成具有精细细节的全新三维物体,在质量和创造性上超越了现有方法。
English: Chirpy3D is a zero-shot 3D object generation method that uses unsupervised part decomposition and a multi-view diffusion model to create novel 3D objects with fine-grained details, outperforming existing approaches in quality and creativity.

Authors:Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang
Title: Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Abstract:
This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.
中文: Sa2VA是首个统一理解图像和视频密集基础信息的模型,结合SAM-2和LLaVA,通过少量指令调整即可支持多种任务,并在指代视频对象分割等复杂应用中达到领先性能。
English: Sa2VA is the first unified model for dense grounded understanding of images and videos, integrating SAM-2 and LLaVA to support various tasks with minimal instruction tuning and achieving state-of-the-art performance, especially in referring video object segmentation.

Authors:Sagi Polaczek, Yuval Alaluf, Elad Richardson, Yael Vinker, Daniel Cohen-Or
Title: NeuralSVG: An Implicit Representation for Text-to-Vector Generation
Abstract:
Vector graphics are essential in design, providing artists with a versatile medium for creating resolution-independent and highly editable visual content. Recent advancements in vision-language and diffusion models have fueled interest in text-to-vector graphics generation. However, existing approaches often suffer from over-parameterized outputs or treat the layered structure - a core feature of vector graphics - as a secondary goal, diminishing their practical use. Recognizing the importance of layered SVG representations, we propose NeuralSVG, an implicit neural representation for generating vector graphics from text prompts. Inspired by Neural Radiance Fields (NeRFs), NeuralSVG encodes the entire scene into the weights of a small MLP network, optimized using Score Distillation Sampling (SDS). To encourage a layered structure in the generated SVG, we introduce a dropout-based regularization technique that strengthens the standalone meaning of each shape. We additionally demonstrate that utilizing a neural representation provides an added benefit of inference-time control, enabling users to dynamically adapt the generated SVG based on user-provided inputs, all with a single learned representation. Through extensive qualitative and quantitative evaluations, we demonstrate that NeuralSVG outperforms existing methods in generating structured and flexible SVG.

Authors:Satchel French, Faith Zhu, Amish Jain, Naimul Khan
Title: Temporal Feature Weaving for Neonatal Echocardiographic Viewpoint Video Classification
Abstract:
Automated viewpoint classification in echocardiograms can help under-resourced clinics and hospitals in providing faster diagnosis and screening when expert technicians may not be available. We propose a novel approach towards echocardiographic viewpoint classification. We show that treating viewpoint classification as video classification rather than image classification yields advantage. We propose a CNN-GRU architecture with a novel temporal feature weaving method, which leverages both spatial and temporal information to yield a 4.33\% increase in accuracy over baseline image classification while using only four consecutive frames. The proposed approach incurs minimal computational overhead. Additionally, we publish the Neonatal Echocardiogram Dataset (NED), a professionally-annotated dataset providing sixteen viewpoints and associated echocardipgraphy videos to encourage future work and development in this field. Code available at: https://github.com/satchelfrench/NED
中文摘要:本研究提出了一种结合CNN-GRU架构与时序特征融合的新方法,通过视频分类实现超声心动图视角识别,准确率较基线提升4.33%,并发布了带专业标注的新生儿超声心动图数据集以推动相关研究。
English Summary: This study introduces a novel CNN-GRU architecture with temporal feature weaving for echocardiographic viewpoint classification, achieving a 4.33% accuracy improvement over baseline methods while publishing the annotated Neonatal Echocardiogram Dataset to advance the field.

Authors:Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, Jiaya Jia
Title: Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers
Abstract:
We present Magic Mirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that Magic Mirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available at: https://github.com/dvlab-research/MagicMirror/
中文: Magic Mirror 是一种创新框架,通过双分支面部特征提取器、轻量级跨模态适配器和两阶段训练策略,生成具有动态运动且保持身份一致的高质量视频,在身份保持与运动多样性平衡方面优于现有方法。
English: Magic Mirror is a novel framework that generates high-quality, identity-preserved videos with dynamic motion by integrating a dual-branch facial feature extractor, a lightweight cross-modal adapter, and a two-stage training strategy, outperforming existing methods in balancing identity consistency and motion diversity.

Authors:Jiakang Yuan, Xiangchao Yan, Shiyang Feng, Bo Zhang, Tao Chen, Botian Shi, Wanli Ouyang, Yu Qiao, Lei Bai, Bowen Zhou
Title: Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback
Abstract:
The scientific research paradigm is undergoing a profound transformation owing to the development of Artificial Intelligence (AI). Recent works demonstrate that various AI-assisted research methods can largely improve research efficiency by improving data analysis, accelerating computation, and fostering novel idea generation. To further move towards the ultimate goal (i.e., automatic scientific research), in this paper, we introduce Dolphin, a closed-loop LLM-driven framework to enhance the automation level of scientific research. Dolphin first generates novel ideas based on feedback from previous experiments and relevant papers ranked by the topic and task attributes. Then, the generated ideas can be implemented using a code template refined and debugged with the designed exception-traceback-guided local code structure. Finally, Dolphin automatically analyzes the results of each idea and feeds the results back to the next round of idea generation. Experiments are conducted on the benchmark datasets of different topics and a subset of MLE-bench. Results show that Dolphin can continuously improve the performance of the input topic in a loop. We highlight that Dolphin can automatically propose methods that are comparable to the state-of-the-art in some tasks such as 3D point classification.
Chinese: Dolphin框架提出了一种闭环、大语言模型驱动的系统,通过生成想法、实施代码和分析结果来自动化科学研究,在3D点分类等任务中展现出与最先进方法相媲美的性能提升。
English: The Dolphin framework introduces a closed-loop, LLM-driven system that automates scientific research by generating ideas, implementing code, and analyzing results, demonstrating performance improvements comparable to state-of-the-art methods in tasks like 3D point classification.

Authors:Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng
Title: LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Abstract:
The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and textual instructions into the context of large language models (LLMs), where large-scale parameters and numerous context tokens (predominantly vision tokens) result in substantial computational overhead. Previous efforts towards efficient LMMs always focus on replacing the LLM backbone with smaller models, while neglecting the crucial issue of token quantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimal vision tokens. To achieve a high compression ratio of vision tokens while preserving visual information, we first analyze how LMMs understand vision tokens and find that most vision tokens only play a crucial role in the early layers of LLM backbone, where they mainly fuse visual information into text tokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion to fuse visual information into text tokens in advance, thereby facilitating the extreme compression of vision tokens fed to LLM backbone into one token. LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Experiments across 11 image-based and 7 video-based benchmarks demonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision token instead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
中文: LLaVA-Mini是一种高效的大型多模态模型,通过模态预融合将视觉令牌压缩至仅一个,在显著降低计算开销的同时,保持了在图像和视频基准测试中的优异性能。
English: LLaVA-Mini is an efficient large multimodal model that drastically reduces computational overhead by compressing vision tokens to just one through modality pre-fusion, while maintaining high performance across image and video benchmarks.

Authors:Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu
Title: Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
Abstract:
Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process, such as camera manipulation or content editing, remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.
中文:Diffusion as Shader (DaS) 提出了一种以三维追踪视频作为控制输入的通用架构,能够实现相机控制和物体操纵等多种视频生成任务,同时以极少的训练量确保时间连贯性。
English: Diffusion as Shader (DaS) introduces a unified architecture using 3D tracking videos as control inputs, enabling versatile video generation tasks like camera control and object manipulation while ensuring temporal consistency with minimal training.

Authors:Eduarda Caldeira, Guray Ozgur, Tahar Chettaoui, Marija Ivanovska, Peter Peer, Fadi Boutros, Vitomir Struc, Naser Damer
Title: MADation: Face Morphing Attack Detection with Foundation Models
Abstract:
Despite the considerable performance improvements of face recognition algorithms in recent years, the same scientific advances responsible for this progress can also be used to create efficient ways to attack them, posing a threat to their secure deployment. Morphing attack detection (MAD) systems aim to detect a specific type of threat, morphing attacks, at an early stage, preventing them from being considered for verification in critical processes. Foundation models (FM) learn from extensive amounts of unlabelled data, achieving remarkable zero-shot generalization to unseen domains. Although this generalization capacity might be weak when dealing with domain-specific downstream tasks such as MAD, FMs can easily adapt to these settings while retaining the built-in knowledge acquired during pre-training. In this work, we recognize the potential of FMs to perform well in the MAD task when properly adapted to its specificities. To this end, we adapt FM CLIP architectures with LoRA weights while simultaneously training a classification header. The proposed framework, MADation surpasses our alternative FM and transformer-based frameworks and constitutes the first adaption of FMs to the MAD task. MADation presents competitive results with current MAD solutions in the literature and even surpasses them in several evaluation scenarios. To encourage reproducibility and facilitate further research in MAD, we publicly release the implementation of MADation at https://github.com/gurayozgur/MADation
中文: 尽管人脸识别技术的进步带来了新的攻击风险,本研究提出的MADation框架通过LoRA适配基础模型,在篡改攻击检测任务中展现出卓越性能,在多个评估场景下达到甚至超越了现有解决方案的水平。
English: While face recognition advances also enable new attack methods, this work introduces MADation, a framework adapting foundation models with LoRA to effectively detect morphing attacks, achieving competitive and even superior performance in various scenarios.

Authors:Xinbin Yuan, Zhaohui Zheng, Yuxuan Li, Xialei Liu, Li Liu, Xiang Li, Qibin Hou, Ming-Ming Cheng
Title: Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection
Abstract:
While witnessed with rapid development, remote sensing object detection remains challenging for detecting high aspect ratio objects. This paper shows that large strip convolutions are good feature representation learners for remote sensing object detection and can detect objects of various aspect ratios well. Based on large strip convolutions, we build a new network architecture called Strip R-CNN, which is simple, efficient, and powerful. Unlike recent remote sensing object detectors that leverage large-kernel convolutions with square shapes, our Strip R-CNN takes advantage of sequential orthogonal large strip convolutions in our backbone network StripNet to capture spatial information. In addition, we improve the localization capability of remote-sensing object detectors by decoupling the detection heads and equipping the localization branch with strip convolutions in our strip head. Extensive experiments on several benchmarks, for example DOTA, FAIR1M, HRSC2016, and DIOR, show that our Strip R-CNN can greatly improve previous work. In particular, our 30M model achieves 82.75% mAP on DOTA-v1.0, setting a new state-of-the-art record. Our code will be made publicly available.Code is available at https://github.com/YXB-NKU/Strip-R-CNN.
中文摘要:本文提出Strip R-CNN网络架构,采用大型条带卷积有效检测遥感图像中不同长宽比的物体,在多个基准测试中实现了最先进的性能。
English Summary: This paper introduces Strip R-CNN, a novel network architecture using large strip convolutions to effectively detect objects of various aspect ratios in remote sensing, achieving state-of-the-art performance on multiple benchmarks.

Authors:Maxime Zanella, Clément Fuchs, Christophe De Vleeschouwer, Ismail Ben Ayed
Title: Realistic Test-Time Adaptation of Vision-Language Models
Abstract:
The zero-shot capabilities of Vision-Language Models (VLMs) have been widely leveraged to improve predictive performance. However, previous works on transductive or test-time adaptation (TTA) often make strong assumptions about the data distribution, such as the presence of all classes. Our work challenges these favorable deployment scenarios, and introduces a more realistic evaluation framework, including: (i) a variable number of effective classes for adaptation within a single batch, and (ii) non-i.i.d. batches of test samples in online adaptation settings. We provide comprehensive evaluations, comparisons, and ablation studies that demonstrate how current transductive or TTA methods for VLMs systematically compromise the models' initial zero-shot robustness across various realistic scenarios, favoring performance gains under advantageous assumptions about the test samples' distributions. Furthermore, we introduce StatA, a versatile method that could handle a wide range of deployment scenarios, including those with a variable number of effective classes at test time. Our approach incorporates a novel regularization term designed specifically for VLMs, which acts as a statistical anchor preserving the initial text-encoder knowledge, particularly in low-data regimes. Code available at https://github.com/MaxZanella/StatA.
中文: 本研究挑战了视觉语言模型中转导学习和测试时适应方法的理想假设,提出了包含可变类别数和非独立同分布测试批次的现实评估框架,同时引入StatA方法——通过新型正则化项在多样化部署场景中保持模型初始鲁棒性的通用解决方案。
English: This study challenges the favorable assumptions in transductive and test-time adaptation methods for Vision-Language Models by introducing a realistic evaluation framework with variable class numbers and non-i.i.d. test batches, while proposing StatA—a versatile method with a novel regularization term that preserves initial model robustness across diverse deployment scenarios.

Authors:Lezhong Wang, Duc Minh Tran, Ruiqi Cui, Thomson TG, Anders Bjorholm Dahl, Siavash Arjomand Bigdeli, Jeppe Revall Frisvad, Manmohan Chandraker
Title: Materialist: Physically Based Editing Using Single-Image Inverse Rendering
Abstract:
Achieving physically consistent image editing remains a significant challenge in computer vision. Existing image editing methods typically rely on neural networks, which struggle to accurately handle shadows and refractions. Conversely, physics-based inverse rendering often requires multi-view optimization, limiting its practicality in single-image scenarios. In this paper, we propose Materialist, a method combining a learning-based approach with physically based progressive differentiable rendering. Given an image, our method leverages neural networks to predict initial material properties. Progressive differentiable rendering is then used to optimize the environment map and refine the material properties with the goal of closely matching the rendered result to the input image. Our approach enables a range of applications, including material editing, object insertion, and relighting, while also introducing an effective method for editing material transparency without requiring full scene geometry. Furthermore, Our envmap estimation method also achieves state-of-the-art performance, further enhancing the accuracy of image editing task. Experiments demonstrate strong performance across synthetic and real-world datasets, excelling even on challenging out-of-domain images. Project website: https://lez-s.github.io/materialist_project/

Authors:Sangwoon Kwak, Joonsoo Kim, Jun Young Jeong, Won-Sik Cheong, Jihyong Oh, Munchurl Kim
Title: MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has made significant strides in scene representation and neural rendering, with intense efforts focused on adapting it for dynamic scenes. Despite delivering remarkable rendering quality and speed, existing methods struggle with storage demands and representing complex real-world motions. To tackle these issues, we propose MoDecGS, a memory-efficient Gaussian splatting framework designed for reconstructing novel views in challenging scenarios with complex motions. We introduce GlobaltoLocal Motion Decomposition (GLMD) to effectively capture dynamic motions in a coarsetofine manner. This approach leverages Global Canonical Scaffolds (Global CS) and Local Canonical Scaffolds (Local CS), extending static Scaffold representation to dynamic video reconstruction. For Global CS, we propose Global Anchor Deformation (GAD) to efficiently represent global dynamics along complex motions, by directly deforming the implicit Scaffold attributes which are anchor position, offset, and local context features. Next, we finely adjust local motions via the Local Gaussian Deformation (LGD) of Local CS explicitly. Additionally, we introduce Temporal Interval Adjustment (TIA) to automatically control the temporal coverage of each Local CS during training, allowing MoDecGS to find optimal interval assignments based on the specified number of temporal segments. Extensive evaluations demonstrate that MoDecGS achieves an average 70% reduction in model size over stateoftheart methods for dynamic 3D Gaussians from realworld dynamic videos while maintaining or even improving rendering quality.

Authors:Mengshi Qi, Hao Ye, Jiaxuan Peng, Huadong Ma
Title: Action Quality Assessment via Hierarchical Pose-guided Multi-stage Contrastive Regression
Abstract:
Action Quality Assessment (AQA), which aims at automatic and fair evaluation of athletic performance, has gained increasing attention in recent years. However, athletes are often in rapid movement and the corresponding visual appearance variances are subtle, making it challenging to capture fine-grained pose differences and leading to poor estimation performance. Furthermore, most common AQA tasks, such as diving in sports, are usually divided into multiple sub-actions, each of which contains different durations. However, existing methods focus on segmenting the video into fixed frames, which disrupts the temporal continuity of sub-actions resulting in unavoidable prediction errors. To address these challenges, we propose a novel action quality assessment method through hierarchically pose-guided multi-stage contrastive regression. Firstly, we introduce a multi-scale dynamic visual-skeleton encoder to capture fine-grained spatio-temporal visual and skeletal features. Then, a procedure segmentation network is introduced to separate different sub-actions and obtain segmented features. Afterwards, the segmented visual and skeletal features are both fed into a multi-modal fusion module as physics structural priors, to guide the model in learning refined activity similarities and variances. Finally, a multi-stage contrastive learning regression approach is employed to learn discriminative representations and output prediction results. In addition, we introduce a newly-annotated FineDiving-Pose Dataset to improve the current low-quality human pose labels. In experiments, the results on FineDiving and MTL-AQA datasets demonstrate the effectiveness and superiority of our proposed approach. Our source code and dataset are available at https://github.com/Lumos0507/HP-MCoRe.
中文: 本文提出了一种新颖的分层姿态引导多阶段对比回归方法,通过多尺度特征编码和过程分割解决动作质量评估中细粒度姿态变化捕捉和时序连续性的难题,并在新数据集和现有数据集上验证了其有效性。
English: This paper proposes a novel hierarchically pose-guided multi-stage contrastive regression method for action quality assessment, addressing challenges in capturing fine-grained pose variations and temporal continuity through multi-scale feature encoding and procedure segmentation, validated on new and existing datasets.

Authors:Jiaxuan Li, Qing Xu, Xiangjian He, Ziyu Liu, Daokun Zhang, Ruili Wang, Rong Qu, Guoping Qiu
Title: CFFormer: Cross CNN-Transformer Channel Attention and Spatial Feature Fusion for Improved Segmentation of Heterogeneous Medical Images
Abstract:
Medical image segmentation plays an important role in computer-aided diagnosis. Existing methods mainly utilize spatial attention to highlight the region of interest. However, due to limitations of medical imaging devices, medical images exhibit significant heterogeneity, posing challenges for segmentation. Ultrasound images, for instance, often suffer from speckle noise, low resolution, and poor contrast between target tissues and background, which may lead to inaccurate boundary delineation. To address these challenges caused by heterogeneous image quality, we propose a hybrid CNN-Transformer model,called CFFormer, which leverages effective channel feature extraction to enhance the model' s ability to accurately identify tissue regions by capturing rich contextual information. The proposed architecture contains two key components: the Cross Feature Channel Attention (CFCA) module and the X-Spatial Feature Fusion (XFF) module. The model incorporates dual encoders, with the CNN encoder focusing on capturing local features and the Transformer encoder modeling global features. The CFCA module filters and facilitates interactions between the channel features from the two encoders, while the XFF module effectively reduces the significant semantic information differences in spatial features, enabling a smooth and cohesive spatial feature fusion. We evaluate our model across eight datasets covering five modalities to test its generalization capability. Experimental results demonstrate that our model outperforms current state-of-the-art methods and maintains accurate tissue region segmentation across heterogeneous medical image datasets. The code is available at https://github.com/JiaxuanFelix/CFFormer.
中文: 提出的CFFormer模型融合CNN与Transformer架构,通过特殊模块增强通道和空间特征整合,在包含噪声和低对比度等挑战的多种医学影像数据集中实现了更优的分割精度。
English: The proposed CFFormer model combines CNN and Transformer architectures with specialized modules to enhance channel and spatial feature integration, achieving superior segmentation accuracy across diverse medical imaging datasets despite challenges like noise and low contrast.

Authors:NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mousavian, Seungjun Nah, Sriharsha Niverty, David Page, Despoina Paschalidou, Zeeshan Patel, Lindsey Pavao, Morteza Ramezanali, Fitsum Reda, Xiaowei Ren, Vasanth Rao Naik Sabavat, Ed Schmerling, Stella Shi, Bartosz Stefaniak, Shitao Tang, Lyne Tchapmi, Przemek Tredak, Wei-Cheng Tseng, Jibin Varghese, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Xinyue Wei, Jay Zhangjie Wu, Jiashu Xu, Wei Yang, Lin Yen-Chen, Xiaohui Zeng, Yu Zeng, Jing Zhang, Qinsheng Zhang, Yuxuan Zhang, Qingqing Zhao, Artur Zolkowski
Title: Cosmos World Foundation Model Platform for Physical AI
Abstract:
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open-source and our models open-weight with permissive licenses available via https://github.com/nvidia-cosmos/cosmos-predict1.
物理AI需要自身和世界的数字孪生,Cosmos平台通过可定制的世界基础模型及开源工具提供支持,以解决社会关键问题。
Physical AI requires digital twins of itself and the world, which the Cosmos platform provides through customizable world foundation models and open-source tools to address societal challenges.

Authors:Zetian Feng, Dong Ni, Yi Wang
Title: Salient Region Matching for Fully Automated MR-TRUS Registration
Abstract:
Prostate cancer is a leading cause of cancer-related mortality in men. The registration of magnetic resonance (MR) and transrectal ultrasound (TRUS) can provide guidance for the targeted biopsy of prostate cancer. In this study, we propose a salient region matching framework for fully automated MR-TRUS registration. The framework consists of prostate segmentation, rigid alignment and deformable registration. Prostate segmentation is performed using two segmentation networks on MR and TRUS respectively, and the predicted salient regions are used for the rigid alignment. The rigidly-aligned MR and TRUS images serve as initialization for the deformable registration. The deformable registration network has a dual-stream encoder with cross-modal spatial attention modules to facilitate multi-modality feature learning, and a salient region matching loss to consider both structure and intensity similarity within the prostate region. Experiments on a public MR-TRUS dataset demonstrate that our method achieves satisfactory registration results, outperforming several cutting-edge methods. The code is publicly available at https://github.com/mock1ngbrd/salient-region-matching.
中文: 本研究提出了一种用于前列腺癌活检中磁共振与经直肠超声图像全自动配准的框架,通过分割、刚性对齐和带跨模态注意力的双流形变网络,在公共数据集上取得了优于现有方法的配准效果。
English: This study introduces a fully automated framework for MR-TRUS registration in prostate cancer biopsy, utilizing segmentation, rigid alignment, and a dual-stream deformable network with cross-modal attention to achieve superior results over existing methods.

Authors:Fatemeh Ghofrani, Pooyan Jamshidi
Title: An Empirical Study of Accuracy-Robustness Tradeoff and Training Efficiency in Self-Supervised Learning
Abstract:
Self-supervised learning (SSL) has significantly advanced image representation learning, yet efficiency challenges persist, particularly with adversarial training. Many SSL methods require extensive epochs to achieve convergence, a demand further amplified in adversarial settings. To address this inefficiency, we revisit the robust EMP-SSL framework, emphasizing the importance of increasing the number of crops per image to accelerate learning. Unlike traditional contrastive learning, robust EMP-SSL leverages multi-crop sampling, integrates an invariance term and regularization, and reduces training epochs, enhancing time efficiency. Evaluated with both standard linear classifiers and multi-patch embedding aggregation, robust EMP-SSL provides new insights into SSL evaluation strategies. Our results show that robust crop-based EMP-SSL not only accelerates convergence but also achieves a superior balance between clean accuracy and adversarial robustness, outperforming multi-crop embedding aggregation. Additionally, we extend this approach with free adversarial training in Multi-Crop SSL, introducing the Cost-Free Adversarial Multi-Crop Self-Supervised Learning (CF-AMC-SSL) method. CF-AMC-SSL demonstrates the effectiveness of free adversarial training in reducing training time while simultaneously improving clean accuracy and adversarial robustness. These findings underscore the potential of CF-AMC-SSL for practical SSL applications. Our code is publicly available at https://github.com/softsys4ai/CF-AMC-SSL.
中文摘要:鲁棒EMP-SSL框架通过增加图像裁剪数量和减少训练周期来加速自监督学习,而扩展的CF-AMC-SSL方法通过免费对抗训练进一步提升了效率与鲁棒性。
English Summary: The robust EMP-SSL framework accelerates self-supervised learning by increasing image crops and reducing training epochs, while the extended CF-AMC-SSL method further enhances efficiency and robustness through free adversarial training.

Authors:Chuang Niu, Wenjun Xia, Hongming Shan, Ge Wang
Title: Information-Maximized Soft Variable Discretization for Self-Supervised Image Representation Learning
Abstract:
Self-supervised learning (SSL) has emerged as a crucial technique in image processing, encoding, and understanding, especially for developing today's vision foundation models that utilize large-scale datasets without annotations to enhance various downstream tasks. This study introduces a novel SSL approach, Information-Maximized Soft Variable Discretization (IMSVD), for image representation learning. Specifically, IMSVD softly discretizes each variable in the latent space, enabling the estimation of their probability distributions over training batches and allowing the learning process to be directly guided by information measures. Motivated by the MultiView assumption, we propose an information-theoretic objective function to learn transform-invariant, non-travail, and redundancy-minimized representation features. We then derive a joint-cross entropy loss function for self-supervised image representation learning, which theoretically enjoys superiority over the existing methods in reducing feature redundancy. Notably, our non-contrastive IMSVD method statistically performs contrastive learning. Extensive experimental results demonstrate the effectiveness of IMSVD on various downstream tasks in terms of both accuracy and efficiency. Thanks to our variable discretization, the embedding features optimized by IMSVD offer unique explainability at the variable level. IMSVD has the potential to be adapted to other learning paradigms. Our code is publicly available at https://github.com/niuchuangnn/IMSVD.
中文: 本研究提出信息最大化软变量离散化(IMSVD)这一新型自监督学习方法,通过对潜在变量进行软离散化来减少冗余并增强图像表征的可解释性,在无需标注的情况下于多项任务中展现出卓越性能。
English: This study introduces Information-Maximized Soft Variable Discretization (IMSVD), a novel self-supervised learning method that discretizes latent variables to minimize redundancy and enhance explainability in image representation, demonstrating superior performance across various tasks without requiring annotations.

Authors:Xiao Wang, Fuling Wang, Haowen Wang, Bo Jiang, Chuanfu Li, Yaowei Wang, Yonghong Tian, Jin Tang
Title: Activating Associative Disease-Aware Vision Token Memory for LLM-Based X-ray Report Generation
Abstract:
X-ray image based medical report generation achieves significant progress in recent years with the help of the large language model, however, these models have not fully exploited the effective information in visual image regions, resulting in reports that are linguistically sound but insufficient in describing key diseases. In this paper, we propose a novel associative memory-enhanced X-ray report generation model that effectively mimics the process of professional doctors writing medical reports. It considers both the mining of global and local visual information and associates historical report information to better complete the writing of the current report. Specifically, given an X-ray image, we first utilize a classification model along with its activation maps to accomplish the mining of visual regions highly associated with diseases and the learning of disease query tokens. Then, we employ a visual Hopfield network to establish memory associations for disease-related tokens, and a report Hopfield network to retrieve report memory information. This process facilitates the generation of high-quality reports based on a large language model and achieves state-of-the-art performance on multiple benchmark datasets, including the IU X-ray, MIMIC-CXR, and Chexpert Plus. The source code of this work is released on \url{https://github.com/Event-AHU/Medical_Image_Analysis}.
中文摘要:本文提出了一种联想记忆增强的X光报告生成模型,通过结合全局与局部视觉信息并关联历史报告数据来模拟医生诊断过程,在多个医学影像基准测试中实现了最优性能。
English Summary: This paper introduces an associative memory-enhanced X-ray report generation model that mimics doctors' diagnostic processes by integrating global and local visual information with historical report data, achieving state-of-the-art performance on multiple medical imaging benchmarks.

Authors:Soonbin Lee, Fangwen Shu, Yago Sanchez, Thomas Schierl, Cornelius Hellge
Title: Compression of 3D Gaussian Splatting with Optimized Feature Planes and Standard Video Codecs
Abstract:
3D Gaussian Splatting is a recognized method for 3D scene representation, known for its high rendering quality and speed. However, its substantial data requirements present challenges for practical applications. In this paper, we introduce an efficient compression technique that significantly reduces storage overhead by using compact representation. We propose a unified architecture that combines point cloud data and feature planes through a progressive tri-plane structure. Our method utilizes 2D feature planes, enabling continuous spatial representation. To further optimize these representations, we incorporate entropy modeling in the frequency domain, specifically designed for standard video codecs. We also propose channel-wise bit allocation to achieve a better trade-off between bitrate consumption and feature plane representation. Consequently, our model effectively leverages spatial correlations within the feature planes to enhance rate-distortion performance using standard, non-differentiable video codecs. Experimental results demonstrate that our method outperforms existing methods in data compactness while maintaining high rendering quality. Our project page is available at https://fraunhoferhhi.github.io/CodecGS

Authors:Xuyang Wang, Ziang Cheng, Zhenyu Li, Jiayu Yang, Haorui Ji, Pan Ji, Mehrtash Harandi, Richard Hartley, Hongdong Li
Title: DoubleDiffusion: Combining Heat Diffusion with Denoising Diffusion for Texture Generation on 3D Meshes
Abstract:
This paper addresses the problem of generating textures for 3D mesh assets. Existing approaches often rely on image diffusion models to generate multi-view image observations, which are then transformed onto the mesh surface to produce a single texture. However, due to the gap between multi-view images and 3D space, such process is susceptible to arange of issues such as geometric inconsistencies, visibility occlusion, and baking artifacts. To overcome this problem, we propose a novel approach that directly generates texture on 3D meshes. Our approach leverages heat dissipation diffusion, which serves as an efficient operator that propagates features on the geometric surface of a mesh, while remaining insensitive to the specific layout of the wireframe. By integrating this technique into a generative diffusion pipeline, we significantly improve the efficiency of texture generation compared to existing texture generation methods. We term our approach DoubleDiffusion, as it combines heat dissipation diffusion with denoising diffusion to enable native generative learning on 3D mesh surfaces.
中文: 本文提出DoubleDiffusion方法,通过结合热耗散扩散和去噪扩散直接在三维网格表面生成纹理,有效解决了传统多视图图像方法存在的几何不一致性问题,并显著提升了纹理生成效率。
English: This paper introduces DoubleDiffusion, a novel method that directly generates textures on 3D meshes by combining heat dissipation diffusion with denoising diffusion, overcoming geometric inconsistencies and improving efficiency compared to traditional multi-view image approaches.

Authors:Jathushan Rajasegaran, Xinlei Chen, Rulilong Li, Christoph Feichtenhofer, Jitendra Malik, Shiry Ginosar
Title: Gaussian Masked Autoencoders
Abstract:
This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. While reconstructive self-supervised learning frameworks such as MAE learns good semantic abstractions, it is not trained for explicit spatial awareness. Our approach, named Gaussian Masked Autoencoder, or GMAE, aims to learn semantic abstractions and spatial understanding jointly. Like MAE, it reconstructs the image end-to-end in the pixel space, but beyond MAE, it also introduces an intermediate, 3D Gaussian-based representation and renders images via splatting. We show that GMAE can enable various zero-shot learning capabilities of spatial understanding (e.g., figure-ground segmentation, image layering, edge detection, etc.) while preserving the high-level semantics of self-supervised representation quality from MAE. To our knowledge, we are the first to employ Gaussian primitives in an image representation learning framework beyond optimization-based single-scene reconstructions. We believe GMAE will inspire further research in this direction and contribute to developing next-generation techniques for modeling high-fidelity visual data. More details at https://brjathu.github.io/gmae

Authors:Tingyang Zhang, Chen Wang, Zhiyang Dou, Qingzhe Gao, Jiahui Lei, Baoquan Chen, Lingjie Liu
Title: ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking
Abstract:
We propose ProTracker, a novel framework for accurate and robust long-term dense tracking of arbitrary points in videos. Previous methods relying on global cost volumes effectively handle large occlusions and scene changes but lack precision and temporal awareness. In contrast, local iteration-based methods accurately track smoothly transforming scenes but face challenges with occlusions and drift. To address these issues, we propose a probabilistic framework that marries the strengths of both paradigms by leveraging local optical flow for predictions and refined global heatmaps for observations. This design effectively combines global semantic information with temporally aware low-level features, enabling precise and robust long-term tracking of arbitrary points in videos. Extensive experiments demonstrate that ProTracker attains state-of-the-art performance among optimization-based approaches and surpasses supervised feed-forward methods on multiple benchmarks. The code and model will be released after publication.

Authors:Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Title: Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Abstract:
Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a system that disentangles Perception, Decision, and Reaction. Dispider features a lightweight proactive streaming video processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses, thereby validating the effectiveness of our architecture. The code and model are released at \url{https://github.com/Mark12Ding/Dispider}.
中文: Dispider通过解耦感知、决策和反应能力,并采用异步设计,实现了视频大语言模型的主动实时交互,确保在流媒体视频分析中提供高效、及时的响应。
English: Dispider introduces a disentangled and asynchronous system that enables active real-time interaction with video LLMs by separating perception, decision, and reaction capabilities, ensuring efficient and timely responses for streaming video analysis.

Authors:Valery Istomin, Oleg Pereziabov, Ilya Afanasyev
Title: Geometry Restoration and Dewarping of Camera-Captured Document Images
Abstract:
This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera, using algorithms for detection, segmentation, geometry restoration, and dewarping. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid using cubic polynomial interpolation and correct nonlinear distortions by remapping the image. Using classical CV methods makes the document topology restoration process more efficient and faster, as it requires significantly fewer computational resources and memory. We developed a new pipeline for automatic document dewarping and reconstruction, along with a framework and annotated dataset to demonstrate its efficiency. Our experiments confirm the promise of our methodology and its superiority over existing benchmarks (including mobile apps and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both visually and in terms of document readability via Optical Character Recognition (OCR) and geometry restoration metrics. This paves the way for creating high-quality digital copies of paper documents and enhancing the efficiency of OCR systems. Project page: https://github.com/HorizonParadox/DRCCBI
中文: 本研究提出了一种结合深度学习轮廓检测与计算机视觉技术的文档复原方法,通过非线性畸变校正显著提升了OCR可读性和几何复原指标,性能优于当前主流方案。
English: This study introduces an efficient document restoration method combining deep learning for outline detection with computer vision techniques to correct distortions, demonstrating superior performance in OCR readability and geometry metrics compared to existing solutions.

Authors:Yuxiang Bao, Guoliang Kang, Linlin Yang, Xiaoyue Duan, Bo Zhao, Baochang Zhang
Title: Normalizing Batch Normalization for Long-Tailed Recognition
Abstract:
In real-world scenarios, the number of training samples across classes usually subjects to a long-tailed distribution. The conventionally trained network may achieve unexpected inferior performance on the rare class compared to the frequent class. Most previous works attempt to rectify the network bias from the data-level or from the classifier-level. Differently, in this paper, we identify that the bias towards the frequent class may be encoded into features, i.e., the rare-specific features which play a key role in discriminating the rare class are much weaker than the frequent-specific features. Based on such an observation, we introduce a simple yet effective approach, normalizing the parameters of Batch Normalization (BN) layer to explicitly rectify the feature bias. To achieve this end, we represent the Weight/Bias parameters of a BN layer as a vector, normalize it into a unit one and multiply the unit vector by a scalar learnable parameter. Through decoupling the direction and magnitude of parameters in BN layer to learn, the Weight/Bias exhibits a more balanced distribution and thus the strength of features becomes more even. Extensive experiments on various long-tailed recognition benchmarks (i.e., CIFAR-10/100-LT, ImageNet-LT and iNaturalist 2018) show that our method outperforms previous state-of-the-arts remarkably. The code and checkpoints are available at https://github.com/yuxiangbao/NBN.
中文: 本文针对长尾数据集中的类别不平衡问题,提出通过归一化批归一化层参数来纠正特征偏差的方法,在多个基准测试中显著提升了识别性能。
English: This paper addresses class imbalance in long-tailed datasets by proposing a method to normalize Batch Normalization parameters, which rectifies feature bias and improves recognition performance across various benchmarks.

Authors:Haojin Li, Heng Li, Jianyu Chen, Rihan Zhong, Ke Niu, Huazhu Fu, Jiang Liu
Title: AIF-SFDA: Autonomous Information Filter-driven Source-Free Domain Adaptation for Medical Image Segmentation
Abstract:
Decoupling domain-variant information (DVI) from domain-invariant information (DII) serves as a prominent strategy for mitigating domain shifts in the practical implementation of deep learning algorithms. However, in medical settings, concerns surrounding data collection and privacy often restrict access to both training and test data, hindering the empirical decoupling of information by existing methods. To tackle this issue, we propose an Autonomous Information Filter-driven Source-free Domain Adaptation (AIF-SFDA) algorithm, which leverages a frequency-based learnable information filter to autonomously decouple DVI and DII. Information Bottleneck (IB) and Self-supervision (SS) are incorporated to optimize the learnable frequency filter. The IB governs the information flow within the filter to diminish redundant DVI, while SS preserves DII in alignment with the specific task and image modality. Thus, the autonomous information filter can overcome domain shifts relying solely on target data. A series of experiments covering various medical image modalities and segmentation tasks were conducted to demonstrate the benefits of AIF-SFDA through comparisons with leading algorithms and ablation studies. The code is available at https://github.com/JingHuaMan/AIF-SFDA.
中文: 提出的AIF-SFDA算法通过基于频率的可学习信息过滤器,结合信息瓶颈与自监督优化,实现了无需源数据的域自适应,在多种医学影像分割任务中有效克服了域偏移问题。
English: The proposed AIF-SFDA algorithm autonomously decouples domain-variant and domain-invariant information using a frequency-based filter optimized by Information Bottleneck and Self-supervision, effectively overcoming domain shifts in medical image segmentation without requiring source data access.

Authors:Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, Adam Polyak
Title: Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
Abstract:
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce \benchmark, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate our method's superiority on this benchmark. Project page is available at https://guyyariv.github.io/TTM/.

Authors:Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen
Title: TransPixeler: Advancing Text-to-Video Generation with Transparency
Abstract:
Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixeler, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.

Authors:Xianhao Zhou, Jianghao Wu, Huangxuan Zhao, Lei Chen, Shaoting Zhang, Guotai Wang
Title: GLFC: Unified Global-Local Feature and Contrast Learning with Mamba-Enhanced UNet for Synthetic CT Generation from CBCT
Abstract:
Generating synthetic Computed Tomography (CT) images from Cone Beam Computed Tomography (CBCT) is desirable for improving the image quality of CBCT. Existing synthetic CT (sCT) generation methods using Convolutional Neural Networks (CNN) and Transformers often face difficulties in effectively capturing both global and local features and contrasts for high-quality sCT generation. In this work, we propose a Global-Local Feature and Contrast learning (GLFC) framework for sCT generation. First, a Mamba-Enhanced UNet (MEUNet) is introduced by integrating Mamba blocks into the skip connections of a high-resolution UNet for effective global and local feature learning. Second, we propose a Multiple Contrast Loss (MCL) that calculates synthetic loss at different intensity windows to improve quality for both soft tissues and bone regions. Experiments on the SynthRAD2023 dataset demonstrate that GLFC improved the SSIM of sCT from 77.91% to 91.50% compared with the original CBCT, and significantly outperformed several existing methods for sCT generation. The code is available at https://github.com/HiLab-git/GLFC
中文摘要:本研究提出了一种全局-局部特征与对比学习框架,通过结合Mamba增强的UNet架构和多窗口对比损失,显著提升了从锥形束CT生成合成CT的图像质量,在SynthRAD2023数据集上实现了91.50%的结构相似性指数。
English Summary: This study introduces a Global-Local Feature and Contrast learning (GLFC) framework that combines Mamba-enhanced UNet architecture with multi-window contrast loss to significantly improve synthetic CT image quality from CBCT data, achieving 91.50% SSIM on SynthRAD2023 benchmark.

Authors:Jiexi Zhong, Zhiheng Li, Yubo Cui, Zheng Fang
Title: 4D-CS: Exploiting Cluster Prior for 4D Spatio-Temporal LiDAR Semantic Segmentation
Abstract:
Semantic segmentation of LiDAR points has significant value for autonomous driving and mobile robot systems. Most approaches explore spatio-temporal information of multi-scan to identify the semantic classes and motion states for each point. However, these methods often overlook the segmentation consistency in space and time, which may result in point clouds within the same object being predicted as different categories. To handle this issue, our core idea is to generate cluster labels across multiple frames that can reflect the complete spatial structure and temporal information of objects. These labels serve as explicit guidance for our dual-branch network, 4D-CS, which integrates point-based and cluster-based branches to enable more consistent segmentation. Specifically, in the point-based branch, we leverage historical knowledge to enrich the current feature through temporal fusion on multiple views. In the cluster-based branch, we propose a new strategy to produce cluster labels of foreground objects and apply them to gather point-wise information to derive cluster features. We then merge neighboring clusters across multiple scans to restore missing features due to occlusion. Finally, in the point-cluster fusion stage, we adaptively fuse the information from the two branches to optimize segmentation results. Extensive experiments confirm the effectiveness of the proposed method, and we achieve state-of-the-art results on the multi-scan semantic and moving object segmentation on SemanticKITTI and nuScenes datasets. The code will be available at https://github.com/NEU-REAL/4D-CS.git.
中文: 提出的4D-CS方法通过结合基于点和基于聚类的双分支网络,利用多帧聚类标签提升激光雷达语义分割的时空一致性,在自动驾驶数据集上实现了最优性能。
English: The proposed 4D-CS method enhances LiDAR semantic segmentation consistency by integrating point-based and cluster-based branches with multi-frame cluster labels, achieving state-of-the-art performance on autonomous driving datasets.

Authors:Guray Ozgur, Eduarda Caldeira, Tahar Chettaoui, Fadi Boutros, Raghavendra Ramachandra, Naser Damer
Title: FoundPAD: Foundation Models Reloaded for Face Presentation Attack Detection
Abstract:
Although face recognition systems have seen a massive performance enhancement in recent years, they are still targeted by threats such as presentation attacks, leading to the need for generalizable presentation attack detection (PAD) algorithms. Current PAD solutions suffer from two main problems: low generalization to unknown cenarios and large training data requirements. Foundation models (FM) are pre-trained on extensive datasets, achieving remarkable results when generalizing to unseen domains and allowing for efficient task-specific adaption even when little training data are available. In this work, we recognize the potential of FMs to address common PAD problems and tackle the PAD task with an adapted FM for the first time. The FM under consideration is adapted with LoRA weights while simultaneously training a classification header. The resultant architecture, FoundPAD, is highly generalizable to unseen domains, achieving competitive results in several settings under different data availability scenarios and even when using synthetic training data. To encourage reproducibility and facilitate further research in PAD, we publicly release the implementation of FoundPAD at https://github.com/gurayozgur/FoundPAD .
中文: 本研究提出FoundPAD,这是一种高度泛化的呈现攻击检测系统,通过适配基础模型,能在数据有限的情况下有效应对未知场景的攻击威胁。
English: This work introduces FoundPAD, a highly generalizable presentation attack detection system that adapts foundation models to effectively counter unseen domain threats with minimal data requirements.

Authors:Asma Alkalbani, Muhammad Saqib, Ahmed Salim Alrawahi, Abbas Anwar, Chandarnath Adak, Saeed Anwar
Title: RDD4D: 4D Attention-Guided Road Damage Detection And Classification
Abstract:
Road damage detection and assessment are crucial components of infrastructure maintenance. However, current methods often struggle with detecting multiple types of road damage in a single image, particularly at varying scales. This is due to the lack of road datasets with various damage types having varying scales. To overcome this deficiency, first, we present a novel dataset called Diverse Road Damage Dataset (DRDD) for road damage detection that captures the diverse road damage types in individual images, addressing a crucial gap in existing datasets. Then, we provide our model, RDD4D, that exploits Attention4D blocks, enabling better feature refinement across multiple scales. The Attention4D module processes feature maps through an attention mechanism combining positional encoding and "Talking Head" components to capture local and global contextual information. In our comprehensive experimental analysis comparing various state-of-the-art models on our proposed, our enhanced model demonstrated superior performance in detecting large-sized road cracks with an Average Precision (AP) of 0.458 and maintained competitive performance with an overall AP of 0.445. Moreover, we also provide results on the CrackTinyNet dataset; our model achieved around a 0.21 increase in performance. The code, model weights, dataset, and our results are available on \href{https://github.com/msaqib17/Road_Damage_Detection}{https://github.com/msaqib17/Road\_Damage\_Detection}.
Chinese: 本研究提出了新型多样化道路损伤数据集(DRDD)和采用Attention4D模块的RDD4D模型,显著提升了多尺度道路损伤检测能力,在大尺寸裂缝识别和整体损伤检测方面均表现出优越性能。
English: This study introduces a novel Diverse Road Damage Dataset (DRDD) and the RDD4D model with Attention4D blocks, which significantly improves multi-scale road damage detection and achieves superior performance in identifying large cracks and overall damage.

Authors:Niloufar Eghbali, Hassan Bagher-Ebadian, Tuka Alhanai, Mohammad M. Ghassemi
Title: GLoG-CSUnet: Enhancing Vision Transformers with Adaptable Radiomic Features for Medical Image Segmentation
Abstract:
Vision Transformers (ViTs) have shown promise in medical image semantic segmentation (MISS) by capturing long-range correlations. However, ViTs often struggle to model local spatial information effectively, which is essential for accurately segmenting fine anatomical details, particularly when applied to small datasets without extensive pre-training. We introduce Gabor and Laplacian of Gaussian Convolutional Swin Network (GLoG-CSUnet), a novel architecture enhancing Transformer-based models by incorporating learnable radiomic features. This approach integrates dynamically adaptive Gabor and Laplacian of Gaussian (LoG) filters to capture texture, edge, and boundary information, enhancing the feature representation processed by the Transformer model. Our method uniquely combines the long-range dependency modeling of Transformers with the texture analysis capabilities of Gabor and LoG features. Evaluated on the Synapse multi-organ and ACDC cardiac segmentation datasets, GLoG-CSUnet demonstrates significant improvements over state-of-the-art models, achieving a 1.14% increase in Dice score for Synapse and 0.99% for ACDC, with minimal computational overhead (only 15 and 30 additional parameters, respectively). GLoG-CSUnet's flexible design allows integration with various base models, offering a promising approach for incorporating radiomics-inspired feature extraction in Transformer architectures for medical image analysis. The code implementation is available on GitHub at: https://github.com/HAAIL/GLoG-CSUnet.
中文: GLoG-CSUnet架构通过整合自适应Gabor和拉普拉斯高斯滤波器来增强视觉Transformer在医学图像分割中的表现,能有效捕捉细微纹理特征,在基准数据集上以极小的计算开销实现了性能突破。
English: The GLoG-CSUnet architecture enhances Vision Transformers for medical image segmentation by integrating adaptive Gabor and Laplacian of Gaussian filters to capture fine texture details, achieving superior performance on benchmark datasets with minimal computational overhead.

Authors:Binyu Zhang, Zhu Meng, Junhao Dong, Fei Su, Zhicheng Zhao
Title: ICFNet: Integrated Cross-modal Fusion Network for Survival Prediction
Abstract:
Survival prediction is a crucial task in the medical field and is essential for optimizing treatment options and resource allocation. However, current methods often rely on limited data modalities, resulting in suboptimal performance. In this paper, we propose an Integrated Cross-modal Fusion Network (ICFNet) that integrates histopathology whole slide images, genomic expression profiles, patient demographics, and treatment protocols. Specifically, three types of encoders, a residual orthogonal decomposition module and a unification fusion module are employed to merge multi-modal features to enhance prediction accuracy. Additionally, a balanced negative log-likelihood loss function is designed to ensure fair training across different patients. Extensive experiments demonstrate that our ICFNet outperforms state-of-the-art algorithms on five public TCGA datasets, including BLCA, BRCA, GBMLGG, LUAD, and UCEC, and shows its potential to support clinical decision-making and advance precision medicine. The codes are available at: https://github.com/binging512/ICFNet.
中文: 本文提出的ICFNet通过整合病理图像、基因组数据和临床信息的多模态融合网络,在五种癌症数据集上显著提升了生存预测准确率,展现出良好的临床应用前景。
English: This paper introduces ICFNet, a novel multi-modal fusion network that integrates histopathology images, genomic data, and clinical information to significantly improve survival prediction accuracy across five cancer datasets, demonstrating strong potential for clinical application.

Authors:Yifan Li, Zhixin Lai, Wentao Bao, Zhen Tan, Anh Dao, Kewei Sui, Jiayi Shen, Dong Liu, Huan Liu, Yu Kong
Title: Visual Large Language Models for Generalized and Specialized Applications
Abstract:
Visual-language models (VLM) have emerged as a powerful tool for learning a unified embedding space for vision and language. Inspired by large language models, which have demonstrated strong reasoning and multi-task capabilities, visual large language models (VLLMs) are gaining increasing attention for building general-purpose VLMs. Despite the significant progress made in VLLMs, the related literature remains limited, particularly from a comprehensive application perspective, encompassing generalized and specialized applications across vision (image, video, depth), action, and language modalities. In this survey, we focus on the diverse applications of VLLMs, examining their using scenarios, identifying ethics consideration and challenges, and discussing future directions for their development. By synthesizing these contents, we aim to provide a comprehensive guide that will pave the way for future innovations and broader applications of VLLMs. The paper list repository is available: https://github.com/JackYFL/awesome-VLLMs.
Chinese: 本综述全面探讨了视觉大语言模型(VLLMs)在多模态领域的广泛应用,分析其使用场景、伦理考量、挑战及未来发展方向,旨在为后续创新提供系统指导。
English: This survey comprehensively examines the diverse applications of visual large language models (VLLMs) across multiple modalities, addressing their usage scenarios, ethical considerations, challenges, and future directions to guide further innovation.

Authors:Xiaojiao Guo, Xuhang Chen, Shuqiang Wang, Chi-Man Pun
Title: Underwater Image Restoration Through a Prior Guided Hybrid Sense Approach and Extensive Benchmark Analysis
Abstract:
Underwater imaging grapples with challenges from light-water interactions, leading to color distortions and reduced clarity. In response to these challenges, we propose a novel Color Balance Prior \textbf{Guided} \textbf{Hyb}rid \textbf{Sens}e \textbf{U}nderwater \textbf{I}mage \textbf{R}estoration framework (\textbf{GuidedHybSensUIR}). This framework operates on multiple scales, employing the proposed \textbf{Detail Restorer} module to restore low-level detailed features at finer scales and utilizing the proposed \textbf{Feature Contextualizer} module to capture long-range contextual relations of high-level general features at a broader scale. The hybridization of these different scales of sensing results effectively addresses color casts and restores blurry details. In order to effectively point out the evolutionary direction for the model, we propose a novel \textbf{Color Balance Prior} as a strong guide in the feature contextualization step and as a weak guide in the final decoding phase. We construct a comprehensive benchmark using paired training data from three real-world underwater datasets and evaluate on six test sets, including three paired and three unpaired, sourced from four real-world underwater datasets. Subsequently, we tested 14 traditional and retrained 23 deep learning existing underwater image restoration methods on this benchmark, obtaining metric results for each approach. This effort aims to furnish a valuable benchmarking dataset for standard basis for comparison. The extensive experiment results demonstrate that our method outperforms 37 other state-of-the-art methods overall on various benchmark datasets and metrics, despite not achieving the best results in certain individual cases. The code and dataset are available at \href{https://github.com/CXH-Research/GuidedHybSensUIR}{https://github.com/CXH-Research/GuidedHybSensUIR}.
Chinese: 提出的GuidedHybSensUIR框架通过结合多尺度细节恢复和新型色彩平衡先验,有效修复水下图像,在多个基准测试中优于37种现有最优方法。
English: The proposed GuidedHybSensUIR framework effectively restores underwater images by combining multi-scale detail restoration with a novel Color Balance Prior, outperforming 37 state-of-the-art methods across various benchmarks.

Authors:Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li
Title: GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking
Abstract:
4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant limitation of current video generation models. GS-DiT demonstrates strong generalization capabilities and extends the 4D controllability of Gaussian splatting to video generation beyond just camera poses. It supports advanced cinematic effects through the manipulation of the Gaussian field and camera intrinsics, making it a powerful tool for creative video production. Demos are available at https://wkbian.github.io/Projects/GS-DiT/.
中文: GS-DiT框架通过引入伪4D高斯场实现了先进的四维视频控制,能够生成具有多样化相机参数和电影特效的动态内容,有效克服了现有视频生成方法的局限性。
English: The GS-DiT framework introduces pseudo 4D Gaussian fields to enable advanced 4D video control, allowing dynamic content generation with diverse camera parameters and cinematic effects, overcoming the limitations of existing video generation methods.

Authors:Ziyang Song, Zerong Wang, Bo Li, Hao Zhang, Ruijie Zhu, Li Liu, Peng-Tao Jiang, Tianzhu Zhang
Title: DepthMaster: Taming Diffusion Models for Monocular Depth Estimation
Abstract:
Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at https://indu1ge.github.io/DepthMaster_page.
Chinese: 提出的DepthMaster模型通过特征对齐模块增强语义表示和傅里叶增强模块保留细节,在单步扩散框架中实现了卓越的泛化能力和推理效率,显著提升了单目深度估计性能。
English: The proposed DepthMaster model enhances monocular depth estimation by integrating a Feature Alignment module for semantic enrichment and a Fourier Enhancement module for detail preservation, achieving superior generalization and inference efficiency in a single-step diffusion framework.

Authors:Yibo Zhang
Title: KM-UNet KAN Mamba UNet for medical image segmentation
Abstract:
Medical image segmentation is a critical task in medical imaging analysis. Traditional CNN-based methods struggle with modeling long-range dependencies, while Transformer-based models, despite their success, suffer from quadratic computational complexity. To address these limitations, we propose KM-UNet, a novel U-shaped network architecture that combines the strengths of Kolmogorov-Arnold Networks (KANs) and state-space models (SSMs). KM-UNet leverages the Kolmogorov-Arnold representation theorem for efficient feature representation and SSMs for scalable long-range modeling, achieving a balance between accuracy and computational efficiency. We evaluate KM-UNet on five benchmark datasets: ISIC17, ISIC18, CVC, BUSI, and GLAS. Experimental results demonstrate that KM-UNet achieves competitive performance compared to state-of-the-art methods in medical image segmentation tasks. To the best of our knowledge, KM-UNet is the first medical image segmentation framework integrating KANs and SSMs. This work provides a valuable baseline and new insights for the development of more efficient and interpretable medical image segmentation systems. The code is open source at https://github.com/2760613195/KM_UNet Keywords:KAN,Manba, state-space models,UNet, Medical image segmentation, Deep learning
中文:KM-UNet是一种新颖的U型网络,结合Kolmogorov-Arnold网络与状态空间模型,在多个基准数据集上实现了高效精准的医学图像分割,展现出卓越性能。
English: KM-UNet is a novel U-shaped network that integrates Kolmogorov-Arnold Networks and state-space models to achieve efficient and accurate medical image segmentation, demonstrating competitive performance on multiple benchmark datasets.

Authors:Jaeyoung Kim, Jongho Lee, Hong-Jun Choi, Ting-Yao Hsu, Chieh-Yang Huang, Sungchul Kim, Ryan Rossi, Tong Yu, Clyde Lee Giles, Ting-Hao 'Kenneth' Huang, Sungchul Choi
Title: Multi-LLM Collaborative Caption Generation in Scientific Documents
Abstract:
Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of low-quality captions. (Diverse Caption Generation) We then employ a strategy of fine-tuning/prompting multiple LLMs on the captioning task to generate candidate captions. (Judgment) Lastly, we prompt a prominent LLM to select the highest quality caption from the candidates, followed by refining any remaining inaccuracies. Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions, highlighting its effectiveness. Our code is available at https://github.com/teamreboott/MLBCAP
中文摘要:MLBCAP框架通过协同多个专用大语言模型,实现了训练数据质量评估、多样化描述生成与最优描述筛选的三步优化,在科学图表标注任务中生成比人工撰写更优质的结果。
English Summary: The MLBCAP framework enhances scientific figure captioning by employing multiple specialized LLMs to filter low-quality data, generate diverse candidate captions, and select the most accurate description, outperforming human-written captions in evaluations.

Authors:Dawei Dai, Mingming Jia, Yinxiu Zhou, Hang Xing, Chenghang Li
Title: Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation
Abstract:
Facial images have extensive practical applications. Although the current large-scale text-image diffusion models exhibit strong generation capabilities, it is challenging to generate the desired facial images using only text prompt. Image prompts are a logical choice. However, current methods of this type generally focus on general domain. In this paper, we aim to optimize image makeup techniques to generate the desired facial images. Specifically, (1) we built a dataset of 4 million high-quality face image-text pairs (FaceCaptionHQ-4M) based on LAION-Face to train our Face-MakeUp model; (2) to maintain consistency with the reference facial image, we extract/learn multi-scale content features and pose features for the facial image, integrating these into the diffusion model to enhance the preservation of facial identity features for diffusion models. Validation on two face-related test datasets demonstrates that our Face-MakeUp can achieve the best comprehensive performance.All codes are available at:https://github.com/ddw2AIGROUP2CQUPT/Face-MakeUp
中文: 本文提出Face-MakeUp模型,通过提取参考图像的多尺度内容和姿态特征并融入扩散模型,基于新建的400万高质量人脸图像-文本数据集进行训练,有效提升生成面部图像的身份一致性,在测试中表现最佳。
English: This paper introduces Face-MakeUp, a model that enhances facial image generation by integrating multi-scale content and pose features from reference images, trained on a new dataset of 4 million face image-text pairs to improve identity consistency and achieve top performance.

Authors:Sung Jin Um, Dongjin Kim, Sangmin Lee, Jung Uk Kim
Title: Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection
Abstract:
The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query. With the rapid growth of video content and the overlap between these tasks, recent works have addressed both simultaneously. However, they still struggle to fully capture the overall video context, making it challenging to determine which words are most relevant. In this paper, we present a novel Video Context-aware Keyword Attention module that overcomes this limitation by capturing keyword variation within the context of the entire video. To achieve this, we introduce a video context clustering module that provides concise representations of the overall video context, thereby enhancing the understanding of keyword dynamics. Furthermore, we propose a keyword weight detection module with keyword-aware contrastive learning that incorporates keyword information to enhance fine-grained alignment between visual and textual features. Extensive experiments on the QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that our proposed method significantly improves performance in moment retrieval and highlight detection tasks compared to existing approaches. Our code is available at: https://github.com/VisualAIKHU/Keyword-DETR
中文摘要:本文提出了一种视频上下文感知关键词注意力模块,通过捕捉关键词在整体视频语境中的变化并利用上下文聚类和对比学习增强视觉与文本特征的细粒度对齐,在多个基准测试中显著提升了视频片段检索和高亮检测的性能。
English Summary: This paper introduces a Video Context-aware Keyword Attention module that enhances video moment retrieval and highlight detection by capturing keyword variations and improving visual-text alignment through context clustering and contrastive learning, achieving superior performance on benchmark datasets.

Authors:Yihang Tao, Senkang Hu, Yue Hu, Haonan An, Hangcheng Cao, Yuguang Fang
Title: GCP: Guarded Collaborative Perception with Spatial-Temporal Aware Malicious Agent Detection
Abstract:
Collaborative perception significantly enhances autonomous driving safety by extending each vehicle's perception range through message sharing among connected and autonomous vehicles. Unfortunately, it is also vulnerable to adversarial message attacks from malicious agents, resulting in severe performance degradation. While existing defenses employ hypothesis-and-verification frameworks to detect malicious agents based on single-shot outliers, they overlook temporal message correlations, which can be circumvented by subtle yet harmful perturbations in model input and output spaces. This paper reveals a novel blind area confusion (BAC) attack that compromises existing single-shot outlier-based detection methods. As a countermeasure, we propose GCP, a Guarded Collaborative Perception framework based on spatial-temporal aware malicious agent detection, which maintains single-shot spatial consistency through a confidence-scaled spatial concordance loss, while simultaneously examining temporal anomalies by reconstructing historical bird's eye view motion flows in low-confidence regions. We also employ a joint spatial-temporal Benjamini-Hochberg test to synthesize dual-domain anomaly results for reliable malicious agent detection. Extensive experiments demonstrate GCP's superior performance under diverse attack scenarios, achieving up to 34.69% improvements in AP@0.5 compared to the state-of-the-art CP defense strategies under BAC attacks, while maintaining consistent 5-8% improvements under other typical attacks. Code will be released at https://github.com/CP-Security/GCP.git.
中文: 本文揭示了一种新型盲区混淆攻击对协同感知系统的威胁,并提出基于时空检测的GCP防御框架,该框架在多种攻击场景下均展现出优越性能,检测精度提升最高达34.69%。
English: This paper introduces a novel blind area confusion attack that exploits vulnerabilities in collaborative perception systems and proposes GCP, a guarded framework using spatial-temporal detection to effectively counter such attacks with significant performance improvements.

Authors:Sichao Wang, Ming Yuan, Chuang Zhang, Qing Xu, Lei He, Jianqiang Wang
Title: V2X-DGPE: Addressing Domain Gaps and Pose Errors for Robust Collaborative 3D Object Detection
Abstract:
In V2X collaborative perception, the domain gaps between heterogeneous nodes pose a significant challenge for effective information fusion. Pose errors arising from latency and GPS localization noise further exacerbate the issue by leading to feature misalignment. To overcome these challenges, we propose V2X-DGPE, a high-accuracy and robust V2X feature-level collaborative perception framework. V2X-DGPE employs a Knowledge Distillation Framework and a Feature Compensation Module to learn domain-invariant representations from multi-source data, effectively reducing the feature distribution gap between vehicles and roadside infrastructure. Historical information is utilized to provide the model with a more comprehensive understanding of the current scene. Furthermore, a Collaborative Fusion Module leverages a heterogeneous self-attention mechanism to extract and integrate heterogeneous representations from vehicles and infrastructure. To address pose errors, V2X-DGPE introduces a deformable attention mechanism, enabling the model to adaptively focus on critical parts of the input features by dynamically offsetting sampling points. Extensive experiments on the real-world DAIR-V2X dataset demonstrate that the proposed method outperforms existing approaches, achieving state-of-the-art detection performance. The code is available at https://github.com/wangsch10/V2X-DGPE.
中文:提出的V2X-DGPE框架通过知识蒸馏、特征补偿和可变形注意力机制解决V2X协同感知中的域差异和位姿误差问题,在真实场景数据集中实现了最优性能。
English: The proposed V2X-DGPE framework addresses domain gaps and pose errors in V2X collaborative perception through knowledge distillation, feature compensation, and deformable attention mechanisms, achieving state-of-the-art performance on real-world datasets.

Authors:Liye Jia, Runwei Guan, Haocheng Zhao, Qiuchi Zhao, Ka Lok Man, Jeremy Smith, Limin Yu, Yutao Yue
Title: RadarNeXt: Real-Time and Reliable 3D Object Detector Based On 4D mmWave Imaging Radar
Abstract:
3D object detection is crucial for Autonomous Driving (AD) and Advanced Driver Assistance Systems (ADAS). However, most 3D detectors prioritize detection accuracy, often overlooking network inference speed in practical applications. In this paper, we propose RadarNeXt, a real-time and reliable 3D object detector based on the 4D mmWave radar point clouds. It leverages the re-parameterizable neural networks to catch multi-scale features, reduce memory cost and accelerate the inference. Moreover, to highlight the irregular foreground features of radar point clouds and suppress background clutter, we propose a Multi-path Deformable Foreground Enhancement Network (MDFEN), ensuring detection accuracy while minimizing the sacrifice of speed and excessive number of parameters. Experimental results on View-of-Delft and TJ4DRadSet datasets validate the exceptional performance and efficiency of RadarNeXt, achieving 50.48 and 32.30 mAPs with the variant using our proposed MDFEN. Notably, our RadarNeXt variants achieve inference speeds of over 67.10 FPS on the RTX A4000 GPU and 28.40 FPS on the Jetson AGX Orin. This research demonstrates that RadarNeXt brings a novel and effective paradigm for 3D perception based on 4D mmWave radar.
中文: RadarNeXt是一种基于4D毫米波雷达的实时3D物体检测器,通过多尺度特征提取和前景增强网络在精度与速度间取得平衡,在自动驾驶数据集中表现卓越。
English: RadarNeXt is a real-time 3D object detector using 4D mmWave radar that balances accuracy and speed through multi-scale feature extraction and a foreground enhancement network, achieving high performance on autonomous driving datasets.

Authors:Mengting Wei, Tuomas Varanka, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao
Title: MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control
Abstract:
We address the problem of facial expression editing by controling the relative variation of facial action-unit (AU) from the same person. This enables us to edit this specific person's expression in a fine-grained, continuous and interpretable manner, while preserving their identity, pose, background and detailed facial attributes. Key to our model, which we dub MagicFace, is a diffusion model conditioned on AU variations and an ID encoder to preserve facial details of high consistency. Specifically, to preserve the facial details with the input identity, we leverage the power of pretrained Stable-Diffusion models and design an ID encoder to merge appearance features through self-attention. To keep background and pose consistency, we introduce an efficient Attribute Controller by explicitly informing the model of current background and pose of the target. By injecting AU variations into a denoising UNet, our model can animate arbitrary identities with various AU combinations, yielding superior results in high-fidelity expression editing compared to other facial expression editing works. Code is publicly available at https://github.com/weimengting/MagicFace.
中文: MagicFace是一种基于扩散模型的精细化面部表情编辑方法,通过控制动作单元变化并利用ID编码器和属性控制器,在保持身份、姿态和背景一致性的同时实现连续可解释的表情编辑。
English: MagicFace is a diffusion model that enables fine-grained facial expression editing by controlling action-unit variations while preserving identity, pose, and background through an ID encoder and attribute controller.

Authors:Mian Zou, Baosheng Yu, Yibing Zhan, Kede Ma
Title: Self-Supervised Learning for Detecting AI-Generated Faces as Anomalies
Abstract:
The detection of AI-generated faces is commonly approached as a binary classification task. Nevertheless, the resulting detectors frequently struggle to adapt to novel AI face generators, which evolve rapidly. In this paper, we describe an anomaly detection method for AI-generated faces by leveraging self-supervised learning of camera-intrinsic and face-specific features purely from photographic face images. The success of our method lies in designing a pretext task that trains a feature extractor to rank four ordinal exchangeable image file format (EXIF) tags and classify artificially manipulated face images. Subsequently, we model the learned feature distribution of photographic face images using a Gaussian mixture model. Faces with low likelihoods are flagged as AI-generated. Both quantitative and qualitative experiments validate the effectiveness of our method. Our code is available at \url{https://github.com/MZMMSEC/AIGFD_EXIF.git}.
中文: 本文提出一种基于自监督学习的异常检测方法,通过从摄影人脸图像中学习相机固有和面部特征,有效识别AI生成的人脸,实验验证了其有效性。
English: This paper introduces an anomaly detection method for AI-generated faces using self-supervised learning of camera and facial features from photographic images, validated by experiments to effectively identify synthetic faces.

Authors:Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, Guangyao Shi
Title: A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
Abstract:
Multimodal Vision Language Models (VLMs) have emerged as a transformative topic at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification [93]. With their rapid advancements in research and growing popularity in various applications, we provide a comprehensive survey of VLMs. Specifically, we provide a systematic overview of VLMs in the following aspects: [1] model information of the major VLMs developed up to 2025; [2] the transition of VLM architectures and the newest VLM alignment methods; [3] summary and categorization of the popular benchmarks and evaluation metrics of VLMs; [4] the challenges and issues faced by current VLMs such as hallucination, alignment, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Vision-Language-Models-Overview.
中文: 多模态视觉语言模型融合计算机视觉与自然语言处理,使机器能通过视觉和文本模态感知与推理,本综述系统梳理了其发展历程、架构、评估基准及幻觉与安全等挑战。
English: Multimodal Vision Language Models (VLMs) integrate computer vision and natural language processing to enable machines to perceive and reason through visual and textual data, with this survey systematically reviewing their development, architectures, benchmarks, and challenges like hallucination and safety.

Authors:Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou
Title: Joint Optimization for 4D Human-Scene Reconstruction in the Wild
Abstract:
Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.

Authors:Yuanpeng Tu, Xi Chen, Ser-Nam Lim, Hengshuang Zhao
Title: DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data
Abstract:
Open-vocabulary panoptic segmentation has received significant attention due to its applicability in the real world. Despite claims of robust generalization, we find that the advancements of previous works are attributed mainly on trained categories, exposing a lack of generalization to novel classes. In this paper, we explore boosting existing models from a data-centric perspective. We propose DreamMask, which systematically explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data. For the first part, we propose an automatic data generation pipeline with off-the-shelf models. We propose crucial designs for vocabulary expansion, layout arrangement, data filtering, etc. Equipped with these techniques, our generated data could significantly outperform the manually collected web data. To train the model with generated data, a synthetic-real alignment loss is designed to bridge the representation gap, bringing noticeable improvements across multiple benchmarks. In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods. For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
Chinese: DreamMask提出了一种以数据为中心的方法,通过生成合成训练数据并与真实数据对齐来增强开放词汇全景分割,显著提升了模型的泛化能力,并在ADE20K等基准测试中大幅超越先前方法。
English: DreamMask introduces a data-centric approach to enhance open-vocabulary panoptic segmentation by generating synthetic training data and aligning it with real data, significantly improving model generalization and outperforming prior methods on benchmarks like ADE20K.

Authors:Juntao Zhang, Shaogeng Liu, Kun Bian, You Zhou, Pei Zhang, Jianning Liu, Jun Zhou, Bingyan Liu
Title: A Separable Self-attention Inspired by the State Space Model for Computer Vision
Abstract:
Mamba is an efficient State Space Model (SSM) with linear computational complexity. Although SSMs are not suitable for handling non-causal data, Vision Mamba (ViM) methods still demonstrate good performance in tasks such as image classification and object detection. Recent studies have shown that there is a rich theoretical connection between state space models and attention variants. We propose a novel separable self attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a simple yet powerful prototype architecture, constructed solely by stacking our novel attention modules with the most basic down-sampling layers. Notably, VMINet differs significantly from the conventional Transformer architecture. Our experiments demonstrate that VMINet has achieved competitive results on image classification and high-resolution dense prediction tasks.Code is available at: https://github.com/yws-wxs/VMINet.
中文: VMINet提出了一种新颖的可分离自注意力方法,首次将Mamba的优秀设计理念融入其中,在图像分类和高分辨率密集预测任务中取得了有竞争力的结果,其架构简单且不同于传统Transformer。
English: VMINet introduces a novel separable self-attention method that incorporates Mamba's design concepts, achieving competitive performance in image classification and dense prediction tasks with a simple, non-Transformer architecture.

Authors:Ziwei Zheng, Junyao Zhao, Le Yang, Lijun He, Fan Li
Title: Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models
Abstract:
With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{https://github.com/Ziwei-Zheng/SAHs}.
中文: 大型视觉语言模型比纯文本模型更易受安全风险威胁,但其内部的"安全头部"能在首个令牌生成时有效识别恶意提示,从而以最小计算成本实现高效防护。
English: Large vision-language models are more vulnerable to safety risks than text-only models, but their internal "safety heads" can detect malicious prompts during the first token generation, enabling effective protection with minimal computational cost.

Authors:Delin An, Chaoli Wang
Title: SurfPatch: Enabling Patch Matching for Exploratory Stream Surface Visualization
Abstract:
Unlike their line-based counterparts, surface-based techniques have yet to be thoroughly investigated in flow visualization due to their significant placement, speed, perception, and evaluation challenges. This paper presents SurfPatch, a novel framework supporting exploratory stream surface visualization. To begin with, we translate the issue of surface placement to surface selection and trace a large number of stream surfaces from a given flow field dataset. Then, we introduce a three-stage process: vertex-level classification, patch-level matching, and surface-level clustering that hierarchically builds the connection between vertices and patches and between patches and surfaces. This bottom-up approach enables fine-grained, multiscale patch-level matching, sharply contrasts surface-level matching offered by existing works, and provides previously unavailable flexibility during querying. We design an intuitive visual interface for users to conveniently visualize and analyze the underlying collection of stream surfaces in an exploratory manner. SurfPatch is not limited to stream surfaces traced from steady flow datasets. We demonstrate its effectiveness through experiments on stream surfaces produced from steady and unsteady flows as well as isosurfaces extracted from scalar fields. The code is available at https://github.com/adlsn/SurfPatch.
中文摘要:本文提出SurfPatch这一创新框架,通过分层多尺度面片匹配解决了基于表面的流场可视化难题,并借助直观的可视界面实现了灵活的探索式分析。
English Summary: This paper introduces SurfPatch, a novel framework that addresses the challenges in surface-based flow visualization by enabling hierarchical, multiscale patch-level matching and providing flexible querying through an intuitive visual interface.

Authors:Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang
Title: FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models
Abstract:
The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens. Existing token reduction methods primarily prune tokens based on importance metrics, such as cumulative attention scores. However, even important tokens may exhibit high redundancy caused by similarity among adjacent video frames and repetitive visual elements. To address this limitation, we propose FrameFusion, a novel token reduction approach integrating similarity-based merging with importance-based pruning. We conduct a thorough study on token similarity characteristics, revealing three key insights: (1) spatially corresponding visual tokens between adjacent frames have higher cosine similarities compared to other token pairs; (2) high token similarities prominently decrease in deeper model layers; and (3) token similarity rankings are highly consistent across different layers. Guided by these observations, FrameFusion computes token similarities exclusively between corresponding visual tokens from adjacent frames, applies token merging at initial successive layers followed by pruning in deeper layers, and adopts a cascaded merging strategy to further enhance efficiency. We evaluate FrameFusion comprehensively across six diverse LVLMs, ranging from 2B to 72B parameters, using five video benchmarks encompassing video retrieval, question-answering, and spatial-temporal understanding tasks. Experiments show that FrameFusion reduces visual tokens by 70%, achieving 1.6-3.6x end-to-end speedups, with an average performance impact of less than 3%. Our code is available at: https://github.com/thu-nics/FrameFusion.
中文:FrameFusion通过结合基于相似性的令牌合并和基于重要性的剪枝,将大型视觉语言模型中的视觉令牌减少70%,在实现显著加速的同时保持了性能损失最小。
English: FrameFusion combines similarity-based token merging with importance-based pruning to reduce visual tokens by 70% in Large Vision-Language Models, achieving significant speed improvements with minimal performance loss.

Authors:Atharva Divekar, Atharva Sonawane
Title: Leveraging AI for Automatic Classification of PCOS Using Ultrasound Imaging
Abstract:
The AUTO-PCOS Classification Challenge seeks to advance the diagnostic capabilities of artificial intelligence (AI) in identifying Polycystic Ovary Syndrome (PCOS) through automated classification of healthy and unhealthy ultrasound frames. This report outlines our methodology for building a robust AI pipeline utilizing transfer learning with the InceptionV3 architecture to achieve high accuracy in binary classification. Preprocessing steps ensured the dataset was optimized for training, validation, and testing, while interpretability methods like LIME and saliency maps provided valuable insights into the model's decision-making. Our approach achieved an accuracy of 90.52%, with precision, recall, and F1-score metrics exceeding 90% on validation data, demonstrating its efficacy. The project underscores the transformative potential of AI in healthcare, particularly in addressing diagnostic challenges like PCOS. Key findings, challenges, and recommendations for future enhancements are discussed, highlighting the pathway for creating reliable, interpretable, and scalable AI-driven medical diagnostic tools.
Chinese: AUTO-PCOS分类挑战通过InceptionV3迁移学习构建的AI模型在多囊卵巢综合征超声图像分类中准确率超过90%,展现了人工智能在医疗诊断领域的变革潜力。
English: The AUTO-PCOS Classification Challenge developed an AI pipeline using InceptionV3 transfer learning that achieved over 90% accuracy in classifying PCOS ultrasound images, demonstrating AI's potential to transform medical diagnostics.

Authors:Jiahao Qin, Feng Liu
Title: GAF-FusionNet: Multimodal ECG Analysis via Gramian Angular Fields and Split Attention
Abstract:
Electrocardiogram (ECG) analysis plays a crucial role in diagnosing cardiovascular diseases, but accurate interpretation of these complex signals remains challenging. This paper introduces a novel multimodal framework(GAF-FusionNet) for ECG classification that integrates time-series analysis with image-based representation using Gramian Angular Fields (GAF). Our approach employs a dual-layer cross-channel split attention module to adaptively fuse temporal and spatial features, enabling nuanced integration of complementary information. We evaluate GAF-FusionNet on three diverse ECG datasets: ECG200, ECG5000, and the MIT-BIH Arrhythmia Database. Results demonstrate significant improvements over state-of-the-art methods, with our model achieving 94.5\%, 96.9\%, and 99.6\% accuracy on the respective datasets. Our code will soon be available at https://github.com/Cross-Innovation-Lab/GAF-FusionNet.git.
中文摘要:本文提出GAF-FusionNet多模态心电图分类框架,通过格拉米角场融合时序分析与图像表征,在三个基准数据集上实现了最先进的分类准确率。
English Summary: This paper presents GAF-FusionNet, a multimodal ECG classification framework that combines time-series analysis with image-based representations using Gramian Angular Fields, achieving state-of-the-art accuracy across three benchmark datasets.

Authors:Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He
Title: VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Abstract:
Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.
中文: 本文提出一种多阶段训练方法,使多模态大语言模型具备强大的视觉与语音交互能力,无需独立语音模块即可实现高效端到端对话,同时在视觉与语音任务中保持优异性能。
English: This paper introduces a multi-stage training method that equips multimodal large language models with strong visual and speech interaction capabilities, enabling efficient end-to-end dialogue without separate speech modules while maintaining high performance across vision and speech benchmarks.

Authors:Jiaming Li, Jiacheng Zhang, Zequn Jie, Lin Ma, Guanbin Li
Title: Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding
Abstract:
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding and attention rectification, to reduce overreliance on language priors. However, these approaches overlook hallucinations stemming from spurious inter-modality correlations. In this paper, we propose an Inter-Modality Correlation Calibration Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner. In this method, we design a Cross-Modal Value-Enhanced Decoding(CMVED) module to alleviate hallucination by a novel contrastive decoding mechanism. During the estimation of distorted distribution, CMVED masks the value vectors associated with significant cross-modal attention weights, which address both uni-modality overreliance and misleading inter-modality correlations. Additionally, a Content-Driven Attention Refinement(CDAR) module refines cross-modal attention weights, guiding LVLMs to focus on important visual content. Experimental results on diverse hallucination benchmarks validate the superiority of our method over existing state-of-the-art techniques in reducing hallucinations in LVLM text generation. Our code will be available at https://github.com/lijm48/IMCCD.
Chinese: 本文提出了一种跨模态相关性校准解码方法(IMCCD),通过跨模态对比解码和注意力优化来减少大型视觉语言模型中的幻觉问题,并在多个基准测试中验证了其优越性。
English: The paper introduces an Inter-Modality Correlation Calibration Decoding (IMCCD) method to reduce hallucinations in large vision-language models by employing cross-modal contrastive decoding and attention refinement, validated as superior on benchmarks.

Authors:Brandon Y. Feng, Rodrigo Ferrer-Chávez, Aviad Levis, Jason J. Wang, Katherine L. Bouman, William T. Freeman
Title: Exoplanet Detection via Differentiable Rendering
Abstract:
Direct imaging of exoplanets is crucial for advancing our understanding of planetary systems beyond our solar system, but it faces significant challenges due to the high contrast between host stars and their planets. Wavefront aberrations introduce speckles in the telescope science images, which are patterns of diffracted starlight that can mimic the appearance of planets, complicating the detection of faint exoplanet signals. Traditional post-processing methods, operating primarily in the image intensity domain, do not integrate wavefront sensing data. These data, measured mainly for adaptive optics corrections, have been overlooked as a potential resource for post-processing, partly due to the challenge of the evolving nature of wavefront aberrations. In this paper, we present a differentiable rendering approach that leverages these wavefront sensing data to improve exoplanet detection. Our differentiable renderer models wave-based light propagation through a coronagraphic telescope system, allowing gradient-based optimization to significantly improve starlight subtraction and increase sensitivity to faint exoplanets. Simulation experiments based on the James Webb Space Telescope configuration demonstrate the effectiveness of our approach, achieving substantial improvements in contrast and planet detection limits. Our results showcase how the computational advancements enabled by differentiable rendering can revitalize previously underexploited wavefront data, opening new avenues for enhancing exoplanet imaging and characterization.

Authors:Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen
Title: Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Abstract:
Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems. To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at https://github.com/RUCAIBox/Virgo.
中文摘要:本文通过用少量文本长思考数据微调多模态大语言模型,开发了多模态慢思考系统Virgo,证明基于语言的推理能有效迁移慢思考能力,且文本推理数据比视觉数据更能激发模型的深度思考潜能。
English Summary: This paper introduces Virgo, a multimodal slow-thinking system created by fine-tuning a multimodal large language model with textual long-form thought data, demonstrating that language-based reasoning can effectively transfer slow-thinking capabilities across modalities.

Authors:Huaxiang Zhang, Kai Liu, Zhongxue Gan, Guo-Niu Zhu
Title: UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery
Abstract:
Unmanned aerial vehicle object detection (UAV-OD) has been widely used in various scenarios. However, most existing UAV-OD algorithms rely on manually designed components, which require extensive tuning. End-to-end models that do not depend on such manually designed components are mainly designed for natural images, which are less effective for UAV imagery. To address such challenges, this paper proposes an efficient detection transformer (DETR) framework tailored for UAV imagery, i.e., UAV-DETR. The framework includes a multi-scale feature fusion with frequency enhancement module, which captures both spatial and frequency information at different scales. In addition, a frequency-focused down-sampling module is presented to retain critical spatial details during down-sampling. A semantic alignment and calibration module is developed to align and fuse features from different fusion paths. Experimental results demonstrate the effectiveness and generalization of our approach across various UAV imagery datasets. On the VisDrone dataset, our method improves AP by 3.1\% and $\text{AP}_{50}$ by 4.2\% over the baseline. Similar enhancements are observed on the UAVVaste dataset. The project page: https://github.com/ValiantDiligent/UAV-DETR
Chinese: 本文提出专为无人机图像设计的UAV-DETR检测框架,通过多尺度频率增强特征融合和语义对齐模块,在多个无人机数据集上显著提升了检测性能,相比基线方法AP指标提升超过3%。
English: This paper introduces UAV-DETR, an efficient detection transformer framework specifically designed for UAV imagery, incorporating multi-scale feature fusion with frequency enhancement and semantic alignment modules to significantly improve object detection performance on drone datasets.

Authors:Hu Ding, Yan Yan, Yang Lu, Jing-Hao Xue, Hanzi Wang
Title: Uncertainty-Aware Label Refinement on Hypergraphs for Personalized Federated Facial Expression Recognition
Abstract:
Most facial expression recognition (FER) models are trained on large-scale expression data with centralized learning. Unfortunately, collecting a large amount of centralized expression data is difficult in practice due to privacy concerns of facial images. In this paper, we investigate FER under the framework of personalized federated learning, which is a valuable and practical decentralized setting for real-world applications. To this end, we develop a novel uncertainty-Aware label refineMent on hYpergraphs (AMY) method. For local training, each local model consists of a backbone, an uncertainty estimation (UE) block, and an expression classification (EC) block. In the UE block, we leverage a hypergraph to model complex high-order relationships between expression samples and incorporate these relationships into uncertainty features. A personalized uncertainty estimator is then introduced to estimate reliable uncertainty weights of samples in the local client. In the EC block, we perform label propagation on the hypergraph, obtaining high-quality refined labels for retraining an expression classifier. Based on the above, we effectively alleviate heterogeneous sample uncertainty across clients and learn a robust personalized FER model in each client. Experimental results on two challenging real-world facial expression databases show that our proposed method consistently outperforms several state-of-the-art methods. This indicates the superiority of hypergraph modeling for uncertainty estimation and label refinement on the personalized federated FER task. The source code will be released at https://github.com/mobei1006/AMY.
中文: 本文提出AMY方法,通过超图建模实现标签优化和样本不确定性评估,在保护数据隐私的个性化联邦学习框架下显著提升了面部表情识别性能。
English: This paper introduces AMY, a personalized federated learning method that uses hypergraph modeling to refine labels and estimate sample uncertainty, achieving superior facial expression recognition performance while addressing data privacy concerns.

Authors:Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, Mingyuan Fan
Title: Ingredients: Blending Custom Photos with Video Diffusion Transformers
Abstract:
This paper presents a powerful framework to customize video creations by incorporating multiple specific identity (ID) photos, with video diffusion Transformers, referred to as Ingredients. Generally, our method consists of three primary modules: (i) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives; (ii) a multi-scale projector that maps face embeddings into the contextual space of image query in video diffusion transformers; (iii) an ID router that dynamically combines and allocates multiple ID embedding to the corresponding space-time regions. Leveraging a meticulously curated text-video dataset and a multi-stage training protocol, Ingredients demonstrates superior performance in turning custom photos into dynamic and personalized video content. Qualitative evaluations highlight the advantages of proposed method, positioning it as a significant advancement toward more effective generative video control tools in Transformer-based architecture, compared to existing methods. The data, code, and model weights are publicly available at: https://github.com/feizc/Ingredients.
中文摘要:本文提出了名为Ingredients的框架,通过面部提取、多尺度投影和身份路由三个核心模块,利用视频扩散Transformer将多张定制照片转化为动态个性化视频内容,在生成视频控制方面实现了显著进步。
English Summary: This paper introduces a framework called Ingredients that uses video diffusion Transformers to create customized videos from multiple identity photos through three specialized modules for facial feature extraction, embedding projection, and dynamic ID allocation.

Authors:Er Jin, Qihui Feng, Yongli Mou, Stefan Decker, Gerhard Lakemeyer, Oliver Simons, Johannes Stegmaier
Title: LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction
Abstract:
Logical image understanding involves interpreting and reasoning about the relationships and consistency within an image's visual content. This capability is essential in applications such as industrial inspection, where logical anomaly detection is critical for maintaining high-quality standards and minimizing costly recalls. Previous research in anomaly detection (AD) has relied on prior knowledge for designing algorithms, which often requires extensive manual annotations, significant computing power, and large amounts of data for training. Autoregressive, multimodal Vision Language Models (AVLMs) offer a promising alternative due to their exceptional performance in visual reasoning across various domains. Despite this, their application to logical AD remains unexplored. In this work, we investigate using AVLMs for logical AD and demonstrate that they are well-suited to the task. Combining AVLMs with format embedding and a logic reasoner, we achieve SOTA performance on public benchmarks, MVTec LOCO AD, with an AUROC of 86.0% and F1-max of 83.7%, along with explanations of anomalies. This significantly outperforms the existing SOTA method by a large margin.

Authors:Ruikang Chen, Yan Yan, Jing-Hao Xue, Yang Lu, Hanzi Wang
Title: Augmentation Matters: A Mix-Paste Method for X-Ray Prohibited Item Detection under Noisy Annotations
Abstract:
Automatic X-ray prohibited item detection is vital for public safety. Existing deep learning-based methods all assume that the annotations of training X-ray images are correct. However, obtaining correct annotations is extremely hard if not impossible for large-scale X-ray images, where item overlapping is ubiquitous.As a result, X-ray images are easily contaminated with noisy annotations, leading to performance deterioration of existing methods.In this paper, we address the challenging problem of training a robust prohibited item detector under noisy annotations (including both category noise and bounding box noise) from a novel perspective of data augmentation, and propose an effective label-aware mixed patch paste augmentation method (Mix-Paste). Specifically, for each item patch, we mix several item patches with the same category label from different images and replace the original patch in the image with the mixed patch. In this way, the probability of containing the correct prohibited item within the generated image is increased. Meanwhile, the mixing process mimics item overlapping, enabling the model to learn the characteristics of X-ray images. Moreover, we design an item-based large-loss suppression (LLS) strategy to suppress the large losses corresponding to potentially positive predictions of additional items due to the mixing operation. We show the superiority of our method on X-ray datasets under noisy annotations. In addition, we evaluate our method on the noisy MS-COCO dataset to showcase its generalization ability. These results clearly indicate the great potential of data augmentation to handle noise annotations. The source code is released at https://github.com/wscds/Mix-Paste.
中文摘要:本文提出Mix-Paste方法,通过混合同类物品图像块的数据增强技术,有效解决X射线图像在噪声标注下的违禁品检测问题,并采用损失抑制策略提升模型在标注噪声环境中的鲁棒性。
English Summary: This paper introduces Mix-Paste, a label-aware data augmentation method that enhances robust prohibited item detection in X-ray images by mixing item patches to handle noisy annotations and mimic item overlapping, complemented by a loss suppression strategy to improve performance under noisy training conditions.

Authors:Hanxin Zhu, Tianyu He, Xiqian Yu, Junliang Guo, Zhibo Chen, Jiang Bian
Title: AR4D: Autoregressive 4D Generation from Monocular Videos
Abstract:
Recent advancements in generative models have ignited substantial interest in dynamic 3D content creation (\ie, 4D generation). Existing approaches primarily rely on Score Distillation Sampling (SDS) to infer novel-view videos, typically leading to issues such as limited diversity, spatial-temporal inconsistency and poor prompt alignment, due to the inherent randomness of SDS. To tackle these problems, we propose AR4D, a novel paradigm for SDS-free 4D generation. Specifically, our paradigm consists of three stages. To begin with, for a monocular video that is either generated or captured, we first utilize pre-trained expert models to create a 3D representation of the first frame, which is further fine-tuned to serve as the canonical space. Subsequently, motivated by the fact that videos happen naturally in an autoregressive manner, we propose to generate each frame's 3D representation based on its previous frame's representation, as this autoregressive generation manner can facilitate more accurate geometry and motion estimation. Meanwhile, to prevent overfitting during this process, we introduce a progressive view sampling strategy, utilizing priors from pre-trained large-scale 3D reconstruction models. To avoid appearance drift introduced by autoregressive generation, we further incorporate a refinement stage based on a global deformation field and the geometry of each frame's 3D representation. Extensive experiments have demonstrated that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.
中文: AR4D框架提出了一种无需SDS的动态3D内容生成新范式,通过自回归生成和渐进式视角采样技术,实现了更优的多样性、时空一致性和文本对齐效果。
English: The AR4D framework introduces an SDS-free paradigm for dynamic 3D content creation, employing autoregressive generation and progressive view sampling to achieve superior diversity, consistency, and prompt alignment.

Authors:Jiajun Cao, Yuan Zhang, Tao Huang, Ming Lu, Qizhe Zhang, Ruichuan An, Ningning MA, Shanghang Zhang
Title: MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders
Abstract:
Visual encoders are fundamental components in vision-language models (VLMs), each showcasing unique strengths derived from various pre-trained visual foundation models. To leverage the various capabilities of these encoders, recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. Specifically, to mitigate conflicts and retain the unique characteristics of each teacher encoder, we employ low-rank adaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate specialized knowledge based on input features, enhancing both adaptability and efficiency. To regularize the KD process and enhance performance, we propose an attention-based distillation strategy that adaptively weighs the different encoders and emphasizes valuable visual tokens, reducing the burden of replicating comprehensive but distinct features from multiple teachers. Comprehensive experiments on popular VLMs, such as LLaVA and LLaVA-NeXT, validate the effectiveness of our method. Our code is available at: https://github.com/hey-cjj/MoVE-KD.
中文摘要:本文提出MoVE-KD框架,通过低秩适应和专家混合机制选择性地激活不同视觉编码器的专长知识,并采用基于注意力的蒸馏策略自适应权衡编码器价值,将多种视觉编码器的优势高效蒸馏到单一模型中。
English Summary: The paper introduces MoVE-KD, a framework that distills the diverse capabilities of multiple visual encoders into a single efficient model using LoRA and mixture-of-experts to selectively activate specialized knowledge while employing attention-based distillation to adaptively weigh encoders and emphasize valuable visual tokens.

Authors:Jina Kim, Jihoo Lee, Je-Won Kang
Title: SNeRV: Spectra-preserving Neural Representation for Video
Abstract:
Neural representation for video (NeRV), which employs a neural network to parameterize video signals, introduces a novel methodology in video representations. However, existing NeRV-based methods have difficulty in capturing fine spatial details and motion patterns due to spectral bias, in which a neural network learns high-frequency (HF) components at a slower rate than low-frequency (LF) components. In this paper, we propose spectra-preserving NeRV (SNeRV) as a novel approach to enhance implicit video representations by efficiently handling various frequency components. SNeRV uses 2D discrete wavelet transform (DWT) to decompose video into LF and HF features, preserving spatial structures and directly addressing the spectral bias issue. To balance the compactness, we encode only the LF components, while HF components that include fine textures are generated by a decoder. Specialized modules, including a multi-resolution fusion unit (MFU) and a high-frequency restorer (HFR), are integrated into a backbone to facilitate the representation. Furthermore, we extend SNeRV to effectively capture temporal correlations between adjacent video frames, by casting the extension as additional frequency decomposition to a temporal domain. This approach allows us to embed spatio-temporal LF features into the network, using temporally extended up-sampling blocks (TUBs). Experimental results demonstrate that SNeRV outperforms existing NeRV models in capturing fine details and achieves enhanced reconstruction, making it a promising approach in the field of implicit video representations. The codes are available at https://github.com/qwertja/SNeRV.
中文摘要:SNeRV提出了一种新颖的隐式视频表示方法,通过二维小波变换和专用模块解决频谱偏差问题,能更好地捕捉细节和运动模式,在重建质量上超越了现有NeRV模型。
English Summary: SNeRV introduces a novel implicit video representation method that uses 2D wavelet transform and specialized modules to overcome spectral bias, enabling better capture of fine details and motion patterns while outperforming existing NeRV models in reconstruction quality.

Authors:Tengfei Wang, Xin Wang, Yongmao Hou, Yiwei Xu, Wendi Zhang, Zongqian Zhan
Title: PG-SAG: Parallel Gaussian Splatting for Fine-Grained Large-Scale Urban Buildings Reconstruction via Semantic-Aware Grouping
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a transformative method in the field of real-time novel synthesis. Based on 3DGS, recent advancements cope with large-scale scenes via spatial-based partition strategy to reduce video memory and optimization time costs. In this work, we introduce a parallel Gaussian splatting method, termed PG-SAG, which fully exploits semantic cues for both partitioning and Gaussian kernel optimization, enabling fine-grained building surface reconstruction of large-scale urban areas without downsampling the original image resolution. First, the Cross-modal model - Language Segment Anything is leveraged to segment building masks. Then, the segmented building regions is grouped into sub-regions according to the visibility check across registered images. The Gaussian kernels for these sub-regions are optimized in parallel with masked pixels. In addition, the normal loss is re-formulated for the detected edges of masks to alleviate the ambiguities in normal vectors on edges. Finally, to improve the optimization of 3D Gaussians, we introduce a gradient-constrained balance-load loss that accounts for the complexity of the corresponding scenes, effectively minimizing the thread waiting time in the pixel-parallel rendering stage as well as the reconstruction lost. Extensive experiments are tested on various urban datasets, the results demonstrated the superior performance of our PG-SAG on building surface reconstruction, compared to several state-of-the-art 3DGS-based methods. Project Web:https://github.com/TFWang-9527/PG-SAG.
中文:PG-SAG是一种利用语义线索进行分区和优化的并行高斯溅射方法,能够高效精确地实现大规模城市场景的高分辨率重建。
English: PG-SAG is a parallel Gaussian splatting method that leverages semantic cues for partitioning and optimization, enabling high-resolution reconstruction of large-scale urban scenes with improved efficiency and accuracy.

Authors:Kang Yi, Haoran Tang, Yumeng Li, Jing Xu, Jun Zhang
Title: Dual Mutual Learning Network with Global-local Awareness for RGB-D Salient Object Detection
Abstract:
RGB-D salient object detection (SOD), aiming to highlight prominent regions of a given scene by jointly modeling RGB and depth information, is one of the challenging pixel-level prediction tasks. Recently, the dual-attention mechanism has been devoted to this area due to its ability to strengthen the detection process. However, most existing methods directly fuse attentional cross-modality features under a manual-mandatory fusion paradigm without considering the inherent discrepancy between the RGB and depth, which may lead to a reduction in performance. Moreover, the long-range dependencies derived from global and local information make it difficult to leverage a unified efficient fusion strategy. Hence, in this paper, we propose the GL-DMNet, a novel dual mutual learning network with global-local awareness. Specifically, we present a position mutual fusion module and a channel mutual fusion module to exploit the interdependencies among different modalities in spatial and channel dimensions. Besides, we adopt an efficient decoder based on cascade transformer-infused reconstruction to integrate multi-level fusion features jointly. Extensive experiments on six benchmark datasets demonstrate that our proposed GL-DMNet performs better than 24 RGB-D SOD methods, achieving an average improvement of ~3% across four evaluation metrics compared to the second-best model (S3Net). Codes and results are available at https://github.com/kingkung2016/GL-DMNet.
中文: 本文提出GL-DMNet,一种具有全局-局部感知的双重互学习网络,通过位置和通道互融合模块解决RGB-D显著目标检测中的跨模态特征融合难题,在六个基准数据集上超越24种现有方法,四项指标平均提升约3%。
English: This paper introduces GL-DMNet, a novel dual mutual learning network with global-local awareness that addresses challenges in RGB-D salient object detection by employing position and channel mutual fusion modules to effectively integrate cross-modality features, outperforming 24 existing methods with an average 3% improvement across four metrics.

Authors:Zihao Wang, Yuxiang Wei, Fan Li, Renjing Pei, Hang Xu, Wangmeng Zuo
Title: ACE: Anti-Editing Concept Erasure in Text-to-Image Models
Abstract:
Recent advance in text-to-image diffusion models have significantly facilitated the generation of high-quality images, but also raising concerns about the illegal creation of harmful content, such as copyrighted images. Existing concept erasure methods achieve superior results in preventing the production of erased concept from prompts, but typically perform poorly in preventing undesired editing. To address this issue, we propose an Anti-Editing Concept Erasure (ACE) method, which not only erases the target concept during generation but also filters out it during editing. Specifically, we propose to inject the erasure guidance into both conditional and the unconditional noise prediction, enabling the model to effectively prevent the creation of erasure concepts during both editing and generation. Furthermore, a stochastic correction guidance is introduced during training to address the erosion of unrelated concepts. We conducted erasure editing experiments with representative editing methods (i.e., LEDITS++ and MasaCtrl) to erase IP characters, and the results indicate that our ACE effectively filters out target concepts in both types of edits. Additional experiments on erasing explicit concepts and artistic styles further demonstrate that our ACE performs favorably against state-of-the-art methods. Our code will be publicly available at https://github.com/120L020904/ACE.
中文: 提出的抗编辑概念擦除(ACE)方法通过在条件和非条件噪声预测中融入擦除指导,有效防止有害或受版权保护内容的生成与编辑,在多种场景下优于现有技术。
English: The proposed Anti-Editing Concept Erasure (ACE) method effectively prevents the generation and editing of harmful or copyrighted content by integrating erasure guidance into both conditional and unconditional noise predictions, outperforming existing techniques in various scenarios.

Authors:Yun Zhu, Dong Zhang, Yi Lin, Yifei Feng, Jinhui Tang
Title: Merging Context Clustering with Visual State Space Models for Medical Image Segmentation
Abstract:
Medical image segmentation demands the aggregation of global and local feature representations, posing a challenge for current methodologies in handling both long-range and short-range feature interactions. Recently, vision mamba (ViM) models have emerged as promising solutions for addressing model complexities by excelling in long-range feature iterations with linear complexity. However, existing ViM approaches overlook the importance of preserving short-range local dependencies by directly flattening spatial tokens and are constrained by fixed scanning patterns that limit the capture of dynamic spatial context information. To address these challenges, we introduce a simple yet effective method named context clustering ViM (CCViM), which incorporates a context clustering module within the existing ViM models to segment image tokens into distinct windows for adaptable local clustering. Our method effectively combines long-range and short-range feature interactions, thereby enhancing spatial contextual representations for medical image segmentation tasks. Extensive experimental evaluations on diverse public datasets, i.e., Kumar, CPM17, ISIC17, ISIC18, and Synapse demonstrate the superior performance of our method compared to current state-of-the-art methods. Our code can be found at https://github.com/zymissy/CCViM.
中文: 提出的CCViM方法通过在视觉Mamba模型中引入上下文聚类模块,有效结合长程与短程特征交互,在多个数据集上显著提升了医学图像分割的空间上下文表征能力。
English: The proposed CCViM method enhances medical image segmentation by integrating a context clustering module into vision mamba models, effectively combining long-range and short-range feature interactions to improve spatial contextual representations across multiple datasets.

Authors:Yao Ding, Weijie Kang, Aitao Yang, Zhili Zhang, Junyang Zhao, Jie Feng, Danfeng Hong, Qinhe Zheng
Title: Adaptive Homophily Clustering: Structure Homophily Graph Learning with Adaptive Filter for Hyperspectral Image
Abstract:
Hyperspectral image (HSI) clustering has been a fundamental but challenging task with zero training labels. Currently, some deep graph clustering methods have been successfully explored for HSI due to their outstanding performance in effective spatial structural information encoding. Nevertheless, insufficient structural information utilization, poor feature presentation ability, and weak graph update capability limit their performance. Thus, in this paper, a homophily structure graph learning with an adaptive filter clustering method (AHSGC) for HSI is proposed. Specifically, homogeneous region generation is first developed for HSI processing and constructing the original graph. Afterward, an adaptive filter graph encoder is designed to adaptively capture the high and low frequency features on the graph for subsequence processing. Then, a graph embedding clustering self-training decoder is developed with KL Divergence, with which the pseudo-label is generated for network training. Meanwhile, homophily-enhanced structure learning is introduced to update the graph according to the clustering task, in which the orient correlation estimation is adopted to estimate the node connection, and graph edge sparsification is designed to adjust the edges in the graph dynamically. Finally, a joint network optimization is introduced to achieve network self-training and update the graph. The K-means is adopted to express the latent features. Extensive experiments and repeated comparative analysis have verified that our AHSGC contains high clustering accuracy, low computational complexity, and strong robustness. The code source will be available at https://github.com/DY-HYX.
中文: 本文提出了一种用于高光谱图像聚类的同质性结构图学习方法AHSGC,通过自适应图滤波和结构优化,显著提升了聚类精度和鲁棒性,同时降低了计算复杂度。
English: This paper introduces AHSGC, a novel homophily structure graph learning method with adaptive filtering for hyperspectral image clustering, which enhances feature representation and graph updating to achieve high accuracy, low complexity, and strong robustness.

Authors:Honghu Chen, Bo Peng, Yunfan Tao, Juyong Zhang
Title: D$^3$-Human: Dynamic Disentangled Digital Human from Monocular Video
Abstract:
We introduce D$^3$-Human, a method for reconstructing Dynamic Disentangled Digital Human geometry from monocular videos. Past monocular video human reconstruction primarily focuses on reconstructing undecoupled clothed human bodies or only reconstructing clothing, making it difficult to apply directly in applications such as animation production. The challenge in reconstructing decoupled clothing and body lies in the occlusion caused by clothing over the body. To this end, the details of the visible area and the plausibility of the invisible area must be ensured during the reconstruction process. Our proposed method combines explicit and implicit representations to model the decoupled clothed human body, leveraging the robustness of explicit representations and the flexibility of implicit representations. Specifically, we reconstruct the visible region as SDF and propose a novel human manifold signed distance field (hmSDF) to segment the visible clothing and visible body, and then merge the visible and invisible body. Extensive experimental results demonstrate that, compared with existing reconstruction schemes, D$^3$-Human can achieve high-quality decoupled reconstruction of the human body wearing different clothing, and can be directly applied to clothing transfer and animation.

Authors:Lihao Wang
Title: Click-Calib: A Robust Extrinsic Calibration Method for Surround-View Systems
Abstract:
Surround-View System (SVS) is an essential component in Advanced Driver Assistance System (ADAS) and requires precise calibrations. However, conventional offline extrinsic calibration methods are cumbersome and time-consuming as they rely heavily on physical patterns. Additionally, these methods primarily focus on short-range areas surrounding the vehicle, resulting in lower calibration quality in more distant zones. To address these limitations, we propose Click-Calib, a pattern-free approach for offline SVS extrinsic calibration. Without requiring any special setup, the user only needs to click a few keypoints on the ground in natural scenes. Unlike other offline calibration approaches, Click-Calib optimizes camera poses over a wide range by minimizing reprojection distance errors of keypoints, thereby achieving accurate calibrations at both short and long distances. Furthermore, Click-Calib supports both single-frame and multiple-frame modes, with the latter offering even better results. Evaluations on our in-house dataset and the public WoodScape dataset demonstrate its superior accuracy and robustness compared to baseline methods. Code is available at https://github.com/lwangvaleo/click_calib.
Chinese: Click-Calib是一种无需标定板的环视系统离线外参标定方法,用户只需在自然场景地面点击少量关键点,即可实现短距和长距的精确标定,在精度和鲁棒性上均优于基线方法。
English: Click-Calib is a pattern-free, offline extrinsic calibration method for Surround-View Systems that requires only a few user-clicked keypoints on the ground to achieve accurate calibration across both short and long distances, outperforming baseline methods in accuracy and robustness.

Authors:Xincheng Shuai, Henghui Ding, Zhenyuan Qin, Hao Luo, Xingjun Ma, Dacheng Tao
Title: Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation
Abstract:
Controlling the movements of dynamic objects and the camera within generated videos is a meaningful yet challenging task. Due to the lack of datasets with comprehensive 6D pose annotations, existing text-to-video methods can not simultaneously control the motions of both camera and objects in 3D-aware manner, resulting in limited controllability over generated contents. To address this issue and facilitate the research in this field, we introduce a Synthetic Dataset for Free-Form Motion Control (SynFMC). The proposed SynFMC dataset includes diverse object and environment categories and covers various motion patterns according to specific rules, simulating common and complex real-world scenarios. The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video.~To provide precise 3D-aware motion control, we further propose a method trained on SynFMC, Free-Form Motion Control (FMC). FMC can control the 6D poses of objects and camera independently or simultaneously, producing high-fidelity videos. Moreover, it is compatible with various personalized text-to-image (T2I) models for different content styles. Extensive experiments demonstrate that the proposed FMC outperforms previous methods across multiple scenarios.

Authors:Gaurav Parmar, Or Patashnik, Kuan-Chieh Wang, Daniil Ostashev, Srinivasa Narasimhan, Jun-Yan Zhu, Daniel Cohen-Or, Kfir Aberman
Title: Object-level Visual Prompts for Compositional Image Generation
Abstract:
We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.
中文: 本文提出一种KV混合交叉注意力机制,通过分别从布局编码器和外观编码器学习键值对,使文本到图像扩散模型能够在使用物体级视觉提示时保持对象身份,同时生成多样化的场景构图。
English: This paper introduces a KV-mixed cross-attention mechanism for text-to-image diffusion models, enabling object-level visual prompts to maintain identity while generating diverse scene compositions through separate key and value learning from layout and appearance encoders.

Authors:Jingfeng Yao, Bin Yang, Xinggang Wang
Title: Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Abstract:
Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256x256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs--representing an over 21 times convergence speedup compared to the original DiT. Models and codes are available at: https://github.com/hustvl/LightningDiT.
中文: 该研究揭示了潜在扩散模型中视觉分词器质量与计算效率之间的优化困境,提出VA-VAE方法将潜在空间与预训练视觉模型对齐,通过集成LightningDiT系统实现了最先进的图像生成性能,并显著加快了收敛速度。
English: The study identifies an optimization dilemma in latent diffusion models where improving visual tokenizer quality conflicts with computational efficiency, and proposes VA-VAE to align latent spaces with pre-trained vision models, achieving state-of-the-art image generation with significantly faster convergence through the integrated LightningDiT system.

Authors:Xudong Jiang, Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Marc Pollefeys
Title: R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization
Abstract:
Learning-based visual localization methods that use scene coordinate regression (SCR) offer the advantage of smaller map sizes. However, on datasets with complex illumination changes or image-level ambiguities, it remains a less robust alternative to feature matching methods. This work aims to close the gap. We introduce a covisibility graph-based global encoding learning and data augmentation strategy, along with a depth-adjusted reprojection loss to facilitate implicit triangulation. Additionally, we revisit the network architecture and local feature extraction module. Our method achieves state-of-the-art on challenging large-scale datasets without relying on network ensembles or 3D supervision. On Aachen Day-Night, we are 10$\times$ more accurate than previous SCR methods with similar map sizes and require at least 5$\times$ smaller map sizes than any other SCR method while still delivering superior accuracy. Code is available at: https://github.com/cvg/scrstudio .
中文: 本研究通过引入基于共视图的全局编码学习、数据增强策略及深度调整重投影损失,提升了场景坐标回归在视觉定位中的性能,在复杂数据集上以更小的地图尺寸实现了顶尖的精度。
English: This study enhances scene coordinate regression for visual localization by introducing a covisibility graph-based global encoding, data augmentation, and a depth-adjusted reprojection loss, achieving state-of-the-art accuracy with significantly reduced map sizes on challenging datasets.

Authors:Yoshitomo Matsubara, Matteo Mendula, Marco Levorato
Title: A Multi-task Supervised Compression Model for Split Computing
Abstract:
Split computing ($\neq$ split learning) is a promising approach to deep learning models for resource-constrained edge computing systems, where weak sensor (mobile) devices are wirelessly connected to stronger edge servers through channels with limited communication capacity. State-of-theart work on split computing presents methods for single tasks such as image classification, object detection, or semantic segmentation. The application of existing methods to multitask problems degrades model accuracy and/or significantly increase runtime latency. In this study, we propose Ladon, the first multi-task-head supervised compression model for multi-task split computing. Experimental results show that the multi-task supervised compression model either outperformed or rivaled strong lightweight baseline models in terms of predictive performance for ILSVRC 2012, COCO 2017, and PASCAL VOC 2012 datasets while learning compressed representations at its early layers. Furthermore, our models reduced end-to-end latency (by up to 95.4%) and energy consumption of mobile devices (by up to 88.2%) in multi-task split computing scenarios.
中文: Ladon是首个用于分割计算的多任务监督压缩模型,它在多个数据集上提升预测性能的同时,大幅降低了移动设备的延迟和能耗。
English: Ladon is the first multi-task supervised compression model for split computing that enhances predictive performance on multiple datasets while significantly reducing latency and energy consumption on mobile devices.

Authors:Or Patashnik, Rinon Gal, Daniil Ostashev, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or
Title: Nested Attention: Semantic-aware Attention Values for Concept Personalization
Abstract:
Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a balance between identity preservation and alignment with the input text prompt. Some methods rely on a single textual token to represent a subject, which limits expressiveness, while others employ richer representations but disrupt the model's prior, diminishing prompt alignment. In this work, we introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while adhering to input text prompts. Our approach is general and can be trained on various domains. Additionally, its prior preservation allows us to combine multiple personalized subjects from different domains in a single image.
Chinese Summary: 嵌套注意力是一种创新机制,通过将丰富的图像表征注入交叉注意力层,在保持身份特征与提示对齐之间实现更优平衡,同时支持多主体图像合成。
English Summary: Nested Attention is a novel mechanism that enhances text-to-image personalization by injecting rich image representations into cross-attention layers, achieving superior balance between identity preservation and prompt alignment while enabling multi-subject composition.

Authors:Yidi Shao, Chen Change Loy, Bo Dai
Title: Learning 3D Garment Animation from Trajectories of A Piece of Cloth
Abstract:
Garment animation is ubiquitous in various applications, such as virtual reality, gaming, and film producing. Recently, learning-based approaches obtain compelling performance in animating diverse garments under versatile scenarios. Nevertheless, to mimic the deformations of the observed garments, data-driven methods require large scale of garment data, which are both resource-wise expensive and time-consuming. In addition, forcing models to match the dynamics of observed garment animation may hinder the potentials to generalize to unseen cases. In this paper, instead of using garment-wise supervised-learning we adopt a disentangled scheme to learn how to animate observed garments: 1). learning constitutive behaviors from the observed cloth; 2). dynamically animate various garments constrained by the learned constitutive laws. Specifically, we propose Energy Unit network (EUNet) to model the constitutive relations in the format of energy. Without the priors from analytical physics models and differentiable simulation engines, EUNet is able to directly capture the constitutive behaviors from the observed piece of cloth and uniformly describes the change of energy caused by deformations, such as stretching and bending. We further apply the pre-trained EUNet to animate various garments based on energy optimizations. The disentangled scheme alleviates the need of garment data and enables us to utilize the dynamics of a piece of cloth for animating garments. Experiments show that while EUNet effectively delivers the energy gradients due to the deformations, models constrained by EUNet achieve more stable and physically plausible performance comparing with those trained in garment-wise supervised manner. Code is available at https://github.com/ftbabi/EUNet_NeurIPS2024.git .
Chinese: 本文提出EUNet这一解耦学习方案,通过能量动态模拟服装本构行为,无需大量服装专用数据即可实现稳定且符合物理规律的动画效果。
English: This paper introduces EUNet, a disentangled learning approach that models garment constitutive behaviors from energy dynamics, enabling stable and physically plausible animation without requiring extensive garment-specific data.

Authors:Xuyin Qi, Zeyu Zhang, Aaron Berliano Handoko, Huazhan Zheng, Mingxi Chen, Ta Duc Huy, Vu Minh Hieu Phan, Lei Zhang, Linqi Cheng, Shiyu Jiang, Zhiwei Zhang, Zhibin Liao, Yang Zhao, Minh-Son To
Title: ProjectedEx: Enhancing Generation in Explainable AI for Prostate Cancer
Abstract:
Prostate cancer, a growing global health concern, necessitates precise diagnostic tools, with Magnetic Resonance Imaging (MRI) offering high-resolution soft tissue imaging that significantly enhances diagnostic accuracy. Recent advancements in explainable AI and representation learning have significantly improved prostate cancer diagnosis by enabling automated and precise lesion classification. However, existing explainable AI methods, particularly those based on frameworks like generative adversarial networks (GANs), are predominantly developed for natural image generation, and their application to medical imaging often leads to suboptimal performance due to the unique characteristics and complexity of medical image. To address these challenges, our paper introduces three key contributions. First, we propose ProjectedEx, a generative framework that provides interpretable, multi-attribute explanations, effectively linking medical image features to classifier decisions. Second, we enhance the encoder module by incorporating feature pyramids, which enables multiscale feedback to refine the latent space and improves the quality of generated explanations. Additionally, we conduct comprehensive experiments on both the generator and classifier, demonstrating the clinical relevance and effectiveness of ProjectedEx in enhancing interpretability and supporting the adoption of AI in medical settings. Code will be released at https://github.com/Richardqiyi/ProjectedEx
中文: 本文提出了ProjectedEx这一可解释AI框架,通过生成多属性医学图像解释并整合特征金字塔技术,显著提升了前列腺癌诊断的临床可解释性和实用性。
English: This paper introduces ProjectedEx, an explainable AI framework that enhances prostate cancer diagnosis by generating multi-attribute medical image explanations and incorporating feature pyramids for improved interpretability in clinical settings.

Authors:Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Fei Xiao, Chen Change Loy, Lu Jiang
Title: SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration
Abstract:
Video restoration poses non-trivial challenges in maintaining fidelity while recovering temporally consistent details from unknown degradations in the wild. Despite recent advances in diffusion-based restoration, these methods often face limitations in generation capability and sampling efficiency. In this work, we present SeedVR, a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR further supports variable-sized windows near the boundary of both spatial and temporal dimensions, overcoming the resolution constraints of traditional window attention. Equipped with contemporary practices, including causal video autoencoder, mixed image and video training, and progressive training, SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR's superiority over existing methods for generic video restoration.

Authors:Leandro Di Bella, Yangxintong Lyu, Bruno Cornelis, Adrian Munteanu
Title: HybridTrack: A Hybrid Approach for Robust Multi-Object Tracking
Abstract:
The evolution of Advanced Driver Assistance Systems (ADAS) has increased the need for robust and generalizable algorithms for multi-object tracking. Traditional statistical model-based tracking methods rely on predefined motion models and assumptions about system noise distributions. Although computationally efficient, they often lack adaptability to varying traffic scenarios and require extensive manual design and parameter tuning. To address these issues, we propose a novel 3D multi-object tracking approach for vehicles, HybridTrack, which integrates a data-driven Kalman Filter (KF) within a tracking-by-detection paradigm. In particular, it learns the transition residual and Kalman gain directly from data, which eliminates the need for manual motion and stochastic parameter modeling. Validated on the real-world KITTI dataset, HybridTrack achieves 82.72% HOTA accuracy, significantly outperforming state-of-the-art methods. We also evaluate our method under different configurations, achieving the fastest processing speed of 112 FPS. Consequently, HybridTrack eliminates the dependency on scene-specific designs while improving performance and maintaining real-time efficiency. The code is publicly available at: https://github.com/leandro-svg/HybridTrack.
Chinese: HybridTrack提出了一种创新的3D多目标跟踪方法,通过集成数据驱动的卡尔曼滤波器消除了手动运动建模需求,在KITTI数据集上实现了卓越的准确性和实时处理能力。
English: HybridTrack introduces a novel 3D multi-object tracking method that integrates a data-driven Kalman Filter to eliminate manual motion modeling, achieving superior accuracy and real-time performance on the KITTI dataset.

Authors:Yongle Huang, Haodong Chen, Zhenbang Xu, Zihan Jia, Haozhou Sun, Dian Shao
Title: SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization
Abstract:
Human action understanding is crucial for the advancement of multimodal systems. While recent developments, driven by powerful large language models (LLMs), aim to be general enough to cover a wide range of categories, they often overlook the need for more specific capabilities. In this work, we address the more challenging task of Fine-grained Action Recognition (FAR), which focuses on detailed semantic labels within shorter temporal duration (e.g., "salto backward tucked with 1 turn"). Given the high costs of annotating fine-grained labels and the substantial data needed for fine-tuning LLMs, we propose to adopt semi-supervised learning (SSL). Our framework, SeFAR, incorporates several innovative designs to tackle these challenges. Specifically, to capture sufficient visual details, we construct Dual-level temporal elements as more effective representations, based on which we design a new strong augmentation strategy for the Teacher-Student learning paradigm through involving moderate temporal perturbation. Furthermore, to handle the high uncertainty within the teacher model's predictions for FAR, we propose the Adaptive Regulation to stabilize the learning process. Experiments show that SeFAR achieves state-of-the-art performance on two FAR datasets, FineGym and FineDiving, across various data scopes. It also outperforms other semi-supervised methods on two classical coarse-grained datasets, UCF101 and HMDB51. Further analysis and ablation studies validate the effectiveness of our designs. Additionally, we show that the features extracted by our SeFAR could largely promote the ability of multimodal foundation models to understand fine-grained and domain-specific semantics.
Chinese: 本文提出SeFAR框架,通过双级时间元素和自适应调节的半监督学习方法,在细粒度动作识别任务中实现了最优性能,并有效提升了多模态基础模型的领域语义理解能力。
English: This paper introduces SeFAR, a semi-supervised learning framework designed for Fine-grained Action Recognition that incorporates dual-level temporal elements and adaptive regulation to achieve state-of-the-art performance on specialized datasets.

Authors:Zhiyao Wang, Xu Chen, Chengming Xu, Junwei Zhu, Xiaobin Hu, Jiangning Zhang, Chengjie Wang, Yuqi Liu, Yiyi Zhou, Rongrong Ji
Title: SVFR: A Unified Framework for Generalized Video Face Restoration
Abstract:
Face Restoration (FR) is a crucial area within image and video processing, focusing on reconstructing high-quality portraits from degraded inputs. Despite advancements in image FR, video FR remains relatively under-explored, primarily due to challenges related to temporal consistency, motion artifacts, and the limited availability of high-quality video data. Moreover, traditional face restoration typically prioritizes enhancing resolution and may not give as much consideration to related tasks such as facial colorization and inpainting. In this paper, we propose a novel approach for the Generalized Video Face Restoration (GVFR) task, which integrates video BFR, inpainting, and colorization tasks that we empirically show to benefit each other. We present a unified framework, termed as stable video face restoration (SVFR), which leverages the generative and motion priors of Stable Video Diffusion (SVD) and incorporates task-specific information through a unified face restoration framework. A learnable task embedding is introduced to enhance task identification. Meanwhile, a novel Unified Latent Regularization (ULR) is employed to encourage the shared feature representation learning among different subtasks. To further enhance the restoration quality and temporal stability, we introduce the facial prior learning and the self-referred refinement as auxiliary strategies used for both training and inference. The proposed framework effectively combines the complementary strengths of these tasks, enhancing temporal coherence and achieving superior restoration quality. This work advances the state-of-the-art in video FR and establishes a new paradigm for generalized video face restoration. Code and video demo are available at https://github.com/wangzhiyaoo/SVFR.git.
中文: 本文提出了一种新颖的通用视频人脸修复框架,通过整合修复与着色等多任务,利用稳定视频扩散先验和统一潜在正则化方法,显著提升了视频修复的时间一致性与质量。
English: This paper introduces a novel generalized video face restoration framework that integrates multiple tasks like inpainting and colorization using Stable Video Diffusion priors, enhancing temporal consistency and restoration quality through unified latent regularization and auxiliary strategies.

Authors:Amil Bhagat, Milind Jain, A. V. Subramanyam
Title: Conditional Consistency Guided Image Translation and Enhancement
Abstract:
Consistency models have emerged as a promising alternative to diffusion models, offering high-quality generative capabilities through single-step sample generation. However, their application to multi-domain image translation tasks, such as cross-modal translation and low-light image enhancement remains largely unexplored. In this paper, we introduce Conditional Consistency Models (CCMs) for multi-domain image translation by incorporating additional conditional inputs. We implement these modifications by introducing task-specific conditional inputs that guide the denoising process, ensuring that the generated outputs retain structural and contextual information from the corresponding input domain. We evaluate CCMs on 10 different datasets demonstrating their effectiveness in producing high-quality translated images across multiple domains. Code is available at https://github.com/amilbhagat/Conditional-Consistency-Models.
中文: 条件一致性模型通过引入任务特定的条件输入,实现了跨多个领域的高质量单步图像翻译,并在10个不同数据集上验证了其有效性。
English: Conditional Consistency Models (CCMs) introduce task-specific conditional inputs to enable high-quality, single-step image translation across multiple domains, demonstrating effectiveness on 10 diverse datasets.

Authors:Yitong Zhu, Zhuowen Liang, Yiming Wu, Tangyao Li, Yuyang Wang
Title: Towards Consumer-Grade Cybersickness Prediction: Multi-Model Alignment for Real-Time Vision-Only Inference
Abstract:
Cybersickness remains a major obstacle to the widespread adoption of immersive virtual reality (VR), particularly in consumer-grade environments. While prior methods rely on invasive signals such as electroencephalography (EEG) for high predictive accuracy, these approaches require specialized hardware and are impractical for real-world applications. In this work, we propose a scalable, deployable framework for personalized cybersickness prediction leveraging only non-invasive signals readily available from commercial VR headsets, including head motion, eye tracking, and physiological responses. Our model employs a modality-specific graph neural network enhanced with a Difference Attention Module to extract temporal-spatial embeddings capturing dynamic changes across modalities. A cross-modal alignment module jointly trains the video encoder to learn personalized traits by aligning video features with sensor-derived representations. Consequently, the model accurately predicts individual cybersickness using only video input during inference. Experimental results show our model achieves 88.4\% accuracy, closely matching EEG-based approaches (89.16\%), while reducing deployment complexity. With an average inference latency of 90ms, our framework supports real-time applications, ideal for integration into consumer-grade VR platforms without compromising personalization or performance. The code will be relesed at https://github.com/U235-Aurora/PTGNN.
中文摘要:本研究提出一种利用商用VR头显非侵入式信号的个性化晕动症预测框架,通过新型图神经网络设计实现接近脑电图方法的准确率,并具备实时应用能力。
English Summary: This study introduces a scalable framework for predicting cybersickness in VR using non-invasive signals from commercial headsets, achieving near-EEG accuracy with real-time performance through a novel graph neural network design.

Authors:Yulun Wu, Han Huang, Wenyuan Zhang, Chao Deng, Ge Gao, Ming Gu, Yu-Shen Liu
Title: Sparis: Neural Implicit Surface Reconstruction of Indoor Scenes from Sparse Views
Abstract:
In recent years, reconstructing indoor scene geometry from multi-view images has achieved encouraging accomplishments. Current methods incorporate monocular priors into neural implicit surface models to achieve high-quality reconstructions. However, these methods require hundreds of images for scene reconstruction. When only a limited number of views are available as input, the performance of monocular priors deteriorates due to scale ambiguity, leading to the collapse of the reconstructed scene geometry. In this paper, we propose a new method, named Sparis, for indoor surface reconstruction from sparse views. Specifically, we investigate the impact of monocular priors on sparse scene reconstruction, introducing a novel prior based on inter-image matching information. Our prior offers more accurate depth information while ensuring cross-view matching consistency. Additionally, we employ an angular filter strategy and an epipolar matching weight function, aiming to reduce errors due to view matching inaccuracies, thereby refining the inter-image prior for improved reconstruction accuracy. The experiments conducted on widely used benchmarks demonstrate superior performance in sparse-view scene reconstruction.

Authors:Dat Nguyen, Marcella Astrid, Anis Kacem, Enjie Ghorbel, Djamila Aouada
Title: Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection
Abstract:
Detecting deepfake videos is highly challenging given the complexity of characterizing spatio-temporal artifacts. Most existing methods rely on binary classifiers trained using real and fake image sequences, therefore hindering their generalization capabilities to unseen generation methods. Moreover, with the constant progress in generative Artificial Intelligence (AI), deepfake artifacts are becoming imperceptible at both the spatial and the temporal levels, making them extremely difficult to capture. To address these issues, we propose a fine-grained deepfake video detection approach called FakeSTormer that enforces the modeling of subtle spatio-temporal inconsistencies while avoiding overfitting. Specifically, we introduce a multi-task learning framework that incorporates two auxiliary branches for explicitly attending artifact-prone spatial and temporal regions. Additionally, we propose a video-level data synthesis strategy that generates pseudo-fake videos with subtle spatio-temporal artifacts, providing high-quality samples and hand-free annotations for our additional branches. Extensive experiments on several challenging benchmarks demonstrate the superiority of our approach compared to recent state-of-the-art methods. The code is available at https://github.com/10Ring/FakeSTormer.
Chinese: FakeSTormer方法通过多任务学习框架和视频级数据合成技术,有效捕捉深度伪造视频中的细微时空不一致性,在多个基准测试中优于现有方法。
English: The proposed FakeSTormer method enhances deepfake video detection by employing a multi-task learning framework and video-level data synthesis to capture subtle spatio-temporal inconsistencies, outperforming existing approaches on multiple benchmarks.

Authors:Lixiong Qin, Ning Jiang, Yang Zhang, Yuhan Qiu, Dingheng Zeng, Jiani Hu, Weihong Deng
Title: Towards Interactive Deepfake Analysis
Abstract:
Existing deepfake analysis methods are primarily based on discriminative models, which significantly limit their application scenarios. This paper aims to explore interactive deepfake analysis by performing instruction tuning on multi-modal large language models (MLLMs). This will face challenges such as the lack of datasets and benchmarks, and low training efficiency. To address these issues, we introduce (1) a GPT-assisted data construction process resulting in an instruction-following dataset called DFA-Instruct, (2) a benchmark named DFA-Bench, designed to comprehensively evaluate the capabilities of MLLMs in deepfake detection, deepfake classification, and artifact description, and (3) construct an interactive deepfake analysis system called DFA-GPT, as a strong baseline for the community, with the Low-Rank Adaptation (LoRA) module. The dataset and code will be made available at https://github.com/lxq1000/DFA-Instruct to facilitate further research.
中文摘要:本文通过指令调优多模态大语言模型探索交互式深度伪造分析,构建了DFA-Instruct数据集和DFA-Bench基准,并提出基于LoRA的DFA-GPT系统作为解决方案。
English Summary: This paper introduces an interactive deepfake analysis system using instruction-tuned multimodal large language models, addressing dataset gaps with DFA-Instruct and DFA-Bench while proposing DFA-GPT as a baseline with LoRA adaptation.

Authors:Feng Han, Kai Chen, Chao Gong, Zhipeng Wei, Jingjing Chen, Yu-Gang Jiang
Title: DuMo: Dual Encoder Modulation Network for Precise Concept Erasure
Abstract:
The exceptional generative capability of text-to-image models has raised substantial safety concerns regarding the generation of Not-Safe-For-Work (NSFW) content and potential copyright infringement. To address these concerns, previous methods safeguard the models by eliminating inappropriate concepts. Nonetheless, these models alter the parameters of the backbone network and exert considerable influences on the structural (low-frequency) components of the image, which undermines the model's ability to retain non-target concepts. In this work, we propose our Dual encoder Modulation network (DuMo), which achieves precise erasure of inappropriate target concepts with minimum impairment to non-target concepts. In contrast to previous methods, DuMo employs the Eraser with PRior Knowledge (EPR) module which modifies the skip connection features of the U-NET and primarily achieves concept erasure on details (high-frequency) components of the image. To minimize the damage to non-target concepts during erasure, the parameters of the backbone U-NET are frozen and the prior knowledge from the original skip connection features is introduced to the erasure process. Meanwhile, the phenomenon is observed that distinct erasing preferences for the image structure and details are demonstrated by the EPR at different timesteps and layers. Therefore, we adopt a novel Time-Layer MOdulation process (TLMO) that adjusts the erasure scale of EPR module's outputs across different layers and timesteps, automatically balancing the erasure effects and model's generative ability. Our method achieves state-of-the-art performance on Explicit Content Erasure, Cartoon Concept Removal and Artistic Style Erasure, clearly outperforming alternative methods. Code is available at https://github.com/Maplebb/DuMo
中文: 提出的DuMo网络通过修改图像高频细节实现精准内容擦除,同时利用TLMO模块自适应调节擦除强度,在保护非目标概念的前提下取得了最优性能。
English: The proposed DuMo network effectively removes inappropriate content from images by modifying high-frequency details while preserving non-target concepts, achieving superior performance through its innovative EPR and TLMO modules.

Authors:Jian Lang, Zhangtao Cheng, Ting Zhong, Fan Zhou
Title: Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning
Abstract:
Multimodal learning with incomplete modality is practical and challenging. Recently, researchers have focused on enhancing the robustness of pre-trained MultiModal Transformers (MMTs) under missing modality conditions by applying learnable prompts. However, these prompt-based methods face several limitations: (1) incomplete modalities provide restricted modal cues for task-specific inference, (2) dummy imputation for missing content causes information loss and introduces noise, and (3) static prompts are instance-agnostic, offering limited knowledge for instances with various missing conditions. To address these issues, we propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three modules: (I) the multi-channel retriever, which identifies similar instances through a within-modality retrieval strategy, (II) the missing modality generator, which recovers missing information using retrieved contexts, and (III) the context-aware prompter, which captures contextual knowledge from relevant instances and generates dynamic prompts to largely enhance the MMT's robustness. Extensive experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems. The code of our work and prompt-based baselines is available at https://github.com/Jian-Lang/RAGPT.
中文摘要:提出的RAGPT框架通过检索增强的动态提示解决了多模态学习中模态缺失的难题,利用跨模态检索和动态提示生成显著提升了多模态变换器的鲁棒性,在三个真实数据集上均展现出最优性能。
English Summary: The proposed RAGPT framework addresses limitations in prompt-based multimodal learning by using retrieval-augmented dynamic prompts to enhance transformer robustness against missing modalities, demonstrating superior performance across three real-world datasets.

Authors:Xuan Yu, Yuxuan Xie, Yili Liu, Haojian Lu, Rong Xiong, Yiyi Liao, Yue Wang
Title: Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction
Abstract:
Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator. Our project website is at: https://yuxuan1206.github.io/panopticrecon_pp/
中文:PanopticRecon++ 提出了一种端到端的全景重建方法,通过可学习的3D高斯模型作为实例查询,结合空间先验和交叉注意力机制,有效提升了机器人技术和仿真中的场景理解能力。
English: PanopticRecon++ introduces an end-to-end method for panoptic reconstruction using learnable 3D Gaussians as instance queries, which integrates spatial priors and cross-attention to enhance scene understanding for robotics and simulation.

Authors:Hong Zhang, Zhongjie Duan, Xingjun Wang, Yingda Chen, Yu Zhang
Title: EliGen: Entity-Level Controlled Image Generation with Regional Attention
Abstract:
Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-level controlled image Generation. Firstly, we put forward regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both spatial precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending its capabilities to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with other open-source models such as IP-Adapter, In-Context LoRA and MLLM, unlocking new creative possibilities. The source code, model, and dataset are published at https://github.com/modelscope/DiffSynth-Studio.git.
中文: EliGen提出了一种新颖框架,通过区域注意力和高质量数据集实现精细的实体级图像控制,在精度和质量上超越现有方法,并能扩展到多实体任务及与其他模型集成。
English: EliGen introduces a novel framework with regional attention and a high-quality dataset to enable fine-grained entity-level image control, surpassing existing methods in precision and quality while extending to multi-entity tasks and integration with other models.

Authors:Ziyang Chen, Wenting Li, Yongjun Zhang, Yabo Wu, Bingshu Wang, Yong Zhao, C. L. Philip Chen
Title: Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer
Abstract:
Constrained by the low-rank bottleneck inherent in attention mechanisms, current stereo matching transformers suffer from limited nonlinear expressivity, which renders their feature representations sensitive to challenging conditions such as reflections. To overcome this difficulty, we present the Hadamard Attention Recurrent Stereo Transformer (HART). HART includes a novel attention mechanism that incorporates the following components: 1) The Dense Attention Kernel (DAK) maps the attention weight distribution into a high-dimensional space over (0, +$\infty$). By removing the upper bound constraint on attention weights, DAK enables more flexible modeling of complex feature interactions. This reduces feature collinearity. 2) The Multi Kernel & Order Interaction (MKOI) module extends the attention mechanism by unifying semantic and spatial knowledge learning. This integration improves the ability of HART to learn features in binocular images. Experimental results demonstrate the effectiveness of our HART. In reflective area, HART ranked 1st on the KITTI 2012 benchmark among all published methods at the time of submission. Code is available at https://github.com/ZYangChen/HART.
中文: 现有立体匹配变换器因注意力机制的低秩瓶颈而存在非线性表达能力受限的问题,对反射等挑战性条件敏感,但提出的HART模型通过移除注意力权重上限约束并融合语义-空间学习的新型注意力机制,在KITTI 2012基准测试中取得了最优性能。
English: Current stereo matching transformers face limited nonlinear expressivity due to low-rank attention bottlenecks, making them sensitive to challenging conditions like reflections, but the proposed HART model overcomes this with a novel attention mechanism that removes upper bound constraints and integrates semantic-spatial learning, achieving top performance on the KITTI 2012 benchmark.

Authors:Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing
Title: 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Abstract:
Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving. Our code are available at https://github.com/DAMO-NLP-SG/multimodal_textbook.
中文摘要:本文提出了一种从教学视频中提取的高质量多模态教材语料库,为视觉语言模型提供了更丰富的知识基础和更好的图文对齐,显著提升了其在知识密集型和推理任务中的表现。
English Summary: This paper introduces a high-quality multimodal textbook corpus derived from instructional videos, which provides richer foundational knowledge and better image-text alignment for Vision-Language Models, significantly enhancing their performance in knowledge-intensive and reasoning tasks.

Authors:Chenyang Liu, Keyan Chen, Rui Zhao, Zhengxia Zou, Zhenwei Shi
Title: Text2Earth: Unlocking Text-driven Remote Sensing Image Generation with a Global-Scale Dataset and a Foundation Model
Abstract:
Generative foundation models have advanced large-scale text-driven natural image generation, becoming a prominent research trend across various vertical domains. However, in the remote sensing field, there is still a lack of research on large-scale text-to-image (text2image) generation technology. Existing remote sensing image-text datasets are small in scale and confined to specific geographic areas and scene types. Besides, existing text2image methods have struggled to achieve global-scale, multi-resolution controllable, and unbounded image generation. To address these challenges, this paper presents two key contributions: the Git-10M dataset and the Text2Earth foundation model. Git-10M is a global-scale image-text dataset comprising 10.5 million image-text pairs, 5 times larger than the previous largest one. The dataset covers a wide range of geographic scenes and contains resolution information, significantly surpassing existing datasets in both size and diversity. Building on Git-10M, we propose Text2Earth, a 1.3 billion parameter generative foundation model based on the diffusion framework to model global-scale remote sensing scenes. Text2Earth integrates a resolution guidance mechanism, enabling users to specify image resolutions. A dynamic condition adaptation strategy is proposed for training and inference to improve image quality. Text2Earth excels in zero-shot text2image generation and demonstrates robust generalization and flexibility across multiple tasks, including unbounded scene construction, image editing, and cross-modal image generation. This robust capability surpasses previous models restricted to the basic fixed size and limited scene types. On the previous benchmark dataset, Text2Earth outperforms previous models with an improvement of +26.23 FID and +20.95% Zero-shot Cls-OA metric.Our project page is https://chen-yang-liu.github.io/Text2Earth
中文摘要:本文提出Git-10M数据集和Text2Earth基础模型,解决了遥感领域文本生成图像技术中数据规模小、场景单一的问题,实现了全球尺度、分辨率可控的高质量图像生成。
English Summary: This paper introduces the Git-10M dataset and Text2Earth foundation model to address limitations in remote sensing text-to-image generation, achieving superior performance in global-scale, resolution-controllable image synthesis.

Authors:Libin Lan, Lu Jiang, Tianshu Yu, Xiaojuan Liu, Zhongshi He
Title: FullTransNet: Full Transformer with Local-Global Attention for Video Summarization
Abstract:
Video summarization aims to generate a compact, informative, and representative synopsis of raw videos, which is crucial for browsing, analyzing, and understanding video content. Dominant approaches in video summarization primarily rely on recurrent or convolutional neural networks, and more recently on encoder-only transformer architectures. However, these methods typically suffer from several limitations in parallelism, modeling long-range dependencies, and providing explicit generative capabilities. To address these issues, we propose a transformer-like architecture named FullTransNet with two-fold ideas. First, it uses a full transformer with an encoder-decoder structure as an alternative architecture for video summarization. As the full transformer is specifically designed for sequence transduction tasks, its direct application to video summarization is both intuitive and effective. Second, it replaces the standard full attention mechanism with a combination of local and global sparse attention, enabling the model to capture long-range dependencies while significantly reducing computational costs. This local-global sparse attention is applied exclusively at the encoder side, where the majority of computations occur, further enhancing efficiency. Extensive experiments on two widely used benchmark datasets, SumMe and TVSum, demonstrate that our model achieves F-scores of 54.4% and 63.9%, respectively, while maintaining relatively low computational and memory requirements. These results surpass the second-best performing methods by 0.1% and 0.3%, respectively, verifying the effectiveness and efficiency of FullTransNet.
中文: FullTransNet 提出了一种基于Transformer的编码器-解码器架构,结合局部-全局稀疏注意力机制,在降低计算成本的同时,于标准数据集上实现了最优的视频摘要生成性能。
English: FullTransNet introduces a transformer-based encoder-decoder architecture with local-global sparse attention to efficiently generate video summaries, achieving state-of-the-art performance on benchmark datasets with reduced computational costs.

Authors:Teng Hu, Jiangning Zhang, Ran Yi, Jieyu Weng, Yabiao Wang, Xianfang Zeng, Zhucun Xue, Lizhuang Ma
Title: Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction
Abstract:
Employing LLMs for visual generation has recently become a research focus. However, the existing methods primarily transfer the LLM architecture to visual generation but rarely investigate the fundamental differences between language and vision. This oversight may lead to suboptimal utilization of visual generation capabilities within the LLM framework. In this paper, we explore the characteristics of visual embedding space under the LLM framework and discover that the correlation between visual embeddings can help achieve more stable and robust generation results. We present IAR, an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models. Firstly, we propose a Codebook Rearrangement strategy that uses balanced k-means clustering algorithm to rearrange the visual codebook into clusters, ensuring high similarity among visual features within each cluster. Leveraging the rearranged codebook, we propose a Cluster-oriented Cross-entropy Loss that guides the model to correctly predict the cluster where the token is located. This approach ensures that even if the model predicts the wrong token index, there is a high probability the predicted token is located in the correct cluster, which significantly enhances the generation quality and robustness. Extensive experiments demonstrate that our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID. Additionally, our approach can be applied to various LLM-based visual generation models and adheres to the scaling law, providing a promising direction for future research in LLM-based visual generation. The code is available at: https://github.com/sjtuplayer/IAR.
中文: 本文提出IAR改进自回归视觉生成方法,通过重组视觉码本并采用面向簇的交叉熵损失,有效提升了基于大语言模型的视觉生成效率与质量。
English: This paper introduces IAR, an improved autoregressive visual generation method that enhances LLM-based visual generation by rearranging the codebook and using cluster-oriented loss to boost training efficiency and output quality.

Authors:Mingjia Li, Shuang Li, Tongrui Su, Longhui Yuan, Jian Liang, Wei Li
Title: Exploring Structured Semantic Priors Underlying Diffusion Score for Test-time Adaptation
Abstract:
Capitalizing on the complementary advantages of generative and discriminative models has always been a compelling vision in machine learning, backed by a growing body of research. This work discloses the hidden semantic structure within score-based generative models, unveiling their potential as effective discriminative priors. Inspired by our theoretical findings, we propose DUSA to exploit the structured semantic priors underlying diffusion score to facilitate the test-time adaptation of image classifiers or dense predictors. Notably, DUSA extracts knowledge from a single timestep of denoising diffusion, lifting the curse of Monte Carlo-based likelihood estimation over timesteps. We demonstrate the efficacy of our DUSA in adapting a wide variety of competitive pre-trained discriminative models on diverse test-time scenarios. Additionally, a thorough ablation study is conducted to dissect the pivotal elements in DUSA. Code is publicly available at https://github.com/BIT-DA/DUSA.
Chinese Summary: 本研究提出DUSA方法,利用基于分数的生成模型中的语义先验,仅通过单个扩散步骤即可实现判别模型的高效测试时适应。
English Summary: This research introduces DUSA, a method that leverages the semantic priors in score-based generative models to enable efficient test-time adaptation of discriminative models using just a single diffusion timestep.

Authors:Binglu Wang, Yao Tian, Shunzhou Wang, Le Yang
Title: Multimodal Large Models Are Effective Action Anticipators
Abstract:
The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling straightforward action prediction without the need for complex instructions or redundant descriptions. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding. In addition, we introduce a Cross-Modality Interaction Block, designed to explore the specificity within each modality and capture interactions between vision and textual modalities, thereby enhancing multimodal tuning. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed ActionLLM framework, encouraging a promising direction to explore LLMs in the context of action anticipation. Code is available at https://github.com/2tianyao1/ActionLLM.git.
Chinese: ActionLLM框架利用大型语言模型将视频序列视为连续标记,通过动作调整模块和跨模态交互块增强多模态理解,在基准数据集上展现出卓越的长期动作预测性能。
English: The ActionLLM framework leverages large language models to treat video sequences as tokens for long-term action anticipation, incorporating an action tuning module and cross-modality interaction to enhance multimodal understanding and achieve superior performance on benchmarks.

Authors:Suho Park, SuBeen Lee, Hyun Seok Seong, Jaejoon Yoo, Jae-Pil Heo
Title: Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation
Abstract:
We propose Foreground-Covering Prototype Generation and Matching to resolve Few-Shot Segmentation (FSS), which aims to segment target regions in unlabeled query images based on labeled support images. Unlike previous research, which typically estimates target regions in the query using support prototypes and query pixels, we utilize the relationship between support and query prototypes. To achieve this, we utilize two complementary features: SAM Image Encoder features for pixel aggregation and ResNet features for class consistency. Specifically, we construct support and query prototypes with SAM features and distinguish query prototypes of target regions based on ResNet features. For the query prototype construction, we begin by roughly guiding foreground regions within SAM features using the conventional pseudo-mask, then employ iterative cross-attention to aggregate foreground features into learnable tokens. Here, we discover that the cross-attention weights can effectively alternate the conventional pseudo-mask. Therefore, we use the attention-based pseudo-mask to guide ResNet features to focus on the foreground, then infuse the guided ResNet feature into the learnable tokens to generate class-consistent query prototypes. The generation of the support prototype is conducted symmetrically to that of the query one, with the pseudo-mask replaced by the ground-truth mask. Finally, we compare these query prototypes with support ones to generate prompts, which subsequently produce object masks through the SAM Mask Decoder. Our state-of-the-art performances on various datasets validate the effectiveness of the proposed method for FSS. Our official code is available at https://github.com/SuhoPark0706/FCP
中文: 本研究提出前景覆盖原型生成与匹配方法,通过结合SAM和ResNet特征构建类别一致的原型,并利用交叉注意力优化伪掩码,在多个数据集上实现了最先进的少样本分割性能。
English: This study introduces Foreground-Covering Prototype Generation and Matching for Few-Shot Segmentation, leveraging SAM and ResNet features to create class-consistent prototypes and using cross-attention to refine pseudo-masks, achieving state-of-the-art results across multiple datasets.

Authors:Haoxuan Li, Wei song, Peiwu Qin, Xi Yuan, Zhenglin Chen
Title: HCMA-UNet: A Hybrid CNN-Mamba UNet with Axial Self-Attention for Efficient Breast Cancer Segmentation
Abstract:
Breast cancer lesion segmentation in DCE-MRI remains challenging due to heterogeneous tumor morphology and indistinct boundaries. To address these challenges, this study proposes a novel hybrid segmentation network, HCMA-UNet, for lesion segmentation of breast cancer. Our network consists of a lightweight CNN backbone and a Multi-view Axial Self-Attention Mamba (MISM) module. The MISM module integrates Visual State Space Block (VSSB) and Axial Self-Attention (ASA) mechanism, effectively reducing parameters through Asymmetric Split Channel (ASC) strategy to achieve efficient tri-directional feature extraction. Our lightweight model achieves superior performance with 2.87M parameters and 126.44 GFLOPs. A Feature-guided Region-aware loss function (FRLoss) is proposed to enhance segmentation accuracy. Extensive experiments on one private and two public DCE-MRI breast cancer datasets demonstrate that our approach achieves state-of-the-art performance while maintaining computational efficiency. FRLoss also exhibits good cross-architecture generalization capabilities. The source code is available at https://github.com/Haoxuanli-Thu/HCMA-UNet.
Chinese: 本研究提出轻量级混合网络HCMA-UNet,通过新型MISM模块和FRLoss损失函数,在保持计算效率的同时实现了DCE-MRI乳腺癌病灶分割的最优性能。
English: This study introduces HCMA-UNet, a lightweight hybrid network with a novel MISM module and FRLoss function, achieving state-of-the-art breast cancer lesion segmentation in DCE-MRI while maintaining computational efficiency.

Authors:Ruibin Li, Tao Yang, Song Guo, Lei Zhang
Title: RORem: Training a Robust Object Remover with Human-in-the-Loop
Abstract:
Despite the significant advancements, existing object removal methods struggle with incomplete removal, incorrect content synthesis and blurry synthesized regions, resulting in low success rates. Such issues are mainly caused by the lack of high-quality paired training data, as well as the self-supervised training paradigm adopted in these methods, which forces the model to in-paint the masked regions, leading to ambiguity between synthesizing the masked objects and restoring the background. To address these issues, we propose a semi-supervised learning strategy with human-in-the-loop to create high-quality paired training data, aiming to train a Robust Object Remover (RORem). We first collect 60K training pairs from open-source datasets to train an initial object removal model for generating removal samples, and then utilize human feedback to select a set of high-quality object removal pairs, with which we train a discriminator to automate the following training data generation process. By iterating this process for several rounds, we finally obtain a substantial object removal dataset with over 200K pairs. Fine-tuning the pre-trained stable diffusion model with this dataset, we obtain our RORem, which demonstrates state-of-the-art object removal performance in terms of both reliability and image quality. Particularly, RORem improves the object removal success rate over previous methods by more than 18\%. The dataset, source code and trained model are available at https://github.com/leeruibin/RORem.
Existing object removal methods face challenges like incomplete removal and blurry synthesis due to limited training data and ambiguous self-supervised learning, prompting the development of RORem through a semi-supervised approach with human feedback that achieves over 18% higher success rates.
English Summary:

Authors:Abdesselam Ferdi
Title: Lightweight G-YOLOv11: Advancing Efficient Fracture Detection in Pediatric Wrist X-rays
Abstract:
Computer-aided diagnosis (CAD) systems have greatly improved the interpretation of medical images by radiologists and surgeons. However, current CAD systems for fracture detection in X-ray images primarily rely on large, resource-intensive detectors, which limits their practicality in clinical settings. To address this limitation, we propose a novel lightweight CAD system based on the YOLO detector for fracture detection. This system, named ghost convolution-based YOLOv11 (G-YOLOv11), builds on the latest version of the YOLO detector family and incorporates the ghost convolution operation for feature extraction. The ghost convolution operation generates the same number of feature maps as traditional convolution but requires fewer linear operations, thereby reducing the detector's computational resource requirements. We evaluated the performance of the proposed G-YOLOv11 detector on the GRAZPEDWRI-DX dataset, achieving an mAP@0.5 of 0.535 with an inference time of 2.4 ms on an NVIDIA A10 GPU. Compared to the standard YOLOv11l, G-YOLOv11l achieved reductions of 13.6% in mAP@0.5 and 68.7% in size. These results establish a new state-of-the-art benchmark in terms of efficiency, outperforming existing detectors. Code and models are available at https://github.com/AbdesselamFerdi/G-YOLOv11.
中文: 研究人员开发了一种名为G-YOLOv11的轻量级计算机辅助诊断系统,通过采用幽灵卷积技术实现X光图像骨折检测,在保持较高检测效率的同时将模型体积减小68.7%,创下了计算效率的新标杆。
English: Researchers have developed a lightweight computer-aided diagnosis system called G-YOLOv11 that uses ghost convolution to detect fractures in X-ray images with high efficiency, achieving state-of-the-art performance in computational speed while reducing model size by 68.7%.

Authors:Kim Sung-Bin, Kim Jun-Seong, Junseok Ko, Yewon Kim, Tae-Hyun Oh
Title: SoundBrush: Sound as a Brush for Visual Scene Editing
Abstract:
We propose SoundBrush, a model that uses sound as a brush to edit and manipulate visual scenes. We extend the generative capabilities of the Latent Diffusion Model (LDM) to incorporate audio information for editing visual scenes. Inspired by existing image-editing works, we frame this task as a supervised learning problem and leverage various off-the-shelf models to construct a sound-paired visual scene dataset for training. This richly generated dataset enables SoundBrush to learn to map audio features into the textual space of the LDM, allowing for visual scene editing guided by diverse in-the-wild sound. Unlike existing methods, SoundBrush can accurately manipulate the overall scenery or even insert sounding objects to best match the audio inputs while preserving the original content. Furthermore, by integrating with novel view synthesis techniques, our framework can be extended to edit 3D scenes, facilitating sound-driven 3D scene manipulation. Demos are available at https://soundbrush.github.io/.

Authors:Yuchuan Tian, Jing Han, Chengcheng Wang, Yuchen Liang, Chao Xu, Hanting Chen
Title: DiC: Rethinking Conv3x3 Designs in Diffusion Models
Abstract:
Diffusion models have shown exceptional performance in visual generation tasks. Recently, these models have shifted from traditional U-Shaped CNN-Attention hybrid structures to fully transformer-based isotropic architectures. While these transformers exhibit strong scalability and performance, their reliance on complicated self-attention operation results in slow inference speeds. Contrary to these works, we rethink one of the simplest yet fastest module in deep learning, 3x3 Convolution, to construct a scaled-up purely convolutional diffusion model. We first discover that an Encoder-Decoder Hourglass design outperforms scalable isotropic architectures for Conv3x3, but still under-performing our expectation. Further improving the architecture, we introduce sparse skip connections to reduce redundancy and improve scalability. Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. These improvements lead to our proposed Diffusion CNN (DiC), which serves as a swift yet competitive diffusion architecture baseline. Experiments on various scales and settings show that DiC surpasses existing diffusion transformers by considerable margins in terms of performance while keeping a good speed advantage. Project page: https://github.com/YuchuanTian/DiC
中文摘要:提出的扩散卷积网络(DiC)通过采用优化的卷积结构、稀疏连接和改进的条件机制,在保持速度优势的同时,其性能显著超越了现有的扩散变换器模型。
English Summary: The proposed Diffusion CNN (DiC) replaces complex transformer architectures with optimized convolutional networks featuring sparse connections and enhanced conditioning, achieving superior performance and faster inference than existing diffusion models.

Authors:Jiawei Yang, Jiahui Huang, Yuxiao Chen, Yan Wang, Boyi Li, Yurong You, Apoorva Sharma, Maximilian Igl, Peter Karkus, Danfei Xu, Boris Ivanovic, Yue Wang, Marco Pavone
Title: STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes
Abstract:
We present STORM, a spatio-temporal reconstruction model designed for reconstructing dynamic outdoor scenes from sparse observations. Existing dynamic reconstruction methods often rely on per-scene optimization, dense observations across space and time, and strong motion supervision, resulting in lengthy optimization times, limited generalization to novel views or scenes, and degenerated quality caused by noisy pseudo-labels for dynamics. To address these challenges, STORM leverages a data-driven Transformer architecture that directly infers dynamic 3D scene representations--parameterized by 3D Gaussians and their velocities--in a single forward pass. Our key design is to aggregate 3D Gaussians from all frames using self-supervised scene flows, transforming them to the target timestep to enable complete (i.e., "amodal") reconstructions from arbitrary viewpoints at any moment in time. As an emergent property, STORM automatically captures dynamic instances and generates high-quality masks using only reconstruction losses. Extensive experiments on public datasets show that STORM achieves precise dynamic scene reconstruction, surpassing state-of-the-art per-scene optimization methods (+4.3 to 6.6 PSNR) and existing feed-forward approaches (+2.1 to 4.7 PSNR) in dynamic regions. STORM reconstructs large-scale outdoor scenes in 200ms, supports real-time rendering, and outperforms competitors in scene flow estimation, improving 3D EPE by 0.422m and Acc5 by 28.02%. Beyond reconstruction, we showcase four additional applications of our model, illustrating the potential of self-supervised learning for broader dynamic scene understanding.
中文: STORM是一种创新的时空重建模型,通过Transformer架构从稀疏观测中单次前向推断动态户外场景,在重建质量和效率上均优于现有方法。
English: STORM is a novel spatio-temporal reconstruction model that uses a Transformer architecture to reconstruct dynamic outdoor scenes from sparse observations in a single forward pass, achieving superior performance over existing methods in both reconstruction quality and efficiency.

Authors:Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, Yue Wang
Title: DreamDrive: Generative 4D Scene Modeling from Street View Images
Abstract:
Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and street view images demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.
中文摘要:DreamDrive是一种创新的4D场景生成方法,通过融合生成模型与神经渲染技术,能够基于车辆轨迹合成具有三维一致性的逼真驾驶视频,并通过高质量场景生成提升自动驾驶任务的性能。
English Summary: DreamDrive is a novel 4D scene generation method that combines generative models with neural rendering to synthesize realistic driving videos with 3D consistency from vehicle trajectories, enhancing autonomous driving tasks through high-fidelity scene generation.

Authors:Madeleine Darbyshire, Elizabeth Sklar, Simon Parsons
Title: Exploiting Boundary Loss for the Hierarchical Panoptic Segmentation of Plants and Leaves
Abstract:
Precision agriculture leverages data and machine learning so that farmers can monitor their crops and target interventions precisely. This enables the precision application of herbicide only to weeds, or the precision application of fertilizer only to undernourished crops, rather than to the entire field. The approach promises to maximize yields while minimizing resource use and harm to the surrounding environment. To this end, we propose a hierarchical panoptic segmentation method that simultaneously determines leaf count (as an identifier of plant growth)and locates weeds within an image. In particular, our approach aims to improve the segmentation of smaller instances like the leaves and weeds by incorporating focal loss and boundary loss. Not only does this result in competitive performance, achieving a PQ+ of 81.89 on the standard training set, but we also demonstrate we can improve leaf-counting accuracy with our method. The code is available at https://github.com/madeleinedarbyshire/HierarchicalMask2Former.
Chinese: 精准农业利用机器学习实现针对性作物监测与干预,通过结合焦点损失和边界损失的层次全景分割方法,提升了叶片计数和杂草检测的准确性,取得了优异性能。
English: Precision agriculture employs machine learning for targeted crop monitoring and intervention, using a hierarchical panoptic segmentation method with focal and boundary loss to enhance leaf counting and weed detection, achieving high performance and improved accuracy.

Authors:Jianjie Luo, Jingwen Chen, Yehao Li, Yingwei Pan, Jianlin Feng, Hongyang Chao, Ting Yao
Title: Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
Abstract:
Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP space. Next, the patch-wise visual features of the input image are selectively fused with the textual features of the salient visual concepts, leading to a mixed-up feature map with less defective content. Finally, a visual-semantic encoder is exploited to refine the derived feature map, which is further incorporated into the sentence decoder for caption generation. Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the superiority of our PCM-Net compared with state-of-the-art VLMs-based approaches. It is noteworthy that our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning. The synthetic dataset SynthImgCap and code are available at https://jianjieluo.github.io/SynthImgCap.
中文摘要:提出的PCM-Net通过逐块跨模态特征混合机制,在零样本图像描述任务中自适应融合文本特征以修正合成图像的语义偏差,在基准数据集上实现了最优性能。
English Summary: The proposed PCM-Net introduces a patch-wise cross-modal feature mix-up mechanism to address semantic misalignment in zero-shot image captioning by adaptively refining synthetic image features with textual concepts, achieving state-of-the-art performance on benchmark datasets.

Authors:Wenhao Dong, Yueyang Li, Weiming Zeng, Lei Chen, Hongjie Yan, Wai Ting Siok, Nizhuan Wang
Title: STARFormer: A Novel Spatio-Temporal Aggregation Reorganization Transformer of FMRI for Brain Disorder Diagnosis
Abstract:
Many existing methods that use functional magnetic resonance imaging (fMRI) classify brain disorders, such as autism spectrum disorder (ASD) and attention deficit hyperactivity disorder (ADHD), often overlook the integration of spatial and temporal dependencies of the blood oxygen level-dependent (BOLD) signals, which may lead to inaccurate or imprecise classification results. To solve this problem, we propose a Spatio-Temporal Aggregation eorganization ransformer (STARFormer) that effectively captures both spatial and temporal features of BOLD signals by incorporating three key modules. The region of interest (ROI) spatial structure analysis module uses eigenvector centrality (EC) to reorganize brain regions based on effective connectivity, highlighting critical spatial relationships relevant to the brain disorder. The temporal feature reorganization module systematically segments the time series into equal-dimensional window tokens and captures multiscale features through variable window and cross-window attention. The spatio-temporal feature fusion module employs a parallel transformer architecture with dedicated temporal and spatial branches to extract integrated features. The proposed STARFormer has been rigorously evaluated on two publicly available datasets for the classification of ASD and ADHD. The experimental results confirm that the STARFormer achieves state-of-the-art performance across multiple evaluation metrics, providing a more accurate and reliable tool for the diagnosis of brain disorders and biomedical research. The codes are available at: https://github.com/NZWANG/STARFormer.
中文:提出的STARFormer模型通过专门模块有效整合fMRI BOLD信号的空间与时间特征,在自闭症和多动症分类任务中取得最优性能,同时解决了以往方法在捕捉这些依赖关系方面的不足。
English: The proposed STARFormer model effectively integrates spatial and temporal features of fMRI BOLD signals through specialized modules, achieving state-of-the-art performance in classifying autism and ADHD while addressing previous methods' limitations in capturing these dependencies.

Authors:Runnan Chen, Zhaoqing Wang, Jiepeng Wang, Yuexin Ma, Mingming Gong, Wenping Wang, Tongliang Liu
Title: PanoSLAM: Panoptic 3D Scene Reconstruction via Gaussian SLAM
Abstract:
Understanding geometric, semantic, and instance information in 3D scenes from sequential video data is essential for applications in robotics and augmented reality. However, existing Simultaneous Localization and Mapping (SLAM) methods generally focus on either geometric or semantic reconstruction. In this paper, we introduce PanoSLAM, the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework. Our approach builds upon 3D Gaussian Splatting, modified with several critical components to enable efficient rendering of depth, color, semantic, and instance information from arbitrary viewpoints. To achieve panoptic 3D scene reconstruction from sequential RGB-D videos, we propose an online Spatial-Temporal Lifting (STL) module that transfers 2D panoptic predictions from vision models into 3D Gaussian representations. This STL module addresses the challenges of label noise and inconsistencies in 2D predictions by refining the pseudo labels across multi-view inputs, creating a coherent 3D representation that enhances segmentation accuracy. Our experiments show that PanoSLAM outperforms recent semantic SLAM methods in both mapping and tracking accuracy. For the first time, it achieves panoptic 3D reconstruction of open-world environments directly from the RGB-D video. (https://github.com/runnanchen/PanoSLAM)
中文: PanoSLAM是首个统一SLAM系统,通过3D高斯泼溅和时空提升模块,将几何重建、3D语义分割和实例分割集成,实现了从RGB-D视频中精确的全景3D重建。
English: PanoSLAM is the first unified SLAM system that integrates geometric reconstruction, 3D semantic segmentation, and instance segmentation using 3D Gaussian Splatting and a Spatial-Temporal Lifting module to achieve accurate panoptic 3D reconstruction from RGB-D videos.

Authors:Runnan Chen, Xiangyu Sun, Zhaoqing Wang, Youquan Liu, Jiepeng Wang, Lingdong Kong, Jiankang Deng, Mingming Gong, Liang Pan, Wenping Wang, Tongliang Liu
Title: OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies
Abstract:
Open-vocabulary scene understanding using 3D Gaussian (3DGS) representations has garnered considerable attention. However, existing methods mostly lift knowledge from large 2D vision models into 3DGS on a scene-by-scene basis, restricting the capabilities of open-vocabulary querying within their training scenes so that lacking the generalizability to novel scenes. In this work, we propose \textbf{OVGaussian}, a generalizable \textbf{O}pen-\textbf{V}ocabulary 3D semantic segmentation framework based on the 3D \textbf{Gaussian} representation. We first construct a large-scale 3D scene dataset based on 3DGS, dubbed \textbf{SegGaussian}, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images. To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a 3D neural network to learn and predict the semantic property for each 3D Gaussian point, where the semantic property can be rendered as multi-view consistent 2D semantic maps. In the next, we propose a Cross-modal Consistency Learning (CCL) framework that utilizes open-vocabulary annotations of 2D images and 3D Gaussians within SegGaussian to train the 3D neural network capable of open-vocabulary semantic segmentation across Gaussian-based 3D scenes. Experimental results demonstrate that OVGaussian significantly outperforms baseline methods, exhibiting robust cross-scene, cross-domain, and novel-view generalization capabilities. Code and the SegGaussian dataset will be released. (https://github.com/runnanchen/OVGaussian).
中文: OVGaussian 是一种创新的开放式词汇三维语义分割框架,通过大规模标注数据集和跨模态学习,实现了跨场景和跨领域的强大泛化能力。
English: OVGaussian is a novel open-vocabulary 3D semantic segmentation framework that leverages a large-scale annotated dataset and cross-modal learning to achieve robust generalization across diverse scenes and domains.

Authors:Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, Xiang Bai
Title: OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Abstract:
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios), and thorough evaluation metrics, with 10,000 human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with 1,500 manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The project website is at: https://99franklin.github.io/ocrbench_v2/
中文: OCRBench v2 被提出作为一个全面的双语基准,用于评估大型多模态模型的OCR能力,结果显示大多数模型因在文本定位和逻辑推理等挑战性任务上的局限而得分低于50%。
English: OCRBench v2 is introduced as a comprehensive bilingual benchmark to evaluate Large Multimodal Models' OCR capabilities, revealing that most models score below 50% due to limitations in challenging tasks like text localization and logical reasoning.

Authors:Shi-Feng Peng, Guolei Sun, Yong Li, Hongsong Wang, Guo-Sen Xie
Title: SAM-Aware Graph Prompt Reasoning Network for Cross-Domain Few-Shot Segmentation
Abstract:
The primary challenge of cross-domain few-shot segmentation (CD-FSS) is the domain disparity between the training and inference phases, which can exist in either the input data or the target classes. Previous models struggle to learn feature representations that generalize to various unknown domains from limited training domain samples. In contrast, the large-scale visual model SAM, pre-trained on tens of millions of images from various domains and classes, possesses excellent generalizability. In this work, we propose a SAM-aware graph prompt reasoning network (GPRN) that fully leverages SAM to guide CD-FSS feature representation learning and improve prediction accuracy. Specifically, we propose a SAM-aware prompt initialization module (SPI) to transform the masks generated by SAM into visual prompts enriched with high-level semantic information. Since SAM tends to divide an object into many sub-regions, this may lead to visual prompts representing the same semantic object having inconsistent or fragmented features. We further propose a graph prompt reasoning (GPR) module that constructs a graph among visual prompts to reason about their interrelationships and enable each visual prompt to aggregate information from similar prompts, thus achieving global semantic consistency. Subsequently, each visual prompt embeds its semantic information into the corresponding mask region to assist in feature representation learning. To refine the segmentation mask during testing, we also design a non-parameter adaptive point selection module (APS) to select representative point prompts from query predictions and feed them back to SAM to refine inaccurate segmentation results. Experiments on four standard CD-FSS datasets demonstrate that our method establishes new state-of-the-art results. Code: https://github.com/CVL-hub/GPRN.
中文: 提出的SAM感知图提示推理网络(GPRN)利用大规模视觉模型SAM的泛化能力,通过将SAM生成的掩码转化为语义提示并借助图推理确保全局一致性,有效解决了跨域少样本分割中的领域差异问题,在多个数据集上取得了最优性能。
English: The proposed SAM-aware graph prompt reasoning network (GPRN) leverages the generalizability of the large-scale visual model SAM to address domain disparity in cross-domain few-shot segmentation by converting SAM masks into semantic prompts and ensuring global consistency through graph reasoning, achieving state-of-the-art results on multiple datasets.

Authors:Rajat Talak, Charis Georgiou, Jingnan Shi, Luca Carlone
Title: Outlier-Robust Training of Machine Learning Models
Abstract:
Robust training of machine learning models in the presence of outliers has garnered attention across various domains. The use of robust losses is a popular approach and is known to mitigate the impact of outliers. We bring to light two literatures that have diverged in their ways of designing robust losses: one using M-estimation, which is popular in robotics and computer vision, and another using a risk-minimization framework, which is popular in deep learning. We first show that a simple modification of the Black-Rangarajan duality provides a unifying view. The modified duality brings out a definition of a robust loss kernel $σ$ that is satisfied by robust losses in both the literatures. Secondly, using the modified duality, we propose an Adaptive Alternation Algorithm (AAA) for training machine learning models with outliers. The algorithm iteratively trains the model by using a weighted version of the non-robust loss, while updating the weights at each iteration. The algorithm is augmented with a novel parameter update rule by interpreting the weights as inlier probabilities, and obviates the need for complex parameter tuning. Thirdly, we investigate convergence of the adaptive alternation algorithm to outlier-free optima. Considering arbitrary outliers (i.e., with no distributional assumption on the outliers), we show that the use of robust loss kernels σ increases the region of convergence. We experimentally show the efficacy of our algorithm on regression, classification, and neural scene reconstruction problems. We release our implementation code: https://github.com/MIT-SPARK/ORT.
中文摘要:本文通过改进的对偶框架提出了鲁棒损失设计的统一视角,并开发了一种自适应交替算法,通过将权重作为内点概率迭代更新来增强含异常值时的模型训练,在多种任务中提升了收敛性和性能。
English Summary: This paper introduces a unified view of robust loss design through a modified duality framework and proposes an Adaptive Alternation Algorithm that enhances model training with outliers by iteratively updating weights as inlier probabilities, improving convergence and performance across various tasks.

Authors:Edwin Arkel Rios, Jansen Christopher Yuanda, Vincent Leon Ghanz, Cheng-Wei Yu, Bo-Cheng Lai, Min-Chun Hu
Title: Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition
Abstract:
Ultra-fine-grained image recognition (UFGIR) is a challenging task that involves classifying images within a macro-category. While traditional FGIR deals with classifying different species, UFGIR goes beyond by classifying sub-categories within a species such as cultivars of a plant. In recent times the usage of Vision Transformer-based backbones has allowed methods to obtain outstanding recognition performances in this task but this comes at a significant cost in terms of computation specially since this task significantly benefits from incorporating higher resolution images. Therefore, techniques such as token reduction have emerged to reduce the computational cost. However, dropping tokens leads to loss of essential information for fine-grained categories, specially as the token keep rate is reduced. Therefore, to counteract the loss of information brought by the usage of token reduction we propose a novel Cross-Layer Aggregation Classification Head and a Cross-Layer Cache mechanism to recover and access information from previous layers in later locations. Extensive experiments covering more than 2000 runs across diverse settings including 5 datasets, 9 backbones, 7 token reduction methods, 5 keep rates, and 2 image sizes demonstrate the effectiveness of the proposed plug-and-play modules and allow us to push the boundaries of accuracy vs cost for UFGIR by reducing the kept tokens to extremely low ratios of up to 10\% while maintaining a competitive accuracy to state-of-the-art models. Code is available at: \url{https://github.com/arkel23/CLCA}
中文: 针对超细粒度图像识别中令牌削减导致的信息丢失问题,我们提出了跨层聚合分类头和跨层缓存机制,通过在多种数据集和骨干网络上验证,能在削减高达90%令牌的同时保持优异识别精度。
English: To address the information loss from token reduction in ultra-fine-grained image recognition, we introduce a Cross-Layer Aggregation Classification Head and Cross-Layer Cache mechanism, enabling high accuracy with up to 90% token reduction across diverse datasets and backbones.